Data & Preprocessing
Dataset Overview
- Volume: ~15GB total, over 90,000 labeled images
- Sources: COCO dataset, Public Images from Robowflow, Kaggle
- Labeling Tools: Roboflow Annotation Tool + Manual QA

COCO dataset

Roboflow dataset
Object Classes
Common Classes (14)
- Person
- Bicycle
- Car
- Motorcycle
- Bus
- Truck
- Traffic Light
- Fire Hydrant
- Stop Sign
- Parking Meter
- Bench
- Bird
- Cat
- Dog
Additional Classes (13)
- Pothole
- Scooter
- Tree
- Trash Bin
- Bollard
- Fence/Barrier
- Traffic Cone
- Bad Roads
- Crosswalk
- Pole
- Stairs
- Upstairs
- Upstairs
Challenges & Solutions
Challenges
- Mislabeling and noise in public datasets
- Minority classes like potholes hard to source
- Discrepancies between real-world and training data
Solutions
- Manual relabeling and quality control
- Data augmentation and undersampling strategies
- Custom datasets captured with GoPro / phone cameras

Example: real-world capture vs. public dataset image quality

Example of mislabeling error from auto-label tool
Preprocessing Workflow
- Data augmentation for minority classes (rotation, flipping, brightness)
- Undersampling of dominant classes (e.g., person)
- Manual quality checks and relabeling
- Dataset split: 80% train / 10% validation / 10% test

Augmentation applied to minority-class images to enhance balance
Replay-Based Learning ðŸ§
To prevent catastrophic forgetting when adding new classes, we used replay-based learning. This technique preserves prior model performance while enabling continual learning from new examples.