garbage-segregate / ml /README.md
Rahiq's picture
Deploy waste classification backend with ML model
bf17f74
# ML Training Pipeline
Complete machine learning pipeline for waste classification using PyTorch and EfficientNet-B0.
## Setup
### 1. Install Dependencies
\`\`\`bash
pip install -r ml/requirements.txt
\`\`\`
### 2. Prepare Dataset
#### Option A: Use Public Datasets
\`\`\`bash
# View available datasets
python ml/dataset_prep.py info
# Download datasets from sources in DATASET_SOURCES.txt
# Extract to ml/data/raw/ with category folders
# Organize dataset into train/val/test splits
python ml/dataset_prep.py
\`\`\`
#### Option B: Use Custom Data
Place your images in:
\`\`\`
ml/data/raw/
recyclable/
organic/
wet-waste/
dry-waste/
ewaste/
hazardous/
landfill/
\`\`\`
Then run:
\`\`\`bash
python ml/dataset_prep.py
\`\`\`
## Training
### Initial Training
Train from scratch with pretrained EfficientNet-B0:
\`\`\`bash
python ml/train.py
\`\`\`
Training will:
- Use transfer learning with ImageNet pretrained weights
- Apply data augmentation for better generalization
- Save best model to `ml/models/best_model.pth`
- Generate confusion matrix
- Log training history
### Model Architecture
- **Base**: EfficientNet-B0 (pretrained on ImageNet)
- **Input**: 224x224 RGB images
- **Output**: 7 waste categories
- **Parameters**: ~5.3M
- **Inference Time**: ~50ms on CPU
### Why EfficientNet-B0?
1. **Accuracy**: State-of-the-art performance
2. **Speed**: Optimized for mobile/edge devices
3. **Size**: Compact model (~20MB)
4. **Efficiency**: Best accuracy-to-parameters ratio
## Inference
### Python Inference
\`\`\`python
from ml.predict import WasteClassifier
classifier = WasteClassifier('ml/models/best_model.pth')
# From file path
result = classifier.predict('image.jpg')
# From base64
result = classifier.predict('data:image/jpeg;base64,...')
print(result)
# {
# 'category': 'recyclable',
# 'confidence': 0.95,
# 'probabilities': {...},
# 'timestamp': 1234567890
# }
\`\`\`
### Export to ONNX
For production deployment:
\`\`\`bash
python -c "from ml.predict import export_to_onnx; export_to_onnx()"
\`\`\`
## Continuous Learning
### Collect Feedback
User corrections are saved to:
\`\`\`
ml/data/retraining/
recyclable/
organic/
...
\`\`\`
### Retrain Model
Fine-tune model with new samples:
\`\`\`bash
python ml/retrain.py
\`\`\`
Retraining will:
1. Add new samples to training set
2. Fine-tune existing model (lower learning rate)
3. Evaluate improvement
4. Promote model if accuracy improves by >1%
5. Version models (v1, v2, v3, ...)
6. Archive retraining samples
7. Log retraining events
### Automated Retraining
Set up a cron job or scheduled task:
\`\`\`bash
# Weekly retraining
0 2 * * 0 python ml/retrain.py
\`\`\`
## Model Versioning
Models are versioned automatically:
- `best_model.pth` - Current production model
- `model_v1.pth` - Version 1 (archived)
- `model_v2.pth` - Version 2 (archived)
- `best_model_backup_*.pth` - Backup before promotion
## Evaluation Metrics
- **Accuracy**: Overall classification accuracy
- **F1 Score (Macro)**: Average F1 across all categories
- **F1 Score (Weighted)**: Weighted by class frequency
- **Confusion Matrix**: Per-category performance
## Dataset Requirements
### Minimum Samples per Category
- Training: 500+ images per category
- Validation: 100+ images per category
- Test: 100+ images per category
### Image Quality
- Resolution: 640x480 or higher
- Format: JPG or PNG
- Lighting: Various conditions
- Backgrounds: Real-world environments
- Variety: Different angles, distances, overlaps
## Performance Optimization
### CPU Inference
- Uses optimized EfficientNet-B0
- Inference time: ~50ms per image
- No GPU required for deployment
### GPU Training
- Trains 10-20x faster on GPU
- Automatically detects CUDA availability
- Falls back to CPU if no GPU
## Troubleshooting
### Low Accuracy
1. Add more diverse training data
2. Balance dataset (equal samples per category)
3. Increase training epochs
4. Adjust learning rate
### Overfitting
1. Increase dropout rate
2. Add more data augmentation
3. Use early stopping (already enabled)
4. Collect more training data
### Class Confusion
1. Check confusion matrix
2. Add more examples for confused classes
3. Ensure clear visual differences
4. Review mislabeled data
## Next Steps
1. **Collect Data**: Gather Indian waste images
2. **Initial Training**: Train base model
3. **Deploy**: Integrate with backend API
4. **Monitor**: Track prediction accuracy
5. **Improve**: Continuous learning pipeline