|
|
--- |
|
|
language: en |
|
|
license: mit |
|
|
tags: |
|
|
- text-classification |
|
|
- code-quality |
|
|
- documentation |
|
|
- code-comments |
|
|
- developer-tools |
|
|
- distilbert |
|
|
datasets: |
|
|
- synthetic |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
pipeline_tag: text-classification |
|
|
widget: |
|
|
- text: 'This function calculates the Fibonacci sequence using dynamic programming |
|
|
to avoid redundant calculations. Time complexity: O(n), Space complexity: O(n)' |
|
|
example_title: Excellent Comment |
|
|
- text: Calculates the sum of two numbers and returns the result |
|
|
example_title: Helpful Comment |
|
|
- text: does stuff with numbers |
|
|
example_title: Unclear Comment |
|
|
- text: 'DEPRECATED: Use calculate_new() instead. This method will be removed in v2.0' |
|
|
example_title: Outdated Comment |
|
|
--- |
|
|
|
|
|
# Code Comment Quality Classifier ๐ |
|
|
|
|
|
A machine learning model that automatically classifies code comments into quality categories to help improve code documentation and review processes. |
|
|
|
|
|
## ๐ฏ What Does This Model Do? |
|
|
|
|
|
This model analyzes code comments and classifies them into four categories: |
|
|
- **Excellent**: Clear, comprehensive, and highly informative comments |
|
|
- **Helpful**: Good comments that add value but could be improved |
|
|
- **Unclear**: Vague or confusing comments that don't add much value |
|
|
- **Outdated**: Comments that may no longer reflect the current code |
|
|
|
|
|
## ๐ Quick Start |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
### Using the Model |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
# Load the model and tokenizer |
|
|
model_name = "Snaseem2026/code-comment-classifier" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
# Classify a comment |
|
|
comment = "This function calculates the fibonacci sequence using dynamic programming" |
|
|
inputs = tokenizer(comment, return_tensors="pt", truncation=True, max_length=512) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
|
predicted_class = torch.argmax(predictions, dim=-1).item() |
|
|
|
|
|
labels = ["excellent", "helpful", "unclear", "outdated"] |
|
|
print(f"Comment quality: {labels[predicted_class]}") |
|
|
``` |
|
|
|
|
|
## ๐๏ธ Training the Model |
|
|
|
|
|
To train the model on your own data: |
|
|
|
|
|
```bash |
|
|
python train.py --config config.yaml |
|
|
``` |
|
|
|
|
|
To generate synthetic training data: |
|
|
|
|
|
```bash |
|
|
python scripts/generate_data.py |
|
|
``` |
|
|
|
|
|
## ๐ Model Details |
|
|
|
|
|
- **Base Model**: DistilBERT (distilbert-base-uncased) |
|
|
- **Task**: Multi-class text classification |
|
|
- **Classes**: 4 (excellent, helpful, unclear, outdated) |
|
|
- **Training Data**: Synthetic code comments with quality labels |
|
|
- **License**: MIT |
|
|
|
|
|
## ๐ Use Cases |
|
|
|
|
|
- **Code Review Automation**: Automatically flag low-quality comments during PR reviews |
|
|
- **Documentation Quality Checks**: Audit codebases for documentation quality |
|
|
- **Developer Education**: Help developers learn what makes good code comments |
|
|
- **IDE Integration**: Real-time feedback on comment quality while coding |
|
|
|
|
|
## ๐ Project Structure |
|
|
|
|
|
``` |
|
|
. |
|
|
โโโ README.md |
|
|
โโโ LICENSE |
|
|
โโโ requirements.txt |
|
|
โโโ config.yaml |
|
|
โโโ train.py # Main training script |
|
|
โโโ inference.py # Inference script |
|
|
โโโ src/ |
|
|
โ โโโ __init__.py |
|
|
โ โโโ data_loader.py # Data loading utilities |
|
|
โ โโโ model.py # Model definition |
|
|
โ โโโ utils.py # Helper functions |
|
|
โโโ scripts/ |
|
|
โ โโโ generate_data.py # Generate synthetic training data |
|
|
โ โโโ evaluate.py # Evaluation script |
|
|
โ โโโ upload_to_hub.py # Upload model to Hugging Face Hub |
|
|
โโโ data/ |
|
|
โ โโโ .gitkeep |
|
|
โโโ MODEL_CARD.md # Hugging Face model card |
|
|
``` |
|
|
|
|
|
## ๐ค Contributing |
|
|
|
|
|
This is an open-source project! Contributions are welcome. Please feel free to: |
|
|
- Report bugs or issues |
|
|
- Suggest new features |
|
|
- Submit pull requests |
|
|
- Improve documentation |
|
|
|
|
|
## ๐ License |
|
|
|
|
|
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. |
|
|
|
|
|
## ๐ Acknowledgments |
|
|
|
|
|
- Built with [Hugging Face Transformers](https://huggingface.co/transformers/) |
|
|
- Base model: [DistilBERT](https://huggingface.co/distilbert-base-uncased) |
|
|
|
|
|
## ๐ฎ Contact |
|
|
|
|
|
For questions or feedback, please open a discussion on the model's [Hugging Face page](https://huggingface.co/Snaseem2026/code-comment-classifier/discussions) or reach out via Hugging Face. |
|
|
|
|
|
--- |
|
|
|
|
|
**Note**: This model is designed for educational and productivity purposes. Always review automated suggestions with human judgment. |
|
|
|