File size: 7,700 Bytes

4d8819a
 
 
 
 
 
 
 
c555cb0
 
 
 
 
 
 
4d8819a
 
c555cb0
4d8819a
c555cb0
4d8819a
 
 
 
 
c555cb0
4d8819a
c555cb0
 
 
 
 
4d8819a
c555cb0
4d8819a
c555cb0
 
4d8819a
 
 
 
 
c555cb0
 
 
 
 
4d8819a
c555cb0
4d8819a
c555cb0
 
 
4d8819a
c555cb0
 
 
 
 
 
4d8819a
c555cb0
 
4d8819a
c555cb0
 
 
 
 
 
4d8819a
c555cb0
4d8819a
c555cb0
 
 
 
4d8819a
 
 
c555cb0
4d8819a
c555cb0
 
 
 
 
4d8819a
 
 
c555cb0
 
 
 
4d8819a
 
 
 
 
c555cb0
 
 
 
 
 
 
4d8819a
 
 
c555cb0
4d8819a
c555cb0
 
 
 
 
 
 
4d8819a
 
 
c555cb0
 
 
 
 
 
 
4d8819a
c555cb0
4d8819a
c555cb0
4d8819a
 
 
c555cb0
4d8819a
 
 
c555cb0
 
4d8819a
 
 
c555cb0
 
4d8819a
 
 
c555cb0
 
 
 
4d8819a
c555cb0
 
 
 
4d8819a
c555cb0
4d8819a
 
 
c555cb0
 
 
 
 
4d8819a
c555cb0
4d8819a
c555cb0
4d8819a
c555cb0
4d8819a
c555cb0
 
 
 
4d8819a
 
 
 
 
c555cb0
 
4d8819a
 
 
c555cb0
 
 
 
 
4d8819a
c555cb0
4d8819a
c555cb0
4d8819a
c555cb0
 
 
 
4d8819a
c555cb0
4d8819a
c555cb0
 
 
 
 
4d8819a
c555cb0
4d8819a
c555cb0
4d8819a
c555cb0
 
 
 
 
 
 
 
4d8819a
c555cb0
4d8819a
c555cb0
4d8819a

---
base_model: Qwen/Qwen1.5-1.8B
library_name: peft
pipeline_tag: text-generation
tags:
- base_model:adapter:Qwen/Qwen1.5-1.8B
- lora
- transformers
- code-generation
- python
- reasoning
- synthetic-data
language:
- en
license: apache-2.0
---

# Qwen 1.5 1.8B - Python Code Generation with Step-by-Step Reasoning

A fine-tuned version of Qwen 1.5 1.8B that generates Python code with detailed step-by-step reasoning explanations. This model teaches users how to solve programming problems by explaining its thought process before writing code.

## Model Details

### Model Description

This model is fine-tuned using QLoRA on a synthetic dataset of 1,000 Python programming problems enriched with step-by-step reasoning. The model learns to explain its problem-solving approach before generating code, making it ideal for educational purposes and transparent code generation.

- **Developed by:** [Your Name/Organization]
- **Model type:** Causal Language Model (Fine-tuned with LoRA adapters)
- **Language(s):** English (code generation in Python)
- **License:** Apache 2.0
- **Finetuned from model:** Qwen/Qwen1.5-1.8B

### Model Sources

- **Base Model:** [Qwen/Qwen1.5-1.8B](https://huggingface.co/Qwen/Qwen1.5-1.8B)
- **Training Data:** Synthetic dataset generated from MBPP and CodeAlpaca using Llama 3.1 8B

## Uses

### Direct Use

This model is designed for:
- **Educational code generation**: Teaching programming concepts through explained solutions
- **Transparent AI coding assistants**: Understanding how the model approaches problems
- **Code explanation**: Generating step-by-step breakdowns of problem-solving strategies
- **Learning tool**: Helping beginners understand algorithmic thinking

### Example Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model and tokenizer
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen1.5-1.8B",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-1.8B")

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "[YOUR_MODEL_PATH]")

# Generate code with reasoning
prompt = "Write a Python function to find the longest common prefix in a list of strings."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Out-of-Scope Use

- **Production-critical systems**: This model is fine-tuned on a limited dataset and should not be used for safety-critical applications
- **Non-Python languages**: The model is specifically trained on Python problems
- **Complex software architecture**: Best suited for algorithm-level problems, not large-scale system design
- **Security-sensitive code**: Should not be used for generating cryptographic or security-critical code without expert review

## Bias, Risks, and Limitations

### Limitations

1. **Dataset size**: Trained on only 1,000 examples, may not generalize to all problem types
2. **Teacher model quality**: Synthetic data generated by Llama 3.1 8B may contain errors
3. **Small test set**: Evaluated on only 7 problems, true generalization unknown
4. **Potential overfitting**: High accuracy on test set may indicate memorization rather than true learning
5. **No code validation**: Training data was not validated for correctness before fine-tuning

### Recommendations

- Always review and test generated code before using in production
- Use as a learning tool rather than a replacement for human expertise
- Validate outputs against test cases and edge cases
- Consider the model's explanations as one perspective, not absolute truth

## Training Details

### Training Data

- **Source datasets**: MBPP (Mostly Basic Programming Problems) and CodeAlpaca
- **Dataset size**: 1,000 Python programming problems
- **Data generation**: Synthetic step-by-step reasoning generated using Llama 3.1 8B Instant via Groq API
- **Data structure**: Each example contains:
  - Original programming problem
  - Step-by-step reasoning (problem understanding, algorithm design, implementation strategy)
  - Python solution

### Training Procedure

#### Fine-tuning Method

- **Technique**: QLoRA (Quantized Low-Rank Adaptation)
- **Quantization**: 4-bit quantization for memory efficiency
- **LoRA Configuration**:
  - Rank (r): 8
  - Alpha: 16
  - Target modules: q_proj, k_proj, v_proj, o_proj (attention layers)
  - Dropout: 0.05

#### Training Hyperparameters

- **Training epochs**: 3
- **Learning rate**: 2e-4
- **Optimizer**: paged_adamw_8bit
- **Batch size**: [Specify if known]
- **Training regime**: Mixed precision (4-bit quantization)
- **Hardware**: Google Colab T4 GPU (free tier)
- **Framework**: PEFT 0.17.1, Transformers, bitsandbytes

#### Training Time

- Approximately [X hours] on Google Colab T4 GPU

## Evaluation

### Testing Data & Metrics

#### Testing Data

- **Test set size**: 7 diverse Python programming problems
- **Problem types**: Mix of algorithmic challenges from the training distribution

#### Metrics

- **Primary metric**: Pass@1 (functional correctness - does the generated code execute correctly?)
- **Secondary metric**: Reasoning structure presence (does output include step-by-step explanation?)

### Results

| Metric | Base Model (Qwen 1.5 1.8B) | Fine-tuned Model |
|--------|---------------------------|------------------|
| Pass@1 | 75% | 100% |
| Reasoning Structure | Inconsistent | 100% |

**Key Findings**:
- **+25 percentage point improvement** in functional correctness
- **100% of outputs** now include structured step-by-step reasoning
- All 7 test cases passed successfully

**Important Note**: Results are based on a small test set (7 examples). Larger-scale evaluation needed to confirm generalization.

## Environmental Impact

- **Hardware Type**: NVIDIA T4 GPU (Google Colab)
- **Hours used**: ~[X hours for fine-tuning]
- **Cloud Provider**: Google Cloud Platform
- **Compute Region**: [Specify if known]
- **Carbon Emitted**: Minimal due to use of QLoRA on single T4 GPU

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).

## Technical Specifications

### Model Architecture

- **Base architecture**: Qwen 1.5 1.8B (Transformer decoder)
- **Fine-tuning method**: LoRA adapters on attention layers
- **Total parameters**: 1.8B (base) + ~4.7M (LoRA adapters)
- **Trainable parameters**: ~4.7M (0.26% of total)

### Compute Infrastructure

#### Hardware

- GPU: NVIDIA T4 (16GB VRAM)
- Platform: Google Colab (free tier)

#### Software

- PEFT 0.17.1
- Transformers
- bitsandbytes (for 4-bit quantization)
- PyTorch
- Groq API (for synthetic data generation)

## Project Insights

### What Worked Well

- Cross-model knowledge distillation (8B teacher → 1.8B student)
- QLoRA enabled fine-tuning on free-tier GPU
- Structured prompts for synthetic data generation
- Teaching reasoning process alongside code generation

### Future Improvements

1. **Better teacher model**: Use Llama 3.1 70B for higher-quality synthetic data
2. **Data validation**: Verify all generated code executes correctly before training
3. **Larger dataset**: Scale to 5,000-10,000 examples
4. **Robust evaluation**: Test on 50-100 problems from benchmarks like HumanEval
5. **Higher LoRA rank**: Experiment with rank 16 or 32 for more capacity

## Citation

If you use this model, please cite:

```bibtex
@misc{qwen15-code-reasoning,
  author = {[Rachit Verma]},
  title = {Qwen 1.5 1.8B Fine-tuned for Python Code Generation with Reasoning},
  year = {2025},
  publisher = {HuggingFace},
}
```

## Model Card Authors

[Rachit Verma]