|
|
--- |
|
|
language: |
|
|
- zh |
|
|
- en |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
tags: |
|
|
- code |
|
|
- qwen |
|
|
- lora |
|
|
- repository-understanding |
|
|
- code-assistant |
|
|
- fine-tuning |
|
|
- multi-agent-systems |
|
|
base_model: Qwen/Qwen3-8B |
|
|
datasets: |
|
|
- custom |
|
|
metrics: |
|
|
- accuracy |
|
|
- code_understanding |
|
|
pipeline_tag: text-generation |
|
|
model-index: |
|
|
- name: code_repo_finetuning |
|
|
results: |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Code Repository Understanding |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 71.5 |
|
|
name: Overall Score |
|
|
- type: improvement |
|
|
value: 22.1 |
|
|
name: Improvement over Base Model |
|
|
--- |
|
|
|
|
|
# Finetune any base model (e.g. Qwen3-8B) on any given code repository |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is a fine-tuned version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) specifically trained to understand and answer questions about any given private or new project repository, for example, [Laddr](https://github.com/AgnetLabs/Laddr) - a framework for building scalable multi-agent systems. |
|
|
|
|
|
The fine-tuning was performed using **LoRA (Low-Rank Adaptation)** with an innovative training data generation approach that **does not rely on LLM-generated synthetic data**, avoiding circular dependencies and hallucination issues. |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- ✅ **Project-Specific Knowledge**: Deep understanding of Laddr's architecture, codebase, and APIs |
|
|
- ✅ **Code Location**: Accurately locates functions, classes, and modules (+30% improvement) |
|
|
- ✅ **Code Understanding**: Explains code functionality with detailed context (+19.3% improvement) |
|
|
- ✅ **Maintains General Abilities**: Retains base model's general knowledge capabilities |
|
|
- ✅ **Zero Hallucination Training Data**: Generated from real code via AST parsing, not LLM synthesis |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Base Model |
|
|
- **Model**: Qwen/Qwen3-8B |
|
|
- **Parameters**: 8 Billion |
|
|
- **Architecture**: Transformer-based causal language model |
|
|
|
|
|
### Fine-tuning Specifications |
|
|
- **Method**: LoRA (Low-Rank Adaptation) |
|
|
- **LoRA Rank**: 64 |
|
|
- **LoRA Alpha**: 128 |
|
|
- **Target Modules**: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
|
|
- **Training Framework**: DeepSpeed ZeRO-3 |
|
|
- **Precision**: BF16 |
|
|
- **Epochs**: 3 |
|
|
- **Training Samples**: 650+ |
|
|
- **Training Time**: ~2-3 hours on 2x GPUs (48GB each) |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The training dataset was **automatically generated** from the Laddr repository using: |
|
|
- **Python AST parsing** for code structure extraction |
|
|
- **Real docstrings** and code comments |
|
|
- **Function signatures** and parameter information |
|
|
- **Call graph relationships** |
|
|
- **Project statistics** and module structure |
|
|
|
|
|
**Data Composition**: |
|
|
- Code Explanation: 300+ samples (46%) |
|
|
- API Usage: 150+ samples (23%) |
|
|
- Code Location: 100+ samples (15%) |
|
|
- Project Overview: 50+ samples (8%) |
|
|
- Design Proposals: 50+ samples (8%) |
|
|
|
|
|
**Data Split**: |
|
|
- Training: 80% (520+ samples) |
|
|
- Validation: 10% (65+ samples) |
|
|
- Test: 10% (65+ samples) |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Overall Results |
|
|
|
|
|
| Metric | Base Model | Fine-tuned | Improvement | |
|
|
|--------|-----------|-----------|-------------| |
|
|
| **Overall Score** | 49.4% | 71.5% | **+22.1%** ✅ | |
|
|
| Code Location | 60.0% | 90.0% | **+30.0%** ⭐ | |
|
|
| Code Understanding | 59.3% | 78.6% | +19.3% | |
|
|
| Project Overview | 35.0% | 51.7% | +16.7% | |
|
|
| General Knowledge | 10.0% | 30.0% | +20.0% | |
|
|
|
|
|
### Detailed Performance by Task Type |
|
|
|
|
|
**Code Location Tasks** (+30.0%): |
|
|
- Accurately identifies file locations of functions/classes |
|
|
- Provides complete file paths with line numbers |
|
|
- Eliminates uncertainty in location queries |
|
|
|
|
|
**Code Understanding Tasks** (+19.3%): |
|
|
- Explains code functionality with context |
|
|
- Includes function signatures and parameters |
|
|
- Extracts and presents real docstrings |
|
|
|
|
|
**Project Overview Tasks** (+16.7%): |
|
|
- Describes project purpose and architecture |
|
|
- Lists technology stack and dependencies |
|
|
- Provides project statistics and structure |
|
|
|
|
|
**General Knowledge** (+20.0%): |
|
|
- Maintains ability to answer general programming questions |
|
|
- Improves conciseness (from verbose to clear) |
|
|
- Preserves base model capabilities |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
### 1. Code Assistant for Laddr Project |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
model_name = "tensense/code_repo_finetuning" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True) |
|
|
|
|
|
question = "Where is the create_system_tools function located in Laddr?" |
|
|
messages = [{"role": "user", "content": question}] |
|
|
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
inputs = tokenizer([text], return_tensors="pt") |
|
|
|
|
|
outputs = model.generate(**inputs, max_new_tokens=512) |
|
|
response = tokenizer.decode(outputs[0][len(inputs['input_ids'][0]):], skip_special_tokens=True) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
### 2. API Documentation Assistant |
|
|
Query about function usage, parameters, and return values. |
|
|
|
|
|
### 3. Codebase Onboarding |
|
|
Help new developers quickly understand the Laddr project structure and key components. |
|
|
|
|
|
### 4. Code Review Support |
|
|
Provide context and explanations for code review processes. |
|
|
|
|
|
## Example Queries |
|
|
|
|
|
### Query 1: Code Location |
|
|
**Question**: "Where is `create_system_tools` located in Laddr?" |
|
|
|
|
|
**Response**: |
|
|
> `create_system_tools` is located in `lib/laddr/src/laddr/core/system_tools.py`. |
|
|
> |
|
|
> **Function Description**: |
|
|
> Create system tools for task delegation and artifact management. This function checks for user-provided overrides before registering the default implementations. |
|
|
> |
|
|
> **Function Signature**: |
|
|
> ```python |
|
|
> def create_system_tools(message_bus, storage_backend=None, agent=None) -> dict[str, tuple[Any, list[str]]]: |
|
|
> ``` |
|
|
> |
|
|
> **Parameters**: |
|
|
> - `message_bus` (Any): Message queue backend |
|
|
> - `storage_backend` (Any): Optional storage backend |
|
|
> - `agent` (Any): Optional agent instance |
|
|
> |
|
|
> **Returns**: `dict[str, tuple[Any, list[str]]]` |
|
|
|
|
|
### Query 2: Project Overview |
|
|
**Question**: "What is Laddr's main functionality?" |
|
|
|
|
|
**Response**: |
|
|
> Laddr is a Python project that serves as a framework for building scalable multi-agent systems. |
|
|
> |
|
|
> **Main Technologies**: bandit, black, boto3-stubs, build, click, etc. |
|
|
> |
|
|
> The project contains 279 code elements, consisting of 67 classes and 212 functions/methods. |
|
|
> |
|
|
> **Core Modules**: |
|
|
> - `core` (279 elements) |
|
|
> - `cli` (52 elements) |
|
|
> - `llms` (39 elements) |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Project-Specific**: Optimized for Laddr project; may not perform as well on other codebases |
|
|
- **Knowledge Cutoff**: Based on the Laddr repository as of training time (2025-01) |
|
|
- **Language Focus**: Primarily trained on Python code and English/Chinese documentation |
|
|
- **Limited General Coding**: While it maintains general knowledge, it's optimized for Laddr-specific queries |
|
|
|
|
|
## Training Methodology |
|
|
|
|
|
### Innovation: LLM-Free Training Data Generation |
|
|
|
|
|
Unlike traditional approaches that use LLMs to generate synthetic training data, this project employs a novel methodology: |
|
|
|
|
|
1. **AST-Based Code Parsing**: Python Abstract Syntax Tree analysis extracts accurate code structure |
|
|
2. **Real Documentation**: Utilizes actual docstrings, comments, and code signatures |
|
|
3. **Call Graph Analysis**: Builds function dependency relationships |
|
|
4. **Pattern Extraction**: Identifies code patterns (implementation, usage, interaction) |
|
|
5. **Template-Based QA**: Generates question-answer pairs using templates with real code context |
|
|
|
|
|
**Benefits**: |
|
|
- ✅ Avoids circular dependency (using LLM data to train LLM) |
|
|
- ✅ Eliminates hallucination in training data |
|
|
- ✅ Ensures factual accuracy |
|
|
- ✅ Provides complete reasoning traces |
|
|
|
|
|
### Training Pipeline |
|
|
|
|
|
``` |
|
|
GitHub Repository |
|
|
↓ |
|
|
[1. Repository Analyzer] |
|
|
→ Extracts code elements, patterns, call graph |
|
|
↓ |
|
|
[2. Data Generator] |
|
|
→ Creates QA pairs with code context |
|
|
↓ |
|
|
[3. Model Fine-tuner] |
|
|
→ LoRA + DeepSpeed ZeRO-3 training |
|
|
↓ |
|
|
[4. LoRA Merger] |
|
|
→ Merges adapter into base model |
|
|
↓ |
|
|
[5. Model Evaluator] |
|
|
→ Compares base vs fine-tuned |
|
|
↓ |
|
|
Fine-tuned Model |
|
|
``` |
|
|
|
|
|
## Extensibility |
|
|
|
|
|
The training methodology is **repository-agnostic** and can be applied to any codebase: |
|
|
|
|
|
### Adapt to Your Repository |
|
|
|
|
|
```bash |
|
|
# 1. Update configuration |
|
|
python utils/config_manager.py https://github.com/your-org/your-repo |
|
|
|
|
|
# 2. Analyze repository |
|
|
python scripts/01_analyze_repo.py |
|
|
|
|
|
# 3. Generate training data |
|
|
python scripts/02_generate_data.py |
|
|
|
|
|
# 4. Fine-tune model |
|
|
deepspeed --num_gpus=2 scripts/03_train_model.py |
|
|
|
|
|
# 5. Merge LoRA weights |
|
|
python scripts/04_merge_weights.py |
|
|
|
|
|
# 6. Evaluate |
|
|
python scripts/05_evaluate.py |
|
|
``` |
|
|
|
|
|
**Supported Languages** (currently): |
|
|
- Python (primary) |
|
|
- Markdown (documentation) |
|
|
|
|
|
**Extensible to**: |
|
|
- JavaScript/TypeScript |
|
|
- Java |
|
|
- Go |
|
|
- Rust |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
- **Code Attribution**: All training data comes from the open-source Laddr repository |
|
|
- **License Compliance**: Respects Apache 2.0 license of both base model and Laddr project |
|
|
- **No Private Data**: Only uses publicly available code |
|
|
- **Reproducibility**: Complete methodology documented for transparency |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model or methodology in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{qwen3-code-repo-finetuned-2025, |
|
|
title={Finetune any base model (e.g. Qwen3-8B) on any given code repository}, |
|
|
author={Tensense}, |
|
|
year={2025}, |
|
|
publisher={HuggingFace}, |
|
|
url={https://huggingface.co/tensense/code_repo_finetuning} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- **Base Model**: [Qwen Team](https://huggingface.co/Qwen) for Qwen3-8B |
|
|
- **Laddr Project**: [AgnetLabs](https://github.com/AgnetLabs/Laddr) for the multi-agent framework |
|
|
- **Training Framework**: HuggingFace Transformers, DeepSpeed, PEFT (LoRA) |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the **Apache 2.0 License**, consistent with: |
|
|
- Qwen3-8B base model license |
|
|
- Laddr project license |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
[Tensense] |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
For questions or issues, please contact: |
|
|
- Email: xu@tensense.org |
|
|
- GitHub: [TopologyApplied](https://github.com/TopologyApplied) |
|
|
- HuggingFace: [tensense](https://huggingface.co/tensense) |
|
|
|
|
|
--- |
|
|
|
|
|
## Additional Resources |
|
|
|
|
|
- **Base Model**: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) |
|
|
- **Training Code**: [GitHub Repository](https://github.com/TopologyApplied/code_repo_finetuning) |
|
|
- **Checkpoint & Finetuned Model**: [Huggingface](https://huggingface.co/tensense/code_repo_finetuning) |
|
|
- **Laddr Project**: [GitHub](https://github.com/AgnetLabs/Laddr) |
|
|
- **Evaluation Report**: [[Link to comparison_report.json](https://github.com/TopologyApplied/code_repo_finetuning/blob/main/output/comparison_report_Laddr.json)] |
|
|
- **Design Documentation**: [[Link to design docs](https://github.com/TopologyApplied/code_repo_finetuning/blob/main/代码仓库智能训练数据生成系统_设计文档.md)] |
|
|
|
|
|
## Version History |
|
|
|
|
|
### v1.0 (2025-11-15) |
|
|
- Initial release |
|
|
- Fine-tuned on Laddr repository |
|
|
- 650+ training samples |
|
|
- LoRA rank 64, alpha 128 |
|
|
- 3 epochs training |
|
|
- Overall improvement: +22.1% |
|
|
|
|
|
--- |
|
|
|
|
|
**Note**: This is a demonstration of repository-specific fine-tuning methodology. The approach can be adapted to any codebase for creating custom code assistants. |
|
|
|