README.md · tensense/code_repo

code_repo_finetuning / README.md

tensense

Upload README.md

eab1af9 verified 26 days ago

preview code

raw

history blame contribute delete

11.1 kB

	---
	language:
	- zh
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- code
	- qwen
	- lora
	- repository-understanding
	- code-assistant
	- fine-tuning
	- multi-agent-systems
	base_model: Qwen/Qwen3-8B
	datasets:
	- custom
	metrics:
	- accuracy
	- code_understanding
	pipeline_tag: text-generation
	model-index:
	- name: code_repo_finetuning
	results:
	- task:
	type: text-generation
	name: Code Repository Understanding
	metrics:
	- type: accuracy
	value: 71.5
	name: Overall Score
	- type: improvement
	value: 22.1
	name: Improvement over Base Model
	---

	# Finetune any base model (e.g. Qwen3-8B) on any given code repository

	## Model Description

	This model is a fine-tuned version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) specifically trained to understand and answer questions about any given private or new project repository, for example, [Laddr](https://github.com/AgnetLabs/Laddr) - a framework for building scalable multi-agent systems.

	The fine-tuning was performed using LoRA (Low-Rank Adaptation) with an innovative training data generation approach that does not rely on LLM-generated synthetic data, avoiding circular dependencies and hallucination issues.

	### Key Features

	- ✅ Project-Specific Knowledge: Deep understanding of Laddr's architecture, codebase, and APIs
	- ✅ Code Location: Accurately locates functions, classes, and modules (+30% improvement)
	- ✅ Code Understanding: Explains code functionality with detailed context (+19.3% improvement)
	- ✅ Maintains General Abilities: Retains base model's general knowledge capabilities
	- ✅ Zero Hallucination Training Data: Generated from real code via AST parsing, not LLM synthesis

	## Model Details

	### Base Model
	- Model: Qwen/Qwen3-8B
	- Parameters: 8 Billion
	- Architecture: Transformer-based causal language model

	### Fine-tuning Specifications
	- Method: LoRA (Low-Rank Adaptation)
	- LoRA Rank: 64
	- LoRA Alpha: 128
	- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
	- Training Framework: DeepSpeed ZeRO-3
	- Precision: BF16
	- Epochs: 3
	- Training Samples: 650+
	- Training Time: ~2-3 hours on 2x GPUs (48GB each)

	### Training Data

	The training dataset was automatically generated from the Laddr repository using:
	- Python AST parsing for code structure extraction
	- Real docstrings and code comments
	- Function signatures and parameter information
	- Call graph relationships
	- Project statistics and module structure

	Data Composition:
	- Code Explanation: 300+ samples (46%)
	- API Usage: 150+ samples (23%)
	- Code Location: 100+ samples (15%)
	- Project Overview: 50+ samples (8%)
	- Design Proposals: 50+ samples (8%)

	Data Split:
	- Training: 80% (520+ samples)
	- Validation: 10% (65+ samples)
	- Test: 10% (65+ samples)

	## Performance

	### Overall Results

	\| Metric \| Base Model \| Fine-tuned \| Improvement \|
	\|--------\|-----------\|-----------\|-------------\|
	\| Overall Score \| 49.4% \| 71.5% \| +22.1% ✅ \|
	\| Code Location \| 60.0% \| 90.0% \| +30.0% ⭐ \|
	\| Code Understanding \| 59.3% \| 78.6% \| +19.3% \|
	\| Project Overview \| 35.0% \| 51.7% \| +16.7% \|
	\| General Knowledge \| 10.0% \| 30.0% \| +20.0% \|

	### Detailed Performance by Task Type

	Code Location Tasks (+30.0%):
	- Accurately identifies file locations of functions/classes
	- Provides complete file paths with line numbers
	- Eliminates uncertainty in location queries

	Code Understanding Tasks (+19.3%):
	- Explains code functionality with context
	- Includes function signatures and parameters
	- Extracts and presents real docstrings

	Project Overview Tasks (+16.7%):
	- Describes project purpose and architecture
	- Lists technology stack and dependencies
	- Provides project statistics and structure

	General Knowledge (+20.0%):
	- Maintains ability to answer general programming questions
	- Improves conciseness (from verbose to clear)
	- Preserves base model capabilities

	## Use Cases

	### 1. Code Assistant for Laddr Project
	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_name = "tensense/code_repo_finetuning"
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

	question = "Where is the create_system_tools function located in Laddr?"
	messages = [{"role": "user", "content": question}]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer([text], return_tensors="pt")

	outputs = model.generate(**inputs, max_new_tokens=512)
	response = tokenizer.decode(outputs[0][len(inputs['input_ids'][0]):], skip_special_tokens=True)
	print(response)
	```

	### 2. API Documentation Assistant
	Query about function usage, parameters, and return values.

	### 3. Codebase Onboarding
	Help new developers quickly understand the Laddr project structure and key components.

	### 4. Code Review Support
	Provide context and explanations for code review processes.

	## Example Queries

	### Query 1: Code Location
	Question: "Where is `create_system_tools` located in Laddr?"

	Response:
	> `create_system_tools` is located in `lib/laddr/src/laddr/core/system_tools.py`.
	>
	> Function Description:
	> Create system tools for task delegation and artifact management. This function checks for user-provided overrides before registering the default implementations.
	>
	> Function Signature:
	> ```python
	> def create_system_tools(message_bus, storage_backend=None, agent=None) -> dict[str, tuple[Any, list[str]]]:
	> ```
	>
	> Parameters:
	> - `message_bus` (Any): Message queue backend
	> - `storage_backend` (Any): Optional storage backend
	> - `agent` (Any): Optional agent instance
	>
	> Returns: `dict[str, tuple[Any, list[str]]]`

	### Query 2: Project Overview
	Question: "What is Laddr's main functionality?"

	Response:
	> Laddr is a Python project that serves as a framework for building scalable multi-agent systems.
	>
	> Main Technologies: bandit, black, boto3-stubs, build, click, etc.
	>
	> The project contains 279 code elements, consisting of 67 classes and 212 functions/methods.
	>
	> Core Modules:
	> - `core` (279 elements)
	> - `cli` (52 elements)
	> - `llms` (39 elements)

	## Limitations

	- Project-Specific: Optimized for Laddr project; may not perform as well on other codebases
	- Knowledge Cutoff: Based on the Laddr repository as of training time (2025-01)
	- Language Focus: Primarily trained on Python code and English/Chinese documentation
	- Limited General Coding: While it maintains general knowledge, it's optimized for Laddr-specific queries

	## Training Methodology

	### Innovation: LLM-Free Training Data Generation

	Unlike traditional approaches that use LLMs to generate synthetic training data, this project employs a novel methodology:

	1. AST-Based Code Parsing: Python Abstract Syntax Tree analysis extracts accurate code structure
	2. Real Documentation: Utilizes actual docstrings, comments, and code signatures
	3. Call Graph Analysis: Builds function dependency relationships
	4. Pattern Extraction: Identifies code patterns (implementation, usage, interaction)
	5. Template-Based QA: Generates question-answer pairs using templates with real code context

	Benefits:
	- ✅ Avoids circular dependency (using LLM data to train LLM)
	- ✅ Eliminates hallucination in training data
	- ✅ Ensures factual accuracy
	- ✅ Provides complete reasoning traces

	### Training Pipeline

	```
	GitHub Repository
	↓
	[1. Repository Analyzer]
	→ Extracts code elements, patterns, call graph
	↓
	[2. Data Generator]
	→ Creates QA pairs with code context
	↓
	[3. Model Fine-tuner]
	→ LoRA + DeepSpeed ZeRO-3 training
	↓
	[4. LoRA Merger]
	→ Merges adapter into base model
	↓
	[5. Model Evaluator]
	→ Compares base vs fine-tuned
	↓
	Fine-tuned Model
	```

	## Extensibility

	The training methodology is repository-agnostic and can be applied to any codebase:

	### Adapt to Your Repository

	```bash
	# 1. Update configuration
	python utils/config_manager.py https://github.com/your-org/your-repo

	# 2. Analyze repository
	python scripts/01_analyze_repo.py

	# 3. Generate training data
	python scripts/02_generate_data.py

	# 4. Fine-tune model
	deepspeed --num_gpus=2 scripts/03_train_model.py

	# 5. Merge LoRA weights
	python scripts/04_merge_weights.py

	# 6. Evaluate
	python scripts/05_evaluate.py
	```

	Supported Languages (currently):
	- Python (primary)
	- Markdown (documentation)

	Extensible to:
	- JavaScript/TypeScript
	- Java
	- Go
	- Rust

	## Ethical Considerations

	- Code Attribution: All training data comes from the open-source Laddr repository
	- License Compliance: Respects Apache 2.0 license of both base model and Laddr project
	- No Private Data: Only uses publicly available code
	- Reproducibility: Complete methodology documented for transparency

	## Citation

	If you use this model or methodology in your research, please cite:

	```bibtex
	@misc{qwen3-code-repo-finetuned-2025,
	title={Finetune any base model (e.g. Qwen3-8B) on any given code repository},
	author={Tensense},
	year={2025},
	publisher={HuggingFace},
	url={https://huggingface.co/tensense/code_repo_finetuning}
	}
	```

	## Acknowledgments

	- Base Model: [Qwen Team](https://huggingface.co/Qwen) for Qwen3-8B
	- Laddr Project: [AgnetLabs](https://github.com/AgnetLabs/Laddr) for the multi-agent framework
	- Training Framework: HuggingFace Transformers, DeepSpeed, PEFT (LoRA)

	## License

	This model is released under the Apache 2.0 License, consistent with:
	- Qwen3-8B base model license
	- Laddr project license

	## Model Card Authors

	[Tensense]

	## Model Card Contact

	For questions or issues, please contact:
	- Email: xu@tensense.org
	- GitHub: [TopologyApplied](https://github.com/TopologyApplied)
	- HuggingFace: [tensense](https://huggingface.co/tensense)

	---

	## Additional Resources

	- Base Model: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
	- Training Code: [GitHub Repository](https://github.com/TopologyApplied/code_repo_finetuning)
	- Checkpoint & Finetuned Model: [Huggingface](https://huggingface.co/tensense/code_repo_finetuning)
	- Laddr Project: [GitHub](https://github.com/AgnetLabs/Laddr)
	- Evaluation Report: [[Link to comparison_report.json](https://github.com/TopologyApplied/code_repo_finetuning/blob/main/output/comparison_report_Laddr.json)]
	- Design Documentation: [[Link to design docs](https://github.com/TopologyApplied/code_repo_finetuning/blob/main/代码仓库智能训练数据生成系统_设计文档.md)]

	## Version History

	### v1.0 (2025-11-15)
	- Initial release
	- Fine-tuned on Laddr repository
	- 650+ training samples
	- LoRA rank 64, alpha 128
	- 3 epochs training
	- Overall improvement: +22.1%

	---

	Note: This is a demonstration of repository-specific fine-tuning methodology. The approach can be adapted to any codebase for creating custom code assistants.