Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,125 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
# JT-Coder-8B-Instruct
|
| 5 |
+
|
| 6 |
+
<p align="center">
|
| 7 |
+
<a href="#" target="_blank">
|
| 8 |
+
<img src="https://img.shields.io/badge/Paper-ArXiv-red">
|
| 9 |
+
</a>
|
| 10 |
+
<a href="https://huggingface.co/JT-LM/JT-Coder-8B-Instruct" target="_blank">
|
| 11 |
+
<img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue">
|
| 12 |
+
</a>
|
| 13 |
+
<a href="./LICENSE" target="_blank">
|
| 14 |
+
<img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-yellow.svg">
|
| 15 |
+
</a>
|
| 16 |
+
</p>
|
| 17 |
+
|
| 18 |
+
**JT-Coder** is a series of **high-performance and energy-efficient** code large language models (LLMs) developed by the JiuTian team. Our core philosophy is: **high-quality data is more important than massive amounts of data**. Through our innovative data-centric framework, JT-Coder, while pre-trained using only **1.6T** tokens, comprehensively outperforms multiple models of similar scale trained on approximately 4x the data, providing a more efficient and reproducible path for the development of code LLMs.
|
| 19 |
+
|
| 20 |
+

|
| 21 |
+
|
| 22 |
+
*Figure 1: Performance of JT-Coder-8B-Instruct on multiple mainstream code generation benchmarks.*
|
| 23 |
+
|
| 24 |
+
## Core Features
|
| 25 |
+
|
| 26 |
+
- 🚀 **State-of-the-Art Performance**: JT-Coder achieves or surpasses the performance of existing top open-source models at both 1.5B and 8B scales across multiple code generation and comprehension benchmarks, including `EvalPlus`, `BigCodeBench`, `LiveCodeBench`, and `FullstackBench`.
|
| 27 |
+
|
| 28 |
+
- 🧠 **Extreme Data Efficiency**: We completed pre-training using only **1.6T** high-quality tokens. Compared to similar models that typically use 5-6T of data, our data efficiency is improved by **4x**, demonstrating the immense value of our advanced data processing pipeline.
|
| 29 |
+
|
| 30 |
+
- 💡 **Innovative Data-Centric Framework**:
|
| 31 |
+
|
| 32 |
+
- **Pre-training Phase**: We meticulously cleaned open-source code data, filtering out low-quality and sensitive information. Simultaneously, we recovered and enriched high-value data such as Jupyter Notebooks, and synthesized large-scale, context-rich Q&A data and programming guides.
|
| 33 |
+
|
| 34 |
+
- **Instruction Tuning Phase**: We pioneered the **"Instruction Evolution"** technique. This technique reverse-engineers the model's various effective outputs for simple instructions, transforming implicit characteristics within the code (e.g., algorithm selection, error handling) into explicit, complex instruction constraints, thereby significantly enriching the diversity and complexity of instruction data.
|
| 35 |
+
|
| 36 |
+
## Model List
|
| 37 |
+
|
| 38 |
+
We have released the following pre-trained base models and instruction-tuned models:
|
| 39 |
+
|
| 40 |
+
| Model Name | Type | Size |
|
| 41 |
+
| ------------------------ | --------- | ---- |
|
| 42 |
+
| `JT-Coder-8B-Instruct`**(You are here!)** | Instruct | 8B |
|
| 43 |
+
| `JT-Coder-8B-Base` | Base | 8B |
|
| 44 |
+
| `JT-Coder-1.5B-Instruct` | Instruct | 1.5B |
|
| 45 |
+
| `JT-Coder-1.5B-Base` | Base | 1.5B |
|
| 46 |
+
|
| 47 |
+
## Quick Start: Inference with Transformers
|
| 48 |
+
|
| 49 |
+
You can easily run our models using the standard `transformers` library.
|
| 50 |
+
|
| 51 |
+
### 1. Install Dependencies
|
| 52 |
+
|
| 53 |
+
```bash
|
| 54 |
+
pip install torch transformers accelerate
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
### 2. Inference Code Example
|
| 58 |
+
|
| 59 |
+
Below is an example Python script for inference using the `JT-Coder-8B-Instruct` model.
|
| 60 |
+
|
| 61 |
+
```python
|
| 62 |
+
import torch
|
| 63 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 64 |
+
|
| 65 |
+
# --- 1. Configure Model Path and Device ---
|
| 66 |
+
# Model ID on Hugging Face Hub
|
| 67 |
+
model_path = "JT-LM/JT-Coder-8B-Instruct"
|
| 68 |
+
# Automatically select device (GPU preferred)
|
| 69 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 70 |
+
|
| 71 |
+
# --- 2. Load Tokenizer and Model ---
|
| 72 |
+
# trust_remote_code=True is necessary as we use a custom model structure
|
| 73 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
|
| 74 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 75 |
+
model_path,
|
| 76 |
+
torch_dtype=torch.bfloat16,
|
| 77 |
+
trust_remote_code=True,
|
| 78 |
+
use_flash_attention_2=True
|
| 79 |
+
).to(device)
|
| 80 |
+
model.eval()
|
| 81 |
+
|
| 82 |
+
# --- 3. Construct Dialogue Input ---
|
| 83 |
+
# Use a list of dictionaries containing roles and content to represent dialogue history
|
| 84 |
+
messages = [
|
| 85 |
+
{"role": "user", "content": "Please write a Python function to calculate the nth Fibonacci number, including detailed comments."},
|
| 86 |
+
]
|
| 87 |
+
|
| 88 |
+
# --- 4. Format and Encode with apply_chat_template ---
|
| 89 |
+
inputs = tokenizer.apply_chat_template(
|
| 90 |
+
messages,
|
| 91 |
+
add_generation_prompt=True,
|
| 92 |
+
return_tensors="pt"
|
| 93 |
+
).to(device)
|
| 94 |
+
|
| 95 |
+
# --- 5. Perform Inference ---
|
| 96 |
+
# Set generation parameters
|
| 97 |
+
generation_params = {
|
| 98 |
+
"max_new_tokens": 2048,
|
| 99 |
+
"do_sample": True,
|
| 100 |
+
"temperature": 0.7,
|
| 101 |
+
"top_p": 0.85,
|
| 102 |
+
"top_k": 20,
|
| 103 |
+
}
|
| 104 |
+
|
| 105 |
+
# Generate response
|
| 106 |
+
with torch.no_grad():
|
| 107 |
+
outputs = model.generate(inputs, **generation_params)
|
| 108 |
+
|
| 109 |
+
# --- 6. Decode and Print Results ---
|
| 110 |
+
# Decode only the newly generated part, skipping the original prompt
|
| 111 |
+
response = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
|
| 112 |
+
|
| 113 |
+
print("--- User Query ---")
|
| 114 |
+
print(messages[0]['content'])
|
| 115 |
+
print("\n--- Model Response ---")
|
| 116 |
+
print(response)
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
## License
|
| 120 |
+
|
| 121 |
+
The source code for this project is licensed under the [Apache 2.0 license](LICENSE). The distribution and use of model weights adhere to their respective licensing agreements.
|
| 122 |
+
|
| 123 |
+
## Disclaimer
|
| 124 |
+
|
| 125 |
+
JT-Coder is a large language model. While it has undergone rigorous data filtering and training, it may still generate inaccurate, biased, or harmful content. Users are advised to carefully evaluate the model's output and are responsible for any consequences arising from its use.
|