|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- de |
|
|
datasets: |
|
|
- stefan-it/nanochat-german-alpaca |
|
|
- argilla/databricks-dolly-15k-curated-multilingual |
|
|
- FreedomIntelligence/evol-instruct-deutsch |
|
|
- LSX-UniWue/Guanako |
|
|
- stefan-it/nanochat-german-openhermes |
|
|
- FreedomIntelligence/sharegpt-deutsch |
|
|
tags: |
|
|
- nanochat |
|
|
- german |
|
|
- v1 |
|
|
base_model: |
|
|
- stefan-it/nanochat-german-base |
|
|
--- |
|
|
|
|
|
# 🇩🇪 nanochat German: v1 |
|
|
|
|
|
<p align="left"> |
|
|
<picture> |
|
|
<img alt="nanochat German logo" src="https://raw.githubusercontent.com/stefan-it/nanochat-german/main/assets/nanochat-german.png" style="max-width: 75%;"> |
|
|
</picture> |
|
|
<br/> |
|
|
</p> |
|
|
|
|
|
This repository hosts the first German nanochat model. It was fine-tuned (mid-training phase) on various German SFT datasets. |
|
|
|
|
|
💬 A demo space of the model can be found [here](https://huggingface.co/spaces/stefan-it/nanochat-german-v1). |
|
|
|
|
|
## Datasets |
|
|
|
|
|
The chat model was fine-tuned on the following datasets: |
|
|
|
|
|
* [German Alpaca](https://huggingface.co/datasets/stefan-it/nanochat-german-alpaca) |
|
|
* [German Dolly](https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual) |
|
|
* [German Evol Instruct](https://huggingface.co/datasets/FreedomIntelligence/evol-instruct-deutsch) |
|
|
* [German Guanako](https://huggingface.co/datasets/LSX-UniWue/Guanako) |
|
|
* [German Openhermes](https://huggingface.co/datasets/stefan-it/nanochat-german-openhermes) |
|
|
* [German ShareGPT](https://huggingface.co/datasets/FreedomIntelligence/sharegpt-deutsch) |
|
|
* German Spelling Tasks |
|
|
|
|
|
More information can be found in the corresponding [German nanochat repository](https://github.com/stefan-it/nanochat-german). |
|
|
|
|
|
## Fine-Tuning Stats |
|
|
|
|
|
- run: nanochat-german |
|
|
- device_type: |
|
|
- dtype: bfloat16 |
|
|
- num_iterations: -1 |
|
|
- max_seq_len: 2048 |
|
|
- device_batch_size: 32 |
|
|
- unembedding_lr: 0.0040 |
|
|
- embedding_lr: 0.2000 |
|
|
- matrix_lr: 0.0200 |
|
|
- init_lr_frac: 1.0000 |
|
|
- weight_decay: 0.0000 |
|
|
- eval_every: 150 |
|
|
- eval_tokens: 10,485,760 |
|
|
- total_batch_size: 524,288 |
|
|
- dry_run: 0 |
|
|
- Number of iterations: 346 |
|
|
- DDP world size: 8 |
|
|
- Minimum validation bpb: 0.6001 |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
We use `lm_eval` to measure and compare the model's performance against other language models in the same parameter range (note: this list is not exhaustive): |
|
|
|
|
|
<table class="model-comparison"> |
|
|
<thead> |
|
|
<tr> |
|
|
<th align="left">Model</th> |
|
|
<th align="center" colspan="2">arc_de</th> |
|
|
<th align="center" colspan="2">hellaswag_de</th> |
|
|
<th align="center">m_mmlu_de</th> |
|
|
<th align="center">truthfulqa_de_mc1</th> |
|
|
<th align="center">truthfulqa_de_mc2</th> |
|
|
</tr> |
|
|
<tr> |
|
|
<th></th> |
|
|
<th align="center">acc</th> |
|
|
<th align="center">acc_norm</th> |
|
|
<th align="center">acc</th> |
|
|
<th align="center">acc_norm</th> |
|
|
<th align="center">acc</th> |
|
|
<th align="center">acc</th> |
|
|
<th align="center">acc</th> |
|
|
</tr> |
|
|
</thead> |
|
|
<tbody> |
|
|
<tr> |
|
|
<td><a href="https://huggingface.co/stefan-it/nanochat-german-v1" target="_blank">nanochat German v1</a></td> |
|
|
<td align="center">0.2241</td> |
|
|
<td align="center">0.2626</td> |
|
|
<td align="center">0.3203</td> |
|
|
<td align="center">0.3581</td> |
|
|
<td align="center">0.2285</td> |
|
|
<td align="center">0.2500</td> |
|
|
<td align="center">0.4184</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><a href="https://huggingface.co/LSX-UniWue/LLaMmlein_120M" target="_blank">LLäMmlein-120M</a></td> |
|
|
<td align="center">0.1942</td> |
|
|
<td align="center">0.2301</td> |
|
|
<td align="center">0.2945</td> |
|
|
<td align="center">0.3178</td> |
|
|
<td align="center">0.2285</td> |
|
|
<td align="center">0.2310</td> |
|
|
<td align="center">0.4055</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><a href="https://huggingface.co/LSX-UniWue/LLaMmlein_1B" target="_blank">LLäMmlein-1B</a></td> |
|
|
<td align="center">0.2515</td> |
|
|
<td align="center">0.2960</td> |
|
|
<td align="center">0.3703</td> |
|
|
<td align="center">0.4490</td> |
|
|
<td align="center">0.2317</td> |
|
|
<td align="center">0.2322</td> |
|
|
<td align="center">0.3617</td> |
|
|
</tr> |
|
|
</tbody> |
|
|
</table> |
|
|
|
|
|
Command that was used to retrieve evaluation results - using our model: |
|
|
|
|
|
```python |
|
|
lm_eval --model hf \ |
|
|
--model_args pretrained="stefan-it/nanochat-german-v1" \ |
|
|
--tasks "arc_de,hellaswag_de,m_mmlu_de,truthfulqa_de_mc1,truthfulqa_de_mc2" \ |
|
|
--device cuda:0 \ |
|
|
--batch_size auto \ |
|
|
--trust_remote_code \ |
|
|
--log_samples \ |
|
|
--output_path ./nanochat-german-v1 |
|
|
``` |
|
|
|
|
|
## Demo |
|
|
|
|
|
To generate some text, please make sure that you are using [this specific](https://github.com/huggingface/transformers/pull/41634) HF branch. |
|
|
|
|
|
Then the following code can be used: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
|
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline |
|
|
|
|
|
|
|
|
model_id = "stefan-it/nanochat-german-v1" |
|
|
revision = "main" |
|
|
max_new_tokens = 64 |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False, revision=revision) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=False, dtype=torch.bfloat16, revision=revision).to(device) |
|
|
model.eval() |
|
|
|
|
|
conversation = [ |
|
|
{"role": "user", "content": "Was ist die Hauptstadt von Bayern?"}, |
|
|
] |
|
|
|
|
|
inputs = tokenizer.apply_chat_template( |
|
|
conversation, |
|
|
add_generation_prompt=True, |
|
|
tokenize=True, |
|
|
return_dict=True, |
|
|
return_tensors="pt" |
|
|
).to(device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=max_new_tokens, |
|
|
) |
|
|
|
|
|
# Decode only the generated tokens (excluding the input prompt) |
|
|
generated_tokens = outputs[0, inputs["input_ids"].shape[1]:] |
|
|
print(tokenizer.decode(generated_tokens, skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
The model is licences under a permissive Apache 2.0 license. |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
- Many thanks to Andrej Karpathy's original [nanochat](https://github.com/karpathy/nanochat) repo! |
|
|
- Thanks to the [LLäMmlein team](https://huggingface.co/LSX-UniWue) for making the pretraining data publicly available. |
|
|
- Thanks to [Ben](https://huggingface.co/burtenshaw) and [Joshua](https://huggingface.co/Xenova) for help and working on the nanochat [HF integration](https://github.com/huggingface/transformers/pull/41634). |