nanochat-german-v1 / README.md

docs: mention demo space

221c489 verified about 1 month ago

6.17 kB

	---
	license: apache-2.0
	language:
	- de
	datasets:
	- stefan-it/nanochat-german-alpaca
	- argilla/databricks-dolly-15k-curated-multilingual
	- FreedomIntelligence/evol-instruct-deutsch
	- LSX-UniWue/Guanako
	- stefan-it/nanochat-german-openhermes
	- FreedomIntelligence/sharegpt-deutsch
	tags:
	- nanochat
	- german
	- v1
	base_model:
	- stefan-it/nanochat-german-base
	---

	# 🇩🇪 nanochat German: v1

	<p align="left">
	<picture>
	<img alt="nanochat German logo" src="https://raw.githubusercontent.com/stefan-it/nanochat-german/main/assets/nanochat-german.png" style="max-width: 75%;">
	</picture>
	<br/>
	</p>

	This repository hosts the first German nanochat model. It was fine-tuned (mid-training phase) on various German SFT datasets.

	💬 A demo space of the model can be found [here](https://huggingface.co/spaces/stefan-it/nanochat-german-v1).

	## Datasets

	The chat model was fine-tuned on the following datasets:

	* [German Alpaca](https://huggingface.co/datasets/stefan-it/nanochat-german-alpaca)
	* [German Dolly](https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual)
	* [German Evol Instruct](https://huggingface.co/datasets/FreedomIntelligence/evol-instruct-deutsch)
	* [German Guanako](https://huggingface.co/datasets/LSX-UniWue/Guanako)
	* [German Openhermes](https://huggingface.co/datasets/stefan-it/nanochat-german-openhermes)
	* [German ShareGPT](https://huggingface.co/datasets/FreedomIntelligence/sharegpt-deutsch)
	* German Spelling Tasks

	More information can be found in the corresponding [German nanochat repository](https://github.com/stefan-it/nanochat-german).

	## Fine-Tuning Stats

	- run: nanochat-german
	- device_type:
	- dtype: bfloat16
	- num_iterations: -1
	- max_seq_len: 2048
	- device_batch_size: 32
	- unembedding_lr: 0.0040
	- embedding_lr: 0.2000
	- matrix_lr: 0.0200
	- init_lr_frac: 1.0000
	- weight_decay: 0.0000
	- eval_every: 150
	- eval_tokens: 10,485,760
	- total_batch_size: 524,288
	- dry_run: 0
	- Number of iterations: 346
	- DDP world size: 8
	- Minimum validation bpb: 0.6001

	## Evaluation Results

	We use `lm_eval` to measure and compare the model's performance against other language models in the same parameter range (note: this list is not exhaustive):

	<table class="model-comparison">
	<thead>
	<tr>
	<th align="left">Model</th>
	<th align="center" colspan="2">arc_de</th>
	<th align="center" colspan="2">hellaswag_de</th>
	<th align="center">m_mmlu_de</th>
	<th align="center">truthfulqa_de_mc1</th>
	<th align="center">truthfulqa_de_mc2</th>
	</tr>
	<tr>
	<th></th>
	<th align="center">acc</th>
	<th align="center">acc_norm</th>
	<th align="center">acc</th>
	<th align="center">acc_norm</th>
	<th align="center">acc</th>
	<th align="center">acc</th>
	<th align="center">acc</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td><a href="https://huggingface.co/stefan-it/nanochat-german-v1" target="_blank">nanochat German v1</a></td>
	<td align="center">0.2241</td>
	<td align="center">0.2626</td>
	<td align="center">0.3203</td>
	<td align="center">0.3581</td>
	<td align="center">0.2285</td>
	<td align="center">0.2500</td>
	<td align="center">0.4184</td>
	</tr>
	<tr>
	<td><a href="https://huggingface.co/LSX-UniWue/LLaMmlein_120M" target="_blank">LLäMmlein-120M</a></td>
	<td align="center">0.1942</td>
	<td align="center">0.2301</td>
	<td align="center">0.2945</td>
	<td align="center">0.3178</td>
	<td align="center">0.2285</td>
	<td align="center">0.2310</td>
	<td align="center">0.4055</td>
	</tr>
	<tr>
	<td><a href="https://huggingface.co/LSX-UniWue/LLaMmlein_1B" target="_blank">LLäMmlein-1B</a></td>
	<td align="center">0.2515</td>
	<td align="center">0.2960</td>
	<td align="center">0.3703</td>
	<td align="center">0.4490</td>
	<td align="center">0.2317</td>
	<td align="center">0.2322</td>
	<td align="center">0.3617</td>
	</tr>
	</tbody>
	</table>

	Command that was used to retrieve evaluation results - using our model:

	```python
	lm_eval --model hf \
	--model_args pretrained="stefan-it/nanochat-german-v1" \
	--tasks "arc_de,hellaswag_de,m_mmlu_de,truthfulqa_de_mc1,truthfulqa_de_mc2" \
	--device cuda:0 \
	--batch_size auto \
	--trust_remote_code \
	--log_samples \
	--output_path ./nanochat-german-v1
	```

	## Demo

	To generate some text, please make sure that you are using [this specific](https://github.com/huggingface/transformers/pull/41634) HF branch.

	Then the following code can be used:

	```python
	import torch

	from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline


	model_id = "stefan-it/nanochat-german-v1"
	revision = "main"
	max_new_tokens = 64
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False, revision=revision)
	model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=False, dtype=torch.bfloat16, revision=revision).to(device)
	model.eval()

	conversation = [
	{"role": "user", "content": "Was ist die Hauptstadt von Bayern?"},
	]

	inputs = tokenizer.apply_chat_template(
	conversation,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt"
	).to(device)

	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=max_new_tokens,
	)

	# Decode only the generated tokens (excluding the input prompt)
	generated_tokens = outputs[0, inputs["input_ids"].shape[1]:]
	print(tokenizer.decode(generated_tokens, skip_special_tokens=True))
	```

	## License

	The model is licences under a permissive Apache 2.0 license.

	## Acknowledgements

	- Many thanks to Andrej Karpathy's original [nanochat](https://github.com/karpathy/nanochat) repo!
	- Thanks to the [LLäMmlein team](https://huggingface.co/LSX-UniWue) for making the pretraining data publicly available.
	- Thanks to [Ben](https://huggingface.co/burtenshaw) and [Joshua](https://huggingface.co/Xenova) for help and working on the nanochat [HF integration](https://github.com/huggingface/transformers/pull/41634).