ai-forever
/

charllama-2.6B

+---
+license: mit
+language:
+- ru
+- en
+---
+# CharLLama-2.6B Pretrained Language Model
+This repository contains a pre-trained language model based on the [Llama](https://arxiv.org/abs/2302.13971) architecture, utilizing character-level tokenization. The model was developed for experiments in generating Russian-language accentual-syllabic poetry, as described in our paper: *"Generation of Russian Poetry of Different Genres and Styles Using Neural Networks with Character-Level Tokenization"*. Note that for practical applications, fine-tuning on a specific dataset is recommended.
+## Model Specifications
+- **Number of parameters**: `2,641,199,664`
+## Pretraining Data
+The model was pretrained on a mixed dataset of approximately 100GB of Russian and English texts, with a focus on Russian-language content. The dataset includes diverse domains such as fiction and poetry across various genres and styles. All texts were accentuated.
+## Character-Level Tokenization
+The model employs character-by-character tokenization. To use the tokenizer, install it via:
+```
+pip install git+https://github.com/Koziev/character-tokenizer
+```
+The tokenizer includes special tokens `<s>` and `</s>`.
+## Usage
+To use the model with the `transformers` library, follow this example:
+```
+import torch
+import transformers
+import charactertokenizer
+generation_args = {'max_length': 1024,
+                   'num_return_sequences': 1,
+                   'do_sample': True,
+                   'no_repeat_ngram_size': 10,
+                   'temperature': 0.8,
+                   'top_p': 0.6,
+                   'top_k': 0,
+                   }
+device = "cuda:0"
+model_dir = 'ai-forever/charllama-2.6B'
+tokenizer = charactertokenizer.CharacterTokenizer.from_pretrained(model_dir)
+model = transformers.AutoModelForCausalLM.from_pretrained(model_dir)
+model.to(device)
+# Poetry completion
+prompt = chr(8) + 'У бу́рных чу́вств неи́стовый коне́ц'
+input_ids = tokenizer(prompt, return_tensors='pt').input_ids
+out_ids = model.generate(input_ids=input_ids.to(device),
+                         eos_token_id=tokenizer.eos_token_id,
+                         **generation_args).tolist()
+prompt_len = len(input_ids[0])
+for seq in out_ids:
+    seq = seq[1:]
+    output = tokenizer.decode(seq)
+    if '</s>' in output:
+        output = output[:output.find('</s>')].strip()
+    text = output
+    print('-'*80)
+    print(text)
+```
+Example output (may vary):
+```
+У бу́рных чу́вств неи́стовый коне́ц,
+И в э́том не́т ни ка́пельки сомне́нья.
+Прихо́дит сро́к, и го́рестный вене́ц
+Наде́нет на себя́ душа́ смире́нно.
+И не помо́гут в э́том Небеса́,
+И не поми́лует Судьба́ - подру́га.
+И бу́дет на душе́ твое́й тоска́,
+И ста́нет в жи́зни нестерпи́мо ту́го.
+```
+## Limitation
+The model may generate inappropriate content, including hate speech, offensive language, or biased outputs reflecting the training data. Use with caution and consider post-processing or filtering mechanisms.
+## Citation
+If you use this model in your research, please cite it as follows (citation details will be available soon):
+*citation information will be available soon*