ai-forever commited on
Commit
a166072
·
verified ·
1 Parent(s): 5a93199

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +113 -3
README.md CHANGED
@@ -1,3 +1,113 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - ru
5
+ - en
6
+ ---
7
+
8
+
9
+
10
+ # CharLLama-2.6B Pretrained Language Model
11
+
12
+ This repository contains a pre-trained language model based on the [Llama](https://arxiv.org/abs/2302.13971) architecture, utilizing character-level tokenization. The model was developed for experiments in generating Russian-language accentual-syllabic poetry, as described in our paper: *"Generation of Russian Poetry of Different Genres and Styles Using Neural Networks with Character-Level Tokenization"*. Note that for practical applications, fine-tuning on a specific dataset is recommended.
13
+
14
+
15
+ ## Model Specifications
16
+
17
+ - **Number of parameters**: `2,641,199,664`
18
+
19
+
20
+ ## Pretraining Data
21
+
22
+ The model was pretrained on a mixed dataset of approximately 100GB of Russian and English texts, with a focus on Russian-language content. The dataset includes diverse domains such as fiction and poetry across various genres and styles. All texts were accentuated.
23
+
24
+
25
+ ## Character-Level Tokenization
26
+
27
+ The model employs character-by-character tokenization. To use the tokenizer, install it via:
28
+
29
+ ```
30
+ pip install git+https://github.com/Koziev/character-tokenizer
31
+ ```
32
+
33
+ The tokenizer includes special tokens `<s>` and `</s>`.
34
+
35
+
36
+ ## Usage
37
+
38
+ To use the model with the `transformers` library, follow this example:
39
+
40
+
41
+ ```
42
+ import torch
43
+ import transformers
44
+ import charactertokenizer
45
+
46
+ generation_args = {'max_length': 1024,
47
+ 'num_return_sequences': 1,
48
+ 'do_sample': True,
49
+ 'no_repeat_ngram_size': 10,
50
+ 'temperature': 0.8,
51
+ 'top_p': 0.6,
52
+ 'top_k': 0,
53
+ }
54
+
55
+ device = "cuda:0"
56
+
57
+ model_dir = 'ai-forever/charllama-2.6B'
58
+
59
+ tokenizer = charactertokenizer.CharacterTokenizer.from_pretrained(model_dir)
60
+
61
+ model = transformers.AutoModelForCausalLM.from_pretrained(model_dir)
62
+ model.to(device)
63
+
64
+ # Poetry completion
65
+ prompt = chr(8) + 'У бу́рных чу́вств неи́стовый коне́ц'
66
+
67
+ input_ids = tokenizer(prompt, return_tensors='pt').input_ids
68
+ out_ids = model.generate(input_ids=input_ids.to(device),
69
+ eos_token_id=tokenizer.eos_token_id,
70
+ **generation_args).tolist()
71
+
72
+ prompt_len = len(input_ids[0])
73
+ for seq in out_ids:
74
+ seq = seq[1:]
75
+
76
+ output = tokenizer.decode(seq)
77
+
78
+ if '</s>' in output:
79
+ output = output[:output.find('</s>')].strip()
80
+
81
+ text = output
82
+ print('-'*80)
83
+ print(text)
84
+ ```
85
+
86
+ Example output (may vary):
87
+
88
+ ```
89
+ У бу́рных чу́вств неи́стовый коне́ц,
90
+ И в э́том не́т ни ка́пельки сомне́нья.
91
+ Прихо́дит сро́к, и го́рестный вене́ц
92
+ Наде́нет на себя́ душа́ смире́нно.
93
+
94
+ И не помо́гут в э́том Небеса́,
95
+ И не поми́лует Судьба́ - подру́га.
96
+ И бу́дет на душе́ твое́й тоска́,
97
+ И ста́нет в жи́зни нестерпи́мо ту́го.
98
+ ```
99
+
100
+
101
+ ## Limitation
102
+
103
+ The model may generate inappropriate content, including hate speech, offensive language, or biased outputs reflecting the training data. Use with caution and consider post-processing or filtering mechanisms.
104
+
105
+
106
+
107
+
108
+ ## Citation
109
+
110
+ If you use this model in your research, please cite it as follows (citation details will be available soon):
111
+
112
+ *citation information will be available soon*
113
+