Clémentine commited on
Commit
8fa19a6
·
1 Parent(s): 8bcaee3

embed space

Browse files
app/src/content/chapters/general-knowledge/model-inference-and-evaluation.mdx CHANGED
@@ -71,8 +71,15 @@ However, if you want to allow your tokenizer to correctly split text in other la
71
 
72
  This effect leads to an unfairness in multilingual tokenization: some (less frequent, or *lower-resourced*) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.
73
 
 
 
 
 
 
 
 
74
  <Note title="Going further: Language and tokenization" emoji="📚" variant="warning">
75
- - ⭐ [A beautiful breakdown and demo by Yennie Jun on tokenization issues across languages](https://www.artfish.ai/p/all-languages-are-not-created-tokenized): The breakdown in itself is very clear, and it's worth playing around with the [demo space](https://huggingface.co/spaces/yenniejun/tokenizers-languages)
76
  - ⭐ [A demo by Aleksandar Petrov on unfairness of tokenization](https://aleksandarpetrov.github.io/tokenization-fairness/): I recommend looking at `Compare tokenization of sentences` to get a feel for the differences in cost of inference depending on languages
77
  </Note>
78
 
 
71
 
72
  This effect leads to an unfairness in multilingual tokenization: some (less frequent, or *lower-resourced*) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.
73
 
74
+ <iframe
75
+ src="https://OpenEvals-tokenizers-languages.hf.space"
76
+ frameborder="0"
77
+ width="850"
78
+ height="450"
79
+ ></iframe>
80
+
81
  <Note title="Going further: Language and tokenization" emoji="📚" variant="warning">
82
+ - ⭐ [A beautiful breakdown and demo by Yennie Jun on tokenization issues across languages](https://www.artfish.ai/p/all-languages-are-not-created-tokenized): The breakdown in itself is very clear, and the embedded space comes from her work.
83
  - ⭐ [A demo by Aleksandar Petrov on unfairness of tokenization](https://aleksandarpetrov.github.io/tokenization-fairness/): I recommend looking at `Compare tokenization of sentences` to get a feel for the differences in cost of inference depending on languages
84
  </Note>
85