evaluation-guidebook

Running

Clémentine commited on 9 days ago

Commit

8fa19a6

1 Parent(s): 8bcaee3

embed space

Files changed (1) hide show

app/src/content/chapters/general-knowledge/model-inference-and-evaluation.mdx CHANGED Viewed

@@ -71,8 +71,15 @@ However, if you want to allow your tokenizer to correctly split text in other la
 This effect leads to an unfairness in multilingual tokenization: some (less frequent, or *lower-resourced*) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.
 <Note title="Going further: Language and tokenization" emoji="📚" variant="warning">
-- ⭐ [A beautiful breakdown and demo by Yennie Jun on tokenization issues across languages](https://www.artfish.ai/p/all-languages-are-not-created-tokenized): The breakdown in itself is very clear, and it's worth playing around with the [demo space](https://huggingface.co/spaces/yenniejun/tokenizers-languages)
 - ⭐ [A demo by Aleksandar Petrov on unfairness of tokenization](https://aleksandarpetrov.github.io/tokenization-fairness/): I recommend looking at `Compare tokenization of sentences` to get a feel for the differences in cost of inference depending on languages
 </Note>

 This effect leads to an unfairness in multilingual tokenization: some (less frequent, or *lower-resourced*) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.
+<iframe
+    src="https://OpenEvals-tokenizers-languages.hf.space"
+    frameborder="0"
+    width="850"
+    height="450"
+></iframe>
 <Note title="Going further: Language and tokenization" emoji="📚" variant="warning">
+- ⭐ [A beautiful breakdown and demo by Yennie Jun on tokenization issues across languages](https://www.artfish.ai/p/all-languages-are-not-created-tokenized): The breakdown in itself is very clear, and the embedded space comes from her work.
 - ⭐ [A demo by Aleksandar Petrov on unfairness of tokenization](https://aleksandarpetrov.github.io/tokenization-fairness/): I recommend looking at `Compare tokenization of sentences` to get a feel for the differences in cost of inference depending on languages
 </Note>