Clémentine
commited on
Commit
·
8fa19a6
1
Parent(s):
8bcaee3
embed space
Browse files
app/src/content/chapters/general-knowledge/model-inference-and-evaluation.mdx
CHANGED
|
@@ -71,8 +71,15 @@ However, if you want to allow your tokenizer to correctly split text in other la
|
|
| 71 |
|
| 72 |
This effect leads to an unfairness in multilingual tokenization: some (less frequent, or *lower-resourced*) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.
|
| 73 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
<Note title="Going further: Language and tokenization" emoji="📚" variant="warning">
|
| 75 |
-
- ⭐ [A beautiful breakdown and demo by Yennie Jun on tokenization issues across languages](https://www.artfish.ai/p/all-languages-are-not-created-tokenized): The breakdown in itself is very clear, and
|
| 76 |
- ⭐ [A demo by Aleksandar Petrov on unfairness of tokenization](https://aleksandarpetrov.github.io/tokenization-fairness/): I recommend looking at `Compare tokenization of sentences` to get a feel for the differences in cost of inference depending on languages
|
| 77 |
</Note>
|
| 78 |
|
|
|
|
| 71 |
|
| 72 |
This effect leads to an unfairness in multilingual tokenization: some (less frequent, or *lower-resourced*) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.
|
| 73 |
|
| 74 |
+
<iframe
|
| 75 |
+
src="https://OpenEvals-tokenizers-languages.hf.space"
|
| 76 |
+
frameborder="0"
|
| 77 |
+
width="850"
|
| 78 |
+
height="450"
|
| 79 |
+
></iframe>
|
| 80 |
+
|
| 81 |
<Note title="Going further: Language and tokenization" emoji="📚" variant="warning">
|
| 82 |
+
- ⭐ [A beautiful breakdown and demo by Yennie Jun on tokenization issues across languages](https://www.artfish.ai/p/all-languages-are-not-created-tokenized): The breakdown in itself is very clear, and the embedded space comes from her work.
|
| 83 |
- ⭐ [A demo by Aleksandar Petrov on unfairness of tokenization](https://aleksandarpetrov.github.io/tokenization-fairness/): I recommend looking at `Compare tokenization of sentences` to get a feel for the differences in cost of inference depending on languages
|
| 84 |
</Note>
|
| 85 |
|