How is BertTokenizer trained and what is its data corpus?

#90

by boydcheung - opened 25 days ago

Discussion

boydcheung

25 days ago

Could you specify how to train the tokenizer and where to obtain its corpus?

TarSh8654

25 days ago

•

edited 25 days ago

🤖 Tokenizer Training and Corpus

Training a tokenizer is a crucial step in preparing text data for natural language processing (NLP) models. Here's a specification of how to train a tokenizer and where to obtain its corpus:

1. ⚙️ How to Train a Tokenizer

The specific steps depend on the type of tokenizer you are training (e.g., WordPiece, Byte Pair Encoding (BPE), or Unigram), but the general process involves the following:

A. Preparation

Corpus Collection: Gather a large and representative text corpus (see Section 2).
Cleaning/Preprocessing (Optional but Recommended): While not always strictly necessary, cleaning the corpus (e.g., removing boilerplate, non-text content, or HTML tags) can improve the quality of the vocabulary learned.
Tool Selection: Choose a library or tool for training. The Hugging Face tokenizers library is a very popular and efficient choice that supports most modern algorithms.

B. Training Algorithm

The tokenizer training algorithm iteratively builds the vocabulary based on the frequency of character sequences in your corpus.

BPE (Byte Pair Encoding):
1. Start with the base vocabulary of all individual characters (and often bytes).
2. Find the most frequent adjacent pair of symbols/characters in the corpus.
3. Replace all occurrences of that pair with a new, merged symbol.
4. Repeat steps 2 and 3 until the desired vocabulary size is reached.
WordPiece (Used by BERT, DistilBERT):
1. Starts with base characters, similar to BPE.
2. The merging criterion is based on likelihood—it chooses the pair that increases the likelihood of the overall training corpus when merged, rather than just the most frequent pair.
Unigram (Used by SentencePiece, T5, ALBERT):
1. Starts with a large initial vocabulary (e.g., all frequent subwords/words).
2. Iteratively prunes (removes) subwords that are less useful. The pruning decision is often based on the loss of the corpus when a subword is removed.

C. Configuration and Execution

Define Special Tokens: Specify tokens like [UNK] (Unknown), [CLS] (Classification), [SEP] (Separator), and [PAD] (Padding).
Set Vocabulary Size: Decide on the target size of the vocabulary (e.g., 30,000 or 50,000). A larger vocab size generally leads to shorter sequences but a larger model embedding layer.
Train: Use the chosen library's train function, passing in the corpus file path and configuration options.

Example (using Hugging Face tokenizers):
from tokenizers import Tokenizer, models, pre_tokenizers, trainers

1. Instantiate the model (e.g., BPE)

tokenizer = Tokenizer(models.BPE())

2. Set the pre-tokenizer (how to split the text initially)

tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

3. Set the trainer

trainer = trainers.BpeTrainer(
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
vocab_size=30000,
)

4. Train the tokenizer on a list of files

files = ["path/to/your/corpus.txt"]
tokenizer.train(files, trainer=trainer)

5. Save the tokenizer

tokenizer.save("my_new_tokenizer.json")

2. 📚 Where to Obtain the Corpus

The corpus is the large collection of raw text data that the tokenizer "reads" to determine the most effective subwords. The choice of corpus is critical—it should match the domain of the data your final NLP model will be used on.

Corpus Type	Description	Examples & Sources
General Domain	Massive, diverse collections of text covering a wide range of topics. Best for foundational models (like BERT, GPT).	* Common Crawl: A public archive of web data (trillions of words). * Wikipedia: Complete text dumps of many languages. * Books Corpus: Large collection of books.
Specific Domain	Text data focused on a particular topic, industry, or style. Crucial for specialized models (e.g., medical, legal, code).	* Hugging Face Datasets: A vast repository of easily downloadable, pre-processed datasets (e.g., SQuAD, PubMed). * GitHub Repositories: For code-related tokenizers. * Private/Proprietary Data: Internal company documents, emails, or chat logs (if you have permission).
Multilingual	Text from multiple languages combined into a single corpus, often used for multilingual models.	* OSCAR: A cleaned, filtered, and multilingual version of the Common Crawl dataset. * OPUS: A growing collection of parallel corpora (translations).

Key Considerations:

Size: Modern tokenizers are often trained on billions of tokens (hundreds of gigabytes of text) to ensure comprehensive coverage.
Representativeness: The corpus must represent the text your final model will encounter. If your model works on legal documents, training the tokenizer only on Wikipedia will result in poor performance on legal jargon.

boydcheung

24 days ago

Thanks for the GPT-like answer, I am looking for specific information about BERT tokenizer, could you single that out?

TarSh8654

20 days ago

Its basically an language model that has capacity to learn specific task when it comes to language. So, as they have an capacity to learn any language and fine tune for a specific task, it helps to automate or you can say its a time saver. In the same time pretraining to BERT is very time consuming but it's worth.

You looking for any specific?? or you working on something??

I can help, but I work for money.

Someone said, "Don't it free if you are good at something."

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment