Update README.md
Browse files
README.md
CHANGED
|
@@ -9,7 +9,6 @@ This is the official pre-trained model introduced in [DNA language model GROVER
|
|
| 9 |
|
| 10 |
|
| 11 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
| 12 |
-
import torch
|
| 13 |
|
| 14 |
# Import the tokenizer and the model
|
| 15 |
tokenizer = AutoTokenizer.from_pretrained("PoetschLab/GROVER")
|
|
@@ -17,7 +16,7 @@ This is the official pre-trained model introduced in [DNA language model GROVER
|
|
| 17 |
|
| 18 |
|
| 19 |
Some preliminary analysis shows that sequence re-tokenization using Byte Pair Encoding (BPE) changes significantly if the sequence is less than 50 nucleotides long. Longer than 50 nucleotides, you should still be careful with sequence edges.
|
| 20 |
-
We advice to add 100 nucleotides at the beginning and end of every sequence in order to
|
| 21 |
We also provide the tokenized chromosomes with their respective nucleotide mappers (They are available in the folder tokenized chromosomes).
|
| 22 |
|
| 23 |
### BibTeX entry and citation info
|
|
|
|
| 9 |
|
| 10 |
|
| 11 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
|
|
|
| 12 |
|
| 13 |
# Import the tokenizer and the model
|
| 14 |
tokenizer = AutoTokenizer.from_pretrained("PoetschLab/GROVER")
|
|
|
|
| 16 |
|
| 17 |
|
| 18 |
Some preliminary analysis shows that sequence re-tokenization using Byte Pair Encoding (BPE) changes significantly if the sequence is less than 50 nucleotides long. Longer than 50 nucleotides, you should still be careful with sequence edges.
|
| 19 |
+
We advice to add 100 nucleotides at the beginning and end of every sequence in order to guarantee that your sequence is represented with the same tokens as the original tokenization.
|
| 20 |
We also provide the tokenized chromosomes with their respective nucleotide mappers (They are available in the folder tokenized chromosomes).
|
| 21 |
|
| 22 |
### BibTeX entry and citation info
|