microsoft
/

layoutlm-base-cased

Model card Files Files and versions

Yiheng Xu commited on Sep 27, 2021

Commit

91acf0f

·

1 Parent(s): 4678df1

Update README.md

Files changed (1) hide show

README.md +8 -6

README.md CHANGED Viewed

@@ -10,12 +10,14 @@ LayoutLM is a simple but effective pre-training method of text and layout for do
 [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318)
 Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, [KDD 2020](https://www.kdd.org/kdd2020/accepted-papers)
-## Training data
-We pre-train LayoutLM on IIT-CDIP Test Collection 1.0\* dataset with two settings.
-* LayoutLM-Base, Uncased (11M documents, 2 epochs): 12-layer, 768-hidden, 12-heads, 113M parameters **(This Model)**
-* LayoutLM-Large, Uncased (11M documents, 2 epochs): 24-layer, 1024-hidden, 16-heads, 343M parameters
 ## Citation

 [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/abs/1912.13318)
 Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, [KDD 2020](https://www.kdd.org/kdd2020/accepted-papers)
+## Different Tokenizer
+Note that LayoutLM-Cased requires a different tokenizer, based on RobertaTokenizer. You can
+initialize it as follows:
+~~~
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained('microsoft/layoutlm-base-cased')
+~~~
 ## Citation