htagourti redmoe-ai-v1 commited on
Commit
bc1682b
·
verified ·
0 Parent(s):

Duplicate from rednote-hilab/dots.ocr

Browse files

Co-authored-by: redmoe-ai-v1 <redmoe-ai-v1@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
NOTICE ADDED
The diff for this file is too large to render. See raw diff
 
README.md ADDED
@@ -0,0 +1,1234 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: dots_ocr
4
+ pipeline_tag: image-text-to-text
5
+ tags:
6
+ - image-to-text
7
+ - ocr
8
+ - document-parse
9
+ - layout
10
+ - table
11
+ - formula
12
+ language:
13
+ - en
14
+ - zh
15
+ - multilingual
16
+ ---
17
+
18
+ <div align="center">
19
+
20
+ <p align="center">
21
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/logo.png" width="300"/>
22
+ <p>
23
+
24
+ <h1 align="center">
25
+ dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model
26
+ </h1>
27
+
28
+ [![Blog](https://img.shields.io/badge/Blog-View_on_GitHub-333.svg?logo=github)](https://github.com/rednote-hilab/dots.ocr/blob/master/assets/blog.md)
29
+ [![HuggingFace](https://img.shields.io/badge/HuggingFace%20Weights-black.svg?logo=HuggingFace)](https://huggingface.co/rednote-hilab/dots.ocr)
30
+
31
+
32
+ <div align="center">
33
+ <a href="https://dotsocr.xiaohongshu.com" target="_blank" rel="noopener noreferrer"><strong>🖥️ Live Demo</strong></a> |
34
+ <a href="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/wechat.jpg" target="_blank" rel="noopener noreferrer"><strong>💬 WeChat</strong></a> |
35
+ <a href="https://www.xiaohongshu.com/user/profile/683ffe42000000001d021a4c" target="_blank" rel="noopener noreferrer"><strong>📕 rednote</strong></a>
36
+ </div>
37
+
38
+ </div>
39
+
40
+
41
+
42
+ ## Introduction
43
+
44
+ **dots.ocr** is a powerful, multilingual document parser that unifies layout detection and content recognition within a single vision-language model while maintaining good reading order. Despite its compact 1.7B-parameter LLM foundation, it achieves state-of-the-art(SOTA) performance.
45
+
46
+ 1. **Powerful Performance:** **dots.ocr** achieves SOTA performance for text, tables, and reading order on [OmniDocBench](https://github.com/opendatalab/OmniDocBench), while delivering formula recognition results comparable to much larger models like Doubao-1.5 and gemini2.5-pro.
47
+ 2. **Multilingual Support:** **dots.ocr** demonstrates robust parsing capabilities for low-resource languages, achieving decisive advantages across both layout detection and content recognition on our in-house multilingual documents benchmark.
48
+ 3. **Unified and Simple Architecture:** By leveraging a single vision-language model, **dots.ocr** offers a significantly more streamlined architecture than conventional methods that rely on complex, multi-model pipelines. Switching between tasks is accomplished simply by altering the input prompt, proving that a VLM can achieve competitive detection results compared to traditional detection models like DocLayout-YOLO.
49
+ 4. **Efficient and Fast Performance:** Built upon a compact 1.7B LLM, **dots.ocr** provides faster inference speeds than many other high-performing models based on larger foundations.
50
+
51
+
52
+ ### Performance Comparison: dots.ocr vs. Competing Models
53
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/chart.png" border="0" />
54
+
55
+ > **Notes:**
56
+ > - The EN, ZH metrics are the end2end evaluation results of [OmniDocBench](https://github.com/opendatalab/OmniDocBench), and Multilingual metric is the end2end evaluation results of dots.ocr-bench.
57
+
58
+
59
+ ## News
60
+ * ```2025.07.30 ``` 🚀 We release [dots.ocr](https://github.com/rednote-hilab/dots.ocr), — a multilingual documents parsing model based on 1.7b llm, with SOTA performance.
61
+
62
+
63
+
64
+ ## Benchmark Results
65
+
66
+ ### 1. OmniDocBench
67
+
68
+ #### The end-to-end evaluation results of different tasks.
69
+
70
+ <table>
71
+ <thead>
72
+ <tr>
73
+ <th rowspan="2"><strong>Model<br>Type</strong></th>
74
+ <th rowspan="2"><strong>Methods</strong></th>
75
+ <th colspan="2"><strong>Overall<sup>Edit</sup>↓</strong></th>
76
+ <th colspan="2"><strong>Text<sup>Edit</sup>↓</strong></th>
77
+ <th colspan="2"><strong>Formula<sup>Edit</sup>↓</strong></th>
78
+ <th colspan="2"><strong>Table<sup>TEDS</sup>↑</strong></th>
79
+ <th colspan="2"><strong>Table<sup>Edit</sup>↓</strong></th>
80
+ <th colspan="2"><strong>Read Order<sup>Edit</sup>↓</strong></th>
81
+ </tr>
82
+ <tr>
83
+ <th><em>EN</em></th>
84
+ <th><em>ZH</em></th>
85
+ <th><em>EN</em></th>
86
+ <th><em>ZH</em></th>
87
+ <th><em>EN</em></th>
88
+ <th><em>ZH</em></th>
89
+ <th><em>EN</em></th>
90
+ <th><em>ZH</em></th>
91
+ <th><em>EN</em></th>
92
+ <th><em>ZH</em></th>
93
+ <th><em>EN</em></th>
94
+ <th><em>ZH</em></th>
95
+ </tr>
96
+ </thead>
97
+ <tbody>
98
+ <tr>
99
+ <td rowspan="8"><strong>Pipeline<br>Tools</strong></td>
100
+ <td>MinerU</td>
101
+ <td>0.150</td>
102
+ <td>0.357</td>
103
+ <td>0.061</td>
104
+ <td>0.215</td>
105
+ <td>0.278</td>
106
+ <td>0.577</td>
107
+ <td>78.6</td>
108
+ <td>62.1</td>
109
+ <td>0.180</td>
110
+ <td>0.344</td>
111
+ <td>0.079</td>
112
+ <td>0.292</td>
113
+ </tr>
114
+ <tr>
115
+ <td>Marker</td>
116
+ <td>0.336</td>
117
+ <td>0.556</td>
118
+ <td>0.080</td>
119
+ <td>0.315</td>
120
+ <td>0.530</td>
121
+ <td>0.883</td>
122
+ <td>67.6</td>
123
+ <td>49.2</td>
124
+ <td>0.619</td>
125
+ <td>0.685</td>
126
+ <td>0.114</td>
127
+ <td>0.340</td>
128
+ </tr>
129
+ <tr>
130
+ <td>Mathpix</td>
131
+ <td>0.191</td>
132
+ <td>0.365</td>
133
+ <td>0.105</td>
134
+ <td>0.384</td>
135
+ <td>0.306</td>
136
+ <td>0.454</td>
137
+ <td>77.0</td>
138
+ <td>67.1</td>
139
+ <td>0.243</td>
140
+ <td>0.320</td>
141
+ <td>0.108</td>
142
+ <td>0.304</td>
143
+ </tr>
144
+ <tr>
145
+ <td>Docling</td>
146
+ <td>0.589</td>
147
+ <td>0.909</td>
148
+ <td>0.416</td>
149
+ <td>0.987</td>
150
+ <td>0.999</td>
151
+ <td>1</td>
152
+ <td>61.3</td>
153
+ <td>25.0</td>
154
+ <td>0.627</td>
155
+ <td>0.810</td>
156
+ <td>0.313</td>
157
+ <td>0.837</td>
158
+ </tr>
159
+ <tr>
160
+ <td>Pix2Text</td>
161
+ <td>0.320</td>
162
+ <td>0.528</td>
163
+ <td>0.138</td>
164
+ <td>0.356</td>
165
+ <td>0.276</td>
166
+ <td>0.611</td>
167
+ <td>73.6</td>
168
+ <td>66.2</td>
169
+ <td>0.584</td>
170
+ <td>0.645</td>
171
+ <td>0.281</td>
172
+ <td>0.499</td>
173
+ </tr>
174
+ <tr>
175
+ <td>Unstructured</td>
176
+ <td>0.586</td>
177
+ <td>0.716</td>
178
+ <td>0.198</td>
179
+ <td>0.481</td>
180
+ <td>0.999</td>
181
+ <td>1</td>
182
+ <td>0</td>
183
+ <td>0.06</td>
184
+ <td>1</td>
185
+ <td>0.998</td>
186
+ <td>0.145</td>
187
+ <td>0.387</td>
188
+ </tr>
189
+ <tr>
190
+ <td>OpenParse</td>
191
+ <td>0.646</td>
192
+ <td>0.814</td>
193
+ <td>0.681</td>
194
+ <td>0.974</td>
195
+ <td>0.996</td>
196
+ <td>1</td>
197
+ <td>64.8</td>
198
+ <td>27.5</td>
199
+ <td>0.284</td>
200
+ <td>0.639</td>
201
+ <td>0.595</td>
202
+ <td>0.641</td>
203
+ </tr>
204
+ <tr>
205
+ <td>PPStruct-V3</td>
206
+ <td>0.145</td>
207
+ <td>0.206</td>
208
+ <td>0.058</td>
209
+ <td>0.088</td>
210
+ <td>0.295</td>
211
+ <td>0.535</td>
212
+ <td>-</td>
213
+ <td>-</td>
214
+ <td>0.159</td>
215
+ <td>0.109</td>
216
+ <td>0.069</td>
217
+ <td>0.091</td>
218
+ </tr>
219
+ <tr>
220
+ <td rowspan="9"><strong>Expert<br>VLMs</strong></td>
221
+ <td>GOT-OCR</td>
222
+ <td>0.287</td>
223
+ <td>0.411</td>
224
+ <td>0.189</td>
225
+ <td>0.315</td>
226
+ <td>0.360</td>
227
+ <td>0.528</td>
228
+ <td>53.2</td>
229
+ <td>47.2</td>
230
+ <td>0.459</td>
231
+ <td>0.520</td>
232
+ <td>0.141</td>
233
+ <td>0.280</td>
234
+ </tr>
235
+ <tr>
236
+ <td>Nougat</td>
237
+ <td>0.452</td>
238
+ <td>0.973</td>
239
+ <td>0.365</td>
240
+ <td>0.998</td>
241
+ <td>0.488</td>
242
+ <td>0.941</td>
243
+ <td>39.9</td>
244
+ <td>0</td>
245
+ <td>0.572</td>
246
+ <td>1.000</td>
247
+ <td>0.382</td>
248
+ <td>0.954</td>
249
+ </tr>
250
+ <tr>
251
+ <td>Mistral OCR</td>
252
+ <td>0.268</td>
253
+ <td>0.439</td>
254
+ <td>0.072</td>
255
+ <td>0.325</td>
256
+ <td>0.318</td>
257
+ <td>0.495</td>
258
+ <td>75.8</td>
259
+ <td>63.6</td>
260
+ <td>0.600</td>
261
+ <td>0.650</td>
262
+ <td>0.083</td>
263
+ <td>0.284</td>
264
+ </tr>
265
+ <tr>
266
+ <td>OLMOCR-sglang</td>
267
+ <td>0.326</td>
268
+ <td>0.469</td>
269
+ <td>0.097</td>
270
+ <td>0.293</td>
271
+ <td>0.455</td>
272
+ <td>0.655</td>
273
+ <td>68.1</td>
274
+ <td>61.3</td>
275
+ <td>0.608</td>
276
+ <td>0.652</td>
277
+ <td>0.145</td>
278
+ <td>0.277</td>
279
+ </tr>
280
+ <tr>
281
+ <td>SmolDocling-256M</td>
282
+ <td>0.493</td>
283
+ <td>0.816</td>
284
+ <td>0.262</td>
285
+ <td>0.838</td>
286
+ <td>0.753</td>
287
+ <td>0.997</td>
288
+ <td>44.9</td>
289
+ <td>16.5</td>
290
+ <td>0.729</td>
291
+ <td>0.907</td>
292
+ <td>0.227</td>
293
+ <td>0.522</td>
294
+ </tr>
295
+ <tr>
296
+ <td>Dolphin</td>
297
+ <td>0.206</td>
298
+ <td>0.306</td>
299
+ <td>0.107</td>
300
+ <td>0.197</td>
301
+ <td>0.447</td>
302
+ <td>0.580</td>
303
+ <td>77.3</td>
304
+ <td>67.2</td>
305
+ <td>0.180</td>
306
+ <td>0.285</td>
307
+ <td>0.091</td>
308
+ <td>0.162</td>
309
+ </tr>
310
+ <tr>
311
+ <td>MinerU 2</td>
312
+ <td>0.139</td>
313
+ <td>0.240</td>
314
+ <td>0.047</td>
315
+ <td>0.109</td>
316
+ <td>0.297</td>
317
+ <td>0.536</td>
318
+ <td>82.5</td>
319
+ <td>79.0</td>
320
+ <td>0.141</td>
321
+ <td>0.195</td>
322
+ <td>0.069<</td>
323
+ <td>0.118</td>
324
+ </tr>
325
+ <tr>
326
+ <td>OCRFlux</td>
327
+ <td>0.195</td>
328
+ <td>0.281</td>
329
+ <td>0.064</td>
330
+ <td>0.183</td>
331
+ <td>0.379</td>
332
+ <td>0.613</td>
333
+ <td>71.6</td>
334
+ <td>81.3</td>
335
+ <td>0.253</td>
336
+ <td>0.139</td>
337
+ <td>0.086</td>
338
+ <td>0.187</td>
339
+ </tr>
340
+ <tr>
341
+ <td>MonkeyOCR-pro-3B</td>
342
+ <td>0.138</td>
343
+ <td>0.206</td>
344
+ <td>0.067</td>
345
+ <td>0.107</td>
346
+ <td><strong>0.246</strong></td>
347
+ <td>0.421</td>
348
+ <td>81.5</td>
349
+ <td>87.5</td>
350
+ <td>0.139</td>
351
+ <td>0.111</td>
352
+ <td>0.100</td>
353
+ <td>0.185</td>
354
+ </tr>
355
+ <tr>
356
+
357
+ <td rowspan="5"><strong>General<br>VLMs</strong></td>
358
+ <td>GPT4o</td>
359
+ <td>0.233</td>
360
+ <td>0.399</td>
361
+ <td>0.144</td>
362
+ <td>0.409</td>
363
+ <td>0.425</td>
364
+ <td>0.606</td>
365
+ <td>72.0</td>
366
+ <td>62.9</td>
367
+ <td>0.234</td>
368
+ <td>0.329</td>
369
+ <td>0.128</td>
370
+ <td>0.251</td>
371
+ </tr>
372
+ <tr>
373
+ <td>Qwen2-VL-72B</td>
374
+ <td>0.252</td>
375
+ <td>0.327</td>
376
+ <td>0.096</td>
377
+ <td>0.218</td>
378
+ <td>0.404</td>
379
+ <td>0.487</td>
380
+ <td>76.8</td>
381
+ <td>76.4</td>
382
+ <td>0.387</td>
383
+ <td>0.408</td>
384
+ <td>0.119</td>
385
+ <td>0.193</td>
386
+ </tr>
387
+ <tr>
388
+ <td>Qwen2.5-VL-72B</td>
389
+ <td>0.214</td>
390
+ <td>0.261</td>
391
+ <td>0.092</td>
392
+ <td>0.18</td>
393
+ <td>0.315</td>
394
+ <td>0.434</td>
395
+ <td>82.9</td>
396
+ <td>83.9</td>
397
+ <td>0.341</td>
398
+ <td>0.262</td>
399
+ <td>0.106</td>
400
+ <td>0.168</td>
401
+ </tr>
402
+ <tr>
403
+ <td>Gemini2.5-Pro</td>
404
+ <td>0.148</td>
405
+ <td>0.212</td>
406
+ <td>0.055</td>
407
+ <td>0.168</td>
408
+ <td>0.356</td>
409
+ <td>0.439</td>
410
+ <td>85.8</td>
411
+ <td>86.4</td>
412
+ <td>0.13</td>
413
+ <td>0.119</td>
414
+ <td>0.049</td>
415
+ <td>0.121</td>
416
+ </tr>
417
+ <tr>
418
+ <td>doubao-1-5-thinking-vision-pro-250428</td>
419
+ <td>0.140</td>
420
+ <td>0.162</td>
421
+ <td>0.043</td>
422
+ <td>0.085</td>
423
+ <td>0.295</td>
424
+ <td><strong>0.384</strong></td>
425
+ <td>83.3</td>
426
+ <td><strong>89.3</strong></td>
427
+ <td>0.165</td>
428
+ <td><strong>0.085</strong></td>
429
+ <td>0.058</td>
430
+ <td>0.094</td>
431
+ </tr>
432
+ <tr>
433
+ <td rowspan="1"><strong>Expert VLMs</strong></td>
434
+ <td><strong>dots.ocr</strong></td>
435
+ <td><strong>0.125</strong></td>
436
+ <td><strong>0.160</strong></td>
437
+ <td><strong>0.032</strong></td>
438
+ <td><strong>0.066</strong></td>
439
+ <td>0.329</td>
440
+ <td>0.416</td>
441
+ <td><strong>88.6</strong></td>
442
+ <td>89.0</td>
443
+ <td><strong>0.099</strong></td>
444
+ <td>0.092</td>
445
+ <td><strong>0.040</strong></td>
446
+ <td><strong>0.067</strong></td>
447
+ </tr>
448
+ <tr>
449
+ </tbody>
450
+ </table>
451
+
452
+
453
+ #### The end-to-end text recognition performance across 9 PDF page types.
454
+
455
+ <table>
456
+ <thead>
457
+ <tr>
458
+ <th><strong>Model<br>Type</strong></th>
459
+ <th><strong>Models</strong></th>
460
+ <th><strong>Book</strong></th>
461
+ <th><strong>Slides</strong></th>
462
+ <th><strong>Financial<br>Report</strong></th>
463
+ <th><strong>Textbook</strong></th>
464
+ <th><strong>Exam<br>Paper</strong></th>
465
+ <th><strong>Magazine</strong></th>
466
+ <th><strong>Academic<br>Papers</strong></th>
467
+ <th><strong>Notes</strong></th>
468
+ <th><strong>Newspaper</strong></th>
469
+ <th><strong>Overall</strong></th>
470
+ </tr>
471
+ </thead>
472
+ <tbody>
473
+ <tr>
474
+ <td rowspan="3"><strong>Pipeline<br>Tools</strong></td>
475
+ <td>MinerU</td>
476
+ <td>0.055</td>
477
+ <td>0.124</td>
478
+ <td><u>0.033</u></td>
479
+ <td>0.102</td>
480
+ <td>0.159</td>
481
+ <td><strong>0.072</strong></td>
482
+ <td><u>0.025</u></td>
483
+ <td>0.984</td>
484
+ <td>0.171</td>
485
+ <td>0.206</td>
486
+ </tr>
487
+ <tr>
488
+ <td>Marker</td>
489
+ <td>0.074</td>
490
+ <td>0.340</td>
491
+ <td>0.089</td>
492
+ <td>0.319</td>
493
+ <td>0.452</td>
494
+ <td>0.153</td>
495
+ <td>0.059</td>
496
+ <td>0.651</td>
497
+ <td>0.192</td>
498
+ <td>0.274</td>
499
+ </tr>
500
+ <tr>
501
+ <td>Mathpix</td>
502
+ <td>0.131</td>
503
+ <td>0.220</td>
504
+ <td>0.202</td>
505
+ <td>0.216</td>
506
+ <td>0.278</td>
507
+ <td>0.147</td>
508
+ <td>0.091</td>
509
+ <td>0.634</td>
510
+ <td>0.690</td>
511
+ <td>0.300</td>
512
+ </tr>
513
+ <tr>
514
+ <td rowspan="5"><strong>Expert<br>VLMs</strong></td>
515
+ <td>GOT-OCR</td>
516
+ <td>0.111</td>
517
+ <td>0.222</td>
518
+ <td>0.067</td>
519
+ <td>0.132</td>
520
+ <td>0.204</td>
521
+ <td>0.198</td>
522
+ <td>0.179</td>
523
+ <td>0.388</td>
524
+ <td>0.771</td>
525
+ <td>0.267</td>
526
+ </tr>
527
+ <tr>
528
+ <td>Nougat</td>
529
+ <td>0.734</td>
530
+ <td>0.958</td>
531
+ <td>1.000</td>
532
+ <td>0.820</td>
533
+ <td>0.930</td>
534
+ <td>0.830</td>
535
+ <td>0.214</td>
536
+ <td>0.991</td>
537
+ <td>0.871</td>
538
+ <td>0.806</td>
539
+ </tr>
540
+ <tr>
541
+ <td>Dolphin</td>
542
+ <td>0.091</td>
543
+ <td>0.131</td>
544
+ <td>0.057</td>
545
+ <td>0.146</td>
546
+ <td>0.231</td>
547
+ <td>0.121</td>
548
+ <td>0.074</td>
549
+ <td>0.363</td>
550
+ <td>0.307</td>
551
+ <td>0.177</td>
552
+ </tr>
553
+ <tr>
554
+ <td>OCRFlux</td>
555
+ <td>0.068</td>
556
+ <td>0.125</td>
557
+ <td>0.092</td>
558
+ <td>0.102</td>
559
+ <td>0.119</td>
560
+ <td>0.083</td>
561
+ <td>0.047</td>
562
+ <td>0.223</td>
563
+ <td>0.536</td>
564
+ <td>0.149</td>
565
+ </tr>
566
+ <tr>
567
+ <td>MonkeyOCR-pro-3B</td>
568
+ <td>0.084</td>
569
+ <td>0.129</td>
570
+ <td>0.060</td>
571
+ <td>0.090</td>
572
+ <td>0.107</td>
573
+ <td>0.073</td>
574
+ <td>0.050</td>
575
+ <td>0.171</td>
576
+ <td>0.107</td>
577
+ <td>0.100</td>
578
+ </tr>
579
+ <tr>
580
+ <td rowspan="4"><strong>General<br>VLMs</strong></td>
581
+ <td>GPT4o</td>
582
+ <td>0.157</td>
583
+ <td>0.163</td>
584
+ <td>0.348</td>
585
+ <td>0.187</td>
586
+ <td>0.281</td>
587
+ <td>0.173</td>
588
+ <td>0.146</td>
589
+ <td>0.607</td>
590
+ <td>0.751</td>
591
+ <td>0.316</td>
592
+ </tr>
593
+ <tr>
594
+ <td>Qwen2.5-VL-7B</td>
595
+ <td>0.148</td>
596
+ <td>0.053</td>
597
+ <td>0.111</td>
598
+ <td>0.137</td>
599
+ <td>0.189</td>
600
+ <td>0.117</td>
601
+ <td>0.134</td>
602
+ <td>0.204</td>
603
+ <td>0.706</td>
604
+ <td>0.205</td>
605
+ </tr>
606
+ <tr>
607
+ <td>InternVL3-8B</td>
608
+ <td>0.163</td>
609
+ <td>0.056</td>
610
+ <td>0.107</td>
611
+ <td>0.109</td>
612
+ <td>0.129</td>
613
+ <td>0.100</td>
614
+ <td>0.159</td>
615
+ <td>0.150</td>
616
+ <td>0.681</td>
617
+ <td>0.188</td>
618
+ </tr>
619
+ <tr>
620
+ <td>doubao-1-5-thinking-vision-pro-250428</td>
621
+ <td>0.048</td>
622
+ <td>0.048</td>
623
+ <td>0.024</td>
624
+ <td><strong>0.062</strong></td>
625
+ <td>0.085</td>
626
+ <td>0.051</td>
627
+ <td>0.039</td>
628
+ <td><strong>0.096</strong></td>
629
+ <td>0.181</td>
630
+ <td>0.073</td>
631
+ </tr>
632
+ <tr>
633
+ <td rowspan="1"><strong>Expert VLMs</strong></td>
634
+ <td><strong>dots.ocr</strong></td>
635
+ <td><strong>0.031</strong></td>
636
+ <td><strong>0.047</strong></td>
637
+ <td><strong>0.011</strong></td>
638
+ <td>0.082</td>
639
+ <td><strong>0.079</strong></td>
640
+ <td><strong>0.028</strong></td>
641
+ <td><strong>0.029</strong></td>
642
+ <td>0.109</td>
643
+ <td><strong>0.056</strong></td>
644
+ <td><strong>0.055</strong></td>
645
+ </tr>
646
+
647
+ </tbody>
648
+ </table>
649
+
650
+ > **Notes:**
651
+ > - The metrics are from [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR), [OmniDocBench](https://github.com/opendatalab/OmniDocBench), and our own internal evaluations.
652
+ > - We delete the Page-header and Page-footer cells in the result markdown.
653
+ > - We use tikz_preprocess pipeline to upsample the images to dpi 200.
654
+
655
+
656
+ ### 2. **dots.ocr-bench**
657
+
658
+ This is an inhouse benchmark which contain 1493 pdf images with 100 languages.
659
+
660
+ #### The end-to-end evaluation results of different tasks.
661
+
662
+ <table>
663
+ <thead>
664
+ <tr>
665
+ <th rowspan="1"><strong>Methods</strong></th>
666
+ <th colspan="1"><strong>Overall<sup>Edit</sup>↓</strong></th>
667
+ <th colspan="1"><strong>Text<sup>Edit</sup>↓</strong></th>
668
+ <th colspan="1"><strong>Formula<sup>Edit</sup>↓</strong></th>
669
+ <th colspan="1"><strong>Table<sup>TEDS</sup>↑</strong></th>
670
+ <th colspan="1"><strong>Table<sup>Edit</sup>↓</strong></th>
671
+ <th colspan="1"><strong>Read Order<sup>Edit</sup>↓</strong></th>
672
+ </tr>
673
+ </thead>
674
+ <tbody>
675
+ <td>MonkeyOCR-3B</td>
676
+ <td>0.483</td>
677
+ <td>0.445</td>
678
+ <td>0.627</td>
679
+ <td>50.93</td>
680
+ <td>0.452</td>
681
+ <td>0.409</td>
682
+ </tr>
683
+ <tr>
684
+ <td>doubao-1-5-thinking-vision-pro-250428</td>
685
+ <td>0.291</td>
686
+ <td>0.226</td>
687
+ <td>0.440</td>
688
+ <td>71.2</td>
689
+ <td>0.260</td>
690
+ <td>0.238</td>
691
+ </tr>
692
+ <tr>
693
+ <td>doubao-1-6</td>
694
+ <td>0.299</td>
695
+ <td>0.270</td>
696
+ <td>0.417</td>
697
+ <td>71.0</td>
698
+ <td>0.258</td>
699
+ <td>0.253</td>
700
+ </tr>
701
+ <tr>
702
+ <td>Gemini2.5-Pro</td>
703
+ <td>0.251</td>
704
+ <td>0.163</td>
705
+ <td>0.402</td>
706
+ <td>77.1</td>
707
+ <td>0.236</td>
708
+ <td>0.202</td>
709
+ </tr>
710
+ <tr>
711
+ <td><strong>dots.ocr</strong> </td>
712
+ <td><strong>0.177</strong></td>
713
+ <td><strong>0.075</strong></td>
714
+ <td><strong>0.297</strong></td>
715
+ <td><strong>79.2</strong></td>
716
+ <td><strong>0.186</strong></td>
717
+ <td><strong>0.152</strong></td>
718
+ </tr>
719
+
720
+ </tbody>
721
+ </table>
722
+
723
+ > **Notes:**
724
+ > - We use the same metric calculation pipeline of [OmniDocBench](https://github.com/opendatalab/OmniDocBench).
725
+ > - We delete the Page-header and Page-footer cells in the result markdown.
726
+
727
+ #### Layout Detection
728
+
729
+ <table>
730
+ <thead>
731
+ <tr>
732
+ <th rowspan="2"><strong>Method</strong></th>
733
+ <th colspan="5" style="text-align: center;"><strong>F1@IoU=.50:.05:.95↑</strong></th>
734
+ <th colspan="5" style="text-align: center;"><strong>F1@IoU=.50↑</strong></th>
735
+ </tr>
736
+ <tr>
737
+ <th>Overall</th>
738
+ <th>Text</th>
739
+ <th>Formula</th>
740
+ <th>Table</th>
741
+ <th>Picture</th>
742
+ <th>Overall</th>
743
+ <th>Text</th>
744
+ <th>Formula</th>
745
+ <th>Table</th>
746
+ <th>Picture</th>
747
+ </tr>
748
+ </thead>
749
+
750
+ <tbody>
751
+ <td>DocLayout-YOLO-DocStructBench</td>
752
+ <td>0.733</td>
753
+ <td>0.694</td>
754
+ <td>0.480</td>
755
+ <td>0.803</td>
756
+ <td>0.619</td>
757
+ <td>0.806</td>
758
+ <td>0.779</td>
759
+ <td>0.620</td>
760
+ <td>0.858</td>
761
+ <td>0.678</td>
762
+ </tr>
763
+
764
+ <tr>
765
+ <td>dots.ocr-parse all</td>
766
+ <td>0.831</td>
767
+ <td>0.801</td>
768
+ <td>0.654</td>
769
+ <td>0.838</td>
770
+ <td>0.748</td>
771
+ <td>0.922</td>
772
+ <td>0.909</td>
773
+ <td>0.770</td>
774
+ <td>0.888</td>
775
+ <td>0.831</td>
776
+ </tr>
777
+
778
+ <tr>
779
+ <td> <strong>dots.ocr-detection only</strong> </td>
780
+ <td><strong>0.845</strong></td>
781
+ <td><strong>0.816</strong></td>
782
+ <td><strong>0.716</strong></td>
783
+ <td><strong>0.875</strong></td>
784
+ <td><strong>0.765</strong></td>
785
+ <td><strong>0.930</strong></td>
786
+ <td><strong>0.917</strong></td>
787
+ <td><strong>0.832</strong></td>
788
+ <td><strong>0.918</strong></td>
789
+ <td><strong>0.843</strong></td>
790
+ </tr>
791
+
792
+ </tbody>
793
+ </table>
794
+
795
+ > **Notes:**
796
+ > - prompt_layout_all_en for **parse all**, prompt_layout_only_en for **detection only**, please refer to [prompts](https://github.com/rednote-hilab/dots.ocr/blob/master/dots_ocr/utils/prompts.py)
797
+
798
+
799
+ ### 3. olmOCR-bench.
800
+
801
+ <table>
802
+ <thead>
803
+ <tr>
804
+ <th>Model</th>
805
+ <th>ArXiv</th>
806
+ <th>Old Scans<br>Math</th>
807
+ <th>Tables</th>
808
+ <th>Old Scans</th>
809
+ <th>Headers and<br>Footers</th>
810
+ <th>Multi<br>column</th>
811
+ <th>Long Tiny<br>Text</th>
812
+ <th>Base</th>
813
+ <th>Overall</th>
814
+ </tr>
815
+ </thead>
816
+ <tbody>
817
+ <tr>
818
+ <td>GOT OCR</td>
819
+ <td>52.7</td>
820
+ <td>52.0</td>
821
+ <td>0.2</td>
822
+ <td>22.1</td>
823
+ <td>93.6</td>
824
+ <td>42.0</td>
825
+ <td>29.9</td>
826
+ <td>94.0</td>
827
+ <td>48.3 ± 1.1</td>
828
+ </tr>
829
+ <tr>
830
+ <td>Marker</td>
831
+ <td>76.0</td>
832
+ <td>57.9</td>
833
+ <td>57.6</td>
834
+ <td>27.8</td>
835
+ <td>84.9</td>
836
+ <td>72.9</td>
837
+ <td>84.6</td>
838
+ <td>99.1</td>
839
+ <td>70.1 ± 1.1</td>
840
+ </tr>
841
+ <tr>
842
+ <td>MinerU</td>
843
+ <td>75.4</td>
844
+ <td>47.4</td>
845
+ <td>60.9</td>
846
+ <td>17.3</td>
847
+ <td><strong>96.6</strong></td>
848
+ <td>59.0</td>
849
+ <td>39.1</td>
850
+ <td>96.6</td>
851
+ <td>61.5 ± 1.1</td>
852
+ </tr>
853
+ <tr>
854
+ <td>Mistral OCR</td>
855
+ <td>77.2</td>
856
+ <td>67.5</td>
857
+ <td>60.6</td>
858
+ <td>29.3</td>
859
+ <td>93.6</td>
860
+ <td>71.3</td>
861
+ <td>77.1</td>
862
+ <td>99.4</td>
863
+ <td>72.0 ± 1.1</td>
864
+ </tr>
865
+ <tr>
866
+ <td>Nanonets OCR</td>
867
+ <td>67.0</td>
868
+ <td>68.6</td>
869
+ <td>77.7</td>
870
+ <td>39.5</td>
871
+ <td>40.7</td>
872
+ <td>69.9</td>
873
+ <td>53.4</td>
874
+ <td>99.3</td>
875
+ <td>64.5 ± 1.1</td>
876
+ </tr>
877
+ <tr>
878
+ <td>GPT-4o<br>(No Anchor)</td>
879
+ <td>51.5</td>
880
+ <td><strong>75.5</strong></td>
881
+ <td>69.1</td>
882
+ <td>40.9</td>
883
+ <td>94.2</td>
884
+ <td>68.9</td>
885
+ <td>54.1</td>
886
+ <td>96.7</td>
887
+ <td>68.9 ± 1.1</td>
888
+ </tr>
889
+ <tr>
890
+ <td>GPT-4o<br>(Anchored)</td>
891
+ <td>53.5</td>
892
+ <td>74.5</td>
893
+ <td>70.0</td>
894
+ <td>40.7</td>
895
+ <td>93.8</td>
896
+ <td>69.3</td>
897
+ <td>60.6</td>
898
+ <td>96.8</td>
899
+ <td>69.9 ± 1.1</td>
900
+ </tr>
901
+ <tr>
902
+ <td>Gemini Flash 2<br>(No Anchor)</td>
903
+ <td>32.1</td>
904
+ <td>56.3</td>
905
+ <td>61.4</td>
906
+ <td>27.8</td>
907
+ <td>48.0</td>
908
+ <td>58.7</td>
909
+ <td><strong>84.4</strong></td>
910
+ <td>94.0</td>
911
+ <td>57.8 ± 1.1</td>
912
+ </tr>
913
+ <tr>
914
+ <td>Gemini Flash 2<br>(Anchored)</td>
915
+ <td>54.5</td>
916
+ <td>56.1</td>
917
+ <td>72.1</td>
918
+ <td>34.2</td>
919
+ <td>64.7</td>
920
+ <td>61.5</td>
921
+ <td>71.5</td>
922
+ <td>95.6</td>
923
+ <td>63.8 ± 1.2</td>
924
+ </tr>
925
+ <tr>
926
+ <td>Qwen 2 VL<br>(No Anchor)</td>
927
+ <td>19.7</td>
928
+ <td>31.7</td>
929
+ <td>24.2</td>
930
+ <td>17.1</td>
931
+ <td>88.9</td>
932
+ <td>8.3</td>
933
+ <td>6.8</td>
934
+ <td>55.5</td>
935
+ <td>31.5 ± 0.9</td>
936
+ </tr>
937
+ <tr>
938
+ <td>Qwen 2.5 VL<br>(No Anchor)</td>
939
+ <td>63.1</td>
940
+ <td>65.7</td>
941
+ <td>67.3</td>
942
+ <td>38.6</td>
943
+ <td>73.6</td>
944
+ <td>68.3</td>
945
+ <td>49.1</td>
946
+ <td>98.3</td>
947
+ <td>65.5 ± 1.2</td>
948
+ </tr>
949
+ <tr>
950
+ <td>olmOCR v0.1.75<br>(No Anchor)</td>
951
+ <td>71.5</td>
952
+ <td>71.4</td>
953
+ <td>71.4</td>
954
+ <td><strong>42.8</strong></td>
955
+ <td>94.1</td>
956
+ <td>77.7</td>
957
+ <td>71.0</td>
958
+ <td>97.8</td>
959
+ <td>74.7 ± 1.1</td>
960
+ </tr>
961
+ <tr>
962
+ <td>olmOCR v0.1.75<br>(Anchored)</td>
963
+ <td>74.9</td>
964
+ <td>71.2</td>
965
+ <td>71.0</td>
966
+ <td>42.2</td>
967
+ <td>94.5</td>
968
+ <td>78.3</td>
969
+ <td>73.3</td>
970
+ <td>98.3</td>
971
+ <td>75.5 ± 1.0</td>
972
+ </tr>
973
+ <tr>
974
+ <td>MonkeyOCR-pro-3B</td>
975
+ <td><strong>83.8</strong></td>
976
+ <td>68.8</td>
977
+ <td>74.6</td>
978
+ <td>36.1</td>
979
+ <td>91.2</td>
980
+ <td>76.6</td>
981
+ <td>80.1</td>
982
+ <td>95.3</td>
983
+ <td>75.8 ± 1.0</td>
984
+ </tr>
985
+ <tr>
986
+ <td><strong>dots.ocr</strong></td>
987
+ <td>82.1</td>
988
+ <td>64.2</td>
989
+ <td><strong>88.3</strong></td>
990
+ <td>40.9</td>
991
+ <td>94.1</td>
992
+ <td><strong>82.4</strong></td>
993
+ <td>81.2</td>
994
+ <td><strong>99.5</strong></td>
995
+ <td><strong>79.1 ± 1.0</strong></td>
996
+ </tr>
997
+ </tbody>
998
+ </table>
999
+
1000
+
1001
+ > **Note:**
1002
+ > - The metrics are from [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR),
1003
+ [olmocr](https://github.com/allenai/olmocr), and our own internal evaluations.
1004
+ > - We delete the Page-header and Page-footer cells in the result markdown.
1005
+
1006
+
1007
+
1008
+ # Quick Start
1009
+ ## 1. Installation
1010
+ ### Install dots.ocr
1011
+ ```shell
1012
+ conda create -n dots_ocr python=3.12
1013
+ conda activate dots_ocr
1014
+
1015
+ git clone https://github.com/rednote-hilab/dots.ocr.git
1016
+ cd dots.ocr
1017
+
1018
+ # Install pytorch, see https://pytorch.org/get-started/previous-versions/ for your cuda version
1019
+ pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
1020
+ pip install -e .
1021
+ ```
1022
+
1023
+ If you have trouble with the installation, try our [Docker Image](https://hub.docker.com/r/rednotehilab/dots.ocr) for an easier setup, and follow these steps:
1024
+ ```shell
1025
+ git clone https://github.com/rednote-hilab/dots.ocr.git
1026
+ cd dots.ocr
1027
+ pip install -e .
1028
+ ```
1029
+
1030
+
1031
+ ### Download Model Weights
1032
+ > 💡**Note:** Please use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) for the model save path. This is a temporary workaround pending our integration with Transformers.
1033
+ ```shell
1034
+ python3 tools/download_model.py
1035
+ ```
1036
+
1037
+
1038
+ ## 2. Deployment
1039
+ ### vLLM inference
1040
+ We highly recommend using vllm for deployment and inference. All of our evaluations results are based on vllm version 0.9.1.
1041
+ The [Docker Image](https://hub.docker.com/r/rednotehilab/dots.ocr) is based on the official vllm image. You can also follow [Dockerfile](https://github.com/rednote-hilab/dots.ocr/blob/master/docker/Dockerfile) to build the deployment environment by yourself.
1042
+
1043
+ ```shell
1044
+ # You need to register model to vllm at first
1045
+ python3 tools/download_model.py
1046
+ export hf_model_path=./weights/DotsOCR # Path to your downloaded model weights, Please use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) for the model save path. This is a temporary workaround pending our integration with Transformers.
1047
+ export PYTHONPATH=$(dirname "$hf_model_path"):$PYTHONPATH
1048
+ sed -i '/^from vllm\.entrypoints\.cli\.main import main$/a\
1049
+ from DotsOCR import modeling_dots_ocr_vllm' `which vllm` # If you downloaded model weights by yourself, please replace `DotsOCR` by your model saved directory name, and remember to use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`)
1050
+
1051
+ # launch vllm server
1052
+ CUDA_VISIBLE_DEVICES=0 vllm serve ${hf_model_path} --tensor-parallel-size 1 --gpu-memory-utilization 0.95 --chat-template-content-format string --served-model-name model --trust-remote-code
1053
+
1054
+ # If you get a ModuleNotFoundError: No module named 'DotsOCR', please check the note above on the saved model directory name.
1055
+
1056
+ # vllm api demo
1057
+ python3 ./demo/demo_vllm.py --prompt_mode prompt_layout_all_en
1058
+ ```
1059
+
1060
+ ### Hugginface inference
1061
+ ```shell
1062
+ python3 demo/demo_hf.py
1063
+ ```
1064
+
1065
+ <details>
1066
+ <summary><b>Hugginface inference details</b></summary>
1067
+
1068
+ ```python
1069
+ import torch
1070
+ from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
1071
+ from qwen_vl_utils import process_vision_info
1072
+ from dots_ocr.utils import dict_promptmode_to_prompt
1073
+
1074
+ model_path = "./weights/DotsOCR"
1075
+ model = AutoModelForCausalLM.from_pretrained(
1076
+ model_path,
1077
+ attn_implementation="flash_attention_2",
1078
+ torch_dtype=torch.bfloat16,
1079
+ device_map="auto",
1080
+ trust_remote_code=True
1081
+ )
1082
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
1083
+
1084
+ image_path = "demo/demo_image1.jpg"
1085
+ prompt = """Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.
1086
+
1087
+ 1. Bbox format: [x1, y1, x2, y2]
1088
+
1089
+ 2. Layout Categories: The possible categories are ['Caption', 'Footnote', 'Formula', 'List-item', 'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title'].
1090
+
1091
+ 3. Text Extraction & Formatting Rules:
1092
+ - Picture: For the 'Picture' category, the text field should be omitted.
1093
+ - Formula: Format its text as LaTeX.
1094
+ - Table: Format its text as HTML.
1095
+ - All Others (Text, Title, etc.): Format their text as Markdown.
1096
+
1097
+ 4. Constraints:
1098
+ - The output text must be the original text from the image, with no translation.
1099
+ - All layout elements must be sorted according to human reading order.
1100
+
1101
+ 5. Final Output: The entire output must be a single JSON object.
1102
+ """
1103
+
1104
+ messages = [
1105
+ {
1106
+ "role": "user",
1107
+ "content": [
1108
+ {
1109
+ "type": "image",
1110
+ "image": image_path
1111
+ },
1112
+ {"type": "text", "text": prompt}
1113
+ ]
1114
+ }
1115
+ ]
1116
+
1117
+ # Preparation for inference
1118
+ text = processor.apply_chat_template(
1119
+ messages,
1120
+ tokenize=False,
1121
+ add_generation_prompt=True
1122
+ )
1123
+ image_inputs, video_inputs = process_vision_info(messages)
1124
+ inputs = processor(
1125
+ text=[text],
1126
+ images=image_inputs,
1127
+ videos=video_inputs,
1128
+ padding=True,
1129
+ return_tensors="pt",
1130
+ )
1131
+
1132
+ inputs = inputs.to("cuda")
1133
+
1134
+ # Inference: Generation of the output
1135
+ generated_ids = model.generate(**inputs, max_new_tokens=24000)
1136
+ generated_ids_trimmed = [
1137
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
1138
+ ]
1139
+ output_text = processor.batch_decode(
1140
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
1141
+ )
1142
+ print(output_text)
1143
+
1144
+ ```
1145
+
1146
+ </details>
1147
+
1148
+ ## 3. Document Parse
1149
+ **Based on vLLM server**, you can parse an image or a pdf file using the following commands:
1150
+ ```bash
1151
+
1152
+ # Parse all layout info, both detection and recognition
1153
+ # Parse a single image
1154
+ python3 dots_ocr/parser.py demo/demo_image1.jpg
1155
+ # Parse a single PDF
1156
+ python3 dots_ocr/parser.py demo/demo_pdf1.pdf --num_threads 64 # try bigger num_threads for pdf with a large number of pages
1157
+
1158
+ # Layout detection only
1159
+ python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_layout_only_en
1160
+
1161
+ # Parse text only, except Page-header and Page-footer
1162
+ python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_ocr
1163
+
1164
+ # Parse layout info by bbox
1165
+ python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_grounding_ocr --bbox 163 241 1536 705
1166
+
1167
+ ```
1168
+
1169
+ <details>
1170
+ <summary><b>Output Results</b></summary>
1171
+
1172
+ 1. **Structured Layout Data** (`demo_image1.json`): A JSON file containing the detected layout elements, including their bounding boxes, categories, and extracted text.
1173
+ 2. **Processed Markdown File** (`demo_image1.md`): A Markdown file generated from the concatenated text of all detected cells.
1174
+ * An additional version, `demo_image1_nohf.md`, is also provided, which excludes page headers and footers for compatibility with benchmarks like Omnidocbench and olmOCR-bench.
1175
+ 3. **Layout Visualization** (`demo_image1.jpg`): The original image with the detected layout bounding boxes drawn on it.
1176
+
1177
+ </details>
1178
+
1179
+ ## 4. Demo
1180
+ You can run the demo with the following command, or try directly at [live demo](https://dotsocr.xiaohongshu.com/)
1181
+ ```bash
1182
+ python demo/demo_gradio.py
1183
+ ```
1184
+
1185
+ We also provide a demo for grounding ocr:
1186
+ ```bash
1187
+ python demo/demo_gradio_annotion.py
1188
+ ```
1189
+
1190
+
1191
+ ### Example for formula document
1192
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/formula1.png" alt="formula1.png" border="0" />
1193
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/formula2.png" alt="formula2.png" border="0" />
1194
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/formula3.png" alt="formula3.png" border="0" />
1195
+
1196
+ ### Example for table document
1197
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/table1.png" alt="table1.png" border="0" />
1198
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/table2.png" alt="table2.png" border="0" />
1199
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/table3.png" alt="table3.png" border="0" />
1200
+
1201
+ ### Example for multilingual document
1202
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/Tibetan.png" alt="Tibetan.png" border="0" />
1203
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/tradition_zh.png" alt="tradition_zh.png" border="0" />
1204
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/nl.png" alt="nl.png" border="0" />
1205
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/kannada.png" alt="kannada.png" border="0" />
1206
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/russian.png" alt="russian.png" border="0" />
1207
+
1208
+ ### Example for reading order
1209
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/reading_order.png" alt="reading_order.png" border="0" />
1210
+
1211
+ ### Example for grounding ocr
1212
+ <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/grounding.png" alt="grounding.png" border="0" />
1213
+
1214
+
1215
+ ## Acknowledgments
1216
+ We would like to thank [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL), [aimv2](https://github.com/apple/ml-aim), [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR),
1217
+ [OmniDocBench](https://github.com/opendatalab/OmniDocBench), [PyMuPDF](https://github.com/pymupdf/PyMuPDF), for providing code and models.
1218
+
1219
+ We also thank [DocLayNet](https://github.com/DS4SD/DocLayNet), [M6Doc](https://github.com/HCIILAB/M6Doc), [CDLA](https://github.com/buptlihang/CDLA), [D4LA](https://github.com/AlibabaResearch/AdvancedLiterateMachinery) for providing valuable datasets.
1220
+
1221
+ ## Limitation & Future Work
1222
+
1223
+ - **Complex Document Elements:**
1224
+ - **Table&Formula**: dots.ocr is not yet perfect for high-complexity tables and formula extraction.
1225
+ - **Picture**: Pictures in documents are currently not parsed.
1226
+
1227
+ - **Parsing Failures:** The model may fail to parse under certain conditions:
1228
+ - When the character-to-pixel ratio is excessively high. Try enlarging the image or increasing the PDF parsing DPI (a setting of 200 is recommended). However, please note that the model performs optimally on images with a resolution under 11289600 pixels.
1229
+ - Continuous special characters, such as ellipses (`...`) and underscores (`_`), may cause the prediction output to repeat endlessly. In such scenarios, consider using alternative prompts like `prompt_layout_only_en`, `prompt_ocr`, or `prompt_grounding_ocr` ([details here](https://github.com/rednote-hilab/dots.ocr/blob/master/dots_ocr/utils/prompts.py)).
1230
+
1231
+ - **Performance Bottleneck:** Despite its 1.7B parameter LLM foundation, **dots.ocr** is not yet optimized for high-throughput processing of large PDF volumes.
1232
+
1233
+ We are committed to achieving more accurate table and formula parsing, as well as enhancing the model's OCR capabilities for broader generalization, all while aiming for **a more powerful, more efficient model**. Furthermore, we are actively considering the development of **a more general-purpose perception model** based on Vision-Language Models (VLMs), which would integrate general detection, image captioning, and OCR tasks into a unified framework. **Parsing the content of the pictures in the documents** is also a key priority for our future work.
1234
+ We believe that collaboration is the key to tackling these exciting challenges. If you are passionate about advancing the frontiers of document intelligence and are interested in contributing to these future endeavors, we would love to hear from you. Please reach out to us via email at: [yanqing4@xiaohongshu.com].
chat_template.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{%- for m in messages %}{%- if m.role == 'system' %}{{- '<|system|>' + m.content + '<|endofsystem|>\n' }}{%- elif m.role == 'user' %}{% if m.content is string %}{{- '<|user|>' + m.content + '<|endofuser|>' }}{% else %} {% for content in m.content %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|img|><|imgpad|><|endofimg|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|img|><|video_pad|><|endofimg|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}{%- endif %}{%- elif m.role == 'assistant' %}{{- '<|assistant|>' + m.content }}{%- if not loop.last %}{{- '<|endofassistant|>' }}{%- endif %}{%- endif %}{%- endfor %}{%- if messages[-1].role != 'assistant' %}{{- '<|assistant|>' }}{%- endif %}"
3
+ }
config.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "DotsOCRForCausalLM"
4
+ ],
5
+ "model_type": "dots_ocr",
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_dots.DotsOCRConfig",
8
+ "AutoModelForCausalLM": "modeling_dots_ocr.DotsOCRForCausalLM"
9
+ },
10
+ "attention_bias": true,
11
+ "attention_dropout": 0.0,
12
+ "hidden_act": "silu",
13
+ "hidden_size": 1536,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 8960,
16
+ "max_position_embeddings": 131072,
17
+ "max_window_layers": 28,
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 28,
20
+ "num_key_value_heads": 2,
21
+ "rms_norm_eps": 1e-06,
22
+ "rope_scaling": null,
23
+ "rope_theta": 1000000,
24
+ "sliding_window": 131072,
25
+ "tie_word_embeddings": false,
26
+ "torch_dtype": "bfloat16",
27
+ "transformers_version": "4.51.0",
28
+ "use_cache": true,
29
+ "use_sliding_window": false,
30
+ "vocab_size": 151936,
31
+ "image_token_id": 151665,
32
+ "video_token_id": 151656,
33
+ "vision_config": {
34
+ "embed_dim": 1536,
35
+ "hidden_size": 1536,
36
+ "intermediate_size": 4224,
37
+ "num_hidden_layers": 42,
38
+ "num_attention_heads": 12,
39
+ "num_channels": 3,
40
+ "patch_size": 14,
41
+ "post_norm": true,
42
+ "rms_norm_eps": 1e-05,
43
+ "spatial_merge_size": 2,
44
+ "temporal_patch_size": 1,
45
+ "use_bias": false,
46
+ "attn_implementation": "flash_attention_2",
47
+ "init_merger_std": 0.02,
48
+ "initializer_range": 0.02,
49
+ "is_causal": false
50
+ }
51
+ }
configuration_dots.py ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Any, Optional
2
+ from transformers.configuration_utils import PretrainedConfig
3
+ from transformers.models.qwen2 import Qwen2Config
4
+ from transformers import Qwen2_5_VLProcessor, AutoProcessor
5
+ from transformers.models.auto.configuration_auto import CONFIG_MAPPING
6
+
7
+
8
+ class DotsVisionConfig(PretrainedConfig):
9
+ model_type: str = "dots_vit"
10
+
11
+ def __init__(
12
+ self,
13
+ embed_dim: int = 1536, # vision encoder embed size
14
+ hidden_size: int = 1536, # after merger hidden size
15
+ intermediate_size: int = 4224,
16
+ num_hidden_layers: int = 42,
17
+ num_attention_heads: int = 12,
18
+ num_channels: int = 3,
19
+ patch_size: int = 14,
20
+ spatial_merge_size: int = 2,
21
+ temporal_patch_size: int = 1,
22
+ rms_norm_eps: float = 1e-5,
23
+ use_bias: bool = False,
24
+ attn_implementation="flash_attention_2", # "eager","sdpa","flash_attention_2"
25
+ initializer_range=0.02,
26
+ init_merger_std=0.02,
27
+ is_causal=False, # ve causal forward
28
+ post_norm=True,
29
+ gradient_checkpointing=False,
30
+ **kwargs: Any,
31
+ ):
32
+ super().__init__(**kwargs)
33
+ self.embed_dim = embed_dim
34
+ self.hidden_size = hidden_size
35
+ self.intermediate_size = intermediate_size
36
+ self.num_hidden_layers = num_hidden_layers
37
+ self.num_attention_heads = num_attention_heads
38
+ self.num_channels = num_channels
39
+ self.patch_size = patch_size
40
+ self.spatial_merge_size = spatial_merge_size
41
+ self.temporal_patch_size = temporal_patch_size
42
+ self.rms_norm_eps = rms_norm_eps
43
+ self.use_bias = use_bias
44
+ self.attn_implementation = attn_implementation
45
+ self.initializer_range = initializer_range
46
+ self.init_merger_std = init_merger_std
47
+ self.is_causal = is_causal
48
+ self.post_norm = post_norm
49
+ self.gradient_checkpointing = gradient_checkpointing
50
+
51
+
52
+
53
+ class DotsOCRConfig(Qwen2Config):
54
+ model_type = "dots_ocr"
55
+ def __init__(self,
56
+ image_token_id = 151665,
57
+ video_token_id = 151656,
58
+ vision_config: Optional[dict] = None, *args, **kwargs):
59
+ super().__init__(*args, **kwargs)
60
+ self.image_token_id = image_token_id
61
+ self.video_token_id = video_token_id
62
+ self.vision_config = DotsVisionConfig(**(vision_config or {}))
63
+
64
+ def save_pretrained(self, save_directory, **kwargs):
65
+ self._auto_class = None
66
+ super().save_pretrained(save_directory, **kwargs)
67
+
68
+
69
+ class DotsVLProcessor(Qwen2_5_VLProcessor):
70
+ def __init__(self, image_processor=None, tokenizer=None, chat_template=None, **kwargs):
71
+ super().__init__(image_processor, tokenizer, chat_template=chat_template)
72
+ self.image_token = "<|imgpad|>" if not hasattr(tokenizer, "image_token") else tokenizer.image_token
73
+
74
+
75
+ AutoProcessor.register("dots_ocr", DotsVLProcessor)
76
+ CONFIG_MAPPING.register("dots_ocr", DotsOCRConfig)
dots.ocr LICENSE AGREEMENT ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ dots.ocr LICENSE AGREEMENT
2
+
3
+ Effective Date: [ August 8, 2025]
4
+
5
+ Copyright Holder: [Xingyin Information Technology (Shanghai) Co., Ltd]
6
+
7
+ This License Agreement (“Agreement”) governs Your use, reproduction, modification, and distribution of dots.ocr (the "Model Materials"). This Agreement is designed to maximize the openness and use of the Model Materials while addressing the unique legal, ethical, and technical challenges posed by large language models.
8
+
9
+ WHEREAS, Licensor has developed the dots.ocr document parsing model and intends to distribute the Model Materials under an open‑source framework;
10
+ WHEREAS, traditional open-source licenses (e.g., the MIT License) may not fully address the complexity inherent complexities of document parsing models, namely their multiple components (code, weights, training data), potential ethical risks, data‑governance issues, and intellectual‑property and liability questions regarding AI‑generated content;
11
+ WHEREAS, Licensor seeks to provide a legal framework that ensures maximum access to and use of the Model Materials while clearly defining the rights, obligations, and liabilities of Licensee;
12
+
13
+ THEREFORE, the parties agree that, subject to the MIT License, they shall be bound by the following terms and conditions:
14
+
15
+ 1. Definitions and Interpretation
16
+ Purpose: To define key terms used in this Agreement, particularly "Model Materials," ensuring clarity of the license scope beyond traditional software code. To clarify the order of precedence between this Agreement and the MIT License to avoid conflict.
17
+
18
+ 1.1 “Licensor” shall mean the entity providing the Model Materials under this Agreement, namely [Xingyin Information Technology (Shanghai) Co., Ltd].
19
+
20
+ 1.2 “Licensee” or "You" shall mean any individual or entity exercising permissions granted by this Agreement.
21
+
22
+ 1.3 “Model Materials” shall mean all materials provided by Licensor under this Agreement, including but not limited to:
23
+         (a) one or more machine‑learning models, including architecture and trained parameters (i.e., model weights);
24
+         (b) all associated preprocessing, training, inference, and fine‑tuning code;
25
+         (c) training datasets and evaluation scripts (or their detailed descriptions and access mechanisms); and
26
+         (d) any accompanying documentation, metadata, and tools.
27
+ The above Model Materials shall be subject to the content published on the Licensor’s website or GitHub repository at https://github.com/rednote-hilab/dots.ocr.
28
+
29
+ 1.4 “Outputs” shall mean any content generated through the use of the Model Materials, such as text, tables, code,layout information, and formulas extracted from documents.
30
+
31
+ 1.5 “MIT License” shall mean The MIT Open Source License published by the Massachusetts Institute of Technology.
32
+
33
+ 1.6   Priority of Agreement. In the event of any conflict or inconsistency between this Agreement and the MIT License, the terms of the MIT License shall prevail. However, if the terms of the MIT License are ambiguous or silent on a particular matter, the provisions of this Agreement shall apply and supplement the MIT License.
34
+
35
+ 2. Grant of Rights and Scope of Use
36
+
37
+ Purpose: To grant broad, permissive rights to the Licensee for the Model Materials—including code, weights, data, and documentation—to ensure maximum openness and flexibility while clarifying the free use of model-generated content. Additionally, it clarifies the feasibility of transitioning from open-source to commercial‑use and the use of OpenAPI interfaces.
38
+
39
+ 2.1   Grant of Copyright License. Subject to Licensee's compliance with this Agreement, Licensor hereby grants Licensee a perpetual, worldwide, non‑exclusive, no-charge, royalty‑free copyright license to use (run or test), reproduce, modify, create derivative works of, merge, publish, distribute the Model Materials; sublicense and/or sell copies of the Model Materials or any derivative works thereof; and incorporate the unmodified or modified Model Materials into proprietary products or services, including for commercial purposes, software‑as‑a‑service (SaaS) offerings, or via OpenAPI or other interfaces.
40
+
41
+ 2.2   Fundamental Capabilities. The Model Materials only provide the fundamental model’s capabilities. Licensees may develop derivative AI applications or undertake task‑specific training thereon.
42
+
43
+ 2.3   From Open Source to Commercial Use. The open-source release does not preclude Licensor’s commercial exploitation of the Model Materials, in whole or in part. Any such commercial use shall, at that time, be subject to license agreements between Licensor and applicable users.
44
+
45
+ 2.4   API‑Service Exception. Licensees who access the Model Materials through API calls or provide model services via API interfaces(without directly distributing model weights )shall not be subject to this Agreement unless otherwise expressly agreed. Instead, such use shall be governed by the API terms of use published by Licensor (if any).
46
+
47
+ 3. Acceptable Use Policy and Prohibited Uses
48
+
49
+ 3.1   Responsible Use. Licensee must use the Model Materials in a responsible, ethical, and lawful manner, in compliance with all applicable laws, regulations, industry standards, and best practices.
50
+
51
+ 3.2   Enterprise On‑Premises Deployment. The Licensee may deploy the Model Materials in closed‑source, on‑premises enterprise environments.
52
+
53
+ 3.3   Prohibited Uses. Any breach of the prohibitions below will result in the automatic termination of all licenses granted under this Agreement. Licensee agrees not to use the Model Materials or any derivative works thereof, in connection with:
54
+ (a) Identification and Utilization of Illegal/Harmful Content:Includes identifying graphic/text materials used for counterfeiting certificates/invoices, perpetrating fraud, or launching cyberattacks; or processing images containing illegal content such as violence, criminal activities, disinformation, or child exploitation.
55
+ (b) Privacy Infringement and Discriminatory Practices:Extracting personal sensitive information (e.g., ID numbers, medical records, biometric data) or protected characteristics (e.g., race, gender) from images without legal authorization or consent, for purposes of privacy violation, automated discriminatory decision-making, or harassment.
56
+ (c) Copyright Restrictions:Licensees shall not use the tool for unauthorized digitization of publications/document scanning or bulk scraping of content. Any use involving publications or other copyright-protected materials must first obtain relevant permissions.
57
+
58
+ 4. Intellectual Property Ownership and Contributions
59
+
60
+ 4.1   Licensor's Copyright Reservation. Licensor reserves all right, title, and interest in and to the Model Materials (including the model architecture, parameters, code, and original training data), except as expressly licensed herein. The original copyright of the Model Materials belongs to the Licensor.
61
+
62
+ 4.2   Patent License. Subject to the terms and conditions of this Agreement, Licensor hereby grants Licensee a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Model Materials, where such license applies only to those patent claims licensable by the Lisensor that are necessarily infringed by its contribution(s).
63
+ If Licensee institutes patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Model Materials constitute direct or contributory patent infringement, then any patent licenses granted under this License for the Model Materials shall terminate as of the date such litigation is asserted or filed.
64
+
65
+ 4.3   Outputs: The Outputs generated through the use of the Model Materials generally refer to text, tables, layouts, and other content extracted from documents or images. The extracted content itself does not generate new intellectual property rights, and all intellectual property remains with the original authors or copyright holders. The Licensee is responsible for due diligence regarding the legality of the Outputs, particularly where the content extracted by the OCR model may be substantially similar to existing copyrighted works, which could present intellectual property infringement risks. The Licensor assumes no liability for such infringements.
66
+ 4.4   Trademarks. Nothing in this License permits Licensee to make use of Licensor’s trademarks, trade names, logos (e.g., “rednote,” “Xiaohongshu,” “dots.ocr”) or to otherwise suggest endorsement or misrepresent the relationship between the parties, unless Licensor’s prior written approval is granted.
67
+
68
+ 5. Data Governance, Privacy, and Security
69
+
70
+ 5.1   Data Quality and Bias. Licensee shall use training data from lawful sources and is encouraged to conduct due diligence before deploying the Model Materials and to take reasonable steps to mitigate any known biases in its training data or applications.
71
+
72
+ 5.2   Privacy Protection.
73
+         (a) Sensitive‑Data Restrictions. It is prohibited to use the Model Materials to process,or extract infer sensitive personal data protected under specific laws (such as GDPR or HIPAA), particularly when dealing with documents containing personally identifiable information (such as ID numbers, health data, financial information, etc.), unless Licensee has obtained all necessary consents, lawful basis, or authorizations, and has implemented adequate anonymization, pseudonymization, or other privacy-enhancing technologies.
74
+         (b) Data Minimization and Purpose Limitation. The Licensee shall follow the principle of data minimization when using the OCR Model, processing only the user data necessary for specific, explicit, and lawful purposes. Specifically, the OCR Model should avoid processing unnecessary sensitive data and ensure compliance with applicable privacy protection laws during data handling.
75
+         (c) Transparency. Licensee shall provide clear and transparent privacy policies and terms of use when processing user data, particularly during document scanning and information extraction. .
76
+
77
+ 5.3   Security Measures. Licensee shall implement appropriate technical and administrative safeguards to protect the Model Materials and any associated data against unauthorized access, disclosure, alteration, or destruction. Such measures may include, but are not limited to, encryption, access controls, logging, and audit trails.
78
+
79
+ 5.4   Further Training. Licensee may only use user‑provided input or Outputs for training, fine-tuning, or improving other AI models if it has obtained the specific and informed consent of data subjects.
80
+
81
+ 6. Disclaimer of Warranty and Limitation of Liability
82
+
83
+ 6.1 “AS IS” Basis. Unless required by applicable law, the Model Materials are provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. Licensee is solely responsible for determining the appropriateness of using or redistributing the Model Materials and assume any risks associated with the exercise of permissions under this License. Licensor does not provide any warranty of non-infringement but represents that no infringing code has been knowingly included.
84
+
85
+ 6.2   Outputs Disclaimer. As a neutral technology, Licensor disclaims all liability for the accuracy, completeness, reliability, safety, legality, or suitability of any Outputs. The Licensee is solely responsible for verifying the accuracy and appropriateness of AI-generated content and shall provide appropriate disclosures when publishing or relying upon such content.
86
+
87
+ 6.3   Limitation of Liability and Recourse. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, shall Licensor or contributors be liable for any claims, damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Model Materials (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Licensor has been advised of the possibility of such damages. If such losses are incurred, recourse may be sought against the Licensee responsible for causing the loss.
88
+
89
+ 6.4   Content‑Filtering Disclaimer. Although the Model Materials may include content‑filtering mechanisms, Licensor makes no warranties of any kind regarding the stability, quality, accuracy, completeness, or any specific outcome of Outputs. Licensee is solely responsible for reviewing, verifying, and performing quality control on Outputs and assumes all associated risks and liabilities.
90
+
91
+ 7. Attribution and License Reservation
92
+
93
+ 7.1   License. When distributing or redistributing the Model Materials, Licensee must give any other recipients of the Model Materials a copy of this Agreement.
94
+
95
+ 7.2   Copyright and Notices. When distributing any part of the Model Materials, Licensee must retain all copyright, patent, trademark, and attribution notices included in the Model Materials.
96
+
97
+ 7.3   Attribution. Licensee is encouraged to prominently display the name of Licensor and the Model Materials in any public statements, products, or services that contain the Model Materials (or any derivative works thereof), to promote transparency and community trust. If Licensee distributes modified weights or fine‑tuned models based on the Model Materials, Licensee must prominently display the following statement in the related website or documentation: “Built with dots.ocr.”
98
+
99
+ 8. Governing Law and Dispute Resolution
100
+
101
+ 8.1   Governing Law. This Agreement shall be governed by and construed in accordance with the laws of the People’s Republic of China, without regard to its conflict of laws principles.
102
+
103
+ 8.2   Dispute Resolution. Any dispute claim, or disagreement arising out of or relating to this Agreement shall first be resolved through amicable consultation. If such consultation fails, the dispute shall be submitted to the Hangzhou Arbitration Commission for arbitration. The arbitration shall be conducted in accordance with the laws of China, and the place of arbitration shall be [Hangzhou, China]. The arbitral award shall be final and binding upon both parties.
104
+
105
+ 9. Regulatory Compliance Amendments
106
+ In the event that any part of this Agreement becomes invalid or requires adjustment due to changes in applicable laws or regulations, Licensor reserves the right to issue a revised version of this Agreement. Licensee shall migrate to the new version within [e.g., ninety (90)] days of its release; otherwise, all rights granted under this Agreement shall automatically terminate.
107
+
108
+ 10. Security Reporting
109
+ Licensee discovering any security vulnerability in the Model Materials may report it to Licensor via: dots-feedback@xiaohongshu.com. Licensee shall not disclose vulnerability details until Licensor issues an official remediation, unless otherwise required by law.
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "max_length": 32768,
3
+ "eos_token_id": [
4
+ 151643,
5
+ 151673
6
+ ]
7
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ea1d532184f3adf5cbcfcc00b2cf5b2abfa6fe182768a3ae63d441a9b5fc99ac
3
+ size 4292758192
model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:26ab1ec6c8b4e4116befbd59af42159f1dbcb0ad0c045a15e890bb2f6e8b0dae
3
+ size 1785673544
model.safetensors.index.json ADDED
@@ -0,0 +1,650 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 6078358528
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "model-00001-of-00002.safetensors",
7
+ "model.embed_tokens.weight": "model-00001-of-00002.safetensors",
8
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
9
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
10
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
11
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
12
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
13
+ "model.layers.0.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
14
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
15
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
16
+ "model.layers.0.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
17
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
18
+ "model.layers.0.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
19
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
20
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
21
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
22
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
23
+ "model.layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
24
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
25
+ "model.layers.1.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
26
+ "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
27
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
28
+ "model.layers.1.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
29
+ "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
30
+ "model.layers.1.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
31
+ "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
32
+ "model.layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
33
+ "model.layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
34
+ "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
35
+ "model.layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
36
+ "model.layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
37
+ "model.layers.10.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
38
+ "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
39
+ "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
40
+ "model.layers.10.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
41
+ "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
42
+ "model.layers.10.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
43
+ "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
44
+ "model.layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
45
+ "model.layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
46
+ "model.layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
47
+ "model.layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
48
+ "model.layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
49
+ "model.layers.11.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
50
+ "model.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
51
+ "model.layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
52
+ "model.layers.11.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
53
+ "model.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
54
+ "model.layers.11.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
55
+ "model.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
56
+ "model.layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
57
+ "model.layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
58
+ "model.layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
59
+ "model.layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
60
+ "model.layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
61
+ "model.layers.12.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
62
+ "model.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
63
+ "model.layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
64
+ "model.layers.12.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
65
+ "model.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
66
+ "model.layers.12.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
67
+ "model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
68
+ "model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
69
+ "model.layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
70
+ "model.layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
71
+ "model.layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
72
+ "model.layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
73
+ "model.layers.13.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
74
+ "model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
75
+ "model.layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
76
+ "model.layers.13.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
77
+ "model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
78
+ "model.layers.13.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
79
+ "model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
80
+ "model.layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
81
+ "model.layers.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
82
+ "model.layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
83
+ "model.layers.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
84
+ "model.layers.14.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
85
+ "model.layers.14.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
86
+ "model.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
87
+ "model.layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
88
+ "model.layers.14.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
89
+ "model.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
90
+ "model.layers.14.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
91
+ "model.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
92
+ "model.layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
93
+ "model.layers.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
94
+ "model.layers.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
95
+ "model.layers.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
96
+ "model.layers.15.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
97
+ "model.layers.15.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
98
+ "model.layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
99
+ "model.layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
100
+ "model.layers.15.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
101
+ "model.layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
102
+ "model.layers.15.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
103
+ "model.layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
104
+ "model.layers.16.input_layernorm.weight": "model-00001-of-00002.safetensors",
105
+ "model.layers.16.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
106
+ "model.layers.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
107
+ "model.layers.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
108
+ "model.layers.16.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
109
+ "model.layers.16.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
110
+ "model.layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
111
+ "model.layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
112
+ "model.layers.16.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
113
+ "model.layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
114
+ "model.layers.16.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
115
+ "model.layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
116
+ "model.layers.17.input_layernorm.weight": "model-00001-of-00002.safetensors",
117
+ "model.layers.17.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
118
+ "model.layers.17.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
119
+ "model.layers.17.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
120
+ "model.layers.17.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
121
+ "model.layers.17.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
122
+ "model.layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
123
+ "model.layers.17.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
124
+ "model.layers.17.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
125
+ "model.layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
126
+ "model.layers.17.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
127
+ "model.layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
128
+ "model.layers.18.input_layernorm.weight": "model-00001-of-00002.safetensors",
129
+ "model.layers.18.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
130
+ "model.layers.18.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
131
+ "model.layers.18.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
132
+ "model.layers.18.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
133
+ "model.layers.18.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
134
+ "model.layers.18.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
135
+ "model.layers.18.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
136
+ "model.layers.18.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
137
+ "model.layers.18.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
138
+ "model.layers.18.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
139
+ "model.layers.18.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
140
+ "model.layers.19.input_layernorm.weight": "model-00001-of-00002.safetensors",
141
+ "model.layers.19.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
142
+ "model.layers.19.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
143
+ "model.layers.19.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
144
+ "model.layers.19.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
145
+ "model.layers.19.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
146
+ "model.layers.19.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
147
+ "model.layers.19.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
148
+ "model.layers.19.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
149
+ "model.layers.19.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
150
+ "model.layers.19.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
151
+ "model.layers.19.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
152
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
153
+ "model.layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
154
+ "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
155
+ "model.layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
156
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
157
+ "model.layers.2.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
158
+ "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
159
+ "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
160
+ "model.layers.2.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
161
+ "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
162
+ "model.layers.2.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
163
+ "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
164
+ "model.layers.20.input_layernorm.weight": "model-00001-of-00002.safetensors",
165
+ "model.layers.20.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
166
+ "model.layers.20.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
167
+ "model.layers.20.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
168
+ "model.layers.20.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
169
+ "model.layers.20.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
170
+ "model.layers.20.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
171
+ "model.layers.20.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
172
+ "model.layers.20.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
173
+ "model.layers.20.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
174
+ "model.layers.20.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
175
+ "model.layers.20.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
176
+ "model.layers.21.input_layernorm.weight": "model-00001-of-00002.safetensors",
177
+ "model.layers.21.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
178
+ "model.layers.21.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
179
+ "model.layers.21.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
180
+ "model.layers.21.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
181
+ "model.layers.21.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
182
+ "model.layers.21.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
183
+ "model.layers.21.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
184
+ "model.layers.21.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
185
+ "model.layers.21.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
186
+ "model.layers.21.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
187
+ "model.layers.21.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
188
+ "model.layers.22.input_layernorm.weight": "model-00001-of-00002.safetensors",
189
+ "model.layers.22.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
190
+ "model.layers.22.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
191
+ "model.layers.22.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
192
+ "model.layers.22.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
193
+ "model.layers.22.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
194
+ "model.layers.22.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
195
+ "model.layers.22.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
196
+ "model.layers.22.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
197
+ "model.layers.22.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
198
+ "model.layers.22.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
199
+ "model.layers.22.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
200
+ "model.layers.23.input_layernorm.weight": "model-00001-of-00002.safetensors",
201
+ "model.layers.23.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
202
+ "model.layers.23.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
203
+ "model.layers.23.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
204
+ "model.layers.23.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
205
+ "model.layers.23.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
206
+ "model.layers.23.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
207
+ "model.layers.23.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
208
+ "model.layers.23.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
209
+ "model.layers.23.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
210
+ "model.layers.23.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
211
+ "model.layers.23.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
212
+ "model.layers.24.input_layernorm.weight": "model-00001-of-00002.safetensors",
213
+ "model.layers.24.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
214
+ "model.layers.24.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
215
+ "model.layers.24.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
216
+ "model.layers.24.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
217
+ "model.layers.24.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
218
+ "model.layers.24.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
219
+ "model.layers.24.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
220
+ "model.layers.24.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
221
+ "model.layers.24.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
222
+ "model.layers.24.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
223
+ "model.layers.24.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
224
+ "model.layers.25.input_layernorm.weight": "model-00001-of-00002.safetensors",
225
+ "model.layers.25.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
226
+ "model.layers.25.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
227
+ "model.layers.25.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
228
+ "model.layers.25.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
229
+ "model.layers.25.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
230
+ "model.layers.25.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
231
+ "model.layers.25.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
232
+ "model.layers.25.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
233
+ "model.layers.25.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
234
+ "model.layers.25.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
235
+ "model.layers.25.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
236
+ "model.layers.26.input_layernorm.weight": "model-00001-of-00002.safetensors",
237
+ "model.layers.26.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
238
+ "model.layers.26.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
239
+ "model.layers.26.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
240
+ "model.layers.26.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
241
+ "model.layers.26.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
242
+ "model.layers.26.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
243
+ "model.layers.26.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
244
+ "model.layers.26.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
245
+ "model.layers.26.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
246
+ "model.layers.26.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
247
+ "model.layers.26.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
248
+ "model.layers.27.input_layernorm.weight": "model-00001-of-00002.safetensors",
249
+ "model.layers.27.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
250
+ "model.layers.27.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
251
+ "model.layers.27.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
252
+ "model.layers.27.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
253
+ "model.layers.27.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
254
+ "model.layers.27.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
255
+ "model.layers.27.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
256
+ "model.layers.27.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
257
+ "model.layers.27.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
258
+ "model.layers.27.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
259
+ "model.layers.27.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
260
+ "model.layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
261
+ "model.layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
262
+ "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
263
+ "model.layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
264
+ "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
265
+ "model.layers.3.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
266
+ "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
267
+ "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
268
+ "model.layers.3.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
269
+ "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
270
+ "model.layers.3.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
271
+ "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
272
+ "model.layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
273
+ "model.layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
274
+ "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
275
+ "model.layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
276
+ "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
277
+ "model.layers.4.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
278
+ "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
279
+ "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
280
+ "model.layers.4.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
281
+ "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
282
+ "model.layers.4.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
283
+ "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
284
+ "model.layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
285
+ "model.layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
286
+ "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
287
+ "model.layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
288
+ "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
289
+ "model.layers.5.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
290
+ "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
291
+ "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
292
+ "model.layers.5.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
293
+ "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
294
+ "model.layers.5.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
295
+ "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
296
+ "model.layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
297
+ "model.layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
298
+ "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
299
+ "model.layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
300
+ "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
301
+ "model.layers.6.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
302
+ "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
303
+ "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
304
+ "model.layers.6.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
305
+ "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
306
+ "model.layers.6.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
307
+ "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
308
+ "model.layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
309
+ "model.layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
310
+ "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
311
+ "model.layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
312
+ "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
313
+ "model.layers.7.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
314
+ "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
315
+ "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
316
+ "model.layers.7.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
317
+ "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
318
+ "model.layers.7.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
319
+ "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
320
+ "model.layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
321
+ "model.layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
322
+ "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
323
+ "model.layers.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
324
+ "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
325
+ "model.layers.8.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
326
+ "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
327
+ "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
328
+ "model.layers.8.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
329
+ "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
330
+ "model.layers.8.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
331
+ "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
332
+ "model.layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
333
+ "model.layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
334
+ "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
335
+ "model.layers.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
336
+ "model.layers.9.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
337
+ "model.layers.9.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
338
+ "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
339
+ "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
340
+ "model.layers.9.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
341
+ "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
342
+ "model.layers.9.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
343
+ "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
344
+ "model.norm.weight": "model-00001-of-00002.safetensors",
345
+ "vision_tower.blocks.0.attn.proj.weight": "model-00001-of-00002.safetensors",
346
+ "vision_tower.blocks.0.attn.qkv.weight": "model-00001-of-00002.safetensors",
347
+ "vision_tower.blocks.0.mlp.fc1.weight": "model-00001-of-00002.safetensors",
348
+ "vision_tower.blocks.0.mlp.fc2.weight": "model-00001-of-00002.safetensors",
349
+ "vision_tower.blocks.0.mlp.fc3.weight": "model-00001-of-00002.safetensors",
350
+ "vision_tower.blocks.0.norm1.weight": "model-00001-of-00002.safetensors",
351
+ "vision_tower.blocks.0.norm2.weight": "model-00001-of-00002.safetensors",
352
+ "vision_tower.blocks.1.attn.proj.weight": "model-00001-of-00002.safetensors",
353
+ "vision_tower.blocks.1.attn.qkv.weight": "model-00001-of-00002.safetensors",
354
+ "vision_tower.blocks.1.mlp.fc1.weight": "model-00001-of-00002.safetensors",
355
+ "vision_tower.blocks.1.mlp.fc2.weight": "model-00001-of-00002.safetensors",
356
+ "vision_tower.blocks.1.mlp.fc3.weight": "model-00001-of-00002.safetensors",
357
+ "vision_tower.blocks.1.norm1.weight": "model-00001-of-00002.safetensors",
358
+ "vision_tower.blocks.1.norm2.weight": "model-00001-of-00002.safetensors",
359
+ "vision_tower.blocks.10.attn.proj.weight": "model-00001-of-00002.safetensors",
360
+ "vision_tower.blocks.10.attn.qkv.weight": "model-00001-of-00002.safetensors",
361
+ "vision_tower.blocks.10.mlp.fc1.weight": "model-00001-of-00002.safetensors",
362
+ "vision_tower.blocks.10.mlp.fc2.weight": "model-00001-of-00002.safetensors",
363
+ "vision_tower.blocks.10.mlp.fc3.weight": "model-00001-of-00002.safetensors",
364
+ "vision_tower.blocks.10.norm1.weight": "model-00001-of-00002.safetensors",
365
+ "vision_tower.blocks.10.norm2.weight": "model-00001-of-00002.safetensors",
366
+ "vision_tower.blocks.11.attn.proj.weight": "model-00001-of-00002.safetensors",
367
+ "vision_tower.blocks.11.attn.qkv.weight": "model-00001-of-00002.safetensors",
368
+ "vision_tower.blocks.11.mlp.fc1.weight": "model-00001-of-00002.safetensors",
369
+ "vision_tower.blocks.11.mlp.fc2.weight": "model-00001-of-00002.safetensors",
370
+ "vision_tower.blocks.11.mlp.fc3.weight": "model-00001-of-00002.safetensors",
371
+ "vision_tower.blocks.11.norm1.weight": "model-00001-of-00002.safetensors",
372
+ "vision_tower.blocks.11.norm2.weight": "model-00001-of-00002.safetensors",
373
+ "vision_tower.blocks.12.attn.proj.weight": "model-00001-of-00002.safetensors",
374
+ "vision_tower.blocks.12.attn.qkv.weight": "model-00001-of-00002.safetensors",
375
+ "vision_tower.blocks.12.mlp.fc1.weight": "model-00001-of-00002.safetensors",
376
+ "vision_tower.blocks.12.mlp.fc2.weight": "model-00001-of-00002.safetensors",
377
+ "vision_tower.blocks.12.mlp.fc3.weight": "model-00001-of-00002.safetensors",
378
+ "vision_tower.blocks.12.norm1.weight": "model-00001-of-00002.safetensors",
379
+ "vision_tower.blocks.12.norm2.weight": "model-00001-of-00002.safetensors",
380
+ "vision_tower.blocks.13.attn.proj.weight": "model-00001-of-00002.safetensors",
381
+ "vision_tower.blocks.13.attn.qkv.weight": "model-00001-of-00002.safetensors",
382
+ "vision_tower.blocks.13.mlp.fc1.weight": "model-00001-of-00002.safetensors",
383
+ "vision_tower.blocks.13.mlp.fc2.weight": "model-00001-of-00002.safetensors",
384
+ "vision_tower.blocks.13.mlp.fc3.weight": "model-00001-of-00002.safetensors",
385
+ "vision_tower.blocks.13.norm1.weight": "model-00001-of-00002.safetensors",
386
+ "vision_tower.blocks.13.norm2.weight": "model-00001-of-00002.safetensors",
387
+ "vision_tower.blocks.14.attn.proj.weight": "model-00001-of-00002.safetensors",
388
+ "vision_tower.blocks.14.attn.qkv.weight": "model-00001-of-00002.safetensors",
389
+ "vision_tower.blocks.14.mlp.fc1.weight": "model-00001-of-00002.safetensors",
390
+ "vision_tower.blocks.14.mlp.fc2.weight": "model-00001-of-00002.safetensors",
391
+ "vision_tower.blocks.14.mlp.fc3.weight": "model-00001-of-00002.safetensors",
392
+ "vision_tower.blocks.14.norm1.weight": "model-00001-of-00002.safetensors",
393
+ "vision_tower.blocks.14.norm2.weight": "model-00001-of-00002.safetensors",
394
+ "vision_tower.blocks.15.attn.proj.weight": "model-00001-of-00002.safetensors",
395
+ "vision_tower.blocks.15.attn.qkv.weight": "model-00001-of-00002.safetensors",
396
+ "vision_tower.blocks.15.mlp.fc1.weight": "model-00001-of-00002.safetensors",
397
+ "vision_tower.blocks.15.mlp.fc2.weight": "model-00001-of-00002.safetensors",
398
+ "vision_tower.blocks.15.mlp.fc3.weight": "model-00001-of-00002.safetensors",
399
+ "vision_tower.blocks.15.norm1.weight": "model-00001-of-00002.safetensors",
400
+ "vision_tower.blocks.15.norm2.weight": "model-00001-of-00002.safetensors",
401
+ "vision_tower.blocks.16.attn.proj.weight": "model-00001-of-00002.safetensors",
402
+ "vision_tower.blocks.16.attn.qkv.weight": "model-00001-of-00002.safetensors",
403
+ "vision_tower.blocks.16.mlp.fc1.weight": "model-00001-of-00002.safetensors",
404
+ "vision_tower.blocks.16.mlp.fc2.weight": "model-00001-of-00002.safetensors",
405
+ "vision_tower.blocks.16.mlp.fc3.weight": "model-00001-of-00002.safetensors",
406
+ "vision_tower.blocks.16.norm1.weight": "model-00001-of-00002.safetensors",
407
+ "vision_tower.blocks.16.norm2.weight": "model-00001-of-00002.safetensors",
408
+ "vision_tower.blocks.17.attn.proj.weight": "model-00001-of-00002.safetensors",
409
+ "vision_tower.blocks.17.attn.qkv.weight": "model-00001-of-00002.safetensors",
410
+ "vision_tower.blocks.17.mlp.fc1.weight": "model-00001-of-00002.safetensors",
411
+ "vision_tower.blocks.17.mlp.fc2.weight": "model-00001-of-00002.safetensors",
412
+ "vision_tower.blocks.17.mlp.fc3.weight": "model-00001-of-00002.safetensors",
413
+ "vision_tower.blocks.17.norm1.weight": "model-00001-of-00002.safetensors",
414
+ "vision_tower.blocks.17.norm2.weight": "model-00001-of-00002.safetensors",
415
+ "vision_tower.blocks.18.attn.proj.weight": "model-00001-of-00002.safetensors",
416
+ "vision_tower.blocks.18.attn.qkv.weight": "model-00001-of-00002.safetensors",
417
+ "vision_tower.blocks.18.mlp.fc1.weight": "model-00001-of-00002.safetensors",
418
+ "vision_tower.blocks.18.mlp.fc2.weight": "model-00001-of-00002.safetensors",
419
+ "vision_tower.blocks.18.mlp.fc3.weight": "model-00001-of-00002.safetensors",
420
+ "vision_tower.blocks.18.norm1.weight": "model-00001-of-00002.safetensors",
421
+ "vision_tower.blocks.18.norm2.weight": "model-00001-of-00002.safetensors",
422
+ "vision_tower.blocks.19.attn.proj.weight": "model-00001-of-00002.safetensors",
423
+ "vision_tower.blocks.19.attn.qkv.weight": "model-00001-of-00002.safetensors",
424
+ "vision_tower.blocks.19.mlp.fc1.weight": "model-00001-of-00002.safetensors",
425
+ "vision_tower.blocks.19.mlp.fc2.weight": "model-00001-of-00002.safetensors",
426
+ "vision_tower.blocks.19.mlp.fc3.weight": "model-00001-of-00002.safetensors",
427
+ "vision_tower.blocks.19.norm1.weight": "model-00001-of-00002.safetensors",
428
+ "vision_tower.blocks.19.norm2.weight": "model-00001-of-00002.safetensors",
429
+ "vision_tower.blocks.2.attn.proj.weight": "model-00001-of-00002.safetensors",
430
+ "vision_tower.blocks.2.attn.qkv.weight": "model-00001-of-00002.safetensors",
431
+ "vision_tower.blocks.2.mlp.fc1.weight": "model-00001-of-00002.safetensors",
432
+ "vision_tower.blocks.2.mlp.fc2.weight": "model-00001-of-00002.safetensors",
433
+ "vision_tower.blocks.2.mlp.fc3.weight": "model-00002-of-00002.safetensors",
434
+ "vision_tower.blocks.2.norm1.weight": "model-00002-of-00002.safetensors",
435
+ "vision_tower.blocks.2.norm2.weight": "model-00002-of-00002.safetensors",
436
+ "vision_tower.blocks.20.attn.proj.weight": "model-00002-of-00002.safetensors",
437
+ "vision_tower.blocks.20.attn.qkv.weight": "model-00002-of-00002.safetensors",
438
+ "vision_tower.blocks.20.mlp.fc1.weight": "model-00002-of-00002.safetensors",
439
+ "vision_tower.blocks.20.mlp.fc2.weight": "model-00002-of-00002.safetensors",
440
+ "vision_tower.blocks.20.mlp.fc3.weight": "model-00002-of-00002.safetensors",
441
+ "vision_tower.blocks.20.norm1.weight": "model-00002-of-00002.safetensors",
442
+ "vision_tower.blocks.20.norm2.weight": "model-00002-of-00002.safetensors",
443
+ "vision_tower.blocks.21.attn.proj.weight": "model-00002-of-00002.safetensors",
444
+ "vision_tower.blocks.21.attn.qkv.weight": "model-00002-of-00002.safetensors",
445
+ "vision_tower.blocks.21.mlp.fc1.weight": "model-00002-of-00002.safetensors",
446
+ "vision_tower.blocks.21.mlp.fc2.weight": "model-00002-of-00002.safetensors",
447
+ "vision_tower.blocks.21.mlp.fc3.weight": "model-00002-of-00002.safetensors",
448
+ "vision_tower.blocks.21.norm1.weight": "model-00002-of-00002.safetensors",
449
+ "vision_tower.blocks.21.norm2.weight": "model-00002-of-00002.safetensors",
450
+ "vision_tower.blocks.22.attn.proj.weight": "model-00002-of-00002.safetensors",
451
+ "vision_tower.blocks.22.attn.qkv.weight": "model-00002-of-00002.safetensors",
452
+ "vision_tower.blocks.22.mlp.fc1.weight": "model-00002-of-00002.safetensors",
453
+ "vision_tower.blocks.22.mlp.fc2.weight": "model-00002-of-00002.safetensors",
454
+ "vision_tower.blocks.22.mlp.fc3.weight": "model-00002-of-00002.safetensors",
455
+ "vision_tower.blocks.22.norm1.weight": "model-00002-of-00002.safetensors",
456
+ "vision_tower.blocks.22.norm2.weight": "model-00002-of-00002.safetensors",
457
+ "vision_tower.blocks.23.attn.proj.weight": "model-00002-of-00002.safetensors",
458
+ "vision_tower.blocks.23.attn.qkv.weight": "model-00002-of-00002.safetensors",
459
+ "vision_tower.blocks.23.mlp.fc1.weight": "model-00002-of-00002.safetensors",
460
+ "vision_tower.blocks.23.mlp.fc2.weight": "model-00002-of-00002.safetensors",
461
+ "vision_tower.blocks.23.mlp.fc3.weight": "model-00002-of-00002.safetensors",
462
+ "vision_tower.blocks.23.norm1.weight": "model-00002-of-00002.safetensors",
463
+ "vision_tower.blocks.23.norm2.weight": "model-00002-of-00002.safetensors",
464
+ "vision_tower.blocks.24.attn.proj.weight": "model-00002-of-00002.safetensors",
465
+ "vision_tower.blocks.24.attn.qkv.weight": "model-00002-of-00002.safetensors",
466
+ "vision_tower.blocks.24.mlp.fc1.weight": "model-00002-of-00002.safetensors",
467
+ "vision_tower.blocks.24.mlp.fc2.weight": "model-00002-of-00002.safetensors",
468
+ "vision_tower.blocks.24.mlp.fc3.weight": "model-00002-of-00002.safetensors",
469
+ "vision_tower.blocks.24.norm1.weight": "model-00002-of-00002.safetensors",
470
+ "vision_tower.blocks.24.norm2.weight": "model-00002-of-00002.safetensors",
471
+ "vision_tower.blocks.25.attn.proj.weight": "model-00002-of-00002.safetensors",
472
+ "vision_tower.blocks.25.attn.qkv.weight": "model-00002-of-00002.safetensors",
473
+ "vision_tower.blocks.25.mlp.fc1.weight": "model-00002-of-00002.safetensors",
474
+ "vision_tower.blocks.25.mlp.fc2.weight": "model-00002-of-00002.safetensors",
475
+ "vision_tower.blocks.25.mlp.fc3.weight": "model-00002-of-00002.safetensors",
476
+ "vision_tower.blocks.25.norm1.weight": "model-00002-of-00002.safetensors",
477
+ "vision_tower.blocks.25.norm2.weight": "model-00002-of-00002.safetensors",
478
+ "vision_tower.blocks.26.attn.proj.weight": "model-00002-of-00002.safetensors",
479
+ "vision_tower.blocks.26.attn.qkv.weight": "model-00002-of-00002.safetensors",
480
+ "vision_tower.blocks.26.mlp.fc1.weight": "model-00002-of-00002.safetensors",
481
+ "vision_tower.blocks.26.mlp.fc2.weight": "model-00002-of-00002.safetensors",
482
+ "vision_tower.blocks.26.mlp.fc3.weight": "model-00002-of-00002.safetensors",
483
+ "vision_tower.blocks.26.norm1.weight": "model-00002-of-00002.safetensors",
484
+ "vision_tower.blocks.26.norm2.weight": "model-00002-of-00002.safetensors",
485
+ "vision_tower.blocks.27.attn.proj.weight": "model-00002-of-00002.safetensors",
486
+ "vision_tower.blocks.27.attn.qkv.weight": "model-00002-of-00002.safetensors",
487
+ "vision_tower.blocks.27.mlp.fc1.weight": "model-00002-of-00002.safetensors",
488
+ "vision_tower.blocks.27.mlp.fc2.weight": "model-00002-of-00002.safetensors",
489
+ "vision_tower.blocks.27.mlp.fc3.weight": "model-00002-of-00002.safetensors",
490
+ "vision_tower.blocks.27.norm1.weight": "model-00002-of-00002.safetensors",
491
+ "vision_tower.blocks.27.norm2.weight": "model-00002-of-00002.safetensors",
492
+ "vision_tower.blocks.28.attn.proj.weight": "model-00002-of-00002.safetensors",
493
+ "vision_tower.blocks.28.attn.qkv.weight": "model-00002-of-00002.safetensors",
494
+ "vision_tower.blocks.28.mlp.fc1.weight": "model-00002-of-00002.safetensors",
495
+ "vision_tower.blocks.28.mlp.fc2.weight": "model-00002-of-00002.safetensors",
496
+ "vision_tower.blocks.28.mlp.fc3.weight": "model-00002-of-00002.safetensors",
497
+ "vision_tower.blocks.28.norm1.weight": "model-00002-of-00002.safetensors",
498
+ "vision_tower.blocks.28.norm2.weight": "model-00002-of-00002.safetensors",
499
+ "vision_tower.blocks.29.attn.proj.weight": "model-00002-of-00002.safetensors",
500
+ "vision_tower.blocks.29.attn.qkv.weight": "model-00002-of-00002.safetensors",
501
+ "vision_tower.blocks.29.mlp.fc1.weight": "model-00002-of-00002.safetensors",
502
+ "vision_tower.blocks.29.mlp.fc2.weight": "model-00002-of-00002.safetensors",
503
+ "vision_tower.blocks.29.mlp.fc3.weight": "model-00002-of-00002.safetensors",
504
+ "vision_tower.blocks.29.norm1.weight": "model-00002-of-00002.safetensors",
505
+ "vision_tower.blocks.29.norm2.weight": "model-00002-of-00002.safetensors",
506
+ "vision_tower.blocks.3.attn.proj.weight": "model-00002-of-00002.safetensors",
507
+ "vision_tower.blocks.3.attn.qkv.weight": "model-00002-of-00002.safetensors",
508
+ "vision_tower.blocks.3.mlp.fc1.weight": "model-00002-of-00002.safetensors",
509
+ "vision_tower.blocks.3.mlp.fc2.weight": "model-00002-of-00002.safetensors",
510
+ "vision_tower.blocks.3.mlp.fc3.weight": "model-00002-of-00002.safetensors",
511
+ "vision_tower.blocks.3.norm1.weight": "model-00002-of-00002.safetensors",
512
+ "vision_tower.blocks.3.norm2.weight": "model-00002-of-00002.safetensors",
513
+ "vision_tower.blocks.30.attn.proj.weight": "model-00002-of-00002.safetensors",
514
+ "vision_tower.blocks.30.attn.qkv.weight": "model-00002-of-00002.safetensors",
515
+ "vision_tower.blocks.30.mlp.fc1.weight": "model-00002-of-00002.safetensors",
516
+ "vision_tower.blocks.30.mlp.fc2.weight": "model-00002-of-00002.safetensors",
517
+ "vision_tower.blocks.30.mlp.fc3.weight": "model-00002-of-00002.safetensors",
518
+ "vision_tower.blocks.30.norm1.weight": "model-00002-of-00002.safetensors",
519
+ "vision_tower.blocks.30.norm2.weight": "model-00002-of-00002.safetensors",
520
+ "vision_tower.blocks.31.attn.proj.weight": "model-00002-of-00002.safetensors",
521
+ "vision_tower.blocks.31.attn.qkv.weight": "model-00002-of-00002.safetensors",
522
+ "vision_tower.blocks.31.mlp.fc1.weight": "model-00002-of-00002.safetensors",
523
+ "vision_tower.blocks.31.mlp.fc2.weight": "model-00002-of-00002.safetensors",
524
+ "vision_tower.blocks.31.mlp.fc3.weight": "model-00002-of-00002.safetensors",
525
+ "vision_tower.blocks.31.norm1.weight": "model-00002-of-00002.safetensors",
526
+ "vision_tower.blocks.31.norm2.weight": "model-00002-of-00002.safetensors",
527
+ "vision_tower.blocks.32.attn.proj.weight": "model-00002-of-00002.safetensors",
528
+ "vision_tower.blocks.32.attn.qkv.weight": "model-00002-of-00002.safetensors",
529
+ "vision_tower.blocks.32.mlp.fc1.weight": "model-00002-of-00002.safetensors",
530
+ "vision_tower.blocks.32.mlp.fc2.weight": "model-00002-of-00002.safetensors",
531
+ "vision_tower.blocks.32.mlp.fc3.weight": "model-00002-of-00002.safetensors",
532
+ "vision_tower.blocks.32.norm1.weight": "model-00002-of-00002.safetensors",
533
+ "vision_tower.blocks.32.norm2.weight": "model-00002-of-00002.safetensors",
534
+ "vision_tower.blocks.33.attn.proj.weight": "model-00002-of-00002.safetensors",
535
+ "vision_tower.blocks.33.attn.qkv.weight": "model-00002-of-00002.safetensors",
536
+ "vision_tower.blocks.33.mlp.fc1.weight": "model-00002-of-00002.safetensors",
537
+ "vision_tower.blocks.33.mlp.fc2.weight": "model-00002-of-00002.safetensors",
538
+ "vision_tower.blocks.33.mlp.fc3.weight": "model-00002-of-00002.safetensors",
539
+ "vision_tower.blocks.33.norm1.weight": "model-00002-of-00002.safetensors",
540
+ "vision_tower.blocks.33.norm2.weight": "model-00002-of-00002.safetensors",
541
+ "vision_tower.blocks.34.attn.proj.weight": "model-00002-of-00002.safetensors",
542
+ "vision_tower.blocks.34.attn.qkv.weight": "model-00002-of-00002.safetensors",
543
+ "vision_tower.blocks.34.mlp.fc1.weight": "model-00002-of-00002.safetensors",
544
+ "vision_tower.blocks.34.mlp.fc2.weight": "model-00002-of-00002.safetensors",
545
+ "vision_tower.blocks.34.mlp.fc3.weight": "model-00002-of-00002.safetensors",
546
+ "vision_tower.blocks.34.norm1.weight": "model-00002-of-00002.safetensors",
547
+ "vision_tower.blocks.34.norm2.weight": "model-00002-of-00002.safetensors",
548
+ "vision_tower.blocks.35.attn.proj.weight": "model-00002-of-00002.safetensors",
549
+ "vision_tower.blocks.35.attn.qkv.weight": "model-00002-of-00002.safetensors",
550
+ "vision_tower.blocks.35.mlp.fc1.weight": "model-00002-of-00002.safetensors",
551
+ "vision_tower.blocks.35.mlp.fc2.weight": "model-00002-of-00002.safetensors",
552
+ "vision_tower.blocks.35.mlp.fc3.weight": "model-00002-of-00002.safetensors",
553
+ "vision_tower.blocks.35.norm1.weight": "model-00002-of-00002.safetensors",
554
+ "vision_tower.blocks.35.norm2.weight": "model-00002-of-00002.safetensors",
555
+ "vision_tower.blocks.36.attn.proj.weight": "model-00002-of-00002.safetensors",
556
+ "vision_tower.blocks.36.attn.qkv.weight": "model-00002-of-00002.safetensors",
557
+ "vision_tower.blocks.36.mlp.fc1.weight": "model-00002-of-00002.safetensors",
558
+ "vision_tower.blocks.36.mlp.fc2.weight": "model-00002-of-00002.safetensors",
559
+ "vision_tower.blocks.36.mlp.fc3.weight": "model-00002-of-00002.safetensors",
560
+ "vision_tower.blocks.36.norm1.weight": "model-00002-of-00002.safetensors",
561
+ "vision_tower.blocks.36.norm2.weight": "model-00002-of-00002.safetensors",
562
+ "vision_tower.blocks.37.attn.proj.weight": "model-00002-of-00002.safetensors",
563
+ "vision_tower.blocks.37.attn.qkv.weight": "model-00002-of-00002.safetensors",
564
+ "vision_tower.blocks.37.mlp.fc1.weight": "model-00002-of-00002.safetensors",
565
+ "vision_tower.blocks.37.mlp.fc2.weight": "model-00002-of-00002.safetensors",
566
+ "vision_tower.blocks.37.mlp.fc3.weight": "model-00002-of-00002.safetensors",
567
+ "vision_tower.blocks.37.norm1.weight": "model-00002-of-00002.safetensors",
568
+ "vision_tower.blocks.37.norm2.weight": "model-00002-of-00002.safetensors",
569
+ "vision_tower.blocks.38.attn.proj.weight": "model-00002-of-00002.safetensors",
570
+ "vision_tower.blocks.38.attn.qkv.weight": "model-00002-of-00002.safetensors",
571
+ "vision_tower.blocks.38.mlp.fc1.weight": "model-00002-of-00002.safetensors",
572
+ "vision_tower.blocks.38.mlp.fc2.weight": "model-00002-of-00002.safetensors",
573
+ "vision_tower.blocks.38.mlp.fc3.weight": "model-00002-of-00002.safetensors",
574
+ "vision_tower.blocks.38.norm1.weight": "model-00002-of-00002.safetensors",
575
+ "vision_tower.blocks.38.norm2.weight": "model-00002-of-00002.safetensors",
576
+ "vision_tower.blocks.39.attn.proj.weight": "model-00002-of-00002.safetensors",
577
+ "vision_tower.blocks.39.attn.qkv.weight": "model-00002-of-00002.safetensors",
578
+ "vision_tower.blocks.39.mlp.fc1.weight": "model-00002-of-00002.safetensors",
579
+ "vision_tower.blocks.39.mlp.fc2.weight": "model-00002-of-00002.safetensors",
580
+ "vision_tower.blocks.39.mlp.fc3.weight": "model-00002-of-00002.safetensors",
581
+ "vision_tower.blocks.39.norm1.weight": "model-00002-of-00002.safetensors",
582
+ "vision_tower.blocks.39.norm2.weight": "model-00002-of-00002.safetensors",
583
+ "vision_tower.blocks.4.attn.proj.weight": "model-00002-of-00002.safetensors",
584
+ "vision_tower.blocks.4.attn.qkv.weight": "model-00002-of-00002.safetensors",
585
+ "vision_tower.blocks.4.mlp.fc1.weight": "model-00002-of-00002.safetensors",
586
+ "vision_tower.blocks.4.mlp.fc2.weight": "model-00002-of-00002.safetensors",
587
+ "vision_tower.blocks.4.mlp.fc3.weight": "model-00002-of-00002.safetensors",
588
+ "vision_tower.blocks.4.norm1.weight": "model-00002-of-00002.safetensors",
589
+ "vision_tower.blocks.4.norm2.weight": "model-00002-of-00002.safetensors",
590
+ "vision_tower.blocks.40.attn.proj.weight": "model-00002-of-00002.safetensors",
591
+ "vision_tower.blocks.40.attn.qkv.weight": "model-00002-of-00002.safetensors",
592
+ "vision_tower.blocks.40.mlp.fc1.weight": "model-00002-of-00002.safetensors",
593
+ "vision_tower.blocks.40.mlp.fc2.weight": "model-00002-of-00002.safetensors",
594
+ "vision_tower.blocks.40.mlp.fc3.weight": "model-00002-of-00002.safetensors",
595
+ "vision_tower.blocks.40.norm1.weight": "model-00002-of-00002.safetensors",
596
+ "vision_tower.blocks.40.norm2.weight": "model-00002-of-00002.safetensors",
597
+ "vision_tower.blocks.41.attn.proj.weight": "model-00002-of-00002.safetensors",
598
+ "vision_tower.blocks.41.attn.qkv.weight": "model-00002-of-00002.safetensors",
599
+ "vision_tower.blocks.41.mlp.fc1.weight": "model-00002-of-00002.safetensors",
600
+ "vision_tower.blocks.41.mlp.fc2.weight": "model-00002-of-00002.safetensors",
601
+ "vision_tower.blocks.41.mlp.fc3.weight": "model-00002-of-00002.safetensors",
602
+ "vision_tower.blocks.41.norm1.weight": "model-00002-of-00002.safetensors",
603
+ "vision_tower.blocks.41.norm2.weight": "model-00002-of-00002.safetensors",
604
+ "vision_tower.blocks.5.attn.proj.weight": "model-00002-of-00002.safetensors",
605
+ "vision_tower.blocks.5.attn.qkv.weight": "model-00002-of-00002.safetensors",
606
+ "vision_tower.blocks.5.mlp.fc1.weight": "model-00002-of-00002.safetensors",
607
+ "vision_tower.blocks.5.mlp.fc2.weight": "model-00002-of-00002.safetensors",
608
+ "vision_tower.blocks.5.mlp.fc3.weight": "model-00002-of-00002.safetensors",
609
+ "vision_tower.blocks.5.norm1.weight": "model-00002-of-00002.safetensors",
610
+ "vision_tower.blocks.5.norm2.weight": "model-00002-of-00002.safetensors",
611
+ "vision_tower.blocks.6.attn.proj.weight": "model-00002-of-00002.safetensors",
612
+ "vision_tower.blocks.6.attn.qkv.weight": "model-00002-of-00002.safetensors",
613
+ "vision_tower.blocks.6.mlp.fc1.weight": "model-00002-of-00002.safetensors",
614
+ "vision_tower.blocks.6.mlp.fc2.weight": "model-00002-of-00002.safetensors",
615
+ "vision_tower.blocks.6.mlp.fc3.weight": "model-00002-of-00002.safetensors",
616
+ "vision_tower.blocks.6.norm1.weight": "model-00002-of-00002.safetensors",
617
+ "vision_tower.blocks.6.norm2.weight": "model-00002-of-00002.safetensors",
618
+ "vision_tower.blocks.7.attn.proj.weight": "model-00002-of-00002.safetensors",
619
+ "vision_tower.blocks.7.attn.qkv.weight": "model-00002-of-00002.safetensors",
620
+ "vision_tower.blocks.7.mlp.fc1.weight": "model-00002-of-00002.safetensors",
621
+ "vision_tower.blocks.7.mlp.fc2.weight": "model-00002-of-00002.safetensors",
622
+ "vision_tower.blocks.7.mlp.fc3.weight": "model-00002-of-00002.safetensors",
623
+ "vision_tower.blocks.7.norm1.weight": "model-00002-of-00002.safetensors",
624
+ "vision_tower.blocks.7.norm2.weight": "model-00002-of-00002.safetensors",
625
+ "vision_tower.blocks.8.attn.proj.weight": "model-00002-of-00002.safetensors",
626
+ "vision_tower.blocks.8.attn.qkv.weight": "model-00002-of-00002.safetensors",
627
+ "vision_tower.blocks.8.mlp.fc1.weight": "model-00002-of-00002.safetensors",
628
+ "vision_tower.blocks.8.mlp.fc2.weight": "model-00002-of-00002.safetensors",
629
+ "vision_tower.blocks.8.mlp.fc3.weight": "model-00002-of-00002.safetensors",
630
+ "vision_tower.blocks.8.norm1.weight": "model-00002-of-00002.safetensors",
631
+ "vision_tower.blocks.8.norm2.weight": "model-00002-of-00002.safetensors",
632
+ "vision_tower.blocks.9.attn.proj.weight": "model-00002-of-00002.safetensors",
633
+ "vision_tower.blocks.9.attn.qkv.weight": "model-00002-of-00002.safetensors",
634
+ "vision_tower.blocks.9.mlp.fc1.weight": "model-00002-of-00002.safetensors",
635
+ "vision_tower.blocks.9.mlp.fc2.weight": "model-00002-of-00002.safetensors",
636
+ "vision_tower.blocks.9.mlp.fc3.weight": "model-00002-of-00002.safetensors",
637
+ "vision_tower.blocks.9.norm1.weight": "model-00002-of-00002.safetensors",
638
+ "vision_tower.blocks.9.norm2.weight": "model-00002-of-00002.safetensors",
639
+ "vision_tower.merger.ln_q.bias": "model-00002-of-00002.safetensors",
640
+ "vision_tower.merger.ln_q.weight": "model-00002-of-00002.safetensors",
641
+ "vision_tower.merger.mlp.0.bias": "model-00002-of-00002.safetensors",
642
+ "vision_tower.merger.mlp.0.weight": "model-00002-of-00002.safetensors",
643
+ "vision_tower.merger.mlp.2.bias": "model-00002-of-00002.safetensors",
644
+ "vision_tower.merger.mlp.2.weight": "model-00002-of-00002.safetensors",
645
+ "vision_tower.patch_embed.patchifier.norm.weight": "model-00002-of-00002.safetensors",
646
+ "vision_tower.patch_embed.patchifier.proj.bias": "model-00002-of-00002.safetensors",
647
+ "vision_tower.patch_embed.patchifier.proj.weight": "model-00002-of-00002.safetensors",
648
+ "vision_tower.post_trunk_norm.weight": "model-00002-of-00002.safetensors"
649
+ }
650
+ }
modeling_dots_ocr.py ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import List, Optional, Tuple, Union
2
+
3
+ import torch
4
+ from transformers.modeling_outputs import CausalLMOutputWithPast
5
+ from transformers.models.qwen2 import Qwen2ForCausalLM
6
+
7
+ from .configuration_dots import DotsVisionConfig, DotsOCRConfig
8
+ from .modeling_dots_vision import DotsVisionTransformer
9
+
10
+
11
+ DOTS_VLM_MAX_IMAGES = 200
12
+
13
+
14
+ class DotsOCRForCausalLM(Qwen2ForCausalLM):
15
+ config_class = DotsOCRConfig
16
+
17
+ def __init__(self, config: DotsOCRConfig):
18
+ super().__init__(config)
19
+
20
+ if isinstance(self.config.vision_config, dict):
21
+ vision_config = DotsVisionConfig(**self.config.vision_config)
22
+ self.config.vision_config = vision_config
23
+ else:
24
+ vision_config = self.config.vision_config
25
+
26
+ self.vision_tower = DotsVisionTransformer(vision_config)
27
+
28
+ def prepare_inputs_embeds(
29
+ self,
30
+ input_ids: torch.LongTensor,
31
+ pixel_values: Optional[torch.FloatTensor] = None,
32
+ grid_thw: Optional[torch.FloatTensor] = None,
33
+ img_mask: Optional[torch.BoolTensor] = None,
34
+ ) -> torch.Tensor:
35
+ inputs_embeds = self.get_input_embeddings()(input_ids)
36
+
37
+ if pixel_values is not None:
38
+ assert img_mask is not None
39
+ if grid_thw.shape[0] > DOTS_VLM_MAX_IMAGES:
40
+ print(
41
+ f"Num image exceeded: {grid_thw.shape[0]} > {DOTS_VLM_MAX_IMAGES}, which may cause FSDP hang"
42
+ )
43
+
44
+ vision_embeddings = self.vision_tower(pixel_values, grid_thw)
45
+
46
+ true_indices = torch.nonzero(img_mask).squeeze()
47
+ if len(true_indices) > vision_embeddings.size(0):
48
+ print(
49
+ f"img_mask sum > VE and will be truncated, mask.sum()={len(true_indices)} {vision_embeddings.size(0)=}"
50
+ )
51
+ true_indices = true_indices[: vision_embeddings.size(0)]
52
+ new_img_mask = torch.zeros_like(img_mask, device=img_mask.device)
53
+ new_img_mask[true_indices[:, 0], true_indices[:, 1]] = True
54
+ else:
55
+ new_img_mask = img_mask
56
+
57
+ assert (
58
+ vision_embeddings.size(0) == new_img_mask.sum()
59
+ ), f"{vision_embeddings.size(0)=}, {new_img_mask.sum()=}"
60
+
61
+ inputs_embeds = inputs_embeds.masked_scatter(
62
+ new_img_mask.to(inputs_embeds.device).unsqueeze(-1).expand_as(inputs_embeds),
63
+ vision_embeddings.to(inputs_embeds.device).type(inputs_embeds.dtype),
64
+ )
65
+
66
+ return inputs_embeds
67
+
68
+ def forward(
69
+ self,
70
+ input_ids: torch.LongTensor,
71
+ pixel_values: Optional[torch.FloatTensor] = None,
72
+ image_grid_thw: Optional[torch.FloatTensor] = None,
73
+ inputs_embeds: Optional[torch.Tensor] = None,
74
+ attention_mask: Optional[torch.Tensor] = None,
75
+ position_ids: Optional[torch.LongTensor] = None,
76
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
77
+ labels: Optional[torch.LongTensor] = None,
78
+ output_attentions: Optional[bool] = None,
79
+ output_hidden_states: Optional[bool] = None,
80
+ return_dict: Optional[bool] = None,
81
+ use_cache: Optional[bool] = None,
82
+ logits_to_keep: int = 0,
83
+ **loss_kwargs,
84
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
85
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
86
+ assert len(input_ids) >= 1, f"empty input_ids {input_ids.shape=} will cause gradnorm nan"
87
+ if inputs_embeds is None:
88
+ img_mask = input_ids == self.config.image_token_id
89
+ inputs_embeds = self.prepare_inputs_embeds(input_ids, pixel_values, image_grid_thw, img_mask)
90
+
91
+ outputs = super().forward(
92
+ inputs_embeds=inputs_embeds,
93
+ attention_mask=attention_mask,
94
+ position_ids=position_ids,
95
+ past_key_values=past_key_values,
96
+ labels=labels,
97
+ use_cache=use_cache if use_cache is not None else self.config.use_cache,
98
+ output_attentions=output_attentions,
99
+ output_hidden_states=output_hidden_states,
100
+ # return_dict=return_dict,
101
+ logits_to_keep=logits_to_keep,
102
+ **loss_kwargs,
103
+ )
104
+
105
+ return outputs
106
+
107
+ def prepare_inputs_for_generation(
108
+ self,
109
+ input_ids,
110
+ past_key_values=None,
111
+ inputs_embeds=None,
112
+ pixel_values=None,
113
+ attention_mask=None,
114
+ cache_position=None,
115
+ num_logits_to_keep=None,
116
+ **kwargs,
117
+ ):
118
+ model_inputs = super().prepare_inputs_for_generation(
119
+ input_ids,
120
+ past_key_values=past_key_values,
121
+ inputs_embeds=inputs_embeds,
122
+ attention_mask=attention_mask,
123
+ cache_position=cache_position,
124
+ num_logits_to_keep=num_logits_to_keep,
125
+ **kwargs,
126
+ )
127
+
128
+ if cache_position[0] == 0:
129
+ model_inputs["pixel_values"] = pixel_values
130
+
131
+ return model_inputs
modeling_dots_ocr_vllm.py ADDED
@@ -0,0 +1,451 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from functools import cached_property
2
+ from typing import Iterable, Literal, Mapping, Optional, Set, Tuple, TypedDict, Union
3
+
4
+ import torch
5
+ import torch.nn as nn
6
+ from transformers.models.qwen2_vl import Qwen2VLImageProcessor, Qwen2VLProcessor
7
+ from transformers.models.qwen2_vl.image_processing_qwen2_vl import smart_resize
8
+ from vllm import ModelRegistry
9
+ from vllm.config import VllmConfig
10
+ from vllm.model_executor.layers.sampler import SamplerOutput, get_sampler
11
+ from vllm.model_executor.models.interfaces import MultiModalEmbeddings, SupportsMultiModal
12
+ from vllm.model_executor.models.qwen2 import Qwen2ForCausalLM
13
+ from vllm.model_executor.models.qwen2_5_vl import (
14
+ Qwen2_5_VLMultiModalProcessor,
15
+ Qwen2_5_VLProcessingInfo,
16
+ )
17
+ from vllm.model_executor.models.qwen2_vl import Qwen2VLDummyInputsBuilder
18
+ from vllm.model_executor.models.utils import (
19
+ AutoWeightsLoader,
20
+ WeightsMapper,
21
+ init_vllm_registered_model,
22
+ maybe_prefix,
23
+ merge_multimodal_embeddings,
24
+ )
25
+ from vllm.model_executor.sampling_metadata import SamplingMetadata
26
+ from vllm.multimodal import MULTIMODAL_REGISTRY
27
+ from vllm.multimodal.inputs import MultiModalDataDict
28
+ from vllm.multimodal.parse import ImageSize
29
+ from vllm.sequence import IntermediateTensors
30
+
31
+ from .configuration_dots import DotsVisionConfig
32
+ from .configuration_dots import DotsOCRConfig
33
+ from .modeling_dots_vision import DotsVisionTransformer
34
+
35
+
36
+ class DotsOCRImagePixelInputs(TypedDict):
37
+ type: Literal["pixel_values", "image_grid_thw"]
38
+
39
+ pixel_values: torch.Tensor
40
+ image_grid_thw: torch.Tensor
41
+
42
+
43
+ class DotsOCRImageEmbeddingInputs(TypedDict):
44
+ type: Literal["image_embeds", "image_grid_thw"]
45
+ image_embeds: torch.Tensor
46
+ """Supported types:
47
+ - List[`torch.Tensor`]: A list of tensors holding all images' features.
48
+ Each tensor holds an image's features.
49
+ - `torch.Tensor`: A tensor holding all images' features
50
+ (concatenation of all images' feature tensors).
51
+
52
+ Tensor shape: `(num_image_features, hidden_size)`
53
+ - `num_image_features` varies based on
54
+ the number and resolution of the images.
55
+ - `hidden_size` must match the hidden size of language model backbone.
56
+ """
57
+
58
+ image_grid_thw: torch.Tensor
59
+
60
+
61
+ DotsOCRImageInputs = Union[DotsOCRImagePixelInputs, DotsOCRImageEmbeddingInputs]
62
+
63
+
64
+ class DotsOCRMultiModalProcessor(Qwen2_5_VLMultiModalProcessor):
65
+ pass
66
+
67
+
68
+ class DotsOCRDummyInputsBuilder(Qwen2VLDummyInputsBuilder):
69
+ def get_dummy_mm_data(
70
+ self,
71
+ seq_len: int,
72
+ mm_counts: Mapping[str, int],
73
+ ) -> MultiModalDataDict:
74
+ num_images = mm_counts.get("image", 0)
75
+
76
+ target_width, target_height = self.info.get_image_size_with_most_features()
77
+
78
+ return {
79
+ "image": self._get_dummy_images(width=target_width, height=target_height, num_images=num_images),
80
+ }
81
+
82
+
83
+ class DotsOCRProcessingInfo(Qwen2_5_VLProcessingInfo):
84
+ def get_hf_config(self) -> DotsOCRConfig:
85
+ config = self.ctx.get_hf_config()
86
+ if not config.__class__.__name__ == 'DotsOCRConfig':
87
+ raise TypeError(f"Expected DotsOCRConfig, got {type(config)}")
88
+
89
+ if hasattr(config, "vision_config") and isinstance(config.vision_config, dict):
90
+ config.vision_config = DotsVisionConfig(**config.vision_config)
91
+
92
+ return config
93
+
94
+ def get_supported_mm_limits(self) -> Mapping[str, Optional[int]]:
95
+ return {"image": None, "video": 0}
96
+
97
+ def get_mm_max_tokens_per_item(
98
+ self,
99
+ seq_len: int,
100
+ mm_counts: Mapping[str, int],
101
+ ) -> Mapping[str, int]:
102
+ max_image_tokens = self.get_max_image_tokens()
103
+ return {"image": max_image_tokens, "video": 0}
104
+
105
+ def get_hf_processor(
106
+ self,
107
+ *,
108
+ min_pixels: Optional[int] = None,
109
+ max_pixels: Optional[int] = None,
110
+ size: Optional[dict[str, int]] = None,
111
+ **kwargs: object,
112
+ ) -> Qwen2VLProcessor:
113
+ self.get_tokenizer().image_token = "<|imgpad|>" # Ensure image token is set
114
+ processor = self.ctx.get_hf_processor(
115
+ Qwen2VLProcessor,
116
+ image_processor=self.get_image_processor(min_pixels=min_pixels, max_pixels=max_pixels, size=size),
117
+ **kwargs,
118
+ )
119
+ processor.image_token = "<|imgpad|>"
120
+ processor.video_token = "<|video_pad|>"
121
+ return processor
122
+
123
+ def _get_vision_info(
124
+ self,
125
+ *,
126
+ image_width: int,
127
+ image_height: int,
128
+ num_frames: int = 1,
129
+ do_resize: bool = True,
130
+ image_processor: Optional[Qwen2VLImageProcessor],
131
+ ) -> tuple[ImageSize, int]:
132
+ if image_processor is None:
133
+ image_processor = self.get_image_processor()
134
+
135
+ hf_config: DotsOCRConfig = self.get_hf_config()
136
+ vision_config = hf_config.vision_config
137
+ patch_size = vision_config.patch_size
138
+ merge_size = vision_config.spatial_merge_size
139
+ temporal_patch_size = vision_config.temporal_patch_size
140
+
141
+ if do_resize:
142
+ resized_height, resized_width = smart_resize(
143
+ height=image_height,
144
+ width=image_width,
145
+ factor=patch_size * merge_size,
146
+ min_pixels=image_processor.min_pixels,
147
+ max_pixels=image_processor.max_pixels,
148
+ )
149
+ preprocessed_size = ImageSize(width=resized_width, height=resized_height)
150
+ else:
151
+ preprocessed_size = ImageSize(width=image_width, height=image_height)
152
+
153
+ # NOTE: Frames are padded to be divisible by `temporal_patch_size`
154
+ # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py#L294
155
+ padded_num_frames = num_frames + num_frames % temporal_patch_size
156
+
157
+ grid_t = max(padded_num_frames // temporal_patch_size, 1)
158
+ grid_h = preprocessed_size.height // patch_size
159
+ grid_w = preprocessed_size.width // patch_size
160
+
161
+ num_patches = grid_t * grid_h * grid_w
162
+ num_vision_tokens = num_patches // (merge_size**2)
163
+
164
+ return preprocessed_size, num_vision_tokens
165
+
166
+
167
+ @MULTIMODAL_REGISTRY.register_processor(
168
+ Qwen2_5_VLMultiModalProcessor,
169
+ info=DotsOCRProcessingInfo,
170
+ dummy_inputs=DotsOCRDummyInputsBuilder,
171
+ )
172
+ class DotsOCRForCausalLM(nn.Module, SupportsMultiModal):
173
+ hf_to_vllm_mapper = WeightsMapper(
174
+ orig_to_new_prefix={
175
+ "lm_head.": "language_model.lm_head.",
176
+ "model.": "language_model.model.",
177
+ }
178
+ )
179
+ _tp_plan = {}
180
+
181
+ @classmethod
182
+ def get_placeholder_str(cls, modality: str, i: int) -> Optional[str]:
183
+ if modality in ("image",):
184
+ return "<|img|><|imgpad|><|endofimg|>"
185
+
186
+ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
187
+ super().__init__()
188
+
189
+ self.config: DotsOCRConfig = vllm_config.model_config.hf_config
190
+ self.quant_config = vllm_config.quant_config
191
+ self.multimodal_config = vllm_config.model_config.multimodal_config
192
+
193
+ if isinstance(self.config.vision_config, dict):
194
+ vision_config = DotsVisionConfig(**self.config.vision_config)
195
+ self.config.vision_config = vision_config
196
+ else:
197
+ vision_config = self.config.vision_config
198
+
199
+ self.vision_tower = DotsVisionTransformer(vision_config)
200
+ self.language_model: Qwen2ForCausalLM = init_vllm_registered_model(
201
+ vllm_config=vllm_config,
202
+ hf_config=self.config,
203
+ prefix=maybe_prefix(prefix, "language_model"),
204
+ architectures=["Qwen2ForCausalLM"],
205
+ )
206
+
207
+ @cached_property
208
+ def sampler(self):
209
+ if hasattr(self.language_model, "sampler"):
210
+ return self.language_model.sampler
211
+
212
+ return get_sampler()
213
+
214
+ def _validate_and_reshape_mm_tensor(self, mm_input: object, name: str) -> torch.Tensor:
215
+ if not isinstance(mm_input, (torch.Tensor, list)):
216
+ raise ValueError(f"Incorrect type of {name}. " f"Got type: {type(mm_input)}")
217
+ if isinstance(mm_input, torch.Tensor):
218
+ if mm_input.ndim == 2:
219
+ return mm_input
220
+ if mm_input.ndim != 3:
221
+ raise ValueError(
222
+ f"{name} should be 2D or batched 3D tensor. "
223
+ f"Got ndim: {mm_input.ndim} "
224
+ f"(shape={mm_input.shape})"
225
+ )
226
+ return torch.concat(list(mm_input))
227
+ else:
228
+ return torch.concat(mm_input)
229
+
230
+ def _parse_and_validate_image_input(self, **kwargs: object) -> Optional[DotsOCRImageInputs]:
231
+ pixel_values = kwargs.pop("pixel_values", None)
232
+ image_embeds = kwargs.pop("image_embeds", None)
233
+ image_grid_thw = kwargs.pop("image_grid_thw", None)
234
+
235
+ if pixel_values is None and image_embeds is None:
236
+ return None
237
+
238
+ if pixel_values is not None:
239
+ pixel_values = self._validate_and_reshape_mm_tensor(pixel_values, "image pixel values")
240
+ image_grid_thw = self._validate_and_reshape_mm_tensor(image_grid_thw, "image grid_thw")
241
+
242
+ if not isinstance(pixel_values, (torch.Tensor, list)):
243
+ raise ValueError("Incorrect type of image pixel values. " f"Got type: {type(pixel_values)}")
244
+
245
+ return DotsOCRImagePixelInputs(
246
+ type="pixel_values", pixel_values=pixel_values, image_grid_thw=image_grid_thw
247
+ )
248
+
249
+ if image_embeds is not None:
250
+ image_embeds = self._validate_and_reshape_mm_tensor(image_embeds, "image embeds")
251
+ image_grid_thw = self._validate_and_reshape_mm_tensor(image_grid_thw, "image grid_thw")
252
+
253
+ if not isinstance(image_embeds, torch.Tensor):
254
+ raise ValueError("Incorrect type of image embeddings. " f"Got type: {type(image_embeds)}")
255
+ return DotsOCRImageEmbeddingInputs(
256
+ type="image_embeds", image_embeds=image_embeds, image_grid_thw=image_grid_thw
257
+ )
258
+
259
+ def vision_forward(self, pixel_values: torch.Tensor, image_grid_thw: torch.Tensor):
260
+ from vllm.distributed import (
261
+ get_tensor_model_parallel_group,
262
+ get_tensor_model_parallel_rank,
263
+ get_tensor_model_parallel_world_size,
264
+ )
265
+
266
+ assert self.vision_tower is not None
267
+
268
+ tp_rank = get_tensor_model_parallel_rank()
269
+ tp = get_tensor_model_parallel_world_size()
270
+
271
+ image_grid_thw_chunk = image_grid_thw.chunk(tp)
272
+ image_sizes_consum = torch.tensor([i.prod(-1).sum() for i in image_grid_thw_chunk]).cumsum(dim=0)
273
+ merge_size_square = self.vision_tower.config.spatial_merge_size**2
274
+ image_embedding = torch.zeros(
275
+ (
276
+ pixel_values.shape[0] // merge_size_square,
277
+ self.vision_tower.config.hidden_size,
278
+ ),
279
+ device=pixel_values.device,
280
+ dtype=pixel_values.dtype,
281
+ )
282
+
283
+ if tp_rank < len(image_sizes_consum):
284
+ idx_start = 0 if tp_rank == 0 else image_sizes_consum[tp_rank - 1].item()
285
+ idx_end = image_sizes_consum[tp_rank].item()
286
+ pixel_values_part = pixel_values[idx_start:idx_end]
287
+ image_grid_thw_part = image_grid_thw_chunk[tp_rank]
288
+ image_embedding_part = self.vision_tower(pixel_values_part, image_grid_thw_part)
289
+ image_embedding[idx_start // merge_size_square : idx_end // merge_size_square] = image_embedding_part
290
+
291
+ group = get_tensor_model_parallel_group().device_group
292
+ torch.distributed.all_reduce(image_embedding, group=group)
293
+ return image_embedding
294
+
295
+ def _process_image_input(self, image_input: DotsOCRImageInputs) -> tuple[torch.Tensor, ...]:
296
+ grid_thw = image_input["image_grid_thw"]
297
+ assert grid_thw.ndim == 2
298
+
299
+ if image_input["type"] == "image_embeds":
300
+ image_embeds = image_input["image_embeds"].type(self.vision_tower.dtype)
301
+ else:
302
+ pixel_values = image_input["pixel_values"].type(self.vision_tower.dtype)
303
+ image_embeds = self.vision_forward(pixel_values, grid_thw)[
304
+ :, : self.config.hidden_size
305
+ ]
306
+
307
+ # Split concatenated embeddings for each image item.
308
+ merge_size = self.vision_tower.config.spatial_merge_size
309
+ sizes = grid_thw.prod(-1) // merge_size // merge_size
310
+
311
+ return image_embeds.split(sizes.tolist())
312
+
313
+ def _parse_and_validate_multimodal_inputs(self, **kwargs: object) -> dict:
314
+ modalities = {}
315
+
316
+ # Preserve the order of modalities if there are multiple of them
317
+ # from the order of kwargs.
318
+ for input_key in kwargs:
319
+ if input_key in ("pixel_values", "image_embeds") and "images" not in modalities:
320
+ modalities["images"] = self._parse_and_validate_image_input(**kwargs)
321
+ return modalities
322
+
323
+ def get_language_model(self) -> torch.nn.Module:
324
+ return self.language_model
325
+
326
+ def get_multimodal_embeddings(self, **kwargs: object) -> Optional[MultiModalEmbeddings]:
327
+ modalities = self._parse_and_validate_multimodal_inputs(**kwargs)
328
+ if not modalities:
329
+ return None
330
+
331
+ # The result multimodal_embeddings is tuple of tensors, with each
332
+ # tensor correspoending to a multimodal data item (image or video).
333
+ multimodal_embeddings: tuple[torch.Tensor, ...] = ()
334
+
335
+ # NOTE: It is important to iterate over the keys in this dictionary
336
+ # to preserve the order of the modalities.
337
+ for modality in modalities:
338
+ if modality == "images":
339
+ image_input = modalities["images"]
340
+ vision_embeddings = self._process_image_input(image_input)
341
+ multimodal_embeddings += vision_embeddings
342
+
343
+ return multimodal_embeddings
344
+
345
+ def get_input_embeddings(
346
+ self,
347
+ input_ids: torch.Tensor,
348
+ multimodal_embeddings: Optional[MultiModalEmbeddings] = None,
349
+ ) -> torch.Tensor:
350
+ inputs_embeds = self.language_model.get_input_embeddings(input_ids)
351
+ if multimodal_embeddings is not None:
352
+ inputs_embeds = merge_multimodal_embeddings(
353
+ input_ids,
354
+ inputs_embeds,
355
+ multimodal_embeddings,
356
+ [self.config.image_token_id, self.config.video_token_id],
357
+ )
358
+
359
+ return inputs_embeds
360
+
361
+ def get_input_embeddings_v0(
362
+ self,
363
+ input_ids: torch.Tensor,
364
+ image_input: Optional[DotsOCRImagePixelInputs] = None,
365
+ ) -> torch.Tensor:
366
+ inputs_embeds = self.get_input_embeddings(input_ids)
367
+ if image_input is not None:
368
+ image_embeds = self._process_image_input(image_input)
369
+ inputs_embeds = merge_multimodal_embeddings(
370
+ input_ids,
371
+ inputs_embeds,
372
+ image_embeds,
373
+ placeholder_token_id=self.config.image_token_id,
374
+ )
375
+ return inputs_embeds
376
+
377
+ def forward(
378
+ self,
379
+ input_ids: Optional[torch.Tensor],
380
+ positions: torch.Tensor,
381
+ intermediate_tensors: Optional[IntermediateTensors] = None,
382
+ inputs_embeds: Optional[torch.Tensor] = None,
383
+ **kwargs,
384
+ ) -> Union[torch.Tensor, IntermediateTensors]:
385
+ if intermediate_tensors is not None:
386
+ inputs_embeds = None
387
+ elif inputs_embeds is None and kwargs.get("pixel_values") is not None:
388
+ image_input = self._parse_and_validate_image_input(**kwargs)
389
+ if image_input is None:
390
+ inputs_embeds = None
391
+ else:
392
+ assert input_ids is not None
393
+ inputs_embeds = self.get_input_embeddings_v0(
394
+ input_ids,
395
+ image_input=image_input,
396
+ )
397
+ input_ids = None
398
+
399
+ hidden_states = self.language_model(
400
+ input_ids=input_ids,
401
+ positions=positions,
402
+ intermediate_tensors=intermediate_tensors,
403
+ inputs_embeds=inputs_embeds,
404
+ )
405
+
406
+ return hidden_states
407
+
408
+ def compute_logits(
409
+ self,
410
+ hidden_states: torch.Tensor,
411
+ sampling_metadata: SamplingMetadata,
412
+ ) -> Optional[torch.Tensor]:
413
+ return self.language_model.compute_logits(hidden_states, sampling_metadata)
414
+
415
+ def sample(
416
+ self,
417
+ logits: Optional[torch.Tensor],
418
+ sampling_metadata: SamplingMetadata,
419
+ ) -> Optional[SamplerOutput]:
420
+ next_tokens = self.sampler(logits, sampling_metadata)
421
+ return next_tokens
422
+
423
+ def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]) -> Set[str]:
424
+ loader = AutoWeightsLoader(self)
425
+ return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
426
+
427
+
428
+ def patch_vllm_chat_placeholder():
429
+ import vllm
430
+ # return when vllm version > 0.9.1
431
+ if not (vllm.__version_tuple__[0]==0 and vllm.__version_tuple__[1] <= 9 and vllm.__version_tuple__[2] <= 1):
432
+ return
433
+ from vllm.entrypoints.chat_utils import BaseMultiModalItemTracker
434
+
435
+ ori = BaseMultiModalItemTracker._placeholder_str
436
+
437
+ def _placeholder_str(self, modality, current_count: int) -> Optional[str]:
438
+ hf_config = self._model_config.hf_config
439
+ model_type = hf_config.model_type
440
+ if modality in ("image",) and model_type in ["dots_ocr"]:
441
+ return "<|img|><|imgpad|><|endofimg|>"
442
+ return ori(self, modality, current_count)
443
+
444
+ BaseMultiModalItemTracker._placeholder_str = _placeholder_str
445
+
446
+ ModelRegistry.register_model(
447
+ "DotsOCRForCausalLM", DotsOCRForCausalLM,
448
+ )
449
+
450
+
451
+ patch_vllm_chat_placeholder()
modeling_dots_vision.py ADDED
@@ -0,0 +1,520 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+
3
+ import torch
4
+ import torch.nn as nn
5
+ import torch.nn.functional as F
6
+ import torch.utils.checkpoint
7
+
8
+ flash_attn_available = True
9
+ npu_available = True
10
+
11
+ try:
12
+ from flash_attn import flash_attn_varlen_func
13
+ except ImportError:
14
+ flash_attn_available = False
15
+
16
+ from torch.nn import LayerNorm
17
+ from transformers.modeling_utils import PreTrainedModel
18
+ from .configuration_dots import DotsVisionConfig
19
+
20
+ try:
21
+ import torch_npu
22
+ except ImportError:
23
+ npu_available = False
24
+
25
+
26
+ def rotate_half(x):
27
+ """Rotates half the hidden dims of the input."""
28
+ x1 = x[..., : x.shape[-1] // 2]
29
+ x2 = x[..., x.shape[-1] // 2:]
30
+ return torch.cat((-x2, x1), dim=-1)
31
+
32
+
33
+ def apply_rotary_pos_emb_vision(tensor: torch.Tensor, freqs: torch.Tensor) -> torch.Tensor:
34
+ orig_dtype = tensor.dtype
35
+ tensor = tensor.float()
36
+
37
+ cos = freqs.cos()
38
+ sin = freqs.sin()
39
+
40
+ cos = cos.unsqueeze(1).repeat(1, 1, 2).unsqueeze(0).float()
41
+ sin = sin.unsqueeze(1).repeat(1, 1, 2).unsqueeze(0).float()
42
+
43
+ output = (tensor * cos) + (rotate_half(tensor) * sin)
44
+
45
+ output = output.to(orig_dtype)
46
+
47
+ return output
48
+
49
+
50
+ class VisionRotaryEmbedding(nn.Module):
51
+ def __init__(self, dim: int, theta: float = 10000.0) -> None:
52
+ super().__init__()
53
+ inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2, dtype=torch.float) / dim))
54
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
55
+
56
+ def forward(self, seqlen: int) -> torch.Tensor:
57
+ seq = torch.arange(seqlen, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
58
+ freqs = torch.outer(seq, self.inv_freq)
59
+ return freqs
60
+
61
+
62
+ class PatchMerger(nn.Module):
63
+ def __init__(
64
+ self,
65
+ dim: int,
66
+ context_dim: int,
67
+ spatial_merge_size: int = 2,
68
+ pre_norm="layernorm",
69
+ init_merger_std=None,
70
+ ) -> None:
71
+ super().__init__()
72
+ self.hidden_size = context_dim * (spatial_merge_size ** 2)
73
+ self.pre_norm = pre_norm
74
+ if self.pre_norm == "layernorm":
75
+ self.ln_q = LayerNorm(context_dim, eps=1e-6)
76
+ elif self.pre_norm == "rmsnorm":
77
+ self.ln_q = RMSNorm(context_dim, eps=1e-6)
78
+ else:
79
+ print("no norm in patch merger")
80
+
81
+ self.mlp = nn.Sequential(
82
+ nn.Linear(self.hidden_size, self.hidden_size),
83
+ nn.GELU(),
84
+ nn.Linear(self.hidden_size, dim),
85
+ )
86
+
87
+ if init_merger_std is not None:
88
+ nn.init.normal_(self.mlp[0].weight, mean=0.0, std=init_merger_std)
89
+ nn.init.zeros_(self.mlp[0].bias)
90
+ nn.init.normal_(self.mlp[2].weight, mean=0.0, std=init_merger_std)
91
+ nn.init.zeros_(self.mlp[2].bias)
92
+
93
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
94
+ if self.pre_norm:
95
+ x = self.mlp(self.ln_q(x).view(-1, self.hidden_size))
96
+ else:
97
+ x = self.mlp(x.view(-1, self.hidden_size))
98
+ return x
99
+
100
+
101
+ class VisionAttention(nn.Module):
102
+ def __init__(self, config, dim: int, num_heads: int = 16, bias=True) -> None:
103
+ super().__init__()
104
+ self.num_heads = num_heads
105
+ self.head_dim = dim // num_heads
106
+ self.qkv = nn.Linear(dim, dim * 3, bias=bias)
107
+ self.proj = nn.Linear(dim, dim, bias=bias)
108
+
109
+ def forward(
110
+ self,
111
+ hidden_states: torch.Tensor,
112
+ cu_seqlens: torch.Tensor,
113
+ rotary_pos_emb: torch.Tensor = None,
114
+ ) -> torch.Tensor:
115
+ seq_length = hidden_states.shape[0]
116
+
117
+ q, k, v = self.qkv(hidden_states).reshape(seq_length, 3, self.num_heads, -1).permute(1, 0, 2, 3).unbind(0)
118
+ q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
119
+ k = apply_rotary_pos_emb_vision(k.unsqueeze(0), rotary_pos_emb).squeeze(0)
120
+
121
+ attention_mask = torch.full(
122
+ [1, seq_length, seq_length], torch.finfo(q.dtype).min, device=q.device, dtype=q.dtype
123
+ )
124
+ for i in range(1, len(cu_seqlens)):
125
+ attention_mask[..., cu_seqlens[i - 1]: cu_seqlens[i], cu_seqlens[i - 1]: cu_seqlens[i]] = 0
126
+
127
+ q = q.transpose(0, 1)
128
+ k = k.transpose(0, 1)
129
+ v = v.transpose(0, 1)
130
+ attn_weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(self.head_dim)
131
+ attn_weights = attn_weights + attention_mask
132
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(q.dtype)
133
+ attn_output = torch.matmul(attn_weights, v)
134
+ attn_output = attn_output.transpose(0, 1)
135
+ attn_output = attn_output.reshape(seq_length, -1)
136
+ attn_output = self.proj(attn_output)
137
+ return attn_output
138
+
139
+
140
+ class VisionFlashAttention2(nn.Module):
141
+ def __init__(self, config, dim: int, num_heads: int = 16, bias=True) -> None:
142
+ super().__init__()
143
+ self.num_heads = num_heads
144
+ self.qkv = nn.Linear(dim, dim * 3, bias=bias)
145
+ self.proj = nn.Linear(dim, dim, bias=bias)
146
+ self.config = config
147
+ self.is_causal = config.is_causal
148
+
149
+ def forward(
150
+ self,
151
+ hidden_states: torch.Tensor,
152
+ cu_seqlens: torch.Tensor,
153
+ rotary_pos_emb: torch.Tensor = None,
154
+ ) -> torch.Tensor:
155
+ seq_length = hidden_states.shape[0]
156
+ q, k, v = (
157
+ self.qkv(hidden_states).reshape(seq_length, 3, self.num_heads, -1).permute(1, 0, 2, 3).unbind(0)
158
+ ) # 'shd'
159
+ q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
160
+ k = apply_rotary_pos_emb_vision(k.unsqueeze(0), rotary_pos_emb).squeeze(0)
161
+ max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max().item()
162
+ attn_output = flash_attn_varlen_func(
163
+ q, k, v, cu_seqlens, cu_seqlens, max_seqlen, max_seqlen, causal=self.is_causal
164
+ ).reshape(seq_length, -1)
165
+ attn_output = self.proj(attn_output)
166
+
167
+ return attn_output
168
+
169
+
170
+ class VisionAttentionV2(nn.Module):
171
+ def __init__(self, config, dim: int, num_heads: int = 16, bias=True) -> None:
172
+ super().__init__()
173
+ self.num_heads = num_heads
174
+ self.head_dim = dim // num_heads
175
+ self.qkv = nn.Linear(dim, dim * 3, bias=bias)
176
+ self.proj = nn.Linear(dim, dim, bias=bias)
177
+
178
+ def forward(
179
+ self,
180
+ hidden_states: torch.Tensor,
181
+ cu_seqlens: torch.Tensor,
182
+ rotary_pos_emb: torch.Tensor = None,
183
+ ) -> torch.Tensor:
184
+ seq_length = hidden_states.shape[0]
185
+
186
+ q, k, v = self.qkv(hidden_states).reshape(seq_length, 3, self.num_heads, -1).permute(1, 0, 2, 3).unbind(0)
187
+ q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
188
+ k = apply_rotary_pos_emb_vision(k.unsqueeze(0), rotary_pos_emb).squeeze(0)
189
+
190
+ seqlens = torch.diff(cu_seqlens).tolist()
191
+
192
+ q_list = torch.split(q, seqlens, 0)
193
+ k_list = torch.split(k, seqlens, 0)
194
+ v_list = torch.split(v, seqlens, 0)
195
+ # eager attention 空间复杂度为 O(n^2) , n 为 b*s(batch_size * seq_len), 序列太长容易OOM, 这个实现 更具batch 切分 seq
196
+ # 减少内存需求, 计算相对 continus batching 较慢。
197
+ outputs = []
198
+ for q_i, k_i, v_i in zip(q_list, k_list, v_list):
199
+ q_i = q_i.transpose(0, 1)
200
+ k_i = k_i.transpose(0, 1)
201
+ v_i = v_i.transpose(0, 1)
202
+ out = torch.matmul(q_i, k_i.transpose(1, 2)) / math.sqrt(self.head_dim)
203
+ out = nn.functional.softmax(out, dim=-1, dtype=torch.float32).to(q.dtype)
204
+ out = torch.matmul(out, v_i)
205
+ out = out.transpose(0, 1)
206
+ outputs.append(out)
207
+
208
+ attn_output = torch.concat(outputs, dim=0)
209
+ attn_output = attn_output.reshape(seq_length, -1)
210
+ attn_output = self.proj(attn_output)
211
+ return attn_output
212
+
213
+
214
+ class VisionAscendAttention(nn.Module):
215
+ def __init__(self, config, dim: int, num_heads: int = 16, bias=True) -> None:
216
+ super().__init__()
217
+ self.num_heads = num_heads
218
+ self.head_dim = dim // num_heads
219
+ self.qkv = nn.Linear(dim, dim * 3, bias=bias)
220
+ self.proj = nn.Linear(dim, dim, bias=bias)
221
+ self.config = config
222
+
223
+ def forward(
224
+ self,
225
+ hidden_states: torch.Tensor,
226
+ cu_seqlens: torch.Tensor,
227
+ rotary_pos_emb: torch.Tensor = None,
228
+ ) -> torch.Tensor:
229
+ seq_length = hidden_states.shape[0]
230
+ q, k, v = self.qkv(hidden_states).reshape(seq_length, 3, self.num_heads, -1).permute(1, 0, 2, 3).unbind(0)
231
+
232
+ q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
233
+ k = apply_rotary_pos_emb_vision(k.unsqueeze(0), rotary_pos_emb).squeeze(0)
234
+
235
+ attention_mask = torch.ones([1, seq_length, seq_length], device=q.device, dtype=torch.bool)
236
+ for i in range(1, len(cu_seqlens)):
237
+ attention_mask[..., cu_seqlens[i - 1]: cu_seqlens[i], cu_seqlens[i - 1]: cu_seqlens[i]] = False
238
+
239
+ q = q.transpose(0, 1).unsqueeze(0)
240
+ k = k.transpose(0, 1).unsqueeze(0)
241
+ v = v.transpose(0, 1).unsqueeze(0)
242
+
243
+ attn_output = torch_npu.npu_prompt_flash_attention(q, k, v,
244
+ atten_mask=attention_mask,
245
+ num_heads=self.num_heads, input_layout="BNSD",
246
+ scale_value=self.head_dim ** -0.5)
247
+ attn_output = attn_output.squeeze(0).transpose(0, 1)
248
+ attn_output = attn_output.reshape(seq_length, -1)
249
+ attn_output = self.proj(attn_output)
250
+ return attn_output
251
+
252
+
253
+ class VisionSdpaAttention(nn.Module):
254
+ def __init__(self, config, dim: int, num_heads: int = 16, bias=True) -> None:
255
+ super().__init__()
256
+ self.num_heads = num_heads
257
+ self.qkv = nn.Linear(dim, dim * 3, bias=bias)
258
+ self.proj = nn.Linear(dim, dim, bias=bias)
259
+ self.config = config
260
+
261
+ def forward(
262
+ self,
263
+ hidden_states: torch.Tensor,
264
+ cu_seqlens: torch.Tensor,
265
+ rotary_pos_emb: torch.Tensor = None,
266
+ ) -> torch.Tensor:
267
+ seq_length = hidden_states.shape[0]
268
+ q, k, v = self.qkv(hidden_states).reshape(seq_length, 3, self.num_heads, -1).permute(1, 0, 2, 3).unbind(0)
269
+
270
+ q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
271
+ k = apply_rotary_pos_emb_vision(k.unsqueeze(0), rotary_pos_emb).squeeze(0)
272
+
273
+ attention_mask = torch.zeros([1, seq_length, seq_length], device=q.device, dtype=torch.bool)
274
+ for i in range(1, len(cu_seqlens)):
275
+ attention_mask[..., cu_seqlens[i - 1]: cu_seqlens[i], cu_seqlens[i - 1]: cu_seqlens[i]] = True
276
+
277
+ # Convert q, k, v to 4D to enable : (1, num_heads, seq_length, head_dim)
278
+ q = q.transpose(0, 1).unsqueeze(0) # (1, num_heads, seq_length, head_dim)
279
+ k = k.transpose(0, 1).unsqueeze(0)
280
+ v = v.transpose(0, 1).unsqueeze(0)
281
+
282
+ # See: https://github.com/pytorch/pytorch/issues/127523
283
+ if attention_mask.stride(-1) != 1:
284
+ attention_mask = torch.empty_like(attention_mask, memory_format=torch.contiguous_format).copy_(attention_mask)
285
+
286
+ # use memory efficient backend
287
+ from torch.nn.attention import SDPBackend, sdpa_kernel
288
+ with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):
289
+ attn_output = F.scaled_dot_product_attention(q, k, v, attention_mask, dropout_p=0.0)
290
+
291
+ attn_output = attn_output.squeeze(0).transpose(0, 1) # (seq_length, num_heads, head_dim)
292
+ attn_output = attn_output.reshape(seq_length, -1)
293
+
294
+ attn_output = self.proj(attn_output)
295
+ return attn_output
296
+
297
+
298
+ DOTS_VISION_ATTENTION_CLASSES = {
299
+ "eager": VisionAttention,
300
+ "eager_v2": VisionAttentionV2, # 内存更少
301
+ "flash_attention_2": VisionFlashAttention2,
302
+ "sdpa": VisionSdpaAttention,
303
+ "ascend_fa": VisionAscendAttention, # ascend, 长序列精度下降严重。
304
+ }
305
+
306
+
307
+ class RMSNorm(nn.Module):
308
+ def __init__(self, dim: int, eps: float = 1e-6):
309
+ super().__init__()
310
+ self.weight = nn.Parameter(torch.ones(dim))
311
+ self.eps = eps
312
+
313
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
314
+ output = self._norm(x.float()).type_as(x)
315
+ return output * self.weight
316
+
317
+ def extra_repr(self) -> str:
318
+ return f"{tuple(self.weight.shape)}, eps={self.eps}"
319
+
320
+ def _norm(self, x: torch.Tensor) -> torch.Tensor:
321
+ return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
322
+
323
+
324
+ class DotsSwiGLUFFN(nn.Module):
325
+ def __init__(self, config):
326
+ super().__init__()
327
+ hidden_features = config.intermediate_size
328
+ in_features = config.embed_dim
329
+ bias = config.use_bias
330
+
331
+ self.fc1 = nn.Linear(in_features, hidden_features, bias=bias)
332
+ self.fc2 = nn.Linear(hidden_features, in_features, bias=bias)
333
+ self.fc3 = nn.Linear(in_features, hidden_features, bias=bias)
334
+
335
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
336
+ x = F.silu(self.fc1(x)) * self.fc3(x)
337
+ x = self.fc2(x)
338
+ return x
339
+
340
+
341
+ class DotsPatchEmbed(nn.Module):
342
+ def __init__(self, config):
343
+ super().__init__()
344
+ self.num_channels = config.num_channels
345
+ self.patch_size = config.patch_size
346
+ self.temporal_patch_size = config.temporal_patch_size
347
+ self.embed_dim = config.embed_dim
348
+ self.config = config
349
+ self.proj = nn.Conv2d(
350
+ config.num_channels,
351
+ config.embed_dim,
352
+ kernel_size=(config.patch_size, config.patch_size),
353
+ stride=(config.patch_size, config.patch_size),
354
+ )
355
+ self.norm = RMSNorm(config.embed_dim, eps=config.rms_norm_eps)
356
+
357
+ def forward(self, x: torch.Tensor, grid_thw=None) -> torch.Tensor:
358
+ x = x.view(-1, self.num_channels, self.temporal_patch_size, self.patch_size, self.patch_size)[:, :, 0]
359
+ x = self.proj(x).view(-1, self.embed_dim)
360
+ x = self.norm(x)
361
+ return x
362
+
363
+
364
+ class DotsViTPreprocessor(nn.Module):
365
+ def __init__(self, config):
366
+ super().__init__()
367
+ self.patch_h = config.patch_size
368
+ self.patch_w = config.patch_size
369
+ self.embed_dim = config.embed_dim
370
+ self.config = config
371
+ self.patchifier = DotsPatchEmbed(config)
372
+
373
+ def forward(self, x: torch.Tensor, grid_thw=None) -> torch.Tensor:
374
+ tokens = self.patchifier(x, grid_thw)
375
+ return tokens
376
+
377
+
378
+ class DotsVisionBlock(nn.Module):
379
+ def __init__(self, config, attn_implementation: str = "flash_attention_2"):
380
+ super().__init__()
381
+
382
+ if attn_implementation == "flash_attention_2" and not flash_attn_available:
383
+ # fallback to eager
384
+ attn_implementation = "eager"
385
+ print("flash attention not available! fallback to eager implementation ")
386
+
387
+ if attn_implementation == "ascend_fa" and not npu_available:
388
+ attn_implementation = "eager"
389
+ print("flash attention not available! fallback to eager implementation ")
390
+
391
+ self.attn = DOTS_VISION_ATTENTION_CLASSES[attn_implementation](
392
+ config, config.embed_dim, num_heads=config.num_attention_heads, bias=config.use_bias
393
+ )
394
+ self.norm1 = RMSNorm(config.embed_dim, eps=config.rms_norm_eps)
395
+ self.mlp = DotsSwiGLUFFN(config)
396
+ self.norm2 = RMSNorm(config.embed_dim, eps=config.rms_norm_eps)
397
+
398
+ def forward(self, hidden_states, cu_seqlens, rotary_pos_emb) -> torch.Tensor:
399
+ hidden_states = hidden_states + self.attn(
400
+ self.norm1(hidden_states), cu_seqlens=cu_seqlens, rotary_pos_emb=rotary_pos_emb
401
+ )
402
+ hidden_states = hidden_states + self.mlp(self.norm2(hidden_states))
403
+ return hidden_states
404
+
405
+
406
+ class DotsVisionTransformer(PreTrainedModel):
407
+ def __init__(self, config: DotsVisionConfig) -> None:
408
+ super().__init__(config)
409
+ self.config = config
410
+ self.spatial_merge_size = config.spatial_merge_size
411
+
412
+ self.patch_embed = DotsViTPreprocessor(config)
413
+ self._init_weights(self.patch_embed.patchifier.proj)
414
+
415
+ head_dim = config.embed_dim // config.num_attention_heads
416
+
417
+ self.rotary_pos_emb = VisionRotaryEmbedding(head_dim // 2)
418
+
419
+ _num_hidden_layers = config.num_hidden_layers
420
+ self.blocks = nn.ModuleList(
421
+ [DotsVisionBlock(config, config.attn_implementation) for _ in range(_num_hidden_layers)]
422
+ )
423
+
424
+ if self.config.post_norm:
425
+ self.post_trunk_norm = RMSNorm(config.embed_dim, eps=config.rms_norm_eps)
426
+
427
+ self.merger = PatchMerger(
428
+ dim=config.hidden_size,
429
+ context_dim=config.embed_dim,
430
+ spatial_merge_size=config.spatial_merge_size,
431
+ init_merger_std=self.config.init_merger_std,
432
+ )
433
+
434
+ self.gradient_checkpointing = False
435
+ self._gradient_checkpointing_func = torch.utils.checkpoint.checkpoint
436
+
437
+ def _init_weights(self, module):
438
+ std = self.config.initializer_range
439
+ if isinstance(module, (nn.Linear, nn.Conv3d)):
440
+ module.weight.data.normal_(mean=0.0, std=std)
441
+ if module.bias is not None:
442
+ module.bias.data.zero_()
443
+ elif isinstance(module, nn.Embedding):
444
+ module.weight.data.normal_(mean=0.0, std=std)
445
+ if module.padding_idx is not None:
446
+ module.weight.data[module.padding_idx].zero_()
447
+
448
+ @property
449
+ def dtype(self) -> torch.dtype:
450
+ return self.blocks[0].mlp.fc2.weight.dtype
451
+
452
+ @property
453
+ def device(self) -> torch.device:
454
+ return self.blocks[0].mlp.fc2.weight.device
455
+
456
+ def get_pos_ids_by_grid(self, grid_thw):
457
+ pos_ids = []
458
+ for t, h, w in grid_thw:
459
+ hpos_ids = torch.arange(h).unsqueeze(1).expand(-1, w)
460
+ hpos_ids = hpos_ids.reshape(
461
+ h // self.spatial_merge_size,
462
+ self.spatial_merge_size,
463
+ w // self.spatial_merge_size,
464
+ self.spatial_merge_size,
465
+ )
466
+ hpos_ids = hpos_ids.permute(0, 2, 1, 3)
467
+ hpos_ids = hpos_ids.flatten()
468
+
469
+ wpos_ids = torch.arange(w).unsqueeze(0).expand(h, -1)
470
+ wpos_ids = wpos_ids.reshape(
471
+ h // self.spatial_merge_size,
472
+ self.spatial_merge_size,
473
+ w // self.spatial_merge_size,
474
+ self.spatial_merge_size,
475
+ )
476
+ wpos_ids = wpos_ids.permute(0, 2, 1, 3)
477
+ wpos_ids = wpos_ids.flatten()
478
+ pos_ids.append(
479
+ torch.stack([hpos_ids, wpos_ids], dim=-1).repeat(t, 1)
480
+ )
481
+
482
+ return pos_ids
483
+
484
+ def rot_pos_emb(self, grid_thw):
485
+ pos_ids = self.get_pos_ids_by_grid(grid_thw)
486
+ pos_ids = torch.cat(pos_ids, dim=0)
487
+ max_grid_size = grid_thw[:, 1:].max()
488
+ rotary_pos_emb_full = self.rotary_pos_emb(max_grid_size)
489
+ rotary_pos_emb = rotary_pos_emb_full[pos_ids].flatten(1)
490
+ return rotary_pos_emb
491
+
492
+ def forward(self, hidden_states: torch.Tensor, grid_thw: torch.Tensor, bf16=True) -> torch.Tensor:
493
+ if bf16:
494
+ hidden_states = hidden_states.bfloat16()
495
+ hidden_states = self.patch_embed(hidden_states, grid_thw)
496
+
497
+ rotary_pos_emb = self.rot_pos_emb(grid_thw)
498
+
499
+ cu_seqlens = torch.repeat_interleave(grid_thw[:, 1] * grid_thw[:, 2], grid_thw[:, 0]).cumsum(
500
+ dim=0,
501
+ dtype=grid_thw.dtype if torch.jit.is_tracing() else torch.int32,
502
+ )
503
+ cu_seqlens = F.pad(cu_seqlens, (1, 0), value=0)
504
+
505
+ for blk in self.blocks:
506
+ if self.gradient_checkpointing and self.training:
507
+ hidden_states = self._gradient_checkpointing_func(
508
+ blk.__call__,
509
+ hidden_states,
510
+ cu_seqlens,
511
+ rotary_pos_emb,
512
+ )
513
+ else:
514
+ hidden_states = blk(hidden_states, cu_seqlens=cu_seqlens, rotary_pos_emb=rotary_pos_emb)
515
+
516
+ if self.config.post_norm:
517
+ hidden_states = self.post_trunk_norm(hidden_states)
518
+
519
+ hidden_states = self.merger(hidden_states)
520
+ return hidden_states
preprocessor_config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "min_pixels": 3136,
3
+ "max_pixels": 11289600,
4
+ "patch_size": 14,
5
+ "temporal_patch_size": 1,
6
+ "merge_size": 2,
7
+ "image_mean": [
8
+ 0.48145466,
9
+ 0.4578275,
10
+ 0.40821073
11
+ ],
12
+ "image_std": [
13
+ 0.26862954,
14
+ 0.26130258,
15
+ 0.27577711
16
+ ],
17
+ "image_processor_type": "Qwen2VLImageProcessor",
18
+ "processor_class": "DotsVLProcessor"
19
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": "[PAD]"
25
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,391 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "<|imgpad|>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": true
188
+ },
189
+ "151666": {
190
+ "content": "<|img|>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": true
196
+ },
197
+ "151667": {
198
+ "content": "<|endofimg|>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": true
204
+ },
205
+ "151668": {
206
+ "content": "<|systemprompt|>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": true
212
+ },
213
+ "151669": {
214
+ "content": "<|endofsystemprompt|>",
215
+ "lstrip": false,
216
+ "normalized": false,
217
+ "rstrip": false,
218
+ "single_word": false,
219
+ "special": true
220
+ },
221
+ "151670": {
222
+ "content": "<|user|>",
223
+ "lstrip": false,
224
+ "normalized": false,
225
+ "rstrip": false,
226
+ "single_word": false,
227
+ "special": true
228
+ },
229
+ "151671": {
230
+ "content": "<|endofuser|>",
231
+ "lstrip": false,
232
+ "normalized": false,
233
+ "rstrip": false,
234
+ "single_word": false,
235
+ "special": true
236
+ },
237
+ "151672": {
238
+ "content": "<|assistant|>",
239
+ "lstrip": false,
240
+ "normalized": false,
241
+ "rstrip": false,
242
+ "single_word": false,
243
+ "special": true
244
+ },
245
+ "151673": {
246
+ "content": "<|endofassistant|>",
247
+ "lstrip": false,
248
+ "normalized": false,
249
+ "rstrip": false,
250
+ "single_word": false,
251
+ "special": true
252
+ },
253
+ "151674": {
254
+ "content": "<|ref_start|>",
255
+ "lstrip": false,
256
+ "normalized": false,
257
+ "rstrip": false,
258
+ "single_word": false,
259
+ "special": true
260
+ },
261
+ "151675": {
262
+ "content": "<|ref_end|>",
263
+ "lstrip": false,
264
+ "normalized": false,
265
+ "rstrip": false,
266
+ "single_word": false,
267
+ "special": true
268
+ },
269
+ "151676": {
270
+ "content": "[SEP]",
271
+ "lstrip": false,
272
+ "normalized": false,
273
+ "rstrip": false,
274
+ "single_word": false,
275
+ "special": true
276
+ },
277
+ "151677": {
278
+ "content": "<|pic|>",
279
+ "lstrip": false,
280
+ "normalized": false,
281
+ "rstrip": false,
282
+ "single_word": false,
283
+ "special": true
284
+ },
285
+ "151678": {
286
+ "content": "<|text|>",
287
+ "lstrip": false,
288
+ "normalized": false,
289
+ "rstrip": false,
290
+ "single_word": false,
291
+ "special": true
292
+ },
293
+ "151679": {
294
+ "content": "<|pictotext|>",
295
+ "lstrip": false,
296
+ "normalized": false,
297
+ "rstrip": false,
298
+ "single_word": false,
299
+ "special": true
300
+ },
301
+ "151680": {
302
+ "content": "[PAD]",
303
+ "lstrip": false,
304
+ "normalized": false,
305
+ "rstrip": false,
306
+ "single_word": false,
307
+ "special": true
308
+ },
309
+ "151681": {
310
+ "content": "<|slice|>",
311
+ "lstrip": false,
312
+ "normalized": false,
313
+ "rstrip": false,
314
+ "single_word": false,
315
+ "special": true
316
+ },
317
+ "151682": {
318
+ "content": "<|endofslice|>",
319
+ "lstrip": false,
320
+ "normalized": false,
321
+ "rstrip": false,
322
+ "single_word": false,
323
+ "special": true
324
+ },
325
+ "151683": {
326
+ "content": "<|imgrowend|>",
327
+ "lstrip": false,
328
+ "normalized": false,
329
+ "rstrip": false,
330
+ "single_word": false,
331
+ "special": true
332
+ },
333
+ "151684": {
334
+ "content": "<|polygon_start|>",
335
+ "lstrip": false,
336
+ "normalized": false,
337
+ "rstrip": false,
338
+ "single_word": false,
339
+ "special": true
340
+ },
341
+ "151685": {
342
+ "content": "<|polygon_end|>",
343
+ "lstrip": false,
344
+ "normalized": false,
345
+ "rstrip": false,
346
+ "single_word": false,
347
+ "special": true
348
+ },
349
+ "151686": {
350
+ "content": "<|image_gen_start|>",
351
+ "lstrip": false,
352
+ "normalized": false,
353
+ "rstrip": false,
354
+ "single_word": false,
355
+ "special": true
356
+ },
357
+ "151687": {
358
+ "content": "<|image_gen_end|>",
359
+ "lstrip": false,
360
+ "normalized": false,
361
+ "rstrip": false,
362
+ "single_word": false,
363
+ "special": true
364
+ }
365
+ },
366
+ "additional_special_tokens": [
367
+ "<|im_start|>",
368
+ "<|im_end|>",
369
+ "<|object_ref_start|>",
370
+ "<|object_ref_end|>",
371
+ "<|box_start|>",
372
+ "<|box_end|>",
373
+ "<|quad_start|>",
374
+ "<|quad_end|>",
375
+ "<|vision_start|>",
376
+ "<|vision_end|>",
377
+ "<|vision_pad|>",
378
+ "<|image_pad|>",
379
+ "<|video_pad|>"
380
+ ],
381
+ "bos_token": null,
382
+ "chat_template": "{%- for m in messages %}\n {%- if m.role == 'system' %}\n {{- '<|system|>' + m.content + '<|endofsystem|>\\n' }}\n {%- elif m.role == 'user' %}\n {{- '<|user|>' + m.content + '<|endofuser|>' }}\n {%- elif m.role == 'assistant' %}\n {{- '<|assistant|>' + m.content }}\n {%- if not loop.last %}\n {{- '<|endofassistant|>' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if messages[-1].role != 'assistant' %}\n {{- '<|assistant|>' }}\n{%- endif %}",
383
+ "clean_up_tokenization_spaces": false,
384
+ "eos_token": "<|endoftext|>",
385
+ "errors": "replace",
386
+ "model_max_length": 131072,
387
+ "pad_token": "[PAD]",
388
+ "split_special_tokens": false,
389
+ "tokenizer_class": "Qwen2Tokenizer",
390
+ "unk_token": null
391
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff