htagourti

redmoe-ai-v1 commited on Sep 17

Commit

bc1682b

verified ·

0 Parent(s):

Duplicate from rednote-hilab/dots.ocr

Browse files

Co-authored-by: redmoe-ai-v1 <redmoe-ai-v1@users.noreply.huggingface.co>

Files changed (20) hide show

.gitattributes +35 -0
NOTICE +0 -0
README.md +1234 -0
chat_template.json +3 -0
config.json +51 -0
configuration_dots.py +76 -0
dots.ocr LICENSE AGREEMENT +109 -0
generation_config.json +7 -0
merges.txt +0 -0
model-00001-of-00002.safetensors +3 -0
model-00002-of-00002.safetensors +3 -0
model.safetensors.index.json +650 -0
modeling_dots_ocr.py +131 -0
modeling_dots_ocr_vllm.py +451 -0
modeling_dots_vision.py +520 -0
preprocessor_config.json +19 -0
special_tokens_map.json +25 -0
tokenizer.json +0 -0
tokenizer_config.json +391 -0
vocab.json +0 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

NOTICE ADDED Viewed

The diff for this file is too large to render. See raw diff

README.md ADDED Viewed

	@@ -0,0 +1,1234 @@

+---
+license: mit
+library_name: dots_ocr
+pipeline_tag: image-text-to-text
+tags:
+- image-to-text
+- ocr
+- document-parse
+- layout
+- table
+- formula
+language:
+- en
+- zh
+- multilingual
+---
+<div align="center">
+<p align="center">
+    <img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/logo.png" width="300"/>
+<p>
+<h1 align="center">
+dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model
+</h1>
+[![Blog](https://img.shields.io/badge/Blog-View_on_GitHub-333.svg?logo=github)](https://github.com/rednote-hilab/dots.ocr/blob/master/assets/blog.md)
+[![HuggingFace](https://img.shields.io/badge/HuggingFace%20Weights-black.svg?logo=HuggingFace)](https://huggingface.co/rednote-hilab/dots.ocr)
+<div align="center">
+  <a href="https://dotsocr.xiaohongshu.com" target="_blank" rel="noopener noreferrer"><strong>🖥️ Live Demo</strong></a> |
+  <a href="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/wechat.jpg" target="_blank" rel="noopener noreferrer"><strong>💬 WeChat</strong></a> |
+  <a href="https://www.xiaohongshu.com/user/profile/683ffe42000000001d021a4c" target="_blank" rel="noopener noreferrer"><strong>📕 rednote</strong></a>
+</div>
+</div>
+## Introduction
+**dots.ocr** is a powerful, multilingual document parser that unifies layout detection and content recognition within a single vision-language model while maintaining good reading order. Despite its compact 1.7B-parameter LLM foundation, it achieves state-of-the-art(SOTA) performance.
+1. **Powerful Performance:** **dots.ocr** achieves SOTA performance for text, tables, and reading order on [OmniDocBench](https://github.com/opendatalab/OmniDocBench), while delivering formula recognition results comparable to much larger models like Doubao-1.5 and gemini2.5-pro.
+2. **Multilingual Support:** **dots.ocr** demonstrates robust parsing capabilities for low-resource languages, achieving decisive advantages across both layout detection and content recognition on our in-house multilingual documents benchmark.
+3. **Unified and Simple Architecture:** By leveraging a single vision-language model, **dots.ocr** offers a significantly more streamlined architecture than conventional methods that rely on complex, multi-model pipelines. Switching between tasks is accomplished simply by altering the input prompt, proving that a VLM can achieve competitive detection results compared to traditional detection models like DocLayout-YOLO.
+4.  **Efficient and Fast Performance:** Built upon a compact 1.7B LLM, **dots.ocr** provides faster inference speeds than many other high-performing models based on larger foundations.
+### Performance Comparison: dots.ocr vs. Competing Models
+<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/chart.png" border="0" />
+> **Notes:**
+> - The EN, ZH metrics are the end2end evaluation results of [OmniDocBench](https://github.com/opendatalab/OmniDocBench), and Multilingual metric is the end2end evaluation results of dots.ocr-bench.
+## News
+* ```2025.07.30 ``` 🚀 We release [dots.ocr](https://github.com/rednote-hilab/dots.ocr), — a multilingual documents parsing model based on 1.7b llm, with SOTA performance.
+## Benchmark Results
+### 1. OmniDocBench
+#### The end-to-end evaluation results of different tasks.
+<table>
+<thead>
+<tr>
+<th rowspan="2"><strong>Model<br>Type</strong></th>
+<th rowspan="2"><strong>Methods</strong></th>
+<th colspan="2"><strong>Overall<sup>Edit</sup>↓</strong></th>
+<th colspan="2"><strong>Text<sup>Edit</sup>↓</strong></th>
+<th colspan="2"><strong>Formula<sup>Edit</sup>↓</strong></th>
+<th colspan="2"><strong>Table<sup>TEDS</sup>↑</strong></th>
+<th colspan="2"><strong>Table<sup>Edit</sup>↓</strong></th>
+<th colspan="2"><strong>Read Order<sup>Edit</sup>↓</strong></th>
+</tr>
+<tr>
+<th><em>EN</em></th>
+<th><em>ZH</em></th>
+<th><em>EN</em></th>
+<th><em>ZH</em></th>
+<th><em>EN</em></th>
+<th><em>ZH</em></th>
+<th><em>EN</em></th>
+<th><em>ZH</em></th>
+<th><em>EN</em></th>
+<th><em>ZH</em></th>
+<th><em>EN</em></th>
+<th><em>ZH</em></th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td rowspan="8"><strong>Pipeline<br>Tools</strong></td>
+<td>MinerU</td>
+<td>0.150</td>
+<td>0.357</td>
+<td>0.061</td>
+<td>0.215</td>
+<td>0.278</td>
+<td>0.577</td>
+<td>78.6</td>
+<td>62.1</td>
+<td>0.180</td>
+<td>0.344</td>
+<td>0.079</td>
+<td>0.292</td>
+</tr>
+<tr>
+<td>Marker</td>
+<td>0.336</td>
+<td>0.556</td>
+<td>0.080</td>
+<td>0.315</td>
+<td>0.530</td>
+<td>0.883</td>
+<td>67.6</td>
+<td>49.2</td>
+<td>0.619</td>
+<td>0.685</td>
+<td>0.114</td>
+<td>0.340</td>
+</tr>
+<tr>
+<td>Mathpix</td>
+<td>0.191</td>
+<td>0.365</td>
+<td>0.105</td>
+<td>0.384</td>
+<td>0.306</td>
+<td>0.454</td>
+<td>77.0</td>
+<td>67.1</td>
+<td>0.243</td>
+<td>0.320</td>
+<td>0.108</td>
+<td>0.304</td>
+</tr>
+<tr>
+<td>Docling</td>
+<td>0.589</td>
+<td>0.909</td>
+<td>0.416</td>
+<td>0.987</td>
+<td>0.999</td>
+<td>1</td>
+<td>61.3</td>
+<td>25.0</td>
+<td>0.627</td>
+<td>0.810</td>
+<td>0.313</td>
+<td>0.837</td>
+</tr>
+<tr>
+<td>Pix2Text</td>
+<td>0.320</td>
+<td>0.528</td>
+<td>0.138</td>
+<td>0.356</td>
+<td>0.276</td>
+<td>0.611</td>
+<td>73.6</td>
+<td>66.2</td>
+<td>0.584</td>
+<td>0.645</td>
+<td>0.281</td>
+<td>0.499</td>
+</tr>
+<tr>
+<td>Unstructured</td>
+<td>0.586</td>
+<td>0.716</td>
+<td>0.198</td>
+<td>0.481</td>
+<td>0.999</td>
+<td>1</td>
+<td>0</td>
+<td>0.06</td>
+<td>1</td>
+<td>0.998</td>
+<td>0.145</td>
+<td>0.387</td>
+</tr>
+<tr>
+<td>OpenParse</td>
+<td>0.646</td>
+<td>0.814</td>
+<td>0.681</td>
+<td>0.974</td>
+<td>0.996</td>
+<td>1</td>
+<td>64.8</td>
+<td>27.5</td>
+<td>0.284</td>
+<td>0.639</td>
+<td>0.595</td>
+<td>0.641</td>
+</tr>
+<tr>
+<td>PPStruct-V3</td>
+<td>0.145</td>
+<td>0.206</td>
+<td>0.058</td>
+<td>0.088</td>
+<td>0.295</td>
+<td>0.535</td>
+<td>-</td>
+<td>-</td>
+<td>0.159</td>
+<td>0.109</td>
+<td>0.069</td>
+<td>0.091</td>
+</tr>
+<tr>
+<td rowspan="9"><strong>Expert<br>VLMs</strong></td>
+<td>GOT-OCR</td>
+<td>0.287</td>
+<td>0.411</td>
+<td>0.189</td>
+<td>0.315</td>
+<td>0.360</td>
+<td>0.528</td>
+<td>53.2</td>
+<td>47.2</td>
+<td>0.459</td>
+<td>0.520</td>
+<td>0.141</td>
+<td>0.280</td>
+</tr>
+<tr>
+<td>Nougat</td>
+<td>0.452</td>
+<td>0.973</td>
+<td>0.365</td>
+<td>0.998</td>
+<td>0.488</td>
+<td>0.941</td>
+<td>39.9</td>
+<td>0</td>
+<td>0.572</td>
+<td>1.000</td>
+<td>0.382</td>
+<td>0.954</td>
+</tr>
+<tr>
+<td>Mistral OCR</td>
+<td>0.268</td>
+<td>0.439</td>
+<td>0.072</td>
+<td>0.325</td>
+<td>0.318</td>
+<td>0.495</td>
+<td>75.8</td>
+<td>63.6</td>
+<td>0.600</td>
+<td>0.650</td>
+<td>0.083</td>
+<td>0.284</td>
+</tr>
+<tr>
+<td>OLMOCR-sglang</td>
+<td>0.326</td>
+<td>0.469</td>
+<td>0.097</td>
+<td>0.293</td>
+<td>0.455</td>
+<td>0.655</td>
+<td>68.1</td>
+<td>61.3</td>
+<td>0.608</td>
+<td>0.652</td>
+<td>0.145</td>
+<td>0.277</td>
+</tr>
+<tr>
+<td>SmolDocling-256M</td>
+<td>0.493</td>
+<td>0.816</td>
+<td>0.262</td>
+<td>0.838</td>
+<td>0.753</td>
+<td>0.997</td>
+<td>44.9</td>
+<td>16.5</td>
+<td>0.729</td>
+<td>0.907</td>
+<td>0.227</td>
+<td>0.522</td>
+</tr>
+<tr>
+<td>Dolphin</td>
+<td>0.206</td>
+<td>0.306</td>
+<td>0.107</td>
+<td>0.197</td>
+<td>0.447</td>
+<td>0.580</td>
+<td>77.3</td>
+<td>67.2</td>
+<td>0.180</td>
+<td>0.285</td>
+<td>0.091</td>
+<td>0.162</td>
+</tr>
+<tr>
+<td>MinerU 2</td>
+<td>0.139</td>
+<td>0.240</td>
+<td>0.047</td>
+<td>0.109</td>
+<td>0.297</td>
+<td>0.536</td>
+<td>82.5</td>
+<td>79.0</td>
+<td>0.141</td>
+<td>0.195</td>
+<td>0.069<</td>
+<td>0.118</td>
+</tr>
+<tr>
+<td>OCRFlux</td>
+<td>0.195</td>
+<td>0.281</td>
+<td>0.064</td>
+<td>0.183</td>
+<td>0.379</td>
+<td>0.613</td>
+<td>71.6</td>
+<td>81.3</td>
+<td>0.253</td>
+<td>0.139</td>
+<td>0.086</td>
+<td>0.187</td>
+</tr>
+<tr>
+<td>MonkeyOCR-pro-3B</td>
+<td>0.138</td>
+<td>0.206</td>
+<td>0.067</td>
+<td>0.107</td>
+<td><strong>0.246</strong></td>
+<td>0.421</td>
+<td>81.5</td>
+<td>87.5</td>
+<td>0.139</td>
+<td>0.111</td>
+<td>0.100</td>
+<td>0.185</td>
+</tr>
+<tr>
+<td rowspan="5"><strong>General<br>VLMs</strong></td>
+<td>GPT4o</td>
+<td>0.233</td>
+<td>0.399</td>
+<td>0.144</td>
+<td>0.409</td>
+<td>0.425</td>
+<td>0.606</td>
+<td>72.0</td>
+<td>62.9</td>
+<td>0.234</td>
+<td>0.329</td>
+<td>0.128</td>
+<td>0.251</td>
+</tr>
+    <tr>
+      <td>Qwen2-VL-72B</td>
+      <td>0.252</td>
+      <td>0.327</td>
+      <td>0.096</td>
+      <td>0.218</td>
+      <td>0.404</td>
+      <td>0.487</td>
+      <td>76.8</td>
+      <td>76.4</td>
+      <td>0.387</td>
+      <td>0.408</td>
+      <td>0.119</td>
+      <td>0.193</td>
+    </tr>
+    <tr>
+      <td>Qwen2.5-VL-72B</td>
+      <td>0.214</td>
+      <td>0.261</td>
+      <td>0.092</td>
+      <td>0.18</td>
+      <td>0.315</td>
+      <td>0.434</td>
+      <td>82.9</td>
+      <td>83.9</td>
+      <td>0.341</td>
+      <td>0.262</td>
+      <td>0.106</td>
+      <td>0.168</td>
+    </tr>
+    <tr>
+      <td>Gemini2.5-Pro</td>
+      <td>0.148</td>
+      <td>0.212</td>
+      <td>0.055</td>
+      <td>0.168</td>
+      <td>0.356</td>
+      <td>0.439</td>
+      <td>85.8</td>
+      <td>86.4</td>
+      <td>0.13</td>
+      <td>0.119</td>
+      <td>0.049</td>
+      <td>0.121</td>
+    </tr>
+    <tr>
+      <td>doubao-1-5-thinking-vision-pro-250428</td>
+      <td>0.140</td>
+      <td>0.162</td>
+      <td>0.043</td>
+      <td>0.085</td>
+      <td>0.295</td>
+      <td><strong>0.384</strong></td>
+      <td>83.3</td>
+      <td><strong>89.3</strong></td>
+      <td>0.165</td>
+      <td><strong>0.085</strong></td>
+      <td>0.058</td>
+      <td>0.094</td>
+    </tr>
+<tr>
+<td rowspan="1"><strong>Expert VLMs</strong></td>
+<td><strong>dots.ocr</strong></td>
+<td><strong>0.125</strong></td>
+<td><strong>0.160</strong></td>
+<td><strong>0.032</strong></td>
+<td><strong>0.066</strong></td>
+<td>0.329</td>
+<td>0.416</td>
+<td><strong>88.6</strong></td>
+<td>89.0</td>
+<td><strong>0.099</strong></td>
+<td>0.092</td>
+<td><strong>0.040</strong></td>
+<td><strong>0.067</strong></td>
+</tr>
+<tr>
+</tbody>
+</table>
+#### The end-to-end text recognition performance across 9 PDF page types.
+<table>
+<thead>
+<tr>
+<th><strong>Model<br>Type</strong></th>
+<th><strong>Models</strong></th>
+<th><strong>Book</strong></th>
+<th><strong>Slides</strong></th>
+<th><strong>Financial<br>Report</strong></th>
+<th><strong>Textbook</strong></th>
+<th><strong>Exam<br>Paper</strong></th>
+<th><strong>Magazine</strong></th>
+<th><strong>Academic<br>Papers</strong></th>
+<th><strong>Notes</strong></th>
+<th><strong>Newspaper</strong></th>
+<th><strong>Overall</strong></th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td rowspan="3"><strong>Pipeline<br>Tools</strong></td>
+<td>MinerU</td>
+<td>0.055</td>
+<td>0.124</td>
+<td><u>0.033</u></td>
+<td>0.102</td>
+<td>0.159</td>
+<td><strong>0.072</strong></td>
+<td><u>0.025</u></td>
+<td>0.984</td>
+<td>0.171</td>
+<td>0.206</td>
+</tr>
+<tr>
+<td>Marker</td>
+<td>0.074</td>
+<td>0.340</td>
+<td>0.089</td>
+<td>0.319</td>
+<td>0.452</td>
+<td>0.153</td>
+<td>0.059</td>
+<td>0.651</td>
+<td>0.192</td>
+<td>0.274</td>
+</tr>
+<tr>
+<td>Mathpix</td>
+<td>0.131</td>
+<td>0.220</td>
+<td>0.202</td>
+<td>0.216</td>
+<td>0.278</td>
+<td>0.147</td>
+<td>0.091</td>
+<td>0.634</td>
+<td>0.690</td>
+<td>0.300</td>
+</tr>
+<tr>
+<td rowspan="5"><strong>Expert<br>VLMs</strong></td>
+<td>GOT-OCR</td>
+<td>0.111</td>
+<td>0.222</td>
+<td>0.067</td>
+<td>0.132</td>
+<td>0.204</td>
+<td>0.198</td>
+<td>0.179</td>
+<td>0.388</td>
+<td>0.771</td>
+<td>0.267</td>
+</tr>
+<tr>
+<td>Nougat</td>
+<td>0.734</td>
+<td>0.958</td>
+<td>1.000</td>
+<td>0.820</td>
+<td>0.930</td>
+<td>0.830</td>
+<td>0.214</td>
+<td>0.991</td>
+<td>0.871</td>
+<td>0.806</td>
+</tr>
+<tr>
+<td>Dolphin</td>
+<td>0.091</td>
+<td>0.131</td>
+<td>0.057</td>
+<td>0.146</td>
+<td>0.231</td>
+<td>0.121</td>
+<td>0.074</td>
+<td>0.363</td>
+<td>0.307</td>
+<td>0.177</td>
+</tr>
+<tr>
+<td>OCRFlux</td>
+<td>0.068</td>
+<td>0.125</td>
+<td>0.092</td>
+<td>0.102</td>
+<td>0.119</td>
+<td>0.083</td>
+<td>0.047</td>
+<td>0.223</td>
+<td>0.536</td>
+<td>0.149</td>
+</tr>
+<tr>
+<td>MonkeyOCR-pro-3B</td>
+<td>0.084</td>
+<td>0.129</td>
+<td>0.060</td>
+<td>0.090</td>
+<td>0.107</td>
+<td>0.073</td>
+<td>0.050</td>
+<td>0.171</td>
+<td>0.107</td>
+<td>0.100</td>
+</tr>
+<tr>
+<td rowspan="4"><strong>General<br>VLMs</strong></td>
+<td>GPT4o</td>
+<td>0.157</td>
+<td>0.163</td>
+<td>0.348</td>
+<td>0.187</td>
+<td>0.281</td>
+<td>0.173</td>
+<td>0.146</td>
+<td>0.607</td>
+<td>0.751</td>
+<td>0.316</td>
+</tr>
+<tr>
+<td>Qwen2.5-VL-7B</td>
+<td>0.148</td>
+<td>0.053</td>
+<td>0.111</td>
+<td>0.137</td>
+<td>0.189</td>
+<td>0.117</td>
+<td>0.134</td>
+<td>0.204</td>
+<td>0.706</td>
+<td>0.205</td>
+</tr>
+<tr>
+<td>InternVL3-8B</td>
+<td>0.163</td>
+<td>0.056</td>
+<td>0.107</td>
+<td>0.109</td>
+<td>0.129</td>
+<td>0.100</td>
+<td>0.159</td>
+<td>0.150</td>
+<td>0.681</td>
+<td>0.188</td>
+</tr>
+<tr>
+<td>doubao-1-5-thinking-vision-pro-250428</td>
+<td>0.048</td>
+<td>0.048</td>
+<td>0.024</td>
+<td><strong>0.062</strong></td>
+<td>0.085</td>
+<td>0.051</td>
+<td>0.039</td>
+<td><strong>0.096</strong></td>
+<td>0.181</td>
+<td>0.073</td>
+</tr>
+<tr>
+<td rowspan="1"><strong>Expert VLMs</strong></td>
+<td><strong>dots.ocr</strong></td>
+<td><strong>0.031</strong></td>
+<td><strong>0.047</strong></td>
+<td><strong>0.011</strong></td>
+<td>0.082</td>
+<td><strong>0.079</strong></td>
+<td><strong>0.028</strong></td>
+<td><strong>0.029</strong></td>
+<td>0.109</td>
+<td><strong>0.056</strong></td>
+<td><strong>0.055</strong></td>
+</tr>
+</tbody>
+</table>
+> **Notes:**
+> - The metrics are from [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR), [OmniDocBench](https://github.com/opendatalab/OmniDocBench), and our own internal evaluations.
+> - We delete the Page-header and Page-footer cells in the result markdown.
+> - We use tikz_preprocess pipeline to upsample the images to dpi 200.
+### 2. **dots.ocr-bench**
+This is an inhouse benchmark which contain 1493 pdf images with 100 languages.
+#### The end-to-end evaluation results of different tasks.
+<table>
+<thead>
+<tr>
+<th rowspan="1"><strong>Methods</strong></th>
+<th colspan="1"><strong>Overall<sup>Edit</sup>↓</strong></th>
+<th colspan="1"><strong>Text<sup>Edit</sup>↓</strong></th>
+<th colspan="1"><strong>Formula<sup>Edit</sup>↓</strong></th>
+<th colspan="1"><strong>Table<sup>TEDS</sup>↑</strong></th>
+<th colspan="1"><strong>Table<sup>Edit</sup>↓</strong></th>
+<th colspan="1"><strong>Read Order<sup>Edit</sup>↓</strong></th>
+</tr>
+</thead>
+<tbody>
+<td>MonkeyOCR-3B</td>
+<td>0.483</td>
+<td>0.445</td>
+<td>0.627</td>
+<td>50.93</td>
+<td>0.452</td>
+<td>0.409</td>
+</tr>
+<tr>
+<td>doubao-1-5-thinking-vision-pro-250428</td>
+<td>0.291</td>
+<td>0.226</td>
+<td>0.440</td>
+<td>71.2</td>
+<td>0.260</td>
+<td>0.238</td>
+</tr>
+<tr>
+<td>doubao-1-6</td>
+<td>0.299</td>
+<td>0.270</td>
+<td>0.417</td>
+<td>71.0</td>
+<td>0.258</td>
+<td>0.253</td>
+</tr>
+<tr>
+<td>Gemini2.5-Pro</td>
+<td>0.251</td>
+<td>0.163</td>
+<td>0.402</td>
+<td>77.1</td>
+<td>0.236</td>
+<td>0.202</td>
+</tr>
+<tr>
+<td><strong>dots.ocr</strong> </td>
+<td><strong>0.177</strong></td>
+<td><strong>0.075</strong></td>
+<td><strong>0.297</strong></td>
+<td><strong>79.2</strong></td>
+<td><strong>0.186</strong></td>
+<td><strong>0.152</strong></td>
+</tr>
+</tbody>
+</table>
+> **Notes:**
+> - We use the same metric calculation pipeline of [OmniDocBench](https://github.com/opendatalab/OmniDocBench).
+> - We delete the Page-header and Page-footer cells in the result markdown.
+#### Layout Detection
+<table>
+<thead>
+<tr>
+<th rowspan="2"><strong>Method</strong></th>
+<th colspan="5" style="text-align: center;"><strong>F1@IoU=.50:.05:.95↑</strong></th>
+<th colspan="5" style="text-align: center;"><strong>F1@IoU=.50↑</strong></th>
+</tr>
+<tr>
+<th>Overall</th>
+<th>Text</th>
+<th>Formula</th>
+<th>Table</th>
+<th>Picture</th>
+<th>Overall</th>
+<th>Text</th>
+<th>Formula</th>
+<th>Table</th>
+<th>Picture</th>
+</tr>
+</thead>
+<tbody>
+<td>DocLayout-YOLO-DocStructBench</td>
+<td>0.733</td>
+<td>0.694</td>
+<td>0.480</td>
+<td>0.803</td>
+<td>0.619</td>
+<td>0.806</td>
+<td>0.779</td>
+<td>0.620</td>
+<td>0.858</td>
+<td>0.678</td>
+</tr>
+<tr>
+<td>dots.ocr-parse all</td>
+<td>0.831</td>
+<td>0.801</td>
+<td>0.654</td>
+<td>0.838</td>
+<td>0.748</td>
+<td>0.922</td>
+<td>0.909</td>
+<td>0.770</td>
+<td>0.888</td>
+<td>0.831</td>
+</tr>
+<tr>
+<td> <strong>dots.ocr-detection only</strong> </td>
+<td><strong>0.845</strong></td>
+<td><strong>0.816</strong></td>
+<td><strong>0.716</strong></td>
+<td><strong>0.875</strong></td>
+<td><strong>0.765</strong></td>
+<td><strong>0.930</strong></td>
+<td><strong>0.917</strong></td>
+<td><strong>0.832</strong></td>
+<td><strong>0.918</strong></td>
+<td><strong>0.843</strong></td>
+</tr>
+</tbody>
+</table>
+> **Notes:**
+> - prompt_layout_all_en for **parse all**, prompt_layout_only_en for **detection only**, please refer to [prompts](https://github.com/rednote-hilab/dots.ocr/blob/master/dots_ocr/utils/prompts.py)
+### 3. olmOCR-bench.
+<table>
+<thead>
+<tr>
+<th>Model</th>
+<th>ArXiv</th>
+<th>Old Scans<br>Math</th>
+<th>Tables</th>
+<th>Old Scans</th>
+<th>Headers and<br>Footers</th>
+<th>Multi<br>column</th>
+<th>Long Tiny<br>Text</th>
+<th>Base</th>
+<th>Overall</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>GOT OCR</td>
+<td>52.7</td>
+<td>52.0</td>
+<td>0.2</td>
+<td>22.1</td>
+<td>93.6</td>
+<td>42.0</td>
+<td>29.9</td>
+<td>94.0</td>
+<td>48.3 ± 1.1</td>
+</tr>
+<tr>
+<td>Marker</td>
+<td>76.0</td>
+<td>57.9</td>
+<td>57.6</td>
+<td>27.8</td>
+<td>84.9</td>
+<td>72.9</td>
+<td>84.6</td>
+<td>99.1</td>
+<td>70.1 ± 1.1</td>
+</tr>
+<tr>
+<td>MinerU</td>
+<td>75.4</td>
+<td>47.4</td>
+<td>60.9</td>
+<td>17.3</td>
+<td><strong>96.6</strong></td>
+<td>59.0</td>
+<td>39.1</td>
+<td>96.6</td>
+<td>61.5 ± 1.1</td>
+</tr>
+<tr>
+<td>Mistral OCR</td>
+<td>77.2</td>
+<td>67.5</td>
+<td>60.6</td>
+<td>29.3</td>
+<td>93.6</td>
+<td>71.3</td>
+<td>77.1</td>
+<td>99.4</td>
+<td>72.0 ± 1.1</td>
+</tr>
+<tr>
+<td>Nanonets OCR</td>
+<td>67.0</td>
+<td>68.6</td>
+<td>77.7</td>
+<td>39.5</td>
+<td>40.7</td>
+<td>69.9</td>
+<td>53.4</td>
+<td>99.3</td>
+<td>64.5 ± 1.1</td>
+</tr>
+<tr>
+<td>GPT-4o<br>(No Anchor)</td>
+<td>51.5</td>
+<td><strong>75.5</strong></td>
+<td>69.1</td>
+<td>40.9</td>
+<td>94.2</td>
+<td>68.9</td>
+<td>54.1</td>
+<td>96.7</td>
+<td>68.9 ± 1.1</td>
+</tr>
+<tr>
+<td>GPT-4o<br>(Anchored)</td>
+<td>53.5</td>
+<td>74.5</td>
+<td>70.0</td>
+<td>40.7</td>
+<td>93.8</td>
+<td>69.3</td>
+<td>60.6</td>
+<td>96.8</td>
+<td>69.9 ± 1.1</td>
+</tr>
+<tr>
+<td>Gemini Flash 2<br>(No Anchor)</td>
+<td>32.1</td>
+<td>56.3</td>
+<td>61.4</td>
+<td>27.8</td>
+<td>48.0</td>
+<td>58.7</td>
+<td><strong>84.4</strong></td>
+<td>94.0</td>
+<td>57.8 ± 1.1</td>
+</tr>
+<tr>
+<td>Gemini Flash 2<br>(Anchored)</td>
+<td>54.5</td>
+<td>56.1</td>
+<td>72.1</td>
+<td>34.2</td>
+<td>64.7</td>
+<td>61.5</td>
+<td>71.5</td>
+<td>95.6</td>
+<td>63.8 ± 1.2</td>
+</tr>
+<tr>
+<td>Qwen 2 VL<br>(No Anchor)</td>
+<td>19.7</td>
+<td>31.7</td>
+<td>24.2</td>
+<td>17.1</td>
+<td>88.9</td>
+<td>8.3</td>
+<td>6.8</td>
+<td>55.5</td>
+<td>31.5 ± 0.9</td>
+</tr>
+<tr>
+<td>Qwen 2.5 VL<br>(No Anchor)</td>
+<td>63.1</td>
+<td>65.7</td>
+<td>67.3</td>
+<td>38.6</td>
+<td>73.6</td>
+<td>68.3</td>
+<td>49.1</td>
+<td>98.3</td>
+<td>65.5 ± 1.2</td>
+</tr>
+<tr>
+<td>olmOCR v0.1.75<br>(No Anchor)</td>
+<td>71.5</td>
+<td>71.4</td>
+<td>71.4</td>
+<td><strong>42.8</strong></td>
+<td>94.1</td>
+<td>77.7</td>
+<td>71.0</td>
+<td>97.8</td>
+<td>74.7 ± 1.1</td>
+</tr>
+<tr>
+<td>olmOCR v0.1.75<br>(Anchored)</td>
+<td>74.9</td>
+<td>71.2</td>
+<td>71.0</td>
+<td>42.2</td>
+<td>94.5</td>
+<td>78.3</td>
+<td>73.3</td>
+<td>98.3</td>
+<td>75.5 ± 1.0</td>
+</tr>
+<tr>
+<td>MonkeyOCR-pro-3B</td>
+<td><strong>83.8</strong></td>
+<td>68.8</td>
+<td>74.6</td>
+<td>36.1</td>
+<td>91.2</td>
+<td>76.6</td>
+<td>80.1</td>
+<td>95.3</td>
+<td>75.8 ± 1.0</td>
+</tr>
+<tr>
+<td><strong>dots.ocr</strong></td>
+<td>82.1</td>
+<td>64.2</td>
+<td><strong>88.3</strong></td>
+<td>40.9</td>
+<td>94.1</td>
+<td><strong>82.4</strong></td>
+<td>81.2</td>
+<td><strong>99.5</strong></td>
+<td><strong>79.1 ± 1.0</strong></td>
+</tr>
+</tbody>
+</table>
+> **Note:**
+> - The metrics are from [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR),
+[olmocr](https://github.com/allenai/olmocr), and our own internal evaluations.
+> - We delete the Page-header and Page-footer cells in the result markdown.
+# Quick Start
+## 1. Installation
+### Install dots.ocr
+```shell
+conda create -n dots_ocr python=3.12
+conda activate dots_ocr
+git clone https://github.com/rednote-hilab/dots.ocr.git
+cd dots.ocr
+# Install pytorch, see https://pytorch.org/get-started/previous-versions/ for your cuda version
+pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
+pip install -e .
+```
+If you have trouble with the installation, try our [Docker Image](https://hub.docker.com/r/rednotehilab/dots.ocr) for an easier setup, and follow these steps:
+```shell
+git clone https://github.com/rednote-hilab/dots.ocr.git
+cd dots.ocr
+pip install -e .
+```
+### Download Model Weights
+> 💡**Note:** Please use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) for the model save path. This is a temporary workaround pending our integration with Transformers.
+```shell
+python3 tools/download_model.py
+```
+## 2. Deployment
+### vLLM inference
+We highly recommend using vllm for deployment and inference. All of our evaluations results are based on vllm version 0.9.1.
+The [Docker Image](https://hub.docker.com/r/rednotehilab/dots.ocr) is based on the official vllm image. You can also follow [Dockerfile](https://github.com/rednote-hilab/dots.ocr/blob/master/docker/Dockerfile) to build the deployment environment by yourself.
+```shell
+# You need to register model to vllm at first
+python3 tools/download_model.py
+export hf_model_path=./weights/DotsOCR  # Path to your downloaded model weights, Please use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`) for the model save path. This is a temporary workaround pending our integration with Transformers.
+export PYTHONPATH=$(dirname "$hf_model_path"):$PYTHONPATH
+sed -i '/^from vllm\.entrypoints\.cli\.main import main$/a\
+from DotsOCR import modeling_dots_ocr_vllm' `which vllm`  # If you downloaded model weights by yourself, please replace `DotsOCR` by your model saved directory name, and remember to use a directory name without periods (e.g., `DotsOCR` instead of `dots.ocr`)
+# launch vllm server
+CUDA_VISIBLE_DEVICES=0 vllm serve ${hf_model_path} --tensor-parallel-size 1 --gpu-memory-utilization 0.95  --chat-template-content-format string --served-model-name model --trust-remote-code
+# If you get a ModuleNotFoundError: No module named 'DotsOCR', please check the note above on the saved model directory name.
+# vllm api demo
+python3 ./demo/demo_vllm.py --prompt_mode prompt_layout_all_en
+```
+### Hugginface inference
+```shell
+python3 demo/demo_hf.py
+```
+<details>
+<summary><b>Hugginface inference details</b></summary>
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
+from qwen_vl_utils import process_vision_info
+from dots_ocr.utils import dict_promptmode_to_prompt
+model_path = "./weights/DotsOCR"
+model = AutoModelForCausalLM.from_pretrained(
+    model_path,
+    attn_implementation="flash_attention_2",
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
+image_path = "demo/demo_image1.jpg"
+prompt = """Please output the layout information from the PDF image, including each layout element's bbox, its category, and the corresponding text content within the bbox.
+1. Bbox format: [x1, y1, x2, y2]
+2. Layout Categories: The possible categories are ['Caption', 'Footnote', 'Formula', 'List-item', 'Page-footer', 'Page-header', 'Picture', 'Section-header', 'Table', 'Text', 'Title'].
+3. Text Extraction & Formatting Rules:
+    - Picture: For the 'Picture' category, the text field should be omitted.
+    - Formula: Format its text as LaTeX.
+    - Table: Format its text as HTML.
+    - All Others (Text, Title, etc.): Format their text as Markdown.
+4. Constraints:
+    - The output text must be the original text from the image, with no translation.
+    - All layout elements must be sorted according to human reading order.
+5. Final Output: The entire output must be a single JSON object.
+"""
+messages = [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image",
+                    "image": image_path
+                },
+                {"type": "text", "text": prompt}
+            ]
+        }
+    ]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+# Inference: Generation of the output
+generated_ids = model.generate(**inputs, max_new_tokens=24000)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+</details>
+## 3. Document Parse
+**Based on vLLM server**, you can parse an image or a pdf file using the following commands:
+```bash
+# Parse all layout info, both detection and recognition
+# Parse a single image
+python3 dots_ocr/parser.py demo/demo_image1.jpg
+# Parse a single PDF
+python3 dots_ocr/parser.py demo/demo_pdf1.pdf  --num_threads 64  # try bigger num_threads for pdf with a large number of pages
+# Layout detection only
+python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_layout_only_en
+# Parse text only, except Page-header and Page-footer
+python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_ocr
+# Parse layout info by bbox
+python3 dots_ocr/parser.py demo/demo_image1.jpg --prompt prompt_grounding_ocr --bbox 163 241 1536 705
+```
+<details>
+<summary><b>Output Results</b></summary>
+1.  **Structured Layout Data** (`demo_image1.json`): A JSON file containing the detected layout elements, including their bounding boxes, categories, and extracted text.
+2.  **Processed Markdown File** (`demo_image1.md`): A Markdown file generated from the concatenated text of all detected cells.
+    *   An additional version, `demo_image1_nohf.md`, is also provided, which excludes page headers and footers for compatibility with benchmarks like Omnidocbench and olmOCR-bench.
+3.  **Layout Visualization** (`demo_image1.jpg`): The original image with the detected layout bounding boxes drawn on it.
+</details>
+## 4. Demo
+You can run the demo with the following command, or try directly at [live demo](https://dotsocr.xiaohongshu.com/)
+```bash
+python demo/demo_gradio.py
+```
+We also provide a demo for grounding ocr:
+```bash
+python demo/demo_gradio_annotion.py
+```
+### Example for formula document
+<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/formula1.png" alt="formula1.png" border="0" />
+<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/formula2.png" alt="formula2.png" border="0" />
+<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/formula3.png" alt="formula3.png" border="0" />
+### Example for table document
+<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/table1.png" alt="table1.png" border="0" />
+<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/table2.png" alt="table2.png" border="0" />
+<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/table3.png" alt="table3.png" border="0" />
+### Example for multilingual document
+<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/Tibetan.png" alt="Tibetan.png" border="0" />
+<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/tradition_zh.png" alt="tradition_zh.png" border="0" />
+<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/nl.png" alt="nl.png" border="0" />
+<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/kannada.png" alt="kannada.png" border="0" />
+<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/russian.png" alt="russian.png" border="0" />
+### Example for reading order
+<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/reading_order.png" alt="reading_order.png" border="0" />
+### Example for grounding ocr
+<img src="https://raw.githubusercontent.com/rednote-hilab/dots.ocr/master/assets/showcase/grounding.png" alt="grounding.png" border="0" />
+## Acknowledgments
+We would like to thank [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL), [aimv2](https://github.com/apple/ml-aim), [MonkeyOCR](https://github.com/Yuliang-Liu/MonkeyOCR),
+[OmniDocBench](https://github.com/opendatalab/OmniDocBench), [PyMuPDF](https://github.com/pymupdf/PyMuPDF), for providing code and models.
+We also thank [DocLayNet](https://github.com/DS4SD/DocLayNet), [M6Doc](https://github.com/HCIILAB/M6Doc), [CDLA](https://github.com/buptlihang/CDLA), [D4LA](https://github.com/AlibabaResearch/AdvancedLiterateMachinery) for providing valuable datasets.
+## Limitation & Future Work
+- **Complex Document Elements:**
+  - **Table&Formula**: dots.ocr is not yet perfect for high-complexity tables and formula extraction.
+  - **Picture**: Pictures in documents are currently not parsed.
+- **Parsing Failures:** The model may fail to parse under certain conditions:
+  - When the character-to-pixel ratio is excessively high. Try enlarging the image or increasing the PDF parsing DPI (a setting of 200 is recommended). However, please note that the model performs optimally on images with a resolution under 11289600 pixels.
+  - Continuous special characters, such as ellipses (`...`) and underscores (`_`), may cause the prediction output to repeat endlessly. In such scenarios, consider using alternative prompts like `prompt_layout_only_en`, `prompt_ocr`, or `prompt_grounding_ocr` ([details here](https://github.com/rednote-hilab/dots.ocr/blob/master/dots_ocr/utils/prompts.py)).
+- **Performance Bottleneck:** Despite its 1.7B parameter LLM foundation, **dots.ocr** is not yet optimized for high-throughput processing of large PDF volumes.
+We are committed to achieving more accurate table and formula parsing, as well as enhancing the model's OCR capabilities for broader generalization, all while aiming for **a more powerful, more efficient model**. Furthermore, we are actively considering the development of **a more general-purpose perception model** based on Vision-Language Models (VLMs), which would integrate general detection, image captioning, and OCR tasks into a unified framework. **Parsing the content of the pictures in the documents** is also a key priority for our future work.
+We believe that collaboration is the key to tackling these exciting challenges. If you are passionate about advancing the frontiers of document intelligence and are interested in contributing to these future endeavors, we would love to hear from you. Please reach out to us via email at: [yanqing4@xiaohongshu.com].

chat_template.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+    "chat_template": "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{%- for m in messages %}{%- if m.role == 'system' %}{{- '<|system|>' + m.content + '<|endofsystem|>\n' }}{%- elif m.role == 'user' %}{% if m.content is string %}{{- '<|user|>' + m.content + '<|endofuser|>' }}{% else %} {% for content in m.content %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|img|><|imgpad|><|endofimg|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|img|><|video_pad|><|endofimg|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}{%- endif %}{%- elif m.role == 'assistant' %}{{- '<|assistant|>' + m.content }}{%- if not loop.last %}{{- '<|endofassistant|>' }}{%- endif %}{%- endif %}{%- endfor %}{%- if messages[-1].role != 'assistant' %}{{- '<|assistant|>' }}{%- endif %}"
+}

config.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+    "architectures": [
+        "DotsOCRForCausalLM"
+    ],
+    "model_type": "dots_ocr",
+    "auto_map": {
+        "AutoConfig": "configuration_dots.DotsOCRConfig",
+        "AutoModelForCausalLM": "modeling_dots_ocr.DotsOCRForCausalLM"
+        },
+    "attention_bias": true,
+    "attention_dropout": 0.0,
+    "hidden_act": "silu",
+    "hidden_size": 1536,
+    "initializer_range": 0.02,
+    "intermediate_size": 8960,
+    "max_position_embeddings": 131072,
+    "max_window_layers": 28,
+    "num_attention_heads": 12,
+    "num_hidden_layers": 28,
+    "num_key_value_heads": 2,
+    "rms_norm_eps": 1e-06,
+    "rope_scaling": null,
+    "rope_theta": 1000000,
+    "sliding_window": 131072,
+    "tie_word_embeddings": false,
+    "torch_dtype": "bfloat16",
+    "transformers_version": "4.51.0",
+    "use_cache": true,
+    "use_sliding_window": false,
+    "vocab_size": 151936,
+    "image_token_id": 151665,
+    "video_token_id": 151656,
+    "vision_config": {
+        "embed_dim": 1536,
+        "hidden_size": 1536,
+        "intermediate_size": 4224,
+        "num_hidden_layers": 42,
+        "num_attention_heads": 12,
+        "num_channels": 3,
+        "patch_size": 14,
+        "post_norm": true,
+        "rms_norm_eps": 1e-05,
+        "spatial_merge_size": 2,
+        "temporal_patch_size": 1,
+        "use_bias": false,
+        "attn_implementation": "flash_attention_2",
+        "init_merger_std": 0.02,
+        "initializer_range": 0.02,
+        "is_causal": false
+    }
+}

configuration_dots.py ADDED Viewed

	@@ -0,0 +1,76 @@

+from typing import Any, Optional
+from transformers.configuration_utils import PretrainedConfig
+from transformers.models.qwen2 import Qwen2Config
+from transformers import Qwen2_5_VLProcessor, AutoProcessor
+from transformers.models.auto.configuration_auto import CONFIG_MAPPING
+class DotsVisionConfig(PretrainedConfig):
+    model_type: str = "dots_vit"
+    def __init__(
+        self,
+        embed_dim: int = 1536,  # vision encoder embed size
+        hidden_size: int = 1536,  # after merger hidden size
+        intermediate_size: int = 4224,
+        num_hidden_layers: int = 42,
+        num_attention_heads: int = 12,
+        num_channels: int = 3,
+        patch_size: int = 14,
+        spatial_merge_size: int = 2,
+        temporal_patch_size: int = 1,
+        rms_norm_eps: float = 1e-5,
+        use_bias: bool = False,
+        attn_implementation="flash_attention_2",  # "eager","sdpa","flash_attention_2"
+        initializer_range=0.02,
+        init_merger_std=0.02,
+        is_causal=False,  # ve causal forward
+        post_norm=True,
+        gradient_checkpointing=False,
+        **kwargs: Any,
+    ):
+        super().__init__(**kwargs)
+        self.embed_dim = embed_dim
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_channels = num_channels
+        self.patch_size = patch_size
+        self.spatial_merge_size = spatial_merge_size
+        self.temporal_patch_size = temporal_patch_size
+        self.rms_norm_eps = rms_norm_eps
+        self.use_bias = use_bias
+        self.attn_implementation = attn_implementation
+        self.initializer_range = initializer_range
+        self.init_merger_std = init_merger_std
+        self.is_causal = is_causal
+        self.post_norm = post_norm
+        self.gradient_checkpointing = gradient_checkpointing
+class DotsOCRConfig(Qwen2Config):
+    model_type = "dots_ocr"
+    def __init__(self,
+        image_token_id = 151665,
+        video_token_id = 151656,
+        vision_config: Optional[dict] = None, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.image_token_id = image_token_id
+        self.video_token_id = video_token_id
+        self.vision_config = DotsVisionConfig(**(vision_config or {}))
+    def save_pretrained(self, save_directory, **kwargs):
+        self._auto_class = None
+        super().save_pretrained(save_directory, **kwargs)
+class DotsVLProcessor(Qwen2_5_VLProcessor):
+    def __init__(self, image_processor=None, tokenizer=None, chat_template=None, **kwargs):
+        super().__init__(image_processor, tokenizer, chat_template=chat_template)
+        self.image_token = "<|imgpad|>" if not hasattr(tokenizer, "image_token") else tokenizer.image_token
+AutoProcessor.register("dots_ocr", DotsVLProcessor)
+CONFIG_MAPPING.register("dots_ocr", DotsOCRConfig)

dots.ocr LICENSE AGREEMENT ADDED Viewed

	@@ -0,0 +1,109 @@

+dots.ocr LICENSE AGREEMENT
+Effective Date: [ August 8, 2025]
+Copyright Holder: [Xingyin Information Technology (Shanghai) Co., Ltd]
+This License Agreement (“Agreement”) governs Your use, reproduction, modification, and distribution of dots.ocr (the "Model Materials"). This Agreement is designed to maximize the openness and use of the Model Materials while addressing the unique legal, ethical, and technical challenges posed by large language models.
+WHEREAS, Licensor has developed the dots.ocr document parsing model and intends to distribute the Model Materials under an open‑source framework;
+WHEREAS, traditional open-source licenses (e.g., the MIT License) may not fully address the complexity inherent complexities of document parsing models, namely their multiple components (code, weights, training data), potential ethical risks, data‑governance issues, and intellectual‑property and liability questions regarding AI‑generated content;
+WHEREAS, Licensor seeks to provide a legal framework that ensures maximum access to and use of the Model Materials while clearly defining the rights, obligations, and liabilities of Licensee;
+THEREFORE, the parties agree that, subject to the MIT License, they shall be bound by the following terms and conditions:
+1. Definitions and Interpretation
+Purpose: To define key terms used in this Agreement, particularly "Model Materials," ensuring clarity of the license scope beyond traditional software code. To clarify the order of precedence between this Agreement and the MIT License to avoid conflict.
+1.1 “Licensor” shall mean the entity providing the Model Materials under this Agreement, namely [Xingyin Information Technology (Shanghai) Co., Ltd].
+1.2 “Licensee” or "You" shall mean any individual or entity exercising permissions granted by this Agreement.
+1.3 “Model Materials” shall mean all materials provided by Licensor under this Agreement, including but not limited to:
+        (a) one or more machine‑learning models, including architecture and trained parameters (i.e., model weights);
+        (b) all associated preprocessing, training, inference, and fine‑tuning code;
+        (c) training datasets and evaluation scripts (or their detailed descriptions and access mechanisms); and
+        (d) any accompanying documentation, metadata, and tools.
+The above Model Materials shall be subject to the content published on the Licensor’s website or GitHub repository at https://github.com/rednote-hilab/dots.ocr.
+1.4 “Outputs” shall mean any content generated through the use of the Model Materials, such as text, tables, code,layout information, and formulas extracted from documents.
+1.5 “MIT License” shall mean The MIT Open Source License published by the Massachusetts Institute of Technology.
+1.6   Priority of Agreement. In the event of any conflict or inconsistency between this Agreement and the MIT License, the terms of the MIT License shall prevail. However, if the terms of the MIT License are ambiguous or silent on a particular matter, the provisions of this Agreement shall apply and supplement the MIT License.
+2. Grant of Rights and Scope of Use
+Purpose: To grant broad, permissive rights to the Licensee for the Model Materials—including code, weights, data, and documentation—to ensure maximum openness and flexibility while clarifying the free use of model-generated content. Additionally, it clarifies the feasibility of transitioning from open-source to commercial‑use and the use of OpenAPI interfaces.
+2.1   Grant of Copyright License. Subject to Licensee's compliance with this Agreement, Licensor hereby grants Licensee a perpetual, worldwide, non‑exclusive, no-charge, royalty‑free copyright license to use (run or test), reproduce, modify, create derivative works of, merge, publish, distribute the Model Materials; sublicense and/or sell copies of the Model Materials or any derivative works thereof; and incorporate the unmodified or modified Model Materials into proprietary products or services, including for commercial purposes, software‑as‑a‑service (SaaS) offerings, or via OpenAPI or other interfaces.
+2.2   Fundamental Capabilities. The Model Materials only provide the fundamental model’s capabilities. Licensees may develop derivative AI applications or undertake task‑specific training thereon.
+2.3   From Open Source to Commercial Use. The open-source release does not preclude Licensor’s commercial exploitation of the Model Materials, in whole or in part. Any such commercial use shall, at that time, be subject to license agreements between Licensor and applicable users.
+2.4   API‑Service Exception. Licensees who access the Model Materials through API calls or provide model services via API interfaces(without directly distributing model weights )shall not be subject to this Agreement unless otherwise expressly agreed. Instead, such use shall be governed by the API terms of use published by Licensor (if any).
+3. Acceptable Use Policy and Prohibited Uses
+3.1   Responsible Use. Licensee must use the Model Materials in a responsible, ethical, and lawful manner, in compliance with all applicable laws, regulations, industry standards, and best practices.
+3.2   Enterprise On‑Premises Deployment. The Licensee may deploy the Model Materials in closed‑source, on‑premises enterprise environments.
+3.3   Prohibited Uses. Any breach of the prohibitions below will result in the automatic termination of all licenses granted under this Agreement. Licensee agrees not to use the Model Materials or any derivative works thereof, in connection with:
+(a) Identification and Utilization of Illegal/Harmful Content:Includes identifying graphic/text materials used for counterfeiting certificates/invoices, perpetrating fraud, or launching cyberattacks; or processing images containing illegal content such as violence, criminal activities, disinformation, or child exploitation.
+(b) Privacy Infringement and Discriminatory Practices:Extracting personal sensitive information (e.g., ID numbers, medical records, biometric data) or protected characteristics (e.g., race, gender) from images without legal authorization or consent, for purposes of privacy violation, automated discriminatory decision-making, or harassment.
+(c) Copyright Restrictions:Licensees shall not use the tool for unauthorized digitization of publications/document scanning or bulk scraping of content. Any use involving publications or other copyright-protected materials must first obtain relevant permissions.
+4. Intellectual Property Ownership and Contributions
+4.1   Licensor's Copyright Reservation. Licensor reserves all right, title, and interest in and to the Model Materials (including the model architecture, parameters, code, and original training data), except as expressly licensed herein. The original copyright of the Model Materials belongs to the Licensor.
+4.2   Patent License. Subject to the terms and conditions of this Agreement, Licensor hereby grants Licensee a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Model Materials, where such license applies only to those patent claims licensable by the Lisensor that are necessarily infringed by its contribution(s).
+If Licensee institutes patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Model Materials constitute direct or contributory patent infringement, then any patent licenses granted under this License for the Model Materials shall terminate as of the date such litigation is asserted or filed.
+4.3   Outputs: The Outputs generated through the use of the Model Materials generally refer to text, tables, layouts, and other content extracted from documents or images. The extracted content itself does not generate new intellectual property rights, and all intellectual property remains with the original authors or copyright holders. The Licensee is responsible for due diligence regarding the legality of the Outputs, particularly where the content extracted by the OCR model may be substantially similar to existing copyrighted works, which could present intellectual property infringement risks. The Licensor assumes no liability for such infringements.
+4.4   Trademarks. Nothing in this License permits Licensee to make use of Licensor’s trademarks, trade names, logos (e.g., “rednote,” “Xiaohongshu,” “dots.ocr”) or to otherwise suggest endorsement or misrepresent the relationship between the parties, unless Licensor’s prior written approval is granted.
+5. Data Governance, Privacy, and Security
+5.1   Data Quality and Bias. Licensee shall use training data from lawful sources and is encouraged to conduct due diligence before deploying the Model Materials and to take reasonable steps to mitigate any known biases in its training data or applications.
+5.2   Privacy Protection.
+        (a) Sensitive‑Data Restrictions. It is prohibited to use the Model Materials to process,or extract infer sensitive personal data protected under specific laws (such as GDPR or HIPAA), particularly when dealing with documents containing personally identifiable information (such as ID numbers, health data, financial information, etc.), unless Licensee has obtained all necessary consents, lawful basis, or authorizations, and has implemented adequate anonymization, pseudonymization, or other privacy-enhancing technologies.
+        (b) Data Minimization and Purpose Limitation. The Licensee shall follow the principle of data minimization when using the OCR Model, processing only the user data necessary for specific, explicit, and lawful purposes. Specifically, the OCR Model should avoid processing unnecessary sensitive data and ensure compliance with applicable privacy protection laws during data handling.
+        (c) Transparency. Licensee shall provide clear and transparent privacy policies and terms of use when processing user data, particularly during document scanning and information extraction. .
+5.3   Security Measures. Licensee shall implement appropriate technical and administrative safeguards to protect the Model Materials and any associated data against unauthorized access, disclosure, alteration, or destruction. Such measures may include, but are not limited to, encryption, access controls, logging, and audit trails.
+5.4   Further Training. Licensee may only use user‑provided input or Outputs for training, fine-tuning, or improving other AI models if it has obtained the specific and informed consent of data subjects.
+6. Disclaimer of Warranty and Limitation of Liability
+6.1 “AS IS” Basis. Unless required by applicable law, the Model Materials are provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. Licensee is solely responsible for determining the appropriateness of using or redistributing the Model Materials and assume any risks associated with the exercise of permissions under this License. Licensor does not provide any warranty of non-infringement but represents that no infringing code has been knowingly included.
+6.2   Outputs Disclaimer. As a neutral technology, Licensor disclaims all liability for the accuracy, completeness, reliability, safety, legality, or suitability of any Outputs. The Licensee is solely responsible for verifying the accuracy and appropriateness of AI-generated content and shall provide appropriate disclosures when publishing or relying upon such content.
+6.3   Limitation of Liability and Recourse. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, shall Licensor or contributors be liable for any claims, damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Model Materials (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Licensor has been advised of the possibility of such damages. If such losses are incurred, recourse may be sought against the Licensee responsible for causing the loss.
+6.4   Content‑Filtering Disclaimer. Although the Model Materials may include content‑filtering mechanisms, Licensor makes no warranties of any kind regarding the stability, quality, accuracy, completeness, or any specific outcome of Outputs. Licensee is solely responsible for reviewing, verifying, and performing quality control on Outputs and assumes all associated risks and liabilities.
+7. Attribution and License Reservation
+7.1   License. When distributing or redistributing the Model Materials, Licensee must give any other recipients of the Model Materials a copy of this Agreement.
+7.2   Copyright and Notices. When distributing any part of the Model Materials, Licensee must retain all copyright, patent, trademark, and attribution notices included in the Model Materials.
+7.3   Attribution. Licensee is encouraged to prominently display the name of Licensor and the Model Materials in any public statements, products, or services that contain the Model Materials (or any derivative works thereof), to promote transparency and community trust. If Licensee distributes modified weights or fine‑tuned models based on the Model Materials, Licensee must prominently display the following statement in the related website or documentation: “Built with dots.ocr.”
+8. Governing Law and Dispute Resolution
+8.1   Governing Law. This Agreement shall be governed by and construed in accordance with the laws of the People’s Republic of China, without regard to its conflict of laws principles.
+8.2   Dispute Resolution. Any dispute claim, or disagreement arising out of or relating to this Agreement shall first be resolved through amicable consultation. If such consultation fails, the dispute shall be submitted to the Hangzhou Arbitration Commission for arbitration. The arbitration shall be conducted in accordance with the laws of China, and the place of arbitration shall be [Hangzhou, China]. The arbitral award shall be final and binding upon both parties.
+9. Regulatory Compliance Amendments
+In the event that any part of this Agreement becomes invalid or requires adjustment due to changes in applicable laws or regulations, Licensor reserves the right to issue a revised version of this Agreement. Licensee shall migrate to the new version within [e.g., ninety (90)] days of its release; otherwise, all rights granted under this Agreement shall automatically terminate.
+10. Security Reporting
+Licensee discovering any security vulnerability in the Model Materials may report it to Licensor via: dots-feedback@xiaohongshu.com. Licensee shall not disclose vulnerability details until Licensor issues an official remediation, unless otherwise required by law.

generation_config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "max_length": 32768,
+  "eos_token_id": [
+    151643,
+    151673
+  ]
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model-00001-of-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ea1d532184f3adf5cbcfcc00b2cf5b2abfa6fe182768a3ae63d441a9b5fc99ac
+size 4292758192

model-00002-of-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:26ab1ec6c8b4e4116befbd59af42159f1dbcb0ad0c045a15e890bb2f6e8b0dae
+size 1785673544

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,650 @@

+{
+  "metadata": {
+    "total_size": 6078358528
+  },
+  "weight_map": {
+    "lm_head.weight": "model-00001-of-00002.safetensors",
+    "model.embed_tokens.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.11.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.11.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.11.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.12.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.12.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.12.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.13.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.13.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.13.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.14.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.14.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.14.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.15.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.15.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.15.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.16.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.16.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.16.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.17.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.17.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.17.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.18.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.18.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.18.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.19.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.19.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.19.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.20.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.20.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.20.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.21.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.21.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.21.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.22.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.22.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.22.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.23.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.23.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.23.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.24.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.24.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.24.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.25.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.25.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.25.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.26.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.26.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.26.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.27.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.27.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.27.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.27.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.self_attn.k_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.self_attn.q_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.self_attn.v_proj.bias": "model-00001-of-00002.safetensors",
+    "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
+    "model.norm.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.0.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.0.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.0.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.0.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.0.mlp.fc3.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.0.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.0.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.1.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.1.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.1.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.1.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.1.mlp.fc3.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.1.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.1.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.10.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.10.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.10.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.10.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.10.mlp.fc3.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.10.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.10.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.11.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.11.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.11.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.11.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.11.mlp.fc3.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.11.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.11.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.12.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.12.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.12.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.12.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.12.mlp.fc3.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.12.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.12.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.13.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.13.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.13.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.13.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.13.mlp.fc3.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.13.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.13.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.14.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.14.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.14.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.14.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.14.mlp.fc3.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.14.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.14.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.15.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.15.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.15.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.15.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.15.mlp.fc3.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.15.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.15.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.16.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.16.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.16.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.16.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.16.mlp.fc3.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.16.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.16.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.17.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.17.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.17.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.17.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.17.mlp.fc3.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.17.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.17.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.18.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.18.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.18.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.18.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.18.mlp.fc3.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.18.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.18.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.19.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.19.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.19.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.19.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.19.mlp.fc3.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.19.norm1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.19.norm2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.2.attn.proj.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.2.attn.qkv.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.2.mlp.fc1.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.2.mlp.fc2.weight": "model-00001-of-00002.safetensors",
+    "vision_tower.blocks.2.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.2.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.2.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.20.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.20.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.20.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.20.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.20.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.20.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.20.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.21.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.21.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.21.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.21.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.21.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.21.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.21.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.22.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.22.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.22.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.22.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.22.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.22.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.22.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.23.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.23.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.23.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.23.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.23.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.23.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.23.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.24.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.24.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.24.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.24.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.24.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.24.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.24.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.25.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.25.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.25.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.25.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.25.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.25.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.25.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.26.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.26.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.26.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.26.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.26.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.26.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.26.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.27.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.27.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.27.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.27.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.27.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.27.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.27.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.28.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.28.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.28.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.28.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.28.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.28.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.28.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.29.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.29.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.29.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.29.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.29.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.29.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.29.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.3.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.3.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.3.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.3.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.3.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.3.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.3.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.30.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.30.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.30.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.30.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.30.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.30.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.30.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.31.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.31.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.31.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.31.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.31.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.31.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.31.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.32.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.32.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.32.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.32.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.32.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.32.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.32.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.33.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.33.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.33.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.33.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.33.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.33.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.33.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.34.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.34.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.34.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.34.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.34.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.34.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.34.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.35.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.35.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.35.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.35.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.35.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.35.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.35.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.36.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.36.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.36.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.36.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.36.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.36.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.36.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.37.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.37.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.37.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.37.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.37.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.37.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.37.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.38.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.38.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.38.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.38.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.38.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.38.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.38.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.39.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.39.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.39.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.39.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.39.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.39.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.39.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.4.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.4.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.4.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.4.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.4.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.4.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.4.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.40.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.40.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.40.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.40.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.40.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.40.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.40.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.41.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.41.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.41.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.41.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.41.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.41.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.41.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.5.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.5.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.5.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.5.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.5.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.5.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.5.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.6.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.6.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.6.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.6.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.6.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.6.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.6.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.7.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.7.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.7.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.7.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.7.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.7.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.7.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.8.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.8.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.8.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.8.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.8.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.8.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.8.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.9.attn.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.9.attn.qkv.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.9.mlp.fc1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.9.mlp.fc2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.9.mlp.fc3.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.9.norm1.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.blocks.9.norm2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.merger.ln_q.bias": "model-00002-of-00002.safetensors",
+    "vision_tower.merger.ln_q.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.merger.mlp.0.bias": "model-00002-of-00002.safetensors",
+    "vision_tower.merger.mlp.0.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.merger.mlp.2.bias": "model-00002-of-00002.safetensors",
+    "vision_tower.merger.mlp.2.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.patch_embed.patchifier.norm.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.patch_embed.patchifier.proj.bias": "model-00002-of-00002.safetensors",
+    "vision_tower.patch_embed.patchifier.proj.weight": "model-00002-of-00002.safetensors",
+    "vision_tower.post_trunk_norm.weight": "model-00002-of-00002.safetensors"
+  }
+}

modeling_dots_ocr.py ADDED Viewed

	@@ -0,0 +1,131 @@

+from typing import List, Optional, Tuple, Union
+import torch
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from transformers.models.qwen2 import Qwen2ForCausalLM
+from .configuration_dots import DotsVisionConfig, DotsOCRConfig
+from .modeling_dots_vision import DotsVisionTransformer
+DOTS_VLM_MAX_IMAGES = 200
+class DotsOCRForCausalLM(Qwen2ForCausalLM):
+    config_class = DotsOCRConfig
+    def __init__(self, config: DotsOCRConfig):
+        super().__init__(config)
+        if isinstance(self.config.vision_config, dict):
+            vision_config = DotsVisionConfig(**self.config.vision_config)
+            self.config.vision_config = vision_config
+        else:
+            vision_config = self.config.vision_config
+        self.vision_tower = DotsVisionTransformer(vision_config)
+    def prepare_inputs_embeds(
+        self,
+        input_ids: torch.LongTensor,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        grid_thw: Optional[torch.FloatTensor] = None,
+        img_mask: Optional[torch.BoolTensor] = None,
+    ) -> torch.Tensor:
+        inputs_embeds = self.get_input_embeddings()(input_ids)
+        if pixel_values is not None:
+            assert img_mask is not None
+            if grid_thw.shape[0] > DOTS_VLM_MAX_IMAGES:
+                print(
+                    f"Num image exceeded: {grid_thw.shape[0]} > {DOTS_VLM_MAX_IMAGES}, which may cause FSDP hang"
+                )
+            vision_embeddings = self.vision_tower(pixel_values, grid_thw)
+            true_indices = torch.nonzero(img_mask).squeeze()
+            if len(true_indices) > vision_embeddings.size(0):
+                print(
+                    f"img_mask sum > VE and will be truncated, mask.sum()={len(true_indices)} {vision_embeddings.size(0)=}"
+                )
+                true_indices = true_indices[: vision_embeddings.size(0)]
+                new_img_mask = torch.zeros_like(img_mask, device=img_mask.device)
+                new_img_mask[true_indices[:, 0], true_indices[:, 1]] = True
+            else:
+                new_img_mask = img_mask
+            assert (
+                vision_embeddings.size(0) == new_img_mask.sum()
+            ), f"{vision_embeddings.size(0)=}, {new_img_mask.sum()=}"
+            inputs_embeds = inputs_embeds.masked_scatter(
+                new_img_mask.to(inputs_embeds.device).unsqueeze(-1).expand_as(inputs_embeds),
+                vision_embeddings.to(inputs_embeds.device).type(inputs_embeds.dtype),
+            )
+        return inputs_embeds
+    def forward(
+        self,
+        input_ids: torch.LongTensor,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        image_grid_thw: Optional[torch.FloatTensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        labels: Optional[torch.LongTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        use_cache: Optional[bool] = None,
+        logits_to_keep: int = 0,
+        **loss_kwargs,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        assert len(input_ids) >= 1, f"empty input_ids {input_ids.shape=} will cause gradnorm nan"
+        if inputs_embeds is None:
+            img_mask = input_ids == self.config.image_token_id
+            inputs_embeds = self.prepare_inputs_embeds(input_ids, pixel_values, image_grid_thw, img_mask)
+        outputs = super().forward(
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            labels=labels,
+            use_cache=use_cache if use_cache is not None else self.config.use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            # return_dict=return_dict,
+            logits_to_keep=logits_to_keep,
+            **loss_kwargs,
+        )
+        return outputs
+    def prepare_inputs_for_generation(
+        self,
+        input_ids,
+        past_key_values=None,
+        inputs_embeds=None,
+        pixel_values=None,
+        attention_mask=None,
+        cache_position=None,
+        num_logits_to_keep=None,
+        **kwargs,
+    ):
+        model_inputs = super().prepare_inputs_for_generation(
+            input_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            cache_position=cache_position,
+            num_logits_to_keep=num_logits_to_keep,
+            **kwargs,
+        )
+        if cache_position[0] == 0:
+            model_inputs["pixel_values"] = pixel_values
+        return model_inputs

modeling_dots_ocr_vllm.py ADDED Viewed

	@@ -0,0 +1,451 @@

+from functools import cached_property
+from typing import Iterable, Literal, Mapping, Optional, Set, Tuple, TypedDict, Union
+import torch
+import torch.nn as nn
+from transformers.models.qwen2_vl import Qwen2VLImageProcessor, Qwen2VLProcessor
+from transformers.models.qwen2_vl.image_processing_qwen2_vl import smart_resize
+from vllm import ModelRegistry
+from vllm.config import VllmConfig
+from vllm.model_executor.layers.sampler import SamplerOutput, get_sampler
+from vllm.model_executor.models.interfaces import MultiModalEmbeddings, SupportsMultiModal
+from vllm.model_executor.models.qwen2 import Qwen2ForCausalLM
+from vllm.model_executor.models.qwen2_5_vl import (
+    Qwen2_5_VLMultiModalProcessor,
+    Qwen2_5_VLProcessingInfo,
+)
+from vllm.model_executor.models.qwen2_vl import Qwen2VLDummyInputsBuilder
+from vllm.model_executor.models.utils import (
+    AutoWeightsLoader,
+    WeightsMapper,
+    init_vllm_registered_model,
+    maybe_prefix,
+    merge_multimodal_embeddings,
+)
+from vllm.model_executor.sampling_metadata import SamplingMetadata
+from vllm.multimodal import MULTIMODAL_REGISTRY
+from vllm.multimodal.inputs import MultiModalDataDict
+from vllm.multimodal.parse import ImageSize
+from vllm.sequence import IntermediateTensors
+from .configuration_dots import DotsVisionConfig
+from .configuration_dots import DotsOCRConfig
+from .modeling_dots_vision import DotsVisionTransformer
+class DotsOCRImagePixelInputs(TypedDict):
+    type: Literal["pixel_values", "image_grid_thw"]
+    pixel_values: torch.Tensor
+    image_grid_thw: torch.Tensor
+class DotsOCRImageEmbeddingInputs(TypedDict):
+    type: Literal["image_embeds", "image_grid_thw"]
+    image_embeds: torch.Tensor
+    """Supported types:
+    - List[`torch.Tensor`]: A list of tensors holding all images' features.
+        Each tensor holds an image's features.
+    - `torch.Tensor`: A tensor holding all images' features
+        (concatenation of all images' feature tensors).
+    Tensor shape: `(num_image_features, hidden_size)`
+    - `num_image_features` varies based on
+        the number and resolution of the images.
+    - `hidden_size` must match the hidden size of language model backbone.
+    """
+    image_grid_thw: torch.Tensor
+DotsOCRImageInputs = Union[DotsOCRImagePixelInputs, DotsOCRImageEmbeddingInputs]
+class DotsOCRMultiModalProcessor(Qwen2_5_VLMultiModalProcessor):
+    pass
+class DotsOCRDummyInputsBuilder(Qwen2VLDummyInputsBuilder):
+    def get_dummy_mm_data(
+        self,
+        seq_len: int,
+        mm_counts: Mapping[str, int],
+    ) -> MultiModalDataDict:
+        num_images = mm_counts.get("image", 0)
+        target_width, target_height = self.info.get_image_size_with_most_features()
+        return {
+            "image": self._get_dummy_images(width=target_width, height=target_height, num_images=num_images),
+        }
+class DotsOCRProcessingInfo(Qwen2_5_VLProcessingInfo):
+    def get_hf_config(self) -> DotsOCRConfig:
+        config = self.ctx.get_hf_config()
+        if not config.__class__.__name__ == 'DotsOCRConfig':
+            raise TypeError(f"Expected DotsOCRConfig, got {type(config)}")
+        if hasattr(config, "vision_config") and isinstance(config.vision_config, dict):
+            config.vision_config = DotsVisionConfig(**config.vision_config)
+        return config
+    def get_supported_mm_limits(self) -> Mapping[str, Optional[int]]:
+        return {"image": None, "video": 0}
+    def get_mm_max_tokens_per_item(
+        self,
+        seq_len: int,
+        mm_counts: Mapping[str, int],
+    ) -> Mapping[str, int]:
+        max_image_tokens = self.get_max_image_tokens()
+        return {"image": max_image_tokens, "video": 0}
+    def get_hf_processor(
+        self,
+        *,
+        min_pixels: Optional[int] = None,
+        max_pixels: Optional[int] = None,
+        size: Optional[dict[str, int]] = None,
+        **kwargs: object,
+    ) -> Qwen2VLProcessor:
+        self.get_tokenizer().image_token = "<|imgpad|>" # Ensure image token is set
+        processor = self.ctx.get_hf_processor(
+            Qwen2VLProcessor,
+            image_processor=self.get_image_processor(min_pixels=min_pixels, max_pixels=max_pixels, size=size),
+            **kwargs,
+        )
+        processor.image_token = "<|imgpad|>"
+        processor.video_token = "<|video_pad|>"
+        return processor
+    def _get_vision_info(
+        self,
+        *,
+        image_width: int,
+        image_height: int,
+        num_frames: int = 1,
+        do_resize: bool = True,
+        image_processor: Optional[Qwen2VLImageProcessor],
+    ) -> tuple[ImageSize, int]:
+        if image_processor is None:
+            image_processor = self.get_image_processor()
+        hf_config: DotsOCRConfig = self.get_hf_config()
+        vision_config = hf_config.vision_config
+        patch_size = vision_config.patch_size
+        merge_size = vision_config.spatial_merge_size
+        temporal_patch_size = vision_config.temporal_patch_size
+        if do_resize:
+            resized_height, resized_width = smart_resize(
+                height=image_height,
+                width=image_width,
+                factor=patch_size * merge_size,
+                min_pixels=image_processor.min_pixels,
+                max_pixels=image_processor.max_pixels,
+            )
+            preprocessed_size = ImageSize(width=resized_width, height=resized_height)
+        else:
+            preprocessed_size = ImageSize(width=image_width, height=image_height)
+        # NOTE: Frames are padded to be divisible by `temporal_patch_size`
+        # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py#L294
+        padded_num_frames = num_frames + num_frames % temporal_patch_size
+        grid_t = max(padded_num_frames // temporal_patch_size, 1)
+        grid_h = preprocessed_size.height // patch_size
+        grid_w = preprocessed_size.width // patch_size
+        num_patches = grid_t * grid_h * grid_w
+        num_vision_tokens = num_patches // (merge_size**2)
+        return preprocessed_size, num_vision_tokens
+@MULTIMODAL_REGISTRY.register_processor(
+    Qwen2_5_VLMultiModalProcessor,
+    info=DotsOCRProcessingInfo,
+    dummy_inputs=DotsOCRDummyInputsBuilder,
+)
+class DotsOCRForCausalLM(nn.Module, SupportsMultiModal):
+    hf_to_vllm_mapper = WeightsMapper(
+        orig_to_new_prefix={
+            "lm_head.": "language_model.lm_head.",
+            "model.": "language_model.model.",
+        }
+    )
+    _tp_plan = {}
+    @classmethod
+    def get_placeholder_str(cls, modality: str, i: int) -> Optional[str]:
+        if modality in ("image",):
+            return "<|img|><|imgpad|><|endofimg|>"
+    def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
+        super().__init__()
+        self.config: DotsOCRConfig = vllm_config.model_config.hf_config
+        self.quant_config = vllm_config.quant_config
+        self.multimodal_config = vllm_config.model_config.multimodal_config
+        if isinstance(self.config.vision_config, dict):
+            vision_config = DotsVisionConfig(**self.config.vision_config)
+            self.config.vision_config = vision_config
+        else:
+            vision_config = self.config.vision_config
+        self.vision_tower = DotsVisionTransformer(vision_config)
+        self.language_model: Qwen2ForCausalLM = init_vllm_registered_model(
+            vllm_config=vllm_config,
+            hf_config=self.config,
+            prefix=maybe_prefix(prefix, "language_model"),
+            architectures=["Qwen2ForCausalLM"],
+        )
+    @cached_property
+    def sampler(self):
+        if hasattr(self.language_model, "sampler"):
+            return self.language_model.sampler
+        return get_sampler()
+    def _validate_and_reshape_mm_tensor(self, mm_input: object, name: str) -> torch.Tensor:
+        if not isinstance(mm_input, (torch.Tensor, list)):
+            raise ValueError(f"Incorrect type of {name}. " f"Got type: {type(mm_input)}")
+        if isinstance(mm_input, torch.Tensor):
+            if mm_input.ndim == 2:
+                return mm_input
+            if mm_input.ndim != 3:
+                raise ValueError(
+                    f"{name} should be 2D or batched 3D tensor. "
+                    f"Got ndim: {mm_input.ndim} "
+                    f"(shape={mm_input.shape})"
+                )
+            return torch.concat(list(mm_input))
+        else:
+            return torch.concat(mm_input)
+    def _parse_and_validate_image_input(self, **kwargs: object) -> Optional[DotsOCRImageInputs]:
+        pixel_values = kwargs.pop("pixel_values", None)
+        image_embeds = kwargs.pop("image_embeds", None)
+        image_grid_thw = kwargs.pop("image_grid_thw", None)
+        if pixel_values is None and image_embeds is None:
+            return None
+        if pixel_values is not None:
+            pixel_values = self._validate_and_reshape_mm_tensor(pixel_values, "image pixel values")
+            image_grid_thw = self._validate_and_reshape_mm_tensor(image_grid_thw, "image grid_thw")
+            if not isinstance(pixel_values, (torch.Tensor, list)):
+                raise ValueError("Incorrect type of image pixel values. " f"Got type: {type(pixel_values)}")
+            return DotsOCRImagePixelInputs(
+                type="pixel_values", pixel_values=pixel_values, image_grid_thw=image_grid_thw
+            )
+        if image_embeds is not None:
+            image_embeds = self._validate_and_reshape_mm_tensor(image_embeds, "image embeds")
+            image_grid_thw = self._validate_and_reshape_mm_tensor(image_grid_thw, "image grid_thw")
+            if not isinstance(image_embeds, torch.Tensor):
+                raise ValueError("Incorrect type of image embeddings. " f"Got type: {type(image_embeds)}")
+            return DotsOCRImageEmbeddingInputs(
+                type="image_embeds", image_embeds=image_embeds, image_grid_thw=image_grid_thw
+            )
+    def vision_forward(self, pixel_values: torch.Tensor, image_grid_thw: torch.Tensor):
+        from vllm.distributed import (
+            get_tensor_model_parallel_group,
+            get_tensor_model_parallel_rank,
+            get_tensor_model_parallel_world_size,
+        )
+        assert self.vision_tower is not None
+        tp_rank = get_tensor_model_parallel_rank()
+        tp = get_tensor_model_parallel_world_size()
+        image_grid_thw_chunk = image_grid_thw.chunk(tp)
+        image_sizes_consum = torch.tensor([i.prod(-1).sum() for i in image_grid_thw_chunk]).cumsum(dim=0)
+        merge_size_square = self.vision_tower.config.spatial_merge_size**2
+        image_embedding = torch.zeros(
+            (
+                pixel_values.shape[0] // merge_size_square,
+                self.vision_tower.config.hidden_size,
+            ),
+            device=pixel_values.device,
+            dtype=pixel_values.dtype,
+        )
+        if tp_rank < len(image_sizes_consum):
+            idx_start = 0 if tp_rank == 0 else image_sizes_consum[tp_rank - 1].item()
+            idx_end = image_sizes_consum[tp_rank].item()
+            pixel_values_part = pixel_values[idx_start:idx_end]
+            image_grid_thw_part = image_grid_thw_chunk[tp_rank]
+            image_embedding_part = self.vision_tower(pixel_values_part, image_grid_thw_part)
+            image_embedding[idx_start // merge_size_square : idx_end // merge_size_square] = image_embedding_part
+        group = get_tensor_model_parallel_group().device_group
+        torch.distributed.all_reduce(image_embedding, group=group)
+        return image_embedding
+    def _process_image_input(self, image_input: DotsOCRImageInputs) -> tuple[torch.Tensor, ...]:
+        grid_thw = image_input["image_grid_thw"]
+        assert grid_thw.ndim == 2
+        if image_input["type"] == "image_embeds":
+            image_embeds = image_input["image_embeds"].type(self.vision_tower.dtype)
+        else:
+            pixel_values = image_input["pixel_values"].type(self.vision_tower.dtype)
+            image_embeds = self.vision_forward(pixel_values, grid_thw)[
+                :, : self.config.hidden_size
+            ]
+        # Split concatenated embeddings for each image item.
+        merge_size = self.vision_tower.config.spatial_merge_size
+        sizes = grid_thw.prod(-1) // merge_size // merge_size
+        return image_embeds.split(sizes.tolist())
+    def _parse_and_validate_multimodal_inputs(self, **kwargs: object) -> dict:
+        modalities = {}
+        # Preserve the order of modalities if there are multiple of them
+        # from the order of kwargs.
+        for input_key in kwargs:
+            if input_key in ("pixel_values", "image_embeds") and "images" not in modalities:
+                modalities["images"] = self._parse_and_validate_image_input(**kwargs)
+        return modalities
+    def get_language_model(self) -> torch.nn.Module:
+        return self.language_model
+    def get_multimodal_embeddings(self, **kwargs: object) -> Optional[MultiModalEmbeddings]:
+        modalities = self._parse_and_validate_multimodal_inputs(**kwargs)
+        if not modalities:
+            return None
+        # The result multimodal_embeddings is tuple of tensors, with each
+        # tensor correspoending to a multimodal data item (image or video).
+        multimodal_embeddings: tuple[torch.Tensor, ...] = ()
+        # NOTE: It is important to iterate over the keys in this dictionary
+        # to preserve the order of the modalities.
+        for modality in modalities:
+            if modality == "images":
+                image_input = modalities["images"]
+                vision_embeddings = self._process_image_input(image_input)
+                multimodal_embeddings += vision_embeddings
+        return multimodal_embeddings
+    def get_input_embeddings(
+        self,
+        input_ids: torch.Tensor,
+        multimodal_embeddings: Optional[MultiModalEmbeddings] = None,
+    ) -> torch.Tensor:
+        inputs_embeds = self.language_model.get_input_embeddings(input_ids)
+        if multimodal_embeddings is not None:
+            inputs_embeds = merge_multimodal_embeddings(
+                input_ids,
+                inputs_embeds,
+                multimodal_embeddings,
+                [self.config.image_token_id, self.config.video_token_id],
+            )
+        return inputs_embeds
+    def get_input_embeddings_v0(
+        self,
+        input_ids: torch.Tensor,
+        image_input: Optional[DotsOCRImagePixelInputs] = None,
+    ) -> torch.Tensor:
+        inputs_embeds = self.get_input_embeddings(input_ids)
+        if image_input is not None:
+            image_embeds = self._process_image_input(image_input)
+            inputs_embeds = merge_multimodal_embeddings(
+                input_ids,
+                inputs_embeds,
+                image_embeds,
+                placeholder_token_id=self.config.image_token_id,
+            )
+        return inputs_embeds
+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor],
+        positions: torch.Tensor,
+        intermediate_tensors: Optional[IntermediateTensors] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> Union[torch.Tensor, IntermediateTensors]:
+        if intermediate_tensors is not None:
+            inputs_embeds = None
+        elif inputs_embeds is None and kwargs.get("pixel_values") is not None:
+            image_input = self._parse_and_validate_image_input(**kwargs)
+            if image_input is None:
+                inputs_embeds = None
+            else:
+                assert input_ids is not None
+                inputs_embeds = self.get_input_embeddings_v0(
+                    input_ids,
+                    image_input=image_input,
+                )
+                input_ids = None
+        hidden_states = self.language_model(
+            input_ids=input_ids,
+            positions=positions,
+            intermediate_tensors=intermediate_tensors,
+            inputs_embeds=inputs_embeds,
+        )
+        return hidden_states
+    def compute_logits(
+        self,
+        hidden_states: torch.Tensor,
+        sampling_metadata: SamplingMetadata,
+    ) -> Optional[torch.Tensor]:
+        return self.language_model.compute_logits(hidden_states, sampling_metadata)
+    def sample(
+        self,
+        logits: Optional[torch.Tensor],
+        sampling_metadata: SamplingMetadata,
+    ) -> Optional[SamplerOutput]:
+        next_tokens = self.sampler(logits, sampling_metadata)
+        return next_tokens
+    def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]) -> Set[str]:
+        loader = AutoWeightsLoader(self)
+        return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
+def patch_vllm_chat_placeholder():
+    import vllm
+    # return when vllm version > 0.9.1
+    if not (vllm.__version_tuple__[0]==0 and vllm.__version_tuple__[1] <= 9 and vllm.__version_tuple__[2] <= 1):
+        return
+    from vllm.entrypoints.chat_utils import BaseMultiModalItemTracker
+    ori = BaseMultiModalItemTracker._placeholder_str
+    def _placeholder_str(self, modality, current_count: int) -> Optional[str]:
+        hf_config = self._model_config.hf_config
+        model_type = hf_config.model_type
+        if modality in ("image",) and model_type in ["dots_ocr"]:
+            return "<|img|><|imgpad|><|endofimg|>"
+        return ori(self, modality, current_count)
+    BaseMultiModalItemTracker._placeholder_str = _placeholder_str
+ModelRegistry.register_model(
+    "DotsOCRForCausalLM", DotsOCRForCausalLM,
+)
+patch_vllm_chat_placeholder()

modeling_dots_vision.py ADDED Viewed

	@@ -0,0 +1,520 @@

+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils.checkpoint
+flash_attn_available = True
+npu_available = True
+try:
+    from flash_attn import flash_attn_varlen_func
+except ImportError:
+    flash_attn_available = False
+from torch.nn import LayerNorm
+from transformers.modeling_utils import PreTrainedModel
+from .configuration_dots import DotsVisionConfig
+try:
+    import torch_npu
+except ImportError:
+    npu_available = False
+def rotate_half(x):
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2:]
+    return torch.cat((-x2, x1), dim=-1)
+def apply_rotary_pos_emb_vision(tensor: torch.Tensor, freqs: torch.Tensor) -> torch.Tensor:
+    orig_dtype = tensor.dtype
+    tensor = tensor.float()
+    cos = freqs.cos()
+    sin = freqs.sin()
+    cos = cos.unsqueeze(1).repeat(1, 1, 2).unsqueeze(0).float()
+    sin = sin.unsqueeze(1).repeat(1, 1, 2).unsqueeze(0).float()
+    output = (tensor * cos) + (rotate_half(tensor) * sin)
+    output = output.to(orig_dtype)
+    return output
+class VisionRotaryEmbedding(nn.Module):
+    def __init__(self, dim: int, theta: float = 10000.0) -> None:
+        super().__init__()
+        inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2, dtype=torch.float) / dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+    def forward(self, seqlen: int) -> torch.Tensor:
+        seq = torch.arange(seqlen, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
+        freqs = torch.outer(seq, self.inv_freq)
+        return freqs
+class PatchMerger(nn.Module):
+    def __init__(
+            self,
+            dim: int,
+            context_dim: int,
+            spatial_merge_size: int = 2,
+            pre_norm="layernorm",
+            init_merger_std=None,
+    ) -> None:
+        super().__init__()
+        self.hidden_size = context_dim * (spatial_merge_size ** 2)
+        self.pre_norm = pre_norm
+        if self.pre_norm == "layernorm":
+            self.ln_q = LayerNorm(context_dim, eps=1e-6)
+        elif self.pre_norm == "rmsnorm":
+            self.ln_q = RMSNorm(context_dim, eps=1e-6)
+        else:
+            print("no norm in patch merger")
+        self.mlp = nn.Sequential(
+            nn.Linear(self.hidden_size, self.hidden_size),
+            nn.GELU(),
+            nn.Linear(self.hidden_size, dim),
+        )
+        if init_merger_std is not None:
+            nn.init.normal_(self.mlp[0].weight, mean=0.0, std=init_merger_std)
+            nn.init.zeros_(self.mlp[0].bias)
+            nn.init.normal_(self.mlp[2].weight, mean=0.0, std=init_merger_std)
+            nn.init.zeros_(self.mlp[2].bias)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        if self.pre_norm:
+            x = self.mlp(self.ln_q(x).view(-1, self.hidden_size))
+        else:
+            x = self.mlp(x.view(-1, self.hidden_size))
+        return x
+class VisionAttention(nn.Module):
+    def __init__(self, config, dim: int, num_heads: int = 16, bias=True) -> None:
+        super().__init__()
+        self.num_heads = num_heads
+        self.head_dim = dim // num_heads
+        self.qkv = nn.Linear(dim, dim * 3, bias=bias)
+        self.proj = nn.Linear(dim, dim, bias=bias)
+    def forward(
+            self,
+            hidden_states: torch.Tensor,
+            cu_seqlens: torch.Tensor,
+            rotary_pos_emb: torch.Tensor = None,
+    ) -> torch.Tensor:
+        seq_length = hidden_states.shape[0]
+        q, k, v = self.qkv(hidden_states).reshape(seq_length, 3, self.num_heads, -1).permute(1, 0, 2, 3).unbind(0)
+        q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
+        k = apply_rotary_pos_emb_vision(k.unsqueeze(0), rotary_pos_emb).squeeze(0)
+        attention_mask = torch.full(
+            [1, seq_length, seq_length], torch.finfo(q.dtype).min, device=q.device, dtype=q.dtype
+        )
+        for i in range(1, len(cu_seqlens)):
+            attention_mask[..., cu_seqlens[i - 1]: cu_seqlens[i], cu_seqlens[i - 1]: cu_seqlens[i]] = 0
+        q = q.transpose(0, 1)
+        k = k.transpose(0, 1)
+        v = v.transpose(0, 1)
+        attn_weights = torch.matmul(q, k.transpose(1, 2)) / math.sqrt(self.head_dim)
+        attn_weights = attn_weights + attention_mask
+        attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(q.dtype)
+        attn_output = torch.matmul(attn_weights, v)
+        attn_output = attn_output.transpose(0, 1)
+        attn_output = attn_output.reshape(seq_length, -1)
+        attn_output = self.proj(attn_output)
+        return attn_output
+class VisionFlashAttention2(nn.Module):
+    def __init__(self, config, dim: int, num_heads: int = 16, bias=True) -> None:
+        super().__init__()
+        self.num_heads = num_heads
+        self.qkv = nn.Linear(dim, dim * 3, bias=bias)
+        self.proj = nn.Linear(dim, dim, bias=bias)
+        self.config = config
+        self.is_causal = config.is_causal
+    def forward(
+            self,
+            hidden_states: torch.Tensor,
+            cu_seqlens: torch.Tensor,
+            rotary_pos_emb: torch.Tensor = None,
+    ) -> torch.Tensor:
+        seq_length = hidden_states.shape[0]
+        q, k, v = (
+            self.qkv(hidden_states).reshape(seq_length, 3, self.num_heads, -1).permute(1, 0, 2, 3).unbind(0)
+        )  # 'shd'
+        q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
+        k = apply_rotary_pos_emb_vision(k.unsqueeze(0), rotary_pos_emb).squeeze(0)
+        max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max().item()
+        attn_output = flash_attn_varlen_func(
+            q, k, v, cu_seqlens, cu_seqlens, max_seqlen, max_seqlen, causal=self.is_causal
+        ).reshape(seq_length, -1)
+        attn_output = self.proj(attn_output)
+        return attn_output
+class VisionAttentionV2(nn.Module):
+    def __init__(self, config, dim: int, num_heads: int = 16, bias=True) -> None:
+        super().__init__()
+        self.num_heads = num_heads
+        self.head_dim = dim // num_heads
+        self.qkv = nn.Linear(dim, dim * 3, bias=bias)
+        self.proj = nn.Linear(dim, dim, bias=bias)
+    def forward(
+            self,
+            hidden_states: torch.Tensor,
+            cu_seqlens: torch.Tensor,
+            rotary_pos_emb: torch.Tensor = None,
+    ) -> torch.Tensor:
+        seq_length = hidden_states.shape[0]
+        q, k, v = self.qkv(hidden_states).reshape(seq_length, 3, self.num_heads, -1).permute(1, 0, 2, 3).unbind(0)
+        q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
+        k = apply_rotary_pos_emb_vision(k.unsqueeze(0), rotary_pos_emb).squeeze(0)
+        seqlens = torch.diff(cu_seqlens).tolist()
+        q_list = torch.split(q, seqlens, 0)
+        k_list = torch.split(k, seqlens, 0)
+        v_list = torch.split(v, seqlens, 0)
+        # eager attention  空间复杂度为 O(n^2) , n 为  b*s（batch_size * seq_len）,  序列太长容易OOM， 这个实现 更具batch 切分 seq
+        # 减少内存需求， 计算相对 continus  batching 较慢。
+        outputs = []
+        for q_i, k_i, v_i in zip(q_list, k_list, v_list):
+            q_i = q_i.transpose(0, 1)
+            k_i = k_i.transpose(0, 1)
+            v_i = v_i.transpose(0, 1)
+            out = torch.matmul(q_i, k_i.transpose(1, 2)) / math.sqrt(self.head_dim)
+            out = nn.functional.softmax(out, dim=-1, dtype=torch.float32).to(q.dtype)
+            out = torch.matmul(out, v_i)
+            out = out.transpose(0, 1)
+            outputs.append(out)
+        attn_output = torch.concat(outputs, dim=0)
+        attn_output = attn_output.reshape(seq_length, -1)
+        attn_output = self.proj(attn_output)
+        return attn_output
+class VisionAscendAttention(nn.Module):
+    def __init__(self, config, dim: int, num_heads: int = 16, bias=True) -> None:
+        super().__init__()
+        self.num_heads = num_heads
+        self.head_dim = dim // num_heads
+        self.qkv = nn.Linear(dim, dim * 3, bias=bias)
+        self.proj = nn.Linear(dim, dim, bias=bias)
+        self.config = config
+    def forward(
+            self,
+            hidden_states: torch.Tensor,
+            cu_seqlens: torch.Tensor,
+            rotary_pos_emb: torch.Tensor = None,
+    ) -> torch.Tensor:
+        seq_length = hidden_states.shape[0]
+        q, k, v = self.qkv(hidden_states).reshape(seq_length, 3, self.num_heads, -1).permute(1, 0, 2, 3).unbind(0)
+        q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
+        k = apply_rotary_pos_emb_vision(k.unsqueeze(0), rotary_pos_emb).squeeze(0)
+        attention_mask = torch.ones([1, seq_length, seq_length], device=q.device, dtype=torch.bool)
+        for i in range(1, len(cu_seqlens)):
+            attention_mask[..., cu_seqlens[i - 1]: cu_seqlens[i], cu_seqlens[i - 1]: cu_seqlens[i]] = False
+        q = q.transpose(0, 1).unsqueeze(0)
+        k = k.transpose(0, 1).unsqueeze(0)
+        v = v.transpose(0, 1).unsqueeze(0)
+        attn_output = torch_npu.npu_prompt_flash_attention(q, k, v,
+                                                           atten_mask=attention_mask,
+                                                           num_heads=self.num_heads, input_layout="BNSD",
+                                                           scale_value=self.head_dim ** -0.5)
+        attn_output = attn_output.squeeze(0).transpose(0, 1)
+        attn_output = attn_output.reshape(seq_length, -1)
+        attn_output = self.proj(attn_output)
+        return attn_output
+class VisionSdpaAttention(nn.Module):
+    def __init__(self, config, dim: int, num_heads: int = 16, bias=True) -> None:
+        super().__init__()
+        self.num_heads = num_heads
+        self.qkv = nn.Linear(dim, dim * 3, bias=bias)
+        self.proj = nn.Linear(dim, dim, bias=bias)
+        self.config = config
+    def forward(
+            self,
+            hidden_states: torch.Tensor,
+            cu_seqlens: torch.Tensor,
+            rotary_pos_emb: torch.Tensor = None,
+    ) -> torch.Tensor:
+        seq_length = hidden_states.shape[0]
+        q, k, v = self.qkv(hidden_states).reshape(seq_length, 3, self.num_heads, -1).permute(1, 0, 2, 3).unbind(0)
+        q = apply_rotary_pos_emb_vision(q.unsqueeze(0), rotary_pos_emb).squeeze(0)
+        k = apply_rotary_pos_emb_vision(k.unsqueeze(0), rotary_pos_emb).squeeze(0)
+        attention_mask = torch.zeros([1, seq_length, seq_length], device=q.device, dtype=torch.bool)
+        for i in range(1, len(cu_seqlens)):
+            attention_mask[..., cu_seqlens[i - 1]: cu_seqlens[i], cu_seqlens[i - 1]: cu_seqlens[i]] = True
+        # Convert q, k, v to 4D to enable : (1, num_heads, seq_length, head_dim)
+        q = q.transpose(0, 1).unsqueeze(0)   # (1, num_heads, seq_length, head_dim)
+        k = k.transpose(0, 1).unsqueeze(0)
+        v = v.transpose(0, 1).unsqueeze(0)
+        # See: https://github.com/pytorch/pytorch/issues/127523
+        if attention_mask.stride(-1) != 1:
+            attention_mask = torch.empty_like(attention_mask, memory_format=torch.contiguous_format).copy_(attention_mask)
+        # use memory efficient backend
+        from torch.nn.attention import SDPBackend, sdpa_kernel
+        with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):
+            attn_output = F.scaled_dot_product_attention(q, k, v, attention_mask, dropout_p=0.0)
+        attn_output = attn_output.squeeze(0).transpose(0, 1)  # (seq_length, num_heads, head_dim)
+        attn_output = attn_output.reshape(seq_length, -1)
+        attn_output = self.proj(attn_output)
+        return attn_output
+DOTS_VISION_ATTENTION_CLASSES = {
+    "eager": VisionAttention,
+    "eager_v2": VisionAttentionV2,  # 内存更少
+    "flash_attention_2": VisionFlashAttention2,
+    "sdpa": VisionSdpaAttention,
+    "ascend_fa": VisionAscendAttention,  # ascend， 长序列精度下降严重。
+}
+class RMSNorm(nn.Module):
+    def __init__(self, dim: int, eps: float = 1e-6):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(dim))
+        self.eps = eps
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        output = self._norm(x.float()).type_as(x)
+        return output * self.weight
+    def extra_repr(self) -> str:
+        return f"{tuple(self.weight.shape)}, eps={self.eps}"
+    def _norm(self, x: torch.Tensor) -> torch.Tensor:
+        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
+class DotsSwiGLUFFN(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        hidden_features = config.intermediate_size
+        in_features = config.embed_dim
+        bias = config.use_bias
+        self.fc1 = nn.Linear(in_features, hidden_features, bias=bias)
+        self.fc2 = nn.Linear(hidden_features, in_features, bias=bias)
+        self.fc3 = nn.Linear(in_features, hidden_features, bias=bias)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = F.silu(self.fc1(x)) * self.fc3(x)
+        x = self.fc2(x)
+        return x
+class DotsPatchEmbed(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.num_channels = config.num_channels
+        self.patch_size = config.patch_size
+        self.temporal_patch_size = config.temporal_patch_size
+        self.embed_dim = config.embed_dim
+        self.config = config
+        self.proj = nn.Conv2d(
+            config.num_channels,
+            config.embed_dim,
+            kernel_size=(config.patch_size, config.patch_size),
+            stride=(config.patch_size, config.patch_size),
+        )
+        self.norm = RMSNorm(config.embed_dim, eps=config.rms_norm_eps)
+    def forward(self, x: torch.Tensor, grid_thw=None) -> torch.Tensor:
+        x = x.view(-1, self.num_channels, self.temporal_patch_size, self.patch_size, self.patch_size)[:, :, 0]
+        x = self.proj(x).view(-1, self.embed_dim)
+        x = self.norm(x)
+        return x
+class DotsViTPreprocessor(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.patch_h = config.patch_size
+        self.patch_w = config.patch_size
+        self.embed_dim = config.embed_dim
+        self.config = config
+        self.patchifier = DotsPatchEmbed(config)
+    def forward(self, x: torch.Tensor, grid_thw=None) -> torch.Tensor:
+        tokens = self.patchifier(x, grid_thw)
+        return tokens
+class DotsVisionBlock(nn.Module):
+    def __init__(self, config, attn_implementation: str = "flash_attention_2"):
+        super().__init__()
+        if attn_implementation == "flash_attention_2" and not flash_attn_available:
+            # fallback to eager
+            attn_implementation = "eager"
+            print("flash attention not available! fallback to eager implementation ")
+        if attn_implementation == "ascend_fa" and not npu_available:
+            attn_implementation = "eager"
+            print("flash attention not available! fallback to eager implementation ")
+        self.attn = DOTS_VISION_ATTENTION_CLASSES[attn_implementation](
+            config, config.embed_dim, num_heads=config.num_attention_heads, bias=config.use_bias
+        )
+        self.norm1 = RMSNorm(config.embed_dim, eps=config.rms_norm_eps)
+        self.mlp = DotsSwiGLUFFN(config)
+        self.norm2 = RMSNorm(config.embed_dim, eps=config.rms_norm_eps)
+    def forward(self, hidden_states, cu_seqlens, rotary_pos_emb) -> torch.Tensor:
+        hidden_states = hidden_states + self.attn(
+            self.norm1(hidden_states), cu_seqlens=cu_seqlens, rotary_pos_emb=rotary_pos_emb
+        )
+        hidden_states = hidden_states + self.mlp(self.norm2(hidden_states))
+        return hidden_states
+class DotsVisionTransformer(PreTrainedModel):
+    def __init__(self, config: DotsVisionConfig) -> None:
+        super().__init__(config)
+        self.config = config
+        self.spatial_merge_size = config.spatial_merge_size
+        self.patch_embed = DotsViTPreprocessor(config)
+        self._init_weights(self.patch_embed.patchifier.proj)
+        head_dim = config.embed_dim // config.num_attention_heads
+        self.rotary_pos_emb = VisionRotaryEmbedding(head_dim // 2)
+        _num_hidden_layers = config.num_hidden_layers
+        self.blocks = nn.ModuleList(
+            [DotsVisionBlock(config, config.attn_implementation) for _ in range(_num_hidden_layers)]
+        )
+        if self.config.post_norm:
+            self.post_trunk_norm = RMSNorm(config.embed_dim, eps=config.rms_norm_eps)
+        self.merger = PatchMerger(
+            dim=config.hidden_size,
+            context_dim=config.embed_dim,
+            spatial_merge_size=config.spatial_merge_size,
+            init_merger_std=self.config.init_merger_std,
+        )
+        self.gradient_checkpointing = False
+        self._gradient_checkpointing_func = torch.utils.checkpoint.checkpoint
+    def _init_weights(self, module):
+        std = self.config.initializer_range
+        if isinstance(module, (nn.Linear, nn.Conv3d)):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+    @property
+    def dtype(self) -> torch.dtype:
+        return self.blocks[0].mlp.fc2.weight.dtype
+    @property
+    def device(self) -> torch.device:
+        return self.blocks[0].mlp.fc2.weight.device
+    def get_pos_ids_by_grid(self, grid_thw):
+        pos_ids = []
+        for t, h, w in grid_thw:
+            hpos_ids = torch.arange(h).unsqueeze(1).expand(-1, w)
+            hpos_ids = hpos_ids.reshape(
+                h // self.spatial_merge_size,
+                self.spatial_merge_size,
+                w // self.spatial_merge_size,
+                self.spatial_merge_size,
+            )
+            hpos_ids = hpos_ids.permute(0, 2, 1, 3)
+            hpos_ids = hpos_ids.flatten()
+            wpos_ids = torch.arange(w).unsqueeze(0).expand(h, -1)
+            wpos_ids = wpos_ids.reshape(
+                h // self.spatial_merge_size,
+                self.spatial_merge_size,
+                w // self.spatial_merge_size,
+                self.spatial_merge_size,
+            )
+            wpos_ids = wpos_ids.permute(0, 2, 1, 3)
+            wpos_ids = wpos_ids.flatten()
+            pos_ids.append(
+                torch.stack([hpos_ids, wpos_ids], dim=-1).repeat(t, 1)
+            )
+        return pos_ids
+    def rot_pos_emb(self, grid_thw):
+        pos_ids = self.get_pos_ids_by_grid(grid_thw)
+        pos_ids = torch.cat(pos_ids, dim=0)
+        max_grid_size = grid_thw[:, 1:].max()
+        rotary_pos_emb_full = self.rotary_pos_emb(max_grid_size)
+        rotary_pos_emb = rotary_pos_emb_full[pos_ids].flatten(1)
+        return rotary_pos_emb
+    def forward(self, hidden_states: torch.Tensor, grid_thw: torch.Tensor, bf16=True) -> torch.Tensor:
+        if bf16:
+            hidden_states = hidden_states.bfloat16()
+        hidden_states = self.patch_embed(hidden_states, grid_thw)
+        rotary_pos_emb = self.rot_pos_emb(grid_thw)
+        cu_seqlens = torch.repeat_interleave(grid_thw[:, 1] * grid_thw[:, 2], grid_thw[:, 0]).cumsum(
+            dim=0,
+            dtype=grid_thw.dtype if torch.jit.is_tracing() else torch.int32,
+        )
+        cu_seqlens = F.pad(cu_seqlens, (1, 0), value=0)
+        for blk in self.blocks:
+            if self.gradient_checkpointing and self.training:
+                hidden_states = self._gradient_checkpointing_func(
+                    blk.__call__,
+                    hidden_states,
+                    cu_seqlens,
+                    rotary_pos_emb,
+                )
+            else:
+                hidden_states = blk(hidden_states, cu_seqlens=cu_seqlens, rotary_pos_emb=rotary_pos_emb)
+        if self.config.post_norm:
+            hidden_states = self.post_trunk_norm(hidden_states)
+        hidden_states = self.merger(hidden_states)
+        return hidden_states

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "min_pixels": 3136,
+  "max_pixels": 11289600,
+  "patch_size": 14,
+  "temporal_patch_size": 1,
+  "merge_size": 2,
+  "image_mean": [
+    0.48145466,
+    0.4578275,
+    0.40821073
+  ],
+  "image_std": [
+    0.26862954,
+    0.26130258,
+    0.27577711
+  ],
+  "image_processor_type": "Qwen2VLImageProcessor",
+  "processor_class": "DotsVLProcessor"
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "[PAD]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,391 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151665": {
+      "content": "<|imgpad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151666": {
+      "content": "<|img|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151667": {
+      "content": "<|endofimg|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151668": {
+      "content": "<|systemprompt|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151669": {
+      "content": "<|endofsystemprompt|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151670": {
+      "content": "<|user|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151671": {
+      "content": "<|endofuser|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151672": {
+      "content": "<|assistant|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151673": {
+      "content": "<|endofassistant|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151674": {
+      "content": "<|ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151675": {
+      "content": "<|ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151676": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151677": {
+      "content": "<|pic|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151678": {
+      "content": "<|text|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151679": {
+      "content": "<|pictotext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151680": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151681": {
+      "content": "<|slice|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151682": {
+      "content": "<|endofslice|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151683": {
+      "content": "<|imgrowend|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151684": {
+      "content": "<|polygon_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151685": {
+      "content": "<|polygon_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151686": {
+      "content": "<|image_gen_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151687": {
+      "content": "<|image_gen_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "bos_token": null,
+  "chat_template": "{%- for m in messages %}\n    {%- if m.role == 'system' %}\n        {{- '<|system|>' + m.content + '<|endofsystem|>\\n' }}\n    {%- elif m.role == 'user' %}\n        {{- '<|user|>' + m.content + '<|endofuser|>' }}\n    {%- elif m.role == 'assistant' %}\n        {{- '<|assistant|>' + m.content }}\n        {%- if not loop.last %}\n            {{- '<|endofassistant|>' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if messages[-1].role != 'assistant' %}\n    {{- '<|assistant|>' }}\n{%- endif %}",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "model_max_length": 131072,
+  "pad_token": "[PAD]",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff