Upload folder using huggingface_hub
Browse files- README.md +55 -0
- clip_encoder.pth +3 -0
- projector.pth +3 -0
- sam_encoder.pth +3 -0
- special_tokens.pth +3 -0
README.md
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# DeepEncoder
|
| 2 |
+
|
| 3 |
+
This is the encoder component of DeepSeek-OCR, containing:
|
| 4 |
+
- **sam_encoder.pth**: SAM ViT-B encoder for high-resolution feature extraction
|
| 5 |
+
- **clip_encoder.pth**: CLIP-L encoder for semantic feature extraction
|
| 6 |
+
- **projector.pth**: Linear projector layer
|
| 7 |
+
|
| 8 |
+
## Architecture
|
| 9 |
+
|
| 10 |
+
The DeepEncoder processes images through:
|
| 11 |
+
1. SAM encoder: Extracts high-resolution visual features with window attention
|
| 12 |
+
2. CLIP encoder: Extracts semantic features with global attention (uses SAM features as input)
|
| 13 |
+
3. Projector: Projects concatenated features to decoder dimension (1280)
|
| 14 |
+
|
| 15 |
+
## Usage
|
| 16 |
+
|
| 17 |
+
```python
|
| 18 |
+
import torch
|
| 19 |
+
from deepencoder.sam_vary_sdpa import build_sam_vit_b
|
| 20 |
+
from deepencoder.clip_sdpa import build_clip_l
|
| 21 |
+
from deepencoder.build_linear import MlpProjector
|
| 22 |
+
from addict import Dict
|
| 23 |
+
|
| 24 |
+
# Load models
|
| 25 |
+
sam_model = build_sam_vit_b()
|
| 26 |
+
vision_model = build_clip_l()
|
| 27 |
+
projector = MlpProjector(Dict(projector_type="linear", input_dim=2048, n_embed=1280))
|
| 28 |
+
|
| 29 |
+
# Load weights
|
| 30 |
+
sam_model.load_state_dict(torch.load("sam_encoder.pth"))
|
| 31 |
+
vision_model.load_state_dict(torch.load("clip_encoder.pth"))
|
| 32 |
+
projector.load_state_dict(torch.load("projector.pth"))
|
| 33 |
+
|
| 34 |
+
# Process image
|
| 35 |
+
with torch.no_grad():
|
| 36 |
+
sam_features = sam_model(image) # [B, 1024, H/16, W/16]
|
| 37 |
+
clip_features = vision_model(image, sam_features) # [B, N, 1024]
|
| 38 |
+
|
| 39 |
+
# Concatenate features
|
| 40 |
+
combined_features = torch.cat(
|
| 41 |
+
(clip_features[:, 1:], sam_features.flatten(2).permute(0, 2, 1)),
|
| 42 |
+
dim=-1
|
| 43 |
+
) # [B, N, 2048]
|
| 44 |
+
|
| 45 |
+
# Project to decoder dimension
|
| 46 |
+
vision_embeddings = projector(combined_features) # [B, N, 1280]
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
## Source
|
| 50 |
+
|
| 51 |
+
Extracted from [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)
|
| 52 |
+
|
| 53 |
+
## Reference
|
| 54 |
+
|
| 55 |
+
Similar structure to [Volkopat/DeepSeek-DeepEncoder](https://huggingface.co/Volkopat/DeepSeek-DeepEncoder)
|
clip_encoder.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3d3f97c24bd69378a5f5a657ad81223134025ebf52f784eb32042d4d2b57404f
|
| 3 |
+
size 606449932
|
projector.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e0acc5973ed8d2025990af99da202c1519ebe26454acc61ae9072dac635c35cb
|
| 3 |
+
size 5247349
|
sam_encoder.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:bb795b0351d9dfdb05015e49b2e83bbb6b7f1397831ac391121ae7f4eaf5c5d4
|
| 3 |
+
size 191192957
|
special_tokens.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7d439e0a460139f7764bc3e5c9eb9f3c217d9becdbbc9dc77f4506fa32a8754e
|
| 3 |
+
size 7069
|