junkim100 commited on
Commit
dba1050
·
verified ·
1 Parent(s): 5dbe42b

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +55 -0
  2. clip_encoder.pth +3 -0
  3. projector.pth +3 -0
  4. sam_encoder.pth +3 -0
  5. special_tokens.pth +3 -0
README.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DeepEncoder
2
+
3
+ This is the encoder component of DeepSeek-OCR, containing:
4
+ - **sam_encoder.pth**: SAM ViT-B encoder for high-resolution feature extraction
5
+ - **clip_encoder.pth**: CLIP-L encoder for semantic feature extraction
6
+ - **projector.pth**: Linear projector layer
7
+
8
+ ## Architecture
9
+
10
+ The DeepEncoder processes images through:
11
+ 1. SAM encoder: Extracts high-resolution visual features with window attention
12
+ 2. CLIP encoder: Extracts semantic features with global attention (uses SAM features as input)
13
+ 3. Projector: Projects concatenated features to decoder dimension (1280)
14
+
15
+ ## Usage
16
+
17
+ ```python
18
+ import torch
19
+ from deepencoder.sam_vary_sdpa import build_sam_vit_b
20
+ from deepencoder.clip_sdpa import build_clip_l
21
+ from deepencoder.build_linear import MlpProjector
22
+ from addict import Dict
23
+
24
+ # Load models
25
+ sam_model = build_sam_vit_b()
26
+ vision_model = build_clip_l()
27
+ projector = MlpProjector(Dict(projector_type="linear", input_dim=2048, n_embed=1280))
28
+
29
+ # Load weights
30
+ sam_model.load_state_dict(torch.load("sam_encoder.pth"))
31
+ vision_model.load_state_dict(torch.load("clip_encoder.pth"))
32
+ projector.load_state_dict(torch.load("projector.pth"))
33
+
34
+ # Process image
35
+ with torch.no_grad():
36
+ sam_features = sam_model(image) # [B, 1024, H/16, W/16]
37
+ clip_features = vision_model(image, sam_features) # [B, N, 1024]
38
+
39
+ # Concatenate features
40
+ combined_features = torch.cat(
41
+ (clip_features[:, 1:], sam_features.flatten(2).permute(0, 2, 1)),
42
+ dim=-1
43
+ ) # [B, N, 2048]
44
+
45
+ # Project to decoder dimension
46
+ vision_embeddings = projector(combined_features) # [B, N, 1280]
47
+ ```
48
+
49
+ ## Source
50
+
51
+ Extracted from [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)
52
+
53
+ ## Reference
54
+
55
+ Similar structure to [Volkopat/DeepSeek-DeepEncoder](https://huggingface.co/Volkopat/DeepSeek-DeepEncoder)
clip_encoder.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3d3f97c24bd69378a5f5a657ad81223134025ebf52f784eb32042d4d2b57404f
3
+ size 606449932
projector.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e0acc5973ed8d2025990af99da202c1519ebe26454acc61ae9072dac635c35cb
3
+ size 5247349
sam_encoder.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bb795b0351d9dfdb05015e49b2e83bbb6b7f1397831ac391121ae7f4eaf5c5d4
3
+ size 191192957
special_tokens.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7d439e0a460139f7764bc3e5c9eb9f3c217d9becdbbc9dc77f4506fa32a8754e
3
+ size 7069