Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

README.md +55 -0
clip_encoder.pth +3 -0
projector.pth +3 -0
sam_encoder.pth +3 -0
special_tokens.pth +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,55 @@

+# DeepEncoder
+This is the encoder component of DeepSeek-OCR, containing:
+- **sam_encoder.pth**: SAM ViT-B encoder for high-resolution feature extraction
+- **clip_encoder.pth**: CLIP-L encoder for semantic feature extraction
+- **projector.pth**: Linear projector layer
+## Architecture
+The DeepEncoder processes images through:
+1. SAM encoder: Extracts high-resolution visual features with window attention
+2. CLIP encoder: Extracts semantic features with global attention (uses SAM features as input)
+3. Projector: Projects concatenated features to decoder dimension (1280)
+## Usage
+```python
+import torch
+from deepencoder.sam_vary_sdpa import build_sam_vit_b
+from deepencoder.clip_sdpa import build_clip_l
+from deepencoder.build_linear import MlpProjector
+from addict import Dict
+# Load models
+sam_model = build_sam_vit_b()
+vision_model = build_clip_l()
+projector = MlpProjector(Dict(projector_type="linear", input_dim=2048, n_embed=1280))
+# Load weights
+sam_model.load_state_dict(torch.load("sam_encoder.pth"))
+vision_model.load_state_dict(torch.load("clip_encoder.pth"))
+projector.load_state_dict(torch.load("projector.pth"))
+# Process image
+with torch.no_grad():
+    sam_features = sam_model(image)  # [B, 1024, H/16, W/16]
+    clip_features = vision_model(image, sam_features)  # [B, N, 1024]
+    # Concatenate features
+    combined_features = torch.cat(
+        (clip_features[:, 1:], sam_features.flatten(2).permute(0, 2, 1)),
+        dim=-1
+    )  # [B, N, 2048]
+    # Project to decoder dimension
+    vision_embeddings = projector(combined_features)  # [B, N, 1280]
+```
+## Source
+Extracted from [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)
+## Reference
+Similar structure to [Volkopat/DeepSeek-DeepEncoder](https://huggingface.co/Volkopat/DeepSeek-DeepEncoder)

clip_encoder.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3d3f97c24bd69378a5f5a657ad81223134025ebf52f784eb32042d4d2b57404f
+size 606449932

projector.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e0acc5973ed8d2025990af99da202c1519ebe26454acc61ae9072dac635c35cb
+size 5247349

sam_encoder.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bb795b0351d9dfdb05015e49b2e83bbb6b7f1397831ac391121ae7f4eaf5c5d4
+size 191192957

special_tokens.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7d439e0a460139f7764bc3e5c9eb9f3c217d9becdbbc9dc77f4506fa32a8754e
+size 7069