| # DeepEncoder | |
| This is the encoder component of DeepSeek-OCR, containing: | |
| - **sam_encoder.pth**: SAM ViT-B encoder for high-resolution feature extraction | |
| - **clip_encoder.pth**: CLIP-L encoder for semantic feature extraction | |
| - **projector.pth**: Linear projector layer | |
| ## Architecture | |
| The DeepEncoder processes images through: | |
| 1. SAM encoder: Extracts high-resolution visual features with window attention | |
| 2. CLIP encoder: Extracts semantic features with global attention (uses SAM features as input) | |
| 3. Projector: Projects concatenated features to decoder dimension (1280) | |
| ## Usage | |
| ```python | |
| import torch | |
| from deepencoder.sam_vary_sdpa import build_sam_vit_b | |
| from deepencoder.clip_sdpa import build_clip_l | |
| from deepencoder.build_linear import MlpProjector | |
| from addict import Dict | |
| # Load models | |
| sam_model = build_sam_vit_b() | |
| vision_model = build_clip_l() | |
| projector = MlpProjector(Dict(projector_type="linear", input_dim=2048, n_embed=1280)) | |
| # Load weights | |
| sam_model.load_state_dict(torch.load("sam_encoder.pth")) | |
| vision_model.load_state_dict(torch.load("clip_encoder.pth")) | |
| projector.load_state_dict(torch.load("projector.pth")) | |
| # Process image | |
| with torch.no_grad(): | |
| sam_features = sam_model(image) # [B, 1024, H/16, W/16] | |
| clip_features = vision_model(image, sam_features) # [B, N, 1024] | |
| # Concatenate features | |
| combined_features = torch.cat( | |
| (clip_features[:, 1:], sam_features.flatten(2).permute(0, 2, 1)), | |
| dim=-1 | |
| ) # [B, N, 2048] | |
| # Project to decoder dimension | |
| vision_embeddings = projector(combined_features) # [B, N, 1280] | |
| ``` | |
| ## Source | |
| Extracted from [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) | |