junkim100
/

DeepEncoder

Model card Files Files and versions

DeepEncoder / README.md

junkim100's picture

Update README.md

0577cbd verified about 2 months ago

|

history blame contribute delete

1.7 kB

	# DeepEncoder

	This is the encoder component of DeepSeek-OCR, containing:
	- sam_encoder.pth: SAM ViT-B encoder for high-resolution feature extraction
	- clip_encoder.pth: CLIP-L encoder for semantic feature extraction
	- projector.pth: Linear projector layer

	## Architecture

	The DeepEncoder processes images through:
	1. SAM encoder: Extracts high-resolution visual features with window attention
	2. CLIP encoder: Extracts semantic features with global attention (uses SAM features as input)
	3. Projector: Projects concatenated features to decoder dimension (1280)

	## Usage

	```python
	import torch
	from deepencoder.sam_vary_sdpa import build_sam_vit_b
	from deepencoder.clip_sdpa import build_clip_l
	from deepencoder.build_linear import MlpProjector
	from addict import Dict

	# Load models
	sam_model = build_sam_vit_b()
	vision_model = build_clip_l()
	projector = MlpProjector(Dict(projector_type="linear", input_dim=2048, n_embed=1280))

	# Load weights
	sam_model.load_state_dict(torch.load("sam_encoder.pth"))
	vision_model.load_state_dict(torch.load("clip_encoder.pth"))
	projector.load_state_dict(torch.load("projector.pth"))

	# Process image
	with torch.no_grad():
	sam_features = sam_model(image) # [B, 1024, H/16, W/16]
	clip_features = vision_model(image, sam_features) # [B, N, 1024]

	# Concatenate features
	combined_features = torch.cat(
	(clip_features[:, 1:], sam_features.flatten(2).permute(0, 2, 1)),
	dim=-1
	) # [B, N, 2048]

	# Project to decoder dimension
	vision_embeddings = projector(combined_features) # [B, N, 1280]
	```

	## Source

	Extracted from [deepseek-ai/DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)