Add model card with pipeline tag, paper, code, and usage example

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +66 -3
README.md CHANGED
@@ -1,3 +1,66 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ pipeline_tag: zero-shot-image-classification
4
+ ---
5
+
6
+ # SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
7
+
8
+ This repository is the official implementation of **SmartCLIP**, presented in the paper [SmartCLIP: Modular Vision-language Alignment with Identification Guarantees](https://huggingface.co/papers/2507.22264).
9
+
10
+ SmartCLIP is a novel approach that aims to improve CLIP training by providing a mask-based solution for better modular vision-language alignment, especially when dealing with long and short texts. It addresses issues of information misalignment and entangled representations in existing contrastive learning methods, ensuring both the preservation of cross-modal semantic information and the disentanglement of visual representations to capture fine-grained textual concepts.
11
+
12
+ Code: [https://github.com/MidPush/SmartCLIP](https://github.com/MidPush/SmartCLIP)
13
+
14
+ ## Abstract
15
+ Contrastive Language-Image Pre-training (CLIP) has emerged as a pivotal model in computer vision and multimodal learning, achieving state-of-the-art performance at aligning visual and textual representations through contrastive learning. However, CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation. On the one hand, short captions for a single image in datasets like MSCOCO may describe disjoint regions in the image, leaving the model uncertain about which visual features to retain or disregard. On the other hand, directly aligning long captions with images can lead to the retention of entangled details, preventing the model from learning disentangled, atomic concepts -- ultimately limiting its generalization on certain downstream tasks involving short prompts. In this paper, we establish theoretical conditions that enable flexible alignment between textual and visual representations across varying levels of granularity. Specifically, our framework ensures that a model can not only *preserve* cross-modal semantic information in its entirety but also *disentangle* visual representations to capture fine-grained textual concepts. Building on this foundation, we introduce **SmartCLIP**, a novel approach that identifies and aligns the most relevant visual and textual representations in a modular manner. Superior performance across various tasks demonstrates its capability to handle information misalignment and supports our identification theory.
16
+
17
+ ## Usage
18
+
19
+ Our model is based on the [CLIP](https://github.com/openai/CLIP) framework. You can use it by loading the pre-trained checkpoints and performing inference as shown below.
20
+
21
+ First, clone the repository and download the trained models:
22
+
23
+ ```bash
24
+ git clone https://github.com/MidPush/SmartCLIP.git
25
+ cd SmartCLIP
26
+ mkdir checkpoints
27
+ wget https://huggingface.co/Shaoan/SmartCLIP/resolve/main/smartclip_l14.pt -O checkpoints/smartclip_l14.pt
28
+ wget https://huggingface.co/Shaoan/SmartCLIP/resolve/main/smartclip_b16.pt -O checkpoints/smartclip_b16.pt
29
+ ```
30
+
31
+ Then, you can use the model for tasks like zero-shot image classification:
32
+
33
+ ```python
34
+ from model import longclip
35
+ import torch
36
+ from PIL import Image
37
+
38
+ device = "cuda" if torch.cuda.is_available() else "cpu"
39
+ model, preprocess = longclip.load("./checkpoints/smartclip_l14.pt", device=device)
40
+
41
+ text = longclip.tokenize(["A cat is holding a yellow sign", "A dog is holding a yellow sign"]).to(device)
42
+ image = preprocess(Image.open("./assets/cat.webp")).unsqueeze(0).to(device)
43
+
44
+ with torch.no_grad():
45
+ image_features = model.encode_image(image)
46
+ text_features = model.encode_text(text)
47
+
48
+ logits_per_image = image_features @ text_features.T
49
+ probs = logits_per_image.softmax(dim=-1).cpu().numpy()
50
+
51
+ print("Label probabilities:", probs)
52
+ ```
53
+
54
+ ## Citation
55
+
56
+ If you find our work helpful for your research, please consider citing our paper:
57
+
58
+ ```bibtex
59
+ @inproceedings{xie2025smartclip,
60
+ title={SmartCLIP: Modular Vision-language Alignment with Identification Guarantees},
61
+ author={Xie, Shaoan and Lingjing, Lingjing and Zheng, Yujia and Yao, Yu and Tang, Zeyu and Xing, Eric P and Chen, Guangyi and Zhang, Kun},
62
+ booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
63
+ pages={29780--29790},
64
+ year={2025}
65
+ }
66
+ ```