finhdev commited on
Commit
ce813cc
·
verified ·
1 Parent(s): 407a13c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +130 -44
README.md CHANGED
@@ -1,65 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- license: apple-amlr
3
- license_name: apple-ascl
4
- license_link: https://github.com/apple/ml-mobileclip/blob/main/LICENSE_weights_data
5
- library_name: mobileclip
6
- ---
7
 
8
- # MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
 
 
 
 
 
 
 
 
9
 
10
- MobileCLIP was introduced in [MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
11
- ](https://arxiv.org/pdf/2311.17049.pdf) (CVPR 2024), by Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel.
12
 
13
- This repository contains the **MobileCLIP-B** checkpoint.
14
 
15
- ![MobileCLIP Performance Figure](fig_accuracy_latency.png)
 
 
 
 
 
16
 
17
- ### Highlights
 
 
 
 
18
 
19
- * Our smallest variant `MobileCLIP-S0` obtains similar zero-shot performance as [OpenAI](https://arxiv.org/abs/2103.00020)'s ViT-B/16 model while being 4.8x faster and 2.8x smaller.
20
- * `MobileCLIP-S2` obtains better avg zero-shot performance than [SigLIP](https://arxiv.org/abs/2303.15343)'s ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples.
21
- * `MobileCLIP-B`(LT) attains zero-shot ImageNet performance of **77.2%** which is significantly better than recent works like [DFN](https://arxiv.org/abs/2309.17425) and [SigLIP](https://arxiv.org/abs/2303.15343) with similar architectures or even [OpenAI's ViT-L/14@336](https://arxiv.org/abs/2103.00020).
22
 
23
- ## Checkpoints
 
 
24
 
25
- | Model | # Seen <BR>Samples (B) | # Params (M) <BR> (img + txt) | Latency (ms) <BR> (img + txt) | IN-1k Zero-Shot <BR> Top-1 Acc. (%) | Avg. Perf. (%) <BR> on 38 datasets |
26
- |:----------------------------------------------------------|:----------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------------:|:----------------------------------:|
27
- | [MobileCLIP-S0](https://hf.co/pcuenq/MobileCLIP-S0) | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 |
28
- | [MobileCLIP-S1](https://hf.co/pcuenq/MobileCLIP-S1) | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 |
29
- | [MobileCLIP-S2](https://hf.co/pcuenq/MobileCLIP-S2) | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 |
30
- | [MobileCLIP-B](https://hf.co/pcuenq/MobileCLIP-B) | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 |
31
- | [MobileCLIP-B (LT)](https://hf.co/pcuenq/MobileCLIP-B-LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 |
32
 
33
- ## How to Use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
- First, download the desired checkpoint visiting one of the links in the table above, then click the `Files and versions` tab, and download the PyTorch checkpoint.
36
- For programmatic downloading, if you have `huggingface_hub` installed, you can also run:
37
 
 
 
 
 
 
 
 
38
  ```
39
- huggingface-cli download pcuenq/MobileCLIP-B
40
- ```
41
 
42
- Then, install [`ml-mobileclip`](https://github.com/apple/ml-mobileclip) by following the instructions in the repo. It uses an API similar to [`open_clip`'s](https://github.com/mlfoundations/open_clip).
43
- You can run inference with a code snippet like the following:
 
 
 
 
 
 
 
 
44
 
45
- ```py
46
- import torch
47
- from PIL import Image
48
- import mobileclip
49
 
50
- model, _, preprocess = mobileclip.create_model_and_transforms('mobileclip_b', pretrained='/path/to/mobileclip_b.pt')
51
- tokenizer = mobileclip.get_tokenizer('mobileclip_b')
 
 
 
 
52
 
53
- image = preprocess(Image.open("docs/fig_accuracy_latency.png").convert('RGB')).unsqueeze(0)
54
- text = tokenizer(["a diagram", "a dog", "a cat"])
55
 
56
- with torch.no_grad(), torch.cuda.amp.autocast():
57
- image_features = model.encode_image(image)
58
- text_features = model.encode_text(text)
59
- image_features /= image_features.norm(dim=-1, keepdim=True)
60
- text_features /= text_features.norm(dim=-1, keepdim=True)
 
 
 
 
 
 
 
 
 
 
 
 
61
 
62
- text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
 
 
 
 
 
63
 
64
- print("Label probs:", text_probs)
 
 
65
  ```
 
 
 
1
+ ````markdown
2
+ # 📸 MobileCLIP-B Zero-Shot Image Classifier — HF Inference Endpoint
3
+
4
+ This repository packages Apple’s **MobileCLIP-B** model as a production-ready
5
+ Hugging Face Inference Endpoint.
6
+
7
+ * **One-shot image → class probabilities**
8
+ ⚡ < 30 ms on an A10G / T4 once the image arrives.
9
+ * **Branch-fused / FP16** MobileCLIP for fast GPU inference.
10
+ * **Pre-computed text embeddings** for your custom label set
11
+ (`items.json`) — every request encodes **only** the image.
12
+ * Built with vanilla **`open-clip-torch`** (no forks) and a
13
+ 60-line local helper (`reparam.py`) to fuse MobileOne blocks.
14
+
15
  ---
 
 
 
 
 
16
 
17
+ ## What’s inside
18
+
19
+ | File | Purpose |
20
+ |------|---------|
21
+ | `handler.py` | Hugging Face entry-point — loads weights, caches text features, serves requests |
22
+ | `reparam.py` | Stand-alone copy of `reparameterize_model` from Apple’s repo (removes heavy upstream dependency) |
23
+ | `requirements.txt` | Minimal, conflict-free dependency set (`torch`, `torchvision`, `open-clip-torch`) |
24
+ | `items.json` | Your label spec — each element must have `id`, `name`, and `prompt` fields |
25
+ | `README.md` | You are here |
26
 
27
+ ---
 
28
 
29
+ ## 🔧 Quick start (local smoke-test)
30
 
31
+ ```bash
32
+ python -m venv venv && source venv/bin/activate
33
+ pip install -r requirements.txt
34
+ python - <<'PY'
35
+ from pathlib import Path, PurePosixPath
36
+ import base64, json, requests
37
 
38
+ # Load a demo image and encode it
39
+ img_path = Path("tests/cat.jpg")
40
+ payload = {
41
+ "image": base64.b64encode(img_path.read_bytes()).decode()
42
+ }
43
 
44
+ # Local simulation spin up uvicorn the same way the HF container does
45
+ import handler, uvicorn
46
+ app = handler.EndpointHandler()
47
 
48
+ print(app({"inputs": payload})[:5]) # top-5 classes
49
+ PY
50
+ ````
51
 
52
+ ---
 
 
 
 
 
 
53
 
54
+ ## 🚀 Calling the deployed endpoint
55
+
56
+ ```bash
57
+ ENDPOINT_URL="https://<your-endpoint>.aws.endpoints.huggingface.cloud"
58
+ HF_TOKEN="hf_xxxxxxxxxxxxxxxxx"
59
+ IMG="cat.jpg"
60
+
61
+ python - <<'PY'
62
+ import base64, json, requests, sys, os
63
+ url = os.environ["ENDPOINT_URL"]
64
+ token = os.environ["HF_TOKEN"]
65
+ img = sys.argv[1]
66
+
67
+ payload = {
68
+ "inputs": {
69
+ "image": base64.b64encode(open(img, "rb").read()).decode()
70
+ }
71
+ }
72
+ resp = requests.post(
73
+ url,
74
+ headers={
75
+ "Authorization": f"Bearer {token}",
76
+ "Content-Type": "application/json",
77
+ "Accept": "application/json",
78
+ },
79
+ json=payload,
80
+ timeout=60,
81
+ )
82
+ print(json.dumps(resp.json()[:5], indent=2)) # top-5
83
+ PY
84
+ $IMG
85
+ ```
86
 
87
+ Sample response:
 
88
 
89
+ ```json
90
+ [
91
+ { "id": 23, "label": "cat", "score": 0.92 },
92
+ { "id": 11, "label": "tiger cat", "score": 0.05 },
93
+ { "id": 48, "label": "siamese cat", "score": 0.02 },
94
+
95
+ ]
96
  ```
 
 
97
 
98
+ ---
99
+
100
+ ## 🏗️ How the handler works (high-level)
101
+
102
+ 1. **Startup**
103
+
104
+ * Downloads / loads the `datacompdr` MobileCLIP-B checkpoint.
105
+ * Runs `reparameterize_model` to fuse MobileOne branches.
106
+ * Reads `items.json`, tokenises all prompts, and caches the resulting
107
+ text embeddings (`[n_classes, 512]`).
108
 
109
+ 2. **Per request**
 
 
 
110
 
111
+ * Decodes the incoming base-64 JPEG/PNG.
112
+ * Applies the exact OpenCLIP preprocessing (224 × 224 center-crop,
113
+ mean/std normalisation).
114
+ * Encodes the image, L2-normalises, and performs one `softmax(cosine)`
115
+ against the cached text matrix.
116
+ * Returns a sorted JSON list `[{"id", "label", "score"}, …]`.
117
 
118
+ This design keeps bandwidth low (compressed image over the wire) and
119
+ latency low (no per-request text encoding).
120
 
121
+ ---
122
+
123
+ ## 📝 Updating the label set
124
+
125
+ Edit `items.json`, **rebuild the endpoint**, done.
126
+
127
+ ```json
128
+ [
129
+ { "id": 0, "name": "cat", "prompt": "a photo of a cat" },
130
+ { "id": 1, "name": "dog", "prompt": "a photo of a dog" },
131
+
132
+ ]
133
+ ```
134
+
135
+ * `id` is your internal numeric key (stays stable).
136
+ * `name` is the human-readable label returned to clients.
137
+ * `prompt` is what the model actually “sees” — tweak wording to improve accuracy.
138
 
139
+ ---
140
+
141
+ ## ⚖️ Licence
142
+
143
+ * **Weights**: Apple AMLR (see [`LICENSE_weights_data`](./LICENSE_weights_data)).
144
+ * **Code in this repo**: MIT.
145
 
146
+ ---
147
+
148
+ <div align="center"><sub>Maintained with ❤️ by Your Team — August 2025</sub></div>
149
  ```
150
+ ::contentReference[oaicite:0]{index=0}
151
+