Spaces:
Sleeping
Sleeping
Remove info
Browse files
app.py
CHANGED
|
@@ -1286,71 +1286,11 @@ with gr.Blocks(title="Babel-ImageNet Quiz") as demo:
|
|
| 1286 |
# Title Area
|
| 1287 |
gr.Markdown(
|
| 1288 |
"""
|
| 1289 |
-
# Are you smarter🤓 than CLIP🤖?
|
| 1290 |
-
|
| 1291 |
-
<small>by Gregor Geigle, WüNLP & Computer Vision Lab, University of Würzburg</small>
|
| 1292 |
-
|
| 1293 |
-
In this quiz, you play against a CLIP model (specifically: [mSigLIP](https://huggingface.co/timm/ViT-B-16-SigLIP-i18n-256), a multilingual [SigLIP](https://arxiv.org/abs/2303.15343) model) and try to correctly classify the images over the 1000 ImageNet classes (in English) or over our (partial) Babel-ImageNet translations of those classes.
|
| 1294 |
-
Select your language, click 'Start' and start guessing! We'll keep track of your score and of your opponent's.
|
| 1295 |
-
> **Disclaimer:** Translations and images are derived automatically and can be wrong, unusual, or mismatch! This is supposed to be a fun game to explore the dataset and see how a CLIP model would answer the questions and not a product.
|
| 1296 |
-
> We do *not* use the official ImageNet images. Instead, we use images linked in BabelNet for each class, which are often from Wikipedia and have not been checked for suitability.
|
| 1297 |
-
|
| 1298 |
-
> **Content Warning:** There are spiders, insects, and various animals under the images. Please take caution if those might scare you.
|
| 1299 |
-
|
| 1300 |
-
<details>
|
| 1301 |
-
<summary> <b> FAQ</b> (click me to read)</summary>
|
| 1302 |
-
<p><b>'Over 1000 classes? I just see 4.'</b> True, you have it easier and you only have to chose between 4 classes. These are the top-4 picks of your opponent (+ the correct class if they are wrong). Your opponent has it harder: they have to deal with all classes.</p>
|
| 1303 |
-
<p><b>'Who is my opponent?'</b> Your opponent CLIP model is [mSigLIP](https://huggingface.co/timm/ViT-B-16-SigLIP-i18n-256), a powerful but small multilingual model with only 370M parameters.</p>
|
| 1304 |
-
<p><b>'My game crashed/ I got an error!'</b> This usually happens because of problems with the image URLs. You can try the button to reroll the image or start a new round by clicking the 'Start' button again.</p>
|
| 1305 |
-
</details>
|
| 1306 |
"""
|
| 1307 |
)
|
| 1308 |
-
|
| 1309 |
-
with gr.Column(scale=1):
|
| 1310 |
-
gr.Markdown(
|
| 1311 |
-
"""
|
| 1312 |
-
<details>
|
| 1313 |
-
<summary> <b>What is CLIP? </b> (click me to read)</summary>
|
| 1314 |
-
<p>
|
| 1315 |
-
<a href='https://arxiv.org/abs/2103.00020'>CLIP</a> are vision-language models that learn to encode images and text in a joint semantic embedding space, where related concepts are close together.
|
| 1316 |
-
With CLIP, you can search through, filter, or group large image datasets. The image encoder in CLIP also powers many of the large vision language models like Llava 1.5!
|
| 1317 |
-
</p>
|
| 1318 |
-
<p>
|
| 1319 |
-
Your opponent CLIP model [mSigLIP](https://arxiv.org/abs/2303.15343) in this quiz does 'zero-shot image classification': We encode all possible class labels and the image and we check which class is most similar; this is then the class chosen by CLIP.
|
| 1320 |
-
</p>
|
| 1321 |
-
</details>
|
| 1322 |
-
"""
|
| 1323 |
-
)
|
| 1324 |
-
with gr.Column(scale=1):
|
| 1325 |
-
gr.Markdown(
|
| 1326 |
-
"""
|
| 1327 |
-
<details>
|
| 1328 |
-
<summary> <b>What is ImageNet? </b> (click me to read)</summary>
|
| 1329 |
-
<p>
|
| 1330 |
-
ImageNet is a challenging image classification dataset with 1000 diverse classes covering animals, plants, human-made objects and more.
|
| 1331 |
-
It is a very popular dataset used to benchmark CLIP models because strong results here usually indicates that the image model is overall usefull for many tasks.
|
| 1332 |
-
</p>
|
| 1333 |
-
</details>
|
| 1334 |
-
"""
|
| 1335 |
-
)
|
| 1336 |
-
with gr.Column(scale=1):
|
| 1337 |
-
gr.Markdown(
|
| 1338 |
-
"""
|
| 1339 |
-
<details>
|
| 1340 |
-
<summary> <b>What is Babel-ImageNet? </b> (click me to read)</summary>
|
| 1341 |
-
<p>
|
| 1342 |
-
ImageNet class labels are only in English but we want to use CLIP models also in other languages. How can we know how good a CLIP model is outside of English?
|
| 1343 |
-
This is the goal of Babel-ImageNet: to translate the English labels to other languages. However, automatic translation can give bad results for many languages and human translation is expensive.
|
| 1344 |
-
</p>
|
| 1345 |
-
<p>
|
| 1346 |
-
Instead, we use the fact that ImageNet was constructed using WordNet and WordNet in turn can be linked to the multilingual resource BabelNet.
|
| 1347 |
-
Using this link, we can get reliable (partial) translations of the English labels.
|
| 1348 |
-
For more details, please read our <a href='https://arxiv.org/abs/2306.08658'>paper.</a>
|
| 1349 |
-
</p>
|
| 1350 |
-
</details>
|
| 1351 |
-
"""
|
| 1352 |
-
)
|
| 1353 |
-
# language select dropdown
|
| 1354 |
with gr.Row():
|
| 1355 |
# language_select = gr.Dropdown(
|
| 1356 |
# choices=main_language_values,
|
|
|
|
| 1286 |
# Title Area
|
| 1287 |
gr.Markdown(
|
| 1288 |
"""
|
| 1289 |
+
# Are you smarter🤓 than CLIP🤖?
|
| 1290 |
+
<small>adapted from the original code by Gregor Geigle, WüNLP & Computer Vision Lab, University of Würzburg</small>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1291 |
"""
|
| 1292 |
)
|
| 1293 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1294 |
with gr.Row():
|
| 1295 |
# language_select = gr.Dropdown(
|
| 1296 |
# choices=main_language_values,
|