jbilcke-hf's picture
Upload core files for paper 2510.18876
46861c5 verified
# Evaluation of GAR
## 1. GARBench
### 1.1 GARBench-Caption-Simple
First, perform inference, e.g., using GAR-8B.
```bash
torchrun --nproc-per-node=1 --master-port=9811 \
evaluation/GAR-Bench/inference.py \
--model_name_or_path HaochenWang/GAR-8B \
--anno_file evaluation/GAR-Bench/annotations/GAR-Bench-Caption-Simple.json \
--mode simple \
--cache_name ${CACHE_NAME} \
--data_type bf16 \
--seed 42
```
The generated descriptions will be saved to ```evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_simple.json```
Next, perform evaluation (with images using GPT-4o).
```bash
export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT
export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY
python3 evaluation/GAR-Bench/eval_simple.py --pred evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_simple.json
```
Reference cache (including model predictions and evaluation results) are stored in ```model_outputs/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```).
To re-run the evaluation, you could change to your own ```CACHE_NAME```.
Reference results:
```bash
# GAR-1B
Accuracy: 0.5567010309278351
# GAR-8B
Accuracy: 0.6391752577319587
```
### 1.2 GARBench-Caption-Detailed
First, perform inference, e.g., using GAR-8B.
```bash
torchrun --nproc-per-node=1 --master-port=9811 \
evaluation/GAR-Bench/inference.py \
--model_name_or_path HaochenWang/GAR-8B \
--anno_file evaluation/GAR-Bench/annotations/GAR-Bench-Caption-Detailed.json \
--mode detailed \
--cache_name ${CACHE_NAME} \
--data_type bf16 \
--seed 42
```
The generated descriptions will be saved to ```evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_detailed.json```
Next, perform evaluation (with images using GPT-4o).
```bash
export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT
export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY
python3 evaluation/GAR-Bench/eval_detailed.py --pred evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_detailed.json
```
Reference cache (including model predictions and evaluation results) are stored in ```model_outputs/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```).
To re-run the evaluation, you could change to your own ```CACHE_NAME```.
Reference results:
```bash
# GAR-1B
Accuracy: 0.6635514018691588
# GAR-8B
Accuracy: 0.6915887850467289
```
### 1.3 GARBench-VQA
Perform inference, e.g., using GAR-8B.
```bash
torchrun --nproc-per-node=1 --master-port=9811 \
evaluation/GAR-Bench/inference.py \
--model_name_or_path HaochenWang/GAR-8B \
--anno_file evaluation/GAR-Bench/annotations/GAR-Bench-VQA.json \
--mode vqa \
--cache_name ${CACHE_NAME} \
--data_type bf16 \
--seed 42
```
Reference cache (including model predictions and evaluation results) are stored in ```model_outputs/```.
To re-run the evaluation, you could change to your own ```CACHE_NAME```.
Reference results:
```
# GAR-1B
color: [34/69]=49.3
texture/pattern: [17/29]=58.6
mirror: [36/61]=59.0
ordering: [13/64]=20.3
material: [14/36]=38.9
shape: [32/64]=50.0
relation: [57/101]=56.4
=> overall: [203/424]=47.9
# GAR-8B
texture/pattern: [22/29]=75.9
material: [19/36]=52.8
mirror: [36/61]=59.0
relation: [66/101]=65.4
shape: [34/64]=53.1
ordering: [28/64]=43.8
color: [40/69]=58.0
=> overall: [245/424]=57.8
```
## 2. DLC-Bench
First, download images of DLC-Bench and put the ```images``` folder in the ```annotations``` directory:
```bash
cd evaluation/DLC-Bench/annotations
hf download nvidia/DLC-Bench --repo-type dataset --include "images/*" --exclude "*" --local-dir ./
```
The overall structure should be:
```bash
evaluation/DLC-Bench/annotations
├── annotations.json
├── class_names.json
├── images
│ └── objects365_v2_*.jpg
└── qa.json
```
Next, perform inference to obtain detailed descriptions, e.g., using GAR-8B.
```bash
torchrun --nproc-per-node=1 --master-port=8841 \
evaluation/DLC-Bench/inference.py \
--model_name_or_path HaochenWang/GAR-8B \
--cache_name ${CACHE_NAME} \
--data_type bf16 \
--seed 42
```
The generated descriptions will be saved to ```evaluation/DLC-Bench/model_outputs/${CACHE_NAME}.json```
Finally, perform evaluation (with images using GPT-4o or without images using Llama3.1-8B).
**Optional 1. Using GPT-4o *with* images (Recommended)**
```bash
export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT
export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY
python3 evaluation/DLC-Bench/eval_gpt_with_image.py --pred evaluation/DLC-Bench/model_outputs/${CACHE_NAME}.json
```
**Optional 2. Using Llama3.1-8B *without* images**
First, we need to serve Llama3.1-8B using vLLM.
```bash
bash evaluation/DLC-Bench/serve_judge.sh
```
Next, on the *same* node, run evaluation.
```bash
python3 eval_llama_without_image.py --pred ../model_outputs/${CACHE_NAME}.json --base_url http://localhost:8007/v1
```
For more details for the differences between these two evaluation settings, please refer to Appendix F of our paper.
Reference cache (including model predictions and evaluation results) are stored in ```model_outputs/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```).
To re-run the evaluation, you could change to your own ```CACHE_NAME```.
Reference results:
```bash
# GAR-1B
# By GPT-4o (with images):
Summary (Pos Neg Avg(Pos, Neg)): 0.662, 0.880, 0.771
# By Llama3.1-8B (without images):
Summary (Pos Neg Avg(Pos, Neg)): 0.489, 0.870, 0.679
# GAR-8B
# By GPT-4o (with images):
Summary (Pos Neg Avg(Pos, Neg)): 0.680, 0.860, 0.770
# By Llama3.1-8B (without images):
Summary (Pos Neg Avg(Pos, Neg)): 0.502, 0.846, 0.674
```
## 3. Ferret-Bench
First, perform inference to obtain detailed descriptions, e.g., using GAR-8B.
```bash
torchrun --nproc-per-node=1 --master-port=8841 \
evaluation/Ferret-Bench/inference.py \
--model_name_or_path HaochenWang/GAR-8B \
--cache_name ${CACHE_NAME} \
--data_type bf16 \
--seed 42
```
The generated descriptions will be saved to ```evaluation/Ferret-Bench/model_outputs/${CACHE_NAME}.json```
Then, perform evaluation using GPT-4o.
```bash
export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT
export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY
cd evaluation/Ferret-Bench
bash eval.sh ${CACHE_NAME}
```
Reference model predictions are stored in ```model_outputs/```, and reference evaluation results are stored in ```gpt4_result/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```).
To re-run the evaluation, you could change to your own ```CACHE_NAME```.
Reference results:
```bash
# GAR-1B
review_refer_desc
all 56.0
refer_desc 56.0
=================================
# GAR-8B
review_refer_desc
all 64.8
refer_desc 64.8
=================================
```
## 4. MDVP-Bench
First, perform inference to obtain detailed descriptions, e.g., using GAR-8B.
```bash
torchrun --nproc-per-node=1 --master-port=8841 \
evaluation/MDVP-Bench/inference.py \
--model_name_or_path HaochenWang/GAR-8B \
--cache_name ${CACHE_NAME} \
--data_type bf16 \
--seed 42
```
The generated descriptions will be saved to ```evaluation/MDVP-Bench/model_outputs/${CACHE_NAME}.json```
Then, perform evaluation using GPT-4o.
```bash
export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT
export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY
cd evaluation/MDVP-Bench
bash eval.sh model_outputs/${CACHE_NAME}.json
```
Reference model predictions are stored in ```model_outputs/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```).
To re-run the evaluation, you could change to your own ```CACHE_NAME```.
Reference results:
```bash
# GAR-1B
android_detailed_caption_box 80.65
multipanel_detailed_caption_box 103.7
natural_detailed_caption_box 152.63
ocr_doc_detailed_caption_box 146.87
ocr_spotting_detailed_caption_box 152.38
web_detailed_caption_box 150.0
# Natural = natural_detailed_caption_box = 152.6
# OCR = (ocr_doc_detailed_caption_box + ocr_spotting_detailed_caption_box) / 2 = 149.6
# Multi-Panel = multipanel_detailed_caption_box = 103.7
# Sceenshot = (android_detailed_caption_box + web_detailed_caption_box) / 2 = 115.3
# GAR-8B
android_detailed_caption_box 113.79
multipanel_detailed_caption_box 117.24
natural_detailed_caption_box 178.57
ocr_doc_detailed_caption_box 138.10
ocr_spotting_detailed_caption_box 160.0
web_detailed_caption_box 132.26
# Natural = natural_detailed_caption_box = 178.6
# OCR = (ocr_doc_detailed_caption_box + ocr_spotting_detailed_caption_box) / 2 = 149.1
# Multi-Panel = multipanel_detailed_caption_box = 117.2
# Sceenshot = (android_detailed_caption_box + web_detailed_caption_box) / 2 = 123.0
```