Spaces:
Running
on
Zero
A newer version of the Gradio SDK is available:
6.0.2
Evaluation of GAR
1. GARBench
1.1 GARBench-Caption-Simple
First, perform inference, e.g., using GAR-8B.
torchrun --nproc-per-node=1 --master-port=9811 \
evaluation/GAR-Bench/inference.py \
--model_name_or_path HaochenWang/GAR-8B \
--anno_file evaluation/GAR-Bench/annotations/GAR-Bench-Caption-Simple.json \
--mode simple \
--cache_name ${CACHE_NAME} \
--data_type bf16 \
--seed 42
The generated descriptions will be saved to evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_simple.json
Next, perform evaluation (with images using GPT-4o).
export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT
export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY
python3 evaluation/GAR-Bench/eval_simple.py --pred evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_simple.json
Reference cache (including model predictions and evaluation results) are stored in model_outputs/. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with temperature=0).
To re-run the evaluation, you could change to your own CACHE_NAME.
Reference results:
# GAR-1B
Accuracy: 0.5567010309278351
# GAR-8B
Accuracy: 0.6391752577319587
1.2 GARBench-Caption-Detailed
First, perform inference, e.g., using GAR-8B.
torchrun --nproc-per-node=1 --master-port=9811 \
evaluation/GAR-Bench/inference.py \
--model_name_or_path HaochenWang/GAR-8B \
--anno_file evaluation/GAR-Bench/annotations/GAR-Bench-Caption-Detailed.json \
--mode detailed \
--cache_name ${CACHE_NAME} \
--data_type bf16 \
--seed 42
The generated descriptions will be saved to evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_detailed.json
Next, perform evaluation (with images using GPT-4o).
export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT
export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY
python3 evaluation/GAR-Bench/eval_detailed.py --pred evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_detailed.json
Reference cache (including model predictions and evaluation results) are stored in model_outputs/. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with temperature=0).
To re-run the evaluation, you could change to your own CACHE_NAME.
Reference results:
# GAR-1B
Accuracy: 0.6635514018691588
# GAR-8B
Accuracy: 0.6915887850467289
1.3 GARBench-VQA
Perform inference, e.g., using GAR-8B.
torchrun --nproc-per-node=1 --master-port=9811 \
evaluation/GAR-Bench/inference.py \
--model_name_or_path HaochenWang/GAR-8B \
--anno_file evaluation/GAR-Bench/annotations/GAR-Bench-VQA.json \
--mode vqa \
--cache_name ${CACHE_NAME} \
--data_type bf16 \
--seed 42
Reference cache (including model predictions and evaluation results) are stored in model_outputs/.
To re-run the evaluation, you could change to your own CACHE_NAME.
Reference results:
# GAR-1B
color: [34/69]=49.3
texture/pattern: [17/29]=58.6
mirror: [36/61]=59.0
ordering: [13/64]=20.3
material: [14/36]=38.9
shape: [32/64]=50.0
relation: [57/101]=56.4
=> overall: [203/424]=47.9
# GAR-8B
texture/pattern: [22/29]=75.9
material: [19/36]=52.8
mirror: [36/61]=59.0
relation: [66/101]=65.4
shape: [34/64]=53.1
ordering: [28/64]=43.8
color: [40/69]=58.0
=> overall: [245/424]=57.8
2. DLC-Bench
First, download images of DLC-Bench and put the images folder in the annotations directory:
cd evaluation/DLC-Bench/annotations
hf download nvidia/DLC-Bench --repo-type dataset --include "images/*" --exclude "*" --local-dir ./
The overall structure should be:
evaluation/DLC-Bench/annotations
βββ annotations.json
βββ class_names.json
βββ images
β βββ objects365_v2_*.jpg
βββ qa.json
Next, perform inference to obtain detailed descriptions, e.g., using GAR-8B.
torchrun --nproc-per-node=1 --master-port=8841 \
evaluation/DLC-Bench/inference.py \
--model_name_or_path HaochenWang/GAR-8B \
--cache_name ${CACHE_NAME} \
--data_type bf16 \
--seed 42
The generated descriptions will be saved to evaluation/DLC-Bench/model_outputs/${CACHE_NAME}.json
Finally, perform evaluation (with images using GPT-4o or without images using Llama3.1-8B).
Optional 1. Using GPT-4o with images (Recommended)
export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT
export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY
python3 evaluation/DLC-Bench/eval_gpt_with_image.py --pred evaluation/DLC-Bench/model_outputs/${CACHE_NAME}.json
Optional 2. Using Llama3.1-8B without images
First, we need to serve Llama3.1-8B using vLLM.
bash evaluation/DLC-Bench/serve_judge.sh
Next, on the same node, run evaluation.
python3 eval_llama_without_image.py --pred ../model_outputs/${CACHE_NAME}.json --base_url http://localhost:8007/v1
For more details for the differences between these two evaluation settings, please refer to Appendix F of our paper.
Reference cache (including model predictions and evaluation results) are stored in model_outputs/. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with temperature=0).
To re-run the evaluation, you could change to your own CACHE_NAME.
Reference results:
# GAR-1B
# By GPT-4o (with images):
Summary (Pos Neg Avg(Pos, Neg)): 0.662, 0.880, 0.771
# By Llama3.1-8B (without images):
Summary (Pos Neg Avg(Pos, Neg)): 0.489, 0.870, 0.679
# GAR-8B
# By GPT-4o (with images):
Summary (Pos Neg Avg(Pos, Neg)): 0.680, 0.860, 0.770
# By Llama3.1-8B (without images):
Summary (Pos Neg Avg(Pos, Neg)): 0.502, 0.846, 0.674
3. Ferret-Bench
First, perform inference to obtain detailed descriptions, e.g., using GAR-8B.
torchrun --nproc-per-node=1 --master-port=8841 \
evaluation/Ferret-Bench/inference.py \
--model_name_or_path HaochenWang/GAR-8B \
--cache_name ${CACHE_NAME} \
--data_type bf16 \
--seed 42
The generated descriptions will be saved to evaluation/Ferret-Bench/model_outputs/${CACHE_NAME}.json
Then, perform evaluation using GPT-4o.
export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT
export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY
cd evaluation/Ferret-Bench
bash eval.sh ${CACHE_NAME}
Reference model predictions are stored in model_outputs/, and reference evaluation results are stored in gpt4_result/. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with temperature=0).
To re-run the evaluation, you could change to your own CACHE_NAME.
Reference results: ```bash # GAR-1B review_refer_desc all 56.0 refer_desc 56.0
GAR-8B
review_refer_desc all 64.8 refer_desc 64.8
## 4. MDVP-Bench
First, perform inference to obtain detailed descriptions, e.g., using GAR-8B.
```bash
torchrun --nproc-per-node=1 --master-port=8841 \
evaluation/MDVP-Bench/inference.py \
--model_name_or_path HaochenWang/GAR-8B \
--cache_name ${CACHE_NAME} \
--data_type bf16 \
--seed 42
The generated descriptions will be saved to evaluation/MDVP-Bench/model_outputs/${CACHE_NAME}.json
Then, perform evaluation using GPT-4o.
export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT
export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY
cd evaluation/MDVP-Bench
bash eval.sh model_outputs/${CACHE_NAME}.json
Reference model predictions are stored in model_outputs/. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with temperature=0).
To re-run the evaluation, you could change to your own CACHE_NAME.
Reference results:
# GAR-1B
android_detailed_caption_box 80.65
multipanel_detailed_caption_box 103.7
natural_detailed_caption_box 152.63
ocr_doc_detailed_caption_box 146.87
ocr_spotting_detailed_caption_box 152.38
web_detailed_caption_box 150.0
# Natural = natural_detailed_caption_box = 152.6
# OCR = (ocr_doc_detailed_caption_box + ocr_spotting_detailed_caption_box) / 2 = 149.6
# Multi-Panel = multipanel_detailed_caption_box = 103.7
# Sceenshot = (android_detailed_caption_box + web_detailed_caption_box) / 2 = 115.3
# GAR-8B
android_detailed_caption_box 113.79
multipanel_detailed_caption_box 117.24
natural_detailed_caption_box 178.57
ocr_doc_detailed_caption_box 138.10
ocr_spotting_detailed_caption_box 160.0
web_detailed_caption_box 132.26
# Natural = natural_detailed_caption_box = 178.6
# OCR = (ocr_doc_detailed_caption_box + ocr_spotting_detailed_caption_box) / 2 = 149.1
# Multi-Panel = multipanel_detailed_caption_box = 117.2
# Sceenshot = (android_detailed_caption_box + web_detailed_caption_box) / 2 = 123.0