# Evaluation of GAR ## 1. GARBench ### 1.1 GARBench-Caption-Simple First, perform inference, e.g., using GAR-8B. ```bash torchrun --nproc-per-node=1 --master-port=9811 \ evaluation/GAR-Bench/inference.py \ --model_name_or_path HaochenWang/GAR-8B \ --anno_file evaluation/GAR-Bench/annotations/GAR-Bench-Caption-Simple.json \ --mode simple \ --cache_name ${CACHE_NAME} \ --data_type bf16 \ --seed 42 ``` The generated descriptions will be saved to ```evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_simple.json``` Next, perform evaluation (with images using GPT-4o). ```bash export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY python3 evaluation/GAR-Bench/eval_simple.py --pred evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_simple.json ``` Reference cache (including model predictions and evaluation results) are stored in ```model_outputs/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```). To re-run the evaluation, you could change to your own ```CACHE_NAME```. Reference results: ```bash # GAR-1B Accuracy: 0.5567010309278351 # GAR-8B Accuracy: 0.6391752577319587 ``` ### 1.2 GARBench-Caption-Detailed First, perform inference, e.g., using GAR-8B. ```bash torchrun --nproc-per-node=1 --master-port=9811 \ evaluation/GAR-Bench/inference.py \ --model_name_or_path HaochenWang/GAR-8B \ --anno_file evaluation/GAR-Bench/annotations/GAR-Bench-Caption-Detailed.json \ --mode detailed \ --cache_name ${CACHE_NAME} \ --data_type bf16 \ --seed 42 ``` The generated descriptions will be saved to ```evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_detailed.json``` Next, perform evaluation (with images using GPT-4o). ```bash export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY python3 evaluation/GAR-Bench/eval_detailed.py --pred evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_detailed.json ``` Reference cache (including model predictions and evaluation results) are stored in ```model_outputs/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```). To re-run the evaluation, you could change to your own ```CACHE_NAME```. Reference results: ```bash # GAR-1B Accuracy: 0.6635514018691588 # GAR-8B Accuracy: 0.6915887850467289 ``` ### 1.3 GARBench-VQA Perform inference, e.g., using GAR-8B. ```bash torchrun --nproc-per-node=1 --master-port=9811 \ evaluation/GAR-Bench/inference.py \ --model_name_or_path HaochenWang/GAR-8B \ --anno_file evaluation/GAR-Bench/annotations/GAR-Bench-VQA.json \ --mode vqa \ --cache_name ${CACHE_NAME} \ --data_type bf16 \ --seed 42 ``` Reference cache (including model predictions and evaluation results) are stored in ```model_outputs/```. To re-run the evaluation, you could change to your own ```CACHE_NAME```. Reference results: ``` # GAR-1B color: [34/69]=49.3 texture/pattern: [17/29]=58.6 mirror: [36/61]=59.0 ordering: [13/64]=20.3 material: [14/36]=38.9 shape: [32/64]=50.0 relation: [57/101]=56.4 => overall: [203/424]=47.9 # GAR-8B texture/pattern: [22/29]=75.9 material: [19/36]=52.8 mirror: [36/61]=59.0 relation: [66/101]=65.4 shape: [34/64]=53.1 ordering: [28/64]=43.8 color: [40/69]=58.0 => overall: [245/424]=57.8 ``` ## 2. DLC-Bench First, download images of DLC-Bench and put the ```images``` folder in the ```annotations``` directory: ```bash cd evaluation/DLC-Bench/annotations hf download nvidia/DLC-Bench --repo-type dataset --include "images/*" --exclude "*" --local-dir ./ ``` The overall structure should be: ```bash evaluation/DLC-Bench/annotations ├── annotations.json ├── class_names.json ├── images │ └── objects365_v2_*.jpg └── qa.json ``` Next, perform inference to obtain detailed descriptions, e.g., using GAR-8B. ```bash torchrun --nproc-per-node=1 --master-port=8841 \ evaluation/DLC-Bench/inference.py \ --model_name_or_path HaochenWang/GAR-8B \ --cache_name ${CACHE_NAME} \ --data_type bf16 \ --seed 42 ``` The generated descriptions will be saved to ```evaluation/DLC-Bench/model_outputs/${CACHE_NAME}.json``` Finally, perform evaluation (with images using GPT-4o or without images using Llama3.1-8B). **Optional 1. Using GPT-4o *with* images (Recommended)** ```bash export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY python3 evaluation/DLC-Bench/eval_gpt_with_image.py --pred evaluation/DLC-Bench/model_outputs/${CACHE_NAME}.json ``` **Optional 2. Using Llama3.1-8B *without* images** First, we need to serve Llama3.1-8B using vLLM. ```bash bash evaluation/DLC-Bench/serve_judge.sh ``` Next, on the *same* node, run evaluation. ```bash python3 eval_llama_without_image.py --pred ../model_outputs/${CACHE_NAME}.json --base_url http://localhost:8007/v1 ``` For more details for the differences between these two evaluation settings, please refer to Appendix F of our paper. Reference cache (including model predictions and evaluation results) are stored in ```model_outputs/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```). To re-run the evaluation, you could change to your own ```CACHE_NAME```. Reference results: ```bash # GAR-1B # By GPT-4o (with images): Summary (Pos Neg Avg(Pos, Neg)): 0.662, 0.880, 0.771 # By Llama3.1-8B (without images): Summary (Pos Neg Avg(Pos, Neg)): 0.489, 0.870, 0.679 # GAR-8B # By GPT-4o (with images): Summary (Pos Neg Avg(Pos, Neg)): 0.680, 0.860, 0.770 # By Llama3.1-8B (without images): Summary (Pos Neg Avg(Pos, Neg)): 0.502, 0.846, 0.674 ``` ## 3. Ferret-Bench First, perform inference to obtain detailed descriptions, e.g., using GAR-8B. ```bash torchrun --nproc-per-node=1 --master-port=8841 \ evaluation/Ferret-Bench/inference.py \ --model_name_or_path HaochenWang/GAR-8B \ --cache_name ${CACHE_NAME} \ --data_type bf16 \ --seed 42 ``` The generated descriptions will be saved to ```evaluation/Ferret-Bench/model_outputs/${CACHE_NAME}.json``` Then, perform evaluation using GPT-4o. ```bash export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY cd evaluation/Ferret-Bench bash eval.sh ${CACHE_NAME} ``` Reference model predictions are stored in ```model_outputs/```, and reference evaluation results are stored in ```gpt4_result/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```). To re-run the evaluation, you could change to your own ```CACHE_NAME```. Reference results: ```bash # GAR-1B review_refer_desc all 56.0 refer_desc 56.0 ================================= # GAR-8B review_refer_desc all 64.8 refer_desc 64.8 ================================= ``` ## 4. MDVP-Bench First, perform inference to obtain detailed descriptions, e.g., using GAR-8B. ```bash torchrun --nproc-per-node=1 --master-port=8841 \ evaluation/MDVP-Bench/inference.py \ --model_name_or_path HaochenWang/GAR-8B \ --cache_name ${CACHE_NAME} \ --data_type bf16 \ --seed 42 ``` The generated descriptions will be saved to ```evaluation/MDVP-Bench/model_outputs/${CACHE_NAME}.json``` Then, perform evaluation using GPT-4o. ```bash export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY cd evaluation/MDVP-Bench bash eval.sh model_outputs/${CACHE_NAME}.json ``` Reference model predictions are stored in ```model_outputs/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```). To re-run the evaluation, you could change to your own ```CACHE_NAME```. Reference results: ```bash # GAR-1B android_detailed_caption_box 80.65 multipanel_detailed_caption_box 103.7 natural_detailed_caption_box 152.63 ocr_doc_detailed_caption_box 146.87 ocr_spotting_detailed_caption_box 152.38 web_detailed_caption_box 150.0 # Natural = natural_detailed_caption_box = 152.6 # OCR = (ocr_doc_detailed_caption_box + ocr_spotting_detailed_caption_box) / 2 = 149.6 # Multi-Panel = multipanel_detailed_caption_box = 103.7 # Sceenshot = (android_detailed_caption_box + web_detailed_caption_box) / 2 = 115.3 # GAR-8B android_detailed_caption_box 113.79 multipanel_detailed_caption_box 117.24 natural_detailed_caption_box 178.57 ocr_doc_detailed_caption_box 138.10 ocr_spotting_detailed_caption_box 160.0 web_detailed_caption_box 132.26 # Natural = natural_detailed_caption_box = 178.6 # OCR = (ocr_doc_detailed_caption_box + ocr_spotting_detailed_caption_box) / 2 = 149.1 # Multi-Panel = multipanel_detailed_caption_box = 117.2 # Sceenshot = (android_detailed_caption_box + web_detailed_caption_box) / 2 = 123.0 ```