Spaces:
Running
on
Zero
Running
on
Zero
| # Evaluation of GAR | |
| ## 1. GARBench | |
| ### 1.1 GARBench-Caption-Simple | |
| First, perform inference, e.g., using GAR-8B. | |
| ```bash | |
| torchrun --nproc-per-node=1 --master-port=9811 \ | |
| evaluation/GAR-Bench/inference.py \ | |
| --model_name_or_path HaochenWang/GAR-8B \ | |
| --anno_file evaluation/GAR-Bench/annotations/GAR-Bench-Caption-Simple.json \ | |
| --mode simple \ | |
| --cache_name ${CACHE_NAME} \ | |
| --data_type bf16 \ | |
| --seed 42 | |
| ``` | |
| The generated descriptions will be saved to ```evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_simple.json``` | |
| Next, perform evaluation (with images using GPT-4o). | |
| ```bash | |
| export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT | |
| export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY | |
| python3 evaluation/GAR-Bench/eval_simple.py --pred evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_simple.json | |
| ``` | |
| Reference cache (including model predictions and evaluation results) are stored in ```model_outputs/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```). | |
| To re-run the evaluation, you could change to your own ```CACHE_NAME```. | |
| Reference results: | |
| ```bash | |
| # GAR-1B | |
| Accuracy: 0.5567010309278351 | |
| # GAR-8B | |
| Accuracy: 0.6391752577319587 | |
| ``` | |
| ### 1.2 GARBench-Caption-Detailed | |
| First, perform inference, e.g., using GAR-8B. | |
| ```bash | |
| torchrun --nproc-per-node=1 --master-port=9811 \ | |
| evaluation/GAR-Bench/inference.py \ | |
| --model_name_or_path HaochenWang/GAR-8B \ | |
| --anno_file evaluation/GAR-Bench/annotations/GAR-Bench-Caption-Detailed.json \ | |
| --mode detailed \ | |
| --cache_name ${CACHE_NAME} \ | |
| --data_type bf16 \ | |
| --seed 42 | |
| ``` | |
| The generated descriptions will be saved to ```evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_detailed.json``` | |
| Next, perform evaluation (with images using GPT-4o). | |
| ```bash | |
| export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT | |
| export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY | |
| python3 evaluation/GAR-Bench/eval_detailed.py --pred evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_detailed.json | |
| ``` | |
| Reference cache (including model predictions and evaluation results) are stored in ```model_outputs/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```). | |
| To re-run the evaluation, you could change to your own ```CACHE_NAME```. | |
| Reference results: | |
| ```bash | |
| # GAR-1B | |
| Accuracy: 0.6635514018691588 | |
| # GAR-8B | |
| Accuracy: 0.6915887850467289 | |
| ``` | |
| ### 1.3 GARBench-VQA | |
| Perform inference, e.g., using GAR-8B. | |
| ```bash | |
| torchrun --nproc-per-node=1 --master-port=9811 \ | |
| evaluation/GAR-Bench/inference.py \ | |
| --model_name_or_path HaochenWang/GAR-8B \ | |
| --anno_file evaluation/GAR-Bench/annotations/GAR-Bench-VQA.json \ | |
| --mode vqa \ | |
| --cache_name ${CACHE_NAME} \ | |
| --data_type bf16 \ | |
| --seed 42 | |
| ``` | |
| Reference cache (including model predictions and evaluation results) are stored in ```model_outputs/```. | |
| To re-run the evaluation, you could change to your own ```CACHE_NAME```. | |
| Reference results: | |
| ``` | |
| # GAR-1B | |
| color: [34/69]=49.3 | |
| texture/pattern: [17/29]=58.6 | |
| mirror: [36/61]=59.0 | |
| ordering: [13/64]=20.3 | |
| material: [14/36]=38.9 | |
| shape: [32/64]=50.0 | |
| relation: [57/101]=56.4 | |
| => overall: [203/424]=47.9 | |
| # GAR-8B | |
| texture/pattern: [22/29]=75.9 | |
| material: [19/36]=52.8 | |
| mirror: [36/61]=59.0 | |
| relation: [66/101]=65.4 | |
| shape: [34/64]=53.1 | |
| ordering: [28/64]=43.8 | |
| color: [40/69]=58.0 | |
| => overall: [245/424]=57.8 | |
| ``` | |
| ## 2. DLC-Bench | |
| First, download images of DLC-Bench and put the ```images``` folder in the ```annotations``` directory: | |
| ```bash | |
| cd evaluation/DLC-Bench/annotations | |
| hf download nvidia/DLC-Bench --repo-type dataset --include "images/*" --exclude "*" --local-dir ./ | |
| ``` | |
| The overall structure should be: | |
| ```bash | |
| evaluation/DLC-Bench/annotations | |
| ├── annotations.json | |
| ├── class_names.json | |
| ├── images | |
| │ └── objects365_v2_*.jpg | |
| └── qa.json | |
| ``` | |
| Next, perform inference to obtain detailed descriptions, e.g., using GAR-8B. | |
| ```bash | |
| torchrun --nproc-per-node=1 --master-port=8841 \ | |
| evaluation/DLC-Bench/inference.py \ | |
| --model_name_or_path HaochenWang/GAR-8B \ | |
| --cache_name ${CACHE_NAME} \ | |
| --data_type bf16 \ | |
| --seed 42 | |
| ``` | |
| The generated descriptions will be saved to ```evaluation/DLC-Bench/model_outputs/${CACHE_NAME}.json``` | |
| Finally, perform evaluation (with images using GPT-4o or without images using Llama3.1-8B). | |
| **Optional 1. Using GPT-4o *with* images (Recommended)** | |
| ```bash | |
| export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT | |
| export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY | |
| python3 evaluation/DLC-Bench/eval_gpt_with_image.py --pred evaluation/DLC-Bench/model_outputs/${CACHE_NAME}.json | |
| ``` | |
| **Optional 2. Using Llama3.1-8B *without* images** | |
| First, we need to serve Llama3.1-8B using vLLM. | |
| ```bash | |
| bash evaluation/DLC-Bench/serve_judge.sh | |
| ``` | |
| Next, on the *same* node, run evaluation. | |
| ```bash | |
| python3 eval_llama_without_image.py --pred ../model_outputs/${CACHE_NAME}.json --base_url http://localhost:8007/v1 | |
| ``` | |
| For more details for the differences between these two evaluation settings, please refer to Appendix F of our paper. | |
| Reference cache (including model predictions and evaluation results) are stored in ```model_outputs/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```). | |
| To re-run the evaluation, you could change to your own ```CACHE_NAME```. | |
| Reference results: | |
| ```bash | |
| # GAR-1B | |
| # By GPT-4o (with images): | |
| Summary (Pos Neg Avg(Pos, Neg)): 0.662, 0.880, 0.771 | |
| # By Llama3.1-8B (without images): | |
| Summary (Pos Neg Avg(Pos, Neg)): 0.489, 0.870, 0.679 | |
| # GAR-8B | |
| # By GPT-4o (with images): | |
| Summary (Pos Neg Avg(Pos, Neg)): 0.680, 0.860, 0.770 | |
| # By Llama3.1-8B (without images): | |
| Summary (Pos Neg Avg(Pos, Neg)): 0.502, 0.846, 0.674 | |
| ``` | |
| ## 3. Ferret-Bench | |
| First, perform inference to obtain detailed descriptions, e.g., using GAR-8B. | |
| ```bash | |
| torchrun --nproc-per-node=1 --master-port=8841 \ | |
| evaluation/Ferret-Bench/inference.py \ | |
| --model_name_or_path HaochenWang/GAR-8B \ | |
| --cache_name ${CACHE_NAME} \ | |
| --data_type bf16 \ | |
| --seed 42 | |
| ``` | |
| The generated descriptions will be saved to ```evaluation/Ferret-Bench/model_outputs/${CACHE_NAME}.json``` | |
| Then, perform evaluation using GPT-4o. | |
| ```bash | |
| export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT | |
| export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY | |
| cd evaluation/Ferret-Bench | |
| bash eval.sh ${CACHE_NAME} | |
| ``` | |
| Reference model predictions are stored in ```model_outputs/```, and reference evaluation results are stored in ```gpt4_result/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```). | |
| To re-run the evaluation, you could change to your own ```CACHE_NAME```. | |
| Reference results: | |
| ```bash | |
| # GAR-1B | |
| review_refer_desc | |
| all 56.0 | |
| refer_desc 56.0 | |
| ================================= | |
| # GAR-8B | |
| review_refer_desc | |
| all 64.8 | |
| refer_desc 64.8 | |
| ================================= | |
| ``` | |
| ## 4. MDVP-Bench | |
| First, perform inference to obtain detailed descriptions, e.g., using GAR-8B. | |
| ```bash | |
| torchrun --nproc-per-node=1 --master-port=8841 \ | |
| evaluation/MDVP-Bench/inference.py \ | |
| --model_name_or_path HaochenWang/GAR-8B \ | |
| --cache_name ${CACHE_NAME} \ | |
| --data_type bf16 \ | |
| --seed 42 | |
| ``` | |
| The generated descriptions will be saved to ```evaluation/MDVP-Bench/model_outputs/${CACHE_NAME}.json``` | |
| Then, perform evaluation using GPT-4o. | |
| ```bash | |
| export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT | |
| export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY | |
| cd evaluation/MDVP-Bench | |
| bash eval.sh model_outputs/${CACHE_NAME}.json | |
| ``` | |
| Reference model predictions are stored in ```model_outputs/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```). | |
| To re-run the evaluation, you could change to your own ```CACHE_NAME```. | |
| Reference results: | |
| ```bash | |
| # GAR-1B | |
| android_detailed_caption_box 80.65 | |
| multipanel_detailed_caption_box 103.7 | |
| natural_detailed_caption_box 152.63 | |
| ocr_doc_detailed_caption_box 146.87 | |
| ocr_spotting_detailed_caption_box 152.38 | |
| web_detailed_caption_box 150.0 | |
| # Natural = natural_detailed_caption_box = 152.6 | |
| # OCR = (ocr_doc_detailed_caption_box + ocr_spotting_detailed_caption_box) / 2 = 149.6 | |
| # Multi-Panel = multipanel_detailed_caption_box = 103.7 | |
| # Sceenshot = (android_detailed_caption_box + web_detailed_caption_box) / 2 = 115.3 | |
| # GAR-8B | |
| android_detailed_caption_box 113.79 | |
| multipanel_detailed_caption_box 117.24 | |
| natural_detailed_caption_box 178.57 | |
| ocr_doc_detailed_caption_box 138.10 | |
| ocr_spotting_detailed_caption_box 160.0 | |
| web_detailed_caption_box 132.26 | |
| # Natural = natural_detailed_caption_box = 178.6 | |
| # OCR = (ocr_doc_detailed_caption_box + ocr_spotting_detailed_caption_box) / 2 = 149.1 | |
| # Multi-Panel = multipanel_detailed_caption_box = 117.2 | |
| # Sceenshot = (android_detailed_caption_box + web_detailed_caption_box) / 2 = 123.0 | |
| ``` |