Spaces:

jbilcke-hf
/

SNIPED_grasp-any-region

Running on Zero

App Files Files Community

SNIPED_grasp-any-region / evaluation /EVALUATION.md

jbilcke-hf

Upload core files for paper 2510.18876

46861c5 verified about 1 month ago

preview code

raw

history blame contribute delete

9.23 kB

	# Evaluation of GAR

	## 1. GARBench

	### 1.1 GARBench-Caption-Simple

	First, perform inference, e.g., using GAR-8B.

	```bash
	torchrun --nproc-per-node=1 --master-port=9811 \
	evaluation/GAR-Bench/inference.py \
	--model_name_or_path HaochenWang/GAR-8B \
	--anno_file evaluation/GAR-Bench/annotations/GAR-Bench-Caption-Simple.json \
	--mode simple \
	--cache_name ${CACHE_NAME} \
	--data_type bf16 \
	--seed 42
	```

	The generated descriptions will be saved to ```evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_simple.json```

	Next, perform evaluation (with images using GPT-4o).

	```bash
	export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT
	export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY

	python3 evaluation/GAR-Bench/eval_simple.py --pred evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_simple.json
	```

	Reference cache (including model predictions and evaluation results) are stored in ```model_outputs/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```).

	To re-run the evaluation, you could change to your own ```CACHE_NAME```.

	Reference results:

	```bash
	# GAR-1B
	Accuracy: 0.5567010309278351

	# GAR-8B
	Accuracy: 0.6391752577319587
	```

	### 1.2 GARBench-Caption-Detailed

	First, perform inference, e.g., using GAR-8B.

	```bash
	torchrun --nproc-per-node=1 --master-port=9811 \
	evaluation/GAR-Bench/inference.py \
	--model_name_or_path HaochenWang/GAR-8B \
	--anno_file evaluation/GAR-Bench/annotations/GAR-Bench-Caption-Detailed.json \
	--mode detailed \
	--cache_name ${CACHE_NAME} \
	--data_type bf16 \
	--seed 42
	```

	The generated descriptions will be saved to ```evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_detailed.json```

	Next, perform evaluation (with images using GPT-4o).

	```bash
	export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT
	export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY

	python3 evaluation/GAR-Bench/eval_detailed.py --pred evaluation/GAR-Bench/model_outputs/${CACHE_NAME}_detailed.json
	```

	Reference cache (including model predictions and evaluation results) are stored in ```model_outputs/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```).

	To re-run the evaluation, you could change to your own ```CACHE_NAME```.

	Reference results:

	```bash
	# GAR-1B
	Accuracy: 0.6635514018691588

	# GAR-8B
	Accuracy: 0.6915887850467289
	```

	### 1.3 GARBench-VQA

	Perform inference, e.g., using GAR-8B.

	```bash
	torchrun --nproc-per-node=1 --master-port=9811 \
	evaluation/GAR-Bench/inference.py \
	--model_name_or_path HaochenWang/GAR-8B \
	--anno_file evaluation/GAR-Bench/annotations/GAR-Bench-VQA.json \
	--mode vqa \
	--cache_name ${CACHE_NAME} \
	--data_type bf16 \
	--seed 42
	```

	Reference cache (including model predictions and evaluation results) are stored in ```model_outputs/```.

	To re-run the evaluation, you could change to your own ```CACHE_NAME```.

	Reference results:
	```
	# GAR-1B
	color: [34/69]=49.3
	texture/pattern: [17/29]=58.6
	mirror: [36/61]=59.0
	ordering: [13/64]=20.3
	material: [14/36]=38.9
	shape: [32/64]=50.0
	relation: [57/101]=56.4
	=> overall: [203/424]=47.9

	# GAR-8B
	texture/pattern: [22/29]=75.9
	material: [19/36]=52.8
	mirror: [36/61]=59.0
	relation: [66/101]=65.4
	shape: [34/64]=53.1
	ordering: [28/64]=43.8
	color: [40/69]=58.0
	=> overall: [245/424]=57.8
	```

	## 2. DLC-Bench

	First, download images of DLC-Bench and put the ```images``` folder in the ```annotations``` directory:
	```bash
	cd evaluation/DLC-Bench/annotations
	hf download nvidia/DLC-Bench --repo-type dataset --include "images/" --exclude "" --local-dir ./
	```

	The overall structure should be:
	```bash
	evaluation/DLC-Bench/annotations
	├── annotations.json
	├── class_names.json
	├── images
	│ └── objects365_v2_*.jpg
	└── qa.json
	```

	Next, perform inference to obtain detailed descriptions, e.g., using GAR-8B.

	```bash
	torchrun --nproc-per-node=1 --master-port=8841 \
	evaluation/DLC-Bench/inference.py \
	--model_name_or_path HaochenWang/GAR-8B \
	--cache_name ${CACHE_NAME} \
	--data_type bf16 \
	--seed 42
	```

	The generated descriptions will be saved to ```evaluation/DLC-Bench/model_outputs/${CACHE_NAME}.json```

	Finally, perform evaluation (with images using GPT-4o or without images using Llama3.1-8B).

	*Optional 1. Using GPT-4o with* images (Recommended)**

	```bash
	export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT
	export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY

	python3 evaluation/DLC-Bench/eval_gpt_with_image.py --pred evaluation/DLC-Bench/model_outputs/${CACHE_NAME}.json
	```

	*Optional 2. Using Llama3.1-8B without* images**

	First, we need to serve Llama3.1-8B using vLLM.

	```bash
	bash evaluation/DLC-Bench/serve_judge.sh
	```

	Next, on the same node, run evaluation.

	```bash
	python3 eval_llama_without_image.py --pred ../model_outputs/${CACHE_NAME}.json --base_url http://localhost:8007/v1
	```

	For more details for the differences between these two evaluation settings, please refer to Appendix F of our paper.

	Reference cache (including model predictions and evaluation results) are stored in ```model_outputs/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```).

	To re-run the evaluation, you could change to your own ```CACHE_NAME```.

	Reference results:

	```bash
	# GAR-1B
	# By GPT-4o (with images):
	Summary (Pos Neg Avg(Pos, Neg)): 0.662, 0.880, 0.771
	# By Llama3.1-8B (without images):
	Summary (Pos Neg Avg(Pos, Neg)): 0.489, 0.870, 0.679

	# GAR-8B
	# By GPT-4o (with images):
	Summary (Pos Neg Avg(Pos, Neg)): 0.680, 0.860, 0.770
	# By Llama3.1-8B (without images):
	Summary (Pos Neg Avg(Pos, Neg)): 0.502, 0.846, 0.674
	```

	## 3. Ferret-Bench

	First, perform inference to obtain detailed descriptions, e.g., using GAR-8B.

	```bash
	torchrun --nproc-per-node=1 --master-port=8841 \
	evaluation/Ferret-Bench/inference.py \
	--model_name_or_path HaochenWang/GAR-8B \
	--cache_name ${CACHE_NAME} \
	--data_type bf16 \
	--seed 42
	```

	The generated descriptions will be saved to ```evaluation/Ferret-Bench/model_outputs/${CACHE_NAME}.json```


	Then, perform evaluation using GPT-4o.

	```bash
	export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT
	export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY

	cd evaluation/Ferret-Bench
	bash eval.sh ${CACHE_NAME}
	```

	Reference model predictions are stored in ```model_outputs/```, and reference evaluation results are stored in ```gpt4_result/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```).

	To re-run the evaluation, you could change to your own ```CACHE_NAME```.

	Reference results:
	```bash
	# GAR-1B
	review_refer_desc
	all 56.0
	refer_desc 56.0
	=================================

	# GAR-8B
	review_refer_desc
	all 64.8
	refer_desc 64.8
	=================================
	```


	## 4. MDVP-Bench

	First, perform inference to obtain detailed descriptions, e.g., using GAR-8B.

	```bash
	torchrun --nproc-per-node=1 --master-port=8841 \
	evaluation/MDVP-Bench/inference.py \
	--model_name_or_path HaochenWang/GAR-8B \
	--cache_name ${CACHE_NAME} \
	--data_type bf16 \
	--seed 42
	```

	The generated descriptions will be saved to ```evaluation/MDVP-Bench/model_outputs/${CACHE_NAME}.json```


	Then, perform evaluation using GPT-4o.

	```bash
	export AZURE_OPENAI_ENDPOINT=YOUR_AZURE_OPENAI_ENDPOINT
	export AZURE_OPENAI_KEY=YOUR_AZURE_OPENAI_KEY

	cd evaluation/MDVP-Bench
	bash eval.sh model_outputs/${CACHE_NAME}.json
	```

	Reference model predictions are stored in ```model_outputs/```. Due to the randomness during LLM-Judge, the final performance may slighly differ even with the same predicitons (even with ```temperature=0```).

	To re-run the evaluation, you could change to your own ```CACHE_NAME```.

	Reference results:
	```bash
	# GAR-1B
	android_detailed_caption_box 80.65
	multipanel_detailed_caption_box 103.7
	natural_detailed_caption_box 152.63
	ocr_doc_detailed_caption_box 146.87
	ocr_spotting_detailed_caption_box 152.38
	web_detailed_caption_box 150.0
	# Natural = natural_detailed_caption_box = 152.6
	# OCR = (ocr_doc_detailed_caption_box + ocr_spotting_detailed_caption_box) / 2 = 149.6
	# Multi-Panel = multipanel_detailed_caption_box = 103.7
	# Sceenshot = (android_detailed_caption_box + web_detailed_caption_box) / 2 = 115.3

	# GAR-8B
	android_detailed_caption_box 113.79
	multipanel_detailed_caption_box 117.24
	natural_detailed_caption_box 178.57
	ocr_doc_detailed_caption_box 138.10
	ocr_spotting_detailed_caption_box 160.0
	web_detailed_caption_box 132.26
	# Natural = natural_detailed_caption_box = 178.6
	# OCR = (ocr_doc_detailed_caption_box + ocr_spotting_detailed_caption_box) / 2 = 149.1
	# Multi-Panel = multipanel_detailed_caption_box = 117.2
	# Sceenshot = (android_detailed_caption_box + web_detailed_caption_box) / 2 = 123.0
	```