nielsr HF Staff commited on
Commit
974e7a4
·
verified ·
1 Parent(s): 731ff41

Improve model card with metadata, paper link, and description

Browse files

Hi! I'm Niels from the community science team at Hugging Face. I've opened this PR to enhance your model card with relevant metadata and information.

This update includes:
- Adding the `image-text-to-text` pipeline tag for better discoverability.
- Adding `library_name: transformers` metadata, as the `config.json` confirms compatibility with the Transformers library.
- Linking the model to the research paper on Hugging Face Papers.
- Adding a descriptive summary of the model's architecture and key features based on the paper.

Please let me know if you have any questions!

Files changed (1) hide show
  1. README.md +34 -4
README.md CHANGED
@@ -1,9 +1,39 @@
1
  ---
2
- datasets:
3
- - erenzhou/AirSpatial
4
- - erenzhou/refGeo
5
  base_model:
6
  - erenzhou/GeoGround
7
  - liuhaotian/llava-v1.5-7b
 
 
 
 
 
8
  ---
9
- https://github.com/VisionXLab/AirSpatialBot
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  base_model:
3
  - erenzhou/GeoGround
4
  - liuhaotian/llava-v1.5-7b
5
+ datasets:
6
+ - erenzhou/AirSpatial
7
+ - erenzhou/refGeo
8
+ library_name: transformers
9
+ pipeline_tag: image-text-to-text
10
  ---
11
+
12
+ # AirSpatialBot: A Spatially-Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognition and Retrieval
13
+
14
+ [**Paper**](https://huggingface.co/papers/2601.01416) | [**Code**](https://github.com/VisionXLab/AirSpatialBot) | [**Dataset**](https://huggingface.co/datasets/erenzhou/AirSpatial)
15
+
16
+ AirSpatialBot is a Vision-Language Model (VLM) specifically designed for remote sensing and aerial drone imagery. It addresses the limitations of existing VLMs in spatial understanding by introducing specialized tasks like Spatial Grounding (SG) and Spatial Question Answering (SQA).
17
+
18
+ ## Key Features
19
+ - **Spatially-Aware Training:** Employs a two-stage training strategy (Image Understanding Pre-training and Spatial Understanding Fine-tuning) to bridge the gap between general vision tasks and aerial spatial awareness.
20
+ - **3D Grounding:** It is the first remote sensing grounding model to utilize 3D Bounding Boxes (3DBB), enhancing its capability for precise vehicle localization.
21
+ - **Fine-Grained Attribute Recognition:** Capable of identifying specific vehicle brands, models, and pricing information from high-altitude imagery.
22
+ - **Aerial Agent Capabilities:** Integrates task planning and spatial reasoning to act as an agent for complex retrieval queries in remote sensing scenarios.
23
+
24
+ ## Model Training
25
+ The model is built upon the LLaVA-v1.5-7b architecture and was fine-tuned using the **AirSpatial** dataset, which comprises over 206K instructions tailored for spatial tasks in aerial imagery.
26
+
27
+ ## Citation
28
+ ```bibtex
29
+ @ARTICLE{zhou2025airspatialbot,
30
+ author={Zhou, Yue and Ding, Ran and Yang, Xue and Jiang, Xue and Liu, Xingzhao},
31
+ journal={IEEE Transactions on Geoscience and Remote Sensing},
32
+ title={AirSpatialBot: A Spatially-Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognization and Retrieval},
33
+ year={2025},
34
+ volume={},
35
+ number={},
36
+ pages={1-1},
37
+ doi={10.1109/TGRS.2025.3570895}
38
+ }
39
+ ```