Improve model card with metadata, paper link, and description
Browse filesHi! I'm Niels from the community science team at Hugging Face. I've opened this PR to enhance your model card with relevant metadata and information.
This update includes:
- Adding the `image-text-to-text` pipeline tag for better discoverability.
- Adding `library_name: transformers` metadata, as the `config.json` confirms compatibility with the Transformers library.
- Linking the model to the research paper on Hugging Face Papers.
- Adding a descriptive summary of the model's architecture and key features based on the paper.
Please let me know if you have any questions!
README.md
CHANGED
|
@@ -1,9 +1,39 @@
|
|
| 1 |
---
|
| 2 |
-
datasets:
|
| 3 |
-
- erenzhou/AirSpatial
|
| 4 |
-
- erenzhou/refGeo
|
| 5 |
base_model:
|
| 6 |
- erenzhou/GeoGround
|
| 7 |
- liuhaotian/llava-v1.5-7b
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- erenzhou/GeoGround
|
| 4 |
- liuhaotian/llava-v1.5-7b
|
| 5 |
+
datasets:
|
| 6 |
+
- erenzhou/AirSpatial
|
| 7 |
+
- erenzhou/refGeo
|
| 8 |
+
library_name: transformers
|
| 9 |
+
pipeline_tag: image-text-to-text
|
| 10 |
---
|
| 11 |
+
|
| 12 |
+
# AirSpatialBot: A Spatially-Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognition and Retrieval
|
| 13 |
+
|
| 14 |
+
[**Paper**](https://huggingface.co/papers/2601.01416) | [**Code**](https://github.com/VisionXLab/AirSpatialBot) | [**Dataset**](https://huggingface.co/datasets/erenzhou/AirSpatial)
|
| 15 |
+
|
| 16 |
+
AirSpatialBot is a Vision-Language Model (VLM) specifically designed for remote sensing and aerial drone imagery. It addresses the limitations of existing VLMs in spatial understanding by introducing specialized tasks like Spatial Grounding (SG) and Spatial Question Answering (SQA).
|
| 17 |
+
|
| 18 |
+
## Key Features
|
| 19 |
+
- **Spatially-Aware Training:** Employs a two-stage training strategy (Image Understanding Pre-training and Spatial Understanding Fine-tuning) to bridge the gap between general vision tasks and aerial spatial awareness.
|
| 20 |
+
- **3D Grounding:** It is the first remote sensing grounding model to utilize 3D Bounding Boxes (3DBB), enhancing its capability for precise vehicle localization.
|
| 21 |
+
- **Fine-Grained Attribute Recognition:** Capable of identifying specific vehicle brands, models, and pricing information from high-altitude imagery.
|
| 22 |
+
- **Aerial Agent Capabilities:** Integrates task planning and spatial reasoning to act as an agent for complex retrieval queries in remote sensing scenarios.
|
| 23 |
+
|
| 24 |
+
## Model Training
|
| 25 |
+
The model is built upon the LLaVA-v1.5-7b architecture and was fine-tuned using the **AirSpatial** dataset, which comprises over 206K instructions tailored for spatial tasks in aerial imagery.
|
| 26 |
+
|
| 27 |
+
## Citation
|
| 28 |
+
```bibtex
|
| 29 |
+
@ARTICLE{zhou2025airspatialbot,
|
| 30 |
+
author={Zhou, Yue and Ding, Ran and Yang, Xue and Jiang, Xue and Liu, Xingzhao},
|
| 31 |
+
journal={IEEE Transactions on Geoscience and Remote Sensing},
|
| 32 |
+
title={AirSpatialBot: A Spatially-Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognization and Retrieval},
|
| 33 |
+
year={2025},
|
| 34 |
+
volume={},
|
| 35 |
+
number={},
|
| 36 |
+
pages={1-1},
|
| 37 |
+
doi={10.1109/TGRS.2025.3570895}
|
| 38 |
+
}
|
| 39 |
+
```
|