EarthMind-R1

EarthMind-R1 is a vision-language model fine-tuned using GRPO (Group Relative Policy Optimization) for geospatial and remote sensing image understanding tasks.

Model Description

  • Base Model: EarthMind-4B
  • Training Method: GRPO (Group Relative Policy Optimization)
  • Training Data: Geospatial instruction dataset
  • Fine-tuning: LoRA adapters merged into base weights

Usage

Quick Start

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_id = "aadex/Earthmind-R1"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load an image
image = Image.open("your_image.jpg").convert("RGB")

# Ask a question
question = "Describe what you see in this satellite image."

# Use model's chat interface
response = model.chat(
    tokenizer=tokenizer,
    question=question,
    images=[image],
    generation_config={
        "max_new_tokens": 512,
        "temperature": 0.7,
        "do_sample": True,
    },
)

print(response)

Expected Output Format

The model is trained to provide structured responses:

<think>
[Reasoning about the image content]
</think>
<answer>
[Final answer to the question]
</answer>

Requirements

torch>=2.0
transformers>=4.40
accelerate
pillow

Hardware Requirements

  • Minimum: 16GB VRAM (with bfloat16)
  • Recommended: 24GB VRAM for comfortable inference

Training Details

  • Framework: VLM-R1 + TRL
  • Optimizer: AdamW
  • Learning Rate: 1e-6
  • LoRA Configuration:
    • r: 32
    • alpha: 64
    • dropout: 0.05
  • GRPO Settings:
    • num_generations: 4
    • num_iterations: 2
    • beta: 0.01

Limitations

  • Optimized for geospatial/remote sensing imagery
  • May not perform as well on general domain images
  • Response quality depends on image resolution and clarity

Citation

If you use this model, please cite:

@misc{earthmind-r1,
  title={EarthMind-R1: GRPO Fine-tuned Vision-Language Model for Geospatial Understanding},
  author={Your Name},
  year={2024},
  publisher={HuggingFace}
}

License

Apache 2.0

Downloads last month
36
Safetensors
Model size
4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support