GATE-VLAP: Grounded Action Trajectory Embeddings with Vision-Language Action Planning

Trained on LIBERO-10 Benchmark

This model is trained for robotic manipulation tasks using vision-language-action learning with semantic action chunking.

Model Details

  • Architecture: CLIP-RT (CLIP-based Robot Transformer)
  • Training Dataset: GATE-VLAP LIBERO-10
  • Training Epochs: 90
  • Task Type: Long-horizon robotic manipulation
  • Input: RGB images (128×128) + language instructions
  • Output: 7-DOF actions (xyz, rpy, gripper)

Training Details

  • Dataset: LIBERO-10 (29 subtasks, 1,354 demonstrations)
  • Segmentation: Semantic action chunking using Gemini Vision API
  • Framework: PyTorch
  • Checkpoint: Epoch 90 (best_epoch)

Performance

Training run: libero_10_fixed_training_v1

Overall performance accuracy: 88.8 % task success rate => 5 % better than raw CLIP-RT on LIBERO-LONG

Dataset

This model was trained on the GATE-VLAP Datasets, which includes:

  • LIBERO-10: 103,650 frames across 29 subtasks
  • Semantic action segmentation
  • Vision-language annotations

Citation

@article{gateVLAP@SAC2026,
  title={Atomic Action Slicing: Planner-Aligned Options for Generalist VLA Agents},
  author={Stefan Tabakov, Asen Popov, Dimitar Dimitrov, Ensiye Kiyamousavi and Boris Kraychev},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  conference={The 41st ACM/SIGAPP Symposium On Applied Computing (SAC2026), track on Intelligent Robotics and Multi-Agent Systems (IRMAS)},
  year={2025}
}

Maintainer

GATE Institute - Advanced AI Research Group, Sofia, Bulgaria

Links

Downloads last month
23
Video Preview
loading

Dataset used to train gate-institute/GATE-VLAP