Spaces:

Smilyai-labs
/

VISION-LLM-COT

Sleeping

App Files Files Community

Apply for community grant: Company project (gpu and storage)

by Keeby-smilyai - opened Sep 17

Discussion

Keeby-smilyai

Smilyai labs org Sep 17

Hi Hugging Face Team,

On behalf of SmilyAI Labs Research Team, we’re requesting GPU + persistent storage support for our open project:

🥥 COCONUT-VLM: A 3-Stage Chain-of-Thought Training Framework for Vision-Language Models

We’ve built a Hugging Face Space that trains a VLM live in-browser — no simulation. Users can:

✅ Trigger training for Stage 1 (basic CoT), Stage 2 (masked reasoning), or Stage 3 (reflective COCONUT mode)
✅ Chat with the latest trained stage — model reloads from saved checkpoint
✅ Upload images and receive CoT responses formatted per stage:
- Stage 1: “Let’s think step by step...”
- Stage 2: Internal reasoning hidden, only final answer shown
- Stage 3: Self-revision, confidence-aware output

⚙️ What We Need (to deliver what’s coded):

GPU Access: 1x A10G or T4 — for LoRA training (~12–24 hrs per stage)
Persistent Storage: Any size you can provide — we save checkpoints to ./checkpoints/stage_X. Without it, training progress is lost on reboot.
Session Duration: ~72 hrs total (non-consecutive OK) — 3 stages × ~24 hrs

🚫 We respectfully opt out of ZeroGPU — not due to preference, but to protect community users.
In our Space, chatting or uploading an image triggers model inference. On ZeroGPU, this silently consumes the visitor’s free GPU quota — leading to confusion and frustration (“Why is my GPU time gone?”).

We want users to engage freely without burning their own resources. Dedicated GPU lets us host the compute burden — making the experience fair and welcoming.

✅ What Our Code Currently Delivers:

🧠 Real LoRA fine-tuning of LLaVA/TinyLLaVA using PEFT + Transformers
💾 Checkpoints saved locally per stage (manual HF Hub push planned post-training)
🖼️ Image + text CoT chat with stage-specific prompt templates
🔄 Training runs in background thread — UI stays responsive
🚫 No reliance on ZeroGPU — avoids unfair quota consumption for visitors

📅 What We’ll Add Post-Grant (Planned, Not Promised):

📤 Auto-push trained adapters to HF Hub
📚 Technical blog + Colab tutorial (“How COCONUT-VLM Works”)
🎥 Short demo video + community tweet thread (tagging @HuggingFace)

👥 About SmilyAI Labs

We’re a research team focused on transparent, interactive AI education. This project uses:

HF Transformers, PEFT, Datasets, Gradio
Real training — not simulation
Community-first design — no hidden quota costs for users

We believe reasoning should be visible and improvable — and we’re building the tools to prove it.

We’d be honored to partner with Hugging Face on this. Your support enables us to build something real, reproducible, and respectful of the community’s resources.

Thank you for your time — and for building the ecosystem that makes this possible.

Warmly,
Bc
On behalf of SmilyAI Labs Research Team

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment