I appreciate the clarification! My goal with this project was purely educational to understand the mechanics of how vision-language connectors work. While the base weights belong to Google and OpenAI, the implementation of the projection layers and the fine-tuning process was my contribution. I'm still learning the ropes, so I appreciate the call-out on the terminology!
Sagar pallai PRO
AI & ML interests
LLM AND STABLE DIFFUSION
Recent Activity
new activity
about 6 hours ago
sagar007/multigemma:๐จ๐จ๐จ License Violation Alert: Illegally Re-Licensing Google's Gemma Model as "Open Source"
replied to
their
post
about 6 hours ago
๐ I built a Multimodal Vision-Language Model from using Gemma-270M + CLIP!
Just finished training my multimodal model on the full LLaVA-Instruct-150K dataset (157K samples) and wanted to share the results!
๐ง What I Built:
A vision-language model that can understand images and answer questions about them, combining:
- Google Gemma-3-270M (language)
- OpenAI CLIP ViT-Large/14 (vision)
- LoRA fine-tuning for efficiency
๐ Training Stats:
- 157,712 training samples (full LLaVA dataset)
- 3 epochs on A100 40GB
- ~9 hours training time
- Final loss: 1.333 training / 1.430 validation
- Only 18.6M trainable params (3.4% of 539M total)
๐ https://huggingface.co/sagar007/multigemma
Benchmark Results:
- VQA Accuracy: 53.8%
- Works great for: animal detection, room identification, scene understanding
๐ **Try it yourself:**
- ๐ค Model: https://huggingface.co/sagar007/multigemma
- ๐ฎ Demo: https://huggingface.co/spaces/sagar007/Multimodal-Gemma
- ๐ป GitHub: https://github.com/sagar431/multimodal-gemma-270m
Built with PyTorch Lightning + MLflow for experiment tracking. Full MLOps pipeline with CI/CD!
Would love to hear your feedback! ๐
#multimodal #gemma #clip #llava #vision-language #pytorch
replied to
their
post
1 day ago
๐ I built a Multimodal Vision-Language Model from using Gemma-270M + CLIP!
Just finished training my multimodal model on the full LLaVA-Instruct-150K dataset (157K samples) and wanted to share the results!
๐ง What I Built:
A vision-language model that can understand images and answer questions about them, combining:
- Google Gemma-3-270M (language)
- OpenAI CLIP ViT-Large/14 (vision)
- LoRA fine-tuning for efficiency
๐ Training Stats:
- 157,712 training samples (full LLaVA dataset)
- 3 epochs on A100 40GB
- ~9 hours training time
- Final loss: 1.333 training / 1.430 validation
- Only 18.6M trainable params (3.4% of 539M total)
๐ https://huggingface.co/sagar007/multigemma
Benchmark Results:
- VQA Accuracy: 53.8%
- Works great for: animal detection, room identification, scene understanding
๐ **Try it yourself:**
- ๐ค Model: https://huggingface.co/sagar007/multigemma
- ๐ฎ Demo: https://huggingface.co/spaces/sagar007/Multimodal-Gemma
- ๐ป GitHub: https://github.com/sagar431/multimodal-gemma-270m
Built with PyTorch Lightning + MLflow for experiment tracking. Full MLOps pipeline with CI/CD!
Would love to hear your feedback! ๐
#multimodal #gemma #clip #llava #vision-language #pytorch