Spaces:

jiani-huang
/

LASER

Running on Zero

File size: 10,108 Bytes

f9a6349

# VINE: Video Understanding with Natural Language

[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-video--fm%2Fvine-blue)](https://huggingface.co/video-fm/vine)
[![GitHub](https://img.shields.io/badge/GitHub-LASER-green)](https://github.com/kevinxuez/LASER)

VINE is a video understanding model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships.

## Quick Start

```python
from transformers import AutoModel
from vine_hf import VineConfig, VineModel, VinePipeline

# Load VINE model from HuggingFace
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)

# Create pipeline with your checkpoint paths
vine_pipeline = VinePipeline(
    model=model,
    tokenizer=None,
    sam_config_path="/path/to/sam2_config.yaml",
    sam_checkpoint_path="/path/to/sam2_checkpoint.pt",
    gd_config_path="/path/to/grounding_dino_config.py",
    gd_checkpoint_path="/path/to/grounding_dino_checkpoint.pth",
    device="cuda",
    trust_remote_code=True
)

# Process a video
results = vine_pipeline(
    'path/to/video.mp4',
    categorical_keywords=['human', 'dog', 'frisbee'],
    unary_keywords=['running', 'jumping'],
    binary_keywords=['chasing', 'behind'],
    return_top_k=3
)
```

## Installation

### Option 1: Automated Setup (Recommended)

```bash
# Download the setup script
wget https://raw.githubusercontent.com/kevinxuez/vine_hf/main/setup_vine_demo.sh

# Run the setup
bash setup_vine_demo.sh

# Activate environment
conda activate vine_demo
```

### Option 2: Manual Installation

```bash
# 1. Create conda environment
conda create -n vine_demo python=3.10 -y
conda activate vine_demo

# 2. Install PyTorch with CUDA support
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126

# 3. Install core dependencies
pip install transformers huggingface-hub safetensors

# 4. Clone and install required repositories
git clone https://github.com/video-fm/video-sam2.git
git clone https://github.com/video-fm/GroundingDINO.git
git clone https://github.com/kevinxuez/LASER.git
git clone https://github.com/kevinxuez/vine_hf.git

# Install in editable mode
pip install -e ./video-sam2
pip install -e ./GroundingDINO
pip install -e ./LASER
pip install -e ./vine_hf

# Build GroundingDINO extensions
cd GroundingDINO && python setup.py build_ext --force --inplace && cd ..
```

## Required Checkpoints

VINE requires SAM2 and GroundingDINO checkpoints for segmentation. Download these separately:

### SAM2 Checkpoint
```bash
wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt
wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_t.yaml
```

### GroundingDINO Checkpoint
```bash
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
wget https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py
```

## Architecture

```
video-fm/vine (HuggingFace Hub)
├── VINE Model Weights (~1.8GB)
│   ├── Categorical CLIP model (fine-tuned)
│   ├── Unary CLIP model (fine-tuned)
│   └── Binary CLIP model (fine-tuned)
└── Architecture Files
    ├── vine_config.py
    ├── vine_model.py
    ├── vine_pipeline.py
    └── utilities

User Provides:
├── Dependencies (via pip/conda)
│   ├── laser (video processing utilities)
│   ├── sam2 (segmentation)
│   └── groundingdino (object detection)
└── Checkpoints (downloaded separately)
    ├── SAM2 model files
    └── GroundingDINO model files
```

## Why This Architecture?

This separation of concerns provides several benefits:

1. **Lightweight Distribution**: Only VINE-specific weights (~1.8GB) are on HuggingFace
2. **Version Control**: Users can choose their preferred SAM2/GroundingDINO versions
3. **Licensing**: Keeps different model licenses separate
4. **Flexibility**: Easy to swap segmentation backends
5. **Standard Practice**: Similar to models like LLaVA, BLIP-2, etc.

## Full Usage Example

```python
import os
from pathlib import Path
from transformers import AutoModel
from vine_hf import VinePipeline

# Set up paths
checkpoint_dir = Path("/path/to/checkpoints")
sam_config = checkpoint_dir / "sam2_hiera_t.yaml"
sam_checkpoint = checkpoint_dir / "sam2_hiera_tiny.pt"
gd_config = checkpoint_dir / "GroundingDINO_SwinT_OGC.py"
gd_checkpoint = checkpoint_dir / "groundingdino_swint_ogc.pth"

# Load VINE from HuggingFace
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)

# Create pipeline
vine_pipeline = VinePipeline(
    model=model,
    tokenizer=None,
    sam_config_path=str(sam_config),
    sam_checkpoint_path=str(sam_checkpoint),
    gd_config_path=str(gd_config),
    gd_checkpoint_path=str(gd_checkpoint),
    device="cuda:0",
    trust_remote_code=True
)

# Process video
results = vine_pipeline(
    "path/to/video.mp4",
    categorical_keywords=['person', 'dog', 'ball'],
    unary_keywords=['running', 'jumping', 'sitting'],
    binary_keywords=['chasing', 'next to', 'holding'],
    object_pairs=[(0, 1), (0, 2)],  # person-dog, person-ball
    return_top_k=5,
    include_visualizations=True
)

# Access results
print(f"Detected {results['summary']['num_objects_detected']} objects")
print(f"Top categories: {results['summary']['top_categories']}")
print(f"Top actions: {results['summary']['top_actions']}")
print(f"Top relations: {results['summary']['top_relations']}")

# Access detailed predictions
for obj_id, predictions in results['categorical_predictions'].items():
    print(f"\nObject {obj_id}:")
    for prob, category in predictions:
        print(f"  {category}: {prob:.3f}")
```

## Output Format

```python
{
    "categorical_predictions": {
        object_id: [(probability, category), ...]
    },
    "unary_predictions": {
        (frame_id, object_id): [(probability, action), ...]
    },
    "binary_predictions": {
        (frame_id, (obj1_id, obj2_id)): [(probability, relation), ...]
    },
    "confidence_scores": {
        "categorical": float,
        "unary": float,
        "binary": float
    },
    "summary": {
        "num_objects_detected": int,
        "top_categories": [(category, probability), ...],
        "top_actions": [(action, probability), ...],
        "top_relations": [(relation, probability), ...]
    },
    "visualizations": {  # if include_visualizations=True
        "vine": {
            "all": {"frames": [...], "video_path": "..."},
            ...
        }
    }
}
```

## Configuration Options

```python
from vine_hf import VineConfig

config = VineConfig(
    model_name="openai/clip-vit-base-patch32",  # CLIP backbone
    segmentation_method="grounding_dino_sam2",   # or "sam2"
    box_threshold=0.35,                          # GroundingDINO threshold
    text_threshold=0.25,                         # GroundingDINO threshold
    target_fps=5,                                # Video sampling rate
    visualize=True,                              # Enable visualizations
    visualization_dir="outputs/",                # Output directory
    debug_visualizations=False,                  # Debug mode
    device="cuda:0"                              # Device
)
```

## Deployment Examples

### Local Script
```python
# test_vine.py
from transformers import AutoModel
from vine_hf import VinePipeline

model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
pipeline = VinePipeline(model=model, ...)
results = pipeline("video.mp4", ...)
```

### HuggingFace Spaces
```python
# app.py for Gradio Space
import gradio as gr
from transformers import AutoModel
from vine_hf import VinePipeline

model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
# ... set up pipeline and Gradio interface
```

### API Server
```python
# FastAPI server
from fastapi import FastAPI
from transformers import AutoModel
from vine_hf import VinePipeline

app = FastAPI()
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
pipeline = VinePipeline(model=model, ...)

@app.post("/process")
async def process_video(video_path: str):
    return pipeline(video_path, ...)
```

## Troubleshooting

### Import Errors
```bash
# Make sure all dependencies are installed
pip list | grep -E "laser|sam2|groundingdino"

# Reinstall if needed
pip install -e ./LASER
pip install -e ./video-sam2
pip install -e ./GroundingDINO
```

### CUDA Errors
```python
# Check CUDA availability
import torch
print(torch.cuda.is_available())
print(torch.version.cuda)

# Use CPU if needed
pipeline = VinePipeline(model=model, device="cpu", ...)
```

### Checkpoint Not Found
```bash
# Verify checkpoint paths
ls -lh /path/to/sam2_hiera_tiny.pt
ls -lh /path/to/groundingdino_swint_ogc.pth
```

## System Requirements

- **Python**: 3.10+
- **CUDA**: 11.8+ (for GPU)
- **GPU**: 8GB+ VRAM recommended (T4, V100, A100, etc.)
- **RAM**: 16GB+ recommended
- **Storage**: ~3GB for checkpoints

## Citation

```bibtex
@article{laser2024,
  title={LASER: Language-guided Object Grounding and Relation Understanding in Videos},
  author={Your Authors},
  journal={Your Conference/Journal},
  year={2024}
}
```

## License

This model and code are released under the MIT License. Note that SAM2 and GroundingDINO have their own respective licenses.

## Links

- **Model**: https://huggingface.co/video-fm/vine
- **Code**: https://github.com/kevinxuez/LASER
- **vine_hf Package**: https://github.com/kevinxuez/vine_hf
- **SAM2**: https://github.com/facebookresearch/sam2
- **GroundingDINO**: https://github.com/IDEA-Research/GroundingDINO

## Support

For issues or questions:
- **Model/Architecture**: [HuggingFace Discussions](https://huggingface.co/video-fm/vine/discussions)
- **LASER Framework**: [GitHub Issues](https://github.com/kevinxuez/LASER/issues)
- **vine_hf Package**: [GitHub Issues](https://github.com/kevinxuez/vine_hf/issues)