LASER / vine_hf /README_HF.md
ASethi04's picture
updates
f9a6349
|
raw
history blame
10.1 kB

VINE: Video Understanding with Natural Language

HuggingFace GitHub

VINE is a video understanding model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships.

Quick Start

from transformers import AutoModel
from vine_hf import VineConfig, VineModel, VinePipeline

# Load VINE model from HuggingFace
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)

# Create pipeline with your checkpoint paths
vine_pipeline = VinePipeline(
    model=model,
    tokenizer=None,
    sam_config_path="/path/to/sam2_config.yaml",
    sam_checkpoint_path="/path/to/sam2_checkpoint.pt",
    gd_config_path="/path/to/grounding_dino_config.py",
    gd_checkpoint_path="/path/to/grounding_dino_checkpoint.pth",
    device="cuda",
    trust_remote_code=True
)

# Process a video
results = vine_pipeline(
    'path/to/video.mp4',
    categorical_keywords=['human', 'dog', 'frisbee'],
    unary_keywords=['running', 'jumping'],
    binary_keywords=['chasing', 'behind'],
    return_top_k=3
)

Installation

Option 1: Automated Setup (Recommended)

# Download the setup script
wget https://raw.githubusercontent.com/kevinxuez/vine_hf/main/setup_vine_demo.sh

# Run the setup
bash setup_vine_demo.sh

# Activate environment
conda activate vine_demo

Option 2: Manual Installation

# 1. Create conda environment
conda create -n vine_demo python=3.10 -y
conda activate vine_demo

# 2. Install PyTorch with CUDA support
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126

# 3. Install core dependencies
pip install transformers huggingface-hub safetensors

# 4. Clone and install required repositories
git clone https://github.com/video-fm/video-sam2.git
git clone https://github.com/video-fm/GroundingDINO.git
git clone https://github.com/kevinxuez/LASER.git
git clone https://github.com/kevinxuez/vine_hf.git

# Install in editable mode
pip install -e ./video-sam2
pip install -e ./GroundingDINO
pip install -e ./LASER
pip install -e ./vine_hf

# Build GroundingDINO extensions
cd GroundingDINO && python setup.py build_ext --force --inplace && cd ..

Required Checkpoints

VINE requires SAM2 and GroundingDINO checkpoints for segmentation. Download these separately:

SAM2 Checkpoint

wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt
wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_t.yaml

GroundingDINO Checkpoint

wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
wget https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py

Architecture

video-fm/vine (HuggingFace Hub)
β”œβ”€β”€ VINE Model Weights (~1.8GB)
β”‚   β”œβ”€β”€ Categorical CLIP model (fine-tuned)
β”‚   β”œβ”€β”€ Unary CLIP model (fine-tuned)
β”‚   └── Binary CLIP model (fine-tuned)
└── Architecture Files
    β”œβ”€β”€ vine_config.py
    β”œβ”€β”€ vine_model.py
    β”œβ”€β”€ vine_pipeline.py
    └── utilities

User Provides:
β”œβ”€β”€ Dependencies (via pip/conda)
β”‚   β”œβ”€β”€ laser (video processing utilities)
β”‚   β”œβ”€β”€ sam2 (segmentation)
β”‚   └── groundingdino (object detection)
└── Checkpoints (downloaded separately)
    β”œβ”€β”€ SAM2 model files
    └── GroundingDINO model files

Why This Architecture?

This separation of concerns provides several benefits:

  1. Lightweight Distribution: Only VINE-specific weights (~1.8GB) are on HuggingFace
  2. Version Control: Users can choose their preferred SAM2/GroundingDINO versions
  3. Licensing: Keeps different model licenses separate
  4. Flexibility: Easy to swap segmentation backends
  5. Standard Practice: Similar to models like LLaVA, BLIP-2, etc.

Full Usage Example

import os
from pathlib import Path
from transformers import AutoModel
from vine_hf import VinePipeline

# Set up paths
checkpoint_dir = Path("/path/to/checkpoints")
sam_config = checkpoint_dir / "sam2_hiera_t.yaml"
sam_checkpoint = checkpoint_dir / "sam2_hiera_tiny.pt"
gd_config = checkpoint_dir / "GroundingDINO_SwinT_OGC.py"
gd_checkpoint = checkpoint_dir / "groundingdino_swint_ogc.pth"

# Load VINE from HuggingFace
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)

# Create pipeline
vine_pipeline = VinePipeline(
    model=model,
    tokenizer=None,
    sam_config_path=str(sam_config),
    sam_checkpoint_path=str(sam_checkpoint),
    gd_config_path=str(gd_config),
    gd_checkpoint_path=str(gd_checkpoint),
    device="cuda:0",
    trust_remote_code=True
)

# Process video
results = vine_pipeline(
    "path/to/video.mp4",
    categorical_keywords=['person', 'dog', 'ball'],
    unary_keywords=['running', 'jumping', 'sitting'],
    binary_keywords=['chasing', 'next to', 'holding'],
    object_pairs=[(0, 1), (0, 2)],  # person-dog, person-ball
    return_top_k=5,
    include_visualizations=True
)

# Access results
print(f"Detected {results['summary']['num_objects_detected']} objects")
print(f"Top categories: {results['summary']['top_categories']}")
print(f"Top actions: {results['summary']['top_actions']}")
print(f"Top relations: {results['summary']['top_relations']}")

# Access detailed predictions
for obj_id, predictions in results['categorical_predictions'].items():
    print(f"\nObject {obj_id}:")
    for prob, category in predictions:
        print(f"  {category}: {prob:.3f}")

Output Format

{
    "categorical_predictions": {
        object_id: [(probability, category), ...]
    },
    "unary_predictions": {
        (frame_id, object_id): [(probability, action), ...]
    },
    "binary_predictions": {
        (frame_id, (obj1_id, obj2_id)): [(probability, relation), ...]
    },
    "confidence_scores": {
        "categorical": float,
        "unary": float,
        "binary": float
    },
    "summary": {
        "num_objects_detected": int,
        "top_categories": [(category, probability), ...],
        "top_actions": [(action, probability), ...],
        "top_relations": [(relation, probability), ...]
    },
    "visualizations": {  # if include_visualizations=True
        "vine": {
            "all": {"frames": [...], "video_path": "..."},
            ...
        }
    }
}

Configuration Options

from vine_hf import VineConfig

config = VineConfig(
    model_name="openai/clip-vit-base-patch32",  # CLIP backbone
    segmentation_method="grounding_dino_sam2",   # or "sam2"
    box_threshold=0.35,                          # GroundingDINO threshold
    text_threshold=0.25,                         # GroundingDINO threshold
    target_fps=5,                                # Video sampling rate
    visualize=True,                              # Enable visualizations
    visualization_dir="outputs/",                # Output directory
    debug_visualizations=False,                  # Debug mode
    device="cuda:0"                              # Device
)

Deployment Examples

Local Script

# test_vine.py
from transformers import AutoModel
from vine_hf import VinePipeline

model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
pipeline = VinePipeline(model=model, ...)
results = pipeline("video.mp4", ...)

HuggingFace Spaces

# app.py for Gradio Space
import gradio as gr
from transformers import AutoModel
from vine_hf import VinePipeline

model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
# ... set up pipeline and Gradio interface

API Server

# FastAPI server
from fastapi import FastAPI
from transformers import AutoModel
from vine_hf import VinePipeline

app = FastAPI()
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
pipeline = VinePipeline(model=model, ...)

@app.post("/process")
async def process_video(video_path: str):
    return pipeline(video_path, ...)

Troubleshooting

Import Errors

# Make sure all dependencies are installed
pip list | grep -E "laser|sam2|groundingdino"

# Reinstall if needed
pip install -e ./LASER
pip install -e ./video-sam2
pip install -e ./GroundingDINO

CUDA Errors

# Check CUDA availability
import torch
print(torch.cuda.is_available())
print(torch.version.cuda)

# Use CPU if needed
pipeline = VinePipeline(model=model, device="cpu", ...)

Checkpoint Not Found

# Verify checkpoint paths
ls -lh /path/to/sam2_hiera_tiny.pt
ls -lh /path/to/groundingdino_swint_ogc.pth

System Requirements

  • Python: 3.10+
  • CUDA: 11.8+ (for GPU)
  • GPU: 8GB+ VRAM recommended (T4, V100, A100, etc.)
  • RAM: 16GB+ recommended
  • Storage: ~3GB for checkpoints

Citation

@article{laser2024,
  title={LASER: Language-guided Object Grounding and Relation Understanding in Videos},
  author={Your Authors},
  journal={Your Conference/Journal},
  year={2024}
}

License

This model and code are released under the MIT License. Note that SAM2 and GroundingDINO have their own respective licenses.

Links

Support

For issues or questions: