Spaces:
Running
on
Zero
VINE HuggingFace Interface
VINE (Video Understanding with Natural Language) is a model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships.
This package provides a HuggingFace-compatible interface for the VINE model, making it easy to use for video understanding tasks.
Features
- Categorical Classification: Classify objects in videos (e.g., "human", "dog", "frisbee")
- Unary Predicates: Detect actions on single objects (e.g., "running", "jumping", "sitting")
- Binary Relations: Detect relationships between object pairs (e.g., "behind", "in front of", "chasing")
- Multiple Segmentation Methods: Support for SAM2 and Grounding DINO + SAM2
- HuggingFace Integration: Full compatibility with HuggingFace transformers and pipelines
- Visualization Hooks: Optional high-level visualizations plus lightweight debug mask dumps for quick sanity checks
Installation
# Install the package (assuming it's in your Python path)
pip install transformers torch torchvision
pip install opencv-python pillow numpy
# For segmentation functionality, you'll also need:
# - SAM2: https://github.com/facebookresearch/sam2
# - Grounding DINO: https://github.com/IDEA-Research/GroundingDINO
Segmentation Model Configuration
VinePipeline lazily brings up the segmentation stack the first time a call needs masks. Thresholds, FPS, visualization toggles, and device selection live in VineConfig; the pipeline constructor tells it where to fetch SAM2 / GroundingDINO weights or lets you inject already-instantiated modules.
Provide file paths at construction (most common)
from vine_hf import VineConfig, VineModel, VinePipeline
vine_config = VineConfig(
segmentation_method="grounding_dino_sam2", # or "sam2"
box_threshold=0.35,
text_threshold=0.25,
target_fps=5,
visualization_dir="output/visualizations", # where to write visualizations (and debug visualizations if enabled)
debug_visualizations=True, # Write videos of the groundingDINO/SAM2/Binary/Unary, etc... outputs
pretrained_vine_path="/abs/path/to/laser_model_v1.pkl",
device="cuda:0", # accepts int, str, or torch.device
)
vine_model = VineModel(vine_config)
vine_pipeline = VinePipeline(
model=vine_model,
tokenizer=None,
sam_config_path="/abs/path/to/sam2/sam2.1_hiera_t.yaml",
sam_checkpoint_path="/abs/path/to/sam2/sam2_hiera_tiny.pt",
gd_config_path="/abs/path/to/groundingdino/config/GroundingDINO_SwinT_OGC.py",
gd_checkpoint_path="/abs/path/to/groundingdino/weights/groundingdino_swint_ogc.pth",
device=vine_config._device,
)
When segmentation_method="grounding_dino_sam2", both SAM2 and GroundingDINO must be reachable. The pipeline validates the paths; missing files raise a ValueError. If you pick "sam2", only the SAM2 config and checkpoint are required.
Reuse pre-initialized segmentation modules
If you build the segmentation stack elsewhere, inject the components with set_segmentation_models before running the pipeline:
from sam2.build_sam import build_sam2_video_predictor, build_sam2
from sam2.automatic_mask_generator import SAM2AutomaticMaskGenerator
from groundingdino.util.inference import Model as GroundingDINOModel
sam_predictor = build_sam2_video_predictor(..., device=vine_config._device)
mask_generator = SAM2AutomaticMaskGenerator(build_sam2(..., device=vine_config._device))
grounding_model = GroundingDINOModel(..., device=vine_config._device)
vine_pipeline.set_segmentation_models(
sam_predictor=sam_predictor,
mask_generator=mask_generator,
grounding_model=grounding_model,
)
Any argument left as None is initialized lazily from the file paths when the pipeline first needs that backend.
Quick Start
Requirements
-torch -torchvision -transformers -opencv-python -matplotlib -seaborn -pandas -numpy -ipywidgets -tqdm -scikit-learn -sam2 (from Facebook Research) "https://github.com/video-fm/video-sam2" -sam2 weights (downloaded separately. EX: https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt) -groundingdino (from IDEA Research) -groundingdino weights (downloaded separately. EX:https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth) -spacy-fastlang -en-core-web-sm (for spacy-fastlang) -ffmpeg (for video processing) -(optional) laser weights/full model checkpoint (downloaded separately. EX: https://huggingface.co/video-fm/vine_v0)
Usually, by running the laser/environments/laser_env.yml from the LASER repo, most dependencies will be installed. You will need to manually install sam2 and groundingdino as per their instructions.
Using the Pipeline (Recommended)
from transformers.pipelines import PIPELINE_REGISTRY
from vine_hf import VineConfig, VineModel, VinePipeline
PIPELINE_REGISTRY.register_pipeline(
"vine-video-understanding",
pipeline_class=VinePipeline,
pt_model=VineModel,
type="multimodal",
)
config = VineConfig(
segmentation_method="grounding_dino_sam2",
pretrained_vine_path="/abs/path/to/laser_model_v1.pkl",
visualization_dir="output",
visualize=True,
device="cuda:0",
)
model = VineModel(config)
vine_pipeline = VinePipeline(
model=model,
tokenizer=None,
sam_config_path="/abs/path/to/sam2/sam2.1_hiera_t.yaml",
sam_checkpoint_path="/abs/path/to/sam2/sam2_hiera_tiny.pt",
gd_config_path="/abs/path/to/groundingdino/config/GroundingDINO_SwinT_OGC.py",
gd_checkpoint_path="/abs/path/to/groundingdino/weights/groundingdino_swint_ogc.pth",
device=config._device,
)
results = vine_pipeline(
"/path/to/video.mp4",
categorical_keywords=["dog", "human"],
unary_keywords=["running"],
binary_keywords=["chasing"],
object_pairs=[(0, 1)],
return_top_k=3,
include_visualizations=True,
)
print(results["summary"])
Using the Model Directly (Advanced)
For advanced users who want to provide their own segmentation:
from vine_hf import VineConfig, VineModel
import torch
# Create configuration
config = VineConfig(
pretrained_vine_path="/path/to/your/vine/weights" # Optional: your fine-tuned weights
)
# Initialize model
model = VineModel(config)
# If you have your own video frames, masks, and bboxes from external segmentation
video_frames = torch.randn(3, 224, 224, 3) * 255 # Your video frames
masks = {0: {1: torch.ones(224, 224, 1)}} # Your segmentation masks
bboxes = {0: {1: [50, 50, 150, 150]}} # Your bounding boxes
# Run prediction
results = model.predict(
video_frames=video_frames,
masks=masks,
bboxes=bboxes,
categorical_keywords=['human', 'dog', 'frisbee'],
unary_keywords=['running', 'jumping'],
binary_keywords=['chasing', 'following'],
object_pairs=[(1, 2)],
return_top_k=3
)
Note: For most users, the pipeline approach above is recommended as it handles video loading and segmentation automatically.
Configuration Options
The VineConfig class supports the following parameters (non-exhaustive):
model_name: CLIP model backbone (default:"openai/clip-vit-large-patch14-336")pretrained_vine_path: Optional path or Hugging Face repo with pretrained VINE weightssegmentation_method:"sam2"or"grounding_dino_sam2"(default:"grounding_dino_sam2")box_threshold/text_threshold: Grounding DINO thresholdstarget_fps: Target FPS for video processing (default:1)alpha,white_alpha: Rendering parameters used when extracting masked cropstopk_cate: Top-k categories to return per object (default:3)max_video_length: Maximum frames to process (default:100)visualize: WhenTrue, pipeline post-processing attempts to create stitched visualizationsvisualization_dir: Optional base directory where visualization assets are writtendebug_visualizations: WhenTrue, the model saves a single first-frame mask composite for quick inspectiondebug_visualization_path: Target filepath for the debug mask composite (must point to a writable file)return_flattened_segments,return_valid_pairs,interested_object_pairs: Advanced geometry outputs for downstream consumers
Output Format
The model returns a dictionary with the following structure:
{
"masks" : {},
"boxes" : {},
"categorical_predictions": {
object_id: [(probability, category), ...]
},
"unary_predictions": {
(frame_id, object_id): [(probability, action), ...]
},
"binary_predictions": {
(frame_id, (obj1_id, obj2_id)): [(probability, relation), ...]
},
"confidence_scores": {
"categorical": max_categorical_confidence,
"unary": max_unary_confidence,
"binary": max_binary_confidence
},
"summary": {
"num_objects_detected": int,
"top_categories": [(category, probability), ...],
"top_actions": [(action, probability), ...],
"top_relations": [(relation, probability), ...]
}
}
Visualization & Debugging
There are two complementary visualization layers:
Post-process visualizations (
include_visualizations=Truein the pipeline call) produces a high-level stitched video summarizing detections, actions, and relations over time.Debug visualizations (
debug_visualizations=TrueinVineConfig) dumps videos of intermediate segmentation masks and outputs from GroundingDINO, SAM2, Unary, Binary, etc. for quick sanity checks.
If you plan to enable either option, ensure the relevant output directories exist before running the pipeline.
Segmentation Methods
Grounding DINO + SAM2 (Recommended)
Uses Grounding DINO for object detection based on text prompts, then SAM2 for precise segmentation.
Requirements:
- Grounding DINO model and weights
- SAM2 model and weights
- Properly configured paths to model checkpoints
SAM2 Only
Uses SAM2's automatic mask generation without text-based object detection.
Requirements:
- SAM2 model and weights
Model Architecture
VINE is built on top of CLIP and uses three separate CLIP models for different tasks:
- Categorical Model: For object classification
- Unary Model: For single-object action recognition
- Binary Model: For relationship detection between object pairs
Each model processes both visual and textual features to compute similarity scores and probability distributions.
Pushing to HuggingFace Hub
from vine_hf import VineConfig, VineModel
# Create and configure your model
config = VineConfig()
model = VineModel(config)
# Load your pretrained weights
# model.load_state_dict(torch.load('path/to/your/weights.pth'))
# Register for auto classes
config.register_for_auto_class()
model.register_for_auto_class("AutoModel")
# Push to Hub
config.push_to_hub('your-username/vine-model')
model.push_to_hub('your-username/vine-model')
Loading from HuggingFace Hub
from transformers import AutoModel, pipeline
# Load model
model = AutoModel.from_pretrained('your-username/vine-model', trust_remote_code=True)
# Or use with pipeline
vine_pipeline = pipeline(
'vine-video-understanding',
model='your-username/vine-model',
trust_remote_code=True
)
Examples
See example_usage.py for comprehensive examples including:
- Direct model usage
- Pipeline usage
- HuggingFace Hub integration
- Real video processing
Requirements
- Python 3.7+
- PyTorch 1.9+
- transformers 4.20+
- OpenCV
- PIL/Pillow
- NumPy
For segmentation:
- SAM2 (Facebook Research)
- Grounding DINO (IDEA Research)
Citation
If you use VINE in your research, please cite:
@article{vine2024,
title={VINE: Video Understanding with Natural Language},
author={Your Authors},
journal={Your Journal},
year={2024}
}
License
[Your License Here]
Contact
[Your Contact Information Here]