LASER / vine_hf /README.md
ASethi04's picture
updates
f9a6349
|
raw
history blame
12 kB

VINE HuggingFace Interface

VINE (Video Understanding with Natural Language) is a model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships.

This package provides a HuggingFace-compatible interface for the VINE model, making it easy to use for video understanding tasks.

Features

  • Categorical Classification: Classify objects in videos (e.g., "human", "dog", "frisbee")
  • Unary Predicates: Detect actions on single objects (e.g., "running", "jumping", "sitting")
  • Binary Relations: Detect relationships between object pairs (e.g., "behind", "in front of", "chasing")
  • Multiple Segmentation Methods: Support for SAM2 and Grounding DINO + SAM2
  • HuggingFace Integration: Full compatibility with HuggingFace transformers and pipelines
  • Visualization Hooks: Optional high-level visualizations plus lightweight debug mask dumps for quick sanity checks

Installation

# Install the package (assuming it's in your Python path)
pip install transformers torch torchvision
pip install opencv-python pillow numpy

# For segmentation functionality, you'll also need:
# - SAM2: https://github.com/facebookresearch/sam2
# - Grounding DINO: https://github.com/IDEA-Research/GroundingDINO

Segmentation Model Configuration

VinePipeline lazily brings up the segmentation stack the first time a call needs masks. Thresholds, FPS, visualization toggles, and device selection live in VineConfig; the pipeline constructor tells it where to fetch SAM2 / GroundingDINO weights or lets you inject already-instantiated modules.

Provide file paths at construction (most common)

from vine_hf import VineConfig, VineModel, VinePipeline

vine_config = VineConfig(
    segmentation_method="grounding_dino_sam2",  # or "sam2"
    box_threshold=0.35,
    text_threshold=0.25,
    target_fps=5,
    visualization_dir="output/visualizations", # where to write visualizations (and debug visualizations if enabled)
    debug_visualizations=True, # Write videos of the groundingDINO/SAM2/Binary/Unary, etc... outputs
    pretrained_vine_path="/abs/path/to/laser_model_v1.pkl",
    device="cuda:0",  # accepts int, str, or torch.device
)

vine_model = VineModel(vine_config)

vine_pipeline = VinePipeline(
    model=vine_model,
    tokenizer=None,
    sam_config_path="/abs/path/to/sam2/sam2.1_hiera_t.yaml",
    sam_checkpoint_path="/abs/path/to/sam2/sam2_hiera_tiny.pt",
    gd_config_path="/abs/path/to/groundingdino/config/GroundingDINO_SwinT_OGC.py",
    gd_checkpoint_path="/abs/path/to/groundingdino/weights/groundingdino_swint_ogc.pth",
    device=vine_config._device,
)

When segmentation_method="grounding_dino_sam2", both SAM2 and GroundingDINO must be reachable. The pipeline validates the paths; missing files raise a ValueError. If you pick "sam2", only the SAM2 config and checkpoint are required.

Reuse pre-initialized segmentation modules

If you build the segmentation stack elsewhere, inject the components with set_segmentation_models before running the pipeline:

from sam2.build_sam import build_sam2_video_predictor, build_sam2
from sam2.automatic_mask_generator import SAM2AutomaticMaskGenerator
from groundingdino.util.inference import Model as GroundingDINOModel

sam_predictor = build_sam2_video_predictor(..., device=vine_config._device)
mask_generator = SAM2AutomaticMaskGenerator(build_sam2(..., device=vine_config._device))
grounding_model = GroundingDINOModel(..., device=vine_config._device)

vine_pipeline.set_segmentation_models(
    sam_predictor=sam_predictor,
    mask_generator=mask_generator,
    grounding_model=grounding_model,
)

Any argument left as None is initialized lazily from the file paths when the pipeline first needs that backend.

Quick Start

Requirements

-torch -torchvision -transformers -opencv-python -matplotlib -seaborn -pandas -numpy -ipywidgets -tqdm -scikit-learn -sam2 (from Facebook Research) "https://github.com/video-fm/video-sam2" -sam2 weights (downloaded separately. EX: https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt) -groundingdino (from IDEA Research) -groundingdino weights (downloaded separately. EX:https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth) -spacy-fastlang -en-core-web-sm (for spacy-fastlang) -ffmpeg (for video processing) -(optional) laser weights/full model checkpoint (downloaded separately. EX: https://huggingface.co/video-fm/vine_v0)

Usually, by running the laser/environments/laser_env.yml from the LASER repo, most dependencies will be installed. You will need to manually install sam2 and groundingdino as per their instructions.

Using the Pipeline (Recommended)

from transformers.pipelines import PIPELINE_REGISTRY
from vine_hf import VineConfig, VineModel, VinePipeline

PIPELINE_REGISTRY.register_pipeline(
    "vine-video-understanding",
    pipeline_class=VinePipeline,
    pt_model=VineModel,
    type="multimodal",
)

config = VineConfig(
    segmentation_method="grounding_dino_sam2",
    pretrained_vine_path="/abs/path/to/laser_model_v1.pkl",
    visualization_dir="output",
    visualize=True,
    device="cuda:0",
)

model = VineModel(config)

vine_pipeline = VinePipeline(
    model=model,
    tokenizer=None,
    sam_config_path="/abs/path/to/sam2/sam2.1_hiera_t.yaml",
    sam_checkpoint_path="/abs/path/to/sam2/sam2_hiera_tiny.pt",
    gd_config_path="/abs/path/to/groundingdino/config/GroundingDINO_SwinT_OGC.py",
    gd_checkpoint_path="/abs/path/to/groundingdino/weights/groundingdino_swint_ogc.pth",
    device=config._device,
)

results = vine_pipeline(
    "/path/to/video.mp4",
    categorical_keywords=["dog", "human"],
    unary_keywords=["running"],
    binary_keywords=["chasing"],
    object_pairs=[(0, 1)],
    return_top_k=3,
    include_visualizations=True,
)
print(results["summary"])

Using the Model Directly (Advanced)

For advanced users who want to provide their own segmentation:

from vine_hf import VineConfig, VineModel
import torch

# Create configuration
config = VineConfig(
    pretrained_vine_path="/path/to/your/vine/weights"  # Optional: your fine-tuned weights
)

# Initialize model
model = VineModel(config)

# If you have your own video frames, masks, and bboxes from external segmentation
video_frames = torch.randn(3, 224, 224, 3) * 255  # Your video frames
masks = {0: {1: torch.ones(224, 224, 1)}}  # Your segmentation masks
bboxes = {0: {1: [50, 50, 150, 150]}}  # Your bounding boxes

# Run prediction
results = model.predict(
    video_frames=video_frames,
    masks=masks,
    bboxes=bboxes,
    categorical_keywords=['human', 'dog', 'frisbee'],
    unary_keywords=['running', 'jumping'],
    binary_keywords=['chasing', 'following'],
    object_pairs=[(1, 2)],
    return_top_k=3
)

Note: For most users, the pipeline approach above is recommended as it handles video loading and segmentation automatically.

Configuration Options

The VineConfig class supports the following parameters (non-exhaustive):

  • model_name: CLIP model backbone (default: "openai/clip-vit-large-patch14-336")
  • pretrained_vine_path: Optional path or Hugging Face repo with pretrained VINE weights
  • segmentation_method: "sam2" or "grounding_dino_sam2" (default: "grounding_dino_sam2")
  • box_threshold / text_threshold: Grounding DINO thresholds
  • target_fps: Target FPS for video processing (default: 1)
  • alpha, white_alpha: Rendering parameters used when extracting masked crops
  • topk_cate: Top-k categories to return per object (default: 3)
  • max_video_length: Maximum frames to process (default: 100)
  • visualize: When True, pipeline post-processing attempts to create stitched visualizations
  • visualization_dir: Optional base directory where visualization assets are written
  • debug_visualizations: When True, the model saves a single first-frame mask composite for quick inspection
  • debug_visualization_path: Target filepath for the debug mask composite (must point to a writable file)
  • return_flattened_segments, return_valid_pairs, interested_object_pairs: Advanced geometry outputs for downstream consumers

Output Format

The model returns a dictionary with the following structure:

{
    "masks" : {},

    "boxes" : {},

    "categorical_predictions": {
        object_id: [(probability, category), ...]
    },
    "unary_predictions": {
        (frame_id, object_id): [(probability, action), ...]
    },
    "binary_predictions": {
        (frame_id, (obj1_id, obj2_id)): [(probability, relation), ...]
    },
    "confidence_scores": {
        "categorical": max_categorical_confidence,
        "unary": max_unary_confidence,
        "binary": max_binary_confidence
    },
    "summary": {
        "num_objects_detected": int,
        "top_categories": [(category, probability), ...],
        "top_actions": [(action, probability), ...],
        "top_relations": [(relation, probability), ...]
    }
}

Visualization & Debugging

There are two complementary visualization layers:

  • Post-process visualizations (include_visualizations=True in the pipeline call) produces a high-level stitched video summarizing detections, actions, and relations over time.

  • Debug visualizations (debug_visualizations=True in VineConfig) dumps videos of intermediate segmentation masks and outputs from GroundingDINO, SAM2, Unary, Binary, etc. for quick sanity checks.

If you plan to enable either option, ensure the relevant output directories exist before running the pipeline.

Segmentation Methods

Grounding DINO + SAM2 (Recommended)

Uses Grounding DINO for object detection based on text prompts, then SAM2 for precise segmentation.

Requirements:

  • Grounding DINO model and weights
  • SAM2 model and weights
  • Properly configured paths to model checkpoints

SAM2 Only

Uses SAM2's automatic mask generation without text-based object detection.

Requirements:

  • SAM2 model and weights

Model Architecture

VINE is built on top of CLIP and uses three separate CLIP models for different tasks:

  • Categorical Model: For object classification
  • Unary Model: For single-object action recognition
  • Binary Model: For relationship detection between object pairs

Each model processes both visual and textual features to compute similarity scores and probability distributions.

Pushing to HuggingFace Hub

from vine_hf import VineConfig, VineModel

# Create and configure your model
config = VineConfig()
model = VineModel(config)

# Load your pretrained weights
# model.load_state_dict(torch.load('path/to/your/weights.pth'))

# Register for auto classes
config.register_for_auto_class()
model.register_for_auto_class("AutoModel")

# Push to Hub
config.push_to_hub('your-username/vine-model')
model.push_to_hub('your-username/vine-model')

Loading from HuggingFace Hub

from transformers import AutoModel, pipeline

# Load model
model = AutoModel.from_pretrained('your-username/vine-model', trust_remote_code=True)

# Or use with pipeline
vine_pipeline = pipeline(
    'vine-video-understanding', 
    model='your-username/vine-model', 
    trust_remote_code=True
)

Examples

See example_usage.py for comprehensive examples including:

  • Direct model usage
  • Pipeline usage
  • HuggingFace Hub integration
  • Real video processing

Requirements

  • Python 3.7+
  • PyTorch 1.9+
  • transformers 4.20+
  • OpenCV
  • PIL/Pillow
  • NumPy

For segmentation:

  • SAM2 (Facebook Research)
  • Grounding DINO (IDEA Research)

Citation

If you use VINE in your research, please cite:

@article{vine2024,
  title={VINE: Video Understanding with Natural Language},
  author={Your Authors},
  journal={Your Journal},
  year={2024}
}

License

[Your License Here]

Contact

[Your Contact Information Here]