# VINE HuggingFace Interface VINE (Video Understanding with Natural Language) is a model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships. This package provides a HuggingFace-compatible interface for the VINE model, making it easy to use for video understanding tasks. ## Features - **Categorical Classification**: Classify objects in videos (e.g., "human", "dog", "frisbee") - **Unary Predicates**: Detect actions on single objects (e.g., "running", "jumping", "sitting") - **Binary Relations**: Detect relationships between object pairs (e.g., "behind", "in front of", "chasing") - **Multiple Segmentation Methods**: Support for SAM2 and Grounding DINO + SAM2 - **HuggingFace Integration**: Full compatibility with HuggingFace transformers and pipelines - **Visualization Hooks**: Optional high-level visualizations plus lightweight debug mask dumps for quick sanity checks ## Installation ```bash # Install the package (assuming it's in your Python path) pip install transformers torch torchvision pip install opencv-python pillow numpy # For segmentation functionality, you'll also need: # - SAM2: https://github.com/facebookresearch/sam2 # - Grounding DINO: https://github.com/IDEA-Research/GroundingDINO ``` ## Segmentation Model Configuration `VinePipeline` lazily brings up the segmentation stack the first time a call needs masks. Thresholds, FPS, visualization toggles, and device selection live in `VineConfig`; the pipeline constructor tells it where to fetch SAM2 / GroundingDINO weights or lets you inject already-instantiated modules. ### Provide file paths at construction (most common) ```python from vine_hf import VineConfig, VineModel, VinePipeline vine_config = VineConfig( segmentation_method="grounding_dino_sam2", # or "sam2" box_threshold=0.35, text_threshold=0.25, target_fps=5, visualization_dir="output/visualizations", # where to write visualizations (and debug visualizations if enabled) debug_visualizations=True, # Write videos of the groundingDINO/SAM2/Binary/Unary, etc... outputs pretrained_vine_path="/abs/path/to/laser_model_v1.pkl", device="cuda:0", # accepts int, str, or torch.device ) vine_model = VineModel(vine_config) vine_pipeline = VinePipeline( model=vine_model, tokenizer=None, sam_config_path="/abs/path/to/sam2/sam2.1_hiera_t.yaml", sam_checkpoint_path="/abs/path/to/sam2/sam2_hiera_tiny.pt", gd_config_path="/abs/path/to/groundingdino/config/GroundingDINO_SwinT_OGC.py", gd_checkpoint_path="/abs/path/to/groundingdino/weights/groundingdino_swint_ogc.pth", device=vine_config._device, ) ``` When `segmentation_method="grounding_dino_sam2"`, both SAM2 and GroundingDINO must be reachable. The pipeline validates the paths; missing files raise a `ValueError`. If you pick `"sam2"`, only the SAM2 config and checkpoint are required. ### Reuse pre-initialized segmentation modules If you build the segmentation stack elsewhere, inject the components with `set_segmentation_models` before running the pipeline: ```python from sam2.build_sam import build_sam2_video_predictor, build_sam2 from sam2.automatic_mask_generator import SAM2AutomaticMaskGenerator from groundingdino.util.inference import Model as GroundingDINOModel sam_predictor = build_sam2_video_predictor(..., device=vine_config._device) mask_generator = SAM2AutomaticMaskGenerator(build_sam2(..., device=vine_config._device)) grounding_model = GroundingDINOModel(..., device=vine_config._device) vine_pipeline.set_segmentation_models( sam_predictor=sam_predictor, mask_generator=mask_generator, grounding_model=grounding_model, ) ``` Any argument left as `None` is initialized lazily from the file paths when the pipeline first needs that backend. ## Quick Start ## Requirements -torch -torchvision -transformers -opencv-python -matplotlib -seaborn -pandas -numpy -ipywidgets -tqdm -scikit-learn -sam2 (from Facebook Research) "https://github.com/video-fm/video-sam2" -sam2 weights (downloaded separately. EX: https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt) -groundingdino (from IDEA Research) -groundingdino weights (downloaded separately. EX:https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth) -spacy-fastlang -en-core-web-sm (for spacy-fastlang) -ffmpeg (for video processing) -(optional) laser weights/full model checkpoint (downloaded separately. EX: https://huggingface.co/video-fm/vine_v0) Usually, by running the laser/environments/laser_env.yml from the LASER repo, most dependencies will be installed. You will need to manually install sam2 and groundingdino as per their instructions. ### Using the Pipeline (Recommended) ```python from transformers.pipelines import PIPELINE_REGISTRY from vine_hf import VineConfig, VineModel, VinePipeline PIPELINE_REGISTRY.register_pipeline( "vine-video-understanding", pipeline_class=VinePipeline, pt_model=VineModel, type="multimodal", ) config = VineConfig( segmentation_method="grounding_dino_sam2", pretrained_vine_path="/abs/path/to/laser_model_v1.pkl", visualization_dir="output", visualize=True, device="cuda:0", ) model = VineModel(config) vine_pipeline = VinePipeline( model=model, tokenizer=None, sam_config_path="/abs/path/to/sam2/sam2.1_hiera_t.yaml", sam_checkpoint_path="/abs/path/to/sam2/sam2_hiera_tiny.pt", gd_config_path="/abs/path/to/groundingdino/config/GroundingDINO_SwinT_OGC.py", gd_checkpoint_path="/abs/path/to/groundingdino/weights/groundingdino_swint_ogc.pth", device=config._device, ) results = vine_pipeline( "/path/to/video.mp4", categorical_keywords=["dog", "human"], unary_keywords=["running"], binary_keywords=["chasing"], object_pairs=[(0, 1)], return_top_k=3, include_visualizations=True, ) print(results["summary"]) ``` ### Using the Model Directly (Advanced) For advanced users who want to provide their own segmentation: ```python from vine_hf import VineConfig, VineModel import torch # Create configuration config = VineConfig( pretrained_vine_path="/path/to/your/vine/weights" # Optional: your fine-tuned weights ) # Initialize model model = VineModel(config) # If you have your own video frames, masks, and bboxes from external segmentation video_frames = torch.randn(3, 224, 224, 3) * 255 # Your video frames masks = {0: {1: torch.ones(224, 224, 1)}} # Your segmentation masks bboxes = {0: {1: [50, 50, 150, 150]}} # Your bounding boxes # Run prediction results = model.predict( video_frames=video_frames, masks=masks, bboxes=bboxes, categorical_keywords=['human', 'dog', 'frisbee'], unary_keywords=['running', 'jumping'], binary_keywords=['chasing', 'following'], object_pairs=[(1, 2)], return_top_k=3 ) ``` **Note**: For most users, the pipeline approach above is recommended as it handles video loading and segmentation automatically. ## Configuration Options The `VineConfig` class supports the following parameters (non-exhaustive): - `model_name`: CLIP model backbone (default: `"openai/clip-vit-large-patch14-336"`) - `pretrained_vine_path`: Optional path or Hugging Face repo with pretrained VINE weights - `segmentation_method`: `"sam2"` or `"grounding_dino_sam2"` (default: `"grounding_dino_sam2"`) - `box_threshold` / `text_threshold`: Grounding DINO thresholds - `target_fps`: Target FPS for video processing (default: `1`) - `alpha`, `white_alpha`: Rendering parameters used when extracting masked crops - `topk_cate`: Top-k categories to return per object (default: `3`) - `max_video_length`: Maximum frames to process (default: `100`) - `visualize`: When `True`, pipeline post-processing attempts to create stitched visualizations - `visualization_dir`: Optional base directory where visualization assets are written - `debug_visualizations`: When `True`, the model saves a single first-frame mask composite for quick inspection - `debug_visualization_path`: Target filepath for the debug mask composite (must point to a writable file) - `return_flattened_segments`, `return_valid_pairs`, `interested_object_pairs`: Advanced geometry outputs for downstream consumers ## Output Format The model returns a dictionary with the following structure: ```python { "masks" : {}, "boxes" : {}, "categorical_predictions": { object_id: [(probability, category), ...] }, "unary_predictions": { (frame_id, object_id): [(probability, action), ...] }, "binary_predictions": { (frame_id, (obj1_id, obj2_id)): [(probability, relation), ...] }, "confidence_scores": { "categorical": max_categorical_confidence, "unary": max_unary_confidence, "binary": max_binary_confidence }, "summary": { "num_objects_detected": int, "top_categories": [(category, probability), ...], "top_actions": [(action, probability), ...], "top_relations": [(relation, probability), ...] } } ``` ## Visualization & Debugging There are two complementary visualization layers: - **Post-process visualizations** (`include_visualizations=True` in the pipeline call) produces a high-level stitched video summarizing detections, actions, and relations over time. - **Debug visualizations** (`debug_visualizations=True` in `VineConfig`) dumps videos of intermediate segmentation masks and outputs from GroundingDINO, SAM2, Unary, Binary, etc. for quick sanity checks. If you plan to enable either option, ensure the relevant output directories exist before running the pipeline. ## Segmentation Methods ### Grounding DINO + SAM2 (Recommended) Uses Grounding DINO for object detection based on text prompts, then SAM2 for precise segmentation. Requirements: - Grounding DINO model and weights - SAM2 model and weights - Properly configured paths to model checkpoints ### SAM2 Only Uses SAM2's automatic mask generation without text-based object detection. Requirements: - SAM2 model and weights ## Model Architecture VINE is built on top of CLIP and uses three separate CLIP models for different tasks: - **Categorical Model**: For object classification - **Unary Model**: For single-object action recognition - **Binary Model**: For relationship detection between object pairs Each model processes both visual and textual features to compute similarity scores and probability distributions. ## Pushing to HuggingFace Hub ```python from vine_hf import VineConfig, VineModel # Create and configure your model config = VineConfig() model = VineModel(config) # Load your pretrained weights # model.load_state_dict(torch.load('path/to/your/weights.pth')) # Register for auto classes config.register_for_auto_class() model.register_for_auto_class("AutoModel") # Push to Hub config.push_to_hub('your-username/vine-model') model.push_to_hub('your-username/vine-model') ``` ## Loading from HuggingFace Hub ```python from transformers import AutoModel, pipeline # Load model model = AutoModel.from_pretrained('your-username/vine-model', trust_remote_code=True) # Or use with pipeline vine_pipeline = pipeline( 'vine-video-understanding', model='your-username/vine-model', trust_remote_code=True ) ``` ## Examples See `example_usage.py` for comprehensive examples including: - Direct model usage - Pipeline usage - HuggingFace Hub integration - Real video processing ## Requirements - Python 3.7+ - PyTorch 1.9+ - transformers 4.20+ - OpenCV - PIL/Pillow - NumPy For segmentation: - SAM2 (Facebook Research) - Grounding DINO (IDEA Research) ## Citation If you use VINE in your research, please cite: ```bibtex @article{vine2024, title={VINE: Video Understanding with Natural Language}, author={Your Authors}, journal={Your Journal}, year={2024} } ``` ## License [Your License Here] ## Contact [Your Contact Information Here]