Spaces:
Running
on
Zero
Running
on
Zero
| # VINE HuggingFace Interface | |
| VINE (Video Understanding with Natural Language) is a model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships. | |
| This package provides a HuggingFace-compatible interface for the VINE model, making it easy to use for video understanding tasks. | |
| ## Features | |
| - **Categorical Classification**: Classify objects in videos (e.g., "human", "dog", "frisbee") | |
| - **Unary Predicates**: Detect actions on single objects (e.g., "running", "jumping", "sitting") | |
| - **Binary Relations**: Detect relationships between object pairs (e.g., "behind", "in front of", "chasing") | |
| - **Multiple Segmentation Methods**: Support for SAM2 and Grounding DINO + SAM2 | |
| - **HuggingFace Integration**: Full compatibility with HuggingFace transformers and pipelines | |
| - **Visualization Hooks**: Optional high-level visualizations plus lightweight debug mask dumps for quick sanity checks | |
| ## Installation | |
| ```bash | |
| # Install the package (assuming it's in your Python path) | |
| pip install transformers torch torchvision | |
| pip install opencv-python pillow numpy | |
| # For segmentation functionality, you'll also need: | |
| # - SAM2: https://github.com/facebookresearch/sam2 | |
| # - Grounding DINO: https://github.com/IDEA-Research/GroundingDINO | |
| ``` | |
| ## Segmentation Model Configuration | |
| `VinePipeline` lazily brings up the segmentation stack the first time a call needs masks. Thresholds, FPS, visualization toggles, and device selection live in `VineConfig`; the pipeline constructor tells it where to fetch SAM2 / GroundingDINO weights or lets you inject already-instantiated modules. | |
| ### Provide file paths at construction (most common) | |
| ```python | |
| from vine_hf import VineConfig, VineModel, VinePipeline | |
| vine_config = VineConfig( | |
| segmentation_method="grounding_dino_sam2", # or "sam2" | |
| box_threshold=0.35, | |
| text_threshold=0.25, | |
| target_fps=5, | |
| visualization_dir="output/visualizations", # where to write visualizations (and debug visualizations if enabled) | |
| debug_visualizations=True, # Write videos of the groundingDINO/SAM2/Binary/Unary, etc... outputs | |
| pretrained_vine_path="/abs/path/to/laser_model_v1.pkl", | |
| device="cuda:0", # accepts int, str, or torch.device | |
| ) | |
| vine_model = VineModel(vine_config) | |
| vine_pipeline = VinePipeline( | |
| model=vine_model, | |
| tokenizer=None, | |
| sam_config_path="/abs/path/to/sam2/sam2.1_hiera_t.yaml", | |
| sam_checkpoint_path="/abs/path/to/sam2/sam2_hiera_tiny.pt", | |
| gd_config_path="/abs/path/to/groundingdino/config/GroundingDINO_SwinT_OGC.py", | |
| gd_checkpoint_path="/abs/path/to/groundingdino/weights/groundingdino_swint_ogc.pth", | |
| device=vine_config._device, | |
| ) | |
| ``` | |
| When `segmentation_method="grounding_dino_sam2"`, both SAM2 and GroundingDINO must be reachable. The pipeline validates the paths; missing files raise a `ValueError`. If you pick `"sam2"`, only the SAM2 config and checkpoint are required. | |
| ### Reuse pre-initialized segmentation modules | |
| If you build the segmentation stack elsewhere, inject the components with `set_segmentation_models` before running the pipeline: | |
| ```python | |
| from sam2.build_sam import build_sam2_video_predictor, build_sam2 | |
| from sam2.automatic_mask_generator import SAM2AutomaticMaskGenerator | |
| from groundingdino.util.inference import Model as GroundingDINOModel | |
| sam_predictor = build_sam2_video_predictor(..., device=vine_config._device) | |
| mask_generator = SAM2AutomaticMaskGenerator(build_sam2(..., device=vine_config._device)) | |
| grounding_model = GroundingDINOModel(..., device=vine_config._device) | |
| vine_pipeline.set_segmentation_models( | |
| sam_predictor=sam_predictor, | |
| mask_generator=mask_generator, | |
| grounding_model=grounding_model, | |
| ) | |
| ``` | |
| Any argument left as `None` is initialized lazily from the file paths when the pipeline first needs that backend. | |
| ## Quick Start | |
| ## Requirements | |
| -torch | |
| -torchvision | |
| -transformers | |
| -opencv-python | |
| -matplotlib | |
| -seaborn | |
| -pandas | |
| -numpy | |
| -ipywidgets | |
| -tqdm | |
| -scikit-learn | |
| -sam2 (from Facebook Research) "https://github.com/video-fm/video-sam2" | |
| -sam2 weights (downloaded separately. EX: https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt) | |
| -groundingdino (from IDEA Research) | |
| -groundingdino weights (downloaded separately. EX:https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth) | |
| -spacy-fastlang | |
| -en-core-web-sm (for spacy-fastlang) | |
| -ffmpeg (for video processing) | |
| -(optional) laser weights/full model checkpoint (downloaded separately. EX: https://huggingface.co/video-fm/vine_v0) | |
| Usually, by running the laser/environments/laser_env.yml from the LASER repo, most dependencies will be installed. You will need to manually install sam2 and groundingdino as per their instructions. | |
| ### Using the Pipeline (Recommended) | |
| ```python | |
| from transformers.pipelines import PIPELINE_REGISTRY | |
| from vine_hf import VineConfig, VineModel, VinePipeline | |
| PIPELINE_REGISTRY.register_pipeline( | |
| "vine-video-understanding", | |
| pipeline_class=VinePipeline, | |
| pt_model=VineModel, | |
| type="multimodal", | |
| ) | |
| config = VineConfig( | |
| segmentation_method="grounding_dino_sam2", | |
| pretrained_vine_path="/abs/path/to/laser_model_v1.pkl", | |
| visualization_dir="output", | |
| visualize=True, | |
| device="cuda:0", | |
| ) | |
| model = VineModel(config) | |
| vine_pipeline = VinePipeline( | |
| model=model, | |
| tokenizer=None, | |
| sam_config_path="/abs/path/to/sam2/sam2.1_hiera_t.yaml", | |
| sam_checkpoint_path="/abs/path/to/sam2/sam2_hiera_tiny.pt", | |
| gd_config_path="/abs/path/to/groundingdino/config/GroundingDINO_SwinT_OGC.py", | |
| gd_checkpoint_path="/abs/path/to/groundingdino/weights/groundingdino_swint_ogc.pth", | |
| device=config._device, | |
| ) | |
| results = vine_pipeline( | |
| "/path/to/video.mp4", | |
| categorical_keywords=["dog", "human"], | |
| unary_keywords=["running"], | |
| binary_keywords=["chasing"], | |
| object_pairs=[(0, 1)], | |
| return_top_k=3, | |
| include_visualizations=True, | |
| ) | |
| print(results["summary"]) | |
| ``` | |
| ### Using the Model Directly (Advanced) | |
| For advanced users who want to provide their own segmentation: | |
| ```python | |
| from vine_hf import VineConfig, VineModel | |
| import torch | |
| # Create configuration | |
| config = VineConfig( | |
| pretrained_vine_path="/path/to/your/vine/weights" # Optional: your fine-tuned weights | |
| ) | |
| # Initialize model | |
| model = VineModel(config) | |
| # If you have your own video frames, masks, and bboxes from external segmentation | |
| video_frames = torch.randn(3, 224, 224, 3) * 255 # Your video frames | |
| masks = {0: {1: torch.ones(224, 224, 1)}} # Your segmentation masks | |
| bboxes = {0: {1: [50, 50, 150, 150]}} # Your bounding boxes | |
| # Run prediction | |
| results = model.predict( | |
| video_frames=video_frames, | |
| masks=masks, | |
| bboxes=bboxes, | |
| categorical_keywords=['human', 'dog', 'frisbee'], | |
| unary_keywords=['running', 'jumping'], | |
| binary_keywords=['chasing', 'following'], | |
| object_pairs=[(1, 2)], | |
| return_top_k=3 | |
| ) | |
| ``` | |
| **Note**: For most users, the pipeline approach above is recommended as it handles video loading and segmentation automatically. | |
| ## Configuration Options | |
| The `VineConfig` class supports the following parameters (non-exhaustive): | |
| - `model_name`: CLIP model backbone (default: `"openai/clip-vit-large-patch14-336"`) | |
| - `pretrained_vine_path`: Optional path or Hugging Face repo with pretrained VINE weights | |
| - `segmentation_method`: `"sam2"` or `"grounding_dino_sam2"` (default: `"grounding_dino_sam2"`) | |
| - `box_threshold` / `text_threshold`: Grounding DINO thresholds | |
| - `target_fps`: Target FPS for video processing (default: `1`) | |
| - `alpha`, `white_alpha`: Rendering parameters used when extracting masked crops | |
| - `topk_cate`: Top-k categories to return per object (default: `3`) | |
| - `max_video_length`: Maximum frames to process (default: `100`) | |
| - `visualize`: When `True`, pipeline post-processing attempts to create stitched visualizations | |
| - `visualization_dir`: Optional base directory where visualization assets are written | |
| - `debug_visualizations`: When `True`, the model saves a single first-frame mask composite for quick inspection | |
| - `debug_visualization_path`: Target filepath for the debug mask composite (must point to a writable file) | |
| - `return_flattened_segments`, `return_valid_pairs`, `interested_object_pairs`: Advanced geometry outputs for downstream consumers | |
| ## Output Format | |
| The model returns a dictionary with the following structure: | |
| ```python | |
| { | |
| "masks" : {}, | |
| "boxes" : {}, | |
| "categorical_predictions": { | |
| object_id: [(probability, category), ...] | |
| }, | |
| "unary_predictions": { | |
| (frame_id, object_id): [(probability, action), ...] | |
| }, | |
| "binary_predictions": { | |
| (frame_id, (obj1_id, obj2_id)): [(probability, relation), ...] | |
| }, | |
| "confidence_scores": { | |
| "categorical": max_categorical_confidence, | |
| "unary": max_unary_confidence, | |
| "binary": max_binary_confidence | |
| }, | |
| "summary": { | |
| "num_objects_detected": int, | |
| "top_categories": [(category, probability), ...], | |
| "top_actions": [(action, probability), ...], | |
| "top_relations": [(relation, probability), ...] | |
| } | |
| } | |
| ``` | |
| ## Visualization & Debugging | |
| There are two complementary visualization layers: | |
| - **Post-process visualizations** (`include_visualizations=True` in the pipeline call) produces a high-level stitched video summarizing detections, actions, and relations over time. | |
| - **Debug visualizations** (`debug_visualizations=True` in `VineConfig`) dumps videos of intermediate segmentation masks and outputs from GroundingDINO, SAM2, Unary, Binary, etc. for quick sanity checks. | |
| If you plan to enable either option, ensure the relevant output directories exist before running the pipeline. | |
| ## Segmentation Methods | |
| ### Grounding DINO + SAM2 (Recommended) | |
| Uses Grounding DINO for object detection based on text prompts, then SAM2 for precise segmentation. | |
| Requirements: | |
| - Grounding DINO model and weights | |
| - SAM2 model and weights | |
| - Properly configured paths to model checkpoints | |
| ### SAM2 Only | |
| Uses SAM2's automatic mask generation without text-based object detection. | |
| Requirements: | |
| - SAM2 model and weights | |
| ## Model Architecture | |
| VINE is built on top of CLIP and uses three separate CLIP models for different tasks: | |
| - **Categorical Model**: For object classification | |
| - **Unary Model**: For single-object action recognition | |
| - **Binary Model**: For relationship detection between object pairs | |
| Each model processes both visual and textual features to compute similarity scores and probability distributions. | |
| ## Pushing to HuggingFace Hub | |
| ```python | |
| from vine_hf import VineConfig, VineModel | |
| # Create and configure your model | |
| config = VineConfig() | |
| model = VineModel(config) | |
| # Load your pretrained weights | |
| # model.load_state_dict(torch.load('path/to/your/weights.pth')) | |
| # Register for auto classes | |
| config.register_for_auto_class() | |
| model.register_for_auto_class("AutoModel") | |
| # Push to Hub | |
| config.push_to_hub('your-username/vine-model') | |
| model.push_to_hub('your-username/vine-model') | |
| ``` | |
| ## Loading from HuggingFace Hub | |
| ```python | |
| from transformers import AutoModel, pipeline | |
| # Load model | |
| model = AutoModel.from_pretrained('your-username/vine-model', trust_remote_code=True) | |
| # Or use with pipeline | |
| vine_pipeline = pipeline( | |
| 'vine-video-understanding', | |
| model='your-username/vine-model', | |
| trust_remote_code=True | |
| ) | |
| ``` | |
| ## Examples | |
| See `example_usage.py` for comprehensive examples including: | |
| - Direct model usage | |
| - Pipeline usage | |
| - HuggingFace Hub integration | |
| - Real video processing | |
| ## Requirements | |
| - Python 3.7+ | |
| - PyTorch 1.9+ | |
| - transformers 4.20+ | |
| - OpenCV | |
| - PIL/Pillow | |
| - NumPy | |
| For segmentation: | |
| - SAM2 (Facebook Research) | |
| - Grounding DINO (IDEA Research) | |
| ## Citation | |
| If you use VINE in your research, please cite: | |
| ```bibtex | |
| @article{vine2024, | |
| title={VINE: Video Understanding with Natural Language}, | |
| author={Your Authors}, | |
| journal={Your Journal}, | |
| year={2024} | |
| } | |
| ``` | |
| ## License | |
| [Your License Here] | |
| ## Contact | |
| [Your Contact Information Here] | |