Spaces:

jiani-huang
/

LASER

Running on Zero

App Files Files Community

LASER / src /vine_hf /README.md

moqingyan123

final fixes

888f9e4 13 days ago

preview code

raw

history blame

12 kB

	# VINE HuggingFace Interface

	VINE (Video Understanding with Natural Language) is a model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships.

	This package provides a HuggingFace-compatible interface for the VINE model, making it easy to use for video understanding tasks.

	## Features

	- Categorical Classification: Classify objects in videos (e.g., "human", "dog", "frisbee")
	- Unary Predicates: Detect actions on single objects (e.g., "running", "jumping", "sitting")
	- Binary Relations: Detect relationships between object pairs (e.g., "behind", "in front of", "chasing")
	- Multiple Segmentation Methods: Support for SAM2 and Grounding DINO + SAM2
	- HuggingFace Integration: Full compatibility with HuggingFace transformers and pipelines
	- Visualization Hooks: Optional high-level visualizations plus lightweight debug mask dumps for quick sanity checks

	## Installation

	```bash
	# Install the package (assuming it's in your Python path)
	pip install transformers torch torchvision
	pip install opencv-python pillow numpy

	# For segmentation functionality, you'll also need:
	# - SAM2: https://github.com/facebookresearch/sam2
	# - Grounding DINO: https://github.com/IDEA-Research/GroundingDINO
	```

	## Segmentation Model Configuration

	`VinePipeline` lazily brings up the segmentation stack the first time a call needs masks. Thresholds, FPS, visualization toggles, and device selection live in `VineConfig`; the pipeline constructor tells it where to fetch SAM2 / GroundingDINO weights or lets you inject already-instantiated modules.

	### Provide file paths at construction (most common)

	```python
	from vine_hf import VineConfig, VineModel, VinePipeline

	vine_config = VineConfig(
	segmentation_method="grounding_dino_sam2", # or "sam2"
	box_threshold=0.35,
	text_threshold=0.25,
	target_fps=5,
	visualization_dir="output/visualizations", # where to write visualizations (and debug visualizations if enabled)
	debug_visualizations=True, # Write videos of the groundingDINO/SAM2/Binary/Unary, etc... outputs
	pretrained_vine_path="/abs/path/to/laser_model_v1.pkl",
	device="cuda:0", # accepts int, str, or torch.device
	)

	vine_model = VineModel(vine_config)

	vine_pipeline = VinePipeline(
	model=vine_model,
	tokenizer=None,
	sam_config_path="/abs/path/to/sam2/sam2.1_hiera_t.yaml",
	sam_checkpoint_path="/abs/path/to/sam2/sam2_hiera_tiny.pt",
	gd_config_path="/abs/path/to/groundingdino/config/GroundingDINO_SwinT_OGC.py",
	gd_checkpoint_path="/abs/path/to/groundingdino/weights/groundingdino_swint_ogc.pth",
	device=vine_config._device,
	)
	```

	When `segmentation_method="grounding_dino_sam2"`, both SAM2 and GroundingDINO must be reachable. The pipeline validates the paths; missing files raise a `ValueError`. If you pick `"sam2"`, only the SAM2 config and checkpoint are required.

	### Reuse pre-initialized segmentation modules

	If you build the segmentation stack elsewhere, inject the components with `set_segmentation_models` before running the pipeline:

	```python
	from sam2.build_sam import build_sam2_video_predictor, build_sam2
	from sam2.automatic_mask_generator import SAM2AutomaticMaskGenerator
	from groundingdino.util.inference import Model as GroundingDINOModel

	sam_predictor = build_sam2_video_predictor(..., device=vine_config._device)
	mask_generator = SAM2AutomaticMaskGenerator(build_sam2(..., device=vine_config._device))
	grounding_model = GroundingDINOModel(..., device=vine_config._device)

	vine_pipeline.set_segmentation_models(
	sam_predictor=sam_predictor,
	mask_generator=mask_generator,
	grounding_model=grounding_model,
	)
	```

	Any argument left as `None` is initialized lazily from the file paths when the pipeline first needs that backend.

	## Quick Start

	## Requirements
	-torch
	-torchvision
	-transformers
	-opencv-python
	-matplotlib
	-seaborn
	-pandas
	-numpy
	-ipywidgets
	-tqdm
	-scikit-learn
	-sam2 (from Facebook Research) "https://github.com/video-fm/video-sam2"
	-sam2 weights (downloaded separately. EX: https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt)
	-groundingdino (from IDEA Research)
	-groundingdino weights (downloaded separately. EX:https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth)
	-spacy-fastlang
	-en-core-web-sm (for spacy-fastlang)
	-ffmpeg (for video processing)
	-(optional) laser weights/full model checkpoint (downloaded separately. EX: https://huggingface.co/video-fm/vine_v0)

	Usually, by running the laser/environments/laser_env.yml from the LASER repo, most dependencies will be installed. You will need to manually install sam2 and groundingdino as per their instructions.

	### Using the Pipeline (Recommended)
	```python
	from transformers.pipelines import PIPELINE_REGISTRY
	from vine_hf import VineConfig, VineModel, VinePipeline

	PIPELINE_REGISTRY.register_pipeline(
	"vine-video-understanding",
	pipeline_class=VinePipeline,
	pt_model=VineModel,
	type="multimodal",
	)

	config = VineConfig(
	segmentation_method="grounding_dino_sam2",
	pretrained_vine_path="/abs/path/to/laser_model_v1.pkl",
	visualization_dir="output",
	visualize=True,
	device="cuda:0",
	)

	model = VineModel(config)

	vine_pipeline = VinePipeline(
	model=model,
	tokenizer=None,
	sam_config_path="/abs/path/to/sam2/sam2.1_hiera_t.yaml",
	sam_checkpoint_path="/abs/path/to/sam2/sam2_hiera_tiny.pt",
	gd_config_path="/abs/path/to/groundingdino/config/GroundingDINO_SwinT_OGC.py",
	gd_checkpoint_path="/abs/path/to/groundingdino/weights/groundingdino_swint_ogc.pth",
	device=config._device,
	)

	results = vine_pipeline(
	"/path/to/video.mp4",
	categorical_keywords=["dog", "human"],
	unary_keywords=["running"],
	binary_keywords=["chasing"],
	object_pairs=[(0, 1)],
	return_top_k=3,
	include_visualizations=True,
	)
	print(results["summary"])
	```

	### Using the Model Directly (Advanced)

	For advanced users who want to provide their own segmentation:

	```python
	from vine_hf import VineConfig, VineModel
	import torch

	# Create configuration
	config = VineConfig(
	pretrained_vine_path="/path/to/your/vine/weights" # Optional: your fine-tuned weights
	)

	# Initialize model
	model = VineModel(config)

	# If you have your own video frames, masks, and bboxes from external segmentation
	video_frames = torch.randn(3, 224, 224, 3) * 255 # Your video frames
	masks = {0: {1: torch.ones(224, 224, 1)}} # Your segmentation masks
	bboxes = {0: {1: [50, 50, 150, 150]}} # Your bounding boxes

	# Run prediction
	results = model.predict(
	video_frames=video_frames,
	masks=masks,
	bboxes=bboxes,
	categorical_keywords=['human', 'dog', 'frisbee'],
	unary_keywords=['running', 'jumping'],
	binary_keywords=['chasing', 'following'],
	object_pairs=[(1, 2)],
	return_top_k=3
	)
	```

	Note: For most users, the pipeline approach above is recommended as it handles video loading and segmentation automatically.

	## Configuration Options

	The `VineConfig` class supports the following parameters (non-exhaustive):

	- `model_name`: CLIP model backbone (default: `"openai/clip-vit-large-patch14-336"`)
	- `pretrained_vine_path`: Optional path or Hugging Face repo with pretrained VINE weights
	- `segmentation_method`: `"sam2"` or `"grounding_dino_sam2"` (default: `"grounding_dino_sam2"`)
	- `box_threshold` / `text_threshold`: Grounding DINO thresholds
	- `target_fps`: Target FPS for video processing (default: `1`)
	- `alpha`, `white_alpha`: Rendering parameters used when extracting masked crops
	- `topk_cate`: Top-k categories to return per object (default: `3`)
	- `max_video_length`: Maximum frames to process (default: `100`)
	- `visualize`: When `True`, pipeline post-processing attempts to create stitched visualizations
	- `visualization_dir`: Optional base directory where visualization assets are written
	- `debug_visualizations`: When `True`, the model saves a single first-frame mask composite for quick inspection
	- `debug_visualization_path`: Target filepath for the debug mask composite (must point to a writable file)
	- `return_flattened_segments`, `return_valid_pairs`, `interested_object_pairs`: Advanced geometry outputs for downstream consumers

	## Output Format

	The model returns a dictionary with the following structure:

	```python
	{
	"masks" : {},

	"boxes" : {},

	"categorical_predictions": {
	object_id: [(probability, category), ...]
	},
	"unary_predictions": {
	(frame_id, object_id): [(probability, action), ...]
	},
	"binary_predictions": {
	(frame_id, (obj1_id, obj2_id)): [(probability, relation), ...]
	},
	"confidence_scores": {
	"categorical": max_categorical_confidence,
	"unary": max_unary_confidence,
	"binary": max_binary_confidence
	},
	"summary": {
	"num_objects_detected": int,
	"top_categories": [(category, probability), ...],
	"top_actions": [(action, probability), ...],
	"top_relations": [(relation, probability), ...]
	}
	}
	```

	## Visualization & Debugging

	There are two complementary visualization layers:

	- Post-process visualizations (`include_visualizations=True` in the pipeline call) produces a high-level stitched video summarizing detections, actions, and relations over time.

	- Debug visualizations (`debug_visualizations=True` in `VineConfig`) dumps videos of intermediate segmentation masks and outputs from GroundingDINO, SAM2, Unary, Binary, etc. for quick sanity checks.

	If you plan to enable either option, ensure the relevant output directories exist before running the pipeline.

	## Segmentation Methods

	### Grounding DINO + SAM2 (Recommended)

	Uses Grounding DINO for object detection based on text prompts, then SAM2 for precise segmentation.

	Requirements:
	- Grounding DINO model and weights
	- SAM2 model and weights
	- Properly configured paths to model checkpoints

	### SAM2 Only

	Uses SAM2's automatic mask generation without text-based object detection.

	Requirements:
	- SAM2 model and weights

	## Model Architecture

	VINE is built on top of CLIP and uses three separate CLIP models for different tasks:
	- Categorical Model: For object classification
	- Unary Model: For single-object action recognition
	- Binary Model: For relationship detection between object pairs

	Each model processes both visual and textual features to compute similarity scores and probability distributions.

	## Pushing to HuggingFace Hub

	```python
	from vine_hf import VineConfig, VineModel

	# Create and configure your model
	config = VineConfig()
	model = VineModel(config)

	# Load your pretrained weights
	# model.load_state_dict(torch.load('path/to/your/weights.pth'))

	# Register for auto classes
	config.register_for_auto_class()
	model.register_for_auto_class("AutoModel")

	# Push to Hub
	config.push_to_hub('your-username/vine-model')
	model.push_to_hub('your-username/vine-model')
	```

	## Loading from HuggingFace Hub

	```python
	from transformers import AutoModel, pipeline

	# Load model
	model = AutoModel.from_pretrained('your-username/vine-model', trust_remote_code=True)

	# Or use with pipeline
	vine_pipeline = pipeline(
	'vine-video-understanding',
	model='your-username/vine-model',
	trust_remote_code=True
	)
	```

	## Examples

	See `example_usage.py` for comprehensive examples including:
	- Direct model usage
	- Pipeline usage
	- HuggingFace Hub integration
	- Real video processing

	## Requirements

	- Python 3.7+
	- PyTorch 1.9+
	- transformers 4.20+
	- OpenCV
	- PIL/Pillow
	- NumPy

	For segmentation:
	- SAM2 (Facebook Research)
	- Grounding DINO (IDEA Research)

	## Citation

	If you use VINE in your research, please cite:

	```bibtex
	@article{vine2024,
	title={VINE: Video Understanding with Natural Language},
	author={Your Authors},
	journal={Your Journal},
	year={2024}
	}
	```

	## License

	[Your License Here]

	## Contact

	[Your Contact Information Here]