MC3-18 HMDB51 (UCF-101 Init)

Model Description

MC3-18 (Mixed Convolution 3D) finetuned on HMDB51 split 1 for human action recognition. This model was initialized with weights from an MC3-18 model pretrained on UCF-101 (87% accuracy) rather than Kinetics-400.

Validation Accuracy: 55.46%

This model demonstrates transfer learning from UCF-101 to HMDB51. Despite similar validation accuracy to Kinetics initialization (56.34%), this approach shows better generalization with a smaller train-validation gap.

Model Details

Architecture: MC3-18 (11.7M parameters)
Initialization: UCF-101 pretrained weights (87% accuracy on UCF-101)
Dataset: HMDB51 split 1
- Train: 3,570 videos across 51 action classes
- Validation: 1,530 videos
Input: RGB video clips (16 frames, 112x112 spatial resolution)
Output: 51-class action predictions

Training Configuration

Frames: 16
Frame Interval: 2
Spatial Size: 112x112
Batch Size: 16
Epochs: 100
Learning Rate: 0.0003
Weight Decay: 2e-3
Optimizer: SGD (momentum=0.9)

Augmentation:

MixUp (alpha=0.4)
CutMix (alpha=0.8)
Label Smoothing (0.1)
Random horizontal flip
Color jitter
Random grayscale

Performance

Metric	Value
Validation Accuracy	55.46%
Training Accuracy	68.59%
Train-Val Gap	13%
Val F1 Score	0.5359
Val Precision	0.5379

Why UCF-101 Transfer?

UCF-101 and HMDB51 share similar characteristics:

Both are human action recognition datasets
Similar video sources (YouTube, movies)
Overlapping action categories (basketball, biking, diving, etc.)
Similar temporal and spatial scales

This makes UCF-101 a more natural pretraining source than Kinetics-400 for HMDB51 transfer.

Better Generalization

Compared to Kinetics-400 initialization:

Initialization	Val Acc	Train Acc	Gap
Kinetics-400	56.34%	~75%	19%
UCF-101 (this)	55.46%	68.59%	13%

The UCF-101 initialization achieves:

Similar validation accuracy (only 0.88% lower)
Much better generalization (6% smaller train-val gap)
Lower training accuracy (less memorization)

This suggests UCF-101 features are better regularized for HMDB51, even though Kinetics-400 is a larger pretraining dataset.

Frame Tiling Issue

Important caveat: The UCF-101 checkpoint was trained with num_frames=16, frame_interval=2 (requiring 32 consecutive frames). However, many HMDB51 videos are shorter than 32 frames.

For short videos, the data loader tiles/repeats frames to reach 16 frames. This may hurt performance on those samples but was necessary to match the UCF-101 pretraining configuration.

The Kinetics-initialized model uses num_frames=8, frame_interval=1 to avoid this issue, which may explain its slightly higher validation accuracy despite worse generalization.

Usage

import torch
from torchvision.models.video import mc3_18
from torchvision import transforms
import cv2

# Load model
model = mc3_18(weights=None)
model.fc = torch.nn.Linear(model.fc.in_features, 51)
checkpoint = torch.load('best.pth')
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Preprocessing
transform = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((128, 171)),
    transforms.CenterCrop(112),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.43216, 0.394666, 0.37645], 
                        std=[0.22803, 0.22145, 0.216989])
])

# Load 16 frames from video (sample every 2nd frame)
frames = []  # Load your frames here
frames = [transform(frame) for frame in frames]
video_tensor = torch.stack(frames).permute(1, 0, 2, 3).unsqueeze(0)  # (1, 3, 16, 112, 112)

# Inference
with torch.no_grad():
    output = model(video_tensor)
    pred = output.argmax(dim=1)

Alternative Approach

We also provide a model initialized from Kinetics-400 instead of UCF-101. That model achieves 56.34% validation accuracy but with worse generalization (19% train-val gap).

See: mc3-18-hmdb51-kinetics

UCF-101 vs Kinetics initialization:

UCF-101 (this): Better generalization (13% gap), requires 16 frames, closer domain to HMDB51
Kinetics: Slightly higher accuracy (56.34%), optimized for short clips (8 frames), larger pretraining dataset

Limitations

Requires 16-frame inputs (causes frame tiling on short HMDB51 videos)
Still overfits despite better generalization (13% train-val gap)
Single model without ensemble
No test-time augmentation
Trained on HMDB51 split 1 only

Transfer Learning Insight

This model demonstrates that:

Pretraining dataset size is not everything - UCF-101 (13K videos) transfers better than Kinetics-400 (400K videos) when domain similarity matters
Domain alignment matters - UCF-101 and HMDB51 share similar action types and video characteristics
Config matching matters - Matching pretraining config (16 frames) can conflict with target dataset characteristics (short videos)

HMDB51 Classes

The model predicts 51 action classes including: brush_hair, cartwheel, catch, chew, clap, climb, climb_stairs, dive, draw_sword, dribble, drink, eat, fall_floor, fencing, flic_flac, golf, handstand, hit, hug, jump, kick, kick_ball, kiss, laugh, pick, pour, pullup, punch, push, pushup, ride_bike, ride_horse, run, shake_hands, shoot_ball, shoot_bow, shoot_gun, sit, situp, smile, smoke, somersault, stand, swing_baseball, sword, sword_exercise, talk, throw, turn, walk, wave.

Training Details

Framework: PyTorch
Hardware: Single GPU (CUDA)
Training Time: ~1.5 hours (100 epochs)
Convergence: Best model saved around epoch 90-100

Citation

If you use this model, please cite:

HMDB51 dataset:

@inproceedings{kuehne2011hmdb,
  title={HMDB: a large video database for human motion recognition},
  author={Kuehne, Hildegard and Jhuang, Hueihan and Garrote, Est{\'\i}baliz and Poggio, Tomaso and Serre, Thomas},
  booktitle={2011 International Conference on Computer Vision},
  pages={2556--2563},
  year={2011},
  organization={IEEE}
}

UCF-101 dataset:

@article{soomro2012ucf101,
  title={UCF101: A dataset of 101 human actions classes from videos in the wild},
  author={Soomro, Khurram and Zamir, Amir Roshan and Shah, Mubarak},
  journal={arXiv preprint arXiv:1212.0402},
  year={2012}
}

MC3 architecture:

@inproceedings{tran2018closer,
  title={A closer look at spatiotemporal convolutions for action recognition},
  author={Tran, Du and Wang, Heng and Torresani, Lorenzo and Ray, Jamie and LeCun, Yann and Paluri, Manohar},
  booktitle={Proceedings of the IEEE conference on Computer Vision and Pattern Recognition},
  pages={6450--6459},
  year={2018}
}

License

Model weights: [Apache] Code: [Apache] HMDB51 Dataset: [Original dataset license] UCF-101 Dataset: [Original dataset license]

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support