Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction
Abstract
A conditional binary segmentation framework with cycle-consistency training enables robust object correspondence across egocentric and exocentric viewpoints without ground-truth annotations.
We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios. We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video. To encourage robust, view-invariant representations, we introduce a cycle-consistency training objective: the predicted mask in the target view is projected back to the source view to reconstruct the original query mask. This bidirectional constraint provides a strong self-supervisory signal without requiring ground-truth annotations and enables test-time training (TTT) at inference. Experiments on the Ego-Exo4D and HANDAL-X benchmarks demonstrate the effectiveness of our optimization objective and TTT strategy, achieving state-of-the-art performance. The code is available at https://github.com/shannany0606/CCMP.
Community
The paper has been accepted to CVPR 2026 with a high review score of 554. Our approach is intentionally simple and effective. We use a straightforward pipeline, and show that such a simple design can already achieve strong performance and generalization across benchmarks. We believe that presenting a simple and effective solution to a difficult problem is valuable and useful for the community.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- 3AM: 3egment Anything with Geometric Consistency in Videos (2026)
- VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization (2026)
- Revisiting Multi-Task Visual Representation Learning (2026)
- 3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence (2026)
- Beyond Single Prompts: Synergistic Fusion and Arrangement for VICL (2026)
- UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass (2026)
- Segment and Matte Anything in a Unified Model (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper