Segment Anything Meets Point Tracking
Foundation models have marked a significant stride toward addressing generalization challenges in deep learning. While the Segment Anything Model (SAM) has established a strong foothold in image segmentation, existing video segmentation methods still require extensive mask labeling for fine-tuning, or face performance drops on unseen data domains otherwise. In this paper, we show how foundation models for image segmentation make a step toward enhancing domain generalizability in video segmentation. We discover that, combined with long-term point tracking, image segmentation models yield state-of-the-art results in zero-shot video segmentation across multiple benchmarks. Surprisingly, point trackers exhibit generalization to domains beyond their synthetic pre-training sequences, which we attribute to the trackers' ability to harness the rich local information in the vicinity of each tracked point. Thus, we introduce SAM-PT, an innovative method for point-centric video segmentation, leveraging the capabilities of SAM alongside long-term point tracking. SAM-PT extends SAM's capability to tracking and segmenting anything in dynamic videos. Unlike traditional video segmentation methods that focus on object-centric mask propagation, our approach uniquely exploits point propagation to utilize local structure information independent of object semantics. The effectiveness of point-based tracking is underscored by direct evaluation on the zero-shot open-world UVO benchmark. Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a pointbased segmentation tracker yields better zero-shot performance and efficient interactions. We release our code at https://github.com/SysCV/sam-pt.
WOS:001521272600411
École Polytechnique Fédérale de Lausanne
Hong Kong University of Science & Technology
Hong Kong University of Science & Technology
Hong Kong University of Science & Technology
Swiss Federal Institutes of Technology Domain
Swiss Federal Institutes of Technology Domain
2025-01-01
Los Alamitos
979-8-3315-1084-8
979-8-3315-1083-1
IEEE Winter Conference on Applications of Computer Vision
2472-6737
9302
9311
REVIEWED
EPFL
| Event name | Event acronym | Event place | Event date |
WACV 2024 | Tucson, AZ, USA | 2025-02-26 - 2025-03-06 | |