[논문 리뷰] Segment Anything Meets Point Tracking, arXiv'2307

Video Segmentation

hobin-e 2023. 11. 6. 10:00

Background

Meta AI에서 발표한 Segment Anything [1]의 성능이 좋다보니, Segment Anything Model (SAM)을 활용한 많은 수의 프로젝트가 공개되고 있다.
- Zero-shot image segmentation model인데다가, point부터 text까지 다양한 형태의 prompt를 지원해서 활용도가 높다.
SAM을 기반으로 language understanding 능력을 강화한 application을 만들기도 하고,
Tracking 문제를 푸는데 사용되기도 한다.
SAM을 활용하는 tracking 모델은 보통 raw RGB image 대신 segmented image를 입력으로 사용함으로써 SAM의 supervision을 가져온다.
- 따라서 mask를 어떻게하면 잘 propagation 할 수 있을지가 주된 관심사였다.
SAM-PT는 특이하게도 SAM의 point promptable한 특성을 이용해, point tracker만을 학습함으로써 video object segmentation 문제를 풀었다.

Video segmentation은 zero-shot으로 풀기 힘들다.
- Zero-shot image segmentation model인 SAM을 활용한 연구들 [2, 3]이 있지만, SAM의 image domain에서의 zero-shot 성능이 video domain까지 유지되지는 않았다.
- SAM을 활용하지 않는 zero-shot video segmentation 연구도 있다 [4, 5].
- 하지만 Painter [4]는 성능이 아쉽고, SegGPT [5]는 우수한 성능을 보여주지만 첫 번째 프레임에서는 mask annotation을 요구한다.
Video는 rich local structure information을 가지고 있다.
- Object-centric feature matching이나 mask propagation을 하려면 object 단위의 association이 필요한데, deformation이 심한 경우 같은 object라도 보는 방향이나 시점에 따라 다르게 보이기 때문에 association이 어렵다.
- Object의 global structure를 보는 이런 방법들 보다는 point 주변의 local structure를 보는 SAM-PT와 같은 방법이 더 robust한 tracking을 가능하게 한다.
SAM-PT는 첫 프레임의 points만 요구하면서도 zero-shot video segmentation task에서 SoTA 성능을 달성했다.

첫 번째 프레임의 objects 위에 points를 찍는다.
Object마다 여러개의 points를 찍을 수 있고, object를 구분해서 찍을 수 있으며, 어떤 object 위에도 존재하지 않는 negative points를 찍을 수도 있다.
- Negative point는 하나 정도만 찍어도 충분하다

Point를 어디에 몇개나 찍을지는 자기 맘이지만, 최적의 방법을 찾는 것이 중요하다.
- GT mask가 있다는 가정 하에 네 가지 방법으로 point를 찍어보았다. K-Medoids [6]는 K-Means의 median 버전이고, Shi-Tomasi [7]는 tracking에 적합한 corner points를 뽑는 방법이다.
- DAVIS 2017 J&F score를 측정해보니, K-Medoids 방법으로 8개 정도 찍는 것이 좋아보인다.

SoTA point tracker인 PIPS [8]를 사용했다.
- Long-term tracking과 occlusion & re-appearance에 강인한 모델이다.
- Point trajectory 뿐만아니라 occlusion scores도출력한다.
RAFT [9], SuperGlue [10], TapNet [11]과 비교해보았을 때, 실험적으로 가장 성능이 좋았다.

SAM을 활용한 두 번의 inference step을 거쳐 segmentation mask를 얻는다.
1. Positive (non-occluded) points를 prompt로 사용해 initial mask를 얻는다.
2. Positive points, initial mask, negative points를 모두 prompt로 사용해 final mask를 얻는다.
2번 과정을 여러번 반복해서 돌려 mask의 품질을 높이는 iterative refinement가 더 좋은 성능을 야기하기도 했지만 (DAVIS 2017) 성능 하락으로 이어지기도 해서 (MOSE), 실험적으로 결과를 보고 적용 여부를 판단해야 할 듯 하다.

Youtube와 MOSE에서 SegGPT보다 떨어지는 성능을 보이지만, zero-shot model인 만큼 학습에 사용한 데이터가 다른 것은 고려해주어야 한다.

Extensive occlusion, small objects, motion blur, re-identification 등으로 인해 point tracker가 실패하는 경우 video segmentation의 실패로 이어진다.

Video segmentation 문제를 point tracking과 image segmentation with point prompt로 나누어 푸는 접근 방법이 인상적이다.
- Point tracker와 promptable image segmentation model의 발전에도 관심을 가질 필요가 있겠다.
- 연속된 프레임간의 image segmentation 결과가 inconsistent할 수 있어보이는데, 이를 보완할만한 장치가 있으면 좋겠다.
Mask propagation 보다 point propagation이 더 robust하다는 주장이 직관적으로 잘 와닿았다.

Segment Anything, arXiv'2304
Track Anything: Segment Anything Meets Videos, arXiv'2304
Segment and Track Anything, arXiv'2305
Images Speak in Images: A Generalist Painter for In-context Visual Learning, CVPR'23
SegGPT: Towards Segmenting Everything In Context, ICCV'23
A Simple and Fast Algorithm for K-Medoids Clustering, Expert Systems with Applications 2009
Good features to track, CVPR'94
Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories, ECCV'22
Raft: Recurrent All-pairs Field Transforms for Optical Flow, ECCV'20
Superglue: Learning Feature Matching with Graph Neural Networks, CVPR'20
Tap-vid: A Benchmark for Tracking Any Point in a Video, NeurIPS'22