There is no SAMantics!

Exploring SAM as a Backbone for Visual Understanding Tasks

Miguel Espinosa^*, Chenhongyi Yang^*, Linus Ericsson, Steven McDonagh, Elliot J. Crowley

University of Edinburgh
Arxiv 2024
^*Indicates Equal Contribution

The Segment Anything Model (SAM) was originally designed for label-agnostic mask generation. Does this model also possess inherent semantic understanding, of value to broader visual tasks? In this work we follow a multi-staged approach to answer this question.

Abstract

The Segment Anything Model (SAM) was originally designed for label-agnostic mask generation. Does this model also possess inherent semantic understanding, of value to broader visual tasks? In this work we follow a multi-staged approach to answer this question. We firstly quantify SAM's semantic capabilities by comparing base image encoder efficacy under classification tasks, in comparison with established models (CLIP and DINOv2). Our findings reveal a significant lack of semantic discriminability in SAM feature representations, limiting potential for tasks that require class differentiation. This initial result motivates our exploratory study that attempts to enable semantic information via in-context learning with lightweight fine-tuning where we observe that generalisability to unseen classes remains limited. Our observations culminate in the proposal of a training-free approach that leverages DINOv2 features, towards better endowing SAM with semantic understanding and achieving instance-level class differentiation through feature-based similarity. Our study suggests that incorporation of external semantic sources provides a promising direction for the enhancement of SAM's utility with respect to complex visual tasks that require semantic understanding.

Quantifying Semantics in SAM

We benchmark SAM feature representations against popular vision encoders (CLIP and DINOv2) on image classification tasks, measuring the presence of semantics via linear probing on respective features. Despite impressive segmentation abilities, we find SAM lacks discriminative feature quality to enable successful classification, indicating limited semantic encoding.

ImageNet1K performance gap between SAM and CLIP/DINOv2

Model	Top-1 Acc. (%)	Top-5 Acc. (%)
SAM	11.06	25.37
SAM 2	23.16	44.44
CLIP	73.92	92.89
DINOv2	77.43	93.78

Image classification accuracy of SAM, SAM 2, CLIP, and DINOv2 on the ImageNet1K dataset using linear probing.

Recovering Semantics in SAM

We then explore whether SAM's inherent representations can be adapted for semantic tasks through lightweight fine-tuning. By introducing in-context learning with reference images, we help SAM capture class-specific information for more effective segmentation. This approach shows some success but reveals a critical limitation: the adapted model overfits to known classes, struggling to generalise to new ones.

Architecture diagram — Training pipeline: Reference image and category are encoded by SAM to get token embeddings, which condition a DETR decoder to predict boxes on the target image. These boxes then prompt SAM to generate masks. Only DETR and token-merge MLPs are trained, while SAM remains frozen.

Example result using 4 reference images

Example result using 2 reference images

Failure Cases

While our approach shows promise on known classes, it faces significant challenges when encountering novel categories not seen during training. These failure cases highlight a fundamental limitation in the model's ability to generalize semantic understanding beyond its training distribution.

T-SNE Exploration

We make use of T-SNE plots to visualise the features, as shown in the figure below. These plots reveal that the class semantic information is not present before the MLP. Instead, class semantics are actually being learned in the MLP layer, therefore overfitting to the classes seen during training. While the MLP layer encodes class information, this semantic structure is absent in the earlier features, highlighting that the semantics that are recovered in the model are specific to the classes we tune on, and that these class-specific cues do not generalise to unseen classes.

T-SNE visualization of feature space before and after fine-tuning, and comparison with DINOv2 and CLIP features.

Injecting External Semantics

Motivated by SAM's inherent semantic gap, we experiment with injecting semantics directly from a pretrained, semantically rich model. Using DINOv2 as a backbone, we create a hybrid architecture that fuses SAM's segmentation strengths with external semantic knowledge. Preliminary results suggest a promising direction for the integration of semantic awareness, without exhaustive model retraining.

	Box		Mask
	AP	AR	AP	AR
SAM+DINOv2	13.3	39.9	11.3	34.1

Preliminary box and mask AP and AR results on COCO validation split using DINOv2 with SAM. Since it is a training-free method, these results can be interpreted as generalisation to unseen classes.

Conclusion

Our study reveals a significant semantic gap in SAM's feature representations, limiting its utility for tasks that require class differentiation. While SAM's in-context learning approach shows promise, it struggles with generalising to unseen classes. We propose a training-free approach that leverages DINOv2 features to enhance SAM's semantic understanding and enable instance-level class differentiation through feature-based similarity. Our study suggests that incorporating external semantic sources offers a promising direction for the enhancement of SAM's utility with respect to complex visual tasks that require semantic understanding.

BibTeX

@article{espinosa2024samantics,
        title={There is no SAMantics! Exploring SAM as a Backbone for Visual Understanding Tasks}, 
        author={Miguel Espinosa and Chenhongyi Yang and Linus Ericsson and Steven McDonagh and Elliot J. Crowley},
        year={2024},
        institution={University of Edinburgh},
        journal={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2411.15288}
      }