The Segment Anything Model (SAM) was originally designed for label-agnostic mask generation. Does this model also possess inherent semantic understanding, of value to broader visual tasks? In this work we follow a multi-staged approach to answer this question. We firstly quantify SAM's semantic capabilities by comparing base image encoder efficacy under classification tasks, in comparison with established models (CLIP and DINOv2). Our findings reveal a significant lack of semantic discriminability in SAM feature representations, limiting potential for tasks that require class differentiation. This initial result motivates our exploratory study that attempts to enable semantic information via in-context learning with lightweight fine-tuning where we observe that generalisability to unseen classes remains limited. Our observations culminate in the proposal of a training-free approach that leverages DINOv2 features, towards better endowing SAM with semantic understanding and achieving instance-level class differentiation through feature-based similarity. Our study suggests that incorporation of external semantic sources provides a promising direction for the enhancement of SAM's utility with respect to complex visual tasks that require semantic understanding.
We benchmark SAM feature representations against popular vision encoders (CLIP and DINOv2) on image classification tasks, measuring the presence of semantics via linear probing on respective features. Despite impressive segmentation abilities, we find SAM lacks discriminative feature quality to enable successful classification, indicating limited semantic encoding.
Model | Top-1 Acc. (%) | Top-5 Acc. (%) |
---|---|---|
SAM | 11.06 | 25.37 |
SAM 2 | 23.16 | 44.44 |
CLIP | 73.92 | 92.89 |
DINOv2 | 77.43 | 93.78 |
We then explore whether SAM's inherent representations can be adapted for semantic tasks through lightweight fine-tuning. By introducing in-context learning with reference images, we help SAM capture class-specific information for more effective segmentation. This approach shows some success but reveals a critical limitation: the adapted model overfits to known classes, struggling to generalise to new ones.
While our approach shows promise on known classes, it faces significant challenges when encountering novel categories not seen during training. These failure cases highlight a fundamental limitation in the model's ability to generalize semantic understanding beyond its training distribution.
We make use of T-SNE plots to visualise the features, as shown in the figure below. These plots reveal that the class semantic information is not present before the MLP. Instead, class semantics are actually being learned in the MLP layer, therefore overfitting to the classes seen during training. While the MLP layer encodes class information, this semantic structure is absent in the earlier features, highlighting that the semantics that are recovered in the model are specific to the classes we tune on, and that these class-specific cues do not generalise to unseen classes.
Motivated by SAM's inherent semantic gap, we experiment with injecting semantics directly from a pretrained, semantically rich model. Using DINOv2 as a backbone, we create a hybrid architecture that fuses SAM's segmentation strengths with external semantic knowledge. Preliminary results suggest a promising direction for the integration of semantic awareness, without exhaustive model retraining.
Box | Mask | |||
---|---|---|---|---|
AP | AR | AP | AR | |
SAM+DINOv2 | 13.3 | 39.9 | 11.3 | 34.1 |
Preliminary box and mask AP and AR results on COCO validation split using DINOv2 with SAM. Since it is a training-free method, these results can be interpreted as generalisation to unseen classes.
Our study reveals a significant semantic gap in SAM's feature representations, limiting its utility for tasks that require class differentiation. While SAM's in-context learning approach shows promise, it struggles with generalising to unseen classes. We propose a training-free approach that leverages DINOv2 features to enhance SAM's semantic understanding and enable instance-level class differentiation through feature-based similarity. Our study suggests that incorporating external semantic sources offers a promising direction for the enhancement of SAM's utility with respect to complex visual tasks that require semantic understanding.
@article{espinosa2024samantics,
title={There is no SAMantics! Exploring SAM as a Backbone for Visual Understanding Tasks},
author={Miguel Espinosa and Chenhongyi Yang and Linus Ericsson and Steven McDonagh and Elliot J. Crowley},
year={2024},
institution={University of Edinburgh},
journal={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.15288}
}