The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).
The problem:
Collecting annotations for segmentation is resource-intensive, and while recent
promptable segmentation models like SAM reduce manual effort, they still lack semantic
understanding and scalability. Reference-based instance segmentation offers an alternative
by using annotated reference images to guide segmentation, but current methods often require
fine-tuning and struggle with generalisation. Existing attempts to integrate foundation
models such as DINOv2 and SAM are hindered by computational inefficiency and poor
performance on complex instance-level tasks.
The solution:
We propose a training-free, three-stage method that
We employ a memory-based approach to store discriminative representations for object categories.
Given a set of reference images \( \{ I_r^j \}_{j=1}^{N_r} \) and their corresponding instance masks \( \{ M_r^{j,i} \}_{j=1}^{N_r} \) for category \( i \), we extract dense feature maps \( F_r^j \in \mathbb{R}^{H' \times W' \times d} \) using a pretrained frozen encoder \( \mathcal{E} \)[link], where \( d \) is the feature dimension and \( H', W' \) denote the spatial resolution of the feature map. The corresponding instance masks \( M_r^{j,i} \in \{0,1\}^{H' \times W'} \) are resized to match this resolution. For each category \( i \), we store the masked features:
\[ \mathcal{F}_r^{j,i} = F_r^j \odot M_r^{j,i} \]
where \( \odot \) denotes element-wise multiplication. These category-wise feature sets are stored in a memory bank \( \mathcal{M}_i \), which is synchronised across GPUs to ensure consistency in distributed settings.
To construct category prototypes, we first compute instance-wise feature representations, and then aggregate them into class-wise prototypes:
\[ P_r^{j,k} = \frac{1}{\| M_r^{j,k} \|_1} \sum_{(u,v)} M_r^{j,k}(u,v) F_r^j(u,v) \]
where \( P_r^{j,k} \in \mathbb{R}^d \) represents the mean feature representation of the \( k \)-th instance in image \( I_r^j \).
\[ P_i = \frac{1}{N_i} \sum_{j=1}^{N_r} \sum_{k \in \mathcal{K}_i^j} P_r^{j,k}, \]
where \( \mathcal{K}_i^j \) is the set of instances in image \( I_r^j \) that belong to category \( i \), and \( N_i = \sum_{j=1}^{N_r} |\mathcal{K}_i^j| \) is the total number of instances belonging to category \( i \). These class-wise prototypes \( P_i \) are stored in the memory bank.
For a target image \( I_t \), we extract dense features \( F_t \in \mathbb{R}^{H' \times W' \times d} \) using the same encoder \( \mathcal{E} \). We use the frozen SAM model to generate \( N_m \) candidate instance masks \( \{ M_t^m \}_{m=1}^{N_m} \), where \( M_t^m \in \{0,1\}^{H' \times W'} \). Each mask \( M_t^m \) is used to compute a feature representation via average pooling and L2 normalisation:
\[ P_t^m = \frac{1}{\| M_t^m \|_1} \sum_{(u,v)} M_t^m(u,v) F_t(u,v), \quad \hat{P}_t^m = \frac{P_t^m}{\| P_t^m \|_2} \]
where \( \hat{P}_t^m \in \mathbb{R}^d \) is the normalised mask feature.
To classify each candidate mask, we compute:
\[ S_t^m = \max_i \left( \frac{\hat{P}_t^m \cdot P_i}{\| P_i \|_2} \right) \]
which provides the classification score \( S_t^m \) for mask \( M_t^m \).
\[ \text{IoS}(M_t^m, M_t^{m'}) = \frac{\sum (M_t^m \cap M_t^{m'})}{\sum M_t^m} \]
and weight it by feature similarity:
\[ w_{m,m'} = \frac{\hat{P}_t^m \cdot \hat{P}_t^{m'}}{\|\hat{P}_t^{m'}\|_2} \]
The final score for each mask is adjusted using a decay factor:
\[ S_t^m \leftarrow S_t^m \cdot \sqrt{(1 - \text{IoS}(M_t^m, M_t^{m'}) w_{m,m'})} \]
reducing redundant detections while preserving distinct instances that may partially overlap. Finally, we rank masks by their adjusted scores, and select the top-\( K \) predictions as the final output.
We evaluate our method on three standard few-shot object detection benchmarks: COCO-FSOD, PASCAL VOC, and CD-FSOD. Our training-free approach achieves state-of-the-art performance across all three benchmarks, outperforming both fine-tuned and training-free methods. On COCO-FSOD and PASCAL VOC, we set new state-of-the-art results for novel classes. On CD-FSOD, which tests cross-domain generalization to diverse visual domains like aerial imagery and microscopy, we achieve the best performance among training-free methods while remaining competitive with fine-tuned approaches.
The proposed method is evaluated on the COCO-20i dataset under strict few-shot settings (10-shot and 30-shot) using the COCO-FSOD benchmark. It achieves state-of-the-art performance on novel classes—those overlapping with PASCAL VOC categories—despite being completely training-free and outperforming methods that require fine-tuning. Qualitative results, shown in Figure 3, demonstrate strong instance segmentation in crowded scenes with fine-grained semantics and precise localization, aided by a semantic-aware soft merging strategy that reduces duplicate detections and false positives. Failure cases are shown in Figure 12.
Method | Ft. on novel | 10-shot | 30-shot | ||||
---|---|---|---|---|---|---|---|
nAP | nAP50 | nAP75 | nAP | nAP50 | nAP75 | ||
TFA | ✓ | 10.0 | 19.2 | 9.2 | 13.5 | 24.9 | 13.2 |
FSCE | ✓ | 11.9 | — | 10.5 | 16.4 | — | 16.2 |
Retentive RCNN | ✓ | 10.5 | 19.5 | 9.3 | 13.8 | 22.9 | 13.8 |
HeteroGraph | ✓ | 11.6 | 23.9 | 9.8 | 16.5 | 31.9 | 15.5 |
Meta F. R-CNN | ✓ | 12.7 | 25.7 | 10.8 | 16.6 | 31.8 | 15.8 |
LVC | ✓ | 19.0 | 34.1 | 19.0 | 26.8 | 45.8 | 27.5 |
C. Transformer | ✓ | 17.1 | 30.2 | 17.0 | 21.4 | 35.5 | 22.1 |
NIFF | ✓ | 18.8 | — | — | 20.9 | — | — |
DiGeo | ✓ | 10.3 | 18.7 | 9.9 | 14.2 | 26.2 | 14.8 |
CD-ViTO (ViT-L) | ✓ | 35.3 | 54.9 | 37.2 | 35.9 | 54.5 | 38.0 |
FSRW | ✗ | 5.6 | 12.3 | 4.6 | 9.1 | 19.0 | 7.6 |
Meta R-CNN | ✗ | 6.1 | 19.1 | 6.6 | 9.9 | 25.3 | 10.8 |
DE-ViT (ViT-L) | ✗ | 34.0 | 53.0 | 37.0 | 34.0 | 52.9 | 37.2 |
Training-free (ours) | ✗ | 36.6 | 54.1 | 38.3 | 36.8 | 54.5 | 38.7 |
Table 1: Comparison of our training-free method against state-of-the-art approaches on the COCO-FSOD benchmark under 10-shot and 30-shot settings. Our approach achieves state-of-the-art performance without fine-tuning on novel classes (Ft. on novel). Results are reported in terms of nAP, nAP50, and nAP75 (nAP = mAP for novel classes). Competing results taken from [10]. Segmentation AP is omitted for brevity.
The PASCAL-VOC dataset is split into three groups for few-shot evaluation, each containing 15 base and 5 novel classes. Following standard protocol, AP50 is reported on the novel classes. As shown in Table 2, the proposed method outperforms all prior approaches across all splits, achieving state-of-the-art results regardless of whether competing methods use fine-tuning or not.
Method | Ft. on novel | Novel Split 1 | Novel Split 2 | Novel Split 3 | Avg | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 5 | 10 | 1 | 2 | 3 | 5 | 10 | 1 | 2 | 3 | 5 | 10 | |||
FsDetView | ✓ | 25.4 | 20.4 | 37.4 | 36.1 | 42.3 | 22.9 | 21.7 | 22.6 | 25.6 | 29.2 | 32.4 | 19.0 | 29.8 | 33.2 | 39.8 | 29.2 |
TFA | ✓ | 39.8 | 36.1 | 44.7 | 55.7 | 56.0 | 23.5 | 26.9 | 34.1 | 35.1 | 39.1 | 30.8 | 34.8 | 42.8 | 49.5 | 49.8 | 39.9 |
Retentive RCNN | ✓ | 42.4 | 45.8 | 45.9 | 53.7 | 56.1 | 21.7 | 27.8 | 35.2 | 37.0 | 40.3 | 30.2 | 37.6 | 43.0 | 49.7 | 50.1 | 41.1 |
DiGeo | ✓ | 37.9 | 39.4 | 48.5 | 58.6 | 61.5 | 26.6 | 28.9 | 41.9 | 42.1 | 49.1 | 30.4 | 40.1 | 46.9 | 52.7 | 54.7 | 44.0 |
HeteroGraph | ✓ | 42.4 | 51.9 | 55.7 | 62.6 | 63.4 | 25.9 | 37.8 | 46.6 | 48.9 | 51.1 | 35.2 | 42.9 | 47.8 | 54.8 | 53.5 | 48.0 |
Meta Faster R-CNN | ✓ | 43.0 | 54.5 | 60.6 | 66.1 | 65.4 | 27.7 | 35.5 | 46.1 | 47.8 | 51.4 | 40.6 | 46.4 | 53.4 | 59.9 | 58.6 | 50.5 |
CrossTransformer | ✓ | 49.9 | 57.1 | 57.9 | 63.2 | 67.1 | 27.6 | 34.5 | 43.7 | 49.2 | 51.2 | 39.5 | 54.7 | 52.3 | 57.0 | 58.7 | 50.9 |
LVC | ✓ | 54.5 | 53.2 | 58.8 | 63.2 | 65.7 | 32.8 | 29.2 | 50.7 | 49.8 | 50.6 | 48.4 | 52.7 | 55.0 | 59.6 | 59.6 | 52.3 |
NIFF | ✓ | 62.8 | 67.2 | 68.0 | 70.3 | 68.8 | 38.4 | 42.9 | 54.0 | 56.4 | 54.0 | 56.4 | 62.1 | 61.2 | 64.1 | 63.9 | 59.4 |
Multi-Relation Det | ✗ | 37.8 | 43.6 | 51.6 | 56.5 | 58.6 | 22.5 | 30.6 | 40.7 | 43.1 | 47.6 | 31.0 | 37.9 | 43.7 | 51.3 | 49.8 | 43.1 |
DE-ViT (ViT-S/14) | ✗ | 47.5 | 64.5 | 57.0 | 68.5 | 67.3 | 43.1 | 34.1 | 49.7 | 56.7 | 60.8 | 52.5 | 62.1 | 60.7 | 61.4 | 64.5 | 56.7 |
DE-ViT (ViT-B/14) | ✗ | 56.9 | 61.8 | 68.0 | 73.9 | 72.8 | 45.3 | 47.3 | 58.2 | 59.8 | 60.6 | 58.6 | 62.3 | 62.7 | 64.6 | 67.8 | 61.4 |
DE-ViT (ViT-L/14) | ✗ | 55.4 | 56.1 | 68.1 | 70.9 | 71.9 | 43.0 | 39.3 | 58.1 | 61.6 | 63.1 | 58.2 | 64.0 | 61.3 | 64.2 | 67.3 | 60.2 |
Training-free (ours) | ✗ | 70.8 | 72.3 | 73.3 | 77.2 | 79.1 | 54.5 | 67.0 | 76.3 | 75.9 | 78.2 | 61.1 | 67.9 | 71.3 | 70.8 | 72.6 | 71.2 |
Table 2: AP50 results on the novel classes of the Pascal VOC few-shot benchmark. Bold numbers indicate state-of-the-art. *NIFF implementation is not publicly available. Our training-free approach consistently outperforms both finetuned and other training-free methods across all splits.
The CD-FSOD benchmark evaluates cross-domain few-shot object detection using COCO as the source dataset and six diverse target datasets spanning various visual domains. Unlike most methods that fine-tune on labeled target data, the proposed model is entirely training-free and evaluated directly on all targets. As shown in Table 3, it achieves state-of-the-art performance among training-free methods and remains competitive with fine-tuned approaches, demonstrating strong cross-domain generalization and robustness without retraining.
Method | Ft. on novel | ArT axOr | Clip art1k | DIOR | Deep Fish | NEU DET | UO DD | Avg |
---|---|---|---|---|---|---|---|---|
1-shot | ||||||||
TFA w/cos ◦ | ✓ | 3.1 | — | 8.0 | — | — | 4.4 | — |
FSCE ◦ | ✓ | 3.7 | — | 8.6 | — | — | 3.9 | — |
DeFRCN ◦ | ✓ | 3.6 | — | 9.3 | — | — | 4.5 | — |
Distill-cdfsod ◦ | ✓ | 5.1 | 7.6 | 10.5 | nan | nan | 5.9 | — |
ViTDeT-FT † | ✓ | 5.9 | 6.1 | 12.9 | 0.9 | 2.4 | 4.0 | 5.4 |
Detic-FT † | ✓ | 3.2 | 15.1 | 4.1 | 9.0 | 3.8 | 4.2 | 6.6 |
DE-ViT-FT † | ✓ | 10.5 | 13.0 | 14.7 | 19.3 | 0.6 | 2.4 | 10.1 |
CD-ViTO † | ✓ | 21.0 | 17.7 | 17.8 | 20.3 | 3.6 | 3.1 | 13.9 |
Meta-RCNN ◦ | ✗ | 2.8 | — | 7.8 | — | — | 3.6 | — |
Detic † | ✗ | 0.6 | 11.4 | 0.1 | 0.9 | 0.0 | 0.0 | 2.2 |
DE-ViT † | ✗ | 0.4 | 0.5 | 2.7 | 0.4 | 0.4 | 1.5 | 1.0 |
Training-free (ours) | ✗ | 28.2 | 18.9 | 14.9 | 30.5 | 5.5 | 10.0 | 18.0 |
5-shot | ||||||||
TFA w/cos ◦ | ✓ | 8.8 | — | 18.1 | — | — | 8.7 | — |
FSCE ◦ | ✓ | 10.2 | — | 18.7 | — | — | 9.6 | — |
DeFRCN ◦ | ✓ | 9.9 | — | 18.9 | — | — | 9.9 | — |
Distill-cdfsod ◦ | ✓ | 12.5 | 23.3 | 19.1 | 15.5 | 16.0 | 12.2 | 16.4 |
ViTDeT-FT † | ✓ | 20.9 | 23.3 | 23.3 | 9.0 | 13.5 | 11.1 | 16.9 |
Detic-FT † | ✓ | 8.7 | 20.2 | 12.1 | 14.3 | 14.1 | 10.4 | 13.3 |
DE-ViT-FT † | ✓ | 38.0 | 38.1 | 23.4 | 21.2 | 7.8 | 5.0 | 22.3 |
CD-ViTO † | ✓ | 47.9 | 41.1 | 26.9 | 22.3 | 11.4 | 6.8 | 26.1 |
Meta-RCNN ◦ | ✗ | 8.5 | — | 17.7 | — | — | 8.8 | — |
Detic † | ✗ | 0.6 | 11.4 | 0.1 | 0.9 | 0.0 | 0.0 | 2.2 |
DE-ViT † | ✗ | 10.1 | 5.5 | 7.8 | 2.5 | 1.5 | 3.1 | 5.1 |
Training-free (ours) | ✗ | 35.7 | 24.9 | 18.5 | 29.6 | 5.2 | 20.2 | 22.4 |
10-shot | ||||||||
TFA w/cos ◦ | ✓ | 14.8 | — | 20.5 | — | — | 11.8 | — |
FSCE ◦ | ✓ | 15.9 | — | 21.9 | — | — | 12.0 | — |
DeFRCN ◦ | ✓ | 15.5 | — | 22.9 | — | — | 12.1 | — |
Distill-cdfsod ◦ | ✓ | 18.1 | 27.3 | 26.5 | 15.5 | 21.1 | 14.5 | 20.5 |
ViTDeT-FT † | ✓ | 23.4 | 25.6 | 29.4 | 6.5 | 15.8 | 15.6 | 19.4 |
Detic-FT † | ✓ | 12.0 | 22.3 | 15.4 | 17.9 | 16.8 | 14.4 | 16.5 |
DE-ViT-FT † | ✓ | 49.2 | 40.8 | 25.6 | 21.3 | 8.8 | 5.4 | 25.2 |
CD-ViTO † | ✓ | 60.5 | 44.3 | 30.8 | 22.3 | 12.8 | 7.0 | 29.6 |
Meta-RCNN ◦ | ✗ | 14.0 | — | 20.6 | — | — | 11.2 | — |
Detic † | ✗ | 0.6 | 11.4 | 0.1 | 0.9 | 0.0 | 0.0 | 2.2 |
DE-ViT † | ✗ | 9.2 | 11.0 | 8.4 | 2.1 | 1.8 | 3.1 | 5.9 |
Training-free (ours) | ✗ | 35.0 | 25.9 | 16.4 | 29.6 | 5.5 | 16.0 | 21.4 |
Table 3: Performance comparison (mAP) on the CD-FSOD benchmark. The second column indicates whether methods fine-tune on novel classes. ◦ values come from Distill-cdfsod; † values from CD-ViTO. Bold denotes best performance.
Performance varies with the choice of reference images, as results depend on their quality. To measure this, the method was evaluated on COCO-20i using different random seeds. Figure 4 shows that variance decreases with more reference images—higher n-shot settings yield lower standard deviations. While 1-3 shots show greater sensitivity to reference selection, 5+ shots offer stable results. This suggests that certain reference images are more effective, and identifying optimal reference sets is a promising direction for future work.
The proposed training-free pipeline is lightweight and efficient:
Method | Time (sec/img) |
---|---|
Matcher [45] | 120.014 |
Training-free (ours) with SAM AMG | 3.5092 |
Training-free (ours) | 0.9292 |
Table 4: Time to process an image on 20 reference classes with A100.
Table 5 highlights the effectiveness of the proposed semantic-aware soft-merging strategy, which outperforms other aggregation variants like covariance similarity, instance softmax, score decay, iterative mask refinement, and attention-guided global averaging.
Aggregation strategy | 10-shot nAP |
---|---|
Hard-merging (hard threshold of 1 IoS) | 31.2 |
Soft-merging (without semantics) | 35.7 |
Soft-merging (with semantics) | 36.6 |
Table 5: Ablation on aggregation strategies.
We provide more visualisations on the different datasets in the CD-FSOD benchmark.
We provide visualisations of some of the failure cases for COCO-FSOD dataset.
This work presents a novel training-free few-shot instance segmentation method that combines SAM's mask proposals with DINOv2's semantic features. The approach builds a reference-based memory bank, refines features through aggregation, and classifies new instances using cosine similarity and semantic-aware soft merging. It achieves state-of-the-art results on COCO-FSOD, PASCAL VOC, and CD-FSOD without any fine-tuning, and shows potential for semantic segmentation as well. Future directions include learning to select optimal reference images, improving DINOv2's feature localization, and exploring lightweight fine-tuning to enhance the memory bank.
@article{espinosa2025notimetotrain,
title={No time to train! Training-Free Reference-Based Instance Segmentation},
author={Miguel Espinosa and Chenhongyi Yang and Linus Ericsson and Steven McDonagh and Elliot J. Crowley},
year={2025}
journal={arXiv},
primaryclass={cs.CV},
url={https://arxiv.org/abs/2507.01300}
}