No time to train!

Training-Free Reference-Based Instance Segmentation

School of Engineering, University of Edinburgh
Arxiv 2025

*Indicates Equal Contribution
Final Results

Figure 1: Cross-domain 1-shot segmentation results using our training-free method on CD-FSOD benchmark. Our method directly evaluates on diverse datasets without any fine-tuning, using frozen SAMv2 and DINOv2 models. The reference set contains a single example image per class. The model then segments the entire target dataset based on the reference set. Results show: (1) generalization capabilities to out-of-distribution domains (e.g., underwater images, cartoons, microscopic textures); (2) state-of-the-art performance in 1-shot segmentation without training or domain adaptation; (3) limitations in cases with ambiguous annotations or highly similar classes (e.g., "harbor" vs. "ships" in DIOR).

Abstract

The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).

Introduction

The problem:
Collecting annotations for segmentation is resource-intensive, and while recent promptable segmentation models like SAM reduce manual effort, they still lack semantic understanding and scalability. Reference-based instance segmentation offers an alternative by using annotated reference images to guide segmentation, but current methods often require fine-tuning and struggle with generalisation. Existing attempts to integrate foundation models such as DINOv2 and SAM are hindered by computational inefficiency and poor performance on complex instance-level tasks.

The solution:
We propose a training-free, three-stage method that

  • (1) builds a memory bank of features,
  • (2) refines them through aggregation,
  • (3) infers segmentation via semantic-aware feature matching.
This approach achieves strong performance on multiple benchmarks without fine-tuning, demonstrating robust generalisation across domains with fixed hyperparameters.

Method

We employ a memory-based approach to store discriminative representations for object categories.

  1. Constructing a memory bank from reference images:
    Details

    Given a set of reference images \( \{ I_r^j \}_{j=1}^{N_r} \) and their corresponding instance masks \( \{ M_r^{j,i} \}_{j=1}^{N_r} \) for category \( i \), we extract dense feature maps \( F_r^j \in \mathbb{R}^{H' \times W' \times d} \) using a pretrained frozen encoder \( \mathcal{E} \)[link], where \( d \) is the feature dimension and \( H', W' \) denote the spatial resolution of the feature map. The corresponding instance masks \( M_r^{j,i} \in \{0,1\}^{H' \times W'} \) are resized to match this resolution. For each category \( i \), we store the masked features:

    \[ \mathcal{F}_r^{j,i} = F_r^j \odot M_r^{j,i} \]

    where \( \odot \) denotes element-wise multiplication. These category-wise feature sets are stored in a memory bank \( \mathcal{M}_i \), which is synchronised across GPUs to ensure consistency in distributed settings.

  2. Refine representations via two-stage feature aggregation:
    Details

    To construct category prototypes, we first compute instance-wise feature representations, and then aggregate them into class-wise prototypes:

    1. Instance-wise prototypes:
      Each instance \( k \) in reference image \( I_r^j \) has its own prototype, computed by averaging the feature embeddings within its corresponding mask:

      \[ P_r^{j,k} = \frac{1}{\| M_r^{j,k} \|_1} \sum_{(u,v)} M_r^{j,k}(u,v) F_r^j(u,v) \]

      where \( P_r^{j,k} \in \mathbb{R}^d \) represents the mean feature representation of the \( k \)-th instance in image \( I_r^j \).

    2. Class-wise prototype:
      We compute the category prototype \( P_i \) by averaging all instance-wise prototypes belonging to the same category \( i \):

      \[ P_i = \frac{1}{N_i} \sum_{j=1}^{N_r} \sum_{k \in \mathcal{K}_i^j} P_r^{j,k}, \]

      where \( \mathcal{K}_i^j \) is the set of instances in image \( I_r^j \) that belong to category \( i \), and \( N_i = \sum_{j=1}^{N_r} |\mathcal{K}_i^j| \) is the total number of instances belonging to category \( i \). These class-wise prototypes \( P_i \) are stored in the memory bank.

  3. Performing inference on the target images through feature matching and semantic-aware soft merging:
    Details

    For a target image \( I_t \), we extract dense features \( F_t \in \mathbb{R}^{H' \times W' \times d} \) using the same encoder \( \mathcal{E} \). We use the frozen SAM model to generate \( N_m \) candidate instance masks \( \{ M_t^m \}_{m=1}^{N_m} \), where \( M_t^m \in \{0,1\}^{H' \times W'} \). Each mask \( M_t^m \) is used to compute a feature representation via average pooling and L2 normalisation:

    \[ P_t^m = \frac{1}{\| M_t^m \|_1} \sum_{(u,v)} M_t^m(u,v) F_t(u,v), \quad \hat{P}_t^m = \frac{P_t^m}{\| P_t^m \|_2} \]

    where \( \hat{P}_t^m \in \mathbb{R}^d \) is the normalised mask feature.

    To classify each candidate mask, we compute:

    1. Feature Matching:
      We compute the cosine similarity between \( \hat{P}_t^m \) and category prototypes \( P_i \):

      \[ S_t^m = \max_i \left( \frac{\hat{P}_t^m \cdot P_i}{\| P_i \|_2} \right) \]

      which provides the classification score \( S_t^m \) for mask \( M_t^m \).

    2. Semantic-Aware Soft Merging:
      To handle overlapping predictions, we introduce a novel soft merging strategy. Given two masks \( M_t^m \) and \( M_t^{m'} \) of the same category, we compute their intersection-over-self (IoS):

      \[ \text{IoS}(M_t^m, M_t^{m'}) = \frac{\sum (M_t^m \cap M_t^{m'})}{\sum M_t^m} \]

      and weight it by feature similarity:

      \[ w_{m,m'} = \frac{\hat{P}_t^m \cdot \hat{P}_t^{m'}}{\|\hat{P}_t^{m'}\|_2} \]

      The final score for each mask is adjusted using a decay factor:

      \[ S_t^m \leftarrow S_t^m \cdot \sqrt{(1 - \text{IoS}(M_t^m, M_t^{m'}) w_{m,m'})} \]

      reducing redundant detections while preserving distinct instances that may partially overlap. Finally, we rank masks by their adjusted scores, and select the top-\( K \) predictions as the final output.

Method Overview
Figure 2: Overview of our training-free method for few-shot instance segmentation and object detection. We (1) create a reference memory bank using DINOv2 features from segmented images, (2) aggregate them into class prototypes, and (3) perform inference by matching target image features to the memory bank via semantic-aware soft merging.

Results

We evaluate our method on three standard few-shot object detection benchmarks: COCO-FSOD, PASCAL VOC, and CD-FSOD. Our training-free approach achieves state-of-the-art performance across all three benchmarks, outperforming both fine-tuned and training-free methods. On COCO-FSOD and PASCAL VOC, we set new state-of-the-art results for novel classes. On CD-FSOD, which tests cross-domain generalization to diverse visual domains like aerial imagery and microscopy, we achieve the best performance among training-free methods while remaining competitive with fine-tuned approaches.


COCO-FSOD Benchmark

The proposed method is evaluated on the COCO-20i dataset under strict few-shot settings (10-shot and 30-shot) using the COCO-FSOD benchmark. It achieves state-of-the-art performance on novel classes—those overlapping with PASCAL VOC categories—despite being completely training-free and outperforming methods that require fine-tuning. Qualitative results, shown in Figure 3, demonstrate strong instance segmentation in crowded scenes with fine-grained semantics and precise localization, aided by a semantic-aware soft merging strategy that reduces duplicate detections and false positives. Failure cases are shown in Figure 12.

Method Overview
Figure 3: Qualitative results on the COCO val2017 set under the 10-shot setting, illustrating the method's ability to handle crowded scenes with overlapping instances, fine-grained semantics, and precise localization. The use of semantic-aware soft merging helps reduce duplicate detections and false positives.
(Hide / Show) COCO-FSOD Comparison Table
Method Ft. on novel 10-shot 30-shot
nAP nAP50 nAP75 nAP nAP50 nAP75
TFA 10.019.29.2 13.524.913.2
FSCE 11.910.5 16.416.2
Retentive RCNN 10.519.59.3 13.822.913.8
HeteroGraph 11.623.99.8 16.531.915.5
Meta F. R-CNN 12.725.710.8 16.631.815.8
LVC 19.034.119.0 26.845.827.5
C. Transformer 17.130.217.0 21.435.522.1
NIFF 18.8 20.9
DiGeo 10.318.79.9 14.226.214.8
CD-ViTO (ViT-L) 35.354.937.2 35.954.538.0
FSRW 5.612.34.6 9.119.07.6
Meta R-CNN 6.119.16.6 9.925.310.8
DE-ViT (ViT-L) 34.053.037.0 34.052.937.2
Training-free (ours) 36.654.138.3 36.854.538.7

Table 1: Comparison of our training-free method against state-of-the-art approaches on the COCO-FSOD benchmark under 10-shot and 30-shot settings. Our approach achieves state-of-the-art performance without fine-tuning on novel classes (Ft. on novel). Results are reported in terms of nAP, nAP50, and nAP75 (nAP = mAP for novel classes). Competing results taken from [10]. Segmentation AP is omitted for brevity.


PASCAL VOC Few-Shot Benchmark

The PASCAL-VOC dataset is split into three groups for few-shot evaluation, each containing 15 base and 5 novel classes. Following standard protocol, AP50 is reported on the novel classes. As shown in Table 2, the proposed method outperforms all prior approaches across all splits, achieving state-of-the-art results regardless of whether competing methods use fine-tuning or not.

(Hide / Show) PASCAL VOC Few-Shot Benchmark Comparison Table
Method Ft. on novel Novel Split 1 Novel Split 2 Novel Split 3 Avg
123 510 123 510 123 510
FsDetView 25.420.437.436.142.3 22.921.722.625.629.2 32.419.029.833.239.8 29.2
TFA 39.836.144.755.756.0 23.526.934.135.139.1 30.834.842.849.549.8 39.9
Retentive RCNN 42.445.845.953.756.1 21.727.835.237.040.3 30.237.643.049.750.1 41.1
DiGeo 37.939.448.558.661.5 26.628.941.942.149.1 30.440.146.952.754.7 44.0
HeteroGraph 42.451.955.762.663.4 25.937.846.648.951.1 35.242.947.854.853.5 48.0
Meta Faster R-CNN 43.054.560.666.165.4 27.735.546.147.851.4 40.646.453.459.958.6 50.5
CrossTransformer 49.957.157.963.267.1 27.634.543.749.251.2 39.554.752.357.058.7 50.9
LVC 54.553.258.863.265.7 32.829.250.749.850.6 48.452.755.059.659.6 52.3
NIFF 62.867.268.070.368.8 38.442.954.056.454.0 56.462.161.264.163.9 59.4
Multi-Relation Det 37.843.651.656.558.6 22.530.640.743.147.6 31.037.943.751.349.8 43.1
DE-ViT (ViT-S/14) 47.564.557.068.567.3 43.134.149.756.760.8 52.562.160.761.464.5 56.7
DE-ViT (ViT-B/14) 56.961.868.073.972.8 45.347.358.259.860.6 58.662.362.764.667.8 61.4
DE-ViT (ViT-L/14) 55.456.168.170.971.9 43.039.358.161.663.1 58.264.061.364.267.3 60.2
Training-free (ours) 70.872.373.377.279.1 54.567.076.375.978.2 61.167.971.370.872.6 71.2

Table 2: AP50 results on the novel classes of the Pascal VOC few-shot benchmark. Bold numbers indicate state-of-the-art. *NIFF implementation is not publicly available. Our training-free approach consistently outperforms both finetuned and other training-free methods across all splits.


Cross-Domain Few-Shot Object Detection

The CD-FSOD benchmark evaluates cross-domain few-shot object detection using COCO as the source dataset and six diverse target datasets spanning various visual domains. Unlike most methods that fine-tune on labeled target data, the proposed model is entirely training-free and evaluated directly on all targets. As shown in Table 3, it achieves state-of-the-art performance among training-free methods and remains competitive with fine-tuned approaches, demonstrating strong cross-domain generalization and robustness without retraining.

Method Overview
Figure 4: Cross-domain 5-shot segmentation results using our training-free method. Our approach evaluates diverse datasets across multiple domains, including aerial, underwater, microscopic, and cartoon imagery, without requiring fine-tuning. Results demonstrate the robustness and generalisability of our method.
(Hide / Show) Cross-Domain Few-Shot Object Detection Comparison Table
Method Ft. on novel ArT axOr Clip art1k DIOR Deep Fish NEU DET UO DD Avg
1-shot
TFA w/cos ◦3.18.04.4
FSCE ◦3.78.63.9
DeFRCN ◦3.69.34.5
Distill-cdfsod ◦5.17.610.5nannan5.9
ViTDeT-FT †5.96.112.90.92.44.05.4
Detic-FT †3.215.14.19.03.84.26.6
DE-ViT-FT †10.513.014.719.30.62.410.1
CD-ViTO †21.017.717.820.33.63.113.9
Meta-RCNN ◦2.87.83.6
Detic †0.611.40.10.90.00.02.2
DE-ViT †0.40.52.70.40.41.51.0
Training-free (ours) 28.218.914.9 30.55.510.018.0
5-shot
TFA w/cos ◦8.818.18.7
FSCE ◦10.218.79.6
DeFRCN ◦9.918.99.9
Distill-cdfsod ◦12.523.319.115.516.012.216.4
ViTDeT-FT †20.923.323.39.013.511.116.9
Detic-FT †8.720.212.114.314.110.413.3
DE-ViT-FT †38.038.123.421.27.85.022.3
CD-ViTO †47.941.126.922.311.46.826.1
Meta-RCNN ◦8.517.78.8
Detic †0.611.40.10.90.00.02.2
DE-ViT †10.15.57.82.51.53.15.1
Training-free (ours) 35.724.918.5 29.65.220.222.4
10-shot
TFA w/cos ◦14.820.511.8
FSCE ◦15.921.912.0
DeFRCN ◦15.522.912.1
Distill-cdfsod ◦18.127.326.515.521.114.520.5
ViTDeT-FT †23.425.629.46.515.815.619.4
Detic-FT †12.022.315.417.916.814.416.5
DE-ViT-FT †49.240.825.621.38.85.425.2
CD-ViTO †60.544.330.822.312.87.029.6
Meta-RCNN ◦14.020.611.2
Detic †0.611.40.10.90.00.02.2
DE-ViT †9.211.08.42.11.83.15.9
Training-free (ours) 35.025.916.4 29.65.516.021.4

Table 3: Performance comparison (mAP) on the CD-FSOD benchmark. The second column indicates whether methods fine-tune on novel classes. ◦ values come from Distill-cdfsod; † values from CD-ViTO. Bold denotes best performance.

Ablations


Variance in reference set

Performance varies with the choice of reference images, as results depend on their quality. To measure this, the method was evaluated on COCO-20i using different random seeds. Figure 4 shows that variance decreases with more reference images—higher n-shot settings yield lower standard deviations. While 1-3 shots show greater sensitivity to reference selection, 5+ shots offer stable results. This suggests that certain reference images are more effective, and identifying optimal reference sets is a promising direction for future work.

Method Overview
Figure 2: Figure 4 illustrates performance variance on the COCO-20i benchmark across different n-shot settings, showing that lower shot counts (1-3 shots) lead to higher variability due to reliance on reference image selection. As the number of shots increases, variance decreases, indicating the robustness of the training-free method to reference set changes.

Efficiency

The proposed training-free pipeline is lightweight and efficient:

  • memory bank construction is done once at 0.1 seconds per image
  • semantic matching is fast at 0.0003 seconds per image using parallel cosine similarity
  • soft merging runs at 0.006 seconds per image via a parallel NMS implementation
As shown in Table 4, the method outperforms Matcher and accelerates SAM's automatic mask generation by 3x through optimized point sampling, faster mask filtering, and reduced post-processing.

(Hide / Show) Efficiency Comparison Table
Method Time (sec/img)
Matcher [45] 120.014
Training-free (ours) with SAM AMG 3.5092
Training-free (ours) 0.9292

Table 4: Time to process an image on 20 reference classes with A100.


Aggregation strategies

Table 5 highlights the effectiveness of the proposed semantic-aware soft-merging strategy, which outperforms other aggregation variants like covariance similarity, instance softmax, score decay, iterative mask refinement, and attention-guided global averaging.

(Hide / Show) Aggregation Strategy Table
Aggregation strategy 10-shot nAP
Hard-merging (hard threshold of 1 IoS) 31.2
Soft-merging (without semantics) 35.7
Soft-merging (with semantics) 36.6

Table 5: Ablation on aggregation strategies.

Visualisations


CD-FSOD

We provide more visualisations on the different datasets in the CD-FSOD benchmark.

(Hide / Show) CD-FSOD Visualisations
Method Overview
Figure 6: 5-shot results on the DIOR dataset.

Method Overview
Figure 7: 5-shot results on the CLIPART dataset.

Method Overview
Figure 8: 5-shot results on the ARTARAXOR dataset.

Method Overview
Figure 9: 5-shot results on the FISH dataset.

Method Overview
Figure 10: 5-shot results on the NEU-DET dataset.

Method Overview
Figure 11: 5-shot results on the UODD dataset.

Failure cases

We provide visualisations of some of the failure cases for COCO-FSOD dataset.

(Hide / Show) CD-FSOD Visualisations
Method Overview
Figure 12: We show failure cases on the COCO val2017 set under the 10-shot setting. The method sometimes confuses semantically similar classes (e.g., bread vs. hot dog, armchair vs. couch), misses small or fine objects, and struggles with complete instance detection in highly crowded scenes.

Conclusion

This work presents a novel training-free few-shot instance segmentation method that combines SAM's mask proposals with DINOv2's semantic features. The approach builds a reference-based memory bank, refines features through aggregation, and classifies new instances using cosine similarity and semantic-aware soft merging. It achieves state-of-the-art results on COCO-FSOD, PASCAL VOC, and CD-FSOD without any fine-tuning, and shows potential for semantic segmentation as well. Future directions include learning to select optimal reference images, improving DINOv2's feature localization, and exploring lightweight fine-tuning to enhance the memory bank.

BibTeX

@article{espinosa2025notimetotrain,
        title={No time to train! Training-Free Reference-Based Instance Segmentation},
        author={Miguel Espinosa and Chenhongyi Yang and Linus Ericsson and Steven McDonagh and Elliot J. Crowley},
        year={2025}
        journal={arXiv},
        primaryclass={cs.CV},
        url={https://arxiv.org/abs/2507.01300}
      }