NearID: Identity Representation Learning
via Near-Identity Distractors

Aleksandar Cvejic¹, Rameen Abdal^2,†, Abdelrahman Eldesokey¹, Bernard Ghanem¹, Peter Wonka¹

¹King Abdullah University of Science and Technology (KAUST) ²Snap Research

^†Served in advisory capacity

Current vision encoders confuse context for identity. So do the metrics built on them. NearID solves both.

SSR 99.2% vs 30.7% baseline PA 99.7% vs 48.8% baseline M-H 0.545 human alignment M-O 0.465 oracle alignment Only 15M trainable params ~324x cheaper than VLM eval

NearID teaser: three representation learning paradigms

NearID: From Context to Identity. (Left) Traditional representations entangle object identity with background context. (Middle) Synthetic data lacks explicit control over visually similar distractors. (Right) NearID introduces matched-context distractors to remove contextual shortcuts and isolate intrinsic identity signals.

⚙ Method Frozen SigLIP2 + 15M-param MAP head reshapes similarity geometry

NearID method: frozen encoder + trained MAP head

Left: In the pretrained manifold, positives can be misaligned (d_i < d_j). Right: The MAP head boosts true positives (green) and suppresses distractors (red). Multi-view positives are produced via depth-conditioned generation and cross-view feature warping from Objaverse 3D assets (SynCD pipeline, illustrated for completeness).

📊 Main Results +68% SSR, +51% PA over frozen SigLIP2; human-aligned on DB++

Scoring Model	NearID		MTG				DB++
Scoring Model	SSR↑	PA↑	M-O↑	M-O_pair↑	SSR↑	PA↑	M-H↑
Qwen3VL 30B	49.73	69.20	0.219	0.329	17.0	26.0	--^‡
CLIP	10.31	20.92	0.239	0.484	0.0	0.0	0.493
DINOv2	20.43	34.55	0.324	0.519	0.0	0.0	0.492
VSM*	32.13	46.70	0.394	0.445	7.0	24.5	0.190
SigLIP2	30.74	48.81	0.180	0.366	0.0	0.0	0.516
NearID (Ours)	99.17 (+68.43)	99.71 (+50.90)	0.465 (+0.285)	0.486 (+0.120)	35.0 (+35.0)	46.5 (+46.5)	0.545 (+0.029)

Color coding: blue = top-1, gold = top-2, purple = top-3. Top: Existing embedding and VLM methods under the near-identity protocol. Bottom: NearID achieves 99.17% SSR (vs. SigLIP2 30.74%) while improving MTG oracle alignment (M-O: 0.465 vs. 0.180) and DB++ human alignment (M-H: 0.545 vs. 0.516). * trained directly on MTG. ^‡ Skipped due to computational cost (~324x embedding-based evaluation).

⚖ Paradigm Contrastive framework with explicit structured near-identity distractors

SimCLR vs StableRep vs NearID contrastive paradigms

(A) SimCLR: augmented copies of one real image. (B) StableRep: multiple diffusion samples from the same text prompt. (C) NearID (Ours): 3D-consistent multi-view images of the same identity, produced via depth-conditioned generation and cross-view feature warping from Objaverse 3D assets (SynCD); a different but visually similar instance inpainted into the same background as the structured negative.

(A) SimCLR

Strong stochastic augmentations of one real image form the positive pair. Learns pixel-level invariance; no semantic distractors are present.

(B) StableRep

A diffusion model generates multiple images from the same text caption. Learns caption-level invariance but cannot distinguish intra-class instances.

(C) NearID (Ours)

Positives are 3D-consistent multi-view images of the same object identity, produced via depth-conditioned diffusion generation and cross-view feature warping from Objaverse 3D assets (SynCD). Near-identity distractors – a different but visually similar instance inpainted into the exact same background – are the key negative signal absent from both SimCLR and StableRep.

🎬 KPCA Evolution Positives cluster, distractors separate over 3300 training steps

What to look for

In the SigLIP2 baseline (frozen), positives and near-identity distractors of the same identity are interleaved – the embedding cannot tell them apart.

As NearID training progresses, same-identity positives (•) collapse into tight clusters while distractors (✖) are pushed to distinct regions – even though they share the exact same background as the anchor.

Step 0: SigLIP2 frozen – distractors and positives overlap
Steps 100–900: identity clusters begin to form
Steps 1000–2500: distractors progressively rejected
Step 3300 (final): tight positive clusters, clear separation

7 identities, val split. KernelPCA (cosine), Procrustes-aligned to final frame, global axis limits. 35 frames at 180 ms/frame.

KernelPCA timelapse: SigLIP2 baseline through NearID step 3300

KernelPCA summary: SigLIP2 / step 1600 / NearID step 3300

Static keyframes. Left: SigLIP2 frozen – positives (•) and distractors (✖) cluster indistinguishably. Center: Step 1600 – identity structure emerges. Right: NearID final (step 3300) – compact positive clusters, distractors pushed away. Same axis limits and alignment across all panels.

🖼 NearID Qualitative Anchor | Positives | Distractors – S = SigLIP2, N = NearID

Figure 7 Seed 42 Seed 123 Seed 11 Seed 22 Seed 33

Figure 7 (Seed 999). Includes VLM (VL) and VSM (V) scores. SigLIP2 rates distractors on par with positives (e.g., seashell: S=0.97 distractor vs 0.79 positive). NearID correctly suppresses the distractor to 0.71. VLM is inconsistent: 0.10 for basketball yet 1.00 for a spacecraft distractor.

Seed 42. Includes VLM and VSM scores. SigLIP2 shows repeated failures where distractors score as high as genuine positives.

Seed 123. Includes VLM and VSM scores. NearID correctly suppresses distractors even when they share the exact same background context as the anchor.

Seed 11. Random test-split samples. NearID consistently assigns higher similarity to positives than to near-identity distractors.

Seed 22. Random test-split samples. NearID maintains clear margins across diverse object categories.

Seed 33. Random test-split samples. The NearID score gap is consistently larger than the frozen SigLIP2 baseline.

🔍 MTG Qualitative Part-level edits – O = Oracle, red = edited region

Figure 8 Seed 42 Seed 256

Figure 8 (Seed 123). NearID captures part-level identity differences even when the Oracle edit ratio is moderate (O < 0.5).

Seed 42. SigLIP2 ranks Distractor 1 above the genuine Positive in the ornate doll row (S=0.93 vs 0.89). NearID correctly ranks Positive (0.96) above both distractors (0.79, 0.84).

Seed 256. NearID's advantage over frozen SigLIP2 and VSM is consistent across localized edits of varying severity.

👁 Attention MAP head re-focuses on identity-discriminative regions

NearID dataset. Left: input. Center: SigLIP2 backbone attention (before training). Right: NearID attention after training – re-focused onto identity-discriminative parts.

Positive vs. confound. NearID attention concentrates on surface details for the positive (left) while diffusing on the confound (right) – despite identical background context.

★ DreamBench++ Per-category M-H – comparable human alignment at ~324x lower cost than VLM evaluation

DreamBench++ per-category Pearson correlation bar chart

M-H by category. NearID (red) leads on Object (0.549) and ranks competitively on Animal, Human, and Style. GPT-4o (orange) leads on Object but requires a costly closed-source API call per image pair. NearID achieves comparable performance with a 15M-param MAP head at <1 ms per embedding.

📈 Score Distributions NearID tracks Oracle; VSM and VLM do not

ECDF on MTG test set. Oracle (blue) is the reference. NearID (pink) closely tracks Oracle – scores correlate with actual part-edit severity. VSM (green) clusters near 1, insensitive to part-level differences. VLM 30B (brown) accumulates at low scores, reflecting categorical reasoning.

NearID: Identity Representation Learningvia Near-Identity Distractors

(A) SimCLR

(B) StableRep

(C) NearID (Ours)

What to look for

NearID: Identity Representation Learning
via Near-Identity Distractors