NearID: Identity Representation Learning
via Near-Identity Distractors

Aleksandar Cvejic1, Rameen Abdal2,†, Abdelrahman Eldesokey1, Bernard Ghanem1, Peter Wonka1

1King Abdullah University of Science and Technology (KAUST)    2Snap Research

Served in advisory capacity

Paper (arXiv) Code Model Dataset

Current vision encoders confuse context for identity. So do the metrics built on them. NearID solves both.

SSR 99.2% vs 30.7% baseline PA 99.7% vs 48.8% baseline M-H 0.545 human alignment M-O 0.465 oracle alignment Only 15M trainable params ~324x cheaper than VLM eval
NearID teaser: three representation learning paradigms

NearID: From Context to Identity. (Left) Traditional representations entangle object identity with background context. (Middle) Synthetic data lacks explicit control over visually similar distractors. (Right) NearID introduces matched-context distractors to remove contextual shortcuts and isolate intrinsic identity signals.

Method Frozen SigLIP2 + 15M-param MAP head reshapes similarity geometry
NearID method: frozen encoder + trained MAP head

Left: In the pretrained manifold, positives can be misaligned (di < dj). Right: The MAP head boosts true positives (green) and suppresses distractors (red). Multi-view positives are produced via depth-conditioned generation and cross-view feature warping from Objaverse 3D assets (SynCD pipeline, illustrated for completeness).

📊 Main Results +68% SSR, +51% PA over frozen SigLIP2; human-aligned on DB++
Scoring Model NearID MTG DB++
SSR↑ PA↑ M-O↑ M-Opair SSR↑ PA↑ M-H↑
Qwen3VL 30B 49.73 69.20 0.219 0.329 17.0 26.0 --
CLIP 10.3120.92 0.2390.484 0.00.0 0.493
DINOv2 20.4334.55 0.324 0.519 0.00.0 0.492
VSM* 32.13 46.70 0.394 0.445 7.0 24.5 0.190
SigLIP2 30.74 48.81 0.1800.366 0.00.0 0.516
NearID (Ours) 99.17 (+68.43) 99.71 (+50.90) 0.465 (+0.285) 0.486 (+0.120) 35.0 (+35.0) 46.5 (+46.5) 0.545 (+0.029)

Color coding: blue = top-1, gold = top-2, purple = top-3. Top: Existing embedding and VLM methods under the near-identity protocol. Bottom: NearID achieves 99.17% SSR (vs. SigLIP2 30.74%) while improving MTG oracle alignment (M-O: 0.465 vs. 0.180) and DB++ human alignment (M-H: 0.545 vs. 0.516). * trained directly on MTG. Skipped due to computational cost (~324x embedding-based evaluation).

Paradigm Contrastive framework with explicit structured near-identity distractors
SimCLR vs StableRep vs NearID contrastive paradigms

(A) SimCLR: augmented copies of one real image. (B) StableRep: multiple diffusion samples from the same text prompt. (C) NearID (Ours): 3D-consistent multi-view images of the same identity, produced via depth-conditioned generation and cross-view feature warping from Objaverse 3D assets (SynCD); a different but visually similar instance inpainted into the same background as the structured negative.

(A) SimCLR

Strong stochastic augmentations of one real image form the positive pair. Learns pixel-level invariance; no semantic distractors are present.

(B) StableRep

A diffusion model generates multiple images from the same text caption. Learns caption-level invariance but cannot distinguish intra-class instances.

(C) NearID (Ours)

Positives are 3D-consistent multi-view images of the same object identity, produced via depth-conditioned diffusion generation and cross-view feature warping from Objaverse 3D assets (SynCD). Near-identity distractors – a different but visually similar instance inpainted into the exact same background – are the key negative signal absent from both SimCLR and StableRep.

🎬 KPCA Evolution Positives cluster, distractors separate over 3300 training steps

What to look for

In the SigLIP2 baseline (frozen), positives and near-identity distractors of the same identity are interleaved – the embedding cannot tell them apart.

As NearID training progresses, same-identity positives (•) collapse into tight clusters while distractors (✖) are pushed to distinct regions – even though they share the exact same background as the anchor.

  • Step 0: SigLIP2 frozen – distractors and positives overlap
  • Steps 100–900: identity clusters begin to form
  • Steps 1000–2500: distractors progressively rejected
  • Step 3300 (final): tight positive clusters, clear separation

7 identities, val split. KernelPCA (cosine), Procrustes-aligned to final frame, global axis limits. 35 frames at 180 ms/frame.

KernelPCA timelapse: SigLIP2 baseline through NearID step 3300

KernelPCA summary: SigLIP2 / step 1600 / NearID step 3300

Static keyframes. Left: SigLIP2 frozen – positives (•) and distractors (✖) cluster indistinguishably. Center: Step 1600 – identity structure emerges. Right: NearID final (step 3300) – compact positive clusters, distractors pushed away. Same axis limits and alignment across all panels.

🖼 NearID Qualitative Anchor | Positives | Distractors – S = SigLIP2, N = NearID
🔍 MTG Qualitative Part-level edits – O = Oracle, red = edited region
👁 Attention MAP head re-focuses on identity-discriminative regions
Attention maps on NearID dataset

NearID dataset. Left: input. Center: SigLIP2 backbone attention (before training). Right: NearID attention after training – re-focused onto identity-discriminative parts.

Positive vs confound attention split

Positive vs. confound. NearID attention concentrates on surface details for the positive (left) while diffusing on the confound (right) – despite identical background context.

DreamBench++ Per-category M-H – comparable human alignment at ~324x lower cost than VLM evaluation
DreamBench++ per-category Pearson correlation bar chart

M-H by category. NearID (red) leads on Object (0.549) and ranks competitively on Animal, Human, and Style. GPT-4o (orange) leads on Object but requires a costly closed-source API call per image pair. NearID achieves comparable performance with a 15M-param MAP head at <1 ms per embedding.

📈 Score Distributions NearID tracks Oracle; VSM and VLM do not
MTG ECDF: NearID vs VLM, VSM, Oracle

ECDF on MTG test set. Oracle (blue) is the reference. NearID (pink) closely tracks Oracle – scores correlate with actual part-edit severity. VSM (green) clusters near 1, insensitive to part-level differences. VLM 30B (brown) accumulates at low scores, reflecting categorical reasoning.