Loss Functions

RecSys

Author

Aayush Agrawal

Published

February 22, 2026

Evaluation Metrics Primer

Before talking about loss functions, we need to know what we’re trying to optimize. These are the metrics that matter in ranking.

Precision@K and Recall@K

The simplest ranking metrics. Given the top-K items the model returns:

Precision@K = (relevant items in top-K) / K — “Of what I showed, how much was good?”
Recall@K = (relevant items in top-K) / (total relevant items) — “Of all the good stuff, how much did I find?”

import numpy as np

# 10 items, 4 are relevant (indices 0, 2, 5, 7)
relevant = {0, 2, 5, 7}
# Model's top-5 ranking: items [2, 5, 3, 0, 9]
top_k = [2, 5, 3, 0, 9]

hits = [item in relevant for item in top_k]
print(f"Hits in top-5: {hits}")  

precision_at_5 = sum(hits) / len(top_k)
recall_at_5 = sum(hits) / len(relevant)
print(f"Precision@5: {precision_at_5:.2f}")  
print(f"Recall@5: {recall_at_5:.2f}")

Hits in top-5: [True, True, False, True, False]
Precision@5: 0.60
Recall@5: 0.75

Limitation: these treat position as binary – you’re either in the top-K or you’re not. Being ranked 1st vs. 5th doesn’t matter. That’s a problem because users look at the top of the list first.

Mean Reciprocal Rank (MRR)

“How far down does the user have to scroll to find the first relevant item?”

\[\text{MRR} = \frac{1}{|Q|} \sum_{q=1}^{|Q|} \frac{1}{\text{rank}_q}\]

where rank_q is the position of the first relevant item for query q.

# Three queries, each with a ranked list of items (1 = relevant, 0 = not)
queries = [
    [0, 0, 1, 0, 1],  # first relevant at position 3 → 1/3
    [1, 0, 0, 0, 0],  # first relevant at position 1 → 1/1
    [0, 0, 0, 1, 0],  # first relevant at position 4 → 1/4
]

reciprocal_ranks = []
for ranked_list in queries:
    for i, label in enumerate(ranked_list):
        if label == 1:
            reciprocal_ranks.append(1.0 / (i + 1))
            break
print(f"Reciprocal rank: {reciprocal_ranks}")

mrr = np.mean(reciprocal_ranks)
print(f"MRR: {mrr}")

Reciprocal rank: [0.3333333333333333, 1.0, 0.25]
MRR: 0.5277777777777778

When it matters: Single-answer retrieval (search, Q&A). Less useful when multiple items are relevant since it ignores everything after the first hit.

DCG and NDCG

This is the ranking metric you’ll see most often. It answers: “Are the best items near the top of the list?”

DCG (Discounted Cumulative Gain) adds up relevance scores, but discounts items lower in the list by 1/log₂(position):

\[\text{DCG@K} = \sum_{i=1}^{K} \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)}\]

Two key ideas baked in:

Gain: 2^rel - 1 makes highly relevant items worth exponentially more. A 5-star item isn’t 5x as valuable as a 1-star, it’s 31x (2⁵-1 vs 2¹-1).
Discount: 1/log₂(i+1) means position 1 gets full credit, position 2 gets 63% credit, position 10 gets 30%. Items buried at the bottom barely count.

NDCG normalizes DCG by dividing by the best possible DCG (the “ideal” ranking where all relevant items are sorted to the top):

\[\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}\]

This gives a 0-to-1 score. NDCG = 1.0 means the model’s ranking is perfect.

def dcg_at_k(relevances, k):
    """Compute DCG@K for a single ranked list."""
    rel = np.array(relevances[:k])
    gains = 2**rel - 1
    discounts = np.log2(np.arange(2, k + 2))
    return np.sum(gains / discounts)

def ndcg_at_k(relevances, k):
    """Compute NDCG@K for a single ranked list."""
    dcg = dcg_at_k(relevances, k)
    # Ideal: sort relevances descending, then compute DCG
    ideal_relevances = sorted(relevances, reverse=True)
    idcg = dcg_at_k(ideal_relevances, k)
    if idcg == 0:
        return 0.0
    return dcg / idcg

# Model ranks 6 items. Relevance: 3=highly relevant, 0=irrelevant
model_ranking = [3, 2, 0, 1, 3, 0]  # model's output order
k = 6

dcg = dcg_at_k(model_ranking, k)
ndcg = ndcg_at_k(model_ranking, k)
print(f"DCG@{k}: {dcg:.4f}")    # 12.0314
print(f"NDCG@{k}: {ndcg:.4f}")  # 0.9014

# What if we had the perfect ranking? 
perfect = sorted(model_ranking, reverse=True)
print(f"Perfect ranking: {perfect}") # [3, 3, 2, 1, 0, 0]
print(f"NDCG@{k} (perfect): {ndcg_at_k(perfect, k):.4f}")  # 1.0000

DCG@6: 12.0314
NDCG@6: 0.9014
Perfect ranking: [3, 3, 2, 1, 0, 0]
NDCG@6 (perfect): 1.0000

The model ranked a 0-relevance item at position 3, pushing a relevance-3 item down to position 5. NDCG captures exactly how much that costs.

When it matters: Anywhere you care about the quality of the full ranked list, which is most recommendation and search settings. This is the metric that LambdaRank/LambdaMART directly optimize.

Quick Reference

Metric	What it measures	Position-aware?	Best for
Precision@K	Fraction of top-K that’s relevant	No	Retrieval stage filtering
Recall@K	Fraction of relevant items found in top-K	No	Retrieval coverage
MRR	How quickly you find the first relevant item	Yes (first hit only)	Single-answer tasks (search)
NDCG@K	Quality of the full ranked list, top-weighted	Yes (all positions)	Ranking stage evaluation

Mental Model: What Unit Does the Loss Operate On?

Every loss function in recommendation ranking answers one question: what do you compare against what?

The answer falls on a spectrum:

Level	Input to Loss	What It Optimizes	Analogy
Pointwise	Single (user, item) score vs. label	Absolute accuracy per item	“Is this rating correct?”
Pairwise	Two items: one positive, one negative	Relative order between a pair	“Is A ranked above B?”
Listwise	Full ranked list of items	Global ranking quality (e.g., NDCG)	“Is the whole list sorted well?”

Why this spectrum matters:

Moving left to right, you get closer to the actual objective (serve a well-ranked list), but further from a clean, easy-to-optimize loss.
Pointwise losses are simple to implement and scale, but they optimize a proxy. A model can nail per-item accuracy and still produce a bad ranking.
Listwise losses directly target ranking metrics, but the gradients are noisier and the implementation is more involved.
Pairwise sits in the middle – it captures relative ordering (which is what ranking is) without needing to reason over full permutations.

Contrastive losses (InfoNCE, etc.) don’t fit neatly on this spectrum. They’re structurally similar to softmax over a batch (listwise over a sampled subset), but their goal is learning a similarity space rather than optimizing a ranking metric directly. Think of them as a separate axis: “learn distances” vs. “learn orderings.”

In practice, most production systems use multiple losses across stages:

Retrieval: contrastive (learn embedding space)
Ranking: pointwise BCE (calibrated scores for mixing with other signals)
Re-ranking/blending: sometimes a listwise objective on top

The loss isn’t chosen in isolation. It’s chosen per stage, based on what that stage needs to produce.

Pointwise Losses

Pointwise losses treat each (user, item) pair independently. Predict a score, compare it to a label, compute the loss. No awareness of other items in the list.

Binary Cross-Entropy (BCE)

The workhorse of CTR prediction. You have implicit feedback – clicked or not, watched or not – and you predict a probability.

Formula:

\[\mathcal{L} = -\left[y \cdot \log(\hat{y}) + (1 - y) \cdot \log(1 - \hat{y})\right]\]

import torch
import torch.nn.functional as F

# Model predicts raw logits for 5 (user, item) pairs
logits = torch.tensor([2.0, -1.0, 0.5, 3.0, -0.5])

# Ground truth: clicked (1) or not (0)
labels = torch.tensor([1.0, 0.0, 1.0, 1.0, 0.0])

loss = F.binary_cross_entropy_with_logits(logits, labels)
print(f"BCE loss: {loss.item():.4f}")

BCE loss: 0.2874

BCE gives you calibrated probabilities – the output actually means something (“70% chance of click”). This matters when you need to blend scores from multiple models (e.g., P(click) * P(purchase) * bid) downstream in an auction or score fusion layer.

Mean Squared Error (MSE)

For explicit ratings. User gave 4 stars, you predicted 3.2, penalty is (4 - 3.2)².

Formula:

\[\mathcal{L} = \frac{1}{n}\sum(y - \hat{y})^2\]

# Predicted ratings vs actual ratings
predictions = torch.tensor([3.2, 4.8, 2.1, 3.9])
actuals     = torch.tensor([4.0, 5.0, 1.0, 4.0])

loss = F.mse_loss(predictions, actuals)
print(f"MSE loss: {loss.item():.4f}")

MSE loss: 0.4750

MSE assumes Gaussian noise around the true rating. It penalizes large errors quadratically, so a prediction that’s off by 2 is 4x worse than one that’s off by 1. This makes it sensitive to outliers – a single “hate-click” 1-star rating on something a user usually rates 4-5 can disproportionately warp the model.

The Core Limitation: Low Pointwise Loss ≠ Good Ranking

This is the critical insight. A model can be well-calibrated per item but rank terribly.

import numpy as np

# User has 5 candidate items. Items 0, 1, 2 were clicked (relevant).
labels =  np.array([1, 1, 1, 0, 0])

# Model A: Nails 4 out of 5 items with high confidence,
#          but swaps item 2 (positive) and item 3 (negative)
scores_a =    np.array([0.99, 0.98, 0.40, 0.60, 0.01])
# Ranked order: [0, 1, 3, 2, 4] — one negative leaks above a positive

# Model B: Moderate confidence everywhere, but perfectly ordered
scores_b =    np.array([0.70, 0.65, 0.60, 0.40, 0.35])
# Ranked order: [0, 1, 2, 3, 4] — all relevant items on top

def bce(y, p):
    p = np.clip(p, 1e-7, 1-1e-7)
    return -np.mean(y * np.log(p) + (1-y) * np.log(1-p))

def ndcg_at_k(labels, scores, k=5):
    order = np.argsort(-scores)
    gains = (2**labels[order[:k]] - 1) / np.log2(np.arange(2, k+2))
    ideal = np.sort(2**labels - 1)[::-1][:k]
    ideal_dcg = np.sum(ideal / np.log2(np.arange(2, k+2)))
    return np.sum(gains) / ideal_dcg

print(f"Model A — BCE: {bce(labels, scores_a):.4f}, NDCG@5: {ndcg_at_k(labels, scores_a):.4f}")
print(f"Model B — BCE: {bce(labels, scores_b):.4f}, NDCG@5: {ndcg_at_k(labels, scores_b):.4f}")

Model A — BCE: 0.3746, NDCG@5: 0.9675
Model B — BCE: 0.4480, NDCG@5: 1.0000

Model A is extremely confident on 4 out of 5 items (0.99, 0.98, 0.01 contribute almost zero BCE), pulling its average loss below Model B. But it swaps one positive-negative pair at the boundary – item 3 (negative, scored 0.60) sits above item 2 (positive, scored 0.40). Model B’s scores are moderate across the board (higher per-item BCE) but the ordering is perfect.

The takeaway: BCE rewards confidence on easy items. Ranking rewards getting the boundaries right. A model can “spend” its BCE budget being extremely sure about obvious cases while botching the one pair that actually matters for list quality. This gap is exactly why pairwise and listwise losses exist.

Pairwise Losses

Pairwise losses shift the question from “is this score correct?” to “is this item ranked above that one?” You sample a pair – one positive (interacted), one negative (not interacted) – and push the positive’s score higher.

This is a fundamental shift: you’re optimizing relative order, not absolute accuracy. That’s closer to what ranking actually is.

BPR (Bayesian Personalized Ranking)

The classic pairwise loss for implicit feedback. Derived from maximizing the posterior probability that a user prefers item \(i\) over item \(j\).

\[\mathcal{L}_{\text{BPR}} = -\log\left(\sigma(\hat{s}_{\text{pos}} - \hat{s}_{\text{neg}})\right)\]

where \(\sigma\) is the sigmoid function. Intuition: the larger the gap \(\hat{s}_{\text{pos}} - \hat{s}_{\text{neg}}\), the closer \(\sigma(\cdot)\) is to 1, the closer \(-\log(\cdot)\) is to 0.

import torch
import torch.nn.functional as F

def bpr_loss(score_pos, score_neg):
    return -F.logsigmoid(score_pos - score_neg)

# 4 users, each with one positive and one sampled negative
score_pos = torch.tensor([2.5, 1.8, 3.0, 0.9])
score_neg = torch.tensor([1.0, 2.0, 0.5, 0.8])

loss = bpr_loss(score_pos, score_neg)
print(f"BPR loss: {loss}")

BPR loss: tensor([0.2014, 0.7981, 0.0789, 0.6444])

Look at user 2: positive scored 1.8, negative scored 2.0. The model ranks this pair wrong – the negative is higher. That pair contributes heavily to the loss. User 3 (gap of 2.5) contributes almost nothing. The loss naturally focuses on the violated pairs.

When to use: Matrix factorization with implicit feedback. BPR was the standard loss for collaborative filtering models (ALS, LightFM, etc.) before deep learning took over retrieval.

Margin / Hinge Loss

Enforce a minimum gap (margin \(m\)) between positive and negative scores.

\[\mathcal{L}_{\text{hinge}} = \max\left(0,\; m - (\hat{s}_{\text{pos}} - \hat{s}_{\text{neg}})\right)\]

def hinge_loss(score_pos, score_neg, margin=1.0):
    return torch.clamp(margin - (score_pos - score_neg), min=0)

score_pos = torch.tensor([2.5, 1.8, 3.0, 0.9])
score_neg = torch.tensor([1.0, 2.0, 0.5, 0.8])

loss = hinge_loss(score_pos, score_neg, margin=1.0)
print(f"Hinge loss: {loss}")

Hinge loss: tensor([0.0000, 1.2000, 0.0000, 0.9000])

Breaking it down per pair:

User 1: margin - (2.5 - 1.0) = 1.0 - 1.5 = -0.5 → clamped to 0 (satisfied)
User 2: 1.0 - (1.8 - 2.0) = 1.0 - (-0.2) = 1.2 → loss = 1.2 (violated, wrong order AND no margin)
User 3: 1.0 - (3.0 - 0.5) = 1.0 - 2.5 = -1.5 → clamped to 0 (satisfied)
User 4: 1.0 - (0.9 - 0.8) = 1.0 - 0.1 = 0.9 → loss = 0.9 (right order, but gap too small)

The key difference from BPR: once the gap exceeds \(m\), hinge loss is exactly zero – no further gradient. BPR always pushes scores apart (asymptotically). This makes hinge loss more suitable for embedding learning where you want items to be “far enough” apart but don’t need infinite separation.

When to use: Metric learning, embedding spaces (think: learning user/item embeddings where cosine distance matters). Triplet loss is the three-term version of this (anchor, positive, negative).

BPR vs. Hinge: When to Pick Which

	BPR	Hinge
Gradient when pair is correct	Still nonzero (decaying)	Zero once margin met
Sensitive to margin hyperparameter	No	Yes (\(m\) matters a lot)
Probabilistic interpretation	Yes (posterior maximization)	No
Common in	Collaborative filtering	Embedding / metric learning

The Core Limitation: Not All Pairs Are Equal

Pairwise losses treat every (positive, negative) pair with equal weight. But ranking is top-heavy – swapping items at positions 1 and 2 is catastrophic, while swapping at positions 99 and 100 is irrelevant.

import numpy as np

# 10 items, first 3 are relevant
labels = np.array([1,1,1,0,0,0,0,0,0,0])

# Model ranks perfectly except it swaps ONE pair
scores_top_swap = np.array([0.95, 0.50, 0.85, 0.90, 0.45, 0.40, 0.35, 0.30, 0.25, 0.20])
# Rank order: [0, 3, 2, 1, 4, 5, 6, 7, 8, 9]
# Position 2 is a negative (item 3). User sees a bad rec at slot 2.

scores_bot_swap = np.array([0.95, 0.90, 0.85, 0.45, 0.40, 0.35, 0.30, 0.25, 0.15, 0.20])
# Rank order: [0, 1, 2, 3, 4, 5, 6, 7, 9, 8]
# Items 8 and 9 (both negative) swapped. User never scrolls that far. Zero impact.

# Both models violate exactly ONE pair, so BPR/hinge treat them identically.
# But the ranking quality is vastly different.

def ndcg_at_k(labels, scores, k=5):
    order = np.argsort(-scores)
    gains = (2**labels[order[:k]] - 1) / np.log2(np.arange(2, k+2))
    ideal = np.sort(2**labels - 1)[::-1][:k]
    ideal_dcg = np.sum(ideal / np.log2(np.arange(2, k+2)))
    return np.sum(gains) / ideal_dcg

print(f"Top swap — NDCG@5: {ndcg_at_k(labels, scores_top_swap):.4f}")
print(f"Bottom swap — NDCG@5: {ndcg_at_k(labels, scores_bot_swap):.4f}")

Top swap — NDCG@5: 0.9060
Bottom swap — NDCG@5: 1.0000

One pair violation at the top costs you ~11% NDCG. The same violation at the bottom costs nothing. BPR and hinge are blind to this distinction. This is exactly the gap that LambdaRank fills – it weights each pair by |ΔNDCG|, so top-of-list swaps get massive gradients and bottom-of-list swaps get near-zero.

The takeaway: Pairwise losses get you from “predict accurately” to “rank correctly,” but they still don’t know where in the list the pair sits. That’s what listwise losses solve.

4. Listwise Losses

Listwise losses operate on the entire ranked list at once. Instead of asking “is A above B?” for one pair, they ask “how good is this entire ordering?” This directly aligns the loss with the metrics we actually evaluate (NDCG, MAP).

The challenge: ranking metrics like NDCG involve sorting, which is non-differentiable. You can’t backpropagate through argmax. Every listwise loss is a different strategy for getting around this.

LambdaRank

The key insight: you don’t need the loss function itself, you just need its gradients.

LambdaRank starts from a pairwise setup (every pair of items where one is relevant and the other isn’t), then multiplies each pair’s gradient by |ΔNDCG| – the change in NDCG that would result from swapping those two items.

\[\lambda_{ij} = \frac{-\sigma\left(\hat{s}_i - \hat{s}_j\right)}{1 + e^{\hat{s}_i - \hat{s}_j}} \cdot |\Delta\text{NDCG}_{ij}|\]

The |ΔNDCG| term does all the heavy lifting: - Swapping items at positions 1 and 2? Huge ΔNDCG, huge gradient. - Swapping items at positions 98 and 99? Tiny ΔNDCG, near-zero gradient. - Swapping two items that have the same relevance? Zero ΔNDCG, no gradient at all.

import numpy as np

def compute_dcg(relevances, k=None):
    if k is None: k = len(relevances)
    rel = np.array(relevances[:k])
    discounts = np.log2(np.arange(2, k + 2))
    return np.sum((2**rel - 1) / discounts)

def compute_delta_ndcg(relevances, i, j):
    """NDCG change from swapping items at positions i and j."""
    k = len(relevances)
    ideal_dcg = compute_dcg(sorted(relevances, reverse=True), k)
    if ideal_dcg == 0: return 0.0

    # Current NDCG
    ndcg_before = compute_dcg(relevances, k) / ideal_dcg

    # Swap and recompute
    swapped = list(relevances)
    swapped[i], swapped[j] = swapped[j], swapped[i]
    ndcg_after = compute_dcg(swapped, k) / ideal_dcg

    return abs(ndcg_after - ndcg_before)

# 6 items in current model order, relevance labels
relevances = [3, 0, 2, 1, 0, 3]

# Compare swapping different pairs
pairs = [(0, 1), (1, 2), (3, 4), (4, 5)]
for i, j in pairs:
    delta = compute_delta_ndcg(relevances, i, j)
    print(f"Swap positions {i}<->{j} (rel {relevances[i]}<->{relevances[j]}): |ΔNDCG| = {delta:.4f}")

# Swap positions 0<->1 (rel 3<->0): |ΔNDCG| = 0.1936  ← huge, top of list + big relevance gap
# Swap positions 1<->2 (rel 0<->2): |ΔNDCG| = 0.0294  ← moderate
# Swap positions 3<->4 (rel 1<->0): |ΔNDCG| = 0.0033  ← tiny, bottom of list + small gap
# Swap positions 4<->5 (rel 0<->3): |ΔNDCG| = 0.0161  ← moderate, big relevance gap but low position

Swap positions 0<->1 (rel 3<->0): |ΔNDCG| = 0.1936
Swap positions 1<->2 (rel 0<->2): |ΔNDCG| = 0.0294
Swap positions 3<->4 (rel 1<->0): |ΔNDCG| = 0.0033
Swap positions 4<->5 (rel 0<->3): |ΔNDCG| = 0.0161

Position 0↔︎1 swap has 25x the gradient weight of position 3↔︎4 swap. The model learns to fight hard for the top of the list and not waste capacity on the tail. This is exactly what NDCG rewards.

LambdaMART = LambdaRank + gradient-boosted decision trees (GBDT). Instead of a neural net, the model is an ensemble of trees, each fitted to the lambda gradients. This powers XGBoost/LightGBM rankers and is still the backbone of many production search engines.

When to use: When you have graded relevance labels and care about NDCG. Search ranking, recommendation re-ranking, any setting where a learning-to-rank model takes features and produces a final ordering.

Softmax Cross-Entropy (Sampled Softmax)

Treat the relevant item as the correct “class” among all candidates. Given user \(u\), the probability of the positive item \(i^+\) is:

\[P(i^+ | u) = \frac{\exp(\hat{s}_{i^+})}{\sum_{j \in \mathcal{C}} \exp(\hat{s}_j)}\]

\[\mathcal{L}_{\text{softmax}} = -\log P(i^+ | u)\]

This is identical to classification cross-entropy where the number of classes = number of items. Obviously computing the denominator over millions of items is infeasible, so in practice you sample:

import torch
import torch.nn.functional as F

def sampled_softmax_loss(score_pos, scores_neg):
    """
    score_pos: (batch_size,) — score for the positive item per user
    scores_neg: (batch_size, num_neg) — scores for sampled negatives
    """
    # Positive is "class 0", negatives are the rest
    # Shape: (batch_size, 1 + num_neg)
    logits = torch.cat([score_pos.unsqueeze(1), scores_neg], dim=1)
    labels = torch.zeros(logits.size(0), dtype=torch.long)  # correct class = 0
    return F.cross_entropy(logits, labels)

batch_size, num_neg = 4, 99
score_pos = torch.tensor([3.0, 1.5, 2.8, 0.5])
scores_neg = torch.randn(batch_size, num_neg)  # 99 random negatives per user

loss = sampled_softmax_loss(score_pos, scores_neg)
print(f"Sampled softmax loss: {loss.item():.4f}")

Sampled softmax loss: 3.1978

The gradient pushes the positive score up and all negative scores down, but weighted by their softmax probability. High-scoring negatives (hard negatives) get pushed down harder. This is a natural hard-negative weighting built into the loss.

When to use: Two-tower retrieval models. YouTube’s Deep Retrieval, Google’s two-tower, Meta’s EBR – they all use some variant of this. It’s the standard loss for learning embedding spaces where you retrieve via approximate nearest neighbor.

The Trade-off

	Pointwise	Pairwise	Listwise
Optimizes	Per-item accuracy	Relative order of pairs	Full list quality
Gradient signal	From individual labels	From pair comparisons	From list-level metric
Complexity	O(n)	O(n²) worst case	O(n² ) for lambda, O(n) for softmax
Position-aware	No	No	Yes (LambdaRank), partially (softmax)
Production use	CTR prediction	CF models (BPR)	Search ranking, retrieval

The movement from pointwise → listwise is a movement from “easy to optimize, loose proxy” to “hard to optimize, tight alignment with the real objective.” Most production systems don’t pick one – they use different losses at different stages.

Contrastive Losses

Contrastive losses learn a similarity space: items that should be close get pulled together, items that should be far get pushed apart. The output isn’t a ranking score directly – it’s an embedding where distance is the score.

This is the engine behind modern retrieval: CLIP, SimCLR, two-tower recommenders. The core question shifts from “what score should this item get?” to “should these two embeddings be close or far?”

InfoNCE

The dominant contrastive loss. Given a query \(q\), one positive key \(k^+\), and \(N-1\) negative keys \(\{k^-\}\):

\[\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(q, k^+) / \tau)}{\sum_{i=1}^{N} \exp(\text{sim}(q, k_i) / \tau)}\]

This is literally softmax cross-entropy where the positive is the correct class and all negatives are the other classes. The connection is not an analogy – it’s algebraically identical.

import torch
import torch.nn.functional as F

def info_nce_loss(query, positive, negatives, temperature=0.07):
    """
    query: (batch_size, embed_dim)
    positive: (batch_size, embed_dim)
    negatives: (batch_size, num_neg, embed_dim)
    """
    # Normalize embeddings (cosine similarity)
    query = F.normalize(query, dim=-1)
    positive = F.normalize(positive, dim=-1)
    negatives = F.normalize(negatives, dim=-1)

    # Positive similarity: (batch_size,)
    pos_sim = torch.sum(query * positive, dim=-1, keepdim=True) / temperature

    # Negative similarities: (batch_size, num_neg)
    neg_sim = torch.bmm(negatives, query.unsqueeze(-1)).squeeze(-1) / temperature

    # Logits: positive first, then negatives
    logits = torch.cat([pos_sim, neg_sim], dim=1)  # (batch_size, 1 + num_neg)
    labels = torch.zeros(query.size(0), dtype=torch.long)

    return F.cross_entropy(logits, labels)

batch_size, embed_dim, num_neg = 32, 128, 255

query = torch.randn(batch_size, embed_dim)
positive = torch.randn(batch_size, embed_dim)
negatives = torch.randn(batch_size, num_neg, embed_dim)

loss = info_nce_loss(query, positive, negatives, temperature=0.07)
print(f"InfoNCE loss: {loss.item():.4f}")

InfoNCE loss: 6.0130

Temperature: The One Hyperparameter That Matters

Temperature \(\tau\) controls the sharpness of the similarity distribution. This is worth understanding deeply because getting it wrong can quietly ruin your model.

import torch
import torch.nn.functional as F

# 5 items with cosine similarities to query
similarities = torch.tensor([0.9, 0.7, 0.5, 0.3, 0.1])

for temp in [0.01, 0.07, 0.5, 1.0]:
    probs = F.softmax(similarities / temp, dim=0)
    print(f"τ={temp:<5} → probs: [{', '.join(f'{p:.3f}' for p in probs)}]")

τ=0.01  → probs: [1.000, 0.000, 0.000, 0.000, 0.000]
τ=0.07  → probs: [0.943, 0.054, 0.003, 0.000, 0.000]
τ=0.5   → probs: [0.381, 0.256, 0.171, 0.115, 0.077]
τ=1.0   → probs: [0.287, 0.235, 0.192, 0.157, 0.129]

Low τ (0.01-0.05): Only the hardest negatives contribute gradients. Learns very fine-grained distinctions but training is unstable – the loss landscape is spiky.

High τ (0.5-1.0): All negatives contribute roughly equally. Stable training but the model doesn’t learn to separate similar items. Everything clusters into broad blobs.

Sweet spot (0.05-0.1): Sharp enough to focus on hard negatives, smooth enough for stable gradients. CLIP uses 0.07 (learned). SimCLR uses 0.1. Most two-tower recommenders land in this range.

In-Batch Negatives

Instead of explicitly sampling negatives, use the other positives in the same batch as negatives. If the batch has \(B\) (query, item) pairs, each query treats its own item as positive and the other \(B-1\) items as negatives.

import torch
import torch.nn.functional as F

def in_batch_contrastive_loss(query_emb, item_emb, temperature=0.07):
    """
    query_emb: (B, D) — user/query embeddings
    item_emb: (B, D) — item embeddings (query_emb[i] pairs with item_emb[i])
    """
    query_emb = F.normalize(query_emb, dim=-1)
    item_emb = F.normalize(item_emb, dim=-1)

    # All-pairs similarity matrix: (B, B)
    sim_matrix = query_emb @ item_emb.T / temperature

    # Diagonal entries are the positives
    labels = torch.arange(sim_matrix.size(0))

    return F.cross_entropy(sim_matrix, labels)

B, D = 512, 128
query_emb = torch.randn(B, D)
item_emb = torch.randn(B, D)

loss = in_batch_contrastive_loss(query_emb, item_emb)
print(f"In-batch contrastive loss: {loss.item():.4f}")

In-batch contrastive loss: 7.0379

Simple, efficient, no negative sampler needed. But there’s a catch: popular items appear in many batches, so they show up as negatives disproportionately often. The model learns to push popular items away from everything, which is the opposite of what you want (popular items are often relevant).

The fix: log-popularity correction. Subtract \(\log(p_j)\) from each negative’s score, where \(p_j\) is the item’s sampling probability (proportional to its frequency):

def corrected_in_batch_loss(query_emb, item_emb, item_freq, temperature=0.07):
    query_emb = F.normalize(query_emb, dim=-1)
    item_emb = F.normalize(item_emb, dim=-1)

    sim_matrix = query_emb @ item_emb.T / temperature

    # Subtract log-frequency to correct for popularity bias
    log_correction = torch.log(item_freq + 1e-8)
    sim_matrix = sim_matrix - log_correction.unsqueeze(0)

    labels = torch.arange(sim_matrix.size(0))
    return F.cross_entropy(sim_matrix, labels)

This comes from the theory of sampled softmax: if your negatives are sampled from distribution \(Q\) but you want to approximate the full softmax, you subtract \(\log Q(j)\) from each sampled negative. Google’s two-tower retrieval paper showed this correction matters a lot in practice.

Hard Negative Mining

Random negatives are easy to distinguish – a cooking video vs. a car review. The model learns to separate these quickly and then stops improving. The interesting pairs are items that are similar but wrong – a cooking video the user didn’t click vs. one they did.

The two-phase strategy used in production:

Phase 1: Train with random/in-batch negatives. Get a baseline model that has reasonable embeddings.

Phase 2: Mine hard negatives from the model’s own predictions. For each query, retrieve the top-K items using the Phase 1 model. Items in the top-K that are not relevant are your hard negatives. Re-train (or fine-tune) with a mix of hard and random negatives.

# Pseudocode for hard negative mining

def mine_hard_negatives(model, queries, positive_items, all_items, k=100, num_hard=10):
    """
    For each query, retrieve top-K with current model,
    filter out true positives, take the top `num_hard` as hard negatives.
    """
    hard_negatives = []
    for q, pos_set in zip(queries, positive_items):
        # Retrieve top-K from current model
        scores = model.score(q, all_items)        # score all items
        top_k_indices = scores.argsort()[-k:][::-1]  # top K by score

        # Filter: keep only items NOT in the positive set
        hard_negs = [idx for idx in top_k_indices if idx not in pos_set]
        hard_negatives.append(hard_negs[:num_hard])

    return hard_negatives

# Training loop sketch
# Phase 1: train with random negatives for N epochs
# Phase 2: every M steps, re-mine hard negatives, mix 50/50 with random

Why mix hard and random? Pure hard negatives cause the model to oscillate – it fixes one hard pair, which changes the top-K, which creates new hard pairs that break old ones. Mixing with random negatives acts as a regularizer, keeping the embedding space globally coherent while fine-tuning boundaries.

How much does this matter? Typically the single biggest improvement you can make to a retrieval model. Going from random-only to hard negative mining can improve Recall@100 by 10-30% depending on the domain.

The Connection Back to Other Losses

InfoNCE with in-batch negatives is algebraically equivalent to:

Sampled softmax (Section 4) when the “classes” are items in the batch
Multi-class cross-entropy where each item gets one positive class

The differences are in framing, not math:

Softmax CE: “classify this query to the right item”
InfoNCE: “make this query-item pair more similar than other pairs”
Both produce the same gradients.

The real difference is what negatives you use and how you sample them, not the loss formula itself.

Choosing the Right Loss

By Stage

Stage	Loss	Why
Retrieval (two-tower, ANN)	Sampled softmax / InfoNCE	You need embeddings where nearest neighbors are relevant items. Contrastive losses learn this space directly. Softmax over the batch is cheap and effective.
Ranking (CTR/CVR prediction)	BCE	Downstream systems need calibrated probabilities, not just orderings. Auction mechanisms, score fusion across multiple models, and business rules all depend on “this item has a 4.2% click probability,” not “this item is better than that one.”
Ranking (order optimization)	LambdaRank / LambdaMART	When the final metric is NDCG or MAP and you have graded relevance labels. Common in search, less common in recommendations where labels are binary.
Embedding learning (similarity)	Contrastive / Triplet / Hinge	Goal is a distance metric, not a ranked list. User-to-user similarity, item-to-item similarity, content understanding.
Collaborative filtering (MF)	BPR	Clean pairwise optimization for implicit feedback with matrix factorization. If you’re using ALS or SGD-based MF, BPR is the default.

By Data Type

Your Data Looks Like	Loss	Reasoning
Binary implicit (click/no-click)	BCE or BPR	BCE if you need probabilities, BPR if you need ranking
Binary implicit + embeddings	InfoNCE / sampled softmax	Learning retrieval representations
Explicit ratings (1-5 stars)	MSE	Predicting the rating value itself
Graded relevance (0/1/2/3)	LambdaRank	Multiple relevance levels, NDCG is the right metric
Pairwise preferences (A > B)	BPR / Hinge	Data is already in pairwise form

Common Mistakes

1. Using BCE for retrieval. BCE optimizes per-item accuracy, not embedding geometry. Your embeddings might be accurate classifiers but form a useless nearest-neighbor space – items with similar scores don’t end up close in embedding space.

2. Skipping log-popularity correction with in-batch negatives. Popular items get pushed away from everything. Your retrieval model develops a systematic bias against the items users actually want most.

3. Using only random negatives. The model quickly learns to separate obviously different items and then plateaus. Hard negative mining is almost always the highest-ROI improvement for retrieval quality.

4. Choosing LambdaRank when labels are binary. With binary labels (relevant / not relevant), ΔNDCG reduces to a function of position only. LambdaRank still works but the advantage over pairwise losses is smaller. LambdaRank shines with graded relevance (0/1/2/3) where the gain differences between grades create richer gradient signals.

5. Optimizing one loss, evaluating another. Training on BCE, evaluating on NDCG, and being surprised they don’t correlate. As Section 2 showed, these can diverge. If your evaluation metric is NDCG, at minimum add a pairwise or listwise loss component.

Multi-Loss Training

Production systems often combine losses. A retrieval model might use:

\[\mathcal{L} = \mathcal{L}_{\text{InfoNCE}} + \alpha \cdot \mathcal{L}_{\text{BCE}}\]

InfoNCE shapes the embedding space for retrieval. The BCE term on a side prediction head keeps the model calibrated for downstream use. The weight \(\alpha\) is tuned to prevent one loss from dominating.

Similarly, a ranking model might use:

\[\mathcal{L} = \mathcal{L}_{\text{BCE}} + \beta \cdot \mathcal{L}_{\text{pairwise}}\]

BCE handles calibration. The pairwise term directly penalizes misordered pairs that BCE might miss (the Model A vs. Model B problem from Section 2).