Limitation: these treat position as binary – you’re either in the top-K or you’re not. Being ranked 1st vs. 5th doesn’t matter. That’s a problem because users look at the top of the list first.
Mean Reciprocal Rank (MRR)
“How far down does the user have to scroll to find the first relevant item?”
where rank_q is the position of the first relevant item for query q.
# Three queries, each with a ranked list of items (1 = relevant, 0 = not)queries = [ [0, 0, 1, 0, 1], # first relevant at position 3 → 1/3 [1, 0, 0, 0, 0], # first relevant at position 1 → 1/1 [0, 0, 0, 1, 0], # first relevant at position 4 → 1/4]reciprocal_ranks = []for ranked_list in queries:for i, label inenumerate(ranked_list):if label ==1: reciprocal_ranks.append(1.0/ (i +1))breakprint(f"Reciprocal rank: {reciprocal_ranks}")mrr = np.mean(reciprocal_ranks)print(f"MRR: {mrr}")
The model ranked a 0-relevance item at position 3, pushing a relevance-3 item down to position 5. NDCG captures exactly how much that costs.
When it matters: Anywhere you care about the quality of the full ranked list, which is most recommendation and search settings. This is the metric that LambdaRank/LambdaMART directly optimize.
Quick Reference
Metric
What it measures
Position-aware?
Best for
Precision@K
Fraction of top-K that’s relevant
No
Retrieval stage filtering
Recall@K
Fraction of relevant items found in top-K
No
Retrieval coverage
MRR
How quickly you find the first relevant item
Yes (first hit only)
Single-answer tasks (search)
NDCG@K
Quality of the full ranked list, top-weighted
Yes (all positions)
Ranking stage evaluation
Mental Model: What Unit Does the Loss Operate On?
Every loss function in recommendation ranking answers one question: what do you compare against what?
The answer falls on a spectrum:
Level
Input to Loss
What It Optimizes
Analogy
Pointwise
Single (user, item) score vs. label
Absolute accuracy per item
“Is this rating correct?”
Pairwise
Two items: one positive, one negative
Relative order between a pair
“Is A ranked above B?”
Listwise
Full ranked list of items
Global ranking quality (e.g., NDCG)
“Is the whole list sorted well?”
Why this spectrum matters:
Moving left to right, you get closer to the actual objective (serve a well-ranked list), but further from a clean, easy-to-optimize loss.
Pointwise losses are simple to implement and scale, but they optimize a proxy. A model can nail per-item accuracy and still produce a bad ranking.
Listwise losses directly target ranking metrics, but the gradients are noisier and the implementation is more involved.
Pairwise sits in the middle – it captures relative ordering (which is what ranking is) without needing to reason over full permutations.
Contrastive losses (InfoNCE, etc.) don’t fit neatly on this spectrum. They’re structurally similar to softmax over a batch (listwise over a sampled subset), but their goal is learning a similarity space rather than optimizing a ranking metric directly. Think of them as a separate axis: “learn distances” vs. “learn orderings.”
In practice, most production systems use multiple losses across stages:
Retrieval: contrastive (learn embedding space)
Ranking: pointwise BCE (calibrated scores for mixing with other signals)
Re-ranking/blending: sometimes a listwise objective on top
The loss isn’t chosen in isolation. It’s chosen per stage, based on what that stage needs to produce.
Pointwise Losses
Pointwise losses treat each (user, item) pair independently. Predict a score, compare it to a label, compute the loss. No awareness of other items in the list.
Binary Cross-Entropy (BCE)
The workhorse of CTR prediction. You have implicit feedback – clicked or not, watched or not – and you predict a probability.
import torchimport torch.nn.functional as F# Model predicts raw logits for 5 (user, item) pairslogits = torch.tensor([2.0, -1.0, 0.5, 3.0, -0.5])# Ground truth: clicked (1) or not (0)labels = torch.tensor([1.0, 0.0, 1.0, 1.0, 0.0])loss = F.binary_cross_entropy_with_logits(logits, labels)print(f"BCE loss: {loss.item():.4f}")
BCE loss: 0.2874
BCE gives you calibrated probabilities – the output actually means something (“70% chance of click”). This matters when you need to blend scores from multiple models (e.g., P(click) * P(purchase) * bid) downstream in an auction or score fusion layer.
Mean Squared Error (MSE)
For explicit ratings. User gave 4 stars, you predicted 3.2, penalty is (4 - 3.2)².
MSE assumes Gaussian noise around the true rating. It penalizes large errors quadratically, so a prediction that’s off by 2 is 4x worse than one that’s off by 1. This makes it sensitive to outliers – a single “hate-click” 1-star rating on something a user usually rates 4-5 can disproportionately warp the model.
The Core Limitation: Low Pointwise Loss ≠ Good Ranking
This is the critical insight. A model can be well-calibrated per item but rank terribly.
import numpy as np# User has 5 candidate items. Items 0, 1, 2 were clicked (relevant).labels = np.array([1, 1, 1, 0, 0])# Model A: Nails 4 out of 5 items with high confidence,# but swaps item 2 (positive) and item 3 (negative)scores_a = np.array([0.99, 0.98, 0.40, 0.60, 0.01])# Ranked order: [0, 1, 3, 2, 4] — one negative leaks above a positive# Model B: Moderate confidence everywhere, but perfectly orderedscores_b = np.array([0.70, 0.65, 0.60, 0.40, 0.35])# Ranked order: [0, 1, 2, 3, 4] — all relevant items on topdef bce(y, p): p = np.clip(p, 1e-7, 1-1e-7)return-np.mean(y * np.log(p) + (1-y) * np.log(1-p))def ndcg_at_k(labels, scores, k=5): order = np.argsort(-scores) gains = (2**labels[order[:k]] -1) / np.log2(np.arange(2, k+2)) ideal = np.sort(2**labels -1)[::-1][:k] ideal_dcg = np.sum(ideal / np.log2(np.arange(2, k+2)))return np.sum(gains) / ideal_dcgprint(f"Model A — BCE: {bce(labels, scores_a):.4f}, NDCG@5: {ndcg_at_k(labels, scores_a):.4f}")print(f"Model B — BCE: {bce(labels, scores_b):.4f}, NDCG@5: {ndcg_at_k(labels, scores_b):.4f}")
Model A — BCE: 0.3746, NDCG@5: 0.9675
Model B — BCE: 0.4480, NDCG@5: 1.0000
Model A is extremely confident on 4 out of 5 items (0.99, 0.98, 0.01 contribute almost zero BCE), pulling its average loss below Model B. But it swaps one positive-negative pair at the boundary – item 3 (negative, scored 0.60) sits above item 2 (positive, scored 0.40). Model B’s scores are moderate across the board (higher per-item BCE) but the ordering is perfect.
The takeaway: BCE rewards confidence on easy items. Ranking rewards getting the boundaries right. A model can “spend” its BCE budget being extremely sure about obvious cases while botching the one pair that actually matters for list quality. This gap is exactly why pairwise and listwise losses exist.
Pairwise Losses
Pairwise losses shift the question from “is this score correct?” to “is this item ranked above that one?” You sample a pair – one positive (interacted), one negative (not interacted) – and push the positive’s score higher.
This is a fundamental shift: you’re optimizing relative order, not absolute accuracy. That’s closer to what ranking actually is.
BPR (Bayesian Personalized Ranking)
The classic pairwise loss for implicit feedback. Derived from maximizing the posterior probability that a user prefers item \(i\) over item \(j\).
where \(\sigma\) is the sigmoid function. Intuition: the larger the gap \(\hat{s}_{\text{pos}} - \hat{s}_{\text{neg}}\), the closer \(\sigma(\cdot)\) is to 1, the closer \(-\log(\cdot)\) is to 0.
import torchimport torch.nn.functional as Fdef bpr_loss(score_pos, score_neg):return-F.logsigmoid(score_pos - score_neg)# 4 users, each with one positive and one sampled negativescore_pos = torch.tensor([2.5, 1.8, 3.0, 0.9])score_neg = torch.tensor([1.0, 2.0, 0.5, 0.8])loss = bpr_loss(score_pos, score_neg)print(f"BPR loss: {loss}")
Look at user 2: positive scored 1.8, negative scored 2.0. The model ranks this pair wrong – the negative is higher. That pair contributes heavily to the loss. User 3 (gap of 2.5) contributes almost nothing. The loss naturally focuses on the violated pairs.
When to use: Matrix factorization with implicit feedback. BPR was the standard loss for collaborative filtering models (ALS, LightFM, etc.) before deep learning took over retrieval.
Margin / Hinge Loss
Enforce a minimum gap (margin \(m\)) between positive and negative scores.
\[\mathcal{L}_{\text{hinge}} = \max\left(0,\; m - (\hat{s}_{\text{pos}} - \hat{s}_{\text{neg}})\right)\]
Hinge loss: tensor([0.0000, 1.2000, 0.0000, 0.9000])
Breaking it down per pair:
User 1: margin - (2.5 - 1.0) = 1.0 - 1.5 = -0.5 → clamped to 0 (satisfied)
User 2: 1.0 - (1.8 - 2.0) = 1.0 - (-0.2) = 1.2 → loss = 1.2 (violated, wrong order AND no margin)
User 3: 1.0 - (3.0 - 0.5) = 1.0 - 2.5 = -1.5 → clamped to 0 (satisfied)
User 4: 1.0 - (0.9 - 0.8) = 1.0 - 0.1 = 0.9 → loss = 0.9 (right order, but gap too small)
The key difference from BPR: once the gap exceeds \(m\), hinge loss is exactly zero – no further gradient. BPR always pushes scores apart (asymptotically). This makes hinge loss more suitable for embedding learning where you want items to be “far enough” apart but don’t need infinite separation.
When to use: Metric learning, embedding spaces (think: learning user/item embeddings where cosine distance matters). Triplet loss is the three-term version of this (anchor, positive, negative).
BPR vs. Hinge: When to Pick Which
BPR
Hinge
Gradient when pair is correct
Still nonzero (decaying)
Zero once margin met
Sensitive to margin hyperparameter
No
Yes (\(m\) matters a lot)
Probabilistic interpretation
Yes (posterior maximization)
No
Common in
Collaborative filtering
Embedding / metric learning
The Core Limitation: Not All Pairs Are Equal
Pairwise losses treat every (positive, negative) pair with equal weight. But ranking is top-heavy – swapping items at positions 1 and 2 is catastrophic, while swapping at positions 99 and 100 is irrelevant.
import numpy as np# 10 items, first 3 are relevantlabels = np.array([1,1,1,0,0,0,0,0,0,0])# Model ranks perfectly except it swaps ONE pairscores_top_swap = np.array([0.95, 0.50, 0.85, 0.90, 0.45, 0.40, 0.35, 0.30, 0.25, 0.20])# Rank order: [0, 3, 2, 1, 4, 5, 6, 7, 8, 9]# Position 2 is a negative (item 3). User sees a bad rec at slot 2.scores_bot_swap = np.array([0.95, 0.90, 0.85, 0.45, 0.40, 0.35, 0.30, 0.25, 0.15, 0.20])# Rank order: [0, 1, 2, 3, 4, 5, 6, 7, 9, 8]# Items 8 and 9 (both negative) swapped. User never scrolls that far. Zero impact.# Both models violate exactly ONE pair, so BPR/hinge treat them identically.# But the ranking quality is vastly different.def ndcg_at_k(labels, scores, k=5): order = np.argsort(-scores) gains = (2**labels[order[:k]] -1) / np.log2(np.arange(2, k+2)) ideal = np.sort(2**labels -1)[::-1][:k] ideal_dcg = np.sum(ideal / np.log2(np.arange(2, k+2)))return np.sum(gains) / ideal_dcgprint(f"Top swap — NDCG@5: {ndcg_at_k(labels, scores_top_swap):.4f}")print(f"Bottom swap — NDCG@5: {ndcg_at_k(labels, scores_bot_swap):.4f}")
Top swap — NDCG@5: 0.9060
Bottom swap — NDCG@5: 1.0000
One pair violation at the top costs you ~11% NDCG. The same violation at the bottom costs nothing. BPR and hinge are blind to this distinction. This is exactly the gap that LambdaRank fills – it weights each pair by |ΔNDCG|, so top-of-list swaps get massive gradients and bottom-of-list swaps get near-zero.
The takeaway: Pairwise losses get you from “predict accurately” to “rank correctly,” but they still don’t know where in the list the pair sits. That’s what listwise losses solve.
4. Listwise Losses
Listwise losses operate on the entire ranked list at once. Instead of asking “is A above B?” for one pair, they ask “how good is this entire ordering?” This directly aligns the loss with the metrics we actually evaluate (NDCG, MAP).
The challenge: ranking metrics like NDCG involve sorting, which is non-differentiable. You can’t backpropagate through argmax. Every listwise loss is a different strategy for getting around this.
LambdaRank
The key insight: you don’t need the loss function itself, you just need its gradients.
LambdaRank starts from a pairwise setup (every pair of items where one is relevant and the other isn’t), then multiplies each pair’s gradient by |ΔNDCG| – the change in NDCG that would result from swapping those two items.
The |ΔNDCG| term does all the heavy lifting: - Swapping items at positions 1 and 2? Huge ΔNDCG, huge gradient. - Swapping items at positions 98 and 99? Tiny ΔNDCG, near-zero gradient. - Swapping two items that have the same relevance? Zero ΔNDCG, no gradient at all.
import numpy as npdef compute_dcg(relevances, k=None):if k isNone: k =len(relevances) rel = np.array(relevances[:k]) discounts = np.log2(np.arange(2, k +2))return np.sum((2**rel -1) / discounts)def compute_delta_ndcg(relevances, i, j):"""NDCG change from swapping items at positions i and j.""" k =len(relevances) ideal_dcg = compute_dcg(sorted(relevances, reverse=True), k)if ideal_dcg ==0: return0.0# Current NDCG ndcg_before = compute_dcg(relevances, k) / ideal_dcg# Swap and recompute swapped =list(relevances) swapped[i], swapped[j] = swapped[j], swapped[i] ndcg_after = compute_dcg(swapped, k) / ideal_dcgreturnabs(ndcg_after - ndcg_before)# 6 items in current model order, relevance labelsrelevances = [3, 0, 2, 1, 0, 3]# Compare swapping different pairspairs = [(0, 1), (1, 2), (3, 4), (4, 5)]for i, j in pairs: delta = compute_delta_ndcg(relevances, i, j)print(f"Swap positions {i}<->{j} (rel {relevances[i]}<->{relevances[j]}): |ΔNDCG| = {delta:.4f}")# Swap positions 0<->1 (rel 3<->0): |ΔNDCG| = 0.1936 ← huge, top of list + big relevance gap# Swap positions 1<->2 (rel 0<->2): |ΔNDCG| = 0.0294 ← moderate# Swap positions 3<->4 (rel 1<->0): |ΔNDCG| = 0.0033 ← tiny, bottom of list + small gap# Swap positions 4<->5 (rel 0<->3): |ΔNDCG| = 0.0161 ← moderate, big relevance gap but low position
Position 0↔︎1 swap has 25x the gradient weight of position 3↔︎4 swap. The model learns to fight hard for the top of the list and not waste capacity on the tail. This is exactly what NDCG rewards.
LambdaMART = LambdaRank + gradient-boosted decision trees (GBDT). Instead of a neural net, the model is an ensemble of trees, each fitted to the lambda gradients. This powers XGBoost/LightGBM rankers and is still the backbone of many production search engines.
When to use: When you have graded relevance labels and care about NDCG. Search ranking, recommendation re-ranking, any setting where a learning-to-rank model takes features and produces a final ordering.
Softmax Cross-Entropy (Sampled Softmax)
Treat the relevant item as the correct “class” among all candidates. Given user \(u\), the probability of the positive item \(i^+\) is:
This is identical to classification cross-entropy where the number of classes = number of items. Obviously computing the denominator over millions of items is infeasible, so in practice you sample:
import torchimport torch.nn.functional as Fdef sampled_softmax_loss(score_pos, scores_neg):""" score_pos: (batch_size,) — score for the positive item per user scores_neg: (batch_size, num_neg) — scores for sampled negatives """# Positive is "class 0", negatives are the rest# Shape: (batch_size, 1 + num_neg) logits = torch.cat([score_pos.unsqueeze(1), scores_neg], dim=1) labels = torch.zeros(logits.size(0), dtype=torch.long) # correct class = 0return F.cross_entropy(logits, labels)batch_size, num_neg =4, 99score_pos = torch.tensor([3.0, 1.5, 2.8, 0.5])scores_neg = torch.randn(batch_size, num_neg) # 99 random negatives per userloss = sampled_softmax_loss(score_pos, scores_neg)print(f"Sampled softmax loss: {loss.item():.4f}")
Sampled softmax loss: 3.1978
The gradient pushes the positive score up and all negative scores down, but weighted by their softmax probability. High-scoring negatives (hard negatives) get pushed down harder. This is a natural hard-negative weighting built into the loss.
When to use: Two-tower retrieval models. YouTube’s Deep Retrieval, Google’s two-tower, Meta’s EBR – they all use some variant of this. It’s the standard loss for learning embedding spaces where you retrieve via approximate nearest neighbor.
The Trade-off
Pointwise
Pairwise
Listwise
Optimizes
Per-item accuracy
Relative order of pairs
Full list quality
Gradient signal
From individual labels
From pair comparisons
From list-level metric
Complexity
O(n)
O(n²) worst case
O(n² ) for lambda, O(n) for softmax
Position-aware
No
No
Yes (LambdaRank), partially (softmax)
Production use
CTR prediction
CF models (BPR)
Search ranking, retrieval
The movement from pointwise → listwise is a movement from “easy to optimize, loose proxy” to “hard to optimize, tight alignment with the real objective.” Most production systems don’t pick one – they use different losses at different stages.
Contrastive Losses
Contrastive losses learn a similarity space: items that should be close get pulled together, items that should be far get pushed apart. The output isn’t a ranking score directly – it’s an embedding where distance is the score.
This is the engine behind modern retrieval: CLIP, SimCLR, two-tower recommenders. The core question shifts from “what score should this item get?” to “should these two embeddings be close or far?”
InfoNCE
The dominant contrastive loss. Given a query \(q\), one positive key \(k^+\), and \(N-1\) negative keys \(\{k^-\}\):
This is literally softmax cross-entropy where the positive is the correct class and all negatives are the other classes. The connection is not an analogy – it’s algebraically identical.
Temperature \(\tau\) controls the sharpness of the similarity distribution. This is worth understanding deeply because getting it wrong can quietly ruin your model.
import torchimport torch.nn.functional as F# 5 items with cosine similarities to querysimilarities = torch.tensor([0.9, 0.7, 0.5, 0.3, 0.1])for temp in [0.01, 0.07, 0.5, 1.0]: probs = F.softmax(similarities / temp, dim=0)print(f"τ={temp:<5} → probs: [{', '.join(f'{p:.3f}'for p in probs)}]")
Low τ (0.01-0.05): Only the hardest negatives contribute gradients. Learns very fine-grained distinctions but training is unstable – the loss landscape is spiky.
High τ (0.5-1.0): All negatives contribute roughly equally. Stable training but the model doesn’t learn to separate similar items. Everything clusters into broad blobs.
Sweet spot (0.05-0.1): Sharp enough to focus on hard negatives, smooth enough for stable gradients. CLIP uses 0.07 (learned). SimCLR uses 0.1. Most two-tower recommenders land in this range.
In-Batch Negatives
Instead of explicitly sampling negatives, use the other positives in the same batch as negatives. If the batch has \(B\) (query, item) pairs, each query treats its own item as positive and the other \(B-1\) items as negatives.
Simple, efficient, no negative sampler needed. But there’s a catch: popular items appear in many batches, so they show up as negatives disproportionately often. The model learns to push popular items away from everything, which is the opposite of what you want (popular items are often relevant).
The fix: log-popularity correction. Subtract \(\log(p_j)\) from each negative’s score, where \(p_j\) is the item’s sampling probability (proportional to its frequency):
This comes from the theory of sampled softmax: if your negatives are sampled from distribution \(Q\) but you want to approximate the full softmax, you subtract \(\log Q(j)\) from each sampled negative. Google’s two-tower retrieval paper showed this correction matters a lot in practice.
Hard Negative Mining
Random negatives are easy to distinguish – a cooking video vs. a car review. The model learns to separate these quickly and then stops improving. The interesting pairs are items that are similar but wrong – a cooking video the user didn’t click vs. one they did.
The two-phase strategy used in production:
Phase 1: Train with random/in-batch negatives. Get a baseline model that has reasonable embeddings.
Phase 2: Mine hard negatives from the model’s own predictions. For each query, retrieve the top-K items using the Phase 1 model. Items in the top-K that are not relevant are your hard negatives. Re-train (or fine-tune) with a mix of hard and random negatives.
# Pseudocode for hard negative miningdef mine_hard_negatives(model, queries, positive_items, all_items, k=100, num_hard=10):""" For each query, retrieve top-K with current model, filter out true positives, take the top `num_hard` as hard negatives. """ hard_negatives = []for q, pos_set inzip(queries, positive_items):# Retrieve top-K from current model scores = model.score(q, all_items) # score all items top_k_indices = scores.argsort()[-k:][::-1] # top K by score# Filter: keep only items NOT in the positive set hard_negs = [idx for idx in top_k_indices if idx notin pos_set] hard_negatives.append(hard_negs[:num_hard])return hard_negatives# Training loop sketch# Phase 1: train with random negatives for N epochs# Phase 2: every M steps, re-mine hard negatives, mix 50/50 with random
Why mix hard and random? Pure hard negatives cause the model to oscillate – it fixes one hard pair, which changes the top-K, which creates new hard pairs that break old ones. Mixing with random negatives acts as a regularizer, keeping the embedding space globally coherent while fine-tuning boundaries.
How much does this matter? Typically the single biggest improvement you can make to a retrieval model. Going from random-only to hard negative mining can improve Recall@100 by 10-30% depending on the domain.
The Connection Back to Other Losses
InfoNCE with in-batch negatives is algebraically equivalent to:
Sampled softmax (Section 4) when the “classes” are items in the batch
Multi-class cross-entropy where each item gets one positive class
The differences are in framing, not math:
Softmax CE: “classify this query to the right item”
InfoNCE: “make this query-item pair more similar than other pairs”
Both produce the same gradients.
The real difference is what negatives you use and how you sample them, not the loss formula itself.
Choosing the Right Loss
By Stage
Stage
Loss
Why
Retrieval (two-tower, ANN)
Sampled softmax / InfoNCE
You need embeddings where nearest neighbors are relevant items. Contrastive losses learn this space directly. Softmax over the batch is cheap and effective.
Ranking (CTR/CVR prediction)
BCE
Downstream systems need calibrated probabilities, not just orderings. Auction mechanisms, score fusion across multiple models, and business rules all depend on “this item has a 4.2% click probability,” not “this item is better than that one.”
Ranking (order optimization)
LambdaRank / LambdaMART
When the final metric is NDCG or MAP and you have graded relevance labels. Common in search, less common in recommendations where labels are binary.
Embedding learning (similarity)
Contrastive / Triplet / Hinge
Goal is a distance metric, not a ranked list. User-to-user similarity, item-to-item similarity, content understanding.
Collaborative filtering (MF)
BPR
Clean pairwise optimization for implicit feedback with matrix factorization. If you’re using ALS or SGD-based MF, BPR is the default.
By Data Type
Your Data Looks Like
Loss
Reasoning
Binary implicit (click/no-click)
BCE or BPR
BCE if you need probabilities, BPR if you need ranking
Binary implicit + embeddings
InfoNCE / sampled softmax
Learning retrieval representations
Explicit ratings (1-5 stars)
MSE
Predicting the rating value itself
Graded relevance (0/1/2/3)
LambdaRank
Multiple relevance levels, NDCG is the right metric
Pairwise preferences (A > B)
BPR / Hinge
Data is already in pairwise form
Common Mistakes
1. Using BCE for retrieval. BCE optimizes per-item accuracy, not embedding geometry. Your embeddings might be accurate classifiers but form a useless nearest-neighbor space – items with similar scores don’t end up close in embedding space.
2. Skipping log-popularity correction with in-batch negatives. Popular items get pushed away from everything. Your retrieval model develops a systematic bias against the items users actually want most.
3. Using only random negatives. The model quickly learns to separate obviously different items and then plateaus. Hard negative mining is almost always the highest-ROI improvement for retrieval quality.
4. Choosing LambdaRank when labels are binary. With binary labels (relevant / not relevant), ΔNDCG reduces to a function of position only. LambdaRank still works but the advantage over pairwise losses is smaller. LambdaRank shines with graded relevance (0/1/2/3) where the gain differences between grades create richer gradient signals.
5. Optimizing one loss, evaluating another. Training on BCE, evaluating on NDCG, and being surprised they don’t correlate. As Section 2 showed, these can diverge. If your evaluation metric is NDCG, at minimum add a pairwise or listwise loss component.
Multi-Loss Training
Production systems often combine losses. A retrieval model might use:
InfoNCE shapes the embedding space for retrieval. The BCE term on a side prediction head keeps the model calibrated for downstream use. The weight \(\alpha\) is tuned to prevent one loss from dominating.