import numpy as np
# Simulated scenario: a query returns 10 items
# Binary relevance: 1 = relevant, 0 = not relevant
binary_relevance = [1, 0, 1, 0, 0, 1, 1, 0, 0, 1]
# Graded relevance: scale of 0-5 (e.g., star ratings)
graded_relevance = [3, 0, 2, 0, 0, 5, 4, 0, 0, 1]
# Total relevant items that exist in the entire corpus (not just top-K)
total_relevant = 8Ranking Metrics
Ranking Metrics
If you’re preparing for ML system design or ranking interviews, you will be asked to define and compare these metrics. I’ve organized them from simplest to most expressive, then covered the online vs. offline gap.
Let’s start with a shared setup we’ll use across all examples:
1. Precision@K
What it measures: Of the top-K items you returned, how many were actually relevant?
- Simple to compute and explain
- Weakness: treats all K positions equally. Being relevant at position 1 vs. position K counts the same.
def precision_at_k(relevance, k):
"""Fraction of top-K items that are relevant."""
top_k = relevance[:k]
return sum(top_k) / k
for k in [3, 5, 10]:
print(f"Precision@{k}: {precision_at_k(binary_relevance, k):.3f}")
# Precision@3: 0.667 (2 out of 3 are relevant)
# Precision@5: 0.600 (2 out of 5 are relevant)
# Precision@10: 0.500 (5 out of 10 are relevant)Precision@3: 0.667
Precision@5: 0.400
Precision@10: 0.500
Notice how Precision@5 is lower than Precision@10 here. That’s because positions 6 (relevant) and 7 (relevant) pull the number back up. This metric has no notion of “better to be relevant earlier.”
2. Recall@K
What it measures: Of all relevant items that exist, how many made it into your top-K list?
def recall_at_k(relevance, k, total_relevant):
"""Fraction of all relevant items captured in top-K."""
top_k = relevance[:k]
return sum(top_k) / total_relevant
for k in [3, 5, 10]:
print(f"Recall@{k}: {recall_at_k(binary_relevance, k, total_relevant):.3f}")
# Recall@3: 0.250 (2 out of 8 total relevant)
# Recall@5: 0.250 (2 out of 8 total relevant)
# Recall@10: 0.625 (5 out of 8 total relevant)Recall@3: 0.250
Recall@5: 0.250
Recall@10: 0.625
Useful for retrieval-stage evaluation where you want to make sure you’re not missing good candidates before re-ranking.
3. Mean Reciprocal Rank (MRR)
What it measures: How quickly does the user find the first relevant result? You take 1/rank of the first relevant item and average across all queries.
MRR = (1/|Q|) · Σ_q 1/rank_q
- Great for navigational queries (user wants one specific thing)
- Limitation: completely ignores every relevant item after the first one
def reciprocal_rank(relevance):
"""1/rank of the first relevant item."""
for i, rel in enumerate(relevance):
if rel == 1:
return 1.0 / (i + 1)
return 0.0
def mrr(queries):
"""Mean Reciprocal Rank across multiple queries."""
return np.mean([reciprocal_rank(q) for q in queries])
# Three different queries with their ranked results
query_results = [
[1, 0, 0, 0, 0], # first relevant at position 1 -> RR = 1.0
[0, 0, 1, 0, 0], # first relevant at position 3 -> RR = 0.333
[0, 1, 0, 0, 1], # first relevant at position 2 -> RR = 0.5
]
for i, q in enumerate(query_results):
print(f"Query {i+1}: RR = {reciprocal_rank(q):.3f}")
print(f"MRR = {mrr(query_results):.3f}")Query 1: RR = 1.000
Query 2: RR = 0.333
Query 3: RR = 0.500
MRR = 0.611
Query 3 has two relevant items, but MRR only cares about the first one at position 2. That’s the key limitation.
4. Mean Average Precision (MAP)
What it measures: The standard metric for binary relevance. Unlike MRR, it considers all relevant items and where they appear.
AP = (1/R) · Σ_{k=1}^{n} Precision@k · rel(k)
def average_precision(relevance, total_relevant):
"""AP: average of Precision@k at each relevant position."""
hits = 0
sum_precision = 0.0
for i, rel in enumerate(relevance):
if rel == 1:
hits += 1
precision_at_i = hits / (i + 1)
sum_precision += precision_at_i
print(f" Position {i+1}: Precision@{i+1} = {hits}/{i+1} = {precision_at_i:.3f}")
return sum_precision / total_relevant
print("Step-by-step AP calculation:")
ap = average_precision(binary_relevance, total_relevant)
print(f"\nAP = {ap:.3f}")Step-by-step AP calculation:
Position 1: Precision@1 = 1/1 = 1.000
Position 3: Precision@3 = 2/3 = 0.667
Position 6: Precision@6 = 3/6 = 0.500
Position 7: Precision@7 = 4/7 = 0.571
Position 10: Precision@10 = 5/10 = 0.500
AP = 0.405
MAP is just the mean of AP across multiple queries. The step-by-step printout shows exactly what makes MAP position-aware: relevant items ranked higher get a better Precision@k, which pulls the average up.
5. NDCG@K (Normalized Discounted Cumulative Gain)
The most widely used ranking metric in practice. Nearly all production search and ranking teams use NDCG as their primary offline metric.
DCG@K = Σ_{i=1}^{K} (2^{rel_i} - 1) / log₂(i+1)
NDCG@K = DCG@K / IDCG@K
- Numerator (exponential gain): A 5-star item contributes way more than a 3-star item
- Denominator (log discount): Positions further down matter less
def dcg_at_k(relevance, k):
"""Discounted Cumulative Gain at K."""
relevance = np.array(relevance[:k])
positions = np.arange(1, len(relevance) + 1)
gains = 2**relevance - 1 # exponential gain
discounts = np.log2(positions + 1) # log discount
# Show the math for each position
for i in range(len(relevance)):
print(f" Pos {i+1}: gain={gains[i]:5.1f}, "
f"discount={discounts[i]:.3f}, "
f"contribution={gains[i]/discounts[i]:.3f}")
return np.sum(gains / discounts)
def ndcg_at_k(relevance, k):
"""Normalized DCG: actual DCG / ideal DCG."""
print("Actual ranking:")
dcg = dcg_at_k(relevance, k)
print(f" DCG@{k} = {dcg:.3f}\n")
# Ideal ranking: sort by relevance descending
ideal = sorted(relevance, reverse=True)
print("Ideal ranking (sorted desc):")
idcg = dcg_at_k(ideal, k)
print(f" IDCG@{k} = {idcg:.3f}\n")
ndcg = dcg / idcg if idcg > 0 else 0.0
return ndcg
K = 6
ndcg = ndcg_at_k(graded_relevance, K)
print(f"NDCG@{K} = {ndcg:.3f}")Actual ranking:
Pos 1: gain= 7.0, discount=1.000, contribution=7.000
Pos 2: gain= 0.0, discount=1.585, contribution=0.000
Pos 3: gain= 3.0, discount=2.000, contribution=1.500
Pos 4: gain= 0.0, discount=2.322, contribution=0.000
Pos 5: gain= 0.0, discount=2.585, contribution=0.000
Pos 6: gain= 31.0, discount=2.807, contribution=11.042
DCG@6 = 19.542
Ideal ranking (sorted desc):
Pos 1: gain= 31.0, discount=1.000, contribution=31.000
Pos 2: gain= 15.0, discount=1.585, contribution=9.464
Pos 3: gain= 7.0, discount=2.000, contribution=3.500
Pos 4: gain= 3.0, discount=2.322, contribution=1.292
Pos 5: gain= 1.0, discount=2.585, contribution=0.387
Pos 6: gain= 0.0, discount=2.807, contribution=0.000
IDCG@6 = 45.643
NDCG@6 = 0.428
The 5-star item at position 6 contributes 11.042 in the actual ranking, but would contribute 31.0 at position 1 in the ideal ranking. That’s where the “loss” comes from. NDCG captures exactly this.
Why NDCG dominates:
- Handles graded relevance (not just binary)
- Position-aware
- Normalized to [0, 1], so you can compare across queries
- Well-studied with known statistical properties
6. AUC and GAUC
What it measures: The probability that a randomly chosen positive item is scored higher than a randomly chosen negative item.
def auc_from_scratch(labels, scores):
"""
AUC via the Wilcoxon-Mann-Whitney interpretation:
fraction of (positive, negative) pairs where positive is scored higher.
"""
positives = [s for s, l in zip(scores, labels) if l == 1]
negatives = [s for s, l in zip(scores, labels) if l == 0]
concordant = 0
total_pairs = 0
for pos_score in positives:
for neg_score in negatives:
total_pairs += 1
if pos_score > neg_score:
concordant += 1
elif pos_score == neg_score:
concordant += 0.5 # tie-breaking
auc = concordant / total_pairs
return auc
# Model scores and true labels
labels = [1, 0, 1, 0, 1, 0, 0, 1]
scores = [0.9, 0.3, 0.8, 0.4, 0.7, 0.2, 0.5, 0.6]
print(f"AUC = {auc_from_scratch(labels, scores):.3f}")AUC = 1.000
# GAUC: AUC per user, then weighted average
def gauc(user_labels, user_scores):
"""Group AUC: per-user AUC, weighted by number of interactions."""
total_weight = 0
weighted_auc = 0
for user_id in user_labels:
labels = user_labels[user_id]
scores = user_scores[user_id]
# Skip users with only positives or only negatives
if len(set(labels)) < 2:
continue
user_auc = auc_from_scratch(labels, scores)
weight = len(labels)
weighted_auc += user_auc * weight
total_weight += weight
print(f" User {user_id}: AUC={user_auc:.3f}, weight={weight}")
return weighted_auc / total_weight
user_labels = {"u1": [1, 0, 1, 0], "u2": [1, 0, 0, 1], "u3": [1, 1, 1, 0]}
user_scores = {"u1": [0.9, 0.2, 0.8, 0.3], "u2": [0.6, 0.7, 0.4, 0.5], "u3": [0.9, 0.8, 0.7, 0.1]}
print(f"\nGAUC = {gauc(user_labels, user_scores):.3f}") User u1: AUC=1.000, weight=4
User u2: AUC=0.500, weight=4
User u3: AUC=1.000, weight=4
GAUC = 0.833
Global AUC can be misleading when users have very different baseline click rates. GAUC fixes this by measuring discrimination within each user’s session, which is what actually matters for personalized ranking.
7. Online Metrics (What Actually Matters)
All the metrics above are offline proxies. In production, what you ship gets measured by:
- Engagement: CTR, session length, items per session, return rate
- Business: CVR, RPM, LTV, retention
# Quick example: CTR computation from impression logs
impressions = [
{"item": "A", "shown": True, "clicked": True},
{"item": "B", "shown": True, "clicked": False},
{"item": "C", "shown": True, "clicked": True},
{"item": "D", "shown": True, "clicked": False},
{"item": "E", "shown": True, "clicked": True},
]
clicks = sum(1 for imp in impressions if imp["clicked"])
total = len(impressions)
ctr = clicks / total
print(f"Clicks: {clicks}, Impressions: {total}, CTR: {ctr:.1%}")Clicks: 3, Impressions: 5, CTR: 60.0%
The offline-online gap is something senior and principal candidates need to speak to directly. Offline metrics (NDCG, AUC) and online A/B test results don’t always agree. A model that wins on NDCG can lose on CTR, and vice versa. Strategies to manage this gap include:
- Counterfactual evaluation (e.g., IPS-weighted estimators) to get better offline estimates
- Tracking offline-online metric correlation over time and across launches
- Interleaving experiments for faster, more sensitive online signal than traditional A/B tests
Quick Reference Table
| Metric | Relevance Type | Position-Aware | Multi-Item | Best For |
|--------------|----------------|----------------|------------|-----------------------------|
| Precision@K | Binary | No | Yes | Simple top-K quality |
| Recall@K | Binary | No | Yes | Retrieval coverage |
| MRR | Binary | Yes | No (1st) | Navigational search |
| MAP | Binary | Yes | Yes | Standard IR evaluation |
| NDCG@K | Graded | Yes | Yes | Production ranking (gold) |
| AUC | Binary | No | Yes | Overall CTR model quality |
| GAUC | Binary | No | Yes | Per-user discrimination |