Ranking Metrics

RecSys
ML
Author

Aayush Agrawal

Published

February 22, 2026

Ranking Metrics

If you’re preparing for ML system design or ranking interviews, you will be asked to define and compare these metrics. I’ve organized them from simplest to most expressive, then covered the online vs. offline gap.

Let’s start with a shared setup we’ll use across all examples:

import numpy as np

# Simulated scenario: a query returns 10 items
# Binary relevance: 1 = relevant, 0 = not relevant
binary_relevance = [1, 0, 1, 0, 0, 1, 1, 0, 0, 1]

# Graded relevance: scale of 0-5 (e.g., star ratings)
graded_relevance = [3, 0, 2, 0, 0, 5, 4, 0, 0, 1]

# Total relevant items that exist in the entire corpus (not just top-K)
total_relevant = 8

1. Precision@K

What it measures: Of the top-K items you returned, how many were actually relevant?

  • Simple to compute and explain
  • Weakness: treats all K positions equally. Being relevant at position 1 vs. position K counts the same.
def precision_at_k(relevance, k):
    """Fraction of top-K items that are relevant."""
    top_k = relevance[:k]
    return sum(top_k) / k

for k in [3, 5, 10]:
    print(f"Precision@{k}: {precision_at_k(binary_relevance, k):.3f}")
# Precision@3:  0.667  (2 out of 3 are relevant)
# Precision@5:  0.600  (2 out of 5 are relevant)
# Precision@10: 0.500  (5 out of 10 are relevant)
Precision@3: 0.667
Precision@5: 0.400
Precision@10: 0.500

Notice how Precision@5 is lower than Precision@10 here. That’s because positions 6 (relevant) and 7 (relevant) pull the number back up. This metric has no notion of “better to be relevant earlier.”

2. Recall@K

What it measures: Of all relevant items that exist, how many made it into your top-K list?

def recall_at_k(relevance, k, total_relevant):
    """Fraction of all relevant items captured in top-K."""
    top_k = relevance[:k]
    return sum(top_k) / total_relevant

for k in [3, 5, 10]:
    print(f"Recall@{k}: {recall_at_k(binary_relevance, k, total_relevant):.3f}")
# Recall@3:  0.250  (2 out of 8 total relevant)
# Recall@5:  0.250  (2 out of 8 total relevant)
# Recall@10: 0.625  (5 out of 8 total relevant)
Recall@3: 0.250
Recall@5: 0.250
Recall@10: 0.625

Useful for retrieval-stage evaluation where you want to make sure you’re not missing good candidates before re-ranking.

3. Mean Reciprocal Rank (MRR)

What it measures: How quickly does the user find the first relevant result? You take 1/rank of the first relevant item and average across all queries.

MRR = (1/|Q|) · Σ_q  1/rank_q
  • Great for navigational queries (user wants one specific thing)
  • Limitation: completely ignores every relevant item after the first one
def reciprocal_rank(relevance):
    """1/rank of the first relevant item."""
    for i, rel in enumerate(relevance):
        if rel == 1:
            return 1.0 / (i + 1)
    return 0.0

def mrr(queries):
    """Mean Reciprocal Rank across multiple queries."""
    return np.mean([reciprocal_rank(q) for q in queries])

# Three different queries with their ranked results
query_results = [
    [1, 0, 0, 0, 0],  # first relevant at position 1 -> RR = 1.0
    [0, 0, 1, 0, 0],  # first relevant at position 3 -> RR = 0.333
    [0, 1, 0, 0, 1],  # first relevant at position 2 -> RR = 0.5
]

for i, q in enumerate(query_results):
    print(f"Query {i+1}: RR = {reciprocal_rank(q):.3f}")
print(f"MRR = {mrr(query_results):.3f}")
Query 1: RR = 1.000
Query 2: RR = 0.333
Query 3: RR = 0.500
MRR = 0.611

Query 3 has two relevant items, but MRR only cares about the first one at position 2. That’s the key limitation.

4. Mean Average Precision (MAP)

What it measures: The standard metric for binary relevance. Unlike MRR, it considers all relevant items and where they appear.

AP = (1/R) · Σ_{k=1}^{n}  Precision@k · rel(k)
def average_precision(relevance, total_relevant):
    """AP: average of Precision@k at each relevant position."""
    hits = 0
    sum_precision = 0.0

    for i, rel in enumerate(relevance):
        if rel == 1:
            hits += 1
            precision_at_i = hits / (i + 1)
            sum_precision += precision_at_i
            print(f"  Position {i+1}: Precision@{i+1} = {hits}/{i+1} = {precision_at_i:.3f}")

    return sum_precision / total_relevant

print("Step-by-step AP calculation:")
ap = average_precision(binary_relevance, total_relevant)
print(f"\nAP = {ap:.3f}")
Step-by-step AP calculation:
  Position 1: Precision@1 = 1/1 = 1.000
  Position 3: Precision@3 = 2/3 = 0.667
  Position 6: Precision@6 = 3/6 = 0.500
  Position 7: Precision@7 = 4/7 = 0.571
  Position 10: Precision@10 = 5/10 = 0.500

AP = 0.405

MAP is just the mean of AP across multiple queries. The step-by-step printout shows exactly what makes MAP position-aware: relevant items ranked higher get a better Precision@k, which pulls the average up.

5. NDCG@K (Normalized Discounted Cumulative Gain)

The most widely used ranking metric in practice. Nearly all production search and ranking teams use NDCG as their primary offline metric.

DCG@K  = Σ_{i=1}^{K}  (2^{rel_i} - 1) / log₂(i+1)
NDCG@K = DCG@K / IDCG@K
  • Numerator (exponential gain): A 5-star item contributes way more than a 3-star item
  • Denominator (log discount): Positions further down matter less
def dcg_at_k(relevance, k):
    """Discounted Cumulative Gain at K."""
    relevance = np.array(relevance[:k])
    positions = np.arange(1, len(relevance) + 1)

    gains = 2**relevance - 1        # exponential gain
    discounts = np.log2(positions + 1)  # log discount

    # Show the math for each position
    for i in range(len(relevance)):
        print(f"  Pos {i+1}: gain={gains[i]:5.1f}, "
              f"discount={discounts[i]:.3f}, "
              f"contribution={gains[i]/discounts[i]:.3f}")

    return np.sum(gains / discounts)


def ndcg_at_k(relevance, k):
    """Normalized DCG: actual DCG / ideal DCG."""
    print("Actual ranking:")
    dcg = dcg_at_k(relevance, k)
    print(f"  DCG@{k} = {dcg:.3f}\n")

    # Ideal ranking: sort by relevance descending
    ideal = sorted(relevance, reverse=True)
    print("Ideal ranking (sorted desc):")
    idcg = dcg_at_k(ideal, k)
    print(f"  IDCG@{k} = {idcg:.3f}\n")

    ndcg = dcg / idcg if idcg > 0 else 0.0
    return ndcg


K = 6
ndcg = ndcg_at_k(graded_relevance, K)
print(f"NDCG@{K} = {ndcg:.3f}")
Actual ranking:
  Pos 1: gain=  7.0, discount=1.000, contribution=7.000
  Pos 2: gain=  0.0, discount=1.585, contribution=0.000
  Pos 3: gain=  3.0, discount=2.000, contribution=1.500
  Pos 4: gain=  0.0, discount=2.322, contribution=0.000
  Pos 5: gain=  0.0, discount=2.585, contribution=0.000
  Pos 6: gain= 31.0, discount=2.807, contribution=11.042
  DCG@6 = 19.542

Ideal ranking (sorted desc):
  Pos 1: gain= 31.0, discount=1.000, contribution=31.000
  Pos 2: gain= 15.0, discount=1.585, contribution=9.464
  Pos 3: gain=  7.0, discount=2.000, contribution=3.500
  Pos 4: gain=  3.0, discount=2.322, contribution=1.292
  Pos 5: gain=  1.0, discount=2.585, contribution=0.387
  Pos 6: gain=  0.0, discount=2.807, contribution=0.000
  IDCG@6 = 45.643

NDCG@6 = 0.428

The 5-star item at position 6 contributes 11.042 in the actual ranking, but would contribute 31.0 at position 1 in the ideal ranking. That’s where the “loss” comes from. NDCG captures exactly this.

Why NDCG dominates:

  • Handles graded relevance (not just binary)
  • Position-aware
  • Normalized to [0, 1], so you can compare across queries
  • Well-studied with known statistical properties

6. AUC and GAUC

What it measures: The probability that a randomly chosen positive item is scored higher than a randomly chosen negative item.

def auc_from_scratch(labels, scores):
    """
    AUC via the Wilcoxon-Mann-Whitney interpretation:
    fraction of (positive, negative) pairs where positive is scored higher.
    """
    positives = [s for s, l in zip(scores, labels) if l == 1]
    negatives = [s for s, l in zip(scores, labels) if l == 0]

    concordant = 0
    total_pairs = 0

    for pos_score in positives:
        for neg_score in negatives:
            total_pairs += 1
            if pos_score > neg_score:
                concordant += 1
            elif pos_score == neg_score:
                concordant += 0.5  # tie-breaking

    auc = concordant / total_pairs
    return auc

# Model scores and true labels
labels = [1, 0, 1, 0, 1, 0, 0, 1]
scores = [0.9, 0.3, 0.8, 0.4, 0.7, 0.2, 0.5, 0.6]

print(f"AUC = {auc_from_scratch(labels, scores):.3f}")
AUC = 1.000
# GAUC: AUC per user, then weighted average
def gauc(user_labels, user_scores):
    """Group AUC: per-user AUC, weighted by number of interactions."""
    total_weight = 0
    weighted_auc = 0

    for user_id in user_labels:
        labels = user_labels[user_id]
        scores = user_scores[user_id]

        # Skip users with only positives or only negatives
        if len(set(labels)) < 2:
            continue

        user_auc = auc_from_scratch(labels, scores)
        weight = len(labels)
        weighted_auc += user_auc * weight
        total_weight += weight
        print(f"  User {user_id}: AUC={user_auc:.3f}, weight={weight}")

    return weighted_auc / total_weight

user_labels = {"u1": [1, 0, 1, 0], "u2": [1, 0, 0, 1], "u3": [1, 1, 1, 0]}
user_scores = {"u1": [0.9, 0.2, 0.8, 0.3], "u2": [0.6, 0.7, 0.4, 0.5], "u3": [0.9, 0.8, 0.7, 0.1]}

print(f"\nGAUC = {gauc(user_labels, user_scores):.3f}")
  User u1: AUC=1.000, weight=4
  User u2: AUC=0.500, weight=4
  User u3: AUC=1.000, weight=4

GAUC = 0.833

Global AUC can be misleading when users have very different baseline click rates. GAUC fixes this by measuring discrimination within each user’s session, which is what actually matters for personalized ranking.

7. Online Metrics (What Actually Matters)

All the metrics above are offline proxies. In production, what you ship gets measured by:

  • Engagement: CTR, session length, items per session, return rate
  • Business: CVR, RPM, LTV, retention
# Quick example: CTR computation from impression logs
impressions = [
    {"item": "A", "shown": True, "clicked": True},
    {"item": "B", "shown": True, "clicked": False},
    {"item": "C", "shown": True, "clicked": True},
    {"item": "D", "shown": True, "clicked": False},
    {"item": "E", "shown": True, "clicked": True},
]

clicks = sum(1 for imp in impressions if imp["clicked"])
total = len(impressions)
ctr = clicks / total

print(f"Clicks: {clicks}, Impressions: {total}, CTR: {ctr:.1%}")
Clicks: 3, Impressions: 5, CTR: 60.0%

The offline-online gap is something senior and principal candidates need to speak to directly. Offline metrics (NDCG, AUC) and online A/B test results don’t always agree. A model that wins on NDCG can lose on CTR, and vice versa. Strategies to manage this gap include:

  1. Counterfactual evaluation (e.g., IPS-weighted estimators) to get better offline estimates
  2. Tracking offline-online metric correlation over time and across launches
  3. Interleaving experiments for faster, more sensitive online signal than traditional A/B tests

Quick Reference Table

| Metric       | Relevance Type | Position-Aware | Multi-Item | Best For                    |
|--------------|----------------|----------------|------------|-----------------------------|
| Precision@K  | Binary         | No             | Yes        | Simple top-K quality        |
| Recall@K     | Binary         | No             | Yes        | Retrieval coverage          |
| MRR          | Binary         | Yes            | No (1st)   | Navigational search         |
| MAP          | Binary         | Yes            | Yes        | Standard IR evaluation      |
| NDCG@K       | Graded         | Yes            | Yes        | Production ranking (gold)   |
| AUC          | Binary         | No             | Yes        | Overall CTR model quality   |
| GAUC         | Binary         | No             | Yes        | Per-user discrimination     |