LLMxRecSys

RecSys

LLMs

Author

Aayush Agrawal

Published

February 5, 2026

AI × Recommender Systems in Big Tech (Outline)

1. Zero-Shot / Direct LLM Integration in RecSys

Using LLMs as plug-and-play components:

Re-ranking for diversity/exploration: LLMs can score items on subjective qualities (novelty, serendipity) without labeled data
Cold-start item understanding: New content gets rich embeddings from multimodal LLMs before any engagement signals exist
Query understanding & intent expansion: User searches like “something chill to watch” get semantic expansion
Conversational recommendation: Natural language interfaces layered on traditional retrieval (Netflix’s “what to watch” experiments)
Explanation generation: Post-hoc reasoning for why something was recommended

Challenges: Latency vs. quality tradeoff. LLM inference is expensive. Most production systems use LLMs offline or for re-ranking small candidate sets.

2. Conceptual Borrowing: LLM Ideas → RecSys Architecture

Adapting LLM research ideas for RecSys architectures.

Semantic IDs (Google’s TIGER, etc.): Treating item IDs as “tokens” that can be generated autoregressively rather than retrieved
Generative retrieval: Instead of embedding similarity search, directly generate item identifiers
Content authenticity/originality detection: Using LLMs to flag low-effort reposts, AI-generated spam, or engagement bait
Foundation models for user understanding: Pre-training on behavioral sequences (like Meta’s actions transformer)
In-context learning for personalization: User history as “prompt” rather than learned embeddings

Good discussion point: These blur the line between retrieval and generation, a paradigm shift worth debating.

3. AI for RecSys Developer Productivity

This is where day-to-day impact is massive:

Automated Analytics - Natural language to SQL/metric queries - Anomaly root-cause agents

Code Assistance - Copilot for model code - Config generation, boilerplate reduction

Experiment Velocity - Auto-summarizing A/B results - Flagging metric movements - Hypothesis generation

Research → Production - LLMs help translate paper pseudocode to production implementations

Documentation - Auto-generating design docs - Onboarding materials from code

Debug/Triage - “Why did this user see this recommendation?” investigative agents

Section 1: Zero-Shot / Direct LLM Integration in RecSys

The core insight here is that LLMs bring semantic understanding that traditional collaborative filtering lacks. In the recommendation context, LLMs “generalize to new users or items by leveraging semantic cues from textual metadata, behavior descriptions, or demographic features, bypassing the need for supervised training or interaction histories on the specific platform.”

1.1 Re-ranking for Diversity & Exploration

The Problem

Traditional re-rankers optimize for a single objective, usually relevance or CTR. But modern recommendation systems need to balance multiple goals at once: accuracy, diversity, fairness, and exploration. Building separate models for each objective is expensive and hard to maintain.

Key Papers - LLM4Rerank (2024): Link

This paper introduces a graph-based reranking framework that uses zero-shot LLMs to handle multiple ranking aspects simultaneously. The key idea is representing different requirements (relevance, diversity, novelty, etc.) as nodes in a graph, then using Chain-of-Thought prompting to let the LLM reason through each aspect step by step before producing the final ranking.

High-Level Approach - Start with a candidate set from your existing retrieval/ranking pipeline - Define the aspects you care about (diversity, freshness, user preference alignment) - Structure these as a reasoning graph for the LLM - Let the LLM score and reorder candidates by reasoning through each node

Key Learnings - Zero-shot LLMs can incorporate subjective qualities like “novelty” or “serendipity” without needing labeled training data - The Chain-of-Thought structure helps the model reason more systematically than single-prompt approaches - Works best when combined with traditional retrievers rather than replacing them entirely

Tradeoffs and Challenges - Latency is the biggest concern. LLM inference is slow and expensive, so most production systems restrict LLM reranking to small candidate sets (top 50-100 items) or run it offline - Cost scales with candidate set size and prompt length - Results are promising but still lag behind well-tuned traditional rerankers in controlled benchmarks - Reproducibility can be tricky since LLM outputs are non-deterministic

1.2 Cold-Start Item & User Understanding

The Problem

New content has no engagement history. Traditional recommender systems rely heavily on collaborative signals (who clicked what, who watched what), so when an item is brand new, the model has almost nothing to work with. Current approaches try to generate synthetic behavioral embeddings from content features, but they’re fundamentally limited because there’s no actual user interaction data to learn from.

Key Papers

ColdLLM (2024) Link: Uses LLMs to simulate user interactions for cold items, generating synthetic engagement signals before real users ever see the content.
LLM-based Cold-Start Survey (2025) Link: Comprehensive roadmap covering how LLMs are reshaping cold-start approaches across the industry.
LLMs as Near Cold-start Recommenders (2023) Link: Shows that LLMs can provide competitive zero-shot recommendations using pure language-based preferences, even without supervised training.

High-Level Approach

Take new item metadata (title, description, categories, thumbnails)
Use LLMs to generate predicted user reactions or synthetic engagement patterns
Feed these synthetic signals into your existing recommendation model as “warm-up” data
Once real engagement arrives, blend or replace the synthetic signals

Key Learnings

Language-based preference representations are more explainable and auditable than pure embedding vectors
Zero-shot LLM recommendations are surprisingly competitive with collaborative filtering for users who express preferences in natural language
The real value is in offline signal generation, not real-time LLM inference

Tradeoffs and Challenges

LLM-simulated engagement may not match real user behavior, especially for niche or culturally specific content
Synthetic signals can introduce bias if the LLM has blind spots
Cost of running LLMs at scale for every new item adds up quickly
Evaluation is tricky: how do you measure if synthetic signals actually helped before real data arrives?

1.3 Query Understanding & Intent Expansion

The Problem

Users often express vague or underspecified queries like “something fun to watch” or “chill vibes.” Traditional query processing relies on exact term matching or pre-defined taxonomies, which breaks down when users speak naturally. This gap between how users think about what they want and how items are indexed creates a retrieval bottleneck.

Key Papers

Query2Doc (EMNLP 2023) Link: Generates pseudo-documents from queries using LLMs to enrich sparse retrieval.
LLMs Know Your Contextual Search Intent (EMNLP 2023) Link: Demonstrates LLMs can infer implicit user intent from ambiguous queries without fine-tuning.
Query Expansion Survey (2025) Link: Comprehensive overview of LLM-based expansion techniques and production considerations.

High-Level Approach

Take the raw user query and use LLMs to infer latent intent. Generate expanded terms, entities, or reformulations. Optionally use Chain-of-Thought prompting to surface structured expansions. Feed the expanded query into your retrieval pipeline alongside or instead of the original.

Key Learnings

Zero-shot and few-shot expansion works well without task-specific fine-tuning
LLMs can perform spelling correction and query segmentation in a single pass
Pre-computing expansions for high-traffic queries makes production deployment feasible
Distilling GPT-4 outputs into smaller models provides a cost-effective path to scale

Tradeoffs and Challenges

Latency blocks real-time expansion; caching and pre-computation are essential
Over-expansion can introduce drift, retrieving items that technically match the expanded query but miss user intent
Evaluation is difficult because “better understanding” is subjective
Cost scales with query volume unless you invest in distillation or caching infrastructure

1.4 Conversational Recommendation

The Problem

Static recommendation interfaces force users into passive consumption. Users cannot easily express nuanced preferences like “I liked that movie but it was too long” or “something like last week’s suggestion but more upbeat.” Traditional systems require explicit actions (clicks, ratings) to learn preferences, missing the rich signal that natural conversation could provide.

Key Papers

RecLLM / Leveraging LLMs in CRS (2023, Google) Link: End-to-end conversational recommender for YouTube using LaMDA, with synthetic conversation generation to bootstrap training data.

High-Level Approach

Layer a conversational interface on top of existing retrieval and ranking infrastructure. The LLM extracts structured preferences from natural language turns and maps them to retrieval queries or filter constraints. Responses explain recommendations and invite refinement. Synthetic conversation data (generated via LLM-based user simulators) can bootstrap the system before real user conversations exist.

Key Learnings

LLMs excel at preference elicitation through clarifying questions
Multi-turn context enables progressive refinement that single-query systems cannot achieve
Hybrid architectures combining LLM reasoning with traditional retrieval outperform pure LLM approaches
Synthetic conversations help solve the cold-start problem for building CRS systems

Tradeoffs and Challenges

Latency compounds across turns; users expect sub-second responses
Maintaining coherent context across long conversations is technically difficult and token-expensive
Evaluation metrics for conversational quality are still immature compared to traditional ranking metrics
Safety and hallucination risks increase when LLMs generate free-form responses about items

1.5 Explanation Generation

The Problem

Users increasingly demand transparency in why they see specific recommendations. Traditional systems either provide no explanation or rely on templated approaches (“Because you watched X”). Generic LLMs, when used directly for explanations, often fail to capture individual user preferences, producing explanations that feel generic and impersonal. Despite LLMs’ strong natural language generation capabilities, most existing models struggle to produce reliable zero-shot explanations.

Key Papers

Logic-Scaffolding (WSDM 2024) Link: Combines aspect-based explanation with chain-of-thought prompting to generate personalized explanations through intermediate reasoning steps.

High-Level Approach

Extract key aspects (themes, genres, moods) from recommended items using the LLM
Ground the explanation in user history by pulling relevant items from their watching history alongside the recommended item’s plot and aspects
Use chain-of-thought prompting with three distinct steps that force the model to reason through intermediate logic before producing the final explanation

Key Learnings

Structured prompting (aspect extraction + chain-of-thought) dramatically outperforms naive LLM explanation generation
Grounding explanations in specific user history items improves perceived relevance significantly
The framework shows large effect sizes across properness, readability, and relevance, addressing multiple limitations of generic approaches simultaneously

Tradeoffs and Challenges

Faithfulness gap: LLM explanations are post-hoc rationalizations and may not reflect what the ranking model actually optimized for
Latency prohibits real-time generation for every impression; batch generation or caching is required
Evaluation remains subjective since what makes an explanation “good” varies by user and context
Demonstrated on MovieLens 1M using Falcon-40b; production deployment at scale requires additional engineering work

Section 2: Conceptual Borrowing — LLM Ideas Reshaping RecSys Architecture

Section 1 covered using LLMs directly as components. This section is different: it’s about architectural ideas that originated in LLM research and are now being adapted for recommender systems, even when the production system runs no LLM at inference time.

These are the changes worth watching because they affect how we design systems, not just how we prompt models.

2.1 Semantic IDs: From Random Hashes to Meaningful Representations

The Problem

Traditional recommender systems assign each item a random integer ID and learn an embedding for it through user interactions. This creates several fundamental issues. New items have no meaningful embedding until users interact with them. Items with few interactions end up with poorly learned representations. And billion-item catalogs require massive embedding tables with enormous parameter counts. The core limitation is that item IDs carry no inherent meaning, so the system cannot generalize across similar items.

Key Papers

TIGER (2023) Link: First Semantic ID-based generative recommender, introducing the paradigm of autoregressively decoding item identifiers instead of embedding lookup.
Better Generalization with Semantic IDs (2024) Link: YouTube production deployment showing Semantic IDs can replace video IDs while improving generalization on new and long-tail items.
Semantic IDs for Joint Search & Rec (Spotify,2025) Link: Cross-task Semantic ID construction enabling shared representations across search and recommendation.
Generative Rec with Semantic IDs: A Handbook (2025) Link: Practitioner-focused guide covering implementation details and lessons learned.

High-Level Approach

Instead of mapping each item to a random ID, Semantic IDs encode items as sequences of learned codewords that capture hierarchical meaning. The process typically uses RQ-VAE (Residual Quantized Variational Autoencoder) to compress content embeddings into discrete tokens. So instead of item_47291 → random embedding lookup, you get item_47291 → (cluster_5, sub_25, leaf_78) → compositional embedding. The recommendation model then autoregressively generates these Semantic ID tokens, similar to how LLMs generate text tokens. This draws a direct connection to subword tokenization in NLP.

Key Learnings

The hierarchical nature of Semantic IDs allows granularity control by using various levels of prefixes
SentencePiece tokenization (borrowed directly from LLM practice) outperforms manually crafted approaches like N-grams
Moving from “Semantic IDs as better representations” to “Semantic IDs as generative targets” was the critical paradigm shift
Multiple companies (Google/YouTube, Spotify, Alibaba, Kuaishou) have adopted variants in production

Tradeoffs and Challenges

RQ-VAE training adds complexity to the pipeline and requires careful tuning
Semantic IDs are learned from frozen content embeddings, which creates adaptation challenges when content or user behavior shifts
The compactness that makes Semantic IDs efficient also limits their expressiveness for highly nuanced items
Balancing semantic similarity with recommendation relevance is non-trivial since semantically similar items are not always good recommendations together

2.2 Generative Retrieval: From Retrieve-then-Rank to Direct Generation

The Problem

Traditional recommender systems follow a rigid multi-stage pipeline: embed the query, run approximate nearest neighbor search, retrieve candidates, then rank them. Each stage is optimized separately with its own objective function, creating interface mismatches between components. The retrieval stage optimizes for embedding similarity while the ranker optimizes for engagement, and these objectives do not always align. This fragmentation limits end-to-end learning and makes the system harder to reason about holistically.

Key Papers

DSI (Differentiable Search Index) (2022) Link: Introduced the paradigm of encoding document corpora directly into model parameters, enabling retrieval through generation rather than embedding lookup.
TIGER (2023) Link: Applied generative retrieval to recommendations using Semantic IDs, achieving 17.3% higher recall@5 than prior methods while reducing memory usage by 20x on Amazon datasets.
OneRec (2025) Link: First industrial-scale deployment of generative recommendation, proving the paradigm works beyond academic benchmarks in real-world production systems.
RIPOR (2024) Link: Addressed scalability challenges in generative retrieval through progressive optimization techniques.

High-Level Approach

Instead of the traditional Query → Embedding → ANN Search → Candidates → Ranker → Results pipeline, generative retrieval collapses this into a single autoregressive model. Given a user’s interaction history encoded as Semantic IDs, a Transformer-based sequence-to-sequence model directly decodes the Semantic ID of the next item the user will interact with. The model learns to generate item identifiers token by token, similar to how language models generate text. Beam search at inference time naturally produces a ranked list of candidates.

Key Learnings

End-to-end optimization eliminates the objective mismatch between retrieval and ranking stages
Beam search provides natural diversity in results without explicit diversification logic
Content-based Semantic IDs give the system a meaningful starting point for cold-start items
The paradigm shift from “retrieval as search” to “retrieval as generation” unlocks new architectural possibilities
Industrial deployments (OneRec at Kuaishou) have shown this approach can outperform well-tuned traditional systems at scale

Tradeoffs and Challenges

Autoregressive decoding is inherently sequential, creating latency concerns for real-time serving
Training requires careful curriculum design to handle the combinatorial space of possible item sequences
Scaling to billion-item catalogs pushes the limits of current generative architectures
Debugging is harder since there is no explicit retrieval index to inspect
The approach is still maturing, with fewer established best practices compared to traditional pipelines

2.3 Foundation Models for User Behavior: Treating Actions as a Modality

The Problem

Large-scale recommendation systems rely on high cardinality, heterogeneous features and handle tens of billions of user actions daily. Despite training on massive data with thousands of features, most Deep Learning Recommendation Models (DLRMs) in industry fail to scale with compute. Traditional architectures hit diminishing returns: adding more parameters or data does not reliably improve quality. Meanwhile, LLMs demonstrated that pre-training on massive corpora creates powerful foundation models that follow predictable scaling laws. The question became: can we do the same for user behavior sequences?

Key Papers

HSTU / Actions Speak Louder than Words (Meta, ICML 2024) Link: Introduces trillion-parameter sequential transducers that reformulate recommendations as generative sequence modeling, achieving 12.4% online gains.
HLLM (ByteDance, 2024) Link: Hierarchical LLM architecture that separates item-level and user-level modeling for efficiency.
ReLLa (2023) Link: Retrieval-enhanced approach for handling lifelong user behavior sequences that exceed context limits.

High-Level Approach

Reformulate recommendation as sequential transduction within a generative modeling framework. User behavior (clicks, watches, purchases) becomes a “language” of actions that can be modeled autoregressively. The HSTU (Hierarchical Sequential Transduction Unit) architecture replaces traditional DLRMs with Transformer-based sequence models operating over user action histories. Key innovations include pointwise normalization for non-stationary vocabularies, hierarchical temporal units for long-range dependencies, and M-FALCON inference that enables 285x model complexity at the same compute budget.

Key Learnings

Recommendation model quality follows scaling laws as a power-law of training compute across three orders of magnitude, matching GPT-3/LLaMA-2 scale behavior
Pure sequential transduction architectures significantly outperform traditional DLRMs in large-scale industrial settings
HSTU outperforms baselines by up to 65.8% in NDCG and runs 5.3x to 15.2x faster than FlashAttention2-based Transformers on 8192-length sequences
1.5 trillion parameter models have been deployed on multiple surfaces at Meta with billions of users
This represents what some call the “ChatGPT moment” for recommendation systems

Tradeoffs and Challenges

Training at this scale requires significant infrastructure investment (row-wise AdamW for 6x HBM reduction, custom kernels)
Long user histories create memory and latency challenges even with efficient attention
Non-stationary user behavior distributions differ fundamentally from static text corpora
Debugging and interpretability become harder as models scale
The gap between academic reproducibility and industrial deployment remains wide

2.4 In-Context Learning for Personalization

The Problem

Traditional recommendation systems encode user preferences into learned embedding vectors tied to user IDs. This creates several limitations: new users have no meaningful embedding until sufficient interactions accumulate, preference shifts require retraining or fine-tuning, and embeddings are not transferable across domains or platforms. The fundamental issue is that user representations are static parameters rather than dynamic functions of observed behavior.

Key Papers

P5 (2022) Link: Unified text-to-text framework that reformulates five recommendation tasks (rating, sequential, explanation, review, direct) as natural language generation with personalized prompts.
CALRec (Google, RecSys 2024) Link: Contrastive alignment method that bridges the gap between LLM representations and recommendation-specific objectives, improving generalization without task-specific fine-tuning.
RecMind (NAACL 2024) Link: LLM-powered autonomous recommendation agent that uses self-inspiring reasoning to plan and execute personalized recommendations through tool use.

High-Level Approach

Instead of learning a fixed embedding vector for each user, treat the user’s interaction history as a prompt that conditions model output. The user’s past items become few-shot examples that demonstrate their preferences. Given this “prompt,” the model generates recommendations by continuing the sequence, similar to how LLMs perform few-shot learning from examples in context. No per-user parameters are stored or updated; personalization emerges entirely from conditioning on history at inference time.

Key Learnings

Framing user history as context enables instant adaptation to preference shifts without retraining
The approach transfers naturally across domains since preferences are expressed through item semantics rather than opaque IDs
Combining contrastive alignment with generative LLMs significantly improves recommendation quality over naive prompting
Self-inspiring reasoning (where the model generates its own planning steps) outperforms rigid prompting templates

Tradeoffs and Challenges

Context length limits constrain how much user history can be included; long-term preferences may be truncated
Inference cost scales with history length since the full context must be processed for each recommendation
The approach assumes item descriptions carry sufficient semantic signal, which may not hold for all domains
Evaluation is tricky because improvements in personalization quality are hard to disentangle from general LLM capabilities
Production deployment requires careful prompt engineering and is sensitive to formatting choices

2.5 Content Authenticity and Fighting Low-Quality Content

The Problem

Generative AI has made content creation trivially easy, flooding platforms with AI-generated spam, low-effort reposts, engagement bait, and synthetic reviews. Traditional moderation relied on keyword filters and user reports, which cannot keep pace with the volume or sophistication of machine-generated content. For recommender systems, surfacing this content degrades user trust and engagement. Quality signals have become as important as relevance signals.

High-Level Approach

Score content on multiple authenticity dimensions before it enters the recommendation pipeline. Originality detection identifies near-duplicates through semantic similarity. Quality scoring evaluates effort, coherence, and informativeness as ranking features. Hybrid systems combine behavioral signals (posting patterns, account age, engagement velocity) with linguistic analysis from LLMs. These signals feed into downstream rankers as features or hard filters.

Key Learnings

LLMs significantly outperform previous moderation tools in detecting policy violations
Hybrid systems combining graph-based spam detection with LLM content analysis beat either approach alone
Quality scores work better as soft ranking signals than binary filters
Detection models benefit from platform-specific training data since AI-generated content varies by domain

Tradeoffs and Challenges

Adversarial evolution is constant; as detection improves, generation adapts to evade it
False positives on legitimate content create trust issues and potential legal exposure
Running LLM inference on every piece of content is expensive; sampling and tiered approaches are necessary
Defining “quality” is subjective and culturally dependent, making ground truth labeling difficult
Transparency requirements may conflict with keeping detection methods confidential

Section 3 Summary: AI for Developer Productivity

The Big Picture: ML engineers spend most of their time on infrastructure, debugging, analysis, and documentation — not model architecture. AI tools are transforming this day-to-day work.

Use Cases & What’s Helping

1. Code Assistance - Autocomplete and boilerplate generation (GitHub Copilot, Cursor, Claude Code) - Feature pipeline scaffolding from descriptions - Auto-generating unit tests for data processing code - Converting code to documentation - Senior engineers report 55%+ productivity gains; tools amplify existing skills rather than replace expertise

2. Text-to-SQL & Automated Analytics - Natural language → SQL queries (“Why did CTR drop 2% yesterday for iOS users?”) - Metric debugging, cohort analysis, feature validation - Works well for common queries; complex enterprise schemas still need human refinement - Decomposing queries into focused sub-problems improves accuracy

3. Autonomous Coding Agents - Moving beyond suggestions to actually executing tasks (SWE-agent, Claude Code agents) - Bug fixes, feature implementation, refactoring, test generation - SWE-bench performance jumped from ~20% to 74%+ in 18 months - Now used at Meta, NVIDIA, and others for real engineering work

4. Experiment Analysis & Results Synthesis - Auto-monitoring metrics during A/B tests - Generating summary reports and highlighting anomalies - Suggesting follow-up experiments based on results - Reducing the manual SQL → charts → analysis doc → presentation cycle

5. Research-to-Production Translation - Summarizing papers and identifying key implementation details - Converting pseudocode to working code in your stack - Suggesting optimizations (batching, caching, vectorization) - Generating test cases based on expected paper behavior

6. Documentation & Knowledge Management - Auto-generating docstrings from code - Creating onboarding guides from codebase analysis - Writing design docs from code changes - Migration guides for upgrades

7. Multi-Agent Systems (Emerging) - Teams of specialized agents: Researcher, Engineer, Analyst, Reviewer - Structured workflows mimicking real engineering teams - Still early but showing promise for end-to-end task automation

Key Takeaway

AI tools are eating the “undifferentiated heavy lifting” of ML engineering — the SQL queries, boilerplate code, documentation, and routine analysis. This frees engineers to focus on the hard problems: system design, novel algorithms, and understanding user behavior. The engineers who master these tools will dramatically outpace those who don’t.