LLMxRecSys
AI × Recommender Systems in Big Tech (Outline)
1. Zero-Shot / Direct LLM Integration in RecSys
Using LLMs as plug-and-play components:
- Re-ranking for diversity/exploration: LLMs can score items on subjective qualities (novelty, serendipity) without labeled data
- Cold-start item understanding: New content gets rich embeddings from multimodal LLMs before any engagement signals exist
- Query understanding & intent expansion: User searches like “something chill to watch” get semantic expansion
- Conversational recommendation: Natural language interfaces layered on traditional retrieval (Netflix’s “what to watch” experiments)
- Explanation generation: Post-hoc reasoning for why something was recommended
Challenges: Latency vs. quality tradeoff. LLM inference is expensive. Most production systems use LLMs offline or for re-ranking small candidate sets.
2. Conceptual Borrowing: LLM Ideas → RecSys Architecture
Adapting LLM research ideas for RecSys architectures.
- Semantic IDs (Google’s TIGER, etc.): Treating item IDs as “tokens” that can be generated autoregressively rather than retrieved
- Generative retrieval: Instead of embedding similarity search, directly generate item identifiers
- Content authenticity/originality detection: Using LLMs to flag low-effort reposts, AI-generated spam, or engagement bait
- Foundation models for user understanding: Pre-training on behavioral sequences (like Meta’s actions transformer)
- In-context learning for personalization: User history as “prompt” rather than learned embeddings
Good discussion point: These blur the line between retrieval and generation, a paradigm shift worth debating.
3. AI for RecSys Developer Productivity
This is where day-to-day impact is massive:
Automated Analytics - Natural language to SQL/metric queries - Anomaly root-cause agents
Code Assistance - Copilot for model code - Config generation, boilerplate reduction
Experiment Velocity - Auto-summarizing A/B results - Flagging metric movements - Hypothesis generation
Research → Production - LLMs help translate paper pseudocode to production implementations
Documentation - Auto-generating design docs - Onboarding materials from code
Debug/Triage - “Why did this user see this recommendation?” investigative agents
Section 1: Zero-Shot / Direct LLM Integration in RecSys
The core insight here is that LLMs bring semantic understanding that traditional collaborative filtering lacks. In the recommendation context, LLMs “generalize to new users or items by leveraging semantic cues from textual metadata, behavior descriptions, or demographic features, bypassing the need for supervised training or interaction histories on the specific platform.”
1.1 Re-ranking for Diversity & Exploration
The Problem
Traditional re-rankers optimize for a single objective, usually relevance or CTR. But modern recommendation systems need to balance multiple goals at once: accuracy, diversity, fairness, and exploration. Building separate models for each objective is expensive and hard to maintain.
Key Papers - LLM4Rerank (2024): Link
This paper introduces a graph-based reranking framework that uses zero-shot LLMs to handle multiple ranking aspects simultaneously. The key idea is representing different requirements (relevance, diversity, novelty, etc.) as nodes in a graph, then using Chain-of-Thought prompting to let the LLM reason through each aspect step by step before producing the final ranking.
High-Level Approach - Start with a candidate set from your existing retrieval/ranking pipeline - Define the aspects you care about (diversity, freshness, user preference alignment) - Structure these as a reasoning graph for the LLM - Let the LLM score and reorder candidates by reasoning through each node
Key Learnings - Zero-shot LLMs can incorporate subjective qualities like “novelty” or “serendipity” without needing labeled training data - The Chain-of-Thought structure helps the model reason more systematically than single-prompt approaches - Works best when combined with traditional retrievers rather than replacing them entirely
Tradeoffs and Challenges - Latency is the biggest concern. LLM inference is slow and expensive, so most production systems restrict LLM reranking to small candidate sets (top 50-100 items) or run it offline - Cost scales with candidate set size and prompt length - Results are promising but still lag behind well-tuned traditional rerankers in controlled benchmarks - Reproducibility can be tricky since LLM outputs are non-deterministic
1.2 Cold-Start Item & User Understanding
The Problem
New content has no engagement history. Traditional recommender systems rely heavily on collaborative signals (who clicked what, who watched what), so when an item is brand new, the model has almost nothing to work with. Current approaches try to generate synthetic behavioral embeddings from content features, but they’re fundamentally limited because there’s no actual user interaction data to learn from.
Key Papers
ColdLLM (2024) Link: Uses LLMs to simulate user interactions for cold items, generating synthetic engagement signals before real users ever see the content.
LLM-based Cold-Start Survey (2025) Link: Comprehensive roadmap covering how LLMs are reshaping cold-start approaches across the industry.
LLMs as Near Cold-start Recommenders (2023) Link: Shows that LLMs can provide competitive zero-shot recommendations using pure language-based preferences, even without supervised training.
High-Level Approach
- Take new item metadata (title, description, categories, thumbnails)
- Use LLMs to generate predicted user reactions or synthetic engagement patterns
- Feed these synthetic signals into your existing recommendation model as “warm-up” data
- Once real engagement arrives, blend or replace the synthetic signals
Key Learnings
- Language-based preference representations are more explainable and auditable than pure embedding vectors
- Zero-shot LLM recommendations are surprisingly competitive with collaborative filtering for users who express preferences in natural language
- The real value is in offline signal generation, not real-time LLM inference
Tradeoffs and Challenges
- LLM-simulated engagement may not match real user behavior, especially for niche or culturally specific content
- Synthetic signals can introduce bias if the LLM has blind spots
- Cost of running LLMs at scale for every new item adds up quickly
- Evaluation is tricky: how do you measure if synthetic signals actually helped before real data arrives?
1.3 Query Understanding & Intent Expansion
The Problem
Users often express vague or underspecified queries like “something fun to watch” or “chill vibes.” Traditional query processing relies on exact term matching or pre-defined taxonomies, which breaks down when users speak naturally. This gap between how users think about what they want and how items are indexed creates a retrieval bottleneck.
Key Papers
- Query2Doc (EMNLP 2023) Link: Generates pseudo-documents from queries using LLMs to enrich sparse retrieval.
- LLMs Know Your Contextual Search Intent (EMNLP 2023) Link: Demonstrates LLMs can infer implicit user intent from ambiguous queries without fine-tuning.
- Query Expansion Survey (2025) Link: Comprehensive overview of LLM-based expansion techniques and production considerations.
High-Level Approach
Take the raw user query and use LLMs to infer latent intent. Generate expanded terms, entities, or reformulations. Optionally use Chain-of-Thought prompting to surface structured expansions. Feed the expanded query into your retrieval pipeline alongside or instead of the original.
Key Learnings
- Zero-shot and few-shot expansion works well without task-specific fine-tuning
- LLMs can perform spelling correction and query segmentation in a single pass
- Pre-computing expansions for high-traffic queries makes production deployment feasible
- Distilling GPT-4 outputs into smaller models provides a cost-effective path to scale
Tradeoffs and Challenges
- Latency blocks real-time expansion; caching and pre-computation are essential
- Over-expansion can introduce drift, retrieving items that technically match the expanded query but miss user intent
- Evaluation is difficult because “better understanding” is subjective
- Cost scales with query volume unless you invest in distillation or caching infrastructure
1.4 Conversational Recommendation
The Problem
Static recommendation interfaces force users into passive consumption. Users cannot easily express nuanced preferences like “I liked that movie but it was too long” or “something like last week’s suggestion but more upbeat.” Traditional systems require explicit actions (clicks, ratings) to learn preferences, missing the rich signal that natural conversation could provide.
Key Papers
- RecLLM / Leveraging LLMs in CRS (2023, Google) Link: End-to-end conversational recommender for YouTube using LaMDA, with synthetic conversation generation to bootstrap training data.
High-Level Approach
Layer a conversational interface on top of existing retrieval and ranking infrastructure. The LLM extracts structured preferences from natural language turns and maps them to retrieval queries or filter constraints. Responses explain recommendations and invite refinement. Synthetic conversation data (generated via LLM-based user simulators) can bootstrap the system before real user conversations exist.
Key Learnings
- LLMs excel at preference elicitation through clarifying questions
- Multi-turn context enables progressive refinement that single-query systems cannot achieve
- Hybrid architectures combining LLM reasoning with traditional retrieval outperform pure LLM approaches
- Synthetic conversations help solve the cold-start problem for building CRS systems
Tradeoffs and Challenges
- Latency compounds across turns; users expect sub-second responses
- Maintaining coherent context across long conversations is technically difficult and token-expensive
- Evaluation metrics for conversational quality are still immature compared to traditional ranking metrics
- Safety and hallucination risks increase when LLMs generate free-form responses about items
1.5 Explanation Generation
The Problem
Users increasingly demand transparency in why they see specific recommendations. Traditional systems either provide no explanation or rely on templated approaches (“Because you watched X”). Generic LLMs, when used directly for explanations, often fail to capture individual user preferences, producing explanations that feel generic and impersonal. Despite LLMs’ strong natural language generation capabilities, most existing models struggle to produce reliable zero-shot explanations.
Key Papers
- Logic-Scaffolding (WSDM 2024) Link: Combines aspect-based explanation with chain-of-thought prompting to generate personalized explanations through intermediate reasoning steps.
High-Level Approach
- Extract key aspects (themes, genres, moods) from recommended items using the LLM
- Ground the explanation in user history by pulling relevant items from their watching history alongside the recommended item’s plot and aspects
- Use chain-of-thought prompting with three distinct steps that force the model to reason through intermediate logic before producing the final explanation
Key Learnings
- Structured prompting (aspect extraction + chain-of-thought) dramatically outperforms naive LLM explanation generation
- Grounding explanations in specific user history items improves perceived relevance significantly
- The framework shows large effect sizes across properness, readability, and relevance, addressing multiple limitations of generic approaches simultaneously
Tradeoffs and Challenges
- Faithfulness gap: LLM explanations are post-hoc rationalizations and may not reflect what the ranking model actually optimized for
- Latency prohibits real-time generation for every impression; batch generation or caching is required
- Evaluation remains subjective since what makes an explanation “good” varies by user and context
- Demonstrated on MovieLens 1M using Falcon-40b; production deployment at scale requires additional engineering work
Section 2: Conceptual Borrowing — LLM Ideas Reshaping RecSys Architecture
Section 1 covered using LLMs directly as components. This section is different: it’s about architectural ideas that originated in LLM research and are now being adapted for recommender systems, even when the production system runs no LLM at inference time.
These are the changes worth watching because they affect how we design systems, not just how we prompt models.
2.1 Semantic IDs: From Random Hashes to Meaningful Representations
The Problem
Traditional recommender systems assign each item a random integer ID and learn an embedding for it through user interactions. This creates several fundamental issues. New items have no meaningful embedding until users interact with them. Items with few interactions end up with poorly learned representations. And billion-item catalogs require massive embedding tables with enormous parameter counts. The core limitation is that item IDs carry no inherent meaning, so the system cannot generalize across similar items.
Key Papers
TIGER (2023) Link: First Semantic ID-based generative recommender, introducing the paradigm of autoregressively decoding item identifiers instead of embedding lookup.
Better Generalization with Semantic IDs (2024) Link: YouTube production deployment showing Semantic IDs can replace video IDs while improving generalization on new and long-tail items.
Semantic IDs for Joint Search & Rec (Spotify,2025) Link: Cross-task Semantic ID construction enabling shared representations across search and recommendation.
Generative Rec with Semantic IDs: A Handbook (2025) Link: Practitioner-focused guide covering implementation details and lessons learned.
High-Level Approach
Instead of mapping each item to a random ID, Semantic IDs encode items as sequences of learned codewords that capture hierarchical meaning. The process typically uses RQ-VAE (Residual Quantized Variational Autoencoder) to compress content embeddings into discrete tokens. So instead of item_47291 → random embedding lookup, you get item_47291 → (cluster_5, sub_25, leaf_78) → compositional embedding. The recommendation model then autoregressively generates these Semantic ID tokens, similar to how LLMs generate text tokens. This draws a direct connection to subword tokenization in NLP.
Key Learnings
- The hierarchical nature of Semantic IDs allows granularity control by using various levels of prefixes
- SentencePiece tokenization (borrowed directly from LLM practice) outperforms manually crafted approaches like N-grams
- Moving from “Semantic IDs as better representations” to “Semantic IDs as generative targets” was the critical paradigm shift
- Multiple companies (Google/YouTube, Spotify, Alibaba, Kuaishou) have adopted variants in production
Tradeoffs and Challenges
- RQ-VAE training adds complexity to the pipeline and requires careful tuning
- Semantic IDs are learned from frozen content embeddings, which creates adaptation challenges when content or user behavior shifts
- The compactness that makes Semantic IDs efficient also limits their expressiveness for highly nuanced items
- Balancing semantic similarity with recommendation relevance is non-trivial since semantically similar items are not always good recommendations together
2.2 Generative Retrieval: From Retrieve-then-Rank to Direct Generation
The Problem
Traditional recommender systems follow a rigid multi-stage pipeline: embed the query, run approximate nearest neighbor search, retrieve candidates, then rank them. Each stage is optimized separately with its own objective function, creating interface mismatches between components. The retrieval stage optimizes for embedding similarity while the ranker optimizes for engagement, and these objectives do not always align. This fragmentation limits end-to-end learning and makes the system harder to reason about holistically.
Key Papers
DSI (Differentiable Search Index) (2022) Link: Introduced the paradigm of encoding document corpora directly into model parameters, enabling retrieval through generation rather than embedding lookup.
TIGER (2023) Link: Applied generative retrieval to recommendations using Semantic IDs, achieving 17.3% higher recall@5 than prior methods while reducing memory usage by 20x on Amazon datasets.
OneRec (2025) Link: First industrial-scale deployment of generative recommendation, proving the paradigm works beyond academic benchmarks in real-world production systems.
RIPOR (2024) Link: Addressed scalability challenges in generative retrieval through progressive optimization techniques.
High-Level Approach
Instead of the traditional Query → Embedding → ANN Search → Candidates → Ranker → Results pipeline, generative retrieval collapses this into a single autoregressive model. Given a user’s interaction history encoded as Semantic IDs, a Transformer-based sequence-to-sequence model directly decodes the Semantic ID of the next item the user will interact with. The model learns to generate item identifiers token by token, similar to how language models generate text. Beam search at inference time naturally produces a ranked list of candidates.
Key Learnings
- End-to-end optimization eliminates the objective mismatch between retrieval and ranking stages
- Beam search provides natural diversity in results without explicit diversification logic
- Content-based Semantic IDs give the system a meaningful starting point for cold-start items
- The paradigm shift from “retrieval as search” to “retrieval as generation” unlocks new architectural possibilities
- Industrial deployments (OneRec at Kuaishou) have shown this approach can outperform well-tuned traditional systems at scale
Tradeoffs and Challenges
- Autoregressive decoding is inherently sequential, creating latency concerns for real-time serving
- Training requires careful curriculum design to handle the combinatorial space of possible item sequences
- Scaling to billion-item catalogs pushes the limits of current generative architectures
- Debugging is harder since there is no explicit retrieval index to inspect
- The approach is still maturing, with fewer established best practices compared to traditional pipelines
2.3 Foundation Models for User Behavior: Treating Actions as a Modality
The Problem
Large-scale recommendation systems rely on high cardinality, heterogeneous features and handle tens of billions of user actions daily. Despite training on massive data with thousands of features, most Deep Learning Recommendation Models (DLRMs) in industry fail to scale with compute. Traditional architectures hit diminishing returns: adding more parameters or data does not reliably improve quality. Meanwhile, LLMs demonstrated that pre-training on massive corpora creates powerful foundation models that follow predictable scaling laws. The question became: can we do the same for user behavior sequences?
Key Papers
HSTU / Actions Speak Louder than Words (Meta, ICML 2024) Link: Introduces trillion-parameter sequential transducers that reformulate recommendations as generative sequence modeling, achieving 12.4% online gains.
HLLM (ByteDance, 2024) Link: Hierarchical LLM architecture that separates item-level and user-level modeling for efficiency.
ReLLa (2023) Link: Retrieval-enhanced approach for handling lifelong user behavior sequences that exceed context limits.
High-Level Approach
Reformulate recommendation as sequential transduction within a generative modeling framework. User behavior (clicks, watches, purchases) becomes a “language” of actions that can be modeled autoregressively. The HSTU (Hierarchical Sequential Transduction Unit) architecture replaces traditional DLRMs with Transformer-based sequence models operating over user action histories. Key innovations include pointwise normalization for non-stationary vocabularies, hierarchical temporal units for long-range dependencies, and M-FALCON inference that enables 285x model complexity at the same compute budget.
Key Learnings
- Recommendation model quality follows scaling laws as a power-law of training compute across three orders of magnitude, matching GPT-3/LLaMA-2 scale behavior
- Pure sequential transduction architectures significantly outperform traditional DLRMs in large-scale industrial settings
- HSTU outperforms baselines by up to 65.8% in NDCG and runs 5.3x to 15.2x faster than FlashAttention2-based Transformers on 8192-length sequences
- 1.5 trillion parameter models have been deployed on multiple surfaces at Meta with billions of users
- This represents what some call the “ChatGPT moment” for recommendation systems
Tradeoffs and Challenges
- Training at this scale requires significant infrastructure investment (row-wise AdamW for 6x HBM reduction, custom kernels)
- Long user histories create memory and latency challenges even with efficient attention
- Non-stationary user behavior distributions differ fundamentally from static text corpora
- Debugging and interpretability become harder as models scale
- The gap between academic reproducibility and industrial deployment remains wide
2.4 In-Context Learning for Personalization
The Problem
Traditional recommendation systems encode user preferences into learned embedding vectors tied to user IDs. This creates several limitations: new users have no meaningful embedding until sufficient interactions accumulate, preference shifts require retraining or fine-tuning, and embeddings are not transferable across domains or platforms. The fundamental issue is that user representations are static parameters rather than dynamic functions of observed behavior.
Key Papers
P5 (2022) Link: Unified text-to-text framework that reformulates five recommendation tasks (rating, sequential, explanation, review, direct) as natural language generation with personalized prompts.
CALRec (Google, RecSys 2024) Link: Contrastive alignment method that bridges the gap between LLM representations and recommendation-specific objectives, improving generalization without task-specific fine-tuning.
RecMind (NAACL 2024) Link: LLM-powered autonomous recommendation agent that uses self-inspiring reasoning to plan and execute personalized recommendations through tool use.
High-Level Approach
Instead of learning a fixed embedding vector for each user, treat the user’s interaction history as a prompt that conditions model output. The user’s past items become few-shot examples that demonstrate their preferences. Given this “prompt,” the model generates recommendations by continuing the sequence, similar to how LLMs perform few-shot learning from examples in context. No per-user parameters are stored or updated; personalization emerges entirely from conditioning on history at inference time.
Key Learnings
- Framing user history as context enables instant adaptation to preference shifts without retraining
- The approach transfers naturally across domains since preferences are expressed through item semantics rather than opaque IDs
- Combining contrastive alignment with generative LLMs significantly improves recommendation quality over naive prompting
- Self-inspiring reasoning (where the model generates its own planning steps) outperforms rigid prompting templates
Tradeoffs and Challenges
- Context length limits constrain how much user history can be included; long-term preferences may be truncated
- Inference cost scales with history length since the full context must be processed for each recommendation
- The approach assumes item descriptions carry sufficient semantic signal, which may not hold for all domains
- Evaluation is tricky because improvements in personalization quality are hard to disentangle from general LLM capabilities
- Production deployment requires careful prompt engineering and is sensitive to formatting choices
2.5 Content Authenticity and Fighting Low-Quality Content
The Problem
Generative AI has made content creation trivially easy, flooding platforms with AI-generated spam, low-effort reposts, engagement bait, and synthetic reviews. Traditional moderation relied on keyword filters and user reports, which cannot keep pace with the volume or sophistication of machine-generated content. For recommender systems, surfacing this content degrades user trust and engagement. Quality signals have become as important as relevance signals.
High-Level Approach
Score content on multiple authenticity dimensions before it enters the recommendation pipeline. Originality detection identifies near-duplicates through semantic similarity. Quality scoring evaluates effort, coherence, and informativeness as ranking features. Hybrid systems combine behavioral signals (posting patterns, account age, engagement velocity) with linguistic analysis from LLMs. These signals feed into downstream rankers as features or hard filters.
Key Learnings
- LLMs significantly outperform previous moderation tools in detecting policy violations
- Hybrid systems combining graph-based spam detection with LLM content analysis beat either approach alone
- Quality scores work better as soft ranking signals than binary filters
- Detection models benefit from platform-specific training data since AI-generated content varies by domain
Tradeoffs and Challenges
- Adversarial evolution is constant; as detection improves, generation adapts to evade it
- False positives on legitimate content create trust issues and potential legal exposure
- Running LLM inference on every piece of content is expensive; sampling and tiered approaches are necessary
- Defining “quality” is subjective and culturally dependent, making ground truth labeling difficult
- Transparency requirements may conflict with keeping detection methods confidential
Section 3 Summary: AI for Developer Productivity
The Big Picture: ML engineers spend most of their time on infrastructure, debugging, analysis, and documentation — not model architecture. AI tools are transforming this day-to-day work.
Use Cases & What’s Helping
1. Code Assistance - Autocomplete and boilerplate generation (GitHub Copilot, Cursor, Claude Code) - Feature pipeline scaffolding from descriptions - Auto-generating unit tests for data processing code - Converting code to documentation - Senior engineers report 55%+ productivity gains; tools amplify existing skills rather than replace expertise
2. Text-to-SQL & Automated Analytics - Natural language → SQL queries (“Why did CTR drop 2% yesterday for iOS users?”) - Metric debugging, cohort analysis, feature validation - Works well for common queries; complex enterprise schemas still need human refinement - Decomposing queries into focused sub-problems improves accuracy
3. Autonomous Coding Agents - Moving beyond suggestions to actually executing tasks (SWE-agent, Claude Code agents) - Bug fixes, feature implementation, refactoring, test generation - SWE-bench performance jumped from ~20% to 74%+ in 18 months - Now used at Meta, NVIDIA, and others for real engineering work
4. Experiment Analysis & Results Synthesis - Auto-monitoring metrics during A/B tests - Generating summary reports and highlighting anomalies - Suggesting follow-up experiments based on results - Reducing the manual SQL → charts → analysis doc → presentation cycle
5. Research-to-Production Translation - Summarizing papers and identifying key implementation details - Converting pseudocode to working code in your stack - Suggesting optimizations (batching, caching, vectorization) - Generating test cases based on expected paper behavior
6. Documentation & Knowledge Management - Auto-generating docstrings from code - Creating onboarding guides from codebase analysis - Writing design docs from code changes - Migration guides for upgrades
7. Multi-Agent Systems (Emerging) - Teams of specialized agents: Researcher, Engineer, Analyst, Reviewer - Structured workflows mimicking real engineering teams - Still early but showing promise for end-to-end task automation
Key Takeaway
AI tools are eating the “undifferentiated heavy lifting” of ML engineering — the SQL queries, boilerplate code, documentation, and routine analysis. This frees engineers to focus on the hard problems: system design, novel algorithms, and understanding user behavior. The engineers who master these tools will dramatically outpace those who don’t.