DL Recsys

Deep Ranking Models — Comprehensive Deep Dive
Deep ranking models sit in the late-stage ranking part of the pipeline. Their job: given a few hundred candidates from retrieval, score each candidate with a rich model that captures complex feature interactions. This is where the majority of recommendation quality comes from.
Why Deep Ranking Models Exist
Classical ranking used hand-crafted feature crosses (e.g., “user_age=25 AND item_category=shoes”). This has two problems:
- Feature engineering is expensive — you need domain experts to manually define every useful interaction.
- Generalization is poor — a cross feature only fires for exact matches. “user_age=25 × category=shoes” learns nothing about “user_age=26 × category=sneakers.”
Deep ranking models aim to learn feature interactions automatically while retaining the ability to memorize important specific patterns.
The Models, In Depth
Wide & Deep (Google, 2016)
The paper that launched deep learning in production recommendation.
Architecture:
- Wide component: A linear model that takes manually engineered cross-product features as input. For example,
installed_app × impression_app— this captures the memorization of specific, known-to-be-important feature co-occurrences. - Deep component: A feed-forward MLP (typically 3 layers, 1024→512→256) that takes dense embeddings of categorical features + continuous features. This captures generalization — it can learn that “young users who browse athletic categories” is a meaningful pattern even without explicit feature engineering.
- Joint training: Both components are trained simultaneously with a combined logistic loss. The final prediction is σ(w_wide · x_cross + w_deep · a_final + bias).
Why it matters:
- Memorization (wide) handles the long tail of specific patterns: “users who installed Netflix and see an HBO ad convert at 3x.”
- Generalization (deep) handles unseen combinations: a new user with similar demographics to Netflix users might also like HBO, even without the install signal.
- This was deployed for Google Play app recommendations and became the template for the industry.
Weaknesses:
- The wide component still requires manual feature engineering — you have to decide which crosses matter.
- The deep component learns interactions implicitly through hidden layers, but MLPs are actually not great at learning multiplicative interactions — they approximate them through addition.
DeepFM (Huazhong University & Harbin Institute, 2017)
Key idea: Replace the manually engineered wide component with a Factorization Machine (FM) that automatically learns all 2nd-order feature interactions.
Architecture:
- FM component: For features x₁, x₂, …, xₙ with embeddings v₁, v₂, …, vₙ, computes all pairwise interactions: Σᵢ Σⱼ>ᵢ ⟨vᵢ, vⱼ⟩ · xᵢ · xⱼ. This captures every 2nd-order cross-feature automatically.
- Deep component: Same MLP as Wide & Deep.
- Shared embeddings: Crucially, the FM component and the deep component share the same embedding layer. This means the FM’s learned feature representations inform the deep component and vice versa — no separate feature engineering needed.
- Final output: σ(y_FM + y_deep).
Why it’s better than Wide & Deep:
- No manual feature crosses. The FM learns all pairwise interactions.
- Shared embeddings mean fewer parameters and better training signal for embeddings.
- FM captures explicit 2nd-order interactions; deep captures higher-order implicit interactions.
Connection to classical FMs:
- Factorization Machines (Rendle, 2010) were already the state-of-the-art for sparse, high-dimensional CTR prediction. They generalize matrix factorization to arbitrary feature sets. DeepFM simply adds a deep component on top.
Weakness: FM only captures 2nd-order interactions explicitly. For 3rd-order and above, you still rely on the MLP.
DCN — Deep & Cross Network (Google, 2017) and DCN-v2 (2020)
Key idea: Instead of relying on the MLP to implicitly learn feature crosses, add an explicit cross network that learns bounded-degree feature interactions.
DCN-v1 Architecture:
- Cross network: A series of layers where each layer computes: x_{l+1} = x₀ · (xₗᵀ · w_l) + b_l + xₗ. Here x₀ is the input, and each layer adds one more degree of interaction. After L layers, you get interactions up to degree L+1.
- This is computationally cheap — each layer is just a rank-1 matrix multiplication plus a residual connection.
- The cross network is explicit — you can reason about what degree of interaction each layer captures.
- Deep network: Standard MLP, same as before.
- Outputs are concatenated and passed through a final linear layer.
DCN-v2 Improvements:
- The v1 cross network has limited expressiveness — each layer is rank-1. DCN-v2 replaces the rank-1 weight vector with a full weight matrix: x_{l+1} = x₀ ⊙ (W_l · xₗ + b_l) + xₗ.
- Uses Mixture of Experts (MoE) in the cross layers — multiple expert weight matrices, with a gating network selecting which experts to use. This dramatically increases capacity while keeping compute bounded.
- Two architectures: stacked (cross → deep sequentially) and parallel (cross and deep in parallel, concatenated). Stacked generally performs better.
Why DCN matters:
- MLPs are inefficient at learning multiplicative feature interactions. They can approximate them, but it requires many parameters and training steps. The cross network provides a direct, efficient path.
- You get explicit control over the interaction degree (network depth = max interaction order).
- DCN-v2 with MoE is the current production standard at Google for ad ranking.
DIN — Deep Interest Network (Alibaba, 2018)
This is arguably the most important architecture for e-commerce recommendation.
The Problem DIN Solves: In shopping, a user’s intent is highly context-dependent. If I’m looking at a winter jacket, my past purchase of hiking boots is very relevant, but my purchase of a phone charger is not. Standard models (including Wide & Deep, DeepFM, DCN) represent a user as a fixed-length vector by averaging or pooling all their historical behaviors. This loses critical context.
Architecture:
- Behavior sequence: The user’s past N interactions, each represented as an item embedding.
- Candidate item: The item being scored, also represented as an embedding.
- Attention mechanism: For each historical item, compute an attention weight relative to the candidate item:
- a(eᵢ, e_candidate) = MLP([eᵢ; e_candidate; eᵢ ⊙ e_candidate; eᵢ − e_candidate])
- This outputs a scalar weight for each historical item.
- Weighted sum: The user representation = Σ aᵢ · eᵢ — a candidate-aware weighted average of historical behaviors.
- This user representation, along with other user/context features, is fed to a standard MLP for final scoring.
Key Design Decisions:
- No softmax normalization on attention weights — unlike standard attention, DIN uses raw attention scores. The reasoning: if a user has no relevant history for a candidate, the attention weights should all be low, producing a near-zero user representation for that interest. Softmax would force a distribution that sums to 1, artificially amplifying irrelevant behaviors.
- Activation function: PReLU (parametric ReLU) instead of ReLU — handles the sparse, imbalanced distribution of attention scores better.
- Mini-batch Aware Regularization: With massive categorical feature spaces (billions of product IDs), standard L2 regularization computes over all parameters. DIN regularizes only the parameters present in each mini-batch, making training tractable.
- Data Adaptive Activation Function (Dice): A smooth, data-dependent alternative to PReLU that adapts its shape to the input distribution.
Why it’s transformative for shopping: Before DIN, if you bought running shoes, a phone case, and diapers, your user embedding was a blurry average of an athlete, a tech user, and a parent. With DIN, when scoring a protein bar, the attention mechanism upweights running shoes and downweights phone case and diapers. Your user representation changes for every candidate.
DIEN — Deep Interest Evolution Network (Alibaba, 2019)
Key idea: User interests aren’t just a bag of behaviors — they evolve over time. DIN captures relevance but ignores temporal dynamics.
Architecture (two key modules):
Interest Extractor Layer:
- A GRU (Gated Recurrent Unit) over the user behavior sequence, producing hidden states h₁, h₂, …, hₜ.
- Auxiliary loss: At each time step, the GRU should predict the next item in the sequence. An auxiliary binary cross-entropy loss ensures each hidden state hₜ captures the user’s interest at time t (positive = next actual click, negative = random item).
- Without this auxiliary loss, the GRU hidden states tend to just encode item identity rather than evolving interest.
Interest Evolution Layer:
- An attention-based GRU (AUGRU) that evolves interest states relative to the candidate item.
- Attention scores are computed between each hidden state and the candidate item (similar to DIN).
- These attention scores modulate the GRU update gate: u’ₜ = aₜ · uₜ. When attention is low (irrelevant behavior), the update gate is suppressed and that behavior barely affects the evolving state. When attention is high, the behavior strongly updates the interest state.
- The final hidden state of the AUGRU is the user’s evolved interest representation for this specific candidate.
Why it matters:
- Captures interest drift — a user who was into camping gear last month but has shifted to home office equipment this month. The GRU forgets the camping interest as newer signals dominate.
- The attention-gated evolution means the model tracks different interest evolution trajectories for different candidates.
SIM — Search-based Interest Model (Alibaba, 2020)
The Problem: DIN and DIEN only handle short-term behavior sequences (typically the last 50-100 interactions) because attention/GRU over long sequences is computationally expensive. But in e-commerce, users have months or years of behavior history, and long-term patterns matter (seasonal purchases, life events).
Architecture (two stages):
General Search Unit (GSU) — hard search:
- Given the candidate item, retrieve the top-K most relevant behaviors from the user’s entire history (potentially thousands of items).
- Two strategies:
- Hard search: Category-based matching — select behaviors in the same category as the candidate. Simple but effective.
- Soft search: Embedding-based nearest neighbor search over all historical item embeddings.
Exact Search Unit (ESU):
- Apply DIN-style attention over only the K retrieved behaviors (typically K=50-200).
- Since K is small, this is computationally tractable even with rich attention.
Why it matters:
- At Alibaba’s scale, users have 10,000+ historical interactions. Running attention over all of them is infeasible. SIM makes it O(K) instead of O(N).
- Captures long-range dependencies: “this user bought a tent 8 months ago → relevant when scoring a sleeping bag today.”
AutoRec (Sedhain et al., 2015)
Branch: Autoencoder → Recommendation
Key idea: Use an autoencoder to reconstruct the user-item interaction matrix. The bottleneck layer learns a compressed representation of either users or items.
How it works:
- Item-based AutoRec: Input is a partial column of the rating matrix (all users’ ratings for one item, with missing entries masked). The autoencoder reconstructs the full column. Missing entries in the reconstruction are predictions.
- Architecture: Input (n_users) → hidden layer (k units, sigmoid) → output (n_users, identity).
- Loss: MSE only on observed entries: Σ ||r - f(r; θ)||² (masked).
- The hidden layer is effectively learning a latent item representation, similar to MF — but the non-linear activation gives it more expressiveness.
Why it matters:
- Simplest possible neural rec model — showed that even a basic autoencoder beats MF baselines, establishing that neural approaches have value for recommendations.
- Directly inspired the line of research into deeper autoencoders for recs (Variational Autoencoders → MultVAE, which is still competitive today).
Limitations: Shallow (one hidden layer), doesn’t incorporate side features, fundamentally still CF-only.
NeuralCF — Neural Collaborative Filtering (He et al., 2017)
Branch: Replace the dot product in MF with a neural network.
Key idea: Matrix Factorization uses a fixed dot product to combine user and item embeddings. But why limit ourselves to a dot product? Let a neural network learn an arbitrary interaction function.
Architecture (NeuMF = GMF + MLP): Two parallel pathways:
- GMF (Generalized Matrix Factorization): Element-wise product of user and item embeddings: φ = pᵤ ⊙ qᵢ. This is a generalization of MF — if you put a linear layer with uniform weights on top, you get standard dot product MF.
- MLP pathway: Concatenate user and item embeddings, pass through multiple hidden layers: [pᵤ; qᵢ] → h₁ → h₂ → h₃. This learns non-linear interactions.
- Fusion: Concatenate outputs of GMF and MLP, pass through a final output layer for prediction.
- Uses separate embeddings for GMF and MLP pathways (different embedding spaces optimized for different interaction types).
Training: Pointwise BCE loss with negative sampling (4 negatives per positive is the standard).
Pre-training trick: First train GMF and MLP separately, then initialize NeuMF with their weights and fine-tune jointly. This stabilizes training.
Why it matters:
- Foundational paper — established that learned interaction functions outperform fixed dot products for collaborative filtering.
- Cited 5000+ times, launched the “neural recommendation” wave.
Limitations:
- Later work showed that with proper tuning, a simple dot product MF can match or beat NeuralCF (Rendle et al., 2020 — “Neural Collaborative Filtering vs. Matrix Factorization Revisited”). The gains were partly from the MLP’s additional capacity rather than the non-linear interaction per se.
- Doesn’t incorporate features beyond user/item IDs — purely CF.
- Not suitable for retrieval (can’t pre-compute item embeddings if interaction is non-linear and requires both user and item at inference time).
Deep Crossing (Microsoft, 2016)
Branch: ResNet-style architecture for CTR prediction.
Key idea: Stack residual layers on top of feature embeddings to learn feature crosses, with no manual feature engineering at all.
Architecture:
- Embedding layer: Each categorical feature → embedding vector. Numeric features passed through.
- Stacking layer: Concatenate all embeddings + numeric features into one vector.
- Multiple residual layers: Each block computes: output = ReLU(W₂ · ReLU(W₁ · x + b₁) + b₂) + x. The residual connection (+ x) enables training very deep networks.
- Scoring layer: Sigmoid for CTR prediction.
Why it matters:
- One of the first purely deep learning CTR models that worked in production (deployed for Bing Ads).
- Demonstrated that residual connections solve the vanishing gradient problem for deep recommendation models, just as they did for computer vision.
- No feature engineering, no manual crosses — the network learns everything from raw features.
Comparison to Wide & Deep (published same year, 2016):
- Deep Crossing is “all deep, no wide” — it relies entirely on the deep network for both memorization and generalization.
- Wide & Deep hedges by keeping a linear wide component for memorization.
- Deep Crossing needs to be deeper to compensate (5+ residual layers), but avoids the manual feature engineering that the wide component requires.
PNN — Product-based Neural Network (Qu et al., 2016)
Branch: Explicit product layer on top of embeddings before the MLP.
Key idea: Before feeding embeddings into the MLP, add an explicit product layer that computes pairwise feature interactions, then let the MLP process these interactions.
Architecture:
- Embedding layer: Same as other models.
- Product layer: Computes two types of signals:
- Linear signal (lz): Simply passes embeddings through (like a standard stacking layer).
- Pairwise product signal (lp): Computes pairwise interactions between all embedding pairs.
- Deep layers: MLP on top of the concatenated linear + product signals.
Two variants:
- IPNN (Inner Product NN): lp = ⟨fᵢ, fⱼ⟩ — inner product between embedding pairs. Produces a matrix of pairwise similarities.
- OPNN (Outer Product NN): lp = fᵢ · fⱼᵀ — outer product between embedding pairs. Richer (captures dimension-wise interactions) but much more expensive (each pair produces a k×k matrix). Approximated using sum-pooling to keep it tractable.
Why it matters:
- Makes feature interactions explicit before the MLP, so the MLP doesn’t need to learn them from scratch.
- The product layer is analogous to FM’s pairwise interactions, but feeds into a deep network for higher-order learning.
- Bridges FM and deep learning — you can think of PNN as “FM output as features for an MLP.”
Relationship to DeepFM: Both add explicit 2nd-order interactions. The difference is that DeepFM uses a parallel FM and MLP with shared embeddings, while PNN uses a sequential product layer → MLP pipeline. PNN’s product signals are processed further by the deep network, potentially learning more complex patterns on top of the interactions.
FNN — Factorization-Machine Supported Neural Networks (Zhang et al., 2016)
Branch: Use pre-trained FM embeddings to initialize a deep network.
The problem it solves: Training deep networks on sparse categorical features is hard — embeddings start random and take many epochs to converge. Can we get a head start?
How it works:
- Pre-train an FM on the CTR prediction task. This produces learned latent vectors (embeddings) for each feature.
- Initialize a deep neural network’s embedding layer with the pre-trained FM embeddings.
- Fine-tune the entire network end-to-end.
Architecture: After initialization, it’s a standard embedding → concatenation → MLP → sigmoid pipeline. The innovation is purely in the initialization strategy.
Why it matters:
- Showed that transfer learning from FM to DNN significantly speeds up convergence and improves final performance.
- Addressed a practical problem — in production, training deep models from scratch on sparse data is unstable and slow.
Limitations:
- Two-stage training — the FM and DNN are not jointly optimized (FM is pre-trained, then frozen for initialization). Information lost in FM training can’t be recovered.
- DeepFM later solved this by training FM and DNN jointly end-to-end with shared embeddings, making FNN somewhat obsolete.
FNN is a good example of how pre-training/transfer learning ideas (now ubiquitous with LLMs) appeared early in the recommendation space.
AFM — Attentional Factorization Machines (Xiao et al., 2017)
Branch: Add attention to the FM interaction layer (Wide Part Improvement → Deep Part).
The problem: FM computes ALL pairwise feature interactions with equal weight: Σᵢ Σⱼ>ᵢ ⟨vᵢ, vⱼ⟩. But not all interactions are equally useful — “user_age × item_price” might be very predictive, while “user_city × item_color” might be noise. FM wastes capacity on useless interactions.
Key idea: Learn an attention weight for each pairwise interaction, so the model can focus on informative crosses and suppress noisy ones.
Architecture:
- Compute all pairwise interactions: eᵢⱼ = vᵢ ⊙ vⱼ (element-wise product, not inner product — richer representation).
- Attention network: a small MLP that scores each interaction: αᵢⱼ = softmax(hᵀ · ReLU(W · eᵢⱼ + b)).
- Weighted sum: Σᵢⱼ αᵢⱼ · eᵢⱼ → compressed interaction representation.
- Final prediction from this weighted representation + linear terms.
Why it matters:
- Interpretable: The attention weights tell you which feature interactions matter most — useful for debugging and understanding model behavior.
- Improved FM performance significantly, especially on noisy feature sets where many crosses are irrelevant.
- The idea of “attending to feature interactions” influenced later architectures.
Limitations: Still only 2nd-order interactions (like FM). No deep component for higher-order learning. In practice, DeepFM or DCN-v2 offer better performance because they combine explicit interactions with deep learning.
NFM — Neural Factorization Machines (He & Chua, 2017)
Branch: Replace the Deep Part with a Bi-Interaction + MLP.
Key idea: FM’s pairwise interactions are powerful but shallow. What if we feed FM’s interaction output into a deep network to learn higher-order patterns on top of 2nd-order interactions?
Architecture:
- Embedding layer: Same as FM.
- Bi-Interaction pooling layer: Compute the sum of all pairwise element-wise products: f_BI = Σᵢ Σⱼ>ᵢ (vᵢ ⊙ vⱼ). This produces a single k-dimensional vector that encodes all 2nd-order interactions. Crucially, this can be computed in O(nk) time using the identity: Σᵢ Σⱼ>ᵢ (vᵢ ⊙ vⱼ) = ½[(Σvᵢ)² − Σ(vᵢ²)].
- Deep layers: MLP stack on top of the bi-interaction vector.
- Final prediction: y = w₀ + Σwᵢxᵢ + MLPₗ(f_BI).
How it relates to FM: If you remove the MLP layers entirely, NFM reduces to FM. The MLP adds the ability to learn higher-order interactions on top of 2nd-order interactions — which FM alone cannot do.
Comparison to DeepFM:
- DeepFM: FM and MLP are parallel paths with separate roles (FM for 2nd-order, MLP for higher-order, from raw embeddings).
- NFM: FM and MLP are sequential — FM’s interaction output is the MLP’s input. The MLP explicitly builds on top of FM’s 2nd-order interactions.
- In practice, NFM often works better with fewer MLP layers because the bi-interaction layer already provides a rich starting representation.