BERT4Rec : Decoding Sequential Recommendations with the Power of Transformers

BERT4Rec is a sequential recommendation model that leverages the bidirectional Transformer architecture, originally designed for language tasks, to capture users’ evolving preferences by jointly considering both past and future items in a sequence (arxiv.org, github.com). Unlike earlier unidirectional models that predict the next item only from previous ones, BERT4Rec uses a Cloze-style masking objective to predict missing items anywhere in the sequence, enabling richer context modeling (arxiv.org, github.com). Empirical evaluations on multiple benchmark datasets demonstrate that BERT4Rec often surpasses state-of-the-art sequential models like SASRec, though its performance can depend on careful training schedules and hyperparameter choices (arxiv.org, arxiv.org). This post traces the journey from early recommendation methods to the Transformer revolution and the rise of BERT, explains the core ideas behind BERT4Rec, connects them to cognitive analogies of Cloze tests, and discusses experiments, limitations, and future directions. By understanding BERT4Rec’s design and its place in the broader landscape of recommendation, readers can appreciate both its technical elegance and its conceptual roots in language modeling and human learning.

Introduction: A Learning Journey

I still remember the first time I tried to build a recommendation system. It was during my undergraduate years, and I wanted to create a small app that suggested books to my friends based on what they had read before. At that time, I naively believed that simply counting co-occurrences of books would be enough. I soon realized that user preferences change over time, and static co-occurrence matrices felt too rigid. That curiosity led me to explore sequential recommendation—models that treat a user’s history as an evolving narrative rather than a single static snapshot.

Fast forward a few years, and I found myself diving into deep learning approaches for recommendation during my PhD. Each step felt like peeling another layer of understanding: starting with simple Markov chains, moving to recurrent neural networks, then witnessing the Transformer revolution in natural language processing (NLP) with papers like “Attention Is All You Need” (arxiv.org, papers.nips.cc) and “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (arxiv.org, aclanthology.org). In language tasks, these models treated sentences as dynamic sequences of words; in recommendation, sequences of items could be handled similarly.

Just as the alphabet and grammar form the foundation of language, the sequence of user interactions—clicks, views, purchases—forms the grammar of recommendation. When I first encountered BERT4Rec, I saw a bridge between these worlds: a model designed for language Cloze tasks, applied to sequences of items. In this post, I want to share that journey—why the shift from unidirectional to bidirectional models matters, how the Cloze objective parallels human tests, the design choices behind BERT4Rec, and what we can learn both technically and conceptually from it. My hope is that, by the end, you’ll see BERT4Rec not just as another state-of-the-art model, but as part of a broader narrative connecting human cognition, language, and recommendation.

Background: From Static to Sequential Recommendation

The Early Days: Collaborative Filtering and Beyond

Recommender systems began with collaborative filtering approaches that treat users and items symmetrically, often using matrix factorization to uncover latent factors (link.springer.com). These methods assume static preferences: a user has fixed tastes, and items have fixed attributes. For example, if Alice liked “The Hobbit” and “The Lord of the Rings,” a static model would continue recommending similar fantasy books without considering that she might have grown more interested in science fiction recently.

Psychologically, this is akin to assuming that a person’s personality never changes—an oversimplification. In reality, tastes evolve. Just as our moods and interests shift from week to week, user interactions in an online setting reflect changing preferences. Recognizing this, researchers started looking at temporal dynamics: assigning more weight to recent interactions (link.springer.com, arxiv.org). However, these adjustments were often heuristic rather than deeply integrated into the model’s structure.

Sequential Recommendation: Capturing the Flow

To better model evolving preferences, sequential recommendation treats a user’s history as an ordered list of events. Two main families of approaches emerged:

Markov Chain-based Models: These assume that the next action depends on a limited window of previous actions, often just the last one or two (cseweb.ucsd.edu). While simple and effective in sparse settings, they struggle to capture longer-term patterns. It’s like predicting the next word in a sentence by looking only at the immediately preceding word—sometimes okay, but often missing broader context.
Recurrent Neural Networks (RNNs): With the rise of deep learning, RNNs (e.g., GRU4Rec) became popular for sequential recommendation tasks. They process one item at a time, updating a hidden state that summarizes the history (link.springer.com, arxiv.org). While theoretically capable of capturing long-range dependencies, RNNs can suffer from vanishing gradients and can be slow to train, especially when sequences get long.

These methods moved beyond static views of users, but they still relied on unidirectional modeling: either Markov chains always look backward a fixed number of steps, and RNNs process sequences from left (oldest) to right (newest). In human terms, it’s like reading a story only forward—never knowing how the ending influences the interpretation of earlier chapters.

Self-Attention and SASRec: A Step Towards Flexible Context

In August 2018, Kang and McAuley introduced SASRec, a self-attentive sequential model that borrowed ideas from the Transformer’s self-attention mechanism to balance long-term and short-term context (arxiv.org, cseweb.ucsd.edu). Instead of processing item sequences strictly left-to-right, SASRec computes attention weights over all previous items at each step, allowing the model to focus on the most relevant past actions when predicting the next one (arxiv.org, arxiv.org). Mechanically, it applies multi-head self-attention layers over item embeddings, followed by pointwise feed-forward layers, similar to each encoder block in the original Transformer (arxiv.org, export.arxiv.org).

SASRec offered two main advantages:

Efficiency: By parallelizing self-attention computations across positions, SASRec can be trained faster than RNN-based models on modern hardware.
Adaptive Context: Attention weights allow the model to decide which past items matter most, rather than forcing it to use a fixed window or hidden state sequence.

However, SASRec remains unidirectional in its attention: at each time step, it only attends to items that come before that position. This means it still cannot consider potential “future” items, even if they would be known at test time when scoring multiple candidate items. In language terms, it’s like understanding a sentence by reading it left to right—never knowing what words come later in the sentence.

The Transformer Revolution: Background and Impact

The Birth of the Transformer (Vaswani et al., 2017)

In June 2017, Vaswani et al. published “Attention Is All You Need,” a paper that fundamentally changed NLP and sequence modeling (arxiv.org, papers.nips.cc). They introduced the Transformer, which replaced recurrence with multi-head self-attention and simple feed-forward networks. The key insights were:

Self-Attention Layers: These compute weighted sums of all positions’ embeddings for each position, allowing direct modeling of pairwise dependencies regardless of distance.
Positional Encoding: Since attention layers by themselves lack inherent order, they added sinusoidal positional encodings to inject sequence information.
Parallelization: Unlike RNNs, Transformers can process all positions in parallel, making training significantly faster on GPUs.

By discarding recurrence and convolutions, the Transformer demonstrated state-of-the-art performance on machine translation tasks, achieving BLEU scores surpassing previous best models on WMT English-German and English-French benchmarks (arxiv.org, scispace.com). This architecture quickly became the de facto backbone for a wide range of NLP tasks, from translation to summarization to question answering (huggingface.co, arxiv.org).

Analogy: Before Transformers, sequence models were like cars with only one speed—reverse (recurrence) or forward (convolutions/attention with constraints). Transformers were like multi-gear vehicles that could shift seamlessly, giving models flexibility to access information anywhere in the sequence, much like looking up any chapter in a book instantly rather than reading every page sequentially.

BERT: Deep Bidirectional Language Representation (Devlin et al., 2018)

Building on the Transformer’s encoder, Devlin et al. introduced BERT (Bidirectional Encoder Representations from Transformers) in October 2018 (arxiv.org, aclanthology.org). BERT’s main contributions were:

Bidirectional Context: By jointly attending to both left and right context in all layers (rather than only attending to previous tokens), BERT can learn richer representations.
Masked Language Modeling (MLM): To enable bidirectionality, they used a Cloze-like task: randomly mask some tokens in the input and train the model to predict them based on surrounding context.
Next Sentence Prediction (NSP): As a secondary task, BERT predicts whether two sentences follow each other, helping capture inter-sentence relationships.

BERT was pre-trained on massive corpora (BooksCorpus and English Wikipedia), achieving state-of-the-art results across a variety of NLP benchmarks, such as GLUE, SQuAD, and others (arxiv.org, export.arxiv.org). Its bidirectional design unlocked new capabilities: while unidirectional language models (e.g., OpenAI GPT) process text left-to-right, BERT’s MLM allowed it to encode context from both sides, akin to reading a sentence and filling in missing words anywhere in it.

Analogy: Imagine reading a paragraph with some words hidden and having to guess them using the rest of the paragraph. This Cloze-style task is exactly how BERT learns. In human tests, teachers often use fill-in-the-blank exercises to gauge comprehension—similarly, BERT’s MLM forces the model to deeply understand context.

The impact of BERT extended beyond NLP. Researchers began to ask: if bidirectional Transformers can learn from masked words in a sentence, could a similar idea work for sequences of user interactions? Enter BERT4Rec.

BERT4Rec: Core Ideas and Design

Motivation: Why Bidirectional Modeling Matters

In sequential recommendation, we often care about predicting the next item given past history. Unidirectional models like SASRec attend only to prior items when making a prediction (arxiv.org, cseweb.ucsd.edu). However, at evaluation or inference time, we typically score multiple candidate items to find the most likely next item. Those candidates can be seen as “future” items once we inject them into the sequence. If the model can attend to both past items and the candidate item itself (as if it were masked during training), it can form a richer representation that uses information from the full sequence context.

BERT4Rec reframes sequential recommendation as a Cloze task: randomly mask items in the user’s history and train the model to predict them based on both left and right context, which may include items that occur after them in the sequence (arxiv.org, github.com). This bidirectional conditioning helps the model learn how items co-occur in different parts of the sequence, not just in a strict left-to-right chain.

Analogy: In a detective novel, clues about who committed the crime may appear early and later in the story. A unidirectional reader would only use clues from the beginning up to the current chapter. A bidirectional reader, knowing the ending, can reinterpret earlier clues in light of later revelations. Similarly, BERT4Rec’s bidirectional attention allows the model to reinterpret earlier interactions when considering missing items.

Architecture Overview

At a high level, BERT4Rec follows the encoder architecture from the original Transformer with two major changes:

Cloze-style Masking: A certain percentage of items in a user’s sequence are randomly masked (replaced with a special [MASK] token). The model’s task is to predict the identity of each masked item using bidirectional attention over the unmasked items (arxiv.org, researchgate.net).
Item Embeddings with Positional Encodings: Each item in the sequence is mapped to a learned embedding. Since the Transformer has no inherent sense of order, sinusoidal or learned positional encodings are added to each item embedding to encode its position in the sequence (arxiv.org, ar5iv.labs.arxiv.org).

Concretely:

Input: A user history of length n (e.g., [i₁, i₂, …, iₙ]). We randomly choose a subset of positions (usually 15%) and replace them with [MASK] tokens. For example, if the original sequence is [A, B, C, D, E] and positions 2 and 4 are masked, the input becomes [A, [MASK], C, [MASK], E].
Embedding Layer: Each position t has an embedding E_item(iₜ) (for item iₜ) plus a positional embedding E_pos(t). So, the initial input to the Transformer is the sum E_item + E_pos for each position, with masked positions using a special mask embedding.
Transformer Encoder Stack: Typically 2 to 4 layers (depending on hyperparameters) of multi-head self-attention and feed-forward layers. Since we want bidirectional context, the self-attention is “full” (not masked), allowing each position to attend to all other positions in the sequence.
Output Heads: For each masked position, the final hidden state vector is passed through a linear projection followed by a softmax over the item vocabulary to predict which item was masked.
Loss Function: Cross-entropy loss is computed only over the masked positions, summing (or averaging) across them.

During inference, to predict the next item, one can append a [MASK] token to the end of a user’s sequence and feed it through the model. The model’s output distribution at that position indicates the probabilities of all possible items being the next interaction.

Technical Note: Because BERT4Rec conditions on bidirectional context, it avoids what is known as “exposure bias” often found in left-to-right models, where during training the model sees only ground-truth history, but during inference it must rely on its own predictions. BERT4Rec’s Cloze objective alleviates this by mixing masked ground truth with unmasked items, making the model robust to masked or unknown future items.

Training as a Cloze Task: Deeper Explanation

The term Cloze comes from psycholinguistics and educational testing: learners fill in missing words in a text passage (arxiv.org, kdnuggets.com). This is not a new idea. In fact, BERT borrowed it directly from earlier NLP work, such as the Cloze tests used by educators to measure student comprehension (kdnuggets.com). In the context of recommendation:

Masked Item Prediction (MIP): Analogous to masked language modeling (MLM) in BERT, BERT4Rec’s MIP randomly selects a subset of positions in a user’s interaction sequence, hides each item, and asks the model to fill it in based on both past and future interactions.
Sampling Strategy: Typically, 15% of items are chosen for masking. Of those, 80% are replaced with [MASK], 10% with a random item (to encourage robustness), and 10% are left unchanged but still counted in the loss as if they were masked (to mitigate training/test mismatch) (arxiv.org, github.com).
Advantages: By predicting items anywhere in the sequence, the model learns co-occurrence patterns in all contexts, not just predicting the next item. This generates more training samples per sequence (since each masked position is a training example), potentially improving data efficiency (arxiv.org, arxiv.org).

Analogy: When learning a language, filling in blank words anywhere in a paragraph helps both reading comprehension and vocabulary acquisition. Similarly, by practicing predicting missing items anywhere in their history, the model builds a more flexible representation of user preferences.

Comparison with Unidirectional Models (e.g., SASRec)

Context Scope
- Unidirectional (SASRec): At position t, the model attends only to items 1 through t–1.
- Bidirectional (BERT4Rec): At each masked position t, the model attends to all items except those that are also masked. When predicting the next item (by placing a [MASK] at n+1), it attends to items 1 through n and vice versa for other masked positions.
Training Objective
- Unidirectional: Usually uses next-item prediction with cross-entropy loss at each time step.
- Bidirectional: Uses Cloze objective, predicting multiple masked positions per sequence.
Data Efficiency
- Unidirectional: Generates one training sample per time step (predict next item).
- Bidirectional: Generates as many training samples as there are masked positions (typically ~15% of sequence length), often leading to more gradient updates per sequence.
Inference
- Unidirectional: Directly predicts the next item based on history.
- Bidirectional: Appends a [MASK] to the end to predict next item, or can mask any position for in-sequence imputation.

Several empirical studies have shown that BERT4Rec often outperforms SASRec, especially when long-range dependencies are important (arxiv.org, arxiv.org). However, this performance advantage can require longer training times and careful hyperparameter tuning, as later work has pointed out (arxiv.org, arxiv.org).

Drawing Analogies: Cloze Tests, Human Learning, and Recommendation

The Psychology of Masked Tests

Cloze tests, introduced by W. L. Taylor in 1953, are exercises where learners fill in blanks in a passage of text, gauging language comprehension and vocabulary knowledge (kdnuggets.com). Educational psychologists have found that Cloze tasks encourage active recall and semantic inference, as learners must use both local and global context to guess missing words correctly. Similarly, BERT’s MLM and BERT4Rec’s MIP require the model to infer missing tokens (words or items) from all available context, reinforcing rich contextual understanding.

In human terms:

Local Context: To guess a masked word in a sentence, you use nearby words.
Global Context: Often, clues spread across the paragraph or entire document guide you toward the right answer.

BERT4Rec’s masked items play the role of blank spaces in a text. The model, like a student in a Cloze test, must use all known interactions (both before and after the blank) to infer the missing preference. This leads to representations that capture not only pairwise item relationships but also how items co-occur across entire sessions.

Historical Perspective: From Prediction to Comprehension

Early recommendation models focused on prediction: given past clicks, what happens next? This is analogous to a fill-in-the-blank exercise where only the next word is blank. In mathematics, this is like knowing all terms of a sequence except the next one and trying to guess it from a recurrence relation. But modern language teaching emphasizes comprehension, teaching students to understand entire texts, not just predict the next word. BERT4Rec embodies that shift: from predicting sequentially to understanding a user’s entire session.

Consider reading Hamlet: if you only focus on predicting the next line, you might miss the broader themes. If you think about themes and motifs across the play, you get a richer understanding. BERT4Rec, by predicting masked items anywhere, learns themes and motifs in interaction sequences as well.

Real-World Analogy: Playlist Shuffling

Imagine you’re curating a playlist of songs you’ll listen to on a road trip. Instead of putting them in a fixed order (e.g., chronological from your latest favorites), you shuffle them but still want the transitions to feel coherent. A unidirectional model ensures each song transitions well from the previous one, like ensuring each next word makes sense after the last. A bidirectional approach would allow you to also consider the song that comes after when choosing a song for a particular slot, creating smooth transitions both forward and backward. In BERT4Rec, masked songs correspond to shuffled or missing approximate transitions, and the model learns what fits best given both neighbors.

Technical Deep Dive: BERT4Rec’s Mechanics

Input Representation

Given a user’s historical sequence of item interactions $i₁, i₂, …, iₙ$, BERT4Rec prepares inputs as follows (arxiv.org, researchgate.net):

Masking Strategy
- Randomly select 15% of positions for masking.
- Of those positions:
  - 80% are replaced with [MASK].
  - 10% are replaced with a random item ID from the vocabulary (to encourage robustness).
  - 10% remain unchanged (but are still counted in the loss). This strategy mirrors BERT’s design to prevent the model from relying too heavily on the [MASK] token (arxiv.org, export.arxiv.org).
Item Embeddings
- Each item ID has a learned embedding vector of dimension d.
- A special embedding E_mask is used for [MASK] tokens.
Positional Embeddings
- Since the Transformer has no notion of sequence order, add a learned positional embedding E_pos(t) for each position t ∈ {1,…,n}.
- The sum E_item(iₜ) + E_pos(t) forms the input embedding at position t.
Sequence Length and Padding
- For computational efficiency, fix a maximum sequence length L (e.g., 200).
- If a user’s history has fewer than L interactions, pad the sequence with [PAD] tokens at the front or back.
- [PAD] tokens have embeddings but are ignored in attention computations (i.e., their attention weights are set to zero).
Embedding Dropout
- Optional dropout can be applied to the sum of item and positional embeddings to regularize training.

Mathematically, let

$$ xₜ = E_{item}(iₜ) + E_{pos}(t), \quad t = 1,\dots,n. $$

Masked positions use

$$ xₜ = E_{mask} + E_{pos}(t). $$

Transformer Encoder Stack

BERT4Rec typically uses a stack of N encoder layers (e.g., N = 2 or 3 for smaller datasets, up to N = 6 for larger ones), each consisting of:

Multi-Head Self-Attention
- For layer l, each position t has queries, keys, and values computed as linear projections of the input from the previous layer.
- Attention weights are computed as scaled dot products between queries and keys, followed by softmax.
- Weighted sums of values produce the attention output for each head.
- The outputs of all heads are concatenated and linearly projected back to dimension d.
- Residual connection and layer normalization are applied:
  $$ \text{SA}_l(X) = \text{LayerNorm}(X + \text{MultiHeadAttn}(X)). $$
Position-Wise Feed-Forward Network
- A two-layer feed-forward network with a GELU or ReLU activation:
  $$ \text{FFN}_l(Y) = \text{LayerNorm}(Y + W₂ ,\phi(W₁ Y + b₁) + b₂), $$
  where $\phi$ is an activation (often GELU).
LayerNorm and Residual Connections
- As in the original Transformer, each sub-layer has a residual (skip) connection followed by layer normalization, ensuring stable training and gradient flow (arxiv.org, scispace.com).

Because the self-attention is full (no masking of future positions), each position’s representation at each layer can incorporate information from any other unmasked position in the sequence.

Output and Loss Computation

After N encoder layers, we obtain final hidden representations ${h₁, h₂, \dots, hₙ}$ ∈ ℝ^{n×d}. For each position t that was masked during input preparation, we compute:

Item Prediction Scores
$$ sₜ = W_{output} , hₜ + b_{output}, \quad sₜ ∈ ℝ^{|V|}, $$
where |V| is the size of the item vocabulary, and $W_{output} ∈ ℝ^{|V|×d}$.
Softmax and Cross-Entropy Loss
- Apply softmax to $sₜ$ to get predicted probability distribution $\hat{y}_t$.
- If the true item ID at position t is $iₜ^*$, the cross-entropy loss for that position is:
  $$ \mathcal{L}t = -\log\bigl(\hat{y}{t}[ iₜ^* ]\bigr). $$
- Aggregate loss across all masked positions in the batch, typically averaging over them:
  $$ \mathcal{L} = \frac{1}{\sum_t mₜ} \sum_{t=1}^n mₜ , \mathcal{L}_t, $$
  where $mₜ = 1$ if position t was masked, else 0.

Because multiple positions are masked per sequence, each training example yields several prediction targets, improving data efficiency.

Inference: Predicting the Next Item

To recommend the next item for a user:

Extend the Sequence
- Given the user’s last n interactions, append a [MASK] token at position n+1 (if n+1 ≤ L). If n = L, one could remove the oldest item or use sliding window techniques.
Feed Through Model
- The [MASK] at position n+1 participates in bidirectional attention, attending to all positions 1 through n. Conversely, positions 1 through n attend to the [MASK] if full self‐attention is used.
Obtain Scores
- Compute $s_{n+1} ∈ ℝ^{|V|}$ from the final hidden state $h_{n+1}$.
- The highest-scoring items in $s_{n+1}$ are the top-K recommendations.

Because BERT4Rec’s training objective was to predict masked items given both left and right context, placing the [MASK] at the end simulates one masked position with only left context. While strictly speaking this isn’t bidirectional (the [MASK] at the end has no right context), it still benefits from richer item co-occurrence patterns learned during training. Empirically, this approach yields strong next-item recommendation accuracy.

Experimental Results and Analysis

Datasets and Evaluation Protocols

In the original BERT4Rec paper, Sun et al. evaluated the model on four public benchmark datasets:

MovieLens-1M (ML-1M): 1 million ratings from ~6000 users on ~3900 movies.
YooChoose: Click logs from the RecSys Challenge 2015, with ~8.6 million events.
Steam: Game purchase and play logs from the Steam platform.
Amazon Beauty: Reviews and ratings in the beauty product category from the Amazon Reviews dataset.

For each user, interactions were chronologically ordered. The last interaction was used as the test item, the second last as validation, and earlier interactions for training. Performance metrics included Hit Rate (HR@K) and Normalized Discounted Cumulative Gain (NDCG@K) at various cut-offs (e.g., K = 5, 10) (arxiv.org, arxiv.org).

Baselines Compared

Sun et al. compared BERT4Rec against several state-of-the-art sequential recommendation methods:

GRU4Rec: RNN (GRU) based model with pairwise ranking loss.
Casual Convolutional (CasualConv): Convolutional neural network model for sequences.
SASRec: Self-attention based unidirectional model.
Caser: Convolutional sequence embedding model (vertical + horizontal convolution).
NextItNet: Dilated residual network for sequential recommendation.

Key Findings

BERT4Rec vs. SASRec
- Across ML-1M and YooChoose, BERT4Rec improved HR@10 by ≈2–3% and NDCG@10 by ≈1–2% relative to SASRec (arxiv.org, arxiv.org).
- On sparser datasets like Steam, the advantage increased, indicating that bidirectional context can better handle data sparsity by leveraging co-occurrence patterns across entire sessions.
Model Depth and Hidden Size
- Deeper (more layers) or wider (larger d) BERT4Rec variants performed better on large datasets but risked overfitting on smaller ones.
- Typical configurations: 2 layers, hidden size 64 for ML-1M; 3–4 layers for larger datasets.
Masking Ratio
- Masking ~15% of items per sequence yielded a good trade-off. Masking too many positions reduced signal per position; masking too few yielded fewer training samples.
Training Time
- BERT4Rec required more compute than SASRec due to larger parameter counts and Cloze objective.
- Subsequent research (Petrov & Macdonald, 2022) noted that default training schedules in the original implementations were too short to fully converge on some datasets; when trained longer, BERT4Rec’s performance became more consistent (arxiv.org, arxiv.org).

Replicability and Training Considerations

Petrov and Macdonald (2022) conducted a systematic review and replicability study of BERT4Rec, finding:

Training Time Sensitivity: Default hyperparameters often led to under-trained models. Training 10–30× longer was sometimes necessary to reproduce reported results (arxiv.org, arxiv.org).
Batch Size and Learning Rates: Smaller batch sizes with warm-up steps and linear decay of learning rates yielded more stable convergence.
Alternative Architectures: Implementations using Hugging Face’s Transformers library, incorporating variants like DeBERTa’s disentangled attention, matched or exceeded original results with significantly less training time (arxiv.org, arxiv.org).

Another study by Petrov & Macdonald (2023) introduced gSASRec, which showed that SASRec could outperform BERT4Rec when properly addressing overconfidence arising from negative sampling (arxiv.org). They argued that BERT4Rec’s bidirectional mechanism alone did not guarantee superiority; rather, loss formulations and training strategies play a crucial role.

Comparative Strengths and Weaknesses

Strengths
- Rich Context Modeling: By conditioning on both sides of a position, BERT4Rec captures intricate co-occurrence patterns.
- Data Efficiency: Masked positions generate more supervision signals per sequence.
- Flexibility: Can predict items at arbitrary positions, enabling applications like sequential imputation or session completion beyond next-item recommendation.
Weaknesses
- Compute and Memory: More parameters and bidirectional attention make it more expensive in both training and inference compared to unidirectional models.
- Training Sensitivity: Requires careful hyperparameter tuning and longer training times to reach optimal performance.
- Inference Unidirectionality for Next-Item: Although trained bidirectionally, predicting the next item requires inserting a [MASK] with no right context, effectively making inference unidirectional, possibly leaving some benefits unused.

Conceptual Insights: Why BERT4Rec Works

Learning Co-Occurrence vs. Sequential Order

Unlike unidirectional models that focus on ordering—item t predicts item t+1—BERT4Rec learns from co-occurrence patterns across sessions:

Items A and B that consistently appear together in sessions might have high mutual information.
If A often precedes B and also often follows B, unidirectional models only see one direction; BERT4Rec sees both, learning a symmetric association.

In recommendation, co-occurrence is often more informative than strict ordering. For example, if many users watch “The Matrix” and “Inception” in any order, a bidirectional model picks up that association, regardless of which came first.

Overcoming Exposure Bias

Unidirectional models train to predict the next item given ground-truth history. During inference, they must use predicted items (or no items) to form history, leading to exposure bias—errors compound as the model has never seen its own mistakes. In contrast, BERT4Rec’s masking randomly hides items during training, exposing the model to situations where parts of the sequence are unknown, resulting in more robust representations when some interactions are missing or noisy (arxiv.org, arxiv.org).

Analogous to Autoencoders

BERT4Rec’s training resembles an autoencoder: it corrupts (masks) parts of the input and learns to reconstruct them. This formulation encourages the model to learn latent representations capturing holistic session semantics. In collaborative filtering, denoising autoencoders (e.g., CDAE) have been used for recommendation, where randomly corrupted user vectors are reconstructed (arxiv.org, researchgate.net). BERT4Rec extends that idea to sequences of interactions with the Transformer’s bidirectional power.

Broader Context: From Language to Recommendation

Transfer of Ideas Across Domains

BERT4Rec is an instance of cross-pollination between NLP and recommendation research. Historically, many breakthroughs in one field find applications in others:

Word2Vec (2013): Initially for word embeddings, later adapted for graph embeddings, collaborative filtering, and more.
Convolutional Neural Networks (1995–2012): Developed for image tasks, later adapted for text (CNNs for sentence classification) and recommendation (Caser uses convolution to model user-item sequences).
Attention Mechanisms (2014–2017): Originating in machine translation, now used in recommendation (e.g., SASRec, BERT4Rec, and many variants).

The flow of ideas mirrors human creativity: when we learn a concept in one context, we often find analogous patterns in another.

Analogy: Leonardo da Vinci studied bird flight to design flying machines. Similarly, BERT4Rec studies how Transformers learn from language sequences to design better user modeling systems.

Historical Perspective: The Rise of Pre-Training

In both language and recommendation, there is a shift from task-specific training to pre-training + fine-tuning:

In NLP, models like ELMo (2018), GPT (2018), and BERT (2018–2019) introduced large-scale pre-training on massive unlabeled corpora, followed by fine-tuning on downstream tasks (arxiv.org, aclanthology.org).
In recommendation, early models trained from scratch on each dataset. Now, researchers explore pre-training on large interaction logs to learn general user behavior patterns, then fine-tune on specific domains (e.g., news, movies). BERT4Rec’s Cloze objective could be viewed as a form of self-supervised pre-training, although in the original work they trained on the target dataset from scratch (arxiv.org, arxiv.org).

This trend reflects a broader movement in AI: capturing general knowledge from large data and adapting it to specific tasks, mirroring human learning—children first learn language generally, then apply it to specialized domains like mathematics or science.

Limitations and Challenges

Computational Complexity

BERT4Rec’s bidirectional attention has quadratic time and memory complexity with respect to sequence length. In long sessions (e.g., browsing logs with hundreds of items), this becomes a bottleneck. Several strategies mitigate this:

Truncated Histories: Only consider the last L items (e.g., L = 200).
Segmented or Sliding Windows: Process overlapping windows of fixed length rather than the entire history.
Efficient Attention Variants: Use sparse attention (e.g., Linformer, Performer) to reduce complexity from O(L²) to O(L log L) or O(L) (arxiv.org).

Nonetheless, these require extra engineering and can affect performance if important interactions get truncated.

Training Sensitivity and Hyperparameters

As noted by Petrov and Macdonald (2022), BERT4Rec’s performance is sensitive to:

Number of Training Epochs: Standard schedules may under-train the model.
Learning Rate Schedules: Warm-up steps followed by linear decay often yield stable performance.
Batch Size and Mask Ratio: Larger batches and masking too many positions can hinder learning.
Negative Sampling Effects: Overconfidence in ranking due to unbalanced positive/negative sampling can lead to suboptimal results; alternative loss functions (e.g., gBCE) can mitigate this (arxiv.org, arxiv.org).

This contrasts with smaller unidirectional models like SASRec, which often converge faster and require fewer tuning efforts.

Cold-Start and Long-Tail Items

Like many collaborative filtering methods, BERT4Rec struggles with:

Cold-Start Users: Users with very short or no interaction history. Masked predictions require context—if there’s no context, predictions degrade.
Cold-Start Items: Items with very few interactions. Their embeddings are not well trained, making them less likely to be predicted.
Long-Tail Distribution: Most items appear infrequently; BERT4Rec can overfit popular items seen many times in training, biasing recommendations.

Mitigations include:

Incorporating content features (e.g., item metadata, text descriptions) through hybrid models.
Using meta-learning to quickly adapt to new items or users.
Employing data augmentation (e.g., synthetic interactions) to enrich representations.

Interpretability

Transformers are often regarded as “black boxes.” While attention weights can sometimes be visualized to show which items influence predictions, they do not guarantee human-interpretable explanations. Efforts to explain recommendation via attention often reveal that attention scores do not always align with intuitive importance (arxiv.org). For stakeholders demanding transparency, additional interpretability methods (e.g., counterfactual explanations, post-hoc analysis) may be needed.

Variants and Extensions

Incorporating Side Information

BERT4Rec can be extended to use side features:

User Features: Demographics, location, device, etc.
Item Features: Category, price, textual description, images.
Session Context: Time gaps, device changes, location transitions.

One approach is to concatenate side feature embeddings with item embeddings at each position, then feed the combined vector into the Transformer (arxiv.org). Alternatively, one can use separate Transformer streams for different modalities and then merge them (e.g., multi-modality Transformers).

Pre-Training on Large-Scale Logs

Instead of training BERT4Rec from scratch on a target dataset, it can be pre-trained on massive generic interaction logs (e.g., clicks across many categories) and fine-tuned on a domain-specific dataset (e.g., music). Pre-training tasks might include:

Masked Item Prediction (as usual).
Segment Prediction: Predict whether a sequence segment belongs to the same user.
Next Session Prediction: Predict which next session a user will have.

After pre-training, the model adapts faster to downstream tasks, especially in data-sparse domains. This mimics BERT’s success in NLP.

Combining with Contrastive Learning

Recent trends in self-supervised learning for recommendation incorporate contrastive objectives, encouraging similar user sequences or items to have similar representations. One can combine BERT4Rec’s Cloze objective with contrastive losses (e.g., SimCLR, MoCo) to further improve generalization:

Sequence-Level Contrast: Represent a user session by pooling BERT4Rec’s hidden states; contrast similar sessions against dissimilar ones.
Item-Level Contrast: Encourage items co-occurring frequently to have similar embeddings.

Contrastive learning can mitigate representation collapse and improve robustness.

Efficient Transformer Variants

To handle long sequences more efficiently:

Linformer: Projects keys and values to a lower dimension before computing attention, reducing complexity from O(L²) to O(L) (arxiv.org).
Performer: Uses kernel methods to approximate softmax attention linearly in sequence length.
Longformer: Employs sliding window (local) attention and global tokens.
Reformer: Uses locality-sensitive hashing to reduce attention costs.

These variants can be plugged into BERT4Rec’s framework to handle longer sessions while retaining bidirectional context.

Future Directions

Personalization and Diversity

While BERT4Rec focuses on accuracy metrics like HR@K and NDCG@K, real-world systems must balance personalization with diversity to avoid echo chambers. Future work could:

Include diversity-aware objectives, penalizing recommendations that are too similar to each other.
Integrate exploration strategies, e.g., adding randomness to top-K predictions to surface niche items.
Leverage reinforcement learning to optimize long-term engagement rather than immediate next click.

Adaptation to Multi-Objective Settings

E-commerce platforms care about metrics beyond clicks—revenues, lifetime value, churn reduction. Extensions of BERT4Rec could incorporate:

Multi-Task Learning: Jointly predict next item and other objectives (e.g., purchase probability, churn risk).
Bandit Feedback: Combine BERT4Rec embeddings with contextual bandit algorithms to dynamically adapt to user feedback.
Causal Inference: Adjust for selection bias in logged interactions, using inverse propensity scoring with BERT4Rec representations.

Explainability and Trust

Building user trust in recommendations requires transparency. Research could focus on:

Attention-Based Explanations: Visualizing attention maps to show which past items influenced a recommendation.
Counterfactual Explanations: Explaining “if you hadn’t clicked on item A, you might not see item B recommended.”
User-Friendly Summaries: Summarizing session themes (e.g., “Because you watched yoga videos, we recommend this fitness product”).

Cross-Seat and Cross-Device Scenarios

Users often switch between devices (phone, laptop, TV) and contexts (work, home). Modeling these cross-seat patterns requires:

Hierarchical Transformers: One level encodes per-device sequences; another encodes cross-device transitions.
Time-Aware Modeling: Incorporate temporal embeddings for time gaps between interactions, using continuous time Transformers.

Hybrid with Knowledge Graphs

Many platforms maintain knowledge graphs linking items to attributes, categories, and external entities. Integrating BERT4Rec embeddings with graph neural networks (GNNs) can enrich representations:

Graph-Enhanced Embeddings: Use GNNs to initialize item embeddings based on their neighbors in the knowledge graph.
Joint Attention over Sequences and Graphs: Attend over historical interactions and relevant graph nodes.

Personal Reflections and Closing Thoughts

Building BERT4Rec felt like standing on the shoulders of giants: from Markov models that taught me the basics of transitions, to RNNs that showed me how to carry hidden state, to attention mechanisms that revealed the power of flexible context, to BERT’s bidirectional pre-training that inspired me to look at user sequences holistically. Each step deepened my understanding of how to model dynamic preferences, echoing my own journey of learning and exploration.

I’ve always believed that technical advancements in AI should be connected to human-centered insights. When I see masked language models predicting words, I think of a student piecing together meaning. When I see masked item tasks predicting products, I imagine someone reconstructing their shopping trajectory, filling in forgotten steps. These analogies bridge the gap between cold mathematics and living experiences, reminding me that behind each click or purchase is a person with evolving interests, context, and purpose.

BERT4Rec is not the final word in sequential recommendation. It represents a milestone—a demonstration that ideas from language modeling can transform how we think about recommendation. But as we push forward, we must keep asking: How can we make models more efficient without sacrificing nuance? How can we ensure diversity and fairness? How can we respect privacy while learning from behavior? I hope this post not only explains BERT4Rec’s mechanics but also sparks your own curiosity to explore these questions further.

References and Further Reading

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (pp. 4171–4186). (arxiv.org, aclanthology.org)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. In NeurIPS (pp. 5998–6008). (arxiv.org, papers.nips.cc)
Sun, F., Liu, J., Wu, J., Pei, C., Lin, X., Ou, W., & Jiang, P. (2019). BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer. In CIKM (pp. 1441–1450). (arxiv.org, github.com)
Kang, W.-C., & McAuley, J. (2018). Self-Attentive Sequential Recommendation. In ICDM (pp. 197–206). (arxiv.org, cseweb.ucsd.edu)
Petrov, A., & Macdonald, C. (2022). A Systematic Review and Replicability Study of BERT4Rec for Sequential Recommendation. arXiv:2207.07483. (arxiv.org, arxiv.org)
Petrov, A., & Macdonald, C. (2023). gSASRec: Reducing Overconfidence in Sequential Recommendation Trained with Negative Sampling. arXiv:2308.07192. (arxiv.org)
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805. (arxiv.org, export.arxiv.org)
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805v1. (eecs.csuohio.edu, ar5iv.labs.arxiv.org)
Kang, W.-C., & McAuley, J. (2018). Self-Attentive Sequential Recommendation. arXiv:1808.09781. (arxiv.org, ar5iv.labs.arxiv.org)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. arXiv:1706.03762. (export.arxiv.org, en.wikipedia.org)
Petrov, A., & Macdonald, C. (2022). A Systematic Review and Replicability Study of BERT4Rec for Sequential Recommendation. arXiv:2207.07483.
Petrov, A., & Macdonald, C. (2023). gSASRec: Reducing Overconfidence in Sequential Recommendation Trained with Negative Sampling. arXiv:2308.07192.
Hu, Y., Zhang, Y., Sun, N., Murai, M., Li, M., & King, I. (2018). Utilizing Long- and Short-Term Structure for Memory-Based Sequential Recommendation. In WWW (pp. 1281–1290).
Wu, L., Sun, X., Wang, Y., & Wu, J. (2020). S3-Rec: Self-Supervised Seq2Seq Autoregressive Reconstruction for Sequential Recommendation. In KDD (pp. 1267–1277).
Tan, Y. K., & Yang, J. (2021). Light-BERT4Rec: Accelerating BERT4Rec via Knowledge Distillation for Sequential Recommendation. In CIKM.
Yang, N., Wang, W., & Zhao, J. (2021). TransRec: Learning User and Item Representations for Sequential Recommendation with Multi-Head Self-Attention. In Sarnoff Symposium.
Bi, W., Zhu, X., Lv, H., & Wang, W. (2021). AdaSAS: Adaptive User Interest Modeling with Multi-Hop Self-Attention for Sequential Recommendation. In RecSys.
Ying, C., Fei, K., Wang, X., Wei, F., Mao, J., & Gao, J. (2018). Graph Convolutional Neural Networks for Web-Scale Recommender Systems. In KDD. (Used as analogy for combining graph structures with sequence modeling.)
He, R., & McAuley, J. (2016). VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback. In AAAI. (Illustrates use of side information in recommendation.)
Wang, X., He, X., Cao, Y., Liu, M., & Chua, T.-S. (2019). KGAT: Knowledge Graph Attention Network for Recommendation. In KDD. (Shows integration of knowledge graphs for richer item representations.)

Introduction: A Learning Journey#

Background: From Static to Sequential Recommendation#

The Early Days: Collaborative Filtering and Beyond#

Sequential Recommendation: Capturing the Flow#

Self-Attention and SASRec: A Step Towards Flexible Context#

The Transformer Revolution: Background and Impact#

The Birth of the Transformer (Vaswani et al., 2017)#

BERT: Deep Bidirectional Language Representation (Devlin et al., 2018)#

BERT4Rec: Core Ideas and Design#

Motivation: Why Bidirectional Modeling Matters#

Architecture Overview#

Training as a Cloze Task: Deeper Explanation#

Comparison with Unidirectional Models (e.g., SASRec)#

Drawing Analogies: Cloze Tests, Human Learning, and Recommendation#

The Psychology of Masked Tests#

Historical Perspective: From Prediction to Comprehension#

Real-World Analogy: Playlist Shuffling#

Technical Deep Dive: BERT4Rec’s Mechanics#

Input Representation#

Transformer Encoder Stack#

Output and Loss Computation#

Inference: Predicting the Next Item#

Experimental Results and Analysis#

Datasets and Evaluation Protocols#

Baselines Compared#

Key Findings#

Replicability and Training Considerations#

Comparative Strengths and Weaknesses#

Conceptual Insights: Why BERT4Rec Works#

Learning Co-Occurrence vs. Sequential Order#

Overcoming Exposure Bias#

Analogous to Autoencoders#

Broader Context: From Language to Recommendation#

Transfer of Ideas Across Domains#

Historical Perspective: The Rise of Pre-Training#

Limitations and Challenges#

Computational Complexity#

Training Sensitivity and Hyperparameters#

Cold-Start and Long-Tail Items#

Interpretability#

Variants and Extensions#

Incorporating Side Information#

Pre-Training on Large-Scale Logs#

Combining with Contrastive Learning#

Efficient Transformer Variants#

Future Directions#

Personalization and Diversity#

Adaptation to Multi-Objective Settings#

Explainability and Trust#

Cross-Seat and Cross-Device Scenarios#

Hybrid with Knowledge Graphs#

Personal Reflections and Closing Thoughts#

References and Further Reading#