Attention Mechanism in Recommendation Systems

Produced by NewBeeNLP

Author: @Uesugi Shoji

Leisure Meeting · Information Retrieval

When the attention mechanism has become a rather tasteless technique, using or modifying the attention mechanism must focus on telling a good story: that is, ‘Why use Attention, and why modify Attention.’

Currently, from traditional CF, FM methods to NFM, DeepFM, although deep learning DNN is used to handle deep feature interactions, there are still two main shortcomings:

  • Feature extraction of user historical behavior.
  • Feature redundancy issues. Second-order or high-order features are basically enumerative.

DIN and DIEN are models developed by Alibaba for CTR estimation, mainly focusing on further mining user historical behavior data. The CTR estimation task is to predict clicks based on given ads/items, users, and a large amount of contextual information, making the understanding of user interests and historical behavior data very important.

Since the emergence of Transformers and BERT dominating the NLP field, it is natural that their application in recommendation systems has also begun to upgrade. This blog post will summarize four articles about Attention, from ordinary Attention to BERT.

DIN

  • Paper: Deep Interest Network for Click-Through Rate Prediction
  • Link: https://arxiv.org/abs/1706.06978
  • You can also reply ‘0033’ in the public account backend to get it directly.

The story of Attention in this article starts from two observations:

  • Diversity: Diversity means that users have broad interests, and a user may be interested in multiple items across various fields.
  • Local activation: Local activation means that only a portion of the historical data is relevant to the currently recommended item (for example, recommending snack items has nothing to do with what equipment the user previously bought).

Attention Mechanism in Recommendation SystemsSo how can we dynamically capture this relevant history? DIN’s approach is to calculate the similarity between the user’s historical data and the current item, that is, to compute the Attention value, assigning different weights to each user’s interest representation, and then performing a weighted sum. First, let’s look at the formula:

Where represents the user’s feature vector, represents the user’s interest feature vector (user historical behavior), and represents the item’s feature vector. represents the relevance between user interests and candidate items, aligning user interests essentially solves the Local Activation problem. The user feature vector is the weighted sum of all historical behaviors.

The entire model framework is shown in the figure above, with the left side being the base model, which mainly one-hot or multi-hot encodes (mainly for user historical behavior data) and then embeds. It is worth noting that the number of historical clicks for each user is not equal, but it needs to be transformed into a fixed-length vector, so for multi-hot features, an element-wise operation is performed (the ‘+’ in the figure), meaning that regardless of how long the user’s behavior sequence is, it will be pooled into the same dimension. Finally, after concatenation, an MLP predicts the final score. However, the author believes that the pooling result clearly loses a lot of information, and it is evident that using Attention can enhance the expression of user behavior features, so the right side of the figure shows the version after adding Attention.

The idea of adding attention is quite simple. There are also two important tricks:

1. Data Dependent Activation Function (Dice Activation Function)

Generally, the standard ReLU activation function outputs y=x when the value is greater than 0, and outputs 0 directly when less than 0, which leads to many nodes being “dead” and slow to update. Thus, Leaky ReLU gives a certain gradient on the left side when less than 0, that is, y=ax. However, this is still not enough because their default split point is at 0 (either left or right of 0), which is unreasonable; the split point should be determined by the data, hence the need for Dice.

The first equation is a modified version of Leaky, but at this point, there will be an additional controlling parameter p on the left side of Leaky ReLU, where p is the result of normalizing the data (i.e., adjusting using the mean and variance of the data), effectively shifting the entire activation function to the mean of the data.

  • Advantages: Flexibly adjusts the step change point based on data distribution, has the advantages of BN.
  • Disadvantages: BN complexity, relatively time-consuming.
def dice(_x, axis=-1, epsilon=0.000000001, name=''):
  #Data Adaptive Activation Function
  with tf.variable_scope(name_or_scope='', reuse=tf.AUTO_REUSE):
    alphas = tf.get_variable('alpha'+name, _x.get_shape()[-1],                                  
                         initializer=tf.constant_initializer(0.0),                         
                         dtype=tf.float32)
    beta = tf.get_variable('beta'+name, _x.get_shape()[-1],                                  
                         initializer=tf.constant_initializer(0.0),                         
                         dtype=tf.float32)
  input_shape = list(_x.get_shape())

  reduction_axes = list(range(len(input_shape)))
  del reduction_axes[axis]
  broadcast_shape = [1] * len(input_shape)
  broadcast_shape[axis] = input_shape[axis]
                                                                                                                                                                             
  # case: train mode (uses stats of the current batch)
  #计算batch的均值和方差
  mean = tf.reduce_mean(_x, axis=reduction_axes)
  brodcast_mean = tf.reshape(mean, broadcast_shape)
  std = tf.reduce_mean(tf.square(_x - brodcast_mean) + epsilon, axis=reduction_axes)
  std = tf.sqrt(std)
  brodcast_std = tf.reshape(std, broadcast_shape)
  x_normed = tf.layers.batch_normalization(_x, center=False, scale=False, name=name, reuse=tf.AUTO_REUSE)
  # x_normed = (_x - brodcast_mean) / (brodcast_std + epsilon)
  x_p = tf.sigmoid(beta * x_normed)
 
  return alphas * (1.0 - x_p) * _x + x_p * _x #根据原文中给的公式计算


def parametric_relu(_x):
  #PRELU激活函数,形式上和leakReLU很像,只是它的alpha可学习
  #alpha=0,退化成ReLU。alpha不更新,退化成Leak
  with tf.variable_scope(name_or_scope='', reuse=tf.AUTO_REUSE):
    alphas = tf.get_variable('alpha', _x.get_shape()[-1],
                         initializer=tf.constant_initializer(0.0),
                         dtype=tf.float32)
  pos = tf.nn.relu(_x)
  neg = alphas * (_x - abs(_x)) * 0.5 #用alpha控制

  return pos + neg

Complete source code notes: https://github.com/nakaizura/Source-Code-Notebook/tree/master/DIN

2. Adaptive Regularization

The motivation for this method is to prevent overfitting given the long-tail distribution of input data and the high sparsity of dimensions. Direct L1, L2, Dropout effects are not good; discarding directly may lose information and worsen overfitting. So, the adaptive regularization method adjusts the strength of regularization according to the frequency of occurrence, meaning that for frequently occurring items, the regularization strength is lower, and for infrequently occurring items, the regularization strength is higher. This means punishing items that appear infrequently.

  • The design of DIN is more beneficial to the industry because when deployed, it is limited by memory, so User Embedding cannot be very large, making it difficult to represent user features well. It is challenging to represent multiple interests (multi-modal) of users. In this case, DIN, based on user historical behavior, adds Attention, which alleviates this problem well.
  • Disadvantages: It uses historical behavior data but ignores the sequential relationship.

DIEN

  • Paper: Deep Interest Evolution Network for Click-Through Rate Prediction
  • Link: https://arxiv.org/abs/1809.03672
  • You can also reply ‘0034’ in the public account backend to get it directly.

DIEN upgrades DIN, addressing two shortcomings:

  • User interests should evolve continuously. DIN extracts fixed user interests, failing to capture the evolutionary nature of interests.
  • How to ensure that the interests obtained from users’ explicit behavior are valid.

Thus, DIEN mainly develops the Interest Extractor Layer and Interest Evolution Layer to solve the above two shortcomings.Attention Mechanism in Recommendation Systems

Interest Extractor Layer

The main goal of the Interest Extractor Layer is to extract interest sequences, and the user’s interest at a certain moment has temporal relationships. Thus, a GRU with attentional update gate (AUGRU, which is the form of Evolution Layer after adding attention) is designed to enhance the influence of relevant interests during interest changes and weaken the influence of irrelevant interests. Additionally, to determine whether the interest representation is reasonable, an auxiliary loss is added to improve the accuracy of interest representation:

As shown in the small block on the left of the figure, the auxiliary network inputs the user’s next real behavior e(t+1) as a positive example, and the negative sampled behavior as a negative example e(t+1)’, which are fed into the auxiliary network along with the interest h(t) extracted by GRU to fully enhance the representation of user interests.

def auxiliary_loss(self, h_states, click_seq, noclick_seq, mask, stag = None):
        mask = tf.cast(mask, tf.float32)
        click_input_ = tf.concat([h_states, click_seq], -1) #正例
        noclick_input_ = tf.concat([h_states, noclick_seq], -1) #负例
        #输到网络得到概率
        click_prop_ = self.auxiliary_net(click_input_, stag = stag)[:, :, 0]
        noclick_prop_ = self.auxiliary_net(noclick_input_, stag = stag)[:, :, 0]
        #计算loss
        click_loss_ = - tf.reshape(tf.log(click_prop_), [-1, tf.shape(click_seq)[1]]) * mask
        noclick_loss_ = - tf.reshape(tf.log(1.0 - noclick_prop_), [-1, tf.shape(noclick_seq)[1]]) * mask
        loss_ = tf.reduce_mean(click_loss_ + noclick_loss_)
        return loss_
 #辅助网络的结构
    def auxiliary_net(self, in_, stag='auxiliary_net'):
        bn1 = tf.layers.batch_normalization(inputs=in_, name='bn1' + stag, reuse=tf.AUTO_REUSE)
        dnn1 = tf.layers.dense(bn1, 100, activation=None, name='f1' + stag, reuse=tf.AUTO_REUSE)
        dnn1 = tf.nn.sigmoid(dnn1)
        dnn2 = tf.layers.dense(dnn1, 50, activation=None, name='f2' + stag, reuse=tf.AUTO_REUSE)
        dnn2 = tf.nn.sigmoid(dnn2)
        dnn3 = tf.layers.dense(dnn2, 2, activation=None, name='f3' + stag, reuse=tf.AUTO_REUSE)
        y_hat = tf.nn.softmax(dnn3) + 0.00000001
        return y_hat

Interest Evolution Layer

The goal of the Interest Evolution Layer is to characterize the evolution of user interests. The attention story here is:

  • Interest drift: User interests have tendencies.
  • Interest individual: User interests are independent of each other.

Thus, an attention mechanism is needed to enhance the influence of relevant interests during interest changes and weaken the influence of irrelevant interests, that is, to compute Attention weights for GRU, as shown in the red part of the figure. There are three variants of attention that can be chosen:

  • GRU with attentional input (AIGRU): Attention is used as input.

  • Attention based GRU (AGRU): Attention replaces the GRU update gate.

  • GRU with attentional update gate (AUGRU): Weights are applied to the update gate.

BST

  • Paper: Behavior Sequence Transformer for E-commerce Recommendation in Alibaba
  • Link: https://arxiv.org/abs/1905.06874
  • You can also reply ‘0035’ in the public account backend to get it directly.

Attention Mechanism in Recommendation SystemsUpgrading Attention to Transformer is straightforward; the Transformer[1] has already been organized by the author, so I will not elaborate further. BST simply sends the features after embedding directly. Let’s take a look at several input features:

  • The other feature on the far left is divided into four parts: user features, item features, contextual features, and cross features.
  • The blue in the item represents positional encoding, while the red represents items in the behavior sequence.

Unlike the positional encoding in the Transformer, this one directly maps the timestamp representing the position into a vector rather than using a sine function. Finally, it uses a general loss function:

BERT4Rec

  • Paper: BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer
  • Link: https://arxiv.org/abs/1904.06690
  • You can also reply ‘0037’ in the public account backend to get it directly.

With the Transformer, upgrading to BERT[2] is quite natural. BERT was first used in NLP, and the user’s behavior sequence resembles a text sequence, so BERT4Rec is a perfect fit.Attention Mechanism in Recommendation SystemsThe model structure is shown in the figure above, where the input is a sequence of [v1,v2,…,vt], and the last vt is masked, forcing the Transformer to predict vt in conjunction with the context. Thus, for the user’s behavior sequence, the final output can naturally be the recommended result.

There are two tricks to note:

  • The input is not the entire user behavior sequence because the lengths of behavior sequences for different users may vary significantly, so the most recent N behavior sequences are used as input.
  • During model training, it is not just the last one that is masked; instead, similar to BERT itself, random masking is performed to predict the masked parts.

PRM

  • Paper: Personalized Re-ranking for Recommendation
  • Link: https://arxiv.org/abs/1904.06813
  • You can also reply ‘0036’ in the public account backend to get it directly.

This article from Alibaba’s ResSys’19 focuses on re-ranking after recommendations. The main contribution is utilizing Transformer + personalized re-ranking:

  • User interactions with items in the list can be more biased.
  • Transformer’s self-attention can effectively capture interactions between features.

Attention Mechanism in Recommendation SystemsThe specific model architecture is shown in the figure above. After obtaining the initial list, the features x of each item and the user preference p are concatenated. These two features are pre-trained (for example, using any CRT model combined with the user’s historical behavior, item information, etc. for training):

Then, the current item’s position encoding in the list is added:

Finally, the interactions are captured using the Transformer to obtain the score, which is the re-ranking result.

Other Uses of Attention

The recommendation system is quite a large field, so there are many ways to use attention. Thus, the key is to determine why to use Attention. For example:

  • To fuse various features (attention or co-attention, cross-attention).
  • Whether features are hierarchical (Hierarchical, level-attention).
  • Whether there are interactions between features (interactive attention).
  • Whether there are group relationships among features (Group-Attention).
  • Whether there are dynamic changes (dynamic attention, which might typically be calculated multiple times along a timeline to capture this dynamicity).

Moreover, merely calculating Attention may not be sufficient, so modifying and upgrading Attention can lead to High-order Attention, Channel-wise Attention, Spatial Attention, etc…. There are also other variants of attention[3] that the author has previously organized and will not discuss further.

Currently, the upgrade of Attention has gradually become more aggressive, from self-Attention to Transformer to BERT, and the results have naturally improved.

Leave a Comment