Prompt-Based Reinforcement Learning for Next Item Recommendation Systems

Introduction

The Next item recommendation system is one of the core components of modern online services, embedded in applications such as music, video, and e-commerce websites, helping users navigate and discover new content. Generally, the system is modeled as a sequence prediction task, often implemented over recurrent neural networks or other generative sequence models. Its purpose is to answer the question: what is the next item of interest to the user given their past interactions? Reinforcement learning (RL) trains an agent to take appropriate actions based on observed environmental states to maximize predefined rewards. Existing value-based RL algorithms typically involve policy evaluation and policy improvement, as shown in Figures 1a and 1b. This is because RL naturally aligns with the optimization goals of recommendation systems: to maximize the overall reward of an interaction session, and the flexible reward settings in RL can be customized to suit recommendation objectives. Therefore, using RL in recommendations has become an emerging topic.

(Figure 1: Policy Evaluation Algorithm, Policy Improvement Algorithm, and PRL Paradigm)

Prompt-Based Reinforcement Learning for Next Item Recommendation Systems

However, developing recommendation methods based on reinforcement learning is not an easy task. Specifically, the current RL learning paradigm trains an agent through interaction with the environment followed by observation of rewards. This process requires the agent to engage in a significant amount of interaction itself. Traditional RL relies on extensive online exploration and trial-and-error to train recommendation engines, but in recommendation systems, we cannot afford extensive online trial-and-error, as poor recommendations can adversely affect user experience. Therefore, offline training of recommendation engines needs to be conducted using historical implicit feedback collected under different recommendation strategies. However, historical data is not generated by the agent itself but comes from various or even unknown behavioral policies. The expected estimation of policy evaluation can be easily affected by distribution differences, known as the offline training challenge.

For the scenario of offline training, we propose a new learning paradigm based on prompts called Prompt-Based Reinforcement Learning (PRL). Traditional RL algorithms attempt to map “state-action” input pairs to expected rewards, while PRL directly infers actions from “state-reward” inputs, as shown in Figure 1c. In short, through simple supervised learning, the agent is trained to predict recommended items based on previous interactions and observed reward values. Upon deployment, historical (training) data serves as a knowledge base, and “state-reward” pairs act as prompts. Thus, the agent will be used to solve the problem: what item should be recommended under the conditions of given previous interactions and prompt reward values? We instantiate PRL on four recommendation models and conduct experiments on two e-commerce datasets, demonstrating the effectiveness of our method.

Our contributions are summarized as follows:

● For the offline training of next item recommendation systems based on reinforcement learning, we propose PRL. We suggest using “state-reward” pairs as prompts to infer actions by querying historical implicit feedback data knowledge bases.

● We propose using a supervised self-attention module to learn and store the signal between the input of “state-reward” pairs and the output of actions.

● We instantiate PRL on four recommendation models and conduct experiments on two real-world e-commerce datasets. The experimental results indicate a general improvement in recommendation performance.

Prompt-Based Reinforcement Learning

In this section, we will detail the training and inference process of using PRL for next item recommendations.

During the training phase, a generative sequence model encodes the user’s previous interactions into hidden states, organizing historical data in the {state, cumulative reward}–>{observed action} template, and then uses a supervised self-attention module to learn and store these signals. The PRL training process consists of three parts: prompt generation, prompt encoding, and supervised attention learning, as shown in Figure 2.

Prompt-Based Reinforcement Learning for Next Item Recommendation Systems

(Figure 2: Training Framework of PRL)

Prompt generation aims to formulate offline training data into knowledge templates, i.e., under the condition of previous user-item interaction sequences, what action should be taken to obtain the cumulative reward. Prompt encoding aims to utilize deep neural networks to map the generated prompts into hidden state representations. The supervised self-attention learning module is used to learn and store the mapping signals between the encoded prompt latent representations and the observed actions. Figure 3 presents the algorithm for the prompt generation process of PRL.

Prompt-Based Reinforcement Learning for Next Item Recommendation Systems

(Figure 3: Prompt Generation Process of PRL)

During the inference phase, given the current state, we provide the model with the expected cumulative reward we wish to achieve, and the model can directly infer actions by querying the historical knowledge base. In the training process, cumulative rewards can be computed from offline data, while for model inference, we need to provide inference rewards (prompt rewards) so that the agent can adjust its behavior based on the rewards, achieving exploration.

Experiments

We instantiated PRL using four deep learning-based sequential recommendation models and conducted experiments on two e-commerce datasets, Challenge15 and RetailRocket4. The experiments aimed to answer the following three research questions to verify the effectiveness of the PRL learning paradigm:

● How does PRL perform when instantiated on different sequential recommendation models?

● What is the effect of supervised attention learning, including self-attention modules and weighted loss functions?

● How does the prompt reward setting affect the performance of PRL during the inference phase?

For question one, we compared the results with baseline models, as shown in Tables 1 and 2. In both the Challenge15 and RetailRocket4 datasets, PRL achieved the best performance in almost all cases, demonstrating that PRL consistently and significantly improves the offline learning performance of RL-based recommendation tasks and can be applied to various sequential recommendation models.

(Table 1: Comparison of PRL and Other Models in Top-N Recommendation Performance on Challenge15 Dataset)

Prompt-Based Reinforcement Learning for Next Item Recommendation Systems

(Table 2: Comparison of PRL and Other Models in Top-N Recommendation Performance on RetailRocket4 Dataset)

Prompt-Based Reinforcement Learning for Next Item Recommendation Systems

For question two, we conducted ablation experiments on different aspects. For the study of the effect of the self-attention module, we replaced the self-attention block with average pooling (PRL-mean) or multi-layer perceptron (MLP) (PRL-MLP), and the results are shown in Table 3, indicating that PRL with the self-attention module achieved significant performance improvements.

(Table 3: Effect of Self-Attention Module)

Prompt-Based Reinforcement Learning for Next Item Recommendation Systems

For the study of the effect of re-weighting, we compared the results of PRL without any re-weighting (PRL-w/o) with those of PRL re-weighted by cumulative rewards (PRL-cumu), as shown in Table 4, demonstrating that PRL’s re-weighting successfully helps the model recommend more products with higher prompt rewards.

(Table 4: Impact of Weighted Loss)

Prompt-Based Reinforcement Learning for Next Item Recommendation Systems

For question three, we conducted experiments to study the impact of the expected inference reward μ and the inference reward deviation ϵ on the model’s performance. The results are shown in Figures 4 and 5 respectively:

Prompt-Based Reinforcement Learning for Next Item Recommendation Systems

(Figure 4: Effect of Expected Inference Reward μ)

Prompt-Based Reinforcement Learning for Next Item Recommendation Systems

(Figure 5: Effect of Inference Reward Deviation ϵ)

Conclusion

We proposed a prompt-based reinforcement learning method for the offline training of next item recommendation engines based on reinforcement learning. We theoretically analyzed the offline training challenges when using RL for recommendations, and we suggest using historical offline data as a knowledge base to frame the recommendation task as the question: what action should be taken to expect a prompt reward under state observations? We instantiated PRL on four sequential recommendation models and conducted experiments on two real-world datasets, demonstrating the effectiveness of the proposed method.

Editor: Li Chenliang

Experiments

Leave a Comment Cancel reply