A Survey on Bayesian Deep Learning

Source: Algorithm Advancement
Author: Ziyue Wu@Zhihu
Link: https://zhuanlan.zhihu.com/p/283633149

This article is about 4500 words long and is recommended to read in 10 minutes.
This article mainly summarizes a survey on Bayesian deep learning: A Survey on Bayesian Deep Learning. The content includes a basic introduction to Bayesian deep learning and its applications in recommendation systems, topic models, control, and other fields.

An integrated artificial intelligence system should not only be able to “perceive” the environment but also be able to “infer” relationships and their uncertainties. Deep learning performs well in various perception tasks, such as image recognition and speech recognition. However, probabilistic graphical models are more suitable for inference work.

Bayesian Deep Learning (BDL) combines neural networks (NN) and probabilistic graphical models (PGM). This article introduces the principles of Bayesian deep learning and its applications in recommendation systems, topic models, control, and other fields.

The table of contents is as follows:

1 Introduction
2 Deep Learning

2.2 AutoEncoders

3 Probabilistic Graphical Models
4 Bayesian Deep Learning

4.1 A Brief History of Bayesian Neural Networks and Bayesian Deep Learning
4.2 General Framework
4.3 Perception Component
4.4 Task-Specific Component

5 Concrete BDL Models and Applications

5.1 Supervised Bayesian Deep Learning for Recommender Systems
5.2 Unsupervised Bayesian Deep Learning for Topic Models
5.3 Bayesian Deep Representation Learning for Control
5.4 Bayesian Deep Learning for Other Applications

6 Conclusions and Future Research

1 Introduction

Deep learning-based AI models are often proficient in “perception” tasks; however, mere perception is not enough. “Inference” is a crucial component of higher-level artificial intelligence. For instance, in medical diagnosis, a doctor needs to perceive symptoms through images and audio but should also be able to infer the relationships between symptoms and representations, as well as the probabilities of various conditions, which means having the ability to “think”. Specifically, this involves identifying conditional dependencies, causal inference, logical reasoning, and handling uncertainties.

Probabilistic graphical models (PGM) can effectively address probabilistic inference problems; however, the drawback of PGM is its difficulty in handling large-scale high-dimensional data, such as images and texts. Therefore, this article attempts to combine the two, integrating them into the BDL framework.

For instance, in a movie recommendation system, deep learning is suitable for processing high-dimensional data, such as reviews (text) or posters (images), while probabilistic graphical models are suitable for modeling conditional dependencies, such as the network relationships between viewers and movies.

Considering uncertainty, BDL is suitable for handling such complex tasks. The parameter uncertainties of complex tasks generally fall into the following categories: (1) uncertainty of neural network parameters; (2) task-related parameter uncertainty; (3) uncertainty in the information transfer between the perception and task-specific components. By representing unknown parameters using probability distributions instead of point estimates, it becomes convenient to unify and handle these three types of uncertainty (this is what the BDL framework aims to achieve).

Additionally, BDL has a “implicit” regularization effect, helping to avoid overfitting when data is scarce. Generally, BDL consists of two components: the perception module and the task-specific module. The former can be regularized through weight decay or dropout regularization (these methods have Bayesian interpretations), while the latter can incorporate priors, allowing for effective modeling even with limited data.

Of course, BDL also faces challenges in practical applications, such as issues of time complexity and the effectiveness of information transfer between the two modules.

2 Deep Learning

This chapter mainly introduces classical deep learning methods, and there is no need to elaborate too much. The methods mentioned in the article include MLP, AutoEncoder, CNN, RNN, etc.

2.2 AutoEncoders

This section mentions autoencoders. This is a type of neural network that encodes input into a more compact representation while also being able to reconstruct this compact representation. There is a lot of information on this topic; here we mainly describe one variant of AE—SDAE (Stacked Denoising AutoEncoders).

The structure of SDAE is shown in the figure above. Unlike AE, it can be seen as input datawith noise added or some random processing (for example, randomly setting 30% of the data in to zero). So what SDAE does is to try to recover the processed corrupted data into clean data. SDAE can be transformed into the following optimization problem:

3 Probabilistic Graphical Model

This chapter mainly introduces probabilistic graphical models, laying the groundwork for the following content. There is a lot of information on probabilistic graphical models, so we won’t elaborate too much. The article mainly introduces directed Bayesian networks (Bayesian Networks), such as LDA and other models. LDA can be extended to more topic models, such as collaborative topic regression (CTR) in recommendation systems.

4 Bayesian Deep Learning

In this section, the author lists some applications of BDL models in recommendation systems, control, and other fields. We can see that many currently practical models can be unified under the BDL framework:

4.1 A Brief History of Bayesian Neural Networks and Bayesian Deep Learning

Similar to BDL, BNN (Bayesian neural networks) is quite an old topic; however, BNN is merely a subset of the BDL framework—it is equivalent to a BDL that only has the perception component.

4.2 General Framework

The following figure presents the basic framework proposed in the article, where the red part is the perception module and the blue part is the task module. The red part usually employs various probabilistic neural network models, while the blue part can be Bayesian networks, DBN, or even stochastic processes, which are represented in the form of probabilistic graphs.

In this basic framework, there are usually three types of variables: perception variables (X,W in the figure), hinge variables (H in the figure), and task variables (A,B,C in the figure). Generally, the connection between the red module and hinge variables is independent. Therefore, for models that can be categorized under the BDL framework, we can find such a structure—two modules and three types of variables.

BDL can model the uncertainty of information exchange between the red and blue components, which is equivalent to studying the uncertainty of (as reflected in the formula by conditional variance). The different assumptions of variance are: Zero-Variance (ZV, no uncertainty, variance is zero), Hyper-Variance (HV, variance size is determined by hyperparameters), Learnable Variance (LV, using learnable parameters for representation). Clearly, in terms of flexibility, LV > HV > ZV.

4.3 Perception Component

Generally, this part should adopt models like BNN, but we can also use some more flexible models, such as RBM, VAE, and the recently popular GAN, etc. The article mentions the following examples:

Restricted Boltzmann Machine: RBM is a special type of BNN, characterized by:

(1) No need for backpropagation during training;

(2) Hidden neurons are binary.

The specific structure of RBM is as follows:

RBM is trained through Contrastive Divergence (instead of backpropagation), and after training, we can obtain

and

Probabilistic Generalized SDAE: We improve the SDAE mentioned in section 2.2. If we assume both clear input

and corrupted input

are observable, we can define the following Probabilistic SDAE:

Variational Autoencoders: The basic idea of VAE is to maximize ELBO by learning parameters. VAE also has many variants, such as IWAE (Importance Weighted Autoencoders), Variational RNN, etc.

Natural-Parameter Networks: Unlike vanilla NN with deterministic inputs, NPN takes a distribution as input (which differs from VAE where only the output of the intermediate layer is a distribution). Of course, besides Gaussian distributions, other exponential families can also be used as inputs for NPN, such as Gamma, Poisson, etc.

4.4 Task-Specific Component

This module’s main purpose is to integrate probabilistic priors into the BDL model (naturally, we can use PGM to represent this). This module can be a Bayesian Network, bidirectional inference network, or even a stochastic process.

Bayesian Networks: Bayesian networks are the most common task-specific component. Besides LDA, another example is PMF (Probabilistic Matrix Factorization), which uses BN to model the conditional dependencies of users, items, and ratings. The following is the generative process assumed by PMF:

Bidirectional Inference Networks: Deep Bayesian Networks not only focus on “shallow correlations” and linear structures but also consider the nonlinear correlations of random variables and the nonlinear structures of models. BIN is one such example.

Stochastic Processes: Stochastic processes can also serve as task components, such as simulating discrete Brownian motion with Wiener processes or simulating speech recognition tasks with Poisson processes. Stochastic processes can be viewed as a type of Dynamic Bayesian Network (DBN).

5 Concrete BDL Models and Applications

After discussing the basic model structure that constitutes BDL, we naturally hope to apply this unified framework to some practical problems. Therefore, this chapter mainly discusses various application scenarios of BDL, including recommendation systems, control problems, etc. Here, we assume that the task module uses vanilla Bayesian networks as the model for this part.

5.1 Supervised Bayesian Deep Learning for Recommender Systems

Collaborative Deep Learning. In this section, the article proposes Collaborative Deep Learning (CDL) to address the recommendation system problem. This method connects content information (generally handled using deep learning methods) and rating matrices (typically handled using collaborative filtering).

Using the Probabilistic SDAE mentioned in section 4.3.2, the generative process of the CDL model is as follows:

For efficiency, we can set the tendency towards positive infinity, at which point the graph model of CDL can be represented as follows:

The red dashed box contains the SDAE (the figure shows the case where L=2), while the right side is the degenerated CDL. We can see that the degenerated CDL only has the encoder part of SDAE. According to our previous definition,

is the hinge variable, while

are the task variables, while the others are perception variables.

So how should we train this model? Intuitively, since all parameters are treated as random variables, we can use pure Bayesian methods, such as VI or MCMC; however, this often incurs a huge computational cost. Therefore, we use an EM-style algorithm to obtain MAP estimates. First, we define the objective to optimize. We want to maximize the posterior probability, which can be equivalently stated as maximizing the joint log-likelihood given

Note that when

approaches infinity, the probability graph model for training CDL degenerates into the neural network model depicted below: both networks have the same noisy inputs, while the outputs are different.

With the optimization objective defined, how should we update the parameters? Similar to the clever idea of the EM algorithm, we iteratively find a local optimum:

Once we have estimated the parameters, predicting new ratings becomes easy; we just need to compute the expectation based on the following formula:

Bayesian Collaborative Deep Learning: Besides this model, we can extend the previously mentioned CDL in another way. Here, we do not use MAP estimates but rather a sampling-based algorithm. The main process is as follows:

approaches positive infinity and using adaptive rejection Metropolis sampling, sampling from

is equivalent to the Bayesian generalized version of BP.

Marginalized Collaborative Deep Learning: In the training of SDAE, different epochs use different corrupted inputs; thus, the training process needs to traverse all epochs. Marginalized SDAE has made improvements by directly obtaining closed-form solutions through marginalizing the corrupted inputs.

Collaborative Deep Ranking: Besides focusing on precise ratings, we can also directly focus on the ranking of items, such as the CDR algorithm:

At this point, the log-likelihood we need to optimize will become:

Collaborative Variational Autoencoders: Additionally, we can replace the Probabilistic SDAE of the perception module with VAE, resulting in the following generative process:

In summary, recommendation system problems often involve high-dimensional data (text, images) processing and conditional relationship inference (user-item relationships, etc.). Models like CDL that use the BDL framework can play a significant role. Of course, other supervised learning tasks can also refer to the application of CDL methods in recommendation systems.

5.2 Unsupervised Bayesian Deep Learning for Topic Models

This subsection transitions to unsupervised problems, where we no longer pursue “matching” our goals but rather “describing” our research subjects.

Relational Stacked Denoising Autoencoders as Topic Models (RSDAE): In RSDAE, we aim to learn a set of topics (or latent factors) under the constraints of relational graphs. RSDAE can “natively” integrate the hierarchical structure of latent factors and available relational information. Its graphical model form and generative process are as follows:

Similarly, we maximize the posterior probability, which is to maximize the joint log-likelihood of various parameters:

During training, we still use an EM-style algorithm to find MAP estimates and obtain a local optimum (of course, we can also use some skip methods to attempt to escape local optima), as follows:

Deep Poisson Factor Analysis with Sigmoid Belief Networks: The Poisson process is suitable for modeling non-negative counts related processes. Considering this characteristic, we can attempt to use Poisson factor analysis (PFA) for non-negative matrix factorization. Here, we take the topic problem in text as an example; by taking different priors, we can have multiple different models.

For example, we can construct the DeepPFA model using deep priors based on sigmoid belief networks (SBN). The generative process of DeepPFA is as follows:

The training method for this model uses Bayesian Conditional Density Filtering (BCDF), which is an online version of MCMC; it can also use Stochastic Gradient Thermostats (SGT), a hybrid Monte Carlo sampling method.

Deep Poisson Factor Analysis with Restricted Boltzmann Machine: We can also replace the SBN in DeepPFA with the RBM model to achieve similar effects.

It can be seen that in BDL-based topic models, the perception module is used to infer the topic hierarchy of the text, while the task module is used to model the vocabulary-topic production process and the vocabulary-topic relationships, as well as the intrinsic relationships in the text.

5.3 Bayesian Deep Representation Learning for Control

The previous two sections mainly discussed the applications of BDL in supervised and unsupervised learning; this section focuses on another area: representation learning, using control problems as an example.

Stochastic Optimal Control: In this section, we consider the stochastic optimal control problem for an unknown dynamic system, solved under the BDL framework as follows:

BDL-Based Representation Learning for Control: To optimize the above problem, three key parts are needed, as follows:

Learning Using Stochastic Gradient Variational Bayes: The loss function of this model is in the following form:

In control problems, we generally want to extract semantic information from raw inputs while maintaining local linearity in the system state space. The BDL framework is precisely suitable for this, with the two components accomplishing different tasks: the perception module can capture live video, while the task module can infer the state of the dynamic system.

5.4 Bayesian Deep Learning for Other Applications

In addition to the above-mentioned applications, BDL has many other usage scenarios: link prediction, natural language processing, computer vision, speech, time series forecasting, etc. For example, in link prediction, GCN can be used as the perception module, while stochastic block models can be used as the task processing module, etc.

6 Conclusion and Future Research

Many real-world tasks involve two aspects: perceiving high-dimensional data (images, signals, etc.) and probabilistic inference of random variables. Bayesian Deep Learning (BDL) is precisely the solution to such problems: it combines the strengths of neural networks (NN) and probabilistic graphical models (PGM). The wide-ranging applications make BDL a very valuable research subject, and there are still many areas to explore in these models.

Editor: Wang Jing

Proofreader: Cheng Anle