25 Essential Deep Learning Interview Questions

Click the "Xiaobai Learns Vision" above, select to add "Star" or "Top"
Heavyweight content delivered first time

Author | Tomer AmitTranslator | Wan Yue, Editor | Tu MinProduced by | CSDN (ID: CSDNnews)

The following is the translation:

In this article, I will share 25 questions about deep learning, hoping to help you prepare for interviews.

1. Why must non-linearity be introduced in neural networks?

Answer:Otherwise, we would get a linear function composed of multiple linear functions, which would make it a linear model.Linear models have very few parameters, thus the complexity of modeling is also very limited.

2. Describe two methods to solve the vanishing gradient problem in neural networks.

Answer:

Use ReLU activation function instead of sigmoid activation function.
Use Xavier initialization.

3. In image classification tasks, what are the advantages of using Convolutional Neural Networks (CNN) over Dense Neural Networks (DNN)?

Answer:Although both models can capture relationships between nearby pixels, CNN has the following properties:

It is translation invariant:For the filters, the exact position of the pixel is irrelevant.
Less prone to overfitting:Generally, CNN has significantly fewer parameters than DNN.
Helps us better understand the model:We can examine the weights of the filters and visualize the learning outcomes of the neural network.
Hierarchical nature:Learning patterns by using simpler patterns to describe complex patterns.

4. Describe two methods to visualize CNN features in image classification tasks.

Answer:

Input occlusion:Occlude a part of the input image and see which part has the most significant impact on classification.For example, for a trained image classification model, if we input the following images.If we see that the third image has a 98% probability of being classified as a dog, while the second image only has a 65% accuracy, it indicates that the eyes have a more significant impact on the classification.
Activation maximization:Create an artificial input image to maximize the target response (gradient ascent).

5. Is trying learning rates of 0.1, 0.2, …, 0.5 a good approach when optimizing the learning rate?

Answer:This method is not good; it is recommended to use logarithmic ratios to optimize the learning rate.

6. Assume a neural network has a structure of 3 layers and ReLU activation function. What will happen if we initialize all weights with the same value? What if we only have 1 layer (i.e., linear/logistic regression)?

Answer:If all weights are initialized to the same value, symmetry cannot be broken.In other words, all gradientswill update to the same value, and the neural network will not learn.However, if the neural network only has 1 layer, the cost function is convex (linear/sigmoid), so the weights will always converge to the optimal point, regardless of the initial value (convergence may be slower).

7. Explain the concept of the Adam optimizer.

Answer:Adam combines two ideas to improve convergence:Each parameter update can accelerate convergence;momentum can prevent getting stuck at saddle points.

8. Compare batch, mini-batch, and stochastic gradient descent.

Answer:Batch refers to obtaining the entire data when estimating data;mini-batch processes a few data points through sampling;while stochastic gradient descent updates the gradient of one data point at each epoch.We need to balance the accuracy of gradient computation with the batch size stored in memory.Additionally, by adding random noise at each epoch, we can achieve regularization effects through mini-batch (rather than the entire batch).

9. What is data augmentation? Give an example.

Answer:Data augmentation is a technique to increase input data by manipulating the original data.For images, we can perform the following operations:rotate images, flip images, add Gaussian blur, etc.

10. Explain the concept of GAN.

Answer:GAN (Generative Adversarial Network) typically consists of two neural networks D and G, where D refers to the Discriminator and G refers to the Generative Network.The goal of this model is to create data, such as creating images indistinguishable from real images.Assuming we want to create an adversarial example of a cat.Neural network G is responsible for generating the image, while neural network D is responsible for determining whether the image is a cat.The goal of G is to “fool” D—always classifying G’s output as a cat.

11. What are the advantages of using Batchnorm?

Answer:Batchnorm can accelerate the training process and has a regularizing effect (some noise side effects).

12. What is multi-task learning? When should it be used?

Answer:Multi-task learning is useful when we handle multiple tasks with a small amount of data, and we can also use models trained on large datasets from other tasks.Parameters of the model can be shared through a “hard” way (i.e., the same parameters) or a “soft” way (i.e., regularizing/punishing the cost function).

13. What is end-to-end learning? List some advantages.

Answer:End-to-end learning is usually a model that can take raw data and directly output the desired results without any intermediate tasks or feature engineering.Its advantages include:no need for manual feature construction, and it generally reduces bias.

14. If we first use ReLU activation function in the last layer and then use Sigmoid function, what will happen?

Answer:Since ReLU always outputs non-negative results, this neural network will predict all inputs to be the same category!

15. How to solve the problem of gradient explosion?

Answer:One of the simplest methods to solve the gradient explosion problem is gradient clipping, which sets the gradient to ±M when the absolute value of the gradient exceeds M (M is a large number).

16. Is it necessary to shuffle training data when using batch gradient descent?

Answer:No, it is not necessary.Since the gradient calculation for each epoch uses the entire training data, shuffling the order has no effect.

17. Why is it important to shuffle data when using mini-batch gradient descent?

Answer:If we do not shuffle the order of the data, suppose we train a neural network classifier with two categories:A and B, then all mini-batches in each epoch will be exactly the same, which will slow down convergence and even lead the neural network to have a bias towards the order of the data.

18. List the hyperparameters of transfer learning.

Answer:How many layers to retain, how many layers to add, how many layers to freeze.

19. Is dropout needed on the test set?

Answer:No!Dropout can only be used on the training set.Dropout is a regularization technique applied during training.

20. Explain why dropout in neural networks can serve as regularization.

Answer:There are several explanations for how dropout works.We can view it as a form of model averaging:we can “drop out” a part of the model at each step and take the average.Additionally, it adds noise, which naturally produces a regularizing effect.Finally, it can dilute weights, fundamentally preventing co-adaptation of neurons in the neural network.

21. Give an example suitable for a many-to-one RNN architecture.

Answer:For example:Sentiment analysis, gender recognition in speech, etc.

22. When can we not use BiLSTM? Explain the assumptions we must make when using BiLSTM.

Answer:In all bidirectional models, we can assume access to the next element in the sequence at a given “time.”Text data (e.g., sentiment analysis, translation, etc.) falls under this category, while time series data does not.

23. True or false: Adding L2 regularization to RNN helps to solve the vanishing gradient problem.

Answer:False!Adding L2 regularization shrinks weights to zero, which in some cases can actually make the vanishing gradient problem worse.

24. Suppose the training error/cost is high, and the validation cost/error is nearly equal to it. What does this mean? What should we do?

Answer:This indicates underfitting.We can add more parameters, increase the complexity of the model, or reduce regularization.

25. Explain why L2 regularization can be interpreted as weight decay.

Answer:Assuming our cost function is C(w), we add a term c|w|^2.When using gradient descent, the iteration is as follows:

w = w – grad(C)(w) – 2cw = (1 – 2c)w – grad(C)(w)

In this equation, the weights are multiplied by a factor < 1

Original article: https://towardsdatascience.com/50-deep-learning-interview-questions-part-1-2-8bbc8a00ec61

This article is a translation by CSDN, please indicate the source when reprinting.

Download 1: OpenCV-Contrib Extended Module Chinese Version Tutorial

Reply "Extended Module Chinese Tutorial" in the background of the "Xiaobai Learns Vision" public account to download the first OpenCV extended module tutorial in Chinese on the internet, covering more than 20 chapters including extended module installation, SFM algorithms, stereo vision, target tracking, biological vision, super-resolution processing, etc.

Download 2: 52 Lectures on Python Vision Practical Projects

Reply "Python Vision Practical Projects" in the background of the "Xiaobai Learns Vision" public account to download 31 vision practical projects including image segmentation, mask detection, lane line detection, vehicle counting, eyeliner addition, license plate recognition, character recognition, emotion detection, text content extraction, facial recognition, etc., to help quickly learn computer vision.

Download 3: 20 Lectures on OpenCV Practical Projects

Reply "OpenCV Practical Projects 20 Lectures" in the background of the "Xiaobai Learns Vision" public account to download 20 practical projects based on OpenCV for advanced OpenCV learning.

Exchange Group

Welcome to join the reader group of the public account to communicate with peers. There are currently WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (will gradually be subdivided in the future). Please scan the WeChat ID below to join the group, with the note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiao Tong University + Vision SLAM". Please follow the format for notes, otherwise, it will not be approved. After successful addition, you will be invited to the relevant WeChat group based on your research direction. Please do not send advertisements in the group, otherwise you will be removed. Thank you for your understanding~

Leave a Comment Cancel reply