Practical Summary of CNN Tuning

Click on the above “Beginner Learning Vision“, select to add “Star” or “Top“

Essential insights delivered promptly

Reprinted from: Author | Charlotte

Source | Deep Learning Enthusiasts

Editor | Jishi Platform

Summary of tuning techniques, all about CNN optimization.

Summary of CNN Optimization

Systematic evaluation of CNN advances on the ImageNet

Using ELU non-linearity without batchnorm or ReLU with batchnorm.

Pre-training RGB data with a similar 1*1 network structure can yield better results.

Use a linear learning rate decay strategy.

Use the sum of average and max pooling layers.

Use a mini-batch size of about 128 (0.005) to 256 (0.01). If this is too large for your GPU, just proportionally reduce the learning rate to this size.

Replace the linear layers in previous MLPs with convolutional layers and use average pooling for prediction.

When researching the increase in training set size, determine the balance point where the dataset impacts performance improvement.

The quality of data is more important than the size of the data.

If you cannot increase the size of input images, reduce the stride in subsequent layers, achieving a similar effect.

If your network has a complex and highly optimized architecture like GoogLeNet, modifications should be made cautiously.

For further details, refer to the papers where the authors meticulously compared various hyperparameters’ effects on CNN model performance, which is highly worth a look.

Below is reprinted from: https://nmarkou.blogspot.com.cy/2017/02/the-black-magic-of-deep-learning-tips.html

Tips to Fully Utilize DNN

Remember to shuffle. Don’t let your network see exactly the same minibatch; if the framework allows, shuffle once every epoch.
Augment the dataset. DNNs require a lot of data, and models easily overfit on small datasets. I strongly recommend augmenting your original dataset. If it’s a vision task, you can add noise, brightness, reduce pixels, rotate or shift colors, blur, etc. The downside is that if you over-augment, most of the training data may be similar. I created a layer that applies random transformations to solve this problem, ensuring no identical samples. If you are using voice data, you can apply shifting and distortion processing.
Before training on the entire dataset, first train on a very small subset to overfit, so you know your network can converge. This tip comes from Karpathy.
Always use dropout to minimize the chances of overfitting. After sizes > 256 (fully connected layers or convolutional layers), dropout should be used. There’s a great paper on this: Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning [Gal Yarin & Zoubin Ghahramani, 2015].
Avoid LRN pooling; MAX pooling will be faster.
Avoid Sigmoid/TanH gates; they are costly, prone to saturation, and may halt backpropagation. In fact, the deeper your network, the more you should avoid using Sigmoid and TanH. You can use cheaper and more effective ReLU and PreLU gates, as mentioned in the paper by Yoshua Bengio et al. on Deep Sparse Rectifier Neural Networks, which can promote sparsity and are more robust in backpropagation.
Don’t use ReLU before max pooling; use it after saving calculations.
Don’t use ReLU; they are too outdated. Although they are very useful non-linear functions that solve many problems, you can try fine-tuning a new model with them. Due to ReLU hindering backpropagation and poor initialization, you might not get any fine-tuning effect. Instead, you should use PreLU with a very small multiplier, usually 0.1. Using PreLU allows for faster convergence without getting stuck in the initial phase like ReLU. ELU is also good, but more costly.
Regularly use batch normalization. Refer to the paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift [Sergey Ioffe & Christian Szegedy, 2015]. This will be very effective. Batch normalization allows for faster convergence (very fast) and smaller datasets. This can save you time and resources.
While most people prefer to remove the mean, I don’t. I like to compress input data to [-1, +1]. This can be seen as a training and deployment tip rather than a performance enhancement tip.
Be able to use smaller models. If you are deploying deep learning models like I do, you will quickly feel the pain of pushing gigabyte-sized models to users or servers on the other side of the world. Even if it means sacrificing some accuracy, you should aim for miniaturization.
If you are using relatively small models, try ensemble methods. Typically, ensemble 5 networks can increase accuracy by about 3%.
Whenever possible, use Xavier initialization. You can use it only on large fully connected layers and avoid using it on CNN layers. For an explanation, read the article: An Explanation of Xavier Initialization (by Andy Jones).
If your input data has spatial parameters, try end-to-end CNN. You can read the paper: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size [Forrest N. Iandola et al. 2016], which introduces a new method and performs very well; you can try applying the tips mentioned above.
Modify your model; whenever possible, use 1×1 CNN layers, as their placement can significantly enhance performance.
If you don’t have high-end GPUs, don’t attempt to train anything.
If you want to use your model or your layers to create templates, remember to parameterize everything; otherwise, you’ll have to rebuild all binaries.
Finally, understand what you are doing. Deep learning is like a neutron bomb in machine learning; it’s not effective for any task at any time. Understand the structure you are using and what you are trying to achieve so you don’t blindly copy models.

Ideas for Improving Algorithm Performance

The ideas mentioned in this list are not exhaustive, but they are a good starting point.

My goal is to provide many approaches to try, hoping that one or two might be new to you. Often, you only need one good idea to achieve performance enhancement.

If you achieve results from any of these ideas, please let me know in the comments. I would love to hear the good news.

If you have more ideas or extensions to the listed approaches, please share them; both I and other readers will benefit! Sometimes, just one idea might lead to breakthroughs for others.

I have divided this blog post into four parts:

Improving Performance Through Data
Improving Performance Through Algorithms
Improving Performance Through Algorithm Tuning
Improving Performance Through Nested Models

Generally speaking, as you go down the list, the performance improvements tend to decrease. For example, creating new architectures for the problem or obtaining more data usually yields better results than tweaking the parameters of the optimal algorithm. While this is not always the case, it is generally true.

I have added relevant links in the blog’s tutorials, corresponding site issues, and classic Neural Net FAQs.

Some ideas apply only to artificial neural networks, but most are universal. They are general enough for you to combine with other techniques to discover methods for enhancing model performance.

OK, let’s get started.

1. Improving Performance Through Data

By appropriately changing your training data and problem definition, you can achieve significant performance improvements. Perhaps the largest performance improvement.

Here are the ideas I will mention:

Acquire more data
Create more data
Rescale your data
Transform your data
Feature selection
Re-architect your problem

1) Acquire more data

Can you acquire more training data?

The quality of your model is often limited by the quality of your training data. To obtain the best model, you should first find ways to get the best data. You also want to gather as much of that best data as possible.

With more data, deep learning and other modern non-linear machine learning techniques have a fuller learning source and can learn better, especially deep learning. This is also a major reason why machine learning is so attractive to everyone (there’s data everywhere in the world).

More data is not always useful, but it certainly helps. For me, if possible, I would choose to acquire more data.

Refer to the following reading: Datasets Over Algorithms (www.edge.org/response-detail/26587)

2) Create more data

The previous section mentioned that having more data often improves deep learning algorithms. Sometimes you may not be able to reasonably acquire more data, so you can try creating more data.

If your data is numerical vectors, you can randomly construct modified versions of existing vectors.
If your data is images, you can randomly create modified versions of existing images (translation, cropping, rotation, etc.).
If your data is text, similar operations apply…

This is commonly referred to as data augmentation or data generation.

You can utilize a generative model. You can also use some simple techniques. For image data, you can achieve performance improvements by randomly translating or rotating existing images. If the new data contains such transformations, it enhances the model’s generalization ability.

This is also related to increasing noise, which we are accustomed to calling adding perturbations. It serves a similar purpose to regularization methods, namely suppressing overfitting on training data.

Here are related readings:

Image Augmentation for Deep Learning With Keras (http://machinelearningmastery.com/image-augmentation-deep-learning-keras/)
What is jitter? (Training with noise) (ftp://ftp.sas.com/pub/neural/FAQ3.html#A_jitter)

3) Rescale your data

This is a quick way to achieve performance improvement. When applying neural networks, a traditional rule of thumb is to rescale your data to the boundaries of the activation function.

If you are using the sigmoid activation function, rescale your data to the range of 0 to 1. If you are using the hyperbolic tangent (tanh) activation function, rescale your data to the range of -1 to 1.

This method can be applied to both input data (x) and output data (y). For example, if you use the sigmoid function in the output layer to predict binary classification results, you should normalize the y values to make them binary. If you are using the softmax function, you can still benefit from normalizing y values.

This is still a good rule of thumb, but I want to delve deeper. I suggest you consider the following methods to create different versions of your training data:

Normalize to the range of 0 to 1.
Rescale to the range of -1 to 1.
Standardize (i.e. standardizing data to have zero mean and unit variance).

Then evaluate your model’s performance for each method and choose the best one to use. If you change your activation function, repeat this process.

In neural networks, large numerical accumulation effects (stacking) are not good. Besides the methods mentioned above, there are other ways to control the numerical size of data in your neural network, such as normalizing activation functions and weights, which we will discuss later.

Here are related readings:

Should I standardize the input variables (column vectors)? (ftp://ftp.sas.com/pub/neural/FAQ2.html#A_std)
How To Prepare Your Data For Machine Learning in Python with Scikit-Learn (http://machinelearningmastery.com/prepare-data-machine-learning-python-scikit-learn/)

4) Data transformation

This data transformation is similar to the rescaling method mentioned above but requires more work. You must be very familiar with your data. Examine outliers by visualization.

Guess the univariate distribution of each column of data.

Does the column data look like a skewed Gaussian distribution? Consider using Box-Cox transformation to adjust skewness.
Does the column data appear to be exponentially distributed? Consider using logarithmic transformation.
Does the column data exhibit certain features but they are obscured by something obvious? Try squaring or taking the square root to transform the data.
Can you discretize a feature or combine features in some way to better highlight certain characteristics?

Rely on your intuition and try the following methods.

Can you use projection methods like PCA to preprocess the data?
Can you combine multi-dimensional features into a single value (feature)?
Can you use a new boolean label to discover interesting aspects present in the problem?
Can you explore other special structures in the current scenario using different methods?

Neural network layers excel at feature learning. They can do this themselves. However, if you can better discover the structure of the problem in the network, the neural network will learn faster. You can sample different transformation methods on your data or try specific properties to see which are useful and which are not.

Here are related readings:

How to Define Your Machine Learning Problem (http://machinelearningmastery.com/how-to-define-your-machine-learning-problem/)
Discover Feature Engineering, How to Engineer Features and How to Get Good at It (http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/)
How To Prepare Your Data For Machine Learning in Python with Scikit-Learn (http://machinelearningmastery.com/prepare-data-machine-learning-python-scikit-learn/)

5) Feature selection

Generally speaking, neural networks are robust to irrelevant features (i.e., irrelevant features do not significantly affect the training and performance of neural networks). They will weaken the contribution of features with no predictive power by assigning near-zero weights.

However, these irrelevant data features still consume significant resources during training cycles. So can you remove some features from the data?

There are many feature selection methods and feature importance methods that can provide insights into which features to keep and which to discard. The simplest way is to compare the effects of all features versus a subset of features. Similarly, if you have time, I recommend trying different perspectives on your problem within the same network, evaluating them to see how they perform.

Perhaps you can achieve the same or even better performance with fewer features. Moreover, this will make the model faster!
Perhaps all feature selection methods have discarded the same subset of features. Great, these methods have reached a consensus on useless features.
Perhaps the filtered feature subset can provide new insights for feature engineering.

Here are related readings:

An Introduction to Feature Selection (http://machinelearningmastery.com/an-introduction-to-feature-selection/)
Feature Selection For Machine Learning in Python (http://machinelearningmastery.com/feature-selection-machine-learning-python/)

6) Re-architect your problem

Sometimes, you should try to step outside your current problem definition and think about whether the observations you have collected are the only way to define your problem. Perhaps there are other methods. Maybe other ways of constructing the problem can better reveal the structure of the learning problem.

I really like this attempt because it forces you to open your mind. It is indeed difficult, especially when you have already invested a lot of time and money in the current method.

However, let’s think about it: even if you list 3-5 alternative construction schemes and ultimately abandon them, it at least shows you are more confident in the current scheme.

Look at whether you can merge existing features/data within a time window.
Perhaps your classification problem can become a regression problem (sometimes regression to classification).
Perhaps your binary output can turn into softmax output?
Perhaps you can model sub-problems instead.

Carefully consider your problem, preferably before you settle on tools, as you haven’t invested much in the solution at this point. Additionally, if you are stuck on a particular problem, such a simple attempt can unlock new ideas.

Moreover, this doesn’t mean your previous work was in vain; you can refer to the subsequent nested model section for that.

Here are related readings:

How to Define Your Machine Learning Problem (http://machinelearningmastery.com/how-to-define-your-machine-learning-problem/)

Improving Performance Through Algorithms

Machine learning is certainly about solving problems with algorithms.

All theories and mathematics depict how to apply different methods to learn a decision process from data (if we are only discussing predictive models).

You have chosen deep learning to explain your problem. But is this really the best choice? In this section, we will touch on some algorithm selection ideas before delving into how to maximize the potential of your chosen deep learning method.

Here is a brief list:

Sampling algorithms
Referencing existing literature
Resampling methods

Below, I explain a few of the methods mentioned above.

1) Sampling algorithms

In fact, you cannot know in advance which algorithm is optimal for your problem. If you did, you probably wouldn’t need machine learning. So is there any data (method) that can prove your chosen approach is correct?

Let’s solve this dilemma. When averaging the performance of all algorithms across all possible problems, no single algorithm consistently outperforms others. All algorithms are equal, as summarized in the no free lunch theorem.

Perhaps the algorithm you selected is not the optimal one for your problem.

We are not trying to solve all problems; there are many new hot methods in the algorithm world, but they may not be the optimal algorithms for your dataset.

My suggestion is to gather (evidence) data metrics. Accept the notion that better algorithms may exist and give other algorithms a “fair competition” in solving your problem.

Sample a range of feasible methods to see which perform well and which do not.

First, try evaluating some linear methods, such as logistic regression and linear discriminant analysis.
Evaluate some tree-based models, such as CART, random forests, and gradient boosting.
Evaluate some instance methods, such as support vector machines (SVM) and k-nearest neighbors (kNN).
Evaluate some other neural network methods, such as LVQ, MLP, CNN, LSTM, hybrids, etc.

Select the best-performing algorithm, then enhance it through further tuning and data preparation. Pay special attention to comparing deep learning with other conventional machine learning methods, ranking the results to compare their strengths and weaknesses.

Many times, you will find that you can solve your problem without deep learning, using simpler, faster-to-train, and even more interpretable algorithms.

Here are related readings:

A Data-Driven Approach to Machine Learning (http://machinelearningmastery.com/a-data-driven-approach-to-machine-learning/)
Why you should be Spot-Checking Algorithms on your Machine Learning Problems (http://machinelearningmastery.com/why-you-should-be-spot-checking-algorithms-on-your-machine-learning-problems/)
Spot-Check Classification Machine Learning Algorithms in Python with scikit-learn (http://machinelearningmastery.com/spot-check-classification-machine-learning-algorithms-python-scikit-learn/)

2) Referencing existing literature

A shortcut to method selection is to reference existing literature. Someone may have already researched problems related to yours, and you can see what methods they used.

You can read papers, books, blogs, Q&A sites, tutorials, and anything you can find on Google.

Write down all the ideas, then study them in your own way.

This is not about copying others’ research but about inspiring you to come up with new ideas, some of which you may not have thought of but could lead to performance improvements.

Published research is usually excellent. There are many smart people in the world who have written a lot of interesting things. You should thoroughly explore this “library” to find what you want.

Here are related readings:

How to Research a Machine Learning Algorithm (http://machinelearningmastery.com/how-to-research-a-machine-learning-algorithm/)
Google Scholar (http://scholar.google.com/)

3) Resampling methods

You need to know how your model performs. Is your estimate of model performance reliable?

Deep learning models are very slow to train. This often means we cannot estimate model performance using some common methods, such as k-fold cross-validation.

Perhaps you are using a simple train/test split, which is the conventional approach. If so, you need to ensure this split is representative of your problem. Univariate statistics and visualization are a good start.
Perhaps you can leverage hardware to speed up the estimation process. For example, if you have a cluster or an AWS cloud service account, you can parallel train n models, then obtain the mean and standard deviation of the results for a more robust estimate.
Perhaps you can use hold-out validation methods to understand the model’s performance after training (this is very useful in early stopping, which will be discussed later).
Perhaps you can hide a completely unused validation set and use it only after you have completed model selection.

Sometimes, in other ways, you may be able to make the dataset smaller and use stronger resampling methods.

In some cases, you may find that the performance of models trained on a portion of the training set is highly correlated with the performance of models trained on the entire dataset. Maybe you can first complete model selection and parameter tuning on a small dataset, then extend the final method to the entire dataset.
Perhaps you can limit the dataset in some way, taking only a portion of samples, then use it for the entire modeling process.

Here are related readings:

Evaluate the Performance Of Deep Learning Models in Keras (http://machinelearningmastery.com/evaluate-performance-deep-learning-models-keras/)
Evaluate the Performance of Machine Learning Algorithms in Python using Resampling (http://machinelearningmastery.com/evaluate-performance-machine-learning-algorithms-python-using-resampling/)

Improving Performance Through Algorithm Tuning

This is often the key to the work. You can often quickly discover one or two high-performing algorithms through sampling. However, obtaining the optimal algorithm may take days, weeks, or even months.

To achieve a better model, here are some ideas for tuning neural network algorithms:

Diagnostics
Weight Initialization
Learning Rate
Activation Functions
Network Topology
Batches and Epochs
Regularization
Optimization and Loss
Early Stopping

You may need to train a neural network model with a given “parameter configuration” many times (3-10 times or more) to obtain a reasonably estimated parameter configuration. This applies to all aspects you can tune in this section.

For hyperparameter optimization, refer to the blog post:

How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras (http://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/)

1) Diagnostics

If you can understand why your model’s performance is no longer improving, you can obtain a better-performing model.

Is your model overfitting or underfitting? Always keep this question in mind. Always.

Models will always encounter overfitting or underfitting, just to different extents. A quick way to understand the learning behavior of the model is to evaluate its performance on the training set and validation set at each epoch and plot the results.

If the model on the training set always outperforms the model on the validation set, you may be encountering overfitting, and you can use methods such as regularization.
If both the training set and validation set models perform poorly, you may be encountering underfitting, and you can increase the network’s capacity and train more or longer.
If there is a turning point where the model on the training set starts outperforming the validation set, you may need to use early stopping.

Regularly plot these graphs, learn from them to understand different methods to enhance model performance. These plots may be the most valuable diagnostic information (model status) you can create.

Another useful diagnostic is observing the correct and incorrect predictions made by the network model.

For hard-to-train samples, you may need more data.
Perhaps you should remove redundant samples in the training set that are easy to model.
Maybe you can try partitioning the training set into different regions and use more specialized models in specific regions.

Here are related readings:

Display Deep Learning Model Training History in Keras (http://machinelearningmastery.com/display-deep-learning-model-training-history-in-keras/)
Overfitting and Underfitting With Machine Learning Algorithms (http://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/)

2) Weight Initialization

The rule of thumb is usually to initialize with small random numbers.

In practice, this may still work well, but is it the best for your network? There are some heuristic initialization methods for different activation functions, but they don’t differ much in practice.

Fix your network and try various initialization methods.

Remember, weights are the true parameters of your model, and you need to find them. There are many sets of weights that can perform well, but we want to find the best.

Try all different initialization methods to see if one performs better under otherwise unchanged conditions.
Try using unsupervised methods, such as autoencoders, for pre-training.
Try using an existing model and retraining only the input and output layers for your problem (transfer learning).

It’s worth noting that changing weight initialization methods and activation functions is closely related to changing optimization/loss functions.

Here are related readings:

Initialization of deep networks (http://deepdish.io/2015/02/24/network-initialization/)

3) Learning Rate

Tuning the learning rate is often effective.

Here are some ideas for exploration:

Experiment with very large and very small learning rates.
Grid search common learning rate values in literature to see how deep a network you can learn.
Try decreasing the learning rate per epoch.
Try reducing the learning rate proportionally after a fixed number of epochs.
Try adding a momentum term and perform grid search on both learning rate and momentum simultaneously.

Larger networks require more training, and vice versa. If you add too many neurons and layers, appropriately increase your learning rate. Meanwhile, the learning rate should be considered in conjunction with training epochs, batch size, and optimization methods.

Here are related readings:

Using Learning Rate Schedules for Deep Learning Models in Python with Keras (http://machinelearningmastery.com/using-learning-rate-schedules-deep-learning-models-python-keras/)
What learning rate should be used for backprop? (ftp://ftp.sas.com/pub/neural/FAQ2.html#A_learn_rate)

4) Activation Functions

You might want to use rectifier activation functions. They may provide better performance.

Previously, the earliest activation functions were sigmoid and tanh, followed by softmax, linear activation functions, or sigmoid functions in the output layer. I do not recommend trying more activation functions unless you know what you are doing.

Try all three activation functions and rescale your data to meet the boundaries of the activation functions.

Clearly, you want to choose the correct transfer function for the output format, but consider exploring different representations. For example, switch the sigmoid function used in binary classification problems to a linear function used in regression problems, then post-process your output. This may require changing the loss function to make it more appropriate. Refer to the data transformation section for more details.

Here are related readings:

Why use activation functions? (ftp://ftp.sas.com/pub/neural/FAQ2.html#A_act)

5) Network Topology

Changes in network structure can be beneficial.

How many layers and how many neurons do you need? Sorry, no one knows. Don’t ask this question…

So how do you find the configuration that suits your problem? Experiment.

Try a hidden layer with many neurons (broad model).
Try a deep network but with very few neurons per layer (deep model).
Try a combination of the above two methods.
Refer to the structure in papers that are similar to your research problem.
Try topology patterns (fan-out then fan-in) and empirical rules in books and papers (links below).

Choosing is always difficult. Generally speaking, larger networks have stronger representational capabilities; perhaps you need that. More layers can provide a stronger ability to learn abstract features from data. You may need that.

Deep neural networks require more training, and adjustments to training epochs and learning rates should be made accordingly.

Here are related readings: These links will give you many insights on what to try, at least they have for me.

How many hidden layers should I use? (ftp://ftp.sas.com/pub/neural/FAQ3.html#A_hl)
How many hidden units should I use? (ftp://ftp.sas.com/pub/neural/FAQ3.html#A_hu)

6) Batches and Epochs

The batch size will determine the final gradient and the frequency of weight updates. An epoch refers to the process of the neural network seeing the entire training data once.

Have you experimented with different batch sizes and epoch numbers? Previously, we discussed the relationship between learning rate, network size, and epochs.

In very deep network structures, you often see small batch sizes paired with large training epochs.

The following may help your problem, or may not. You need to try and observe on your data.

Try selecting a batch size equal to the size of the training data, but be mindful of memory (batch learning).
Try selecting 1 as the batch size (online learning).
Try grid searching different small batch sizes (8, 16, 32, …).
Experiment with training for a few epochs versus many epochs.

Consider a near-infinite epoch value (continuous training) to record the best model obtained so far.

Some network structures are more sensitive to batch size. I know multi-layer perceptrons (MLPs) are usually robust to batch size, while LSTMs and CNNs are more sensitive, but this is just a statement (for reference).

Here are related readings:

What are batch, incremental, online… learning? (ftp://ftp.sas.com/pub/neural/FAQ2.html#A_styles)
Intuitively, how does mini-batch size affect the performance of (stochastic) gradient descent? (https://www.quora.com/Intuitively-how-does-mini-batch-size-affect-the-performance-of-stochastic-gradient-descent)

7) Regularization

Regularization is a good method to prevent the model from overfitting on the training set.

The latest and hottest regularization technique in neural networks is the dropout method; have you tried it? The dropout method randomly skips some neurons during the training phase, prompting other neurons in that layer to capture the slack. Simple and effective. You can start with the dropout method.

Grid search different dropout rates.
Experiment with dropout methods in input, hidden, and output layers.
Dropout also has some extensions; you can also try the drop connect method.

You can also try other more traditional neural network regularization methods, such as:

Weight decay to penalize large weights.
Activation constraints to penalize large activation values.

You can also experiment with penalizing different aspects or using different types of penalties/regularization (L1, L2, or both simultaneously).

Here are related readings:

Dropout Regularization in Deep Learning Models With Keras (http://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/)
What is Weight Decay? (ftp://ftp.sas.com/pub/neural/FAQ3.html#A_decay)

8) Optimization and Loss

The most common method is to apply stochastic gradient descent (SGD), but there are now many optimizers. Have you experimented with different optimization processes? Stochastic gradient descent is the default choice. First, make good use of it, combined with different learning rates and momentum.

Many more advanced optimization methods have more parameters, are more complex, and converge faster. Whether they are good or bad depends on your problem.

To make better use of a given optimization method, you really need to understand the meaning of each parameter and then perform grid search on different values for your problem. It’s difficult and time-consuming, but worth it.

I have found some newer, more popular methods that can converge faster and provide a quick understanding of the capacity of a given network, such as:

ADAM
RMSprop

You can also explore other optimization algorithms, such as more traditional (Levenberg-Marquardt) and less conventional (genetic algorithms). Other methods can provide good starting points for improving stochastic gradient descent and similar methods.

The loss function to be optimized is highly related to the problem you are trying to solve. However, you usually still have some leeway (you can fine-tune, e.g., mean squared error (MSE) and mean absolute error (MAE) in regression problems, etc.); sometimes changing the loss function can lead to small performance improvements, depending on the scale of your output data and the activation functions used.

Here are related readings:

An overview of gradient descent optimization algorithms (http://sebastianruder.com/optimizing-gradient-descent/)
What are conjugate gradients, Levenberg-Marquardt, etc.? (ftp://ftp.sas.com/pub/neural/FAQ2.html#A_numanal)
On Optimization Methods for Deep Learning, 2011 PDF (http://ai.stanford.edu/~ang/papers/icml11-OptimizationForDeepLearning.pdf)

9) Early Stopping

Once the performance on the validation set begins to decline during training, you can stop training and learning. This can save a lot of time and even allow you to use more detailed resampling methods to evaluate your model’s performance.

Early stopping is a regularization method used to avoid overfitting the model on the training data; it requires you to monitor the model’s performance on both the training set and validation set at each round. Once the model’s performance on the validation set begins to decline, training can stop.

If a certain condition is met (measuring accuracy loss), you can also set checkpoints to save the model, allowing it to continue learning. Checkpoints enable you to stop early rather than truly stopping training, so in the end, you will have multiple models to choose from.

Here are related readings:

How to Check-Point Deep Learning Models in Keras (http://machinelearningmastery.com/check-point-deep-learning-models-keras/)
What is early stopping? (ftp://ftp.sas.com/pub/neural/FAQ3.html#A_stop)

Improving Performance Through Nested Models

You can combine the predictive capabilities of multiple models. As previously mentioned, algorithm tuning can enhance final performance; after tuning, this is the next major area for improvement.

In fact, you can often achieve excellent predictive capabilities by combining multiple “good enough” models rather than combining multiple highly tuned (fragile) models.

You can consider the following three aspects of nesting:

Combining models
Combining perspectives
Stacking

1) Combining models

Sometimes we simply don’t choose a model but directly combine them.

If you have multiple different deep learning models, each performing reasonably well on your research problem, you can combine them by taking the average of their predictions.

The greater the differences between the models, the better the final effect. For example, you can apply very different network topologies or different techniques.

If each model performs well but in different ways, the predictive ability after nesting will be more robust.

Each time you train a network, you initialize different weights, and it will converge to different final weights. You can repeat this process multiple times to obtain many networks and then combine the predictions of these networks.

Their predictions will be highly correlated, but in those difficult-to-predict features, it will give you a surprising slight improvement.

Here are related readings:

Ensemble Machine Learning Algorithms in Python with scikit-learn (http://machinelearningmastery.com/ensemble-machine-learning-algorithms-python-scikit-learn/)
How to Improve Machine Learning Results (http://machinelearningmastery.com/how-to-improve-machine-learning-results/)

2) Combining perspectives

Similar to the above, but reconstruct your problem from different perspectives when training your models.

Again, the goal is to obtain models that perform well but are different (e.g., uncorrelated predictions). The more different transformation methods you use to train models, the more you can enhance your results.

Simply using the average of the predictions will be a good start.

3) Stacking

You can also learn how to best combine the predictions of multiple models. This is called stacked generalization, or simply stacking.

Typically, you can achieve better results than simply taking the average of predictions using simple linear regression methods, such as regularized regression, which will learn how to weight different predictive models. The baseline model is obtained by averaging the predictions of sub-models, but applying a model that learns weights will enhance performance.

Stacked Generalization (Stacking) (http://machine-learning.martinsewell.com/ensembles/stacking/)

Additional Resources

There are many excellent resources elsewhere, but very few that tie all the ideas together. If you want to delve deeper, I have listed the following resources and corresponding blogs where you can discover many interesting things.

Neural Network FAQ (ftp://ftp.sas.com/pub/neural/FAQ.html)
How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras (http://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/)
Must Know Tips/Tricks in Deep Neural Networks (http://lamda.nju.edu.cn/weixs/project/CNNTricks/CNNTricks.html)
How to increase validation accuracy with deep neural net? (http://stackoverflow.com/questions/37020754/how-to-increase-validation-accuracy-with-deep-neural-net)

Good news!
The Beginner Learning Vision Knowledge Planet
Is now open to the public 👇👇👇





Download 1: OpenCV-Contrib Extension Module Chinese Version Tutorial
Reply "Extension Module Chinese Tutorial" in the "Beginner Learning Vision" public account background to download the first OpenCV extension module tutorial in Chinese, covering over twenty chapters including extension module installation, SFM algorithms, stereo vision, object tracking, biological vision, super-resolution processing, etc.

Download 2: Python Vision Practical Project 52 Lectures
Reply "Python Vision Practical Project" in the "Beginner Learning Vision" public account background to download 31 practical vision projects including image segmentation, mask detection, lane line detection, vehicle counting, eyeliner addition, license plate recognition, character recognition, emotion detection, text content extraction, face recognition, etc., to help you quickly learn computer vision.

Download 3: OpenCV Practical Project 20 Lectures
Reply "OpenCV Practical Project 20 Lectures" in the "Beginner Learning Vision" public account background to download 20 practical projects based on OpenCV to advance your OpenCV learning.

Group Chat

Welcome to join the reader group of the public account to communicate with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (will be gradually subdivided in the future). Please scan the WeChat ID below to join the group, and note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiao Tong University + Vision SLAM". Please follow the format; otherwise, it will not be approved. After successfully adding, you will be invited to the relevant WeChat group based on your research direction. Please do not send advertisements in the group; otherwise, you will be removed from the group. Thank you for your understanding~

Summary of CNN Optimization

Systematic evaluation of CNN advances on the ImageNet

Tips to Fully Utilize DNN

Ideas for Improving Algorithm Performance

1. Improving Performance Through Data

Improving Performance Through Algorithms

Improving Performance Through Algorithm Tuning

Improving Performance Through Nested Models

Additional Resources

Leave a Comment Cancel reply