Overfitting
Once a neural network's weights and biases are adjust to improve its performance on the training data, we hope that what it learns generalizes well beyond the training data. The way we test that is after we train the network, you show it more labeled data that it has never seen before, and you measure how accurately it classifies or predicts the outcome.
How to Choose Hyperparameters?
There are no rule of tumbs for choosing hyperparameters. However, most hyperparameters have a default value, which is a good starting point to create the model. After getting started with the default hyperparameters, then you search for a better hyperparameter combinations. There are different techniques available to find a better combination of hyperparameters such as grid search, random search and Bayesian optimization. Sometimes there are more hyperparameters to be defined by the human, compared to the parameters that the algorithm learns. So, what is the magic in deep learning if we have to find the optimal combination of hyperparameters for the data at hand? Removing hyperparameters has been a long standing goal of the machine learning community. In this regard, deep learning has helped with removing two important hyperparameters, model selection, and feature engineering.
Neural networks introduced two powerful breakthroughs to the machine learning community. If done properly, deep learning can represent any complex function, hence, they are called universal function approximaters. Therefore, theoretically, deep learning can be used for any arbitrary problem and we do not need any other machine learning mode. Deep learning algorithms also eliminated a very taxing hyperparameter called feature selection, or feature engineering. In situations where features are highly correlated, some algorithms such as logistic and linear regression do not perform well. This problem is called multicollinearity. In traditional machine learning algorithms, you either need to remove the features manually, or use a dimensionality reduction technique such as PCA, before using the features as the input to the model. However, with deep learning, dimensionality reduction automatically happens, as long as the layers' width is the less than the number of inputs. Hence, deep learning algorithms tend to retain more important bits of information and disregard the non-important parts of it. Therefore, deep learning eliminated two main hyperparameters, the model to choose and feature engineering and selection.
There are some simple rules that practitioners follow to choose hyperparameters, beyond the default values. For example, if the problem at hand is not complex in nature and we have limited data at hand, then you need a shallow model with as few as one hidden layer. Also, depending on the properties of the problem, you might be able to choose an activation function that helps the optimizer converge faster when training the model. For example, in case of classification problems, family of sigmoid functions inlcuding sigmoid and tanh generally work better. The main drawback of sigmoid functions is vanishing gradient problem, which can prevent the neural network from further training. In general, ReLU function is the default choice in deep learning models these days, since they can model any arbitrary function. There are several variations of ReLU function. For example, if you encounter a dead neuron in the network, leaky ReLU function is a better choice. As a general guideline, you can begin with using ReLU function and then move over to other activation functions in case ReLU doesn’t provide with optimum results.