Rcently I have watched the new deeplearning course offered by Andrew Ng, it is an indeed a great course complementing my understanding about deep learning for multiple perspectives. Here I want to share some of my earinings in this course.
To avoid the overfitting, regulaization is often used in the cost function. For example, $J = (y - \hat y)^2 + \lambda w^2$, then $\lambda$ serves as the regularization parameter prohibiting $w$ getting tow large. Moreover, it gurantees that the cost decreases each iteration. Noticeably, the regularization here we used is also called $L_2\ norm$.
In addition to this regularization, we can also use $L_1\ norm$, which is much less frequent selected, data augumentation and early stopping to aviod the overfitting. But in this post I want to focus on the use of dropout in regularization.
Unlike these methods mentioned above, dropout in my understanding is tricky but practical. Like the figure shown below, the dropout will randomly mute some neurons in the neural network and we therefore have a sparse network which hugely decreases the possibility of overfitting.More importantly, the dropout will make the weights spread over the input features instead of focusing on some features.
The possibility of muting neurons is often set as 0.5 though you can feel free to make it 0.3 or 1.0. When the dropout is 1.0, then you simply don’t drop out any neurons. But our experience tells us 0.5 is usually the best choice.
After finishing the training, it is important to turn off the dropout during development and testing. Otherwise, the prediction of this model is not stable since dropout add uncertainties to it.
Basically, this is what I have concluded from Andrew’s course. I belive the intuition behind this method is easily understood and many people are astonished by the amazaing performance. Regarding regularization, I will write another post about Batch Norma next time.