where $x_i$ represents the input feature and the output value in each hidden layer before activations. Like the figure shown below, we use the Batch Norm for the input feature $X$ as well as the output value in each hidden layer. Noticeably, Batch Normalization also used the exponentially weighted averages and bias correction. The $\gamma$ and $\beta$ are two hyperparameters, which can be learned by this neural network. The use of $\gamma$ and $\beta$ is to keep the mean and variance as 0 and 1 or any fix mean and variance.
After normalizing the input in the neural network, we don’t have to worry about the scale of input features would be extremly different. Thus, the gradient descent can reduce the oscillations when approaching the minimum point and converge faster.
Another important fact is that Batch Norm reduces the impacts of earlier layers on the later layers in deep neural network. If we take a slice from the middle of our neural network, suppose layer 5, we can find the input features of layer 5 change during the training. Therefore, it takes more time to train the model to converge.
However, the use of Batch Norm can reduce the impacts of earlier layers by keeping the mean and variance fixed, which in some way makes the layers independent with each other. Consequently, the convergences become must faster.
Another sided-effect of Batch Norm is regularization. By using mini-batch size, the minor noise will be added to each layer, which imposes the regulaization effects. But the regulaization effects decrease with larger batch size since the noise dimnishes.
It is not difficult to implement Batch Norm since we simply add it in front of the activation like the figure above during forward propagation. Then usually we can implement the backward propagation in these deeplearning frameworks. The detailed process of back_prop can be found here.
Moreover, you might notice it is impossible to get the $\mu$ and $\sigma$ when we want to test on a single example. In this case, we can use exponentially weighted averages to get the final $\mu$ and $\sigma$ across the mini batches in training. Then the final $\mu$ and $\sigma$ can be used in dev and test set.