Backpropagation
1 Forward Propagation
This is more like a summary section.
We set $a^{[0]} = x$ for our input to the network and $\ell = 1,2,\dots,N$ where N is the number of layers of network. Then, we have
where $g^{[\ell]}$ is the same for all the layers except for last layer. For the last layer, we can do:
1 regression then $g(x) = x$
2 binary then $g(x) = sigmoid(x)$
3 multi-class then $g(x) = softmax(x)$
Finally, we can have the output of the network $a^{[N]}$ and compute its loss.
For regression, we have:
For binary classification, we have:
For multi-classification, we have:
Note that for multi-class, if we have $\hat{y}$ as a k-dimensional vector, we can calculate its cross-entropy for its loss:
2 Backpropagation
We define that:
So we have three steps for computing the gradient for any layer:
1 For output layer N, we have:
For softmax function, since it is not performed element-wise, so you can directly caculate it as a whole. For sigmoid, it is applied element-wise, so we need to:
Note this is element-wise operation.
2 For $\ell = N-1,N-2,\dots,1$, we have:
3 For each layer, we have:
This can be directly used in coding, which acts like a formula.