Mediante this case, the activation function does not depend mediante scores of other classes per \(C\) more than \(C_1 = C_i\). So the gradient respect to the each risultato \(s_i\) con \(s\) will only depend on the loss given by its binary problem.
- Caffe: Sigmoid Ciclocampestre-Entropy Loss Layer
- Pytorch: BCEWithLogitsLoss
- TensorFlow: sigmoid_cross_entropy.
, from Facebook, per this paper. They claim to improve one-stage object detectors using Focal Loss preciso train a detector they name RetinaNet. Focal loss is verso Ciclocampestre-Entropy Loss that weighs the contribution of each sample puro the loss based mediante the classification error. The ispirazione is that, if verso sample is already classified correctly by the CNN, its contribution onesto the loss decreases. With this strategy, they claim onesto solve the problem of class imbalance by making the loss implicitly focus durante those problematic classes. Moreover, they also weight the contribution of each class to the lose in verso more explicit class balancing. They use Sigmoid activations, so Focal loss could also be considered a Binary Ciclocampestre-Entropy Loss. We define it for each binary problem as:
Where \((1 - s_i)\gamma\), with the focusing parameter \(\qualita >= 0\), is per modulating factor onesto scampato the influence of correctly classified samples con the loss. With \(\varieta = 0\), Focal Loss is equivalent preciso Binary Cross Entropy Loss.
Where we have separated formulation for when the class \(C_i = C_1\) is positive or negative (and therefore, the class \(C_2\) is positive). As before, we have \(s_2 = 1 - s_1\) and \(t2 = 1 - t_1\).
The gradient gets per bit more complex paio to the inclusion of the modulating factor \((1 - s_i)\gamma\) durante the loss formulation, but it can be deduced using the Binary Ciclocross-Entropy gradient expression.
Where \(f()\) is the sigmoid function. Preciso get the gradient expression for per negative \(C_i (t_i = 0\)), we just need sicuro replace \(f(s_i)\) with \((1 - f(s_i))\) mediante the expression above.
Expose that, if the modulating factor \(\genere = 0\), the loss is equivalent onesto the CE Loss, and we end up with the same gradient expression.
Forward pass: Loss computation
Where logprobs[r] stores, per each element of the batch, the sum of the binary ciclocross entropy a each class. The focusing_parameter is \(\gamma\), which by default is 2 and should be defined as per layer parameter con the net prototxt. The class_balances can be used puro introduce different loss contributions per class, as they do per the Facebook paper.
Backward pass: Gradients computation
Per the specific (and usual) case of Multi-Class classification the labels are one-hot, so only the positive class \(C_p\) keeps its term mediante the loss. There is only one element of the Target vector \(t\) which is not nulla \(t_i = t_p\). So discarding the elements of the summation which are niente paio esatto target labels, we can write:
This would be the pipeline for each one of the \(C\) clases. We servizio \(C\) independent binary classification problems \((C' = 2)\). Then we sum up the loss over the different binary problems: We sum up the gradients of every binary problem esatto backpropagate, and the losses preciso schermo the global loss bumble. \(s_1\) and \(t_1\) are the risultato and the gorundtruth label for the class \(C_1\), which is also the class \(C_i\) mediante \(C\). \(s_2 = 1 - s_1\) and \(t_2 = 1 - t_1\) are the punteggio and the groundtruth label of the class \(C_2\), which is not a “class” durante our original problem with \(C\) classes, but verso class we create onesto arnesi up the binary problem with \(C_1 = C_i\). We can understand it as per background class.