Response of Loss Functions to Label Noise (draft)

August 13, 2019 · Xingyu Li · Noise and Generalization · math information theory

Distortion on dataset distribution by nosie labels

Let $t$ and $t^*$ be the correct target label and the noisy label, respectively. Presently, we focus on binary classification and assume the label noise only depends on the correct label, i.e. the error rates are:

$$e _- = P(t^* = +1 | t= -1) \quad \text{and} \quad e _+ = P(t^*=-1|t=+1).$$

The noisy dataset defines the joint distribution $P(x, t^*)$, which relates to the clean distribution $P(x, t)$ through

$$\begin{aligned}P(x, t^*=+1) &= (1 - e _+) * P(x, t=+1) + e _- * P(x, t= -1),\\ P(x, t^*=-1) &= e _+ *P(x, t=+1) + (1 - e _-) * P(x, t= -1).\end{aligned}\tag{1}$$

For now, we will assume the model has been derived/trained somehow elsewhere. It will define a decision boundary $\partial _d$, which splits the feature space into two parts: one for $+1$ prediction, the other for $-1$. Let’s denote these areas as $\Omega _+$ and $\Omega _-$, respectively. In order to distinguish from the clean and noisy label, we use $y$ to denote the predicted label by our model.

Mutual Information Loss

Mutual information loss is defined as follow:

$$\mathcal{I}(Y; T) = \sum _{y,t} P(y, t)* \log \frac{P(y, t)}{P(y)P(t)},$$

where $P(y,t)$ can be calculated by

$$P(y=\pm 1, t) = \sum _{\Omega _{\pm}} P(x, t).$$

Now, take into account the noise on labels, then, for example

$$\begin{aligned}P(y= +1, t^* = +1) &= \sum _{\Omega _+} P(x, t^*=+1)\\&= \sum _{\Omega _+} [(1 - e _+) * P(x, t=+1) + e _- * P(x, t= -1)]\\ &= (1 - e _+ - e _-)*\sum _{\Omega _+} P(x, t = +1) + e _- * P(y=+1)\\ &= \mathfrak{E}* P(y=+1, t = +1) + e _- * P(y=+1),\end{aligned}$$

at last step we set $(1 - e _+ - e _-) = \mathfrak{E}$ for simplicity. Generally, one has

$$P(y, t^* = \pm 1) = \mathfrak{E} * P(y, t = \pm 1) + e _{\mp}* P(y),$$

or, with different symbols,

$$P(y, t^*) = \mathfrak{E} * P(y, t) + e _{-t} * P(y),$$

where, assumed implicitly, $t^*$ and $t$ take the same value. Further, it follows

$$P(t^*) = \mathfrak{E} * P(t) + e _{-t} .$$

Based on above, if the error rates are small, i.e. $e _\pm \sim 0 ^+$, one may approximate

$$\frac{P( y, t^* )}{ P(y) P(t^*)} \approx \frac{P(y, t)}{ P(y) P(t)},$$

As a result, the mutual information loss goes

$$\begin{aligned}\mathcal{I}(Y;T^*) &\approx \sum _{y,t} [ \mathfrak{E} * P(y, t) + e _{-t}* P(y)] * \log \frac{P(y, t)}{P(y)P(t)}\\ &= \mathcal{I}(Y; T) - (e _+ + e _-) * \mathcal{I}(Y; T) + (e _+ + e _-)* H(Y) + \mathfrak{D}\\ &= \mathcal{I}(Y; T) + (e _+ + e _-) * H(Y|T) + \mathfrak{D},\end{aligned}$$

where

$$\begin{aligned}\mathfrak{D} &= \sum _{y,t} e _{-t} * P(y) * \log P(y|t)\\ &= \mathbb{E} _y \left[ \sum_t e _{-t} * \log P(y|t) \right].\end{aligned}$$

Cross Entropy Loss / Negative Log-Likelihood Loss

Cross entropy loss is usually defined as

$$L(Y, T) = \frac{1}{N} \sum_i \mathfrak{t}^i \cdot \log \mathfrak{y}^i.$$

Here, one adopts the one-hot encoding for training labels, $\mathfrak{t}^i$, and the model prediction, $\mathfrak{y}_i$, are probability vector for the input feature to be classified in each class, i.e. $\mathfrak{y} ^i _l = Q(c_l|x)$, where $c_l$ refers to the $l$‘th class and $Q$ emphasizes this probability is produced by our model. $N$ stands for the size of dataset.

Let’s first consider the limit expression of this loss under $N\rightarrow \infty$. We claim

$$L(Y,T) \underset{N\rightarrow\infty}{\longrightarrow} \sum _{x, t} P(x)*P(t|x)\log Q(t|x).\tag{2}$$

Note that we have changed symbol for the target label, i.e. $t \in \{ -1,+1 \}$, which has the same meaning as in the previous section. $P$ refers to the empirical/true distribution of the dataset.

We also recognize that equation (2) is just the cross entropy between $P$ and $Q$ conditioned on $x$: $H(P,Q|X)$. Now, we can employ equation (1) to write down gthe response of cross entropy loss to the label noise. Note that $Q$ is solely determined by the model, one has

$$\begin{aligned}H(P^* ,Q | X) &= \sum _{x, t^*} P(x) * P(t^* | x)\log Q(t^* | x)\\&= \sum _{x, t} [\mathfrak{E}*P(x,t)+ e _{-t} * P(x)]*\log Q(t|x) \\ &= \mathfrak{E} * H(P,Q | X) + \mathbb{E} _x \left[ \sum_t e _{-t} * \log Q(t|x)\right].\end{aligned}$$

In above, we use $H(P^*,Q|X)$ to indicate the conditional cross entropy on noisy data. One may rearrange terms to write the r.h.s. of above equation as

$$H(P,Q|X) - (e _+ + e _-) * H(P,Q) + \mathbb{E}_x \left[ \sum_t e _{-t} * \log Q(t,x) \right],$$

where

$$H(P,Q) = \sum _{x,t} P(x,t)\log Q(x,t)$$

and $Q(x,t) = Q(t|x)*P(x)$.



there is still a long way to go