Bayes Classifier and MI Classifier With Noisy Labels

August 8, 2019 · Xingyu Li · Noise and Generalization · review math information theory

Definition and Relation

This section is a partial review of Bao-Gang Hu’s paper.

Bayes Classifier without rejection and its Decision Rule

Bayes Classifier makes decisions based on a (often) subjectively designed risk function $\mathfrak{R}$:

$$\mathfrak{R}(y_j | x) = \sum_i \lambda _{ij} P(x | t_i)P(t_i),$$

where $x \in \R^d$ is the input feature; $y_j$ and $t_i$ stand for the predicted class and the true class, respectively; $\lambda _{ij}$ is the cost when a true label $i$ is classified as $j$. Note that generally one doesn’t have a one-to-one mapping between $x$ and $t_i$. Instead, one has a joint distribution $P(x, t_i)$. Then the Bayes Classifier will choose $y^*$ such that

$$y^* = \underset{y}{\arg\max}\, \mathfrak{R}(y|x)$$

as its prediction.

Let’s confine to the case of binary classification, where $i,j \in {1, 2}$. It is straightforward to write down the corresponding decision rule for the Bayes Classifier. Let

$$ \Lambda = \frac{P(t_1|x)}{P(t_2|x)} = \frac{P(x|t_1)P(t_1)}{P(x|t_2)P(t_2)} \quad\text{and}\quad \delta = \frac{\lambda _{21} - \lambda _{22}}{\lambda _{12} - \lambda _{11}}.$$

One has

$$y^* = \begin{cases}y_1, & \text{if}\quad \Lambda > \delta\\ y_2, & \text{if}\quad \Lambda \le \delta\end{cases}$$

The threshold $\delta$ implicitly defines a decision boundary in $\R^d$.

In above, one computes the threshold $\delta$ using $\lambda _{ij}$. However, in practice, one may have a good intuition on the threshold for a certain problem rather than on the costs. In fact, it is ok to define the threshold $\delta$ first, and in turn, it will provide some constrains on the costs $\lambda _{ij}$. Note in that case, the risk $\mathfrak{R}$ would be not well-defined.

Mutual Information Classifier without rejection and its relation to Bayse Classifier

Mutual Information (MI) Classifier is defined as

$$y^ \dagger = \underset{y}{\arg\max}\,\frac{I(T, Y)}{H(T)},$$

where

$$I(T,Y) = \sum _{ij} {P(t_i, y_j)\log \frac{P(t_i, y_j)}{P(t_i)P(y_j)}}$$

and

$$H(T) = -\sum_i P(t_i)\log P(t_i).$$

$H(T)$ serves as a normalization factor. It remains unchanged once the problem itself is defined.

Solving the MI classifier is very difficult, since it involves the search of the dicision boundary in $\R^d$. Recall that for the Bayes Classifier, the dicision boundary is determined after finding the threshold. To solve the MI classifer, one needs to compute $P(t_i, y_j)$

$$P(t_i, y_j) = \int _{\Omega_j} P(x|t_i)P(t_i)\,\text{d} x,$$

where $\Omega_j$ is the region where MI classifier predicts $y_j$ for the given featrure $x$. The Marginal distributions $P(t_i)$ and $P(y_j)$ can be then computed easily.

Of course, MI Classifier and Bayes Classifer are different things, as summarized by Bao-Gang Hu:

Bayesian and mutual-information classifiers are different essentially from their applied learning targets.

Bao-Gang Hu also points out that under specific situations the two classifiers can be equivalent in the sense they give the same decision boundary. As an example, for binary classification case:

$$\begin{aligned}P(x|t_1) \sim N(-1, 2), P(t_1)=0.5, \\ P(x|t_2) \sim N(1, 1), P(t_2)=0.5,\end{aligned}$$

where $x\in \R$ and $N(\mu, \sigma)$ is the Gaussian distribution. Bao-Gang Hu shows if one chooses

$$\lambda _{11} = 0, \lambda _{12} = 1, \lambda _{21} = 2.002, \lambda _{22} = 0,$$

then the decision boundaries of the two classifiers are identical. However, for general cases, there is no such relation.

remarks

For Bayes Classifier on binary classification case,

$$Accuracy = \int _{\Omega_1} P(x|t_1)P(t_1)\,\text{d}x + \int _{\Omega_2} P(x|t_2)P(t_2)\,\text{d}x,$$

based on the decision rule within $\Omega_1$, $P(x|t_1)P(t_1) > \delta * P(x|t_2)P(t_2)$, thus

$$Accuracy > P(t_2) + (\delta - 1)*\int _{\Omega_1} P(x|t_2)P(t_2)\,\text{d}x.$$

Likewise,

$$Accuracy \ge P(t_1) + (\frac{1}{\delta} - 1)*\int _{\Omega_2} P(x|t_1)P(t_1)\,\text{d}x.$$

However, high overall accuracy doesn’t guarantee high accuracy on each class.

when noisy labels come into play



“End? No, the journey doesn't end here.”