Variational Form for H(x)

July 29, 2019 · Xingyu Li · DeepInfoFlow · information theory math variational form

For continuum random variable $X$, the differential entropy id defined as

$$\int P(x)\log\dfrac{1}{P(x)}\,\text{d}x,\tag{1}$$

where $P(x)$ is the distribution for $X$.

function $-\log v$ is strictly convex, let $\phi$ be its conjugate dual function, then $-\log v = \underset{u\in\mathbb{R}}{\text{sup}}\left[uv - \phi(u)\right]$, $\phi(u) = -1 - \log(-u)$ for $u<0$ and $+\infty$ otherwise.

using above, we have

$$\begin{aligned} \int P(x)\log\frac{1}{P(x)}\,\text{d}x &= - \int P(x)\left(- \log\frac{1}{P(x)}\right)\,\text{d}x \\&= -\int P(x) \underset{f}{\text{sup}}\left[ f(x)\frac{1}{P(x)} - \phi(f) \right]\,\text{d}x \\&= -\underset{f}{\text{sup}}\left[ \int f(x)\,\text{d}x - \int \phi(f)P(x)\,\text{d}x\right] \\&= -\underset{f}{\text{sup}}\left[ \int f(x)\,\text{d}x - \mathbb{E}_{P}[\phi(f)]\right]\end{aligned}$$

where the supremum are taken over all measurable functions $f:\mathcal{X} \rightarrow \mathbb{R}$ ($\mathcal{X}$ being the sample space). We also see for any function class $\mathbf{F}$ that maps from $\mathcal{X}$ to $\mathbb{R}$, it holds

$$\int P(x)\log\dfrac{1}{P(x)}\,\text{d}x \le -\underset{f(x)\in\mathbf{F}}{\text{sup}}\left[ \int f(x)\,\text{d}x - \mathbb{E}_{P}[\phi(f)]\right]. $$

approximated with empirical mean, we have the following expression

$$ -\underset{f(x)\in\mathbf{F}}{\text{sup}}\left[ \int f(x)\,\text{d}x - \frac{1}{N}\sum_i^N[\phi(g(x_i))]\right]$$

where $x_i$ are samples from distribution $P(x)$.

specifically, for the differential entropy case,

$$ -\underset{g>0}{\text{sup}}\left[\frac{1}{N}\sum_i^N \log g(x_i) -\int g(x)\,\text{d}x + 1\right], $$

where we substitute $f = -g$.

we may further restrict the function space to be a Reproducing Kernel Hilbert Space (RKHS) induced by Gaussian kernel.

let $\mathcal{G}$ refer to the RKHS we use

in RKHS, there are feature maps $ \psi_x $, any function $ g $ in $ \mathcal{G} $ can be cast into x-representation by $ g(x) = \langle g, \psi_x\rangle $

for the kernel $K$, one has $K(x, y) = \langle \psi_x, \psi_y\rangle$

the Gaussian kernel:

$$ K(x,y) = e^{-|x-y|^2/\sigma}. $$

to seek an optimal solution $\hat{g}$ for the differential entropy, using eq.(1) and the property of RKHS, we need to solve

$$\hat{g}=\underset{g\in\mathcal{G}}{\text{arg}\min} \int g(x)\,\text{d}x - \frac{1}{N}\sum_i^N\log\langle g,\psi(x_i) \rangle + \frac{\lambda_N}{2}\mathbf{I}^2(g)$$

$\frac{\lambda_N}{2}\mathbf{I}^2(g)$ is the regularization term that further reduces the searching space in $\mathcal{G}$.

One may choose $\mathbf{I}^2(g)=|g|^2_{\mathcal{G}}$.

what is the new form after applying the inf-convolution theorem?

need $ \int g(x)\,\text{d}x = \int \langle g, \psi_x\rangle\,\text{d}x $ ?



test post quote