Neural Tangent Kernel and NN Dynamics (draft)

August 1, 2019 · Xingyu Li · Noise and Generalization · math neural network dynamics review

Discussion on Neural Tangent Kernel

The Neural Tangent Kernel is introduced by Arthur Jacot et. al. to study the dynamics of a (S)GD-based learning model, say, the Neural Network. It is expected to enable one to “study the training of ANNs in the functional space $\mathcal{F}$, on which the cost $C$ is convex.”

In the following, we will review the derivations of Arthur Jacot et. al. and argue that by definition the Neural Tangent Kernel only captures the first order dynamics of the Neural Networks. In order to be practical, one needs to include higher order kernels.

Basic Definitions

  • Let $f(x, \theta): \mathbb{R}^n\otimes\mathbb{R}^p\rightarrow\mathbb{R}^m$ be the function learned by the model, where $x\in\mathbb{R^n}$ is the input feature and $\theta\in\mathbb{R}^p$ is the parameter vector. The output is an encoding of the input features in the space $\mathbb{R}^m$. This function belongs to the functional space $\mathcal{F}$.
  • Let $\mathbb{C}$ be the cost function. We will view it as a functional: $\mathbb{C}: \mathcal{F}\rightarrow\mathbb{R}$. This is well defined since, in practice, the training dataset ${(x_i, y_i)}$ is fixed during training, hence the cost is determined by the learned function $f$.
    • Let $\mathcal{L}$ be the loss function, one have $\mathbb{C}=\frac{1}{N}\sum_i\mathcal{L}(f(x_i,\theta), y_i)$, where $N$ is the training dataset size.
  • To investigate the evolution and convergence of the learning process, one is obligated to trace the variation of $f$ and $\mathbb{C}$ along learning iterations, which serves as the time dimension of the learning process.
    • Note both $f$ and $\mathbb{C}$ depend on $t$ through $\theta$, which updates according to (S)GD.
    • The time scale is related to the magnitude of the learning rate $\eta$.

Kernel Gradient and Neural Tangent Kernel

To study the training process, Arthur Jacot et. al. focus on the behavior of $\partial_t f$ and $\partial_t \mathbb{C}$. They first introduce a seminorm $|\cdot| _{p ^{in}}$, which is defined in term of the bilinear form

$$\langle f, g\rangle _{p ^{in}} = \mathbb{E} _{p ^{in}}[f(x)^T g(x)],$$

where $p ^{in}$ is the empirical distribution on the training dataset, i.e. $\frac{1}{N}\sum_i \delta _{x_i}$. Hence,

$$\langle f, g\rangle _{p ^{in}} = \frac{1}{N}\sum_i f(x_i)\cdot g(x_i).$$

In this way, they can form a dual functional space $\mathcal{F}^*$ of $\mathcal{F}$, whose elements are linear forms $\mu: \mathcal{F}\rightarrow \mathbb{R}$ of form $\mu=\langle d, \cdot\rangle _{p ^{in}}$ for some $d\in \mathcal{F}$.

Note that the functional derivative of $\mathbb{C}$ at point $f_0$ (i.e. model function at $t_0$) can be viewed as an element of $\mathcal{F}^*$. In fact, let $\alpha$ be any parameter of $f$, then

$$\begin{aligned}\partial _\alpha \mathbb{C}| _{f_0} &= \frac{1}{N}\sum_i \partial_f \mathcal{L}(f_0)\cdot \partial _\alpha f_0 \\ &=\partial_f\mathbb{C}| _{f_0}* \partial _{\alpha}f_0,\end{aligned}$$

where we use $\partial_f \mathbb{C}| _{f_0}$ to denote the functional derivative and introduce a special product symbol, $*$, at last step, in order to write the formula in a chain-rule like form. Note that we have abbreviated irrelevant variables in above formula. It is easy to see that “$\partial_f \mathbb{C}| _{f_0} *$” is a linear operator. Based on above observation, one may write $\partial_f \mathbb{C}| _{f_0} * =\langle d _{f_0}, \cdot\rangle _{p ^{in}}$ with a corresponding $d _{f_0}\in \mathcal{F}$.

At this point, Arthur Jacot et. al. introduced the kernel gradient of the Neural Network model. Basically, a kernel is a function $K(x,x’):\mathbb{R}^n\otimes\mathbb{R}^n\rightarrow\mathbb{R}^{m\times m}$, that maps a feature pair to a symmetric matrix. Since $K _{i;\cdot}(x, \cdot)$ is a function in $\mathcal{F}$, one can define a map $\Phi_K: \mathcal{F}^* \rightarrow \mathcal{F}$ mapping a dual element $\mu = \langle d, \cdot\rangle _{p ^{in}}$ to an element $f _\mu \in \mathcal{F}$ with components

$$[\Phi_K(\mu)]_i = f _{\mu; i}(x)= \langle d, K _{i;\cdot}(x, \cdot)\rangle _{p ^{in}}.$$

Now the Kernel gradient is defined by

$$\begin{aligned}\nabla_K \mathbb{C}| _{f_0}(x)&= \Phi_K(\partial_f \mathbb{C}| _{f_0})\\&=\frac{1}{N}\sum_i K(x, x_i) d _{f_0}(x_i),\end{aligned}$$

where, at the last step, we mean the matrix multiplication between $K$ and $d _{f_0}$. One would discover that there is a special kernel $\Theta(x, x’)$ such that

$$\partial_t f(x, \theta (t_0)) = - \nabla_K \mathbb{C}| _{f_0},$$

as long as the model is updating according to (S)GD. Arthur Jacot et. al. call this $\Theta$ the Neural Tangent Kernel and show that it is defined as

$$\Theta(x,x’; \theta)=\sum_p \partial _{\theta_p}f(x,\theta)\otimes \partial _{\theta_p}f(x’,\theta),$$

where $p$ refers to the components of the model parameters. We will see the reason for choosing this definition in the next section.

In addition, for $\partial_t \mathbb{C}$ one has

$$\begin{aligned}\partial_t \mathbb{C}| _{f_0} &= \partial_f \mathbb{C}| _{f_0}* \partial_t f(t_0)\\&=-\frac{1}{N^2}\sum_i\sum_j d^T _{f_0}(x_i) \Theta(x_i, x_j) d _{f_0}(x_j).\end{aligned}$$

Again, matrix multiplication is assumed. We see that as long as $\Theta$ is positive defined with respect to $p ^{in}$, the cost will always decrease.

First Order Dynamics

To see what is going on under the fancy appearance of kernel representations, it is helpful to expand all the details of every steps. In doing so, we find that the Neural Tangent Kernel actually only captures the first order dynamics of the Nerual Network model.

The (S)GD updating rule defines the dynamics of the model at each iteration:

$$\begin{aligned}\Delta \theta_p(t_0) &= -\eta \times\partial _{\theta_p} \mathbb{C}| _{f_0}\\&=-\eta \times\frac{1}{N}\sum_i \partial_f\mathcal{L}(f_0(x_i,\theta), y_i)\cdot\partial _{\theta_p}f_0(x_i,\theta)\\&=-\eta \times\partial_f\mathbb{C}| _{f_0}* \partial _{\theta_p}f_0,\end{aligned}$$

where $\times$ indicates the scalar product. The derivatives are evaluated at epoch $t_0$ and corresponding model function is $f_0$. To emphesase function $f$ is an object in functional space $\mathcal{F}$, we also write $f_0(x,\theta)$ as $f(t_0)$.

We are interested in the responses of $f$ and $\mathbb{C}$ with respect to the update:

$$\begin{aligned}\Delta f(t_0) &= f(t_0 + \Delta t) - f(t_0)\\&=f(x, \theta(t_0)+\Delta\theta(t_0)) - f(x, \theta(t_0))\end{aligned}$$

and

$$\Delta \mathbb{C}(t_0) = \frac{1}{N}\sum_i\mathcal{L}(f(x_i,\theta(t_0)+\Delta\theta(t_0)) - \frac{1}{N}\sum_i\mathcal{L}(f(x_i,\theta(t_0)).$$

In the limit $\eta \rightarrow 0^+$, $\Delta \theta_p \rightarrow 0$ (written as $\delta \theta_p$ instead). Thus we only need to consider the first-order derivatives with respect to $\theta_p$ in $\Delta f$ and $\Delta \mathbb{C}$. Namely,

$$\begin{aligned}\delta f(x,\theta(t_0)) &= \sum_p \partial _{\theta_p} f(x,\theta(t_0)) \times \delta\theta_p(t_0)\\&=-\eta \times\sum_p \partial _{\theta_p} f(x,\theta(t_0)) \times \left[\partial_f\mathbb{C}| _{f_0}*\partial _{\theta_p}f_0\right],\end{aligned}$$

Notice that the term $\partial_f\mathbb{C}| _{f_0}*\partial _{\theta_p}f_0$ is a real value that doesn’t depend on $x$. And the derivative of $\mathbb{C}$ with respect to $t$ is given by

$$\begin{aligned}\delta\mathbb{C}| _{f_0} &= \partial_f \mathbb{C}| _{f_0} * \delta f(x, \theta(t_0))\\&=-\eta \times\sum_p \left[ \partial_f \mathbb{C}| _{f_0}* \partial _{\theta_p} f_0 \right] \times \left[\partial_f\mathbb{C}| _{f_0}*\partial _{\theta_p} f_0 \right]\\&=-\eta \times\sum_p \left[ \partial_f \mathbb{C}| _{f_0} * \partial _{\theta_p} f_0 \right] ^2\end{aligned}$$

Now it is natural to group the terms

$$\sum_p \partial _{\theta_p} f(x, \theta) \otimes \partial _{\theta_p} f(x’, \theta),$$

which just corresponds to the Neural Tangent Kernel $\Theta(x, x’, \theta)$. We use tensor product $\otimes$ since in the formulae the two $\partial _{\theta_p} f$ act on different spaces.

We see that locally the cost $\mathbb{C}$ is always non-increasing as long as we keep using the same set of training data for updating. This non-increasing guarantee is not interesting, since it is just another way to say locally (S)GD updates the model along a direction which decreases the cost $\mathbb{C}$, which follows directly from the definition of gradient.

Above analysis only applies to the cases in which $\theta$ undergoes a very small perturbation,in this sense, they only describes a first order dynamics of the underlying model.

Remarks

  • Above analysis is discouraging, since it indicates the Neural Tangent Kernel approach can only describe the first order dynamics. In order to be practical, higher order (at least the second order) terms should be included, say one may consider the Neural Hessian Kernel;
    • of course, the Neural Tangent Kernel (NTK) approach may be of interest in the limiting cases, e.g. for the infinite-width fully-connected network case, where the NTK tends to be a constant kernel.
  • There is another restrict in the deduction that makes the results quite impractical: they require the dataset used at each step of update to be fixed. However, in practice, the situation is totally different. Just need to mention the use of mini-batches, which will introduce randomness into the learning process, making it more like a Markov Chain. One actually need stochastic integration to figure out the trajectory of the model in the functional space during training.


“End? No, the journey doesn't end here.”