Implicit Regularization of Random Feature Models

and an explanation for variance explosion in the Double-Descent phenomenon

Random Feature (RF) models are used as computationally efficient approximations of kernel methods1 and as a simple model to study Neural Networks. Using tools from Random Matrix Theory (RMT), we find2

  1. a surprising implicit regularization effect that comes from the finite sampling of random features, and
  2. an explanation for the variance explosion at the interpolation threshold.

In this blog post, I will try to give the intuition behind these two main results without going into the wilderness of the random matrix theory analysis. 

Introduction & Setup

We are given a training dataset $ (x_i, y_i) \in \mathcal{X} \times \mathbb{R} $ of $ N $ samples for $ i = 1, \ldots, N $. Our goal is to find a predictor $ f : \mathcal{X} \to \mathbb{R} $ that mimics the true function $ f^{*} $ on points sampled from a distribution $ \mu $. We evaluate the generalization performance of $ f $ using the risk $$ L( f ) = \mathbb{E}_{ x \sim \mu } [ (f(x) - f^{*}(x))^2 ]. $$

Kernel ridge regression is a learning algorithm that finds a predictor $ f^{(K)} $ using a similarity kernel $ K : \mathcal{X} \times \mathcal{X} \to \mathbb{R} $ and a regularization parameter (ridge) $ \lambda > 0 $ $$ f_\lambda^{(K)}(x) = K(x, X) (K(X, X) + \lambda I_N)^{-1} y $$ where $K(x, X)_i = K(x, x_i)$ and $K(X, X)_{ij} = K(x_i, x_j)$. The random feature method is introduced as an approximation of the kernel regression to avoid the computational cost of the matrix inversion 1. First, we sample $ P $ random features $ f^{(1)}, \ldots, f^{(P)} : \mathcal{X} \to \mathbb{R} $ with zero mean and covariance $ K $. Given the features, we can find the minimizer of the regularized least squares $$ \begin{align} \hat{\theta} &= \arg\min_{\theta \in \mathbb{R}^P} \frac{1}{N} \sum_{i=1}^N ( \frac{1}{\sqrt{P}} [f^{(1)}(x_i), \ldots, f^{(P)}(x_i)] \cdot \theta - y_i )^2 + \lambda \| \theta \|^2 \\\
&= \arg\min_{\theta \in \mathbb{R}^P} \frac{1}{N} \| F \theta - y \| ^2 + \lambda \| \theta \|^2 \end{align} $$ where $ F $ is the $ N \times P $ data matrix with entries $ F_{ij} = \frac{1}{\sqrt{P}} f^{(j)}(x_i) $, $ y \in \mathbb{R}^N $ is the output vector, and $ \cdot $ represents the Euclidean inner product. Instead of focusing on the optimal parameter $ \hat{\theta} = F^T (F F^T + \lambda I_N)^{-1} y $, we will focus on the optimal function that is parameterized by $ \hat{\theta }$ $$
f^{(RF)}(x) = \frac{1}{\sqrt{P}} \sum_{j=1}^P f^{(j)}(x) \hat{\theta}_j $$ which we call the RF predictor. We assume that the features are Gaussian processes for the theoretical analysis, but we empirically find that results carry over for example for fourier features2.

Moreover, a feed-forward neural network with only the last layer trained is also an RF Predictor; so understanding the generalization behavior of the RF Predictors can pave the way going to the neural networks!

Let us start with looking into some basic properties of the RF predictor.

Basics of the RF Predictor

It is imporant to note that the RF predictor $ f^{(RF)} $ is a random function (i.e. a random process) and that the randomness comes from the sampling of features $ f^{(1)}, \ldots, f^{(P)}$. Let us first see how the first two moments of the RF predictor behaves for a sinusoidal dataset of $ N = 6 $ points on three different Kernels for various $ P $

Figure 1. The mean (shown in black) and the variance ($\pm 2 $ standard deviation) of the RF predictor as $ P $ increases for small ridge $ \lambda = 10^{-3} $. Two data-dependent quantities $\tilde{\lambda}$ and $\partial \tilde{\lambda}$ measure the amount of implicit regularization and the size of variance respectively: as $ P $ increases, $\tilde{\lambda}$ decreases, indicating a decrease in implicit regularization; at the interpolation threshold $ P = N $, $ \partial \tilde{\lambda} $ explodes following the variance explosion.

Nice… But how to study the risk of the RF predictor? The good news is that thanks to the following bias-variance decomposition

$$ \mathbb{E} [L( f^{(RF)} )] = L( \mathbb{E} [f^{(RF)}] ) + \mathbb{E}_\mu [\text{Var}(f^{(RF)}(x))], $$

we do not need higher-order moments of the RF predictor for the (expected) risk!

Implicit Regularization Effect

For the classic limit when $ P \to \infty $ and $ N $ fixed, we know that the RF predictor is asymptotically equivalent to the Kernel predictor1. However, for finite $ P $, the RF predictor remains random. Our key result is a fine approximation of the average (i.e. expected) RF predictor with ridge $ \lambda $ in terms of the Kernel predictor with an effective ridge $ \tilde{\lambda} > \lambda $ $$ \mathbb{E}[ f_\lambda^{(RF)}(x)] \approx f_{\tilde{\lambda}}^{(K)}(x) $$ where $ \tilde{\lambda} $ monotonically decreases down to $ \lambda $ as $ P $ increases $ \tilde{\lambda} \searrow \lambda$. The emergence of a bigger ridge $ \tilde{\lambda} $ for finite $ P $ is what we call implicit regulatization effect of random features!

As a side note, we use $ A \approx B $ to say that the difference between $ A $ and $ B $ is upper-bounded by a quantity that decreases to $ 0 $ as $ P $ grows.

Next, I will present a formula for the average predictor, give an explicit formula for the effective ridge, and explain where this magical formula comes from!

The average RF predictor in terms of the hat matrix $ A $

We will use the law of total expectation to study the average RF predictor in two-steps:

  1. averaging on the realizations $ f^{(j)}(x_i) $ on the training data $ x_1, \ldots, x_N $, and
  2. averaging on the realizations $ f^{(j)}(x) $ on test points $ x \neq x_i $.

Following our strategy, we first average out the realizations on test points
$$ \mathbb{E}[f^{(RF)}(x) | F] = K(x, X) K(X, X)^{-1} F \hat{\theta} = K(x, X) K(X, X)^{-1} A y $$ using the conditional Gaussians formula3, where $ A = F (F^T F + \lambda I_P)^{-1} F^T $ is the hat matrix, i.e. it puts a hat on the output vector $ A y = \hat{y} $ . Finally we average out the realizations on the training data $$ \mathbb{E}[f^{(RF)}(x)] = K(x, X) K(X, X)^{-1} \mathbb{E} [A] y. $$

Figure 2. The empirical eigenvalues of the average hat matrix $ \mathbb{E} [A] $ (circles) for various values of $ P $ and $ \lambda $. We consider a diagonal kernel matrix $ K(X, X) $ with entries $ d_i = 2^{1-i} $ for $i = 1, \ldots, N $. We compare two approximations given by: (1) the classic one $\frac{d_i}{d_i + \lambda}$ (grey dashed line) and (2) $ P $-sensitive ones $ \frac{d_i}{d_i + \tilde{\lambda} } $ with the effective ridge $ \tilde{\lambda} $ (blue lines). $ \tilde{\lambda} $ decreases down to $ \lambda $ for big $ P $, where the two approximations agree.

The effective ridge and where it comes from

The average hat matrix $ \mathbb{E} [A] $ is approximated by $ K(X, X) (K(X, X) + \tilde{\lambda} I_N)^{-1} $ even for small $ P $ (see Fig2), where $\tilde{\lambda}$ is the unique positive solution of the fixed point equation $$ \tilde{\lambda} = \lambda + \frac{\tilde{\lambda}}{P}\sum_{i=1}^N \frac{d_i}{d_i + \tilde{\lambda}} $$ where $d_1, \ldots, d_N$ are the eigenvalues of $ K(X, X) $. This formula looks mysterious at first sight if one is not familiar with the RMT tools, that was at least my feeling six months ago! So I will some present intuition as to where the formula comes from for the curious minds.

For the case of a diagonal Kernel matrix, we expand the diagonal entries of $ A $ using the Sherman-Morrison formula and a rank-one perturbation lemma4 $$ A_{ii} \approx \frac{\frac{1}{P}w_i^T (F^T F + \lambda I_P)^{-1} w_i}{1 + \frac{1}{P}w_i^T (F^T F + \lambda I_P)^{-1} w_i} \approx \frac{d_i \frac{1}{P} \mathrm{Tr} [(F^T F + \lambda I_P)^{-1}]}{1 + d_i \frac{1}{P} \mathrm{Tr}[(F^T F + \lambda I_P)^{-1}]} = \frac{d_i m_P}{1 + d_i m_P} $$

where $ m_P $ is the Stieltjes transform of $ F^T F $ evaluated at $ -\lambda $ $$ m_P = \frac{1}{P} \mathrm{Tr}[(F^T F + \lambda I_P)^{-1}] $$ and $ w_i $ is the $ i $-th row of $F$ multiplied by $ \sqrt{P} $. Using the simple identity $ \frac{1}{P} \mathrm{Tr}[A] + \frac{\lambda}{P} \mathrm{Tr}[(F^T F + \lambda I_P)^{-1}] = 1 $ and the above approximation for $ A_{ii} $, we find $$ \frac{1}{P} \sum_{i=1}^N \frac{d_i m_P}{1 + d_i m_P} + \lambda m_P \approx 1. $$ As a consequence, $ m_P $ converges to $ \tilde{m} $ which is the unique positive solution of $$ \frac{1}{P} \sum_{i=1}^N \frac{d_i \tilde{m}}{1 + d_i \tilde{m}} + \lambda \tilde{m} = 1. $$ In fact, we just found an approximation for the diagonal entries of the average hat matrix $ \mathbb{E}[A_{ii}] \approx \frac{d_i \tilde{m} }{1 + d_i \tilde{m}} $ or $ \frac{d_i }{1 / \tilde{m} + d_i}$. We note that the non-diagonal terms $ A_{ij} $ have zero mean. Combining all together, we find $ \mathbb{E}[f_{\lambda}^{(RF)}(x)] \approx f_{1 / \tilde{m}}^{(K)}(x)$, and we call the reciprocal of the limiting Stieltjes transform as the effective ridge $ \tilde{\lambda} := 1 / \tilde{m} $.

An Explanation for the Variance Explosion

The variance of the RF predictor is more challenging than to study the average. I will only explain the variance explosion at the interpolation threshold $ P = N $ in the ridgeless case using the derivative of the effective ridge $ \partial \tilde{\lambda} $ (with respect to $ \lambda $). Let us first observe the double-descent phenomenon in the function space (see Fig1): $ \partial \tilde{\lambda} $ grows as $ P $ increases up to $ N $, then starts decreasing as soon as $ P $ exceeds $ N $, and keeps decreasing down to $ 1 $ for big $ P $, tracking the variance of the RF predictor.

Now let us see why $ \partial \tilde{\lambda} $ is an important quantity to study the variance. First, we split the variance of the RF predictor into two components using the law of total variance $$ \text{Var}(f^{(RF)}(x)) = \mathbb{E} [\text{Var}(f^{(RF)}(x) | F)] + \text{Var} (\mathbb{E}[f^{(RF)}(x) | F]). $$ Since the second term is positive, we can lower bound the total variance using the first term. Using the conditional Gaussians formula, we have $$ \mathbb{E} [\text{Var}(f^{(RF)}(x) | F)] = \frac{\mathbb{E} [ \| \hat{\theta} \|^2 ]}{P} \tilde{K}(x, x) $$ where $ \tilde{K}(x, x’) = K(x, x’) - K(x, X) K(X, X)^{-1} K(X, x’)$. The important factor here is the squared norm of the the optimal parameter $\hat{\theta}$, that is $$ \mathbb{E} [\| \hat{\theta} \|^2] = y^T \mathbb{E} [F (F^T F + \lambda I_P)^{-2} F^T] y. $$

At this point, we stop for a moment and think why $ \mathbb{E} [F (F^T F + \lambda I_P)^{-2} F^T] $ looks familiar. It is actually the negative of the derivative of the average hat matrix $ \mathbb{E} [A] $ so its eigenvalues can then be approximated by $ - \partial \left(\frac{d_i}{ d_i + \tilde{\lambda} } \right) = \partial \tilde{\lambda} \left(\frac{ d_i}{ (d_i + \tilde{\lambda})^2 } \right). $ We now notice that $ \partial \tilde{\lambda} $ emerges as a constant factor in all eigenvalues of $ \mathbb{E} [F (F^T F + \lambda I_P)^{-2} F^T] $.

In the paper2, we show that $ \partial \tilde{\lambda} $ explodes in the ridgeless case and does not explode for the ridge case at the interpolation threshold; suggesting that it can be a data-dependent indicator of the presence or absence of the double-descent behavior.

Summary

We find that, for finite $ P $, the average RF predictor with ridge $ \lambda $ is approximated by a Kernel predictor with the effective ridge $ \tilde{\lambda} $ for which we present explicit bounds in the paper2. This means that in practice, one actually performs kernel regression with a bigger ridge $ \tilde{\lambda} $ in average when using random features, revealing an implicit regularization effect of finite sampling. Finally, the derivative of the effective ridge $ \partial \tilde{\lambda} $ tracks the variance of the RF predictor, giving a mathematical explanation for the variance explosion in double-descent phenomenon for arbitrary datasets.

Acknowledgement

Big thanks to François for being the first one who read my blog and for giving valuable feedback. Thank you Isak for introducing me to Hugo that made blog writing very much enjoyable! Huge thanks to the co-authors of the paper for their detailed and precious feedback: Clément, Arthur, Francesco, and Franck!


  1. A. Rahimi, and B. Recht. “Random features for large-scale kernel machines.” Advances in neural information processing systems (NeurIPS). 2008. ↩︎

  2. A. Jacot*, B. Şimşek*, F. Spadaro, C. Hongler, F. Gabriel. “Implicit regularization of random feature models.” Proceedings of the International Conference on Machine Learning (ICML). 2020. ↩︎

  3. https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Conditional_distributions ↩︎

  4. J. Silverstein, and Z. D. Bai. “On the empirical distribution of eigenvalues of a class of large dimensional random matrices.” Journal of Multivariate analysis. 1995. ↩︎

comments powered by Disqus

Related