Distillation | Şimşek

A Neural Net Model for Distillation with Weights Explained

It is important to understand how large models represent knowledge to make them efficient and safe. We study a toy model of neural nets that exhibits non-linear dynamics and phase transition. Although the model is complex, it allows finding a family of the so-called `copy-average' critical points of the loss. The gradient flow initialized with random weights consistently converges to one such critical point for networks up to a certain width, which we proved to be optimal among all copy-average points. Moreover, we can explain every neuron of a trained neural network of any width. As the width grows, the network changes the compression strategy and exhibits a phase transition. We close by listing open questions calling for further mathematical analysis and extensions of the model considered here.

Should Under-parameterized Student Networks Copy or Average Teacher Weights?

Any continuous function $f*$ can be approximated arbitrarily well by a neural network with sufficiently many neurons $k$. We consider the case when $f*$ itself is a neural network with one hidden layer and $k$ neurons. Approximating $f*$ with a …