Distillation

Should Under-parameterized Student Networks Copy or Average Teacher Weights?

Any continuous function $f*$ can be approximated arbitrarily well by a neural network with sufficiently many neurons $k$. We consider the case when $f*$ itself is a neural network with one hidden layer and $k$ neurons. Approximating $f*$ with a …