guest @ Flatiron CCM

Hi there! I am a Faculty Fellow at New York University (CDS) and a guest researcher at Flatiron (CCM). My research is centered around developing mathematical theories for modern machine learning. Notably, I introduced a novel method to study the loss landscape complexity of neural nets. These days I am analyzing end-to-end learning dynamics in exciting models and interpreting the learning process as well as the resulting weights. I am also co-teaching Machine Learning Course.

Previously, I was a doctoral researcher at EPFL in Computer and Communication Sciences, supervised by Clément Hongler and by Wulfram Gerstner. During my Ph.D, I did a research internship at Meta AI.

Before my Ph.D., I studied Electrical-Electronics Engineering and Mathematics double-major at Koç University. I also earned two bronze medals in International Mathematical Olympiad (IMO). I am an active Argentine Tango dancer and I like Yoga.

See my Google Scholar page for an up to date list of publications.

Quickly discover relevant content by filtering publications.

Deep Linear Networks Dynamics: Low-Rank Biases Induced by Initialization Scale and L2 Regularization

Large models in deep learning are powerful. It is important to understand how they represent knowledge to make them efficient and safe. We study a toy model of neural nets that exhibits non-linear dynamics and phase transition. Although the model is complex, we were able to find a family of `copy-average’ critical points of the loss. The gradient flow initialized with random weights consistently converges to one such critical point for networks up to a certain width, which we proved to be optimal among all copy-average points. Moreover, we can explain every neuron of a trained neural network of any width. As the width grows, the network changes the compression strategy and exhibits a phase transition. We close by listing open questions calling for further mathematical analysis and extensions of the model considered here.

We discuss how the effective ridge reveals the implicit regularization effect of finite sampling in random features. The derivative of the effective ridge tracks the variance of the optimal predictor, yielding an explanation for the variance explosion at the interpolation threshold for arbitrary datasets.