We have little mechanistic understanding of how deep learning models overfit to their training data, despite it being a central problem. Here we extend our previous work on toy models to shed light on how models generalize beyond their training data.
https://transformer-circuits.pub/2023/toy-double-descent/index.html
https://transformer-circuits.pub/2023/toy-double-descent/index.html
For small training sets, models use superposition to memorize more data points than the two available neurons. For large training sets, models learn features in superposition, as observed in our previous work, allowing the model to generalize.
https://transformer-circuits.pub/2022/toy_model/index.html
https://transformer-circuits.pub/2022/toy_model/index.html
😁1
Models struggle to transition between these strategies, as exhibited by a spike in test loss. This spike moves to larger datasets as one increases model capacity. This is a clear signature of double-descent, a phenomenon that is now well-known in the ML literature.
We hope these results are a step towards a mechanistic theory of memorization. There are many open questions, such as understanding the loss spike, or what happens when only a subset of the data is repeated.
Thanks to Adam Jermyn for his comments reproducing and extending these results!
https://transformer-circuits.pub/2023/toy-double-descent/index.html#comment-jermyn-1
We hope these results are a step towards a mechanistic theory of memorization. There are many open questions, such as understanding the loss spike, or what happens when only a subset of the data is repeated.
Thanks to Adam Jermyn for his comments reproducing and extending these results!
https://transformer-circuits.pub/2023/toy-double-descent/index.html#comment-jermyn-1