NEW BOT Телеграм, страница

Channel created

06:37

We have little mechanistic understanding of how deep learning models overfit to their training data, despite it being a central problem. Here we extend our previous work on toy models to shed light on how models generalize beyond their training data.

https://transformer-circuits.pub/2023/toy-double-descent/index.html

1.84K views06:40

Claude

Our prior work showed that these toy models use a strategy called “superposition” to learn more features than available neurons. Here we observe how training data points, as well as features, are embedded in the hidden space.

2.21K views06:41

Claude

For small training sets, models use superposition to memorize more data points than the two available neurons. For large training sets, models learn features in superposition, as observed in our previous work, allowing the model to generalize.

https://transformer-circuits.pub/2022/toy_model/index.html

❤1😁1

2.75K views06:41

Claude

Models struggle to transition between these strategies, as exhibited by a spike in test loss. This spike moves to larger datasets as one increases model capacity. This is a clear signature of double-descent, a phenomenon that is now well-known in the ML literature.

We hope these results are a step towards a mechanistic theory of memorization. There are many open questions, such as understanding the loss spike, or what happens when only a subset of the data is repeated.

Thanks to Adam Jermyn for his comments reproducing and extending these results!

https://transformer-circuits.pub/2023/toy-double-descent/index.html#comment-jermyn-1

❤1

3.97K views06:42

Claude

❤3👍3

4.37K views06:34

Claude

Channel name was changed to «Claude»

02:55

About

Blog

Apps

Platform