Grokking Machine Learning

£9.9
FREE Shipping

Grokking Machine Learning

Grokking Machine Learning

RRP: £99
Price: £9.9
£9.9 FREE Shipping

In stock

We accept the following payment methods

Description

Thanks to Ardavan Saeedi, Crystal Qian, Emily Reif, Fernanda Viégas, Kathy Meier-Hellstern, Mahima Pushkarna, Minsuk Chang, Neel Nanda and Ryan Mullins for their help with this piece.

Book Genre: Artificial Intelligence, Computer Science, Programming, Software, Technical, Technology The model’s architecture is similarly simple: ReLU ( a one-hot W input + b one-hot W input ) W output \text{ReLU}\left(\mathbf{a}_{\text{one-hot}} \mathbf{W}_{\text{input}} + \mathbf{b}_{\text{one-hot}} \mathbf{W}_{\text{input}}\right) \mathbf{W}_{\text{output}} Do more complex models also suddenly generalize after they’re trained longer? Large language models can certainly seem like they have a rich understanding of the world, but they might just be regurgitating memorized bits of the enormous amount of text they’ve been trained on . How can we tell if they’re generalizing or memorizing? While we now have a solid understanding of the mechanisms a one-layer MLP uses to solve modular addition and why they emerge during training, there are still many interesting open questions about memorization and generalization. Which Model Constraints Work Best? There have also been promising results in predicting grokking before it happens. Though some require knowledge of the generalizing solution or the overall data domain , some rely solely on the analysis of the training loss and might also apply to larger models — hopefully we’ll be able to build tools and techniques that can tell us when a model is parroting memorized information and when it’s using richer models.Figuring out both of these questions simultaneously is hard. Let’s make an even simpler task, one where we know what the generalizing solution should look like and try to understand why the model eventually learns it. In this article we’ll examine the training dynamics of a tiny model and reverse engineer the solution it finds – and in the process provide an illustration of the exciting emerging field of mechanistic interpretability . While it isn’t yet clear how to apply these techniques to today’s largest models, starting small makes it easier to develop intuitions as we progress towards answering these critical questions about large language models. Grokking Modular Addition The periodic patterns suggest the model is learning some sort of mathematical structure; the fact that it happens when the model starts to solve the test examples hints that it’s related to the model generalizing. But why does the model move away from the memorizing solution? And what is the generalizing solution? Generalizing With 1s and 0s PDF / EPUB File Name: Grokking_Machine_Learning_-_Luis_Serrano.pdf, Grokking_Machine_Learning_-_Luis_Serrano.epub

Directly training the model visualized above — ReLU ( a one-hot W input + b one-hot W input ) W output \text{ReLU} \left(a_{\text{one-hot}}\textbf{W}_{\text{input}} + b_{\text{one-hot}}\textbf{W}_{\text{input}} \right) \textbf{W}_{\text{output}} ReLU ( a one-hot ​ W input ​ + b one-hot ​ W input ​ ) W output ​ — does not actually result in generalization on modular arithmetic, even with the addition of weight decay. At least one of the matrices has to be factored: We can induce memorization and generalization on this somewhat contrived 1s and 0s task — but why does it happen with modular addition? Let’s first understand a little more about how a one-layer MLP can solve modular addition by constructing a generalizing solution that’s interpretable. Modular Addition With Five NeuronsWe observed that the generalizing solution is sparse after taking the discrete Fourier transformation, but the collapsed matrices have high norms. This suggests that direct weight decay on W output \textbf{W}_\text{output} W output ​ and W input \textbf{W}_{\text{input}} W input ​ doesn’t provide the right inductive bias for the task.

Using x 2 x Does grokking happen in larger models trained on real world tasks? Earlier observations reported the grokking phenomenon in algorithmic tasks in small transformers and MLPs . Grokking has subsequently been found in more complex tasks involving images, text, and tabular data within certain ranges of hyperparameters . It’s also possible that the largest models, which are able to do many types of tasks, may be grokking many things at different speeds during training . Recall that our modular arithmetic problem a + b m o d 67 a + b \bmod 67 a + b mod 67 is naturally periodic, with answers wrapping around if the sum ever passes 67. Mathematically, this can be mirrored by thinking of the sum as wrapping a a a and b b b around a circle. The weights of the generalizing model also had periodic patterns, indicating that the solution might use this property. Understanding the solution to modular addition wasn’t trivial. Do we have any hope of understanding larger models? One route forward — like our digression into the 20 parameter model and the even simpler boolean parity problem — is to: 1) train simpler models with more inductive biases and fewer moving parts, 2) use them to explain inscrutable parts of how a larger model works, 3) repeat as needed. We believe this could be a fruitful approach to better understanding larger models, and complementary to efforts that aim to use larger models to explain smaller ones and other work to disentangle internal representations . Moreover, this kind of mechanistic approach to interpretability, in time, may help identify patterns that themselves ease or automate the uncovering of algorithms learned by neural networks. Credits



  • Fruugo ID: 258392218-563234582
  • EAN: 764486781913
  • Sold by: Fruugo

Delivery & Returns

Fruugo

Address: UK
All products: Visit Fruugo Shop