The results of the experiment seem counterintuitive just because the used learning rates are huge (up to 10 or even 100). These are not the lr you would use in a normal setting. If you look at the region of small lr it seems all of them converge.
So I would say the experiment is interesting, but not representative of real world deep learning.
In the experiment, you have a function of 272 variables with a lot of minima and maxima, and at each gradient descent step you take huge steps (due to big lr). So my intuition is that convergence is more a matter of luck rather than hyperparameters.
If convergence were a matter of luck, it would look completely different, like white noise, but it clearly has well-defined structure.
The reason for high learning rate is that they used full batched training (see the first cell in https://colab.research.google.com/github/Sohl-Dickstein/frac...), and when batch sizes are large, learning rates typically can be large as well. Plus as others said it's more of a toy problem, it would be hard to get such detail on anything non-toy.
Yes, the author is clear in this regard. But I've seen people interpreting this paper as "training deep networks is chaotic", and I don't think that's the case. I interpret it more as "if you're not careful with your learning rate your training will be chaotic"
Yes. I immediately had visions of a misconfigured PID controller or similar chaotic emergence encountered in control theory. I was surprised and delighted though, that this type of chaos has so much fractal beauty.
The way the author defines best hyperparameters is also a bit weird. To assign a "score" to each pair of HP the author sums all the losses during training (ie score = ∑ᵀₜ₌₀ lₜ) instead of just taking the value of the loss at the last epoch (score = l_T). This makes the converging solutions with high learning rates appear better since they have lower losses since the beginning (so the sum is smaller), but it doesn't mean that the final value of the loss is better than the ones with smaller learning rates.
"Some fractals -- for instance those associated with the Mandelbrot and quadratic Julia sets -- are computed by iterating a function, and identifying the boundary between hyperparameters for which the resulting series diverges or remains bounded. Neural network training similarly involves iterating an update function (e.g. repeated steps of gradient descent), can result in convergent or divergent behavior, and can be extremely sensitive to small changes in hyperparameters. Motivated by these similarities, we experimentally examine the boundary between neural network hyperparameters that lead to stable and divergent training. We find that this boundary is fractal over more than ten decades of scale in all tested configurations."
Contains several cool animations zooming in to show the fractal boundary between convergent and divergent training, just like the classic Mandelbrot and Julia set animations.
I find this result absolutely fascinating, and is exactly the type of research into neural networks we should be expanding.
We've rapidly engineered our way to some very impressive models this past decade, and yet gap in our real understanding of what's going on has widened. There's a large list of very basic questions about LLMs that we haven't answered (or in some cases, really asked). This is not a failing of people researching in this area, it's only that things move so quickly there's not enough time to ponder things like this.
At the same time, the result, unless I'm really misunderstanding, gives me the impression that anything other than grid search hyper parameter optimization is a fools errand. This would give credence to the notion that hyper parameter tuning really is akin to just re-rolling a character sheet until you get one that is over powered.
>exactly the type of research into neural networks we should be expanding.
While it certainly makes for some nice visualizations, the technical insight of this is pretty limited. First of all, this fractal structure emerges at learning rates that are far higher than those used in training actual neural networks nowadays. It's interesting that the training still converges for some combinations and that the (expected) hit-and-miss procedure yields a fractal structure. But if you look closely at the images, you'll see the best hyperparameters are, while close, not at the border. So even if you want to follow the meta-learning approach outlined in the post, your gradient descent has already screwed up before if it ever ends up in this fractal boundary region.
There is an expanding field of study looking at machine learning with statistical physics tools. While there is still a lot of work to do in this area, it yields interesting insights on neural networks, e.g. linking their training with the evolution of spin glasses (a typical statistical physics problem). We can even talk about phase transition and universal exponents.
Most of the research is done with simpler models though (because mainly math people do it, and it's hard to prove anything on something as complex as a transformer).
> gives me the impression that anything other than grid search hyper parameter optimization is a fools errand. This would give credence to the notion that hyper parameter tuning really is akin to just re-rolling a character sheet until you get one that is over powered.
The visualizations only show that at the Border there are a lot of fractals, not in every part of the space.
(Although the highest performance is often achieved close to the border.). I would not state hparam search as bad as that..
what was the name of the Google app (Dreaming or some such) that would iterate frames similar to this that would find lizard/snakes/dogs/eyes and got super trippy the longer it ran? The demos would start with RGB noise, and within a few iterations, it was a full on psychedelic trip. It was the best visual of AI "hallucinating" I've seen yet
Yes. It all started with a leaked image on Reddit. Even though at that point it was just a rumour that it was generated by a neural net, it was so strikingly unlike anything else that it caused a huge stir. Wish I could find that image again, it was a fairly noisy "slugdog" thing.
If you are a fan of the fractals but feel intimidated by neural networks, the networks used here are actually pretty simple and not so difficult to understand if you are familiar with matrix multiplication. To generate a dataset, he samples random vectors (say of size 8) as inputs, and for each vector a target output, which is a single number. The network consists of an 8x8 matrix and an 8x1 matrix, also randomly initialized.
To generate an output from an input vector, you just multiply by your 8x8 matrix (getting a new size 8 vector), apply the tanh function to each element (look up a plot of tanh - it just squeezes its inputs to be between -1 and 1), and then multiply by the 8x1 matrix, getting a single value as an output. The elements of the two matrices are the 'weights' of the neural network, and they are updated to push the output we got towards the target.
When we update our weights, we have to decide on a step size - do we make just a little tiny nudge in the right direction, or take a giant step? The plots are showing what happens if we choose different step sizes for the two matrices ("input layer learning rate" is how big of a step we take for the 8x8 matrix, and "output layer learning rate" for the 8x1 matrix).
If your steps are too big, you run into a problem. Imagine trying to find the bottom of a parabola by taking steps in the direction of downward slope - if you take a giant step, you'll pass right over the bottom and land on the opposite slope, maybe even higher than you started! This is the red region of the plots. If you take really really tiny steps, you'll be safe, but it'll take you a long time to reach the bottom. This is the dark blue section. Another way you can take a long time is to take big steps that jump from one slope to the other, but just barely small enough to end up a little lower each time (this is why there's a dark blue stripe near the boundary). The light green region is where you take goldilocks steps - big enough to find the bottom quickly, but small enough to not jump over it.
This is really fun, and beautiful. Also, despite what people are saying about the learning rates being unrealistic, the findings also really fit well with my own experience in using optimisation algorithms in the real world. If our code ever had a significant results difference between processor architectures (e.g. a machine taking an avx code path vs an sse one) you could be sure that every time the difference began during execution of an optimisation algorithm. The chaotic sensitivity to initial conditions really showed up there, just as it did in the author's newton solver plot. Although I have knew at some level that this behaviour was chaotic it never would have occurred to me to ask if it made a pretty fractal!
This is kind of random but- I wonder, if you had a sufficiently complex lens, or series of lenses, perhaps with specific areas darkened, could you make a lens that shone light through if presented with, say, a cat, but not with anything else? Bending light and darkening it selectively could probably reproduce a layer of a neural net. That would be cool. I suppose, you would need some substance that responded to light in a nonlinear way.
But yes, you are restricted to linear things and you can't make a good photonic cat detector out of that easily. So all the photonic neural networks you may have heard of like https://arxiv.org/abs/2106.11747 wind up sticking some mechanical or electrical nonlinearity somewhere.
You can simulate materials, apply the wave equation, and get "layers" that compute outputs from given inputs, each modeled as points in space. It may be possible to manufacture such layers with metamaterials or something like that.
This is really fun to see. I love toy experiments like this. I see that each plot is always using the same initialization of weights, which presumably makes it possible to have more smoothness between each pixel. I also would guess it's using the same random seed for training (shuffling data).
I'd be curious to know what the plots would look like with a different randomness/shuffling of each pixel's dataset. I'd guess for the high learning rates it would be too noisy, but you might see fractal behavior at more typical and practical learning rates. You could also do the same with the random initialization of each dataset. This would get at if the chaotic boundary also exists in more practical use cases.
This is pretty interesting. Can’t help but he reminded of all the times I’ve done acid. Having been deep in ‘fractal country’ a few times I’ve always felt the psychedelic effect is from my brain going haywire and messing up its pattern recognition. I wonder if it’s related to this.
Generating a single still image is not an issue, obviously. It's more about picking "good" fractal parameters, a "good" spot to zoom in on said fractal, and "good" colors to dress it all up. Even more so if you are trying to time it all to specific music matching the visuals, as some of the fractal art videos do.
This is what I Initially thought too but I am less certain now - I assumed Fractal implied self similarity (which this doesn't seem to have) but this is in fact not true - I think to actually say if its fractal or not someone needs to estimate its dimension using box counting or some, way beyond me, analytical method.
That's not what it shows at all - it shows how varying hyper parameters (which are floats and thus can be set at any arbitary precision) effects the speed at which convergence happens - so its some function F: R^n -> Z - it has literally nothing to do with the nodes in the neural network...
Fractal is not necessarily self-similar; it just has to show enough detailed structures when zoomed in ad infinitum. While no single defiition for fractals exists, at least that's one of the common denominators (because otherwise many "fractals" in the nature can't be called so).
Searching I stumbled over this paper, which I found interesting, on characterizing the fractal-ness of geological structures.
They point out that the usual measure of fractal dimension, or capacity dimension, doesn't consider the physical size of the features and can thus be inaccurate. Instead they suggest using the information dimension, which is bounded by the capacity dimension.
I'm only a fractal enthusiast, but my impression is that the key distinction that makes these fractals, or at least fractal-like, is not their detail per se, but that there is complexity at every scale. From the article:
> we find intricate structure at every scale
> At every length scale, small changes in the hyperparameters can lead to large changes in training dynamics