Neural network training makes beautiful fractals

(sohl-dickstein.github.io)

316 points | by telotortium 510 days ago

21 comments

alexmolas 509 days ago
The results of the experiment seem counterintuitive just because the used learning rates are huge (up to 10 or even 100). These are not the lr you would use in a normal setting. If you look at the region of small lr it seems all of them converge.
So I would say the experiment is interesting, but not representative of real world deep learning.
In the experiment, you have a function of 272 variables with a lot of minima and maxima, and at each gradient descent step you take huge steps (due to big lr). So my intuition is that convergence is more a matter of luck rather than hyperparameters.
[-]
- lopuhin 509 days ago
  If convergence were a matter of luck, it would look completely different, like white noise, but it clearly has well-defined structure.
  The reason for high learning rate is that they used full batched training (see the first cell in https://colab.research.google.com/github/Sohl-Dickstein/frac...), and when batch sizes are large, learning rates typically can be large as well. Plus as others said it's more of a toy problem, it would be hard to get such detail on anything non-toy.
- dmarchand90 509 days ago
  I think the article is very honest about this just being a fun exploration. They even show how you can get similar patterns with newton's algorithm which is a more "classical" take
  [-]
  - alexmolas 509 days ago
    Yes, the author is clear in this regard. But I've seen people interpreting this paper as "training deep networks is chaotic", and I don't think that's the case. I interpret it more as "if you're not careful with your learning rate your training will be chaotic"
    [-]
    - repelsteeltje 509 days ago
      Yes. I immediately had visions of a misconfigured PID controller or similar chaotic emergence encountered in control theory. I was surprised and delighted though, that this type of chaos has so much fractal beauty.
      [-]
      - nyrikki 509 days ago
        Note that not all fractals are chaotic.
        Inter dimensionality is the defining feature of fractals.
        A multilayer ANN will have compression that may be fractal like. But in theory a feed forward network could be single layer but will a lot more neurons.
        I have the feeling this result was due to representation but there are some features that look like riddled basins.
        Riddled basins do arise in SNN or spikey neutral networks that have continuous output bs the binary output of ANNs.
        But as all PAC learning is just compression that may be the cause also.
        I'll be digging into this on the weekend though.
    - breather 509 days ago
      With all do respect, why are you commenting here rather than replying to the person giving your comment context?
- sdenton4 509 days ago
  There was a nice paper about five years ago that also found that the best hyperparams were in the bounty of divergence, even for 'real' imagenet models.
  [-]
  - locuscoeruleus 509 days ago
    What paper was that?
    [-]
    - sdenton4 508 days ago
      Here you go: https://arxiv.org/abs/1811.03600
- vintermann 509 days ago
  Yes, but as he also says, the best hyperparameters are near the boundary. You want to be a bit greedy here.
  [-]
  - alexmolas 509 days ago
    The way the author defines best hyperparameters is also a bit weird. To assign a "score" to each pair of HP the author sums all the losses during training (ie score = ∑ᵀₜ₌₀ lₜ) instead of just taking the value of the loss at the last epoch (score = l_T). This makes the converging solutions with high learning rates appear better since they have lower losses since the beginning (so the sum is smaller), but it doesn't mean that the final value of the loss is better than the ones with smaller learning rates.
telotortium 510 days ago
Twitter: https://twitter.com/jaschasd/status/1756930242965606582 ArXiv: https://arxiv.org/abs/2402.06184
Abstract:
"Some fractals -- for instance those associated with the Mandelbrot and quadratic Julia sets -- are computed by iterating a function, and identifying the boundary between hyperparameters for which the resulting series diverges or remains bounded. Neural network training similarly involves iterating an update function (e.g. repeated steps of gradient descent), can result in convergent or divergent behavior, and can be extremely sensitive to small changes in hyperparameters. Motivated by these similarities, we experimentally examine the boundary between neural network hyperparameters that lead to stable and divergent training. We find that this boundary is fractal over more than ten decades of scale in all tested configurations."
Contains several cool animations zooming in to show the fractal boundary between convergent and divergent training, just like the classic Mandelbrot and Julia set animations.
PheonixPharts 509 days ago
I find this result absolutely fascinating, and is exactly the type of research into neural networks we should be expanding.
We've rapidly engineered our way to some very impressive models this past decade, and yet gap in our real understanding of what's going on has widened. There's a large list of very basic questions about LLMs that we haven't answered (or in some cases, really asked). This is not a failing of people researching in this area, it's only that things move so quickly there's not enough time to ponder things like this.
At the same time, the result, unless I'm really misunderstanding, gives me the impression that anything other than grid search hyper parameter optimization is a fools errand. This would give credence to the notion that hyper parameter tuning really is akin to just re-rolling a character sheet until you get one that is over powered.
[-]
- sigmoid10 509 days ago
  >exactly the type of research into neural networks we should be expanding.
  While it certainly makes for some nice visualizations, the technical insight of this is pretty limited. First of all, this fractal structure emerges at learning rates that are far higher than those used in training actual neural networks nowadays. It's interesting that the training still converges for some combinations and that the (expected) hit-and-miss procedure yields a fractal structure. But if you look closely at the images, you'll see the best hyperparameters are, while close, not at the border. So even if you want to follow the meta-learning approach outlined in the post, your gradient descent has already screwed up before if it ever ends up in this fractal boundary region.
- johan_felisaz 509 days ago
  There is an expanding field of study looking at machine learning with statistical physics tools. While there is still a lot of work to do in this area, it yields interesting insights on neural networks, e.g. linking their training with the evolution of spin glasses (a typical statistical physics problem). We can even talk about phase transition and universal exponents.
  Most of the research is done with simpler models though (because mainly math people do it, and it's hard to prove anything on something as complex as a transformer).
- krallistic 509 days ago
  > gives me the impression that anything other than grid search hyper parameter optimization is a fools errand. This would give credence to the notion that hyper parameter tuning really is akin to just re-rolling a character sheet until you get one that is over powered.
  The visualizations only show that at the Border there are a lot of fractals, not in every part of the space. (Although the highest performance is often achieved close to the border.). I would not state hparam search as bad as that..
- dylan604 509 days ago
  what was the name of the Google app (Dreaming or some such) that would iterate frames similar to this that would find lizard/snakes/dogs/eyes and got super trippy the longer it ran? The demos would start with RGB noise, and within a few iterations, it was a full on psychedelic trip. It was the best visual of AI "hallucinating" I've seen yet
  [-]
  - 0xDEADFED5 509 days ago
    https://en.wikipedia.org/wiki/DeepDream
    [-]
    - dylan604 509 days ago
      winner winner chicken dinner. thanks!
      always thought we now know what the android's dreams were like
  - Sharlin 509 days ago
    DeepDream is essentially what started the current image-creating generative AI craze.
    [-]
    - vintermann 509 days ago
      Yes. It all started with a leaked image on Reddit. Even though at that point it was just a rumour that it was generated by a neural net, it was so strikingly unlike anything else that it caused a huge stir. Wish I could find that image again, it was a fairly noisy "slugdog" thing.
Imnimo 509 days ago
If you are a fan of the fractals but feel intimidated by neural networks, the networks used here are actually pretty simple and not so difficult to understand if you are familiar with matrix multiplication. To generate a dataset, he samples random vectors (say of size 8) as inputs, and for each vector a target output, which is a single number. The network consists of an 8x8 matrix and an 8x1 matrix, also randomly initialized.
To generate an output from an input vector, you just multiply by your 8x8 matrix (getting a new size 8 vector), apply the tanh function to each element (look up a plot of tanh - it just squeezes its inputs to be between -1 and 1), and then multiply by the 8x1 matrix, getting a single value as an output. The elements of the two matrices are the 'weights' of the neural network, and they are updated to push the output we got towards the target.
When we update our weights, we have to decide on a step size - do we make just a little tiny nudge in the right direction, or take a giant step? The plots are showing what happens if we choose different step sizes for the two matrices ("input layer learning rate" is how big of a step we take for the 8x8 matrix, and "output layer learning rate" for the 8x1 matrix).
If your steps are too big, you run into a problem. Imagine trying to find the bottom of a parabola by taking steps in the direction of downward slope - if you take a giant step, you'll pass right over the bottom and land on the opposite slope, maybe even higher than you started! This is the red region of the plots. If you take really really tiny steps, you'll be safe, but it'll take you a long time to reach the bottom. This is the dark blue section. Another way you can take a long time is to take big steps that jump from one slope to the other, but just barely small enough to end up a little lower each time (this is why there's a dark blue stripe near the boundary). The light green region is where you take goldilocks steps - big enough to find the bottom quickly, but small enough to not jump over it.
[-]
- nighthawk454 509 days ago
  Great description! Now we just need one on fractals for all us NN people haha
magicalhippo 509 days ago
Here's the associated blog post, which includes the videos: https://sohl-dickstein.github.io/2024/02/12/fractal.html
Not a ML'er so not sure what to make of it, beyond a fascinating connection.
fancyfredbot 509 days ago
This is really fun, and beautiful. Also, despite what people are saying about the learning rates being unrealistic, the findings also really fit well with my own experience in using optimisation algorithms in the real world. If our code ever had a significant results difference between processor architectures (e.g. a machine taking an avx code path vs an sse one) you could be sure that every time the difference began during execution of an optimisation algorithm. The chaotic sensitivity to initial conditions really showed up there, just as it did in the author's newton solver plot. Although I have knew at some level that this behaviour was chaotic it never would have occurred to me to ask if it made a pretty fractal!
why_only_15 509 days ago
I appreciate that his acknowledgements here were to his daughter ("for detailed feedback on the generated fractals") and wife ("for providing feedback on a draft of this post")
fallingfrog 509 days ago
This is kind of random but- I wonder, if you had a sufficiently complex lens, or series of lenses, perhaps with specific areas darkened, could you make a lens that shone light through if presented with, say, a cat, but not with anything else? Bending light and darkening it selectively could probably reproduce a layer of a neural net. That would be cool. I suppose, you would need some substance that responded to light in a nonlinear way.
[-]
- gwern 509 days ago
  You can do a lot. 'Digital sundials' come to mind: https://en.wikipedia.org/wiki/Digital_sundial https://gwern.net/doc/math/1991-stewart.pdf
  But yes, you are restricted to linear things and you can't make a good photonic cat detector out of that easily. So all the photonic neural networks you may have heard of like https://arxiv.org/abs/2106.11747 wind up sticking some mechanical or electrical nonlinearity somewhere.
- uoaei 509 days ago
  You can simulate materials, apply the wave equation, and get "layers" that compute outputs from given inputs, each modeled as points in space. It may be possible to manufacture such layers with metamaterials or something like that.
  https://www.science.org/doi/10.1126/sciadv.aay6946
- theWreckluse 509 days ago
  A research out of ETH Zurich, based on which the company Rayform https://rayform.ch/ was founded does exactly this! I was so excited when I saw the paper for the first time a couple of years ago.
- mhh__ 509 days ago
  I've seen a student project that did just that. I don't have a link for you unfortunately.
mchinen 509 days ago
This is really fun to see. I love toy experiments like this. I see that each plot is always using the same initialization of weights, which presumably makes it possible to have more smoothness between each pixel. I also would guess it's using the same random seed for training (shuffling data).
I'd be curious to know what the plots would look like with a different randomness/shuffling of each pixel's dataset. I'd guess for the high learning rates it would be too noisy, but you might see fractal behavior at more typical and practical learning rates. You could also do the same with the random initialization of each dataset. This would get at if the chaotic boundary also exists in more practical use cases.
arkano 509 days ago
If you liked this, you may also enjoy: "Back Propagation is Sensitive to Initial Conditions" from the early 90's. The discussion section is fun.
https://proceedings.neurips.cc/paper/1990/file/1543843a4723e...
radarsat1 509 days ago
I'm really curious what effect the common tricks for training have on the smoothness of this landscape: momentum, skip connections, batch/layer/etc normalization, even model size.
I imagine the fractal or chaos is still there, but maybe "smoother" and easier for metalearning to deal with?
Wherecombinator 509 days ago
This is pretty interesting. Can’t help but he reminded of all the times I’ve done acid. Having been deep in ‘fractal country’ a few times I’ve always felt the psychedelic effect is from my brain going haywire and messing up its pattern recognition. I wonder if it’s related to this.
KuzMenachem 509 days ago
Reminds me of an excellent 3blue1brown video about Newton’s method [1]. You can see similar fractal patterns emerge there too.
[1] https://www.youtube.com/watch?v=-RdOwhmqP5s
int_19h 509 days ago
I hope one day we'll have generative AI capable of producing stuff like this on demand:
https://www.youtube.com/watch?v=8cgp2WNNKmQ
[-]
- passion__desire 509 days ago
  Not just that AI can overlay artistic styles like below on top of such fractal structures.
  http://sub.blue/inkwell
  [-]
  - int_19h 509 days ago
    Generating a single still image is not an issue, obviously. It's more about picking "good" fractal parameters, a "good" spot to zoom in on said fractal, and "good" colors to dress it all up. Even more so if you are trying to time it all to specific music matching the visuals, as some of the fractal art videos do.
kalu 509 days ago
So this author trained a neural network billions of times using different hyper parameters ? How much dod that cost ?
[-]
- tpm 509 days ago
  "overnight on an A100" per https://twitter.com/jaschasd/status/1757056127991439832
- TheCoreh 509 days ago
  The networks trained were really small, with only one hidden layer, and a width of 16.
- mkl 509 days ago
  Very small networks are cheap and easy (even 20 years ago).
milliams 509 days ago
I'd argue that these are not fractals in the mathematical sense, but they do seem to be demonstrating chaos.
[-]
- bob88jg 509 days ago
  This is what I Initially thought too but I am less certain now - I assumed Fractal implied self similarity (which this doesn't seem to have) but this is in fact not true - I think to actually say if its fractal or not someone needs to estimate its dimension using box counting or some, way beyond me, analytical method.
  [-]
  - wang_li 509 days ago
    They should have definition at arbitrary scales. These images do not, they are built from a fixed sized matrix. Beyond a certain point you are between the data points.
    [-]
    - bob88jg 509 days ago
      Same can be said for the madelbtot set or any other fractal - there is a physical bound on the precision of the values - you can in theory run a NN with arbitary precision floats...
      [-]
      - wang_li 508 days ago
        But the image is based on the attributes of each node in the NN matrix. The Mandelbrot set can be calculated at any precision you desire and it keeps going. An NN is discrete and finite.
        [-]
        bob88jg 507 days ago
        That's not what it shows at all - it shows how varying hyper parameters (which are floats and thus can be set at any arbitary precision) effects the speed at which convergence happens - so its some function F: R^n -> Z - it has literally nothing to do with the nodes in the neural network...
karxxm 509 days ago
What’s that color-map called?
[-]
- nossid 509 days ago
  In the notebook you can see it set to Spectral.
  https://github.com/Sohl-Dickstein/fractal/blob/main/the_boun...
albertgt 509 days ago
Dave Bowman> omg it’s full of fractals
HAL> why yes Dave what did you think I was made of
7e 509 days ago
Today I learned that if something is detailed, it is now fractal.
[-]
- lifthrasiir 509 days ago
  Fractal is not necessarily self-similar; it just has to show enough detailed structures when zoomed in ad infinitum. While no single defiition for fractals exists, at least that's one of the common denominators (because otherwise many "fractals" in the nature can't be called so).
- magicalhippo 509 days ago
  It would be interesting to compute the fractal dimension[1] to see if it really is fractal[2] or just looks like it.
  I recall similar tests being done on paintings by Pollock and of other artists trying to copy his style to determine authenticity[3].
  [1]: https://en.wikipedia.org/wiki/Fractal_dimension
  [2]: https://mathworld.wolfram.com/Fractal.html
  [3]: https://cpb-us-e1.wpmucdn.com/blogs.uoregon.edu/dist/e/12535...
  [-]
  - bluetintedfort 509 days ago
    The sister paper[1] contains the fractal dimensions of some of the images using a boxcount paper.
    [1]: https://arxiv.org/pdf/2402.06184.pdf
    [-]
    - bob88jg 509 days ago
      Thanks!! I was also under the impression Fractals required self similarity and not just infinite detail but glad to have that misconception corrected!
      I think it probably stems from the fact that all the fractals, I have seen, for which the dimension can be analytically calculated do show obvious patterns of similarity at different scales.
      [-]
      - magicalhippo 509 days ago
        Searching I stumbled over this paper[1], which I found interesting, on characterizing the fractal-ness of geological structures.
        They point out that the usual measure of fractal dimension, or capacity dimension[2], doesn't consider the physical size of the features and can thus be inaccurate. Instead they suggest using the information dimension[3], which is bounded by the capacity dimension.
        [1]: https://doi.org/10.1016/j.chaos.2018.05.008 (full text available on that hub of science)
        [2]: https://mathworld.wolfram.com/CapacityDimension.html
        [3]: https://mathworld.wolfram.com/InformationDimension.html
        [-]
        tudorw 504 days ago
        Seems there are a few more, Higuchi Dimension, Box Counting Dimension (D0), Generalized Rényi Dimension and others are discussed mid-way through this;
        https://medium.com/@h.a.papageorgiou/the-reality-of-the-ruli...
    - magicalhippo 509 days ago
      Ah I had a quick read before I went to bed last night, missed that bit, cheers.
- paulrudy 509 days ago
  I'm only a fractal enthusiast, but my impression is that the key distinction that makes these fractals, or at least fractal-like, is not their detail per se, but that there is complexity at every scale. From the article:
  > we find intricate structure at every scale
  > At every length scale, small changes in the hyperparameters can lead to large changes in training dynamics
  [-]
  - bob88jg 509 days ago
    "small changes in the hyperparameters can lead to large changes in training dynamics"
    This is the definition of Chaos though no? Butterfly flaps its wings, hurricane on other side of the planet...
    [-]
    - paulrudy 507 days ago
      As far as I am aware, these kinds of nonlinear relationships are a feature of fractal dynamics, but I'm not a mathematician.