Neural network training makes beautiful fractals


316 points | by telotortium 12 days ago


  • alexmolas 11 days ago
    The results of the experiment seem counterintuitive just because the used learning rates are huge (up to 10 or even 100). These are not the lr you would use in a normal setting. If you look at the region of small lr it seems all of them converge.

    So I would say the experiment is interesting, but not representative of real world deep learning.

    In the experiment, you have a function of 272 variables with a lot of minima and maxima, and at each gradient descent step you take huge steps (due to big lr). So my intuition is that convergence is more a matter of luck rather than hyperparameters.

    • lopuhin 11 days ago
      If convergence were a matter of luck, it would look completely different, like white noise, but it clearly has well-defined structure.

      The reason for high learning rate is that they used full batched training (see the first cell in, and when batch sizes are large, learning rates typically can be large as well. Plus as others said it's more of a toy problem, it would be hard to get such detail on anything non-toy.

    • dmarchand90 11 days ago
      I think the article is very honest about this just being a fun exploration. They even show how you can get similar patterns with newton's algorithm which is a more "classical" take
      • alexmolas 11 days ago
        Yes, the author is clear in this regard. But I've seen people interpreting this paper as "training deep networks is chaotic", and I don't think that's the case. I interpret it more as "if you're not careful with your learning rate your training will be chaotic"
        • repelsteeltje 11 days ago
          Yes. I immediately had visions of a misconfigured PID controller or similar chaotic emergence encountered in control theory. I was surprised and delighted though, that this type of chaos has so much fractal beauty.
          • nyrikki 11 days ago
            Note that not all fractals are chaotic.

            Inter dimensionality is the defining feature of fractals.

            A multilayer ANN will have compression that may be fractal like. But in theory a feed forward network could be single layer but will a lot more neurons.

            I have the feeling this result was due to representation but there are some features that look like riddled basins.

            Riddled basins do arise in SNN or spikey neutral networks that have continuous output bs the binary output of ANNs.

            But as all PAC learning is just compression that may be the cause also.

            I'll be digging into this on the weekend though.

        • breather 11 days ago
          With all do respect, why are you commenting here rather than replying to the person giving your comment context?
    • sdenton4 11 days ago
      There was a nice paper about five years ago that also found that the best hyperparams were in the bounty of divergence, even for 'real' imagenet models.
    • vintermann 11 days ago
      Yes, but as he also says, the best hyperparameters are near the boundary. You want to be a bit greedy here.
      • alexmolas 11 days ago
        The way the author defines best hyperparameters is also a bit weird. To assign a "score" to each pair of HP the author sums all the losses during training (ie score = ∑ᵀₜ₌₀ lₜ) instead of just taking the value of the loss at the last epoch (score = l_T). This makes the converging solutions with high learning rates appear better since they have lower losses since the beginning (so the sum is smaller), but it doesn't mean that the final value of the loss is better than the ones with smaller learning rates.
  • telotortium 12 days ago
    Twitter: ArXiv:


    "Some fractals -- for instance those associated with the Mandelbrot and quadratic Julia sets -- are computed by iterating a function, and identifying the boundary between hyperparameters for which the resulting series diverges or remains bounded. Neural network training similarly involves iterating an update function (e.g. repeated steps of gradient descent), can result in convergent or divergent behavior, and can be extremely sensitive to small changes in hyperparameters. Motivated by these similarities, we experimentally examine the boundary between neural network hyperparameters that lead to stable and divergent training. We find that this boundary is fractal over more than ten decades of scale in all tested configurations."

    Contains several cool animations zooming in to show the fractal boundary between convergent and divergent training, just like the classic Mandelbrot and Julia set animations.

  • PheonixPharts 11 days ago
    I find this result absolutely fascinating, and is exactly the type of research into neural networks we should be expanding.

    We've rapidly engineered our way to some very impressive models this past decade, and yet gap in our real understanding of what's going on has widened. There's a large list of very basic questions about LLMs that we haven't answered (or in some cases, really asked). This is not a failing of people researching in this area, it's only that things move so quickly there's not enough time to ponder things like this.

    At the same time, the result, unless I'm really misunderstanding, gives me the impression that anything other than grid search hyper parameter optimization is a fools errand. This would give credence to the notion that hyper parameter tuning really is akin to just re-rolling a character sheet until you get one that is over powered.

    • sigmoid10 11 days ago
      >exactly the type of research into neural networks we should be expanding.

      While it certainly makes for some nice visualizations, the technical insight of this is pretty limited. First of all, this fractal structure emerges at learning rates that are far higher than those used in training actual neural networks nowadays. It's interesting that the training still converges for some combinations and that the (expected) hit-and-miss procedure yields a fractal structure. But if you look closely at the images, you'll see the best hyperparameters are, while close, not at the border. So even if you want to follow the meta-learning approach outlined in the post, your gradient descent has already screwed up before if it ever ends up in this fractal boundary region.

    • johan_felisaz 11 days ago
      There is an expanding field of study looking at machine learning with statistical physics tools. While there is still a lot of work to do in this area, it yields interesting insights on neural networks, e.g. linking their training with the evolution of spin glasses (a typical statistical physics problem). We can even talk about phase transition and universal exponents.

      Most of the research is done with simpler models though (because mainly math people do it, and it's hard to prove anything on something as complex as a transformer).

    • krallistic 11 days ago
      > gives me the impression that anything other than grid search hyper parameter optimization is a fools errand. This would give credence to the notion that hyper parameter tuning really is akin to just re-rolling a character sheet until you get one that is over powered.

      The visualizations only show that at the Border there are a lot of fractals, not in every part of the space. (Although the highest performance is often achieved close to the border.). I would not state hparam search as bad as that..

    • dylan604 11 days ago
      what was the name of the Google app (Dreaming or some such) that would iterate frames similar to this that would find lizard/snakes/dogs/eyes and got super trippy the longer it ran? The demos would start with RGB noise, and within a few iterations, it was a full on psychedelic trip. It was the best visual of AI "hallucinating" I've seen yet
      • 0xDEADFED5 11 days ago
        • dylan604 11 days ago
          winner winner chicken dinner. thanks!

          always thought we now know what the android's dreams were like

      • Sharlin 11 days ago
        DeepDream is essentially what started the current image-creating generative AI craze.
        • vintermann 11 days ago
          Yes. It all started with a leaked image on Reddit. Even though at that point it was just a rumour that it was generated by a neural net, it was so strikingly unlike anything else that it caused a huge stir. Wish I could find that image again, it was a fairly noisy "slugdog" thing.
  • Imnimo 11 days ago
    If you are a fan of the fractals but feel intimidated by neural networks, the networks used here are actually pretty simple and not so difficult to understand if you are familiar with matrix multiplication. To generate a dataset, he samples random vectors (say of size 8) as inputs, and for each vector a target output, which is a single number. The network consists of an 8x8 matrix and an 8x1 matrix, also randomly initialized.

    To generate an output from an input vector, you just multiply by your 8x8 matrix (getting a new size 8 vector), apply the tanh function to each element (look up a plot of tanh - it just squeezes its inputs to be between -1 and 1), and then multiply by the 8x1 matrix, getting a single value as an output. The elements of the two matrices are the 'weights' of the neural network, and they are updated to push the output we got towards the target.

    When we update our weights, we have to decide on a step size - do we make just a little tiny nudge in the right direction, or take a giant step? The plots are showing what happens if we choose different step sizes for the two matrices ("input layer learning rate" is how big of a step we take for the 8x8 matrix, and "output layer learning rate" for the 8x1 matrix).

    If your steps are too big, you run into a problem. Imagine trying to find the bottom of a parabola by taking steps in the direction of downward slope - if you take a giant step, you'll pass right over the bottom and land on the opposite slope, maybe even higher than you started! This is the red region of the plots. If you take really really tiny steps, you'll be safe, but it'll take you a long time to reach the bottom. This is the dark blue section. Another way you can take a long time is to take big steps that jump from one slope to the other, but just barely small enough to end up a little lower each time (this is why there's a dark blue stripe near the boundary). The light green region is where you take goldilocks steps - big enough to find the bottom quickly, but small enough to not jump over it.

    • nighthawk454 11 days ago
      Great description! Now we just need one on fractals for all us NN people haha
  • magicalhippo 11 days ago
    Here's the associated blog post, which includes the videos:

    Not a ML'er so not sure what to make of it, beyond a fascinating connection.

  • fancyfredbot 11 days ago
    This is really fun, and beautiful. Also, despite what people are saying about the learning rates being unrealistic, the findings also really fit well with my own experience in using optimisation algorithms in the real world. If our code ever had a significant results difference between processor architectures (e.g. a machine taking an avx code path vs an sse one) you could be sure that every time the difference began during execution of an optimisation algorithm. The chaotic sensitivity to initial conditions really showed up there, just as it did in the author's newton solver plot. Although I have knew at some level that this behaviour was chaotic it never would have occurred to me to ask if it made a pretty fractal!
  • why_only_15 11 days ago
    I appreciate that his acknowledgements here were to his daughter ("for detailed feedback on the generated fractals") and wife ("for providing feedback on a draft of this post")
  • fallingfrog 11 days ago
    This is kind of random but- I wonder, if you had a sufficiently complex lens, or series of lenses, perhaps with specific areas darkened, could you make a lens that shone light through if presented with, say, a cat, but not with anything else? Bending light and darkening it selectively could probably reproduce a layer of a neural net. That would be cool. I suppose, you would need some substance that responded to light in a nonlinear way.
  • mchinen 11 days ago
    This is really fun to see. I love toy experiments like this. I see that each plot is always using the same initialization of weights, which presumably makes it possible to have more smoothness between each pixel. I also would guess it's using the same random seed for training (shuffling data).

    I'd be curious to know what the plots would look like with a different randomness/shuffling of each pixel's dataset. I'd guess for the high learning rates it would be too noisy, but you might see fractal behavior at more typical and practical learning rates. You could also do the same with the random initialization of each dataset. This would get at if the chaotic boundary also exists in more practical use cases.

  • arkano 11 days ago
    If you liked this, you may also enjoy: "Back Propagation is Sensitive to Initial Conditions" from the early 90's. The discussion section is fun.

  • radarsat1 11 days ago
    I'm really curious what effect the common tricks for training have on the smoothness of this landscape: momentum, skip connections, batch/layer/etc normalization, even model size.

    I imagine the fractal or chaos is still there, but maybe "smoother" and easier for metalearning to deal with?

  • Wherecombinator 11 days ago
    This is pretty interesting. Can’t help but he reminded of all the times I’ve done acid. Having been deep in ‘fractal country’ a few times I’ve always felt the psychedelic effect is from my brain going haywire and messing up its pattern recognition. I wonder if it’s related to this.
  • KuzMenachem 11 days ago
    Reminds me of an excellent 3blue1brown video about Newton’s method [1]. You can see similar fractal patterns emerge there too.


  • int_19h 11 days ago
    I hope one day we'll have generative AI capable of producing stuff like this on demand:

    • passion__desire 11 days ago
      Not just that AI can overlay artistic styles like below on top of such fractal structures.

      • int_19h 11 days ago
        Generating a single still image is not an issue, obviously. It's more about picking "good" fractal parameters, a "good" spot to zoom in on said fractal, and "good" colors to dress it all up. Even more so if you are trying to time it all to specific music matching the visuals, as some of the fractal art videos do.
  • kalu 11 days ago
    So this author trained a neural network billions of times using different hyper parameters ? How much dod that cost ?
  • milliams 11 days ago
    I'd argue that these are not fractals in the mathematical sense, but they do seem to be demonstrating chaos.
    • bob88jg 11 days ago
      This is what I Initially thought too but I am less certain now - I assumed Fractal implied self similarity (which this doesn't seem to have) but this is in fact not true - I think to actually say if its fractal or not someone needs to estimate its dimension using box counting or some, way beyond me, analytical method.
      • wang_li 11 days ago
        They should have definition at arbitrary scales. These images do not, they are built from a fixed sized matrix. Beyond a certain point you are between the data points.
        • bob88jg 11 days ago
          Same can be said for the madelbtot set or any other fractal - there is a physical bound on the precision of the values - you can in theory run a NN with arbitary precision floats...
          • wang_li 10 days ago
            But the image is based on the attributes of each node in the NN matrix. The Mandelbrot set can be calculated at any precision you desire and it keeps going. An NN is discrete and finite.
            • bob88jg 9 days ago
              That's not what it shows at all - it shows how varying hyper parameters (which are floats and thus can be set at any arbitary precision) effects the speed at which convergence happens - so its some function F: R^n -> Z - it has literally nothing to do with the nodes in the neural network...
  • karxxm 11 days ago
    What’s that color-map called?
  • albertgt 11 days ago
    Dave Bowman> omg it’s full of fractals

    HAL> why yes Dave what did you think I was made of

  • 7e 11 days ago
    Today I learned that if something is detailed, it is now fractal.
    • lifthrasiir 11 days ago
      Fractal is not necessarily self-similar; it just has to show enough detailed structures when zoomed in ad infinitum. While no single defiition for fractals exists, at least that's one of the common denominators (because otherwise many "fractals" in the nature can't be called so).
    • magicalhippo 11 days ago
      It would be interesting to compute the fractal dimension[1] to see if it really is fractal[2] or just looks like it.

      I recall similar tests being done on paintings by Pollock and of other artists trying to copy his style to determine authenticity[3].




    • paulrudy 11 days ago
      I'm only a fractal enthusiast, but my impression is that the key distinction that makes these fractals, or at least fractal-like, is not their detail per se, but that there is complexity at every scale. From the article:

      > we find intricate structure at every scale

      > At every length scale, small changes in the hyperparameters can lead to large changes in training dynamics

      • bob88jg 11 days ago
        "small changes in the hyperparameters can lead to large changes in training dynamics"

        This is the definition of Chaos though no? Butterfly flaps its wings, hurricane on other side of the planet...

        • paulrudy 9 days ago
          As far as I am aware, these kinds of nonlinear relationships are a feature of fractal dynamics, but I'm not a mathematician.