If all the world were a monorepo

(jtibs.substack.com)

116 points | by sebg 3 days ago

15 comments

  • derefr 5 hours ago
    CRAN’s approach here sounds like it has all the disadvantages of a monorepo without any of the advantages.

    In a true monorepo — the one for the FreeBSD base system, say — if you make a PR that updates some low-level code, then the expectation is that you 1. compile the tree and run all the tests (so far so good), 2. update the high-level code so the tests pass (hmm), and 3. include those updates in your PR. In a true centralized monorepo, a single atomic commit can affect vertical-slice change through a dependency and all of its transitive dependents.

    I don’t know what the equivalent would be in distributed “meta-monorepo” development ala CRAN, but it’s not what they’re currently doing.

    (One hypothetical approach I could imagine, is that a dependency major-version release of a package can ship with AST-rewriting-algorithm code migrations, which automatically push both “dependency-computed” PRs to the dependents’ repos, while also pushing those same patches as temporary forced overlays onto releases of dependent packages until such time as the related PRs get merged. So your dependents’ tests still have to pass before you can release your package — but you can iteratively update things on your end until those tests do pass, and then trigger a simultaneous release of your package and your dependent packages. It’s then in your dependents’ court to modify + merge your PR to undo the forced overlay, asynchronously, as they wish.)

    • boris 1 hour ago
      There is a parallel with database transactions: it's great if you can do everything in a single database/transaction (atomic monorepo commit). But that only scales so far (on both dimensions: single database and single transaction). You can try distributed transactions (multiple coordinated commits) but that also has limits. The next step is eventual consistency, which would be equivalent to releasing a new version of the component while preserving the old one and with dependents eventually migrating to it at their own pace.
      • awesome_dude 20 minutes ago
        Doesn't that rely on the code being able to work in both states?

        I mean, to use a different metaphor, an incremental rollout is all fine and dandy until the old code discovers that it cannot work with the state generated by the new code.

    • joek1301 5 hours ago
      > One hypothetical approach I could imagine, is that a dependency major-version release of a package can ship with AST-rewriting-algorithm code migrations

      Jane Street has something similar called a "tree smash" [1]. When someone makes a breaking change to their internal dialect of OCaml, they also push a commit updating the entire company monorepo.

      It's not explicitly stated whether such migrations happen via AST rewrites, but one can imagine leveraging the existing compiler infrastructure to do that.

      [1]: https://signalsandthreads.com/future-of-programming/#3535

    • chii 3 hours ago
      > In a true monorepo ...

      ideally yes. However, such a monorepo can become increasingly complex as the software being maintained becomes larger and larger (and/or more and more people work on it).

      You end up with massive changes - which might eventually become something that a single person cannot realistically contain within their brain. Not to mention clashes - you will have people making contradictory/conflicting changes, and there will have to be some sort of resolution mechanism outside (or the "default" one, which is first come first served).

      Of course, you could "manage" this complexity by attributing api boundary/layers, and these api changes are deemed to be important to not change too often. But that simply means you're a monorepo only in name - not too different from having different repos with versioned artefacts with a defined api boundary.

      • rafaelmn 48 minutes ago
        >Of course, you could "manage" this complexity by attributing api boundary/layers, and these api changes are deemed to be important to not change too often. But that simply means you're a monorepo only in name - not too different from having different repos with versioned artefacts with a defined api boundary.

        You have visibility into who is using what and you still get to do an atomic update commit even if a commit will touch multiple boundaries - I would say that's a big difference. I hated working with shared repos in big companies.

    • skybrian 3 hours ago
      Yes, it's nice when you can update arbitrarily distant files in a single commit. But when an API is popular enough to be used by dozens of independent projects, this is no longer practical. Even in a monorepo, you'll still need to break it up, adding the new API, gradually migrating the usages, and then deleting the old API.
      • vasvir 2 hours ago
        Yes,

        Also the other problem of a big monorepo is that nothing ever dies. Let's say you have a library and there are 1000 client programs or other libraries of your API. Some of them are pretty popular and some of them are fringe.

        However when you are changing the API they all have the same weight. You have to fix them all. In the non monorepo case the fringe clients will eventually die or their maintainer will invest on them and update them. It's like capitalism vs communism with central planning and all.

        • malkia 54 minutes ago
          If the monorepo is build and tested by single build system (bazel, buck, etc.), then it can graph leaf dependencies with no users. For example library + tests, but no one using it (granted it might be something new popping out, still in early development).

          Bazel has the concept of visibility where while you are developing something in the tree, you may explicitly say who can use it (like trial version).

          But the point is, if something is build, it must be tested, and coverage should catch what is build, but not tested, but also should catch what is build and tested but not really used a lot.

          But why remove it, if it takes no time to build & test (?), and if it takes more time to test, it's usually on your team to start your own testing env, and not rely on the general presubmit/preflight one, and because since the last capacity planning you have only that amount of budget, you'll soon realize - do we really need this piece of code & the tests?

          I mean it's not perfect, there would be always something churning using time & money, but until it's pretty big problem it won't go away automatically (yet)

  • 0xbadcafebee 3 hours ago
    The author is a little confused. A system that blocks releases on defects and doesn't pin versions is continuous integration, not a monorepo. The two are not synonymous. Monorepos often use continuous integration to ensure their integrity, but you can use continuous integration without a monorepo, and monorepos can be used without continuous integration.

    > But the migration had a steep cost: over 6 years later, there are thousands of projects still stuck on an older version.

    This is a feature, not a bug. The pinning of versions allows systems to independently maintain their own dependency trees. This is how your Linux distribution actually remains stable (or used to, before the onslaught of "rolling release" distributions, and the infection of the "automatically updating application" into product development culture, which constantly leaves me with non-functional Mobile applications whereupon I am forced to update them once a week). You set the versions, and nothing changes, so you can keep using the same software, and it doesn't break. Until you choose to upgrade it and deal with all the breaking shit.

    Every decision in life is a tradeoff. Do you go with no version numbers at all, always updating, always fixing things? Or do you always require version numbers, keeping things stable, but having difficulty updating because of a lack of compatible versions? Or do you find some middle ground? There are pros and cons to all these decisions. There is no one best way, only different ways.

    • summis 2 hours ago
      For me the comparison to monorepo made a lot sense. One of the main features of monorepo is maintaining a DAG of dependencies and use that to decide which tests to run given a code change. CRAN package publishing seems to follow same idea.
      • procaryote 1 hour ago
        That's something you can do just as well with multiple repos though

        What a monorepo gives you on top of that is that you can change the dependents in the same PR

      • malkia 52 minutes ago
        for me too - in a way a "virtual" monorepo - as if all these packages belong in some ideal monorepo, even though they don't.
    • Jyaif 1 hour ago
      > There is no one best way

      I think that the laws of physics dictate that there is. If your developers are spawning the galaxy, the speed of development is slower with continuous development than with pinning deps.

  • croemer 1 hour ago
    One workaround that isn't mentioned is that one could just release a new package entirely for each blocked release. grf1, grf2, grf3...

    The downside is that dependees have to manually change their dependency and you get proliferation of packages with informal relationships.

  • quelsolaar 1 hour ago
    This is awesome. I run a team that uses software I produce and i have a rule that i can’t deliver breaking changes, and i cant force migrations. I can do the migration myself, or i have to emulate the old behavior next to the new. It makes you think really hard about releasing new APIs. I wish this was standard practice.
  • haberman 4 hours ago
    This was an interesting article, but it made me even more interested in the author's larger take on R as a language:

    > In the years since, my discomfort has given away to fascination. I’ve come to respect R’s bold choices, its clarity of focus, and the R community’s continued confidence to ‘do their own thing’.

    I would love to see a follow-up article about the key insights that the author took away from diving more deeply into R.

  • noname123 1 hour ago
    To be honest, I don't know what is worse. Installing a R library that require re-installing a bunch of updates, and being stuck in R installation hell or exerpiencing conda install that is stuck in "Resolving Dependencies" hell. The only thing I've learned to mitigate both is just containerize everything.
  • esafak 6 hours ago
    > In what other ecosystem would a top package introduce itself using an eight-variable equation?

    That's the objective function of Hastie et al's GLM. I had a good chuckle when I realized the author's last name is Tibshirani. If you know you know.

  • kazinator 6 hours ago
    > But… CRAN had also rerun the tests for all packages that depend on mine, even if they don’t belong to me!

    When you propose a change to something that other things depend on, it makes sense to test those dependents for a regression; this is not earth shattering.

    If you want to change something which breaks them, you have to then do it in a different way. First provide a new way of doing something. Then get all the dependencies that use the old way to migrate to the new way. Then when the dependents are no longer relying on the old way, you can push out a change which removes it.

  • ants_everywhere 4 hours ago
    I genuinely enjoy R. I use it for calculations daily. In comparison using Python feels tedious and clunky even though I know it better.

    > CRAN had also rerun the tests for all packages that depend on mine, even if they don’t belong to me!

    Another way to frame this is these are the customers of your package's API. If you broke them you are required to ship a fix.

    I see why this isn't the default (e.g. on GitHub you have no idea how many people depend on you). But the developer experience is much nicer like this. Google, for example, makes this promise with some of their public tools.

    Outside the word of professional software developers, R is used by many academics in statistics, economics, social sciences etc. This rule makes it less likely that their research breaks because of some obscure dependency they don't understand.

  • croemer 1 hour ago
    Might be useful to add "R" somewhere to the title to make it clearer what this article is about.
  • maxbond 6 hours ago
    > When declaring dependencies, most packages don’t specify any version requirements, and if they do, it’s usually just a lower bound like ‘grf >= 1.0’.

    I like the perspective presented in this article, I think CRAN is taking an interesting approach. But this is nuts and bolts. Explicitly saying you're compatible with any future breaking changes!? You can't possibly know that!

    I get that a lot of R programmers might be data scientists first and programmers second, so many of them probably don't know semver, but I feel like the language should guide them to a safe choice here. If CRAN is going to email you about reverse dependencies, maybe publishing a package with a crazy semver expression should also trigger an email.

    • remus 54 minutes ago
      > Explicitly saying you're compatible with any future breaking changes!? You can't possibly know that!

      I kind of like it in a way. In a lot of eco systems it's easy for package publishers to be a bit lazy with compatibility which can push a huge amount of work on package consumers. R seems similar to go in this regard, where there is a big focus on not breaking compatibility which then means they are conservative about adding new stuff until they're happy to support it for a long time.

      • maxbond 9 minutes ago
        I guess it wouldn't bother me if it weren't a semver expression. As a semver expression it's ridiculous on it's face, a breaking release will break your code until proven otherwise. "foo >= 2024R1", well, I'm not entirely comfortable with it but if you've got a comprehensive plan to address the potential dangers (as CRAN appears to), godspeed.
  • pabs3 2 hours ago
    Debian is kind of like that, except packages broken by upgrades are mostly just removed.
    • malkia 50 minutes ago
      I've been using it for so many years, and now it makes complete sense now that you mentioned that! THanks!
    • dima55 1 hour ago
      Eventually, yes I guess. But long before that the breaker and breakee both are notified, and the breakage hopefully is fixed. As it should be.

      I would hope the other aspirational software distribution systems (pip, npm, et al) ALSO do that, but according to this article, I guess they don't? Not shocked , to be honest

  • cortesoft 3 hours ago
    I feel like if more package repositories did this, you would end up just finding more and more workarounds and alternative distribution methods.

    I mean, just look at how many projects use “curl and bash” as their distribution method even though the project repositories they could use instead don’t even require anything nearly as onerous as the reverse dependency checks described in this article. If the minimal requirements the current repos have are enough to push projects to alternate distribution, I can’t imagine what would happen if it was added.

  • jiggawatts 5 hours ago
    This (with some tweaks) is what I envision the future of NPM, Cargo, and NuGet should look like.

    Automated tests, compilation by the package publisher, and enforcement of portability flags and SemVer semantics.

  • didip 1 hour ago
    Damn. Well, time to fork everything and keep internal patches internal.

    This system is unworkable.

    • gorset 39 minutes ago
      That’s the default in the monorepos I’ve worked on.

      When a third party dep is broken or needs a workaround, just include a patch in the build (or fork). Then those patches can be upstreamed asynchronously without slowing down development.