Dolt is Git for Data

(github.com)

334 points | by gjvc 1371 days ago

35 comments

throwaway892238 1371 days ago
This is the future of databases, but nobody seems to realize it yet.
One of the biggest problems with databases (particularly SQL ones) is they're a giant pile of mutable state. The whole idea of "migrations" exists because it is impossible to "just" revert any arbitrary change to a database, diff changes automatically, merge changes automatically. You need some kind of intelligent tool or framework to generate DDL, DML, DCL, they have to be applied in turn, something has to check if they've already been applied, etc. And of course you can't roll back a change once it's been applied, unless you create even more program logic to figure out how to do that. It's all a big hack.
By treating a database as version-controlled, you can treat any operation as immutable. Make any change you want and don't worry about conflicts. You can always just go back to the last working version, revert a specific change, merge in one or more changes from different working databases. Make a thousand changes a day, and when one breaks, revert it. No snapshotting and slowly restoring the whole database due to a non-reversible change. Somebody dropped the main table in prod? Just revert the drop. Need to make a change to the prod database but the staging database is different? Branch the prod database, make a change, test it, merge back into prod.
The effect is going to be as radical as the popularization of containers. Whether you like them or not, they are revolutionizing an industry and are a productivity force multiplier.
[-]
- jandrewrogers 1370 days ago
  This is how relational databases have commonly worked since at least the 1990s and is called multi-version concurrency control (MVCC). Welcome to the future, it is called PostgreSQL. There are at least two reasons no sensible database designer would allow users to operate a database in this way even though they are technically capable of it.
  First, keeping every version of every piece of data forever is an excellent way to consume non-intuitively vast amounts of storage even if your data model is tiny. Every time this feature has been offered by databases, it immediately causes a rash of "out of storage" errors that force the user to manually and permanently delete large numbers of old versions. This is extremely user-unfriendly, so the feature is almost immediately removed in subsequent versions because the pain it causes far outweighs the benefits even when used carefully. In typical MVCC systems, old versions are aggressively garbage collected automatically to limit out-of-storage errors.
  Second, finding or reconstructing an arbitrary number of old versions of data is unavoidably expensive. Much of the architectural difference between various MVCC implementations are trying to manage the rather severe performance tradeoffs of maintaining multiple versions of data and navigating to the version you need, with the understanding that all of these versions live on storage and rarely in a single place. There is no optimal way, and keeping version chains short is critical for good performance.
  There is very deep literature around MVCC-style databases. The challenges of generalizing and maximally exploiting MVCC as a user feature while having performance that is not poor to the point of unusability are thoroughly documented.
  [-]
  - zachmu 1370 days ago
    MVCC is not version control, and time travel / historical querying is not version control.
    Dolt's unique functionality isn't time travel, although it has that. It's version control, i.e. branch and merge, push and pull, fork and clone. A bunch of database products give you some of this for schema migrations, but Dolt is the only one that does it for table data as well.
    [-]
    - jandrewrogers 1370 days ago
      The conceit here is the assumption that this has not been built many times by very clever software engineers. It is not a new idea. True git-like version control systems for managing large volumes of data have been built on MVCC kernels for a decades -- branch and merge, push and pull, fork and clone.
      There are fundamental computer science and technical issues that make scaling these systems for arbitrary data models extremely difficult. The platforms always had the ambition to be general but the design tradeoffs required to make them scale requires narrowly overfitting for a particular type of data model such that they can only be used for the original use case. And even then, the performance ends up being not good.
      I've never designed one from scratch but I've worked on a few at large companies. All of them started with the vision you are proposing, all of them failed at achieving that vision because of the technical tradeoffs required to enable something resembling scalability. Unless you are proposing some novel computer science that renders these issues moot, you aren't presenting a credible defense that this hasn't been done before.
      [-]
      - zachmu 1370 days ago
        Git-like version control requires a Merkle DAG. Unless you know something I don't, there are no RDBMS products that incorporate a Merkle DAG for storage. Dolt is the first.
        Table data is stored in a cross between a Merkle DAG and a B Tree (a prolly tree), which is what makes diff / merge performant and scalable. We didn't invent these data structures but we believe we are the first to build a SQL database on them.
        https://docs.dolthub.com/architecture/storage-engine/prolly-...
        [-]
        jandrewrogers 1370 days ago
        > Git-like version control requires a Merkle DAG.
        This is false, you are conflating the abstract algorithm with a narrow implementation. That's like saying the only possible sorting algorithm is quicksort.
        With all due respect, you seem to be only loosely familiar with database architecture, both theory and historical practice. Nothing you've described is actually novel. That you are unfamiliar with why no one builds things this way, despite many attempts, does not lend confidence.
        I am actually a big fan of people trying unorthodox approaches to databases that have never been tried before, this just isn't such an example. Which doesn't make your approach wrong per se, but it leaves you exposed to learning why other people tried and abandoned it.
        Tangentially, the "prolly tree" is intrinsically not a scalable data structure. That may satisfy your design requirements but I can't tell.
        [-]
        wstuartcl 1367 days ago
        I agree with everything you have said in this thread, the one area I would take a slightly different approach with is that some of the best algorithmic advances do come from people and teams looking at a problem anew -- without examining all previous works and failures.
        Are they likely to find a way to jump forward? no. Are they likely to fall into the same traps and dead ends that are well documented in this space? yep. But in general I believe we should temper the negative feedback with constructive details:
        Hey have you seen this previous and very similar work in the same space, here is some info where and how they failed to scale.
        Taking a stance (even an implied one) that this is not novel and therefore you should stop could have been made against many of the teams working in vast areas of algorithms and architectures that have lead to jumps forward. CS is and should be science and to that end, our current knowledge and understanding is simply what is known today and must remain mutable to move forward. While their current approach is not scalable and a dead end, who knows if this experience will lead them to identify previously unrealized solutions (even if partial).
        throwaway892238 1370 days ago
        > you are conflating the abstract algorithm with a narrow implementation
        They're literally telling you Git uses a Merkle DAG and they wanted to recreate Git so they used a Merkle DAG. That's not conflating, it's copying.
        > you seem to be only loosely familiar with database architecture
        Based on what? The only comments GP has made about DBA in this entire HN thread is "this is not MVCC", which is correct.
        > I am actually a big fan of people trying unorthodox approaches to databases that have never been tried before
        So stop discouraging the neophyte? Sometimes innovation requires the ignorant to figure things out without knowing better, because that way they won't quit before they've begun. Let them figure it out. And if it's not novel, who cares? It doesn't have to be some perfect architectural masterpiece to solve people's problems. If it works well enough just for WordPress blogs, that's pretty great already.
        [-]
        discreteevent 1370 days ago
        > They're literally telling you Git uses a Merkle DAG
        They literally told them: "Git-like version control requires a Merkle DAG."
        aboodman 1369 days ago
        Sigh. Databases seems to be one of the last bastions of ivorytower-ism in software engineering.
        Databases aren't magic. They are software. And as you say, there is a deep well of literature, and almost all interesting modern databases are open source. So it's possible for anyone with the interest and motivation to learn how they work, and yes, improve them.
        > With all due respect, you seem to be only loosely familiar with database architecture, both theory and historical practice. Nothing you've described is actually novel. That you are unfamiliar with why no one builds things this way, despite many attempts, does not lend confidence.
        You claim that there is all this information out there saying why this won't work, and that it has been tried, but don't point to any of it.
        > Tangentially, the "prolly tree" is intrinsically not a scalable data structure.
        First: Dolt publishes industry standard performance benchmarks and they are an average of 4.4x slower than MySQL:
        https://docs.dolthub.com/sql-reference/benchmarks/latency
        This is using the original storage format from noms which wasn't written for oltp workloads. A new storage format is coming which is and will dramatically improve this:
        https://www.dolthub.com/blog/2022-05-20-new-format-alpha/
        In any case 5x slower already shows, experimentally, that this approach works for database-style problems.
        ===
        Second: In a purely algorithmic sense, prolly trees are the definition of scalable. They are log(n) (with a large base) for inserts, deletes, and seeks and they have efficient ordered scans. They have basically the same algorithmic complexity as B Trees, B+ Trees, LSM trees, etc -- very similar properties to the data structures used by other databases. The problem with prolly trees is actually the reverse: they are scalable, but have large constant factors due to the overhead of hashing.
        But a single-digit constant factor slower performance than MySQL but with versioning seems like a great product for many applications.
        If anyone reading this is interested, the Dolt team did a great job writing up the how prolly trees work and how they compare to classic databases here:
        https://www.dolthub.com/blog/2020-04-01-how-dolt-stores-tabl...
        [-]
        heisjustsosmart 1369 days ago
        Tip: I'd read the jandrewrogers comments before engaging. He actually does know more than you. He makes some unusual claims then doesn't back them up because he has no need to. I've spent and hour or three and learnt nothing because he's so far ahead I can't keep up. Perhaps read his profile about the amazing thing he's invented such as http://www.jandrewrogers.com/2015/10/08/spacecurve/
        Let me quote a bit. I admit, I don't understand it but then how could I?
        Algorithm design using topology manipulation can be enormously challenging to reason about. You are often taking a conceptually simple algorithm, like a nested loop or hash join, and replacing it with a much more efficient algorithm involving the non-trivial manipulation of complex high-dimensionality constraint spaces that effect the same result. Routinely reasoning about complex object relationships in greater than three dimensions, and constructing correct parallel algorithms that exploit them, becomes easier but never easy.
        Incredible! I wish I could grok this stuff.
        You should save your breath and just get out of the way of the real experts.
        [-]
        aboodman 1367 days ago
        Yeah, I should have really checked who I was dealing with first. Standing aside.
        aboodman 1369 days ago
        Oh. Somebody said it better than me:
        https://news.ycombinator.com/item?id=31851592
- 411111111111111 1371 days ago
  > This is the future of databases, but nobody seems to realize it yet
  It's a pipedream, not the future.
  Your database is either too big / has too much throughput or migrations just don't matter. And it's not like you wouldn't need migrations with a versioned schema, as otherwise a rollback would mean data loss.
  [-]
  - zachmu 1371 days ago
    You're suffering from a failure of imagination.
    Consider a CMS, one of the most common forms of database backed applications. What if you could give your customer a "dev" branch of all their data to make their changes on and test out new content, that you could then merge with back to prod after somebody reviews it in a standard PR workflow?
    This is the workflow one of our earliest customers built. They run network configuration software, and they use Dolt to implement a PR workflow for all changes their customers make.
    More details here:
    https://www.dolthub.com/blog/2021-11-19-dolt-nautobot/
    [-]
    - Johannesbourg 1371 days ago
      Personally working with timeseries data my experience is that clients typically underestimate how much storage they need for a single state, let alone including historic versions. The decision people want more data, not more snapshots for a given storage spend. But that's timeseries.
      [-]
      - hinkley 1371 days ago
        They want more data but they don't want to pay for it. People want lots of things, doesn't mean they get it or deserve it.
        I can't recall which it was but one of the timeseries databases was bragging on the fact that there are certain situations where scanning a block of data is as cheap as trying to add finer grained indexes to it, especially with ad hoc queries. They did a bunch of benchmarks that said block scanning with compression and parallelism was workable.
        And while compression typically leads to write amplification (or very poor compression ratios), in a timeseries database, or a regular database architected in a timeseries-like fashion, modifying the old data is deeply frowned upon (and in fact I've heard people argue for quasi-timeseries behavior because modifying old records is so punishing, especially as the application scales), so as long as you can decide not to compress some pages - new pages - this is not a problem.
    - lazzlazzlazz 1370 days ago
      This is Hacker News. The norm is a deep, thoughtful understanding of the state of the art and a total inability to apply the imagination to the future.
    - password4321 1370 days ago
      See also: https://versionpress.com (not being actively developed)
      [-]
      - zachmu 1370 days ago
        If we were less busy we would adopt versionpress and port it to dolt, such a cool product. Some day.
  - throwaway892238 1371 days ago
    You're thinking in terms of the present, but I'm saying it's the future. At present it doesn't make sense, because nobody has yet made a database which does version control on very big datasets with a lot of throughput. But when somebody does make it, it will be completely obvious that this was something we always needed.
    It's all just software. There is essentially no limit to what we can make software do as long as the hardware supports it. And there's no hardware limit I know of that says version-controlled databases can't work. We just need to figure out how they will work, and then make 'em (or try to make 'em and in the process figure it out).
    > And it's not like you wouldn't need migrations with a versioned schema, as otherwise a rollback would mean data loss.
    When you roll back a code change, you don't lose code, as it's still in history. If you need to revert but keep some code, you branch the code, copy the needed code into some other part of the app, revert the old change, merge everything. If on merge there is a conflict (let's presume losing data is a conflict), it can prompt you to issue a set of commands to resolve the conflict before merge. You could do all of that in a branch, test it, merge into prod, and on merge it could perform the same operations. The database does all the heavy lifting and the user just uses a console the way they use Git today.
    It's probably going to be required to lock the version of software and the version of the database together, such that both are changed/reverted at the same time. But because this is version control, we could actually serve multiple versions of the same database at the same time. You could have the database present two different versions of itself with the same data COW-overlayed for each version, and two different versions of an application. You could then blue/green deploy both the application and database, each modifying only its version. If you need to revert, you can diff and merge changes from one version to another.
    [-]
    - hinkley 1371 days ago
      Do we make databases smarter, or start asking for database-like behavior from version control systems?
      I can't help thinking how much time the git team or Jetbrains or Chrome or the Confluence team have spent on backing store implementation/migration and file formats instead of using sqlite (like why aren't V8 heap and perf snapshots just a sqlite database?) but then many of these things operate in that gap. So do we keep improving change over time functionality in databases, or make VCS backends with more formal database-like behavior?
      IIRC, Trac stores its wiki history in a subversion repository. Since it already had to understand commit histories and show diffs, that was a sensible choice. Of course it is easier to live with such a decision if the API is good, but I haven't heard anyone say that about any version control system yet.
      [-]
      - throwaway892238 1370 days ago
        Well, they're discrete problems. Version control of source code, packaged applications, container images, databases are all quite different.
        Git is a distributed file manager that operates on files where every change is a commit, and a commit is a set of operations on files, and/or a change to a block of text strings terminated by newlines. Versions are merkle trees of commits.
        RPM/Deb/etc is a semi-centralized file manager that operates on files assuming each change is a collection of files with executable stages before and after copying/linking/unlinking. Versions are arbitrary key=value pairs which optionally depend on other versions, with extra logic to resolve relative versions.
        Docker/OCI is a distributed file manager that operates on layers assuming every layer is a collection of files overlaid on other layers, with extra logic to do extra things with the layers at runtime. Versions are (I think?) merkle trees of layers.
        The database is going to need a helluva lot of custom heuristics and operations to do version-control, because how you use it is so much different than the above. Databases are much more complex beasts, require higher performance, higher reliability, tons more functionality.
        [-]
        hinkley 1369 days ago
        > Well, they're discrete problems.
        I'm not convinced they are. Invention is often a case of adopting a solution well known in another discipline. Discovering that these two things share a category. We keep discovering things that are isomorphic to each other.
        In the end git is keeping a database of code changes as write once data entries, and presenting the snapshots for human consumption. It does a very bad job of pretending that's not the case.
    - iamnafets 1371 days ago
      I think the problem is that the tradeoffs already exist. Most users would prefer more usable space or less money to a full history of their data.
      You might be making the argument that the usability of immutable data is not there yet, but there are well-established theoretical costs of maintaining full history and I don't think they're within bounds of many real-world use-cases.
      [-]
      - throwaway892238 1371 days ago
        If the user doesn't want full history they could configure the database to expunge it with a lifecycle policy, though I think keeping deltas of the changes would make any excess file storage negligible, as most people don't seem to ever get rid of data anyway.
      - CPLX 1371 days ago
        As a guy who's been doing technical stuff of one kind or another since the mid 90's I would say that any analysis that insists that a specific use case has tradeoffs due to lack of memory or processing speed has an inevitable expiration date.
  - packetlost 1371 days ago
    I think the problem is relational datasets like that don't fit well into that model. In reality, it's very possible. Look at Datomic. While I agree, for high-throughput systems, storage is a concern, but the general trends seem to be towards streaming data and retention policies anyways.
  - whazor 1370 days ago
    With big data such a model is even more promising. One of the big problems is that people keep copying data sets, which does not scale. Just syncing newer versions is much more efficient, look at delta.io.
- hinkley 1371 days ago
  In a similar vein, I am trying to work on a tool for a volunteer group, and one of the problems I'm trying to figure out how to model is not just tracking when the data changed, but why it changed.
  Sometimes when you're first entering data you just get it wrong. You said the wrench was in storeroom A1 or the customer lives on 3rd St or the tree was five feet from the sidewalk. If the state of the asset changes due to human action, that's a new thing. But if you opened the wrong list taking inventory, or you missed a keystroke for 33rd St or you just eyeballed the coordinates, then that row was simply wrong, and the correction should be back-dated for most things.
  But if I emptied out A1 because we aren't renting that space anymore, the customer moves or the tree got replanted, then it was here and now it's over there. Which might be important for figuring out things like overstock, taxes or success rates.
  Similarly if the migration introduces derived data, then the migrated data is assumed/inferred, whereas if we took that data from a user, that information is confirmed, which might introduce subtle differences in how best to relate to the user. Things a mediocre business could easily ignore but a quality establishment might be ill-pleased with such a request.
- jahnu 1371 days ago
  Doesn’t Datomic do all this for some years now?
  [-]
  - zachmu 1371 days ago
    Lots of databases offer time travel / historical querying, including datomic, MySQL, Postgres, etc (plugins required in some cases).
    Dolt's unique functionality isn't time travel, although it has that. It's version control, i.e. branch and merge, push and pull, fork and clone. A bunch of database products give you some of this for schema migrations, but Dolt is the only one that does it for table data as well.
    [-]
    - rapnie 1371 days ago
      I think TerminusDB does that as well.
      [-]
      - zachmu 1371 days ago
        Yup, TerminusDB has a very similar set of capabilities, just for a graph DB instead of SQL / relational. Very cool product if you're in the market for a graph DB.
- LukeEF 1371 days ago
  This has to be imagined in the context of a post-SQL future. Unless of course you are a 'SQL is the end of history' person!
- blowski 1371 days ago
  It looks incredible, but somehow seems too good to be true.
  What are the tradeoffs here? When wouldn't I want to use this?
  [-]
  - timsehn 1371 days ago
    Creator here.
    It's slower. This is `sysbench` Dolt vs MySQL.
    https://docs.dolthub.com/sql-reference/benchmarks/latency
    We've dedicated this year to performance with a storage engine rewrite. We'll have some performance wins coming in the back half of the year. We think we can get under 2X MySQL.
    It also requires more disk. Each change is at least on average 4K on disk. So, you might need more/bigger hard drives.
    [-]
    - kragen 1371 days ago
      (Disclaimer: I haven't tried Dolt.)
      In your benchmark it's only 2.1–7.4 times slower than MySQL, average 4.4. And any database someone could fit on a disk 20 years ago (I forget, maybe 8 GB?) fits in RAM now, which makes it about 256 times faster, which is a lot bigger than 4.4. You can get a 20 TB disk now, which is enough space So anything that could be done with MySQL 20 years ago can be done faster and cheaper with Dolt now, which covers, I think the technical term is, a fucking shitload of applications. It probably includes literally every 20th-century application of relational databases.
      Well, except for things that have over 5 billion transactions (20 TB ÷ 4 kB/txn) over their lifetime, I guess, so it might be important to find a way to compact that 4K. 5 billion transactions is 19 months at 100 TPS. If you could get that down to 256 bytes it would be almost 25 years of 100 TPS.
      Also, as I understand it, and correct me if I'm wrong here, that 4.4× slowdown buys you a bulletproof and highly performant and scalable strategy for backups (with PITR), staging servers, data warehousing, readslaves, disk error detection and recovery, cryptographically secure audit logs, bug reproduction, and backtesting. Along with the legal security the Apache 2 license gives you, which you don't have with Datomic.
      Sounds fantastic! It sounds like you're selling its performance a bit short. If someone is really concerned about such a small performance loss they probably aren't really in the market for a new RDBMS.
    - EarthLaunch 1371 days ago
      Another commenter noted a need for migrations in order to handle rollbacks without data loss.
- k_bx 1370 days ago
  Hot take: migrations are useless
  In modern world of SAASes and connectivity, you use multiple "databases" and such (queues, identity providers) anyways, and the world got connected enough to not need to have a local db most of the time.
- qaq 1371 days ago
  You do realize that how MVCC works right? Turn off GC process that collects old versions on production DB and see what happens. Reverting changes out of order is not possible in many cases.
  [-]
  - zachmu 1371 days ago
    Right, but it is with Dolt. That's the point.
    CALL DOLT_REVERT('bad-commit-hash')
    Works just like git revert: creates an inverse of the changes in the commit given and applies it as a patch.
    [-]
    - qaq 1370 days ago
      so I have commit1 drop table blah commit2 create table blah with diff schema
      I revert commit1 what would be the state?
      [-]
      - aboodman 1369 days ago
        It will be a conflict. But you can commit the revert to a different branch and run both schemas at the same time. This provides a lot of flexibility! Check out https://www.dolthub.com/blog/2022-04-29-dolt-revert/ for some discussion of this.
- dizhn 1370 days ago
  How would you revert, merge, rollback or otherwise make sense of real world data that necessarily reflects "state"? Or is this only for development?
- simlevesque 1371 days ago
  QLDB seems like something that goes in this direction. What's your opinion of it ?
  [-]
  - zachmu 1371 days ago
    QLDB isn't version controlled, it's an immutable ledger database. If that's what you're looking for, definitely choose it, it's a great product.
    More discussion of the immutable DB product space:
    https://www.dolthub.com/blog/2022-03-21-immutable-database/
- AaronLasseigne 1369 days ago
  One issue with immutable data stores for businesses is compliance with things like GDPR. You need some mechanism for scrubbing data and its version history effectively making it non-immutable.
cosmic_quanta 1371 days ago
That looks awesome. One of the listed use-cases is 'time-travel': https://dolthub.com/blog/2021-03-09-dolt-use-cases-in-the-wi...
I wish we could use this at work. We're trying to predict time-series stuff. However, there's a lot of infrastructure complexity which is there to ensure that when we're training on data from years ago, that we're not using data that would be in the future from this point (future data leaking into the past).
Using Dolt, as far as I understand it, we could simply set the DB to a point in the past where the 'future' data wasn't available. Very cool
[-]
- wodenokoto 1371 days ago
  For at least your use-case, this is a fairly common feature in enterprise databases.
  For a lot of types of data, auditing requires history.
  Have a look at MSSQLs history table, for example: https://docs.microsoft.com/en-us/sql/relational-databases/ta...
- lichtenberger 1371 days ago
  Basically my research project[1] I'm working on in my spare time is all about versioning and efficiently storing small sized revisions of the data as well as allowing sophisticated time travel queries for audits and analysis.
  Of course all secondary user-defined, typed indexes are also versioned.
  Basically the technical idea is to map a huge tree of index tries (with revisions as indexed leave pages at the top-level and a document index as well as secondary indexes on the second level) to an append-only file. To reduce write amplification and to reduce the size of each snapshot data pages are first compressed and second versioned through a sliding snapshot algorithm. Thus, Sirix does not simply do a copy on write per page. Instead it writes nodes, which have been changed in the current revision plus nodes which fall out of the sliding window (therefore it needs a fast random-read drive).
  [1] https://github.com/sirixdb/sirix
  [-]
  - awmarthur 1370 days ago
    That sounds somewhat similar to Dolt's storage index structure: Prolly Trees https://www.dolthub.com/blog/2020-04-01-how-dolt-stores-tabl...
- kortex 1371 days ago
  Have you looked at dvc www.dvc.org? Takes a little bit to figure out how you want to handle the backing store (usually s3) but then it's very straightforward. You could do a similar pattern: have a data repository and simply move the git HEAD to the desired spot and dvc automatically adds/removes the data files based on what's in the commit. You can even version binaries, without blowing up your .git tree.
  [-]
  - nerdponx 1371 days ago
    DVC is great for tracking locally-stored data and artifacts generated in the course of a research project, and for sharing those artifacts across a team of collaborators (and/or future users).
    However DVC is fundamentally limited because you can only have dependencies and outputs that are files on the filesystem. Theoretically they could start supporting pluggable non-file-but-file-like artifacts, but for now it's just a feature request and I don't know if it's on their roadmap at all.
    This is fine, of course, but it kind of sucks for when your data is "big"-ish and you can't or don't want to keep it on your local machine, e.g. generating intermediate datasets that live in some kind of "scratch" workspace within your data lake/warehouse. You can use DBT for that in some cases, but that's not really what DBT is for and then you have two incompatibile workflow graphs within your project and a whole other set of CLI touch points and program semantics to learn.
    The universal solution is something like Airflow, but it's way too verbose for use during a research project, and running it is way too complicated. It's an industrial-strength data engineering tool, not a research workflow-and-artifact-tracking tool.
    I think my ideal tool would be "DVC, but pluggable/extensible with an Airflow-like API."
    [-]
    - henrydark 1371 days ago
      I have dvc pipelines such that input/output is iceberg snapshot files. The data gets medium-big and it works well.
      [-]
      - nerdponx 1371 days ago
        I never heard of Apache Iceberg before. I've used Databricks Delta Lake; is it similar? What is a snapshot file in this case?
        [-]
        henrydark 1370 days ago
        It's basically the same, I just went with iceberg because the specification is a bit more transparent
        [-]
        nerdponx 1370 days ago
        Interesting. So the snapshot file acts much in the same way as a manual "sentinel" file? I generally try to avoid such things because they are brittle and it's easy make a mistake and get the "ad hoc database on your filesystem" out of sync with the actual data.
  - isolli 1371 days ago
    I'm looking into DVC right now, and I feel like the code history (in git) and the data history are too intertwined. If you move the git HEAD back, then you get the old data back, but you also get the old code back. I wish there was a way to move the two "heads" independently. Or is there?
    Edit: I can always revert the contents of the .dvc folder to a previous commit, but I wonder if there's a more natural way of doing it.
    [-]
    - arjvik 1370 days ago
      If you want the dataset to be independent, I would recommend having a seperate repository for the dataset, and using Git Submodules to pull it in. That way you can checkout different versions of the dataset and code because they are essentially in seperate working trees.
    - george_ciobanu 1371 days ago
      also check out Datomic.
- Cthulhu_ 1370 days ago
  I've (partially / POC) implemented time travel in a SQLite database; the TL;DR is that whenever you create a table, you add a second, identical or nearly-identical table with a `_history` suffix; the history table has a valid from and valid to. Then you add a trigger on the primary table that, on update or on delete, makes a copy of the old values into the history table, setting the 'valid_to' column to the current timestamp.
  The reason I used a separate table is so that you don't have to compromise or complicate the primary table's constraints, indices and foreign keys; the history table doesn't really need those because it's not responsible for data integrity.
  Anyway, once you have that, you can run queries with a `where $timestamp is between start_date and end_date` condition, which will also allow you to join many tables at a certain point in time. To also be able to get the latest version, you can use a union query (iirc).
  I'm sure there's a lot of caveats there though. What I should do is take some time in the weekend and work on that POC, publish it for posterity / blog post fuel.
- yread 1371 days ago
  We use DataVault for that. And perhaps Databricks at some point in the future
aratob 1371 days ago
Previous discussions:
- Dolt is Git for Data: a SQL database that you can fork, clone, branch, merge (2021) - 177 comments https://news.ycombinator.com/item?id=26370572
- Dolt is Git for data (2020) - 191 comments https://news.ycombinator.com/item?id=22731928
cgio 1371 days ago
I have been interested in this space, but have failed to understand how these versioning solutions for data work in the context of environments. There are aspects of time travel that line up better e.g. with data modelling approaches (such as bitemporal DBs, xtdb etc.) others more with git-like use cases (e.g. schema evolution, backfilling) some combinations. The challenge is, with data I don’t see how you’d like to have all environments in same place/repo and there may be additional considerations coupled with directionality of moves, such as anonymisation for moving from prod to non-prod , back filling for moving from non-prod to prod etc. Keen to read more on other people experiences in this space and how they might be combining different solutions.
[-]
- nerdponx 1371 days ago
  These tools aren't really meant for developers. They are meant for researchers, analysts, and other "offline" users and managers of data sources. Data science research workflows generally don't need the same "dev/test/prod" kind of environment setup.
  [-]
  - zachmu 1371 days ago
    I won't speak for other data versioning products, but Dolt is definitely for developers. Our customers are software engineers writing database applications that need version control features.
- ollien 1371 days ago
  I've only used Dolt once, but it was very helpful when I was working in an ML class on a group project. Previously we would commit gigantic CSVs to git, which sucked. Putting it in Dolt made a lot of the data exploration and sharing easier, and separated our data from the code.
remram 1371 days ago
What kind of merge actually happens, e.g. how does it deal with conflicts? Does it merge at the SQL command level or at the changed tuple level?
If I have
```
    name   job age
    remram dev 32
```
and I concurrently do
```
    UPDATE table SET age = age + 2 WHERE job = 'dev';
    UPDATE table SET job = 'mgr' WHERE age = 32;
```
Do I get a conflict? Do I get something consistent with the two transactions serialized, e.g. dev 34 or op 32)? Can I get something that no ordering of transaction could have given me, like mgr 34?
[-]
- zachmu 1371 days ago
  Merge is cell-wise. If the two heads being merged edited disjoint cells, there's no conflict, they merge together. If they touched the same cell, or if one branch deleted a row another added, then you get a conflict.
  Merge conflicts are stored in a special system table so you can decide what to do with them, kind of analogous to conflict markers in a source file. More details here:
  https://docs.dolthub.com/sql-reference/version-control/merge...
  The situation you're talking about with two transactions isn't a merge operation, it's just normal database transaction isolation level stuff. Dolt supports REPEATABLE_READ right now, with others coming in future releases. So in the example above, whichever transaction committed last would fail and get rolled back (assuming they touched the same rows).
  [-]
  - remram 1370 days ago
    > The situation you're talking about with two transactions isn't a merge operation, it's just normal database transaction isolation level stuff
    I mean, arguably. It's not like there is a standard definition for "merge operation" on data. Even Git tries to do more than line-level changes, taking context into account, and turning one line change into two line changes if merging across a file copy for example.
    Dolt markets itself as "a version controlled SQL database", so I think it is perfectly reasonable to consider the standard that already exists for concurrent changes to a SQL database, and that's transaction isolation.
    I guess anything more complex than this would be pretty unwieldy though, with probably little benefits. I am struggling to come up with a good example for the kind of anomaly I imagine.
    [-]
    - zachmu 1370 days ago
      You're thinking too small. The transaction is generally not the unit you want to apply version control to, databases already have robust concurrency support at that level.
      What you want is to have branches on your data that have many transactions applied to them, for days or weeks. Then you merge the branch back to main when it's ready to ship.
      An early customer built pull requests for their configuration application using Dolt, read more here:
      https://www.dolthub.com/blog/2021-11-19-dolt-nautobot/
      [-]
      - remram 1370 days ago
        Please don't tell me what I'm thinking or what I want to do ;-)
        The use case I had in mind is closer to a data.world or dbhub, collaborative data cleaning (which often include bulk operations) rather than merging OLTP databases after days or weeks of changes.
        What I would use now is a script or Jupyter Notebook checked in Git, where cells are strongly ordered and if someone sends me a pull request (changing the code) I have to re-run the notebook to obtain the "merged dataset". I can't say that I have a use for Dolt but it is definitely cool tech.
        [-]
        zachmu 1370 days ago
        If you're a data cleaning guy, you might be interested in our data bounties program. It's collaborative data importing / cleaning. The current one is building a catalog of museum collections. We launch a new one every month.
        https://www.dolthub.com/blog/2022-06-14-announcing-museums-b...
adamw2k 1371 days ago
Was going to say it reminded me of noms, only to realize it's a fork...
https://github.com/attic-labs/noms
[-]
- timsehn 1371 days ago
  Creator here...
  We are a fork of Noms. Lots of incremental changes and we're in the process of a major storage engine overhaul (what we use Noms for) for performance as we speak.
pdpi 1371 days ago
Given how many databases already implement MVCC, some form of versioning is already baked into DB systems. It seems obvious in hindsight that this could be reified into its own feature instead of a book-keeping detail. Cool to see a project actually tackle that challenge.
patrickdevivo 1371 days ago
a very cool project they also maintain is a MySQL server framework for arbitrary backends (in Go): https://github.com/dolthub/go-mysql-server
You can define a "virtual" table (schema, how to retrieve rows/columns) and then a MySQL client can connect and execute arbitrary queries on your table (which could just be an API or other source)
bestouff 1371 days ago
I don't find the name very nice: there's too much potential of mixing it with "doit" with a capital "d" and a capital "i".
[-]
- LandStander 1371 days ago
  I blame font designers.
  [-]
  - cal85 1371 days ago
    The font designer designed a font, not a logo.
- CPLX 1371 days ago
  I mean forget about mixing it up, what about the actual meaning of the word:
  dolt noun \ ˈdōlt \ Definition of dolt : a stupid person
  https://www.merriam-webster.com/dictionary/dolt
  [-]
  - graedus 1371 days ago
    I thought this was an odd name too, but I'm guessing it's intentional, as git also is a derogatory word with a similar meaning.
    : a foolish or worthless person
    https://www.merriam-webster.com/dictionary/git
    [-]
    - timsehn 1371 days ago
      One of the creators here... Indeed. Intentional.
      https://docs.dolthub.com/other/faq#why-is-it-called-dolt-are...
  - ghkbrew 1371 days ago
    Are you aware of the meaning of the word "git"?
    https://www.merriam-webster.com/dictionary/git
    [-]
    - CPLX 1371 days ago
      Without ever thinking hard about it I always assumed git came from American cowboy/western slang:
      Specifically this: https://en.wikipedia.org/wiki/Git_Along,_Little_Dogies
      Every time I type it in I am visualizing my little code changes as sickly orphaned calves being herded into the main pen with all the mature cows currently working away at their various functions.
      Of course given the actual author the British/European meaning makes a lot more sense.
  - bravogamma 1371 days ago
    https://www.folklore.org/StoryView.py?story=Do_It.txt
- lyso 1371 days ago
  https://www.folklore.org/StoryView.py?story=Do_It.txt
  [-]
  - bestouff 1370 days ago
    Nice read, thanks !
- moffkalast 1371 days ago
  Dew it.
  Good, good. It was too dangerous to not keep a backup.
- amelius 1371 days ago
  At first I thought it was d01t
- evanb 1371 days ago
  I’d have gone with digit.
pgt 1371 days ago
XTDB is a general-purpose bitemporal database with graph query capabilities: https://xtdb.com/
[-]
- infogulch 1370 days ago
  There was a short talk about xtdb in the recent 'rubbing a database on it' conference: https://www.hytradboi.com/2022/baking-in-time-at-the-bottom-...
  The talk expands on what "bitemporal" means in this context, namely, separate "valid" and "audit" time dimensions. The discussion upthread is a good example: if you accidentally mark down inventory in the wrong location and you find out later, you should backdate the fix to when you did the count -- this is the 'valid' dimension, but the fact that you changed it today should also be recorded so it can be reviewed later -- this is the 'audit' dimension.
chirau 1371 days ago
They have a pretty interesting data bounty program as well.
https://www.dolthub.com/bounties
I participated in one a few months back. I had to stop midway inbetween because work was calling but it was a lot of fun. I know I earned some dollars, lol, but probably too few to count for the competition.
I am actually trying to convince my manager to put up a few bounties later this year.
Their Discord support is also super impressive.
anyfactor 1371 days ago
I use dolthub's public database to practice queries.
I gave their bounty program a shot but I quit when I was prompted to download/clone a 8gb database. So the best thing would be to have a dolthub account and develop the data from there.
I think if you do a little bit of git hacking you can create a branch with just an empty or a dummy database based on the schema, clone only that branch and make a PR to append to the original database. But I couldn't figure that out.
[-]
- ascar 1371 days ago
  Could you elaborate what's so bad about 8GB of test data? This is a database project after all and many things need a certain amount of data to show up.
  [-]
  - anyfactor 1371 days ago
    I tried downloading the data and my network connection glitched at 6gb, I was thrown an error and the processed exited. I believe I was recommended to restart the entire process again as there was no way continue the process where I left out. I didn’t try again.
    I think there could be a better design to clone/download the database on local machine. I understand people don't actually download databases to their local machine but I expected a better solution.
    [-]
    - zachmu 1371 days ago
      Definitely. We have plans to allow clone to be a shallow operation, so that the original data stays remote and only your changes are local. Will make queries slower obviously, but that's the right trade-off for some people.
dmd 1371 days ago
See also: https://www.datalad.org/
MontyCarloHall 1371 days ago
No mention of performance on their GitHub homepage. Anyone know how this compares to a standard MySQL distribution?
[-]
- NickHoff 1371 days ago
  In this [1] blog post, they say it's "between 2 and 20 times slower than MySQL (but committed to getting down to 4x slower)".
  [1]: https://www.dolthub.com/blog/2021-03-09-dolt-use-cases-in-th...
- jpnc 1371 days ago
  They say it's slower but are trying to close the gap [1]
  [1] - https://docs.dolthub.com/sql-reference/benchmarks/latency
magicalhippo 1370 days ago
Somewhat unrelated but...
```
    docs $ dolt conflicts cat docs
    +-----+--------+----+----+
    |     |        | pk | c1 |
    +-----+--------+----+----+
    |     | base   | 1  | 1  |
    |  *  | ours   | 1  | 10 |
    |  *  | theirs | 1  | 0  |
    +-----+--------+----+----+
```
For some reason my brain just can't seem to remember what's "ours" and what's "theirs" when resolving merge conflicts. Like which one represents what was in the working copy and which one represents the data I'm trying to merge. I'd much prefer absolute terms rather than those relative ones.
[-]
- zachmu 1370 days ago
  Yeah I'm kinda the same way. Ultimately we decided to copy git's terminology for this rather than making users learn new terms.
  (ours is the working copy, theirs is the thing you're merging in)
  [-]
  - magicalhippo 1370 days ago
    Yeah I totally get keeping the nomenclature given Dolt tries to be Git-like.
nibab 1369 days ago
Git-like functionality on top of existing RDBMS solutions would significantly reduce the overhead of managing ephemeral test environments for us. It would be awesome to see Dolt as a thin layer on top of MSSQL or Postgres that would allow my team to easily test changes before deploying to prod.
Today, this means creating a snapshot of the DB, creating a new DB instance, copying the data over, doing a deploy, testing changes, teardown and cleanup. Branching with copy-on-write (very similar to what neon.tech offers) on top of on-prem instances would be lifechanging for teams operating in legacy dbs.
[-]
- mildbyte 1369 days ago
  We might have what you're looking for with sgr [0] (I'm the co-founder) which works by running SQL commands on top of a PostgreSQL instance and uses PostgreSQL audit triggers for versioning, though note it currently only works on top of a custom PostgreSQL Docker image, since it requires some extensions.
  [0] https://www.splitgraph.com/docs/sgr-advanced/getting-started...
nailer 1371 days ago
I thought dat was git for data. But it looks like it's dead now.
Docs site is still up: https://docs.dat.foundation/
melvinroest 1371 days ago
Shoutout to Smalltalk [1] for using doit [2] (do it) to execute code! Name collisions happen, it's handy to know about them ;-)
[1] https://en.wikipedia.org/wiki/Smalltalk
[2] https://twitter.com/worrydream/status/1424952628207702022 <-- Brett Victor talking about doit
[-]
- allknowingfrog 1371 days ago
  I think you mistook the "l" for an "i". This is "dolt", not "doit".
  [-]
  - melvinroest 1371 days ago
    dolt
    DoIt
    Ah... whoops, my bad
throwaway787544 1371 days ago
Aw man, why did they have to saddle themselves with Oracle? :(
[-]
- timsehn 1371 days ago
  Creator here...
  Only need Oracle clients but we will eventually have a Postgres foreign data wrapper for the storage engine. I explain how we ended up MySQL in this blog post if you're curious.
  https://www.dolthub.com/blog/2022-03-28-have-postgres-want-d...
henrydark 1371 days ago
I get "git for data" by putting iceberg snapshot files _in_ git. That's it. It's literally git for data.
mriet 1371 days ago
Dolt is git for databases.
Nessie is git for data.
gniap 1371 days ago
It's quite similar to https://qri.io functionality, but unfortunately Qri stopped providing their Cloud service(https://qri.io/winding_down).
amelius 1371 days ago
The nice thing about Git is that it is batteries-included.
I don't want to deal with databases and everything that comes with it. Just give me something that works with the filesystem directly.
arealaccount 1371 days ago
I’ve been looking for something like this where you can bring your own DB. The ability to version schema and metadata is much more interesting over a saas DB host.
bjarneh 1371 days ago
This looks super handy, probably worth the performance penalty in most cases to make every command undo-able (except drop database).
[-]
- timsehn 1371 days ago
  We're trying to close the performance gap. Current performance on sysbench can be found here:
  https://docs.dolthub.com/sql-reference/benchmarks/latency
  We're even slower on heavy transactional use cases.
  We're dedicating the year to it and we think we can get under 2X MySQL. We have customers who switched from other No-SQL DBs and actually got faster so getting close to MySQL is a great accomplishment.
iRomain 1371 days ago
Also in the same vein, check out https://lakefs.io/
wonderwonder 1371 days ago
very cool. Would love to know the size of the team that built this. Like learning about people that create interesting new products.
[-]
- zachmu 1371 days ago
  https://www.dolthub.com/team
  [-]
  - wonderwonder 1371 days ago
    lol, that would probably be the best place for me to have looked. Appreciate it!
memorable 1371 days ago
Seems interesting! Will keep an eye on this.
layer8 1371 days ago
Only MySQL-compatible, apparently?
[-]
- timsehn 1371 days ago
  Creator here.
  You only connect with a MySQL-client. There is no MySQL code in it. Here's the story if you're curious.
  https://www.dolthub.com/blog/2022-03-28-have-postgres-want-d...
  We will eventually build a Postgres Foreign Data Wrapper.
- sitkack 1371 days ago
  https://github.com/dolthub/go-mysql-server 170kloc of pure Go
Hakeemmidan 1371 days ago
this looks really fun
smm11 1371 days ago
Data? rsync
alex-treebeard 1371 days ago
great name
[-]
- rwoerz 1371 days ago
  I misread it as "do it" and wonder if that was intended too.
  [-]
  - obliquely 1371 days ago
    DOLT vs DO-IT: https://www.folklore.org/StoryView.py?story=Do_It.txt