OpenZFS deduplication is good now and you shouldn't use it

(despairlabs.com)

454 points | by type0 11 days ago

33 comments

  • Wowfunhappy 11 days ago
    I want "offline" dedupe, or "lazy" dedupe that doesn't require the pool to be fully offline, but doesn't happen immediately.

    Because:

    > When dedup is enabled [...] every single write and free operation requires a lookup and a then a write to the dedup table, regardless of whether or not the write or free proper was actually done by the pool.

    To me, this is "obviously" the wrong approach in most cases. When I'm writing data, I want that write to complete as fast as possible, even at the cost of disk space. That's why I don't save files I'm actively working on in 7zip archives.

    But later on, when the system is quiet, I would love for ZFS to go back and figure out which data is duplicated, and use the BRT or whatever to reclaim space. This could be part of a normal scrub operation.

    • cryptonector 11 days ago
      Lazy/off-line dedup requires block pointer rewrite, but ZFS _cannot_ and will not ever get true BP rewrite because ZFS is not truly a CAS system. The problem is that physical locations are hashed into the Merkle hash tree, and that makes moving physical locations prohibitively expensive as you have to rewrite all the interior nodes on the way to the nodes you want to rewrite.

      A better design would have been to split every node that has block pointers into two sections, one that has only logical block pointers and all of whose contents gets hashed into the tree, and one that has only the physical locations (as if it were a cache) of the corresponding logical block pointers in the first section, with the second section _not_ hashed into the Merkle hash tree. Then BP rewrite would only require re-writing blocks that are not part of the Merkle hash tree.

      But as it is you can't get BP rewrite to work on ZFS, so you can't get what you're asking for.

      Well... maybe. Perhaps on read hash mismatch ZFS could attempt to locate the pointed-to block in the dedup table using the hash from the pointer. Then ZFS could reallocate the dedup'ed block. The price you'd pay then is one pointless read -- not too bad. The impossibility of BP rewrite generally leads to band-aids like this.

      • Pet_Ant 10 days ago
        • cryptonector 10 days ago
          Sorry, yes, CAS really means that pointers are hash values -- maybe with extra metadata, yes, but _not_ including physical locations. The point is that you need some other way to map logical pointers to physical locations. The easiest way to do that is to store the mappings nearby to the references so that they are easy to find, but the mappings must be left out of the Merkle hash tree in order to make it possible to change the physical locations of the referenced blocks.
    • EvanAnderson 11 days ago
      > I just wish we had "offline" dedupe, or even "lazy" dedupe...

      This is the Windows dedupe methodology. I've used it pretty extensively and I'm generally happy with it when the underlying hardware is sufficient. It's very RAM and I/O hungry but you can schedule and throttle the "groveler".

      I have had some data eating corruption from bugs in the Windows 2012 R2 timeframe.

    • DannyBee 11 days ago
      You can use any of the offline dupe finders to do this.

      Like jdupes or duperemove.

      I sent PR's to both the ZFS folks and the duperemove folks to support the syscalls needed.

      I actually have to go followup on the ZFS one, it took a while to review and i realized i completely forget to finish it up.

    • Dylan16807 11 days ago
      The ability to alter existing snapshots, even in ways that fully preserve the data, is extremely limited in ZFS. So yes that would be great, but if I was holding my breath for Block Pointer Rewrite I'd be long dead.
      • Wowfunhappy 11 days ago
        You need block pointer rewrite for this?
        • Dylan16807 11 days ago
          You don't need it to dedup writable files. But redundant copies in snapshots are stuck there as far as I'm aware. So if you search for duplicates every once in a while, you're not going to reap the space savings until your snapshots fully rotate.
          • Wowfunhappy 10 days ago
            Thanks. I do think dedupe for non-snapshots would still be useful, since as you say most people will get rid of old snapshots eventually.

            I also wonder if it would make sense for ZFS to always automatically dedupe before taking a snapshot. But you'd have to make this behavior configurable since it would turn shapshotting from a quick operation into an expensive one.

          • lazide 11 days ago
            The issue with this, in my experience, is that at some point that pro (exactly, and literally, only one copy of a specific bit of data despite many apparent copies) can become a con if there is some data corruption somewhere.

            Sometimes it can be a similar issue in some edge cases performance wise, but usually caching can address those problems.

            Efficiency being the enemy of reliability, sometimes.

            • Dylan16807 10 days ago
              Redundant copies on a single volume are a waste of resources. Spend less on size, spend more on an extra parity drive, or another backup of your most important files. That way you get more safety per gigabyte.
              • lazide 10 days ago
                Notably, having to duplicate all data x2 (or more) is more of a waste than having 2 copies of a few files - if full drive failure is not the expected failure mode, and not all files should be protected this heavily.

                It’s why metadata gets duplicated in ZFS the way it does on all volumes.

                Having seen this play out a bunch of times, it isn’t an uncommon need either.

                • Dylan16807 10 days ago
                  > having to duplicate all data x2

                  Well I didn't suggest that. I said important files only for the extra backup, and I was talking about reallocating resources not getting new ones.

                  The simplest version is the scenario where turning on dedup means you need one less drive of space. Convert that drive to parity and you'll be better off. Split that drive from the pool and use it to backup the most important files and you'll be better off.

                  If you can't save much space with dedup then don't bother.

                  • lazide 10 days ago
                    There was an implication in your statement that volume level was the level of granularity, yeah?

                    I’m noting that during on volume wide dedup can have the con that you can’t choose (but it looks like you can!) to manually duplicate data.

                    • Dylan16807 10 days ago
                      Note: I assume volume means pool?

                      > There was an implication in your statement that volume level was the level of granularity, yeah?

                      There was an implication that the volume level was the level of granularity for adding parity.

                      But that was not the implication for "another backup of your most important files".

                      > I’m noting that during on volume wide dedup can have the con that you can’t choose (but it looks like you can!) to manually duplicate data.

                      You can't choose just by copying files around, but it's pretty easy to set copies=2 on specific directories. And I'd say that's generally a better option, because it keeps your copies up to date at all times. Just make sure snapshots are happening, and files in there will be very safe.

                      Manual duplication is the worst kind of duplication, so while it's good to warn people that it won't work with dedup on, actually losing the ability is not a big deal when you look at the variety of alternatives. It only tips the balance in situations where dedup is near-useless to start with.

    • UltraSane 11 days ago
      The neat thing about inline dedupe is that if the block hash already exists than the block doesn't have to be written. This can save a LOT of write IO in many situations. There are even extensions where a file copy between to VMs on a dedupe storage array will not actually copy any data but just increment the original blocks reference counter. You will see absurd TB/s write speeds in the OS, it is pretty cool.
      • aidenn0 11 days ago
        This is only a win if the dedupe table fits in RAM; otherwise you pay for it in a LOT of read IO. I have a storage array where dedupe would give me about a 2.2x reduction in disk usage, but there isn't nearly enough RAM for it.
        • UltraSane 11 days ago
          yes inline dedupe has to fit in RAM. Perhaps enterprise storage arrays have spoiled me.
          • aidenn0 11 days ago
            This array is a bit long-in-the-tooth and only has 192GB of RAM, but a bit over 40TB of net storage, which would be a 200GB dedup table size using the back-of-the-envelope estimate of 5GB/TB.

            A more precise calculation on my actual data shows that today's data would allow the dedup table to fit in RAM, but if I ever want to actually use most of the 40TB of storage, I'd need more RAM. I've had a ZFS system swap dedup to disk before, and the performance dropped to approximately zero; fixing it was a PITA, so I'm not doing that anytime soon.

            • barrkel 10 days ago
              Be aware that ZFS performance rapidly drops off north of 80% utilization, when you head into 90%, you will want to buy a bigger array just to escape the pain.
              • gnu8 10 days ago
                I think that is well known amongst storage experts, though maybe not everyone who might be interested in using ZFS for storage in a professional or personal application. What I’m curious about is how ZFS’s full-disk performance (what is the best term for this?) compares to btrfs, WAFL, and so on. Is ZFS abnormally sensitive to this condition, or is it a normal property?

                In any case it doesn’t stick out to me as a problem that needs to be fixed. You can’t fill a propane tank to 100% either.

              • aidenn0 10 days ago
                ZFS has gotten significantly better at 80%, but 90% is painful enough that I almost wish it would reserve 10% a bit more explicitly (maybe like the old Unix systems that would prevent non-root users from using the last 5% of the root partition).

                All my arrays send me nightly e-mails at 80% so I'm aware of when I hit there, but on a desktop system that's typically not the case.

    • magicalhippo 11 days ago
      The author of the new file-based block cloning code had this in mind. A backround process would scan files and identify dupes, delete the dupes and replace them with cloned versions.

      There are of course edge cases to consider to avoid data loss, but I imagine it might come soon, either officially or as a third-party tool.

    • hinkley 11 days ago
      I get the feeling that a hypothetical ZFS maintainer reading some literature on concurrent mark and sweep would be... inspirational, if not immediately helpful.

      You should be able to detect duplicates online. Low priority sweeping is something else. But you can at least reduce pause times.

      • p_l 10 days ago
        They were aware. The reasons it works the way it works are due to higher priority decisions regarding reliability in face of hardware or software corruption.

        That said, you can still do a two-space GC, but it's slow and possibly wasteful.

    • LeoPanthera 11 days ago
      btrfs has this. You can deduplicate a filesystem after the fact, as an overnight cron job or whatever. I really wish ZFS could do this.
      • Sakos 10 days ago
        This is my favourite hack for the Steam Deck. By switching my SD cards and the internal SSD to btrfs, the space savings are unreal (easily halving used space). Every game gets its own prefix which means a crazy amount of file duplication.
        • kccqzy 10 days ago
          Which tool did you end up using for btrfs? I tried out bees https://github.com/Zygo/bees but it is way too slow.
          • Dylan16807 10 days ago
            How established was the drive when you set up bees? In my experience you really want to do the initial pass without snapshots existing, but after that it's pretty smooth.
            • kccqzy 9 days ago
              Quite established. Although having read about snapshots causing slowness, I actually deleted all my snapshots and didn't notice any improvement.
      • DannyBee 11 days ago
        I sent a PR to add support for the necessary syscall (FIDUPERANGE) to zfs that i just have to clean up again.

        Once that is in, any of the existing dupe finding tools that use it (IE jdupes, duperemove) will just work on ZFS.

        • edelbitter 10 days ago
          Knowing what you had to know to write that, would you dare using it?

          Compression, encryption and streaming sparse files together are impressive already. But now we get a new BRT entry appearing out of nowhere, dedup index pruning one that was there a moment ago, all while correctly handling arbitrary errors in whatever simultaneous deduped writes, O_DIRECT writes, FALLOC_FL_PUNCH_HOLE and reads were waiting for the same range? Sounds like adding six new places to hold the wrong lock to me.

          • DannyBee 10 days ago
            "Knowing what you had to know to write that, would you dare using it?"

            It's no worse than anything else related to block cloning :)

            ZFS already supports FICLONERANGE, the thing FIDEDUPRANGE changes is that the compare is part of the atomic guarantee.

            So in fact, i'd argue it's actually better than what is there now - yes, the hardest part is the locking, but the locking is handled by the dedup range call getting the right locks upfront, and passing them along, so nothing else is grabbing the wrong locks. It actually has to because of the requirements to implement the ioctl properly. We have to be able to read both ranges, compare them, and clone them, all as an atomic operation wrt to concurrent writes. So instead of random things grabbing random locks, we pass the right locks around and everything verifies the locks.

            This means fideduprange is not as fast as it maybe could be, but it does not run into the "oops we forgot the right kind of lock" issue. At worst, it would deadlock, because it's holding exclusive locks on all that it could need before it starts to do anything in order to guarantee both the compare and the clone are atomic. So something trying to grab a lock forever under it will just deadlock.

            This seemed the safest course of implementation.

            ficlonerange is only atomic in the cloning, which means it does not have to read anything first, it can just do blind block cloning. So it actually has a more complex (but theoretically faster) lock structure because of the relaxed constraints.

        • DannyBee 10 days ago
          Note - anyone bored enough could already make any of these tools work by using FICLONERANGE (which ZFS already supports), but you'd have to do locking - lock, compare file ranges, clone, unlock.

          Because FIDEDUPRANGE has the compare as part of the atomic guarantee, you don't need to lock in userspace around using it, and so no dedup utility bothers to do FICLONERANGE + locking. Also, ZFS is the only FS that implements FICLONERANGE but not FIDEDUPRANGE :)

        • rattt 10 days ago
          Shouldn't jdupes like tools already work now that ZFS has reflink copy support?
          • DannyBee 10 days ago
            No, because none of these tools use copy_file_range. Because copy_file_range doesn't guarantee deduplication or anything. It is meant to copy data. So you could just end up copying data, when you aren't even trying to copy anything at all.

            All modern tools use FIDEDUPRANGE, which is an ioctl meant for explicitly this use case - telling the FS that two files have bytes that should be shared.

            Under the covers, the FS does block cloning or whatever to make it happen.

            Nothing is copied.

            ZFS does support FICLONERANGE, which is the same as FIDEDUPRANGE but it does not verify the contents are the same prior to cloning.

            Both are atomic WRT to concurrent writes, but for FIDEDUPRANGE that means the compare is part of the atomicness. So you don't have to do any locking.

            If you used FICLONERANGE, you'd need to lock the two file ranges, verify, clone, unlock

            FIDEDUPRANGE does this for you.

            So it is possible, with no changes to ZFS, to modify dedup tools to work on ZFS by changing them to use FICLONERANGE + locking if FIDEDUPRANGE does not exist.

        • Wowfunhappy 10 days ago
          Oh cool! Does this work on the block level or only the file level?
    • tiagod 11 days ago
      I run rdfind[1] as a cronjob to replace duplicates with hardlinks. Works fine!

      https://github.com/pauldreik/rdfind

      • AndrewDavis 11 days ago
        So this is great, if you're just looking to deduplicate read only files. Less so if you intend to write to them. Write to one and they're both updated.

        Anyway. Offline/lazy dedup (not in the zfs dedup sense) is something that could be done in userspace, at the file level on any filesystem that supports reflinks. When a tool like rdfind finds a duplicate, instead of replacing with a hardlink, create a copy of the file with `copy_file_range(2)` and let the filesystem create a reflink to it. Now you've got space savings and they're two separate files so if one is written to the other remains the same.

        • spockz 11 days ago
          How would this work if I have snapshots? Wouldn’t then the version of the file I just replaced still be in use there? But maybe I also need to store the copy again if I make another snapshot because the “original “ file isn’t part of the snapshot? So now I’m effectively storing more not less?
          • magicalhippo 10 days ago
            AFAIK, yes. Blocks are reference counted, so if the duplicate file is in a snapshot then the blocks would be referenced by the snapshot and hence not be eligible for deallocation. Only once the reference count falls to zero would the block be freed.

            This is par for the course with ZFS though. If you delete a non-duplicated file you don't get the space back until any snapshots referencing the file are deleted.

            • spockz 9 days ago
              Yes that snapshots incur a cost I know. But I’m wondering whether now the action of deduplicating actually created an extra copy instead of saving’one.
              • magicalhippo 9 days ago
                I don't fully understand the scenario you mentioned. Could you perhaps explain in a bit more detail?
        • DannyBee 10 days ago
          copy_file_range already works on zfs, but it doesn't guarantee anything interesting.

          Basically all dupe tools that are modern use fideduprange, which is meant to tell the FS which things should be sharing data, and let it take care of the rest. (BTRFS, bcachefs, etc support this ioctl, and zfs will soon too)

          Unlike copy_file_range, it is meant for exactly this use case, and will tell you how many bytes were dedup'd, etc.

      • Wowfunhappy 11 days ago
        But then you have to be careful not to remove the one which happens to be the "original" or the hardlinks will break, right?
        • Dylan16807 11 days ago
          No, pointing to an original is how soft links work.

          Hard links are all equivalent. A file has any number of hard links, and at least in theory you can't distinguish between them.

          The risk with hardlinks is that you might alter the file. Reflinks remove that risk, and also perform very well.

          • Wowfunhappy 10 days ago
            Thank you, I was unaware of this.

            However, the fact that editing one copy edits all of them still makes this a non-solution for me at least. I'd also strongly prefer deduping at the block level vs file level.

            • mdaniel 10 days ago
              I would suspect a call to $(chmod a-w) would fix that, or at least serve as a very fine reminder that there's something special about them
      • sureglymop 10 days ago
        Quite cool, though it's not as storage saving as deduplicating at e.g. N byte blocks, at block level.
    • nixdev 7 days ago
      You can already do offline/lazy dedupe.

        zfs set mountpoint=foopy/foo /mnt/foo
        zfs set dedup=off  foopy/foo
      
        zfs set mountpoint=foopy/baz /mnt/baz
        zfs set dedup=on   foopy/baz
      
      Save all your stuff in /mnt/foo, then when you want to dedup do

        mv /mnt/foo/bar /mnt/baz/
      
      
      Yeah... this feels like picrel, and it is

        https://i.pinimg.com/originals/cb/09/16/cb091697350736aae53afe4b548b9d43.jpg
      
      but it's here and now and you can do it now.
  • UltraSane 11 days ago
    "And this is the fundamental issue with traditional dedup: these overheads are so outrageous that you are unlikely to ever get them back except on rare and specific workloads."

    This struck me as a very odd claim. I've worked with Pure and Dell/EMC arrays and for VMWare workloads they normally got at least 3:1 dedupe/compression savings. Only storing one copy of the base VM image works extremely well. Dedupe/compression works really well on syslog servers where I've seen 6:1 savings.

    The effectiveness of dedupe is strongly affected by the size of the blocks being hashed, with the smaller the better. As the blocks get smaller the odds of having a matching block grow rapidly. In my experience 4KB is my preferred block size.

    • abrookewood 11 days ago
      Couple of comments. Firstly, you are talking about highly redundant information when referencing VM images (e.g. the C drive on all Windows Serer images will be virtually identical), whereas he was using his own laptop contents as an example.

      Secondly, I think you are conflating two different features: compression & de-duplication. In ZFS you can have compression turned on (almost always worth it) for a pool, but still have de-duplication disabled.

      • UltraSane 11 days ago
        Fair point. My experience is with enterprise storage arrays and I have always used dedupe/compression at the same time. Dedupe is going to be a lot less useful on single computers.

        I consider dedupe/compression to be two different forms of the same thing. compression reduces short range duplication while deduplication reduces long range duplication of data.

        • abrookewood 11 days ago
          Yeah agreed, very closely related - even more so on ZFS where the compression (AFAIK) is on a block level rather than a file level.
          • E39M5S62 11 days ago
            ZFS compression is for sure at the block level - it's fully transparent to the userland tools.
            • lazide 11 days ago
              It could be at a file level and still transparent to user land tools, FYI. Depending on what you mean by ‘file level’, I guess.
              • UltraSane 10 days ago
                Windows NTFS has transparent file level compression that works quite well.
                • Dylan16807 10 days ago
                  I don't know how much I agree with that.

                  The old kind of NTFS compression from 1993 is completely transparent, but it uses a weak algorithm and processes each 64KB of file completely independently. It also fragments files to hell and back.

                  The new kind from Windows 10 has a better algorithm and can have up to 2MB of context, which is quite reasonable. But it's not transparent to writes, only to reads. You have to manually apply it and if anything writes to the file it decompresses.

                  I've gotten okay use out of both in certain directories, with the latter being better despite the annoyances, but I think they both have a lot of missed potential compared to how ZFS and BTRFS handle compression.

                  • UltraSane 9 days ago
                    I'm talking about the "Compress contents to save disk space" option in the Advanced Attributes. It makes the file blue. I enable it on all .txt log files because it is so effective and completely transparent. It compresses a 25MB Google Drive log file to 8MB
                    • Dylan16807 9 days ago
                      That's the old kind.

                      It's useful, but if they updated it it could get significantly better ratios and have less impact on performance.

    • phil21 11 days ago
      Base VM images would be a rare and specific workload. One of the few cases dedupe makes sense. However you are likely using better strategies like block or filesystem cloning if you are doing VM hosting off a ZFS filesystem. Not doing so would be throwing away one of it's primary differentiators as a filesystem in such an environment.

      General purpose fileserving or personal desktop/laptop use generally has very few duplicated blocks and is not worth the overhead. Backups are hit or miss depending on both how the backups are implemented, and if they are encrypted prior to the filesystem level.

      Compression is a totally different thing and current ZFS best-practice is to enable it by default for pretty much every workload - the CPU used is barely worth mentioning these days, and the I/O savings can be considerable ignoring any storage space savings. Log storage is going to likely see a lot better than 6:1 savings if you have typical logging, at least in my experience.

      • XorNot 10 days ago
        > General purpose fileserving or personal desktop/laptop use generally has very few duplicated blocks and is not worth the overhead.

        I would contest this is because we don't have a good transparent deduplication right now - just some bad compromises. Hard copies? Edit anything and it gets edited everywhere - not what you want. Symlinks? Look different enough that programs treat them differently.

        I would argue your regular desktop user actually has an enormous demand for a good deduplicating file system - there's no end of use cases where the first step is "make a separate copy of all the relevant files just in case" and a lot of the time we don't do it because it's just too slow and wasteful of disk space.

        If you're working with say, large video files, then a good dedupe system would make copies basically instant, and then have a decent enough split algorithm that edit's/cuts/etc. of the type people try to do losslessly or with editing programs are stored efficiently without special effort. How many people are producing video content today? Thanks to Tiktok we've dragged that skill right down to "random teenagers" who might hopefully pipeline into working with larger content.

        • armada651 10 days ago
          But according to the article the regular desktop already has such a dedup system:

          > If you put all this together, you end up in a place where so long as the client program (like /bin/cp) can issue the right copy offload call, and all the layers in between can translate it (eg the Window application does FSCTL_SRV_COPYCHUNK, which Samba converts to copy_file_range() and ships down to OpenZFS). And again, because there’s that clear and unambiguous signal that the data already exists and also it’s right there, OpenZFS can just bump the refcount in the BRT.

    • wongarsu 11 days ago
      I haven't tried it myself, but the widely quoted number for old ZFS dedup is that you need 5GB of RAM for every 1TB of disk space. Considering that 1 TB of disk space currently costs about $15 and 5GB of server RAM about $25, you need a 3:1 dedupe ratio just to break even.

      If your data is a good fit you might get away with 1GB per TB, but if you are out of luck the 5GB might not even be enough. That's why the article speaks of ZFS dedup having a small sweet spot that your data has to hit, and why most people don't bother

      Other file systems tend to prefer offline dedupe which has more favorable economics

      • floating-io 11 days ago
        That doesn't account for OpEx, though, such as power...
        • wongarsu 11 days ago
          Assuming something reasonable like 20TB Toshiba MG10 HDDs and 64GB DDR4 ECC RAM, quick googling suggests that 1TB of disk space uses about 0.2-0.4W of power (0.2 in idle, 0.4 while writing), 5GB of RAM about 0.3-0.5W. So your break even on power is a bit earlier depending on the access pattern, but in the same ball park.
          • UltraSane 11 days ago
            What about rack space?
            • spockz 11 days ago
              Not just rack space. At a certain amount of disks you also need to get a separate server (chassis + main board + cpu + ram) to host the disks. Maybe you need that for performance reasons any way. But saving disk space and only paying for it with some ram sounds cost effective.
              • janc_ 9 days ago
                That works out only as long as you don’t have to replace the whole machine (motherboard & possibly CPU/CPUs) to be able to add more RAM… So essentially the same problem as with disks.

                In the end it all comes down to: there are a whole lot of trade-offs that you have to take into account, and which ceilings you hit first depends entirely on everyone’s specific situation.

      • UltraSane 11 days ago
        Why does it need so much RAM? It should only need to store the block hashes which should not need anywhere near that much RAM. Inline dedupe is pretty much standard on high-end storage arrays nowadays.
        • AndrewDavis 11 days ago
          The linked blog post covers this, and the improvements made to make the new dedup better.
        • remexre 11 days ago

              (5GiB / 1TiB) * 4KiB to bits
          
                ((5 gibibytes) / (1 tebibyte)) × (4 kibibytes) = 160 bits
    • wmf 11 days ago
      VMs are known to benefit from dedupe so yes, you'll see benefits there. ZFS is a general-purpose filesystem not just an enterprise SAN so many ZFS users aren't running VMs.

      Dedupe/compression works really well on syslog

      I apologize for the pedantry but dedupe and compression aren't the same thing (although they tend to be bundled in the enterprise storage world). Logs are probably benefiting from compression not dedupe and ZFS had compression all along.

      • tw04 11 days ago
        They are not the same thing, but when you boil it down to the raw math, they aren't identical twins, but they're absolutely fraternal twins.

        Both are trying to eliminate repeating data, it's just the frame of reference that changes. Compression in this context is operating on a given block or handful of blocks. Deduplication is operating on the entire "volume" of data. "Volume" having a different meaning depending on the filesystem/storage array in question.

        • UltraSane 11 days ago
          Well put. I like to say compression is just short range dedupe. Hash based dedupe wouldn't be needed if you could just to real-time LZMA on all of the data on a storage array but that just isn't feasible and hash-based dedupe is a very effective compromise.
        • ShroudedNight 11 days ago
          Is "paternal twins" a linguistic borrowing of some sort? It seems a relatively novel form of what I've mostly seen referred to as monozygotic / 'identical' twins. Searching for some kind of semi-canonical confirmation of its widespread use turns up one, maybe two articles where it's treated as an orthodox term, and at least an equal number of discussions admonishing its use.
          • spockz 11 days ago
            If anything I would expect the term “maternal” twin to be used as whether or not a twin is monozygotic or “identical” depends on the amount of eggs from the mother.
        • xxs 10 days ago
          compression tends NOT to use a global dictionary. So to me they are vastly different even if they have the same goal of reducing the output size.

          Compression with a global dict would like do better than dedup yet it will have a lot of other issues.

      • ants_everywhere 11 days ago
        If we're being pedants, then storing the same information in fewer bits than the input is by definition a form of compression, no?

        (Although yes I understand that file-level compression with a standard algorithm is a different thing than dedup)

    • Maakuth 10 days ago
      Certainly it makes sense to not have deep copies of VM base images, but the deduplication is not the right way to do it in ZFS. Instead, you can clone the base image and before changes it will take almost no space at all. This is thanks to the copy-on-write nature of ZFS.

      ZFS deduplication instead tries to find existing copies of data that is being written to the volume. For some use cases it could make a lot of sense (container image storage maybe?), but it's very inefficient if you already know some datasets to be clones of the others, at least initially.

      • UltraSane 10 days ago
        When a new VM is created from a template on a ZFS file system with dedupe enabled what actually happens? Isn't the ref count of every block of the template simply incremented by one? The only time new data will actually be stored is when a block hash a hash that doesn't already exist.
        • Maakuth 9 days ago
          That's right, though deduplication feature is not the way to do it. The VM template would be a zvol, which is a block device backed by the lower levels of ZFS, and it would be cloned to a new zvol for each VM. Alternatively, if image files were used, the image file could be a reflinked copy. In both cases, new data would be stored only when changes accumulate.

          Compare this to the deduplication approach: the filesystem would need to keep tabs on the data that's already on disk, identify the case where the same data is being written and then make that a reference to the existing data instead. Very inefficient if on application level you already know that it is just a copy being made.

          In both of these cases, you could say that the data ends up being deduplicated. But the second approach is what the deduplication feature does. The first one is "just" copy-on-write.

    • jorvi 11 days ago
      > In my experience 4KB is my preferred block size

      That makes sense considering Advanced Format harddrives already have a 4K physical sector size, and if you properly low-level format them (to get rid of the ridiculous Windows XP compatibility) they also have 4K logical sector size. I imagine there might be some real performance benefits to having all of those match up.

      • UltraSane 11 days ago
        In the early days of VMware people had a lot of VMs that were converted from physical machines and this causes a nasty alignment issue between the VMDK blocks and the blocks on your storage array. The effect was to always add one block to every read operation, and in the worst case of reading one block would double the load on the storage array. On NetApp this could only be fixed when the VM wasn't running.
    • Joe_Cool 11 days ago
      Even with the rudimentary Dedup features of NTFS on a Windows Hyper-V Server all running the same base image I can overprovision the 512GB partition to almost 2 GB.

      You need to be careful and do staggered updates in the VMs or it'll spectacularly explode but it's possible and quite performant for less than mission critical VMs.

      • tw04 11 days ago
        I think you mean 2TB volume? But yes, this works. But also: if you're doing anything production, I'd strongly recommend doing deduplication on the back-end storage array, not at the NTFS layer. It'll be more performant and almost assuredly have better space savings.
        • Joe_Cool 10 days ago
          For sure it's not for production. At least not for stuff that's critical. MS also doesn't recommend using it for live VHDX.

          The partition/NTFS volume is 512GB. It currently stores 1.3 TB of "dedupped" data and has about 200GB free. Dedup runs asynchronously in the background and as a job during off hours.

          It's a typo, yes. Thanks.

    • m463 11 days ago
      I would think VMs qualify as a specific workload, since cloning is almost a given.
    • EasyMark 11 days ago
      I figured he was mostly talking about using dedup on your work (dev machine) computer or family computer at home, not on something like a cloud or streaming server or other back end type operations.
    • bobmcnamara 11 days ago
      > In my experience 4KB is my preferred block size.

      This probably has something to do with the VM's filesystem block size. If you have a 4KB filesystem and an 8KB file, the file might be fragmented differently but is still the same 2x4KB blocks just in different places.

      Now I wonder if filesystems zero the slack space at the end of the last block in a file in hopes of better host compression. Vs leaving it as past bytes.

    • mrgaro 10 days ago
      For text based logs I'm almost entirely sure that just using compression is more than enough. ZFS supports compression natively on block level and it's almost always turned on. Trying to use dedup alongside of compression for syslog most likely will not yield any benefits.
    • acdha 11 days ago
      > Dedupe/compression works really well on syslog servers where I've seen 6:1 savings.

      Don’t you compress these directly? I normally see at least twice that for logs doing it at the process level.

      • pezezin 11 days ago
        Yes, that ratio is very small.

        I built a very simple, custom syslog solution, a syslog-ng server writing directly to a TimescaleDB hypertable (https://www.timescale.com/) that is then presented as a Grafana dashboard, and I am getting a 30x compression ratio.

        • pdimitar 10 days ago
          Would love to see your solution -- if it's open source.
          • pezezin 9 days ago
            Sure, no problem. But after you asked me, I realized that the system was not properly documented anywhere, I didn't even have a repo with the configuration files, what an embarrassment :(

            I just created the repo and uploaded the documentation, please give me some more time to write the documentation: https://github.com/josefrcm/simple-syslog-service

      • UltraSane 11 days ago
        What software?
        • acdha 11 days ago
          Log rotate, cron, or simply having something like Varnish or Apache log to a pipe which is something like bzip2 or zstd. The main question is whether you want to easily access the current stream - e.g. I had uncompressed logs being forwarded to CloudWatch so I had daemons logging to timestamped files with a post-rotate compression command which would run after the last write.
          • UltraSane 11 days ago
            That is one wrinkle of using storage based dedupe/compression is you need to avoid doing compression on the client to avoid compressing already compressed data. When a company I worked at first got their Pure array they were using windows file compression heavily and had to disable it as the storage array was now doing it automatically.
            • acdha 11 days ago
              Definitely. We love building abstraction layers but at some point you really need to make decisions across the entire stack.
        • chasil 11 days ago
          Logrotate is the rhel utility, likely present in Fedora, that is easily adapted for custom log handling. I still have rhel5 and I use it there.

          CentOS made it famous. I don't know if it has a foothold in the Debian family.

          • E39M5S62 11 days ago
            logrotate is used on Debian and plenty of other distros. It seems pretty widely used, though maybe not as much so now that things log through systemd.
        • SteveNuts 11 days ago
          Logrotate
    • jyounker 10 days ago
      TL;DR; Declares claim that "that feature is only good for specific rare workloads" is odd. Justifies that statement by pointing out that the feature works well of their specific rare workload.
  • simonjgreen 10 days ago
    We used to make extensive use of, and gained huge benefit from, dedup in ZFS. The specific use case was storage for VMWare clusters where we had hundreds of Linux and Windows VMs that were largely the same content. [this was pre-Docker]
    • aniviacat 10 days ago
      I've read multiple comments on using dedup for VMs here. Wouldn't it be a lot more efficient for this to be implemented by the hypervisor rather than the filesystem?
      • UltraSane 10 days ago
        I'm a former VMware certified admin. How do you envision this to work? All the data written to the VM's virtual disk will cause blocks to change and the storage array is the best place to keep track of that.
        • wang_li 10 days ago
          You do it at the file system layer. Clone the template which creates only metadata referencing the original blocks then you perform copy-on-write as needed.
          • SteveNuts 10 days ago
            VMware allows linked clones which you can do when deploying from template

            https://docs.vmware.com/en/VMware-Fusion/13/com.vmware.fusio...

          • UltraSane 10 days ago
            But that is exactly what the storage array is doing. What is the advantage?
            • anyfoo 10 days ago
              > When dedup is enabled [...] every single write and free operation requires a lookup and a then a write to the dedup table, regardless of whether or not the write or free proper was actually done by the pool.

              Linked clones shouldn’t need that. They likely start out with only references to the original blocks, and then replace them when they change. If so, it’s a different concept (as it would mean that any new duplicate blocks are not shared), but for the use case of “spin up a hundred identical VMs that only change comparably little” it sounds more efficient performance-wise, with a negligible loss in space efficiency.

              Am I certain of this? No, this is just what I quickly pieced together based on some assumptions (albeit reasonable ones). Happy to be told otherwise.

              • UltraSane 10 days ago
                Linked clones aren't used in ESXi, instant clones and they ARE pretty nifty and heavily used in VDI where you need to spin up many thousands of desktop VMs. But they have to keep track of what blocks change and so ever clone has a delta disk. At the end of the day you are just moving around where this bookkeeping happens. And it is best to happen on a enterprise grade array with ultra optimized inline dedupe like a Pure array.

                https://www.yellow-bricks.com/2018/05/01/instant-clone-vsphe...

                • anyfoo 10 days ago
                  I’m not sure that’s true, because the hypervisor can know which blocks are related to begin with? From what I quoted above it seems that the file system instead does a lookup based on the block content to determine if a block is a dupe (I don’t know if it uses a hash, necessitating processing the whole block, or something like an RB tree, which avoids having to read the whole block if it already differs early from candidates). Unless there is a way to explicitly tell the file system that you are copying blocks for that purpose, and that VMware is actually doing that. If not, then leaving it to the file system or even the storage layer should have a definite impact on performance, albeit in exchange for higher space efficiency because a lookup can deduplicate blocks that are identical but not directly related. This would give a space benefit if you do things like installing the same applications across many VMs after the cloning, but assuming that this isn’t commonly done (I think you should clone after establishing all common state like app installations if possible), then my gut feeling is very much that the performance benefit of more semantic-level hypervisor bookkeeping outweighs the space gains from “dumb” block-oriented fs/storage bookkeeping.
                  • Dylan16807 10 days ago
                    Your phrasing sounds like you're unaware that filesystems can also do the same kind of cloning that a hypervisor does, where the initial data takes no storage space and only changes get written.

                    In fact, it's a much more common feature than active deduplication.

                    VM drives are just files, and it's weird that you imply a filesystem wouldn't know about the semantics of a file getting copied and altered, and would only understand blocks.

                    • anyfoo 10 days ago
                      Uh, thanks for the personal attack? I am aware that cloning exists, and I very explicitly allowed for the use of such a mechanism to change the conclusion in both of my comments. My trouble was that I wasn't sure how much filesystem-cloning is actually in use in relevant contexts. Does POSIX have some sort of "copyfile()" system call nowadays? Last I knew (outdated, I'm sure), the cp command for example seemed to just read() blocks into a buffer and write() them out again. I'm not sure how the filesystem layer would detect this as a clone without a lookup. I was quoting and basing my assumptions on the article:

                      > The downside is that every single write and free operation requires a lookup and a then a write to the dedup table, regardless of whether or not the write or free proper was actually done by the pool.

                      Which, if universally true, is very much different from what a hypervisor could do instead, and I've detailed the potential differences. But if a hypervisor does use some sort of clone system call instead, that can indeed shift the same approach into the fs layer, and my genuine question is whether it does.

                      • Dylan16807 10 days ago
                        I said "your phrasing sounds like" specifically to make it not personal. Clearly some information was missing but I wasn't sure exactly what. I'll try to phrase that better in the future.

                        It sounds like the information you need is that cp has a flag to make cloning happen. I think it even became default behavior recently.

                        Also that the article quote is strictly talking about dedup. That downside does not generalize to the clone/reflink features. They use a much more lightweight method.

                        This is one of the relevant syscalls: https://man7.org/linux/man-pages/man2/copy_file_range.2.html

                        • anyfoo 10 days ago
                          Thanks, that was educational.
                • dpedu 10 days ago
                  > Linked clones aren't used in ESXi

                  Huh? What do you mean? They absolutely are. I've made extensive use of them in ESXi/vsphere clusters in situations where I'm spinning up and down many temporary VMs.

                  • UltraSane 10 days ago
                    Linked clones do not exist in ESXi. Horizon Composer is what is/was used to create them, and that requires a vCenter Server and a bit of infrastructure, including a database.
                    • dpedu 8 days ago
                      No, you can create them if you only have vCenter via its API. No extra infrastructure beyond that, though. The pyVmomi library has example code of how to do it. IIRC it is true that standalone ESXi does not offer the option to create a linked clone by itself, but if I wanted to be a pendant I'd argue that linked clones do exist in ESXi, as that is where vCenter deploys them.
        • PittleyDunkin 10 days ago
          > VMware certified admin

          Not to be rude, but does this have any meaning?

          • UltraSane 9 days ago
            I understand how VMware ESXi works better than most people.
      • iwontberude 10 days ago
        COW is significantly slower and has nesting limits when compared to these deduped clones. Great question!
    • jamesbfb 10 days ago
      Can relate. I’ve recently taken ownership of a new work laptop with Ubuntu (with “experimental” zfs) and using dedupe on my nix store has been an absolute blessing!
      • amarshall 10 days ago
        Nix already has some builtin deduplication, see `man nix-store-optimise`. Nix’s own hardlinking optimization reduces disk usage of the store (for me) by 30–40%.
        • jamesbfb 9 days ago
          Update. Turns out PyCharm does not play nice with a plethora of symlinks. :(
          • amarshall 9 days ago
            Nix optimise does not use symlinks, it uses hardlinks.
        • jamesbfb 10 days ago
          Well, TIL. Being relatively new to nix, you’ve let me down another rabbit hole :)
      • rwarfield 10 days ago
        Isn't it better to use `nix store optimise` for dedup of the nix store? The nix command has more knowledge of the structure of the nix store so should be able to do a better job with fewer resources. Also the store is immutable so you don't actually need reflinks - hard links are enough.
        • Filligree 10 days ago
          It is, yeah, though you have to turn it on. I'm not actually sure why it's off by default.
          • amarshall 10 days ago
            It’s off by default as it can make builds slower (regardless of platform)—you should test this if you care. There also are (or were) some bugs on macOS that would cause corruption.
            • Filligree 9 days ago
              That seems like the wrong default. Most people do very little building on their desktops; they get all their software from the cache.
  • nikisweeting 11 days ago
    I'm so excited about fast dedup. I've been wanting to use ZFS deduping for ArchiveBox data for years, as I think fast dedup may finally make it viable to archive many millions of URLs in one collection and let the filesystem take care of compression across everything. So much of archive data is the same jquery.min.js, bootstrap.min.css, logo images, etc. repeated over and over in thousands of snapshots. Other tools compress within a crawl to create wacz or warc.gz files, but I don't think anyone has tried to do compression across the entire database of all snapshots ever taken by a tool.

    Big thank you to all the people that worked on it!

    BTW has anyone tried a probabilistic dedup approach using soemthing like a bloom filter so you don't have to store the entire dedup table of hashes verbatim? Collect groups of ~100 block hashes into a bucket each, and store a hyper compressed representation in a bloom filter. On write, lookup the hash of the block to write in the bloom filter, and if a potential dedup hit is detected, walk the 100 blocks in the matching bucket manually to look for any identical hashes. In theory you could do this with layers of bloom filters with different resolutions and dynamically swap out the heavier ones to disk when memory pressure is too high to keep the high resolution ones in RAM. Allowing the accuracy of the bloom filter to be changed as a tunable parameter would let people choose their preference around CPU time/overhead:bytes saved ratio.

    • mappu 11 days ago
      Even with this change ZFS dedupe is still block-aligned, so it will not match repeated web assets well unless they exist at consistently identical offsets within the warc archives.

      dm-vdo has the same behaviour.

      You may be better off with long-range solid compression instead, or unpacking the warc files into a directory equivalent, or maybe there is some CDC-based FUSE system out there (Seafile perhaps)

      • nikisweeting 11 days ago
        I should clarify I don't use WARCs at all with archivebox, it just stores raw files on the filsystem because I rely on ZFS for all my compression, so there is no offset alignment issue.

        The wget extractor within archivebox can produce WARCs as an output but no parts of ArchiveBox are built to rely on those, they are just one of the optional extractors that can be run.

    • uniqueuid 11 days ago
      I get the use case, but in most cases (and particularly this one) I'm sure it would be much better to implement that client-side.

      You may have seen in the WARC standard that they already do de-duplication based on hashes and use pointers after the first store. So this is exactly a case where FS-level dedup is not all that good.

      • nikisweeting 11 days ago
        WARC only does deduping within a single WARC, I'm talking about deduping across millions of WARCs.
        • uniqueuid 11 days ago
          That's not true, you commonly have CDX index files which allow for de-duplication across arbitrarily large archives. The internet archive could not reasonably operate without this level of abstraction.

          [edit] Should add a link, this is a pretty good overview, but you can also look at implementations such as the new zeno crawler.

          https://support.archive-it.org/hc/en-us/articles/208001016-A...

          • nikisweeting 11 days ago
            Ah cool, TIL, thanks for the link. I didn't realize that was possible.

            I know of the CDX index files produced by some tools but don't know anything about the details/that they could be used to dedup across WARCs, I've only been referencing the WARC file specs via IIPC's old standards docs.

    • alchemist1e9 11 days ago
      While a slightly different use case, I suspect you’d like zbackup if you don’t know about it.
  • dark-star 11 days ago
    I wonder why they are having so much trouble getting this working properly with smaller RAM footprints. We have been using commercial storage appliances that have been able to do this for about a decade (at least) now, even on systems with "little" RAM (compared to the amount of disk storage attached).

    Just store fingerprints in a database and run through that at night and fixup the block pointers...

    • magicalhippo 11 days ago
      > and fixup the block pointers

      That's why. Due to reasons[1], ZFS does not have the capability to rewrite block pointers. It's been a long requested feature[2] as it would also allow for defragmentation.

      I've been thinking this could be solved using block pointer indirection, like virtual memory, at the cost of a bit of speed.

      But I'm by no means a ZFS developer, so there's surely something I'm missing.

      [1]: http://eworldproblems.mbaynton.com/posts/2014/zfs-block-poin...

      [2]: https://github.com/openzfs/zfs/issues/3582

      • phongn 11 days ago
        It looks like they’re playing more with indirection features now (created for vdev removal) for other features. One of the recent summit hackathons sketched out using indirect vdevs to perform rebalancing.

        Once you get a lot of snapshots, though, the indirection costs start to rise.

    • wmf 11 days ago
      Fixup block pointers is the one thing ZFS didn't want to do.
    • olavgg 10 days ago
      You can also use DragonFlyBSD with Hammer2, which supports both online and offline deduplication. It is very similar to ZFS in many ways. The big drawback though, is lack of file transfer protocols using RDMA.

      I've also heard there are some experimental branches that makes it possible to run Hammer2 on FreeBSD. But FreeBSD also lacks RDMA support. For FreeBSD 15, Chelsio has sponsored NVMe-oF target, and initiator support. I think this is just TCP though.

  • nabla9 10 days ago
    You should use:

       cp --reflink=auto
    
    You get file level deduplication. The command above performs a lightweight copy (ZFS clone in file level), where the data blocks are copied only when modified. Its a copy, not a hard link. The same should work in other copy-on-write transactional filesystems as well if they have reflink support.
  • BodyCulture 10 days ago
    I wanted to use ZFS badly, but of course all data must be encrypted. It was surprising to see how usage gets much more complicated than expected and so many people just don’t encrypt their data because things get wild then.

    Look, even Proxmox, which I totally expected to support encryption with default installation (it has „Enterprise“ on the website) does loose important features when trying to use with encryption.

    Also please study the issue tracker, there are a few surprising things I would not have expected to exist in a productive file system.

    • eadmund 10 days ago
      The best way to encrypt ZFS is to run unecrypted ZFS atop encrypted volumes (e.g. LUKS volumes). ZFS ‘encryption’ leaves too much in plaintext for my comfort.
      • BodyCulture 10 days ago
        In the Proxmox forum some people tried this method and do not report big success. Can not recommend for production.

        Still the same picture, encryption seems to be not a first class citizen in ZFS land.

  • klysm 11 days ago
    I really wish we just had a completely different API as a filesystem. The API surface of filesystem on every OS is a complete disaster that we are locked into via backwards compatibility.
    • magicalhippo 11 days ago
      Internally ZFS is essentially an object store. There was some work which tried to expose it through an object store API. Sadly it seems to not have gone anywhere.

      Tried to find the talk but failed, was sure I had seen it on a Delveloper Summit but alas.

    • UltraSane 11 days ago
      Why is it a disaster and what would you replace it with? Is the AWS S3 style API an improvement?
      • lazide 11 days ago
        It’s only a ‘disaster’ if you are using it exclusively programmatically and want to do special tuning.

        File systems are pretty good if you have a mix of human and programmatic uses, especially when the programmatic cases are not very heavy duty.

        The programmatic scenarios are often entirely human hostile, if you try to imagine what would be involved in actually using them. Like direct S3 access, for example.

      • mappu 11 days ago
        High-density drives are usually zoned storage, and it's pretty difficult to implement the regular filesystem API on top of that with any kind of reasonable performance (device- vs host- managed SMR). The S3 API can work great on zones, but only because it doesn't let you modify an existing object without rewriting the whole thing, which is an extremely rough tradeoff.
      • perlgeek 10 days ago
        One way it's a disaster is that file names (on Linux at least, haven't used Windows in a long time) are byte strings that can contain directory paths from different/multiple file systems.

        So if you have non-ASCII characters in your paths, encoding/decoding is guesswork, and at worst, differs from path segment to path segment, and there's no metadata attached which encoding to use.

        • p_l 10 days ago
          ZFS actually has settings related to that which originated from providing filesystems for different OSes, where it enforces canonical utf-8 with a specific canonization rule. AFAIK the reason for it existing was cooperation between Solaris, Linux, Windows, and Mac OS X computers all sharing same network filesystem hosted from ZFS.
        • UltraSane 10 days ago
          That definitely does not sound like much fun to deal with.
  • bastloing 11 days ago
    Forget dedupe just use zfs compression, a lot more bang for your buck
    • Joel_Mckay 11 days ago
      Unless your data-set is highly compressed media files.

      In general, even during rsync operations one often turns off compression on large video files, as the compression operation has low or negative impact on storage/transfers while eating ram and cpu power.

      De-duplication is good for Virtual Machine OS images, as the majority of the storage cost is a replicated backing image. =3

      • bastloing 10 days ago
        Compression is still king. Check out HP's Nimble storage arrays. Way quicker to do compression, fewer iops, and less overhead. Even when it misses, like video files, it's still a winner.
  • rodarmor 11 days ago
    General-purpose deduplication sounds good in theory but tends not to work out in practice. IPFS uses a rolling hash with variable-sized pieces, in an attempt to deduplicate data rysnc-style. However, in practice, it doesn't actually make a difference, and adds complexity for no reason.
  • tilt_error 11 days ago
    If writing performance is critical, why bother with deduplication at writing time? Do deduplication afterwards, concurrently and with lower priority?
    • magicalhippo 11 days ago
      Keep in mind ZFS was created at a time when disks were glacial in comparison to CPUs. And, the fastest write is the one you don't perform, so you can afford some CPU time to check for duplicate blocks.

      That said, NVMe has changed that balance a lot, and you can afford a lot less before you're bottlenecking the drives.

    • 0x457 11 days ago
      Because to make this work without a lot of copying, you would need to mutate things that ZFS absolutely does not want to make mutable.
    • UltraSane 11 days ago
      If the block to be written is already being stored then you will match the hash and the block won't have to be written. This can save a lot of write IO in real world use.
    • klysm 11 days ago
      Kinda like log structured merge tree?
  • rkagerer 10 days ago
    I'd love if dedicated hardware existing in disk controllers for calculating stuff like ECC could be enhanced to expose hashes of blocks to the system. Getting this for free for all your I/O would allow some pretty awesome things.
    • UltraSane 10 days ago
      That is a neat idea. Hard drives could do dedupe from the ECC they calculate for each sector. The main issue with that is that the current ECC is optimal for detecting bit errors but doesn't have the same kind of statistical guarantee of uniqueness that SHA256 or MetroHash has. You need to be VERY confident of the statistical properties of the hash used in dedupe if you are going to increment the ref count of the block hash instead of writing the data to disk.
  • cmiller1 10 days ago
    So if a sweet spot exists where dedup is widely beneficial then:

    Is there an easy way to analyze your dataset to find if you're in this sweet spot?

    If so, is anyone working on some kind of automated partial dedup system where only portions of the filesystem are dedupped based on analysis of how beneficial it would be?

    • mlfreeman 10 days ago
      Are there any tools that can run (even across network on another box) to analyze possible duplication at various block sizes?

      I am NOT interested in finding duplicate files, but duplicate slices within all my files overall.

      I can easily throw together code myself to find duplicate files.

      EDIT: I guess I’m looking for a ZFS/BTRFS/other dedupe preview tool that would say “you might save this much if you used this dedupe process.”

    • mdaniel 10 days ago
      I can't speak to the first one, but AIUI the ZFS way of thinking about the second one is to create a new filesystem and just mount it where you want it, versus "portions of the filesystem" which I doubt very seriously that ZFS allows. Bonus points that in that scenario, I would suspect the dedupe and compression would work even better since any such setup is likely to contain more homogeneous content (music, photos, etc)
  • wpollock 11 days ago
    When the lookup key is a hash, there's no locality over the megabytes of the table. So don't all the extra memory accesses to support dedup affect the L1 and L2 caches? Has anyone at OpenZFS measured that?

    It also occurs to me that spacial locality on spinning rust disks might be affected, also affecting performance.

  • xmodem 10 days ago
    We have a read-heavy zpool with some data that's used as part of our build process, on which we see a roughly 8x savings with dedup - and because of this ZFS dedup makes it economically viable for us to store the pool on NVMe rather than spinning rust.
    • myself248 10 days ago
      And being read-heavy, suboptimal performance at write time is an infrequent pain, I guess?
      • xmodem 10 days ago
        Not even that - the data being written is coming straight from the network, and the pool has no issues keeping up.
  • nobrains 10 days ago
    What are the use cases where it makes sense to use de-dup? Backup comes to mind. What else?
  • watersb 10 days ago
    I've used ZFS dedupe for a personal archive since dedupe was first introduced.

    Currently, it seems to be reducing on-disk footprint by a factor of 3.

    When I first started this project, 2TB hard drives were the largest available.

    My current setup uses slow 2.5-inch hard drives; I attempt to improve things somewhat via NVMe-based Optane drives for cache.

    Every few years, I try to do a better job of things but at this point, the best improvement would be radical simplification.

    ZFS has served very well in terms of reliability. I haven't lost data, and I've been able to catch lots of episodes of almost losing data. Or writing the wrong data.

    Not entirely sure how I'd replace it, if I want something that can spot bit rot and correct it. ZFS scrub.

    • roygbiv2 10 days ago
      Do you have data that is very obviously dedupeable? Or just a mix of things? A factor of three is not to be sniffed at.
      • watersb 10 days ago
        This archive was created as a dumping ground for all of my various computers, so a complete file by file dump of their system drives.

        While that obviously leads to duplicate data from files installed by operating systems, there's a lot of duplicate media libraries. (File-oriented dedupe might not be very effective for old iTunes collections, as iTunes stores metadata like how many times a song has been played in the same file as the audio data. So the hash value of a song will change every time it's played; it looks like a different file. ZFS block-level dedupe might still work ok here because nearly all of the blocks that comprise the song data will be identical.)

        Anyway. It's a huge pile of stuff, a holding area of data that should really be sorted into something small and rational.

        The application leads to a big payoff for deduplication.

    • emptiestplace 10 days ago
      Cache or ZIL (SLOG device)?
      • watersb 10 days ago
        Both the ZIL and the L2ARC, plus a third "special" cache which aggregates small blocks and could hold the dedupe table.

        The ZIL is the "ZFS Intent Log", a log-structured ordered stream of file operations to be performed on the ZFS volume.

        If power goes out, or the disk controller goes away unexpectedly, this ZIL is the log that will get replayed to bring the volume back to a consistent state. I think.

        Usually the ZIL is on the same storage devices as the rest of the data. So a write to the ZIL has to wait for disk in the same line as everybody else. It might improve performance to give the ZIL its own, dedicated storage devices. NVMe is great, lower latency the better.

        Since the ZFS Intent Log gets flushed to disk every five seconds or so, a dedicated ZIL device doesn't have to be very big. But it has to be reliable and durable.

        Windows made small, durable NVMe cache drives a mainstream item for a while, when most laptops still used rotating hard drives. Optane NVMe at 16GB is cheap, like twenty bucks, buy three of them and use a mirrored pair of two for your ZIL.

        ----

        Then there's the read cache, the ARC. I use 1TB mirrored NVMe devices.

        Finally, there's a "special" device that can for example be designated for use for intense things like the dedupe table (which Fast Dedupe is making smaller!).

        • emptiestplace 10 days ago
          A couple things:

          - The ZIL is used exclusively for synchronous writes - critical for VMs, databases, NFS shares, and other applications requiring strict write consistency. Many conventional workloads won't benefit. Use `zilstat` to monitor.

          - The cheap 16GB Optane devices are indeed great in terms of latency, but they were designed primarily for read caching and have significantly limited write speeds. If you need better throughput, look for the larger Optane models which don't have these limitations.

          - SLOG doesn't need to be mirrored - the only risk is if your SLOG device fails at the exact moment your system crashes. While mirroring is reasonable for production systems, with these cheap 16GB Optanes you're just guaranteeing they'll wear out at the same time. You could kill one at a time instead. :)

          - As for those 1TB NVMe devices for read cache (L2ARC) - that's probably overkill unless you have a very specific use case. L2ARC actually consumes RAM to track what's in the cache, and that RAM might be better used for ARC (the main memory cache). L2ARC only makes sense when you have well-understood workload patterns and your ARC is consistently under pressure - like in a busy database server or similar high-traffic scenario. Use `arcstat` to monitor your cache hit ratios before deciding if you need L2ARC.

        • shiroiushi 10 days ago
          >Optane NVMe at 16GB is cheap, like twenty bucks, buy three of them and use a mirrored pair of two for your ZIL.

          I've been building a home media server lately and have thought about doing something like this. However, there's a big problem: these little 16GB Optane drives are NVMe. My main boot drive and where I keep the apps is also NVME (not mirrored, yet: for now I'm just regularly copying to the spinning disks for backup, but a mirror would be better). So ideally that's 4 NVMe drives, and that's with me "cheating" and making the boot drive a partition on the main NVMe drive instead of a separate drive as normally recommended.

          So where are you supposed to plug all these things in? My pretty-typical motherboard has only 2 NVMe slots, one that connects directly to the CPU (PCIe 4.0) and one that connects through the chipset (slower PCIe 3.0). Is the normal method to use some kind of PCIe-to-NVMe adapter card and plug that into the PCIe x16 video slot?

          • emptiestplace 9 days ago
            Why are you looking at 16GB Optane drives? You probably don't need a SLOG device for your media server.

            I think you're pretty far into XY territory here. I'd recommend hanging out in r/homelab and r/zfs, read the FAQs, and then if you still have questions, maybe start out with a post explaining your high level goals and challenges.

            • shiroiushi 9 days ago
              I'm not using them yet; I've already built my server without one, but I was wondering if it would be beneficial to add one for ZIL. Again, this is a home media server, so the main uses are pretty standard for a "home server" these days I think: NFS share, backups (of our PCs), video/music/photo storage, Jellyfin server, Immich server. I've read tons of FAQs and /r/homelab and /r/homeserver (honestly, /r/homelab isn't very useful, it's overkill for this kind of thing, with people building ridiculous rack-mount mega-systems; /r/homeserver is a lot better but it seems like a lot of people are just cobbling together a bunch of old junk, not building a single storage/media server).

              My main question here was just what I asked about NVMe drives. Many times in my research that you recommended, people recommended using multiple NVMe drives. But even a mirror is going to be problematic: on a typical motherboard (I'm using a AMD B550 chipset), there's only 2 slots, and they're connected very differently, with one slot being much faster (PCIe4) than the other (PCIe3) and having very different latency, since the fast one connects to the CPU and the slow one goes through the chipset.

              • emptiestplace 9 days ago
                Ok, understood. The part I'm confused about is the focus on NVMe devices - do you also have a bunch of SATA/SAS SSDs, or even conventional disks for your media? If not, I'd definitely start there. Maybe something like six spinners in RAIDZ2, this would allow you to lose up to two drives without any data loss.

                If NVMe is your only option, I'd try to find a couple used 1.92TB enterprise class drives on ebay, and go ahead and mirror those without worrying about the different performance characteristics (the pool will perform as fast as the slowest device, that's all) - but 1.92TB isn't much for a media server.

                In general, I'd say consumer class SSDs aren't worth the time it'll take you to install them. I'd happily deploy almost any enterprise class SSD with 50% beat out of it over almost any brand new consumer class drive. The difference is stark - enterprise drives offer superior performance through PLP-improved latency and better sustained writes (thanks to higher quality NAND and over-provisioning), while also delivering much better longevity.

                • shiroiushi 9 days ago
                  >The part I'm confused about is the focus on NVMe devices - do you also have a bunch of SATA/SAS SSDs

                  I do have 4 regular SATA spinning disks (enterprise-class), for bulk data storage, in a RAIDZ1 array. I know it's not as safe as RAIDZ2, but I thought it'd be safe enough with only 4 disks, and I want to keep power usage down if possible.

                  I'm using (right now) a single 512GB NVMe drive for both booting and app storage, since it's so much faster. The main data will on the spinners, but the apps themselves on the NVMe which should improve performance a lot. It's not mirrored obviously, so that's one big reason I'm asking about the NVMe slots; sticking a 2nd NVMe drive in this system would actually slow it down, since the 2nd slot is only PCIe3 and connected through the chipset, so I'm wondering if people do something different, like using some kind of adapter card for the x16 video slot. I just haven't seen any good recommendations online in this regard. For now, I'm just doing daily syncs to the raid array, so if the NVMe drive suddenly dies somehow, it won't be that hard to recover, though obviously not nearly as easy as with a mirror. This obviously isn't some kind of mission-critical system so I'm ok with this setup for now; some downtime is OK, but data loss is not.

                  Thanks for the advice!

                  • emptiestplace 9 days ago
                    Yeah, RAIDZ1 is a reasonable trade-off for the four disks.

                    Move your NVMe to the other slot, I bet you can't tell a difference without synthetic benchmarks.

  • qwertox 10 days ago
    What happened to the issue with ZFS which occurred around half a year go?

    I never changed a thing (because it also had some cons) and am believing that as long as a ZFS scrub shows no errors, all is OK. Could I be not seeing a problem?

  • david_draco 10 days ago
    In addition to the copy_file_range discussion at the end, it would be great to be able to applying deduplication to selected files, identified by searching the filesystem for say >1MB files which have identical hash.
  • girishso 11 days ago
    Off topic, any tool to deduplicate files across different external Hard disks?

    Over the years I made multiple copies of my laptop HDD to different external HDDs, ended up with lots of duplicate copies of files.

    • nikisweeting 11 days ago
      How would you want the duplicates resolved? Just reported in some interface or would you want the duplicates deleted off some machines automatically?

      There are a few different ways you could solve it but it depends on what final outcome you need.

      • girishso 9 days ago
        Just reporting in some plain text format so I can manually delete the duplicates, or create some script to delete.

        I can't have like 10 external HDDs attached at the same time, so the tool needs to dump details (hashes?) somewhere on Mac HDD, and compare against those to find the duplicates.

        • nikisweeting 8 days ago
          Here you go:

              cd /path/to/drive
              find . -type f -exec sha256sum {} + | sed -E 's/^([^ ]+) \./\1,/' >> ~/all_hashes.txt
          
          Run that for each drive, then when you're done run:

              sort ~/all_hashes.txt > ~/sorted_hashes.txt
              awk -F, 'NR==1{print;next} {print $0 | "sort | uniq -w64 -D"}' ~/sorted_hashes.txt > ~/non_unique_hashes.txt
          
          The output in ~/non_unique_hashes.txt will contain only the non-unique hashes that appear on more than one path.
    • UltraSane 10 days ago
      dupeGuru works pretty well.
  • forrestthewoods 11 days ago
    My dream Git successor would use either dedupe or a simple cache plus copy-on-write so that repos can commit toolchains and dependencies and users wouldn’t need to worry about disk drive bloat.

    Maybe someday…

    • fragmede 11 days ago
      It does dedup using Sha-1 on entire files. you might try git-lfs for your usecase though.
      • forrestthewoods 10 days ago
        Git LFS is a really really bad gross hack. It’s awful.

        https://www.forrestthewoods.com/blog/dependencies-belong-in-...

        • fragmede 10 days ago
          It's quite functional and usable, now. so I'd agree with hack, just not the rest of your adjectives.

          That was a good read! I've been thinking a lot about what comes after git too. One thing you don't address is that no one wants all parts at once either, not would it fit on one computer, so I should be able to checkout just one subdirectory of a repo.

          • forrestthewoods 8 days ago
            > One thing you don't address is that no one wants all parts at once either, not would it fit on one computer, so I should be able to checkout just one subdirectory of a repo.

            That’s the problem that a virtual file system solves. When you clone a repo it only materializes files when they’re accessed. This is how my work repo operates. It’s great.

            • fragmede 8 days ago
              The problem I'm trying to avoid is having to dive down a hierarchy, and that not everyone needs or wants to know there is such a hierarchy. Like graphics design for foo-team only needs to have access to some subset of graphics. Arguably they could be given a symlink into the checkout, but the problem with that is of having a singular checkout.

              The problem I had with Google3 is that the tools weren't great at branching and didn't fit my workflow, which tends to involve multiple checkouts (or worktrees using git). being forced to checkout the root of the repo, and then having to manage a symlink on top of that is no good for users that don't need/want to manage the complexity of having a single machine-global checkout.

  • UltraSane 11 days ago
    Knowing that your storage has really good inline dedupe is awesome and will affect how you design systems. Solid dedupe lets you effectively treat multiple copies of data as symlinks.
  • teilo 10 days ago
    Why are enterprise SANs so good at dedupe, but filesystems so bad? We use HPE Nimble (yeah, they changed the name recently but I can't be bothered to remember it), and the space savings are insane for the large filesystems we work with. And there is no performance hit.

    Some of this is straight up VM storage volumes for ESX virtual disks, some direct LUNs for our file servers. Our gains are upwards of 70%.

    • growse 10 days ago
      Totally naive question: is this better than you get than simply compressing?

      It's not 100% clear to me why explicit deduping blocks would give you any significant benefit over a properly chosen compression algorithm.

  • onnimonni 9 days ago
    I'm storing a lot of text documents (.html) which contain long similiar sections and are thus not copies but "partial copies".

    Would someone know if the fast dedup works also for this? Anything else I could be using instead?

  • hhdhdbdb 11 days ago
    Any timing attacks possible on a virtualized system using dedupe?

    Eg find out what my neighbours have installed.

    Or if the data before an SSH key is predictable, keep writing that out to disk guessing the next byte or something like that.

    • aidenn0 11 days ago
      I don't think you even need timing attacks if you can read the zpool statistics; you can ask for a histogram of deduped blocks.

      Guessing one byte at a time is not possible though because dedupe is block-level in ZFS.

      • beng-nl 10 days ago
        Gosh, you’re likely right, but what if comparing the blocks (to decide on deduping) is a byte at a time and somehow that can be detected (with a timing channel or a uarch side channel)? Zfs likely compares the hash, but I think KSM doesn’t use hashes but memcmp (or something in that spirit) to avoid collisions. So just maybe… just maybe GP is onto something.. interesting fantasy ;-)
        • hhdhdbdb 10 days ago
          Thanks for putting meat on the (speculitive) bone I threw out! Very interesting.
    • UltraSane 10 days ago
      VMWare ESXi used to dedupe RAM and had to disable this by default because of a security issue it caused that leaded data between VMs.
  • eek2121 11 days ago
    So many flaws. I want to see the author repeat this across 100TB of random data from multiple clients. He/she/whatever will quickly realize why this feature exists. One scenario I am aware of that uses another filesystem in a cloud setup saved 43% of disk space by using dedupe.

    No, you won't save much on a client system. That isn't what the feature is made for.

    • hinkley 11 days ago
      When ZFS first came out I had visions of it being a turnkey RAID array replacement for nontechnical users. Pop out the oldest disk, pop in a new (larger one), wait for the pretty lights to change color. Done.

      It is very clear that consumer was never a priority, and so I wonder what the venn diagram is of 'client system' and 'zfs filesystem'. Not that big right?

    • doublepg23 11 days ago
      I assuming the author is aware why the feature exists since they state in the second sentence they funded the improvement over the course of two years?
    • UltraSane 11 days ago
      My reaction also. Dedupe is a must have for when you are storing hundreds of VMs. you WILL save so much data and inline dedupe will save a lot of write IO.
      • XorNot 11 days ago
        It's an odd notion in the age of containers where dedupe is like, one of the core things we do (but stupidly: amongst dissimilar images there's definitely more identical files then different ones).
    • edelbitter 11 days ago
      I tried two of the most non-random archives I had and was disappointed just as the author. For mail archives, I got 10%. For entire filesystems, I got.. just as much as with any other COW. Because indeed, I duplicate them only once. Later shared blocks are all over the place.
  • nisten 11 days ago
    can someone smarter than me explain what happens when instead of the regular 4kb block size in kernel builds we use 16kb or 64kb block size or is that only for the memory part, i am confused. Will a larger block size make this thing good or bad?
    • UltraSane 11 days ago
      Generally the smaller the dedupe block the better as you are far more likely to find a matching block. But larger blocks will reduce the number of hashes you have to store. In my experience 4KB is the sweet spot to maximize how much data you save.
      • spockz 11 days ago
        So in this case I think it would make sense to have a separate pool where you store large files like media so you can save on the dedup for them.

        Is there an inherent performance loss of using 64kB blocks on FS level when using storage devices that are 4kB under the hood?

        • nisten 10 days ago
          Hmmm you might be able to do both no? Like the dedube is gonna run at the filesystem level but your memory security & ownership stuff is gonna run more efficiently. I am not sure.
  • tiffanyh 11 days ago
    OT: does anyone have a good way to dedupe iCloud Photos. Or my Dropbox photos?
    • nikisweeting 11 days ago
    • acdha 11 days ago
      The built in Photos duplicate feature is the best choice for most people: it’s not just generic file-level dedupe but smart enough to do things like take three versions of the same photo and pick the highest-quality one, which is great if you ever had something like a RAW/TIFF+JPEG workflow or mixed full res and thumbnails.
    • spockz 10 days ago
      Or better yet. A single photo I take of the kids will be stored in my camera roll. I will then share it with family using three different messengers. Now I have 4 copies. Each of the individual (recoded) are stored inside those messengers and also backed up. This even happens when sharing the same photo multiple times in different chats with the same messenger.

      Is there any way to do de duplication here? Or just outright delete all the derivatives?

    • EraYaN 10 days ago
      digiKam can dedupe on actual similarity (so different resizes and formats of the same image). But it does take some time to calculate all the hashes.
  • merpkz 10 days ago
    I don't get it - many people here claim in this thread that VM base image deduplication is great use case for this. So lets assume there are couple of hundreds of VMs on a ZFS dataset with dedupe on, each of them ran by different people for different purposes entirely - some databases, some web frontends / backends, minio S3 storage or backups ect - this might save you those measly hundreds of megabytes for linux system files those VMs might have in common ( even though knowing how many linux versions are out there with different patch levels - unlikely ) it will still not be worth it considering ZFS will keep track of each users individual files - databases and backup files and whatnot - data which is almost guaranteed to be unique between users so it will completely miss the point of ZFS deduplication. What am I missing?
    • jeroenhd 10 days ago
      It largely depends on how you set up your environment. On my home server, most VMs consist of a few gigabytes of a base Linux system and then a couple of hundred megabytes of application code. Some of those VMs also store large amounts of data, but most of that data could be stored in something like a dedicated minio server and maybe a dedicated database server. I could probably get rid of a huge chunk of my used storage if I switched to a deduplicating system (but I have plenty of storage so I don't really need to).

      If you're selling VMs to customers then there's probably no advantage in using deduplication.

    • 3np 10 days ago
      In such a sevario you'd probably have several partitions. So dedupe activated on the root filesystem (/bin,/lib etc) but not for /home and /var.
  • tjwds 11 days ago
    Edit: disregard this, I was wrong and missed the comment deletion window.
  • kderbe 11 days ago
    I clicked because of the bait-y title, but ended up reading pretty much the whole post, even though I have no reason to be interested in ZFS. (I skipped most of the stuff about logs...) Everything was explained clearly, I enjoyed the writing style, and the mobile CSS theme was particularly pleasing to my eyes. (It appears to be Pixyll theme with text set to the all-important #000, although I shouldn't derail this discussion with opinions on contrast ratios...)

    For less patient readers, note that the concise summary is at the bottom of the post, not the top.

    • Aachen 10 days ago
      That being:

      > As we’ve seen from the last 7000+ words, the overheads are not trivial. Even with all these changes, you still need to have a lot of deduplicated blocks to offset the weight of all the unique entries in your dedup table. [...] what might surprise you is how rare it is to find blocks eligible for deduplication are on most general purpose workloads.

      > But the real reason you probably don’t want dedup these days is because since OpenZFS 2.2 we have the BRT (aka “block cloning” aka “reflinks”). [...] it’s actually pretty rare these days that you have a write operation coming from some kind of copy operation, but you don’t know that came from a copy operation. [...] [This isn't] saving as much raw data as dedup would get me, though it’s pretty close. But I’m not spending a fortune tracking all those uncloned and forgotten blocks.

      > [Dedup is only useful if] you have a very very specific workload where data is heavily duplicated and clients can’t or won’t give direct “copy me!” signal

      The section labeled "summary" imo doesn't do the article justice by being fairly vague. I hope these quotes from near the end of the article give a more concrete idea of why (not) use it

      • londons_explore 10 days ago
        > offset the weight of all the unique entries in your dedup table

        Didn't read the 7000 words... But isn't the dedup table in the form of a bunch of bloom filters so the whole dedup table can be stored with ~1 bit per block?

        When you know there is likely a duplicate, you can create a table of blocks where there is a likely duplicate, and find all the duplicates in a single scan later.

        That saves having massive amounts of accounting overhead storing any per-block metadata.

    • emptiestplace 10 days ago
      It scrolls horizontally :(
      • going_north 10 days ago
        It's because of this element in one of the final sections [1]:

            <code>kstat.zfs.<pool>.misc.ddt_stats_<checksum></code>
        
        Typesetting code on a narrow screen is tricky!

        [1] https://despairlabs.com/blog/posts/2024-10-27-openzfs-dedup-...

      • ThePowerOfFuet 10 days ago
        Not on Firefox on Android it doesn't.
        • dspillett 10 days ago
          It does in chrome on android (1080 px wide screen, standard ppi & zoom levels) but not by enough that you see it on the main body text (scrolling just reveals more margin), so you might find it does for you too but not enough that you noticed.

          As it is scrolling here, though inconsequentially, it might be bad on a smaller device with less screen and/or other ppi settings.

  • burnt-resistor 10 days ago
    Already don't use ZoL because of their history of arms shrug-level support coupled with a lack of QA. ZoL != Solaris ZFS. It is mostly an aspirational cargo cult. Only a few fses like XFS and Ext4 have meaningful real-world, enterprise deployment hours. Technically, btrfs has significant (web ops instead of IT ops) deployment exposure due to its use on 10M boxes at Meta. Many non-mainstream fses also aren't assured to be trustworthy because of their low usage and prevalent lack of thorough, formalized QA. There's nothing wrong with experimentation, but it's necessary to have an accurate understanding of the risk budget for a given technology for a given use-case.
    • volkadav 10 days ago
      I sympathize with your concerns for stability and testing, but I think that you might reconsider things in open-source ZFS land. OpenZFS/ZoL have been merged since the 2.0 release several years back, and some very large (e.g. Netflix) environments use FreeBSD which in turn uses OpenZFS, as well as being in use by the various Illumos derivatives and such. It is true that there has been some feature divergence between Oracle ZFS and OpenZFS since the fork, but as I recall that was more "nice to haves" like fs-native encryption than essential reliability fixes, fwiw.
      • ComputerGuru 9 days ago
        Don't disagree with your post but netflix doesn't use zfs for a couple of reasons, one of which is broken sendfile support (though that might be fixed soon!).