Cache-Friendly B+Tree Nodes with Dynamic Fanout

(jacobsherin.com)

87 points | by jasim 17 hours ago

5 comments

  • kazinator 13 hours ago
    > This pattern was officially standardized in C99,

    No it wasn't; the C99 flexible array uses [] not [1] or [0].

    When using the [1] hack, you cannot use the sizeof the structure to get the offset, because it includes the [1] array.

    When using C99, you also cannot use sizeof to get the offset of the [0] element of the flexible array; sizeof is allowed to pretend that there is padding as if the flexible member were not there.

    >

      // The (N - 1) adjusts for the 1-element array in Payload struct
      Payload *item = malloc(sizeof(Payload) + (N - 1) * sizeof(char))
    
    >

    If you are in C++ you need a cast; the void * return value of malloc cannot implicitly convert to Payload *.

      Payload *item = static_cast<Payload *>(malloc(...));
    
    Or of course a C cast if the code has to compile as C or C++:

      Payload *item = (Payload *) malloc(...);
    
    Setting aside that issue for brevity, pretending we are in C, I would make the malloc expression:

      Payload *item = malloc(offsetof(Payload, elements) + N);
    
    sizeof(char) is by definition 1, so we do not need to multiply N by it.

    By taking the offset of the elements array, we don't need to subtract 1 from N to account for the [1] element being skipped by sizeof.

    These kinds of little things take away complexity for something that must be carefully coded to avoid a memory safety issue. You really want the calculations around the memory to use the simplest possible formulas that are as easy as possible to reason about to convince yourself they are correct.

    Also, when you do use sizeof in a malloc expression, the following pattern avoids repeating the type name for the size, and also lets a pair of parentheses be dropped since sizeof only requires parentheses when the operand is a type:

      Payload *item = malloc(sizeof *item);
    • halayli 13 hours ago
      Strictly speaking, in the C++ object model, malloc allocates storage but doesn't create objects. Accessing that memory as if it contains an object (even a trivial one like int) without properly giving it object lifetime is technically UB. For trivial types, this is rarely enforced in practice, but the standard says to use placement new or std::start_lifetime_as (C++23) to properly begin object lifetime.
      • dataflow 7 hours ago
        > Strictly speaking, in the C++ object model, malloc allocates storage but doesn't create objects.

        No - strictly speaking, it does create objects. https://en.cppreference.com/w/cpp/memory/c/malloc.html#:~:te...

        It gets confusing (to say the least) if you start questioning the details, but the spec does formally intend the objects to be implicitly created.

    • whizzter 13 hours ago
      Even better and simpler..

        Payload *item = (Payload *)malloc(offsetof(Payload, elements[N]));
      
      The rest of the article does make me vary of a lot of other things that aren't done "per-spec" if you're making your own container and probably will cause unintended bugs in the future.
      • kazinator 3 hours ago
        That's really great; it's easy to forget that the second argument of offsetof isn't simply a member name, but a designator expression.
  • thesz 14 hours ago
    One would be better off implementing cache-oblivious lookahead array [1] or even log-structured merge trees.

    [1] https://www3.cs.stonybrook.edu/~bender/newpub/BenderFaFi07.p...

    Both structures exploit the fact that most of the data does not change much and can be packed as tight as one wishes. Even prefixes (and suffixes) can be factored out.

    • whizzter 13 hours ago
      This article doesn't seem to be for any disk based structure but rather in-memory, in an in-memory scenario with order requirements some users have reported B-tree's as being quite competitive in mostly-read scenarios.

      "Write-heavy" scenarios will probably be just fine with std::map (backed by an RB-tree) since the main downside of B-tree's is write amplification that isn't an issue since memory doesn't really have any block granularity.

      LSM tree's in-memory will probably not be that useful as scanning becomes much more complicates (it's an interesting structure though if you have append-only workloads and want to implement dictionary for size-coded projects on top of a list).

      • thesz 2 hours ago
        We are talking about cache friendliness here, including compact data representation.

        B-trees algorithm requires leaf nodes to be filled up to some prespecified fill factor, usually from 50% to 70%. This is not compact by any measures.

        Both LSM trees and COLA allow for much more compact storage. They also pretty cache friendly in the "cache oblivious" sense.

    • moab 13 hours ago
      Do you know of important real-world use-cases where cache-oblivious data structures are used? They are frequently mentioned on HN when relevant discussions like this one pop up, but I would love to hear about places where they are actually used in production.
      • thesz 1 hour ago
        You may not encounter a situation where cache oblivious structure is needed, but the knowledge that some algorithms are cache oblivious can help.

        For example, if you represent a set of values as an ordered array, you can perform many set operations with the (array) merge algorithm, which is cache oblivious one. You can avoid pointer chasing, etc, but your structure becomes static instead of dynamic as with, say, red-black trees. This is already a win because you can save memory and compute if at least one of your sets does not change much.

        But, if you apply logarithmic method [1] you can have dynamic data structure (over static one, a sorted array) and still perform most of operations without pointer chasing and with cache oblivious merging.

        [1] https://www.scilit.com/publications/4dd8074c2d05ecc9ed96b5cf...

    • pluto_modadic 13 hours ago
      LSM is great for write heavy loads. not sure about random reads, isn't that a B+ tree's turf?
      • thesz 1 hour ago
        B+-tree is an optimization for slightly faster sequential read, when leafs are organized into a list.

        LSM allow for very compact data representation because most levels of the tree do not change much through time, and this is what we are talking about here. This compact representation makes LSM trees faster at sequential reads too (less pointer chasing).

        Also, you can differently construct LSM trees for different operations: higher LSM tree with more narrow levels allows for faster writes, flatter LSM tree with less but wider levels allow for faster reads.

  • apelapan 14 hours ago
    Many mentions of things being too slow and other things being high performance. I'm not doubting the truthfulness, but it would have been really nice to see some hard numbers that show the magnitude of improvement in a scenario or two.
    • aidenn0 11 hours ago
      Also, it's very hard to make a b+tree that is ever faster than other data-structures in RAM.

      Obviously if you don't need (or only rarely need) in-order traversing (or related operations like successor), hash-tables are very fast.

      If you do need in-order traversing, for small amounts of data, sorted arrays are very fast, and for large amounts of data various types of prefix-tries do very well.

    • whizzter 13 hours ago
      Yeah, this is some weird "performance" optimizer that half-misuses terminology and complains that using an underlying container is bad for implementing their own basic container.

      Eh yes, you're implementing your basic container, naturally a basic container won't cut it.

  • pluto_modadic 13 hours ago
    is there an implementation of B+ trees that fluidly pulls from disk vs RAM?

    e.g., two B+ trees, one in RAM and one on disk, with the RAM one evicted with sieve caching? possibly a very lite WAL?

    something that lets you use a B+ tree bigger than RAM, and persist to disk

    • aidenn0 11 hours ago
      That's a type of Log-structured merge-tree.
  • aidenn0 11 hours ago
    This never explains why dynamic fanout is desired. Static fanout is, of course, trivial with templates.