Automated integer hash function discovery

(github.com)

272 points | by danny00 13 days ago

18 comments

  • skissane 13 days ago
    I don't know this guy personally but I like his code.

    I particularly like his JSON library [0], option parsing libraries [1], branchless UTF-8 decoder [2], lock-free stack [3], and trie library [4].

    I also like his taste in licensing (all of the above is released under The Unlicense)

    [0] https://github.com/skeeto/pdjson

    [1] https://github.com/skeeto/optparse and https://github.com/skeeto/getopt

    [2] https://github.com/skeeto/branchless-utf8

    [3] https://github.com/skeeto/lstack

    [4] https://github.com/skeeto/trie

    • junon 13 days ago
      Skeeto is a legend. Up there with Febrice Bellard in my book. I've followed him on GH for years and he just drops bangers of weird niche little tools that are always interesting.

      Branchless UTF-8 is a famous one, for example.

    • cyfex 13 days ago
      He's also the author of elfeed [0] "An Emacs web feeds client". I've found his minimalist implementation very inspiring.

      [0] https://github.com/skeeto/elfeed

  • aappleby 13 days ago
    Hi, I'm the MurmurHash guy. Neat stuff, and I'm amused that multiply-shift-xor has held up so well for all this time.
    • edflsafoiewq 13 days ago
      Xor-shift counteracts the two weaknesses of a multiply: the high bits have nothing above them to influence, and the low bits have nothing below them to be influenced by.
    • araes 12 days ago
      Like MurmurHash, I expect these are meant to be non-cryptographic.

      However, that said, I feel like the Avalanche + Bias idea is missing quite a bit.

      Example: This function as the last listed function.

        // exact bias: 0.020888578919738908
        uint32_t
        triple32(uint32_t x)
        {
            x ^= x >> 17;
            x *= 0xed5ad4bbU;
            x ^= x >> 11;
            x *= 0xac4c1b51U;
            x ^= x >> 15;
            x *= 0x31848babU;
            x ^= x >> 14;
            return x;
        }
      
      Produces this image when implemented by FabriceNeyret2 on ShaderToy.

      https://www.shadertoy.com/view/WttXWX

      or

      https://i.imgur.com/qU2P5rx.png

      However, a simple normal map slope differentiation shows a quite a few obvious "crystal" lines. There's probably a technical term for those types of ridges.

      https://i.imgur.com/IHWT1GM.png

      Edit: Also, this entire idea is half a decade old? https://nullprogram.com/blog/2018/07/31/

      • the_mitsuhiko 12 days ago
        > Also, this entire idea is half a decade old?

        Is that a bad thing?

        • araes 12 days ago
          Yes?

          In any form of hash function, cryptography field, or information warfare, five years is an enormous length of time.

          That means "somebody" in humanity has been brute force algorithm generating improved integer hash functions for the last half decade.

          For all I know, Google, NVidia, Microsoft, NSA, Spetssvyaz, 3PLA, and Unit 8200 may have been in an integer hash function generation arms race for the last half decade.

          Long Reuters article today on drones, and how the US is floundering (they're just not spending as much money as people want). [1] However, if drone can dodge better than enemy can shoot drone, then integer hash function is suddenly not "civilian". Missiles have known this forever. Talk to the MIRV industry.

          [1] https://www.reuters.com/business/aerospace-defense/sea-drone...

          • rurban 4 days ago
            The big ones probably not, but I extended and ran my fast-hash generator just this year again, with new ideas and new builtin op codes, and the results didn't change since at all. You need to find a combination of fast and good-enough hashes.

            But fast string search changed a lot in the last decade. The previous winners were all taken over by a new two-way algorithm as implemented in musl. This was a surprise. search is definitely more important than inthash. Maybe glibc will catchup in the next decade.

          • addaon 11 days ago
            > Talk to the MIRV industry.

            I didn't get an answer at 800-I-AM-MIRV, and don't know another way to contact this industry. Can you perhaps instead hint why you think that this industry might imply that non-cryptographic hashes are not "civilian"?

  • keepamovin 13 days ago
    I've often thought about the idea of auto hash finding based on my experiences developing good hash functions^0

    Cool to see this! I think it would be cool to hook it up with SMHasher3^1 (a much improved, and faster variant of the venerable hash test suite developed by Frank J. T. Wojcik) to automatically evaluate its output. You could use a subset of tests and fail fast for speed.

    It would also be cool to expand it to 64 and 128 bit hashes (tho obviously the search space is larger).

    Somewhat related I created some NodeJS code to measure avalanche under multiplication for 64-bit primes in order to select the values for Rain.

    [Rain]: https://github.com/dosyago/rain

    [SMHasher3]: https://gitlab.com/fwojcik/smhasher3

  • nullc 13 days ago
    It would be interesting to generalize this with the operations available in the riscv bitmanip extension-- it might discover some strong functions for use in the future when those instructions are more widely available.

    carryless multiply would also allow extending the set of reversible operations and are fast on some existent hardware (CRC too, somewhat a wider set of hardware but should be a strict subset of what CLMUL could find).

    Many applications of hashes only care about the LSB or MSB of the hashes so evaluating the bias on MSB or LSB windows would be interesting (or mod varrious numbers), and some overall unbiased functions will look better or worse for metrics that don't gauge the entire output. (or for non-uniform inputs, for that matter, e.g. ascii text).

  • throwaway_1237 13 days ago
    Can someone explain why this is cool and what it’s used for?
    • tommiegannert 13 days ago
      It seems to be a tool that generates instructions for building hash functions, then evaluates the hash functions to see how good they are. The goal metric has been chosen to be that as many output bits as possible should change as randomly as possible with every changed input bit.

      It outputs C code for the best hash function it generated.

      So it's useful if you want a hash function, and don't think any of the existing ones are good enough. Or if you're researching hash functions, and want new ideas for structures.

      Cool? Well, generating code is cool. Doing it randomly is a first step towards genetic programming, which would be even cooler. ... and making computers burn CPU cycles computing (mostly unused) hashes seems to be something we humans want to do since around 15 years ago.

      • eru 13 days ago
        > [...] making computers burn CPU cycles computing (mostly unused) hashes seems to be something we humans want to do since around 15 years ago.

        Keep in mind that there's a big difference between cryptographic hash functions and the kind of hash functions investigated here.

        • baq 13 days ago
          Yeah notably examples are given with their inverse functions. I’m pretty sure you don’t want that in a crypto hash ;)
          • keepamovin 13 days ago
            Haha! :) No, not overall, but you do want reversible steps in a crypto hash as Reversible steps make better mixing functions, Because you’re not losing information so you avoid iteratively diminishing that information leading to a smaller state space, and more state collisions.
    • AgentOrange1234 13 days ago
      These functions are essential to a hash tables (other related names are hash maps, hash sets).

      A hash table is a remarkable data structure that can be used to implement many algorithms simply and efficiently.

      This efficiency depends on being able to create small (say 32- or 64-bit) and nearly-unique “hashes” for your data. For example, to hash user names, it wouldn’t work very well to just use the ASCII code for the first letter of the names, because then a lot of usernames would map to the same number. That’s called a collision. If there are a lot of collisions, hash tables become very inefficient.

      A better approach would be to grab bits from the entire username, and smash them together somehow so that “throwaway_1237” and “throwaway_12373” still username different numbers.

      Hash functions do this mapping. The property of “avalanche” describes how good of a job they do at avoiding collisions.

      There’s generally a tradeoff between (1) how fast the actual hash function is, and (2) how good of a job it does of avoiding collisions.

      World-class hash functions look remarkably weird, multiplying and xoring and shifting by strange amounts. It’s very hard for humans to look at these cryptic functions and guess how well they will do.

      So this code is trying out a bunch of hash functions at random and baking them off against each other. This is cool because success here could improve real-world performance of a core data structure used across many languages and libraries.

    • masklinn 13 days ago
      Hash functions on integers, so if you want a fast integer hash for a set or map. If the functions diverge sufficiently it also provides fast hashes for a bloom filter.
    • alegeaa 13 days ago
      I am asking myself the same question.
  • infogulch 13 days ago
    I implemented the 1brc in go a few weeks back [1] and this repo was inspiration to try to find a custom perfect hash function that put each station into its own bucket with no collisions. Then I noticed the rule that you can't customize the hash function to the data before the program starts and dropped the idea.

    I made a test harness to check random constants (start values, multiplication constants, shift/rotate amounts, etc) and print out the best constants found so far based on the number of colliding buckets and how many collisions. I think I got it down to 1 bucket with just two colliding values with a ~40% fill rate. Interestingly, I found that the best performing constants included similar values for the number of positions to shift regardless of the other constants, so I ended up hard coding them.

    [1]: https://github.com/infogulch/1brc-go

  • rowanG077 13 days ago
    What would make this really interesting is if you can give it a generator for input data yourself. Very often you don't have random binary data but structured in some way. The structure might imply you can get a very nice hash function for it.
  • o11c 13 days ago
    Limiting itself to reversible operations gives some mathematical niceties, but also locks out a lot of things.

    When I did something similar, I was thinking about perfect hashing, where the set of inputs is known ahead of time. The usual approach uses a constant array, but I wanted to - in particular, if the inputs are already small integers, if I could compact them further (obviously this is possible: hash -= hash > > gap_index). I sat down and wrote out a list of ... probably 100? ... primitive operations (some of which are admittedly redundant with each other, but which are useful thinking of separately). And then I got bored and never did anything with the project.

    • mshroyer 13 days ago
      > Limiting itself to reversible operations gives some mathematical niceties

      What are those niceties, why is a reversible operation desirable in this context?

      • tyingq 13 days ago
        "One of the important properties of an integer hash function is that it maps its inputs to outputs 1:1. In other words, there are no collisions. If there’s a collision, then some outputs aren’t possible, and the function isn’t making efficient use of its entropy.

        This is actually a lot easier than it sounds. As long as every n-bit integer operation used in the hash function is reversible, then the hash function has this property. An operation is reversible if, given its output, you can unambiguously compute its input."

        https://nullprogram.com/blog/2018/07/31/

    • pyinstallwoes 13 days ago
      XOR xor xor reversibility through cyclical permutations ?
  • grok22 13 days ago
    It's unclear to me what this is doing exactly -- is it finding the best ever? If not, why would the best change every time you run this?

    Also, does anyone have pointers to a hash function discovery mechanism that discovers a good hash function for integer values within a specific range (i.e I know that my integer values are between, say, 10,000 and 200,000, for example) and I want to hash them into some optimal number of hash buckets?

    • zamadatix 13 days ago
      It's randomly trying values to find the best of the ones it has tried that run. The values would change between runs because you can't feasibly exhaust the search space in a single run to find the absolute best (+the order you try is random).

      If you're just looking for "good" the best approach is almost always to just use a normal hash function. If your numbers are extremely large and the range extremely small you can offset so the minimum becomes 0 again and use a smaller/faster hash. If you're more looking for "perfect choice of the exact range" I think the closest you'll get is something like this randomized approach but modify the tests to occur on your interval.

  • thomasmg 13 days ago
    I wonder if using the same constant for both multiplications would reduce code size and so could (slightly) speed up computation?

    I have updated my StackOverflow answer at https://stackoverflow.com/questions/664014/what-integer-hash...

  • renonce 13 days ago
    I know I’m comparing oranges to apples here as these functions are not well suited for cryptographic operations, but how does the measured “bias” affect cryptanalysis? Can someone familiar with differential cryptography explain if a hash function with a lower bias defeats cryptanalysis with lower rounds or with less compute? Will this thing help find better cryptographic hash functions?
    • pbsd 13 days ago
      This bias testing is essentially ruling out candidates with very high probability truncated differentials of Hamming weight 1. In a cryptographic primitive you want to rule out _all_ high-probability differentials, which requires different methods. You also want to rule out other high-probability statistical distinguishing properties, such as linear approximations, higher-order differentials, etc.

      That being said, this sort of quick-and-dirty testing is useful to filter out obviously bad candidates at the early design phase of, say, an ARX primitive.

    • eru 13 days ago
      > Will this thing help find better cryptographic hash functions?

      Very, very unlikely.

  • rurban 13 days ago
    I also have something similar here, with a slower generator (interpreted, not jitted), but better quality functions: https://github.com/rurban/fast-hash/tree/master/hashgen

    But I found nothing better than the old favorites.

  • adgjlsfhk1 13 days ago
    this is a very cool library, but it needs someone to figure out a good way of validating 64 bit hash functions.
  • Retr0id 13 days ago
    > Three round functions: [...] this hash function is indistinguishable from a perfect PRF

    I assume this means "indistinguishable through naive statistical tests", because any more literal meaning would surprise me

    • evujumenuk 13 days ago
      Maybe I'm misinterpreting something, but… isn't this really about https://en.wikipedia.org/wiki/Perfect_hash_function? "Perfect" doesn't represent a statement of quality here, if that's what you're alluding to.
      • Retr0id 13 days ago
        No, I mean in the sense given on the page (I should've included the parenthetical in my quote)

        > (e.g. a random permutation of all 32-bit integers)

        Since they only use bijective primitives, it's trivially "perfect" in the "perfect hash function" sense.

        To elaborate further on the definition I was going off (which may well not be what the author meant):

        > A PRF is considered to be good if its behavior is indistinguishable from a truly random function. Therefore, given an output from either the truly random function or a PRF, there should be no efficient method to correctly determine whether the output was produced by the truly random function or the PRF.

        https://en.wikipedia.org/wiki/Pseudorandom_function_family

  • dataangel 12 days ago
    Wouldn't these be slow because all the operations are dependent? e.g. on x86 I'd use one of the 2 hash functions that has built-in instruction support
  • pcranaway 13 days ago
    Is this something like a superoptimizer? looks cool!
  • Nuella19 13 days ago
    [flagged]
  • azubinski 13 days ago
    OMG, is it finally something useful, not in Rust for RISC-V?

    I saw at the top in the comments "Skeeto is a legend", this is the best compliment.

    My sincere gratitude to the author.