Analyzing multi-gigabyte JSON files locally

(thenybble.de)

215 points | by bubblehack3r 786 days ago

35 comments

jmmv 786 days ago
Some random comments:
* A few GBs of data isn't really that much. Even /considering/ the use of cloud services just for this sounds crazy to me... but I'm sure there are people out there that believe it's the only way to do this (not the author, fortunately).
* "You might find out that the data doesn’t fit into RAM (which it well might, JSON is a human-readable format after all)" -- if I'm reading this right, the author is saying that the parsed data takes _more_ space than the JSON version? JSON is a text format and interning it into proper data structures is likely going to take _less_ space, not more.
* "When you’re ~trial-and-error~iteratively building jq commands as I do, you’ll quickly grow tired of having to wait about a minute for your command to succeed" -- well, change your workflow then. When tackling new queries, it's usually a good idea to reduce the data set. Operate on a few records until you have the right query so that you can iterate as fast as possible. Only once you are confident with the query, run it on the full data.
* Importing the data into a SQLite database may be better overall for exploration. Again, JSON is slow to operate on because it's text. Pay the cost of parsing only once.
* Or write a custom little program that streams data from the JSON file without buffering it all in memory. JSON parsing libraries are plentiful so this should not take a lot of code in your favorite language.
[-]
- the_duke 786 days ago
  > JSON is a text format and interning it into proper data structures is likely going to take _less_ space, not more.
  That depends a lot on the language and the json library.
  Lets take `{"ab": 22}` as an example.
  That's 10 bytes.
  In a language like Rust and using the serde library, this could be deserialized directly into a struct with one integer, let's pick a u32. So that would only be four bytes.
  But if it was deserialized to serdes dynamic Value type, this would be : a HashMap<String, u32>, which has a constant size of 48 bytes, plus an allocation of I don't know how much (first allocation will cover more than one entry), plus 16 bytes overhead for the string, plus 2 bytes for the actual string contents, plus the 4 bytes for the u32. So that's already over ~90 bytes, a lot more than the JSON.
  Dynamic languages like Python also have a lot of overhead for all the objects.
  Keys can of course be interned, but not that many default JSON parser libraries do that afaik.
  [-]
  - btown 786 days ago
    I feel like many of the same principles from V8 (the engine for Chrome/Node.js) hidden classes would apply here. It's not just that you're interning keys; you're actually creating a set of optimized data classes that are transitioned to when you encounter a key that wasn't seen before. In such a mode, you're not even storing the keys, just a pointer to the type information tree. Only when you have a sub-object whose keys are truly unpredictable do you go into "dictionary mode (aka 'slow mode')" - which I think says it all!
    https://v8.dev/docs/hidden-classes
  - DonHopkins 786 days ago
    The key names of all the identically shaped objects are repeated and can all be interned into sharing the same strings, saving a huge amount of memory. That's why csv files can be much more compact that the equivalent json files: they only list the keys once in the first row.
    [-]
    - coldtea 786 days ago
      The size of the keys isn't even the problem.
      The numeric value 1 is 1 byte, but it's several bytes to be wrapped into a native numeric object instance in say Python. In C, it would be 8 bytes to get it to an int64, unless the parser does something fancy (and if there are values with 1 and 2 digits, it would at least make it a 2-bytes int holder).
      The string "x" is one byte, but it'll take two bytes in C ("x\0").
      And so on...
      [-]
      - jiggawatts 786 days ago
        You're all massively underestimating the memory requirements of a parsed data structure, which is very entertaining to watch. There's this rule-of-thumb that most programmers are wrong about performance costs by at least a couple of orders of magnitude, and it's on full display here.
        In most languages, dynamically allocated variable-sized objects have a minimum overhead of two pointer-sized fields: a pointer and a length. On a 64-bit platform, that's 16 bytes before you actually start having any data.
        Next, depending on the allocator used, the requested data size is probably rounded up, either to the next power of 2, or there may be a minimum allocation size of 1 pointer. That's the compatible safe approach, and also has some performance advantages on most platforms.
        Last but not least, most heap-based allocators have some "bookkeeping" overheads. Similarly, interpreted languages also have their internal "object" metadata overheads. Typically this is 1 or 2 extra pointers (+8 or +16 bytes).
        Assuming 24-48 bytes for all variable-length values is actually a pretty safe bet!
        There are exceptions:
        Unusually, C skips the 'length' value for strings by using null-termination, so C strings are often just 16 bytes (8 for the pointer, and 8 for the allocated object on the heap, assuming some sort of small-object optimisation is going on).
        The approach in C++ is to use "small string optimisation" where the string values are inlined into the std::string structure itself. This works up to 23 bytes packed into the 24-byte string structure. There's an awesome CppCon presentation on how Andrei Alexandrescu did this optimisation at Facebook: https://www.youtube.com/watch?v=kPR8h4-qZdk
        Interpreted or "VM" languages like JavaScript, C# or Java are far worse than this. For one, they convert UTF8 to UTF16, doubling the bytes required per character for typical "ASCII" identifiers. JavaScript converts integers into 64-bit floats. Java has weird overheads for all objects. Etc...
        Update: I just did an experiment with .NET 6
        Allocating ~1 billion characters as 100M strings 800,000,056 bytes (0.7 GB) for holding the strings. 14,021,088,768 bytes (13.1 GB) for the strings themselves 14,821,216,680 bytes (13.8 GB) total
        Unsurprisingly, simply "referencing" (holding on to) the strings needs an 8-byte pointer per string. The actual strings hold 1-20 characters randomly, but require 140 bytes in memory on average. This bloats out the original 1 GB to just under 14 GB in memory.
        [-]
        fegu 786 days ago
        I have been hit by this many times. Do you know of any good resources on how to reduce this overhead? (Choosing other data structures, choosing other string types etc)
        [-]
        jiggawatts 786 days ago
        There's a whole range of techniques! Many of them are language dependent, and won't be applicable elsewhere.
        A blunt instrument is running your code in 32-bit mode, which halves pointer sizes. The downside is it also limits your maximum data size, and blocks the use of the 64-bit instruction sets that generally speed up data processing.
        Java has an interesting hybrid mode where it uses 32-bit pointers but runs in 64-bit mode.
        Not storing the entire decoded document all at once is usually the recommended approach, via "streaming" parsers that give you one element at a time. This then lets you immediately throw away data that you're done with processing.
        The downside is that it makes certain common idioms impossible or difficult. For example, MVC web architectures generally assume that the controller produces a "complete" model object that then gets passed to the view.
        Unfortunately, language-level support for efficient data processing is generally lacking. Rust and C++ are okay, but have annoying gaps in their capabilities.
        A common trick with something like parsing is to keep the original encoded string to be decoded "as-is", and then simply reference into it using integer indexes. I.e.: don't copy the strings out individually, instead just use "slices" into the original document string.
        Most commonly, unique strings are detected using a hash table and not stored separately. This works better with garbage-collecting languages like C#, Java, and JavaScript. With Rust or C++ you have to use reference counting or other tricks.
      - antonvs 786 days ago
        In the text representation of JSON, the string “x” is three characters. Short of using a statically-typed length, C’s system is about as efficient as it gets in terms of space overhead.
        [-]
        coldtea 785 days ago
        You'd also need to add the overhead of memory representation as jiggawatts says, so you'd need to start from 8 bytes just for the pointer alone (more than double that 3 byte textual representation)! And it's not gonna just allocate one byte (unless you go something fancy with your allocator), but 8 bytes for the string, 2 of which would be the "x" and the null terminator. So 3 -> 16 for a single 1 byte value.
        And that's in C, where the length calculations and the type management is up to you! It gets worse soon...
        [-]
        antonvs 785 days ago
        There’s a weird sort of apples to oranges comparison going on here. Having pointers to separately allocated strings gives you much more functionality than an unindexed blob of JSON text. That obviously has a cost in terms of memory.
        But if you need to, you can represent the same data as a contiguous block of C-style formatted structures, in which case the “x” wouldn’t need more than two or three bytes depending on the representation you choose. In fact, there are formats designed for this exact purpose. The idea that JSON is somehow surprisingly efficient for what it does is just silly.
    - mindslight 786 days ago
      "ab" is 4 bytes, while a 64 bit pointer to an interned string is 8 bytes. It would seem that the savings of CSV would be better analogized to static typing - the type is defined once, then each record only contains data.
      I had the same intuition as the original comment. But no, the relative sizes of data formats aren't that straightforward. One could intern symbols to say 16 bit values, or one could infer structural types and compress the data that way. But those are both creating additional assumptions and processing that likely aren't done by commonly available tools.
      [-]
      - DonHopkins 786 days ago
        You still have to pay for a 64 bit pointer to an un-interned string though. It's the fact that the keys are repeated for each record that overwhelms the size of the keys or pointers themselves, when you have millions of records. If you are using fixed structures, then the keys are implicit and don't need to be repeated or even represented as pointers, just known fixed offsets into the structures.
        I wrote this about "Representing and Editing JSON with Spreadsheets":
        https://news.ycombinator.com/item?id=21109798
        https://medium.com/@donhopkins/representing-and-editing-json...
        SpreadOn: Apply Directly to the Forehead.
        https://www.youtube.com/watch?v=f_SwD7RveNE
  - rootw0rm 786 days ago
    i'm using bincode to serialize btrees and the files are smaller than i expected...it's pretty cool. must be storing my u32s as u16s or u8s sometimes.
    [-]
    - fabrice_d 786 days ago
      Yes, bincode supports the varint zigzag encoding which helps with small enough values.
  - closeparen 786 days ago
    I would read "proper data structure" charitably, as a struct or record type, maybe a tuple. And if it must be a hashmap, then at least using symbols and not strings for keys.
    [-]
    - canadianfella 786 days ago
      [dead]
  - pkaye 786 days ago
    How about something like a Go interface which is essentially two pointers and the data itself.
    [-]
    - Groxx 786 days ago
      That's the same thing they're describing, and equivalent to what Go does for map[string]int64 (plus or minus minor details about how strings are optimized).
      The map has its two slices (pointer + length + capacity = 8*3 times two slices) and the string needs a separate allocation somewhere because it's essentially a pointer to a slice of bytes. All of which is true for almost all reasonably efficiency-focused languages, Rust and Go included - it's just how you make a compact hashmap.
      [-]
      - Yoric 786 days ago
        Does Go perform interning of strings?
        [-]
        Groxx 786 days ago
        Compile-time strings, I think so? In that all identical strings simply point to the same data. Otherwise no.
        [-]
        xh-dude 786 days ago
        Use of Go’s runtime reflection to produce a struct definition would result in one instance of the key string, many instances of the numeric value
        This approach is available/implemented in the standard library, the whole story with JSON in Go is includes a few different approaches but this is definitely anticipated.
- ot 786 days ago
  > Or write a custom little program that streams data from the JSON file without buffering it all in memory. JSON parsing libraries are plentiful so this should not take a lot of code in your favorite language.
  Several years ago I wrote a paper [1] on representing the parse tree of a JSON document in a tiny fraction of the JSON size itself, using succinct data structures. The representation could be built with a single pass of the JSON, and basically constant additional memory.
  The idea was to pre-process the JSON and then save the parse tree, so it could be kept in memory over several passes of the JSON data (which may not fit in memory), avoiding to re-do the parsing work on each pass.
  I don't think I've seen this idea used anywhere, but I still wonder if it could have applications :)
  [1] http://groups.di.unipi.it/~ottavian/files/semi_index_cikm.pd...
  [-]
  - aardvarkr 786 days ago
    Have you implemented this idea in code? Seems interesting and worth playing around with
    [-]
    - ot 786 days ago
      Yes, the code is linked in the paper: https://github.com/ot/semi_index
- bastawhiz 786 days ago
  > JSON is a text format and interning it into proper data structures is likely going to take _less_ space, not more.
  If you're parsing to structs, yes. Otherwise, no. Each object key is going to be a short string, which is going to have some amount of overhead. You're probably storing the objects as hash tables, which will necessarily be larger than the two bytes needed to represent them as text (and probably far more than you expect, so they have enough free space for there to be sufficiently few hash collisions).
  JSON numbers are also 64-bit floats, which will almost universally take up more bytes per number than their serialized format for most JSON data.
  [-]
  - matja 786 days ago
    > JSON numbers are also 64-bit floats
    In common implementations they are, but RFC 8259 and ECMA-404 do not specify the range, precision, or underlying implementation for the storage of numbers in JSON.
    A implementation that guarantees interoperability between all implementations of JSON would use an arbitrary-sized number format, but they seldom do.
    No idea what ISO/IEC 21778:2017 says because it's not free.
    [-]
    - bastawhiz 786 days ago
      I mean, I suppose you could use a numeric format that takes more than 64 bits, but you certainly couldn't use one that uses fewer. You have no way to know whether a number should be an int or a float. And since JavaScript (the JS in JSON) encodes with 64 bits, you'd be losing precision if you choose to use any less. The point is that an integer written as one byte in the JSON will take up more space (at least seven bytes more) when decoded, all the way up to numbers that take seven characters.
      [-]
      - gpderetta 786 days ago
        (subsets of )ASN.1, LEB128, and of course UTF-8 are all examples of encoding binary integers with variable length.
        [-]
        bastawhiz 786 days ago
        Sure, and JSON is as well. 0 takes three fewer bytes to encode than 1234. But if I parse JSON and my numbers come back as a type backed by UTF-8, I'm going to be left confused and angry. Nobody expects their JSON parser to give them back numeric values that their CPU can't natively perform operations on.
        [-]
        gpderetta 786 days ago
        After you have spent some time confused and angry you can then read the documentation of whatever library you are using and verify whether it overloads numerical operations for its special number type or it has a function to convert them to whatever numerical type you need.
        In any case you wouldn't of course use UTF-8 to represent JSON numbers as it can only encode 21 bit intgers.
  - vlovich123 786 days ago
    I think even structs have this problem because typically you heap allocate all the structs/arrays. You could try to arena allocate contiguous objects in place, but that sounds hard enough that I doubt that anyone bothers. Using a SAX parser is almost certainly the tool you want to use.
- coldtea 786 days ago
  >JSON is a text format and interning it into proper data structures is likely going to take _less_ space, not more.
  Wanna bet?
```
  >>> import sys
  >>> import json
  >>> data_as_json_string = '{"a": 5000, "b": 1000}'
  >>> len(data_as_json_string)
  22
  >>> data_as_native_structure = json.loads(data_as_json_string)
  >>> sys.getsizeof(data_as_native_structure)
  232
```
  That's not even the whole story, as what's 232 bytes is not the contents of the dict, but just the Python object with the dict metadata. So the total for the struct inside is much bigger than 232 bytes.
  A single int wrapped as a Python object can be quite a lot by itself:
  >>> sys.getsizeof(1) 28
  A binary int64 would be 8 bytes for comparison.
  [-]
  - hsn915 786 days ago
    I'm pretty sure the GP intended the statement to be for statically typed languages with a struct-like type.
    [-]
    - coldtea 785 days ago
      It doesn't get much better there:
      https://news.ycombinator.com/item?id=35216761
  - killingtime74 786 days ago
    Python is my favorite language but you literally chose one of the least efficient. This would probably be different in Java or Rust.
    [-]
    - bufferoverflow 786 days ago
      I thought Java stores strings as 16-bit character arrays internally. Which means it's very inefficient when it comes to memory consumption.
      [-]
      - samus 786 days ago
        Since version 9, Java uses ISO-8859 when possible. Most raw JSON should therefore take up very little space.
  - berkle4455 786 days ago
    Maybe 2023 is the year people stop using python.
    [-]
    - IshKebab 786 days ago
      We can dream but I seriously doubt it. Not only are there very few nice alternatives (Dart is great but unpopular; Typescript is great but has a terrible standard library, etc.) but very few of them have a decent REPL which I think is probably Python's most underrated feature.
      I mean... it's REPL is extremely basic, but anything you can do in Python you can do in the REPL. Very useful especially for beginners.
- ZephyrBlu 786 days ago
  Have you done much data analysis?
  In my experience adding more steps to your pipeline (E.g. database, deserializing, etc) is a pain when you are figuring things out because nothing has solidified yet, so you're literally just adding overhead that requires even more work to remove/alter later on. If you're not careful you end up with something unmaintainable extremely quickly.
  Only analyzing a subset of your data is usually not a magic bullet either. Unless your data is extremely well cleaned and standardized you're probably going to run into edge cases on the full dataset that were not in your subset.
  Being able to run your full pipeline on the entire dataset in a short period of time is very useful for testing on the full dataset and seeing realistic analysis results. If you're doing any sort of aggregate analysis it becomes even more important, if not required.
  I now believe a relatively fast clean run is one of the most important things for performing data analysis. It increases your velocity tremendously.
- saidinesh5 786 days ago
  > * "You might find out that the data doesn’t fit into RAM (which it well might, JSON is a human-readable format after all)" -- if I'm reading this right, the author is saying that the parsed data takes _more_ space than the JSON version? JSON is a text format and interning it into proper data structures is likely going to take _less_ space, not more.
  Not to mention, even when using bad data structures (eg. hashmap of hashmaps..), One can just add a large enough swapfile and brute force their way through it no?
  [-]
  - samus 786 days ago
    This will fail if the algorithm is not O(n). Even if only a bounded number of passes is required, there will be trouble.
- nerdponx 786 days ago
  If the JSON data has a regular structure, you probably want a database and/or a "data frame" library and Parquet as the storage file format. SQLite, DuckDB, Polars, plenty of options nowadays that are usable from several different programming languages.
- zigzag312 786 days ago
  > if I'm reading this right, the author is saying that the parsed data takes _more_ space than the JSON version? JSON is a text format and interning it into proper data structures is likely going to take _less_ space, not more.
  Utf8 json strings will get converted to utf16 strings in some languages, doubling the size of strings in memory compared to the size on disk.
- walrus01 786 days ago
  > A few GBs of data isn't really that much.
  The entire FCC radio license database (the ULS) is about 14GB in text CSV format and can be imported into a sqlite or sql db and easily queried in RAM on a local workstation...
- taeric 786 days ago
  It still surprises me how many have the intuition that loading the data will take more space than the file.
  Even more annoying when it is. (Compressed or binary formats not withstanding.)
  [-]
  - vlovich123 786 days ago
    Isn’t that almost always true unless the format is specifically designed to take the same amount on disk and in RAM? If that’s the case I think typically it means the file on disk is larger than it needs to be. That’s because you have to convert the file to some kind of in-memory representation that’s easier to work with to manipulate it and what not which requires structure that isn’t necessary at rest.
    Can you give an example where the file on disk is going to be larger than what it is in memory? Provided you’re not just reading it and working with it as an opaque binary blob.
    [-]
    - taeric 786 days ago
      Depends on the data, obviously; but record format data should be compact in memory in ways that text isn't. My favorite example is the meg's of commas and quotes that just aren't needed in memory.
      So, my examples are largely csv files. JSON and xml are the same, in many ways.
      The big curve ball will be text heavy data. But all too often textual data is categorical, such that even that should be smaller in memory.
      In large, this is why parquet files are a ridiculous win in space.
      [-]
      - coldtea 786 days ago
        My favorite example is the meg's of commas and quotes that just aren't needed in memory.
        Depending on the language and internal represenation, for any comma and quote not needed (which for a JSON document with, say, an object without nesting, is just 5 bytes per entry), you might get your strings doubled in size (because e.g. the language converted your fitting into 8bit utf-8 strings to its 16bit native string format), or have some huge constant boxing overhead (in say, Python), and several other fun things besides...
        [-]
        taeric 786 days ago
        Right, is why I said strings can be a curve ball. But most of the data I see can avoid that fairly easily.
    - taeric 784 days ago
      Coming back to this, I was curious and decided to challenge my assertion some. https://taeric.github.io/challenging-my-filesize-intuition.h... is a few examples where I was fishing around. I didn't find any where my assertion is not true.
      Curious if you have any easy examples to look at that do show the opposite.
    - hnlmorg 786 days ago
      XML would probably fall into that category.
      But I do agree with your point more generally speaking.
    - nice2meetu 786 days ago
      I think it comes more from the thought that human text readable format is usually more verbose than binary representations.
- vlovich123 786 days ago
  > JSON is a text format and interning it into proper data structures is likely going to take _less_ space, not more.
  ~~Yeah but parsing it can require ~2x the RAM available and push you into swap / make it not possible.~~
  > Or write a custom little program that streams data from the JSON file without buffering it all in memory. JSON parsing libraries are plentiful so this should not take a lot of code in your favorite language.
  What is the state of SAX JSON parsing? I used yajl a long time ago but not sure if that’s still the state of the art (and it’s C interface was not the easiest to work with).
  EDIT: Actually, I think the reason is that you typically will have pointers (8 bytes) in place of 2 byte demarcations in the text version (eg “”, {}, [] become pointers). It’s very hard to avoid that (maybe impossible? Not sure) and no surprise that Python has a problem with this.
  [-]
  - saidinesh5 786 days ago
    There's libraries like ijson. At my last job, i wrote a wrapper around this C++ library called rapidjson to parse streaming json. It worked quite well.
    Here's the write up about it: https://dinesh.cloud/2022/streaming-json-for-fun-and-profit/ And here's the code: https://github.com/multiversal-ventures/json-buffet
    The API isn't the best. I'd have preferred an iterator based solution as opposed to this callback based one. But we worked with what rapidjson gave us for the proof of concept. The reason for this specific implementation was we wanted to build an index to query the server directly about it's huge json files (compressed size of 20+GB per file) using http range queries.
- fdgddggfddfg 786 days ago
  so on this note, unless you have massive json lists... having a json object per line (jsonl) can really help this process... I use jq when necessary .. but I usually use `cat file.json | grep foo | grep bar | jq -r .thing` since it's a string, use string comparison tools rather than jq :)
ddulaney 786 days ago
I really like using line-delimited JSON [0] for stuff like this. If you're looking at a multi-GB JSON file, it's often made of a large number of individual objects (e.g. semi-structured JSON log data or transaction records).
If you can get to a point where each line is a reasonably-sized JSON file, a lot of things gets way easier. jq will be streaming by default. You can use traditional Unixy tools (grep, sed, etc.) in the normal way because it's just lines of text. And you can jump to any point in the file, skip forward to the next line boundary, and know that you're not in the middle of a record.
The company I work for added line-delimited JSON output to lots of our internal tools, and working with anything else feels painful now. It scales up really well -- I've been able to do things like process full days of OPRA reporting data in a bash script.
[0]: https://jsonlines.org/
[-]
- klabb3 786 days ago
  +1. While yes, you can have a giant json object, and you can hack your way around the obvious memory issues, it’s still a bad idea, imo. Even if you solve it for one use case in one language, you’ll have a bad time as soon as you use different tooling. JSON really is a universal message format, which is useful precisely because it’s so interoperable. And it’s only interoperable as long as messages are reasonably sized.
  The only thing I miss from json lines is allowing a type specifier, so you can mix different types of messages. It’s not at all impossible to work around with wrapping or just roll a custom format, but still, it would be great to have a little bit of metadata for those use cases.
  [-]
  - ddulaney 786 days ago
    An out-of-band type specifier would be cool, though you still have to know the implicit schema implied by each type.
    In the system I work with, we standardized on objects that have a "type" key at the top level that contains a string identifying the type. Of course, that only works because we have lots of different tools that all output the same 30 or so data types. It definitely wouldn't scale to interoperability in general. But that's also one of the great things about JSON: it's flexible enough that you can work out a system that works at your scale, no more and no less.
- wwader 785 days ago
  confusingly jq also has a streaming mode https://stedolan.github.io/jq/manual/#Streaming that streams JSON values as [<path>,<value>] pairs. This can also be combined with null input and enables one to reduce, foreach etc in a memory efficient way, eg sum all .a in an array without loading the whole array into memory:
```
    $ echo '[{"a":1},{"b":2},{"a":3}]' | jq -n --stream 'reduce (inputs | select(.[0][1:] == ["a"])[1]) as $v (0; .+$v)'
    4
```
- stonecolddevin 786 days ago
  Isn't this pretty much what JSON streaming does?
  [-]
  - ddulaney 786 days ago
    Yep, it’s a subset of JSON streaming (using Wikipedia’s definition [0], it’s the second major heading on that page). I like it because it preserves existing Unix tools like grep, but the other methods of streaming JSON have their own advantages.
    [0]: https://en.m.wikipedia.org/wiki/JSON_streaming
jahewson 786 days ago
I had to parse a database backup from Firebase, which was, remarkably, a 300GB JSON file. The database is a tree rooted at a single object, which means that any tool that attempts to stream individual objects always wanted to buffer this single 300GB root object. It wasn’t enough to strip off the root either, as the really big records were arrays a couple of levels down, with a few different formats depending on the schema. For added fun our data included some JSON serialised inside strings too.
This was a few years ago and I threw every tool and language I could at it, but they were either far too slow or buffered records larger than memory, even the fancy C++ SIMD parsers did this. I eventually got something working in Go and it was impressively fast and ran on my MacBook, but we never ended up using it as another engineer just wrote a script that read the entire database from the Firebase API record-by-record throttled over several days, lol.
[-]
- wruza 786 days ago
  The decision they made to store potentially multigigabyte-sized backups into a single json is just idiotic, to begin with.
- taspeotis 786 days ago
  .NET has this built in with Utf8JsonReader [1].
  > Utf8JsonReader is a high-performance, low allocation, forward-only reader for UTF-8 encoded JSON text, read from a ReadOnlySpan<byte> or ReadOnlySequence<byte>
  Although it's a bit cumbersome to use with a stream [2].
  [1] https://learn.microsoft.com/en-us/dotnet/standard/serializat...
  [2] https://learn.microsoft.com/en-us/dotnet/standard/serializat...
- simonw 786 days ago
  I've used ijson in Python for this kind of thing in the past, it's pretty effective: https://pypi.org/project/ijson/
- Liron 786 days ago
  I downloaded a huge Firebase backup looking for a particular record.
  I ended up using the “split” shell command to get a bunch of 1gb files, then grepping for which file had the record I was looking for, then using my own custom script to scan outward from the position of matched text until it detected a valid parsable JSON object within the larger unparseable file, and return that.
- mlhpdx 786 days ago
  Back in the bad old days when XML consumers hit similar problems we’d use and event based parser like SAX. I’m a little shocked there isn’t a mainstream equivalent for JSON — is there something I’ve missed?
  [-]
  - jahewson 786 days ago
    Oh yes, some time ago I wrote a nodejs module to handle large xml files like that https://www.npmjs.com/package/big-xml
    For JSON, given that large files are generally record-based ndjson is the solution I’ve encountered http://ndjson.org/ and it works nicely with various tools out there using the .ndjson file extension
isoprophlex 786 days ago
Nice writeup, but is jq & GNU parallel or a notebook full of python spaghetti the best (least complex) tool for the job?
DuckDB might be nice here, too. See https://duckdb.org/2023/03/03/json.html
[-]
- samwillis 786 days ago
  DuckDB is awesome. As a comparison, I have a dataset that starts life as a 35gb set of json files. Imported into Postgres it's ~6gb, and a key query I run takes 3 min 33 seconds.
  Imported into DuckDB (still about ~6gb for all columns), the same SQL query takes 1.1 second!
  The key thing is that the columns (for all rows) the query scans total only about 100mb, so DuckDB has a lot less to scan. But on top of that it's vectorised query execution is incredibly quick.
  https://mobile.twitter.com/samwillis/status/1633213350002798...
  [-]
  - pletnes 786 days ago
    I found that exporting big tables as a bunch of parquet files is faster and uses less memory than duckdb’s internal format.
  - samus 786 days ago
    PostgreSQL would probably be way faster if you add proper indexes.
    [-]
    - mattpallissard 784 days ago
      And not store the data as Json/jsonb to begin with.
- e12e 786 days ago
  There's also clickhouse/clickhouse local - eg:
  https://clickhouse.com/blog/worlds-fastest-json-querying-too...
  https://clickhouse.com/docs/en/operations/utilities/clickhou...
  https://clickhouse.com/blog/getting-data-into-clickhouse-par...
- nojito 785 days ago
  Calling Dask python spaghetti is quite hilarious.
  That spaghetti can auto scale to hundreds of machines without skipping a beat. Which is far more useful than the other tools you mentioned which are only useful for one off tasks.
  [-]
  - isoprophlex 785 days ago
    Parallelizing a turd across hundreds of machines doesn't mean you're doing something genius, it just means you now have a hundred machines that have to deal with your shit.
- pletnes 786 days ago
  Duckdb is fantastic. Doesn’t need a schema, either.
hamilyon2 786 days ago
Clickhouse is the best way to analyze 10GB sized json by far.
Latest bunch of features add near-native json support. Coupled with ability to add extracted columns make the whole process easy. It is fast, you can use familiar SQL syntax, not constrainted to RAM limits.
It is a bit hard if you want to iteratively process file line-by line or use advanced SQL. And you have one-time cost of writing schema. Apart from that, I can't think of any downsides.
Edit: clarify a bit
[-]
- fdajojiocsjo 786 days ago
  [dead]
jeffbee 786 days ago
One thing that will greatly help with `jq` is rebuilding it so it suits your machine. The package of jq that comes with Debian or Ubuntu Linux is garbage that targets k8-generic (on the x86_64 variant), is built with debug assertions, and uses the GNU system allocator which is the worst allocator on the market. Rebuilding it targeting your platform, without assertions, and with tcmalloc makes it twice as fast in many cases.
On this 988MB dataset I happen to have at hand, compare Ubuntu jq with my local build, with hot caches on an Intel Core i5-1240P.
```
  time parallel -n 100 /usr/bin/jq -rf ../program.jq ::: * -> 1.843s
  time parallel -n 100 ~/bin/jq -rf ../program.jq ::: * -> 1.121s
```
I know it stinks of Gentoo, but if you have any performance requirements at all, you can help yourself by rebuilding the relevant packages. Never use the upstream mysql, postgres, redis, jq, ripgrep, etc etc.
[-]
- jeffbee 786 days ago
  I guess another interesting fact worth mentioning here is the "efficiency" cores on a modern Intel CPU are every bit as good as the performance cores for this purpose. The 8C/8T Atom side of the i5-1240P has the same throughput as the 4C/8T Core side for this workload. I get 1.79s using CPUs 0-7 and 1.82s on CPUs 8-15.
rvanlaar 786 days ago
Recently had 28GB json of IOT data with no guarantees on the data structure inside.
Used simdjson [1] together with python bindings [2]. Achieved massive speedups for analyzing the data. Before it was in the order of minutes, then it became fast enough to not leave my desk. Reading from disk became the bottleneck, not cpu power and memory.
[1] https://github.com/simdjson/simdjson [2] https://pysimdjson.tkte.ch/
[-]
- isoprophlex 785 days ago
  If reading from disk is now your bottleneck, next time put it in a (compressed?) ramdisk if you want to feel particularly clever/enjoy sick speedups
Groxx 786 days ago
tbh my usual strategy is to drop into a real programming language and use whatever JSON stream parsing exists there, and dump the contents into a half-parsed file that can be split with `split`. Then you can use "normal" tools on one of those pieces for fast iteration, and simply `cat * | ...` for the final slow run on all the data.
Go is quite good for this, as it's extremely permissive about errors and structure, has very good performance, and comes with a streaming parser in the standard library. It's pretty easy to be finished after only a couple minutes, and you'll be bottlenecked on I/O unless you did something truly horrific.
And when jq isn't enough because you need to do joins or something, shove it into SQLite. Add an index or three. It'll massively outperform almost anything else unless you need rich text content searches (and even then, a fulltext index might be just as good), and it's plenty happy with a terabyte of data.
zeitlupe 786 days ago
Spark is my favorite tool to deal with jsons. It can read as many jsons – in any format located in any even nested folder structure – as you want, offers parallelization, and is great to flatten structs. I've never run into memory issues (or never ran out of workarounds) so far.
[-]
- pidge 786 days ago
  Yeah, given that everything is now multi-core, it makes sense to use a natively parallel tool for anything compute-bound. And Spark will happily run locally and (unlike previous big data paradigms) doesn’t require excessive mental contortions.
  Of course while you’re at it, you should probably just convert all your JSON into Parquet to speed up successive queries…
- iknownothow 786 days ago
  How much memory would a spark worker need to process a single JSON file that is 25GB?
  To clarify, this is not JSONL or NDJSON file. Just a single JSON object.
19h 786 days ago
To analyze and process the pushshift Reddit comment & submission archives we used Rust with simd-json and currently get to around 1 - 2GB/s (that’s including the decompression of the zstd stream). Still takes a load of time when the decompressed files are 300GB+.
Weirdly enough we ended up networking a bunch of Apple silicon MacBooks together as the Ryzen 32C servers didn’t even closely match its performance :/
[-]
- e12e 786 days ago
  From https://github.com/ClickHouse/ClickHouse/issues/22482#issuec... it looks like a local load into clickhouse is expected to take 6-7 hours (in 2017?).
  I wonder how clickhouse-local would fare today (I'm guessing the dataset is so big, that load/store - then analyze would be better....).
  [-]
  - zX41ZdbW 786 days ago
    Most of the time is spent in decompression - the source dataset used to have files in .bz2, which is the main contributor to total time.
    The dataset itself is just around 10 billion records.
- xk3 786 days ago
  zstd decompression should almost always be very fast. It's faster to decompress than DEFLATE or LZ4 in all the benchmarks that I've seen.
  you might be interested in converting the pushshift data to parquet. Using octosql I'm able to query the submissions data (from the begining of reddit to Sept 2022) in about 10 min
  https://github.com/chapmanjacobd/reddit_mining#how-was-this-...
  Although if you're sending the data to postgres or BigQuery you can probably get better query performance via indexes or parallelism.
  [-]
  - 19h 785 days ago
    Unfortunately we're not just searching for things but extracting word frequencies of every user for stylometric analysis, so we need to do custom crunching.
    Spreading this task into many sub-slices of the files is annoying because the frequencies per user add up quite a lot, which results in quite a massive amount of data.
  - zX41ZdbW 786 days ago
    [flagged]
    [-]
    - zX41ZdbW 786 days ago
      Two SSDs on AWS machine only give 3800 MB/sec :(
      [-]
      - eska 785 days ago
        Meanwhile a single consumer Samsung 980 Pro 2TB for 200€ gives me stable 7000 MB/sec
ginko 786 days ago
This is something I did recently. We have this binary format we use for content traces. You can dump it to JSON, but that turns a ~10GB into a ~100GB file. I needed to check some aspects of this with Python, so I used ijson[1] to parse the JSON without having to keep it in memory.
The nice thing is that our dumping tool can also output JSON to STDOUT so you don't even need to dump the JSON representation to the hard disk. Just open the tool in a subprocess and pipe the output to the ijson parser. Pretty handy.
[1] https://pypi.org/project/ijson/
version_five 786 days ago
For a hacky solution, I've often just used grep, tr, awk, etc. If it's a well structured file and all the records are the same or similar enough, it's often possible to grep your way into getting the thing you want on each line, and then use awk or sed to parse out the data. Obviously lots of ways this can break down, buy 9GB is nothing if you can make it work with these tools. I have found jq much slower.
[-]
- philwelch 786 days ago
  Yeah, if the JSON is relatively flat, converting to TSV makes the data fairly trivial to consume using awk and other classic command line tools. I did a lot of this when a past employer decided they couldn’t afford Splunk.
chrisweekly 786 days ago
LNAV (https://lnav.org) is ideally suited for this kind of thing, with an embedded sqlite engine and what amounts to a local laptop-scale mini-ETL toolkit w/ a nice CLI. I've been recommending it for the last 7 years since I discovered this awesome little underappreciated util.
Nihilartikel 786 days ago
If you're doing interactive analysis, converting the json to parquet is a great first step.. After that duckdb or spark are a good way to go. I only fall back to spark if some aggregations are too big to fit in RAM. Spark spills to disk and subdivides the physical plans better in my experience..
[-]
- lmeyerov 786 days ago
  Yep!
  We do the switch to parquet, and then as they say, use dask so we can stick with python for interesting bits as SQL is relatively anti-productive there
  Interestingly, most of the dask can actually be dask_cudf and cudf nowadays: dask/pandas on a GPU, so can stay in the same computer, no need for distributed, even if TBs etc of json
- pradeepchhetri 786 days ago
  Well if you need to convert json to parquet to do anything fast, then what is the meaning ? You will end up wasting way more resource in that conversion itself that your benefit is all equalized in the cost of extra storage utilization (since now you have json and parquet files both). The whole point is to do fast operations in json itself. Try out clickhouse/clickhouse-local.
  [-]
  - closeparen 786 days ago
    If you're doing interactive analysis, generally you're going to have multiple queries, so it can be worthwhile to pay the conversion cost once upfront. You don't necessarily retain the JSON form, or at least not for as long.
    [-]
mastax 786 days ago
Dask looks really cool, I hope I remember it exists next time I need it.
I've been pretty baffled, and disappointed, by how bad Python is at parallel processing. Yeah, yeah, I know: The GIL. But so much time and effort has been spent engineering around every other flaw in Python and yet this part is still so bad. I've tried every "easy to use" parallelism library that gets recommended and none of them has satisfied. Always: "couldn't pickle this function" or spawning loads of processes that use up all my RAM for no visible reason but don't use any CPU or make any indication of progress. I'm sure I'm missing something, I'm not a Python guy. But every other language I've used has an easy to use stateless parallel map that hasn't given me any trouble.
[-]
- xk3 786 days ago
  ThreadPoolExecutor if IO-bound
  ProcessPoolExecutor if CPU-bound
  for example
```
    with ThreadPoolExecutor(max_workers=4) as e:
        e.submit(shutil.copy, 'src1.txt', 'dest1.txt')
        e.submit(shutil.copy, 'src2.txt', 'dest2.txt')
```
  but yeah if you're truly CPU bound then move to something lower level like C or Rust
- isoprophlex 786 days ago
  I've been seeing python at least once every week for a looooong time. Years. A decade maybe. You are not missing anything. It's a big steamy pile of horse manure.
- dermesser 786 days ago
  I can recommend Julia for easier parallelization while being reasonably Python-like. It's compiled, too, which helps even with single-threaded throughput.
thakoppno 786 days ago
Would sampling the JSON down to 20MB and running jq experimentally until one has found an adequate solution be a decent alternative approach?
It depends on the dataset one supposes.
[-]
- epalm 786 days ago
  Yeah, I do this when querying sql databases. I limit the data to some small/local range, iteratively work on the query, and when I'm happy with the local results, I remove the filter and get the big results.
tylerhannan 786 days ago
There was an interesting article on this recently...
https://news.ycombinator.com/item?id=31004563
It prompted quite some conversation and discussion and, in the end, an updated benchmark across a variety of tools https://colab.research.google.com/github/dcmoura/spyql/blob/... conveniently right in the 10GB dataset size.
berkle4455 786 days ago
Just use clickhouse-local or duckdb. Handling data measured in terabytes is easy.
cpuguy83 786 days ago
Jq does support slurp mode so you should be able to do this using that... granted I've never attempted this and the syntax is very different.
--- edit ---
I used the wrong term, the correct term is streaming mode.
[-]
- Groxx 786 days ago
  It does work, but it is a huge headache to use, in part because the documentation around it is nowhere near enough to understand how to use it.
  If I used it regularly I'd probably develop a feel for it and be much faster - it is reasonable, just abnormal and extremely low level, and much harder to use with other jq stuff. But I almost always start looking for alternatives well before I reach that point.
funstuff007 786 days ago
Anyone who's generating multi-GB JSON files on purpose has some explaining to do.
[-]
- ghshephard 786 days ago
  Logs. jsonl is a popular streaming format.
  [-]
  - funstuff007 786 days ago
    I guess, but you can grep JSONL just like you can a regular log file. As such, you don't need any sophisticated tools as discussed in this article.
    > 2. Each Line is a Valid JSON Value
    > 3. Line Separator is '\n'
    https://jsonlines.org/
    [-]
    - ghshephard 786 days ago
      Yes - 100% I spend hours a day blasting through line json and I always pre-filter with egrep, and only move to things like jq with the hopefully (dramatically) reduced log size.
      Also - with linejson - you can just grab the first 10,000 or so lines and tweak your query with that before throwing it against the full log structure as well.
      With that said - this entire thread has been gold - lots of useful strategies for working with large json files.
hprotagonist 786 days ago
i would seriously consider sqlite-utils here.
https://sqlite-utils.datasette.io/en/stable/cli.html
[-]
- qbasic_forever 786 days ago
  Was going to post the same thing, I suspect converting the dataset to a SQLite db would be infinitely more fast and productive than pecking away at it with pandas and such.
  [-]
  - sgarland 786 days ago
    At this size, I doubt it. While SQLite can read JSON if compiled with support for it, it stores it as TEXT. The only native indexing possible for that that I'm aware of is full-text search, and I suspect the cardinality of JSON characters would make that inefficient. Not to mention that the author stated they didn't have enough memory to store the entire file, so with a DB you'd be reading from disk.
    MySQL or Postgres with their native JSON datatypes _might_ be faster, but you still have to load it in, and storing/indexing it in either of those is [0] its own [1] special nightmare full of footguns.
    Having done similar text manipulation and searches with giant CSV files, parallel and xsv [2] is the way to go.
    [0]: https://dev.mysql.com/doc/refman/8.0/en/json.html
    [1]: https://www.postgresql.org/docs/current/datatype-json.html
    [2]: https://github.com/BurntSushi/xsv
    [-]
    - simonw 786 days ago
      The trick sqlite-utils uses is to automatically create a SQLite table schema that matches the shape of the JSON object - specifically of you give it a JSON array of objects.
      It doesn't handle streaming JSON out of the box though, so you'd need to write some custom code on top of something like ijson to avoid loading the entire JSON file into memory first.
  - philwelch 786 days ago
    SQLite is great for datasets that fit comfortably into memory, but otherwise it starts to struggle.
    [-]
    - fbdab103 786 days ago
      Come again? It is a database. The entire raison-d'etre is for managing bigger-than-memory data. On the SQLite forum[0], Dr. Hipp indicated a customer approached them that they were using a ~40 TB SQLite database. Now, you could argue it is less good at that scale, particularly analytical workloads, but SQLite has an enormous amount of headroom that you are unlikely to encounter.
      [0]: https://sqlite.org/forum/info/6e13e1911f1fe39c
    - simonw 786 days ago
      That's not accurate: SQLite works great with data that's too big to fit in memory provided you create indexes and write queries that can take advantage of them - just like any other relational database (MySQL/PostgreSQL etc).
      [-]
      - philwelch 786 days ago
        I guess it depends on what you’re trying to do. Creating those indices is going to require a scan through the whole data set, and you’re going to need to run a fair number of queries to amortize that. And if I’m going to be querying this data over and over again, that raises the question of why it wasn’t in a relational database to begin with.
        In my personal experience I’m usually digging through structured logs to answer one or two questions, after which point I won’t need the exact same data set to be indexed the exact same way again. That’s often more easily done by converting the data to TSV and using awk and other command line tools, which is typically quicker and more parallelizable than loading the whole works into SQLite and doing the work there.
    - hprotagonist 786 days ago
      happily, i have multiple gigabytes of memory …
      [-]
      - philwelch 786 days ago
        Sure, but a 40 GB SQLite database on a machine with 16 GB of RAM is not gonna be happy
        [-]
        qbasic_forever 786 days ago
        You're not going to do better with pandas or similar tools. If it can't fit in memory, it's going to be painful. SQLite is the least painful in my experience, and it sets you up for working with the data in a proper DB like postgres or similar for when you get fed up with the memory constraints.
        [-]
        philwelch 786 days ago
        I wouldn’t use pandas in that situation either.
        kosherhurricane 786 days ago
        In the original article, they were trying to read a 10GB file into 30GB RAM.
cube2222 786 days ago
OctoSQL[0] or DuckDB[1] will most likely be much simpler, while going through 10 GB of JSON in a couple seconds at most.
Disclaimer: author of OctoSQL
[0]: https://github.com/cube2222/octosql
[1]: https://duckdb.org/
reegnz 785 days ago
Allow me to advertise my zsh jq plugin +jq-repl: https://github.com/reegnz/jq-zsh-plugin
I find that for big datasets choosing the right format is crucial. Using json-lines format + some shell filtering (eg. head, tail to limit the range, egrep or ripgrep for the more trivial filtering) to reduce the dataset to a couple of megabytes, then use that jq-repl of mine to iterate fast on the final jq expression.
I found that the REPL form factor works really well when you don't exactly know what you're digging for.
UnCommonLisp 786 days ago
Use ClickHouse, either clickhouse-server or clickhouse-local. No fuss, no muss.
mattewong 786 days ago
If it could be tabular in nature, maybe convert to sqlite3 so you can make use of indexing, or CSV to make use of high-performance tools like xsv or zsv (the latter of which I'm an author).
https://github.com/liquidaty/zsv/blob/main/docs/csv_json_sql...
https://github.com/BurntSushi/xsv
code-faster 786 days ago
> Also note that this approach generalizes to other text-based formats. If you have 10 gigabyte of CSV, you can use Miller for processing. For binary formats, you could use fq if you can find a workable record separator.
You can also generalize it without learning a new minilanguage by using https://github.com/tyleradams/json-toolkit which converts csv/binary/whatever to/from json
Animats 786 days ago
Rust's serde-json will iterate over a file of JSON without difficulty, and will write one from an iterative process without building it all in memory. I routinely create and read multi-gigabyte JSON files. They're debug dumps of the the scene my metaverse viewer is looking at.
Streaming from large files was routine for XML, but for some reason, JSON users don't seem to work with streams much.
liammclennan 786 days ago
Flare’s (https://blog.datalust.co/a-tour-of-seqs-storage-engine/) command line tool can query CLEF formatted (new-line delimited) JSON files and is perhaps an order of magnitude faster.
Good for searching and aggregating. Probably not great for transformation.
DeathArrow 786 days ago
You can deserialize the JSONs and filter the resulting arrays or lists. For C# the IDE can automatically generate the classes from JSON and I think there are tools for other languages to generate data structures from JSON.
maCDzP 786 days ago
I like SQLite and JSON columns. I wonder how fast it would be if you save the whole JSON file in one record and then query SQLite. I bet it’s fast.
You could probably use that one record to then build tables in SQLite that you can query.
kosherhurricane 786 days ago
What I would have done is first create a map of the file, just the keys and shapes, without the data. That way I can traverse the file. And then mmap the file to traverse and read the data.
A couple of dozen lines of code would do it.
2h 786 days ago
Note the Go standard library has a streaming parser:
https://go.dev/play/p/O2WWn0qQrP6
kashif 786 days ago
Might be useful for some - https://github.com/kashifrazzaqui/json-streamer
zerop 786 days ago
Other day i discovered duckdb on HN which allows firing SQL on JSON. But i am not sure if that can take this much volume of data.
nn3 786 days ago
the real trick is to do the debugging/exploration on a small subset of the data. Then usually you don't need all these extra measures because the real processing is only done a small number of times.