SQLite JSON at Full Index Speed Using Generated Columns

(dbpro.app)

103 points | by upmostly 2 hours ago

16 comments

pawelduda 3 minutes ago
I've been coding a lot of small apps recently, and going from local JSON file storage to SQLite has been a very natural path of progression, as data's order of magnitude ramps up. A fully performant database which still feels as simple as opening and reading from a plain JSON file. The trick you describe in the article is actually an unexpected performance buffer that'll come in handy when I start hitting next bottleneck :) Thank you
tracker1 5 minutes ago
As others mention, you can create indexes directly against the json without projecting in to a computed column... though the computed column has the added benefit of making certain queries easier.
That said, this is pretty much what you have to do with MS-SQL's limited support for JSON before 2025 (v17). Glad I double checked, since I wasn't even aware they had added the JSON type to 2025.
jelder 1 hour ago
I thought this was common practice, generated columns for JSON performance. I've even used this (although it was in Postgres) to maintain foreign key constraints where the key is buried in a JSON column. What we were doing was slightly cursed but it worked perfectly.
[-]
- ramon156 56 minutes ago
  It works until you realize some of these usages would've been better as individual key/value rows.
  For example, if you want to store settings as JSON, you first have to parse it through e.g. Zod, hope that it isn't failing due to schema changes (or write migrations and hope that succeeds).
  When a simple key/value row just works fine, and you can even do partial fetches / updates
- jasonthorsness 13 minutes ago
  This is the typical practice for most index types in SingleStore as well except with the Multi-Value Hash Index which is defined over a JSON or BSON path
- sigwinch 1 hour ago
  It is. I’d wondered if STORED is necessary and this example uses VIRTUAL.
Lex-2008 1 hour ago
interesting, but can't you use "Index On Expression" <https://sqlite.org/expridx.html>?
i.e. something like this: CREATE INDEX idx_events_type ON events(json_extract(data, '$.type'))?
i guess caveat here is that slight change in json path syntax (can't think of any right now) can cause SQLite to not use this index, while in case of explicitly specified Virtual Generated Columns you're guaranteed to use the index.
[-]
- pkhuong 1 hour ago
  Yeah, you can use index on expression and views to ensure the expression matches, like https://github.com/fsaintjacques/recordlite . The view + index approach decouples the convenience of having a column for a given expression and the need to materialise the column for performance.
- paulddraper 1 hour ago
  Yes, that’s the simpler and faster solution.
  You need to ensure your queries match your index, but when isn’t that true :)
upmostly 2 hours ago
I was inspired to write this blog post after reading bambax's comment on a HN post back in 2023: https://news.ycombinator.com/item?id=37082941
meindnoch 41 minutes ago
In the 2nd section you're using a CREATE TABLE plus three separate ALTER TABLE calls to add the virtual columns. In the 3rd section you're using a single CREATE TABLE with the virtual columns included from the get go.
Why?
[-]
- hamburglar 31 minutes ago
  I think the intent is to separate the virtual column creation out when it’s introduced in order to highlight that it’s a very lightweight operation. When moving onto the 3rd example, the existence of the virtual columns is just a given.
- hiccuphippo 23 minutes ago
  In 2 they show how to add virtual columns to an existing table, in 3 how to add indexes to existing virtual columns so they are pre-cooked. Like a cooking show.
  [-]
  - meindnoch 20 minutes ago
    >In 2 they show how to add virtual columns to an existing table
    No, in section 2 the table is created afresh. All 3 sections start with a CREATE TABLE.
    [-]
    - hiccuphippo 7 minutes ago
      Yes, it seems each section has its own independent database so you have to create everything on each of them.
  - upmostly 20 minutes ago
    Literally exactly as I meant it. I watch a lot of cooking shows, too, so this analogy holds up.
moregrist 31 minutes ago
Generated columns are pretty great, but what I would really love is a Postgres-style gin index, which dramatically speeds up json queries for unanticipated keys.
pipe01 5 minutes ago
MongoDB is dead, long live MongoDB
ellisv 45 minutes ago
I wish devs would normalize their data rather than shove everything into a JSON(B) column, especially when there is a consistent schema across records.
It's much harder to setup proper indexes, enforce constraints, and adds overhead every time you actually want to use the data.
[-]
- jasonthorsness 3 minutes ago
  When a data tree is tightly coupled (like a complex sample of nested data with some arrays from a sensor) and the entire tree is treated like a single thing by writes, the JSON column just keeps things easier. Reads can be accelerated with indexes as demonstrated here.
- tracker1 10 minutes ago
  I find that JSON(B) works best when you have a collection of data with different or variant concrete types of data that aren't 1:1 matches. Ex: the actual transaction result if you have different payment processors (paypal, amazon, google, apple-pay, etc)... you don't necessarily want/care about having N different tables for a clean mapping (along with the overhead of a join) to pull the transaction details in the original format(s).
  Another example is a classifieds website, where your extra details for a Dress are going to be quite a bit different than the details for a Car or Watch. But, again, you don't necessarily want to inflate the table structure for a fully normalized flow.
  If you're using a concretely typed service language it can help. C# does a decent job here. But even then, mixing in Zod with Hono and OpenAPI isn't exactly difficult on the JS/TS front.
- nh2 28 minutes ago
  JSON columns shine when
  * The data does not map well to database tables, e.g. when it's tree structures (of course that could be represented as many table rows too, but it's complicated and may be slower when you always need to operate on the whole tree anyway)
  * your programming language has better types and programming facilities than SQL offers; for example in our Haskell+TypeScript code base, we can conveniently serialise large nested data structures with 100s of types into JSON, without having to think about how to represent those trees as tables.
- crazygringo 35 minutes ago
  For very simple JSON data whose schema never changes, I agree.
  But the more complex it is, the more complex the relational representation becomes. JSON responses from some API's could easily require 8 new tables to store the data in, with lots of arbitrary new primary keys and lots of foreign key constraints, your queries will be full of JOIN's that need proper indexing set up...
  Oftentimes it's just not worth it, especially if your queries are relatively simple, but you still need to store the full JSON in case you need the data in the future.
  Obviously storing JSON in a relational database feels a bit like a Frankenstein monster. But at the end of the day, it's really just about what's simplest to maintain and provides the necessary performance.
  And the whole point of the article is how easy it is to set up indexes on JSON.
- konart 42 minutes ago
  Normalisation brings its own overhead though.
simonw 1 hour ago
Tiny bug report: I couldn't edit text in those SQL editor widgets from my iPhone, and I couldn't scroll them to see text that extended past the width of the page either.
[-]
- hamburglar 1 hour ago
  The examples also needed a “drop table if exists” so they could be run more than once without errors.
  [-]
  - upmostly 52 minutes ago
    Great catch, I'll add that now!
- upmostly 55 minutes ago
  Thanks Simon! Looking into that now. Big fan. Hope you enjoyed the post.
  Edit: This should now be fixed for you.
AlexErrant 45 minutes ago
I was looking for a way to index a JSON column that contains a JSON array, like a list of tags. AFAIK this method won't work for that; you'll either need to use FTS or a separate "tag" table that you index.
[-]
- garaetjjte 25 minutes ago
  I would want that too. It's possible in MySQL: https://dev.mysql.com/doc/refman/8.4/en/create-index.html#cr...
mcluck 1 hour ago
Very cool article. To really drill it home, I would have loved to see how the query plan changes. It _looks_ like it should Just Work(tm) but my brain refuses to believe that it's able to use those new indexes so flawlessly
N_Lens 39 minutes ago
What a neat trick, I love SQLite as well.
baq 51 minutes ago
My understanding is Snowflake works kinda like that behind the scenes right?
bushbaba 58 minutes ago
For smaller datasets (100s of thousands of rows) I don’t see why you wouldn’t just use json columns with generated column/index where needed
mring33621 1 hour ago
IIRC, Vertica had/has a similar feature.
[-]
- xp84 44 minutes ago
  Now there’s a name I haven’t heard in 10 years. (I’m only tenuously connected to the kinds of teams that use/would have used that, so it doesn’t mean much.)