We built a new powerful JSON data type for ClickHouse

(clickhouse.com)

230 points | by markhneedham 9 hours ago

12 comments

fuziontech 6 hours ago
Using ClickHouse is one of the best decisions we've made here at PostHog. It has allowed us to scale performance all while allowing us to build more products on the same set of data.
Since we've been using ClickHouse long before this JSON functionality was available (or even before the earlier version of this called `Object('json')` was avaiable) we ended up setting up a job that would materialize json fields out of a json blob and into materialized columns based on query patterns against the keys in the JSON blob. Then, once those materialized columns were created we would just route the queries to those columns at runtime if they were available. This saved us a _ton_ on CPU and IO utilization. Even though ClickHouse uses some really fast SIMD JSON functions, the best way to make a computer go faster is to make the computer do less and this new JSON type does exactly that and it's so turn key!
https://posthog.com/handbook/engineering/databases/materiali...
The team over at ClickHouse Inc. as well as the community behind it moves surprisingly fast. I can't recommend it enough and excited for everything else that is on the roadmap here. I'm really excited for what is on the horizon with Parquet and Iceberg support.
breadwinner 5 hours ago
If you're evaluating ClickHouse take a look at Apache Pinot as well. ClickHouse was designed for single-machine installations, although it has been enhanced to support clusters. But this support is lacking, for example if you add additional nodes it is not easy to redistribute data. Pinot is much easier to scale horizontally. Also take a look at star-tree indexes of Pinot [1]. If you're doing multi-dimensional analysis (Pivot table etc.) there is a huge difference in performance if you take advantage of star-tree.
[1] https://docs.pinot.apache.org/basics/indexing/star-tree-inde...
[-]
- zX41ZdbW 4 hours ago
  > ClickHouse was designed for single-machine installations
  This is incorrect. ClickHouse is designed for distributed setups from the beginning, including cross-DC installations. It has been used on large production clusters even before it was open-sourced. When it became open-source in June 2016, the largest cluster was 394 machines across 6 data-centers with 25 ms RTT between the most distant data-centers.
- cvalka 2 hours ago
  Absolutely incorrect. ClickHouse was created by Yandex and it's cluster ready from day one.
- haolez 5 hours ago
  What's the use case? Analytics on humongous quantities of data? Something besides that?
  [-]
  - breadwinner 5 hours ago
    Use case is "user-facing analytics", for example consider ordering food from Uber Eats. You have thousands of concurrent users, latency should be in milliseconds, and things like delivery time estimate must updated in real-time.
    Spark can do analysis on huge quantities of data, and so can Microsoft Fabric. What Pinot can do that those tools can't is extremely low latency (milliseconds vs. seconds), concurrency (1000s of queries per second), and ability to update data in real-time.
    Excellent intro video on Pinot: https://www.youtube.com/watch?v=_lqdfq2c9cQ
    [-]
    - listenallyall 4 hours ago
      I don't think Uber's estimated time-to-arrival is a statistic on which a database vendor, or development team, should brag about. It's horribly imprecise.
      [-]
      - akavi 3 hours ago
        Also isn't something that a (geo)sharded postgres DB with the appropriate indexes couldn't handle with aplomb. Number of orders to a given restaurant can't be more than a dozen a minute or so.
      - cyanydeez 2 hours ago
        What about it's ability to choose pricing based on source-destination and projected incomes.
  - whalesalad 4 hours ago
    I thought “humongous quantities of data” was a baseline assumption for a discussion involving clickhouse et all?
ramraj07 7 hours ago
Great to see it in ClickHouse.
Snowflake released a white paper before its IPO days and mentioned this same feature (secretly exploding JSON into columns). Explains how snowflake feels faster than it should, they’ve secretly done a lot of amazing things and just offered it as a polished product like Apple.
everfrustrated 5 hours ago
>Dynamically changing data: allow values with different data types (possibly incompatible and not known beforehand) for the same JSON paths without unification into a least common type, preserving the integrity of mixed-type data.
I'm so excited for this! One of my major bug-bears with storing logs in Elasticsearch is the set-type-on-first-seen-occurrence headache.
Hope to see this leave experimental support soon!
[-]
- atombender 2 hours ago
  I never understood why ELK/Kinana chose this method, when there's a much simpler solution: Augment each field name with the data type.
  For example, consider the documents {"value": 42} and {"value": "foo"}. To index this, index {"value::int": 42} and {"value::str": "foo"} instead. Now you have two distinct fields that don't conflict with each other.
  To search this, the logical choice would be to first make sure that the query language is typed. So a query like value=42 would know to search the int field, while a query like value="42" would look in the string field. There's never any situation where there's any ambiguity about which data type is to be searched. KQL doesn't have this, but that's one of their many design mistakes.
  You can do the same for any data type, including arrays and objects. There is absolutely no downside; I've successfully implemented it for a specific project. (OK, one downside: More fields. But the nature of the beast. These are, after all, distinct sets of data.)
  [-]
  - mr_toad 1 hour ago
    > For example, consider the documents {"value": 42} and {"value": "foo"}. To index this, index {"value::int": 42} and {"value::str": "foo"} instead. Now you have two distinct fields that don't conflict with each other.
    But now all my queries that look for “value” don’t work. And I’ve got two columns in my report where I only want one.
abe94 9 hours ago
We've been waiting for more JSON support for Clickhouse - the new type looks promising - and the dynamic column, and no need to specifcy subtypes is particularly helpful for us.
notamy 7 hours ago
Clickhouse is great stuff. I use it for OLAP with a modest database (~600mil rows, ~300GB before compression) and it handles everything I throw at it without issues. I'm hopeful this new JSON data type will be better at a use-case that I currently solve with nested tuples.
[-]
- jabart 56 minutes ago
  Similar for us except 700mil rows in one table, 2.5 billion total rows. That's growing quickly because we started shoving OTEL to the cluster. None of our queries seem to phase Clickhouse. It's like magic. The 48 cores per node also helps
- philosopher1234 5 hours ago
  Postgres should be good enough for 300GB, no?
  [-]
  - wiredfool 3 hours ago
    I had a postgres database where the main index (160gb) was larger than the entire equivalent clickhouse database (60gb). And between the partitioning and the natural keys, the primary key index in clickhouse was about 20k per partition * ~ 1k partitions.
    Now, it wasn't a good schema to start with, and there was about a factor of 3 or 4 size that could be pulled out, but clickhouse was a factor of 20 better for on disk size for what we were doing.
  - marginalia_nu 3 hours ago
    At least in my experience, that's about when regular DBMS:es kinda start to suck for ad-hoc queries. You can push them a bit farther for non-analytical usecases if you're really careful and have prepared indexes that assist every query you make, but that's rarely a luxury you have in OLAP-land.
  - tempest_ 4 hours ago
    It depends, if you want to do any kind of aggregation, counts, or count distinct pg falls over pretty quickly.
  - notamy 5 hours ago
    Probably, but Clickhouse has been zero-maintenance for me + my dataset is growing at 100~200GB/month. Having the Clickhouse automatic compression makes me worry a lot less about disk space.
  - whalesalad 4 hours ago
    For write heavy workloads I find psql to be a dog tbh. I use it everywhere but am anxious to try new tools.
    For truly big data (terabytes per month) we rely on BigQuery. For smaller data that is more OLTP write heavy we are using psql… but I think there is room in the middle.
CSDude 3 hours ago
When I tried it a few weeks ago, because ClickHouse names the files based on column names, weird JSON keys resulted in very long filenames and slashes and it did not play well with it the file system and gave errors, I wonder that is fixed?
officex 7 hours ago
Great to see! I remember checking you guys out in Q1, great team
baq 7 hours ago
Clickhouse is criminally underused.
It's common knowledge that 'postgres is all you need' - but if you somehow reach the stage of 'postgres isn't all I need and I have hard proof' this should be the next tech you look at.
Also, clickhouse-local is rather amazing at csv processing using sql. Highly recommended for when you are fed up with google sheets or even excel.
[-]
- mrsilencedogood 6 hours ago
  This is my take too. At one of my old jobs, we were early (very early) to the Hadoop and then Spark games. Maybe too early, because by the time Spark 2 made it all easy, we had already written a lot of mapreduce-streaming and then some RDD-based code. Towards the end of my tenure there, I was experimenting with alternate datastores, and clickhouse was one I evaluated. It worked really, really well in my demos. But I couldn't get buy-in because management was a little wary of the russian side of it (which they have now distanced/divorced from, I think?) and also they didn't really have the appetite for such a large undertaking anymore. (The org was going through some things.) (So instead a different team blessed by the company owner basically DIYd a system to store .feather files on NVME SSDs... anyway).
  If I were still there, I'd be pushing a lot harder to finally throw away the legacy system (which has lost so many people it's basically ossified, anyway) and just "rebase" it all onto clickhouse and pyspark sparksql. We would throw away so much shitty cruft, and a lot of the newer mapreduce and RDD code is pretty portable to the point that it could be plugged into RDD's pipe() method.
  Anyway. My current job, we just stood up a new product that, from day 1, was ingesting billions of rows (event data) (~nothing for clickhouse, to be clear. but obviously way too much for pg). And it's just chugging along. Clickhouse is definitely in my toolbox right after postgres, as you state.
- osigurdson 6 hours ago
  Agree. CH is a great technology to have some awareness of. I use it for "real things" (100B+ data points) but honestly it can really simplify little things as well.
  I'd throw in one more to round it out however. The three rings of power are Postgres, ClickHouse and NATS. Postgres is the most powerful ring however and lots of times all you need.
- oulipo 7 hours ago
  would you recommend clickhouse over duckdb? and why?
  [-]
  - PeterCorless 6 hours ago
    Note that every use case is different and YMMV.
    https://www.vantage.sh/blog/clickhouse-local-vs-duckdb
    [-]
    - hn1986 5 hours ago
      Great link . Curious how it compares now that Duckdb is 1.0+
  - nasretdinov 5 hours ago
    IMO the only reason to not use ClickHouse is when you either have "small" amount of data or "small" servers (<100 Gb of data, servers with <64 Gb of RAM). Otherwise ClickHouse is a better solution since it's a standalone DB that supports replication and in general has very very robust cluster support, easily scaling to hundreds of nodes.
    Typically when you discover the need for OLAP DB is when you reach that scale, so I'm personally not sure what the real use case for DuckDB is to be completely honest.
    [-]
    - geysersam 5 hours ago
      DuckDB probably performs better per core than clickhouse does for most queries. So as long as your workload fits on a single machine (it's likely that it does) it's often the most performant option.
      Besides, it's so simple, just a single executable.
      Of course if you're at a scale where you need a cluster it's not an option anymore.
      [-]
      - zX41ZdbW 4 hours ago
        The good parts of DuckDB that you've mentioned, including the fact that it is a single-executable, are modeled after ClickHouse.
        [-]
        RyanHamilton 2 hours ago
        Can you provide a reference for that belief? To me that's not true. They started from solving very different problems.
  - theLiminator 6 hours ago
    Not to mention polars, datafusion, etc. Single node OLAP space is really heating up.
    [-]
    - fiddlerwoaroof 6 hours ago
      Clickhouse scales from a local tool like Duckdb to a database cluster that can back your reporting applications and other OLAP applications.
- CalRobert 5 hours ago
  Clickhouse and Postgres are just different tools though - OLTP vs OLAP.
  [-]
  - fiddlerwoaroof 2 hours ago
    It’s fairly common in my experience for reports to initially be driven by a Postgres database until you hit data volumes Postgres cannot handle.
dangsux 4 hours ago
[dead]
trollied 5 hours ago
What on earth has gone wrong. Just use a database with rows and columns.
The amount of extra storage required to deal with a metadata about non-structured data is crazy.
JSON is ok to chuck data about.
[-]
- rockostrich 4 hours ago
  Analytical databases have rows and columns? What do you do when you're ingesting TBs, if not PBs, of unstructured data and need to make it actually useable.
  A couple of MBs (or even GBs) for storage for metadata is peanuts compared to the actual data as well as the material savings when storing it in a column-oriented engine.
anonygler 7 hours ago
I keep misreading this company as ClickHole and expecting some sort of satirical content.