Show HN: Arroyo – Write SQL on streaming data

(github.com)

115 points | by necubi 702 days ago

14 comments

thom 702 days ago
Unbounded streams, but with watermarks (which right now seem fixed length?):
https://doc.arroyo.dev/concepts#watermarks
Also works based on fixed, pre-built pipelines. This is all very much in the style of most stream processing platforms today but I hope we’ll continue to move closer as an industry to having our cake and eating it: ingest everything in real-time, while serving any query (with joins) over the full dataset (either incrementally or ad-hoc).
[-]
- necubi 701 days ago
  In SQL we currently support specifying watermarks based on SQL expressions, so it can be a bit more expressive than just the basic fixed-delay watermark: https://doc.arroyo.dev/sql/ddl#options. For watermark-based systems, I think the ideal is something that allows users to express either max latencies or statistical completeness (like, wait up to 1 minute or until an estimated 99.5% of the data is there).
  Beyond that, there are systems that are more integrated end-to-end that can update as late arriving data comes in (like Materialize), and think those have there place. However for many uses of stream processing what's important is taking action once the data is complete enough, and watermarks a useful and pretty straightforward mechanism for that.
yevpats 701 days ago
Looks cool. What is the difference between this tools and benthos (https://www.benthos.dev/)?
[-]
- necubi 701 days ago
  There's a spectrum of streaming systems from simple, stateless processors to complex, stateful distributed processors.
  I'm not very familiar with Benthos, but it appears to be on the simple, stateless end. Systems like that can be great if they fit your use cases—they tend to be much easier to understand and operate.
  But they are not as flexible or powerful as stateful systems like Arroyo or Flink. For example, they cannot accurately compute aggregations, joins, or windows over long time ranges.
sorenbs 701 days ago
This is a really exciting project! I recently learned about https://github.com/vmware/database-stream-processor which builds on a new theoretical foundation and claims to be 9x faster than Flink. It is also written in Rust, and there is a compiler from SQL to Rust executables. Can you comment on the differences?
[-]
- ryzhyk 701 days ago
  I'm a developer of DBSP. Our repo now lives here: https://github.com/feldera/dbsp/. And here is some more benchmarking data vs Flink and Beam: https://github.com/feldera/dbsp/tree/main/benchmark.
  See feldera.com for more info.
  [-]
  - sorenbs 695 days ago
    Thank you!
    Really happy to see all the activity at the new repo. For a while there I thought the project was dying...
- sorenbs 701 days ago
  As an aside - we are evaluating different streaming engines to power data projections for Prisma.io. Excited to see support for Debezium coming in v0.4.0.
  Would be interesting to somehow make Arroyo run the Nexmark benchmark so we can clearly compare to Flink and DBSP: https://liveandletlearn.net/post/vmware-take-3-experience-wi...
  [-]
  - jnewhouse 701 days ago
    Hi there! We actually already have a built-in Nexmark source. It's pretty useful for developing new capabilities, and available as a source out of the box.
    Just read through the DBSP docs and it looks like it is working in a similar space. The biggest differences in my mind are around distribution and reliability. Arroyo works across a cluster of machines and has built in fault tolerance, while for DBSP that's still just planned for the future.
    (I'm the co-creator of Arroyo, for context)
    [-]
    - sorenbs 701 days ago
      Thank you!
      I'll try to get something set up to compare performance of the two on the same machine.
      [-]
      - necubi 701 days ago
        That'd be great! We have versions of most of the nexmark queries and some internal benchmark vs Flink, we'd love to help if you're interested in benchmarking against more systems. Reach out at [email protected].
benjaminwootton 701 days ago
Between Flink, Spark and KSQL, streaming is very JVM centric. It is nice to see more non JVM projects emerge.
I am not sure about your premise that the operations side is difficult. It tends to be submitting a job to a cluster in Flink or Spark.
The harder barrier to entry is the functional style of transformation code. Even though other frameworks have it, I think the SQL API as the first class citizen is the bigger differentiator.
jsty 702 days ago
In the watermarks documentation it mentions that events arriving after the watermark are dropped. Are there any plans to make this configurable (to disable dropping or trigger exception handling) and/or alertable?
I can think of quite a few use cases (particularly in finance) where we'd want late-arrivals to be recorded and possibly incorporated into later or revised results, not silently dropped on the floor.
[-]
- necubi 701 days ago
  Yeah, currently late-arriving data is dropped but we will be making this more configurable. We're currently working on what we call "update tables," which means being able to emit incremental changes to state as well as final results. Once that's in we'll be able to give richer semantics around late arriving data.
gunnarmorling 701 days ago
Very interesting project, Arroyo has been on my watch list for a while! How would you say does Arroyo compare to Apache Flink, i.e. what are pros and cons? For instance, given it's implemented in Rust, I'd assume Arroyo's resource consumption might be lower?
(Disclaimer: I work for Decodable, where we build a SaaS based on Flink)
dangoodmanUT 701 days ago
Would love to know how you look at tools list Materialize in comparison
fasteo 701 days ago
Slightly off-topic.
"Arroyo" is a Spanish word meaning creek, or stream
[-]
- necubi 701 days ago
  Yep! In California where we live, it typically refers to a seasonal desert stream that can range from a trickle to a torrent. We chose it because Arroyo is a stream processor, and it's very good at autoscaling.
- callmeed 701 days ago
  Correct
  I live in a California town called Arroyo Grande ("big creek")
  [-]
  - fasteo 701 days ago
    My second surname is Arroyo. Probably my ancestors lived near a big creek ;)
KRAKRISMOTT 701 days ago
Very exciting, how is feature parity with tinybird?
https://www.tinybird.co/
[-]
- _peregrine_ 701 days ago
  I am not sure of specifics on features, but I think the fundamental difference is that Arroyo is a stream processing engine i.e., it doesn’t have a database, whereas Tinybird has the statefulness afforded by ClickHouse as its primary data store. Arroyo would be more like Flink, Tinybird would be more like ClickHouse.
  Disclaimer: I work for Tinybird.
  [-]
  - necubi 701 days ago
    Yes, that's mostly true. Arroyo is a stream processor like Flink. In Arroyo, you pre-register the queries you are interested in, and they will be compiled into streaming dataflow jobs that continuously execute as events come in. This means that we're not storing all of the raw events that come in for later querying.
    However, like a database we do have a serving layer (currently only in our cloud version due to its reliance on our distributed state backend: https://doc.arroyo.dev/connectors/state) so it is possible to query the results directly from Arroyo as well.
    Generally you would want to use something like Arroyo when your data is too high volume to reasonably store it all in a DB like Clickhouse, or your queries are too expensive to perform on every query, as Arroyo incrementally computes the results of the query as events come in.
    There's also opportunities to use these systems together: Arroyo can pre-aggregate the high volume raw data, and then it can be inserted into a Clickhouse-based system for final processing along different dimensions.
httgp 701 days ago
This looks great, and it’s very cool that it recommends Nomad to run it in production.
I wish more products would support (or at least document how to run on) Nomad.
[-]
- necubi 701 days ago
  We're big fans of Nomad! It's what we use for scheduling in our cloud platform, due to its simplicity and scheduling speed. Although we also have great support for Kubernetes as that's what most folks will be running.
laurensr 702 days ago
Would Arroyo be an alternative to Confluent KSQL?
[-]
- necubi 701 days ago
  Yes, Arroyo is an alternative to KSQL, although more in the design-space of Flink.
  KSQL is pretty simple and easy to run if you already have Kafka, but will be much more expensive and harder to scale due to its reliance on Kafka streams for persistence and shuffling of data in processing DAG.
  And with Confluent's embrace of Flink in the past year (https://www.confluent.io/blog/cloud-kafka-meets-cloud-flink-...) it's not clear that KSQL has much of a future.
- kohlerm 701 days ago
  If you look at the page, I guess the answer is yes. Or more precisely, an alternative to Flink written in rust.
yakorevivan 702 days ago
[dead]
trevyn 702 days ago
Any interest in redoing the web console in Rust? 8)
[-]
- ukuina 702 days ago
  What would that accomplish?
  [-]
  - tmikaeld 702 days ago
    I think op was trying to be funny, like, everything being ported to rust these days
    [-]
    - trevyn 702 days ago
      No, Arroyo is written is Rust, but the web front-end is React.