Also works based on fixed, pre-built pipelines. This is all very much in the style of most stream processing platforms today but I hope we’ll continue to move closer as an industry to having our cake and eating it: ingest everything in real-time, while serving any query (with joins) over the full dataset (either incrementally or ad-hoc).
In SQL we currently support specifying watermarks based on SQL expressions, so it can be a bit more expressive than just the basic fixed-delay watermark: https://doc.arroyo.dev/sql/ddl#options. For watermark-based systems, I think the ideal is something that allows users to express either max latencies or statistical completeness (like, wait up to 1 minute or until an estimated 99.5% of the data is there).
Beyond that, there are systems that are more integrated end-to-end that can update as late arriving data comes in (like Materialize), and think those have there place. However for many uses of stream processing what's important is taking action once the data is complete enough, and watermarks a useful and pretty straightforward mechanism for that.
There's a spectrum of streaming systems from simple, stateless processors to complex, stateful distributed processors.
I'm not very familiar with Benthos, but it appears to be on the simple, stateless end. Systems like that can be great if they fit your use cases—they tend to be much easier to understand and operate.
But they are not as flexible or powerful as stateful systems like Arroyo or Flink. For example, they cannot accurately compute aggregations, joins, or windows over long time ranges.
This is a really exciting project! I recently learned about https://github.com/vmware/database-stream-processor which builds on a new theoretical foundation and claims to be 9x faster than Flink. It is also written in Rust, and there is a compiler from SQL to Rust executables. Can you comment on the differences?
As an aside - we are evaluating different streaming engines to power data projections for Prisma.io. Excited to see support for Debezium coming in v0.4.0.
Hi there! We actually already have a built-in Nexmark source. It's pretty useful for developing new capabilities, and available as a source out of the box.
Just read through the DBSP docs and it looks like it is working in a similar space. The biggest differences in my mind are around distribution and reliability. Arroyo works across a cluster of machines and has built in fault tolerance, while for DBSP that's still just planned for the future.
That'd be great! We have versions of most of the nexmark queries and some internal benchmark vs Flink, we'd love to help if you're interested in benchmarking against more systems. Reach out at [email protected].
Between Flink, Spark and KSQL, streaming is very JVM centric. It is nice to see more non JVM projects emerge.
I am not sure about your premise that the operations side is difficult. It tends to be submitting a job to a cluster in Flink or Spark.
The harder barrier to entry is the functional style of transformation code. Even though other frameworks have it, I think the SQL API as the first class citizen is the bigger differentiator.
In the watermarks documentation it mentions that events arriving after the watermark are dropped. Are there any plans to make this configurable (to disable dropping or trigger exception handling) and/or alertable?
I can think of quite a few use cases (particularly in finance) where we'd want late-arrivals to be recorded and possibly incorporated into later or revised results, not silently dropped on the floor.
Yeah, currently late-arriving data is dropped but we will be making this more configurable. We're currently working on what we call "update tables," which means being able to emit incremental changes to state as well as final results. Once that's in we'll be able to give richer semantics around late arriving data.
Very interesting project, Arroyo has been on my watch list for a while! How would you say does Arroyo compare to Apache Flink, i.e. what are pros and cons? For instance, given it's implemented in Rust, I'd assume Arroyo's resource consumption might be lower?
(Disclaimer: I work for Decodable, where we build a SaaS based on Flink)
Yep! In California where we live, it typically refers to a seasonal desert stream that can range from a trickle to a torrent. We chose it because Arroyo is a stream processor, and it's very good at autoscaling.
I am not sure of specifics on features, but I think the fundamental difference is that Arroyo is a stream processing engine i.e., it doesn’t have a database, whereas Tinybird has the statefulness afforded by ClickHouse as its primary data store. Arroyo would be more like Flink, Tinybird would be more like ClickHouse.
Yes, that's mostly true. Arroyo is a stream processor like Flink. In Arroyo, you pre-register the queries you are interested in, and they will be compiled into streaming dataflow jobs that continuously execute as events come in. This means that we're not storing all of the raw events that come in for later querying.
However, like a database we do have a serving layer (currently only in our cloud version due to its reliance on our distributed state backend: https://doc.arroyo.dev/connectors/state) so it is possible to query the results directly from Arroyo as well.
Generally you would want to use something like Arroyo when your data is too high volume to reasonably store it all in a DB like Clickhouse, or your queries are too expensive to perform on every query, as Arroyo incrementally computes the results of the query as events come in.
There's also opportunities to use these systems together: Arroyo can pre-aggregate the high volume raw data, and then it can be inserted into a Clickhouse-based system for final processing along different dimensions.
We're big fans of Nomad! It's what we use for scheduling in our cloud platform, due to its simplicity and scheduling speed. Although we also have great support for Kubernetes as that's what most folks will be running.
Yes, Arroyo is an alternative to KSQL, although more in the design-space of Flink.
KSQL is pretty simple and easy to run if you already have Kafka, but will be much more expensive and harder to scale due to its reliance on Kafka streams for persistence and shuffling of data in processing DAG.
https://doc.arroyo.dev/concepts#watermarks
Also works based on fixed, pre-built pipelines. This is all very much in the style of most stream processing platforms today but I hope we’ll continue to move closer as an industry to having our cake and eating it: ingest everything in real-time, while serving any query (with joins) over the full dataset (either incrementally or ad-hoc).
Beyond that, there are systems that are more integrated end-to-end that can update as late arriving data comes in (like Materialize), and think those have there place. However for many uses of stream processing what's important is taking action once the data is complete enough, and watermarks a useful and pretty straightforward mechanism for that.
I'm not very familiar with Benthos, but it appears to be on the simple, stateless end. Systems like that can be great if they fit your use cases—they tend to be much easier to understand and operate.
But they are not as flexible or powerful as stateful systems like Arroyo or Flink. For example, they cannot accurately compute aggregations, joins, or windows over long time ranges.
See feldera.com for more info.
Really happy to see all the activity at the new repo. For a while there I thought the project was dying...
Would be interesting to somehow make Arroyo run the Nexmark benchmark so we can clearly compare to Flink and DBSP: https://liveandletlearn.net/post/vmware-take-3-experience-wi...
Just read through the DBSP docs and it looks like it is working in a similar space. The biggest differences in my mind are around distribution and reliability. Arroyo works across a cluster of machines and has built in fault tolerance, while for DBSP that's still just planned for the future.
(I'm the co-creator of Arroyo, for context)
I'll try to get something set up to compare performance of the two on the same machine.
I am not sure about your premise that the operations side is difficult. It tends to be submitting a job to a cluster in Flink or Spark.
The harder barrier to entry is the functional style of transformation code. Even though other frameworks have it, I think the SQL API as the first class citizen is the bigger differentiator.
I can think of quite a few use cases (particularly in finance) where we'd want late-arrivals to be recorded and possibly incorporated into later or revised results, not silently dropped on the floor.
(Disclaimer: I work for Decodable, where we build a SaaS based on Flink)
"Arroyo" is a Spanish word meaning creek, or stream
I live in a California town called Arroyo Grande ("big creek")
https://www.tinybird.co/
Disclaimer: I work for Tinybird.
However, like a database we do have a serving layer (currently only in our cloud version due to its reliance on our distributed state backend: https://doc.arroyo.dev/connectors/state) so it is possible to query the results directly from Arroyo as well.
Generally you would want to use something like Arroyo when your data is too high volume to reasonably store it all in a DB like Clickhouse, or your queries are too expensive to perform on every query, as Arroyo incrementally computes the results of the query as events come in.
There's also opportunities to use these systems together: Arroyo can pre-aggregate the high volume raw data, and then it can be inserted into a Clickhouse-based system for final processing along different dimensions.
I wish more products would support (or at least document how to run on) Nomad.
KSQL is pretty simple and easy to run if you already have Kafka, but will be much more expensive and harder to scale due to its reliance on Kafka streams for persistence and shuffling of data in processing DAG.
And with Confluent's embrace of Flink in the past year (https://www.confluent.io/blog/cloud-kafka-meets-cloud-flink-...) it's not clear that KSQL has much of a future.