ArkFlow – High-performance Rust stream processing engine

(github.com)

107 points | by chenquan 110 days ago

9 comments

bbminner 110 days ago
I work in one of large tech companies, and I can attest that while the idea seems very neat in theory (esp if your schemas are typed), and even if you define an api for defining new building blocks, sooner or later people realize that they need to dynamically adjust parts of the pipeline, and they write components to dynamically set and resolve these, and then other components on top of these components, and then complements for composing components - and now you forced yourself into implementing a weird and hard to debug functional programming language in yaml which is not a place someone wants to find themselves in :'(
one lesson I learned from this: any bit of logic that defines a computation should prefer explicit imperative code (eg python) over configuration, because you are likely to eventually implement an imperative language in that configuration language anyway
[-]
- bob1029 110 days ago
  > sooner or later people realize that they need to dynamically adjust parts of the pipeline
  The customer is the hard part in all of this, but there is respite if you are patient and careful with the tech.
  If you are in a situation where you need to go from one SQL database to another SQL database, the # of additional tools required should be zero. Using a merge statement & recursive CTEs per target table, you can transform any schema into any other. Most or all of the actual business logic can reside in the command text - how we filter & project data into the target system.
  If we accept the SQL-to-SQL case has a good general solution, I would then ask if it is possible to refactor all problems such that they wind up with this shape in the middle. All of that nasty systems code could then be focused more on loading and extracting data into and out of this regime where it can be trivially sliced & diced. Once you have something in Postgres or SQL Server, you are at the top of the hill. Everything adapts to you at that point. Talking to another instance of yourself - or something that looks & talks like you - is trivial.
  The other advantage with this path is that refactoring SQL scripts is something the customer (B2B) can directly manage in many situations. The entire pipeline can live in a single text file that you throw around an email chain. You don't have to teach them things like python, yaml or source control.
  [-]
  - bbminner 107 days ago
    In fact, I also converge to sql as a universal data transformation language. External analogs include things like duck db. Unfortunately, even with pipe syntax sql lacks expressiveness causing me to revert to c-style macros in sql (eg making table name dynamic), which in the long run makes things far less maintainable if anything.
- lucyjojo 110 days ago
  yeah, most projects when you spot a config file, its complexity will tend to scale with the increasing complexity of the domain you capture.
  so either it's very small/mature and you don't have to worry too much, or in the active development case your config files are pretty much the instruction set of some kind of logical foggy vm... and eventually a whole environment of tools etc. will "compile down" to your config files and you get a pain knot to endlessly massage...
- chenquan 110 days ago
  Thank you for your valuable experience, I will seriously think about what you said.
- NeutralForest 110 days ago
  Pretty much my take any time I see all the convoluted Bicep and YAML we have since there's a bunch of conditional logic and more in our pipelines.
- narad_muni 104 days ago
  Have been building same thing in rust, The part for processing data is quite complex, and I agree that coding in python is a better way
- _ink_ 110 days ago
  So far, this was exactly my experience as well. Well said.
simgt 110 days ago
I worked on something very similar for inference on video streams. To avoid the limitations of the config files mentioned in a sibling comment, I added a tool to convert a config to plain Rust. Your primary focus has to be the quality of the Rust API, and the config files are syntactic sugar for the beginning or simpler projects.
[-]
- chenquan 109 days ago
  Hi, friend. How did you do it specifically?
  [-]
  - narad_muni 104 days ago
    You can read a config file at runtime, but this will need recompilation with every config change, which is kind of useless.
abound 110 days ago
Very cool! Seems like a Rust version of something like Bento? [1] Have you done any benchmarking against similar stream processing tools?
[1] https://github.com/warpstreamlabs/bento
[-]
- regecks 110 days ago
  I haven’t benchmarked this, but I have recently benchmarked Spark Streaming vs self-rolled Go vs Bento vs RisingWave (which is also in Rust) and RW matched/exceeded self-rolled, and absolutely demolished Bento and Spark. Not even in the same ballpark.
  Highly recommend checking RisingWave out if you have real time streaming transformation use cases. It’s open source too.
  The benchmark was some high throughput low latency JSON transformations.
  [-]
  - chenquan 110 days ago
    Thanks for your recommendation.
- chenquan 110 days ago
  Yes, they are similar. ArkFlow is mainly based on DataFusion. Bento actually comes from Benthos. Currently, the ArkFlow project is in the early stages and no performance comparison test has been conducted, but I believe that ArkFlow will outperform them in the long run.
  Benthos: https://github.com/redpanda-data/benthos
  DataFusion: https://github.com/apache/datafusion
  [-]
  - agallego 110 days ago
    What we found with RPCN (redpanda connect)/old benthos is that most systems are very slow and only cpu intensive things require manual CPU instruction optimizations like the snowflake connector we wrote (https://docs.redpanda.com/redpanda-connect/components/output...). The bulk of it is just about completeness. Go feels like the Perl of the 2020s. Cool little libs for just about everything.
    [-]
    - chenquan 110 days ago
      Yes, RPCN (redpanda connect)/old benthos is very cool and can solve most of the scenes. Let me tell you quietly that I am using it too.
  - sakesun 110 days ago
    Arroyo is another one based on DataFusion
    [-]
    - chenquan 110 days ago
      Yes, Arroyo is entirely based on DataFusion, but ArkFlow is not exactly. In the future, ArkFlow will establish a plug-in ecosystem, allowing anyone to process data through plug-ins, not limited to DataFusion.
      [-]
      - necubi 110 days ago
        This isn't quite correct (I'm the creator of Arroyo). We use DataFusion to implement parts of our SQL support (in particular the planner and the expression interpreter) but we have our own dataflow and operators. By contrast Synnada[0] is directly built on DF.
        A contrast between Arroyo and systems like Benthos and from what I can tell ArkFlow, is that Arroyo is a "stateful" stream processing engine, which means that we can support things like windows, aggregates, and joins, with exactly-once semantics and fault tolerance, at the cost of significant additional complexity[1].
        [0] https://www.synnada.ai/ [1] https://www.arroyo.dev/blog/stateful-stream-processing
        [-]
        chenquan 110 days ago
        Arroyo has been designed with more comprehensive consideration.
        chenquan 110 days ago
        Sorry, please forgive me for not knowing Arroyo completely.
        [-]
        necubi 110 days ago
        No worries! We definitely rely heavily on DF (it’s an incredible project!). Part of what makes it so great is its modularity—it’s a toolkit for building sql systems, which is extremely cool.
        [-]
        chenquan 110 days ago
        Yes, whether it is DataFusion, Arroyo, or Bentos, these open source products have made me profit a lot.
        [-]
        esafak 109 days ago
        That made me chuckle. May you profit in all senses.
chenquan 110 days ago
High Performance: Built on Rust and Tokio async runtime, offering excellent performance and low latency Multiple Data Sources: Support for Kafka, MQTT, HTTP, files, and other input/output sources Powerful Processing Capabilities: Built-in SQL queries, JSON processing, Protobuf encoding/decoding, batch processing, and other processors Extensible: Modular design, easy to extend with new input, output, and processor components
[-]
- esafak 110 days ago
  Is there a product motivation; a deficiency you seek to rectify in existing solutions?
  [-]
  - chenquan 110 days ago
    I think a stream processing engine written in rust will have better performance, lower latency, more stable services, lower memory footprint, and cost savings. At the same time, ArkFlow is based on DataFusion implementation, which will put ArkFlow on a strong open source community.
    [-]
    - winwang 110 days ago
      Are there benchmarks you can share? Not discounting Rust, just wondering if you're already seeing some obvious numbers.
      [-]
      - chenquan 110 days ago
        Sorry, not yet, but this is the direction ArkFlow is working hard. Rust's own potential will also guide ArkFlow in this direction.
        [-]
        menaerus 110 days ago
        Rust is rather heavy on its copy/clone imposed semantics making it potentially less suitable for low-latency or large data volume processing workloads. Picking Rust for its performance potential only means that you're going to have a harder time beating other native performance-oriented stream processing engines written in either C or C++, if that is your goal of course.
        This logic
        > written in rust will have better performance, lower latency, ..., lower memory footprint
        is flawed and is cargo-cult programming unless you say what are you objectively comparing it against and how you intend to achieve those goals. Picking the right™ language just for the sake of these goals won't get you too far.
        [-]
        Jweb_Guru 110 days ago
        > Rust is rather heavy on its copy/clone imposed semantics making it potentially less suitable for low-latency or large data volume processing workloads. Picking Rust for its performance potential only means that you're going to have a harder time beating other native performance-oriented stream processing engines written in either C or C++, if that is your goal of course.
        There is absolutely nothing in Rust's semantics preventing you from writing high-performance data processing workloads in it, and in fact it's one of the best languages for that purpose. Beyond that, the usual barrier to entry for working on a product like this written in C++ is incredibly high in part because stability and safety are so critical for these products--which is one of the reasons that in practice they are often written in memory safe languages, where C++ is not even an option. Have you worked on any nontrivial Rust data processing product where "copy/clone imposed semantics" somehow prevented you from getting big performance wins? I'd be very curious to hear about this if so.
        [-]
        menaerus 109 days ago
        Stability and safety are the least of the concerns in data processing and database workloads. That's totally not the reason why we saw an increase of these systems during the 90s and early 00s written in Java or similar alternative languages. It was ease of use, low-entry bar into the ecosystem and generally developer pool accessibility. Otherwise, the cost is the main driver in infrastructure type of software and the reason why we see many of these rewritten exactly in C++. Rust is just another contender here, and it's usually because of the performance and a lot of hype recently, which is fair.
        [-]
        zozbot234 109 days ago
        > Stability and safety are the least of the concerns in data processing and database workloads. That's totally not the reason why we saw an increase of these systems during the 90s and early 00s written in Java or similar alternative languages.
        not_sure_if_serious.jpg
        To be extra clear about it (and to avoid pure snark, that's frowned upon here at HN): that's the kind of software (alongside a lot of general enterprise code) that got rewritten from C++ to Java, not the other way around. The increased safety of Java was absolutely a consideration. Java was the 'Rust' of the mid-to-late 1990s and 2000s, only a whole lot slower and clunkier than the actual Rust of today.
        [-]
        menaerus 109 days ago
        I am serious. C is a simple language but rather complicated to wrap your head around it since it requires the familiarity with low-level machine concepts. C++ ditto but with a difference that it is a rather complicated language with rather advanced programming language concepts - something that did not really exist at that time. So the net result was a very high entry barrier and this was the main reason, and not "safety" as you say, why many people were running away from C and C++ to Java/C# because those were the only alternatives we had at that time. I don't remember "safety" being mentioned at all during the past 20 years or so up until Rust came out. "Segfaults" were the 90s and 00s "safety" vocabulary but, as I said, it was a skill issue.
        Frenzy around the "safety" IMO is way too overhyped and when you and OP say that "safety" plays a huge role in data processing and database kernel source development, no - it is literally not even a 1% of time that a developer in that domain spends his time on. C and C++ are still used in those domains full on.
        > that's the kind of software (alongside a lot of general enterprise code) that got rewritten from C++ to Java, not the other way around
        Which C or C++ engines exactly got rewritten to Java? We can start from this list: https://db-engines.com/en/ranking
        [-]
        zozbot234 108 days ago
        So you agree that many people were absolutely "running away from C and C++ to Java/C#" but somehow this didn't involve any data processing code, even though arguably the main thing that internally-developed enterprise code does is data processing of some kind? OK, I guess.
        > Which C or C++ engines exactly got rewritten to Java?
        It's difficult to give names precisely because private enterprise development was involved. But essentially every non-trivial Java project starting from the mid-1990s or so, would've been written in C++ if it had been around in the late 1980s or earlier in the 1990s. It's just not very sensible to suppose that "data processing" as a broad area was somehow exempted from this. And if writing segfault-free code in C/C++ could be dismissed as a mere "skill issue" we wouldn't need Rust either. It's a wrong take today and it was just as wrong back then.
        (And yes, Java took significant steps forward in safety, including adding a GC - which means no wild pointers or double-free issues - and converting "null pointer" dereferences into a properly managed failure, with backtraces and all that. Just because the "safety" vocabulary wasn't around back then except for programming-theory experts, doesn't imply that people wouldn't care just as much about a guarantee of code being free from the old segfault errors.)
        chenquan 109 days ago
        > Stability and safety are the least of the concerns in data processing and database workloads.
        I'm curious how you came to this conclusion?
        [-]
        menaerus 109 days ago
        I have professional working experience in this domain.
        [-]
        chenquan 108 days ago
        Hi, then I'm curious what do you think people care most about in data processing and database workloads?
        [-]
        menaerus 108 days ago
        You're a servant to the business needs so whatever the business needs are at that moment. It's a vague answer probably not appealing to many engineers but that's what it really is. You're solving problems for your business stakeholder and for your business stakeholder clients.
        In another words, programming language is usually not at the very focus of daily development, given that there's always much bigger fish to fry in this domain, but if Rust provides such an undisputed benefit to your business model, while keeping the cost and risk of it viable for the business, then it's going to be a no-brainer. Chances are that this is going to be the case is very very low.
        So, my advice would rather be use the language whichever you prefer but don't dwell over it - rather put your focus on innovating workload-specific optimizations that are solving real-world issues that are palpable and easily proven/demonstrated. Study the challenges of storage or data processing engines or vectorized query execution algorithms. Depending on the domain problem you're trying to solve, make sure that your language of choice does not step in your way.
        [-]
        chenquan 107 days ago
        I’m grateful for your assistance.
        chenquan 110 days ago
        Why do you have to beat a native performance-oriented streaming engine written in C or C++?
        Currently, most of the mainstream stream processing engines are written in Java. Sorry, I may not add qualifiers to make you misunderstandings.
        Software does not have silver bullets, so does programming languages, and each has its own strengths. I also like to use go and Java to develop software.
        [-]
        menaerus 110 days ago
        So if you don't want to beat native engines in performance what is it that you're trying to solve but Java-based engines don't have? I think it's pretty important to set a vision upfront otherwise you're going to set yourself a trap for a quick failure.
        [-]
        chenquan 109 days ago
        Hi! Brother, I think I will seriously consider what you said and I am honored to communicate with you.
      - chenquan 110 days ago
        Welcome to follow the latest news from ArkFlow at any time and even participate.
heyheyyouyouqq 110 days ago
Reminds me of Pathway https://pathway.com/
[-]
- chenquan 110 days ago
  Good job, this is a rich reference.
- Keyframe 109 days ago
  yeah, without opentelemetry spying hopefully.
yu3zhou4 110 days ago
Good job brother! What do you think you need to implement before it is production-ready?
[-]
- chenquan 110 days ago
  Hi,brother! I'm still thinking, but it's certainly not now.
m00dy 110 days ago
Rust is increasingly becoming the default language for building infrastructure.
[-]
- chenquan 110 days ago
  I think stability, reliability and high performance are the foundation of infrastructure.
tzm 109 days ago
I love the simplicity of this design
[-]
- chenquan 109 days ago
  Welcome to follow anytime.
pstoll 109 days ago
Please tell me you are at least aware that tremor exists and that you rebuilt it on purpose?
https://www.tremor.rs/
[-]
- esafak 109 days ago
  I had not heard of it either! They have not been mentioned much here.
- chenquan 109 days ago
  I don't know it exists.
  [-]
  - Licenser 109 days ago
    Hello friend,
    I'm one of the maintainers of tremor, happy to get together and talk about rust event processing if you ever want to :)
    [-]
    - chenquan 109 days ago
      Yeah, I'm looking forward to having further conversations once I get to know tremor.