Neat! Having literally everything backed by object storage is The Dream, so this makes a lot of sense. So to compare this to the options that are available (that aren't Kafka or Redis streams) I can imagine you could take these items that you're writing to a stream, batch them and write them into some sort of S3-backed data lake. Something like Delta Lake. And then query them using I don't know DuckDB or whatever your OLAP SQL thing is. Or you could what develop your own S3 schema that that's just saving these items to batched objects as they come in. So then part what S2 is saving you from is having to write your own acknowledgement system/protocol for batching these items, and the corresponding read ("consume") queries? Cool!
Yes, that is a reasonable way to think about it! And as s2-lite is designed as a single-node system, there is a natural source of truth on what the latest records are for consuming in real-time.
Can this be used as an embedded lib instead of a separate binary as an API?
And am I understanding correctly that if I pointed 2 running instances of s2-lite at the same place in s3 there would be problems since slatedb is single writer?
Happy to accept contributions that make this more ergonomic.
> And am I understanding correctly that if I pointed 2 running instances of s2-lite at the same place in s3 there would be problems since slatedb is single writer?
SL8 will fence the older writer, thanks to S3 conditional writes.
Kind of what I've been working on to build tenancy on top of SQLite CDC to make it a simple repayable SQLite for Marmot (https://github.com/maxpert/marmot). I personally think we have a synergy here, would drop by your discord.
This is fair question. A stream here == a log. Every write with S2 implementations is durable before it is acknowledged, and it can be consumed in real-time or replayed from any position by multiple readers. The stream is at the granularity of discrete records, rather than a byte stream (although you can certainly layer either over the other).
ED: no k8s required for s2-lite, it is just a singe binary. It was an architectural note about our cloud service.
Your documentation needs improvement. It proudly mentions the alphabet soup of technologies you use, but it leaves me completely baffled about what s2 does, what problem s2 is trying to solve, or who the intended audience of s2 is.
So you frame the data into records, save the frame somehow (maybe with fsync if you're doing it locally, or maybe you outsource it to S3 or S3-compatible storage?), then ack and start sending it to clients. Therefore every frame that's acked or sent to clients has already been saved.
Personally I'd add an application level hash to protect the integrity of the records but that's just me.
At first glance I wondered if a hash chain or Merkle tree might be useful but I think it's overkill. What exactly is the trust model? I get the sense this is a traditional client-server protocol (i.e., not p2p). Does it stream the streams over HTTP / HTTPS, or some custom protocol? Are s2 clients expected to be end-user web browsers, other instances of s2 or something else?
Shoutout to CodesInChaos for suggesting that instead of a mere emulator, should have an actually durable open source implementation – that is what we ended up building with s2-lite! https://news.ycombinator.com/item?id=42487592
And it has the durability of object storage rather than just local. SlateDB actually lets you also use local FS, will experiment with plumbing up the full range of options - right now it's just in-memory or S3-compatible bucket.
> So I'd try so share as much of the frontend code (e.g. the GRPC and REST handlers) as possible between these.
Right on, this is indeed the case. The OpenAPI spec is also now generated off the REST handlers from s2-lite. We are getting rid of gRPC, s2-lite only supports the REST API (+ gRPC-like session protocol over HTTP/2: https://s2.dev/docs/api/records/overview#s2s-spec)
We wanted S2 to be one API. Started out with gRPC, added REST - then realized REST is what is absolutely essential and what most folks care about. gRPC did give us bi-directional streaming for append/read sessions, so we added that as an optional enhancement to the corresponding POST/GET data plane endpoints (the S2S "S2-Session" spec I linked to above). A nice side win is that the stream resource is known from the requested URL rather than having to wait for the first gRPC message.
gRPC ecosystem is also not very uniform despite its popularity, comes with bloat, is a bit of a mess in Python. I'm hoping QUIC enables a viable gRPC alternative to emerge.
Love this. Elegant and powerful. Stateful streams are surprisingly difficult to DIY and as everything becomes a stream of tokens this is super useful tool to have in the toolbox.
And am I understanding correctly that if I pointed 2 running instances of s2-lite at the same place in s3 there would be problems since slatedb is single writer?
Did not architect explicitly for that, but should be viable. You could use the `Backend` directly, is what the REST handlers call https://docs.rs/s2-lite/latest/s2_lite/backend/struct.Backen...
Happy to accept contributions that make this more ergonomic.
> And am I understanding correctly that if I pointed 2 running instances of s2-lite at the same place in s3 there would be problems since slatedb is single writer?
SL8 will fence the older writer, thanks to S3 conditional writes.
Adding a database, multiple components, and Kubernetes to the equation seems like massively overengineering.
What value does S2 provide that simple TCP sockets do not?
Is this for like "making your own Twitch" or something, where streams have to scale to thousands-to-millions of consumers?
ED: no k8s required for s2-lite, it is just a singe binary. It was an architectural note about our cloud service.
So you frame the data into records, save the frame somehow (maybe with fsync if you're doing it locally, or maybe you outsource it to S3 or S3-compatible storage?), then ack and start sending it to clients. Therefore every frame that's acked or sent to clients has already been saved.
Personally I'd add an application level hash to protect the integrity of the records but that's just me.
At first glance I wondered if a hash chain or Merkle tree might be useful but I think it's overkill. What exactly is the trust model? I get the sense this is a traditional client-server protocol (i.e., not p2p). Does it stream the streams over HTTP / HTTPS, or some custom protocol? Are s2 clients expected to be end-user web browsers, other instances of s2 or something else?
Yes, this can be a good building block for broadcasting data streams.
s2-lite is single node, so to scale to that level, you'd need to add some CDN-ing on top.
s2.dev is the elastic cloud service, and it supports high fanout reads using Cachey (https://www.reddit.com/r/databasedevelopment/comments/1nh1go...)
And it has the durability of object storage rather than just local. SlateDB actually lets you also use local FS, will experiment with plumbing up the full range of options - right now it's just in-memory or S3-compatible bucket.
> So I'd try so share as much of the frontend code (e.g. the GRPC and REST handlers) as possible between these.
Right on, this is indeed the case. The OpenAPI spec is also now generated off the REST handlers from s2-lite. We are getting rid of gRPC, s2-lite only supports the REST API (+ gRPC-like session protocol over HTTP/2: https://s2.dev/docs/api/records/overview#s2s-spec)
I'm curious why and what challenges you had with gRPC. s2-lite looks cool!
gRPC ecosystem is also not very uniform despite its popularity, comes with bloat, is a bit of a mess in Python. I'm hoping QUIC enables a viable gRPC alternative to emerge.
Will look into how to enable that option from s2-lite