After years of maintaining and using an application suite that relies on multicast for internal communication, I would hesitate to use "reliable" and "multicast" in the same sentence. Multicast is great in theory, but comes with so many pitfalls and grievances in practice. Mostly due to unreliable handling of switches, routers, network adapters and TCP/IP stacks in operating systems.
Just to mention a few headaches I've been dealing with over the years: multicast sockets that joins the wrong network adapter interface (due to adapter priorities), losing multicast membership after resume from sleep/hibernate, switches/routers just dropping multicast membership after a while (especially when running in VMs and "enterprise" systems like SUSE Linux and Windows Server, all kinds of socket reuse problems, etc.
I don't even dare to think about how many hours I have wasted on issues listed above. I would never rely on multicast again when developing a new system.
But that said, the application suite, a mission control system for satellites, works great most of the time (typically on small, controlled subnets, using physical installations instead of VMs) and has served us well.
I recently finished eight years at a place where everyone used multicast every day. It consistently worked very well (except for the time when the networks team just decided one of my groups was against policy and firewalled it without warning).
But this was because the IT people put effort into making it work well. They knew we needed multicast, so they made sure multicast worked. I have no idea what that involved, but presumably it means buying switches that can handle multicast reliably, and then configuring them properly, and then doing whatever host-level hardware selection and configuration is required.
In a previous job, we tried to use multicast having not done any groundwork. Just opened sockets and started sending. It did not go so well - fine at first, but then packets started to go missing, and we spent days debugging, and finding the obscure errors in our firewall config. In the end, we did get it working, but i would't have done it again. Multicast is a commitment, and we weren't ready to make it.
Yep- the main issue is multicast is so sparsely utilized that you can go through most of a career in networking with minimal exposure to multicast except on a particular peer link- once you scale support to multi-hop the institutional knowledge is critical because the individual knowledge is so spotty.
Aeron is very popular in large financial trading systems. Maybe since multicast is already commonplace (that's how most exchanges distribute market data).
Printers seem to be a solved problem and they mostly use zeroconf which uses mDNS (multicast DNS). I have done a bit of work in the area and I didn't run into the problems you mentioned.
However I had very semi-strict control of my network, but used plenty of random routers for testing.
Link-local multicast like mDNS can be a bit simpler to wrangle than routed multicast. For the link-local case a lot of the interop failure cases with network equipment just devolves into "and it turned into a broadcast" instead of "and it wasn't forwarded". You can still run into some multiple interface issues though.
One of the greatest things about Aeron is just the fact it exists. If one goes e.g. to StackOverflow or a place with the same patronizing attitude of "experts", they will tell you no one needs UDP, even in the same DC on reliable network, especially no one needs multicast. Any sane person should use TCP with a loop for multiple destinations, they would say, and one should measure before optimizing, they would say. But they themselves probably never had a chance to measure. Yet Aeron guys, who are real expert in low-latency systems, just delivered an ultra-fast thing that is quite simple in design.
Aeron latency histograms vs TCP are quite nice in the same DC on enterprise-grade networking hardware. But it really makes sense to use if a single-digit or low-double digit microsecond latency improvement on P50 is worth the effort. Or if the long tail with TCP is a dealbreaker, as Aeron has much nicer P99+ regardless of how well optimized a TCP setup is. Also, if one can leverage multicast that's nice, but not only clouds have it disabled, and Aeron works fine with unicast to N.
However, there are gotchas with threading and configuration overall. Cross-DC setup may surprise in a bad way if buffers are not configured to account for bandwidth-delay product. Any packet loss on high-latency network leads to a nasty NACK storm that is slow to recover under load. It's better to set the highest QoS and ensure the network is never dropping packets, e.g. calculate the real peak instant load vs hardware capacity. Relative latency savings cross-DC become less interesting the longer the distance, so there's nothing wrong with TCP there. Another note is that, e.g. ZMQ is slow not because of TCP but because of its internals, almost 2x slower for small packets than raw well-tuned TCP sockets, which are not that bad vs Aeron. Also, Aeron is not for sending big blobs around, the best is to use it with small payloads.
Aeron is designed with mechanical sympathy in mind by the guys who coined this term and have been evangelizing it for years, and it's visible. Lots to learn from the design & implementation (tons of resources on the web) even without using it in prod.
One time I was setting up a jboss cluster on vmware boxes - two right next to each other in the rack. JBoss used (uses?) multicast discovery to find the cluster and the VMs on different boxes just couldn't find each other.
Another time I had a backup job using uftp (a multicast file xfer tool) and it was a similar story. Systems literally sitting one rack over couldn't talk.
We involved all of our CC*-certified guys, wasted a week, and eventually just used the explicit command line switches to configure the cluster.
The hardware is not up to the task, physical or virtual, as far as I can tell.
"...Cut through mode reduces switch latency at the risk of decreased reliability. Packet transmissions can begin immediately after the destination address is processed. Corrupted frames may be forwarded because packet transmissions begin before CRC bytes are received..."
>Also, if one can leverage multicast that's nice, but not only clouds have it disabled, and Aeron works fine with unicast to N.
How did that happened? Seems multicast is already built in, just use that for massive broadcast. Is TCP used just so we can get an ACK that it is received. Seems multicast and UDP shouldn't be a problem if we just want massive people to listen in on it, but if we want to also track these people then that is another story.
From a user perspective, use UDP/multicast all the way. Let the client to request something if it is dropped or missing or otherwise just multicast for everything.
> Relative latency savings cross-DC become less interesting the longer the distance, so there's nothing wrong with TCP there.
Long fat pipe sees dramatic throughput drops with tcp and relatively small packet loss. Possibly we were holding it wrong; would love to know if there is some definitive guide to doing it right. Good success with UDT.
I would not recommend using Aeron on long fat pipes with a chance of packet loss for hight throughput. It was several years since I stress tested this, maybe there have been improvements. I saw some work on that in release notes after. But that was the worst case as recovery was slow.
I would think of UDP with redundant encoding / FEC, to avoid retransmits.
Looking at their transport protocol benchmarks on AWS [1][2], they average ~3 million 288-byte messages per second on c5.9xlarge (36 vCPU) instances. When increasing to their MTU limit of 1,344 bytes per message that drops to 700 thousand messages per second [2] or ~1 GB/s (~7.5 Gbps) over 36 cores. That is just ~200 Mbps per core assuming it is significantly parallel.
Looking at their transport protocol benchmarks on GCP [3], they average ~4.7 million 288 byte messages per second on C3 (unspecified type) instances. Assuming it scales proportionally to the AWS test, as they do not provide a max message size throughput number for GCP, that would be ~1 million messages per second or ~1.5 GB/s (~12 Gbps).
TCP stacks can routinely average 10 Gbps per individual core even without aggressive tuning, but Aeron appears to struggle to achieve parity with 36x as many cores. That is not to say that there might not be other advantages to Aeron such as latency, multicast support, or whatever their higher levels are doing, but 36x worse performance than basic off-the-shelf protocols does not sound like "high performance".
Aeron benchmarks are open source and can be found here [1]. I was running both AWS and GCP tests and wrote the benchmarks.
The particular transport benchmark mentioned here is an echo test where a message is sent between two machines and echoed back to the sender. This is a single threaded test using a single stream (flow) between publisher and subscriber. On each box there is one application thread that sends and receives data and a standalone media driver component running in a DEDICATED mode (i.e. with 3 separate threads: conductor/sender/receiver).
AWS limits single flow traffic [2]. This test was using cluster placement group placement policy which has a nominal limit of 10 Gbps. However, this is true only for TCP. For UDP the actual limit is 8 Gbps when CPG is used (this is not documented anywhere).
Aeron adds a 32 byte header to each message so 288 bytes payload becomes 320 bytes on the network. At a 3M msgs/sec rate Aeron was sending data at 7.68 Gbps (which is 96% of 8 Gbps limit) on a single CPU core. At that rate it was still achieving p99 < 1ms latency target.
We chose `c5n.9xlarge` instance for this test, because it reserved an entire CPU socket to a single VM. This was done to avoid interference from other VMs, i.e. busy neighbour problem.
GCP test was done on `c3-highcpu-88` instance type. Again choosing an instance with so many cores was done to avoid sharing CPU socket with other VMs.
Aeron can easily saturate a 10 GbE NIC even without kernel bypass (given proper configuration). However, this is not a very useful test. Much harder problem is sending small/medium sized messages at high rates and handling bursts of data.
Aeron transport was designed to achieve both low and predictable latency and high throughput at the same time. The two are not at odds with each other.
A c5.9xlarge instance has 12 gigabits/s of bandwidth available [1]. A fair comparison to TCP would need to look at end to end message delivery, including framing / parsing of the messages.
On the face of it, the ability to use the majority of the bandwidth of the instance with small messages is impressive.
It’s easy to focus on the reliable UDP protocol and the multicast support, but what’s important about Aeron is its system architecture. As noted elsewhere, it all combines together in “mechanical sympathy” and once you have that you can interconnect with high performance transports [1].
So you set up an Aeron Server on your machine. That handles all external network communication at the message layer (NAKs, etc). Every “Aeron Client” process communicates with that Server to stand up shared memory pipes. The messaging client solely deals with those pipes (cache efficient, etc). They subscribe to channels at host:port/topic but it is not direct network delivery to them (the client is transport agnostic, besides the subscription). The network service directs data to the clients shared memory queue.
Once they have that base networked-IPC setup, you can maximize performance with it using reliable UDP and fan out common data (eg marlet data) using multicast.
Then Aeron further adds Archiving (to persist) and Clustering (to scale late join / state replication) components. This stuff works well with kernel bypass :)
[1] It could work with RDMA as well, but the ticket regarding that was closed 8 years ago. Maybe if there was AI/GPU workload interest.
A whole different tamale, but Apache Iggy feels similar-ish, as a persistent message streaming system. Also using UDP, this time with QUIC, which is a pretty future-forward protocol that I love to see targetted (also has a rest API and its own binary protocol). https://iggy.apache.org/
At least in terms of open-source, I often say that the Aeron code base is one of the best projects to study in terms of software quality (especially Java). The real-logic (now Adaptive) guys are a skilled and knowledgeable bunch.
It’s a superb codebase for sure, I’ve spent dozens or hundreds of hours reading it. But I would caution against using it as an example of Java the language because of how non-idiomatic it can be to achieve the absurd performance it does. Most Java shouldn’t use a lot of the techniques they use (and the authors would be the first to admit that).
In general most people don't do networking right, especially within real-time systems.
As people discover the problems with their approach, they rewrite it.
I'm my experience it's generally better not to have transparently reliability for lost data as well. The correct handling should be application-specific.
With Aaron I would think the focus should be on 'efficiënt', not reliable. Within a datacenter they are orders of magnitude faster than plan UDP. You get this crazy efficiency at the cost of reduced flexibility in how you send messages.
I haven't seen the makers of Aeron (or anyone else) claim it's "orders of magnitude faster than plain UDP." Do you have a link to something about this? It doesn't pass the smell test for me unless you're talking specifically about using Aeron within a single machine (where it uses shared memory instead of the network)...but you said "Within a datacenter" not "Within a computer."
It has been a while since I saw their presentation. What I remember is that Aaron has an insanely low delay even in the high percentiles, that is orders of magnitude better. Throughput for a large stream of data is probably similar to plain UDP.
Please correct me if I remember wrong.
It does have great tail latency. But it's not a silver bullet, but careful engineering. And you pay for the latency with spinning threads. It's the architecture that makes it to stand out. In the end, it's just the same old UDP sockets, not even io_uring at least in the free public version. But one can use LD_PRELOAD if hardware has this trick - but again, it's not specific to Aeron.
Just to mention a few headaches I've been dealing with over the years: multicast sockets that joins the wrong network adapter interface (due to adapter priorities), losing multicast membership after resume from sleep/hibernate, switches/routers just dropping multicast membership after a while (especially when running in VMs and "enterprise" systems like SUSE Linux and Windows Server, all kinds of socket reuse problems, etc.
I don't even dare to think about how many hours I have wasted on issues listed above. I would never rely on multicast again when developing a new system.
But that said, the application suite, a mission control system for satellites, works great most of the time (typically on small, controlled subnets, using physical installations instead of VMs) and has served us well.
But this was because the IT people put effort into making it work well. They knew we needed multicast, so they made sure multicast worked. I have no idea what that involved, but presumably it means buying switches that can handle multicast reliably, and then configuring them properly, and then doing whatever host-level hardware selection and configuration is required.
In a previous job, we tried to use multicast having not done any groundwork. Just opened sockets and started sending. It did not go so well - fine at first, but then packets started to go missing, and we spent days debugging, and finding the obscure errors in our firewall config. In the end, we did get it working, but i would't have done it again. Multicast is a commitment, and we weren't ready to make it.
https://web.mit.edu/Saltzer/www/publications/endtoend/endtoe...
However I had very semi-strict control of my network, but used plenty of random routers for testing.
Aeron latency histograms vs TCP are quite nice in the same DC on enterprise-grade networking hardware. But it really makes sense to use if a single-digit or low-double digit microsecond latency improvement on P50 is worth the effort. Or if the long tail with TCP is a dealbreaker, as Aeron has much nicer P99+ regardless of how well optimized a TCP setup is. Also, if one can leverage multicast that's nice, but not only clouds have it disabled, and Aeron works fine with unicast to N.
However, there are gotchas with threading and configuration overall. Cross-DC setup may surprise in a bad way if buffers are not configured to account for bandwidth-delay product. Any packet loss on high-latency network leads to a nasty NACK storm that is slow to recover under load. It's better to set the highest QoS and ensure the network is never dropping packets, e.g. calculate the real peak instant load vs hardware capacity. Relative latency savings cross-DC become less interesting the longer the distance, so there's nothing wrong with TCP there. Another note is that, e.g. ZMQ is slow not because of TCP but because of its internals, almost 2x slower for small packets than raw well-tuned TCP sockets, which are not that bad vs Aeron. Also, Aeron is not for sending big blobs around, the best is to use it with small payloads.
Aeron is designed with mechanical sympathy in mind by the guys who coined this term and have been evangelizing it for years, and it's visible. Lots to learn from the design & implementation (tons of resources on the web) even without using it in prod.
Another time I had a backup job using uftp (a multicast file xfer tool) and it was a similar story. Systems literally sitting one rack over couldn't talk.
We involved all of our CC*-certified guys, wasted a week, and eventually just used the explicit command line switches to configure the cluster.
The hardware is not up to the task, physical or virtual, as far as I can tell.
https://archive.fosdem.org/2023/schedule/event/om_virt/attac...
"...Cut through mode reduces switch latency at the risk of decreased reliability. Packet transmissions can begin immediately after the destination address is processed. Corrupted frames may be forwarded because packet transmissions begin before CRC bytes are received..."
https://www.arista.com/en/um-eos/eos-data-transfer?searchwor...
How did that happened? Seems multicast is already built in, just use that for massive broadcast. Is TCP used just so we can get an ACK that it is received. Seems multicast and UDP shouldn't be a problem if we just want massive people to listen in on it, but if we want to also track these people then that is another story.
From a user perspective, use UDP/multicast all the way. Let the client to request something if it is dropped or missing or otherwise just multicast for everything.
Long fat pipe sees dramatic throughput drops with tcp and relatively small packet loss. Possibly we were holding it wrong; would love to know if there is some definitive guide to doing it right. Good success with UDT.
I would think of UDP with redundant encoding / FEC, to avoid retransmits.
[0] https://en.m.wikipedia.org/wiki/TCP_congestion_control#TCP_B...
Looking at their transport protocol benchmarks on AWS [1][2], they average ~3 million 288-byte messages per second on c5.9xlarge (36 vCPU) instances. When increasing to their MTU limit of 1,344 bytes per message that drops to 700 thousand messages per second [2] or ~1 GB/s (~7.5 Gbps) over 36 cores. That is just ~200 Mbps per core assuming it is significantly parallel.
Looking at their transport protocol benchmarks on GCP [3], they average ~4.7 million 288 byte messages per second on C3 (unspecified type) instances. Assuming it scales proportionally to the AWS test, as they do not provide a max message size throughput number for GCP, that would be ~1 million messages per second or ~1.5 GB/s (~12 Gbps).
TCP stacks can routinely average 10 Gbps per individual core even without aggressive tuning, but Aeron appears to struggle to achieve parity with 36x as many cores. That is not to say that there might not be other advantages to Aeron such as latency, multicast support, or whatever their higher levels are doing, but 36x worse performance than basic off-the-shelf protocols does not sound like "high performance".
[1] https://hub.aeron.io/hubfs/Aeron-Assets/Aeron_AWS_Performanc... Page 13
[2] https://aws.amazon.com/blogs/industries/aeron-performance-en... Search "Test Results"
[3] https://aeron.io/other/aeron-google-cloud-performance-testin...
The particular transport benchmark mentioned here is an echo test where a message is sent between two machines and echoed back to the sender. This is a single threaded test using a single stream (flow) between publisher and subscriber. On each box there is one application thread that sends and receives data and a standalone media driver component running in a DEDICATED mode (i.e. with 3 separate threads: conductor/sender/receiver).
AWS limits single flow traffic [2]. This test was using cluster placement group placement policy which has a nominal limit of 10 Gbps. However, this is true only for TCP. For UDP the actual limit is 8 Gbps when CPG is used (this is not documented anywhere).
Aeron adds a 32 byte header to each message so 288 bytes payload becomes 320 bytes on the network. At a 3M msgs/sec rate Aeron was sending data at 7.68 Gbps (which is 96% of 8 Gbps limit) on a single CPU core. At that rate it was still achieving p99 < 1ms latency target.
We chose `c5n.9xlarge` instance for this test, because it reserved an entire CPU socket to a single VM. This was done to avoid interference from other VMs, i.e. busy neighbour problem.
GCP test was done on `c3-highcpu-88` instance type. Again choosing an instance with so many cores was done to avoid sharing CPU socket with other VMs.
Aeron can easily saturate a 10 GbE NIC even without kernel bypass (given proper configuration). However, this is not a very useful test. Much harder problem is sending small/medium sized messages at high rates and handling bursts of data.
Aeron transport was designed to achieve both low and predictable latency and high throughput at the same time. The two are not at odds with each other.
[1] https://github.com/aeron-io/benchmarks [2] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...
On the face of it, the ability to use the majority of the bandwidth of the instance with small messages is impressive.
[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...
It's not. The Aeron media driver has 1 RX thread and 1 TX thread in the most heavily threaded configuration (+1 admin thread)
So you set up an Aeron Server on your machine. That handles all external network communication at the message layer (NAKs, etc). Every “Aeron Client” process communicates with that Server to stand up shared memory pipes. The messaging client solely deals with those pipes (cache efficient, etc). They subscribe to channels at host:port/topic but it is not direct network delivery to them (the client is transport agnostic, besides the subscription). The network service directs data to the clients shared memory queue.
Once they have that base networked-IPC setup, you can maximize performance with it using reliable UDP and fan out common data (eg marlet data) using multicast.
Then Aeron further adds Archiving (to persist) and Clustering (to scale late join / state replication) components. This stuff works well with kernel bypass :)
[1] It could work with RDMA as well, but the ticket regarding that was closed 8 years ago. Maybe if there was AI/GPU workload interest.
As people discover the problems with their approach, they rewrite it.
I'm my experience it's generally better not to have transparently reliability for lost data as well. The correct handling should be application-specific.