It's Always TCP_NODELAY

(brooker.co.za)

143 points | by eieio 5 hours ago

17 comments

  • anonymousiam 3 hours ago
    The Nagle algorithm was created back in the day of multi-point networking. Multiple hosts were all tied to the same communications (Ethernet) channel, so they would use CSMA (https://en.wikipedia.org/wiki/Carrier-sense_multiple_access_...) to avoid collisions. CSMA is no longer necessary on Ethernet today because all modern connections are point-to-point with only two "hosts" per channel. (Each host can have any number of "users.") In fact, most modern (copper) (Gigabit+) Ethernet connections have both ends both transmitting and receiving AT THE SAME TIME ON THE SAME WIRES. A hybrid is used on the PHY at each end to subtract what is being transmitted from what is being received. Older (10/100 Base-T) can do the same thing because each end has dedicated TX/RX pairs. Fiber optic Ethernet can use either the same fiber with different wavelengths, or separate TX/RX fibers. I haven't seen a 10Base-2 Ethernet/DECnet interface for more than 25 years. If any are still operating somewhere, they are still using CSMA. CSMA is also still used for digital radio systems (WiFi and others). CSMA includes a "random exponential backoff timer" which does the (poor) job of managing congestion. (More modern congestion control methods exist today.) Back in the day, disabling the random backoff timer was somewhat equivalent to setting TCP_NODELAY.

    Dumping the Nagle algorithm (by setting TCP_NODELAY) almost always makes sense and should be enabled by default.

    • gerdesj 11 minutes ago
      I think you are confusing network layers and their functionality.

      "CSMA is no longer necessary on Ethernet today because all modern connections are point-to-point with only two "hosts" per channel."

      Ethernet really isn't ptp. You will have a switch at home (perhaps in your router) with more than two ports on it. At layer 1 or 2 how do you mediate your traffic, without CSMA? Take a single switch with n ports on it, where n>2. How do you mediate ethernet traffic without CSMA - its how the actual electrical signals are mediated?

      "Ethernet connections have both ends both transmitting and receiving AT THE SAME TIME ON THE SAME WIRES."

      That's full duplex as opposed to half duplex.

      Nagle's algo has nothing to do with all that messy layer 1/2 stuff but is at the TCP layer and is an attempt to batch small packets into fewer larger ones for a small gain in efficiency. It is one of many optimisations at the TCP layer, such as Jumbo Frames and mini Jumbo Frames and much more.

      • anonymousiam 9 minutes ago
        It's P2P as far as the physical layer (L1) is concerned.

        Usually, full duplex requires two separate channels. The introduction of a hybrid on each end allows the use of the same channel at the same time.

        Some progress has been made in doing the same thing with radio links, but it's harder.

        Nagle's algorithm is somewhat intertwined with the backoff timer in the sense that it prevents transmitting a packet until some condition is met. IIRC, setting the TCP_NODELAY flag will also disable the backoff timer, at least this is true in the case of TCP/IP over AX25.

    • immibis 1 hour ago
      Nagle is quite sensible when your application isn't taking any care to create sensibly-sized packets, and isn't so sensitive to latency. It avoids creating stupidly small packets unless your network is fast enough to handle them.
      • silisili 41 minutes ago
        At this point, this is an application level problem and not something the kernel should be silently doing for you IMO. An option for legacy systems or known problematic hosts fine, but off by default and probably not a per SOCKOPT.

        Every modern language has buffers in their stdlib. Anyone writing character at a time to the wire lazily or unintentionally should fix their application.

    • Hikikomori 1 hour ago
      Just to add, ethernet uses csma/cd, WiFi uses csma/ca.

      Upgraded our DC switches to new ones around 2014 and needed to keep a few old ones because the new ones didn't support 10Mbit half duplex.

      • anonymousiam 53 minutes ago
        Thanks for the clarification. They're so close to being the same thing that I always call it CSMA/CD. Avoiding a collision is far more preferable than just detecting one.

        Yeah, many enterprise switches don't even support 100Base-T or 10Base-T anymore. I've had to daisy chain an old switch that supports 100Base-T onto a modern one a few times myself. If you drop 10/100 support, you can also drop HD (simplex) support. In my junk drawer, I still have a few old 10/100 hubs (not switches), which are by definition always HD.

  • eieio 4 hours ago
    I found this article while debugging some networking delays for a game that I'm working on.

    It turns out that in my case it wasn't TCP_NODELAY - my backend is written in go, and go sets TCP_NODELAY by default!

    But I still found the article - and in particular Nagle's acknowledgement of the issues! - to be interesting.

    There's a discussion from two years ago here: https://news.ycombinator.com/item?id=40310896 - but I figured it'd been long enough that others might be interested in giving this a read too.

    • miduil 3 hours ago
      There is also a good write-up [0] by Julia Evans. We ran into this with DICOM storescp, which is a chatty protocol and TCP_NODELAY=1 makes the throughput significantly better. Since DICOM is often used in a LAN, that default just makes it unnecessarily worse.

      [0]: https://jvns.ca/blog/2015/11/21/why-you-should-understand-a-...

      [1]: https://news.ycombinator.com/item?id=10607422

      • eieio 56 minutes ago
        Oh! Thank you for this! I love Julia’s writing but haven’t read this post.
    • sail0rm00n 2 hours ago
      Any details on the game you’ve been working on? I’ve been really enjoying Ebitengine and Golang for game dev so would love to read about what you’ve been up to!
  • buybackoff 17 minutes ago
    Then at a lower level and smaller latencies it's often interrupt moderation that must be disabled. Conceptually similar idea to the Nagle algo - coalesce overheads by waiting, but on the receiving end in hardware.
  • x2rj 2 hours ago
    I've always thought a problem with Nagel's algorithm is, that the socket API does not (really) have a function to flush the buffers and send everything out instantly, so you can use that after messages that require a timely answer.

    For stuff where no answer is required, Nagel's algorithm works very well for me, but many TCP channels are mixed use these days. They send messages that expect a fast answer and other that are more asynchronous (from a users point of view, not a programmers).

    Wouldn't it be nice if all operating systems, (home-)routers, firewalls and programming languages would have high quality implementations of something like SCTP...

    • lysace 12 minutes ago
      Agreed. Something like sync(2)/syncfs(2) for filesystems.

      Seems like there's been a disconnect between users and kernel developers here?

  • dllthomas 2 hours ago
    Wildly, the Polish word "nagle" (pronounced differently) means "suddenly" or "all at once", which is just astonishingly apropos for what I'm almost certain is pure coincidence.
  • otterley 2 hours ago
    (2024) - previously discussed at https://news.ycombinator.com/item?id=40310896
  • Veserv 42 minutes ago
    The problem is actually that nobody uses the generic solution to these classes of problems and then everybody complains that the special-case for one set of parameters works poorly for a different set of parameters.

    Nagle’s algorithm is just a special case solution of the generic problem of choosing when and how long to batch. We want to batch because batching usually allows for more efficient batched algorithms, locality, less overhead etc. You do not want to batch because that increases latency, both when collecting enough data to batch and because you need to process the whole batch.

    One class of solution is “Work or Time”. You batch up to a certain amount of work or up to a certain amount of time, whichever comes first. You choose your amount of time as your desired worst case latency. You choose your amount of work as your efficient batch size (it should be less than max throughput * latency, otherwise you will always hit your timer first).

    Nagle’s algorithm is “Work” being one packet (~1.5 KB) with “Time” being the time until all data gets a ack (you might already see how this degree of dynamism in your timeout might pose a problem already) which results in the fallback timer of 500 ms when delayed ack is on. It should be obvious that is a terrible set of parameters for modern connections. The problem is that Nagle’s algorithm only deals with the “Work” component, but punts on the “Time” component allowing for nonsense like delayed ack helpfully “configuring” your effective “Time” component to a eternity resulting in “stuck” buffers which is what the timeout is supposed to avoid. I will decline to discuss the other aspect which is choosing when to buffer and how much of which Nagle’s algorithm is again a special case.

    Delayed ack is, funnily enough, basically the exact same problem but done on the receive side. So both sides set timeouts based on the other side going first which is obviously a recipe for disaster. They both set fixed “Work”, but no fixed “Time” resulting in the situation where both drivers are too polite to go first.

    What should be done is use the generic solutions that are parameterized by your system and channel properties which holistically solve these problems which would take too long to describe in depth here.

  • kazinator 3 hours ago
    > The bigger problem is that TCP_QUICKACK doesn’t fix the fundamental problem of the kernel hanging on to data longer than my program wants it to.

    Well, of course not; it tries to reduce the problem of your kernel hanging on to an ack (or genearting an ack) longer than you would like. That pertains to received data. If the remote end is sending you data, and is paused due to filling its buffers due to not getting an ack from you, it behooves you to send an ack ASAP.

    The original Berkeley Unix implementation of TCP/IP, I seem to recall, had a single global 500 ms timer for sending out acks. So when your TCP connection received new data eligible for acking, it could be as long as 500 ms before the ack was sent. If we reframe that in modern realities, we can imagine every other delay is negligible, and data is coming at the line rate of a multi gigabit connection, 500 ms represents a lot of unacknowledged bits.

    Delayed acks are similar to Nagle in spirit in that they promote coalescing at the possible cost of performance. Under the assumption that the TCP connection is bidirectional and "chatty" (so that even when the bulk of the data transfer is happening in one direction, there are application-level messages in the other direction) the delayed ack creates opportunities for the TCP ACK to be piggy backed on a data transfer. A TCP segment carrying no data, only an ACK, is prevented.

    As far as portability of TCP_QUICKACK goes, in C code it is as simple as #ifdef TCP_QUICKACK. If the constant exists, use it. Otherwise out of luck. If you're in another language, you have to to through some hoops depending on whether the network-related run time exposes nonportable options in a way you can test, or whether you are on your own.

  • martingxx 3 hours ago
    I've always thought that Nagle's algorithm is putting policy in the kernel where it doesn't really belong.

    If userspace applications want to make latency/throughput tradeoffs they can already do that with full awareness and control using their own buffers, which will also often mean fewer syscalls too.

    • klempner 2 hours ago
      The actual algorithm (which is pretty sensible in the absence of delayed ack) is fundamentally a feature of the TCP stack, which in most cases lives in the kernel. To implement the direct equivalent in userspace against the sockets API would require an API to find out about unacked data and would be clumsy at best.

      With that said, I'm pretty sure it is a feature of the TCP stack only because the TCP stack is the layer they were trying to solve this problem at, and it isn't clear at all that "unacked data" is particularly better than a timer -- and of course if you actually do want to implement application layer Nagle directly, delayed acks mean that application level acking is a lot less likely to require an extra packet.

      • j16sdiz 39 minutes ago
        If your application need that level of control, you probably want to use UDP and have something like QUIC over it.

        BTW, Hardware based TCP offloads engine exists... Don't think they are widely used nowadays though

        • nly 12 minutes ago
          Hardware TCP offloads usually deal with the happy fast path - no gaps or out of order inbound packets - and fallback to software when shit gets messy.

          Widely used in low latency fields like trading

    • ghshephard 2 hours ago
      It's kind of in User Space though - right? When an application opens a socket - it decides whether to open it with TCP_NODELAY or not. There isn't any kernel/os setting - it's done on a socket by socket basis, no?
      • naught00 2 hours ago
        TCP_NODELAY is implemented within the kernel. A socket can decide whether to use it or not.
    • kvemkon 2 hours ago
      The tradeoff on one program can influence the other program needing perhaps the opposite decision of such tradeoff. Thus we need the arbiter in the kernel to be able to control what is more important for the whole system. So my guess.
  • hathawsh 1 hour ago
    Ha ha, well that's a relief. I thought the article was going to say that enabling TCP_NODELAY is causing problems in distributed systems. I am one of those people who just turn on TCP_NODELAY and never look back because it solves problems instantly and the downsides seem minimal. Fortunately, the article is on my side. Just enable TCP_NODELAY if you think it's a good idea. It apparently doesn't break anything in general.
  • saghm 2 hours ago
    I first ran into this years ago after working on a database client library as an intern. Having not heard of this option beforehand, I didn't think to enable it in the connections the library opened, and in practice that often led to messages in the wire protocol being entirely ready for sending without actually getting sent immediately. I only found out about it later when someone using it investigated why the latency was much higher than they expected, and I guess either they had run into this before or were able to figure out that it might be the culprit, and it turned out that pretty much all of the existing clients in other languages set NODELAY unconditionally.
  • vsgherzi 3 hours ago
    https://oxide-and-friends.transistor.fm/episodes/mr-nagles-w...

    oxide and friends episode on it! It's quite good

  • foltik 1 hour ago
    Why doesn’t linux just add a kconfig that enables TCP_NODELAY system wide? It could be enabled by default on modern distros.
    • indigodaddy 39 minutes ago
      Looks like there is a sysctl option for BSD/MacOS but Linux it must be done at application level?
  • hsn915 1 hour ago
    Wouldn't distributed systems benefit from using UDP instead of TCP?
    • 0xbadcafebee 57 minutes ago
      Only if you're sending data you don't mind losing and getting out of order
  • rowanG077 2 hours ago
    I fondly remember a simple simulation project we had to do with a group of 5 students in a second year class which had a simulation and some kind of scheduler which communicated via TCP. I was appalled at the perfomance we were getting. Even on the same machine it was way too slow for what it was doing. After hours of debugging in turned out it was indeed Nagle's algorithm causing the slowness, which I never heard about at the time. Fixed instantly with TCP_NODELAY. It was one of the first times it was made abundantly clear to me the teachers at that institution didn't know what they were teaching. Apparently we were the only group that had noticed the slow performance, and the teachers had never even heard of TCP_NODELAY.
  • jonstewart 2 hours ago
    <waits for animats to show up>
    • Animats 13 minutes ago
      OK, I suppose I should say something. I've already written on this before, and that was linked above.

      You never want TCP_NODELAY off at the sending end, and delayed ACKs on at the receiving end. But there's no way to set that from one end. Hence the problem.

      Is TCP_NODELAY off still necessary? Try sending one-byte TCP sends in a tight loop and see what it does to other traffic on the same path, for, say, a cellular link. Today's links may be able to tolerate the 40x extra traffic. It was originally put in as a protection device against badly behaved senders.

      A delayed ACK should be thought of as a bet on the behavior of the listening application. If the listening application usually responds fast, within the ACK delay interval, the delayed ACK is coalesced into the reply and you save a packet. If the listening application does not respond immediately, a delayed ACK has to actually be sent, and nothing was gained by delaying it. It would be useful for TCP implementations to tally, for each socket, the number of delayed ACKs actually sent vs. the number coalesced. If many delayed ACKs are being sent, ACK delay should be turned off, rather than repeating a losing bet.

      This should have been fixed forty years ago. But I was out of networking by the time this conflict appeared. I worked for an aerospace company, and they wanted to move all networking work from Palo Alto to Colorado Springs, Colorado. Colorado Springs was building a router based on the Zilog Z8000, purely for military applications. That turned out to be a dead end. The other people in networking in Palo Alto went off to form a startup to make a "PC LAN" (a forgotten 1980s concept), and for about six months, they led that industry. I ended up leaving and doing things for Autodesk, which worked out well.

    • e40 2 hours ago
      Came here thinking the same thing...
    • armitron 2 hours ago
      Assuming he shows up, he'll still probably be trying to defend the indefensible..

      Disabling Nagle's algorithm should be done as a matter of principle, there's simply no modern network configuration where it's beneficial.