I/O Multiplexing (select vs. poll vs. epoll/kqueue)

(nima101.github.io)

65 points | by pykello 3 days ago

10 comments

  • thasso 2 minutes ago
    This part is bewildering to me:

      > Now, if you try to watch file descriptor 2000, select will loop over fds from 0 to 1999 and will read garbage. The bigger issue is when it tries to set results for a file descriptor past 1024 and tries to set that bit field in say readfds, writefds or errorfds field. At this point it will write something random on the stack eventually crashing the process and making it very hard to debug what happened since your stack is randomized.
    
    I'm not too literate on the Linux kernel code, but I checked, and it looks like the author is right [1].

    It would have been so easy to introduce a size check on the array to make sure this can't happen. The man page reads like FD_SETSIZE differs between platforms. It states that FD_SETSIZE is 1024 in glibc, but no upper limit is imposed by the Linux kernel. My guess is that the Linux kernel doesn't want to assume a value of FD_SETSIZE so they leave it unbounded.

    It's hard to imagine how anyone came up with this thinking it's a good design. Maybe 1024 FDs was so much at the time when this was designed that nobody considered what would happen if this limit is reached? Or they were working on system where 1024 was the maximum number of FDs that a process can open?

    [1]: The core_sys_select function checks the nfds argument passed to select(2) and modifies the fd_set structures that were passed to the system call. The function ensures that n <= max_fds (as the author of the post stated), but it doesn't compare n to the size of the fd_set structures. The set_fd_set function, which modifies the user-side fd_set structures, calls right into __copy_to_user without additional bounds checks. This means page faults will be caught and return -EFAULT, but out-of-bounds accesses that corrupt the user stack are possible.

    EDIT: Fixed formatting

  • eqvinox 25 minutes ago
    > epoll/kqueue are replacements for their deprecated counterparts poll and select.

    Neither poll nor select are deprecated. They're just not good fits for particular use patterns. But even select() is fine if you just need to watch 2 FDs in a CLI tool.

    In fact, due to its footguns, I'd highly advise against epoll (particularly edge triggering) unless you really need it.

  • mort96 4 hours ago
    I wish the UNIXes had gone together and standardized a modern alternative to poll, maybe as part of POSIX. It sucks that any time I want to listen to IO events, I have to choose between old, low performance, cross-platform APIs and the new, higher-performance but Linux-only epoll.
    • ninjin 2 hours ago
      Which is why there is libevent [1]?

      [1]: https://libevent.org

      Unless I am mistaken, OpenBSD base even explicitly codes against the older libevent API internally and ships it with each release, despite at the very least supporting kqueue, and thus gains better portability for a number of their tools this way.

      Personally, I just go with Posix select for small programs where performance is not critical anyway.

      • eqvinox 20 minutes ago
        There are a whole bunch of these — libevent, libev, glib's main loop, Qt's main loop, Apache's modular event loop, …

        …which is why there is libverto, a 2nd order abstraction.

        It'd be funny if it weren't also sad.

    • usrnm 3 hours ago
      Aren't there enough wrapper libraries for all programming languages that take care of this under the hood? You don't have to rely on libc only
      • mort96 3 hours ago
        Sure, there are wrapper libraries. But then I'm met with the question: do I add some big heavy handed IO wrapper library, or ... do I just call poll
        • Galanwe 2 hours ago
          I wouldn't count uv/ev/etc as "big heavy IO wrapper library".
          • mort96 2 hours ago
            I would, especially when nothing else in the program uses it and you just introduce it for one small thing in place of calling poll(). It's over 40 000 loc, over 70 000 including tests.
    • ahartmetz 3 hours ago
      For sure. Though every platform does have it own high-performance alternative, with only kqueue shared by some less popular ones.
  • Luker88 19 minutes ago
    I have vague memories of OSX kqueue not supporting all the usecases that FreeBSD kqueue does from many years ago.

    Have they reached feature parity?

    • nesarkvechnep 15 minutes ago
      I doubt it because applications, using kqueue, written for OSX can’t easily be ported to FreeBSD. ghostty is one such app.
  • tarruda 2 hours ago
    I have implemented a simple asyncio compatible micro event loop library in python.

    The goal was to understand the underlying mechanisms behind python's async/await and to help coworkers understand how event loops work under the hoods.

    The end result is somewhat interesting, as unlike traditional event loop libraries, it doesn't use callbacks as the scheduling primitive: https://gist.github.com/tarruda/5b8c19779c8ff4e8100f0b37eb59...

  • khaledh 6 minutes ago
    Needs "(2020)" in the title.
  • sureglymop 6 hours ago
    Good read but I wish it included io_uring as well.
    • marginalia_nu 1 hour ago
      It's probably hard to include io_uring in something like this, without the article turning into an article mostly about io_uring. It's a cool API that can be incredibly fast, but it also comes with a very long list of caveats.
  • quibono 1 hour ago
    I'm assuming epoll is covered implicitly by the section on kqueue. Are there any differences between the two besides the name?
  • lynx97 4 hours ago
    There is no mention of epoll in thsi other then the heading.
    • lstodd 2 hours ago
      It's because epoll === kqueue mostly.

      Besides kqueue grew from FreeBSD, not OSX. Such ignorance saddens me much more.

  • commandersaki 2 hours ago
    Nice article, though a few spelling mistakes that I thought was to distinguish it from AI slop, only to realise this was written a few years before the AI/GPT craze.