Lord of the Io_uring (2020)

(unixism.net)

173 points | by LAC-Tech 175 days ago

10 comments

accelbred 175 days ago
I'd like to use io_uring, but as long as it bypasses seccomp it should be disabled whenever seccomp is in use. As such, I use epoll, and find it annoying when kernel APIs like ublk require io_uring. The places I'd want to use ublk are inside sandboxes using seccomp. Given that container runtimes, hardened kernels, chromeos, etc., disable io_uring, using it means needing an epoll fallback anyways, so might as well just use epoll and not maintain two async backends for your application.
[-]
- JoshTriplett 175 days ago
  ublk, specifically, is something I'd expect to be primarily used in privileged contexts anyway, because the primary use of the resulting block device is to mount it, which requires privileges for most interesting filesystems. If you want an unprivileged mechanism, you may be interested in the upcoming uring-accelerated FUSE support.
  For other uses, uring has a "restriction" mechanism that does part of what you want. See REGISTER_RESTRICTIONS in the documentation. Any process that's setting up its own seccomp restrictions can also set up a uring with restrictions, limiting the opcodes it can use.
  That said, that mechanism would benefit from a way to apply such restrictions to a process that isn't doing the setup itself, such as when setting up seccomp restrictions on a container or daemon. For instance, a way to set restrictions on all rings created by child processes, or a way for seccomp to enforce that any uring created has restrictions applied to it.
  [-]
  - accelbred 175 days ago
    The main problem I have with fuse is inotify not working. If inotify just worked for fuse, I'd just use it. Ideally I could just run the software in a mount namespace with a fuse fs, but I need inotify.
    I mainly was trying to use ublk to implement a sort of fuse like thing with the kernel handling the fs and thus having inotify support.
    [-]
    - kragen 174 days ago
      Interesting, I didn't realize inotify didn't work with FUSE. Is this a flaw in the FUSE interface, or is it just a deficiency in certain FUSE filesystems?
      [-]
      - jefftk 174 days ago
        I think the key problem is that mapping from FUSE requests to inotify events requires information that only the FUSE daemon has. For example, lets say you open a file with O_CREAT. Whether this should trigger IN_CREATE depends on whether the file already exists. The kernel doesn't know this, and so couldn't be responsible for generating the IN_CREATE event.
        Now, the FUSE daemon could generate the event, but correctly generating events (especially handling edge cases) is difficult.
        [-]
        kragen 173 days ago
        I was thinking about cases where a filesystem change event doesn't stem from a system call at all, for example, because some other machine wrote to a remote fileserver the daemon provides access to. Is that a problem?
  - haberman 175 days ago
    > you may be interested in the upcoming uring-accelerated FUSE support.
    Do you have a reference for this? What is the anticipated timeframe?
    [-]
    - JoshTriplett 175 days ago
      https://lore.kernel.org/io-uring/20241209-fuse-uring-for-6-1...
      I don't know when it'll be merged, but it seems like it's getting close to ready.
  - quotemstr 175 days ago
    > For instance, a way to set restrictions on all rings created by child processes, or a way for seccomp to enforce that any uring created has restrictions applied to it.
    SELinux or your favorite MAC is there to solve this exact problem.
- samlightfoot 175 days ago
  https://github.com/containerd/containerd/issues/9048
- fulafel 175 days ago
  Does this mean you shouldn't use it in containers?
  edit: it does seem it is disabled there now: https://github.com/containerd/containerd/pull/9320 (thanks to sibling comment for an adjancent link)
  [-]
  - cmrdporcupine 174 days ago
    Yeah I had code at one point in my hobby project that used io_uring and it stopped working in docker without overriding security restrictions.
    Unfortunately decided it's not worth it.
- quotemstr 175 days ago
  > find it annoying when kernel APIs like ublk require io_uring
  Good. That's a forcing function for making io_uring work in your environment.
  > bypasses seccomp
  Seccomp sucks.
  We shouldn't be enforcing security by filtering system calls, the set of which will grow forever, but instead by describing access control rules on objects, e.g. with SELinux. If your security policy is that your sandbox should be able to read from some file but not write to it, you should do that with real MAC, which applies to all operations , il_uring included. You shouldn't just filter read(2) and write(2) in particular.
  We shouldn't hold back evolution in systems interfaces because some people are stuck on bad ways of doing things and won't move.
  [-]
  - fulafel 174 days ago
    SELinux is a dx/ux hostile nightmare that we definitely shouldn't be springing on everybody.
  - accelbred 175 days ago
    Since when can you use a MAC as an unprivileged user on an arbitrary distro?
    [-]
    - burch45 174 days ago
      Parent is referring to https://en.m.wikipedia.org/wiki/Mandatory_access_control As opposed to https://en.m.wikipedia.org/wiki/Medium_access_control
      [-]
      - fulafel 174 days ago
        Not necessarily, how do I use SELinux as a unprivileged user on eg Debian?
  - immibis 174 days ago
    seccomp is a mitigation. Once you have already been exploited, if further escalation is prevented by seccomp, or ASLR, or NX stack, or ....... then you got lucky.
- poincaredisk 175 days ago
  Is there a specific io_uring opcode you would like disabled in your sandboxes? It's not like io_uring is a complete seccomp bypass, just another syscall that provides an alternative way to do many things. I doubt you block "read" or "accept" in docker, for example. You can't execute a sysctl or mount a filesystem using io_uring, which are things that are actually blocked in Docker by default.
  edit: on the other hand, a good reason to disable uring in containers is that it's infested with vulnerabilities. It's new, complex, and does a whole lot of things - all of which make serious security bugs there quite common right now.
  [-]
  - JoshTriplett 175 days ago
    > infested with vulnerabilities
    Current io_uring is not particularly prone to vulnerabilities. The original version of it had a design that often led to them (a kernel thread doing operations on behalf of the process and not always remembering to set the appropriate privileges), but it no longer uses that design, and the current design is much more resilient. Unfortunately, the original design led to a reputation that it's still trying to shake.
    [-]
    - quotemstr 175 days ago
      > Current io_uring is not particularly prone to vulnerabilities
      The tech industry: launch early! Develop in public! Many eyes make all bugs shallow!
      Also the tech industry: we will never forgive you for that one segfault you had ten years ago.
      [-]
      - poincaredisk 174 days ago
        Excuse me? Io_uring is by far the most often exploited syscall on modern day Linux. Most often exploited subsystem even. https://www.phoronix.com/news/Google-Restricting-IO_uring
        [-]
        JoshTriplett 174 days ago
        That's a lot like saying "the syscall interface is the most exploited interface to the kernel". io_uring is an entire syscall interface itself; the right point of comparison would be "every other syscall".
        How do the exploits for io_uring compare to the exploits for the rest of the kernel?
        https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=%22linux%20...
      - dietr1ch 174 days ago
        Remember that 10yo crash? Well, I'm going to use a 12yo kernel and complain about it.
        [-]
    - poincaredisk 174 days ago
      https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=io_uring
      20 CVEs in 2024. Yes, some of them are not (exploitable) vulnerabilities, probably, because Linux CNA is being difficult. But many of them are, just ctrl+f privilege.
  - ibotty 175 days ago
    It's not only potentially infested with vulnerabilities. It's also not possible to filter io_uring using seccomp at all. So if you allow io_uring, you allow all that is possible with it.
  - accelbred 175 days ago
    Out of current ones, at a quick glance: connect, openat, openat2, renameat, mkdirat, and bind. More importantly, I'd like to block any opcode I haven't whitelisted, even when my software runs on future kernels with more opcodes available.
    Now that I think about it, how does io_uring interact with landlock?
hinkley 175 days ago
But they were all of them deceived?
[-]
- Cthulhu_ 175 days ago
  And nine, nine async I/O programming APIs were gifted to the race of Linux users, who above all else desire power.
  [-]
  - api 175 days ago
    But this next API, we'll get it right. Let's call it io_uring2!
  - hinkley 175 days ago
    I’m now imagining Torvalds riding a hell-hawk.
- friend_Fernando 175 days ago
  One io_uring to root them all.
- LAC-Tech 175 days ago
  This made me laugh a lot. I can spot Tolkien's language from a mile off!
  [-]
  - ratherbefuddled 175 days ago
    Technically Peter Jackson's :)
samsquire 175 days ago
This document helped me learn the io_uring API.
You can use io_uring with epoll to monitor eventfd to wake up your sleeping with io_uring wait for completions.
I have implemented a barrier and thread safe techniques that I am trying to turn into a command line tool
My goal is that thread safe performant servers are easy to write.
I am using bloom filters for fast set intersection. I intend to use Simd instructions with the bloom hashes.
t00 175 days ago
There are examples of cat and cp using io_uring. What are the chances of having io_uring utilised by standard commands to improve overall Linux performance? I presume GNU utils are not Linux specific hence such commands are programmed for a generic *nix.
Another one is I could not find a benchmark with io_uring - this would confirm the benefit of going from epoll.
[-]
- mahkoh 175 days ago
  >Another one is I could not find a benchmark with io_uring - this would confirm the benefit of going from epoll.
  One of the advantages of io_uring, unrelated to performance, is that it supports non-blocking operations on blocking file descriptors.
  Using io_uring is the only method I recall to bypass https://gitlab.freedesktop.org/wayland/wayland/-/issues/296. This issue deals with having to operate on untrusted file descriptors where the blocking/non-blocking state of the file descriptions might be manipulated by an adversary at any time.
  [-]
  - kragen 174 days ago
    So does the FIONREAD ioctl, but it's not a general solution. (According to https://news.ycombinator.com/item?id=42617719, neither is io_uring yet.) Thanks for the link to the horrifying security problem!
  - o11c 174 days ago
    I thought for sure this was wrong, but when I actually checked the docs, it turns out that `RWF_NOWAIT` is only valid for `preadv2` not `pwritev2`. This should probably be fixed.
    For sockets, `MSG_DONTWAIT` works with both `recv` and `send`.
    For pipes you should be able to do this with `SPLICE_F_NONBLOCK` and the `splice` family, but there are weird restrictions for those.
  - lukeh 175 days ago
    Also useful for things like SPI with only blocking user space API.
- fweimer 175 days ago
  GNU coreutils already has tons of Linux-specific code. But it would be a bit of a kernel fail if io_uring were faster or other preferable to copy_file_range for cp (at least for files that do not have holes).
  [-]
  - Sesse__ 175 days ago
    Not at all; with io_uring, you can copy multiple files in parallel (and in fewer syscalls), which is a huge win for small files.
    [-]
    - kragen 174 days ago
      On a hard disk, copying multiple files in parallel is likely to make the copy run slower because it spends more time seeking back and forth between the files (except for small files). Perhaps that isn't a problem with SSDs? It seems like you'd still end up with the data from the different files interleaved in the erase blocks currently being written instead of contiguous, which seems like it would slow down all subsequent reads of those files (unless they're less than a page in size).
      [-]
      - Sesse__ 174 days ago
        > On a hard disk, copying multiple files in parallel is likely to make the copy run slower because it spends more time seeking back and forth between the files (except for small files).
        Certainly not; it's likely to make it run faster, since you can use the elevator algorithm more efficiently instead of seeking back and forth between the files. You can easily measure this yourself by using comparing wcp, which uses io_uring, and GNU cp (remember to empty the cache between each run).
        [-]
        kragen 174 days ago
        Hmm, that's interesting! I don't have a hard disk handy right now, unfortunately.
mgaunard 175 days ago
a lot of the functionality was significantly improved in 6 and isn't reflected there.
In practice io_uring can be used in many different ways, and it can be challenging to find the most efficient one.
[-]
- LAC-Tech 175 days ago
  What are the big changes in 6? links welcome.
  [-]
  - alecco 175 days ago
    https://kernelnewbies.org/Linux_6.0#io_uring_features but only mentions zero copy and https://lwn.net/Articles/879724/
    also https://www.phoronix.com/news/Linux-6.0-IO-Block-IO_uring
    [-]
    - mgaunard 174 days ago
      there were other changes to do with how register buffers work etc.
saghm 174 days ago
I'm realizing from the title of this that the intended pronunciation of "uring" is probably "yoo-ring"; for some reason I had mentally been reading it as "yurr-ring" all this time, and I guess I never heard anyone say it out loud before. In retrospect, I probably could have guessed that I might be missing something given that I had no clue what "uring" was supposed to mean.
dvektor 174 days ago
I was actually glued to that page for a few days recently, it's a great write-up.
io_uring is such a tremendous improvement over epoll, in both speed and user experience. With sqpoll, vectored ops and proper batching you can get some crazyy speed. I am definitely looking forward to seeing some of these seccomp and privilege issues getting fixed and getting container support in the future.
Thaxll 175 days ago
Someone can comment on the security implications of sharing a buffer between user space and kernel space?
[-]
- alexgartrell 175 days ago
  Sharing a queue itself is not new https://www.kernel.org/doc/html/v5.8/networking/packet_mmap.... and https://docs.kernel.org/next/userspace-api/perf_ring_buffer.... are two examples.
  Issues with io_uring security mostly stemmed from an old architecture and just the fact that there's a ton of surface area.
- fragmede 175 days ago
  As you suspect, it's not awesome.
  https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=io_uring
- quotemstr 175 days ago
  binder shares a buffer between kernel and user space on billions of Android devices, and Android is by far the most secure Linux distribution.
  There's nothing wrong with the general concept.
jauntywundrkind 175 days ago
Definitely one of the best pieces of documentation out there for io_uring. But I'm not sure how much if at all it's been updated since 2020 & Linux 5.5. https://web.archive.org/web/20200527021134/https://unixism.n...
[-]
- alecco 175 days ago
  Yeah, it should have (2020)
  Previous discussion https://news.ycombinator.com/item?id=23132549
twen_ty 175 days ago
[flagged]
[-]
- JoshTriplett 175 days ago
  Windows has copied io_uring from Linux.
  [-]
  - cobbzilla 175 days ago
    This is a troll, but NT did indeed support async IO via WaitForMultipleObjects in the late 90s, long before Linux had a good async IO story.
    [-]
    - p_ing 175 days ago
      I don't think it's a troll (though not a particularly useful comment); Linux has had no true async story thus far. poll, epoll, et. al. are all synchronous behind-the-scenes.
      What Linux still lacks is an OVERLAPPED data structure.
      NT has supported async I/O since it's inception. It was a design principle of the kernel -- all I/O operations in the kernel are async'ed.
      [-]
      - JoshTriplett 175 days ago
        I wasn't referring to async I/O. I'm talking about the ability to make system calls using a user/kernel shared memory buffer, without having to enter and exit the kernel. (This is particularly important with all the security mitigations that make kernel entry/exit more expensive.)
        https://windows-internals.com/ioring-vs-io_uring-a-compariso...
        [-]
        p_ing 175 days ago
        Technically Windows has Registered I/O in Windows 8/Server 2012 for networking which provided this functionality. I/O Rings in Windows extended this to other types of I/O, but of course with a separate API.
        https://serverframework.com/asynchronousevents/2011/10/windo...
        So if we're talking about concepts... NT first, again ;-)