OpenAI's WebRTC problem

(moq.dev)

125 points | by atgctg 1 day ago

11 comments

  • Sean-Der 53 minutes ago
    Responding to some technical points first, but then after that I do see a future that isn't WebRTC. I don't think it matches where WebTransport+WebCodecs etc is going though.

    > …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate

    This is the opposite of the feedback I get. Users want instant responses. If you have delay in generating responses/interruptions it kills the magic. You also don't want to send faster than real-time. If the user interrupts the model you just wasted a bunch of bandwidth sending 3 minutes of audio (but only played 10 seconds)

    > TTS is faster than real-time

    https://research.nvidia.com/labs/adlr/personaplex/ Voice AI for the latest/aspirational is moving away from what the author describes. It is trickled in/out at 20ms

    > We really hope the user’s source IP/port never changes, because we broke that functionality.

    That is supported. When new IP for ufrag comes in its supported

    > It takes a minimum of 8* round trips (RTT)

    That's wrong. https://datatracker.ietf.org/doc/draft-hancke-webrtc-sped/

    > I’d just stream audio over WebSockets

    You lose stuff like AEC. You also push complexity on clients. The simplicity of WebRTC (createOffer -> setRemoteDescription) is what lets people onboard easily. Lots of developers struggled with Realtime API + web sockets (lots of code and having to do stuff by hand)

    ----

    I think if I had my choice I would pick Offer/Answer model and then doing QUIC instead of DTLS+SCTP. Maybe do RTP over QUIC? I personally don't feel strongly about the protocol itself. I don't know how to ship code to multiple clients (and customers clients) with a much large code footprint.

  • awkii 1 hour ago
    This poor soul. There are few protocols I hate implementing more than WebRTC. Getting a simple client going means you need to quickly acclimate to SDP, TURN/STUN, ice-candidates, offers, peer-to-peer protocols, and the complex handshake that is implemented from scratch each time. I can't imagine re-writing the whole trenchcoat of protocols and unintended "best-practices".
    • Sean-Der 1 hour ago
      What platforms were you targeting that you found it painful! Sorry it was frustrating.

      I hope it’s getting better with education/more libraries. It’s also amazing how easy Codex etc… can burn through it now

    • jgalt212 1 hour ago
      Have you attempted to use the Microsoft Graph API to interact with email?
      • tempaccount5050 4 minutes ago
        It's way better than the old powershell modules imo. What don't you like?
      • edoceo 55 minutes ago
        Ugh. Who's decided to Graph all the things.
    • moomoo11 56 minutes ago
      i like livekit for this reason and their ceo is cool
  • fidotron 1 hour ago
    > WebRTC is designed to degrade and drop my prompt during poor network conditions

    You want real time that's what you are going to deal with. If you don't want real time and instead imagine everything as STT -> Prompt -> TTS then maybe you shouldn't even be sending audio on the wire at all.

    • cowsandmilk 46 minutes ago
      > You want real time

      Isn’t the point that OpenAI’s use case does not require realtime?

      When OpenAI responds, it has most of the audio in advance of when the user needs to hear it. It produces audio faster than real time, so a real time protocol is a bad fit.

    • telman17 1 hour ago
      Yep. Maybe there's some additional configuration I'm missing to mitigate the delay but clients don't seem to want to deal with the delay with STT -> Prompt -> TTS. They'll happily suffer occasional quality issues if the conversation feels "real".
  • r2vcap 1 hour ago
    This is frustratingly one-sided writing. Yeah, WebRTC has limitations, but relying on a standard buys you a lot of correctness and reduces long-term engineering cost. The fact that WebRTC is complicated does not mean it is wrong; it means real-time media over the public internet is complicated.

    Also, networking is inherently stateful. NAT traversal, jitter buffers, congestion control, packet loss, codec state, encryption, and session routing do not disappear because you put audio over TCP or WebSocket. Pretending otherwise is not architectural clarity. It is just moving the complexity somewhere less visible.

    • tekacs 1 hour ago
      You might have noticed that the author started the blog post explaining themselves:

        Like 6 years ago I wrote a WebRTC SFU at Twitch.
        Originally we used Pion (Go) just like OpenAI,
        but forked after benchmarking revealed that it was too slow.
        I ended up rewriting every protocol, because of course I did!
      
        Just a year ago, I was at Discord and I rewrote the WebRTC SFU in Rust.
        Because of course I did! You’re probably noticing a trend.
      
        Fun Fact: WebRTC consists of ~45 RFCs dating back to the early 2000s.
        And some de-facto standards that are technically drafts (ex. TWCC, REMB).
        Not a fun fact when you have to implement them all.
      
        You should consider me a Certified WebRTC Expert.
        Which is why I never, never want to use WebRTC again.
      
      I think that they've done more than enough of 'trying the normal way' to be warranted in having an opinion the other way, don't you think?
    • Waterluvian 1 hour ago
      “How hard can it be?” the strawman asked.

      It’s 2026 and teleconferencing is still such a shit show. There’s billions of dollars to be had and Zoom is at best mediocre, and it can be as bad as Microsoft Whatchamacallit. I’ve never not seen teleconferencing be a ham handed mess.

      • fragmede 15 minutes ago
        Facetime does alright in the consumer segment.
    • charcircuit 1 hour ago
      QUIC is also a standard.
  • sam1r 23 minutes ago
    >> ... I say hi to <strike> Scarlett Johansson <strike>

    Had a nice chuckle.

  • lpln3452 58 minutes ago
    I haven't really experienced disconnections while using ChatGPT. Gemini is the frustrating part. Simply backgrounding the app (and the web version too) and resuming it causes the response or the conversation with an assigned ID to disappear. Haha.
    • Sean-Der 56 minutes ago
      I believe Gemini is Websockets? I have the same experience with heavy/custom applications that try to roll their own media stuff.

      You run into issues around AudioContext and resumption etc... it's a PITA to have to handle all those corner cases :(

  • keizo 14 minutes ago
    interesting read albeit over my head, but i spent half of yesterday comparing Gemini Live (websockets) vs gpt-realtime-2 and while gpt is super good, seemingly more robust. Gemini connects faster.
  • spongebobstoes 55 minutes ago
    this misses a few key things but hits on many others

    webrtc is a bad protocol, without a doubt. I do like websockets as an easy alternative, but you do need to reinvent decent portions of webrtc as a result

    I like the idea of MoQ but it's not widely used. probably worth experimenting with, especially as video enters the chat

    > and then a GPU pretends to talk to you via text-to-speech

    OpenAI is speech-to-speech, there is no TTS in voice mode

    > It takes a minimum of 8* round trips (RTT) to establish a WebRTC connection

    signalling can be done long ahead of time, though I don't see this mentioned in the OpenAI blog. I also saw some new webrtc extensions that should reduce setup time further

    ultimately though, it comes down to

    > It’s not like LLMs are particularly responsive anyway

    I expect to see a shift in how S2S models work to be lower latency like the new voice API models that OpenAI announced

    to be fair, the new models were released the day after this MoQ blog was published

  • giancarlostoro 1 hour ago
    Probably because WebTransport is the lesser known alternative to WebRTC.
    • est 1 hour ago
      WebTransport requires some speicific server setup.

      cldouflare doesn't support WebTransport well.

  • Giefo6ah 1 hour ago
    Yet another victim of IPv4, and you still find countless detractors of IPv6 on every thread where it's mentioned.
    • spongebobstoes 53 minutes ago
      IPv4 support is necessary, but IPv6 isn't
    • whattheheckheck 1 hour ago
      How would ipv6 handle it
      • tardedmeme 49 minutes ago
        You just send packets to the other party's address and they send packets back to yours. Both parties know their address and you don't need a relay in the middle.
  • coalstartprob 30 minutes ago
    [dead]