Noise cancellation improves turn-taking for AI Voice Agents

(krisp.ai)

113 points | by davitb 6 days ago

15 comments

dgellow 2 days ago
Off topic: I hate so much what Krisp (the desktop app) has become. It was the perfect background noise cancelling tool, and now is asking permissions to record your whole screen and audio, for AI meeting companion features I couldn’t care about one bit. And they make it such a frustrating pain to opt-out, with constant modals tricking you in enabling all those AI features.
Does anyone know an alternative that achieves similar level of background noise cancelling?
[-]
- bartman 2 days ago
  I've dropped Krisp after they transitioned from their 'old' app to the 'new' app in the most confusing way that I've ever seen. We were paying for a business subscription and suddenly the 'new' app only seemed to worked with a new type of subscription, the 'old' app did not get updates to support the newest macOS versions for months, ...
  Since then we've been relying solely on Zoom's noise cancelling features and haven't been missing anything. They really improved massively over the years.
  While it's not a drop in solution if you routinely need to join calls with other conferencing software, I haven't missed Krisp's after 'switching' 2 years ago.
  If you're on Windows, NVIDIA Broadcast's AI Noise Removal has performed similarily well for me while gaming.
  [-]
  - dgellow 2 days ago
    Yes, I really like NVIDIA broadcast. Unfortunately I work with macOS.
    I will look again at zoom, last time I tried their noise canceling it would cut my sentences
    [-]
    - 42lux 2 days ago
      I had good results with the built in macOS noise canceling.
      https://support.apple.com/guide/mac-help/use-mic-modes-on-yo...
- phrotoma 2 days ago
  It _was_ both the first and the most useful ML powered app I'd ever used and I couldn't get my credit card out fast enough after trying it. Now it won't leave me alone about new features I don't give a fuck about.
- khip 1 day ago
  If you use Linux, we ported the older version of Krisp used by Discord to run natively.
  https://codeberg.org/khip/khip
- fakedang 2 days ago
  Is this the same Krisp.AI on the front page?
  [-]
  - dgellow 2 days ago
    Yes. They switched from the best desktop noise cancelling software to a horrible mess of AI features supposed to help you with your meetings.
    The app now really wants you to use their task tracking, transcripts, summaries, etc. But they have pretty frustrating UX, require invasive permissions. Every other day you get nagged into enabling features you already said no to 20 times
cubefox 2 days ago
That's nice, but the main problem with current voice turn-taking is different. It's that these systems don't know when it is their turn to speak.
When a human speaks to another, the second person will listen and interpret and guess when the first person is finished talking. For voice agents it doesn't work that way at all.
The text-to-speech system just seems to have a hardcoded "pause" detector, e.g. 2 seconds, and if 2 seconds of silence are ever detected, the "end of message" token is sent and the LLM will start talking. Even if you were just collecting your thoughts and weren't finished at all.
So the semantic content of what you are saying is completely ignored for turn-taking and no analysis takes place which would determine whether the user is likely to have said everything they wanted to say.
Instead of the rigid pause detector, it would actually make more sense for the end-of-message token to be sent when you explicitly say a specific phrase, like literally "over". Which was of course common in half-duplex radio where only one person could transmit. LLMs are half-duplex too: they can't listen and talk at the same time.
[-]
- krisoft 2 days ago
  > Instead of the rigid pause detector, it would actually make more sense for the end-of-message token to be sent when you explicitly say a specific phrase, like literally "over".
  That doesn’t sound very conversational at all. Instead one could train the network to recognise the appropriate turn-taking points.
  The simple way to do that is to make the model output a “listen a bit more” token when it is not yet their turn to talk. You can use real life recorded conversations to build up the initial training set, and then add more data where clashes happen (where tha AI and the speaker speaks at the same time over each other.)
  More complicated would be a system where the model is periodically fed the audio chunk so far, and the model predicts what the speaker is likely going to say and based on that when it is appropriate to respond and with wath. And then a smaller, faster, local model can be used to verify if what was said matches the prediction, and if so outputs the generated response. If there is a mismatch it engages the more expensive model to come up with a new prediction.
  If you engineer this right you can reuse the state vector from save points and save a bit of compute that way.
  Asking the user to say “over” at the end of their turn is the most heavy handed solution. Recognising the flow of a conversation is just pattern recognition. That is what machine learning is good at.
  [-]
  - cubefox 1 day ago
    The "listen a bit more" token sounds interesting, but I'm not sure whether it would actually work better than the current solution which just waits for a sufficiently long pause. Maybe both could be combined.
kylebenzle 2 days ago
Great! For me "turn-taking" has been one of the big downfalls of the voice agents. It always seems to break in, just let the silence continue when I'm done talking or pause when it hears the slightest cough or car noise.
[-]
- skygazer 2 days ago
  If using chatGPT advance voice mode, the recent upgraded version seems better, plus, if you're on an iPhone, you can turn on Voice Isolation in the Control Center which will filter out almost all sounds from the phone microphone except for your speaking voice, which made chatGPT behave as one would hope -- I believe the setting is specific to the current microphone using app.
  [-]
  - davitb 2 days ago
    They recently added noise cancellation to realtime transcription.
    https://platform.openai.com/docs/guides/realtime-transcripti...
pzo 2 days ago
> low algorithmic latency of just 15 milliseconds
I guess overall latency will be higher since processing will have to go to their server than back to our server then back from our server to STT provider and back to as then back to LLM provider and back to us and last part to TTS provider and back to the user.
It's so weird that e.g. OpenAI doesn't provide a way to make very simple voice pipeline STT + LLM + TTS executed totally on their servers, this would reduce latency significantly.
The pipeline with this server side audio processing right now looks mostly would have to look like that:
user phone -> our server -> krisp server -> our server -> OpenAI STT -> our server -> OpenAI LLM -> our server -> OpenAI TTS -> our server -> back to user.
Then you have to hope that user and all servers are hosten in the same region
[-]
- edmundsauto 2 days ago
  It depends on what you're optimizing against, I think. I'd guess OpenAI prioritizes retention, so they don't give up the recent massive user base growth. If you retain users, you get them to continue using the product (data flywheel), and you can iterate fast enough - you will eventually be successful.
polygot 2 days ago
How much better is the turn taking relative to two humans, when, for example, ordering a pizza? Human to human interaction has a non-zero false turn taking false positive score in my experience.
[-]
- Ajedi32 2 days ago
  Latency (such as you get when communicating over the phone) makes turn taking much more difficult. Even in-person it's still a non-zero false positive rate though.
  [-]
  - pzo 2 days ago
    exactly when you are speaking face to face and you see person you have another visual cue from someone lips and face expression realtime and you know if someone stopped speaking or just taking some time to gather thoughts or trying to find proper word in their own head (e.g. non native speaker).
- aoanevdus 2 days ago
  IMO this is a use-case where the added latency of cell phones is a downgrade.
a1371 2 days ago
Recently I was practicing QA with ChatGPT, I wondered if I say "stay silent for 4 seconds before talking" will it work? lo and behold, it didn't.
[-]
- mort96 2 days ago
  Does it even have control over that? Isn't ChatGPT's voice mode just speech to text and text to speech wrapped around a text model? Unless it specifically has access to pragmas like "stay silent for 4 seconds" which gets communicated to the text to speech part, it's hard to imagine that it'd even have the ability to stay silence for that amount of time.
  [-]
  - tossandthrow 2 days ago
    You can instruct the speech
yahoozoo 2 days ago
Two Krisp posts on the front page? It smells in here.
[-]
- csouzaf 2 days ago
  It's really fishy.
nickdichev 2 days ago
Daily recently announced a new turn detection model to replace Silero and I think Livekit is working on a similar model.
I’m looking forward to more UX improvements in voice pipelines now that the big players have stabilized their voice pipeline frameworks
orbital-decay 2 days ago
How likely it is for a big generalist model to make this obsolete? I'm sure it has enough capacity to filter out the noise on its own, if trained on a good dataset, moreover it's pretty well suited for that.
kazinator 2 days ago
How about having the AI keep talking until you say a specific word like "stop".
Background noise will rarely produce a false positive for that word.
[-]
- KeplerBoy 2 days ago
  Because that's a completely different problem. By doing that you'd reduce the problem to transcription and have a worse UX.
  [-]
- YetAnotherNick 2 days ago
  They are solving the opposite problem, ie detecting when to start the AI. Stopping the AI is significantly easier problem.
  [-]
  - kazinator 1 day ago
    In spite of your assiduously numerous re-readings of the fine article, it appears that the following passage has eluded your attention:
    As a result, the VAD mistakenly interprets noise or background voices as active user speech, triggering unintended interruptions. These false triggers negatively impact turn-taking, a core component of natural, human-like conversational interactions.
arthurcolle 2 days ago
is there some way to do a simple fingerprint or something so that the AI recognizes when it was the one speaking? or do you really just have to WebRTC. I spoke with someone yesterday who told me WebRTC fixed this, so just curious.
I wrote a "simple" (ugly) Acoustic Echo Cancellation module that kind of worked, but wondering if anyone had any solutions to make it work over the WebSockets Realtime API
[-]
- soulofmischief 2 days ago
  What you're looking for is speaker embeddings. It's an embedding calculated from an audio snippet. As the other commenter mentioned, it should be combined with a robust voice isolation system.
  My own system automatically detects new speakers and tries to pick up on cues to identify the speaker, and once they are identified by name, the corresponding average embedding is inserted into a vector database so that the agent can later use the embedding for simple authentication, ignoring chatter in noisy public spaces, RAG context loading, etc. It works pretty well!
  [-]
  - arthurcolle 2 days ago
    Does this work well for multi-user scenarios? I also wanted to as a side effect tag and label people, but not really used to the audio setting. Just found "Speaker Verification with xvector embeddings on Voxceleb" which seems interesting and useful.
    [-]
    - soulofmischief 1 day ago
      Within constraints, yes, it does, but I think there are many improvements I could still make. Speaker diarization and identification are ongoing subjects of research and right now there's not a good end-to-end model, so if your constraints are local inference only or low latency, it can be harder to get amazing results with current hardware and off-the-shelf models. It's still a lot better than nothing.
- davitb 2 days ago
  The hard part is to separate background voices (e.g. TV, chatter, etc) from the primary speaker's voice. Basically do voice isolation. Voice fingerprinting would help only in this context.
bartman 2 days ago
Is there any publicly available information on the usage based pricing for the server & client-side SDK?
indexerror 2 days ago
Can this be used to clean podcast audio? I don't see a reason why it can't be...
[-]
- satvikpendem 2 days ago
  Adobe already has such a product: https://podcast.adobe.com/enhance
  [-]
  - indexerror 2 days ago
    I have used it for quite sometime. The v2 works fairly well now but still not _great_.
- spockz 2 days ago
  You can use krisp.ai during recording to already remove a lot if not all noise. Not sure about cleaning up in part but I suspect there are a bunch of those tools available as open source as well.
- jonplackett 2 days ago
  Elevenlabs have a good noise reduction tool for that
tmshapland 2 days ago
Awesome! We've been waiting for Krisp's voice isolation on the server side in the Voice AI community! This is an important advancement for the community! Congrats, Krisp team!
boramos 2 days ago
Congrats Krisp team