Off topic: I hate so much what Krisp (the desktop app) has become. It was the perfect background noise cancelling tool, and now is asking permissions to record your whole screen and audio, for AI meeting companion features I couldn’t care about one bit. And they make it such a frustrating pain to opt-out, with constant modals tricking you in enabling all those AI features.
Does anyone know an alternative that achieves similar level of background noise cancelling?
I've dropped Krisp after they transitioned from their 'old' app to the 'new' app in the most confusing way that I've ever seen. We were paying for a business subscription and suddenly the 'new' app only seemed to worked with a new type of subscription, the 'old' app did not get updates to support the newest macOS versions for months, ...
Since then we've been relying solely on Zoom's noise cancelling features and haven't been missing anything. They really improved massively over the years.
While it's not a drop in solution if you routinely need to join calls with other conferencing software, I haven't missed Krisp's after 'switching' 2 years ago.
If you're on Windows, NVIDIA Broadcast's AI Noise Removal has performed similarily well for me while gaming.
It _was_ both the first and the most useful ML powered app I'd ever used and I couldn't get my credit card out fast enough after trying it. Now it won't leave me alone about new features I don't give a fuck about.
Yes. They switched from the best desktop noise cancelling software to a horrible mess of AI features supposed to help you with your meetings.
The app now really wants you to use their task tracking, transcripts, summaries, etc. But they have pretty frustrating UX, require invasive permissions. Every other day you get nagged into enabling features you already said no to 20 times
That's nice, but the main problem with current voice turn-taking is different. It's that these systems don't know when it is their turn to speak.
When a human speaks to another, the second person will listen and interpret and guess when the first person is finished talking. For voice agents it doesn't work that way at all.
The text-to-speech system just seems to have a hardcoded "pause" detector, e.g. 2 seconds, and if 2 seconds of silence are ever detected, the "end of message" token is sent and the LLM will start talking. Even if you were just collecting your thoughts and weren't finished at all.
So the semantic content of what you are saying is completely ignored for turn-taking and no analysis takes place which would determine whether the user is likely to have said everything they wanted to say.
Instead of the rigid pause detector, it would actually make more sense for the end-of-message token to be sent when you explicitly say a specific phrase, like literally "over". Which was of course common in half-duplex radio where only one person could transmit. LLMs are half-duplex too: they can't listen and talk at the same time.
> Instead of the rigid pause detector, it would actually make more sense for the end-of-message token to be sent when you explicitly say a specific phrase, like literally "over".
That doesn’t sound very conversational at all. Instead one could train the network to recognise the appropriate turn-taking points.
The simple way to do that is to make the model output a “listen a bit more” token when it is not yet their turn to talk. You can use real life recorded conversations to build up the initial training set, and then add more data where clashes happen (where tha AI and the speaker speaks at the same time over each other.)
More complicated would be a system where the model is periodically fed the audio chunk so far, and the model predicts what the speaker is likely going to say and based on that when it is appropriate to respond and with wath. And then a smaller, faster, local model can be used to verify if what was said matches the prediction, and if so outputs the generated response. If there is a mismatch it engages the more expensive model to come up with a new prediction.
If you engineer this right you can reuse the state vector from save points and save a bit of compute that way.
Asking the user to say “over” at the end of their turn is the most heavy handed solution. Recognising the flow of a conversation is just pattern recognition. That is what machine learning is good at.
The "listen a bit more" token sounds interesting, but I'm not sure whether it would actually work better than the current solution which just waits for a sufficiently long pause. Maybe both could be combined.
Great! For me "turn-taking" has been one of the big downfalls of the voice agents. It always seems to break in, just let the silence continue when I'm done talking or pause when it hears the slightest cough or car noise.
If using chatGPT advance voice mode, the recent upgraded version seems better, plus, if you're on an iPhone, you can turn on Voice Isolation in the Control Center which will filter out almost all sounds from the phone microphone except for your speaking voice, which made chatGPT behave as one would hope -- I believe the setting is specific to the current microphone using app.
I guess overall latency will be higher since processing will have to go to their server than back to our server then back from our server to STT provider and back to as then back to LLM provider and back to us and last part to TTS provider and back to the user.
It's so weird that e.g. OpenAI doesn't provide a way to make very simple voice pipeline STT + LLM + TTS executed totally on their servers, this would reduce latency significantly.
The pipeline with this server side audio processing right now looks mostly would have to look like that:
user phone -> our server -> krisp server -> our server -> OpenAI STT -> our server -> OpenAI LLM -> our server -> OpenAI TTS -> our server -> back to user.
Then you have to hope that user and all servers are hosten in the same region
It depends on what you're optimizing against, I think. I'd guess OpenAI prioritizes retention, so they don't give up the recent massive user base growth. If you retain users, you get them to continue using the product (data flywheel), and you can iterate fast enough - you will eventually be successful.
How much better is the turn taking relative to two humans, when, for example, ordering a pizza? Human to human interaction has a non-zero false turn taking false positive score in my experience.
Latency (such as you get when communicating over the phone) makes turn taking much more difficult. Even in-person it's still a non-zero false positive rate though.
exactly when you are speaking face to face and you see person you have another visual cue from someone lips and face expression realtime and you know if someone stopped speaking or just taking some time to gather thoughts or trying to find proper word in their own head (e.g. non native speaker).
Does it even have control over that? Isn't ChatGPT's voice mode just speech to text and text to speech wrapped around a text model? Unless it specifically has access to pragmas like "stay silent for 4 seconds" which gets communicated to the text to speech part, it's hard to imagine that it'd even have the ability to stay silence for that amount of time.
How likely it is for a big generalist model to make this obsolete? I'm sure it has enough capacity to filter out the noise on its own, if trained on a good dataset, moreover it's pretty well suited for that.
In spite of your assiduously numerous re-readings of the fine article, it appears that the following passage has eluded your attention:
As a result, the VAD mistakenly interprets noise or background voices as active user speech, triggering unintended interruptions. These false triggers negatively impact turn-taking, a core component of natural, human-like conversational interactions.
is there some way to do a simple fingerprint or something so that the AI recognizes when it was the one speaking? or do you really just have to WebRTC. I spoke with someone yesterday who told me WebRTC fixed this, so just curious.
I wrote a "simple" (ugly) Acoustic Echo Cancellation module that kind of worked, but wondering if anyone had any solutions to make it work over the WebSockets Realtime API
What you're looking for is speaker embeddings. It's an embedding calculated from an audio snippet. As the other commenter mentioned, it should be combined with a robust voice isolation system.
My own system automatically detects new speakers and tries to pick up on cues to identify the speaker, and once they are identified by name, the corresponding average embedding is inserted into a vector database so that the agent can later use the embedding for simple authentication, ignoring chatter in noisy public spaces, RAG context loading, etc. It works pretty well!
Does this work well for multi-user scenarios? I also wanted to as a side effect tag and label people, but not really used to the audio setting. Just found "Speaker Verification with xvector embeddings on Voxceleb" which seems interesting and useful.
Within constraints, yes, it does, but I think there are many improvements I could still make. Speaker diarization and identification are ongoing subjects of research and right now there's not a good end-to-end model, so if your constraints are local inference only or low latency, it can be harder to get amazing results with current hardware and off-the-shelf models. It's still a lot better than nothing.
The hard part is to separate background voices (e.g. TV, chatter, etc) from the primary speaker's voice. Basically do voice isolation.
Voice fingerprinting would help only in this context.
You can use krisp.ai during recording to already remove a lot if not all noise. Not sure about cleaning up in part but I suspect there are a bunch of those tools available as open source as well.
Awesome! We've been waiting for Krisp's voice isolation on the server side in the Voice AI community! This is an important advancement for the community! Congrats, Krisp team!
Does anyone know an alternative that achieves similar level of background noise cancelling?
Since then we've been relying solely on Zoom's noise cancelling features and haven't been missing anything. They really improved massively over the years.
While it's not a drop in solution if you routinely need to join calls with other conferencing software, I haven't missed Krisp's after 'switching' 2 years ago.
If you're on Windows, NVIDIA Broadcast's AI Noise Removal has performed similarily well for me while gaming.
I will look again at zoom, last time I tried their noise canceling it would cut my sentences
https://support.apple.com/guide/mac-help/use-mic-modes-on-yo...
https://codeberg.org/khip/khip
The app now really wants you to use their task tracking, transcripts, summaries, etc. But they have pretty frustrating UX, require invasive permissions. Every other day you get nagged into enabling features you already said no to 20 times
When a human speaks to another, the second person will listen and interpret and guess when the first person is finished talking. For voice agents it doesn't work that way at all.
The text-to-speech system just seems to have a hardcoded "pause" detector, e.g. 2 seconds, and if 2 seconds of silence are ever detected, the "end of message" token is sent and the LLM will start talking. Even if you were just collecting your thoughts and weren't finished at all.
So the semantic content of what you are saying is completely ignored for turn-taking and no analysis takes place which would determine whether the user is likely to have said everything they wanted to say.
Instead of the rigid pause detector, it would actually make more sense for the end-of-message token to be sent when you explicitly say a specific phrase, like literally "over". Which was of course common in half-duplex radio where only one person could transmit. LLMs are half-duplex too: they can't listen and talk at the same time.
That doesn’t sound very conversational at all. Instead one could train the network to recognise the appropriate turn-taking points.
The simple way to do that is to make the model output a “listen a bit more” token when it is not yet their turn to talk. You can use real life recorded conversations to build up the initial training set, and then add more data where clashes happen (where tha AI and the speaker speaks at the same time over each other.)
More complicated would be a system where the model is periodically fed the audio chunk so far, and the model predicts what the speaker is likely going to say and based on that when it is appropriate to respond and with wath. And then a smaller, faster, local model can be used to verify if what was said matches the prediction, and if so outputs the generated response. If there is a mismatch it engages the more expensive model to come up with a new prediction.
If you engineer this right you can reuse the state vector from save points and save a bit of compute that way.
Asking the user to say “over” at the end of their turn is the most heavy handed solution. Recognising the flow of a conversation is just pattern recognition. That is what machine learning is good at.
https://platform.openai.com/docs/guides/realtime-transcripti...
I guess overall latency will be higher since processing will have to go to their server than back to our server then back from our server to STT provider and back to as then back to LLM provider and back to us and last part to TTS provider and back to the user.
It's so weird that e.g. OpenAI doesn't provide a way to make very simple voice pipeline STT + LLM + TTS executed totally on their servers, this would reduce latency significantly.
The pipeline with this server side audio processing right now looks mostly would have to look like that:
user phone -> our server -> krisp server -> our server -> OpenAI STT -> our server -> OpenAI LLM -> our server -> OpenAI TTS -> our server -> back to user.
Then you have to hope that user and all servers are hosten in the same region
I’m looking forward to more UX improvements in voice pipelines now that the big players have stabilized their voice pipeline frameworks
Background noise will rarely produce a false positive for that word.
As a result, the VAD mistakenly interprets noise or background voices as active user speech, triggering unintended interruptions. These false triggers negatively impact turn-taking, a core component of natural, human-like conversational interactions.
I wrote a "simple" (ugly) Acoustic Echo Cancellation module that kind of worked, but wondering if anyone had any solutions to make it work over the WebSockets Realtime API
My own system automatically detects new speakers and tries to pick up on cues to identify the speaker, and once they are identified by name, the corresponding average embedding is inserted into a vector database so that the agent can later use the embedding for simple authentication, ignoring chatter in noisy public spaces, RAG context loading, etc. It works pretty well!