* https://github.com/ictnlp/StreamSpeech
* https://github.com/k2-fsa/sherpa-onnx
* https://github.com/openai/whisper
I'm looking for a simple app that can listen for English, translate into Korean (and other languages), then perform speech synthesis on the translation. Basically, a Babelfish that doesn't stick in the ear. Although real-time would be great, a max 5-second delay is manageable.
RTranslator is awkward (couldn't get it to perform speech-to-speech using a single phone). 3PO sprouts errors like dandelions and requires an online connection.
Any suggestions?
¹ https://github.com/huggingface/speech-to-speech
https://neuml.hashnode.dev/speech-to-speech-rag
https://www.youtube.com/watch?v=tH8QWwkVMKA
One would just need to remove the RAG piece and use a Translation pipeline (https://neuml.github.io/txtai/pipeline/text/translation/). They'd also need to use a Korean TTS model.
Both this and the Hugging Face speech-to-speech projects are Python though.
Code from txtai just feels like exactly the right way to express what I am usually trying to do in NLP.
My highest commendations. If you ever have time, please share your experience/what lead to you taking this path with txtai. For example I see you started in earnest around August 2020 (maybe before) - at that time i would love to know if you imagined LLMs coming on to be as prominent as they are now and for instruction-tuning to work as well as it is. I know at that time many PhD students I knew in NLP (and profs) felt LLMs were far too unreliable and would not reach e.g. consistent scores on MMLU/HELLASWAG.
It's been quite a ride from 2020. When I started txtai, the first use case was RAG in a way. Except instead of an LLM, it used an extractive QA model. But it was really the same idea, get a relevant context then find the useful information in it. LLMs just made it much more "creative".
Right before ChatGPT, I was working on semantic graphs. That took the wind out of the sails on that for a while until GraphRAG came along. Definitely was a detour adding the LLM framework into txtai during 2023.
The next release will be a major release (8.0) with agent support (https://github.com/neuml/txtai/issues/804). I've been hesitant to buy into the "agentic" hype as it seems quite convoluted and complicated at this point. But I believe there are some wins available.
In 2024, it's hard to get noticed. There are tons of RAG and Agent frameworks. Sometimes you see something trend and surge past txtai in terms of stars in a matter of days. txtai has 10% of the stars of LangChain but I feel it competes with it quite well.
Nonetheless I keep chugging along because I believe in the project and that it can solve real-world use cases better than many other options.
> offline
> real-time
> speech-to-speech translation app
> on under-powered devices
I genuinely don't think the technology is there.
I can't even find a half-good real-time "speech to second language text" tool, not even with "paid/online/on powerful device" options.
Definitely true for OP's case, especially for non-trivial language pairs. For the best case scenario, e.g. English<>German, we can probably get close.
> I can't even find a half-good real-time "speech to second language text" tool, not even with "paid/online/on powerful device" options.
As in "you speak and it streams the translated text"? translate.google.com with voice input and a more mobile-friendly UI?
For Japanese to English, the transcription alone is already pretty inaccurate (usable if you know some Japanese; but then again you already know Japanese!)
As long as you're expressive enough in English, and reverse the translation direction every now and again to double check the output then it works fine.
Machine translation for instructional or work-related texts has been "usable" for years, way before LLM emerged.
LLM-based translation has certainly made significant progress in these scenarios—GPT-4, for example, is fully capable IMHO. However, it's still not quite fast enough for real-time use, and the smaller models that can run offline still don't deliver the needed quality.
Anyway, the current state of affairs float somewhere comfortably above "broken clock" and unfortunately below "Babelfish achieved", so opinions may vary.
I can't say much about the quality of English -> Japanese translation, except that people were generally able to understand whatever came out of it.
But don't expect to be able to use it to read actual literature or, back to the topic, subtitling a TV series or a YouTube video without misunderstanding.
iOS's built in translate tool? I haven't tried it for other languages but a quick test English <> Thai seemed to handle things fine (even with my horrible Thai pronunciation and grammar) and even in Airplane mode (i.e. guaranteed on-device) with the language pack pre-downloaded.
We had to be careful not to talk over each other or the model, and the interpreting didn’t work well in a noisy environment. But once we got things set up and had practiced a bit, the conversations went smoothly. The accuracy of the translations was very good.
Such interpreting should get even better once the models have live visual input so that they can “see” the speakers’ gestures and facial expressions. Hosting on local devices, for less latency, will help as well.
In business and government contexts, professional human interpreters are usually provided with background information in advance so that they understand what people are talking about and know how to translate specialized vocabulary. LLMs will need similar preparation for interpreting in serious contexts.
I've given some high-profile keynote speeches where top-notch (UN-level) real-time simultaneous interpreters were used. Even though my content wasn't technical or even highly specialized (more general business), spending an hour with them previewing my speech and answering their questions led to dramatically better audience response. I was often the only speaker on a given day who made the effort to show up for the prep session. The interpreters told me the typical improvement they could achieve from even basic advance prep was usually >50%.
It gave me a deep appreciation for just how uniquely challenging and specialized the ability to do this at the highest level is. These folks get big bucks and from my experience, they're worth it. AI is super impressive but I suspect getting AI to replicate the last 10% of quality top humans can achieve is going to be >100% more work.
An even more specialized and challenging use case is the linguists who spend literally years translating one high-profile book of literature (like Tolstoy). They painstakingly craft every phrase to balance conveying meaning, pace and flavor etc. If you read expert reviews of literature translation quality you get the sense it's a set of trade-offs for which no optimal solution exists, thus how these trade-offs are balanced is almost as artistic and expressive an endeavor as the original authorship.
Current AI translators can do a fine job translating things like directions and restaurant menus in real-time, even on a low-end mobile device but the upper bound of quality achieved by top humans on the hardest translation tasks is really high. It may be a quite a while before AIs can reach these levels technically - and perhaps even longer practically because the hardest use cases are so challenging and the market at the high-end is far too small to justify investing the resources in attaining these levels.
A few more comments from a long-time professional translator:
The unmet demand for communication across language barriers is immense, and, as AI quality continues to improve, translators and interpreters working in more faceless and generic fields will gradually be replaced by the much cheaper AI. That is already happening in translation: Some experienced patent translators I know have seen their workflows dry up in the past couple of years and are trying to find new careers.
This past August, I gave a talk on AI developments to a group of professional interpreters in Tokyo. While they hadn’t heard of any interpreters losing work to AI yet, they thought that some types of work—online interpreting for call centers, for example—would be soon replaced by AI. The safest type of human interpreting work, I suspect, is in-person interpreting at business meetings and the like. The interpreters’ physical presence and human identity should encourage the participants to value and trust them more than they would AI.
Real-time simultaneous translators are an interesting case. On the one hand, they are the rarest and most highly skilled, and national governments and international organizations have a long track record of relying on and trusting them. On the other hand, they usually work in soundproof booths, barely visible (if at all) to the participants. The work is very demanding, so interpreters usually work in twenty- to thirty-minute shifts; the voice heard by meeting participants therefore changes periodically. The result is less awareness that the interpreting is being done by real people, so users might feel less hesitation about replacing them with AI.
When I’ve organized or participated in conferences that had simultaneous human interpreting, the results were mixed. While sometimes it worked well, people often reported having a hard time following the interpretion on headphones. Question-and-answer sessions were often disjointed, with an audience member asking about some point that they seem to have misunderstood in the interpretation and then their interpreted question not being understood by the speaker. The interpreters were well-paid professionals, though not perhaps UN-level.
It’ll work at first for a sentence or two, then the other party asks something and instead of translating the question, it will attempt to answer the question. Even if you remind it of its task it quickly forgets again.
But it does feel like we are close to getting a proper babelfish type solution. The models are good enough now. Especially the bigger ones. It's all about UX and packaging it up now.
Thanks for the efforts! Still many fixes to go, though: I used the version from two days ago, which had numerous issues. Also, 3PO isn't offline, so I won't be pursuing it.
https://github.com/usefulsensors/moonshine
https://www.reddit.com/r/language/comments/1elpv37/why_is_sa...
I'll be downlaoding it and giving it a try today!!
So, I ventured into building 3PO. https://3po.evergreen-labs.org
Would love to hear everyone's feedback here.
Humans can't even do this in immediate real-time, what makes you think a computer can? Some of the best real-time translators that work at the UN or for governments still have a short delay to be able to correctly interpret and translate for accuracy and context. Doing so in real-time actually impedes the translator from working correctly - especially in languages that have different grammatical structures. Even in langauges that are effectively congruent (think Latin derivatives), this is hard, if not outright impossible to do in real time.
I worked in the field of language education and computer science. The tech you're hoping would be free and able to run on older devices is easily a decade away at the very best. As for it being offline, yeah, no. Not going to happen, because accurate real-time translation of even a database of the 20 most common languages on earth is probably a few terrabytes at the very least.
AFAIK, humans who do simultaneous interpretation are provided with at least an outline, if not full script, of what the speaker intends to say, so they can predict what's coming next.
They are usually provided with one, but it is by no means necessary. SI is never truly simultaneous and will have a delay, and the interpretor will also predict based on the context. Which makes certain languages a bit more difficult to work with, e.g. Japanese, sentences of which I believe often have the predicate after the object, rather than the usual subject-predicate-object order, making the "prediction" part harder.
I meant a five-second delay after the speaker finishes talking or the user taps a button to start the translation, not necessarily a five-second rolling window.
Are people still using with DTW + HMMs?
Well almost anyway - last I checked they feed a Mel spectrogram into the model rather than raw audio samples.
Decades doesn't sound right. Around 2019, the Jasper model was SOTA among e2e models but was still slightly behind a non e2e model with an HMM component https://arxiv.org/pdf/1904.03288
IMHO it's unfortunate that everyone jumps to "use AI!" as the default now, when very competitive approaches that have been developed over the past few decades could provide decent results but at a fraction of the computing resources, i.e. a much higher efficiency.
Why online? Why would I want some third-party to (a) listen to my conversations; (b) receive a copy of my voice that hackers could download; (c) analyze my private conversations for marketing purposes; (d) hobble my ability to translate when their system goes down, or permanently offline; or (e) require me to pay for a software service that's feasible to run locally on a smart phone?
Why would I want to have my ability to translate tied to internet connectivity? Routers can fail. Rural areas can be spotty. Cell towers can be downed by hurricanes. Hikes can take people out of cell tower range. People are not always inside of a city.
Also, there are many environments (especially when you travel) where your phone is not readily connected.
I would be very concerned about any LLM model being used for "transcription", since they may injecting things that nobody said, as in this recent item:
https://news.ycombinator.com/item?id=41968191
I saw mediocre results from the biggest model even when I gave it a video of Tom Scott speaking at the Royal Institution where I could be extremely confident about the quality of the recording.
My other gripe with these tools is if there is background noise, they are pretty useless. You can't use them in a crowded room.
[1]: https://en.wikipedia.org/wiki/Throat_microphone