New headphone system with artificial intelligence identifies voices in groups, translates in real time and simulates the tone of each person, offering a more natural and fluid experience in different languages.
It will soon be possible to communicate with people speaking multiple languages without learning theirs. That’s the goal of a new AI-powered headset system.
called Spatial Speech Translation, it translates speech from multiple people, in real time, based on the direction of the voice and the unique characteristics of each speaker.
Technology to break down language barriers
The project was developed by researchers at the University of Washington, in the United States.
-
Students create robot capable of solving the dreaded Rubik's cube faster than the blink of an eye — and the feat enters the Guinness Book of Records
-
The crisis the world was not prepared for: Miguel Nicolelis warns about artificial intelligence, the digital age and the future of human connections
-
After Google was created this place changed completely and became unrecognizable.
-
Researchers believe Earth's natural hydrogen could power the Earth for 170.000 years — and they've figured out the 'way' to find it
The idea came from personal experiences, as Professor Shyam Gollakota explains. “My mom has amazing ideas when she speaks in Telugu, but it's hard for her to communicate with people in the US when she visits us.", it says. "We believe this system can transform the lives of people like her."
Unlike other solutions that focus on just one speaker, the new system recognizes and translates multiple voices at the same time.
It also avoids the artificial sound common in other machine translations. It works with noise-canceling headphones and regular microphones, connected to a laptop with Apple’s M2 chip, the same one used in Vision Pro.
The project was presented this month at the conference ACM CHI on Human Factors in Computing Systems, in Yokohama, Japan.
How the system works
O Spatial Speech Translation uses two artificial intelligence models. The first divides the space around the user into small areas and locates sound sources based on neural networks.
The second model translates speech from languages such as French, German and Spanish for English, as well as simulating the tone and voice style of each speaker.
This allows the translated sound to appear to come from the same direction as the original speaker and in a voice very similar to the original speaker, rather than a generic machine sound. The technology uses public databases to perform the translations and voice simulations.
Samuele Cornell, a researcher at Carnegie Mellon University, highlights the complexity of the task. “SRemoving human voices is already difficult for AI systems. Doing it in real time and with low latency is impressive.”, he says. Although he did not participate in the project, he considers the first results to be quite promising.
Challenges still persist
Even with the advances, the system still faces challenges. The main one is the response time between speech and translation. Currently, there is a slight delay, and Gollakota’s team wants to reduce that time to less than a second.
"The goal is to maintain the fluidity of conversation between people speaking different languages.”, explains the researcher. However, this reduction in time can affect the accuracy of the translation, according to experts.
This is because the more context the AI has, the better the translation. Less time can mean lower quality.
The speed also varies depending on the language. Translation from French to English is the fastest. Spanish comes next, and German is the slowest of the three. This is due to the structure of the sentences. In German, for example, the verb usually comes at the end, which slows down the interpretation of the message.
A promising application
For Alina Karakanta, a professor at Leiden University in the Netherlands and an expert in computational linguistics, the system has great potential. She was not involved in the study, but believes it could have a positive impact.It is a useful application. It can help people"He says.
Real-time translation is still an evolving field. More advanced language models have improved the results significantly in recent years.
In applications like Google Translate or tools like ChatGPT, languages with a lot of available data are already translated with excellent quality. However, it is still not something completely instantaneous.
The system presented now goes one step further. It combines spatial localization, voice identification and simultaneous translation. All this with a more natural and personalized sound.
The future of barrier-free communication
The project shows a promising path for the use of artificial intelligence in human interactions. The ability to understand multiple people speaking different languages at the same time could transform international meetings, family gatherings and everyday situations into multilingual environments.
But, as researcher Claudio Fantinuoli from Johannes Gutenberg University in Germany points out, there are still technical limitations to overcome. “You need to balance speed and accuracy. Waiting longer provides more context but reduces fluidity."He explains.
The team continues to work on improving the system. If they can reduce response time and maintain translation quality, Spatial Speech Translation can become an essential tool for breaking down language barriers around the world.