Meet the Innovative Spatial Speech Translation, Which Allows Understanding Multiple People Speaking Different Languages at the Same Time with Accuracy and Sonic Realism
The new technology presented in Japan transforms meetings between different languages. MIT Technology Review revealed details about a model that combines artificial intelligence with spatial sound capture.
A recently presented development at the ACM CHI conference in Yokohama (Japan) promises to radically transform how people interact in multilingual environments. In light of the emergence of this new technology, MIT Technology Review shared more information.
This is the Spatial Speech Translation, a simultaneous translation system based on artificial intelligence that allows headphone users to identify and understand what multiple people are saying at the same time—even if they speak different languages.
-
A public school student single-handedly created a machine capable of treating water for up to 50 people using only solar energy. It was awarded third place at one of the most important science fairs in the world.
-
Swiss-led researchers warn that tourism on melting glaciers may be destroying exactly the landscapes that visitors want to see before they disappear, and that geotextile covers and helicopter flights only exacerbate the problem instead of solving it.
-
Dry cleaning is leaving behind the old chemical smell and showing that clean clothes can also use much less water.
-
Scientists make worrying discovery: New study reveals that airborne microplastics can worsen global warming on a scale comparable to that of 200 coal power plants spread across the planet.
Designed to work with conventional noise-canceling headphones, the system not only translates but also reproduces the translated voice with a pitch and spatial direction that mimics the original person, generating a more natural and contextualized conversation experience.
A System Against the Language Barrier in Groups
The goal of Spatial Speech Translation is to tackle one of the most complex challenges for automatic translation systems: the overlap of voices in group conversations.
With this technology, artificial intelligence is used to track both the spatial origin of the sound and the individual characteristics of each voice, allowing the user to accurately identify who is speaking and what is being said.
The proposal goes beyond a simple simultaneous translation. According to the technical description, the model divides the user’s acoustic environment into small regions and analyzes each one to detect potential speakers.
This recognition allows for generating a translated version of each voice that preserves essential elements such as sound direction, emotional tone, and original timbre—resulting in a more realistic auditory experience.
The Personal Dimension Behind the Project
The initiative has a deeply personal motivation for one of its creators, Professor Shyam Gollakota, a researcher at the University of Washington. In statements shared with MIT Technology Review, Gollakota explained: “We believe this system can be transformative.”
From a humanistic perspective, it is argued that technology should not only facilitate communication but also promote greater social inclusion for people facing language barriers.
More than solving specific cases, the project aims to reduce the anxiety and isolation that many feel when unable to fully participate in a conversation due to not mastering the language.

Artificial Intelligence at Two Levels: How It Works
The system consists of two interdependent models. The first analyzes the sound space with a neural network that divides the environment into small zones. From this segmentation, it locates the exact direction from where the voices come.
The second model processes the detected voices, translates them into English from three languages—French, German, and Spanish—and reconstructs a version of the original voice, replicating elements such as tone, volume, and emotional cadence.
The innovative aspect is that this “cloned voice” maintains a high degree of naturalness. Instead of a robotic translation, the person using the headphones hears a synthesized version that emulates the original speaker’s voice, with a latency of just a few seconds. This feature enables a more fluid conversational dynamic than conventional systems.
Differences from Existing Technologies
Unlike other devices with automatic translation—such as Meta’s smart glasses—Spatial Speech Translation was developed to process multiple voices simultaneously. While most current systems focus on a single speaker, this proposal seeks to address the real issue of group conversations, with overlapping voices and languages.
Additionally, the technology uses accessible hardware: headphones with built-in microphones and laptops with Apple M2 chips, which enable the execution of the neural network models. This compatibility with commercially available technologies favors potential large-scale adoption.
Challenges and Next Steps
One of the main challenges faced by the team is to reduce the latency between speech and its translation. Currently, the delay is a few seconds, which affects the fluidity of the conversation. “We want to significantly reduce this latency to under one second, in order to keep the pace of the conversation,” Gollakota explained.
This goal presents complex technical challenges, as the syntactic structure of each language influences the speed of translation. For example, the system is faster at translating from French to English, followed by Spanish and then German.
According to researcher Claudio Fantinuoli from Johannes Gutenberg University Mainz, this is because German tends to place verbs—and, therefore, much of the meaning—at the end of sentences.
Several experts who did not participate in the development praised the advancement. Samuele Cornell, a researcher at the Carnegie Mellon Language Technologies Institute, highlighted that the project is technically impressive but warned that, for mass application, further training with real data and recordings in noisy environments will be necessary.


Be the first to react!