According to an article on BBC News, folks are working on a so-called ‘Tower of Babel’ translation device. Unlike other translators, it doesn’t use audio input. Instead, “Electrodes are attached to the neck and face to detect the movements that occur as the person silently mouths words and phrases. Using this data, a computer can work out the sounds being formed and then build these sounds up into words.”
There isn’t enough technical description in the article to really evaluate this well, but I wonder how it could possibly work. First, if no audio input is used, how will the device distinguish between nasals and non-nasals with the same place of articulation? Second, how can it deal with individual variation in the muscle movements involved in pronunciation, much less cross-user variation? Third, it seems to make a lot of assumptions about how words are made, i.e., they’re just building blocks assembled together. I hope the researchers themselves have taken at least co-articulation into account.
Even if the technical bits work, I’m skeptical on other grounds as well. One, the article says it’s about 80% accurate when the vocabulary size is 100-200 words, decreasing dramatically as the vocabularly size increases. 100-200 words? Very limited usefulness! Two, how on earth would it differentiate homonyms? Or tones? Since one of the languages included is Chinese (Mandarin? Cantonese? Wu? they don’t specify), it needs to be able to handle tones. Maybe my articulatory phonetics knowledge is a bit rusty, but how are they going to detect tones via silent mouth movements? Three, can it handle lects (dialects, sociolects, idiolects, the whole shebang)?
Somehow, I just don’t think this is going to revolutionize communication quite the way the authors think.