Why don't text-to-'speech' programs process IPA or other such phonemic, phonetic or phonological scripts?
University of Fukui, Japan
What language would such a program put out? Text-to-speech/spoken language
programs basically work on the principle of recognizing whole words in a
language (based on their relatively unique spellings and the spaces in between
each word) and matching them up with the recordings of pronounced whole words.
The misconception here is that, when we speak, we somehow 'generate' phonemes or allophones and these are strung together into something that adds up to 'real speech'. There is no phonemic or allophonic model of a language that can do that--which is why speech recognition programs work only if you train them and bark simple words and phrases at them. Even the most
algorithmically powerful ones depend on you speaking to them not in sense or breath groups of fluent speech but in a greatly slowed down form of 'clear speech', with more pauses added.
Phonemes and their allophones are really only written descriptive models of language and as such simplified idealizations. They can't be found in articulation, they can't be found in the acoustic stream, and no one has shown convincing evidence that they form some sort of phonologically bottom-up unit in language comprehension (for one thing, such models slow the process down way too much to account for comprehension such as how humans actually do it).
I suppose an IPA program could be created by assigning an IPA-based spelling to each word in a given language's lexicon. Since EFL learners dictionaries use such notation to show some sort of 'phonemic', canonical pronunciation of the word, such whole words written in IPA characters could be used just as English spelling conventions are. I guess no one so far has seen such a step as useful or necessary. Do learners of EFL need to learn the English lexicon as spelled in IPA too? Most would balk, as would a lot of EFL teachers.
I did see a pronunciation-training program back in the Win 98/NT era of computing that seemed to be based, not on whole words, but syllable types and morphemes. You typed a word, and the program basically produced pronunciations for the word you typed. The animation seemed to be a sequencing of photo stills of a woman's face showing 'visemes'--the visual equivalent of a phoneme (e.g., an open round mouth for the English sound [ou]). By playing around with the
input, I tried to 'back engineer' this program in my head to see how they analyzed the language. This seemed to be:
1. Prominent phonemes
2. Their visual equivalents in oral gestures--visemes
3. Pronunciations of syllables (syllable types)
4. Pronunciation of whole words
It seemed to be quite a clever bit of programming to make the sequence of stills match up and produce connected speech that wasn't just whole word sequences. However, the visual oral gestures were much simpler than the ones in real speech (think of the tricks animators know to give the visual illusion of a face speaking, which our own brains fills in and makes match the soundtrack we are listening to). Also, it didn't sound like very natural speech for phrases.
I think audio-visual files of whole words and phrases should be compiled into an audio-visual lexicon--for example, Ogden's Basic English plus the 3800 most frequent words and lexical phrases of spoken English. Today's technology makes it possible. The state of ELT and its publishing make it a mostly undesired niche.