Why don't text-to-'speech' programs process IPA or other such phonemic, phonetic or phonological scripts?

Why don't text-to-'speech' programs process IPA or other such phonemic, phonetic or phonological scripts?
Charles Jannuzi
University of Fukui, Japan

What language would such a program put out? Text-to-speech/spoken language
programs basically work on the principle of recognizing whole words in a
language (based on their relatively unique spellings and the spaces in between
each word) and matching them up with the recordings of pronounced whole words.

The misconception here is that, when we speak,  we somehow 'generate' phonemes or allophones and these are strung together into something that adds up to 'real speech'. There is no phonemic or allophonic model of a  language that can do that--which is why speech recognition programs work only if  you train them and bark simple words and phrases at them. Even the most algorithmically powerful ones depend on you speaking to them not in sense or  breath groups of fluent speech but in a greatly slowed down form of 'clear speech', with more pauses added.

Phonemes and their allophones are really only written descriptive models of  language and as such simplified idealizations. They can't be found in  articulation, they can't be found in the acoustic stream, and no one has shown  convincing evidence that they form some sort of phonologically bottom-up unit in  language comprehension (for one thing, such models slow the process down way too  much to account for comprehension such as how humans actually do it).

I suppose an IPA program could be created by assigning an IPA-based spelling to  each word in a given language's lexicon. Since EFL learners dictionaries use such notation to show some sort of 'phonemic', canonical pronunciation of the word, such whole words written in IPA characters could be used just as English spelling  conventions are.  I guess no one so far has seen such a step as useful or necessary. Do learners of EFL need to learn the English lexicon as spelled in IPA too? Most would balk, as would a lot of EFL teachers.

I did see a pronunciation-training program back in the Win 98/NT era of  computing that seemed to be based, not on whole words, but syllable types and  morphemes. You typed a word, and the program basically produced pronunciations  for the word you typed. The animation seemed to be a sequencing of photo stills  of a woman's face showing 'visemes'--the visual equivalent of a phoneme (e.g.,  an open round mouth for the English sound [ou]). By playing around with the input, I tried to 'back engineer' this program in my head to see how they  analyzed the language. This seemed to be:

1. Audibly prominent phonemes
2. Their visual equivalents in oral gestures--visemes
3. Pronunciations of syllables (syllable types)
4. Pronunciation of whole words

It seemed to be quite a clever bit of programming to make the sequence of stills  match up and produce connected speech that wasn't just whole word sequences.  However, the visual oral gestures were much simpler than the ones in real speech  (think of the tricks animators know to give the visual illusion of a face  speaking, which our own brains fills in and makes match the soundtrack we are  listening to). Also, it didn't sound like very natural speech for phrases.

I think audio-visual files of whole words and phrases should be compiled into an  audio-visual lexicon--for example, Ogden's Basic English plus the 3800 most  frequent words and lexical phrases of spoken English. Today's technology makes  it possible. The state of ELT and its publishing make it a mostly undesired niche.

Comments

Popular posts from this blog

Japanese publishers of EFL textbooks and materials

Implementing Better Multiple Choice for EFL Learning and Testing (a presentation made at IELT-CON 2021 / PELLTA)

Teaching English /r/ and /l/ to EFL learners: a lexical approach (parts 1-3 final)