Speech Technology and Research:
Retrospect and Prospect
By Gunnar Fant
(Contribution to Panel on the History of Speech Technology,
Session 35D1b; Janet Baker, Chair
Lisbon, September 6, 2005)
This is a very brief overview of developments in a historical frame focusing on periods of innovative character and new trends in methodology and applications. It is personal and far from exhaustive. Major aspects have been covered in earlier publications. The most recent one was a contribution to a conference at MIT last year: "From Sound to Sense: 50+ Years of Discoveries in Speech Communication". The conference proceedings contain résumés from several specialists in various areas. You may also find overviews in my recent book, Speech Acoustics and Phonetics.
I have considered 4 periods of 20 years length:
1. 1925-1945 An early period
2. 1945-1965 A pioneering era
3. 1965-1985 Basic knowledge and technology build up
4. 1985-2005 Statistics and large scale databases
Here are some comments:
There was an early electronic period around 1930-1945. A vowel synthesiser with four resonance circuits in parallel was described by the German scientist Karl Willy Wagner in 1936, that is almost 15 years before similar developments at MIT and KTH. Furthermore, we may note the Grutzmacher pitch analyser from 1937 and the Voder and Vocoder of Homer Dudley (1939), who introduced the analysis-synthesis concept in speech processing.
The major foundation of our field derives from a pioneering era around 1945-1965. The sound spectrograph, developed at Bell Laboratories in the early part of this period opened our eyes to the time varying nature of speech (Potter, Kopp and Green, 1947). I consider this to be the most important instrumental innovation ever made in speech research. The first major publication on acoustic phonetics was that of Martin Joos (1948). In these years, 1945-1960, interdisciplinary trends expanded and with them the foundations in acoustics, speech processing, information theory, psychology, physiology, linguistics and phonetics.
Information theory was developed in the early part of this period (Shannon and Weaver, 1949) and gained considerable interest. How do we define the information content in speech? The concept of redundancy not only provided a basis for efficient coding of speech for telephony. It also penetrated linguistic theory in search for minimal redundancy descriptive systems. This has remained a major issue in phonology. My cooperation with Roman Jakobson and Morris Halle started in December 1949 when I came to MIT after four years of research at the Ericsson Telephone Company in Stockholm. It resulted in our Preliminaries to Speech Analysis (1952), introducing the concept of distinctive features and their correlates. Phonetic universals were treated in a minimum redundancy frame of binary categories, with outlooks on both production, acoustics and perception.
The later Chomsky and Halle system (1968) was articulatory oriented. It preserved a uniformity of terminology, which has had some applications in text-to-speech systems. However, it lacked phonetic realism inherent in the binary coding of place features. Roman and Morris also helped me to collect X-ray data from a Russian immigrant, which became a foundation for my book, "Acoustic Theory of Speech Production" in 1960. At MIT, related work on speech acoustics was pursued in co-operation with Ken Stevens. He has devoted a lifetime of studies in speech acoustics, which found its way in his monumental book from 1997. I also established co-operation with Jim Flanagan. His book on Speech Analysis, Production and Perception from 1965 has served as a major reference in speech acoustics and technology. It also conveys historical notes of interest. An other influential publication from this time is that of Peter Ladefoged (1967).
In this early period, system function theory gained ground and introduced the concept of poles and zeroes. The vowel synthesizer POVO at MIT and the OVE I at KTH in Stockholm (1953) employed a number of resonances, i.e. pole circuits. They had important applications in perceptual studies with speech like stimuli. Transmission line analogs of the vocal tract were built for experimental purposes at KTH and MIT (Fant 1960, Stevens, Kasowsky and Fant , 1953). At that time, before computers and transistors, speech hardware was housed in large racks containing vacuum tubes, coils and condensers. The first attempts of parametric synthesis of connected speech derive from this early period. The manually controlled OVE I from 1953 was capable of producing simple sentences of voiced sounds by continuous variations of F1 and F2 and F0, with pre-set F3 and F4. A device, PAT, incorporating an optically scanned function generator, was introduced by Walter Lawrence (1953). A more complete parametric system, the OVE II, was developed in Stockholm (Fant and Martony, 1962). It had a nasal branch and a fricative branch in parallel with the vowel branch. Input parameters were derived from formant and pitch tracks together with voice and noise source contours. The overall configuration is retained in a present formant synthesis system at KTH.
Copy synthesis was well established at that time. Pioneering studies by John Holmes (1961) demonstrated almost perfect re-synthesis. Some of you may have heard his sentence: "I enjoy the simple life". From there on it took almost 20 years for rule based text-to-speech synthesis to provide systems of reasonable quality. Ambitions were set at an early stage, but progress was limited by insufficient knowledge of the complex relations between linguistic structure and acoustical patterns. This is still a problem and even more so in automatic speech recognition. It reflects our lack of insight in the speech code, in older terminology the language of visible speech. I shall return to this topic.
An important instrumental tool from the pioneering era was the Pattern Play Back system developed at Haskins Laboratories in New York at the end of the 1940s. It enabled the direct play back of hand-painted, stylised formant patterns. Accumulated results from such analysis-by-synthesis experiments provided data on vowel and consonant patterns, and also introduced the locus concept of formant transitions (Liberman, Delattre and Cooper, 1952). At that time, in lack of direct speech analysis data, results gained from Pattern Play Back studies, were considered to represent the language of Visible Speech.
I shall not spend much time on the intermediate period from 1965 to 1985. The impact of computer technology was enormous and widened the scope of research and technology. We gained deeper insights in speech perception and in models of the voice source. Work on text-to-speech synthesis was pursued at Bell Laboratories and other places, e.g. Haskins Laboratories, MIT and KTH. Text-to-speech for the blind and for speech handicapped came into use. The quality was far from perfect, but it served the purpose. It should be acknowledged, that the same person or persons responsible for the development also devoted a large time in supporting speech analysis. Examples are Dennis Klatt at MIT (1980) and Rolf Carlson and Björn Granström at KTH (1975) in Sweden. At that time speech recognition had an early start.
In the fourth period, from 1985-2005, we have encountered a rapid growth of speech technology, especially in computer departments, with much work devoted to speech recognition. Now I feel that there is a risk that the progress will be limited by insufficient attention to the potentialities of speech and language research. The symbiosis between technology and basic research that made possible the advance in earlier periods now shows a tendency to turn into polarization. Speech technology is highly dependent on statistical tools and large data bases, whilst phonetics tends to become fractionalized by narrowly defined problems or by abstract issues, with small or no relevance for the overall code of spoken language. Now, for access to the speech code, we need an integrated basic knowledge of speech production, acoustics, perception and cognitive processes and of the encoding of linguistically defined units in the speech chain. This task is plagued by the enormous variability of speech patterns with respect to language, dialect, speaker, age, speaking style, attitudes, emotions and overall context. The present approach is to resort to very large data banks of spoken language, which are submitted to rather primitive phonetic transcription, lacking proper co-variation of segmental, prosodic and voice features.
An example of the limitation of the now popular investments in large data banks, is text to speech synthesis from concatenation of units of arbitrary size. These so called Unit Selection systems work fine for limited vocabulary applications, but they lack rules for proper prosodic realisation of an arbitrary text. But even an advanced labelling system of a single very large text is insufficient for access to generative rules. We need a top down search for acoustic realisation of linguistically defined categories within varying contexts.
So here is our dilemma:
There is the statistical pattern-matching approach dominating present developments in speech synthesis and recognition, and on the other hand a knowledge approach in search of the speech code with its roots in general phonetics. Advanced applications require a substantial increase of basic knowledge. Much attention has been devoted to human performance in man-machine conversation, but we lack insights in the phonetic code. This is an issue that I have fostered since EUROSPEECH 89 in Paris (Fant, 1990). It was illustrated by the following picture of Mr Speech Technology, boldly surfing along, ignoring the reefs of the knowledge barriers, where he may get stranded. I have made frequent use of this illustration. It was drawn by my colleague Rolf Carlson.
But how can we improve our insight in the speech code? How do we cope with the enormous variability? This is a messy business of searching for rules of realisations within a given context, i.e. "ceteris paribus", to quote Roman Jacobson. It is a real challenge, but we cannot turn to phonetic textbooks for an answer. Phonetics of today is oriented against rather narrow problems and questions of principal interest, but we lack exhaustive acoustic-phonetic descriptions of any language. Bits and pieces of data, when available, usually pertain to a limited or ill-defined context. The closest we can come in search for representative data is in manuals for text-to-speech synthesis, if available, but this is a source of limited scope and value.
I am now calling for an ambition to study acoustic phonetics with the same breadth as in studies of a foreign language. The object is the language of Visible Speech, to break the speech code. My recommendation is to collect acoustic data in a top-down linguistic frame of segmental and prosodic categories, with co-references to vocal tract dimensions and models of articulatory gestures. The overall aim is to derive rules for all possible contextual variations including extra-linguistic categories. Do we have to wait another half century?
r. Speech Technology
Carlson, R. and Granström, B. (1975). A text-to-speech system based on a phonetically oriented programming language, Speech Transmission Laboratory Quarterly Progress and Status Report, KTH, 1/1975, 1-4.
Chomsky, N. and Halle, M. (1968). The Sound Pattern of English, Harper and Row, New York.
Dudley, H. (1939). Remaking Speech, Journal of the Acoustical Society of America, 11, 169-177.
Fant, G. (1959). Acoustic analysis and synthesis of speech with applications to Swedish. Ericsson Technics 1 1-106.
Fant, G. (1960). Acoustic theory of speech production. The Hague, Netherlands: Mouton, 2nd edition. 1970, (Translated into Russian, Nauka, Moskva, 1964).
Fant, G. (1990). Speech research in perspective, Speech Communication 9, 171-176.
Fant, G. (2004). Speech research in a historical perspective. In J. Slifka, S. Manuel and M. Matthies (Eds.), From Sound to Sense: 50+ Years of Discoveries in Speech Communication, Research Laboratory of Electronics MIT, June 11-13, 2004, pp. 20-40.
Fant, G. (2004). Speech Acoustics and Phonetics, Selected Writings. Kluwer Academic Publishers - Springer, 2004.
Fant, G. and Martony, J. (1962). Speech synthesis instrumentation for parametric synthesis (OVE II), Speech Transmission Laboratory Quarterly Progress and Status Report, KTH, 2/1962, 18-24.
Flanagan, J. L. (1965). Speech Analysis Synthesis and Perception, Springer Verlag.
Grutzmacher, M. and Lottermoser, W. (1937). Uber ein Verfahren zur Trägheitzfreien Aufzeichnung von Melodikurven, Akustische Zeitschrift 2, 242-248.
Joos, M. (1948). Acoustic Phonetics, Language 24, 1-136.
Holmes, J. (1961). Notes on synthesis work, Speech Transmission Laboratory Quarterly Progress and Status Report, KTH 1/1961, 10-12.
Jakobson, R., Fant, G. and Halle, M. (1952). Preliminaries to speech analysis. The distinctive features and their correlates. Acoustics Laboratory, Massachusetts Inst. of Technology, Technical Report No. 13 (58 pages). Published by MIT press, seventh edition, 1967.
Klatt, D. H. (1980). Software for a cascade/parallel formant synthesizer, Journal of the Acoustical Society of America, 67, 971-995.
Klatt, D. (1987). Review of text to speech conversion for English, Journal of the Acoustical Society of America, 82, 737-793.
Ladefoged, P. (1967). Three areas of experimental phonetics, Oxford University Press.
Lawrence, W. (1953). The synthesis of speech from signals which have a low information rate, Communication Theory, Editor W. Jackson, London.
Liberman, A. M., Delattre, P. C. and Cooper, F. S. (1952). The role of selected stimulus variables in the perception of the unvoiced consonants, American Journal of Psychology, 65, 497-516.
Potter, R. K., Kopp, A. G. and Green, H. C. (1947). Visible Speech, New York.
Shannon, C.E. and Weaver, W. (1949), The Mathematical Theory of Communication, Urbana, 1949.
Stevens, K. N., Kasowsky, S. and Fant, G. (1953), An electrical analog of the vocal tract. Journal of the Acoustical Society of America, 25, 734-742.
Wagner, K. W. (1936). Ein neues elekrisches Sprechgerät zur Nachbildung der menschlichen Vokale, Preuss. Akad. Wiss. Berlin, Abh. 2, 1936.