John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
For details of our global editorial offices, for customer services and for information about how to apply for
permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with the
Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by
the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be
available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand names
and product names used in this book are trade names, service marks, trademarks or registered trademarks of their
respective owners. The publisher is not associated with any product or vendor mentioned in this book. This
publication is designed to provide accurate and authoritative information in regard to the subject matter covered.
It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional
advice or other expert assistance is required, the services of a competent professional should be sought.
Wiley also publishes its books in a variety of electronic formats. Some content that appears
in print may not be available in electronic books.
Library of Congress Cataloging-in-Publication Data
W¨olfel, Matthias.
Distant speech recognition / Matthias W¨olfel, John McDonough.
p. cm.
Includes bibliographical references and index.
ISBN 978-0-470-51704-8 (cloth)
1. Automatic speech recognition. I. McDonough, John (John W.) II. Title.
TK7882.S65W64 2009
006.4
54–dc22
2008052791
A catalogue record for this book is available from the British Library
ISBN 978-0-470-51704-8 (H/B)
Typeset in 10/12 Times by Laserwords Private Limited, Chennai, India
Printed and bound in Great Britain by Antony Rowe Ltd, Chippenham, Wiltshire
Contents
Forewordxiii
Prefacexvii
1Introduction1
1.1Research and Applications in Academia and Industry1
1.1.1Intelligent Home and Office Environments2
1.1.2Humanoid Robots3
1.1.3Automobiles4
1.1.4Speech-to-Speech Translation6
1.2Challenges in Distant Speech Recognition7
1.3System Evaluation9
1.4Fields of Speech Recognition10
1.5Robust Perception12
1.5.1A Priori Knowledge12
1.5.2Phonemic Restoration and Reliability12
1.5.3Binaural Masking Level Difference14
1.5.4Multi-Microphone Processing14
1.5.5Multiple Sources by Different Modalities15
1.6Organizations, Conferences and Journals16
1.7Useful Tools, Data Resources and Evaluation Campaigns18
1.8Organization of this Book18
1.9Principal Symbols used Throughout the Book23
1.10Units used Throughout the Book25
2Acoustics27
2.1Physical Aspect of Sound27
2.1.1Propagation of Sound in Air28
2.1.2The Speed of Sound29
2.1.3Wave Equation and Velocity Potential29
2.1.4Sound Intensity and Acoustic Power31
2.1.5Reflections of Plane Waves32
2.1.6Reflections of Spherical Waves33
viContents
2.2Speech Signals34
2.2.1Production of Speech Signals34
2.2.2Units of Speech Signals36
2.2.3Categories of Speech Signals39
2.2.4Statistics of Speech Signals39
2.3Human Perception of Sound41
2.3.1Phase Insensitivity42
2.3.2Frequency Range and Spectral Resolution42
2.3.3Hearing Level and Speech Intensity42
2.3.4Masking44
2.3.5Binaural Hearing45
2.3.6Weighting Curves45
2.3.7Virtual Pitch46
2.4The Acoustic Environment47
2.4.1Ambient Noise47
2.4.2Echo and Reverberation48
2.4.3Signal-to-Noise and Signal-to-Reverberation Ratio51
2.4.4An Illustrative Comparison between Close and Distant
Recordings52
2.4.5The Influence of the Acoustic Environment on Speech Production 53
2.4.6Coloration54
2.4.7Head Orientation and Sound Radiation55
2.4.8Expected Distances between the Speaker and the Microphone57
2.5Recording Techniques and Sensor Configuration58
2.5.1Mechanical C lassification of Microphones58
2.5.2Electrical Classification of Microphones59
2.5.3Characteristics of Microphones60
2.5.4Microphone Placement60
2.5.5Microphone Amplification62
2.6Summary and Further Reading62
2.7Principal Symbols63
3Signal Processing and Filtering Techniques65
3.1Linear Time-Invariant Systems65
3.1.1Time Domain Analysis66
3.1.2Frequency Domain Analysis69
3.1.3z-Transform Analysis72
3.1.4Sampling Continuous-Time Signals79
3.2The Discrete Fourier Transform82
3.2.1Realizing LTI Systems with the DFT85
3.2.2Overlap-Add Method86
3.2.3Overlap-Save M ethod87
3.3Short-Time Fourier Transform87
3.4Summary and Further Reading90
3.5Principal Symbols91
Contentsvii
4Bayesian Filters93
4.1Sequential Bayesian Estimation95
4.2Wiener Filter98
4.2.1Time Domain Solution98
4.2.2Frequency Domain Solution99
4.3Kalman Filter and Variations101
4.3.1Kalman Filter101
4.3.2Extended Kalman Filter106
4.3.3Iterated Extended Kalman Filter107
4.3.4Numerical Stability108
4.3.5Probabilistic Data Association Filter110
4.3.6Joint Probabilistic Data Association Filter115
4.4Particle Filters121
4.4.1Approximation of Probabilistic Expectations121
4.4.2Sequential Monte Carlo Methods125
4.5Summary and Further Reading132
4.6Principal Symbols133
5Speech Feature Extraction135
5.1Short-Time Spectral Analysis136
5.1.1Speech Windowing and Segmentation136
5.1.2The Spectrogram137
5.2Perceptually Motivated Representation138
5.2.1Spectral Shaping138
5.2.2Bark and Mel Filter Banks139
5.2.3Warping by Bilinear Transform – Time vs Frequency Domain142
5.3Spectral Estimation and Analysis145
5.3.1Power Spectrum145
5.3.2Spectral Envelopes146
5.3.3LP Envelope147
5.3.4MVDR Envelope150
5.3.5Perceptual LP Envelope153
5.3.6Warped LP Envelope153
5.3.7Warped MVDR Envelope156
5.3.8Warped-Twice MVDR Envelope157
5.3.9Comparison of Spectral Estimates159
5.3.10 Scaling of Envelopes160
5.4Cepstral Processing163
5.4.1Definition and Characteristics of Cepstral Sequences163
5.4.2Homomorphic Deconvolution166
5.4.3Calculating Cepstral Coefficients167
5.5Comparison between Mel Frequency, Perceptual LP and warped
MVDR Cepstral Coefficient Front-Ends168
5.6Feature Augmentation169
5.6.1Static and Dynamic Parameter Augmentation169
viiiContents
5.6.2Feature Augmentation by Temporal Patterns171
5.7Feature Reduction171
5.7.1Class Separability Measures172
5.7.2Linear Discriminant Analysis173
5.7.3Heteroscedastic Linear Discriminant Analysis176
5.8Feature-Space Minimum Phone Error178
5.9Summary and Further Reading178
5.10Principal Symbols179
6Speech Feature Enhancement181
6.1Noise and Reverberation in Various Domains183
6.1.1Frequency Domain183
6.1.2Power Spectral Domain185
6.1.3Logarithmic Spectral Domain186
6.1.4Cepstral Domain187
6.2Two Principal Approaches188
6.3Direct Speech Feature Enhancement189
6.3.1Wiener Filter189
6.3.2Gaussian and Super-Gaussian MMSE Estimation191
6.3.3RASTA Processing191
6.3.4Stereo-Based Piecewise Linear Compensation for Environments 192
6.4Schematics of Indirect Speech Feature Enhancement193
14.7Speaker-Tracking Performance vs Word Error Rate504
14.8Single-Speaker Beamforming Experiments505
14.9Speech Separation Experiments507
14.10 Filter Bank Experiments508
14.11 Summary and Further Reading509
Appendices511
AList of Abbreviations513
BUseful Background517
B.1Discrete Cosine Transform517
B.2Matrix Inversion Lemma518
B.3Cholesky Decomposition519
B.4Distance Measures519
B.5Super-Gaussian Probability Density Functions521
B.5.1Generalized Gaussian pdf521
B.5.2Super-Gaussian pdfs with the Meier G-function523
B.6Entropy528
B.7Relative Entropy529
B.8Transformation Law of Probabilities529
B.9Cascade of Warping Stages530
B.10Taylor Series530
B.11Correlation and Covariance531
B.12Bessel Functions531
B.13Proof of the Nyquist–Shannon Sampling Theorem532
B.14Proof of Equations (11.31–11.32)532
B.15Givens Rotations534
B.16Derivatives with Respect to Complex Vectors537
B.17Perpendicular Projection Operators540
Bibliography541
Index561
Foreword
As the authors of Distant Speech Recognition note, automatic speech recognition is the
key enabling technology that will permit natural interaction between humans and intelligent machines. Core speech recognition technology has developed over the past decade
in domains such as office dictation and interactive voice response systems to the point
that it is now commonplace for customers to encounter automated speech-based intelligent
agents that handle at least the initial part of a user query for airline flight information, technical support, ticketing services, etc. While these limited-domain applications have been
reasonably successful in reducing the costs associated with handling telephone inquiries,
their fragility with respect to acoustical variability is illustrated by the difficulties that
are experienced when users interact with the systems using speakerphone input. As time
goes by, we will come to expect the range of natural human-machine dialog to grow to
include seamless and productive interactions in contexts such as humanoid robotic butlers
in our living rooms, information kiosks in large and reverberant public spaces, as well
as intelligent agents in automobiles while traveling at highway speeds in the presence of
multiple sources of noise. Nevertheless, this vision cannot be fulfilled until we are able
to overcome the shortcomings of present speech recognition technology that are observed
when speech is recorded at a distance from the speaker.
While we have made great progress over the past two decades in core speech recognition
technologies, the failure to develop techniques that overcome the effects of acoustical
variability in homes, classrooms, and public spaces is the major reason why automated
speech technologies are not generally available for use in these venues. Consequently,
much of the current research in speech processing is directed toward improving robustness
to acoustical variability of all types. Two of the major forms of environmental degradation
are produced by additive noise of various forms and the effects of linear convolution.
Research directed toward compensating for these problems has been in progress for more
than three decades, beginning with the pioneering work in the late 1970s of Steven Boll
in noise cancellation and Thomas Stockham in homomorphic deconvolution.
Additive noise arises naturally from sound sources that are present in the environment
in addition to the desired speech source. As the speech-to-noise ratio (SNR) decreases, it is
to be expected that speech recognition will become more difficult. In addition, the impact
of noise on speech recognition accuracy depends as much on the type of noise source as on
the SNR. While a number of statistical techniques are known to be reasonably effective in
dealing with the effects of quasi-stationary broadband additive noise of arbitrary spectral
coloration, compensation becomes much more difficult when the noise is highly transient
xivForeword
in nature, as is the case with many types of impulsive machine noise on factory floors and
gunshots in military environments. Interference by sources such as background music or
background speech is especially difficult to handle, as it is both highly transient in nature
and easily confused with the desired speech signal.
Reverberation is also a natural part of virtually all acoustical environments indoors, and
it is a factor in many outdoor settings with reflective surfaces as well. The presence of
even a relatively small amount of reverberation destroys the temporal structure of speech
waveforms. This has a very adverse impact on the recognition accuracy that is obtained
from speech systems that are deployed in public spaces, homes, and offices for virtually
any application in which the user does not use a head-mounted microphone. It is presently
more difficult to ameliorate the effects of common room reverberation than it has been
to render speech systems robust to the effects of additive noise, even at fairly low SNRs.
Researchers have begun to make progress on this problem only recently, and the results
of work from groups around the world have not yet congealed into a clear picture of how
to cope with the problem of reverberation effectively and efficiently.
Distant Speech Recognition by Matthias W¨olfel and John McDonough provides an
extraordinarily comprehensive exposition of the most up-to-date techniques that enable
robust distant speech recognition, along with very useful and detailed explanations of
the underlying science and technology upon which these techniques are based. The
book includes substantial discussions of the major sources of difficulties along with
approaches that are taken toward their resolution, summarizing scholarly work and practical experience around the world that has accumulated over decades. Considering both
single-microphone and multiple-microphone techniques, the authors address a broad array
of approaches at all levels of the system, including methods that enhance the waveforms
that are input to the system, methods that increase the effectiveness of features that are
input to speech recognition systems, as well as methods that render the internal models
that are used to characterize speech sounds more robust to environmental variability.
This book will be of great interest to several types of readers. First (and most obviously), readers who are unfamiliar with the field of distant speech recognition can learn in
this volume all of the technical background needed to construct and integrate a complete
distant speech recognition system. In addition, the discussions in this volume are presented
in self-contained chapters that enable technically literate readers in all fields to acquire a
deep level of knowledge about relevant disciplines that are complementary to their own
primary fields of expertise. Computer scientists can profit from the discussions on signal
processing that begin with elementary signal representation and transformation and lead
to advanced topics such as optimal Bayesian filtering, multirate digital signal processing,
blind source separation, and speaker tracking. Classically-trained engineers will benefit
from the detailed discussion of the theory and implementation of computer speech recognition systems including the extraction and enhancement of features representing speech
sounds, statistical modeling of speech and language, along with the optimal search for the
best available match between the incoming utterance and the internally-stored statistical
representations of speech. Both of these groups will benefit from the treatments of physical acoustics, speech production, and auditory perception that are too frequently omitted
from books of this type. Finally, the detailed contemporary exposition will serve to bring
experienced practitioners who have been in the field for some time up to date on the most
current approaches to robust recognition for language spoken from a distance.
Forewordxv
Doctors W¨olfel and McDonough have provided a resource to scientists and engineers
that will serve as a valuable tutorial exposition and practical reference for all aspects
associated with robust speech recognition in practical environments as well as for speech
recognition in general. I am very pleased that this information is now available so easily
and conveniently in one location. I fully expect that the publication of Distant SpeechRecognition will serve as a significant accelerant to future work in the field, bringing
us closer to the day in which transparent speech-based human-machine interfaces will
become a practical reality in our daily lives everywhere.
Richard M. Stern
Pittsburgh, PA, USA
Preface
Our primary purpose in writing this book has been to cover a broad body of techniques
and diverse disciplines required to enable reliable and natural verbal interaction between
humans and computers. In the early nineties, many claimed that automatic speech recognition (ASR) was a “solved problem” as the word error rate (WER) had dropped below the
5% level for professionally trained speakers such as in the Wall Street Journal (WSJ) corpus. This perception changed, however, when the Switchboard Corpus, the first corpus of
spontaneous speech recorded over a telephone channel, became available. In 1993, the first
reported error rates on Switchboard, obtained largely with ASR systems trained on WSJ
data, were over 60%, which represented a twelve-fold degradation in accuracy. Today the
ASR field stands at the threshold of another radical change. WERs on telephony speech
corpora such as the Switchboard Corpus have dropped below 10%, prompting many to
once more claim that ASR is a solved problem. But such a claim is credible only if one
ignores the fact that such WERs are obtained with close-talking microphones,suchas
those in telephones, and when only a single person is speaking. One of the primary hindrances to the widespread acceptance of ASR as the man-machine interface of first choice
is the necessity of wearing a head-mounted microphone. This necessity is dictated by the
fact that, under the current state of the art, WERs with microphones located a meter or
more away from the speaker’s mouth can catastrophically increase, making most applications impractical. The interest in developing techniques for overcoming such practical
limitations is growing rapidly within the research community. This change, like so many
others in the past, is being driven by the availability of new corpora, namely, speech
corpora recorded with far-field sensors. Examples of such include the meeting corpora
which have been recorded at various sites including the International Computer Science
Institute in Berkeley, California, Carnegie Mellon University in Pittsburgh, Pennsylvania
and the National Institute of Standards and Technologies (NIST) near Washington, D.C.,
USA. In 2005, conversational speech corpora that had been collected with microphonearrays became available for the first time, after being released by the European Union
projects Computers in the Human Interaction Loop (CHIL) and Augmented MultipartyInteraction (AMI). Data collected by both projects was subsequently shared with NIST
for use in the semi-annual Rich Transcription evaluations it sponsors. In 2006 Mike Lincoln at Edinburgh University in Scotland collected the first corpus of overlapping speech
captured with microphone arrays. This data collection effort involved real speakers who
read sentences from the 5,000 word WSJ task.
xviiiPreface
In the view of the current authors, ground breaking progress in the field of distant speech
recognition can only be achieved if the mainstream ASR community adopts methodologies and techniques that have heretofore been confined to the fringes. Such technologies
include speaker tracking for determining a speaker’s position in a room, beamforming for
combining the signals from an array of microphones so as to concentrate on a desired
speaker’s speech and suppress noise and reverberation, and source separation for effective
recognition of overlapping speech. Terms like filter bank, generalized sidelobe canceller,
and diffuse noise field must become household words within the ASR community. At
the same time researchers in the fields of acoustic array processing and source separation
must become more knowledgeable about the current state of the art in the ASR field.
This community must learn to speak the language of word lattices, semi-tied covariance
matrices, and weighted finite-state transducers. For too long, the two research communities have been content to effectively ignore one another. With a few noteable exceptions,
the ASR community has behaved as if a speech signal does not exist before it has been
converted to cepstral coefficients. The array processing community, on the other hand,
continues to publish experimental results obtained on artificial data, with ASR systems
that are nowhere near the state of the art, and on tasks that have long since ceased to
be of any research interest in the mainstream ASR world. It is only if each community
adopts the best practices of the other that they can together meet the challenge posed by
distant speech recognition. We hope with our book to make a step in this direction.
Acknowledgments
We wish to thank the many colleagues who have reviewed parts of this book and provided
very useful feedback for improving its quality and correctness. In particular we would
like to thank the following people: Elisa Barney Smith, Friedrich Faubel, Sadaoki Furui,
Reinhold H¨ab-Umbach, Kenichi Kumatani, Armin Sehr, Antske Fokkens, Richard Stern,
Piergiorgio Svaizer, Helmut W¨olfel, Najib Hadir, Hassan El-soumsoumani, and Barbara
Rauch. Furthermore we would like to thank Tiina Ruonamaa, Sarah Hinton, Anna Smart,
Sarah Tilley, and Brett Wells at Wiley who have supported us in writing this book and
provided useful insights into the process of producing a book, not to mention having
demonstrated the patience of saints through many delays and deadline extensions. We
would also like to thank the university library at Universit¨at Karlsruhe (TH) for providing
us with a great deal of scholarly material, either online or in books.
We would also like to thank the people who have supported us during our careers in
speech recognition. First of all thanks is due to our Ph.D. supervisors Alex Waibel, Bill
Byrne, and Frederick Jelinek who have fostered our interest in the field of automatic
speech recognition. Satoshi Nakamura, Mari Ostendorf, Dietrich Klakow, Mike Savic,
Gerasimos (Makis) Potamianos, and Richard Stern always proved more than willing to
listen to our ideas and scientific interests, for which we are grateful. We would furthermore
like to thank IEEE and ISCA for providing platforms for exchange, publications and for
hosting various conferences. We are indebted to Jim Flanagan and Harry Van Trees, who
were among the great pioneers in the array processing field. We are also much obliged to
the tireless employees at NIST, including Vince Stanford, Jon Fiscus and John Garofolo,
for providing us with our first real microphone array, the Mark III, and hosting the
annual evaluation campaigns which have provided a tremendous impetus for advancing
Prefacexix
the entire field. Thanks is due also to Cedrick Roch´et for having built the Mark III
while at NIST, and having improved it while at Universit¨at Karlsruhe (TH). In the latter
effort, Maurizio Omologo and his coworkers at ITC-irst in Trento, Italy were particularly
helpful. We would also like to thank Kristian Kroschel at Universit¨at Karlsruhe (TH) for
having fostered our initial interest in microphone arrays and agreeing to collaborate in
teaching a course on the subject. Thanks is due also to Mike Riley and Mehryar Mohri
for inspiring our interest in weighted finite-state transducers. Emilian Stoimenov was an
important contributor to many of the finite-state transducer techniques described here.
And of course, the list of those to whom we are indebted would not be complete if we
failed to mention the undergraduates and graduate students at Universit¨at Karlsruhe (TH)
who helped us to build an instrumented seminar room for the CHIL project, and thereafter
collect the audio and video data used for many of the experiments described in the final
chapter of this work. These include Tobias Gehrig, Uwe Mayer, Fabian Jakobs, Keni
Bernardin, Kai Nickel, Hazim Kemal Ekenel, Florian Kraft, and Sebastian St¨uker. We
are also naturally grateful to the funding agencies who made the research described in
this book possible: the European Commission, the American Defense Advanced Research
Projects Agency, and the Deutsche Forschungsgemeinschaft.
Most important of all, our thanks goes to our families. In particular, we would like
to thank Matthias’ wife Irina W¨olfel, without whose support during the many evenings,
holidays and weekends devoted to writing this book, we would have had to survive
only on cold pizza and Diet Coke. Thanks is also due to Helmut and Doris W¨olfel, John
McDonough, Sr. and Christopher McDonough, without whose support through life’s many
trials, this book would not have been possible. Finally, we fondly remember Kathleen
McDonough.
Matthias W¨olfel
Karlsruhe, Germany
John McDonough
Saarbr¨ucken, Germany
1
Introduction
For humans, speech is the quickest and most natural form of communication. Beginning
in the late 19th century, verbal communication has been systematically extended through
technologies such as radio broadcast, telephony, TV, CD and MP3 players, mobile phones
and the Internet by voice over IP. In addition to these examples of one and two way verbal
human–human interaction, in the last decades, a great deal of research has been devoted to
extending our capacity of verbal communication with computers through automatic speechrecognition (ASR) and speech synthesis. The goal of this research effort has been and
remains to enable simple and natural human – computer interaction (HCI). Achieving this
goal is of paramount importance, as verbal communication is not only fast and convenient,
but also the only feasible means of HCI in a broad variety of circumstances. For example,
while driving, it is much safer to simply ask a car navigation system for directions, and
to receive them verbally, than to use a keyboard for tactile input and a screen for visual
feedback. Moreover, hands-free computing is also accessible for disabled users.
1.1Research and Applications in Academia and Industry
Hands-free computing, much like hands-free speech processing, refers to computer interface configurations which allow an interaction between the human user and computer
without the use of the hands. Specifically, this implies that no close-talking microphone
is required. Hands-free computing is important because it is useful in a broad variety
of applications where the use of other common interface devices, such as a mouse or
keyboard, are impractical or impossible. Examples of some currently available hands-free
computing devices are camera-based head location and orientation-tracking systems, as
well as gesture-tracking systems. Of the various hands-free input modalities, however,
distant speech recognition (DSR) systems provide by far the most flexibility. When used
in combination with other hands-free modalities, they provide for a broad variety of HCI
possibilities. For example, in combination with a pointing gesture system it would become
possible to turn on a particular light in the room by pointing at it while saying, “Turn on
this light.”
The remainder of this section describes a variety of applications where speech recognition technology is currently under development or already available commercially. The
Distant Speech RecognitionMatthias W¨olfel and John McDonough
application areas include intelligent home and office environments, humanoid robots,
automobiles, and speech-to-speech translation.
1.1.1Intelligent Home and Office Environments
A great deal of research effort is directed towards equipping household and office
devices – such as appliances, entertainment centers, personal digital assistants and
computers, phones or lights – with more user friendly interfaces. These devices should
be unobtrusive and should not require any special attention from the user. Ideally such
devices should know the mental state of the user and act accordingly, gradually relieving
household inhabitants and office workers from the chore of manual control of the
environment. This is possible only through the application of sophisticated algorithms
such as speech and speaker recognition applied to data captured with far-field sensors.
In addition to applications centered on HCI, computers are gradually gaining the capacity of acting as mediators for human – human interaction. The goal of the research in this
area is to build a computer that will serve human users in their interactions with other
human users; instead of requiring that users concentrate on their interactions with the
machine itself, the machine will provide ancillary services enabling users to attend exclusively to their interactions with other people. Based on a detailed understanding of human
perceptual context, intelligent rooms will be able to provide active assistance without any
explicit request from the users, thereby requiring a minimum of attention from and creating no interruptions for their human users. In addition to speech recognition, such services
need qualitative human analysis and human factors, natural scene analysis, multimodal
structure and content analysis, and HCI. All of these capabilities must also be integrated
into a single system.
Such interaction scenarios have been addressed by the recent projects Computers inthe Human Interaction Loop (CHIL), Augmented Multi-party Interaction (AMI), as well
as the successor of the latter Augmented Multi-party Interaction with Distance Access
(AMIDA), all of which were sponsored by the European Union. To provide such services
requires technology that models human users, their activities, and intentions. Automatically recognizing and understanding human speech plays a fundamental role in developing
such technology. Therefore, all of the projects mentioned above have sought to develop
technology for automatic transcription using speech data captured with distant microphones, determining who spoke when and where, and providing other useful services
such as the summarizations of verbal dialogues. Similarly, the Cognitive Assistant that
Learns and Organizes (CALO) project sponsored by the US Defense Advanced Research
Project Agency (DARPA), takes as its goal the extraction of information from audio data
captured during group interactions.
A typical meeting scenario as addressed by the AMIDA project is shown in Figure 1.1.
Note the three microphone arrays placed at various locations on the table, which are
intended to capture far-field speech for speaker tracking, beamforming, and DSR experiments. Although not shown in the photograph, the meeting participants typically also
wear close-talking microphones to provide the best possible sound capture as a reference
against which to judge the performance of the DSR system.
If humanoid robots are ever to be accepted as full ‘partners’ by their human users, they
must eventually develop perceptual capabilities similar to those possessed by humans, as
well as the capacity of performing a diverse collection of tasks, including learning, reasoning, communicating and forming goals through interaction with both users and instructors.
To provide for such capabilities, ASR is essential, because, as mentioned previously, spoken communication is the most common and flexible form of communication between
people. To provide a natural interaction between a human and a humanoid robot requires
not only the development of speech recognition systems capable of functioning reliably
on data captured with far-field sensors, but also natural language capabilities including a
sense of social interrelations and hierarchies.
In recent years, humanoid robots, albeit with very limited capabilities, have become
commonplace. They are, for example, deployed as entertainment or information systems.
Figure 1.2 shows an example of such a robot, namely, the humanoid tour guide robot
TPR-Robina
1
developed by Toyota. The robot is able to escort visitors around the Toyota Kaikan Exhibition Hall and to interact with them through a combination of verbal
communication and gestures.
While humanoid robots programmed for a limited range of tasks are already in
widespread use, such systems lack the capability of learning and adapting to new
environments. The development of such a capacity is essential for humanoid robots to
become helpful in everyday life. The Cognitive Systems for Cognitive Assistants (COSY)
project, financed by the European Union, has the objective to develops two kind of
robots providing such advanced capabilities. The first robot will find its way around a
1
ROBINA stands for ROBot as INtelligent Assistant.
4Distant Speech Recognition
Figure 1.2 Humanoid tour guide robot TPR-Robina by Toyota which escort visitors around Toyota
complex building, showing others where to go and answering questions about routes
and locations. The second will be able to manipulate structured objects on a table top.
A photograph of the second COSY robot during an interaction session is shown in
Figure 1.3.
1.1.3Automobiles
There is a growing trend in the automotive industry towards increasing both the number
and the complexity of the features available in high end models. Such features include
entertainment, navigation, and telematics systems, all of which compete for the driver’s
visual and auditory attention, and can increase his cognitive load. ASR in such automobile
environments would promote the “Eyes on the road, hands on the wheel” philosophy. This
would not only provide more convenience for the driver, but would in addition actually
enhance automotive safety. The enhanced safety is provided by hands-free operation of
everything but the car itself and thus would leave the driver free to concentrate on the
road and the traffic. Most luxury cars already have some sort of voice-control system
which are, for example, able to provide
• voice-activated, hands-free calling
Allows anyone in the contact list of the driver’s mobile phone to be called by voice
command.
• voice-activated music
Enables browsing through music using voice commands.
• audible information and text messages
Makes it possible to synthesize information and text messages, and have them read out
loud through speech synthesis.
This and other voice-controlled functionality will become available in the mass market
in the near future. An example of a voice-controlled car navigation system is shown in
Figure 1.4.
While high-end consumer automobiles have ever more features available, all of which
represent potential distractions from the task of driving the car, a police automobile has far
more devices that place demands on the driver’s attention. The goal of Project54 is to measure the cognitive load of New Hampshire state policeman – who are using speech-based
interfaces in their cars – during the course of their duties. Shown in Figure 1.5 is the
car simulator used by Project54 to measure the response times of police officers when
confronted with the task of driving a police cruiser as well as manipulating the several
devices contained therein through a speech interface.
Speech-to-speech translation systems provide a platform enabling communication with
others without the requirement of speaking or understanding a common language. Given
the nearly 6,000 different languages presently spoken somewhere on the Earth, and the
ever-increasing rate of globalization and frequency of travel, this is a capacity that will
in future be ever more in demand.
Even though speech-to-speech translation remains a very challenging task, commercial
products are already available that enable meaningful interactions in several scenarios. One
such system from National Telephone and Telegraph (NTT) DoCoMo of Japan works on a
common cell phone, as shown in Figure 1.6, providing voice-activated Japanese–English
and Japanese – Chinese translation. In a typical interaction, the user speaks short Japanese
phrases or sentences into the mobile phone. As the mobile phone does not provide
enough computational power for complete speech-to-text translation, the speech signal
is transformed into enhanced speech features which are transmitted to a server. The
server, operated by ATR-Trek, recognizes the speech and provides statistical translations,
which are then displayed on the screen of the cell-phone. The current system works
for both Japanese–English and Japanese– Chinese language pairs, offering translation in
both directions. For the future, however, preparation is underway to include support for
additional languages.
As the translations appear on the screen of the cell phone in the DoCoMo system, there
is a natural desire by users to hold the phone so that the screen is visible instead of next
to the ear. This would imply that the microphone is no longer only a few centimeters
from the mouth; i.e., we would have once more a distant speech recognition scenario.
Indeed, there is a similar trend in all hand-held devices supporting speech input.
Accurate translation of unrestricted speech is well beyond the capability of today’s
state-of-the-art research systems. Therefore, advances are needed to improve the
technologies for both speech recognition and speech translation. The development of
such technologies are the goals of the Technology and Corpora for Speech-to-Speech
Translation (TC-Star) project, financially supported by European Union, as well as the
Global Autonomous Language Exploitation (GALE) project sponsored by the DARPA.
These projects respectively aim to develop the capability for unconstrained conversational
speech-to-speech translation of English speeches given in the European Parliament, and
of broadcast news in Chinese or Arabic.
1.2Challenges in Distant Speech Recognition
To guarantee high-quality sound capture, the microphones used in an ASR system should
be located at a fixed position, very close to the sound source, namely, the mouth of
the speaker. Thus body mounted microphones, such as head-sets or lapel microphones,
provide the highest sound quality. Such microphones are not practical in a broad variety
of situations, however, as they must be connected by a wire or radio link to a computer
and attached to the speaker’s body before the HCI can begin. As mentioned previously,
this makes HCI impractical in many situations where it would be most helpful; e.g., when
communicating with humanoid robots, or in intelligent room environments.
Although ASR is already used in several commercially available products, there are still
obstacles to be overcome in making DSR commercially viable. The two major sources
8Distant Speech Recognition
of degradation in DSR are distortions, such as additive noise and reverberation, and a
mismatch between training and test data , such as those introduced by speaking style
or accent. In DSR scenarios, the quality of the speech provided to the recognizer has a
decisive impact on system performance. This implies that speech enhancement techniques
are typically required to achieve the best possible signal quality.
In the last decades, many methods have been proposed to enable ASR systems to
compensate or adapt to mismatch due to interspeaker differences, articulation effects and
microphone characteristics. Today, those systems work well for different users on a broad
variety of applications, but only as long as the speech captured by the microphones is
free of other distortions. This explains the severe performance degradation encountered
in current ASR systems as soon as the microphone is moved away from the speaker’s
mouth. Such situations are known as distant, far-field or hands-free
2
speech recognition.
This dramatic drop in performance occurs mainly due to three different types of distortion:
3
• The first is noise, also known as background noise,
which is any sound other than the
desired speech, such as that from air conditioners, printers, machines in a factory, or
speech from other speakers.
• The second distortion is echo and reverberation, which are reflections of the sound
source arriving some time after the signal on the direct path.
• Other types of distortions are introduced by environmental factors such as room modes,
the orientation of the speaker’s head ,ortheLombard effect .
To limit the degradation in system performance introduced by these distortions, a great
deal of current research is devoted to exploiting several aspects of speech captured with
far-field sensors. In DSR applications, procedures already known from conventional ASR
can be adopted. For instance, confusion network combination is typically used with data
captured with a close-talking microphone to fuse word hypotheses obtained by using
various speech feature extraction schemes or even completely different ASR systems.
For DSR with multiple microphone conditions, confusion network combination can be
used to fuse word hypotheses from different microphones. Speech recognition with distant
sensors also introduces the possibility, however, of making use of techniques that were
either developed in other areas of signal processing, or that are entirely novel. It has
become common in the recent past, for example, to place a microphone array in the
speaker’s vicinity, enabling the speaker’s position to be determined and tracked with
time. Through beamforming techniques, a microphone array can also act as a spatial
filter to emphasize the speech of the desired speaker while suppressing ambient noise
or simultaneous speech from other speakers. Moreover, human speech has temporal,
spectral, and statistical characteristics that are very different from those possessed by
other signals for which conventional beamforming techniques have been used in the past.
Recent research has revealed that these characteristics can be exploited to perform more
effective beamforming for speech enhancement and recognition.
2
The latter term is misleading, inasmuch close-talking microphones are usually not held in the hand, but are
mounted to the head or body of the speaker.
3
This term is also misleading, in that the “background” could well be closer to the microphone than the “fore-
ground” signal of interest.
Introduction9
1.3System Evaluation
Quantitative measures of the quality or performance of a system are essential for making
fundamental advances in the state-of-the-art. This fact is embodied in the often repeated
statement, “You improve what you measure.” In order to asses system performance, it is
essential to have error metrics or objective functions at hand which are well-suited to the
problem under investigation. Unfortunately, good objective functions do not exist for a
broad variety of problems, on the one hand, or else cannot be directly or automatically
evaluated, on the other.
Since the early 1980s, word error rate (WER) has emerged as the measure of first choice
for determining the quality of automatically-derived speech transcriptions. As typically
defined, an error in a speech transcription is of one of three types, all of which we will
now describe. A deletion occurs when the recognizer fails to hypothesize a word that
was spoken. An insertion occurs when the recognizer hypothesizes a word that was not
spoken. A substitution occurs when the recognizer misrecognizes a word. These three
errors are illustrated in the following partial hypothesis, where they are labeled with D,
I,andS, respectively:
Hyp: BUT ...WILL SELL THE CHAIN ... FOR EACH STORE SEPARATELY
Utt:... IT WILL SELL THE CHAIN ... OREACH STORE SEPARATELY
IDS
A more thorough discussion of word error rate is given in Section 14.1.
Even though widely accepted and used, word error rate is not without flaws. It has
been argued that the equal weighting of words should be replaced by a context sensitive
weighting, whereby, for example, information-bearing keywords should be assigned a
higher weight than functional words or articles. Additionally, it has been asserted that word
similarities should be considered. Such approaches, however, have never been widely
adopted as they are more difficult to evaluate and involve subjective judgment. Moreover,
these measures would raise new questions, such as how to measure the distance between
words or which words are important.
Naively it could be assumed that WER would be sufficient in ASR as an objective
measure. While this may be true for the user of an ASR system, it does not hold for the
engineer. In fact a broad variety of additional objective or cost functions are required.
These include:
• The Mahalanobis distance, which is used to evaluate the acoustic model.
• Perplexity, which is used to evaluate the language model as described in Section 7.3.1.
• Class separability , which is used to evaluate the feature extraction component or
front-end.
• Maximum mutual information or minimum phone error, which are used during discrim-
inate estimation of the parameters in a hidden Markov model.
• Maximum likelihood, which is the metric of first choice for the estimation of all system
parameters.
A DSR system requires additional objective functions to cope with problems not encoun-
tered in data captured with close-talking microphones. Among these are:
10Distant Speech Recognition
• Cross-correlation, which is used to estimate time delays of arrival between microphone
pairs as described in Section 10.1.
• Signal-to-noise ratio, which can be used for channel selection in a multiple-microphone
data capture scenario.
• Negentropy, which can be used for combining the signals captured by all sensors of a
microphone array.
Most of the objective functions mentioned above are useful because they show a significant correlation with WER. The performance of a system is optimized by minimizing or
maximizing a suitable objective function. The way in which this optimization is conducted
depends both on the objective function and the nature of the underlying model. In the best
case, a closed-form solution is available, such as in the optimization of the beamforming
weights as discussed in Section 13.3. In other cases, an iterative solution can be adopted,
such as when optimizing the parameters of a hidden Markov model (HMM) as discussed
in Chapter 8. In still other cases, numerical optimization algorithms must be used such
as when optimization the parameters of an all-pass transform for speaker adaptation as
discussedinSection9.2.2.
To chose the appropriate objective function a number of decisions must be made
(H¨ansler and Schmidt 2004, sect. 4):
• What kind of information is available?
• How should the available information be used?
• How should the error be weighted by the objective function?
• Should the objective function be deterministic or stochastic?
Throughout the balance of this text, we will strive to answer these questions whenever
introducing an objective function for a particular application or in a particular context.
When a given objective function is better suited than another for a particular purpose, we
will indicate why. As mentioned above, the reasoning typically centers around the fact
that the better suited objective function is more closely correlated with word error rate.
1.4Fields of Speech Recognition
Figure 1.7 presents several subtopics of speech recognition in general which can be
associated with three different fields: automatic, robust and distant speech recognition.
While some topics such as multilingual speech recognition and language modeling can
be clearly assigned to one group (i.e., automatic) other topics such as feature extraction
or adaptation cannot be uniquely assigned to a single group. A second classification of
topics shown in Figure 1.7 depends on the number and type of sensors. Whereas one
microphone is traditionally used for recognition, in distant recognition the traditional
sensor configuration can be augmented by an entire array of microphones with known or
unknown geometry. For specific tasks such as lipreading or speaker localization, additional
sensor types such as video cameras can be used.
Undoubtedly, the construction of optimal DSR systems must draw on concepts from
several fields, including acoustics, signal processing, pattern recognition, speaker tracking
and beamforming. As has been shown in the past, all components can be optimized
Introduction11
dereverberation
feature
enhancement
distant speech
recognition
blind
source
separation
lip reading
missing
features
localization
and
tracking
robust speech
recognition
automatic speech
search
linguistics
adaptation
beamforming
recognition
multi
lingual
modeling
language
modeling
channel
selection
feature
extraction
acoustic
uncertainty
decoding
channel
combination
acoustics
single microphonemulti microphonemulti sensor
Figure 1.7 Illustration of the different fields of speech recognition: automatic, robust and distant
separately to construct a DSR system. Such an independent treatment, however, does
not allow for optimal performance. Moreover, new techniques have recently emerged
exploiting the complementary effects of the several components of a DSR system. These
include:
• More closely coupling the feature extraction and acoustic models; e.g., by propagating
the uncertainty of the feature extraction into the HMM.
• Feeding the word hypotheses produced by the DSR back to the component located
earlier in the processing chain; e.g. by feature enhancement with particle filters with
models for different phoneme classes.
12Distant Speech Recognition
• Replacing traditional objective functions such as signal-to-noise ratio by objective
functions taking into account the acoustic model of the speech recognition system,
as in maximum likelihood beamforming, or considering the particular characteristics of
human speech, as in maximum negentropy beamforming.
1.5Robust Perception
In contrast to automatic pattern recognition, human perception is very robust in the
presence of distortions such as noise and reverberation. Therefore, knowledge of the
mechanisms of human perception, in particular with regard to robustness, may also be
useful in the development of automatic systems that must operate in difficult acoustic
environments. It is interesting to note that the cognitive load for humans increases while
listening in noisy environments, even when the speech remains intelligible (Kjellberg
et al. 2007). This section presents some illustrative examples of human perceptual
phenomena and robustness. We also present several technical solutions based on these
phenomena which are known to improve robustness in automatic recognition.
1.5.1A Priori Knowledge
When confronted with an ambiguous stimulus requiring a single interpretation, the human
brain must rely on apriori knowledge and expectations. What is likely to be one of the
most amazing findings about the robustness and flexibility of human perception and the
use of apriori information is illustrated by the following sentence, which was circulated
in the Internet in September 2003:
Aoccdrnig to rscheearch at Cmabrigde uinervtisy, it deosn’t mttaer waht oredr the
ltteers in a wrod are, the olny ipromoetnt tihng is taht the frist and lsat ltteres are
at the rghit pclae. The rset can be a tatol mses and you can sitll raed it wouthit a
porbelm. Tihs is bcuseae we do not raed ervey lteter by itslef but the wrod as a
wlohe.
The text is easy to read for a human inasmuch as, through reordering, the brain maps
the erroneously presented characters into correct English words.
Apriori knowledge is also widely used in automatic speech processing. Obvious
examples are
• the statistics of speech,
• the limited number of possible phoneme combinations constrained by known words
which might be further constrained by the domain,
• the word sequences follow a particular structure which can be represented as a context
free grammar or the knowledge of successive words, represented as an N-gram .
1.5.2Phonemic Restoration and Reliability
Most signals of interest, including human speech, are highly redundant. This redundancy
provides for correct recognition or classification even in the event that the signal is partially
Loading...
+ 554 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.