Open Access Research Article

Detection and Separation of Speech Event Using Audio and Video Information Fusion and Its Application to Robust Speech Interface

Futoshi Asano1*, Kiyoshi Yamamoto2, Isao Hara1, Jun Ogata1, Takashi Yoshimura1, Yoichi Motomura1, Naoyuki Ichimura1 and Hideki Asoh1

Author Affiliations

1 Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology, Tsukuba 305-8568, Japan

2 Department of Computer Science, Tsukuba University, Tsukuba 305-8573, Japan

For all author emails, please log on.

EURASIP Journal on Advances in Signal Processing 2004, 2004:324028  doi:10.1155/S1110865704402303


The electronic version of this article is the complete one and can be found online at: http://asp.eurasipjournals.com/content/2004/11/324028


Received: 11 November 2003
Revisions received: 3 February 2004
Published: 18 September 2004

© 2004 Asano et al.

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

A method of detecting speech events in a multiple-sound-source condition using audio and video information is proposed. For detecting speech events, sound localization using a microphone array and human tracking by stereo vision is combined by a Bayesian network. From the inference results of the Bayesian network, information on the time and location of speech events can be known. The information on the detected speech events is then utilized in the robust speech interface. A maximum likelihood adaptive beamformer is employed as a preprocessor of the speech recognizer to separate the speech signal from environmental noise. The coefficients of the beamformer are kept updated based on the information of the speech events. The information on the speech events is also used by the speech recognizer for extracting the speech segment.

Keywords:
information fusion; sound localization; human tracking; adaptive beamformer; speech recognition

Research Article