lipreading, audio-visual speech recognition, audio-visual data corpus, multimodal data corpus, high speed camera, dual view recording, frontal view, side view, active appearance models, opticalflow and lipreading, emotional speech.
Abstract: Our software demo package consists of an implementation for an automatic human emotion recognition system. The system is bi-modal and is based on fusing of data regarding facial expressions and emotion that has been extracted from speech signal. We have integrated Viola&Jones face detector (OpenCV), Active Appearance Model AAM (AAM-API) for extracting the face shape and Support Vector Machines (LibSVM) for the classification of emotion patterns. We have used OpticalFlow algorithm for computing the features needed for the classification of facial expressions. Beside the integration of all processing components, the software system accommodates our implementation for the data fusion algorithm. Our C++ implementation has a working frame-rate of about 5fps.
Abstract: This article describes a method to develop an
advanced navigation capability for the standard
platform of the IMAV indoor competition: the
Parrot AR.Drone. Our development is partly
based on simulation, which requires both a realistic
sensor and motion model. This article describes
how a visual map of the indoor environment
can be made, including the effect of sensor
noise. In addition, validation results for the motion
model are presented. On this basis, it should
be possible to learn elevation maps, optimal
paths on this visual map and to autonomously
avoid obstacles based on opticalflow.
Abstract: The current audio-only speech recognition still lacks the expected robustness when the Signal to Noise Ratio (SNR) decreases. The video information is not affected by noise which makes it an ideal candidate for data fusion for speech recognition benefit. In the paper  the authors have shown that most of the techniques used for extraction of static visual features result in equivalent features or at least the most informative features exhibit this property. We argue that one of the main problems of existing methods is that the resulting features contain no information about the motion of the speaker's lips. Therefore, in this paper we will analyze the importance of motion detection for speech recognition. For this we will first present the Lip Geometry Estimation(LGE) method for static feature extraction. This method combines an appearance based approach with a statistical based approach for extracting the shape of the mouth. The method was introduced in  and explored in detail in . Further more, we introduce a second method based on a novel approach that captures the relevant motion information with respect to speech recognition by performing opticalflow analysis on the contour of the speaker's mouth. For completion, a middle way approach is also analyzed. This third method considers recovering the motion information by computing the first derivatives of the static visual features. All methods were tested and compared on a continuous speech recognizer for Dutch. The evaluation of these methods is done under different noise conditions. We show that the audio-video recognition based on the true motion features, namely obtained by performing opticalflow analysis, outperforms the other settings in low SNR conditions.
Abstract: That adding visual information improves the recognition of speech is an established fact. However, since the video sampling is done at a much lower rate than speech sampling and the increase in performance, while promising, is still not at the expected level, the question of whether or not is it useful to use high speed recordings makes its appearance. We propose in this paper an analysis of the gain in information from the point of view of lipreading accuracy, based on Root Mean Square Deviation (RMSD) measure. We analyze both real data, recorded using high speed technology, and synthetic data obtained using the well known software tools CUAnimate and CSLU Toolkit. The analysis is performed on two different and highly used types of features the mouth width and height and opticalflow. Taking into account the rate of speech our preliminary findings show two different situations. In the case of low speech rate, usually used during spellings or when uttering connected digits (e.g. telephone numbers, account numbers) or uttering separate words, the information gained by using high speed recordings is not significant enough to justify the tremendous amount of resources needed when working with high speed recordings. However, in the case of continuous speech, when the speech rate tends to increase, a recording rate in the range of 24 to 30 frames per second is definitely insufficient. Then interpolation is needed which decreases the recognition performance.