Abstract: The current audio-only speech recognition still lacks the expected robustness when the Signal to Noise Ratio (SNR) decreases. The video information is not affected by noise which makes it an ideal candidate for data fusion for speech recognition benefit. In the paper  the authors have shown that most of the techniques used for extraction of static visual features result in equivalent features or at least the most informative features exhibit this property. We argue that one of the main problems of existing methods is that the resulting features contain no information about the motion of the speaker's lips. Therefore, in this paper we will analyze the importance of motion detection for speech recognition. For this we will first present the Lip Geometry Estimation(LGE) method for static feature extraction. This method combines an appearance based approach with a statistical based approach for extracting the shape of the mouth. The method was introduced in  and explored in detail in . Further more, we introduce a second method based on a novel approach that captures the relevant motion information with respect to speech recognition by performing optical flow analysis on the contour of the speaker's mouth. For completion, a middle way approach is also analyzed. This third method considers recovering the motion information by computing the first derivatives of the static visual features. All methods were tested and compared on a continuous speech recognizer for Dutch. The evaluation of these methods is done under different noise conditions. We show that the audio-video recognition based on the true motion features, namely obtained by performing optical flow analysis, outperforms the other settings in low SNR conditions.
Abstract: The system being described in the paper presents a Web interface for a fully automatic audio-video human emotion recognition. The analysis is focused on the set of six basic emotions plus the neutral type. Different classifiers are involved in the process of face detection (AdaBoost), facial expression recognition (SVM and other models) and emotion recognition from speech (GentleBoost). The Active Appearance Model - AAM is used to get the information related to the shapes of the faces to be analyzed. The facial expression recognition is frame based and no temporal patterns of emotions are managed. The emotion recognition from movies is done separately on sound and video frames. The algorithm does not handle the dependencies between audio and video during the analysis. The methodologies for data processing are explained and specific performance measures for the emotion recognition are presented.
Abstract: Lipreading gets increasingly attention from the scientific society. However, many aspects related to lipreading are still unknown or poorly understood. In the current paper we present the entire process used for engineering the data for building a lip recognizer. Firstly, we provide detailed information on compiling an advanced multimodal data corpus for audio-visual speech recognition, lipreading and related domains. This data corpus contains synchronized dual view acquired using high speed camera. We paid careful attention to the language content of the corpus and the affective state of the speaker. Secondly, we introduce several methods for extraction features from both views and detail the problem of combining the information from the two views. While the information of the frontal view processing is more like a state of the art, we bring as well valuable new information and analysis for the profile view.
Abstract: That adding visual information improves the recognition of speech is an established fact. However, since the video sampling is done at a much lower rate than speech sampling and the increase in performance, while promising, is still not at the expected level, the question of whether or not is it useful to use high speed recordings makes its appearance. We propose in this paper an analysis of the gain in information from the point of view of lipreading accuracy, based on Root Mean Square Deviation (RMSD) measure. We analyze both real data, recorded using high speed technology, and synthetic data obtained using the well known software tools CUAnimate and CSLU Toolkit. The analysis is performed on two different and highly used types of features the mouth width and height and optical flow. Taking into account the rate of speech our preliminary findings show two different situations. In the case of low speech rate, usually used during spellings or when uttering connected digits (e.g. telephone numbers, account numbers) or uttering separate words, the information gained by using high speed recordings is not significant enough to justify the tremendous amount of resources needed when working with high speed recordings. However, in the case of continuous speech, when the speech rate tends to increase, a recording rate in the range of 24 to 30 frames per second is definitely insufficient. Then interpolation is needed which decreases the recognition performance.
Abstract: The recognition of the internal emotional state of one person plays an important role in several human-related fields. Among them, human-computer interaction has recently received special attention. The current research is aimed at the analysis of segmentation methods and of the performance of
the GentleBoost classifier on emotion recognition from speech. The data set used for emotion analysis is Berlin - a database of German emotional speech. A second data set is DES – Danish Emotional Speech
data set is used for comparison purposes. Our contribution for the research community consists in a novel extensive study on the efficiency of using distinct numbers of frames per speech utterance for emotion recognition. Eventually, a set of GentleBoost 'committees' with optimal classification rates is determined based on an exhaustive study on the generated classifiers and on different types of segmentation.