lipreading, audio-visual speechrecognition, audio-visual data corpus, multimodal data corpus, high speed camera, dual view recording, frontal view, side view, active appearance models, optical flow and lipreading, emotional speech.
Abstract: In recent years, we have developed a framework of humancomputer interaction that offers recognition of various communication modalities including speech, lip movement, facial expression, handwriting and drawing, body gesture, text and visual symbols. The framework allows the rapid construction of a multimodal, multi-devices, and multi-user communication system within crisis management. This paper reports the multimodal information presentation module combining language, speech, visual-language and graphics, which can be used in isolation, but also as part of the framework. It provides a communication channel between the system and users with different communication devices. The module is able to specify and produce context-sensitive and user-tailored output. By the employment of ontology, it receives the system�s view about the world and dialogue actions from a dialogue manager and generates appropriate multimodal responses.
Abstract: Our software demo package consists of an implementation for an automatic human emotion recognition system. The system is bi-modal and is based on fusing of data regarding facial expressions and emotion that has been extracted from speech signal. We have integrated Viola&Jones face detector (OpenCV), Active Appearance Model AAM (AAM-API) for extracting the face shape and Support Vector Machines (LibSVM) for the classification of emotion patterns. We have used Optical Flow algorithm for computing the features needed for the classification of facial expressions. Beside the integration of all processing components, the software system accommodates our implementation for the data fusion algorithm. Our C++ implementation has a working frame-rate of about 5fps.
Abstract: Data corpora are an important part of any audio-visual research. However, the time and effort needed to build a good dataset are very large. Therefore, we argue that the researchers should follow some general guidelines when building a corpus that guarantees that the resulted datasets have common properties. This will give the opportunity to compare the results of different approaches of different research groups even without sharing the same data corpus. In this paper we will formulate the set of guidelines that should always be taken into account when developing an audio-visual data corpus for bi-modal speechrecognition. During the process we compare samples from different existing datasets, and give solutions for solving the drawbacks that these datasets suffer. In the end we give a complete list with all the properties of some of the most known data corpora.
Abstract: The current audio-only speechrecognition still lacks the expected robustness when the Signal to Noise Ratio (SNR) decreases. The video information is not affected by noise which makes it an ideal candidate for data fusion for speechrecognition benefit. In the paper  the authors have shown that most of the techniques used for extraction of static visual features result in equivalent features or at least the most informative features exhibit this property. We argue that one of the main problems of existing methods is that the resulting features contain no information about the motion of the speaker's lips. Therefore, in this paper we will analyze the importance of motion detection for speechrecognition. For this we will first present the Lip Geometry Estimation(LGE) method for static feature extraction. This method combines an appearance based approach with a statistical based approach for extracting the shape of the mouth. The method was introduced in  and explored in detail in . Further more, we introduce a second method based on a novel approach that captures the relevant motion information with respect to speechrecognition by performing optical flow analysis on the contour of the speaker's mouth. For completion, a middle way approach is also analyzed. This third method considers recovering the motion information by computing the first derivatives of the static visual features. All methods were tested and compared on a continuous speech recognizer for Dutch. The evaluation of these methods is done under different noise conditions. We show that the audio-video recognition based on the true motion features, namely obtained by performing optical flow analysis, outperforms the other settings in low SNR conditions.
Abstract: In recent years, we have developed a framework of human-computer interaction that offers recognition of various communication modalities including speech, lip movement, facial expression, handwriting and drawing, body gesture, text and visual symbols. The framework allows the rapid construction of a multimodal, multi-devices, and multi-user communication system within crisis management. This paper reports the approaches used in multi-user information integration and multimodal presentation modules, which can be used in isolation, but also as part of the framework. The latter is able to specify and produce context-sensitive and user-tailored output combining language, speech, visual-language and graphics. These modules provide a communication channel between the system and users with different communication devices. By the employment of ontology, the system's view about the world is constructed from multi-user observations and appropriate multimodal responses are generated.
Abstract: Multimodal speechrecognition gets increasingly more attention from the scientific society. Merging together information coming on different channels of communication, while taking into account the context, seems the right thing to do. However, many aspects related to lipreading and to what influences the speech are still unknown or poorly understood. In the current paper we present detailed information on compiling an advanced multimodal data corpus for audio-visual speechrecognition, lipreading and related domains. This data corpus contains synchronized dual view acquired using high speed camera. We paid careful attention to the language content of the corpus and to the used speaking style. For recordings we implemented a prompter like software which controlled the recording devices and instructed the speaker to get uniform recordings.
Abstract: The system being described in the paper presents a Web interface for a fully automatic audio-video human emotion recognition. The analysis is focused on the set of six basic emotions plus the neutral type. Different classifiers are involved in the process of face detection (AdaBoost), facial expression recognition (SVM and other models) and emotion recognition from speech (GentleBoost). The Active Appearance Model - AAM is used to get the information related to the shapes of the faces to be analyzed. The facial expression recognition is frame based and no temporal patterns of emotions are managed. The emotion recognition from movies is done separately on sound and video frames. The algorithm does not handle the dependencies between audio and video during the analysis. The methodologies for data processing are explained and specific performance measures for the emotion recognition are presented.
Abstract: Lipreading gets increasingly attention from the scientific society. However, many aspects related to lipreading are still unknown or poorly understood. In the current paper we present the entire process used for engineering the data for building a lip recognizer. Firstly, we provide detailed information on compiling an advanced multimodal data corpus for audio-visual speechrecognition, lipreading and related domains. This data corpus contains synchronized dual view acquired using high speed camera. We paid careful attention to the language content of the corpus and the affective state of the speaker. Secondly, we introduce several methods for extraction features from both views and detail the problem of combining the information from the two views. While the information of the frontal view processing is more like a state of the art, we bring as well valuable new information and analysis for the profile view.
Abstract: The paper describes a novel technique for the recognition of emotions from multimodal data. We focus on the recognition of the six prototypic emotions. The results from the facial expression recognition and from the emotion recognition from speech are combined using a bi-modal multimodal semantic data fusion model that determines the most probable emotion of the subject. Two types of models based on geometric face features for facial expression recognition are being used, depending on the presence or absence of speech. In our approach we define an algorithm that is robust to changes of face shape that occur during regular speech. The influence of phoneme generation on the face shape during speech is removed by using features that are only related to the eyes and the eyebrows. The paper includes results from testing the presented models.
Abstract: That adding visual information improves the recognition of speech is an established fact. However, since the video sampling is done at a much lower rate than speech sampling and the increase in performance, while promising, is still not at the expected level, the question of whether or not is it useful to use high speed recordings makes its appearance. We propose in this paper an analysis of the gain in information from the point of view of lipreading accuracy, based on Root Mean Square Deviation (RMSD) measure. We analyze both real data, recorded using high speed technology, and synthetic data obtained using the well known software tools CUAnimate and CSLU Toolkit. The analysis is performed on two different and highly used types of features the mouth width and height and optical flow. Taking into account the rate of speech our preliminary findings show two different situations. In the case of low speech rate, usually used during spellings or when uttering connected digits (e.g. telephone numbers, account numbers) or uttering separate words, the information gained by using high speed recordings is not significant enough to justify the tremendous amount of resources needed when working with high speed recordings. However, in the case of continuous speech, when the speech rate tends to increase, a recording rate in the range of 24 to 30 frames per second is definitely insufficient. Then interpolation is needed which decreases the recognition performance.
Abstract: The recognition of the internal emotional state of one person plays an important role in several human-related fields. Among them, human-computer interaction has recently received special attention. The current research is aimed at the analysis of segmentation methods and of the performance of
the GentleBoost classifier on emotion recognition from speech. The data set used for emotion analysis is Berlin - a database of German emotional speech. A second data set is DES – Danish Emotional Speech
data set is used for comparison purposes. Our contribution for the research community consists in a novel extensive study on the efficiency of using distinct numbers of frames per speech utterance for emotion recognition. Eventually, a set of GentleBoost 'committees' with optimal classification rates is determined based on an exhaustive study on the generated classifiers and on different types of segmentation.
Abstract: In the past a crisis event was notified by local witnesses that use to make phone calls to the special services. They reported by speech according to their observation on the crisis site. The recent improvements in the area of human computer interfaces make possible the development of context-aware systems for crisis management that support people in escaping a crisis even before external help is available at site. Apart from collecting the people's reports on the crisis, these systems are assumed to automatically extract useful clues during typical human computer interaction sessions. The novelty of the current research resides in the attempt to involve computer vision techniques for performing an automatic evaluation of facial expressions during human-computer interaction sessions with a crisis management system. The current paper details an approach for an automatic facial expression recognition module that may be included in crisis-oriented applications. The algorithm uses Active Appearance Model for facial shape extraction and SVM classifier for Action Units detection and facial expression recognition.