Abstract: In recent years, we have developed a framework of humancomputer interaction that offers recognition of various communication modalities including speech, lip movement, facial expression, handwriting and drawing, body gesture, text and visual symbols. The framework allows the rapid construction of a multimodal, multi-devices, and multi-user communication system within crisis management. This paper reports the multimodal information presentation module combining language, speech, visual-language and graphics, which can be used in isolation, but also as part of the framework. It provides a communication channel between the system and users with different communication devices. The module is able to specify and produce context-sensitive and user-tailored output. By the employment of ontology, it receives the system�s view about the world and dialogue actions from a dialogue manager and generates appropriate multimodal responses.
Abstract: The current paper addresses the aspects related to the development of an automatic probabilistic recognition system for facial expressions in video streams. The face analysis component integrates an eye tracking mechanism based on Kalman filter. The visual feature detection includes PCA oriented recognition for ranking the activity in certain facial areas. The description of the facial expressions is given according to sets of atomic Action Units (AU) from the Facial Action Coding System (FACS). The base for the expression recognition engine is supported through a BBN model that also handles the time behavior of the visual features.
Abstract: Our work addresses the problem of autonomous concept formation from a design point of view, providing an initial answer to the question: What are the design features of an architecture supporting the acquisition of different types of concepts by an autonomous agent?
Autonomous agents, that is systems capable of interacting independently with their environment in the pursuit of their own goals, will provide the framework in which we study the problem of autonomous concept formation. Humans and most animals may in this sense also be regarded as autonomous agents, but our concern will be with artiﬁcial autonomous agents. A detailed survey and discussion of the many issues surrounding the notion of ‘artiﬁcial agency’ is beyond the scope of this thesis and a good overview can be found in [Wooldridge and Jennings, 1995]. Instead we will focus on how artiﬁcial agents could be endowed with representational and modelling capabilities.
The ability to form concepts is an important and recognised cognitive ability, thought to play an essential role in related abilities such as categorisation, language understanding, object identiﬁcation and recognition, reasoning, all of which can be seen as different aspects of intelligence. Concepts and categories are studied within cognitive science, where scientists are concerned with human conceptual abilities and mental representations of categories, but they have been addressed also in the rather different domain of machine learning and classiﬁcatory data analysis, where the focus is on the development of algorithms for clustering problems and induction problems [Mechelen et al., 1993]. The two ﬁelds are well distinct and only recently have started to interact, but even though the importance of concepts have been recognised, the nature of concepts is controversial, in the sense that there is no commonly agreed theory of concepts, and it is still far from obvious which representational means are most suited to capture the many cognitive functions that concepts are involved in.
Among the goals of this thesis there is the attempt to bring together different lines of argumentation that have emerged within philosophy, cognitive science and AI, in order to establish a solid foundation for further research into the representation and acquisition of concepts by autonomous agents. Thus, our results and conclusions will often be stated in terms of new insights and ideas, rather than resulting in new algorithms or formal methods.
Our focus will be on affordance concepts — discussed in detail in Chapter 4 — and our main contributions will be:
* An argument showing that concepts should be thought of as belonging to different kinds, where the differences among these kinds are to be captured in terms of architecture features supporting their acquisition.
* A description (and partial implementation) of a minimal architecture (the Innate Adaptive Behaviour architecture – IAB architecture for short) supporting the acquisition of affordance concepts; the IAB architecture is actually a proposal for a sustaining mechanism, in the sense of [Margolis, 1999], for affordances, and makes clear the necessity of a minimal structure for the representation of affordances.
When addressing concept formation in AI, what can be called the ‘system level’ is often overlooked, which means that concepts and categories are rarely studied from the point of view of a system, autonomous and complete, that might need such constructs and can acquire them only by means of interactions with its environment, under the constraints of its cognitive architecture. Also within psychology, the focus is usually on structural aspects of concepts rather than on developmental issues [Smith and Medin, 1981]. Our approach – an architecture-based approach – is an attempt (i) to show that a system level perspective on concept formation is indeed possible and worth exploring, and (ii) to provide an initial, maybe simple, but concrete example of the insights that can be gained from such an approach. Since the methodology that we propose to study concept formation is a general one, and can be applied also to other types of concepts, we decided to mention broadly ‘autonomous concept formation’ rather than ‘autonomous affordance-concepts formation’ in the title of the thesis.
Abstract: Multimodal emotion recognition gets increasingly more attention from the scientific society. Fusing together information coming on different channels of communication, while taking into account the context seems the right thing to do. During social interaction the affective load of the interlocutors plays a major role. In the current paper we present a detailed analysis of the process of building an advanced multimodal data corpus for affective state recognition and related domains. This data corpus contains synchronized dual view acquired using high speed camera and high quality audio devices. We paid careful attention to the emotional content of the corpus in all aspects such as language content and facial expressions. For recordings we implemented a TV prompter like software which controlled the recording devices and instructed the actors to assure the uniformity of the recordings. In this way we achieved a high quality controlled emotional data corpus.
Abstract: We present a discriminative approach to human actionrecognition. At the heart of our approach is the use of common spatial patterns (CSP), a spatial filter technique that transforms temporal feature data by using differences in variance between two classes. Such a transformation focuses on differences between classes, rather than on modeling each class individually. As a result, to distinguish between two classes, we can use simple distance metrics in the low-dimensional transformed space. The most likely class is found by pairwise evaluation of all discriminant functions, which can be done in real-time. Our image representations are silhouette boundary gradients, spatially binned into cells. We achieve scores of approximately 96% on the Weizmann human action dataset, and show that reasonable results can be obtained when training on only a single subject. We further compare our results with a recent examplarbased approach. Future work is aimed at combining our approach with automatic human detection.
Abstract: The current audio-only speech recognition still lacks the expected robustness when the Signal to Noise Ratio (SNR) decreases. The video information is not affected by noise which makes it an ideal candidate for data fusion for speech recognition benefit. In the paper  the authors have shown that most of the techniques used for extraction of static visual features result in equivalent features or at least the most informative features exhibit this property. We argue that one of the main problems of existing methods is that the resulting features contain no information about the motion of the speaker's lips. Therefore, in this paper we will analyze the importance of motion detection for speech recognition. For this we will first present the Lip Geometry Estimation(LGE) method for static feature extraction. This method combines an appearance based approach with a statistical based approach for extracting the shape of the mouth. The method was introduced in  and explored in detail in . Further more, we introduce a second method based on a novel approach that captures the relevant motion information with respect to speech recognition by performing optical flow analysis on the contour of the speaker's mouth. For completion, a middle way approach is also analyzed. This third method considers recovering the motion information by computing the first derivatives of the static visual features. All methods were tested and compared on a continuous speech recognizer for Dutch. The evaluation of these methods is done under different noise conditions. We show that the audio-video recognition based on the true motion features, namely obtained by performing optical flow analysis, outperforms the other settings in low SNR conditions.
Abstract: In recent years, we have developed a framework of human-computer interaction that offers recognition of various communication modalities including speech, lip movement, facial expression, handwriting and drawing, body gesture, text and visual symbols. The framework allows the rapid construction of a multimodal, multi-devices, and multi-user communication system within crisis management. This paper reports the approaches used in multi-user information integration and multimodal presentation modules, which can be used in isolation, but also as part of the framework. The latter is able to specify and produce context-sensitive and user-tailored output combining language, speech, visual-language and graphics. These modules provide a communication channel between the system and users with different communication devices. By the employment of ontology, the system's view about the world is constructed from multi-user observations and appropriate multimodal responses are generated.
Abstract: We present a discriminative approach to human actionrecognition. At the heart of our approach is the use of common spatial patterns (CSP), a spatial filter technique that transforms temporal feature data by using differences in variance between two classes. Such a transformation focusses on differences between classes, rather than on modelling each class individually. As a results, to distinguish between two classes, we can use simple distance metrics in the low-dimensional transformed space. The most likely class is found by pairwise evaluation of all discriminant functions. Our image representations are silhouette boundary gradients, spatially binned into cells. We achieve scores of approximately 96% on a standard action dataset, and show that reasonable results can be obtained when training on only a single subject. Future work is aimed at combining our approach with automatic human detection.
Abstract: Lipreading gets increasingly attention from the scientific society. However, many aspects related to lipreading are still unknown or poorly understood. In the current paper we present the entire process used for engineering the data for building a lip recognizer. Firstly, we provide detailed information on compiling an advanced multimodal data corpus for audio-visual speech recognition, lipreading and related domains. This data corpus contains synchronized dual view acquired using high speed camera. We paid careful attention to the language content of the corpus and the affective state of the speaker. Secondly, we introduce several methods for extraction features from both views and detail the problem of combining the information from the two views. While the information of the frontal view processing is more like a state of the art, we bring as well valuable new information and analysis for the profile view.
Abstract: The study of human facial expressions is one of the most challenging domains in pattern research community. Each facial expression is generated by non-rigid object deformations and these deformations are person-dependent. Automatic recognition of facial expressions is a process primarily based on analysis of permanent and transient features of the face, which can be only assessed with errors of some degree. The expression recognition model is oriented on the specification of Facial Action Coding System (FACS) of Ekman and Friesen [Ekman, Friesen 1978]. The hard constraints on the scene processing and recording conditions set a limited robustness to the analysis. In order to manage the uncertainties and lack of information, we set a probabilistic oriented framework up. The goal of the project was to design and implement a system for automatic recognition of human facial expression in video streams. The results of the project are of a great importance for a broad area of applications that relate to both research and applied topics.
Abstract: The recognition of the internal emotional state of one person plays an important role in several human-related fields. Among them, human-computer interaction has recently received special attention. The current research is aimed at the analysis of segmentation methods and of the performance of
the GentleBoost classifier on emotion recognition from speech. The data set used for emotion analysis is Berlin - a database of German emotional speech. A second data set is DES – Danish Emotional Speech
data set is used for comparison purposes. Our contribution for the research community consists in a novel extensive study on the efficiency of using distinct numbers of frames per speech utterance for emotion recognition. Eventually, a set of GentleBoost 'committees' with optimal classification rates is determined based on an exhaustive study on the generated classifiers and on different types of segmentation.
Abstract: In the past a crisis event was notified by local witnesses that use to make phone calls to the special services. They reported by speech according to their observation on the crisis site. The recent improvements in the area of human computer interfaces make possible the development of context-aware systems for crisis management that support people in escaping a crisis even before external help is available at site. Apart from collecting the people's reports on the crisis, these systems are assumed to automatically extract useful clues during typical human computer interaction sessions. The novelty of the current research resides in the attempt to involve computer vision techniques for performing an automatic evaluation of facial expressions during human-computer interaction sessions with a crisis management system. The current paper details an approach for an automatic facial expression recognition module that may be included in crisis-oriented applications. The algorithm uses Active Appearance Model for facial shape extraction and SVM classifier for Action Units detection and facial expression recognition.
Abstract: At TUDelft there is a project aiming at the realization of a fully automatic emotion recognition system on the basis of facial analysis. The exploited approach splits the system into four components. Face detection, facial characteristic point extraction, tracking and classification. The focus in this paper will only be on the first two components. Face
detection is employed by boosting simple rectangle Haar-like features that give a decent representation of the face. These features also allow the differentiation between a face and a non-face. The boosting algorithm is combined with an
Evolutionary Search to speed up the overall search time. Facial characteristic points (FCP) are extracted from the detected faces. The same technique applied on faces is utilized for this purpose. Additionally, FCP extraction using corner detection methods and brightness distribution has also been considered. Finally, after retrieving the required FCPs the emotion of the facial expression can be determined. The classification of the Haar-like features is done by the Relevance Vector Machine (RVM).