Real-time, multimodal and continuous affect prediction in the arousal-valence space

Date of Publication


Document Type

Master's Thesis

Degree Name

Master of Science in Computer Science


College of Computer Studies


Computer Science

Thesis Adviser

Jocelynn Cu


Since most automatic emotion recognition (AER) systems employ pre-segmented data that contains only one emotion, the continuous nature of affect cannot be modelled accurately. Moreover, most systems tend to quantize the continuous labels into discrete labels which defeat the purpose of dimensional affect modelling. Thus, in order to model the gradation of affect, continuous affect prediction in the arousal-valence space in real-time is essential. Furthermore, multiple modalities should be utilized in order to compensate for the failure of one modality. Subsequently, this research aims to investigate the problems associated with continuous affect prediction such as establishing a reliable ground truth, finding the appropriate segmentation technique that is suitable for both modalities, choosing fast and robust features that can handle real world data, and building a classification model that can predict the affect effectively with as minimum delay as possible. This will be useful for the advancement of human-computer interaction since applications that utilize this framework will then be able to respond and adapt immediately to the needs of the users.

Specifically, prosodic features were used to predict affect from the voice while temporal templates called Motion History Images (MHI) were used to predict affect from the face. Using a dataset that has annotations with an average correlation of 0.5 or higher, -Support Vector Machine for Regression (SVR), decision-level fusion and 10-fold cross validation, the system yielded a root mean square error (RMSE) of 0.3200 for arousal and 0.3205 for valence. Individually, voice yielded an RMSE of 0.3523 and 0.3485 for arousal and valence accordingly. On the other hand, face yielded better results with an RMSE of 0.3375 for both dimensions. In addition, testing with another database, FilMED2 yielded fusion results of 0.3437 and 0.4350 for arousal and valence respectively. Furthermore, testing with new recorded data yielded fusion results of 0.3740 and 0.5075 for arousal and valence respectively.

Abstract Format






Accession Number


Shelf Location

Archives, The Learning Commons, 12F Henry Sy Sr. Hall

Physical Description

1 v. (various foliations) ; 28 cm.


Emotion recognition; Human-computer interaction

This document is currently not available here.