Affiliations: [a] Department of Information Engineering (DINFO), University of Florence, Italy
| [b] Department of Information Engineering and Mathematics (DIISM), University of Siena, Italy
Correspondence:
[*]
Corresponding author: Lisa Graziani, Department of Information Engineering (DINFO), University of Florence, Italy. E-mail: [email protected].
Abstract: This paper investigates the role of coherence constraints in recognizing facial expressions from images and video sequences. A set of constraints are introduced to bridge a pool of Convolutional Neural Networks (CNNs) during their training stage. Constraints are inspired by practical considerations on the regularity of the temporal evolution of the predictions, and by the idea of connecting the information extracted from multiple representations. We study CNNs with the aim of building a versatile recognizer of expressions in static images that can be further applied to video sequences. First, the importance of different face parts in the recognition task is studied, considering appearance and shape-related features. Then we focus on the Semi-Supervised learning setting, exploiting video data, where only a few frames are supervised. The unsupervised portion of the training data is used to enforce three types of coherence, namely temporal coherence, coherence among the predictions on the face parts and coherence between appearance and shape-based representation. Our experimental analysis shows that coherence constraints improve the quality of the expression recognizer, thus offering a suitable basis to profitably exploit unsupervised video sequences, also in cases in which some portions of the input face are not visible.