MICC - Media Integration and Communication Center

A multimodal dataset designed for Facial Action Unit (AU) classification, integrating both RGB and event-based visual data. The dataset comprises temporally synchronized video recordings captured using a commercial USB RGB camera and a Prophesee Evaluation Kit 4 (EVK4) equipped with the IMX646 neuromorphic sensor. The two modalities provide complementary information, with the RGB stream offering frame-based appearance cues at a resolution of 640 × 480, and the event camera capturing high-temporal-resolution motion dynamics at 1280 × 720. This dual-modality setup enables the exploration of hybrid approaches that leverage both conventional and neuromorphic sensing for improved robustness in facial expression analysis. The dataset is intended to support research in multimodal learning, event-based vision, and affective computing, facilitating the development and benchmarking of advanced AU classification models.

Please, if you use the dataset cite our papers as follows:

@InProceedings{10.1007/978-3-031-92460-6_13,
author=”Becattini, Federico
and Cultrera, Luca
and Berlincioni, Lorenzo
and Ferrari, Claudio
and Leonardo, Andrea
and Del Bimbo, Alberto”,
editor=”Del Bue, Alessio
and Canton, Cristian
and Pont-Tuset, Jordi
and Tommasi, Tatiana”,
title=”Neuromorphic Facial Analysis with Cross-Modal Supervision”,
booktitle=”Computer Vision — ECCV 2024 Workshops”,
year=”2025″,
publisher=”Springer Nature Switzerland”,
address=”Cham”,
pages=”205–223″,
abstract=”Traditional approaches for analyzing RGB frames are capable of providing a fine-grained understanding of a face from different angles by inferring emotions, poses, shapes, landmarks. However, when it comes to subtle movements standard RGB cameras might fall behind due to their latency, making it hard to detect micro-movements that carry highly informative cues to infer the true emotions of a subject. To address this issue, the usage of event cameras to analyze faces is gaining increasing interest. Nonetheless, all the expertise matured for RGB processing is not directly transferrable to neuromorphic data due to a strong domain shift and intrinsic differences in how data is represented. The lack of labeled data can be considered one of the main causes of this gap, yet gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. In this paper, we first present FACEMORPHIC, a multimodal temporally synchronized face dataset comprising both RGB videos and event streams. The data is labeled at a video level with facial Action Units and also contains streams collected with a variety of applications in mind, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space.”,
isbn=”978-3-031-92460-6″
}

Facemorphic Neuromorphic face dataset

Abstract

Details

Related Projects