The project requires both the creation of a sentiment-attention joint dataset and the investigation of computer vision architectures for face detection and sentiment recognition that are small enough to run in a browser at an acceptable speed.
The first step is to create a big enough dataset of users watching different kinds of videos. Since we are interested in providing the user with personalized content based on the user live reaction, we must create a dataset of videos depicting users watching different kinds of videos. This dataset must address the following problems: face detection; emotion recognition; age and gender estimation; attention detection. To this extent a tool to gather this data must be created. With these requirements in mind we developed a web annotation tool that shows the users a set of videos and records their emotional reactions. After watching the videos, the user is asked to annotate his emotions and his interest for each video.
The final system shows videos to the users and watches their reaction through the device camera. The whole computer vision pipeline must perform the following actions: i) detect faces in the camera stream; ii) detect emotions and demographic data from the extracted face; ii) detect the user attention. All these actions must be performed in a browser and should perform at acceptable speed on mobile devices. To this end we studied mobile-friendly network architectures such as tiny Xception networks. These networks run at more than 10 fps on medium level mobile devices and have a memory occupation of around 100-200 kilobytes. These networks will be trained on the obtained dataset to also produce information about emotions and attention.