InTime A System for Video Recommendation using Visual Saliency, Crowdsourced and Automatic Annotations

Intime is a system for content-based video recommendation that exploits visual saliency along with crowdsourced and automatic annotations
to better represent video features and content. Visual saliency is used to select relevant frames to be presented in a web-based interface to tag and annotate video frames in a social network; it is also employed to summarize video content to create a more effective video representation used in the recommender system. The system exploits automatic annotations from CNN-based classifiers on salient frames and user generated annotations.

The architecture of InTime has been implemented in a beta version of a social network. The idea behind the social network is to exploit user profiling techniques to propose to the user targeted recommendations of videos, topic of interests and similar users in the network. This is achieved initially by analysing data collected from other online profiles (i.e. Facebook) and then tracking user’s activities on the social network, like number of video views, click-through data and video ratings. Experiments have been conducted on a dataset collected from the social network. Users can share and annotate videos at frame level using concepts derived from Wikipedia, following a procedure that reminds that of sharing and tagging photos in Facebook.
All the concepts added with this procedure are clustered in 54 categories using Fuzzy K-Means in a two-levels taxonomy of interests (12 macro and 42 micro-categories such as Music and Jazz music, inspired by the taxonomy of Vimeo) and classified using a semantic distance with a nearest neighbour approach. Each item of the taxonomy is associated to a corresponding Wikipedia article, used to compute the semantic distance. All the resources categorised in videos are then used to build a vector describing video content exploited in the recommender.

The network exposes also a module for profile curation of resources of interest extracted from all the users’ network comments. The module is based on the well-grounded hypothesis that self-esteem can be exploited to engage the user in the annotation process: social networks users in fact tend to give a socially desirable representation of themselves as demonstrated by the authors with a Facebook experiment. We take advantage of this, providing users with tools to obtain this representation and engaging them with other- focused activities (e.g. comments, annotations on resources of other profiles etc.), with the goal of easing the collection of crowdsourced annotations.

Video analysis is performed to improve the interaction of users with the system, and to obtain more objective content- based representation of videos, in order to compute the recommendation. Automatic video annotations are extracted using a classifier which exploits convolutional neural networks (CNN) on more salient frames. Annotations are categorised using the semantic relatedness measure and their relevancy is weighted according to the confidence returned by the classifier.

The recommender implements an item-based collaborative filtering that builds an item-item matrix determining similarity relationships between pairs of items. Then the recommendation step uses the most similar items to a user’s already-rated items to generate a list of recommendations. Videos are represented using a feature vector that concatenates the histogram of the categories of the man- ual comments and the BoW description obtained using the CNN classifier on most salient frames.

A dataset has been collected by hiring 812 workers from the Microworkers web site, and asking them to use the system to upload their favorite videos, annotate and comment some shots that were more interesting to them and to provide ratings for some videos. The dataset is composed by 632 videos, of which 468 were annotated with 1956 comments and 1802 annotations. 613 videos were rated by 950 of 1059 total network users. Comparing the performance of the system, in terms of Root Mean Square Error (RMSE), with a standard item-based recommender implemented in Apache Mahout shows an improvement of ∼ 26%.

There are no related publications

There are no related projects