I received a Laurea degree in computer engineering (MSc) from the University of Florence, with a thesis on human action recognition in 2008. I obtained my PhD degree working at the Media Integration and Communication Center of University of Florence under the supervision of Prof. Alberto Del Bimbo with a thesis on “Supervised and Semi-supervised Event Detection with Local Spatio-Temporal Features” in 2012. I was a visiting scholar at Silvio Savarese Laboratory at University of Michigan (now at Stanford) from February 2013 till August 2013.
I’m currently a PostDoc at Visual Information and Media Lab at Media Integration and Communication Center of University of Florence.
My research interests are focused on application of pattern recognition and machine learning to computer vision and specifically in the field of human activity recognition.
My profile on Google Scholar and Scopus.
Object detection is one of the most important tasks of computer vision. It is usually performed evaluating locations of an image that are more likely to contain the object of interest. The interplay of detectors and proposal algorithms has not been studied up to now. We propose to connect, in a closed-loop, detectors and object proposal generator functions exploiting the ordered and continuous nature of video sequences. We obtain State-of-the-art mAP and a detection time that is lower than Faster R-CNN.
Our paper is available on TIP website.
Our representation allows to perform partial matching between videos obtaining a robust similarity measure. This approach is extremely useful in sport videos where multiple entities are involved in the activities. Many existing works perform person detection, tracking and often require camera calibration in order to extract motion and imagery of every player and object in the scene. In this work we overcome this limitations and propose an approach that exploits the spatio-temporal structure of a video, grouping local spatio-temporal features unsupervisedly. Our robust representation allows to measure video similarity making correspondences among arbitrary patterns. We show how our clusters can be used to generate frame-wise action proposals. We exploit proposals to improve our representation further for localization and recognition. We test our method on sport specific and generic activity dataset reporting results above the existing state-of-the-art.
The paper is available on elsevier platform:
Techcrunch covered our system while it was presented as a live demo at ACM MM 2016 in Amsterdam
Our smart audio guide is backed by a computer vision system capable to work in real-time on a mobile device, coupled with audio and motion sensors. We propose the use of a compact Convolutional Neural Network (CNN) that performs object classification and localization. Using the same CNN features computed for these tasks, we perform also robust artwork recognition. To improve the recognition accuracy we perform additional video processing using shape based filtering, artwork tracking and temporal filtering. The system has been deployed on a NVIDIA Jetson TK1 and a NVIDIA Shield Tablet K1, and tested in a real world environment (Bargello Museum of Florence).
Our paper “Adaptive Structured Pooling for Action Recognition” has been accepted for publication and will be presented at British Machine Vision Conference 2014.
This is a joint work with Shugao Ma and Prof. Stan Sclaroff from Boston University, Dr. Svebor Karaman and Prof. Alberto Del Bimbo from University of Florence.
In this paper, we propose an adaptive structured pooling strategy to solve the action recognition problem in videos. Our method aims at individuating several spatio-temporal pooling regions each corresponding to a consistent spatial and temporal subset of the video. Each subset of the video gives a pooling weights map and is represented as a Fisher vector computed from the soft weighted contributions of all dense trajectories evolving in it. We further represent each video through a graph structure, defined over multiple granularities of spatio-temporal subsets. The graph structures extracted from all videos
are finally compared with an efficient graph matching kernel. Our approach does not rely on a fixed partitioning of the video. Moreover, the graph structure depicts both spatial and temporal relationships between the spatio-temporal subsets. Experiments on the UCF Sports and the HighFive datasets show performance above the state-of-the-art.
Here’s the camera ready version of our paper!