I received a Laurea degree in computer engineering (MSc) from the University of Florence, with a thesis on human action recognition in 2008. I obtained my PhD degree working at the Media Integration and Communication Center of University of Florence under the supervision of Prof. Alberto Del Bimbo with a thesis on “Supervised and Semi-supervised Event Detection with Local Spatio-Temporal Features” in 2012. I was a visiting scholar at Silvio Savarese Laboratory at University of Michigan (now at Stanford) from February 2013 till August 2013.
I’m currently a PostDoc at Visual Information and Media Lab at Media Integration and Communication Center of University of Florence.
My research interests are focused on application of pattern recognition and machine learning to computer vision and specifically in the field of human activity recognition.
My profile on Google Scholar and Scopus.
Our paper on Deep Generative Adversarial Compression Artifact Removal, has been accepted for publication at ICCV 2017. In the following figure we can see how our GAN can recover details in a compressed image (left). Note how texture and edges are better looking and blocking, ringing and color quantization artifacts are removed.
We have shown that it is possible to remove compression artifacts by transforming images with deep convolutional residual networks. We have trained a generative network using SSIM loss obtaining state of the art results according to standard image similarity metrics. Nonetheless, images reconstructed as such appear blurry and missing details at higher frequencies. These details make images look less similar to the original ones for human viewers and harder to understand for object detectors. We therefore propose a conditional Generative Adversarial framework which we train alternating full size patch generation with sub-patch discrimination. Human evaluation and quantitative experiments in object detection show that our GAN generates images with finer consistent details and these details make a difference both for machines and humans.
Our work on Label Transfer in the Semantic Space has been accepted for publication on Pattern Recognition. In this work we show how we can learn a semantic space using KCCA, where correlation of visual and textual features are well preserved into a semantic embedding. Interestingly, our method work either when the training set is well annotated by experts, as well as when it is noisy such as in the case of user-generated tags in social media. Extensive testing with modern features and image labeling algorithms show the benefit on several benchmarks. At training time, we leverage the set of tags and the visual features to learn an embedding Φ(v;t) in a semantic space.
Once learned, our embedding is independent from the textual features and can then computed for any image that has to be tagged. Our method is able to reorganize the feature space to preserve image semantics, as shown in this t-SNE plot, where colors represent image labels.
Read the full paper for further details!
Ensembles of Exemplar-SVMs have been introduced as a framework for Object Detection but have rapidly found a large interest in a wide variety of computer vision applications such as mid-level feature learning, tracking and segmentation. To guarantee its effectiveness though, a large collection of classifiers has to be used which has a prohibitive cost. To overcome this issue we organize Exemplar-SVMs into a taxonomy, exploiting the joint distribution of Exemplar scores. This permits to index the classifiers at a logarithmic cost, while maintaining the label transfer capabilities of the method almost unaffected. We propose different formulations of the taxonomy in order to maximize the speed gain. In particular we propose a highly efficient Vector Quantized Rejecting Taxonomy to discard unpromising image regions during evaluation, performing computations in a quantized domain. This allow us to obtain ramarkable speed gains, with an improvement up to more than two orders of magnitude. To verify the robustness of our indexing data structure with reference to a standard Exemplar-SVM ensemble, we experiment with the Pascal VOC 2007 benchmark on the Object Detection competition and on a simple segmentation task.
Our paper is available in preprint !
Object detection is one of the most important tasks of computer vision. It is usually performed evaluating locations of an image that are more likely to contain the object of interest. The interplay of detectors and proposal algorithms has not been studied up to now. We propose to connect, in a closed-loop, detectors and object proposal generator functions exploiting the ordered and continuous nature of video sequences. We obtain State-of-the-art mAP and a detection time that is lower than Faster R-CNN.
Our paper is available on TIP website.
Our representation allows to perform partial matching between videos obtaining a robust similarity measure. This approach is extremely useful in sport videos where multiple entities are involved in the activities. Many existing works perform person detection, tracking and often require camera calibration in order to extract motion and imagery of every player and object in the scene. In this work we overcome this limitations and propose an approach that exploits the spatio-temporal structure of a video, grouping local spatio-temporal features unsupervisedly. Our robust representation allows to measure video similarity making correspondences among arbitrary patterns. We show how our clusters can be used to generate frame-wise action proposals. We exploit proposals to improve our representation further for localization and recognition. We test our method on sport specific and generic activity dataset reporting results above the existing state-of-the-art.
The paper is available on elsevier platform:
Techcrunch covered our system while it was presented as a live demo at ACM MM 2016 in Amsterdam
Our smart audio guide is backed by a computer vision system capable to work in real-time on a mobile device, coupled with audio and motion sensors. We propose the use of a compact Convolutional Neural Network (CNN) that performs object classification and localization. Using the same CNN features computed for these tasks, we perform also robust artwork recognition. To improve the recognition accuracy we perform additional video processing using shape based filtering, artwork tracking and temporal filtering. The system has been deployed on a NVIDIA Jetson TK1 and a NVIDIA Shield Tablet K1, and tested in a real world environment (Bargello Museum of Florence).
Our paper “Adaptive Structured Pooling for Action Recognition” has been accepted for publication and will be presented at British Machine Vision Conference 2014.
This is a joint work with Shugao Ma and Prof. Stan Sclaroff from Boston University, Dr. Svebor Karaman and Prof. Alberto Del Bimbo from University of Florence.
In this paper, we propose an adaptive structured pooling strategy to solve the action recognition problem in videos. Our method aims at individuating several spatio-temporal pooling regions each corresponding to a consistent spatial and temporal subset of the video. Each subset of the video gives a pooling weights map and is represented as a Fisher vector computed from the soft weighted contributions of all dense trajectories evolving in it. We further represent each video through a graph structure, defined over multiple granularities of spatio-temporal subsets. The graph structures extracted from all videos
are finally compared with an efficient graph matching kernel. Our approach does not rely on a fixed partitioning of the video. Moreover, the graph structure depicts both spatial and temporal relationships between the spatio-temporal subsets. Experiments on the UCF Sports and the HighFive datasets show performance above the state-of-the-art.
Here’s the camera ready version of our paper!