I received a Laurea degree in computer engineering (MSc) from the University of Florence, with a thesis on human action recognition in 2008. I obtained my PhD degree working at the Media Integration and Communication Center of University of Florence under the supervision of Prof. Alberto Del Bimbo with a thesis on “Supervised and Semi-supervised Event Detection with Local Spatio-Temporal Features” in 2012. I was a visiting scholar at Silvio Savarese Laboratory at University of Michigan (now at Stanford) from February 2013 till August 2013. I was a PostDoc at Visual Information and Media Lab at Media Integration and Communication Center of University of Florence untile January 2018.
Since March 2018 I’m an Assistant Professor at the University of Florence at the Department of Information Engineering.
My research interests are focused on application of pattern recognition and machine learning to computer vision and specifically in the field of human activity recognition.
Our work on adaptive compression for object detection algorithms has been accepted for publication at ICPR 2018. Video compression algorithms have been designed aiming at pleasing human viewers, and are driven by video quality metrics that are designed to account for the capabilities of the human visual system. However, thanks to the advances in computer vision systems more and more videos are going to be watched by algorithms, e.g. implementing video surveillance systems or performing automatic video tagging. This paper describes an adaptive video coding approach for computer vision-based systems. We show how to control the quality of video compression so that automatic object detectors can still process the resulting video, improving their detection performance, by preserving the elements of the scene that are more likely to contain meaningful content. Our approach is based on computation of saliency maps exploiting a fast objectness measure.
The computational efficiency of this approach makes it usable in a real-time video coding pipeline. Experiments show that our technique outperforms standard H.265 in speed and coding efficiency, and can be applied to different types of video domains, from surveillance to web videos.
Our paper on Deep Generative Adversarial Compression Artifact Removal, has been accepted for publication at ICCV 2017. In the following figure we can see how our GAN can recover details in a compressed image (left). Note how texture and edges are better looking and blocking, ringing and color quantization artifacts are removed.
We have shown that it is possible to remove compression artifacts by transforming images with deep convolutional residual networks. We have trained a generative network using SSIM loss obtaining state of the art results according to standard image similarity metrics. Nonetheless, images reconstructed as such appear blurry and missing details at higher frequencies. These details make images look less similar to the original ones for human viewers and harder to understand for object detectors. We therefore propose a conditional Generative Adversarial framework which we train alternating full size patch generation with sub-patch discrimination. Human evaluation and quantitative experiments in object detection show that our GAN generates images with finer consistent details and these details make a difference both for machines and humans.
Our work on Label Transfer in the Semantic Space has been accepted for publication on Pattern Recognition. In this work we show how we can learn a semantic space using KCCA, where correlation of visual and textual features are well preserved into a semantic embedding. Interestingly, our method work either when the training set is well annotated by experts, as well as when it is noisy such as in the case of user-generated tags in social media. Extensive testing with modern features and image labeling algorithms show the benefit on several benchmarks. At training time, we leverage the set of tags and the visual features to learn an embedding Φ(v;t) in a semantic space.
Once learned, our embedding is independent from the textual features and can then computed for any image that has to be tagged. Our method is able to reorganize the feature space to preserve image semantics, as shown in this t-SNE plot, where colors represent image labels.
Ensembles of Exemplar-SVMs have been introduced as a framework for Object Detection but have rapidly found a large interest in a wide variety of computer vision applications such as mid-level feature learning, tracking and segmentation. To guarantee its effectiveness though, a large collection of classifiers has to be used which has a prohibitive cost. To overcome this issue we organize Exemplar-SVMs into a taxonomy, exploiting the joint distribution of Exemplar scores. This permits to index the classifiers at a logarithmic cost, while maintaining the label transfer capabilities of the method almost unaffected. We propose different formulations of the taxonomy in order to maximize the speed gain. In particular we propose a highly efficient Vector Quantized Rejecting Taxonomy to discard unpromising image regions during evaluation, performing computations in a quantized domain. This allow us to obtain ramarkable speed gains, with an improvement up to more than two orders of magnitude. To verify the robustness of our indexing data structure with reference to a standard Exemplar-SVM ensemble, we experiment with the Pascal VOC 2007 benchmark on the Object Detection competition and on a simple segmentation task.
Object detection is one of the most important tasks of computer vision. It is usually performed evaluating locations of an image that are more likely to contain the object of interest. The interplay of detectors and proposal algorithms has not been studied up to now. We propose to connect, in a closed-loop, detectors and object proposal generator functions exploiting the ordered and continuous nature of video sequences. We obtain State-of-the-art mAP and a detection time that is lower than Faster R-CNN.
Our representation allows to perform partial matching between videos obtaining a robust similarity measure. This approach is extremely useful in sport videos where multiple entities are involved in the activities. Many existing works perform person detection, tracking and often require camera calibration in order to extract motion and imagery of every player and object in the scene. In this work we overcome this limitations and propose an approach that exploits the spatio-temporal structure of a video, grouping local spatio-temporal features unsupervisedly. Our robust representation allows to measure video similarity making correspondences among arbitrary patterns. We show how our clusters can be used to generate frame-wise action proposals. We exploit proposals to improve our representation further for localization and recognition. We test our method on sport specific and generic activity dataset reporting results above the existing state-of-the-art.
Techcrunch covered our system while it was presented as a live demo at ACM MM 2016 in Amsterdam
Our smart audio guide is backed by a computer vision system capable to work in real-time on a mobile device, coupled with audio and motion sensors. We propose the use of a compact Convolutional Neural Network (CNN) that performs object classification and localization. Using the same CNN features computed for these tasks, we perform also robust artwork recognition. To improve the recognition accuracy we perform additional video processing using shape based filtering, artwork tracking and temporal filtering. The system has been deployed on a NVIDIA Jetson TK1 and a NVIDIA Shield Tablet K1, and tested in a real world environment (Bargello Museum of Florence).
Posted inUncategorized|Comments Off on Wearable Smart Audio Guide featured on TechCrunch
In this paper, we propose an adaptive structured pooling strategy to solve the action recognition problem in videos. Our method aims at individuating several spatio-temporal pooling regions each corresponding to a consistent spatial and temporal subset of the video. Each subset of the video gives a pooling weights map and is represented as a Fisher vector computed from the soft weighted contributions of all dense trajectories evolving in it. We further represent each video through a graph structure, defined over multiple granularities of spatio-temporal subsets. The graph structures extracted from all videos
are finally compared with an efficient graph matching kernel. Our approach does not rely on a fixed partitioning of the video. Moreover, the graph structure depicts both spatial and temporal relationships between the spatio-temporal subsets. Experiments on the UCF Sports and the HighFive datasets show performance above the state-of-the-art.