About me

seideI received a Laurea degree in computer engineering (MSc) from the University of Florence, with a thesis on human action recognition in 2008. I obtained my PhD degree working at the Media Integration and Communication Center of University of Florence under the supervision of Prof. Alberto Del Bimbo with a thesis on “Supervised and Semi-supervised Event Detection with Local Spatio-Temporal Features” in 2012. I was a visiting scholar at Silvio Savarese Laboratory at University of Michigan (now at Stanford) from February 2013 till August 2013. I was a PostDoc at Visual Information and Media Lab at Media Integration and Communication Center of University of Florence untile January 2018.

Since March 2018 I’m an Assistant Professor at the University of Florence at the Department of Information Engineering.

My research interests are focused on application of pattern recognition and machine learning to computer vision and specifically in the field of human activity recognition.

My profile on  Google Scholar and Scopus.

Posted in Uncategorized | Leave a comment

ICPR 2018 Paper Accepted

Our work on adaptive compression for object detection algorithms has been accepted for publication at ICPR 2018.  Video compression algorithms have been designed aiming at pleasing human viewers, and are driven by video quality metrics that are designed to account for the capabilities of the human visual system. However, thanks to the advances in computer vision systems more and more videos are going to be watched by algorithms, e.g. implementing video surveillance systems or performing automatic video tagging. This paper describes an adaptive video coding approach for computer vision-based systems. We show how to control the quality of video compression so that automatic object detectors can still process the resulting video, improving their  detection performance, by preserving the elements of the scene that are more likely to contain meaningful content. Our approach is based on computation of saliency maps exploiting a fast objectness measure.

Our system pipeline. Binary saliency maps are predicted using multiple saliency maps fused with our learned model. The final binary map is shown rightmost

The computational efficiency of this approach makes it usable in a real-time video coding pipeline. Experiments show that our technique outperforms standard H.265 in speed and coding efficiency, and can be applied to different types of video domains, from surveillance to web videos.

Average Log Miss Rate on Towncenter dataset for ACF pedestrian detector for different bitrates.


Posted in Uncategorized | Comments Off on ICPR 2018 Paper Accepted

ICCV ’17 Paper Accepted!

Our paper on Deep Generative Adversarial Compression Artifact Removal, has been accepted for publication at ICCV 2017.  In the following figure we can see how our GAN can recover details in a compressed image (left). Note how texture and edges are better looking and blocking, ringing and color quantization artifacts are removed.

We have shown that it is possible to remove compression artifacts by transforming images with deep convolutional residual networks. We have trained a generative network using SSIM loss obtaining state of the art results according to standard image similarity metrics. Nonetheless, images reconstructed as such appear blurry and missing details at higher frequencies. These details make images look less similar to the original ones for human viewers and harder to understand for object detectors. We therefore propose a conditional Generative Adversarial framework which we train alternating full size patch generation with sub-patch discrimination. Human evaluation and quantitative experiments in object detection show that our GAN generates images with finer consistent details and these details make a difference both for machines and humans.

Continue reading

Posted in Uncategorized | Comments Off on ICCV ’17 Paper Accepted!

Automatic Image Annotation via Label Transfer in the Semantic Space

Our work on Label Transfer in the Semantic Space has been accepted for publication on Pattern Recognition. In this work we show how we can learn a semantic space using KCCA, where correlation of visual and textual features are well preserved into a semantic embedding. Interestingly, our method work either when the training set is well annotated by experts, as well as when it is noisy such as in the case of user-generated tags in social media. Extensive testing with modern features and image labeling algorithms show the benefit on several benchmarks. At training time, we leverage the set of tags and the visual features to learn an embedding Φ(v;t) in a semantic space.

Once learned, our embedding is independent from the textual features and can then computed for any image that has to be tagged. Our method is able to reorganize the feature space to preserve image semantics, as shown in this t-SNE plot, where colors represent image labels.

Read the full paper for further details!



Posted in Uncategorized | Comments Off on Automatic Image Annotation via Label Transfer in the Semantic Space

Indexing Quantized Ensembles of Exemplar-SVMs with Rejecting Taxonomies accepted in Multimedia Tools and Applications

Ensembles of Exemplar-SVMs have been introduced as a framework for Object Detection but have rapidly found a large interest in a wide variety of computer vision applications such as mid-level feature learning, tracking and segmentation. To guarantee its effectiveness though, a large collection of classifiers has to be used which has a prohibitive cost. To overcome this issue we organize Exemplar-SVMs into a taxonomy, exploiting the joint distribution of Exemplar scores. This permits to index the classifiers at a logarithmic cost, while maintaining the label transfer capabilities of the method almost unaffected. We propose different formulations of the taxonomy in order to maximize the speed gain. In particular we propose a highly efficient Vector Quantized Rejecting Taxonomy to discard unpromising image regions during evaluation, performing computations in a quantized domain. This allow us to obtain ramarkable speed gains, with an improvement up to more than two orders of magnitude. To verify the robustness of our indexing data structure with reference to a standard Exemplar-SVM ensemble, we experiment with the Pascal VOC 2007 benchmark on the Object Detection competition and on a simple segmentation task.


Our paper is available in preprint !

Posted in Uncategorized | Comments Off on Indexing Quantized Ensembles of Exemplar-SVMs with Rejecting Taxonomies accepted in Multimedia Tools and Applications

Spatio-Temporal Closed-Loop approach published on IEEE Transaction on Image Processing

Object detection is one of the most important tasks of computer vision. It is usually performed evaluating locations of an image that are more likely to contain the object of interest. The interplay of detectors and proposal algorithms has not been studied up to now. We propose to connect, in a closed-loop, detectors and object proposal generator functions exploiting the ordered and continuous nature of video sequences. We obtain State-of-the-art mAP and a detection time that is lower than Faster R-CNN.

Closed-Loop Detection

Our paper is available on TIP website.

Posted in Uncategorized | Comments Off on Spatio-Temporal Closed-Loop approach published on IEEE Transaction on Image Processing

Action Localization based on Clustered Trajectories published on Computer Vision and Image Understanding

Our representation allows to perform partial matching between videos obtaining a robust similarity measure. This approach is extremely useful in sport videos where multiple entities are involved in the activities. Many existing works perform person detection, tracking and often require camera calibration in order to extract motion and imagery of every player and object in the scene. In this work we overcome this limitations and propose an approach that exploits the spatio-temporal structure of a video, grouping local spatio-temporal features unsupervisedly. Our robust representation allows to measure video similarity making correspondences among arbitrary patterns. We show how our clusters can be used to generate frame-wise action proposals. We exploit proposals to improve our representation further for localization and recognition. We test our method on sport specific and generic activity dataset reporting results above the existing state-of-the-art.

The paper is available on elsevier platform:



Posted in Uncategorized | Comments Off on Action Localization based on Clustered Trajectories published on Computer Vision and Image Understanding

Wearable Smart Audio Guide featured on TechCrunch

Techcrunch covered our system while it was presented as a live demo at ACM MM 2016 in Amsterdam

Our smart audio guide is backed by a computer vision system capable to work in real-time on a mobile device, coupled with audio and motion sensors. We propose the use of a compact Convolutional Neural Network (CNN) that performs object classification and localization. Using the same CNN features computed for these tasks, we perform also robust artwork recognition. To improve the recognition accuracy we perform additional video processing using shape based filtering, artwork tracking and temporal filtering. The system has been deployed on a NVIDIA Jetson TK1 and a NVIDIA Shield Tablet K1, and tested in a real world environment (Bargello Museum of Florence).




Posted in Uncategorized | Comments Off on Wearable Smart Audio Guide featured on TechCrunch

Adaptive Structured Pooling for Action Recognition

Our paper “Adaptive Structured Pooling for Action Recognition” has been accepted for publication and will be presented at British Machine Vision Conference 2014.

This is a joint work with Shugao Ma and Prof. Stan Sclaroff from Boston University, Dr. Svebor Karaman and Prof. Alberto Del Bimbo from University of Florence.

In this paper, we propose an adaptive structured pooling strategy to solve the action recognition problem in videos. Our method aims at individuating several spatio-temporal pooling regions each corresponding to a consistent spatial and temporal subset of the video. Each subset of the video gives a pooling weights map and is represented as a Fisher vector computed from the soft weighted contributions of all dense trajectories evolving in it. We further represent each video through a graph structure, defined over multiple granularities of spatio-temporal subsets. The graph structures extracted from all videos
are finally compared with an efficient graph matching kernel. Our approach does not rely on a fixed partitioning of the video. Moreover, the graph structure depicts both spatial and temporal relationships between the spatio-temporal subsets. Experiments on the UCF Sports and the HighFive datasets show performance above the state-of-the-art.

Here’s the camera ready version of our paper!

Posted in Uncategorized | Comments Off on Adaptive Structured Pooling for Action Recognition