Object detection is a widely studied problem in computer vision literature. While humans recognize a multitude of objects in images with little effort, machine are only now starting to correctly recognize objects in images with a good accuracy. That is due the recent breakthrough of models based on Convolutional Neural Networks (CNN) that has made recognition of hundreds class of object feasible. These models can be trained to recognize the presence or absence of pre-specified categories. In fact, companies such as Google, Facebook and Flickr are developing system for object detection in user photos. CNNs can be used also to perform scene classification, i.e. determining the type of scene or location shown in an image. However, CNN are computationally expensive, require really large labeled datasets for training and the learnt model can have a huge memory footprint. This makes this kind of model, hard to be executed in mobile environments where battery and computational resources are constrained.
In collaboration with Cynny, a Florence and Silicon Valley based startup with the mission of bringing a new user experience by exploiting use emotion and video content, we tested existing state of the art architectures on mobile CPUs. After assessing the limitations of current state of the art neural architectures in terms of high computational demand, we developed further advanced models that require less computational power (and thus battery) while retaining the same accuracy. We developed increasingly efficient and smaller neural network for the task of object recognition. We also developed an extremely fast object proposal that further reduce the burden for CPU based object detection. Some modifications follows recent achievements on CNN architecture design, where useless layers in terms of accuracy are removed in order to reduce the memory footprint and execution time.
Considering the fact that reducing the number of learnable weights affects the representation capability of the learned network, we will exploited novel learning algorithms to transfer knowledge from large slow pre-trained networks to smaller and more and efficient architectures.
In the verification of the above reported research strategy, we used publicly available datasets and object categories for training and test. On the one hand this permitted to focus on the assessment of the architectural solutions and on the other hand it avoided dataset bias and unbalance. The networks have been trained from scratch on ILSVRC2012 (1.4 million images with 1000 categories). Fine tuning and detection experiments have been conducted on PASCAL VOC 2007.