Category Archives: Research projects

Media Integration and Communication Centre research projects

From re-identification to identity inference

Person re-identification is a standard component of multi-camera surveillance systems. Particularly in scenarios in which the longterm behaviour of persons must be characterized, accurate re-identification is essential. In realistic, wide-area surveillance scenarios such as airports, metro and train stations, re-identification systems should be capable of robustly associating a unique identity with hundreds, if not thousands, of individual observations collected from a distributed network of very many sensors.

Traditionally, re-identification scenarios are defined in terms of a set of gallery images of a number of known individuals and a set of test images to be re-identified. For each test image or group of test images of an unknown person, the goal of re-identification is to return a ranked list of individuals from the gallery.

From re-identification to identity inference

Configurations of the re-identification problem are generally classified according to how much group structure is available in the gallery and test image sets. In a single-shot image set there is no grouping information available. Though there might be multiple images of an individual, there is no knowledge of which images correspond to that person. In a multi-shot image set, on the other hand, there is explicit grouping information available. That is, it is known which images correspond to the same individual.

While such characterizations of re-identification scenarios are useful for establishing benchmarks and standardized datasets for experimentation on the discriminative power of descriptors for person re-identification, they are not particularly realistic with respect to many real-world application scenarios. In video surveillance scenarios, it is more common to have many unlabelled test images to re-identify and only a few gallery images available.

Another unrealistic aspect of traditional person re-identification is its formulation as a retrieval problem. In most video surveillance applications, the accuracy of re-identification at Rank-1 is the most critical metric and higher ranks are of much less interest.

Based on these observations, we have developed a generalization of person re-identification which we call identity inference. The identity inference formulation is expressive enough to represent existing single- and multi-shot scenarios, while at the same time also modelling a larger class of problems not discussed in the literature.

From re-identification to identity inference

In particular, we demonstrate how identity inference models problems where only a few labelled examples are available, but where identities must be inferred for very many unlabelled images. In addition to describing identity inference problems, our formalism is also useful for precisely specifying the various multi- and single-shot re-identification modalities in the literature.

We show how a Conditional Random Field (CRF) can then be used to efficiently and accurately solve a broad range of identity inference problems, including existing person re-identification scenarios as well as more difficult tasks involving very many test images. The key aspect of our approach is to constraints the identity labelling process through local similarity constraints of all available images.

An Evaluation of Nearest-Neighbor Methods for Tag Refinement

The success of media sharing and social networks has led to the availability of extremely large quantities of images that are tagged by users. The need of methods to manage efficiently and effectively the combination of media and metadata poses significant challenges. In particular, automatic image annotation of social images has become an important research topic for the multimedia community.

Detected tags in an image using Nearest-Neighbor Methods for Tag Refinement

Detected tags in an image using Nearest-Neighbor Methods for Tag Refinement

We propose and thoroughly evaluate the use of nearest-neighbor methods for tag refinement. We performed extensive and rigorous evaluation using two standard large-scale datasets to show that the performance of these methods is comparable with that of more complex and computationally intensive approaches. Differently from these latter approaches, nearest-neighbor methods can be applied to ‘web-scale’ data.

Here we make available the code and the metadata for NUS-WIDE-240K.

  • ICME13 Code (~ 8,5 GB, code + similarity matrices)
  • Nuswide-240K dataset metadata (JSON format, about 25MB). A subset of 238,251 images from NUS-WIDE-270K that we retrieved from Flickr with users data. Note that NUS is now releasing the full image set subject to an agreement and disclaimer form.

If you use this data, please cite the paper as follows:

  author       = "Uricchio, Tiberio and Ballan, Lamberto and Bertini, 
                  Marco and Del Bimbo, Alberto",
  title        = "An evaluation of nearest-neighbor methods for tag refinement",
  booktitle    = "Proc. of IEEE International Conference on Multimedia \& Expo (ICME)",
  month        = "jul",
  year         = "2013",
  address      = "San Jose, CA, USA",
  url          = ""

2D/3D Face Recognition

In this project, started in collaboration with the IRIS Computer Vision lab, University of Southern California, we address the problem of 2D/3D face recognition with a gallery containing 3D models of enrolled subjects and a probe set composed by only 2D imagery with pose variations. Raw 3D models are present in the gallery for each person, where each 3D model shows both a facial shape as a 3D mesh and a 2D component as a texture registered with the shape; by the other hand it is assumed to have only 2D images in the probe set.

2D/3D face recognition dataset

Facial shape as a 3D mesh and a 2D component as a texture registered with the shape

This scenario, defined as is, is an ill-posed problem considering the gap between the kind of information present in the gallery and the one available in the probe.

In experimental result we evaluate the reconstruction result about the 3D shape estimation from multiple 2D images and the face recognition pipeline implemented considering a range of facial poses in the probe set, up to ±45 degrees.

Future directions can be found by investigating a method that is able to fuse the 3D face modeling with the face recognition technique developed accounting for pose variations.

Recognition results

Results: baseline vs. our approach

Results: baseline vs. our approach

This worked was conducted by Iacopo Masi during his internship in 2012/2013at the IRIS Computer Vision lab, University of Southern California.

USC University of Southern California

USC University of Southern California

FaceHugger: The ALIEN Tracker Applied to Faces

The ALIEN visual tracker is a generic visual object tracker achieving state of the art performance. The object is selected at run-time by drawing a bounding box around it and then its appearance is learned and tracked as time progresses.

The ALIEN tracker has been shown to outperform other competitive trackers, especially in the case of long-term tracking, large amount of camera blur, low frame rate videos and severe occlusions including full object disappearance.

FaceHugger: alien vs. predator

The scientific paper introducing the technology behind the tracker will appear at the 12th European Conference in Computer Vision 2012 (eccv2012) under the following title: FaceHugger: The ALIEN Tracker Applied to Faces. In Proceedings of European Conference on Computer Vision (ECCV) – DEMO Session – 2012 Florence Italy.

A real time demo of the released application will also be given during the conference.

Application Demo: here we are releasing the real-time demo software that will be presented and demonstrated at the conference. Currently the software is only working under Microsoft Windows 64bit. The released software demo has been developed using OpenCV and Matlab and deployed as a self installing package. The self-installer will install the MCR (Matlab Compiler Runtime) and will copy some OpenCV .dll files and the application executable.

Note: There is no need to install OpenCV or Matlab, the self-installing package will provide all the necessary files to run the tracker as a standalone application.

[Download not found]


  1. Double click on the exe-file AlienTracker_pkg.exe. The command window will appear, and the exe-file will inflate the files contained in the same directory where you have downloaded AlienTracker_pkg.exe. The MCR (Matlab Compiler Runtime) installation wizard will start with the language window.
  2. Once the MCR installation is completed double click on the AlienTracker.exe. It might take some time (i.e. 4/5 seconds) before the execution actually starts.
  3. Select using the mouse the object area that has to be tracked and then press enter.

How to get the best performance: try to avoid including object background inside the selected bounding box:

FaceHugger: how to get the best performance step 1

It is not important to include the whole object; some parts may be left out of the bounding box:

FaceHugger: how to get the best performance step 2

Provide a reasonable sized bounding box. Small bounding boxes do not provide the necessary visual information to achieve good tracking:

FaceHugger: how to get the best performance step 3

Current release limits:

  • Only Windows 7 64bit platforms supported.
  • Application only supports the first installed webcam device.
  • Image resolution is resized at 320×240.
  • Videos cannot be processed.
  • The tracked trajectory data cannot be exported.
  • Application interface is very basic.
  • Only SIFT features are current available. More recent and faster features may be used (SURF, BRIEF, BRISK etc.).

Future release will correct these limitations. Feel free to provide feedback or ask any question by email or social media:,,,

Continuous Recovery for real time PTZ localization and mapping

We propose a method for real time recovering from tracking failure in monocular localization and mapping with a Pan Tilt Zoom camera (PTZ). The method automatically detects and seamlessly recovers from tracking failure while preserving map integrity.

By extending recent advances in the PTZ localization and mapping, the system can quickly and continuously resume tracking failures by determining the best way to task two different localization modalities.

Continuous Recovery for Real Time Pan Tilt Zoom Localization and Mapping demo

Continuous Recovery for Real Time Pan Tilt Zoom Localization and Mapping demo

The trade-off involved when choosing between the two modalities is captured by maximizing the information expected to be extracted from the scene map.

This is especially helpful in four main viewing condition: blurred frames, weak textured scene, not up to date map and occlusions due to sensor quantization or moving objects. Extensive tests show that the resulting system is able to recover from several different failures while zooming-in weak textured scene, all in real time.

Dataset: we provide four sequences (Festival, Exhibition, Lab, Backyard) used for testing the recovery module for our AVSS 2011 publication, including the map, nearest neighbour keyframe of the map, calibration results (focal length and image to world homography) and finally a total of 2,376 annotated frames. The annotations are ground-truth feet position and head location, used to decide if the calibration is correct or not. Annotations are in term of MATLAB workspace files. Data was recorded using a PTZ Axis Q6032-E and a Sony SNC-RZ30 with a resolution of 320 x 240 pixel and a frame-rate of about 10 FPS. Dataset download.


  • NN keyframe are described as a txt file where first number is the id of the frame and the next string is the id (filename of images in map dir) of the relative NN keyframe as #frame = keyframe id. Note that we store in the file only the frame number in which there is a keyframe switch.
  • Calibration is provided as a CSV file using the following notation [#frames, h11,h12,h13,…., h31,h32 ,h33, focal length], where hij are the i-th row and j-th colum of homography.
    • A MATLAB script is provided to superimpose ground-plane in the current image(plotGrid.m).
    • The homograhy h11..h33 is the world to image homography that maps pixel into meters.
  • Ground-Truth is under the name of “ground-truth.mat” and it consists of a cells where each item is the feet position and the head position.
  • In each sequence it is present a main script plotGrid.m MATLAB script that plots ground-truth annotations and superimposes the ground-plane on the image. ScaleView.m is the script that exploits calibration to predict head location.
  • Note that we have obfuscated most of the faces to keep anonymity.

TANGerINE Tales. Multi-role digital storymaking natural interface

TANGerINE Tales is a solution for multi-role digital storymaking based on the TANGerINE platform. The goal is to create a digital interactive system for children able to stimulate collaboration between users. The result concerns educational psychology in terms of respect of roles, development of literacy and of narrative skills.

Tangerine Tales

Testing Tangerine Tales

TANGerINE Tales lets children create and tell stories combining landscapes and characters chosen by themselves. Initially, children select the elements that will be part of the game and explore the environment within which they will create their own story. After that they have the chance to record their voice and the dynamics of the game. Finally, they are able to replay the self-made story on the interactive table.

The interaction between the system and users is performed through the tangible interface TANGerINE, consisting of two smart cubes (one for each child) and an interactive table. Users interact with the system through the manipulation of cubes that send data to the computer via a Bluetooth connection.

The main assumption is that the interaction takes place through the collaboration between two children who have different roles: one of them will actively interact to control the actions of the main character of the story, while the other will control the environmental events in response to the movements and actions of the character.

The target user of TANGerINE Tales is made up of 7-8 year olds, attending the third year of elementary school. This choice was made following research studies on psychological methods for collaborative learning, on Human Computer Interaction and tangible interfaces; we exploited the guidelines for learning supported by technological tools (computers, cell phones, tablet PCs, etc..) and those extrapolated by projects of storytelling for children.

You can see pictures of the interface on MICC Flickr account!

Scale Invariant 3D Multi-Person Tracking with a PTZ camera

This research aims to realize a videosurveillance system for real-time 3D tracking of multiple people moving over an extended area, as seen from a rotating and zooming camera. The proposed method exploits multi-view image matching techniques to obtain dynamic-calibration of the camera and track many ground targets simultaneously, by slewing the video sensor from target to target and zooming in and out as necessary.

Scale Invariant 3D Multi-Person Tracking with a PTZ camera

Scale Invariant 3D Multi-Person Tracking with a PTZ camera

The image-to-world relation obtained with dynamic-calibration is further exploited to perform scale inference from focal length value, and to make robust tracking with scale invariant template matching and joint data-association techniques. We achieve an almost constant standard deviation error of less than 0.3 meters in recovering 3D trajectories of multiple moving targets, in an area of 70×15 meters.

This general framework will serve as support for the future development of a sensor resource manager component that schedules camera pan, tilt, and zoom, supports kinematic tracking, multiple target tracks association, scene context modeling, confirmatory identification, and collateral damage avoidance and in general to enhance multiple target tracking in PTZ camera networks.

Optimal face detection and tracking

The project’s goal is to develop a reliable face detector and tracker for indoor video surveillance. The problem that we have been asked to deal with is to provide good quality face images of people entering restricted areas. Those images are going to be used for face recognition, and a feedback will be provided from the face recognition system to state if the person has been recognized or not. The nature of the problem makes it very important to keep tracking the person until he is visible on the image plane, even if he is already been recognized. This is needed to prevent the system from providing repeated, multiple alarms from the same person.

Optimal face detection and tracking

Optimal face detection and tracking

In other words, what we aim to obtain is:

  • a reliable detector that could be used to start the tracker: the detector must be sensitive in order to be able to start the tracker as soon as possible when an intruder enters the supervised environment;
  • an efficient and robust tracker to be able to track the intruder without losing him until he leaves the supervised environment: as stated before, it is important to avoid repeated, multiple alarms to be generated from the same track, both for computational cost reduction and false – positives reduction;
  • a fast and reliable face detector to extract face images from the tracked person: the face detector must be reliable on order to provide ‘good’ face images from the target; what “good” stands for depends on the face recognition system, but usually this means that the image has to be at highest achievable resolution and well focused, and that the face has to be as frontal as possible;
  • a method to assess if the tracker has lost the target or is tracking good (a ‘stop criteria’): it is important to be able to detect situations in which the tracker has lost the target, because in such a situation some special action could be required.

At this time, we use a face detector based on the Viola-Jones algorithm to initialize a particle filter-based tracker that uses an histogram-based appearance model. The particle filter accuracy is greatly improved thanks to strong measures provided by the face detector.

To provide a reasonably small number of face images to the face recognition system, a method to evaluate the quality of the captured images is needed. We keep into account image resolution and symmetry in order to store only those images that give increasing quality for each detected person.

Below are reported a few sample videos with the face sequences grabbed from each of them. The faces are ordered by the system according to their quality (increasing from left to right).

Upon face tracking, it is really easy to build a face obfuscation application, though the requirements it needs may be in slight contrast with that needed for face logging. The following video shows an example:

Particle filter-based visual tracking

The project’s goal is to develop a computationally efficient, robust real-time particle filter-based visual tracker. In particular, we aim to increase the robustness of the tracker when it is used in conjunction with weak (but computationally efficient) appearance model, such as color histograms. To achieve this goal, we have proposed an adaptive parameter estimation method that estimates the statistic parameters of the particle filter on-line, so that it is possible to increase or reduce the uncertainty in the filter depending on a measure of its performances (tracking quality).

Particle filter based visual tracking

Particle filter based visual tracking

The method has proved to be effective in dramatically increasing the robustness of a particle filter-based tracker in situations that are usually critical for visual tracking, such as in presence of occlusions and highly erratic motion.

The data set we used is now available for download, with ground truth data, in order to make it possible for other people to test their tracker on our data set and compare the performance.

It is made of 10 video sequences showing a remote controlled toy car (Ferrari F40) filmed from two different point of view: ground floor or ceiling. The sequences will be provided in mjpeg format, together with text files (one per sequence) containing ground truth data (position and size of the target’s bounding box) for each frame. Below you can see an example of the ground truth provided with our data set (sequence #10):

We have tested the performance of the resulting tracker on the sequences of our data set comparing the segmentation provided by the tracker with the ground truth data. Quantitative measures of this performance are reported in the literature. Below we show a few videos that demonstrate the tracker capabilities.

This is an example of tracking on sequence #9 of the data set:

An example tracking humans outdoor with a PTZ camera. In this video (not in the data set) the camera was steered by the tracker. It is thus an active tracking and it shows that the method can be applied to PTZ cameras, since it does not use any background modeling techinque:

IM3I: immersive multimedia interfaces

The IM3I project addresses the needs of a new generation of media and communication industry that has to confront itself not only with changing technologies, but also with the radical change in media consumption behaviour. IM3I will enable new ways of accessing and presenting media content to users, and new ways for users to interact with services, offering a natural and transparent way to deal with the complexities of interaction, while hiding them from the user.

Daphnis: IM3I multimedia content based retrieval interface

Daphnis: IM3I multimedia content based retrieval interface

With the explosion in the volume of digital content being generated, there is a pressing need for highly customisable interfaces tailored according to both user profiles and specific types of search. IM3I aims to provide the creative media sector with new ways of searching, summarising and visualising large multimedia archives. IM3I will provide a service-oriented architecture that allow multiple viewpoints upon multimedia data that are available in a repository, and provide better ways to interact and share rich media. This paves the way for a multimedia information management platform which is more flexible, adaptable and customisable than current repository software. This in turn enables new opportunities for content owners to exploit their digital assets.

The IM3I project addresses the needs of a new generation of media and communication industry that has to confront itself not only with changing technologies, but also with the radical change in media consumption behaviour.

IM3I will enable new ways of accessing and presenting media content to users, and new ways for users to interact with services, offering a natural and transparent way to deal with the complexities of interaction, while hiding them from the user.

Andromeda demo at ACM Multimedia 2010 International Conference, Florence, Italy, October 25-29, 2010

But most of all, designed according to a SOA paradigm, IM3I will also define an enabling technology capable of integrating into existing networks, which will support organisations and users in developing their content related services.

Project website: