Category Archives: Projects

Media Integration and Communication Centre projects

LIT: Lexicon of the Italian Television

LIT (Lexicon of the Italian Television) is a project conceived by the Accademia della Crusca, the leading research institution on the Italian language, in collaboration with CLIEO (Center for theoretical and historical Linguistics: Italian, European and Oriental languages), with the aim of studying frequencies of the Italian lexicon used in television content and targets the specific sector of web applications for linguistic research. The corpus of transcriptions is constituted approximately by 170 hours of random television recordings transmitted by the national broadcaster RAI (Italian Radio Television) during the year 2006.

LIT: Lexicon of the Italian Television

LIT: Lexicon of the Italian Television

The principal outcome of the project is the design and implementation of an interactive system which combines a web-based video transcription and annotation tool, a full featured search engine, and a web application, integrated with video streaming, for data visualization and text-video syncing.

The project presents two different interfaces: a search engine, based on classical textual input forms, and another multimedia interface, used both for data visualization and annotation. Annotation functionalities are activated after user’s authentication. The systems relies on a web application backend which has to handle the transcriptions and provide the necessary indexing and search functions.

The browsing interface shows the video collection present in the model. Users can select a video and play it immediately, and read the associated metadata and speech transcription in sync. Each record in the list of videos provides a link to the raw annotation in XML-TEI format, a standard developed by the TEI: Text Encoding Initiative Consortium. The annotation can be opened directly inside the browser and saved on the local systems. Subtitles are displayed at the bottom of the video while segments in the transcription area are automatically highlighted during playback and metadata are updated accordingly. When the text-to-speech alignment is completed through annotation activities, users can select a unit of text inside the transcription area and the video cue-point is aligned accordingly; on the contrary, scrolling the trigger on the annotated video segment highlights the corresponding segment of text.

The annotation interface is accessed by transcriptionists after authentication, and allows to associate the transcription to the corresponding sequences of video. Annotators can set the cue points of speech on the video sequences using the tools provided by the graphic user interface and assign them an annotation without having prior knowledge of the format used. The tool provides functionalities for the definition of metadata at different levels, or multiple “layers”: features can be assigned to the document as a whole, to individual transmissions, to speakers in the transmissions and to each single segment of the transcription.

The search interface is based on standard text input fields. It provides a JSP frontend to the search functions defined for the Java engine and uses the Lucene query syntax for the identification of HTML elements. The interfaces recalls a common ‘advanced search’ form, providing all the boolean combinations usually present in search engines and, for this reason, making users comfortable with basic features. Notably, some uncommon features appears among other fields, such as:

  • the ‘free sequence’ field, with option for defining it exact, ordered or unordered;
  • the ‘distance’ parameter, where free sequences can appear within specified ranges inside a single utterance;
  • the ‘date range’ parameter.

Advanced search features are shown inside dedicated panels which can be expanded if necessary. These panels give all the options for specifying the constraints of a query, as defined for the XML-TEI custom fields used in LIT. The extended parameters allow to:

  • set the case sensitiveness of a query;
  • perform a word root expansion of jolly characters present in the query;
  • set the constraint for specific categories defined in the taxonomy;
  • select specific parameters for utterances, such as type of speech (improvisation, programmed, executed), speech technique (on scene, voice-over), type of communication (monologue, dialogue), speaker gender and type (professional, non professional).

The system contains 168 hours of RAI (Italian Radio Television) broadcasts, aired during the year 2006. The annotation work was done by researchers of the Accademia della Crusca while LIT was under development, in late 2009. The database has approximately 20.000 utterances stored and using Lucene for search and retrieval does not raise any performance issue.

The system is currently under deployment as a module of the larger national research funding FIRB 2009 VIVIT (Fondo di Investimento per la Ricerca di Base, Vivi l’Italiano), which will integrate the tools and the obtained annotations within a semantic web infrastructure.

Scale Invariant 3D Multi-Person Tracking with a PTZ camera

This research aims to realize a videosurveillance system for real-time 3D tracking of multiple people moving over an extended area, as seen from a rotating and zooming camera. The proposed method exploits multi-view image matching techniques to obtain dynamic-calibration of the camera and track many ground targets simultaneously, by slewing the video sensor from target to target and zooming in and out as necessary.

Scale Invariant 3D Multi-Person Tracking with a PTZ camera

Scale Invariant 3D Multi-Person Tracking with a PTZ camera

The image-to-world relation obtained with dynamic-calibration is further exploited to perform scale inference from focal length value, and to make robust tracking with scale invariant template matching and joint data-association techniques. We achieve an almost constant standard deviation error of less than 0.3 meters in recovering 3D trajectories of multiple moving targets, in an area of 70×15 meters.


This general framework will serve as support for the future development of a sensor resource manager component that schedules camera pan, tilt, and zoom, supports kinematic tracking, multiple target tracks association, scene context modeling, confirmatory identification, and collateral damage avoidance and in general to enhance multiple target tracking in PTZ camera networks.

Optimal face detection and tracking

The project’s goal is to develop a reliable face detector and tracker for indoor video surveillance. The problem that we have been asked to deal with is to provide good quality face images of people entering restricted areas. Those images are going to be used for face recognition, and a feedback will be provided from the face recognition system to state if the person has been recognized or not. The nature of the problem makes it very important to keep tracking the person until he is visible on the image plane, even if he is already been recognized. This is needed to prevent the system from providing repeated, multiple alarms from the same person.

Optimal face detection and tracking

Optimal face detection and tracking

In other words, what we aim to obtain is:

  • a reliable detector that could be used to start the tracker: the detector must be sensitive in order to be able to start the tracker as soon as possible when an intruder enters the supervised environment;
  • an efficient and robust tracker to be able to track the intruder without losing him until he leaves the supervised environment: as stated before, it is important to avoid repeated, multiple alarms to be generated from the same track, both for computational cost reduction and false – positives reduction;
  • a fast and reliable face detector to extract face images from the tracked person: the face detector must be reliable on order to provide ‘good’ face images from the target; what “good” stands for depends on the face recognition system, but usually this means that the image has to be at highest achievable resolution and well focused, and that the face has to be as frontal as possible;
  • a method to assess if the tracker has lost the target or is tracking good (a ‘stop criteria’): it is important to be able to detect situations in which the tracker has lost the target, because in such a situation some special action could be required.

At this time, we use a face detector based on the Viola-Jones algorithm to initialize a particle filter-based tracker that uses an histogram-based appearance model. The particle filter accuracy is greatly improved thanks to strong measures provided by the face detector.

To provide a reasonably small number of face images to the face recognition system, a method to evaluate the quality of the captured images is needed. We keep into account image resolution and symmetry in order to store only those images that give increasing quality for each detected person.

Below are reported a few sample videos with the face sequences grabbed from each of them. The faces are ordered by the system according to their quality (increasing from left to right).

Upon face tracking, it is really easy to build a face obfuscation application, though the requirements it needs may be in slight contrast with that needed for face logging. The following video shows an example:

Particle filter-based visual tracking

The project’s goal is to develop a computationally efficient, robust real-time particle filter-based visual tracker. In particular, we aim to increase the robustness of the tracker when it is used in conjunction with weak (but computationally efficient) appearance model, such as color histograms. To achieve this goal, we have proposed an adaptive parameter estimation method that estimates the statistic parameters of the particle filter on-line, so that it is possible to increase or reduce the uncertainty in the filter depending on a measure of its performances (tracking quality).

Particle filter based visual tracking

Particle filter based visual tracking

The method has proved to be effective in dramatically increasing the robustness of a particle filter-based tracker in situations that are usually critical for visual tracking, such as in presence of occlusions and highly erratic motion.

The data set we used is now available for download, with ground truth data, in order to make it possible for other people to test their tracker on our data set and compare the performance.

It is made of 10 video sequences showing a remote controlled toy car (Ferrari F40) filmed from two different point of view: ground floor or ceiling. The sequences will be provided in mjpeg format, together with text files (one per sequence) containing ground truth data (position and size of the target’s bounding box) for each frame. Below you can see an example of the ground truth provided with our data set (sequence #10):

We have tested the performance of the resulting tracker on the sequences of our data set comparing the segmentation provided by the tracker with the ground truth data. Quantitative measures of this performance are reported in the literature. Below we show a few videos that demonstrate the tracker capabilities.

This is an example of tracking on sequence #9 of the data set:

An example tracking humans outdoor with a PTZ camera. In this video (not in the data set) the camera was steered by the tracker. It is thus an active tracking and it shows that the method can be applied to PTZ cameras, since it does not use any background modeling techinque:

IM3I: immersive multimedia interfaces

The IM3I project addresses the needs of a new generation of media and communication industry that has to confront itself not only with changing technologies, but also with the radical change in media consumption behaviour. IM3I will enable new ways of accessing and presenting media content to users, and new ways for users to interact with services, offering a natural and transparent way to deal with the complexities of interaction, while hiding them from the user.

Daphnis: IM3I multimedia content based retrieval interface

Daphnis: IM3I multimedia content based retrieval interface

With the explosion in the volume of digital content being generated, there is a pressing need for highly customisable interfaces tailored according to both user profiles and specific types of search. IM3I aims to provide the creative media sector with new ways of searching, summarising and visualising large multimedia archives. IM3I will provide a service-oriented architecture that allow multiple viewpoints upon multimedia data that are available in a repository, and provide better ways to interact and share rich media. This paves the way for a multimedia information management platform which is more flexible, adaptable and customisable than current repository software. This in turn enables new opportunities for content owners to exploit their digital assets.

The IM3I project addresses the needs of a new generation of media and communication industry that has to confront itself not only with changing technologies, but also with the radical change in media consumption behaviour.

IM3I will enable new ways of accessing and presenting media content to users, and new ways for users to interact with services, offering a natural and transparent way to deal with the complexities of interaction, while hiding them from the user.

Andromeda demo at ACM Multimedia 2010 International Conference, Florence, Italy, October 25-29, 2010

But most of all, designed according to a SOA paradigm, IM3I will also define an enabling technology capable of integrating into existing networks, which will support organisations and users in developing their content related services.

Project website: http://www.im3i.eu/

Vidivideo: improving accessibility of videos

The VidiVideo project takes on the challenge of creating a substantially enhanced semantic access to video, implemented in a search engine. The outcome of the project is an audio-visual search engine, composed of two parts: an automatic annotation part, that runs off-line, where detectors for more than 1000 semantic concepts are collected in a thesaurus to process and automatically annotate the video and an interactive part that provides a video search engine for both technical and non-technical users.

Andromeda - Vidivideo graph based video browsing

Andromeda - Vidivideo graph based video browsing

Video plays a key role in the news, cultural heritage documentaries and surveillance, and it is a natural form of communication for the Internet and mobile devices. The massive increase in digital audio-visual information poses high demands on advanced storage and search engines for consumers and professional archives.

Video search engines are the product of progress in many technologies: visual and audio analysis, machine learning techniques, as well as visualization and interaction. At present the state-of-the-art systems are able to annotate automatically only a limited set of semantic concepts, and the retrieval is allowed using only a keyword-based approach based on a lexicon.

The VidiVideo project takes on the challenge of creating a substantially enhanced semantic access to video, implemented in a search engine.

The outcome of the project is an audio-visual search engine, composed of two parts: a automatic annotation part, that runs off-line, where detectors for more than 1000 semantic concepts are collected in a thesaurus to process and automatically annotate the video and an interactive part that provides a video search engine for both technical and non-technical users.

The automatic annotation part of the system performs audio and video segmentation, speech recognition, speaker clustering and semantic concept detection.

The VidiVideo system has achieved the highest performance in the most important object and concept recognition international contests (PASCAL VOC and TRECVID).

The interactive part provides two applications: a desktop-based and a web-based search engines. The system permits different query modalities (free text, natural language, graphical composition of concepts using boolean and temporal relations and query by visual example) and visualizations, resulting in an advanced tool for retrieval and exploration of video archives for both technical and non-technical users in different application fields. In addition the use of ontologies (instead of simple keywords) permits to exploit semantic relations between concepts through reasoning, extending the user queries.

The off-line annotation part has been implemented in C++ on the Linux platform, and takes advantage of the low-cost processing power provided by GPUs on consumer graphics cards.

The web-based system is based on the Rich Internet Application paradigm, using a client side Flash virtual machine. RIAs can avoid the usual slow and synchronous loop for user interactions. This allows to implement a visual querying mechanism that exhibits a look and feel approaching that of a desktop environment, with the fast response that is expected by users. The search results are in RSS 2.0 XML format, while videos are streamed using the RTMP protocol.

Accurate Evaluation of HER-2 Amplification in FISH Images

Fluorescence in situ hybridization (FISH) is a cytogenetic technique used to detect and localize the presence or absence of specific DNA sequences on chromosomes.  FISH uses fluorescent probes, each tagged with a different fluorophore, that bind to specific parts of the chromosome. Through multi-band fluorescence microscopy the positions where the fluorescent probes bound to the chromosomes can be displayed so as to derive information of clinical relevance based on the presence and position of the fluorescent probes.

Accurate Evaluation of HER-2 Amplification in FISH Images

Accurate Evaluation of HER-2 Amplification in FISH Images

A sample application of this technique targets the measurement of the amplification of the HER-2 gene within the chromosomes, that constitutes a valuable indicator of invasive breast carcinomas. This requires the application to a tumor tissue sample of fluorescent probes that attach themselves to the HER-2 genes in a process called hybridization. These fluorescent probes carry a marker that emit light when the probes bind to the HER-2 genes, and this makes them visible as green spots under a fluorescent microscope. Similarly, a different probe, carrying a marker that makes it visible as a orange spot under a fluorescent microscope, is used to target the centromere 17 (CEP-17). Measuring the ratio of HER-2 over CEP-17 dots within each nucleus and then averaging this ratio for a representative number of cells allows estimation of HER-2 amplification.

In this research we present a system that supports accurate estimation of the ratio of HER-2 over CEP-17 dots in FISH images of breast tissue samples. Compared to previous work, the system incorporates a model to associate with each segmented nucleus a reliability score that estimates the confidence of the measure of the ratio of HER-2 over CEP-17 dots within the nucleus. This enables the computation of values of the ratio using only nuclei with high reliability scores so as to extract a measure of the amplification of HER-2 versus CEP-17 dots that better conforms to the evaluation of the pathologist compared to the ratio averaged over all available nuclei.

Mobile Robot Path Tracking with uncalibrated cameras

The aim of this transfer project is the motion control problem of a wheeled mobile robot (WMR) as observed from uncalibrated ceiling cameras. We develop a method that localizes the robot in real-time and smartly drives it over a path in a large environment with a pure pursuit controller, achieving less then 5 pixel on cross track error. Experiments are reported for Ambrogio, a two-wheel differentially-driven mobile robot provided by  Zucchetti Centro Sistemi.

Wheeled Mobile Robot path follower in uncalibrated multiple camera environment

Wheeled Mobile Robot path follower in uncalibrated multiple camera environment

The video below shows the improvements in the motion control of a wheeled mobile robot (WMR) with a controller that uses an osculating circle:

Automatic trademark detection and recognition in sports videos

The availability of measures of appearance of trademarks and logos in a video is important in fields of marketing and sponsoring. These statistics can, in fact, be used by the sponsors to estimate the number TV viewers that noticed them and then evaluate the effects of the sponsorship. The goal of this project is to create a semi-automatic system for detection, tracking and recognition of pre-defined brands and trademarks in broadcast television. The number of appearances of a logo, its position, size and duration will be recorded to derive indexes and statistics that can be used for marketing analysis.

Automatic trademark detection and recognition in sports videos

Automatic trademark detection and recognition in sports videos

To obtain a technique that is sufficiently robust to partial occlusions and deformations, we use local neighborhood descriptors of salient points (SIFT features) as a compact representation of the important aspects and local texture in trademarks. By combining the results of local point-based matching we are able to detect and recognize entire trademarks. The determination of whether a video frame contains a reference trademark is made by thresholding the normalized-match score (the ratio of SIFT points of the trademark that have been matched to the frame). Finally, we compute a robust estimate of the point cloud in order to localize the trademark and to approximate its area.

Video event classification using bag-of-words and string kernels

The recognition of events in videos is a relevant and challenging task of automatic semantic video analysis. At present one of the most successful frameworks, used for object recognition tasks, is the bag-of-words (BoW) approach. However it does not model the temporal information of the video stream. We are working at a novel method  to introduce temporal information within the BoW approach by modeling a video clip as a sequence of histograms of visual features, computed from each frame using the traditional BoW model.

Video event classification using bag-of-words and string kernels

Video event classification using bag-of-words and string kernels

The sequences are treated as strings where each histogram is considered as a character. Event classification of these sequences of variable size, depending on the length of the video clip, are performed using SVM classifiers with a string kernel (e.g using the Needlemann-Wunsch edit distance). Experimental results, performed on two domains, soccer video and TRECVID 2005, demonstrate the validity of the proposed approach.