Category Archives: Research projects

Media Integration and Communication Centre research projects

Vidivideo: improving accessibility of videos

The VidiVideo project takes on the challenge of creating a substantially enhanced semantic access to video, implemented in a search engine. The outcome of the project is an audio-visual search engine, composed of two parts: an automatic annotation part, that runs off-line, where detectors for more than 1000 semantic concepts are collected in a thesaurus to process and automatically annotate the video and an interactive part that provides a video search engine for both technical and non-technical users.

Andromeda - Vidivideo graph based video browsing

Andromeda - Vidivideo graph based video browsing

Video plays a key role in the news, cultural heritage documentaries and surveillance, and it is a natural form of communication for the Internet and mobile devices. The massive increase in digital audio-visual information poses high demands on advanced storage and search engines for consumers and professional archives.

Video search engines are the product of progress in many technologies: visual and audio analysis, machine learning techniques, as well as visualization and interaction. At present the state-of-the-art systems are able to annotate automatically only a limited set of semantic concepts, and the retrieval is allowed using only a keyword-based approach based on a lexicon.

The VidiVideo project takes on the challenge of creating a substantially enhanced semantic access to video, implemented in a search engine.

The outcome of the project is an audio-visual search engine, composed of two parts: a automatic annotation part, that runs off-line, where detectors for more than 1000 semantic concepts are collected in a thesaurus to process and automatically annotate the video and an interactive part that provides a video search engine for both technical and non-technical users.

The automatic annotation part of the system performs audio and video segmentation, speech recognition, speaker clustering and semantic concept detection.

The VidiVideo system has achieved the highest performance in the most important object and concept recognition international contests (PASCAL VOC and TRECVID).

The interactive part provides two applications: a desktop-based and a web-based search engines. The system permits different query modalities (free text, natural language, graphical composition of concepts using boolean and temporal relations and query by visual example) and visualizations, resulting in an advanced tool for retrieval and exploration of video archives for both technical and non-technical users in different application fields. In addition the use of ontologies (instead of simple keywords) permits to exploit semantic relations between concepts through reasoning, extending the user queries.

The off-line annotation part has been implemented in C++ on the Linux platform, and takes advantage of the low-cost processing power provided by GPUs on consumer graphics cards.

The web-based system is based on the Rich Internet Application paradigm, using a client side Flash virtual machine. RIAs can avoid the usual slow and synchronous loop for user interactions. This allows to implement a visual querying mechanism that exhibits a look and feel approaching that of a desktop environment, with the fast response that is expected by users. The search results are in RSS 2.0 XML format, while videos are streamed using the RTMP protocol.

Automatic trademark detection and recognition in sports videos

The availability of measures of appearance of trademarks and logos in a video is important in fields of marketing and sponsoring. These statistics can, in fact, be used by the sponsors to estimate the number TV viewers that noticed them and then evaluate the effects of the sponsorship. The goal of this project is to create a semi-automatic system for detection, tracking and recognition of pre-defined brands and trademarks in broadcast television. The number of appearances of a logo, its position, size and duration will be recorded to derive indexes and statistics that can be used for marketing analysis.

Automatic trademark detection and recognition in sports videos

Automatic trademark detection and recognition in sports videos

To obtain a technique that is sufficiently robust to partial occlusions and deformations, we use local neighborhood descriptors of salient points (SIFT features) as a compact representation of the important aspects and local texture in trademarks. By combining the results of local point-based matching we are able to detect and recognize entire trademarks. The determination of whether a video frame contains a reference trademark is made by thresholding the normalized-match score (the ratio of SIFT points of the trademark that have been matched to the frame). Finally, we compute a robust estimate of the point cloud in order to localize the trademark and to approximate its area.

Video event classification using bag-of-words and string kernels

The recognition of events in videos is a relevant and challenging task of automatic semantic video analysis. At present one of the most successful frameworks, used for object recognition tasks, is the bag-of-words (BoW) approach. However it does not model the temporal information of the video stream. We are working at a novel method  to introduce temporal information within the BoW approach by modeling a video clip as a sequence of histograms of visual features, computed from each frame using the traditional BoW model.

Video event classification using bag-of-words and string kernels

Video event classification using bag-of-words and string kernels

The sequences are treated as strings where each histogram is considered as a character. Event classification of these sequences of variable size, depending on the length of the video clip, are performed using SVM classifiers with a string kernel (e.g using the Needlemann-Wunsch edit distance). Experimental results, performed on two domains, soccer video and TRECVID 2005, demonstrate the validity of the proposed approach.

Image forensics using SIFT features

In many application scenarios digital images play a basic role and often it is important to assess if their content is realistic or has been manipulated to mislead watcher’s opinion. Image forensics tools provide answers to similar questions. We are working on a novel method that focuses in particular on the problem of detecting if a feigned image has been created by cloning an area of the image onto another zone to make a duplication or to cancel something awkward.

Image forensics using SIFT features

Image forensics using SIFT features

The proposed approach is based on SIFT features and allows both to understand if a copy-move attack has occurred and which are the image points involved, and, furthermore, to recover which has been the geometric transformation happened to perform cloning, by computing the transformation parameters. In fact when a copy-move attack takes place, usually an affine transformation is applied to the image patch selected to fit in a specified position according to that context. Our experimental results confirm that the technique is able to precisely individuate the attack and the transformation parameter estimation is highly reliable.

Human action categorization in unconstrained videos

Building a general human activity recognition and classification system is a challenging problem, because of the variations in environment, people and actions. In fact environment variation can be caused by cluttered or moving background, camera motion, illumination changes. People may have different size, shape and posture appearance. Recently, interest-points based models have been successfully applied to the human action classification problem, because they overcome some limitations of holistic models such as the necessity of performing background subtraction and tracking. We are working at a novel method based on the visual bag-of-words model and on a new spatio-temporal descriptor.

Human action categorization in unconstrained videos

Human action categorization in unconstrained videos

First, we define a new 3D gradient descriptor that combined with optic flow outperforms the state-of-the-art, without requiring fine parameter tuning. Second, we show that for spatio-temporal features the popular k-means algorithm is insufficient because cluster centers are attracted by the denser regions of the sample distribution, providing a non-uniform description of the feature space and thus failing to code other informative regions. Therefore, we apply a radius-based clustering method and a soft assignment that considers the information of two or more relevant candidates. This approach generates a more effective codebook resulting in a further improvement of classification performances. We extensively test our approach on standard KTH and Weizmann action datasets showing its validity and outperforming other recent approaches.

3D Mesh Partitioning

In this research, a model is proposed for decomposition of 3D objects based on Reeb-graphs. The model is motivated by perceptual principles and supports identification of salient object protrusions. Experimental results have demonstrate the effectiveness of the proposed approach with respect to different solutions appeared in the literature, and with reference to ground-truth data obtained by manually decomposing 3D objects.

3D mesh partitioning

3D mesh partitioning

Our solution falls in the semantic oriented category and is motivated by the need to overcome limitations of geometry based solutions which mainly rely on the sole curvature information to perform mesh decomposition. In particular, we propose the use of Reeb-graph to extract structural and topological information of a mesh surface and to drive the decomposition process.  Curvature information is used to refine boundaries between object parts in accordance to the minima rule.

Thus, object decomposition is achieved by a two steps approach accounting for Reeb-graph construction and refinement. In the construction step, topological as well as metric properties of the object surface are used to build the Reeb-graph. Due to the metric properties of the object that are considered for building the Reeb-graph (i.e., the AGD is used), the structure of this graph captures the object protrusions. In the refinement step, the Reeb-graph is subject to an editing process by which deep concavity and adjacency are used to support fine localization of part boundaries.

In doing so, the main goal of our contribution is to provide and experiment a model to support perceptually consistent decomposition of 3D objects to enable reuse and retrieval of parts of 3D models archived in large model repositories.

3D Face Recognition

In this research, we present a novel approach to 3D face matching that shows high effectiveness in distinguishing facial differences between distinct individuals from differences induced by non-neutral expressions within the same individual. We present an extensive comparative evaluation of performance with the FRGC v2.0 dataset and the SHREC08 dataset.

3D face recognition

3D face recognition

The approach takes into account geometrical information of the 3D face and encodes the relevant information into a compact representation in the form of a graph. Nodes of the graph represent equal width iso-geodesic facial stripes. Arcs between pairs of nodes are labeled with descriptors, referred to as 3D Weighted Walkthroughs (3DWWs), that capture the mutual relative spatial displacement between all the pairs of points of the corresponding stripes. Face partitioning into iso-geodesic stripes and 3DWWs together provide an approximate representation of local morphology of faces that exhibits smooth variations for changes induced by facial expressions. The graph-based representation permits very efficient matching for face recognition and is also suited to be employed for face identification in very large datasets with the support of appropriate index structures. The method obtained the best ranking at the SHREC 2008 contest for 3D face recognition.

SIFTPose: local pose estimation from a single scale invariant keypoint

The aim of this project is to develop a new method of estimating the poses of imaged scene surfaces provided that they can be locally approximated by their tangent planes. Our approach performs an accurate direct estimation by exploiting the robustness of scale invariant feature transform (SIFT). The results are representative of the state of the art for this challenging task.

Local pose estimation from a single scale invariant keypoint

Local pose estimation from a single scale invariant keypoint

Retrieving the poses of keypoints in addition to matching them is an essential task in many computer-vision applications to transform uncostrained problems into costrained ones. This project proposes a new method of estimating the poses of regions around keypoints provided that they can be considered locally planar. While this has previously been attempted by adapting iterative algorithms developed for template matching, no explicit accurate direct estimation has been introduced before. Our approach simultaneously learn the “nuisance residual” structure present in the detection and description steps of the SIFT algorithm allowing local perspective properties of distinctive features to be recovered through a homography. The system is trained using synthetic images generated from a single reference view of the surface.

The method produces accurate detailed and fine grained set of local pose which can also be applied to non rigid surfaces. In particular the accuracy and robustness of the method are representative of the state of the art for this challenging task. At present, we investigate the application of the estimated homographies for building a pose-invariant descriptor for 3D face recognition.

TANGerINE Grape

TANGerINE Grape is a collaborative knowledge sharing system that can be used through natural and tangible interfaces. The final goal is to enable users to enrich their knowledge through the attainment of information both from digital libraries and from the knowledge shared by other users involved in the same interaction session.

TANGerINE Grape

TANGerINE Grape

TANGerINE Grape is a collaborative tangible multi-user interface that allows users to perform semantic based content retrieval. Multimedia contents are organized through knowledgebase management structures (i.e. ontologies) and the interface allows a multi-user interaction with them through different input devices both in a co-located and remote environment.

TANGerINE Grape enables users to enrich their knowledge through the attainment of information both from an informative automatic system and from the knowledge shared by the other users involved: compared to a web-based interface, our system enables a collaborative face-to-face interaction together with the standard remote collaboration. Users, in fact, are allowed to interact with the system through different kind of input devices both in co-located or remote situation. In this way users enrich their knowledge even through the comparison with the other users involved in the same interaction session: they can share choices, results and comments. Face-to-face collaboration has also a ‘social’ value: co-located people involved in similar tasks improve their reciprocal personal/professional knowledge in terms of skills, culture, nature, interests and so on.

As use case we initially exploited the VIDI-Video project and then, to provide a faster response time and more advanced search possibilities, the IM3I project enhancing access to video contents by using its semantic search engine.

This project has been an important case study for the application of natural and tangible interaction research to the access to video content organized in semantic-based structures.

Multi-user environment for semantic search of multimedia contents

This research project exploits new technologies (multi-touch table and iPhone) in order to  develop a multi-user, multi-role and multi-modal system for multimedia content search, annotation and organization. As use case we considered the field of  broadcast journalism where editors and archivists work together in creating a film report using archive footage.

Multi user environment for semantic search of multimedia contents

Multi user environment for semantic search of multimedia contents

The idea behind this work-in-progress project is to create a multi-touch system that allows one or more users to search multimedia content, especially video, exploiting an ontology based structure for the knowledge management. Such system exploits a collaborative multi-role, multi-user and multi-modal interaction of two users performing different tasks within the application.

The first user plays the role of an archivist: by inserting a keyword through the iPhone, he is able to search and select data through an ontological structured interface designed ad-hoc for multi-touch table. At this stage the user can organize their results in  folders and subfolders: the iPhone is therefore used as a device for text input and for folders storage.

The other user performs the role of an editor: he receives the results of  the search carried out by the archivist through the system or the iPhone. This user examines the contents of the video search and select those that are most suitable for the final result, estimating how much the video is appropriate for his purposes (assessment for the current work session) and giving his opinion on the objective quality of the video (subjective assessment that can also influence future research). In addition, the user also plays the role of  an annotator: he can add more tags to the video if he considers them necessary to retrieve that content in future research.