The success of media sharing and social networks has led to the availability of extremely large quantities of images that are tagged by users. The need of methods to manage efficiently and effectively the combination of media and metadata poses significant challenges. In particular, automatic image annotation of social images has become an important research topic for the multimedia community.
Detected tags in an image using Nearest-Neighbor Methods for Tag Refinement
We propose and thoroughly evaluate the use of nearest-neighbor methods for tag refinement. We performed extensive and rigorous evaluation using two standard large-scale datasets to show that the performance of these methods is comparable with that of more complex and computationally intensive approaches. Differently from these latter approaches, nearest-neighbor methods can be applied to ‘web-scale’ data.
Here we make available the code and the metadata for NUS-WIDE-240K.
Nuswide-240K dataset metadata (JSON format, about 25MB). A subset of 238,251 images from NUS-WIDE-270K that we retrieved from Flickr with users data. Note that NUS is now releasing the full image set subject to an agreement and disclaimer form.
If you use this data, please cite the paper as follows:
author = "Uricchio, Tiberio and Ballan, Lamberto and Bertini,
Marco and Del Bimbo, Alberto",
title = "An evaluation of nearest-neighbor methods for tag refinement",
booktitle = "Proc. of IEEE International Conference on Multimedia \& Expo (ICME)",
month = "jul",
year = "2013",
address = "San Jose, CA, USA",
url = "http://www.micc.unifi.it/publications/2013/UBBD13"
Interactive Video Search and Browsing Systems: MediaPick
Daniele will present two interactive systems for video search and browsing; a rich internet application designed to obtain the levels of responsiveness and interactivity typical of a desk- top application, and a system that exploits multi-touch devices to implement a multi-user collaborative application. Both systems use the same ontology-based video search engine, that is capable of expanding user queries through ontology reasoning and let users to search for specific video segments that contain a semantic concept or to browse the content of video collections, when it’s too difficult to express a specific query.
The explosion of digital data in recent times, in its varied forms and formats (MPEG4 image, Flash video, WAV audio, etc.), has necessitated the creation of effective tools to organise, manage and link digital assets, in order to maximise accessibility and reduce cost issues for everyone concerned, from content managers to online content consumers.
euTV video annotation and transcription web component
On a larger scale, isolated information repositories developed by content owners and technology providers can be connected, unleashing opportunities for innovative user services and creating new business models, in the vein of on-demand, online, or mobile TV ventures.
The euTV project stems from above conditions and potentialities, to connect publicly available multimedia information streams under a unifying framework, which additionally allows publishers of audio-visual content to monetise their products and services. The backbone of euTV is a scalable audio-visual analysis and indexing system that allows detection and tracking of vast amounts of multimedia content based on Topics of Interest (TOI) corresponding to a user’s profile and employed search terms. The front-end is a portal that displays syndicated content, allowing users to perform searches, refine queries, and produce faceted presentation of results.
The three main content domains will be (a) news, (b) sports, and (c) documentaries. In the existing market of media monitoring and clipping, euTV distinguishes itself by simultaneously analysing multiple information streams (text, speech, audio, image, video) instead of a single one and tracking TOI in real time. This provides the user with a more robust identification of their TOI and greater insights into how the information is spread.
LIT (Lexicon of the Italian Television) is a project conceived by the Accademia della Crusca, the leading research institution on the Italian language, in collaboration with CLIEO (Center for theoretical and historical Linguistics: Italian, European and Oriental languages), with the aim of studying frequencies of the Italian lexicon used in television content and targets the specific sector of web applications for linguistic research. The corpus of transcriptions is constituted approximately by 170 hours of random television recordings transmitted by the national broadcaster RAI (Italian Radio Television) during the year 2006.
LIT: Lexicon of the Italian Television
The principal outcome of the project is the design and implementation of an interactive system which combines a web-based video transcription and annotation tool, a full featured search engine, and a web application, integrated with video streaming, for data visualization and text-video syncing.
The project presents two different interfaces: a search engine, based on classical textual input forms, and another multimedia interface, used both for data visualization and annotation. Annotation functionalities are activated after user’s authentication. The systems relies on a web application backend which has to handle the transcriptions and provide the necessary indexing and search functions.
The browsing interface shows the video collection present in the model. Users can select a video and play it immediately, and read the associated metadata and speech transcription in sync. Each record in the list of videos provides a link to the raw annotation in XML-TEI format, a standard developed by the TEI: Text Encoding Initiative Consortium. The annotation can be opened directly inside the browser and saved on the local systems. Subtitles are displayed at the bottom of the video while segments in the transcription area are automatically highlighted during playback and metadata are updated accordingly. When the text-to-speech alignment is completed through annotation activities, users can select a unit of text inside the transcription area and the video cue-point is aligned accordingly; on the contrary, scrolling the trigger on the annotated video segment highlights the corresponding segment of text.
The annotation interface is accessed by transcriptionists after authentication, and allows to associate the transcription to the corresponding sequences of video. Annotators can set the cue points of speech on the video sequences using the tools provided by the graphic user interface and assign them an annotation without having prior knowledge of the format used. The tool provides functionalities for the definition of metadata at different levels, or multiple “layers”: features can be assigned to the document as a whole, to individual transmissions, to speakers in the transmissions and to each single segment of the transcription.
The search interface is based on standard text input fields. It provides a JSP frontend to the search functions defined for the Java engine and uses the Lucene query syntax for the identification of HTML elements. The interfaces recalls a common ‘advanced search’ form, providing all the boolean combinations usually present in search engines and, for this reason, making users comfortable with basic features. Notably, some uncommon features appears among other fields, such as:
the ‘free sequence’ field, with option for defining it exact, ordered or unordered;
the ‘distance’ parameter, where free sequences can appear within specified ranges inside a single utterance;
the ‘date range’ parameter.
Advanced search features are shown inside dedicated panels which can be expanded if necessary. These panels give all the options for specifying the constraints of a query, as defined for the XML-TEI custom fields used in LIT. The extended parameters allow to:
set the case sensitiveness of a query;
perform a word root expansion of jolly characters present in the query;
set the constraint for specific categories defined in the taxonomy;
select specific parameters for utterances, such as type of speech (improvisation, programmed, executed), speech technique (on scene, voice-over), type of communication (monologue, dialogue), speaker gender and type (professional, non professional).
The system contains 168 hours of RAI (Italian Radio Television) broadcasts, aired during the year 2006. The annotation work was done by researchers of the Accademia della Crusca while LIT was under development, in late 2009. The database has approximately 20.000 utterances stored and using Lucene for search and retrieval does not raise any performance issue.
The system is currently under deployment as a module of the larger national research funding FIRB 2009 VIVIT (Fondo di Investimento per la Ricerca di Base, Vivi l’Italiano), which will integrate the tools and the obtained annotations within a semantic web infrastructure.
The VidiVideo project takes on the challenge of creating a substantially enhanced semantic access to video, implemented in a search engine. The outcome of the project is an audio-visual search engine, composed of two parts: an automatic annotation part, that runs off-line, where detectors for more than 1000 semantic concepts are collected in a thesaurus to process and automatically annotate the video and an interactive part that provides a video search engine for both technical and non-technical users.
Andromeda - Vidivideo graph based video browsing
Video plays a key role in the news, cultural heritage documentaries and surveillance, and it is a natural form of communication for the Internet and mobile devices. The massive increase in digital audio-visual information poses high demands on advanced storage and search engines for consumers and professional archives.
Video search engines are the product of progress in many technologies: visual and audio analysis, machine learning techniques, as well as visualization and interaction. At present the state-of-the-art systems are able to annotate automatically only a limited set of semantic concepts, and the retrieval is allowed using only a keyword-based approach based on a lexicon.
The VidiVideo project takes on the challenge of creating a substantially enhanced semantic access to video, implemented in a search engine.
The outcome of the project is an audio-visual search engine, composed of two parts: a automatic annotation part, that runs off-line, where detectors for more than 1000 semantic concepts are collected in a thesaurus to process and automatically annotate the video and an interactive part that provides a video search engine for both technical and non-technical users.
The automatic annotation part of the system performs audio and video segmentation, speech recognition, speaker clustering and semantic concept detection.
The VidiVideo system has achieved the highest performance in the most important object and concept recognition international contests (PASCAL VOC and TRECVID).
The interactive part provides two applications: a desktop-based and a web-based search engines. The system permits different query modalities (free text, natural language, graphical composition of concepts using boolean and temporal relations and query by visual example) and visualizations, resulting in an advanced tool for retrieval and exploration of video archives for both technical and non-technical users in different application fields. In addition the use of ontologies (instead of simple keywords) permits to exploit semantic relations between concepts through reasoning, extending the user queries.
The off-line annotation part has been implemented in C++ on the Linux platform, and takes advantage of the low-cost processing power provided by GPUs on consumer graphics cards.
The web-based system is based on the Rich Internet Application paradigm, using a client side Flash virtual machine. RIAs can avoid the usual slow and synchronous loop for user interactions. This allows to implement a visual querying mechanism that exhibits a look and feel approaching that of a desktop environment, with the fast response that is expected by users. The search results are in RSS 2.0 XML format, while videos are streamed using the RTMP protocol.
The recognition of events in videos is a relevant and challenging task of automatic semantic video analysis. At present one of the most successful frameworks, used for object recognition tasks, is the bag-of-words (BoW) approach. However it does not model the temporal information of the video stream. We are working at a novel method to introduce temporal information within the BoW approach by modeling a video clip as a sequence of histograms of visual features, computed from each frame using the traditional BoW model.
Video event classification using bag-of-words and string kernels
The sequences are treated as strings where each histogram is considered as a character. Event classification of these sequences of variable size, depending on the length of the video clip, are performed using SVM classifiers with a string kernel (e.g using the Needlemann-Wunsch edit distance). Experimental results, performed on two domains, soccer video and TRECVID 2005, demonstrate the validity of the proposed approach.