Tag Archives: Digester

VIVIT. Vivi l’Italiano web portal

VIVIT is a three-years project led by Media Integration and Communication Center (MICC) and Accademia della Crusca, funded on government FIRB funding. As a part of this project, the VIVIT web portal has been developed by MICC in order to give visibility to culture-related contents that may appeal to second and third generation Italians living abroad.

Vivit web portal

Vivit web portal

The main aim of the VIVIT web portal is to provide people of italian origins with quality content related to the history of the nation and that of the language, together with learning materials for self-assessment and improvement of the viewer’s language proficiency.

The development of the VIVIT web portal has officially started in 2010, when the information architecture and content organization were first discussed. The VIVIT project stated that the web portal should give users and potential teachers ways to interact with each other and to produce and reorganize contents to be shown online to language and culture learners. Given these premises, it was decided to make use of a CMS (Content Management System), the possibility of user roles definition and interaction being part of its nature.

VIVIT is being developed on Drupal. Free and open-source PHP-based software, Drupal has come a long way over recent years in features development and is now considered one of the best CMS systems together with the well-known WordPress and Joomla. A large amount of user-contributed plugins (modules, in Drupal terms) and layout themes is available, since the development process itself is relatively simple and widely documented.

At this time, the architecture of the VIVIT portal is mostly complete: users may browse content, comment on it, bookmark pages and reorganize them from inside the platform (users with the role of teachers may also share these self-created content units with other users, to create their own learning path through the contents of the web portal); audio and video resources are available as well as learning materials that allow user interaction granted by the use of a custom jQuery plugin developed internally at MICC.

It is also possible, for users with enough rights, to semantically process and annotate (that is, assign resources that describe the content) texts inside the portal by using the named entities and topic extraction servlet Homer, also developed at MICC: the tagging possibility is part of Drupal core modules, while the text analysis feature is a combination of the contributed tagging module and a custom module written specifically for the VIVIT portal. The Homer servlet is a Java application based on GATE, a toolkit for a broad range of NLP (Natural Language Processing) tasks.

LIT. Lexicon of Italian Television search engine

LIT. Lexicon of Italian Television search engine

The VIVIT web portal gives access to additional resources related to the same cultural field: in particular LIT (Lexicon of Italian Television) and LIR (Lexicon of Italian Radio). The former, LIT, is a Java search engine that uses Lucene in order to index about 160 video excerpts from Italian TV programs of about 30 minutes each, chosen from the RAI video archive. LIT also offers a backend system where it is possible to stream the video sequences, synchronize the transcriptions with the audio-video sources, annotate the materials by means of customized taxonomies and furthermore add specific metadata. The latter, LIR, is a similar system that relies on an audio archive composed of radio segments from several Italian sources. Linguists are currently using LIT and LIR for computational linguistics based research.

LIR. Lexicon of Italian Radio backend

LIR. Lexicon of Italian Radio backend

LIT: Lexicon of the Italian Television

LIT (Lexicon of the Italian Television) is a project conceived by the Accademia della Crusca, the leading research institution on the Italian language, in collaboration with CLIEO (Center for theoretical and historical Linguistics: Italian, European and Oriental languages), with the aim of studying frequencies of the Italian lexicon used in television content and targets the specific sector of web applications for linguistic research. The corpus of transcriptions is constituted approximately by 170 hours of random television recordings transmitted by the national broadcaster RAI (Italian Radio Television) during the year 2006.

LIT: Lexicon of the Italian Television

LIT: Lexicon of the Italian Television

The principal outcome of the project is the design and implementation of an interactive system which combines a web-based video transcription and annotation tool, a full featured search engine, and a web application, integrated with video streaming, for data visualization and text-video syncing.

The project presents two different interfaces: a search engine, based on classical textual input forms, and another multimedia interface, used both for data visualization and annotation. Annotation functionalities are activated after user’s authentication. The systems relies on a web application backend which has to handle the transcriptions and provide the necessary indexing and search functions.

The browsing interface shows the video collection present in the model. Users can select a video and play it immediately, and read the associated metadata and speech transcription in sync. Each record in the list of videos provides a link to the raw annotation in XML-TEI format, a standard developed by the TEI: Text Encoding Initiative Consortium. The annotation can be opened directly inside the browser and saved on the local systems. Subtitles are displayed at the bottom of the video while segments in the transcription area are automatically highlighted during playback and metadata are updated accordingly. When the text-to-speech alignment is completed through annotation activities, users can select a unit of text inside the transcription area and the video cue-point is aligned accordingly; on the contrary, scrolling the trigger on the annotated video segment highlights the corresponding segment of text.

The annotation interface is accessed by transcriptionists after authentication, and allows to associate the transcription to the corresponding sequences of video. Annotators can set the cue points of speech on the video sequences using the tools provided by the graphic user interface and assign them an annotation without having prior knowledge of the format used. The tool provides functionalities for the definition of metadata at different levels, or multiple “layers”: features can be assigned to the document as a whole, to individual transmissions, to speakers in the transmissions and to each single segment of the transcription.

The search interface is based on standard text input fields. It provides a JSP frontend to the search functions defined for the Java engine and uses the Lucene query syntax for the identification of HTML elements. The interfaces recalls a common ‘advanced search’ form, providing all the boolean combinations usually present in search engines and, for this reason, making users comfortable with basic features. Notably, some uncommon features appears among other fields, such as:

  • the ‘free sequence’ field, with option for defining it exact, ordered or unordered;
  • the ‘distance’ parameter, where free sequences can appear within specified ranges inside a single utterance;
  • the ‘date range’ parameter.

Advanced search features are shown inside dedicated panels which can be expanded if necessary. These panels give all the options for specifying the constraints of a query, as defined for the XML-TEI custom fields used in LIT. The extended parameters allow to:

  • set the case sensitiveness of a query;
  • perform a word root expansion of jolly characters present in the query;
  • set the constraint for specific categories defined in the taxonomy;
  • select specific parameters for utterances, such as type of speech (improvisation, programmed, executed), speech technique (on scene, voice-over), type of communication (monologue, dialogue), speaker gender and type (professional, non professional).

The system contains 168 hours of RAI (Italian Radio Television) broadcasts, aired during the year 2006. The annotation work was done by researchers of the Accademia della Crusca while LIT was under development, in late 2009. The database has approximately 20.000 utterances stored and using Lucene for search and retrieval does not raise any performance issue.

The system is currently under deployment as a module of the larger national research funding FIRB 2009 VIVIT (Fondo di Investimento per la Ricerca di Base, Vivi l’Italiano), which will integrate the tools and the obtained annotations within a semantic web infrastructure.