Content based retrieval is typically performed by comparing a document used as query (for instance an image) with the documents contained in a possibly large database. Relevant documents are retrieved searching by similarity in the database.
Content based retrieval has to face two issues. On one hand visual information extracted from images should be good enough to allow effective retrieval of documents to be executed. On the other hand, similarity search algorithms should be able to scale to huge (web scale) datasets. The talk will briefly introduce feature extraction issues and will mainly focus on scalable similarity searching techniques.
Similarity search is a difficult task because traditional techniques to process database or text queries cannot be applied here. Visual documents are generally compared using distance (or dis-similarity) measures defined on visual features.
Various indexing strategies and search algorithms based on distance functions were defined during the last decade. A relevant research direction has been that of the tree-based access methods, that allow the search algorithms to inspect just a small portion of the dataset. Limitations of tree-based approaches were addressed by defining techniques for approximate similarity search, where significant improvement boost is obtained at the expense of some minor imprecision in the search results. Techniques that will be discussed include the Locality-Sensitive Hashing methods (LSH), and the permutation-based methods, where documents are represented as permutations of a set of reference objects, and similarity between documents is approximated by comparing permutations.
These indexes allow similarity retrieval to be executed very efficiently, in datasets containing hundred millions images, with limited computing and storage resources.