MICC - Media Integration and Communication Center

We propose a method to match boxes that exhibit a temporal consistency in consecutive frames through the video, yielding to a set of tracks. A track is defined as a succession of bounding boxes for which the Intersection over Union (IoU) between two boxes (belonging to frame i and i+1) is above a defined threshold.

Entropy-based evaluation for object proposals

Starting from the first frame, each time a match is found, the corresponding bounding box is added to the end of the track and becomes the reference box for the following frame. If no match is found the last box of the track is compared with the following frames until a good match is obtained. When one or more consecutive matches are not found, tracks become fragmented, i.e. there are frames for which a track is active but there is no bounding box. This is usually due to a lack of good bounding boxes for that frame, occlusion or appearance changes of the object. It is thus necessary to avoid matching boxes in frames too far apart that therefore do not represent the same content, but at the same time we want to be able to tolerate some missing boxes without prematurely terminating the track. To this end we introduce a Time to Live counter (TTL) for each track. We define TTL as the number of frames, at frame i, that the method can still wait before considering the track terminated. TTL starts from an initial value and each time a box can not be matched in a consecutive frame the TTL is decremented, otherwise is incremented (up to the initial value).

Object proposals are usually evaluated measuring how well objects are covered by the generated boxes. These kind of evaluation does not take into account unannotated objects, and therefore provide a benchmark not reflecting the real capabilities of the proposal method. The method is a general framework for discovering salient spatio-temporal tracks in videos, which is built upon a generic bounding box oracle. To evaluate it, we introduce a novel method to establish the effectiveness of a generic video proposal, which is also dataset-independent since it does not rely on annotations. We evaluate whether a proposal effectively represents an instance of some object, since the goal of an object proposal is to locate good candidates and not to produce the candidate of a given class (i.e. the one of the ground truth). To this end we propose an entropy based evaluation which indicates how the proposal is likely to be recognized as an object. Given a classifier capable of providing for an image a probability distribution X = {x1,…, xN} over N classes, we compute the Shannon entropy for the probability vector X.

The rationale behind this choice is that, given a good classifier, for a known object the output probability distribution will be high for the relative class and near zero for the others, thus producing a small entropy. On the contrary, for inputs that the classifier is unsure of, e.g. background patches, the output probability will be distributed non-uniformly among all the possible classes, resulting in a higher entropy, as shown in Figure 1. Therefore, if the classifier is able to cover effectively a sufficiently large number of classes, then the entropy can be interpreted as a measure of objectness for the given proposal. A comparison of high and low entropy proposals is shown in Figure 2.

Unsupervised Object Discovery Unsupervised spatio-temporal proposal tracks for salient object discovery

Abstract

Insights

Related Publications

Projects you may be interested in