Infoscience

Thesis

Efficient image duplicate detection based on image analysis

This thesis is about the detection of duplicated images. More precisely, the developed system is able to discriminate possibly modified copies of original images from other unrelated images. The proposed method is referred to as content-based since it relies only on content analysis techniques rather than using image tagging as done in watermarking. The proposed content-based duplicate detection system classifies a test image by associating it with a label that corresponds to one of the original known images. The classification is performed in four steps. In the first step, the test image is described by using global statistics about its content. In the second step, the most likely original images are efficiently selected using a spatial indexing technique called R-Tree. The third step consists in using binary detectors to estimate the probability that the test image is a duplicate of the original images selected in the second step. Indeed, each original image known to the system is associated with an adapted binary detector, based on a support vector classifier, that estimates the probability that a test image is one of its duplicate. Finally, the fourth and last step consists in choosing the most probable original by picking that with the highest estimated probability. Comparative experiments have shown that the proposed content-based image duplicate detector greatly outperforms detectors using the same image description but based on a simpler distance functions rather than using a classification algorithm. Additional experiments are carried out so as to compare the proposed system with existing state of the art methods. Accordingly, it also outperforms the perceptual distance function method, which uses similar statistics to describe the image. While the proposed method is slightly outperformed by the key points method, it is five to ten times less complex in terms of computational requirements. Finally, note that the nature of this thesis is essentially exploratory since it is one of the first attempts to apply machine learning techniques to the relatively recent field of content-based image duplicate detection.

Related material