Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. Finding near-duplicate web pages: A large-scale evaluation of algorithms
 
conference paper

Finding near-duplicate web pages: A large-scale evaluation of algorithms

Henzinger, Monika R.  
2006
29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

Broder et al.'s [3] shingling algorithm and Charikar's [4] random projection based approach are considered "state-of-the-art" algorithms for finding near-duplicate web pages. Both algorithms were either developed at or used by popular web search engines. We compare the two algorithms on a very large scale, namely on a set of 1.6B distinct web pages. The results show that neither of the algorithms works well for finding near-duplicate pairs on the same site, while both achieve high precision for near-duplicate pairs on different sites. Since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall, namely 0.50 versus 0.38 for Broder et al. 's algorithm. We present a combined algorithm which achieves precision 0.79 with 79% of the recall of the other algorithms. Copyright 2006 ACM.

  • Files
  • Details
  • Metrics
Type
conference paper
DOI
10.1145/1148170.1148222
Scopus ID

2-s2.0-33750296887

Author(s)
Henzinger, Monika R.  
Date Issued

2006

Publisher

ACM Press

Published in
29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Start page

284

End page

291

Subjects

Content duplication

•

Near-duplicate documents

•

Web pages

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
LTAA  
Event place
Available on Infoscience
January 18, 2007
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/239643
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés