Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. Web page language identification based on URLs
 
conference paper

Web page language identification based on URLs

Baykan, Eda
•
Henzinger, Monika  
•
Weber, Ingmar
2008
Proceedings of the VLDB Endowment
In 34th International Conference on Very Large Data Bases (VLDB)

Given only the URL of a web page, can we identify its language? This is the question that we examine in this paper. Such a language classifier is, for example, useful for crawlers of web search engines, which frequently try to satisfy certain language quotas. To determine the language of uncrawled web pages, they have to download the page, which might be wasteful, if the page is not in the desired language. With URL-based language classifiers these redundant downloads can be avoided. We apply a variety of machine learning algorithms to the language identification task and evaluate their performance in extensive experiments for five languages: English, French, German, Spanish and Italian. Our best methods achieve an F-measure, averaged over all languages, of around .90 for both a random sample of 1,260 web page from a large web crawl and for 25k pages from the ODP directory. For 5k pages of web search engine results we even achieve an Fmeasure of .96. The achieved recall for these collections is .93, .88 and .95 respectively. Two independent human evaluators performed considerably worse on the task, with an F-measure of .75 and a typical recall of a mere .67. Using only country-code top-level domains, such as .de or .fr yields a good precision, but a typical recall of below .60 and an F- measure of around .68.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

url-languages.pdf

Access type

openaccess

Size

177.76 KB

Format

Adobe PDF

Checksum (MD5)

00dad1a144361abfbab5d6e671dfe899

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés