Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Journal articles
  4. A Scalable Approach to Harvest Modern Weblogs
 
research article

A Scalable Approach to Harvest Modern Weblogs

Banos, Vangelis
•
Blanvillain, Olivier
•
Kasioumis, Nikos
Show more
2015
International Journal On Artificial Intelligence Tools

Blogs are one of the most prominent means of communication on the web. Their content, interconnections and influence constitute a unique socio-technical artefact of our times which needs to be preserved. The BlogForever project has established best practices and developed an innovative system to harvest, preserve, manage and reuse blog content. This paper presents the latest developments of the blog crawler which is a key component of the BlogForever platform. More precisely, our work concentrates on techniques to automatically extract content such as articles, authors, dates and comments from blog posts. To achieve this goal, we introduce a simple yet robust and scalable algorithm to generate extraction rules based on string matching using the blog's web feed in conjunction with blog hypertext. Furthermore, we present a system architecture which is characterised by efficiency, modularity, scalability and interoperability with third-party systems. Finally, we conduct thorough evaluations of the performance and accuracy of our system.

  • Details
  • Metrics
Type
research article
DOI
10.1142/S0218213015400059
Web of Science ID

WOS:000352909400002

Author(s)
Banos, Vangelis
Blanvillain, Olivier
Kasioumis, Nikos
Manolopoulos, Yannis
Date Issued

2015

Publisher

World Scientific Publ Co Pte Ltd

Published in
International Journal On Artificial Intelligence Tools
Volume

24

Issue

2

Article Number

1540005

Subjects

Blog crawler

•

web data extraction

•

wrapper generation

•

interoperability

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
IC  
Available on Infoscience
May 29, 2015
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/114342
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés