A Scalable Approach to Harvest Modern Weblogs

Manolopoulos, Yannis

doi:10.1142/S0218213015400059

research article

A Scalable Approach to Harvest Modern Weblogs

Banos, Vangelis

•

Blanvillain, Olivier

•

Kasioumis, Nikos

2015

International Journal On Artificial Intelligence Tools

Blogs are one of the most prominent means of communication on the web. Their content, interconnections and influence constitute a unique socio-technical artefact of our times which needs to be preserved. The BlogForever project has established best practices and developed an innovative system to harvest, preserve, manage and reuse blog content. This paper presents the latest developments of the blog crawler which is a key component of the BlogForever platform. More precisely, our work concentrates on techniques to automatically extract content such as articles, authors, dates and comments from blog posts. To achieve this goal, we introduce a simple yet robust and scalable algorithm to generate extraction rules based on string matching using the blog's web feed in conjunction with blog hypertext. Furthermore, we present a system architecture which is characterised by efficiency, modularity, scalability and interoperability with third-party systems. Finally, we conduct thorough evaluations of the performance and accuracy of our system.

Type

research article

DOI

10.1142/S0218213015400059

Web of Science ID

WOS:000352909400002

Author(s)

Banos, Vangelis

Blanvillain, Olivier

Kasioumis, Nikos

Manolopoulos, Yannis

Date Issued

2015

Publisher

World Scientific Publ Co Pte Ltd

Published in

International Journal On Artificial Intelligence Tools

Volume

24

Issue

2

Article Number

1540005

Subjects

Blog crawler

•

web data extraction

•

wrapper generation

•

interoperability

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units

IC

Available on Infoscience

May 29, 2015

Use this identifier to reference this record

https://infoscience.epfl.ch/handle/20.500.14299/114342