Abstract

The role of social media as a source of timely and massive information has become more apparent since the era of Web 2.0.Multiple studies illustrated the use of information in social media to discover biomedical and health-related knowledge.Most methods proposed in the literature employ traditional document classification techniques that represent a document as a bag of words.These techniques work well when documents are rich in text and conform to standard English; however, they are not optimal for social media data where sparsity and noise are norms.This paper aims to address the limitations posed by the traditional bag-of-word based methods and propose to use heterogeneous features in combination with ensemble machine learning techniques to discover health-related information, which could prove to be useful to multiple biomedical applications, especially those needing to discover health-related knowledge in large scale social media data.Furthermore, the proposed methodology could be generalized to discover different types of information in various kinds of textual data.

Details