The gist of everything new: personalized top-k processing over web 2.0 streams
Web 2.0 portals have made content generation easier than ever with millions of users contributing news stories in form of posts in weblogs or short textual snippets as in Twitter. Efficient and effective filtering solutions are key to allow users stay tuned to this ever-growing ocean of information, releasing only relevant trickles of personal interest. In classical information filtering systems, user interests are formulated using standard IR techniques and data from all available information sources is filtered based on a predefined absolute quality-based threshold. In contrast to this restrictive approach which may still overwhelm the user with the returned stream of data, we envision a system which continuously keeps the user updated with only the top-k relevant new information. Freshness of data is guaranteed by considering it valid for a particular time interval, controlled by a sliding window. Considering relevance as relative to the existing pool of new information creates a highly dynamic setting. We present POL-filter which together with our maintenance module constitute an efficient solution to this kind of problem. We show by comprehensive performance evaluations using real world data, obtained from a weblog crawl, that our approach brings performance gains compared to state-of-the-art.