Combine and Conquer: Mining Social Systems for Prediction

In this thesis, we explore the application of data mining and machine learning techniques to several practical problems. These problems have roots in various fields such as social science, economics, and political science. We show that computer science techniques enable us to bring significant contributions to solving them. Moreover, we show that combining several models or datasets related to the problem we are trying to solve is key to the quality of the solution we find. The first application we consider is human mobility prediction. We describe our winning contribution to the Nokia Mobile Data Challenge, in which we predict the next location a user will visit based on his history and the current context. We first highlight some data characteristics that contribute to the difficulty of the task, such as sparsity and non-stationarity. Then, we present three families of models and observe that, even though their average accuracies are similar, their performances vary significantly across users. To take advantage of this diversity, we introduce several strategies to combine models, and show that the combinations outperform any individual predictor. The second application we examine is predicting the success of crowdfunding campaigns. We collected data on Kickstarter (one of the most popular crowdfunding platforms) in order to predict whether a campaign will reach its funding goal or not. We show that we obtain good performances by simply using information about money, but that combining this information with social features extracted from Kickstarter's social graph and Twitter improves early predictions. In particular, predictions made a few hours after the beginning of a campaign are improved by 4%, to reach an accuracy of 76%. Then, we move to the realms of politics, and first investigate the ideologies of politicians. Using their opinion on several aspects of politics, gathered on a voting advice application (VAA), we show that the themes that divide politicians the most are the ones that we usually associate with left-wing/right-wing and liberal/conservative, thus validating the simplified two-dimensional view of the political system that many people use. We bring attention to the potentially malicious uses of VAAs by creating a fake candidate profile that is able to gather twice as many voting recommendations as any other. To counter this, we demonstrate that we are able to monitor politicians after they were elected, and potentially detect changes of opinion, by combining the data extracted from the VAA with the votes that they cast at the Parliament. Finally, we study the outcome of issue votes. We first show that simply considering vote results at a fine geographical level is sufficient to highlight characteristic geographical voting patterns across a country, and their evolution over time. It also enables us to find representative regions that are crucial in determining the national outcome of a vote. We then demonstrate that predicting the actual result of a vote in all regions (in opposition to the binary national outcome) is a much harder task that requires combining data about regions and votes themselves to obtain good performances. We compare the use of Bayesian and non-Bayesian models that combine matrix-factorization and regression. We show that, here too, combining appropriate models and datasets improves the quality of the predictions, and that Bayesian methods give better estimates of the model's hyperparameters.

Related material