Efficient Distributed Decision Trees for Robust Regression

Guo, Tian; Kutzkov, Konstantin; Ahmed, Mohammed; Calbimonte, Jean-Paul; Aberer, Karl

doi:10.1007/978-3-319-46227-1_6

conference paper

Efficient Distributed Decision Trees for Robust Regression

Guo, Tian

•

Kutzkov, Konstantin

•

Ahmed, Mohammed

more

2016

ECML PKDD 2016: Machine Learning and Knowledge Discovery in Databases

European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD)

The availability of massive volumes of data and recent advances in data collection and processing platforms have motivated the development of distributed machine learning algorithms. In numerous real-world applications large datasets are inevitably noisy and contain outliers. These outliers can dramatically degrade the performance of standard machine learning approaches such as regression trees. To this end, we present a novel distributed regression tree approach that utilizes robust regression statistics, statistics that are more robust to outliers, for handling large and noisy data. We propose to integrate robust statistics based error criteria into the regression tree. A data summarization method is developed and used to improve the efficiency of learning regression trees in the distributed setting. We implemented the proposed approach and baselines based on Apache Spark, a popular distributed data processing platform. Extensive experiments on both synthetic and real datasets verify the effectiveness and efficiency of our approach.

Name

main.pdf

Access type

openaccess

Size

2.2 MB

Format

Adobe PDF

Checksum (MD5)

d79723d1b75f817316d71fc6eb003fa0