Efficient Distributed Decision Trees for Robust Regression
The availability of massive volumes of data and recent advances in data collection and processing platforms have motivated the development of distributed machine learning algorithms. In numerous real-world applications large datasets are inevitably noisy and contain outliers. These outliers can dramatically degrade the performance of standard machine learning approaches such as regression trees. To this end, we present a novel distributed regression tree approach that utilizes robust regression statistics, statistics that are more robust to outliers, for handling large and noisy data. We propose to integrate robust statistics based error criteria into the regression tree. A data summarization method is developed and used to improve the efficiency of learning regression trees in the distributed setting. We implemented the proposed approach and baselines based on Apache Spark, a popular distributed data processing platform. Extensive experiments on both synthetic and real datasets verify the effectiveness and efficiency of our approach.
main.pdf
openaccess
2.2 MB
Adobe PDF
d79723d1b75f817316d71fc6eb003fa0
prel_report_network_dataset.pdf
openaccess
543.23 KB
Adobe PDF
1af242b8917b2e87bff291a9ba6b3f9a