Patching large missing gaps of dissolved oxygen data in an intermittent stream: comparison of interpolation techniques, including machine learning models
Eco-hydrologicalmodels are useful tools for water qualitymanagement, but there implementation may require high-resolution boundary condition data which are often patchy in time due to monitoring costs. In this report, we compare the performance of gradient boosting machine (GBM) and linear models (LM) interpolating missing water temperature (Tw) and dissolved oxygen concentration (O2) data. Within a year and a month of measurement, Tw and O2 data was missing 11% and 87% of the time, respectively. The efficiency of the models were compared by computing the root mean square error (RMSE) using cross-validation and test sets. The GBMmodel errors on the test sets appeared to be 2 to 5 times larger than the cross-validation error, in spite of having a very high accuracy on the training sets (down to 0.044 § 0.001 [±C] for Tw and 0.007 § 0.001 £ mg /L ¤ for O2). According to these results and confirmed by previous studies, we infer that the GBM model is generally able to tackle the task in hand, but needs more input data, covering all hydrological and climatic conditions. Conversely, the LMs exhibit a more consistent model error on the different sets, but the training set error is much higher (down to 0.68 § 0.016 [±C] for Tw and 0.31 § 0.040 £ mg /L ¤ for O2). A qualitative analysis of the LMs revealed unaccountable behavior when the model was used to interpolate missing data. Overall, this work shows the limits of data-drivenmodels for prediction of environmental variables, highlights the importance of the test set selection and suggests that the quantity and relevance of the data should balance the complexity of the system studied.
ARBELLAY_PDM PRINTEMPS 2019.pdf
Publisher's version
restricted
Copyright
9.57 MB
Adobe PDF
3950d37f16c25d2ad1c7a00d8f8cc7d3