Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Student works
  4. Patching large missing gaps of dissolved oxygen data in an intermittent stream: comparison of interpolation techniques, including machine learning models
 
master thesis

Patching large missing gaps of dissolved oxygen data in an intermittent stream: comparison of interpolation techniques, including machine learning models

Arbellay, Pascal
October 5, 2019

Eco-hydrologicalmodels are useful tools for water qualitymanagement, but there implementation may require high-resolution boundary condition data which are often patchy in time due to monitoring costs. In this report, we compare the performance of gradient boosting machine (GBM) and linear models (LM) interpolating missing water temperature (Tw) and dissolved oxygen concentration (O2) data. Within a year and a month of measurement, Tw and O2 data was missing 11% and 87% of the time, respectively. The efficiency of the models were compared by computing the root mean square error (RMSE) using cross-validation and test sets. The GBMmodel errors on the test sets appeared to be 2 to 5 times larger than the cross-validation error, in spite of having a very high accuracy on the training sets (down to 0.044 § 0.001 [±C] for Tw and 0.007 § 0.001 £ mg /L ¤ for O2). According to these results and confirmed by previous studies, we infer that the GBM model is generally able to tackle the task in hand, but needs more input data, covering all hydrological and climatic conditions. Conversely, the LMs exhibit a more consistent model error on the different sets, but the training set error is much higher (down to 0.68 § 0.016 [±C] for Tw and 0.31 § 0.040 £ mg /L ¤ for O2). A qualitative analysis of the LMs revealed unaccountable behavior when the model was used to interpolate missing data. Overall, this work shows the limits of data-drivenmodels for prediction of environmental variables, highlights the importance of the test set selection and suggests that the quantity and relevance of the data should balance the complexity of the system studied.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

ARBELLAY_PDM PRINTEMPS 2019.pdf

Type

Publisher's Version

Version

http://purl.org/coar/version/c_970fb48d4fbd8a85

Access type

restricted

License Condition

Copyright

Size

9.57 MB

Format

Adobe PDF

Checksum (MD5)

3950d37f16c25d2ad1c7a00d8f8cc7d3

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés