G-RUN ENSEMBLE: A Multi-Forcing Observation-Based Global Runoff Reanalysis

River discharge is an Essential Climate Variable (ECV) and is one of the best monitored components of the terrestrial water cycle. Nonetheless, gauging stations are distributed unevenly around the world, leaving many white spaces on global freshwater resources maps. Here, we use a machine learning algorithm and historical weather data to upscale sparse in situ river discharge measurements.

. In recent years, a number of studies have further investigated the potential of optimizing global hydrology models for producing high-resolution discharge estimates, often with an operational focus Harrigan et al., 2020;Lin et al., 2019). This study aims to further constrain previous data-driven estimates of monthly runoff rates derived from the G-RUN data set (Ghiggi et al., 2019). G-RUN is an observation-based runoff reconstruction which employed a machine learning (ML) algorithm to estimate global runoff rates on an observational basis. One drawback of G-RUN is that it is based on a single atmospheric forcing data set (GSWP3; Kim et al., 2017). As a result, the uncertainty related to the forcing data is not accounted for, even though it can be significant, in particular for long-term trends . In this study we aim to account for this uncertainty by using an ensemble of 21 gridded atmospheric forcing data sets, including a set of atmospheric reanalysis, post-processed reanalysis and interpolated-stations data. In the following, we reiterate the methodology detailed in Ghiggi et al. (2019) to produce this ensemble of runoff reconstructions referred to as Global-RUNoff ENSEMBLE (G-RUN ENSEMBLE). The G-RUN ENSEMBLE is evaluated using a database of river discharge observations from large river basins and benchmarked against a comprehensive collection of publicly available global monthly runoff reconstructions spanning the period 1981-2010. The manuscript concludes with examples of new applications enabled by the G-RUN ENSEMBLE.

Monthly River Discharge Data
The Global Streamflow Indices and Metadata Archive (GSIM) Gudmundsson et al., 2018) provides a publicly available collection of streamflow indices at more than 35,000 stations that were obtained by merging existing international and national databases. Prior to production of the data product presented in this study, monthly river discharge data have been screened to remove timeseries with inhomogeneous behavior, unphysical values (i.e., negative values), mislabeled missing values, and uncertain catchment area (see Ghiggi et al., 2019 for details on the full procedure). Stations with catchment area between 10 and 2,500 km 2 have been selected to retrieve the grid cell runoff rates used as observations for model training, while stations of large river basins with catchment area larger than 10,000 km 2 have been used to benchmark the G-RUN ENSEMBLE reconstructions against the GHM runoff simulations.

Monthly Gridded Precipitation and Temperature Data
Gridded observations of precipitation and 2-m air temperature are obtained from 21 global data sets (Figure 1). The choice has been restricted to data sets with spatial resolution equal or higher than 0.5° and spanning a period of at least 40 years. All data were aggregated to monthly resolution and spatially resampled to a common 0.5-degree grid using conservative remapping (Jones, 1999). The considered atmospheric data are classified according to their primary mode of production into interpolated station observations (Harris et al., 2020), atmospheric reanalysis (Dee et al., 2011;Gelaro et al., 2017;Hersbach et al., 2020) and post-processed atmospheric reanalysis (Balsamo et al., 2015;Boogaard et al., 2020;Cucchi et al., 2020;Kim et al., 2017;Lange, 2019;Mengel et al., 2020;Muñoz-Sabater et al., 2021;Reichle et al., 2017;Sheffield, Goteti, & Wood, 2006;Weedon et al., 2014;Weedon et al., 2011). Atmospheric reanalyses assimilate ground and satellite observations to adjust the variables states within numerical weather prediction models. In this study, post-processed atmospheric reanalysis refers to data from an atmospheric reanalysis which have been further adjusted to better match selected observations, e.g., through means of bias correction against an observational reference such GPCC (Becker et al., 2013). Note that the considered atmospheric data sets have different temporal coverage because of design decisions that are often related to a tradeoff between availability and quality of observations. In order to include an additional precipitation data set, MSWEP v2.2 precipitation  has been combined with 2 m temperature from ERA5 (denoted as MSWEP v.2.2*) and ERA5-Land (denoted as MSWEP v.2.2**).

Other Runoff Estimates Used for Benchmarking
The accuracy of the G-RUN ENSEMBLE is benchmarked against a comprehensive set of runoff simulations from global hydrological models (GHM) covering the period 1981-2010. This includes the multi-model ensemble median of the GHMs intercomparison projects ISIMIP2a "nosoc" (Warszawski et al., 2014), eartH2Observe Tier-1 WRR1 , eartH2Observe Tier-2 WRR2 ; the LORA v1.0 product (Hobeichi et al., 2019) resulting from the statistical post-processing of the eartH2Observe WRR1 and WRR2 GHM runoff simulations; the runoff reconstruction used in the Global Drought and Flood Catalog (GDFC) (He et al., 2020); as well as runoff simulations from ERA-Interim (Dee et al., 2011), ERA-Interim/Land (Balsamo et al., 2015), ERA5 (Hersbach et al., 2020) and ERA5-Land reanalysis (Muñoz-Sabater et al., 2021). All runoff simulations are aggregated to the monthly resolution and interpolated to a common 0.5-degree grid using conservative remapping.

Model Setup
Runoff is defined here as the amount of water that is draining from a given land unit (i.e., grid cell) eventually entering the river system, including surface and sub-surface runoff as well as snowmelt. Runoff is difficult to measure over an extended area, but at a monthly timescale, the monthly river discharge measured at the outlet of small catchments divided by the catchment's area can be used as a proxy of the average catchment runoff, provided storage of river water (e.g., in dams, reservoirs) and/or river water losses (e.g., river channel and lake evaporation, irrigation) are minimal. Here observational grid cell runoff is estimated using the procedure described in Ghiggi et al. (2019). To this end, river flow observations from small catchments with an area between 10 and 2,500 km 2 are first assigned to 0.5° grid cells. Following Seneviratne (2015, 2016), a random forest (RF) algorithm (Breiman, 2001) is then used to learn the runoff generation process without the explicit description of the involved hydrological processes. The monthly runoff rate (R) is modeled as a function of antecedent monthly precipitation (P) and monthly near-surface temperature (T) such that where f corresponds to the RF model (RFM) characterized here by 300 trees with a maximum depth of 60 splits and no minimum leaf size; s represents the identifier of the grid cell, t is the time step, and  conditions of the past n months (here n = 6) allowing the RFM to approximate memory effects that influence the runoff generation process. The decision to only consider precipitation and temperature as explanatory variables is motivated by Gudmundsson and Seneviratne (2015), who found that the inclusion of other atmospheric variables as well as selected land parameters (topography and soil texture) did not significantly improve the overall accuracy of the estimate. Furthermore, reducing the number of predictor variables also helped to reduce computational costs significantly. While a more extensive screening of other land parameters is beyond the scope of this study, this could be the subject of future research.
Gridded precipitation and temperature data are then used to reconstruct runoff rates globally, also at ungauged grid cells. Since machine learning algorithm predictions are conditioned by the data used for model training, the following strategy is adopted to characterize the sampling uncertainty. For each forcing data set, 25 runoff reconstructions are generated using a Monte Carlo approach in which each RFM is trained using a random subset of only 60% of all grid cells containing runoff observations across the whole time period spanned by the forcing data set. Each of these 25 reconstructions thus represents the result of an RFM calibrated with slightly different observational data, thus providing a proxy for the effects of sampling uncertainty on the runoff estimate.
The G-RUN ENSEMBLE data repository (https://doi.org/10.6084/m9.figshare.12794075) provides individual ensemble members (25 × 21 = 525 realizations), as well as the ensemble mean of these realizations for each forcing data set. The multi-model median of the G-RUN ENSEMBLE members (combining the estimates obtained with all atmospheric forcing data sets) is referred to as G-RUN ENSEMBLE MMM.

Model Evaluation
A selection of 1,205 river discharge observations from large basins (with areas bigger than 10,000 km 2 ) that are included in GSIM is used to evaluate the accuracy of the G-RUN ENSEMBLE reconstructions and for benchmarking against GHM simulations. Since the RFM is calibrated using only runoff observations from small catchments (with areas smaller than 2,500 km 2 ), the verification of the G-RUN ENSEMBLE can be considered as "out-of-sample," although it must be noticed that some large river basins used for model validation include small sub-watersheds used for model training. Selection criteria of these stations follow the methodology described in Ghiggi et al. (2019). To obtain a common temporal coverage across all forcing data sets and the GHM simulations (except for G-RUN based on WFD which stops in 2001), only river discharge observations of the period 1981-2010 have been selected. Note that in contrast to the G-RUN ENSEMBLE, GHM simulations are rarely provided alongside detailed information on model calibration and tuning. Consequently, it cannot be excluded that observations from large river basins that are used here for model evaluation might have been also used for the GHMs calibration (Alcamo et al., 2003;Alfieri et al., 2020;Döll et al., 2003;Hirpa et al., 2018;Hobeichi et al., 2019;Hunger & Döll, 2008;Nijssen et al., 2001).
In order to compare gridded runoff estimates to observed river discharge from large basins we adopt the approach of Seneviratne (2015, 2016). To this end, river discharge is estimated by spatially averaging the grid cell runoff times series within the basin and multiplying it by the drainage area. At a monthly timescale, the effect of river routing is considered negligible, except for a few very large basins (Allen et al., 2018). As in Ghiggi et al. (2019), six performance metrics have been used to assess the accuracy of the reconstructions in reproducing different aspects of the river discharge time series. The terms p t and o t refer to the predicted and observed time series respectively.
The relative bias (relBIAS) has an optimal value of zero and allows to investigate the presence of systematic errors. A positive (negative) value indicates a general overestimation (underestimation). It is defined as: The ratio of standard deviations (rSD) has an optimal value of one. Values lower than one indicate underestimation, while values higher than one indicate overestimation of the observed variability. It is defined as: The squared correlation coefficient, R2, ranges between zero and one. It measures the degree of the linear association between the predicted time series and the observed one. It is insensitive to the bias. The optimal value is one.
The Nash-Sutcliffe efficiency (NSE), also called model efficiency (Nash & Sutcliffe, 1970), is a measure of the overall predictive skill of the model relative to the long-term mean of the time series. An NSE value of one corresponds to a perfect match between predicted and observed data, while a value lower than zero indicates that model predictions are on average less accurate than those obtained by using the long-term mean of the observed time series (mean(o t )) as predictor. It is defined as: The squared correlation coefficient between the observed and predicted monthly standardized anomalies (i.e., monthly time series with the monthly climatology removed, divided by the long-term standard deviation of each month) is R2 anom . It ranges from zero to one (best value).
The squared correlation coefficient between the observed and predicted monthly climatology is R2 clim . It ranges between zero and one (best value).
The skill metrics of G-RUN ENSEMBLE members at each GSIM station are available in Data Set S1 as basis for future model benchmarking. Figure 2a shows the distribution of the skill metrics of the G-RUN ENSEMBLE members and GHM simulations. A selected subset of observed river discharge time series is compared to the G-RUN ENSEMBLE and the GHM reconstructions in Figure 2b (the full set is provided in Data Set S3). Most runoff estimates tend to overestimate monthly river discharge rates in large basins, but in general the relative bias does not exceed 20%. A similar behavior is observed for the reproduction of the magnitude of monthly variability as illustrated by the rSD metric, even though the G-RUN ENSEMBLE members tend to overestimate the monthly variability less than the GHM simulations. To complement the analysis of the time series magnitude and amplitude, Figure S1 illustrates the average difference in annual runoff between GHM runoff simulations and the G-RUN ENSEMBLE MMM. The GHMs show a complex spatial pattern of lower and higher runoff estimates. Overall, the G-RUN ENSEMBLE MMM does not seems to show a systematic bias with respect to the full ensemble of the GHM simulations ( Figure S1). Additionally, the spatial pattern of average difference in annual runoff between the G-RUN ENSEMBLE members and the G-RUN ENSEMBLE MMM ( Figure S2) display much less variability compared to that of the GHMs. This may be explained by the fact that RFM is by construction less sensitive to bias in precipitation and temperature compared to the GHM parametrizations.

Accuracy of the Monthly Reconstructions
Overall, monthly river discharge dynamics (R2) tends to be better captured by the G-RUN ENSEMBLE members than by the GHMs. The large variations in R2 skill observed for the GHMs suggests a higher sensitivity of the GHM parametrizations to the input meteorological forcing. To highlight regions in which the GHMs struggle in reproducing runoff monthly oscillations, Figure S3 illustrates  G-RUN ENSEMBLE MMM dynamics ( Figure S4), except for the G-RUN ENSEMBLE members forced with MERRA2 and ERA-Interim which disagree in a general manner in the Southern Hemisphere.
The general impression of higher accuracy (Figure 2a) of the G-RUN ENSEMBLE members is confirmed by the NSE skill metrics. Among the GHM reconstructions, LORA v1.0 shows the highest skill, highlighting the benefit of post-processing the GHM runoff simulations. The importance of accurate atmospheric forcing for reproducing the monthly dynamics is highlighted by comparing the best GHM simulations and the best G-RUN ENSEMBLE members: reconstructions forced with MSWEP v2.2, ERA5-Land and ERA5 rank at the top of their respective model category. When considering post-processed atmospheric reanalysis, it also appears that bias correction of precipitation with data from the Global Precipitation Climatology Center (GPCC) (WFDEI-GPCC and WFDE5-GPCC) produces higher skill than bias correcting toward CRU TS (WFDEI-CRU and WFDE5-CRU). In general, the G-RUN ENSEMBLE members show higher accuracy than the GHMs (except the multi-model median of WRR2) in reproducing monthly discharge anomalies (R2 anom ). The GHMs tend to disagree quite strongly from anomalies of the G-RUN ENSEMBLE MMM ( Figures S5), while several G-RUN ENSEMBLE members show disagreement from the G-RUN ENSEM-BLE MMM in Africa ( Figure S6). Concerning the ability to reproduce the seasonal cycle of river discharge (R2 clim ), the performance of the G-RUN ENSEMBLE reconstructions stands out compared to the GHMs. Previous studies already showed that some GHMs struggle in reproducing the seasonality of runoff (Ghiggi et al., 2019;Gudmundsson, Boulange, et al., 2012;Gudmundsson & Seneviratne, 2015).
To disentangle the impact of the meteorological forcing from model uncertainty, Ghiggi et al. (2019) considered a large set of GHMs and RFM estimates that were forced with the same meteorological data. The results highlighted that the lower accuracy of GHMs compared to RFM-based estimates is likely related to issues with GHM structure and parameters. We also note that extensive RFM cross-validation experiments performed in Ghiggi et al. (2019) highlighted that even when removing all observations from large sub-continental regions during training, the RFM predictions remained on average more accurate than the runoff simulations from a large set of GHMs forced with the same meteorological data. Furthermore, it was shown that at the grid cell level, the accuracy of G-RUN is even higher than for large basins. This is related to the nature of the employed approach because RFM is trained at the grid cell level.  spread among the GHMs is much higher than among the G-RUN ENSEMBLE reconstructions (see Figures 4b and S9b for the spatial distribution of uncertainty across the G-RUN ENSEMBLE and the GHMs). The G-RUN ENSEMBLE lies within the GHMs bounds, except in the northern tropics and subtropics. Especially in Southeast Asia, the GHMs simulate higher runoff rates compared to the G-RUN ENSEMBLE (see Figures S1 and S2). Concurrently, in these regions, the G-RUN ENSEMBLE also shows a wider spread in the LTM estimates (see Figure 4b for the LTM spatial uncertainty across the G-RUN ENSEMBLE members), which can partially be related to the uncertainty of precipitation (see Figure S9). The above considerations apply also for the interpretation of the latitudinal profile of the runoff interannual variability (IAV) (Figure 3c). The GHMs tend to have higher annual runoff rates in the northern tropics and subtropics, possibly resulting in higher interannual variability.

Global Runoff Characteristics at Annual Time Scales
GHIGGI ET AL.
10.1029/2020WR028787 8 of 13 Figures S7 and S8 provide a more detailed view on individual G-RUN ENSEMBLE and GHM annual runoff characteristics. As supplementary information, latitudinal profiles, and spatial uncertainty in LTM and IAV of the atmospheric forcing data are reported in Figures S9, S10, S12, and S13.

Limitations of the G-RUN ENSEMBLE
River discharge observations used for model training have been carefully screened to remove time series presenting unphysical values and possible inhomogeneous behaviors introduced by anthropogenic activities. However, we do not exclude that some river discharge observations that are impacted by human activities might have passed these selection steps, for example, if the magnitude of water abstraction/returns did not alter the monthly hydrograph sufficiently to identify a major changing point or if the time series was not long enough to cover past periods of near-natural streamflow. Because the RFM is solely forced with precipitation and temperature, the G-RUN ENSEMBLE does not explicitly account for effects of human water management such as river flow regulation, water withdrawals, or return flows from groundwater abstraction (Arheimer, Donnelly, & Lindström, 2017;Jaramillo & Destouni, 2015;Nazemi & Wheater, 2015a, 2015bVeldkamp et al., 2017;Wada et al., 2017;Wada et al., 2010). Therefore, it is our evaluation that the estimates of the G-RUN ENSEMBLE are relatively close to near-natural conditions and would clearly differ with observations in basins heavily impacted by human activities. Additionally, the G-RUN ENSEMBLE is unlikely to provide reliable long-term reconstructions in mountainous areas where an important portion of total monthly runoff comes from glacier melting, since no information on glacier runoff contribution has been fed to the RFM. We also note that the uncertainty of runoff rates in many mountainous regions is likely underestimated due to the large uncertainty in precipitation (see Figure S12) and the fact that the resolution of the meteorological forcings does not capture the sub-grid variability of precipitation and temperature (with consequences for snowmelt volume and timing).
Furthermore, we also note that the design goal of the G-RUN ENSEMBLE differs from the ones of the GHMs. While the G-RUN ENSEMBLE aims at producing the best possible monthly runoff estimates given the available data, the GHMs simultaneously resolve many hydrological fluxes (i.e., evaporation, soil moisture) at a finer hourly/daily temporal resolution, while maintaining balance of water and energy fluxes. Because of the epistemic uncertainty involved in representing all the relevant hydrological processes and the accumulation of input and model errors over time, it appears reasonable that GHM simulations might be characterized by a reduced predictive skill (and a larger ensemble spread) compared to the G-RUN ENSEMBLE.

G-RUN ENSEMBLE Applications
Figure 4 provides example applications of the G-RUN ENSEMBLE, offering a global view of freshwater resources. Each analysis reports the multi-model median and is complemented with an uncertainty quantification metric made possible through the multi-forcing nature of the G-RUN ENSEMBLE. A similar analysis for the available GHM simulations is provided in Figure S11. Figure 4a displays the multi-model median of the long-term mean annual runoff estimates over the period 1981-2010. Runoff rates vary by 3 orders of magnitude across the Earth, with the highest rates in the tropics and large mountain ranges and lowest rates in the subtropics and major world deserts such as the Sahara. The uncertainty in long-term runoff rates (Figure 4b) is estimated by computing the robust coefficient of variation (defined as the interquartile range divided by the median) across the G-RUN ENSEMBLE members statistics. The G-RUN EN-SEMBLE long-term runoff estimates agree well over North America, Europe, Russia, Australia, Southern Africa, and Eastern South America. Disagreement between the ensemble members occurs along the Andes Mountain Ranges, the mountains of Western United States, Central and East Asia, and in correspondence with uncertainty in the precipitation and temperature forcing data (see Figure S12 and S13). The same analysis performed on the GHMs ( Figure S11b) reveals the much higher uncertainty of the long-term mean of the GHMs compared to the G-RUN ENSEMBLE. Figure 4c highlights locations with large fluctuations in freshwater availability across the period 1981-2010, as indicated by values higher than one of the coefficient of interannual variability (defined as the standard deviation of yearly runoff, divided by the long-term mean). Interannual variability is particularly high in regions that have experienced long-lasting droughts in the last decades: for example, the Fertile Crescent (Trigo, Gouveia, & Barriopedro, 2010), Australia (van Dijk et al., 2013), California (Seager et al., 2015;Swain et al., 2014) and South Africa (Blamey et al., 2018). Figure 4e reveals long-term trends in annual freshwater availability for the period 1981-2010. Following Stahl et al. (2012) trends are computed using the Sen's slope (Sen, 1968) and expressed in percentage change over the 29-year period. Figure 4d shows the percentage agreement on the sign of trends among the G-RUN ENSEMBLE reconstructions. The pattern of change obtained from the GHMs ( Figure S11e) is very similar. Note that since the proper choice of trend tests for runoff time series remains the subject of a scientific debate (Chen & Grasby, 2009;Cohn & Lins, 2005;Radziejewski & Kundzewicz, 2004) and because a pixel-wise application of statistical tests in large data sets can yield spurious results (Wilks, 2016) we follow here the best practice for grid cell scale runoff estimation (Marx et al., 2017;Stahl et al., 2010;Stahl et al., 2012;Thober et al., 2018) and refrain from reporting p-values. For a comprehensive assessment of runoff trends the reader is referred to Gudmundsson et al. (2019) and Gudmundsson et al. (2021). Note also that observed river flow trends can be subject to significant decadal variability and that magnitude and sign of trends can depend on the considered period .

Conclusions
This study builds on the established methodology presented by Gudmundsson and Seneviratne (2015) and Ghiggi et al. (2019) and derives monthly runoff estimates from an ensemble of atmospheric forcing data.
To this end, a machine learning algorithm is trained with runoff observations from a global collection of in situ streamflow observations separately for each atmospheric forcing data set. The resulting multi-forcing ensemble of runoff reconstructions, termed G-RUN ENSEMBLE, allows us to quantify the uncertainty associated to model input data. This publicly available data set is provided on a 0.5° × 0.5° World Geodetic System 1984 (WGS84) grid, and the reconstructions span a period from 1902 to 2019. The G-RUN ENSEM-BLE reconstructions were benchmarked against a comprehensive set of global-scale hydrological model (GHM) simulations, using a large database of river discharge observations as a reference, which can serve as basis also for future global hydrological model intercomparison studies. Overall, the G-RUN ENSEMBLE shows higher accuracy than most GHMs evaluated in this study, especially with respect to the reproduction of the dynamics and seasonality of monthly runoff rates. The analysis also revealed that the accuracy of the reconstruction is dependent on the quality of the forcing data. However, we found that the spread imposed by the atmospheric forcing in the G-RUN ENSEMBLE is small compared to the spread found within a comprehensive ensemble of GHM simulations driven with a smaller subset of possible forcing data. Possible explanations for this behavior include a higher sensitivity of GHMs to biases in the meteorological forcing and a possible accumulation of small errors throughout integration of the governing equations. In summary, the multi-forcing nature of the G-RUN ENSEMBLE allows to quantify the uncertainty associated with the currently available atmospheric forcing data, thereby paving the way for more robust and reliable water resources assessments, climate change attribution studies, hydro-climatic process investigations as well as evaluation, calibration and refinement of the GHMs. We conclude by highlighting that the production of the G-RUN ENSEMBLE would not have been possible without the mobilization of national and international hydrological archives. We call for a continuation of the international efforts to reduce political and technical barriers for the exchange of hydrometeorological data across the scientific community.

Conflict of Interest
The authors declare no conflicts of interest relevant to this study.

Data Availability Statement
The G-RUN ENSEMBLE reconstructions and their associated realizations are publicly available at https:// doi.org/10.6084/m9.figshare.12794075 under Creative Commons Attribution 4.0 International License (CC_BY_4.0). Preliminary data repository available at: https://figshare.com/s/ad6d5cdfbba945d93ad2, DOI has been reserved but will be activated once the manuscript will be published. Full data (>400 GB) will be uploaded before final publication when it will be possible to add the full manuscript reference and DOI to the netCDF-4 attributes.