Introduction

The GloFAS v4.0 hydrological model performance was evaluated in the model calibration context in GloFAS v4 calibration hydrological model performance, using only the 1995 stations involved in the calibration, verified using the full longterm run (produced within the calibration excercise) and the KGE and the three component scores.

On this page, the model performance is analysed over the final v4.0 reanalysis time series (https://cds.climate.copernicus.eu/cdsapp#!/dataset/cems-glofas-historical?tab=overview; which is not expected to be noticeably different to the one used in the calibration evaluation). This v4.0 simulation is then compared to the previous v3.1 simulation. A major change in the new v4.0 is the higher resolution with 3 arcmin resolution river network, instead of the earlier 6 arcmin in v3.1. On the plots below the teo GloFAs model version is denoted as GloFAS-6m-Lisflood (v3.1) and GloFAS-3m-Lisflood (v4.0).

Details on methodologies of the station selection and other aspects of the verification, including the used metrics, are available on the verification methodology page (place holder GloFAS hydrological performance verification methodology).

Executive summary

The general hydrological model performance verification was based on a network of 2293 catchments with at least 1 year of observations, including two subsets of either the fully calibrated stations and the non-calibrated stations, both with no larger reservoir or lake influences. Although this network provides a relatively good global coverage, still some areas of the world are poorly observed with no or very few stations, or too short observation records. The verification analysis can be summarised by few main conclusions:

  • The new higher resolution v4.0 simulation shows a large skill improvement with the KGE increasing by about 0.22 (KGE range 1 to minus infinity,with typical KGE range of 0.4-0.9, although very low KGE values, well below 0, are also possible for very poorly simulated catchments) on average, with the median increasing from 0.31 to 0.65, when all possible stations are considered. The detailed analysis highlights that when only the calibrated stations are considered in both model versions, the improvement is a more moderate 0.08, while with the non-calibrated (in either models) stations it is a higher 0.82 KGE increase, although most likely on occasionally very low KGE reference in v3.1. This suggests, that the improvements in v4.0 do not just come from the better model with higher resolution and better underlying state maps (which is shown by the calibration comparison), or the obviously better performance in v4.0 due to the much large number of calibration stations, i.e. 1995 vs 1226, but will also come from improvements over non-calibrated catchments, thanks to the new parameter regionalisation approach. 
  • GloFAS v4.0 hugely improved the bias errors in the reanalysis simulation by about 0.22 on average (bias ranges from -1 to infinity, with typical values not far from 0), with even higher improvements in the non-calibrated station network. However, there are still very large bias errors remaining in quite many areas in the tropical or subtropical band of the world in v4.0. These catchments can show bias ratio errors with at least doubling the observed mean in the simulation. The large bias error differences seem to be the dominant factor in determining the changes (dominantly improvements) in KGE.
  • The variability errors in the simulation got also largely improved by v4.0 by about 0.07 (variability has the same value range as bias) on average, with still some larger variability errors remaining mainly in the tropical areas, relatively similar to the bias behaviour, just with less extreme error levels.
  • The correlation aspect of the simulation shows a lot of geographical variability in the v4.0 vs v3.1 differences, but still correlation does not seem to improve in v4.0 on average. There is basically no noticeable change in the median correlation values of the verified stations.  
  • The timing error also shows large geographical variability, similarly to the correlation, but the error statistics highlight generally no or only very little timing improvement in v4.0. 

Observation availability

All stations are considered in this general analysis, which have at least 1 year of good enough quality observation data in the 1979-2021 period (while it was at least 4 years for the calibration), supplemented also with a separate station network without larger noticeable impact of reservoirs or lakes. In total, 2293 stations were considered for the general v4.0 verification with all stations (Figuyre 1, left column), 996 for the v4.0 vs v3.1 model comparison with stations used in both calibrations (Figure 1, middle column) and also a third set with 233 stations that were not used in either calibrations (Figure 1, 3rd right column). The later two station networks did not include those stations that had larger reservoir or lake impact.


Figure 1. Number of years of available river discharge observations in the 1979-2021 reanalysis period with the full station list, the calibration station list and the non-calibration station list.

General v4.0 performance

KGE

The generic GloFAS v4.0 model performance is measured by the modified Kling Gupta efficiency (KGE) in Figure 2. High skill (above 0.7) is shown over much of the higher latitude areas and also some southest Asian and central south American areas. The lowest KGE, including even some catchments with no skill at all (below -0.41), are mainly spread across some tropical areas, often in central southern USA and Mexico and some areas in Africa, often in the drier climate.


Figure 2. KGE of the GloFAS v4.0 simulation.

Bias, variability and correlation

The KGE's component scores (Figure 3-4-5.) highlight that much of the lower KGE skill comes from the often high and mainly positive bias, and also larger variability errors. The bias ratio is over 1 for a lot of catchments in the tropical belt (please note, in this version of bias 0 is the optimal value), which means the simulation average is more than double the observation average value (i.e. twice as high as it should be). On the other hand, the variability error tend to be negatively oriented and many tropical catchment see too low variability in the simulations, often 1/3 less than in the observations (-0.33 to -0.5) or even at least 50% less than it should be according to the observations (darkest red).

The correlation is more homogeneous, even though many of the low KGE areas also show low correlation, with exceptions, such as the upstream part of the Niger river basin, or some catchments in the Nile basin, which show high correlation but at the same time really high positive bias and some larger variability errors. 


Figure 3. Bias ratio error of the GloFAS v4.0 simulation.


Figure 4. Variability ratio error of the GloFAS v4.0 simulation.


Figure 5. Pearson correlation of the GloFAS v4.0 simulation.

Timing

The timing error shows quite a lot of areal variability (Figure 6). Some of this probably comes from the potentially short sample period, which makes the verification scores less robust. Also, some larger errors in large variability areas can come from the type of catchments which have lower quality simulation, combined with less clear signal distribution, i.e. no clear peak/trough structure, which can result in no or little correlation change by time-lagging the simulation.

Still, some pattern emerges and generally the errors are more negative than positive, i.e. the GloFAS v4.0 river discharge simulation is too early in the signal, so peaks happen earlier than in the observations. This is the case in many of the catchments in the higher latitudes, in Amazonia or in Australia. In terms of magnitude, the larger errors mean 5-10 days or even over 10 days timing problem.


Figure 6. Timing error of the GloFAS v4.0 simulation.

General v4.0 vs v3.1 performance comparison

When comparing the v4.0 performance with the previous v3.1 one, we provide 3 flavours of the comparison, one which uses all possible stations, regardless of the lake and reservoir impact and two which includes only points that has maximum small reservoir or lake influence. One of these two is for the calibration comparison, i.e. with points used in both v4 and v3 calibrations, while the other is with only points that were used in neither of the calibrations.

KGE

The new higher resolution of v4.0 GloFAS outperforms the earlier v3.1 almost everywhere (Figure 7). Exceptions are mainly in eastern USA, Amazonia and western Europe. In other areas, apart form the odd catchments, v4.0 is better, or largely better. In many of the tropical catchments and also in central/southern North America the KGE improvement is larger than 0.5 over a very large area. The cumulative KGE distributions highlight that including all stations, the median improves from about 0.31 to 0.65, with +0.22 as the median of the KGE differences. Moreover, while about 25% of catchments in v3.1 had KGE below -1, in v4.0 this has decreased to only 7%.

When considering only stations that were used in both v4 and v3 calibrations and here we also exclude the stations with larger reservoir or lake influence (2nd column in Figure 7), the geographical distribution of KGE differences is similar to the full picture in the 1st column of Figure 7, but with this selection of stations the difference looks more modest. Here differences can only come from better calibration methodologies and better general model quality, such as the higher resolution, the better river network and other improved features, such as better soil maps and similar improvements in v4. The KGE median improvement decreases to 0.68 to 0.77, with +0.08 as the median value of the KGE differences, which is still very noticeable.

Another aspect of the v4 vs v3 comparison is the non-calibrated catchments, which were used in neither of model calibrations. For these areas, the v4 model had some major improvements by transferring the calibrated parameters to non-calibrated catchments by a regionalisation method. Indeed, v4.0 shows much higher KGE, in general, over these non-calibrated catchments, with only a very few catchment exceptions. The median of the 233 catchments in this category improves from -1.02 to +0.125, with +0.82 as the median of the KGE differences.

It is clear, the general hydrological improvement is noticeable for the common calibration stations, but much larger for the non-calibration stations, quite possibly highlighting the impact of the regionalisation.


Figure 7. KGE error difference maps between GloFAS v4.0 and v3.1 simulations (top row) and cumulative distributions of KGE for both v4.0 and v3.1. Using all all points (1st column), using only calibration points for both models without larger reservoir or lake influence (2nd column) and non-calibration points for both models without larger reservoir or lake influence (3rd column).

Bias

The bias, measured by the 0-centred version of the KGE's bias ratio component (bias), is very clearly largely contributing to the improved KGE by drastically reduced bias errors in v4.0 (Figure 8). The first row in Figure 8 shows the difference in absvar, the absolute value of bias, as the bias error magnitude difference between v4.0 and v3.1. The large impact of the bias is generally the same with all station versions, the full list (Figure 8, 1st column), the calibrated (Figure 8 2nd column) or non-calibrated station networks (Figure 8 3rd column). The geographical distribution of the errors is very similar to the KGE's picture in Figure 7, with the tropics in general showing very large bias improvement, often more than halving the bias ratio error of v3.1 by v4.0.

The cumulative distributions of the bias highlight that the bias error is generally getting lower in v4.0, seemingly everywhere. In fact, the distribution of the actual bias difference values (not shown here) highlight that about 85% of the catchments indeed has lower bias ratio error in v4.0 than in v3.1. Figure 7 (2nd row) also highlight that the high median value of 0.39 in v3.1 decreased to only 0.05 in v4.0 (see Figure 7, 2nd row, 1st graph), with -0.22 as the median of the absbias difference values (the graph is not shown here). This confirms that the new v4 model delivers an almost optimal bias in global average sense, and that the improvement in the bias error magnitude (measured by absbias) is a very large -0.22 on the basis of all stations that could be verified. The same bias median values are 0.14 to 0.02 for the calibration stations, with -0.09 as the median of the absbias difference, while 1.92 to 0.40, with -0.88 as the median of the absbias differences for the non-calibrated case. This confirms the same picture seen for the KGE, with the calibrated stations showing much smaller improvement in bias than the non-calibrated stations.


Figure 8. Abspbias error difference maps between GloFAS v4.0 and v3.1 simulations (top row) and cumulative distributions of bias for both v4.0 and v3.1 (bottom row). Using all all points (1st column), using only calibration points for both models without larger reservoir or lake influence (2nd column) and non-calibration points for both models without larger reservoir or lake influence (3rd column).

Variability

The variability, measured by the 0-centred version of the KGE's variability ratio component, shows a quite homogeneous geographical distribution globally (Figure 9, top row). Improvement by v4.0, i.e. negative var difference, is the overwhelming picture, other than for the non-calibrated stations, which seem more mixed. There is not really any emerging area with a clear cluster of better variability in v3.1 (i.e. blue dots). It is also clear, that the variability improvement is smaller than the bias improvement seen in Figure 8, there are much less dark red stations in Figure 9 than we had in Figure 8.

The cumulative distributions of var confirm these conclusions. The purple curve (v4.0) is very clearly more centred on the 0 optimal variability line (centre of the graphs), a little less so with the calibrated stations only, and more with all the stations. However, the non-calibrated stations behave differently, with not too much difference, reflecting the rather mixed picture we saw in the absvar difference map in Figure 9.

The median var value change from -0.10 to -0.03 in v4.0, with -0.07 as the median of the absvar differences for the all-station case. For the calibration stations the improvement is from -0.06 to -0.02, with -0.04 as the median of the absvar differences, while for the non-calibrated stations it is from -0.24 to -0.15, with -0.05 as the median of the absvar differences. These number also confirm that the variability error improved in v4.0, but less than the bias errors improved in Figure 8. Moreover, the difference between calibrated and non-calibrated catchments is again less pronounced than it was for the bias case.


Figure 9. Absvar error difference maps between GloFAS v4.0 and v3.1 simulations (top row) and cumulative distributions of var for both v4.0 and v3.1 (bottom row). Using all all points (1st column), using only calibration points for both models without larger reservoir or lake influence (2nd column) and non-calibration points for both models without larger reservoir or lake influence (3rd column).

Correlation

The correlation shows a very mixed picture globally, with slightly more positive than negative catchments (Figure 10, top row). The most prominent area with a correlation improvemnt cluster is in central North-America. The mixed picture is similar for all three station selections (in the three columns).

The cumulative distributions confirms that v4.0 provides only marginal improvement over v3.1 in correlation. For the high correlations v3.1 seems to be even very slightly better, while v4.0 is noticeably better for low to medium correlations. For the calibrated stations this the difference is even less, while for the non-calibrated stations v3.1 actually seems to be better. It seems the up and downs of the simulations could not really be improved very noticeably by the v4 model.

Regarding the actual correlation values, the median changes from 0.748 to 0.759 in v4.0, with 0.000 as the median of the correlation differences for the all-station case, i.e. no change on average at all. For the calibration stations, the improvement is from 0.817 to 0.816 (so actually even very slight decrease), with -0.002 as the median of the correlation differences, while for the non-calibrated stations it is from 0.672 to 0.629, with -0.006 as the median of the correlation differences. These number also confirm that the correlation aspect of the river discharge simulation in v4.0 did improve only marginally when measured using all stations, while the calibration station comparison shows no change at all and the non-calibration comparison shows rather some small deterioration.


Figure 10. Correlation error difference maps between GloFAS v4.0 and v3.1 simulations (top row) and cumulative distributions of correlation for both v4.0 and v3.1 (bottom row). Using all all points (1st column), using only calibration points for both models without larger reservoir or lake influence (2nd column) and non-calibration points for both models without larger reservoir or lake influence (3rd column).

Timing

The timing error shows a lot of geographical variability (Figure 11), especially for the full station list, which include a larger number of shorter observation period stations and also stations with large reservoir or lake influence, which can make the timing error less robust and more difficult to interpret. This way, the calibration version (middle column) provides a safer version to look for patterns. Indeed, one can see that lot of catchments in Amazonia got worse in v4.0, but also quite a few in Russia, or in southest Asia. On the contrary, eastern south America and central USA are areas where the typical change is better v4.0 timing.

The cumulative distributions of the timing error itself highlight some interesting behaviour. Because the timing error changes only in the increment of 1 (day), the median is less meaningful. Still, it is obvious that the purple v4.0 curve is consequently to the left of the orange v3.1 curve, meaning that v4.0 has made the simulation noticeably earlier. In this regard, v4.0 got maybe a bit worse, as the proportion of 1-day error increased slightly, while the 0-day error decreased slightly compared with v3.1.

In terms of numbers, it is better to consider the mean, which in case of the timing error will be more meaningful, as it is capped at 60 days, so it can not really go extremely high, which can happen with for example KGE (very negative) or bias (very positive), making the mean non-representative. The full station list delivers average timing error for v3.1 of -1.68 and abstiming of 5.40, which changes to -2.26 and 5.04 for v4.0. For the purely calibration stations these number are -0.86 and 3.03 to -2.70 and 3.26, while for the non-calibration stations they are -1.29 and 6.92 to 0.40 and 7.85. These number clearly confirm, that v4.0 has slightly more negative timing errors than v3.1, and also suggest, given that the calibration group shows small deterioration, while the full group highlights small improvement in the magnitude of the errors, that there is probably no significant difference between v3.1 and v4.0 or we can not really show it with the timing metric used and more importantly the available observation time series.


Figure 11. Abstiming error difference maps between GloFAS v4.0 and v3.1 simulations (top row) and cumulative distributions of timing error for both v4.0 and v3.1 (bottom row). Using all all points (1st column), using only calibration points for both models without larger reservoir or lake influence (2nd column) and non-calibration points for both models without larger reservoir or lake influence (3rd column).