Using Verification Metrics alongside the Forecast
It is useful to have some measures of the current performance and biases of the IFS. Users can assess from Reliability and ROC diagrams whether the forecast model is:
- effective in capturing an event (e.g. upper tercile rainfall),
- tending to over- or under-forecast with different probabilities of an event,
- tending to forecast events that actually happen while minimising those that don't.
It is vital that users understand, from the outset, the general characteristics of the model forecasts relative to the subsequent verifying observations (e.g. whether or not the model typically over-forecasts or under-forecasts certain types of outcome). Users should then interpret any forecast signals accordingly – usually this will mean that they need to be wary of over-stating the significance of any such signals (that have historically been unreliable and/or unskilful). Such a strategy should be applied at all lead times, within forecasting in general. However it is particularly important for the longer lead forecasts (such as monthly and seasonal).
ECMWF provides a number of verification metrics to use in this way, such as anomaly correlation coefficients, reliability diagrams and ROC curves, which have all been computed using the re-forecasts.
Brier Score
Brier Score (BS) is a measure, over a large sample, of the correspondence between the each forecast probability against the frequency of occurrence of the verifying observations (e.g. on average, when rain is forecasted with probability p, it should occur with the same frequency p). Observation frequency is plotted against forecast probability as a graph. A perfect correspondence means the graph will lie upon the diagonal; the area between the graph and the diagonal measures the Brier Score - values lie between 0 (perfect) and 1 (consistently wrong).
Distribution of forecast probabilities
The distribution of forecast probabilities gives an indication of the tendency of the forecast towards uncertainty. These are plotted as a histogram to give an indication of confidence in model performance:
- A U-shaped distribution (i.e. higher proportion of forecast probabilities occur at each end of the histogram) implies a clearer decision on whether an event will or won't occur and gives a higher confidence in model performance.
- A peaked distribution (i.e. higher proportion of forecast probabilities occur in the centre of the histogram) implies more equivocal decision on whether an event will or won't occur and gives much less confidence in model performance.
Note where there are only a few entries for a given probability on the histogram then confidence in the Reliability diagram is reduced for that probability. Thus in Fig8.4-1 the predominance of probabilities below 0.2 and above 0.9 suggests there can be some confidence that when predicting lower tercile climatological temperatures at 2m, IFS tends to be over confident that the event will occur and under confident that it won't. However, there are few probabilities on the histogram between 0.2 and 0.9 which suggests that it would be unsafe to confidently draw similar deductions from the Reliability diagram within this probability range. Conversely, in Fig8.4-2 the majority of probabilities lie between 0.2 and 0.5 and reliability within this range appears fairly good while there is much less confidence in model performance for over- or under-forecasting an event. This is as expected as the forecast range becomes longer.
The Reliability diagram
The reliability diagram gives a measure of the capacity to discriminate between model over- or under-forecasting.
The diagram shows the relationship between forecast probability of an event and the observed frequency of that event as measured by c(e.g. the probability, measured as the proportion of ensemble members, that a 2m temperature will be greater than 20C, plotted against the climatological (taken from re-forecasts) frequency of that event). The plotted points lie to the right of the diagonal if there is over-forecasting (e.g. where rain is forecast with 100% probability and rain has actually been observed on only 80% of occasions) and to the left where under-forecasting. Ideally points should lie on the diagonal. The size of the departure from the diagonal indicates the magnitude of the over- or under-forecasting error.
A common feature of reliability diagrams is that the profile of the forecasts (red line in Fig8.4-1) has a shallower slope than the diagonal, but crosses it somewhere near the climatological value (blue line intersection). This means that the forecast has a tendency to be over-confident. Users should adjust forecast probabilities, even if departures from the diagonal are only small, to offset the tendency to over-forecast frequently observed events and to under-forecast rather more rare events.
Current Reliability Diagrams (which include distribution of forecast probabilities) are available on Opencharts (days 4, 6, and 10 only)
The ROC diagram
The ROC diagram gives a measure of the capacity to discriminate when events are more likely to happen. It shows the effectiveness of the IFS in forecasting an event that actually happens (Probability of Detection or Hit Rate) while balancing this against the undesirable cases of predicting an event that fails to occur (False Alarm Rate). The effectiveness is also known as the 'resolution' of the forecasting system (not to be confused with spatial and temporal resolution).
A system which always forecast climatological probabilities, for example, would have no discrimination ability (i.e. zero resolution). The resolution can be investigated using the Relative Operating Characteristic (ROC) diagram, which plots hit rate on the y-axis against false alarm rate on the x-axis. Ideally the Hit Rate should be high and the False Alarm Rate low (i.e. ideally the graph should lie well towards the top left corner) and generally the Hit Rate should be better than the False Alarm Rate (i.e. values should lie above the diagonal).
Where a ROC graph:
- arches towards the top left of the diagram then model is effective in forecasting events that occur without warning of events that don't.
- follows the diagonal then the model is forecasting as many events that occur as warning of events that don't.
- lies below the diagonal then the model is forecasting few events that occur while mostly warning of events that don't.
The ROC score is the area beneath the graph on the ROC diagram and lies between 1 (perfect capture of events) and 0 (consistently warning of events that don't happen). Fig8.4-1 shows high effectiveness in forecasting events (ROC score 0.859) while Fig8.4-2 shows reduced effectiveness (ROC score 0.593). This is as expected as the forecast range becomes longer.
Current ROC Diagrams are available on Opencharts (for day5 onwards).
Fig8.4-1: Reliability Diagram (left) and ROC diagram (right) regarding lower tercile for T2m in Europe area for week1 (day5-11), DT:20 Jun 2019.
Fig8.4-2: Reliability Diagram (left) and ROC diagram (right) regarding lower tercile for T2m in Europe area for week5 (day19-32), DT:20 Jun 2019.
In the above diagrams:
- BrSc=Brier Score (BS), LCBrSkSc = Brier Skill Score (BSS).,
- BS_REL = Forecast reliability and BS_RSL = Forecast resolution with respect to observations.
- BSS_RSL = Forecast resolution and, BSS_REL = Forecast reliability with respect to climatology.
Fig8.4-3: Example of Reliability Diagrams from Opencharts. Total 24hr precipitation Day6, assessed from ensemble probability forecasts during a three month period and compared climatology from the same period. The traces show the comparison of forecast probabilities against observed occurrences for 24h precipitation totals of >1mm, >5mm, >10mm, >20mm. Ideally the traces should lie along the dashed blue line (i.e. the ensemble probability forecast should agree with the observed frequency). The diagram shows:
- reasonably good forecasting at low ensemble probabilities
- e.g. ensemble 20% probability occurred on 20% of the time for each group
- over-forecasting at higher ensemble probabilities:
- e.g. ensemble 90% probability of >1mm/24h actually occurred only 60% of the time - the wide distribution of forecast probabilities suggest some confidence in the Reliability trace.
- e.g. ensemble 90% probability of >20mm/24h actually occurred 80% of the time - but the very few forecasts of high probabilities suggest very low confidence in the corresponding implied reliabilities.
Fig8.4-4: Example of Reliability Diagrams from Opencharts. Temperature anomaly Day4, assessed from ensemble probability forecasts during a three month period and compared climatology from the same period. The traces show the comparison of forecast probabilities of anomalies against observed occurrences of anomalies for 2metre temperature of >8°C below, >4°C below, >4°C above, >8°C above climatology. Ideally the traces should lie along the dashed blue line (i.e. the ensemble forecast probability should agree with the observed frequency). The diagram shows:
- under-forecasting at low probabilities
- e.g. for >8°C above climatology, >4°C above, >8°C below climatology, ensemble 20% probability actually occurred on 35% of the time.
- e.g. for >4°C below climatology ensemble 20% probability actually occurred on 25% of the time - fairly good correspondence.
- over-forecasting at higher ensemble probabilities e.g.:
- for >4°C below climatology ensemble 90% probability actually occurred only 70% of the time - the wide distribution of forecast probabilities suggest some moderate confidence in the implied reliability.
- for >8°C above climatology ensemble 90% probability actually occurred only 65% of the time - but the very few forecasts of high probabilities suggest very low confidence in the implied reliability.
However:
- for >4°C above climatology ensemble 90% probability actually occurred 85% of the time - the wide distribution of forecast probabilities suggest some moderate confidence in the implied reliability.
- for >8°C below climatology ensemble 90% probability actually occurred 85% of the time - but the very few forecasts of high probabilities suggest very low confidence in the implied reliability
Fig8.4-5: Reliability diagrams for 2m temperature based on July starts of the long-range forecasts for months 4-6.
- left for the tropics - a slight tendency towards over-confidence, more especially where forecasting that this event (warm anomalies) will happen.
- right for Europe - a tendency towards over-confidence, though the sample size for high confidence forecasts is small, making the plot noisy.
Fig8.4-6: Reliability diagrams for rain based on July starts of the long-range forecasts for months 4-6:
- left for the tropics - a tendency towards over-confidence.
- right for Europe - forecast not reliable at all (so should not be used, unless there are exceptional circumstances that warrant an expectation of skill that is ordinarily not there).
Fig8.4-7: ROC diagrams for Europe based on July starts of the long-range forecasts for months 4-6:
- left for temperatures in the upper tercile - the Hit Rate is slightly better than the False Alarm Rate indicating that the forecast system has some limited ability to discriminate occasions when warm events are likely from occasions when they are not.
- right for precipitation in the upper tercile - the Hit Rate and False Alarm Rate are similar throughout indicating that the seasonal forecast system has no ability to distinguish occasions when it will be wet from occasions when it will not.
Anomaly Correlation
Anomaly Correlation Coefficient (ACC) charts give an assessment of the skill of the forecast. They show the correlation at all geographical locations in map form.
At ECMWF the anomaly correlation coefficient (ACC) scores represent the spatial correlation between:
- the anomalies of a forecast product from a reference model climate and
- the anomalies of observations or reanalysis from the same reference model climate.
Seasonal products are available in chart form and the correlation is evaluated between:
- the anomaly of the product measured relative to the a model climatology based on the ERA-interim re-analysis (based on the period 1993-2016) and
- the anomaly of the verifying observations or reanalysis relative to the seasonal model climate (S-M-Climate).
The seasonal model climate (S-M-climate) is based on re-forecasts spanning the last 20 years, which used the ERA-interim re-analysis for their initialisation
Anomaly correlation coefficient (ACC) charts are produced for several parameters. Each chart shows the skill of the forecast at each location for the given month and lead-time.
- Positive ACC implies a correlation between forecast anomalies and the verifying observed anomalies. Higher values imply a strong correlation.
- Zero ACC implies the forecasts are no better than climatology
- Negative ACC implies the forecasts have a tendency to predict the opposite of what subsequently happened (though often this is a sampling issue which should just be interpreted as "no skill").
Locations with correlation significantly (95% confidence level) different from zero are highlighted by dots.
Fig8.4-8: Anomaly Correlation Coefficient for 2m temperature for months 2-4 based on November runs. High ACC (red) over the eastern Pacific suggests the seasonal model captures the amplitude of the variability of the 2m temperature quite well, Grey over Siberia suggests the model is no better than climatology (i.e. doesn't capture the variability); and cyan near Newfoundland suggests the seasonal model can be rather unreliable and misleading in this area.