Introduction
Some statistical concepts are outlined to facilitate the use and interpretation of deterministic medium-range forecasts. An NWP system can be evaluated in at least two ways:
- Validation measures the realism of the model with respect to its ability to simulate the behaviour of the atmosphere.
- Verification measures the ability of the system to predict atmospheric states.
Only the most commonly used validation and verification methods are discussed here, mainly with 2 m temperature, 10 m wind and upper air variables. Verification of binary forecasts are discussed in relation to utility.
For a full presentation see Nurmi 2003; Joliffe and Stephenson, 2003; Wilks, 2006.
Forecast validation
A forecast system that perfectly simulates the behaviour of the atmosphere has the same degree of variability as the atmosphere with no systematic errors.
The Mean Error (ME)
The mean error (ME) of forecasts (f) relative to analyses (a) can be defined as
where the over-bar denotes an average over a large sample in time and space. A perfect score, ME=0, does not exclude very large errors of opposite signs which cancel each other out. If the mean errors are independent of the forecast and vary around a fixed value, this constitutes an ““unconditional bias””. If the ME is flow dependent, (i.e. if the errors are dependent on the forecast itself or some other parameter), then we are dealing with systematic errors of ““conditional bias”” type; in this case, variations in the ME from one month to another might not necessarily reflect changes in the model but in the large-scale flow patterns (see Fig12.A.1).
Fig12.A.1: A convenient way to differentiate between ““unconditional”” and ““conditional”” biases is to plot scatter diagrams, with forecasts vs. forecast errors or observations (analyses) vs. forecast error. From the slope and direction of the scatter in these diagrams it is also possible to find out if the forecasts are over- or under-variable In this case the colder the forecasts the larger the positive error, the warmer the forecast the larger the negative error. This implies that cold anomalies are not cold enough and warm anomalies not warm enough, i.e. the forecasts are under-variable.
Forecast Variability
The ability of a NWP model to forecast extremes with the same frequency as they occur in the atmosphere is crucial for any ensemble approach, either lagged, multi-model or EPS. If the model has a tendency to over- or under-forecast certain weather elements, their probabilities will, of course, also be biased.
More generally, the forecast variability over time and space should be equal to at least the analysed, ideally the observed variability. There are different variance measures to monitor this variability:
- Variability around the climatological average, which measures the ability of a model to span the full climatological range
- The averaged analysed and forecast spatial variance over a specified area at a specific time e.g. a day; it may be presented as a time series
- The averaged temporal variability over a specified area, calculated over a sufficiently long time period. The variance can be computed for every grid point or as the change over 12 or 24 hours. It may be presented as geographical distributions (see Fig12.A.2).
For all three methods the level of variability averaged over many forecasts in the medium range should be the same as for the initial analysis or a short-range forecast.
Fig12.A.2: 500 hPa geopotential variability, October 2010 - March 2011. The standard deviation over the period is calculated for every grid point. The analysis (left) shows maximum variability between Greenland and Canada and in the North Pacific. This is well captured by the D+5 forecast (centre) and D+10 forecast (right) but with slightly decreasing values.
If a perfect model has, by definition, no systematic errors, then a stable model might have systematic errors which do not change their characteristics during the forecast range. Most state-of-the-art NWP models are fairly stable in the medium range but start to display some model drift, such as gradual cooling or warming, moistening or drying, in the extended ranges.
False Systematic Errors
One of the complexities of interpreting the ME is that apparent systematic errors might, in fact, have a non-systematic origin. If this is the case, a perfect model appears to have systematic errors; a stable model appears to suffer from model drift. This is a reflection of a general statistical artefact, the ““regression to the mean”” effect. (The ““regression to the mean”” effect was first discussed by Francis Galton (1822-1911) who found that tall (short) fathers tended to have tall (short) sons, but on average slightly shorter (taller) than themselves).
The fact that a perfect model forecasts anomalies with the same intensity and frequencies as observed does not mean that they will be correct in time and place. Due to decreasing predictive skill it will, with increasing lead time, have less success in getting the forecast anomalies right in intensity, time and place. If the forecast is wrong for a specific forecast anomaly, in particular for a strong anomaly, the verifying truth might be more anomalous but in most cases will be less anomalous. Even if the forecast anomaly has the right intensity, phase errors will tend to displace it rather towards less anomalous patterns than towards even more anomalous configurations (see Fig12.A.3).
Fig12.A.3: A schematic picture of a medium range forecast (black) and the verifying analysis. The forecast anomalies have about the same magnitudes as the verifying anomalies, but they are out of phase. This will yield a tendency for positive anomalies to verify against less positive or even negative anomalies, negative anomalies to verify against less negative or even positive anomalies.
Anomalies will therefore appear as if they have been systematically exaggerated, increasingly so as skill decreases with increasing lead time. Plotted in a scatter diagram, these non- systematic forecast errors therefore give a misleading impression that positive anomalies are systematically over-forecast and negative anomalies systematically under-forecast (see Fig12.A.4).
Fig12.A.4: A scatter diagram of forecasts versus forecast error. Warm forecasts appear too warm, cold forecasts appear too cold. If the forecasts are short range, it is reasonable to infer that the system is over-active, overdeveloping warm and cold anomalies. If, on the other hand, the forecasts are well into the medium range, this might not be the case. Due to decreased forecast skill, predicted anomalies tend to verify against less anomalous observed states.
False Model Climate Drift
This ““regression to the mean”” effect gives rise to another type of false systematic error. Forecasts produced and verified over a period characterized by on average anomalous weather will display a false impression of a model climate drift. A perfect model will produce natural looking anomalies, independent of lead time, but since the initial state is already anomalous, the forecasts are, with decreasing skill, more likely to be less anomalous than even more anomalous. At a range where there is no longer any predictive skill, the mean error will be equal to the observed mean anomaly with the opposite sign (see Fig12.A.5).
Fig12.A.5: A sequence of consecutive NWP forecasts (thin black lines, their mean (thick black line) and the observations (red line). Forecasts starting in an anomalous state are less likely to forecast even more extreme conditions. With increasing lead time and decreasing skill the forecasts will tend to cluster increasingly around the climate average and give an impression of increasing ME. The mean error will therefore give the false impression of a drift in the model climate.
The ME can be trusted to reflect the performance properties of a model only during periods with no or small average anomalies.
Forecast Verification
Objective weather forecast verification can be performed from at least three different perspectives: accuracy (the difference between forecast and verification), skill (comparison with some reference method, such as persistence, climate or an alternative forecast system) and utility (the economic value or political consequences of the forecast). They are all ““objective”” in the sense that the numerical results are independent of who calculated them, but not necessarily objective with respect to what is considered ““good”” or ““bad””. The skill measure depends on a subjective choice of reference and the utility measure depends on the preferences of the end-user. Only the first approach, the accuracy measure, can be said to be fully ““objective””, but, as seen in 4.3.4, in particular Figure 31 and Figure 32, the purpose of the forecast might influence what is deemed ““good”” or ““bad””.
Measures of Accuracy
Root Mean Square Error (RMSE)
where f = forecast value; a = observed value.
This is the most common accuracy measure. It measures the distance between the forecast and the verifying analysis or observation. The RMSE is negatively orientated (i.e. increasing numerical values indicate increasing ““failure””).
Mean Absolute Error (MAE)
where f = forecast value; a = observed value.
This is also negatively orientated. Due to its quadratic nature, the RMSE penalizes large errors more than the non-quadratic MAE and thus takes higher numerical values. This might be one reason why MAE is sometimes preferred, although the practical consequences of forecast errors are probably better represented by the RMSE.
Mean Square Error (MSE)
where f = forecast value; a = observed value.
We shall concentrate on the RMSE, or rather the squared version, the mean square error (MSE) which is more convenient to analyse mathematically.
The Effect of Mean Analysis and Observation Errors on the RMSE
If the forecasts have a mean error (ME), f = fo + ME and (fo - a) is uncorrelated to ME, then the MSE is their quadratic sum:
where fo is a forecast with no systematic errors; a = observed value.
If the analysis or observation errors (ERR) have to be taken into account and are uncorrelated with (fo - a), the MSE is their quadratic sum:
where fo is a forecast with no systematic errors; a = observed value.
Systematic forecast errors, as well as analysis and observational errors, have their highest impact in the short range, when the non-systematic error level is still relatively low (see Fig12.A.6).
Fig12.A.6: Forecasts verified against analyses often display a ““kink”” for the first forecast interval. This is because the error curve starts from the origin, where the forecast at t=0 is identical to the analysis. However, the true forecast error (forecasts vs. correct observations) at initial time (t=0) represents the analysis error and is rarely zero. The true error curve with respect to the correct observations lies at a slightly higher level than the error curve with respect to the analysis, in particular initially.
Any improvement of the NWP output must, therefore, with increasing forecast range, increasingly address the non-systematic errors (e.g. Model Output Statistics MOS).
The Decomposition of MSE
The MSE can be decomposed around c, the climate of the verifying day:
which can be written:
where Aa and Af are the atmospheric and model variability respectively around the climate and cov refers to the covariance.
Hence the level of forecast accuracy is determined not only by the predictive skill, as reflected in the covariance term, but also by the general variability of the atmosphere, expressed by Aa, and by how well the model simulates this, expressed by Af.
Forecast Error Baseline
When a climatological average c replaces the forecast (i.e. f = c), the model variability Af and the covariance term become zero and
which is the accuracy of climatological weather information used as forecasts. Climatological averages can be found in tourist brochures. To add value, any deterministic medium-range forecast to an end-user must be more accurate than the published climatological averages. Comparison of manual- and computer-generated deterministic forecasts shows, with a good NWP model at some forecast range, the differences between forecast values and observed values must exceed differences between climatological values and observed values. However, errors in a user-orientated forecast produced by a forecaster using model forecast data should not exceed differences between climatological values and observed values.
i.e. At some forecast range:
ENWP forecast > Eclimate but Ehuman ≤ Eclimate
where ENWP forecast = (fNWP - a) = Error of forecast based on NWP, Ehuman = (fhuman - a) = Error of forecast based on human forecaster and NWP, Eclimate = (c - a) = Error of forecast based on climatology, a = observed value, c- climatological value.
Error Saturation Level (ESL)
Forecast errors do not grow indefinitely but asymptotically approach a maximum, the “Error Saturation Level.
Fig12.A.7: The error growth in a state-of-the-art NWP forecast system will at some stage display larger errors than a climatological average used as forecast and will, as do the errors of persistence forecasts and guesses, asymptotically approach an error level 41% above that of a forecast based on a climatological average.
For extended forecast ranges, with decreasing correspondence between forecast and observed anomalies, the covariance term approaches zero. For Af=Aa this yields an ESL at:
which is 41% larger than Eclimate, the error when a climatological average is used as a forecast (see Fig12.A.7). The value Aa√2 is also the ESL for persistence forecasts or guesses based on climatological distributions.
Measure of Skill - the Anomaly Correlation Coefficient (ACC)
Another way to measure the quality of a forecast system is to calculate the correlation between forecasts and observations. However, correlating forecasts directly with observations or analyses may give misleadingly high values because of the seasonal variations. It is therefore established practice to subtract the climate average from both the forecast and the verification and to verify the forecast and observed anomalies according to the Anomaly Correlation Coefficient (ACC). In its most simple form ACC can be written:
Where: f is the forecast value, a is the observed value, c is the climate value. Thus (f-c) is the anomaly of the forecast value relative to the climate value, (a-c)is the anomaly of the observed value relative to the climate value. (f – c)2 and (a – c)2 are the squared standard deviations of the forecast anomalies and analysis anomalies from the climate respectively; they are measures of the ‘activity’ in the forecast and analysis. The overbar indicates a regional or global average.
The WMO definition also takes any mean error into account:
The Anomaly Correlation Coefficient (ACC) can be regarded as a skill score relative to the climate. Increasing numerical values indicate increasing ““success””. It has been found empirically that:
- ACC ~0.8 corresponds to a range where there is synoptic skill in large-scale synoptic patterns.
- ACC=0.6 corresponds to the range down to which there is synoptic skill for the largest scale weather patterns.
- ACC=0.5 corresponds to forecasts for which the error is the same as for a forecast based on a climatological average (i.e. RMSE = Aa, the accuracy of climatological weather information used as forecasts).
Interpretation of Verification Statistics
The mathematics of statistics can be relatively simple but the results are often quite difficult to interpret, due to their counter-intuitive nature: what looks ““good”” might be ““bad””, what looks ““bad”” might be ““good””. As we have seen in A-1.3, seemingly systematic errors can have a non-systematic origin and forecasts verified against analyses can yield results different from those verified against observations. As we will see below, different verification scores can give divergent impressions of forecast quality and, perhaps most paradoxically, improving the realism of an NWP model might give rise to increasing errors.
Interpretation of RMSE and ACC
Both Af and Aa and, consequently, the RMSE vary with geographical area and season. In the mid-latitudes they display a maximum in winter, when the atmospheric flow is dominated by large-scale and stronger amplitudes, and a minimum in summer, when the scales are smaller and the amplitudes weaker.
For a forecast system that realistically reflects atmospheric synoptic-dynamic activity Af =Aa. If Af < Aa the forecasting system underestimates atmospheric variability, which will contribute to a decrease in the RMSE. Jumpiness is ““bad”” if we are dealing with a NWP model but ““good””, if we are dealing with post-processed deterministic forecasts to end-users. On the other hand, if Af > Aa the model overestimates synoptic-dynamic activity, which will contribute to increasing the RMSE. This is normally ““bad”” for all applications.
Comparing RMSE verifications of different models or of different versions of the same model is most straightforward when Af =Aa and the models have the same general variability as the atmosphere.
Effect of Flow Dependency
Both RMSE and ACC are flow dependent, sometimes in a contradictory way. In non-anomalous conditions (e.g. zonal flow) the ACC can easily take low (““bad””) values, while in anomalous regimes (e.g. blocking flow) it can take quite high (““good””) values. The opposite is true for RMSE, which can easily take high (““bad””) values in meridional or blocked flow regimes and low (““good””) values in zonal regimes. Conflicting indications are yet another example of ““what looks bad is good””, as they reflect different virtues of the forecasts and thereby provide the basis for a more nuanced overall assessment.
The ““Double Penalty Effect””
A special case of the flow dependence of the RMSE and ACC is the ““double penalty effect””, where a bad forecast is ““penalised”” twice: first for not having a system where there is one and second for having a system where there is none. It can be shown that, if a wave is forecast with a phase error of half a wave length or more, it will score worse in RMSE and ACC than if the wave had not been forecast at all (see Fig12.A.8).
Fig12.A.8: When the phase error ΔΦ is larger than half a wave length, the scores will be worse than if there was no wave forecast at all.
The double penalty effect often appears in the late medium range, where phase errors become increasingly common. At this range they will strongly contribute to false systematic errors (see A-1.3).
Subjective Evaluations
Considering the many pitfalls in interpreting objective verification results, purely subjective verifications should not be dismissed. They might serve as a good balance and check on the interpretation of the objective verifications. This applies in particular to the verification of extreme events, where the low number of cases makes any statistical verification very difficult or even impossible.
Graphical Representation
The interpretation of RMSE and ACC outlined above may be aided by a graphical vector notation, based on elementary trigonometry. The equation for the decomposition of the MSE is mathematically identical to the ““cosine law””. From this it follows that the cosine of the angle β between the vectors (f-c) and (a-c) corresponds to the ACC (see Fig12.A.9).
Fig12.A.9: The relationship between the cosine theorem and the decomposition of the RMSE.
Forecast Errors
When the predicted and observed anomalies are uncorrelated (i.e. there is no skill in the forecast), they are in a geometrical sense orthogonal and the angle β between vectors (a-c) and (f-c) is 90° and the error is on average √2 times the atmospheric variability around climate.
Fig12.A.10: When the forecast and observed anomalies are orthogonal (i.e. uncorrelated) β=90° and the forecasts (f-c) have on average errors equal to √2 times Aa (the atmospheric variability) or the error of a climatological average.
From Fig12.A.10 it can also be seen that the climate average (c) is more accurate than the forecast at extended or infinite range. From vector-geometrical arguments it is easy to understand why ACC=50% when the RMSE=Eclimate and RMSE < Eclimate for higher ACC, for example 60%, which is the empirically determined limit for useful predictions (see Fig12.A.11).
Fig12.A.11: When the ACC=50% (i.e. the angle between the anomalies (a-c) and (f-c) =60°), the RMSE=Aa, the atmospheric variability (left). When ACC>50% the RMSE is smaller. An ACC=60% is agreed to indicate the limit of useful synoptic forecast skill.
Flow Dependence
The flow dependence of RMSE and ACC is illustrated in Fig12.A.12, for (left) a case of, on average, large anomaly, when a large RMS error is associated with a large ACC (small angle β), and (right) a less anomalous case, when a smaller RMS error is associated with a small ACC (large angle β).
Fig12.A.12: In situations with large variability around the climate average (left) relatively large RMSE can be associated with relatively large ACC (small β) and in situations with relatively small variability around the climate average (right) relatively small RMSE can be associated with small ACC (larger β). For these cases the RMSE and ACC will give conflicting signals.
If the RMSE is used as the norm, it would in principle be possible, at an extended range, to pick out those ““Members of the Day”” that are better than the average, just by selecting those members which are less anomalous. If, however, the ACC is used as the norm, the ““Members of the Day”” may turn out to be those members which are more anomalous.
Damping of Forecast Anomalies
On average, dampening the variability (or jumpiness) of the forecasts reduces forecast error. It can be shown, (see Fig12.A.13), that optimal damping is achieved when the variability is reduced by a proportion that is equal to cosine (β) or the ACC.
Fig12.A.13: Damping the forecast variability Af will minimize the RMSE, if it becomes orthogonal to the forecast. This happens when Af = ACC · Aa and the forecast vector (f - c) varies along a semi-circle with a radius equal to half Aa.
Forecast Error Correlation
At an extended forecast range, when there is low skill in the forecast anomalies and weak correlation between them, there is still a fairly high correlation between the forecast errors. This is because the forecasts are compared with the same analysis. Consider (see Fig12.A.14) two consecutive forecasts f and g, from the same model or two different models, with errors (f - a) and (g - a). Although the angles between (f - c), (g - c) and (a - c) at an infinite range are 90° and thus the correlations zero, the angle between the errors (f - a) and (g - a) is 60°, which yields a correlation of 50%. For shorter ranges the correlation decreases when the forecast anomalies are more correlated and the angle between them <60°. The perturbations in the analyses are constructed to be uncorrelated.
Fig12.A.14: A 3-dimensional vector figure to clarify the relation between forecast jumpiness and error. Two forecasts, f and g, are shown at a range when there is no correlation between the forecast and observed anomalies (f - c), (g - c) and (a - c). The angles between the three vectors are 90°. The angles in the triangle a-f-g measure up to 60° which means that there is a 50% correlation between the ““jumpiness”” (g - f) and the errors (f - a) and (g - a). The same is true for the correlation between (f - a) and (g - a).
Forecast Jumpiness and Forecast Skill
From the same Fig12.A.14 it follows that since the angle between the forecast ““jumpiness”” (f - g) and the error (f - a) is 60°, the correlation at an infinite range between ““jumpiness”” and error is 50%. For shorter forecast ranges the correlations decrease because the forecast anomalies become more correlated, with the angle between them <60°.
Combining Forecasts
Combining different forecasts into a ““consensus”” forecast either from different models (““the multi-model ensemble””) or from the same model (““the lagged average forecast””) normally yields higher forecast accuracy (lower RMSE). The forecasts should be weighted together with respect not only to their average errors but also to the correlation between these errors (see Fig12.A.15).
Fig12.A.15: Forecasts f and g, either from two different NWP systems or from the same, but with different lead times, verifying at the same time. The errors Ef = (f - a) and Eg = (g - a) correlate cosine (β). Weighted together they yield a forecast mfg with a RMSE which is lower than the errors Ef and Eg of the two averaged forecasts. The smaller the β, the larger their error correlation and the less mfg will yield an error reduction.
However, when Ef and Eg correlate less than their fraction Eg/Ef, combining forecasts does not yield a reduction in errors (see Fig12.A.16).
Fig12.A.16: For certain relations between forecast accuracy and forecast error correlation no combination will be able to reduce the RMSE.
The discussion can be extended to any number of participating forecasts. In ensemble systems the forecast errors are initially uncorrelated but slowly increase in correlation over the integration period, though never exceeding 50%.
Usefulness of Statistical Know-how
Statistical verification is normally associated with forecast product control. Statistical know- how is not only able to assure a correct interpretation but also to help add value to the medium-range NWP output. The interventions and modifications performed by experienced forecasters are to some extent statistical in nature. Modifying or adjusting a short-range NWP in the light of later observations is qualitatively similar to ““optimal interpolation”” in data assimilation. Correcting for systematic errors is similar to linear regression analysis and advising end-users in their decision-making involves an understanding of cost-loss analysis. Weather forecasters are not always aware that they make use of Bayesian principles in their daily tasks, even if the mathematics is not formally applied in practice (Doswell, 2004).
Investigations have shown that forecasters who have a statistical education and training do considerably better than those who do not have such understanding (Doswell, 2004). Forecasters should therefore keep themselves informed about recent statistical validations and verifications of NWP performance.
Usefulness of the Forecast - A Cost/Benefit Approach
The ultimate verification of a forecast service is the value of the decisions that end-users make based on its forecasts, providing that it is possible to quantify the usefulness of the forecasts; this brings a subjective element into weather forecast verification.
The Contingency Table
For evaluating the utility aspect of forecasts it is often convenient to present the verification in a contingency table with the corresponding hits (H), false alarms (F), misses (M) and correct no-forecasts (Z). If N is the total number of cases then N=H+F+M+Z. The sample climatological probability of an event occurring is then Pclim = (H + M) / N.
A wide range of verification scores can be computed from this table, but here we will only mention:
Hit Rate HR = H / (H + M), i.e. the proportion of hits given the event was observed.
False Alarm Rate FR = F / (Z + F), i.e. the proportion of false alarms given the event was not observed.
False Alarm Ratio FAR = F / (H + F), i.e. the proportion of false alarms, given the event was forecast, which is one of the main parameters, together with the HR, in ROC diagrams.
Correct null = Z, the proportion of correct forecasts that an event will not happen and does not actually happen.
Note:
The False Alarm Rate (FR) should not be confused with the False Alarm Ratio (FAR).
The terminology here may be different from that used in other books. We refer to the definitions given by Nurmi (2003) and the recommendations from the WWRP/WGNE working group on verification.
The ““Expected Expenses”” (EE)
The Expected Expenses are defined as the sum of the costs due to protective actions and the losses endured:
EE = c · (H + F) + L · M
where c is the cost of protective action, when warnings have been issued, and L is the loss, if the event occurs without protection. Always protecting makes EE = c · N and never protecting EE = L · (M + H). The break-even point, when protecting and not protecting are equally costly, occurs when c · N = L (H + M) which yields c / L = (H + M) / N = Pclim. It is advantageous to protect whenever the ““cost-loss ratio”” c / L < Pclim, if Pclim is the only information available.
Practical Examples
The following set of examples is inspired by real events in California in the 1930s (Lewis, 1994, p.73-74).
A Situation with No Weather Forecast Service
Imagine a location where, on average, it rains 3 days out of 10. Two enterprises, X and Y, each lose €€100 if rain occurs and they have not taken protective action. X has to invest €€20 for protection, whereas Y has to pay €€60.
Thanks to his low protection cost, X protects every day, which costs on average €€20 per day over a longer period. Y, on the other hand, chooses never to protect due to the high cost, and suffers an average loss of €€30 per day over an average 10-day period, owing to the three rain events (see Fig12.A.17).
Fig12.A.17: The triangle defined by the expected daily expenses for different costs (c), when the loss (L) is €100€. End-users who always protect increase their expenses (yellow), end-users who never protect lose on average €30 per day. Even if perfect forecasts were supplied, protection costs could not be avoided (blue line). The triangle defines the area within which weather forecasts can reduce the expected expenses. Note the baseline is not a lack of expenses but the cost of the protection necessary, if perfect knowledge about the future weather is available, in X’’s case €€6 and in Y’’s €€18 per day.
The Benefit of a Local Weather Service
The local weather forecast office A issues deterministic forecasts. They are meteorologically realistic in that rain is forecast with the same frequency as it is observed. The overall forecast performance is reflected in a contingency table (Table3)
Relying on these forecasts over a typical 10-day period, both X and Y protect three times and are caught out unprotected only once. X is able to lower his loss from €€20 to €€16, and Y from €€30 to €€28 (see Fig12.A.18).
Fig12.A.18: The same as Fig12.A.17, but with the expected expenses for end-users served by forecast service A. The red area indicates the added benefits for X and Y from basing their decisions on deterministic weather forecasts from service A.
Note that end-users with very low or very high protection costs do not benefit from A’’s forecast service.
Effect of Introducing further Weather Services
Two new weather agencies, B and C, start to provide forecasts to X and Y. The newcomers B and C have forecast performances in terms of H, F, M and Z:
Agency B heavily under-forecasts rain and agency C heavily over-forecasts. Both give a distorted image of atmospheric behaviour - but what might seem ““bad”” is actually ““good””.
By following B’’s forecasts, which heavily under-forecast rain, end-user Y, who has high protection costs, reduces his expenses from €€28 to €€26.
By following C’’s forecasts, which heavily over-forecast rain, end-user X, who has low protection costs, reduces his expenses from €€16 to €€12 (see Fig12.A.19).
Fig12.A.19: The cost-loss diagram with the expected expenses according to forecasts from agencies B and C for different end-users, defined by their cost-loss ratios. Weather service A is able to provide only a section of the potential end-users, the ones with C/L-ratios between 33 and 50%, with more useful forecasts than B and C. The green and yellow areas indicate where X and Y benefit from the forecasts from agencies B and C respectively.
B has also managed to provide a useful weather service to those with very low protection costs, C to those with very high protection costs. In general, any end-user with protection costs <€33 benefits from C’’s services, any end-user with protection costs >€50 benefits from B’’s services. Only end-users with costs between €€33 and €€50 benefit from A’’s services more than they do from B’’s and C’’s.
There seems to be only two ways in which weather service A can compete with B and C:
- It can improve the deterministic forecast skill –– this would involve NWP model development, which takes time and is costly.
- It can ““tweak”” the forecasts in the same way as B and C, thus violating its policy of well tuned forecasts.
There is, however, a third way, which will enable weather service A to quickly outperform B and C with no extra cost and without compromising its well tuned forecasts policy.
An Introduction to Probabilistic Weather Forecasting
The late American physicist and Nobel Laureate Richard Feynman (1919-88) held the view that it is better not to know than to be told something that is wrong or misleading. This has recently been re-formulated thus: it is better to know that we do not know than to believe that we know when actually we do not know.
Uncertainty - how to turn a Disadvantage into an Advantage
Local forecast office A in its competitive battle with B and C starts to make use of this insight. It offers a surprising change of routine service: it issues a categorical rain or no-rain forecast only when the forecast is absolutely certain. If not, a ““don't know” forecast is issued. If such a ““don't know” forecast is issued about four times during a typical ten-day period, the contingency table might look like this (assuming ““don't know” equates to ““50-50”” or 50%):
This does not look very impressive, rather the opposite, but, paradoxically, both X and Y benefit highly from this special service. This is because they are now free to interpret the forecasts in their own way.
User X, with low protection costs, can afford to interpret the ““don't know” forecast as if it could rain and therefore takes protective action. By doing so, X drastically lowers his costs to €€10 per day, €€20 cheaper than following C’’s forecasts.
User Y, on the other hand, having expensive protection, prefers to interpret ““don't know” as if there will be no rain and decides not to protect. By doing so, Y lowers his costs to €€26 per day, on a par with following service B’’s forecasts (see Fig12.A.20).
Fig12.A.20: The expected daily expenses when the end-users are free to interpret the ““don't know” forecast either as ““rain””, if they have a low c/L ratio, or as ““no rain””, if their c/L ratio is high.
So what might appear as ““cowardly”” forecasts prove to be more valuable for the end-users! If forecasters are uncertain, they should say so and thereby gain respect and authority in the longer term.
Making More Use of Uncertainty - Probabilities
However, service A can go further and quantify how uncertain the rain is. This is best done by expressing the uncertainty of rain in probabilistic terms. If ““don't know” is equal to 50% then 60% and 80% indicate less uncertainty, 40% and 20 % larger uncertainty. Over a 10-day period the contingency table might, on average, look like this, where the four cases of uncertain forecasts have been grouped according to the degree of uncertainty or certainty:
Note: A ““do not know”” forecast does not necessarily mean ““50-50””. It could mean the climatological probability. In fact, unless the climatological rain frequency is indeed 50% a ““50-50”” statement actually provides the non-trivial information that the risk is higher or lower than normal.
The use of probabilities allows other end-users, with protection costs different from X’’s and Y’’s, to benefit from A’’s forecast service. They should take protective action if the forecast probability exceeds their cost/loss ratio (P > c / L). Assuming possible losses of €€100, someone with a protection cost of €€30 should take action when the risk >30% probability, someone with costs of €75, should take action when the risk >75% probability (see Fig12.A.21).
Fig12.A.21: The same figures but with the expected expenses indicated for cases where different end-users take action after receiving probability forecasts. The general performance (diagonal thick blue line) is now closer to the performance for perfect forecasts.
X lowers his expenses to €€10 and Y lowers his expenses to €€24.
Towards more Useful Weather Forecasts
What looks ““bad”” has indeed been ““good””. Using vague phrasing or expressing probabilities instead of giving a clear forecast is often regarded by the public as a sign of professional incompetence.
““Unfortunately, a segment of the public tends to look upon probability forecasting as a means of escape for the forecaster”” (Lorenz, 1970).
Instead, it has been shown that what looks like ““cowardly”” forecast practice is, in reality, more beneficial to the public and end-users than perceived ““brave”” forecast practice.
““What the critics of probability forecasting fail to recognize or else are reluctant to acknowledge is that a forecaster is paid not for exhibiting his skill but for providing information to the public,
and that a probability forecast conveys more information, as opposed to guesswork, than a simple [deterministic] forecast of rain or no rain.””(Lorenz, 1970)
Although the ultimate rationale of probability weather forecasts is their usefulness, which varies from end-user to end-user, forecasters and developers also need verification and validation measures which are objective, in the sense that they do not reflect the subjective needs of different end-user groups.
Quality of Probabilistic Forecasts
The forecast performance in Table 7 exemplifies skilful probability forecasting. In contrast to categorical forecasts, probability forecasts are never ““right”” or ““wrong”” (except when 0% or 100% has been forecast). They can therefore not be verified and validated in the same way as categorical forecasts. This is further explained in Appendix B.
When probabilities are not required
If an end-user does not appreciate forecasts in probabilistic terms and, instead, asks for categorical ““rain”” or ““no rain”” statements, the forecaster must make the decisions for him. Unless the relevant cost-loss ratio is known, this restriction puts forecasters in a difficult position.
If, on the other hand, they have a fair understanding of the end-user’’s needs, forecasters can simply convert their probabilistic forecast into a categorical one, depending on whether the end-user’’s particular probability threshold is exceeded or not. The forecasters are, in other words, doing what the end-user should have done.
So, for example, for an end-user with a 40% threshold, weather service A would issue categorical forecasts which during a 100 day period would verify like Table 8:
For this particular end-user Table 8 shows the rain has been over-forecast: There are 40 forecasts of rain (28 correct, 12 incorrect) against only 30 occurrences.
However, for an end-user with a 60% threshold, weather service A would issue categorical forecasts which during a 100 day period would verify like Table 9:
For this particular end-user Table 9 shows the rain has been under-forecast: There are only 20 forecasts of rain (18 correct, 2 incorrect) against 30 occurrences.
Generally, categorical forecasts have to be biased, either positively (i.e. over-forecasting the event, for end-users with low cost-loss ratios) or negatively (i.e. under-forecasting, for end- users with high cost-loss ratios). A good NWP model should not over-forecast nor under-forecast at any forecast range. This is another example of how computer-based forecasts differ from forecaster-interpreted, customer-orientated, forecasts.
An Extension of the Contingency Table –– the ““SEEPS”” score
The “SEEPS (Stable Equitable Error in Probability Space) score has been developed to address the task of verifying deterministic precipitation forecasts. In contrast to traditional deterministic precipitation verification, it makes use of three categories: ““dry””, ““light precipitation”” and ““heavy precipitation””. ““Dry”” is defined according to WMO guidelines as ≤0.2 mm per 24 hours. The ““light”” and ““heavy”” categories are defined by local climatology, so that light precipitation occurs twice as often as ““heavy”” precipitation. In Europe the threshold between ““light”” and ““heavy”” precipitation is generally between 3mm and 15mm per 24hrs.
Additional Sources of Information
Read further information on the verification of categorical predictands.