Title: Verification Evaluating the Quality of Seasonal Forecasts
1Verification - Evaluating the Quality of Seasonal
Forecasts
2Acknowledgments
- IRI online course material
- http//iri.columbia.edu
- Robert Fawcett, Andrew Watkins, Phillip Reid,
David Jones (BoM)
3Verification - Questions
- How do we decide whether a forecast was
correct? - How do we decide whether a set of forecasts is
correct often enough to be considered good? - How do we decide if forecasts are valuable?
- How can we answer any of these questions when
forecasts are expressed probabilistically?
4Was a Forecast Correct?
Deterministic Forecast it will rain
tomorrow Verification did it rain? Yes or
no Probabilistic Forecast there is a 50
chance of rain tomorrow Verification the
forecast is correct irrespective of whether it
rains!
5Verifying Climate (probabilistic) Forecasts
How do we decide whether a forecast was
correct? Unless the probability is 0 or 100, a
probability forecast is always correct. All
possible outcomes are forecasted e.g., 90 chance
of rain implies 10 chance of no rain. However,
the forecast may be too confident or
under-confident this is termed reliability.
Whenever a forecaster says there is a high
probability of rain tomorrow, it should rain more
frequently than when the forecaster says there is
a low probability of rain.
6Terminology
- Validation assessment of hindcast skill
- assessment of skill by scoring (cross-validated)
hindcasts - essential for assessing new models and expected
future performance of current models - large sample size, immediate results, know
likely skill of forecasts before issuing to
public ? - possibly gives inflated (or deflated) skill
measures, past may not be a good guide to the
future ? - Tells us how well we would have done in the past
7Terminology
- Verification assessment of forecast skill
- assessment of skill by scoring independent
real-time forecasts - used for assessing how forecasts have performed
- undertaken for accountability reasons
- can be applied across multiple forecast models,
but interpretation may be problematic - forecasts accumulate too slowly to allow
verification to be used as the basis for model
selection. - measures skill of what we provide to the public,
is an accurate and accountable measure of
performance ? - takes many forecasts (years) to obtain reliable
statistics?
8Climate Prediction 101
Climate prediction is the process of estimating
the PDF (probability distribution function) of a
climate variable, conditional on an external
forcing (e.g., SO, SSTs, greenhouse gasses,
etc)
9The Basis for Climate Prediction
- This model points to three distinct verification
issues - How consistent is the shift in the PDF
(probability distribution function) with
observations? - Are the probabilities reliable
- How large are the shifts in the PDF?
- Are the forecasts emphatic
- How tight is the shifted PDF?
- Are the forecasts emphatic
- No single skill measure can describe all three of
these aspects, which is why we use a variety of
measures. This is why there are so many skill
measures
10Desired Characteristics of Forecasts
- Probabilities should be reliable.
- Reliability is a function of forecast accuracy.
- Probabilities should be sharp.
- Assuming the forecasts are reliable, sharpness
is a function of predictability or forecast
signal, and relates to skill.
11Reliability Diagrams
Reliability Diagrams For all forecasts of a given
confidence, identify how often the event occurs.
If the proportion of times that the event occurs
is the same as the forecast probability, the
probabilities are reliable (or well calibrated gt
accurate). A plot of relative frequency of
occurrence against forecast probability will be a
diagonal line if the forecasts are
reliable. Problem large number of forecasts
required, cant map spatially or temporally
12Rainfall Above/Below median
- Reliability data (all Aust. grid points)
- 34 verified forecasts
- 1 percent probability bins
- Evidence that wet conditions are easier to
forecast
histogram of forecast probabilities
13Brier Score
Measures the mean-squared error of probability
forecasts. Effectively a Root-Mean-Square error
measure, framed in a probabilistic context. If
an event was forecasted with a probability of
60, and the event occurred, the probability
error is 60 - 100 -40 Brier score2
14Relative Operating Characteristics (ROC)
Convert probabilistic forecasts to deterministic
forecasts by issuing a warning if the probability
exceeds a threshold minimum. By raising the
threshold less warnings are likely to be issued -
reducing the potential of issuing a false alarm,
but increasing the potential of a miss. By
lowering the threshold more warnings are likely
to be issued - reducing the potential of a miss,
but increasing the potential of a false
alarm. The ROCs curve measures the trade-off
between a correct warning and a false alarm,
across a range of decision thresholds i.e.
between Hits and False Alarms.
15Relative Operating Characteristics (ROC)
OBSERVED
HIT RATE Hits/(Hits Misses) prob that
event is forewarned FALSE ALARM RATE F.A/(F.A
Correct Rejections) prob warning is made
for a non-event
16Relative Operating Characteristics (ROC) - EXAMPLE
Probability of rainfall being greater than median
is greater than 50 Number of forecasts 32
Threshold 50
OBSERVED
HIT RATE Hits/(Hits Misses) 7/(710)
0.41 FALSE ALARM RATE FA/(FA CR) 2/(213)
0.13
17Relative Operating Characteristics (ROC) - EXAMPLE
Probability of rainfall being greater than median
is greater than 10 Number of forecasts 32
Threshold 10
OBSERVED
HIT RATE Hits/(Hits Misses) 21/(210)
1.0 FALSE ALARM RATE FA/(FA CR) 7/(74)
0.64
18Relative Operating Characteristics (ROC) - EXAMPLE
Probability of rainfall being greater than median
is greater than 10 Number of forecasts 32
Threshold 80
OBSERVED
HIT RATE Hits/(Hits Misses) 1/(113)
0.07 FALSE ALARM RATE FA/(FA CR) 1/(117)
0.06
19Relative Operating Characteristics (ROC)
0
10
20
30
40
50
60
80
70
20Relative Operating Characteristics (ROC)
21Relative Operating Characteristics
- Advantages
- Skill can be mapped in space and time
- Weights forecasts equally
- Relatively simple to calculate
- Disadvantages
- Some complexity to understand
- Categorical not Probabilistic
- ROCS score not intuitive
22Percent Consistent (Correct Forecast Rate)
For a particular category (e.g., above
median) PC (hits correct rejections)/(total
no. of outlooks) Simply How often did the
outlook favor the eventual outcome?
23Max Temp above/below median
- Correct forecast rates
- 37 verified forecasts
- better than guessing across most of country
- correct 2/3 of times across most of eastern
Australia
inset shows validation comparison
37F
24Percent Correct
- Advantages
- Very simple to calculate
- Simple to understand
- Able to map
- Disadvantages
- May be miss-interpreted. PC does not measure
accuracy. - Categorical not Probabilistic, thereby
encouraging categorical decision making.
25Linear Error in Probability Space (LEPS)
1.0
Forecast
Observed
0.48
Pf Po
Probability
0.20
0.0
23mm
31mm
Rainfall Value
LEPS 100(1 - Pf - Po)
26Rainfall above/below median
- LEPS skill scores
- 34 verified forecasts (from JJA 2000)
- positive skill across most of country
34F
27Max/Min Temp above/below median
- Australian average LEPS2 scores
- max temp solid
- min temp dotted
- positive averages through most of 2002/03 El Niño
- Periods of low skill generally correspond to
periods of low forecast signal
28Linear Error in Probability Space
(LEPS)
- Advantages
- Rewards emphatic forecasts
- Valid across all categories
- Can be mapped
- Disadvantages
- Complex to understand/not intuitive
- Rather difficult to calculate
- Penalizes forecast systems which give near
climatologically - probabilities
29The Value of Forecasts
- Just because a forecast is skilful doesnt mean
that it is valuable A forecast only has value if
it leads to a changed decision.
An Idealized Example - Informed use of a
Probability Forecast
Consider a farmer with 100 sheep. It is the end
of winter, and she/he has the option of buying
supplementary feed for summer at 10 a head.
30Informed use of a Probability Forecast
If the farm receives 100mm of spring rainfall
there will be sufficient pasture and no need for
the extra feed (which will rot). If 100mm of rain
does not fall, however, there will not be
sufficient pasture and hay will have to be bought
over summer at a cost of 20 a head. The
climate forecast is that there is only a 30
chance of receiving at least100mm in spring. What
should the farmer do?
31Informed use of a Probability Forecast
Definitions C is cost of preventative action
(-1000) L is cost of not taking preventative
action if adverse climate outcome occurs
(-2000) In general if PgtC/L the user should
take preventative action. That is, protection is
optimal when the C/L ratio is less than the
probability of the adverse climate outcome. Buy
feed cost -1000 not buying average cost is
-1400 (0.72000) gt farmer should buy feed
now. The forecast is valuable as it motivates a
decision. Imagine, however the accurate model
which never predicts Pgt0.5 For this decision,
the forecasts would never be valuable! Not all
forecasts are relevant to all decisions.
32Conclusions
33Conclusions
- Even for deterministic forecasts, there is no
single measure that gives a comprehensive summary
of forecast quality - - accuracy
- - skill
- - uncertainty
- Probabilistic forecasts address the two
fundamental questions - -What is going to happen?
- How confident can we be that it is going to
happen? - These aspects require verification.
34Conclusions
A probability forecast is always correct.
Further, any scoring technique which converts a
probability to a categorical outcome runs the
risk of encouraging in-appropriate decision
making. No single skill score can describe all
aspects of the verification problem. Fortunately,
most skill scores in most situations will tell a
similar story. Forecast Reliability ? Forecast
skill ? Forecast Value
35Further Information
- Wilks, D. S., 1995 Statistical Methods in the
Atmospheric Sciences. Academic Press, San Diego.
Chapter 7, Forecast verification, pp 233283. - Wilks D.S., 2001 A skill score based on economic
value for probability forecasts. Meteor. Appl.,
8, 209-219. - Hartmann et al., 2002. Confidence Builders.
Evaluating Seasonal Forecasts from User
Perspectives. Bull. Amer. Met. Soc., 683-698. - d.jones_at_bom.gov.au or a.watkins_at_bom.gov.au