Title: METHODS TO EVALUATE PROBABILISTIC AND ENSEMBLE FORECASTS
1METHODS TO EVALUATE PROBABILISTIC AND ENSEMBLE
FORECASTS
- Yuejian Zhu and Zoltan Toth
- SAIC at Environmental Modeling Center
- National Centers for Environmental Prediction
- EMC Seminar Presentation
- November 5 2002
- Camp Springs, MD
2Acknowledgements
- Richard Wobus EMC/NCEP
- Jun Du
EMC/NCEP - Mozheng Wei EMC/NCEP
- Steve Tracton ONR/NAVY
- Olivier Talagrand LDM/FR
- David Richardson ECMWF
- Kenneth Mylne UKMET
- Hua-Lu Pan EMC/NCEP
- Steve Lord EMC/NCEP
3References
- 1. Toth, Talagrand, Candille and Zhu, 2002
- "Probability and ensemble forecasts" book
chapter, in print. - 2. Zhu, Iyenger, Toth, Tracton and Marchok, 1996
- "Objective evaluation of the NCEP global
ensemble forecasting system" - AMS conference proceeding.
- 3. Toth, Zhu and Marchok, 2001
- "The use of ensembles to identify forecasts
with small and large uncertainty". - Weather and Forecasting
- 4. Zhu, Toth, Wobus, Rechardson and Mylne, 2002
- "The economic value of ensemble-based weather
forecasts" - BAMS
- more related articles
4CONTENTS
- Introduction
- Ensemble forecasts
- Current status of NCEP model evaluation
- Probabilistic evaluation simple measurement
- Probabilistic evaluation multi-categories
- Probabilistic evaluation cost-loss analysis
- Probabilistic evaluation useful tools
- Conclusion and discussion
- Plans and more information
5Introduction
- 1. Why do we need verification/evaluation/diagnost
ics? - a. Diagnose errors introduced at each step of
forecast process - Observation --(Data assimilation) --gt
Analysis - Initial condition --(Numerical model) --gt
Forecast - b. Assess forecast uncertainty
- include uncertainty as part of forecast.
- 2. What do we use as proxy for truth?
- a. analysis ( at grid points )
- b. observation ( at report stations )
- 3. Various aspects of forecast verifications.
- a. skill --gt compare errors to reference
system ( clim. ) - b. compare different MWP model forecasts
- c. spatial/temporal variations in skill
-
6Ensemble Forecasts
- 1. Why do we need ensemble forecast?
- Look at following schematic diagrams
7Ensemble Forecasts (continue)
Deterministic forecast
Initial uncertainty
Forecast probability
Verified analysis
8Ensemble Forecasts (continue)
- 2. Ensemble forecast methods
- for example
- . NCEP and ECMWF
- both adding perturbations to initial
unperturbed analysis - perturbation generated by
- . breeding vectors (BVs ---gt NCEP), fast
and cheaper - . singular vectors ( SVs ---gt ECMWF),
solve matrix. - . CMC/MSC -- Monte Carlo approach
- using 2 models and different physical
parameterizations. - design to simulate
- . observation errors ( random
perturbation). - . imperfect boundary conditions.
- . model errors.
9Ensemble forecasts(NCEP's Configuration)
- . Last implementation --- T12Z Jan. 9th 2001
- . 24 hours breeding cycle
- . Initial time --- T00Z and T12Z
- . Total 23 runs --- 12 runs at T00Z, 11 runs at
T12Z - . Integration --- up to 384 hours ( 16 days )
- . 3 different model resolutions
T170L42 --- 75 km - T126L28 --- 105 km
T62L28 --- 210 Km - . GFS - T254/L64 implementation ( Oct 29th. 2002
) - . Next implementation --- 40 members (T126 up to
168hrs), Feb. 2003 ( new IBM machine ) - . Next implementation --- 6hrs breeding cycle.
10 Current Status of NCEP Model Verification
1. Current status of NCEP global model objective
verification. gtgtfor single model forecast
(deterministic forecast) Fcst .vs. Analysis
(500hPa, 1000hPa, 850hPa and etc...) Fcst
.vs. Obs ( sigma/pressure .vs. Obs, 3-dimension)
gtgtwe are objectively evaluating following
models, too. ECMWF T12Z up to 168 (7 days)
hours UKMET T00Z and T12Z up to 144 (6
days) hours CMC T00Z and T12Z up to
144 (6 days) hours NOGAPS T00Z and T12Z up to
120 (5 days) hours -----gt multi-model
ensemble from above productions if
we receive longer lead time forecasts 2. Global
ensemble evaluation/verification.
NCEP/ensemble T00Z and T12Z up to 16 days
ECMWF/ensemble T00Z and T12Z up to 10 days
CMC/ensemble T00Z and T12Z up to 10 days
-----gt We will have a seminar for the
comparison
11Probabilistic Evaluation
- Introduce two characteristics
- reliability and resolution
- reliability --- forecast lt-gt
observation - the property of statistical consistency
between predicted - probabilities and observed frequencies of
occurrence of - the event under consideration.
- resolution --- forecast/observation lt-gt
climatology - the ability of a forecast system to discern
sub-sample - forecast periods with different relative
frequencies of event - (the difference between sub-sample and
overall sample - climatology).
- reliability and resolution together determine the
usefulness - of a probabilistic forecast system.
-
12Prob. Evaluation (simple measurement)
- 1. Talagrand Distribution (histogram
distribution) - Sorting fcst in order, to check where the
analysis is falling - Reliability measurement, system bias
detected. - positive/negative biased for forecasting
model, - example of these forecasts --gt cold bias,
- assume analysis is bias-free (perfect).
Common -"U" sharp -
avg distribution
13Prob. Evaluation (simple measurement)
- 1. Talagrand distribution (continue).
- outlier evolution by different leading
time - adding up two outliers subtract the
average. - ideal forecasts will have outliers.
-
Due to inability of ensemble to capture model
related errors?
14Prob. Evaluation (simple measurement)
- Outlier --gt diagnostic
- forecasts .vs. next forecasts ( f24hrs
valid at same time) - assume forecasting model is perfect,
f24. - perfect forecast system will expect the
outliers are zero. -
Detecting model initial uncertainty?
15Prob. Evaluation (simple measurement)
- 2. Outlier maps (2-dimension).
- flow dependent systematic model errors
- . 40 members at 4 different lead time.
- . measured by normalized distance.
- d (AVGe-ANL)/SQRT(1/(n-1) SPRDe)
- The model errors evaluation ( outliers ) is a
part of model bias estimation ( see map ). The
"Normalized Distance" is defined as the
difference of AVE and ANL over the square root of
normalized ensemble spread. Therefore, the
positive distance means there are positive biases
for ensemble forecasts, the negative means there
are negative biases for ensemble forecasts. - ---gt show example maps.
- . area where consecutive ensembles fail ( missed
) - -- initial errors ? / errors in boundary
forcing ? - model development - identify problem
area. - . weather system where consecutive ensembles
fail ( missed ) - -- inability to capture model related
errors?
16Prob. Evaluation (multi-categories)
- Based on climatological equally likely bins ( for
example. 5 bins ) - For verifying multi-category probability
forecasts. - measure both reliability and resolution.
- 1. Ranked (ordered) probability score ( RPS) and
RPSS - RPSS( RPSf - RPSc )/( 1 - RPSc )
17Prob. Evaluation (multi-categories)
- 2. Brier Score(BS, non-ranked), Brier Skill
Score(BSS). - from two categories to multi-categories/probab
ilistic - ----measure both reliability and resolution
Brier Skill Score
Skill line (ref. Is climatology)
18Prob. Evaluation (multi-categories)
- 3. Decomposition of Brier Score
- consider sub-sample and overall-sample
- reliability, resolution and uncertainty.
- for reliability 0 is perfectly reliable
- for resolution 0 is no resolution (
climatology ) - when resolution reliability ? no skill
- example of global ensemble
No skill beyond this point
resolution
reliability
19Prob. Evaluation (multi-categories)
- 4. Reliability and possible calibration ( remove
bias ) - For period precipitation evaluation
Calibrated forecast
Skill line
Raw forecast
Resolution line Climatological prob.
20Prob. Evaluation (multi-categories)
- 5. Reliability and possible probabilistic
calibration - re-label fcst prob by obs frequency
associated with fcst
calibrated
Un-calibrated
21Prob. Evaluation (cost-loss analysis)
- Based on hit rate (HR) and false alarm (FA) rate.
- 1. Relative Operating Characteristics (ROC) area
- Appl. of signal detection theory for measuring
discrimination between two alternative outcome. - ROCarea Intergrated area 2 ( 0-1
normality )
h/(hm)
Relative Operating Characteristics
-------------------------- o\f
y(f) n(f) --------------------------
y(o) h m -------------------------
- n(o) f c -------------------
-------
f/(hf)
22Prob. Evaluation (cost-loss analysis)
- Based on hit rate (HR) and false alarm (FA) rate
- 2. Relative Operating Characteristics (ROC)
distance - for control forecast and ensemble forecasts
- D distance from point (control) to ensemble
polygon - positive --gt control is better than
ensemble - negative --gt ensemble is better than
control
High resolution control forecast
Low resolution control forecast
23Prob. Evaluation (cost-loss analysis)
- 3. Economic Value (EV) of forecasts.
- Given a particular forecast, a user either
does or does not take action
Highest value (110)
Ensemble forecast
Value line
Deterministic forecast
24Prob. Evaluation (cost-loss analysis)
- Based on hit rate (HR) and false alarm (FA)
analysis - .. Economic Value (EV) of forecasts
Ensemble forecast
Average 2-day advantage
Deterministic forecast
25Prob. Evaluation (cost-loss analysis)
- Based on 2 2 table for precipitation forecast
-
- 4. ETS, TSS and FBI
-
- . ETS -- Equitable Threat Score
- ETS (h-R(h))/(hfm-R(h))
- where R(h)(hf).(hm)/(hfmc)
- . TSS -- True Skill Statistics
- probability of detection of hit rate and
false alram. - TSS (h.c - f.m)/((hm).(fc))
- . FBI -- Frequency Bias
- FBI (hf)/(hm)
26Prob. Evaluation (Useful Tools)
- 1. Small and large uncertainty
- productions spaghetti diagram
and RMOP - for example of evaluation
- and statistical results next diagram--gt
27Prob. Evaluation (useful tools)
- ... Small and large uncertainty.
- 1 day (large uncertainty) 4 days (control)
10-13 days (small uncertainty)
28Prob. Evaluation (useful tools)
Example from large small uncertainty
deterministic forecast
small uncertainty
large uncertainty
information content
29Prob. Evaluation (useful tools)
- 2. Information content
- Statistics show a 7.5-day fully probabilistic
forecast or 6-day - categorical forecast has as much information
content as - 5-day control forecast. Or fully
probabilistic forecast has - more than twice as much information content
at day-5.
ensemble mode considers as most frequent forecasts
30Prob. Evaluation (useful tools)
- 3. Bimodality -- possible another seminar by
Zoltan Toth. - Working in progress
31Conclusion and Discussion
- 1. Two probabilistic attributions
- reliability --gt must statistically consist
with observation - resolution --gt more information with respect
to climatology - 2. Success for probabilistic attributions
- reliability Talagrand, outliers, RPS, BS
- resolution ROC, EV, IC, RPS, BS
- 3. Calibration could improve reliability
(resolution). - 4. Ensemble forecasts .vs. single value forecast.
- Potential abilities of ensemble forecasts.
- 5. Probabilistic forecast and related
probabilistic skill?
32Plans and Information
- 1. To apply the methods to different scale of
forecast. - such as short, medium, and season
- or regional/global
-
- 2. Using observation instead of analysis will be
another set of - evaluation. ( thanks Mark Iredell make OBS
possible ) - . Visiting NCEP global ensemble web-page
- http//wwwt.emc.ncep.noaa.gov/gmb/ens
- click Verification
- . Or visiting Yuejian Zhu's research home-page
- http//wwwt.emc.ncep.noaa.gov/gmb/yzhu
- click Ensemble evaluation