Title: Verification Introduction
1Verification Introduction Holly C.
Hartmann Department of Hydrology and Water
ResourcesUniversity of Arizona hollyoregon_at_juno.c
om
RFC Verification Workshop, 08/14/2007
2Goals
- General concepts of verification
- Think about how to apply to your operations
- Be able to respond to and influence NWS
verification program - Be prepared as new tools become available
- Be able to do some of their own verification
- Be able to work with researchers on verification
projects - Contribute to development of verification tools
(e.g., look at various options) - Avoid some typical mistakes
3Agenda
- Introduction to Verification
- Applications, Rationale, Basic Concepts
- Data Visualization and Exploration
- Deterministic Scalar measures
- 2. Categorical measures KEVIN WERNER
- Deterministic Forecasts
- Ensemble Forecasts
- 3. Diagnostic Verification
- Reliability
- Discrimination
- Conditioning/Structuring Analyses
- 4. Lab Session/Group Exercise
- - Developing Verification Strategies
- - Connecting to Forecast Operations and Users
4Why Do Verification? It depends
Administrative logistics, selected quantitative
criteria Operations inputs, model states,
outputs, quick! Research sources of error,
targeting research Users making decisions,
exploit skill, avoid mistakes
Concerns about verification?
5Need for Verification Measures
- Verification statistics identify
- accuracy of forecasts
- sources of skill in forecasts
- - sources of uncertainty in forecasts
- - conditions where and when forecasts are
skillful or not skillful, and why - Verification statistics then can inform
- improvements in terms of forecast skill and
decision making with alternate forecast sources
(e.g., climatology, persistence, new forecast
systems)
Adapted from Regonda, Demargne, and Seo, 2006
6Skill versus Value
Assess quality of forecast system i.e. determine
skill and value of forecast
Credit Hagedorn (2006) and Julie Demargne
7Stakeholder Use of HydroClimate Info Forecasts
Common across all groups Uninformed, mistaken
about forecast interpretation Use of forecasts
limited by lack of demonstrated forecast
skill Have difficulty specifying required accuracy
Common across many, but not all,
stakeholders Have difficulty distinguishing
between good bad products Have difficulty
placing forecasts in historical context
Unique among stakeholders Relevant forecast
variables, regions (location scale), seasons,
lead times, performance characteristics Technical
sophistication base probabilities,
distributions, math Role of of forecasts in
decision making
8What is a Perfect Forecast?
Forecast evaluation concepts
All happy families are alike each unhappy
family is unhappy in its own way. -- Leo
Tolstoy (1876)
All perfect forecasts are alike each imperfect
forecast is imperfect in its own way. -- Holly
Hartmann (2002)
9Different Forecasts, Information, Evaluation
Deterministic Categorical Probabilistic
Todays high will be 76 degrees, and it will be
partly cloudy, with a 30 chance of rain.
10Different Forecasts, Information, Evaluation
Deterministic Categorical Probabilistic
Todays high will be 76 degrees, and it will be
partly cloudy, with a 30 chance of rain.
Deterministic
Categorical
Probabilistic
76
30
No rain
Rain
How would you evaluate each of these?
11Different Forecasts, Information, Evaluation
Deterministic Categorical Probabilistic
Todays high will be 76 degrees, and it will be
partly cloudy, with a 30 chance of rain.
Deterministic
12ESP Forecasts User preferences influence
verification
From California-Nevada River Forecast Center
13ESP Forecasts User preferences influence
verification
From California-Nevada River Forecast Center
14ESP Forecasts User preferences influence
verification
From California-Nevada River Forecast Center
15ESP Forecasts User preferences influence
verification
From California-Nevada River Forecast Center
16So Many Evaluation Criteria!
Categorical Hit Rate Surprise rate Threat
Score Gerrity Score Success Ratio Post-agreement P
ercent Correct Pierce Skill Score Gilbert Skill
Score Heidke Skill Score Critical Success
index Percent N-class errors Modified Heidke
Skill Score Hannsen and Kuipers Score Gandin and
Murphy Skill Scores
- Deterministic
- Bias
- Correlation
- RMSE
- Standardized RMSE
- Nash-Sutcliffe
- Linear Error in Probability Space
- Probabilistic
- Brier Score
- Ranked Probability Score
- Distributions-oriented Measures
- Reliability
- Discrimination
- Sharpness
17RFC Verification System Metrics
CATEGORIES DETERMINISTIC FORECAST VERIFICATION METRICS PROBABILISTIC FORECAST VERIFICATION METRICS
1. Categorical (predefined threshold, range of values) Probability Of Detection (POD), False Alarm Ratio (FAR), Probability of False Detection (POFD) Lead Time of Detection (LTD), Critical Success Index (CSI), Pierce Skill Score (PSS), Gerrity Score (GS) Brier Score (BS), Rank Probability Score (RPS)
2. Error (accuracy) Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Error (ME), Bias (), Linear Error in Probability Space (LEPS) Continuous RPS
3. Correlation Pearson Correlation Coefficient, Ranked correlation coefficient, scatter plots
4. Distribution Properties Mean, variance, higher moments for observation and forecasts Wilcoxon rank sum test, variance of forecasts, variance of observations, ensemble spread, Talagrand Diagram (or Rank Histogram)
Source Verification Group, courtesy J. Demargne
18RFC Verification System Metrics
CATEGORIES DETERMINISTIC FORECAST VERIFICATION METRICS PROBABILISTIC FORECAST VERIFICATION METRICS
5. Skill Scores (relative accuracy over reference forecast) Root Mean Squared Error Skill Score (SS-RMSE) (with reference to persistence, climatology, lagged persistence), Wilson Score (WS), Linear Error in Probability Space Skill Score (SS-LEPS) Rank Probability Skill Score, Brier Skill Score (with reference to persistence, climatology, lagged persistence)
6. Conditional Statistics (based on occurrence of specific events) Relative Operating Characteristic (ROC), reliability measures, discrimination diagram, other discrimination measures ROC and ROC Area, other resolution measures, reliability diagram, discrimination diagram, other discrimination measures
7. Confidence (metric uncertainty) Sample size, Confidence Interval (CI) Ensemble size, sample size, Confidence Interval (CI)
Source Verification Group, courtesy J. Demargne
19Possible Performance Criteria
Accuracy - overall correspondence between
forecasts and observations Bias - difference
between average forecast and average observation
Consistency - forecasts dont waffle around
Good consistency
20Possible Performance Criteria
Accuracy - overall correspondence between
forecasts and observations Bias - difference
between average forecast and average observation
Consistency - forecasts dont waffle
around Sharpness/Refinement ability to make
bullish forecast statements
Not Sharp
Sharp
21What makes a forecast good?
Forecasts should agree with observations, with few large errors Accuracy
Forecast mean should agree with observed mean Bias
Linear relationship between forecasts and observations Association
Forecast should be more accurate than low-skilled reference forecasts (e.g., random chance, persistence, or climatology) Skill
Adapted from Ebert (2003)
22What makes a forecast good?
Binned forecast values should agree with binned observations (agreement between categories) Reliability
Forecast can discriminate between events non-events Resolution
Forecast can predict with strong probabilities (i.e., 100 for event, 0 for non-event) Sharpness
Forecast represents the associated uncertainty Spread (Variability)
Adapted from Ebert (2003)
23Forecasting Tradeoffs
Forecast performance is multi-faceted
False Alarms Surprises warning without
event event without warning
No fire
False Alarm Ratio Probability of Detection
A forecasters fundamental challenge is
balancing these two. Which is more
important? Depends on the specific decision
context
24How Good? Compared to What?
SForecast SBaseline SPerfect SBaseline
SForecast SBaseline
Skill Score
1 -
Skill Score (0.50 0.54)/(1.00-0.54)
-8.6 worse than guessing
What is the appropriate Baseline?
25Graphical Forecast Evaluation
26Basic Data Display
Historical seasonal water supply
outlooks Colorado River Basin
Morrill, Hartmann, and Bales, 2007
27Scatter plots
Historical seasonal water supply
outlooks Colorado River Basin
Morrill, Hartmann, and Bales, 2007
28Histograms
Historical seasonal water supply
outlooks Colorado River Basin
Morrill, Hartmann, and Bales, 2007
29IVP Scatterplot Example
Source H. Herr
30Cumulative Distribution Function (CDF) IVP
Cat 1 No Observed Precipitation Cat 2
Observed Precipitation (gt0.001)
Empirical distribution of forecast probabilities
for different observations categories
Goal Widely separated CDFs
Source H. Herr, IVP Charting Examples, 2007
31Probability Density Function (PDF) IVP
Cat 1 No Observed Precipitation Cat 2
Observed Precipitation (gt0.001)
Empirical distribution for 10 bins for IVP GUI
Goal Widely separated PDFs
Source H. Herr, IVP Charting Examples, 2007
32Box-plots Quantiles and Extremes
Based on summarizing CDF computation and plot
Goal Widely separated box-plots
Cat 1 No Observed Precipitation Cat 2
Observed Precipitation (gt0.001)
Source H. Herr, IVP Charting Examples, 2007
33Scalar Forecast Evaluation
34Standard Scalar Measures
Bias Mean forecast Mean observed Correlation
Coefficient Variance shared between forecast and
observed (r2) Says nothing about bias or whether
forecast variance observed variance Pearson
correlation coefficient assumes normal
distribution, can be or (Rank r only ,
non-normal ok) Root Mean Squared Error Distance
between forecast/observation values Better than
correlation, poor when error is
heteroscedastic Emphasizes performance for high
flows Alternative Mean Absolute Error (MAE)
Forecast
Observed
fcst
obs
35Standard Scalar Measures (with Scatterplot)
1943-99 April 1 Forecasts for Apr-Sept
Streamflow at Stehekin R at Stehekin, WA
1954-97 January 1 Forecasts for Jan-May
Streamflow at Verde R blw Tangle Crk, AZ
Bias 22 Corr 0.92 RMSE 74.4
Bias -87.5 Corr 0.58 RMSE 228.3
Forecast (1000s ac-ft)
Observed (1000s ac-ft)
Observed (1000s ac-ft)
36IVP Deterministic Scalar Measures
ME smallest and errors cancel MAE vs. RMSE
RMSE influenced by large errors for large
events MAXERR largest Sample Size small samples
have large uncertainty
Source H. Herr, IVP Charting Examples, 2007
37IVP RMSE Skill Scores
Skill compared to Persistence Forecast
Source H. Herr, IVP Charting Examples, 2007