Robin Hogan, Ewan O

About This Presentation

Title:

Robin Hogan, Ewan O

Description:

Objective assessment of the skill of cloud forecasts: Towards an NWP-testbed Robin Hogan, Ewan O Connor, Andrew Barrett University of Reading, UK – PowerPoint PPT presentation

Number of Views:99

Avg rating:3.0/5.0

Slides: 44

Provided by: RobinH154

Category:

more less

Transcript and Presenter's Notes

Title: Robin Hogan, Ewan O

1
Objective assessment of the skill of cloud
forecasts Towards an NWP-testbed

Robin Hogan, Ewan OConnor, Andrew Barrett
University of Reading, UK
Maureen Dunn, Karen Johnson
Brookhaven National Laboratory

2
Overview

Cloud schemes in NWP models are basically the
same as in climate models, but easier to evaluate
using ARM because
NWP models are trying to simulate the actual
weather observed
They are run every day
In Europe at least, NWP modelers are more
interested in comparisons with ARM-like data than
climate modelers (not true in US?)
But can we use these comparisons to improve the
physics?
Can compare different models which have different
parameterizations
But each model uses different data assimilation
system
Cleaner test if the setup is identical except one
aspect of physics
SCM-testbed is the crucial addition to the
NWP-testbed
How do we set such a system up?
Start by interfacing Cloudnet processing with ARM
products
Metrics test both bias and skill (can only test
bias of climate model)
Diurnal compositing to evaluate boundary-layer
physics

3
Level 1b

Minimum instrument requirements at each site
Cloud radar, lidar, microwave radiometer, rain
gauge, model or sondes

Radar
Lidar

4
Level 1c

Instrument Synergy product
Example of target classification and data quality
fields

Ice
Liquid
Rain
Aerosol
5
Level 2a/2b

Cloud products on (L2a) observational and (L2b)
model grid
Water content and cloud fraction

L2a IWC on radar/lidar grid L2b Cloud fraction
on model grid
6
Cloud fraction
Chilbolton Observations Met Office Mesoscale
Model ECMWF Global Model Meteo-France ARPEGE
Model KNMI RACMO Model Swedish RCA model
7
Cloud fraction in 7 models

Mean PDF for 2004 for Chilbolton, Paris and
Cabauw

0-7 km
Illingworth et al. (BAMS 2007)
8
ARM-Cloudnet interface

First step interface ARM products to Cloudnet
processing
Now done at Reading need to implement at
Brookhaven
Is this a long-term solution?
Extra products and verification metrics still
desirable

9
Skill and bias

If directly evaluating a climate model, can only
evaluate bias
Zero bias can often be because of compensating
errors
In NWP- and SCM-testbed, can also measure skill
Answers the question was cloud forecast at the
right time?
This checks whether the cloud responds to the
correct forcing
Easiest to do for binary events, e.g. threshold
exceedence
Metrics of skill should be
Equitability (random and constant forecasts score
zero)
Robust for rare events (many scores tend to 0 or
1)
A metric with good properties is the Symmetric
Extreme Dependency Score (SEDS) Hogan et al.
(2009)
Awards score of 1 to perfect forecast and 0 for
random
We have tested 3 models over SCP in 2004...
Apply with cloud-fraction threshold of 0.1

10
Southern Great Plains 2004
ECMWF NCEP UK Met Office (Hadley Centre
? Met Office)
11
Winter2004
12
Summer2004
13
Microbase IWC vs. ECMWF

Maureen Dunn

14
Different mixing schemes
Longwave cooling
Height (z)
Virtual potential temp. (qv)
15
Different mixing schemes
Non-local mixing scheme (e.g. Met Office, ECMWF,
RACMO)
Longwave cooling

Use a test parcel to locate the unstable
regions of the atmosphere
Eddy diffusivity is positive over this region
with a strength determined by the cloud-top
cooling rate (Lock 1998)

Height (z)
Virtual potential temp. (qv)
Eddy diffusivity (Km) (strength of the mixing)
16
Different mixing schemes
Prognostic turbulent kinetic energy (TKE) scheme
(e.g. SMHI-RCA)
Longwave cooling

Model carries an explicit variable for TKE
Eddy diffusivity parameterized as KmTKE1/2l,
where l is a typical eddy size

TKE generated
Height (z)
TKE transported downwards by turbulence itself
dqv/dzlt0
dqv/dzgt0
TKE destroyed
Virtual potential temp. (qv)
17
Diurnal cycle composite of clouds
Radar and lidar provide cloud boundaries and
cloud properties above site
Most models have a non-local mixing scheme in
unstable conditions and an explicit formulation
for entrainment at cloud top good performance
over the diurnal cycle

Barrett, Hogan OConnor (GRL 2009)

18
Summary and future work

One years evaluation over SGP
All models underestimate mid- and low-level cloud
Skill may be robustly quantified using SEDS less
skill in summer
Infrastructure to interface ARM and Cloudnet data
has been tested on 1 year of data with cloud
fraction and IWC
So far Met Office, NCEP, ECMWF and Meteo-France
can be processed
Next implement code at BNL, with other ARM
products and models
Then run on many years of ARM data from multiple
sites
Question have cloud forecasts improved in 10
years?
Next apply to SCM-testbed
Comparisons already demonstrate strong difference
in performance of different boundary-layer
parameterizations non-local mixing with explicit
entrainment is clearly best
We have the tools to quantify objectively
improvements in both bias and skill with changed
parameterizations in SCMs
Other metrics of performance or compositing
methods required?
Could also forward-model the observations and
evaluate in obs space?

19
(No Transcript)
20
Joint PDFs of cloud fraction

Raw (1 hr) resolution
1 year from Murgtal
DWD COSMO model

21
Contingency tables
Observed cloud Observed clear-sky
a 7194 b 4098
c 4502 d 41062
DWD model, Murgtal DWD model, Murgtal
a Cloud hit b False alarm
c Miss d Clear-sky hit

Model cloud
Model clear-sky

For given set of observed events, only 2 degrees
of freedom in all possible forecasts (e.g. a
b), because 2 quantities fixed - Number of
events that occurred n a b c d - Base
rate (observed frequency of occurrence) p (a
c)/n
22
Skill-Bias diagrams
Reality (n16, p1/4) Forecast

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
-
23
5 desirable properties of verification measures

Equitable all random forecasts receive
expected score zero
Constant forecasts of occurrence or
non-occurrence also score zero
Note that forecasting the right cloud climatology
versus height but with no other skill should also
score zero
Difficult to hedge
Some measures reward under- or over-prediction
Useful for rare events
Almost all measures are degenerate in that they
asymptote to 0 or 1 for vanishingly rare events
Dependence on full joint PDF, not just 2x2
contingency table
Difference between cloud fraction of 0.9 and 1 is
as important for radiation as a difference
between 0 and 0.1
Difficult to achieve with other desirable
properties wont be studied much today...
Linear so that can fit an inverse exponential
for half-life
Some measures (e.g. Odds Ratio Skill Score) are
very non-linear

24
HedgingIssuing a forecast that differs from
your true belief in order to improve your score
(e.g. Jolliffe 2008)

Hit rate Ha/(ac)
Fraction of events correctly forecast
Easily hedged by randomly changing some forecasts
of non-occurrence to occurrence

25
Equitability

Defined by Gandin and Murphy (1992)
Requirement 1 An equitable verification measure
awards all random forecasting systems, including
those that always forecast the same value, the
same expected score
Inequitable measures rank some random forecasts
above skillful ones
Requirement 2 An equitable verification measure
S must be expressible as the linear weighted sum
of the elements of the contingency table, i.e. S
(Saa Sbb Scc Sdd) / n
This can safely be discarded it is incompatible
with other desirable properties, e.g. usefulness
for rare events
Gandin and Murphy reported that only the Peirce
Skill Score and linear transforms of it is
equitable by their requirements
PSS Hit Rate minus False Alarm Rate a/(ac)
b/(bd)
What about all the other measures reported to be
equitable?

26
Some reportedly equitable measures
HSS x-E(x) / n-E(x) x ad ETS
a-E(a) / abc-E(a)
E(a) (ab)(ac)/n is the expected value of a
for an unbiased random forecasting system
LOR lnad/bc ORSS ad/bc 1 / ad/bc 1
27
Skill versus cloud-fraction threshold

Consider 7 models evaluated over 3 European sites
in 2003-2004

LOR
HSS
28
Extreme dependency score

Stephenson et al. (2008) explained this behavior
Almost all scores have a meaningless limit as
base rate p ? 0
HSS tends to zero and LOR tends to infinity
They proposed the Extreme Dependency Score
where n a b c d
It can be shown that this score tends to a
meaningful limit
Rewrite in terms of hit rate H a/(a c) and base
rate p (a c)/n
Then assume a power-law dependence of H on p as p
? 0
In the limit p ? 0 we find
This is useful because random forecasts have Hit
rate converging to zero at the same rate as base
rate d1 so EDS0
Perfect forecasts have constant Hit rate with
base rate d0 so EDS1

29
Symmetric extreme dependency score

EDS problems
Easy to hedge (unless calibrated)
Not equitable
Solved by defining a symmetric version
All the benefits of EDS, none of the drawbacks!

Hogan, OConnor and Illingworth (2009 QJRMS)
30
Skill versus cloud-fraction threshold
SEDS
LOR
HSS
SEDS has much flatter behaviour for all models
(except for Met Office which underestimates high
cloud occurrence significantly)
31
Skill versus height

Most scores not reliable near the tropopause
because cloud fraction tends to zero

LBSS
EDS
LOR
HSS
32
A surprise?

Is mid-level cloud well forecast???
Frequency of occurrence of these clouds is
commonly too low (e.g. from Cloudnet Illingworth
et al. 2007)
Specification of cloud phase cited as a problem
Higher skill could be because large-scale ascent
has largest amplitude here, so cloud response to
large-scale dynamics most clear at mid levels
Higher skill for Met Office models (global and
mesoscale) because they have the arguably most
sophisticated microphysics, with separate liquid
and ice water content (Wilson and Ballard 1999)?
Low skill for boundary-layer cloud is not a
surprise!
Well known problem for forecasting (Martin et al.
2000)
Occurrence and height a subtle function of
subsidence rate, stability, free-troposphere
humidity, surface fluxes, entrainment rate...

33
Key properties for estimating ½ life

We wish to model the score S versus forecast lead
time t as
where t1/2 is forecast half-life
We need linearity
Some measures saturate at high skill end
(e.g. Yules Q / ORSS)
Leads to misleadingly long half-life
...and equitability
The formula above assumes that score tends to
zero for very long forecasts, which will only
occur if the measure is equitable

34
Which measures are equitable?

Expected values of ad for a random forecasting
system may score zero
SE(a), E(b), E(c), E(d) 0
But expected score may not be zero!
ES(a,b,c,d) S P(a,b,c,d)S(a,b,c,d)
Width of random probability distribution
decreases for larger sample size n
A measure is only equitable if positive and
negative scores cancel

35
Asyptotic equitability

Consider first unbiased forecasts of events that
occur with probability p ½

36
What about rarer events?

Equitable Threat Score still virtually
equitable for n gt 30
ORSS, EDS and SEDS approach zero much more slowly
with n
For events that occur 2 of the time (e.g.
Finleys tornado forecasts), need n gt 25,000
before magnitude of expected score is less than
0.01
But these measures are supposed to be useful for
rare events!

37
Possible solutions

Ensure n is large enough that E(a) gt 10
Inequitable scores can be scaled to make them
equitable
This opens the way to a new class of non-linear
equitable measures

38
What is the origin of the term ETS?

First use of Equitable Threat Score Mesinger
Black (1992)
A modification of the Threat Score a/(abc)
They cited Gandin and Murphys equitability
requirement that constant forecasts score zero
(which ETS does) although it doesnt satisfy
requirement that non-constant random forecasts
have expected score 0
ETS now one of most widely used verification
measures in meteorology
An example of rediscovery
Gilbert (1884) discussed a/(abc) as a possible
verification measure in the context of Finleys
(1884) tornado forecasts
Gilbert noted deficiencies of this and also
proposed exactly the same formula as ETS, 108
years before!
Suggest that ETS is referred to as the Gilbert
Skill Score (GSS)
Or use the Heidke Skill Score, which is
unconditionally equitable and is uniquely related
to ETS HSS / (2 HSS)

Hogan, Ferro, Jolliffe and Stephenson (WAF, in
press)
39
Properties of various measures
Measure Equitable Useful for rare events Linear
Peirce Skill Score, PSS Heidke Skill Score, HSS Y N Y
Equitably Transformed SEDS Y Y
Symmetric Extreme Dependency Score, SEDS Y
Log of Odds Ratio, LOR
Odds Ratio Skill Score, ORSS (also known as Yules Q) N
Gilbert Skill Score, GSS (formerly ETS) N N
Extreme Dependency Score, EDS N Y
Hit rate, H False alarm rate, FAR N N Y
Critical Success Index, CSI N N N

Truly equitable
Asymptotically equitable
Not equitable

40
Skill versus lead time
2004
2007

Only possible for UK Met Office 12-km model and
German DWD 7-km model
Steady decrease of skill with lead time
Both models appear to improve between 2004 and
2007
Generally, UK model best over UK, German best
over Germany
An exception is Murgtal in 2007 (Met Office model
wins)

41
Forecast half life
2004
2007

Fit an inverse-exponential
S0 is the initial score and t1/2 is the half-life
Noticeably longer half-life fitted after 36 hours
Same thing found for Met Office rainfall forecast
(Roberts 2008)
First timescale due to data assimilation and
convective events
Second due to more predictable large-scale
weather systems

42
Why is half-life less for clouds than pressure?

Different spatial scales? Convection?
Average temporally before calculating skill
scores
Absolute score and half-life increase with number
of hours averaged

43
Statistics from AMF

Murgtal, Germany, 2007
140-day comparison with Met Office 12-km model
Dataset released to the COPS community
Includes German DWD model at multiple resolutions
and forecast lead times

44
Alternative approach

How valid is it to estimate 3D cloud fraction
from 2D slice?
Henderson and Pincus (2009) imply that it is
reasonable, although presumably not in convective
conditions
Alternative treat cloud fraction as a
probability forecast
Each time the model forecasts a particular cloud
fraction, calculate the fraction of time that
cloud was observed instantaneously over the site
Leads to a Reliability Diagram

Perfect
No resolution
No skill
Jakob et al. (2004)

Write a Comment

User Comments (0)