Loss Functions for Detecting Outliers in Panel Data

About This Presentation

Title:

Loss Functions for Detecting Outliers in Panel Data

Description:

Title: Loss Functions for Detecting Outliers in Panel Data Author: Population Division User Last modified by: Population Division User Created Date – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 28

Provided by: Populatio80

Category:

more less

Transcript and Presenter's Notes

Title: Loss Functions for Detecting Outliers in Panel Data

1
Loss Functions for Detecting Outliers in Panel
Data

Charles D. Coleman
Thomas Bryan
Jason E. Devine
U.S. Census Bureau

Prepared for the Spring 2000 meetings of the
Federal-State Cooperative Program for Population
Estimates, Los Angeles, CA, March, 2000
2
Panel Data

A.k.a. longitudinal data.
xit
i indexes cross-sectional units retain
identities over time. Exx Geographic areas,
persons, households, companies, autos.
t indexes time.
Chronological or nominal.
Chronological time measures time elapsed between
two dates.
Nominal time indexes different sets of estimates,
can also index true values.

3
Notation

Bi is base value for unit i.
Fi is future value for unit i.
Fit is future value for unit i at time t.
Bi, Fi, Fit gt 0.
?iFi-Bi is absolute difference for unit i.
Subscripts will be dropped when not needed.

4
What is an Outlier?

An outlier is an observation which deviates so
much from other observations as to arouse
suspicions that it was generated by a different
mechanism.
D.M. Hawkins, Identification of Outliers, 1980,
p. 1.

5
Meaning of an Outlier

Either
Indication of a problem with the data generation
process.
Or
A true, but unusual, statement about reality.

6
Loss Functions

Motivations The ?i come from unknown
distributions. Want to compare multiple size
classes on same basis.
L(FiBi)??(?i,Bi) is loss function for
observation i.
Loss functions measure badness.
Loss functions produce rankings of observations
to be examined.
Loss functions are empirically based, except for
one special case in nominal time.

7
Assumption 1
Loss is symmetric in error L(B? B) L(B?
B)
8
Assumption 2
Loss increases in difference ??/?? gt 0
9
Assumption 3
Loss decreases in base value ??/?B lt 0
10
Property 1
Loss associated with given absolute percentage
difference (? / B) increases in B.
11
Simplest Loss Function
L(FB) F BBq (1a) or ?(?,B)
?Bq (1b) with 0 gt q gt 1.
12
Loss as Weighted Combination of Absolute
Difference and Absolute Percentage Difference

This generates loss function with q s/(r
s).
Infinite number of pairs (r, s) correspond to
any
given q.

13
Outlier Criterion

Outlier declared whenever
L(FB)??(?,B) gt C
C is critical value.
C can be determined in advance, or as function of
data (e.g., quantile or multiple of scale
measure).

14
Loss Function Variants

Time-Invariant Loss Function
Signed Loss Function
Nominal Time

15
Time-Invariant Loss Function

Idea Compare multiple dates of data on same
basis.
Time need not be round number.
L(FitBi,t) Fit BiBtq
Property 1 satisfied as long as t lt 1/q.
Thus, useful horizon is limited.

16
Signed Loss Function

Idea Account for direction and magnitude of
loss.
S(FB) (F B) Bq
Can use asymmetric critical values and qs
Declare outliers whenever
S(FB) (F B) Bq gt C
or
S(FB) (F B) Bq lt C
with C ? C, q ? q.

17
Nominal Time

Compare 2 sets of estimates, one set can be
actual values, Ai.
Assumptions
Unbiased EBi EFi Ai.
Proportionate variance Var(Bi) Var(Fi) ?2Ai.
q 1/2.
Either set of estimates can be used for Bi, Fi.
Exception Ai can only be substituted for Bi.

18
How to Use No Preexisting Outlier Criteria

Start with q 0.5.
Adjust by increments of 0.1 to get good
distribution of outliers.
Alternative Start with
q log(range)/25 1, where range is range of
data. (Bryan, 1999)
Can adjust.

19
How to Use Preexisting Discrete Outlier Criteria

Start with schedule of critical pairs (?j, Bj).
These pairs (approximately) satisfy equation ?Bq
C for some q and C. They are the cutoffs
between outliers and nonoutliers.
Run regression
log ?j q log Bj K
Then, C eK.

20
Loss Functions and GIS

Loss functions can be used with GIS to focus
analysts attention on problem areas.
Maps compare tax method county population
estimates to unconstrained housing unit method
estimates.
q 0.5 in loss function map.

21
Absolute Differences between the Population
Estimates
22
Percent Absolute Differences between the
Population Estimates
23
Loss Function Values
24
Outliers Classified by Another Variable