Title: Loss Functions for Detecting Outliers in Panel Data
1Loss Functions for Detecting Outliers in Panel
Data
- Charles D. Coleman
- Thomas Bryan
- Jason E. Devine
- U.S. Census Bureau
Prepared for the Spring 2000 meetings of the
Federal-State Cooperative Program for Population
Estimates, Los Angeles, CA, March, 2000
2Panel Data
- A.k.a. longitudinal data.
- xit
- i indexes cross-sectional units retain
identities over time. Exx Geographic areas,
persons, households, companies, autos. - t indexes time.
- Chronological or nominal.
- Chronological time measures time elapsed between
two dates. - Nominal time indexes different sets of estimates,
can also index true values.
3Notation
- Bi is base value for unit i.
- Fi is future value for unit i.
- Fit is future value for unit i at time t.
- Bi, Fi, Fit gt 0.
- ?iFi-Bi is absolute difference for unit i.
- Subscripts will be dropped when not needed.
4What is an Outlier?
- An outlier is an observation which deviates so
much from other observations as to arouse
suspicions that it was generated by a different
mechanism. - D.M. Hawkins, Identification of Outliers, 1980,
p. 1.
5Meaning of an Outlier
- Either
- Indication of a problem with the data generation
process. - Or
- A true, but unusual, statement about reality.
6Loss Functions
- Motivations The ?i come from unknown
distributions. Want to compare multiple size
classes on same basis. - L(FiBi)??(?i,Bi) is loss function for
observation i. - Loss functions measure badness.
- Loss functions produce rankings of observations
to be examined. - Loss functions are empirically based, except for
one special case in nominal time.
7Assumption 1
Loss is symmetric in error L(B? B) L(B?
B)
8Assumption 2
Loss increases in difference ??/?? gt 0
9Assumption 3
Loss decreases in base value ??/?B lt 0
10Property 1
Loss associated with given absolute percentage
difference (? / B) increases in B.
11Simplest Loss Function
L(FB) F BBq (1a) or ?(?,B)
?Bq (1b) with 0 gt q gt 1.
12Loss as Weighted Combination of Absolute
Difference and Absolute Percentage Difference
- This generates loss function with q s/(r
s). - Infinite number of pairs (r, s) correspond to
any - given q.
13Outlier Criterion
- Outlier declared whenever
- L(FB)??(?,B) gt C
- C is critical value.
- C can be determined in advance, or as function of
data (e.g., quantile or multiple of scale
measure).
14Loss Function Variants
- Time-Invariant Loss Function
- Signed Loss Function
- Nominal Time
15Time-Invariant Loss Function
- Idea Compare multiple dates of data on same
basis. - Time need not be round number.
- L(FitBi,t) Fit BiBtq
- Property 1 satisfied as long as t lt 1/q.
- Thus, useful horizon is limited.
16Signed Loss Function
- Idea Account for direction and magnitude of
loss. - S(FB) (F B) Bq
- Can use asymmetric critical values and qs
- Declare outliers whenever
- S(FB) (F B) Bq gt C
- or
- S(FB) (F B) Bq lt C
- with C ? C, q ? q.
17Nominal Time
- Compare 2 sets of estimates, one set can be
actual values, Ai. - Assumptions
- Unbiased EBi EFi Ai.
- Proportionate variance Var(Bi) Var(Fi) ?2Ai.
- q 1/2.
- Either set of estimates can be used for Bi, Fi.
- Exception Ai can only be substituted for Bi.
18How to Use No Preexisting Outlier Criteria
- Start with q 0.5.
- Adjust by increments of 0.1 to get good
distribution of outliers. - Alternative Start with
- q log(range)/25 1, where range is range of
data. (Bryan, 1999) - Can adjust.
19How to Use Preexisting Discrete Outlier Criteria
- Start with schedule of critical pairs (?j, Bj).
- These pairs (approximately) satisfy equation ?Bq
C for some q and C. They are the cutoffs
between outliers and nonoutliers. - Run regression
- log ?j q log Bj K
- Then, C eK.
20Loss Functions and GIS
- Loss functions can be used with GIS to focus
analysts attention on problem areas. - Maps compare tax method county population
estimates to unconstrained housing unit method
estimates. - q 0.5 in loss function map.
21Absolute Differences between the Population
Estimates
22Percent Absolute Differences between the
Population Estimates
23Loss Function Values
24Outliers Classified by Another Variable
- Di is function of 2 successive observations.
- Ri is reference variable, used to classify
outliers. - Start with schedule of critical pairs (Dj, Rj).
- Run regression
- log Dj a log Rj
- Then, L(D, R) DRb and C ea.
25What to Do with Negative Data
- From Coleman and Bryan (2000)
- L(F,B) FB(FB)q, B ? 0 or F ? 0,
- 0 , B F 0.
- S(F,B) (FB)(FB)q, B ? 0 or F ? 0,
- 0 , B F 0.
- 0 gt q gt 1. Suggest q ? 0.5.
26Summary
- Defined panel data.
- Defined outliers.
- Created several types of loss functions to detect
outliers in panel data. - Loss functions are empirical (except for nominal
time.) - Showed several applications, including GIS.
27URL for Presentation
- http//chuckcoleman.home.dhs.org/fscpela.ppt