Loss Functions for Detecting Outliers in Panel Data - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Loss Functions for Detecting Outliers in Panel Data

Description:

Title: Loss Functions for Detecting Outliers in Panel Data Author: Population Division User Last modified by: Population Division User Created Date – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 28
Provided by: Populatio80
Category:

less

Transcript and Presenter's Notes

Title: Loss Functions for Detecting Outliers in Panel Data


1
Loss Functions for Detecting Outliers in Panel
Data
  • Charles D. Coleman
  • Thomas Bryan
  • Jason E. Devine
  • U.S. Census Bureau

Prepared for the Spring 2000 meetings of the
Federal-State Cooperative Program for Population
Estimates, Los Angeles, CA, March, 2000
2
Panel Data
  • A.k.a. longitudinal data.
  • xit
  • i indexes cross-sectional units retain
    identities over time. Exx Geographic areas,
    persons, households, companies, autos.
  • t indexes time.
  • Chronological or nominal.
  • Chronological time measures time elapsed between
    two dates.
  • Nominal time indexes different sets of estimates,
    can also index true values.

3
Notation
  • Bi is base value for unit i.
  • Fi is future value for unit i.
  • Fit is future value for unit i at time t.
  • Bi, Fi, Fit gt 0.
  • ?iFi-Bi is absolute difference for unit i.
  • Subscripts will be dropped when not needed.

4
What is an Outlier?
  • An outlier is an observation which deviates so
    much from other observations as to arouse
    suspicions that it was generated by a different
    mechanism.
  • D.M. Hawkins, Identification of Outliers, 1980,
    p. 1.

5
Meaning of an Outlier
  • Either
  • Indication of a problem with the data generation
    process.
  • Or
  • A true, but unusual, statement about reality.

6
Loss Functions
  • Motivations The ?i come from unknown
    distributions. Want to compare multiple size
    classes on same basis.
  • L(FiBi)??(?i,Bi) is loss function for
    observation i.
  • Loss functions measure badness.
  • Loss functions produce rankings of observations
    to be examined.
  • Loss functions are empirically based, except for
    one special case in nominal time.

7
Assumption 1
Loss is symmetric in error L(B? B) L(B?
B)
8
Assumption 2
Loss increases in difference ??/?? gt 0
9
Assumption 3
Loss decreases in base value ??/?B lt 0
10
Property 1
Loss associated with given absolute percentage
difference (? / B) increases in B.
11
Simplest Loss Function
L(FB) F BBq (1a) or ?(?,B)
?Bq (1b) with 0 gt q gt 1.
12
Loss as Weighted Combination of Absolute
Difference and Absolute Percentage Difference
  • This generates loss function with q s/(r
    s).
  • Infinite number of pairs (r, s) correspond to
    any
  • given q.

13
Outlier Criterion
  • Outlier declared whenever
  • L(FB)??(?,B) gt C
  • C is critical value.
  • C can be determined in advance, or as function of
    data (e.g., quantile or multiple of scale
    measure).

14
Loss Function Variants
  • Time-Invariant Loss Function
  • Signed Loss Function
  • Nominal Time

15
Time-Invariant Loss Function
  • Idea Compare multiple dates of data on same
    basis.
  • Time need not be round number.
  • L(FitBi,t) Fit BiBtq
  • Property 1 satisfied as long as t lt 1/q.
  • Thus, useful horizon is limited.

16
Signed Loss Function
  • Idea Account for direction and magnitude of
    loss.
  • S(FB) (F B) Bq
  • Can use asymmetric critical values and qs
  • Declare outliers whenever
  • S(FB) (F B) Bq gt C
  • or
  • S(FB) (F B) Bq lt C
  • with C ? C, q ? q.

17
Nominal Time
  • Compare 2 sets of estimates, one set can be
    actual values, Ai.
  • Assumptions
  • Unbiased EBi EFi Ai.
  • Proportionate variance Var(Bi) Var(Fi) ?2Ai.
  • q 1/2.
  • Either set of estimates can be used for Bi, Fi.
  • Exception Ai can only be substituted for Bi.

18
How to Use No Preexisting Outlier Criteria
  • Start with q 0.5.
  • Adjust by increments of 0.1 to get good
    distribution of outliers.
  • Alternative Start with
  • q log(range)/25 1, where range is range of
    data. (Bryan, 1999)
  • Can adjust.

19
How to Use Preexisting Discrete Outlier Criteria
  • Start with schedule of critical pairs (?j, Bj).
  • These pairs (approximately) satisfy equation ?Bq
    C for some q and C. They are the cutoffs
    between outliers and nonoutliers.
  • Run regression
  • log ?j q log Bj K
  • Then, C eK.

20
Loss Functions and GIS
  • Loss functions can be used with GIS to focus
    analysts attention on problem areas.
  • Maps compare tax method county population
    estimates to unconstrained housing unit method
    estimates.
  • q 0.5 in loss function map.

21
Absolute Differences between the Population
Estimates
22
Percent Absolute Differences between the
Population Estimates
23
Loss Function Values
24
Outliers Classified by Another Variable
  • Di is function of 2 successive observations.
  • Ri is reference variable, used to classify
    outliers.
  • Start with schedule of critical pairs (Dj, Rj).
  • Run regression
  • log Dj a log Rj
  • Then, L(D, R) DRb and C ea.

25
What to Do with Negative Data
  • From Coleman and Bryan (2000)
  • L(F,B) FB(FB)q, B ? 0 or F ? 0,
  • 0 , B F 0.
  • S(F,B) (FB)(FB)q, B ? 0 or F ? 0,
  • 0 , B F 0.
  • 0 gt q gt 1. Suggest q ? 0.5.

26
Summary
  • Defined panel data.
  • Defined outliers.
  • Created several types of loss functions to detect
    outliers in panel data.
  • Loss functions are empirical (except for nominal
    time.)
  • Showed several applications, including GIS.

27
URL for Presentation
  • http//chuckcoleman.home.dhs.org/fscpela.ppt
Write a Comment
User Comments (0)
About PowerShow.com