Statistical Frame Development - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Statistical Frame Development

Description:

OBJECTIVES OF MATCHING ... Match (M) = duplicate unit / don't add ... EXACT MATCH PROGRAM (2) ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 58
Provided by: MarkRo71
Learn more at: https://www.eia.gov
Category:

less

Transcript and Presenter's Notes

Title: Statistical Frame Development


1
Statistical Frame Development
  • Michael Griffey
  • Energy Information Administration
  • January 2000

2
(No Transcript)
3
Petroleum Marketing Surveys
  • 2 weekly retail price surveys.
  • 1 semi-monthly heating fuels survey.
  • 6 monthly crude oil and petroleum product price
    and supply surveys.
  • 1 annual fuel oil sales survey.
  • 1 quadrennial frame survey.

4
Supplier Frame - Examples
  • 260 refineries/blending plants.
  • 220 importers per month
  • 320 bulk terminals
  • 7,000 home heating oil dealers
  • 60,000 petroleum dealers
  • 180,000 gasoline retailers

5
Consumption Surveys
  • Collect data from the end-user/ consumer of the
    product.
  • Have large frames/sample sizes. MECS200,000
    manufacturers, sample 18,000 establishments.
  • Provide detailed information about consumer
    habits.
  • Conducted every 4 years.

6
Supplier Surveys
  • Collect information from suppliers, e.g.
    refiners, importers, retailers, etc.
  • Have smaller frames, sample sizes.
  • Provide information on volumes supplied/sold.
  • Conducted weekly, monthly, and annually.

7
WHAT IS A FRAME ??
  • A listing of all possible sampling units along
    with basic information about the units

8
Types of Frames
  • List Frame - a listing of all possible sampling
    units or businesses along with basic information
    about the unit.
  • Area Frame - a set of geographic areas from which
    a sample of areas are selected.

9
DEFINITIONS
  • UNIT OF ANALYSIS
  • Unit for which we wish to obtain statistical data

10
DEFINITIONS
  • TARGET POPULATION
  • All units of analysis whose characteristics are
    to be estimated

11
DEFINITIONS
  • SAMPLING UNIT
  • A unit selected from the sampling frame

12
DEFINITIONS
  • SAMPLING FRAME
  • Totality of the sampling units from which the
    sample is selected

13
DESIRABLE FEATURES (1)
  • Complete, current, and non-duplicated coverage
  • Standardized concepts, definition, and
    classifications
  • Ability to update from other sources
  • Standard statistical units
  • Linkage across units/over time

14
DESIRABLE FEATURES (2)
  • Software to extract samples/mailing lists
  • Accessible, useable, and updatable for operations
    personnel
  • Frame for several surveys
  • Data integration across surveys
  • Reasonable operating costs
  • Flexibility for future expansion

15
Frame Examples
  • Postal Service - mail exit points
  • Bureau of Labor Statistics - unemployment
    insurance files
  • EIA - product oriented - petroleum companies and
    sites importing, refining, storing, or selling
    petroleum products.

16
CONSIDERATIONS
  • Data requirements
  • Appropriate collection unit
  • Timeliness requirements
  • Multiple surveys
  • Double sampling

17
Standard Industrial Classification
  • Concept - allocate the economic activities of
    every company or establishment to an industry
    chosen from a set of mutually exclusive
    industries that account for all activities in the
    economy.

18
Establishment Sampling Unit
  • Establishment defined as a production or sales
    unit at a single location.
  • More timely data collection.
  • Larger frame, more difficult to maintain.

19
Company Sampling Unit
  • Company defined as the organizational unit that
    has the ultimate authority with respect to
    financial decision-making and to allocate
    resources.

20
FRAME DATA
  • Identification data
  • Classification data
  • Contact data
  • Maintenance and linkage data

21
EIAS PROCESS FOR CREATING A MAILING LIST
  • Existing Surveys -- Master Frame File
  • Contact sources--State Petroleum Marketing
    Associations from Oil Industry Directory and
    National Petroleum News Factbook, State
    Government Offices, Dun and Bradstreet (select
    SIC codes), Survey Sampling Inc (propane
    marketers)

22
SOURCE CONSIDERATIONS (1)
  • Cost - purchase and formatting
  • Coverage
  • Frequency of update and
  • time available

23
SOURCE CONSIDERATIONS (2)
  • Definitions used
  • Future
    availability
  • Quality of data

24
OBJECTIVES OF MATCHING
  • Match name and address within one file (internal
    match) or between two files (external match) to
    eliminate duplicate units
  • Keep units that are unique births
  • Remove records for units that ceased or are no
    longer in scope deaths
  • Update information for known units

25
THREE TYPES OF RESULTS
  • Match (M) duplicate unit / dont add
  • Possible match (P) more follow-up necessary /
    manual review
  • No match (N) new unit / add

26
ERROR TYPES
  • Error Types--Type I N is really M--Type II M
    is really N
  • Objective Control type I and type II errors
    while minimizing the size of P
  • Considerations Cost of type I and type II errors
    (duplicates vs. undercoverage) resources for P
    expectations regarding births, deaths, and
    updates size of files

27
FIELDS
  • Attributes of the record such as the name or
    street address
  • The user specifies the fields to be matched
  • Some fields provide stronger matches than others

28
BLOCKING (1)
  • Used when number of total pairs between two
    files N(A) X N(B) is large
  • If have a reliable field---gtBlocking --reduces
    number of pairs compared --forces pairs not in
    same block into N
  • Files partitioned into mutually exclusive
  • and exhaustive blocks

29
BLOCKING (2)
  • ExamplesGeographic code in address -- consider
    typographical errors and mobilityClassification
    codes -- consider primary vs. secondary
  • Matches only occur within blocks/pockets
  • Unique identifiers must be accurate

30
RULES
  • Rules produce outcomes of--agreement,
    disagreement or partial agreement of one or more
    input file fields
  • Rules are independent of one anotherP(R) P(r1)
    x P(r2) x P(r3)
  • Weights assigned to each rulecommon values low
    weights, rare values high weights

31
WEIGHTS
  • Weights derived iterativelygt examine M, N, and
    Pgt adjust rules, thresholds, weights
  • Logarithms, especially base two, often used
  • Weights for all rules are summed to produce a
    total weight

32
THRESHOLDS (1)
  • Lower Threshold Total weight lt TL
    (a,b) is a
    non-match
  • Upper Threshold Total weight gt TU
    (a,b) is
    a match
  • TL lt Total weightlt TU
  • (a,b) is a potential match
  • needs manual review or follow-up

33
THRESHOLDS (2)
  • Estimate TL, TU , P(r(a,b)/(a,b) is M, and
    P(r(a,b)/(a,b) is N
  • Use prior information or derive iteratively using
    current information

34
METHODOLOGY
  • Preprocess and Standardize
  • Standardized files
  • Know the data - word frequencies, quality,
    commonality, anomalies
  • Fix format - create extra fields
  • Most time consuming aspect

35
EIAs Matching Programs(1)
  • Zip code edit program--identifies
    incorrect city, state and zip code combinations
    on input file and outputs errors (zip vs state,
    zip vs post office list, city vs zip or city vs
    post office list)

36
EIAs Matching Programs (2)
  • Standardization programs
  • Removes punctuation and multiple blanks
  • Standardizes representations of post office box
    and street address and box placement
  • Spellings and abbreviations of common words
    changed to a
  • common, abbreviated
  • form

37
STANDARDIZATION PROGRAMS
  • Character Elimination
    A.B. Smith Co A B Smith Co Smith's Oil
    Smiths Oil
  • Address Standardization
    P.O. Box 10 PO Box 10
    Box 101 PO Box 101
  • Abbreviations/Spellings Standardized
    Company Co Street ST

38
MATCH TYPES
  • Exact Match - Deterministic - unique
    identifiers, stringent criteria
  • Possible Match - Probabilistic - based on
    statistical probability of a match
  • Internal/External

39
EXACT MATCH PROGRAM (1)
  • Exact matches are eliminated automatically. No
    manual review required.
  • Blocking criteria used1) 3 digits zip, 4
    characters of name2) 5 digits zip, 6 character
    of address 13) 10 digits of telephone4) sorted
    name field, then use 1

40
EXACT MATCH PROGRAM (2)
  • Sort keys defined, same sort key is a block,
    examine within block, Fellegi-Sunter model for
    weight with upper and lower cutoffs
  • Nonmatches pass on to potential match

41
POTENTIAL MATCH PROGRAM (1)
  • Different approach to assignment of weights.
    Each word and number of the name and address from
    the add file is separated and held in a table.
  • Searches base file of 70,000 companies for
    records containing these words
  • Each word in the database has a weight based on
    frequency

42
POTENTIAL MATCH PROGRAM (2)
  • Words and numbers that match, weight assigned
    5-log(frequency)
  • Only count once in a target area target area has
    a weight you vary by importanceExample name
    field weight at 2, other at 1
  • Total weight is sum of weights assigned to
    matched words

43
DECISION CHART AND ONLINE MATCH
  • Decision chart was created to provide direction
    and consistency in the manual review--sometimes
    required calling company to determine resolution
    of conflict or confirmation of match
  • Online match provided for matching one record vs
    base file, 9 highest weights

44
(No Transcript)
45
ONLINE MATCH (1)
46
ONLINE MATCH (2)
47
ADDING CLASSIFICATION DATA
  • Determine what
  • Determine how
  • Existing sources vs. survey
  • Industry burden/cost

48
FRAME SURVEY
  • Field test
  • Survey Considerations
  • Survey nonresponse
  • Editing

49
(No Transcript)
50
(No Transcript)
51
FRAME ACCURACY
  • Coverage errors
  • Classification errors
  • Incomplete or missing data
  • Impact of error on data collection
  • Impact of error on survey estimates
  • Age/industry dynamics

52
MASTER FRAME FILE FUNCTIONS
  • Standardize Ids across surveys
  • Identify births/deaths
  • Communicate changes
  • Identify and eliminate duplicates
  • Assist in frame development
  • Provide historical linkages
  • Cross survey analysis
  • Generate reports to assist managers

53
(No Transcript)
54
OTHER ISSUES
  • Preparation for sample selection
  • Query System
  • Confidentiality vs. Public Use
  • Systems and Platforms

55
TARGET QUERY
56
COST ESTIMATES
  • Development costs
  • Survey costs
  • Maintenance costs
  • Other cost considerations

57
THANK YOU
Write a Comment
User Comments (0)
About PowerShow.com