Title: Statistical Frame Development
1Statistical Frame Development
- Michael Griffey
- Energy Information Administration
- January 2000
2(No Transcript)
3Petroleum Marketing Surveys
- 2 weekly retail price surveys.
- 1 semi-monthly heating fuels survey.
- 6 monthly crude oil and petroleum product price
and supply surveys. - 1 annual fuel oil sales survey.
- 1 quadrennial frame survey.
4Supplier Frame - Examples
- 260 refineries/blending plants.
- 220 importers per month
- 320 bulk terminals
- 7,000 home heating oil dealers
- 60,000 petroleum dealers
- 180,000 gasoline retailers
5Consumption Surveys
- Collect data from the end-user/ consumer of the
product. - Have large frames/sample sizes. MECS200,000
manufacturers, sample 18,000 establishments. - Provide detailed information about consumer
habits. - Conducted every 4 years.
6Supplier Surveys
- Collect information from suppliers, e.g.
refiners, importers, retailers, etc. - Have smaller frames, sample sizes.
- Provide information on volumes supplied/sold.
- Conducted weekly, monthly, and annually.
7WHAT IS A FRAME ??
- A listing of all possible sampling units along
with basic information about the units
8Types of Frames
- List Frame - a listing of all possible sampling
units or businesses along with basic information
about the unit. - Area Frame - a set of geographic areas from which
a sample of areas are selected.
9DEFINITIONS
- UNIT OF ANALYSIS
- Unit for which we wish to obtain statistical data
10DEFINITIONS
- TARGET POPULATION
- All units of analysis whose characteristics are
to be estimated
11DEFINITIONS
- SAMPLING UNIT
- A unit selected from the sampling frame
12DEFINITIONS
- SAMPLING FRAME
- Totality of the sampling units from which the
sample is selected
13DESIRABLE FEATURES (1)
- Complete, current, and non-duplicated coverage
- Standardized concepts, definition, and
classifications - Ability to update from other sources
- Standard statistical units
- Linkage across units/over time
14DESIRABLE FEATURES (2)
- Software to extract samples/mailing lists
- Accessible, useable, and updatable for operations
personnel - Frame for several surveys
- Data integration across surveys
- Reasonable operating costs
- Flexibility for future expansion
15Frame Examples
- Postal Service - mail exit points
- Bureau of Labor Statistics - unemployment
insurance files - EIA - product oriented - petroleum companies and
sites importing, refining, storing, or selling
petroleum products.
16CONSIDERATIONS
- Data requirements
- Appropriate collection unit
- Timeliness requirements
- Multiple surveys
- Double sampling
17Standard Industrial Classification
- Concept - allocate the economic activities of
every company or establishment to an industry
chosen from a set of mutually exclusive
industries that account for all activities in the
economy.
18Establishment Sampling Unit
- Establishment defined as a production or sales
unit at a single location. - More timely data collection.
- Larger frame, more difficult to maintain.
19Company Sampling Unit
- Company defined as the organizational unit that
has the ultimate authority with respect to
financial decision-making and to allocate
resources.
20FRAME DATA
- Identification data
- Classification data
- Contact data
- Maintenance and linkage data
21EIAS PROCESS FOR CREATING A MAILING LIST
- Existing Surveys -- Master Frame File
- Contact sources--State Petroleum Marketing
Associations from Oil Industry Directory and
National Petroleum News Factbook, State
Government Offices, Dun and Bradstreet (select
SIC codes), Survey Sampling Inc (propane
marketers)
22SOURCE CONSIDERATIONS (1)
- Cost - purchase and formatting
- Coverage
- Frequency of update and
- time available
23SOURCE CONSIDERATIONS (2)
- Definitions used
-
- Future
availability - Quality of data
24OBJECTIVES OF MATCHING
- Match name and address within one file (internal
match) or between two files (external match) to
eliminate duplicate units - Keep units that are unique births
- Remove records for units that ceased or are no
longer in scope deaths - Update information for known units
25THREE TYPES OF RESULTS
- Match (M) duplicate unit / dont add
- Possible match (P) more follow-up necessary /
manual review - No match (N) new unit / add
26ERROR TYPES
- Error Types--Type I N is really M--Type II M
is really N - Objective Control type I and type II errors
while minimizing the size of P - Considerations Cost of type I and type II errors
(duplicates vs. undercoverage) resources for P
expectations regarding births, deaths, and
updates size of files
27FIELDS
- Attributes of the record such as the name or
street address - The user specifies the fields to be matched
- Some fields provide stronger matches than others
28BLOCKING (1)
- Used when number of total pairs between two
files N(A) X N(B) is large - If have a reliable field---gtBlocking --reduces
number of pairs compared --forces pairs not in
same block into N - Files partitioned into mutually exclusive
- and exhaustive blocks
29BLOCKING (2)
- ExamplesGeographic code in address -- consider
typographical errors and mobilityClassification
codes -- consider primary vs. secondary - Matches only occur within blocks/pockets
- Unique identifiers must be accurate
30 RULES
- Rules produce outcomes of--agreement,
disagreement or partial agreement of one or more
input file fields - Rules are independent of one anotherP(R) P(r1)
x P(r2) x P(r3) - Weights assigned to each rulecommon values low
weights, rare values high weights
31WEIGHTS
- Weights derived iterativelygt examine M, N, and
Pgt adjust rules, thresholds, weights - Logarithms, especially base two, often used
- Weights for all rules are summed to produce a
total weight
32THRESHOLDS (1)
- Lower Threshold Total weight lt TL
(a,b) is a
non-match - Upper Threshold Total weight gt TU
(a,b) is
a match - TL lt Total weightlt TU
- (a,b) is a potential match
- needs manual review or follow-up
33THRESHOLDS (2)
- Estimate TL, TU , P(r(a,b)/(a,b) is M, and
P(r(a,b)/(a,b) is N - Use prior information or derive iteratively using
current information
34METHODOLOGY
- Preprocess and Standardize
- Standardized files
- Know the data - word frequencies, quality,
commonality, anomalies - Fix format - create extra fields
- Most time consuming aspect
35EIAs Matching Programs(1)
- Zip code edit program--identifies
incorrect city, state and zip code combinations
on input file and outputs errors (zip vs state,
zip vs post office list, city vs zip or city vs
post office list)
36EIAs Matching Programs (2)
- Standardization programs
- Removes punctuation and multiple blanks
- Standardizes representations of post office box
and street address and box placement - Spellings and abbreviations of common words
changed to a - common, abbreviated
- form
37STANDARDIZATION PROGRAMS
- Character Elimination
A.B. Smith Co A B Smith Co Smith's Oil
Smiths Oil - Address Standardization
P.O. Box 10 PO Box 10
Box 101 PO Box 101 - Abbreviations/Spellings Standardized
Company Co Street ST
38MATCH TYPES
- Exact Match - Deterministic - unique
identifiers, stringent criteria - Possible Match - Probabilistic - based on
statistical probability of a match - Internal/External
39EXACT MATCH PROGRAM (1)
- Exact matches are eliminated automatically. No
manual review required. - Blocking criteria used1) 3 digits zip, 4
characters of name2) 5 digits zip, 6 character
of address 13) 10 digits of telephone4) sorted
name field, then use 1
40EXACT MATCH PROGRAM (2)
- Sort keys defined, same sort key is a block,
examine within block, Fellegi-Sunter model for
weight with upper and lower cutoffs - Nonmatches pass on to potential match
41POTENTIAL MATCH PROGRAM (1)
- Different approach to assignment of weights.
Each word and number of the name and address from
the add file is separated and held in a table. - Searches base file of 70,000 companies for
records containing these words - Each word in the database has a weight based on
frequency
42POTENTIAL MATCH PROGRAM (2)
- Words and numbers that match, weight assigned
5-log(frequency) - Only count once in a target area target area has
a weight you vary by importanceExample name
field weight at 2, other at 1 - Total weight is sum of weights assigned to
matched words
43DECISION CHART AND ONLINE MATCH
- Decision chart was created to provide direction
and consistency in the manual review--sometimes
required calling company to determine resolution
of conflict or confirmation of match - Online match provided for matching one record vs
base file, 9 highest weights
44(No Transcript)
45ONLINE MATCH (1)
46ONLINE MATCH (2)
47ADDING CLASSIFICATION DATA
- Determine what
- Determine how
- Existing sources vs. survey
- Industry burden/cost
48FRAME SURVEY
- Field test
- Survey Considerations
- Survey nonresponse
- Editing
49(No Transcript)
50(No Transcript)
51FRAME ACCURACY
- Coverage errors
- Classification errors
- Incomplete or missing data
- Impact of error on data collection
- Impact of error on survey estimates
- Age/industry dynamics
52MASTER FRAME FILE FUNCTIONS
- Standardize Ids across surveys
- Identify births/deaths
- Communicate changes
- Identify and eliminate duplicates
- Assist in frame development
- Provide historical linkages
- Cross survey analysis
- Generate reports to assist managers
53(No Transcript)
54OTHER ISSUES
- Preparation for sample selection
- Query System
- Confidentiality vs. Public Use
- Systems and Platforms
55TARGET QUERY
56COST ESTIMATES
- Development costs
- Survey costs
- Maintenance costs
- Other cost considerations
57THANK YOU