Title: Working with the ECLSK Datasets Weights and other issues'
1Working with the ECLS-K Datasets Weights and
other issues.
Information is courtesy of the Institute of
Educational Sciences, National Center for
Education Statistics and is used in their
training seminars.
2Sampling Weights
- What are sampling weights and why are they
important? - How are weights used?
- What weights are on the ECLS-K data files and
when should they be used?
3What is a Weight ?
- A weight is used to indicate the relative
strength of an observation. - In the simplest case, each observation is counted
equally. - For example, if we have five observations, and
wish to calculate the mean, we just add up the
values and divide by 5.
4How are Weights Used?
- Dataset with 5 cases.
- Value 4 2 1 5 2
- Weight 1 2 4 1 2
- Sample mean (42152) 2.8
- Weighted mean (41) (22) (14) (51)
(22)/sum of weights (4 4 4 5 4)/10
2.1
5What is the Difference Between Weighted and
Unweighted Data?
- With unweighted data, each case is counted
equally. - Unweighted data represent only those in the
sample who provide data. - With weighted data, each case is counted relative
to its representation in the population. - Weights allow analyses that represent the target
population.
6ECLS-K and Weights
- The ECLS-K is a sample, i.e. the entire
population was not surveyed. - The ECLS-K is not a simple random sample (SRS).
That is, not all schools, teachers, and children
had an equal probability of selection. - Not all schools, teachers, and children
participated.
7Why Use Weights in the ECLS-K?
- The ECLS-K weights allow you to make statements
about the population of U.S. children that were
in kindergarten in 1998-99 or in first grade in
1999-2000. Without using weights, estimated are
not nationally representative. - Weights adjust for differential selection
probabilities and reduce bias associated with
non-response by adjusting for differential
nonresponse.
8Examples of Weighted vs. Unweighted Data
9Examples of Weighted vs. Unweighted Data
10Types of Weights on the ECLS-K
- Weights vary according to
- Level of analysis child, teacher, or school
(only child-level after base year). - Round(s) of data cross-sectional or
longitudinal. - Source(s) of data child assessment, parent
interview, and/or teacher questionnaires.
11Level of Analysis Base Year
The first element in a weight variable name
indicates the level of analysis
- Weights for School-level analyses begin with S.
- Weights for Teacher-level analyses begin with
B. - Weights for Child-level analyses begin with C
(cross-sectional). - Weights for Child-level analyses begin with BY
(longitudinal).
12Level of Analysis 1st, 3rd and 5th Grades
- Weights for Child-level analyses (cross sectional
and longitudinal) begin with C. - One exception weight Y2COMW0 is for child-level
analyses of assessment data from rounds 1, 2 and
4 and parent and/or teacher data from spring of
first grade, and one or more base year rounds of
parent and/or teacher data.
13Data Round(s)
The second element in a weight variable name
indicates the round(s) of data.
- Weights for cross-sectional analyses have a
single round number 1,2,3,4,5 or 6. - Weights for longitudinal analyses have 2 or more
numbers, for example - 45 for rounds 4 and 5.
- 124 for rounds 1,2 and 4 (exception in
Y2COMW0). - 1_4 for rounds 1,2,3 and 4.
- 1_6F for rounds 1,2,4,5,6 (Ffull sample).
- 1_5S for rounds 1,2,3,4,5 (Ssubsample).
14Source of the Data
The third element in a weight variable name
indicates the source(s) of data.Weights for
analyses using data from
- Child assessments (alone or in conjunction with
any combination of a limited set of child
characteristic, e.g. age, sex, race/ethnicity)
have a C. - Parent interview (with or without child data)
have a P. - Child AND parent AND teacher have a CPT.
- In 5th grade, the CPT is followed by either
R, M or S for reading, math or science
teacher.
15Sources of the DataTwo exceptions
- BYCOMW0 Child assessment data from fall AND
spring kindergarten in conjunction with one or
more rounds of parent and/or teacher base year
data. - Y2COMW0 Child direct assessment data from fall
AND spring kindergarten AND spring first grade,
in conjunction with parent and/or teacher data
from spring first grade, AND one or more base
year rounds of parent and/or teacher data.
16Source of the Data
Sources that do not affect choice of weight
- School administrator questionnaire
- Facilities checklist
- Teacher questionnaire C
- Special education questionnaires
- Student record abstract data
- Head Start data
- Salary and benefits data
17ExampleC23PW0
- C for child level analysis.
- 23 for analysis of data from rounds 2 and 3.
- P for analysis of parent interview data.
18ExampleC6CPTM0
- C for child level analysis.
- 6 for analysis of data from round 6.
- CPTM for analysis of child, parent, and math
teacher.
19Cross-sectional Examples
- C1PW0 -- Child-level analyses from round 1,
parent interview data (with or without child
assessment data). - B1TW0 -- Teacher level analyses (teacher data)
from round 1. - S2SAQW0 -- School-level analysis (SAQ data) from
round 2. - C6CW0 -- Child assessment data from round 6.
- C5CPTW0 -- Child-level analyses from round 5 with
child, parent AND teacher data.
20Longitudinal Examples
All longitudinal weights are for child-level
analyses.
- BYPW0 Round 1 and 2 parent interview data.
- BYCOMW0 Round 1 and 2 assessment data and some
other parent and teacher data. - C24PW0 Round 2 and 4 parent interview data.
- C245CW0 Round 2, 4 and 5 assessment data.
- C1_6FCO Round 1,2,4,5 and 6 assessment data.
21Third and Fifth-Grade Weights
- Unlike the first grade sample, the ECLS-K sample
was not freshened in third and fifth grade. - The ECLS-K sample does not represent all third
graders in 2001-02 or fifth graders in 2003-04.
These samples represent all children who began
kindergarten in 1998 or began first grade in 1999.
22How to Use Weights
- In SAS, use the WEIGHT statement.
- In SPSS, use the WEIGHT BY statement.
- Key Fact All ECLS-K weights sum to population
totals.
23Weights in SAS
- SAS uses the WEIGHT statement in various
PROCedures. - PROC FREQ data test
- Tables Age Gender Score
- Weight weightvar
- Run
24Weights in SPSS
- LIST VARIABLES age to weightvar.
- Frequencies variables age, score /stadefault.
- weight by weightvar.
- frequencies variables age, score /stadefault.
25Weights in STATA
- clear
- use c\temp\test1.dta"
- tabulate score age gender pweightweightvar
26Weights for HLM Users
- ECLS-K weights are adjusted for nonresponse.
- ECLS-K weights are not normalized (they sum to
the population N rather than the sample n). - A within-school child-level weight can be
approximated by dividing a regular child-level
weight by the school-level weight. - If the analysis includes children that stayed in
the same school at each round of the analysis,
the school weight (S2SAQW0) can be used as a
school-level weight.
27Other Frequently Asked Questions
- When selecting a weight, do I have to subset my
dataset? - What happens to cases where there is no positive
weight? - What weights do I use if analyzing a subsample of
cases? - What if Im running a regression what weights
do I use?
28Summary about Weights
- Weights should be used when analyzing data from
the ECLS-K. - The appropriate weight should be selected based
on Level of analysis, Round(s) of data, and
Source(s) of data. - There may not be a perfect weight for some
analyses. The best weight can be determined with
some descriptive analysis.
29Variance, Calculating Standard Errors
- Why are standard errors important?
- Why not use standard errors that assume a simple
random sample (SRS)? - How to use exact methods for estimating
standard errors. - How to use approximation methods for estimating
standard errors.
30Why are Standard Errors Important?
- Standard errors are produced for estimates from
sample surveys. They are a measure of the
variance in the estimates associated with the
selected sample being one of many possible
samples. - Standard errors are used to test hypotheses and
to study group differences when making inferences
to a population. - Using inaccurate standard errors can lead to
identification of statistically significant
results where none are present and vice versa.
31Important Considerations
- All weights on the ECLS-K data files sum to
population totals and not sample totals. - The ECLS-K has a complex sample design and is not
a simple random sample.
32The ECLS-K Sample DesignOversampling
- The ECLS-K includes oversamples of private
schools, and private school children. - The ECLS-K also oversamples Asian and Pacific
Islander children.
33The ECLS-K Sample DesignClustering
- Sample children were clustered within primary
sampling units (PSUs) to reduce field costs. - Children were in closer geographical proximity
than would occur in a simple random sample. - Children in a clustered sample tend to be more
alike than those in a simple random sample.
34Complex Samples and Standard Errors
- The usual standard error formula assumes a simple
random sample. - Standard errors for estimates from a complex
sample must account for the within cluster/across
cluster variation. - Special software can make the adjustment, or this
adjustment can be approximated using the design
effect.
35Options
- Exact Methods such as the TAYLOR series and
REPLICATION techniques. - Approximation Method
36Exact Methods
- Taylor series
- Extract PSU and strata Ids from data file.
- Software available SUDAAN, STATA (using SVY
commands), and SAS (using PROC SURVEY commands).
37Exact MethodsReplication Techniques
- Extract replication weights (90 of them).
- ECLS-K replication weights use jackknife 2 (JK2)
methods. - Software WESVAR replication series (JK2), AM
(JK2), and SAS callable SUDAAN.
38Approximation Method
- Two stages
- First, normalize weights so standard error is
based on actual sample size rather than
population size. - Then, use design effect (DEFF) to account for
complex sampling design.
391) Normalizing Weights
- Weights on the ECLS-K sum to the population
totals. - Calculate a new weight that sums to the sample
size. - Normalized weights (ECLS-K weight) (sample
n/population N). - SAS users do not need this step since estimates
are produced based on the actual sample size.
40Example Normalizing Weights
- Weight to be normalized C2PW0
- Sum of weights 3,865,946
- Total number of cases with a positive weight
18,950 - Normalized weight C2PW0 (18,950 / 3,865,946)
412) Adjusting for Complex Design
- The ECLS-K has a complex sample design it is not
a simple random sample. - Software packages designed for simple random
samples tend to underestimate the standard errors
for complex sample designs. - Special methods are required for complex designs.
42Using Design Effects (DEFF)
- What is a design effect (DEFF)?
- Its the ratio of the variance found in actual
(complex) sample design to the variance expected
in a simple random sample of the same sample size.
43Using Design Effects (DEFF)
- DEFT the square root of DEFF (Design standard
error/ simple random sample error). - Example for fall-kindergarten reading scores
- SE (SRS) 0.063
- SE (Design) 0.156
- DEFF 0.1562/0.0632 6.15
- DEFT 0.156/0.063 square root of 6.15 2.48
443 Ways of Using the DEFF
- Multiply the SRS (simple random sample) standard
error produced by statistical software (when
using normalized weights) by the square root of
the DEFF (DEFT). - Or
- Adjust the t-statistic by dividing it by the
square root of the design effect (DEFT) or adjust
the F-statistic by dividing it by the DEFF. - Or
- Adjust the weight such that an adjusted standard
error is produced.
45Using a DEFF- Adjusted Weight
- First step, create a weight that sums to the
sample size (normalized weight. - Second, divide this normalized weight by the
DEFF. - Third, use this weight for analyses. The
standard errors produced will approximate the
standard errors obtained using exact methods.
46Where to find ECLS-K DEFFs
- Training material ECLS-K Specifications for
Computing Standard Errors - ECLS-K users manuals
- Base Year (Kindergarten) Table 4.12
- First Grade Tables 4.13 and 9.4
- Third Grade Tables 4.14 and 9.2
- Fifth Grade Tables 4.19 and 9.2
47For SAS Users
- SAS base procedures such as PROC REG, PROC FREQ,
and PROC MEANS do account for the actual sample
size but not for complex sampling. - SAS procedures such as PROC SURVEYMEAN and PROC
SURVEYREG (and other procedures that begin with
Survey) use the Taylor series method to account
for complex sampling and provide exact estimates
of the standard errors.
48PROC SURVEYREG Example
- Example using ECLS-K data, spring kindergarten
and spring first grade variables. - proc surveyreg data fscores
- model c4r3mscl c2r3mscl lowkread t4learn
- cluster c24cstr
- strata c24cpsu
- weight c24cw0
- where lowkmath 0
- run
49PROC SURVEYLOGISTIC Example
- Example using ECLS-K data, spring kindergarten
and spring first grade variables. - proc surveylogistic data fscores
- model lowkread (desc) c2r3mscl t4learn
- cluster c24cstr
- strata c24cpsu
- weight c24cw0
- where lowkmath 0
- run
50PROC SURVEYFREQ Example
- Example using ECLS-K data, spring kindergarten
and spring first grade variables. - proc surveyfreq data fscores
- tables lowkread c2r3mscl t4learn
- cluster c24cstr
- strata c24cpsu
- weight c24cw0
- run
51STATA Code for Complex Design
- Logistic Regression Example, 3rd Grade Data
- Svyset pweightC5CW0, strata (C5TCWSTR) psu
(C5CWPSU) - Svy, subpop (male) logit highbmi white
-
52STATA Code for Complex Design
- Regression Example, 3rd Grade Data
- Svyset pweightC5CW0, strata (C5TCWSTR) psu
(C5CWPSU) - Svy, subpop (male) reg highbmi white
-
53STATA Code for Complex Design
- Means Example, 3rd Grade Data
- Svyset pweightC5CW0, strata (C5TCWSTR) psu
(C5CWPSU) - Svy, subpop (male) mean highbmi female
-
54SPSS for Complex Sample Design
- Use add-on to SPSS called, SPSS Complex Samples
- Complex Samples Logistic Regression
(CSLOGISTIC)Performs binary logistic regression
analysis, as well as multiple logistic regression
(MLR) analysis, for samples drawn by complex
sampling methods. The procedure estimates
variances by taking into account the sample
design used to select the sample, including equal
probability and PPS methods, and WR and WOR
sampling procedures. Optionally, CSLOGISTIC
performs analyses for subpopulations. - Courtesy of SPSS
55Regression Analysis
- Use appropriate software such as AM, WESVAR,
SUDAAN or SAS (SURVEYREG procedure). - For SAS (PROC REG procedure), use DEFF-adjusted
weights. - For SPSS, use normalized, DEFF-adjusted weights.
56Summary
All statistical tests should be based on standard
errors that are calculated to account for the
complex sample design of the ECLS-K.
- Preferred Use software that incorporates JK2
replication methods, or - Use software that incorporates Taylor series
method, or - Last resort Make approximate adjustments based
on design effects.
57ECLS-K Data Availability
- Base Year (Kindergarten) through 5th Grade
restricted use and Public Use datasets have been
released. - 8th Grade restricted use dataset should be
released in the winter of 2008 and the public
datasets should be released in March 2009.
58Differences in Restricted Use and Public Use
ECLS-K Datasets.
- Heres a short explanation from the NCES
http//nces.ed.gov/ecls/kinderfaq.asp?faq1 - Chapter 7 in the ECLS-K, 5th Grade Users Guide
has Tables 7-15 and 7-16 that describe the
differences in the public and restricted
datasets. The Users Guide can be found online
at http//sodapop.pop.psu.edu/codebooks/ecls/k5u
serpart2.pdf