Action Research Data Manipulation and Crosstabs - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Action Research Data Manipulation and Crosstabs

Description:

Pearson's chi-square 2 (used for nominal or ordinal scale data) ... Is a nonparametric test, a.k.a. the Goodness of Fit statistic ... – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 58
Provided by: gle9
Category:

less

Transcript and Presenter's Notes

Title: Action Research Data Manipulation and Crosstabs


1
Action ResearchData Manipulation and Crosstabs
  • INFO 515
  • Glenn Booker

2
Parametric vs. Nonparametric
  • Statistical tests fall into two broad categories
    parametric nonparametric
  • Parametric methods
  • Require data at higher levels of measurement -
    interval and/or ratio scales
  • Are more mathematically powerful than
    nonparametric statistics
  • But often require more assumptions about the
    data, such as having a normal distribution, or
    equal variances

3
Parametric vs. Nonparametric
  • Nonparametric methods
  • Use nominal or ordinal scale data
  • Still allows us to test for a relationship, and
    its strength and direction (direction only if
    ordinal)
  • Often has easier prerequisites for being tested
    (e.g. no distribution limits)
  • Ratio or interval scale data may be recoded to
    become nominal or ordinal data, and hence be used
    with nonparametric tests

4
Significance and Association
  • are useful for inferring population values from
    samples (inferential statistics)
  • Significance establishes whether chance can be
    ruled out as the most likely explanation of
    differences
  • Association shows the nature, strength, and/or
    direction of the relationship between two (or
    among three or more) variables
  • Need to show significance before association is
    meaningful

5
Common Tests of Significance
  • Weve been introduced to three common tests of
    significance
  • z test (large samples of ratio or interval data)
  • t test (small samples of ratio or interval data)
  • F test (ANOVA)
  • Shortly well explore a fourth one
  • Pearsons chi-square ?2 (used for nominal or
    ordinal scale data)

? is the Greek letter chi, pronounced kye,
rhymes with rye
6
Common Measures of Association
  • Association measures often range in valuefrom -1
    to 1 (but not always!)
  • Absence of association between variables
    generally means a result of 0
  • Examples
  • Pearsons r (for interval or ratio scale data)
  • Yules Q (ordinal data in a 2x2 table)
  • Gamma (ordinal more than 2x2 table)

A 2x2 table has 2 rows and 2 columns of data.
7
Common Measures of Association
  • Notice these are all for nominal scale data
  • Phi (?, fee) (nominal data in a 2x2 table)
  • Contingency Coefficient (nominal table larger
    than 2x2)
  • Cramers V (nominal - larger than 2x2)
  • Lambda (l) - nominal data
  • Eta (?) nominal data

8
Significance and Association
  • Tests of significance and measures of association
    are often used together
  • But you can have statistical significance without
    having association

9
Significance and Association Examples
  • Ratio data You might use F to determine if there
    is a significant relationship, then use r from
    a regression to measure its strength
  • Ordinal data You might run a chi-square to
    determine statistical significance in the
    frequencies of two variables, and then run a
    Yules Q to show the relationship between the
    variables

10
Crosstabs
  • Brief digression to introduce crosstabs before
    discussing non-parametric methods
  • Crosstabs are a table, often used to display
    data, sorted by two nominal or ordinal variables
    at once, to study the relationship between
    variables that have a small number of possible
    answers each
  • Generally contains basic descriptive statistics,
    such as frequency counts and percentages

11
Crosstabs
  • Used to check the distribution of data, and as a
    foundation for more complex tests
  • Look for gaps or sparse data (little or no
    contribution to the data set)
  • Rule of thumb - put independent variable in the
    columns and dependent variable in the rows

12
Percentages
  • Can show both column and row percentages in
    crosstabs, rather than just frequency counts (or
    show both counts and percentages)
  • Make sure percentages add to 100!
  • Raw frequency counts of variables dont always
    provide an accurate picture
  • Unequal numbers of subjects in groups (N) might
    make the numbers appear skewed

13
Crosstabs Example
  • Open data set GSS91 political.sav
  • Use Analyze / Descriptive Statistics /
    Crosstabs...
  • Set the Row(s) as region, and the Column(s) as
    relig
  • Note the default scope of an SPSS crosstab is to
    show frequency Counts, with row and column totals

14
Crosstabs Example
15
Crosstabs Example
  • Repeat the same example with percentages selected
    under the Cells button to get detailed data in
    each cell
  • Percent within that region (Row)
  • Percent within that religious preference (Column)
  • Percent of total data set (divide by Total N)
  • Gets a bit messy to show this much!

16
Crosstabs Example
17
Recoding
  • An interval or ratio scaled variable, like age or
    salary, may have too many distinct values to use
    in a crosstab
  • Recoding lets you combine values into a single
    new variable -- also called collapsing the codes
  • Also helpful for creating histogram variables
    (e.g. ranges of age or income)

18
Recoding Example
  • Use Transform / Recode / Into Different
    Variables
  • Move age from the dropdown list for the Numeric
    Variable
  • Define the new Output Variable to have Name
    agegroup and Label Age Group
  • Click Change button to use agegroup
  • Click on Old and New Values button

19
Recoding Example
  • For the Old Value, enter Range of 18 to 30
  • Assign this to a New Value of 1
  • Click on Add
  • Repeat to define ages 31-50 as agegroup New Value
    2, 51-75 as 3, and 76-200 as 4
  • Click Continue and now a new variable exists as
    defined

20
RecodingExample
21
Recoding Example
  • Now generate a crosstab with agegroup as
    columns, and region as the rows

22
Second Recoding Example
  • Prof. Yonker had a previous INFO515 class
    surveyed for their height (in inches) and desired
    salaries (/yr)
  • Rather than analyze ratio data with few
    frequencies larger than one, she recoded
  • Heights into Dwarves for people below average
    height, and Giants for those above
  • Desired salaries were recoded into Cheap and
    Expensive, again below and above average

23
Second Recoding Example
  • The resulting crosstab was like this

24
Pearson Chi Square Test
  • The Chi Square test measures how much observed
    (actual) frequencies (fo) differ from expected
    frequencies (fe)
  • Is a nonparametric test, a.k.a. the Goodness of
    Fit statistic
  • Does not require assumptions about the shape of
    the population distribution
  • Does not require variables be measured on an
    interval or ratio scale

25
Chi Square Concept
  • Chi Square test is like the ANOVA test
  • ANOVA proved whether there was a difference among
    several means proved that the means are
    different from each other in some way
  • Chi square is trying to prove whether the
    frequency distribution is different from a random
    one is there a significant difference among
    frequencies?
  • Allows us to test for a relationship (but not the
    strength or direction if there is one)

26
Chi Square Null Hypothesis
  • Null hypothesis is that the frequencies in cells
    are independent of each other (there is no
    relationship among them)
  • Each case is independent of every other case
    that is, the value of the variable for one
    individual does not influence the value for
    another individual
  • Chi Square works better for small sample sizes (lt
    hundreds of samples)
  • WARNING Almost any really large table will have
    a significant chi square

27
Assumptions for Chi Square
  • A random sample is the expected basis for
    comparison
  • Each case can fall into only one cell
  • No zero values are allowed for the observed
    frequency, fo
  • And no expected frequencies, fe, less than one
  • At least 80 of expected frequencies, fe, should
    be greater than or equal to five (5)

28
Expected Frequency
  • The expected frequency for a cell is based on the
    fraction of things which would fall into it
    randomly, given the same general row and column
    count proportions as the actual data set
  • fe (row total) (column total) / N
  • So if 90 people live in New England, and 335 are
    in Age Group 1 from a total sample of 1500, then
    we would expect fe 90335/1500 20.1 people
    in that cell

See slide 21
29
Expected Frequency
  • So the general formula for the expected frequency
    of a given cell is fe (actual row total)
    (actual column total)/N
  • Notice that this is NOT using the average
    expected frequency for every cell fe N /
    ( of rows)( of columns)

30
Calculating Chi Square
  • The Chi square value for each cell is the
    observed frequency minus the expected one,
    squared, divided by the expected frequencyChi
    square per cell (fo-fe)2/fe
  • Sum this for all cells in the crosstab
  • For the cell on slide 28, the actual frequency
    was 25, so Chi square for that cell is
    (25-20.1)2/20.1 1.195 Note Chi square is
    always positive

31
Calculating Chi Square
  • Page 36/37 of the Action Research handout has an
    example of chi square calculation, where fo is
    the observed (actual) frequency fe is the
    expected frequency
  • E.g. fe for the first cell is 2030/60 10.0
  • Chi square for each cell is (fo-fe)2/fe
  • Sum chi square for all cells in the table

No comments about fe fi fo fum! Is that clear?!?!
32
Interpreting Chi Square
  • When the total Chi square is larger than the
    critical value, reject the null hypothesis
  • See Action Research handout page 42/43 for
    critical Chi square (?2) values
  • Look up critical value using the df value,
    which is based on the number of rows and columns
    in the crosstab df (rows - 1)(columns -
    1)
  • For the example on slide 21, df (9-1)(4-1)
    83 24

33
Interpreting Chi Square
  • Or you can be lazy and use the old standby
  • if the significance is less than 0.050, reject
    the null hypothesis if the significance is less
    than 0.050, reject the null hypothesis if the
    significance is less than 0.050, reject the null
    hypothesisif the significance is less than
    0.050, reject the null hypothesis

34
Chi Square Example
  • Open data set GSS91 political.sav
  • Use Analyze / Descriptive Statistics /
    Crosstabs...
  • Set the Row(s) as region, and the Column(s) as
    agegroup
  • Click on Statistics and select the
    Chi-square test

Notice were still using the Crosstab command!
35
Chi Square Example
36
Chi Square Example
  • Note that we correctly predicted the df value
    of 24
  • SPSS is ready to warn you if too many cells
    expected a count below five, or had expected
    counts below one
  • The significance is below 0.050, indicating we
    reject the null hypothesis
  • The total Chi square for all cells is 43.260

37
Chi Square Example
  • The critical Chi square value can be looked up on
    page 42/43 of Yonker
  • For df 24, and significance level 0.050, we get
    a critical Chi square of 36.415
  • Since the actual Chi square (43.260) is greater
    than the critical value (36.415), reject the null
    hypothesis
  • Chi square often shows significance falsely for
    large sample sizes (hence the earlier warning)

38
Chi Square Example
  • What are the other tests? They dont apply
    here...
  • The Likelihood Ratio test is specifically for
    log-linear models
  • The Linear-by-Linear Association test is a
    function of Pearsons r, so it only applies to
    interval or ratio scale variables
  • Notice that SPSS doesnt realize those tests
    dont apply, and blindly presents results for
    them

39
One-variable Chi square Test
  • To check only one variables distribution, there
    is another way to run Chi square
  • Null hypothesis is that the variable is evenly
    distributed across all of its categories
  • Hence all expected frequencies are equal for each
    category, unless you specify otherwise
  • Expected range can also be specified

40
Other Chi square Example
  • Use Analyze / Nonparametric Tests / Chi-square
  • NOT using the Crosstab command here
  • Add region to the Test Variable List
  • Now df is the number of categories in the
    variable, minus one
  • df ( categories) - 1
  • Significance is interpreted the same

41
Other Chi square Example
42
Other Chi square Example
  • So in this case, the region variable has nine
    categories, for a df of 9-1 8
  • Critical Chi square for df 8 is 15.507, so the
    actual value of 290 shows these data are not
    evenly distributed across regions
  • Significance below 0.050 still, in keeping with
    our fine long established tradition, rejects the
    null hypothesis

43
Whodunit?
  • The chi-square value by itself doesnt tell us
    which of the cells are major contributors to the
    statistical significance
  • We compute the standardized residual to address
    that issue
  • This hints at which cells contribute a lot to the
    total chi square

44
Residuals
  • The Residual is the Observed value minus the
    Estimated value for some data point
  • Residual fo - fe
  • If this variable is evenly distributed, the
    Residuals should have a normal distribution
  • Plots of residuals are sometimes used to check
    data normalcy (i.e. how normal is this datas
    distribution?)

45
Standardized Residual
  • The Standardized Residual is the Residual divided
    by the standard deviation of the residuals
  • When the absolute value of the Standardized
    Residual for a cell is greater than 2, you may
    conclude that it is a major contributor to the
    overall chi-square value
  • Analogous to the original t test, looking for
    t gt 2

46
Standardized Residual
  • Extreme values of Standardized Residual (e.g.
    minimum, maximum) can also help identify extreme
    data points
  • The meaning of residual is the same for
    regression analysis, BTW, where residuals are an
    optional output

47
Standardized Residual Example
  • In the crosstab region-agegroup example
  • Click Cells and select Standardized Residuals
  • In this case, the worst cell is the combination
    W. Nor. Central region - Age Group 4, which
    produced a standardized residual of 2.1

48
Standardized Residual Example
49
Crosstab Statistics for 2x2 Table
  • 2x2 tables appear so often that many tests have
    been developed specifically for them
  • Equality of proportions
  • McNemar Chi-square
  • Yates Correction
  • Fisher Exact Test

50
Crosstab Statistics for 2x2 Table
  • Equality of proportions tests prove whether the
    proportion of one variable is the same as for two
    different values of another variable
  • e.g. Do homeowners vote as often as renters?
  • McNemar Chi-square tests for frequencies in a 2x2
    table where samples are dependent (such as
    pre-test and post-test results)

51
Crosstab Statistics for 2x2 Table
  • Yates Correction for Continuity chi-square is
    refined for small observed frequencies
  • fe ( fo-fe - 0.5)/fe
  • Corrections are too conservative dont use!
  • Fisher Exact Test assumes row/column
    frequencies remain fixed, and computes all
    possible tables gives significance value like
    Chi square

52
Nominal Measures of Association
  • Are used to test if each measure is zero (null
    hypothesis) using different scales
  • Phi
  • Cramers V
  • Contingency Coefficient
  • All three are zero iff Chi square is zero
  • iff is mathspeak for if and only if

53
Nominal Measures of Association
  • The usual Significance criterion is used for all
    three
  • If significance lt 0.050, reject the null
    hypothesis, hence the association is significant
  • Notice that direction is meaningless for nominal
    variables, so only the strength of an
    association can be determined

54
Phi
  • For a 2x2 table, Phi and Cramers V are equal to
    Pearsons r
  • Phi (f) can be gt 1, making it an unusual measure
    of association
  • Phi sqrt (Chi square) / N
  • Phi 0 means no association
  • Phi near or over 1 means strong association

55
Cramers V
  • Cramers V 1
  • V sqrt Chi Square / (N(k 1) where k is
    the smaller of the number of columns or rows
  • Is a better measure for tables larger than 2x2
    instead of the Contingency Coefficient

56
Contingency Coefficient
  • a.k.a. C or Pearsons C or Pearsons Contingency
    Coefficient
  • Most widely used measure based on chi-square
  • Requires only nominal data
  • C has a value of 0 when there is no association

57
Contingency Coefficient
  • The max possible value of C is the square root of
    (the number of columns minus 1, divided by the
    number of columns)Cmax sqrt( (column - 1) /
    column)
  • C sqrt Chi Square / (Chi Square N)
Write a Comment
User Comments (0)
About PowerShow.com