Aggregate Data and Statistics - PowerPoint PPT Presentation

1 / 103
About This Presentation
Title:

Aggregate Data and Statistics

Description:

Building on this example, let's explore aggregation. ... Aggregation involves tabulating a summary statistic across all of the categories ... – PowerPoint PPT presentation

Number of Views:349
Avg rating:3.0/5.0
Slides: 104
Provided by: dmi8
Category:

less

Transcript and Presenter's Notes

Title: Aggregate Data and Statistics


1
Aggregate Data and Statistics
Wendy Watkins Carleton University
Chuck Humphrey University of Alberta
Statistics Canada Data Liberation Initiative
2
Outline
  • What are aggregate data?
  • Why aggregate?
  • How to aggregate?
  • Computing exercise

3
What are aggregate data?
  • Lets start with the relationship between
    statistics and data.

4
Statistics and Data
  • Data
  • numeric files created and organized for analysis
  • requires processing
  • not ready for display
  • Statistics
  • numeric facts/figures
  • created from data, i.e, already processed
  • presentation-ready

5
Statistics and Data
6
(No Transcript)
7
Statistics and Data
8
Statistics and Data
  • In short, statistics are created from data and
    represent summaries of the detail observed in the
    data.

9
What is aggregation?
  • Building on this example, lets explore
    aggregation.
  • We see a table with the number of smokers
    summarized over categories for age, education,
    sex, geography, and different time points.

10
Age and Education are in the background and
display totals.
11
What is aggregation?
  • Aggregation involves tabulating a summary
    statistic across all of the categories or levels
    of a set of variables.

12
The summary statistic
  • The summary statistic in this example is the
    total number of smokers.

13
Variables and categories
  • The variables and their categories are
  • Region (11) Canada and the ten provinces
  • Age (5) Total, 15-19, 20-44, 45-64, 65
  • Sex (3) Total, Female, Male
  • Education (4) Total, Some secondary or less,
    Secondary graduate or more, Not stated
  • Periods (5) 1985, 1989, 1991, 1994-95, 1996-97

14
Variables and categories
  • The tabulation consists of determining the
    combinations of all categories across variables
    and then counting the number of smokers within
    each of these combinations.
  • 11 x 5 x 3 x 4 x 5 3300 category

combinations
15
Tabulating or aggregating
  • One might be wondering if there is a difference
    between tabulating and aggregating.
  • Usually, they are the same thing.

16
Tabulating aggregating
  • In creating tables from data, the variables are
    arranged in various combinations along the
    columns and the rows.

17
Tabulating aggregating
  • Placing multiple variables along the columns or
    rows is called nesting.
  • Tables may have variables nested on both the
    columns and rows.

18
Categories of Sex nested within Periods
19
Categories of Education nested within Sex
Categories of Sex nested within Region
20
A quick summary
  • Up to this point, we have noted that
  • statistics are created from data
  • aggregations consist of tabulating statistics
    within the categories of selected variables
  • variables may be nested within columns and rows
    to display these tabulations

21
What are aggregate data?
  • Q What is the difference between an aggregation
    or tabulation and aggregate data?
  • A The display of the aggregation (that is, the
    structure of the tabulated output).

22
Statistical data structure
  • A statistical data structure is a fixed,
    two-dimensional matrix with the variables in the
    columns and cases in the rows.

V1
V2
V3
V4
V5
V6
V7
Case 1
Case 2
Case 3
Case 4
Case 5
Case 6
Case 7
23
Statistical data structure
  • Aggregate data require the same type of
    statistical data structure.
  • With aggregate data, the variables are the cells
    of a tabulation while the cases are the
    categories of one or more of the tables
    dimensions.

24
What are aggregate data?
  • Here is an example of a tabulation of three
    variables. One variable consists of five levels
    and will be used to represent the cases in an
    aggregate data file. The other two variables
    make up a two-way table and the cells in this
    table will be the variables in the aggregate data
    file.

25
From tabulation to data
Dimension 1
5
4
Dimension 3
3
2
1
Dimension 2
26
From tabulation to data
  • Start with dimension 3 and convert this dimension
    into the rows or cases of an aggregate data file.

27
From tabulation to data
Dimension 1
5
4
Dimension 3
3
2
1
Dimension 2
28
From tabulation to data
Dimension 1
Dimension 2
29
From tabulation to data
Dimension 3
30
From tabulation to data
  • Working with the six cells from the tabulation
    for level 1 of dimension 3, locate these six
    cells as six variables in the new data structure.

31
From tabulation to data
R1C1
Dimension 1
Dimension 2
32
From tabulation to data
R1C1
Dimension 3
33
From tabulation to data
R2C1
Dimension 1
Dimension 2
34
From tabulation to data
R1C1
R2C1
Dimension 3
35
From tabulation to data
Dimension 1
R3C1
Dimension 2
36
From tabulation to data
R1C1
R2C1
R3C1
Dimension 3
37
From tabulation to data
R1C2
Dimension 1
Dimension 2
38
From tabulation to data
R1C2
R1C1
R2C1
R3C1
Dimension 3
39
From tabulation to data
R1C2
Dimension 1
Dimension 2
40
From tabulation to data
R1C2
R2C2
R1C1
R2C1
R3C1
Dimension 3
41
From tabulation to data
Dimension 1
R3C2
Dimension 2
42
From tabulation to data
R1C2
R2C2
R3C2
R1C1
R2C1
R3C1
Dimension 3
43
From tabulation to data
Dimension 1
Dimension 2
44
From tabulation to data
  • Repeat this for level 2 of dimension 3.

45
From tabulation to data
Dimension 1
Dimension 2
46
From tabulation to data
Dimension 3
47
From tabulation to data
Dimension 1
Dimension 2
48
From tabulation to data
  • Now for level 3 of dimension 3, etc.

49
From tabulation to data
Dimension 1
Dimension 2
50
From tabulation to data
Dimension 3
51
From tabulation to data
Dimension 1
Dimension 2
52
From tabulation to data
Dimension 3
53
From tabulation to data
Dimension 1
Dimension 2
54
From tabulation to data
Dimension 3
55
Aggregate data structure
  • The data structure just shown is the one used in
    the aggregate data files for 1991 and earlier
    censuses.

56
Aggregate data structure
  • Another data structure for aggregate files
    consists of records built using nested variables
    along the rows of a table. The categories of
    these nested variables serve as index values.

57
Aggregate data structure
  • This is the data structure for aggregate data
    from the 1996 Census.

58
(11)
(5)
(3)
(4)
(5)
59
Aggregate data structure
  • age by sex by education by periods
  • 5 x 3 x 4 x 5 300 cells
  • For each level of geography (region), there are
    300 cells.

60
Aggregate data structure
  • With the previous data structure, there would be
    one case for each level of region with 300
    variables on each line.

61
Aggregate data structure
  • With a fully indexed aggregate data structure,
    there are 300 lines for each level of region with
    each line representing a combination of the
    values of the remaining nested or index
    variables, i.e., age, sex, education, period.

62
Aggregate data structure
  • In Beyond 20/20, the smokers statistics are
    arranged as the column dimension while region,
    age, sex, education and period nested along the
    rows.

63
(11)
(5)
(3)
(4)
(5)
64
Aggregate data structure
  • Because the original table was not created using
    numeric codes for the categories of the index
    variables, writing the data to a file produces
    the following format. Notice that the categories
    are in string or text format.

65
(No Transcript)
66
Aggregate data structure
  • A recode of these string values to numeric
    codes displays the index structure for this type
    of aggregate data file.

67
(No Transcript)
68
Aggregate Data Structure
  • This is a fully indexed data structure. One row
    in the file represents one cell from the table
    while the index values define the combination of
    categories that define the cell.

69
Aggregate Data Structure
  • A partial index data structure is used by
    Statistics Canada in distributing the 1996
    aggregate data. This kind of structure keeps one
    variable in the column of the table.

70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
Another example
  • Lets look at a table that consists of the
    average length of stay in a hospital by sex, age,
    diagnostic chapter, region, and time period.

74
(No Transcript)
75
Variables and categories
  • diagnostic chapter 19 levels
  • sex 3 levels
  • age 6 levels
  • region 13 levels
  • period 28 levels

76
Aggregate data structure
  • The number of category combinations is equal to
  • 13 x 28 x 3 x 6 x 19 124,488 category

combinations
77
Aggregate data structure
  • Lets convert this into a partially indexed
    aggregate data file structure with the categories
    of region making up the cases the column
    variable consisting of the three latest time
    periods and the summary statistic representing
    the average number of days in the hospital.

78
(No Transcript)
79
Aggregate average length of hospital stay in days
  • The aggregate structure consists of three index
    variables (sex, age, and chapter) with 342
    indexed levels or records per category of region.
  • 3 x 6 x 19 342

80
(No Transcript)
81
What are aggregate data?
  • Definition Statistical summaries over
    categorical variables representing social
    phenomena, geography, and time that are organized
    in a statistical data structure.

82
Time series aggregate data
  • When the data structure of the summaries is
    organized around time, these aggregate statistics
    are called a time series.

83
Time Series aggregate data structure
84
Annual Time Series
85
Geo-spatial aggregate data
  • When the data structure of the summaries is
    organized around geography, we recognize these
    aggregations as geo-referenced data.

86
Geo-spatial aggregate data structure
87
Province
Census Divisions
Census Sub-divisions
88
Why aggregate?
  • Statistics Canada creates aggregate statistics
    from its major surveys, including the Census, to
    publish select findings.

89
Why aggregate?
  • The release of aggregate statistics instead of
    the raw data is a way of protecting the
    confidentiality of the respondents and
    safeguarding against disclosure.

90
Why aggregate?
  • Furthermore, the geographic distribution of
    statistics in Canada is important. As a result,
    aggregate statistics are released by Statistics
    Canada for different levels of geography from
    the nation to small areas.

91
Why aggregate?
  • Statistics organized into time series is another
    way in which Statistics Canada publishes a large
    amount of statistical information. These time
    series reflect summaries of data that are
    repeatedly collected over time and permit studies
    about trends and change.

92
Why aggregate?
  • To publish findings
  • To safeguard against disclosure
  • To provide geographic distributions of statistics
  • To present statistics over time

93
Why aggregate?
  • DLI reasons to aggregate
  • To modify geo-referenced statistics for GIS
    applications
  • for example, finding postal codes within their
    corresponding EA and then aggregating data from
    the postal code level up to the EA level

94
Why aggregate?
  • Other reasons to aggregate
  • To change the unit of analysis
  • for the purposes of a specific research question
  • to create a common, higher-level unit of analysis
    that can be used in merging files

95
How does one aggregate?
  • 1. Identify the grouping structure that
    represents all of the variables and their
    categories over which the aggregation is to be
    conducted.
  • This group structure defines a new unit of
    analysis and
  • with an indexed structure, the levels of indexing.

96
How does one aggregate?
  • 2. Establish the sort-order for the grouping
    variables, i.e., decide which variable increments
    the slowest, the next slowest, until you reach
    the variable that changes the fastest.

97
(No Transcript)
98
How does one aggregate?
  • 3. Select the summary statistics, such as sums,
    averages, minimums, maximums, etc.

99
How does one aggregate?
  • As an example, here are the summary statistics
    available in the SPSS Aggregate procedure.

100
How does one aggregate?
  • The actual aggregation is performed using
    statistical software such as SAS or SPSS.
  • SAS offers the Data step and a few procedures
    that can be used to aggregate data, including
    Proc Summary, Proc Tabulate, and Proc Means.

101
How does one aggregate?
  • SPSS, as already shown, provides the Aggregate
    procedure.

102
Time for an exercise
103
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com