Title: Aggregate Data and Statistics
1Aggregate Data and Statistics
Wendy Watkins Carleton University
Chuck Humphrey University of Alberta
Statistics Canada Data Liberation Initiative
2Outline
- What are aggregate data?
- Why aggregate?
- How to aggregate?
- Computing exercise
3What are aggregate data?
- Lets start with the relationship between
statistics and data.
4Statistics and Data
- Data
- numeric files created and organized for analysis
- requires processing
- not ready for display
- Statistics
- numeric facts/figures
- created from data, i.e, already processed
- presentation-ready
5Statistics and Data
6(No Transcript)
7Statistics and Data
8Statistics and Data
- In short, statistics are created from data and
represent summaries of the detail observed in the
data.
9What is aggregation?
- Building on this example, lets explore
aggregation. - We see a table with the number of smokers
summarized over categories for age, education,
sex, geography, and different time points.
10Age and Education are in the background and
display totals.
11What is aggregation?
- Aggregation involves tabulating a summary
statistic across all of the categories or levels
of a set of variables.
12The summary statistic
- The summary statistic in this example is the
total number of smokers.
13Variables and categories
- The variables and their categories are
- Region (11) Canada and the ten provinces
- Age (5) Total, 15-19, 20-44, 45-64, 65
- Sex (3) Total, Female, Male
- Education (4) Total, Some secondary or less,
Secondary graduate or more, Not stated - Periods (5) 1985, 1989, 1991, 1994-95, 1996-97
14Variables and categories
- The tabulation consists of determining the
combinations of all categories across variables
and then counting the number of smokers within
each of these combinations. - 11 x 5 x 3 x 4 x 5 3300 category
combinations
15Tabulating or aggregating
- One might be wondering if there is a difference
between tabulating and aggregating. - Usually, they are the same thing.
16Tabulating aggregating
- In creating tables from data, the variables are
arranged in various combinations along the
columns and the rows.
17Tabulating aggregating
- Placing multiple variables along the columns or
rows is called nesting. - Tables may have variables nested on both the
columns and rows.
18Categories of Sex nested within Periods
19Categories of Education nested within Sex
Categories of Sex nested within Region
20A quick summary
- Up to this point, we have noted that
- statistics are created from data
- aggregations consist of tabulating statistics
within the categories of selected variables - variables may be nested within columns and rows
to display these tabulations
21What are aggregate data?
- Q What is the difference between an aggregation
or tabulation and aggregate data? - A The display of the aggregation (that is, the
structure of the tabulated output).
22Statistical data structure
- A statistical data structure is a fixed,
two-dimensional matrix with the variables in the
columns and cases in the rows.
V1
V2
V3
V4
V5
V6
V7
Case 1
Case 2
Case 3
Case 4
Case 5
Case 6
Case 7
23Statistical data structure
- Aggregate data require the same type of
statistical data structure. - With aggregate data, the variables are the cells
of a tabulation while the cases are the
categories of one or more of the tables
dimensions.
24What are aggregate data?
- Here is an example of a tabulation of three
variables. One variable consists of five levels
and will be used to represent the cases in an
aggregate data file. The other two variables
make up a two-way table and the cells in this
table will be the variables in the aggregate data
file.
25From tabulation to data
Dimension 1
5
4
Dimension 3
3
2
1
Dimension 2
26From tabulation to data
- Start with dimension 3 and convert this dimension
into the rows or cases of an aggregate data file.
27From tabulation to data
Dimension 1
5
4
Dimension 3
3
2
1
Dimension 2
28From tabulation to data
Dimension 1
Dimension 2
29From tabulation to data
Dimension 3
30From tabulation to data
- Working with the six cells from the tabulation
for level 1 of dimension 3, locate these six
cells as six variables in the new data structure.
31From tabulation to data
R1C1
Dimension 1
Dimension 2
32From tabulation to data
R1C1
Dimension 3
33From tabulation to data
R2C1
Dimension 1
Dimension 2
34From tabulation to data
R1C1
R2C1
Dimension 3
35From tabulation to data
Dimension 1
R3C1
Dimension 2
36From tabulation to data
R1C1
R2C1
R3C1
Dimension 3
37From tabulation to data
R1C2
Dimension 1
Dimension 2
38From tabulation to data
R1C2
R1C1
R2C1
R3C1
Dimension 3
39From tabulation to data
R1C2
Dimension 1
Dimension 2
40From tabulation to data
R1C2
R2C2
R1C1
R2C1
R3C1
Dimension 3
41From tabulation to data
Dimension 1
R3C2
Dimension 2
42From tabulation to data
R1C2
R2C2
R3C2
R1C1
R2C1
R3C1
Dimension 3
43From tabulation to data
Dimension 1
Dimension 2
44From tabulation to data
- Repeat this for level 2 of dimension 3.
45From tabulation to data
Dimension 1
Dimension 2
46From tabulation to data
Dimension 3
47From tabulation to data
Dimension 1
Dimension 2
48From tabulation to data
- Now for level 3 of dimension 3, etc.
49From tabulation to data
Dimension 1
Dimension 2
50From tabulation to data
Dimension 3
51From tabulation to data
Dimension 1
Dimension 2
52From tabulation to data
Dimension 3
53From tabulation to data
Dimension 1
Dimension 2
54From tabulation to data
Dimension 3
55Aggregate data structure
- The data structure just shown is the one used in
the aggregate data files for 1991 and earlier
censuses.
56Aggregate data structure
- Another data structure for aggregate files
consists of records built using nested variables
along the rows of a table. The categories of
these nested variables serve as index values.
57Aggregate data structure
- This is the data structure for aggregate data
from the 1996 Census.
58(11)
(5)
(3)
(4)
(5)
59Aggregate data structure
- age by sex by education by periods
- 5 x 3 x 4 x 5 300 cells
- For each level of geography (region), there are
300 cells.
60Aggregate data structure
- With the previous data structure, there would be
one case for each level of region with 300
variables on each line.
61Aggregate data structure
- With a fully indexed aggregate data structure,
there are 300 lines for each level of region with
each line representing a combination of the
values of the remaining nested or index
variables, i.e., age, sex, education, period.
62Aggregate data structure
- In Beyond 20/20, the smokers statistics are
arranged as the column dimension while region,
age, sex, education and period nested along the
rows.
63(11)
(5)
(3)
(4)
(5)
64Aggregate data structure
- Because the original table was not created using
numeric codes for the categories of the index
variables, writing the data to a file produces
the following format. Notice that the categories
are in string or text format.
65(No Transcript)
66Aggregate data structure
- A recode of these string values to numeric
codes displays the index structure for this type
of aggregate data file.
67(No Transcript)
68Aggregate Data Structure
- This is a fully indexed data structure. One row
in the file represents one cell from the table
while the index values define the combination of
categories that define the cell.
69Aggregate Data Structure
- A partial index data structure is used by
Statistics Canada in distributing the 1996
aggregate data. This kind of structure keeps one
variable in the column of the table.
70(No Transcript)
71(No Transcript)
72(No Transcript)
73Another example
- Lets look at a table that consists of the
average length of stay in a hospital by sex, age,
diagnostic chapter, region, and time period.
74(No Transcript)
75Variables and categories
- diagnostic chapter 19 levels
- sex 3 levels
- age 6 levels
- region 13 levels
- period 28 levels
76Aggregate data structure
- The number of category combinations is equal to
- 13 x 28 x 3 x 6 x 19 124,488 category
combinations
77Aggregate data structure
- Lets convert this into a partially indexed
aggregate data file structure with the categories
of region making up the cases the column
variable consisting of the three latest time
periods and the summary statistic representing
the average number of days in the hospital.
78(No Transcript)
79Aggregate average length of hospital stay in days
- The aggregate structure consists of three index
variables (sex, age, and chapter) with 342
indexed levels or records per category of region. - 3 x 6 x 19 342
80(No Transcript)
81What are aggregate data?
- Definition Statistical summaries over
categorical variables representing social
phenomena, geography, and time that are organized
in a statistical data structure.
82Time series aggregate data
- When the data structure of the summaries is
organized around time, these aggregate statistics
are called a time series.
83Time Series aggregate data structure
84Annual Time Series
85Geo-spatial aggregate data
- When the data structure of the summaries is
organized around geography, we recognize these
aggregations as geo-referenced data.
86Geo-spatial aggregate data structure
87Province
Census Divisions
Census Sub-divisions
88Why aggregate?
- Statistics Canada creates aggregate statistics
from its major surveys, including the Census, to
publish select findings.
89Why aggregate?
- The release of aggregate statistics instead of
the raw data is a way of protecting the
confidentiality of the respondents and
safeguarding against disclosure.
90Why aggregate?
- Furthermore, the geographic distribution of
statistics in Canada is important. As a result,
aggregate statistics are released by Statistics
Canada for different levels of geography from
the nation to small areas.
91Why aggregate?
- Statistics organized into time series is another
way in which Statistics Canada publishes a large
amount of statistical information. These time
series reflect summaries of data that are
repeatedly collected over time and permit studies
about trends and change.
92Why aggregate?
- To publish findings
- To safeguard against disclosure
- To provide geographic distributions of statistics
- To present statistics over time
93Why aggregate?
- DLI reasons to aggregate
- To modify geo-referenced statistics for GIS
applications - for example, finding postal codes within their
corresponding EA and then aggregating data from
the postal code level up to the EA level
94Why aggregate?
- Other reasons to aggregate
- To change the unit of analysis
- for the purposes of a specific research question
- to create a common, higher-level unit of analysis
that can be used in merging files
95How does one aggregate?
- 1. Identify the grouping structure that
represents all of the variables and their
categories over which the aggregation is to be
conducted. - This group structure defines a new unit of
analysis and - with an indexed structure, the levels of indexing.
96How does one aggregate?
- 2. Establish the sort-order for the grouping
variables, i.e., decide which variable increments
the slowest, the next slowest, until you reach
the variable that changes the fastest.
97(No Transcript)
98How does one aggregate?
- 3. Select the summary statistics, such as sums,
averages, minimums, maximums, etc.
99How does one aggregate?
- As an example, here are the summary statistics
available in the SPSS Aggregate procedure.
100How does one aggregate?
- The actual aggregation is performed using
statistical software such as SAS or SPSS. - SAS offers the Data step and a few procedures
that can be used to aggregate data, including
Proc Summary, Proc Tabulate, and Proc Means.
101How does one aggregate?
- SPSS, as already shown, provides the Aggregate
procedure.
102Time for an exercise
103(No Transcript)