Methods for Describing Sets of Data - PowerPoint PPT Presentation

1 / 93
About This Presentation
Title:

Methods for Describing Sets of Data

Description:

2003 Pearson Prentice Hall, 2004 , 2005 Paul Resnick. Review ... Recode variables. Keep only what you want. Save it in stata format. 2 - 10 ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 94
Provided by: johnj178
Category:

less

Transcript and Presenter's Notes

Title: Methods for Describing Sets of Data


1
Chapter 2
  • Methods for Describing Sets of Data

2
Review
  • Descriptive vs. Inferential Statistics
  • Vocabulary
  • Population
  • (Random, representative) sample
  • Parameter
  • Statistic
  • Data types
  • Data sources

3
Learning Objectives
  • Perform basic data manipulation with stata
  • Create Interpret Graphical Displays
  • Analyze Numerical Data Using Summary Measures

4
Data Manipulation with Stata
  • Getting started
  • Data Preparation
  • Data Analysis

5
Getting Started with Stata
  • Install stata via NAL
  • Start menu Run
  • nal
  • Double-click on stata
  • Wait a while
  • Run stata

6
Running Commands from the command line
  • log using my.log, replace text
  • Tells stata to start logging the rest of what
    happens to a log file
  • set memory 100m
  • Tells stata to allocate memory for data
  • Log close
  • Open the file my.log and see what it says

7
Running Commands from a do-file
  • Window menu do-file editor
  • File menu Open
  • T\Fall-2005\544\Public\Stata\SlashDotPrep.do
  • Highlight one or more lines
  • Tools menu Do Selection

8
About This Dataset
  • From readership logs of the website slashdot.org
  • For each page view
  • User id (recoded to protect privacy)
  • Date/time
  • URL
  • Viewing threshold
  • Display mode

9
Data Preparation
  • Read it in
  • Define labels
  • Recode variables
  • Keep only what you want
  • Save it in stata format

10
Data Presentation
11
Presenting Qualitative Data
12
Data Presentation
13
Student Specializations
  • sispec Freq. Percent Cum.
  • -----------------------------------------------
  • HCI 7 46.67 46.67
  • IEMP 3 20.00 66.67
  • tailored 5 33.33 100.00
  • -----------------------------------------------
  • Total 15 100.00

14
Student Specializations
15
Knowledge
  • summnot Freq. Percent
    Cum.
  • -------------------------------------------------
    -----------
  • Never taught this 3 20.00
    20.00
  • Never really learned it 1 6.67
    26.67
  • It's been many years 4 26.67
    53.33
  • I know this 3 20.00
    73.33
  • Can teach this to others 4 26.67
    100.00
  • -------------------------------------------------
    -----------
  • Total 15 100.00
  • deriv Freq. Percent
    Cum.
  • -------------------------------------------------
    -----------
  • Never taught this 1 6.67
    6.67
  • Never really learned it 2 13.33
    20.00
  • It's been many years 10 66.67
    86.67
  • Can teach this to others 2 13.33
    100.00
  • -------------------------------------------------
    -----------
  • Total 15 100.00

16
Knowledge Cont.
  • meanmed Freq. Percent Cum.
  • -------------------------------------------------
    -----------
  • Never really learned it 1 6.67
    6.67
  • It's been many years 3 20.00
    26.67
  • I know this 4 26.67
    53.33
  • Can teach this to others 7 46.67
    100.00
  • -------------------------------------------------
    -----------
  • Total 15 100.00
  • stdev Freq. Percent
    Cum.
  • -------------------------------------------------
    -----------
  • Never taught this 3 20.00
    20.00
  • Never really learned it 2 13.33
    33.33
  • It's been many years 3 20.00
    53.33
  • I know this 4 26.67
    80.00
  • Can teach this to others 3 20.00
    100.00
  • -------------------------------------------------
    -----------

17
Knowledge Cont.
  • cenlim Freq. Percent Cum.
  • -------------------------------------------------
    -----------
  • Never taught this 6 40.00
    40.00
  • Never really learned it 2 13.33
    53.33
  • It's been many years 6 40.00
    93.33
  • I know this 1 6.67
    100.00
  • -------------------------------------------------
    -----------
  • Total 15 100.00
  • -gt tabulation of reg
  • reg Freq. Percent
    Cum.
  • -------------------------------------------------
    -----------
  • Never taught this 7 46.67
    46.67
  • Never really learned it 5 33.33
    80.00
  • It's been many years 1 6.67
    86.67
  • I know this 2 13.33
    100.00
  • -------------------------------------------------
    -----------
  • Total 15 100.00

18
Exercises
  • 2.4
  • 2.5
  • 2.15 which chart type is best for CEO degree
    categories?

19
Stata data analysis
  • File menu Open
  • T\Fall-2005\544\Public\Stata\SlashDotAnalysis.do
  • Counts
  • Summary tables
  • Bar and pie charts

20
(No Transcript)
21
Sort, Generate
  • Sort data
  • sort hour minute second
  • Generate new variable
  • generate totalduration 6060(hour_N -
    hour1) 60(minute_N - minute1)
    (second_N - second1)
  • _N means the last row
  • 1 means the first row

22
Generate Within Groups
  • Group rows by userid
  • Generate within each group
  • sort uid hour minute second
  • by uid generate duration 6060(hour_N -
    hour1) 60(minute_N - minute1)
    (second_N - second1)

23
Egen
  • Egen Many useful options for calculations
    (within groups)
  • Sum, count
  • count-so-far with rank option
  • See documentation via help egen

24
Collapse
  • Creates one row per group
  • Options specify how to combine multiple rows for
    a group
  • Min
  • Max
  • Count
  • Mean
  • Etc.

25
Presenting Numerical Data
26
Data Presentation
27
Stem in Stata
28
Histogram in stata
29
Student Age (Reported) Data
  • . stem age
  • Stem-and-leaf plot for age
  • 2 2345567
  • 3 0122356
  • 4
  • 5
  • 6
  • 7
  • 8 4

30
Histogram
31
Starting Salaries (in K)
  • Fall 04 Class
  • 3 8
  • 4 000025
  • 5 0000
  • 6 0000005
  • 7 5
  • 8 0
  • Fall 05 Class
  • 4 5
  • 5 0000
  • 6 2355
  • 7 5
  • 8 5
  • 9
  • 10 05

32
Summation Notation
  • Exercise 2.43
  • Observations 3, 8, 4, 5, 3, 4, 6

33
Summation Notation
  • Exercise 2.43
  • Observations 3, 8, 4, 5, 3, 4, 6

34
Summation Notation
  • Exercise 2.43
  • Observations 3, 8, 4, 5, 3, 4, 6

35
Summation Notation
  • Exercise 2.43
  • Observations 3, 8, 4, 5, 3, 4, 6

36
Summation Notation
  • Exercise 2.43
  • Observations 3, 8, 4, 5, 3, 4, 6

37
Summation with Indexing
38
(No Transcript)
39
Numerical Data Properties
40
Thinking Challenge
400,000
70,000
50,000
... employees cite low pay -- most workers earn
only 20,000. ... President claims average pay is
70,000!
30,000
20,000
41
Standard Notation
Measure
Sample
Population
Mean
?
?
x
Stand. Dev.
s
?
2
2
Variance

s
?
Size
n
N
42
Numerical Data Properties
Central Tendency (Location)
Variation (Dispersion)
Shape
43
Numerical DataProperties Measures
Numerical Data
Properties
Central
Variation
Shape
Tendency
Mean
Range
Skew
Interquartile Range
Median
Mode
Variance
Standard Deviation
44
Central Tendency
45
Numerical DataProperties Measures
Numerical Data
Properties
Central
Variation
Shape
Tendency
Mean
Range
Skew
Interquartile Range
Median
Mode
Variance
Standard Deviation
46
Whats wrong with this median calculation?
  • Measurements 1 4 2 9 8
  • Middle measurement is 2, so thats the median

47
Mean
48
Exercise 2.53
  • 18, 10, 15, 13, 17, 15, 12, 15, 18, 16, 11
  • Calculate mode, median, mean

49
What if?
  • Replace one of the 18s with 1,118?

50
Exercise 2.55a
  • N10,
  • Whats the mean?
  • Whats the median?

51
400,000
70,000
50,000
... employees cite low pay -- most workers earn
only 20,000. ... President claims average pay is
70,000!
30,000
20,000
52
Ages
  • Mean 33
  • Median 30
  • 2 2345567
  • 3 0122356
  • 4
  • 5
  • 6
  • 7
  • 8 4

53
Summary of Central Tendency Measures
Measure
Equation
Description
Mean
Balance Point
??
X
/
n
i

Median
(
n
1)
Position
Middle Value
2
When Ordered
Mode
none
Most Frequent
54
Shape
55
Numerical DataProperties Measures
Numerical Data
Properties
Central
Variation
Shape
Tendency
Mean
Range
Skew
Median
Interquartile Range
Mode
Variance
Standard Deviation
56
Shape
  • 1. Describes How Data Are Distributed
  • 2. Measures of Shape
  • Skew Symmetry

Right-Skewed
Left-Skewed
Symmetric
Mean

Median

Mode
Mean


Median


Mode
Mode

Median

Mean
57
Exercise 2.62
  • Asked to submit 3 letters
  • Observed mean 2.28, median3, mode3
  • Interpret

58
Exercise 2.64 a-d
59
Variation
60
Numerical DataProperties Measures
Numerical Data
Properties
Central
Variation
Shape
Tendency
Range
Mean
Skew
Interquartile Range
Median
Mode
Variance
Standard Deviation
61
Quartiles
  • 1. Measure of Noncentral Tendency
  • 2. Split Ordered Data into 4 Quarters
  • 3. Position of i-th Quartile

25
25
25
25
Q1
Q2
Q3
a
f
i
n
?
?
1
Positionin
g Point of

Q
?
i
4
62
Ages
  • Range
  • Quartiles
  • 2 2345567
  • 3 0122356
  • 4
  • 5
  • 6
  • 7
  • 8 4

63
Box Plots
64
Age and Salary
  • Quartiles 25, 30, 33
  • Inner fences 13, 45
  • Outer fences 1, 57
  • Quartiles 50K, 63K, 75K
  • Inner fences ??
  • Outer fences ??

65
Box Plots in Stata
  • graph box ltvarnamegt

66
Variance Standard Deviation
  • 1. Measures of Dispersion
  • 2. Most Common Measures
  • 3. Consider How Data Are Distributed
  • 4. Show Variation About Mean (?X or ?)

?
X
8.3
4
6
8
10
12
67
Sample Variance Formula
c
h
n
2
?
n - 1 in denominator! (Use N if Population
Variance)
X
X
?
i
2
i
1
?
S
?
n
1
?
c
h
c
h
c
h
2
2
2
X
X
X
X
X
X
?
?
?
?
?
?
?
n
1
2
?
n
1
?
68
Equivalent Formula
69
Another Equivalent Formula
70
Exercise 2.70
  • What is the primary disadvantage of usign the
    range to compare the variability of data sets?

71
Exercise 2.74a
  • Calculate variance and standard deviation

72
Exercise 2.74a
  • Calculate variance and standard deviation

73
Exercise 2.77 Same mean, different variances
  • Using only integers in 0,10, construct two
    datasets with at least 10 observations each
  • Same mean
  • Different variances

74
Exercise 2.78 Same range, different means
75
Exercise 2.79 (simplified) adding a constant
  • 2, 1, 1, 0, 6
  • Mean 10/52
  • Variance 0 1 1 416 22
  • Add 3 to each measurement
  • Mean ?
  • Variance ?

76
Exercise 2.79 (simplified) adding a constant
  • 2, 1, 1, 0, 6
  • Mean 10/52
  • Variance 0 1 1 416 22
  • Add 3 to each measurement
  • Mean 25/5 5
  • Variance ??

77
Why doesnt adding a constant affect variance?
78
Stata Measures of Central Tendency
. summ accesses, detail
(max) accesses -----------------------------------
-------------------------- Percentiles
Smallest 1 1 1 5
1 1 10 1
1 Obs 22049 25
1 1 Sum of Wgt.
22049 50 2 Mean
7.827158
Largest Std. Dev. 559.3611 75
4 211 90 8
219 Variance 312884.9 95
12 1799 Skewness
148.346 99 26 83038
Kurtosis 22020.4
79
Chebysev and Empirical Rules Intuitions
  • Cant all be above average
  • Cant have too many values very far from the mean
  • Or can you?
  • What if half the values are 1000, half are -1000
  • Cant have too many values very far from the mean
  • Very far measured in standard deviations

80
Chebyshevs Rule Preliminaries
  • Lemma For any positive variable Y, and any
    constant a,
  • Proof of Lemma
  • For values of Ygta, define Z a
  • For values of Ylta, define Z 0
  • Clearly mean of Y is bigger than mean of Z
  • But mean of Z is just

81
Chebyshevs Rule
  • Claim
  • Proof Let

(From lemma)
82
Empirical Rule
  • If x has a symmetric, mound-shaped distribution
  • Justification Known properties of the normal
    distribution, to be studied later in the course

83
Example
  • Data set has nine 0 values, and one 100
  • Mean 10, Range 100
  • s2 (910018100)/91000, s 31.62
  • 10 are at a distance gt 3s
  • Chebyshevs rule applies 10 lt 1/9 11.1
  • Empirical rule severely violated 10 gt 0.3

84
Preview of Statistical Inference
  • You observe one data point
  • Make hypothesis about mean and standard deviation
    from which it was drawn
  • Chebyshevs Rule or Empirical Rule tells you how
    (un)likely the data point is
  • If very unlikely, you are suspicious of the
    hypothesis about mean and standard deviation, and
    reject it

85
Exercise 2.87
  • N200, mean 1500, s 300
  • How many measurements in (900,2100)
  • How many measurements in (600, 2400)
  • How many measurements in (1200, 1800)
  • How many measurements in (1500, 2100)

86
Summary of Variation Measures
Measure
Equation
Description
X
-
X
Total Spread
Range
largest
smallest
Q
-
Q
Spread of Middle 50
Interquartile Range
3
1
Dispersion about
Standard Deviation
Sample Mean
(Sample)
Standard Deviation
Dispersion about
Population Mean
(Population)
Variance
2
Squared Dispersion
?
(
X
-
?
X
)
i
about Sample Mean
(Sample)
n
- 1
87
Z-scores
  • Number of standard deviations from the mean
  • Chebyshev and empirical rules apply

88
Exercise 2.117c, page 85
89
Scatterplots
90
Misleading With Statistics
  • Bar graphs
  • Stretch the vertical axis
  • Scale break

91
(No Transcript)
92
Misleading With Statistics
  • Bar graphs
  • Stretch the vertical axis
  • Scale break
  • Reporting central tendency
  • Medians vs. means
  • Not reporting variance
  • Small samples with reports of relative frequency

93
End of Chapter
Any blank slides that follow are blank
intentionally.
Write a Comment
User Comments (0)
About PowerShow.com