Title: STATISTICS
1STATISTICS
- Summarizing, Visualizing and Understanding Data
2I. Populations, Variables, and Data
3Populations and Samples
-
- To a statistician, the population is the set or
collection under investigation. Individual
members of the population are not usually of
interest. Rather, investigators try to infer
with some degree of confidence the general
features of the population.
4Examples
- Students currently enrolled at a certain
university. - Registered voters in a certain Congressional
district. - The population of large-mouthed bass in a certain
lake. - The population of all decay times of a
radioactive isotope.
5Statistical Inference
- Drawing and quantifying the reliability of
conclusions about a population from observations
on a smaller subset of the population. - Sample The subset observed.
6Variables and Data
- A population variable is a descriptive number or
label associated with each member of a
population. - The values of a population variable are the
various numbers (or labels) that occur as we
consider all the members of the population. - Values of variables that have been recorded for a
population or a sample from a population
constitute data.
7Types of Data
- Nominal variables are variables whose values are
labels. - Ordinal variables are variables whose values have
a natural order. - Interval variables have values represented by
numbers referring to a scale of measurement. - Ratio variables have values that are positive
numbers on a scale with a unit of measurement and
a natural zero point.
8Guess the Type
- Age
- Questionnaire responses 1strongly
agree,2agree,5strongly disagree - Letter grades
- Reading comprehension scores
- Gender
- Zip codes
- Molecular velocities
9II. Summarizing Data
10Location Measures (Measures of Central Tendency)
- A location measure or measure of central
tendency for a variable is a single value or
number that is taken as representing all the
values of the variable. Different location
measures are appropriate for different types of
data.
11The Mean
- For interval or ratio variables x
- N individuals in the sample or population
- xi value of x for ith individual
The mean of a population variable is denoted by m
(the Greek letter mu).
12The Mean with Repeated Values
- Distinct values of x
- nj frequency of occurrence of
13The Mean with Repeated Values
14Example
-2 1 3 4 6
2 1 3 5 3
15The Median
- Informally, the middle value when all the
values are arranged in order - A number m is a median of x if at least half the
individuals i in the population have - and at least half of them have
16The Median Example 1
- x 2.0, 1.5, 2.2, 3.1, 5.7 (no repetitions)
- median(x)2.2
17The Median Example 2
- x -2.0, 1.5, 3.1, 3.1, 3.1
- median(x) 3.1
18The Median Example 3
- x -2.0, 1.5, 3.1, 5.7, 5.9, 7.1
- median(x)Any number in 3.1,5.7
- By convention, for an even number of individuals
choose the midpoint between the smallest and
largest medians, e.g.,
19Example
- Change 7.1 to 71. What happens to the mean and
the median? - The mean changes from 3.55 to 14.2
- No change in the median
- The median is much less sensitive to outliers
(which may be mistakes in recording data)
20The Median for Ordered Categories
A A- B B B- C C C- D D D- F
8 5 10 18 18 15 14 6 4 1 1 0
N100. The median grade is B-.
21The Mode
- The data value with the greatest frequency
- Not useful for interval or ordinal data if
recorded with precision - The only useful location measure for strictly
nominal data
22Example
A A- B B B- C C C- D D D- F
8 5 10 18 18 15 14 6 4 1 1 0
The modes are B and B-.
23Cumulative Frequencies and Percentiles
- x is an interval or ratio variable.
- Ordered distinct values
- Relative frequencies
24Cumulative Frequencies and Percentiles
- Cumulative Relative Frequencies
25The Weather Persons Prediction Errors x
x'j -2 1 3 4 6
nj 2 1 3 5 3
Nj 2 3 6 11 14
fj .1429 .0714 .2143 .3571 .2143
Fj .1429 .2143 .4286 .7857 1.000
26Exercise
- From the table above, what fraction of the data
is less than 1? What fraction is greater than 3?
What fraction is greater than or equal to 3?
27Percentiles
- x an interval or ratio variable
- A number a is a pth percentile of x if at least
p of the values of x are less than or equal to a
and at least (100-p) of the values of x are
greater than or equal to a. - The 25th percentile is called the first quartile
of x and the 75th percentile is the third
quartile of x. - The 50th percentile is the second quartile or
median.
28Example
- For the weather persons errors, the 25th
percentile is 3. The 50th percentile and third
quartile are both 4.
29Measures of Variability
- Statisticians are not only interested in
describing the values of a variable by a single
measure of location. They also want to describe
how much the values of the variable are dispersed
about that location.
30Population Variance and Standard Deviation
- x an interval or ratio variable.
- Nnumber of individuals in population.
- Variance of x
- Standard deviation of x
31Sample Variance and Standard Deviation
- n the number of individuals in a sample from a
population - Sample variance
- Sample standard deviation
32Alternative Formulas for the Variance
- Using frequencies
- Using relative frequencies
33The Interquartile Range
- Q1, Q3 1st and 3rd quartiles, respectively
- Interquartile range
- Not influenced by a few extremely large or small
observations (outliers)
34The Range
- The difference between the largest data value and
the smallest - Range of sample values is not a reliable
indicator of the range of a population variable
35III. Graphical Methods
36Pie Charts (Circle Graphs)
Sources ATT (1961) The Worlds Telephones R A
language and environment for statistical
computing, the R core development team.
37Bar Charts (Bar Graphs)
38Pros and Cons
- Bar chart has a scale of measurement more
precise information - Pie chart gives more vivid impression of relative
proportions, e.g., obvious at a glance that N.
America had more than half the telephones in the
world.
39Stemplots (Stem and Leaf Diagrams)
- StemLeaves Cumulative Frequency
- 4 7 1
- 5 448889 7
- 6 34789 12
- 7 012234455666888889999 33
- 8 0022234457799 46
- 9 0457 50
Grades of 50 students on a test
40Find the Median
- StemLeaves Cumulative Frequency
- 4 7 1
- 5 448889 7
- 6 34789 12
- 7 012234455666888889999 33
- 8 0022234457799 46
- 9 0457 50
25th and 26th leaves circled. Median 78
41Exercise
- StemLeaves Cumulative Frequency
- 4 7 1
- 5 448889 7
- 6 34789 12
- 7 012234455666888889999 33
- 8 0022234457799 46
- 9 0457 50
The 1st quartile is 70 and the 3rd quartile is 82.
42Boxplots (Box and Whisker Diagrams)
43Elements of a Boxplot
largest
outlier
box
whisker
quartiles
median
44Boxplot Shows Distribution Skewed to the Left
45Histograms
- For interval or ratio data
- Data is grouped into class intervals
- Superficially like a bar chart
46Frequency Histogram
Heightbin frequency
Class interval (bin)
Source R A language and environment for
statistical computing, the R core development
team.
47Probability Histogram
Area of bar relative bin frequency E.g.,
.01125.275
48Ogives(Cumulative Frequency Polygons)
- Related to probability histograms
- Examples of cumulative distribution functions
- Probability histograms are examples of density
functions
49Example Ogive
50Relationship Between Probability Histogram and
Ogive
- The height of the ogive is the cumulative area
under the histogram
51Estimating Percentiles from Ogives
- Horizontal line has height .75
- Vertical line intersects horizontal axis at 60
- Estimated 3rd quartile is 60
- True 3rd quartile is 62
52Scatterplots (Scatter Diagrams)
- Used for jointly observed interval or ratio
variables - Example Heights and weights of individuals
- Example State per capita spending on secondary
education and state crime rate - Example Wind speed and ozone concentration
53Example Scatterplot
centroid
54Fitting a Line
- Relationship between variables x and y is
approximately linear. - Approximately, y a bx.
- Find a and b so that data comes closest to
satisfying the equation. - Least squares a formal mathematical technique
to be shown later.
55Line Fitted by Least Squares
56IV. Sampling
57Why Sample?
- Because the population is too large to observe
all its members. - The population may be partly inaccessible.
- The population may even be hypothetical.
58Statistical Inference
- Drawing conclusions about the population based on
observations of a sample. - Reliability of inferences must be quantifiable.
- Random sampling allows probability statements
about the accuracy of inferences.
59Sampling With Replacement
- Population has N members.
- n population members chosen sequentially.
- Once chosen, a member of the population may be
chosen again. - At each stage, all members of the population are
equally likely to be chosen. - Random experiment with possible equally
likely outcomes.
60Sampling With Replacement (continued)
- x is a population variable.
- X1 value of x for 1st sampled individual, X2
value of x for 2nd sampled individual, etc. - Each Xi is a random variable. The random
variables are independent. - The sequence is a random
sample of values of x, or a random sample from
the distribution of x.
61Sampling Without Replacement
- Population has N individuals.
- n members chosen sequentially.
- Once chosen, an individual may not be chosen
again. - At each stage, all of the remaining members are
equally likely to be chosen next. - Random experiment with
possible equally likely outcomes.
62Sampling Without Replacement (continued)
- Sample without replacement.
- Ignore the order of the sequence of individuals
in the sample. - Random experiment whose outcomes are subsets of
size n. - Experiment has possible
equally likely outcomes. - Common meaning of random sample of size n
63Random Number Generators
- Calculators and spreadsheet programs can generate
pseudorandom sequences. - Press the random number key of your calculator
several times. - Simulates a random sample with replacement from
the set of numbers between 0 and 1 (to high
precision).
64Generating a Sample with Replacement
- Number the individuals from 1 to N.
- Generate a pseudorandom number R.
- Include individual i in the sample if
- Repeat n times. Individuals may be included more
than once.
65Exercise
- Suppose you have 30 students in your class.
Use the procedure just described to obtain a
sample of size 10 (a) with replacement, (b)
without replacement.
66V. Estimation
67The Sample Mean and Standard Deviation
- is a random sample from the
distribution of a population variable x. - The sample mean is
- The sample variance is
68The Sample Mean and Standard Deviation (continued)
- The sample standard deviation is
- The sample mean, variance and standard
deviation are all random variables because they
depend on the outcome of the random sampling
experiment.
69Estimators
- The sample mean, variance, and standard deviation
have distributions derived from the distribution
of values of the population variable x. - They are estimators of the population mean m, the
population variance s2, and the population
standard deviation s of x.
70Unbiased Estimators
- The theoretical expected values of the sample
mean and sample variance are equal to their
population counterparts, i.e., - and S2 are said to be unbiased estimators
of m and s2, respectively - S is biased.
and
71The Distribution of the Random Variable
- The mean of is m, the same as the mean of
the population variable x. - The standard deviation of is
- These are the theoretical mean and standard
deviation.
72Density Functions
- A density function is a nonnegative function
such that the total area between the graph of the
function and the horizontal axis is 1. - A probability histogram is a density function.
- Other density functions are limits of
histograms as the number of data elements grows
without bound.
73The Standard Normal Density Function
74Percentiles of the Standard Normal Distribution
za is the 100(1-a) percentile of the distribution
75Symmetry About the Vertical Axis
76Probabilities Related to the Standard Normal
Distribution
77Other Normal Distributions
- Let Z be a random variable with the standard
normal distribution. - The mean of Z is 0 and the standard deviation of
Z is 1. - Let m and s be any numbers, sgt0.
- Let Y sZm
- Y has the normal distribution with mean m and
standard deviation s.
78Other Normal Distributions Example
m 1 and s 1.5
79Standardizing The Inverse Operation
- Let Y be normally distributed with mean m and
standard deviation s. - Let . This is the z-score of
Y. - Then Z has the standard normal distribution and
80The Central Limit Theorem
- Let be the sample average of a random sample
of n values of a population variable x. - The population variable x has mean m and standard
deviation s. - Standardize by subtracting its mean and
dividing by its standard deviation
81The Central Limit Theorem (continued)
- Get Ready for the Central Limit Theorem!
82The Central Limit Theorem(continued)
- The Central Limit Theorem
- As the sample size n grows without bound, the
distribution of Z approaches the standard
normal distribution. This is true no matter what
the distribution of values of the population
variable x.
83Another Statement of the CLT
- For sufficiently large sample sizes n and for
all numbers a and b, - In almost all applications, n50 is large
enough.
84The CLT in Action
- Sample n30 from the population variable
COUNTS whose distribution is tabulated.
Calculate the sample average. Repeat this 500
times and construct a histogram of the z-scores
of the 500 sample averages. Note The
distribution of COUNTS is very far from normal.
xj? 0 1 2 3 4 5 6
fj .36 .33 .19 .08 .02 .01 .01
85Distribution of COUNTS
86Result-500 Averages of 30 Samples from COUNTS
87Estimating a Population Mean
- The sample mean is an unbiased estimator of
the population mean m. - For large sample sizes n, has
approximately a normal distribution with mean m
and standard deviation - For large n, the sample mean is an accurate
estimator of the population mean with high
probability. -
88Example
- Suppose s 2 and we want to estimate m
with an error no greater than 0.05. - Assume is exactly normally distributed.
Standardize.
89Probabilities of 1-place Accuracys 2
90Confidence Intervals for the Population Mean
Review of
91100(1-a) Confidence Interval
- By the CLT
- Rearranging the inequalities
92A Difficulty
- s is probably unknown, so the confidence
interval - cant be used. What to do?
93Enhanced Central Limit Theorem
- Define the modified z-score for as
- As n grows without bound, the distribution of Z
approaches the standard normal distribution.
94A More Useful Confidence Interval
- By the enhanced CLT
- An approximate 100(1-a) confidence interval is
95Example
- n50 from COUNTS (m 1.14)
- 1.32
- S 1.39
- 1-a .95
- 1.320.39
- 95 confidence interval (0.93, 1.71)
- Dont say .95P0.93ltmlt1.71
96Confidence Intervals for Proportions
- x is a population variable with only two values,
0 and 1. - Numerical code for two mutually exclusive
categories, e.g., male and female, or
approves and disapproves. - prelative frequency of x1.
- mp s2p(1-p)
97Confidence Intervals for Proportions (continued)
- Sample n values of x, with replacement. Result is
a sequence of 1s and 0s. - Sample mean is the relative frequency in the
sample of 1s, e.g., the relative frequency of
females in the sample of individuals. - Denote the sample mean by since it is an
estimator of p.
98Confidence Intervals for Proportions(continued)
- By the enhanced CLT, is
approximately standard normal. - An approximate 100(1-a) confidence interval is
99Example
- A public opinion research organization polled
1000 randomly selected state residents. Of
these, 413 said they would vote for a 1 sales
tax increase dedicated to funding higher
education. Find a 90 confidence interval for
the proportion of all voters who would vote for
such a proposal.
100Solution
- n 1000
-
- 1-a .90
- 0.413 1.645
- (0.387, 0.489)
101Linear Regression and Correlation
- x and y are jointly observed numeric variables,
i.e., defined for the same population or arising
from the same experiment. - Have observations for n individuals or outcomes.
- Data
102Examples
- (An observational study) Let x be the height and
y the weight of individuals from a human
population. - (A designed experiment) Let x be the amount of
fertilizer applied to a plot of cotton seedlings
and let y be the weight of raw cotton harvested
at maturity.
103Data on Fertilizer and Cotton Yield
x 2 2 2 4 4 4 6 6 6 8 8 8
y 2.3 2.2 2.2 2.5 2.9 2.7 3.4 2.7 3.4 3.5 3.4 3.3
104Scatterplot of Fertilizer vs. Yield
105Assumptions of Linear Regression
- There is a population or distribution of values
of y for any particular value of x. - There are unknown constants a and b so that for
any particular value of x, the mean of all the
corresponding values of y is - The standard deviation s of the values of y
corresponding to a value of x is the same for all
values of x.
106The Method of Least Squares
- Estimate a and b by choosing them to minimize the
sum of squared differences between the observed
values yi and their putative expected values - In symbols, minimize
107The Least Squares Estimates
- Let and be the means of the observed
xs - and ys. Let be the sample variance of
the xs. - The covariance between the xs and the ys is
- The least squares estimate of the slope is
- The least squares estimate of the intercept is
-
108Least Squares Line for Cotton Yield
109Correlation
- The correlation between the xs and ys is
- r is related to the slope b of the least squares
regression line by - r is always between -1 and 1. r measures how
nearly linear the relationship between x and y
is. If r 0, then x and y are uncorrelated.
110Examples