STATISTICS - PowerPoint PPT Presentation

About This Presentation
Title:

STATISTICS

Description:

The height of the ogive is the cumulative area under the histogram. Estimating Percentiles from Ogives. Horizontal line has height .75 ... – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 111
Provided by: char9
Category:
Tags: statistics | ogive

less

Transcript and Presenter's Notes

Title: STATISTICS


1
STATISTICS
  • Summarizing, Visualizing and Understanding Data

2
I. Populations, Variables, and Data
3
Populations and Samples
  • To a statistician, the population is the set or
    collection under investigation. Individual
    members of the population are not usually of
    interest. Rather, investigators try to infer
    with some degree of confidence the general
    features of the population.

4
Examples
  • Students currently enrolled at a certain
    university.
  • Registered voters in a certain Congressional
    district.
  • The population of large-mouthed bass in a certain
    lake.
  • The population of all decay times of a
    radioactive isotope.

5
Statistical Inference
  • Drawing and quantifying the reliability of
    conclusions about a population from observations
    on a smaller subset of the population.
  • Sample The subset observed.

6
Variables and Data
  • A population variable is a descriptive number or
    label associated with each member of a
    population.
  • The values of a population variable are the
    various numbers (or labels) that occur as we
    consider all the members of the population.
  • Values of variables that have been recorded for a
    population or a sample from a population
    constitute data.

7
Types of Data
  • Nominal variables are variables whose values are
    labels.
  • Ordinal variables are variables whose values have
    a natural order.
  • Interval variables have values represented by
    numbers referring to a scale of measurement.
  • Ratio variables have values that are positive
    numbers on a scale with a unit of measurement and
    a natural zero point.

8
Guess the Type
  • Age
  • Questionnaire responses 1strongly
    agree,2agree,5strongly disagree
  • Letter grades
  • Reading comprehension scores
  • Gender
  • Zip codes
  • Molecular velocities

9
II. Summarizing Data
10
Location Measures (Measures of Central Tendency)
  • A location measure or measure of central
    tendency for a variable is a single value or
    number that is taken as representing all the
    values of the variable. Different location
    measures are appropriate for different types of
    data.

11
The Mean
  • For interval or ratio variables x
  • N individuals in the sample or population
  • xi value of x for ith individual

The mean of a population variable is denoted by m
(the Greek letter mu).
12
The Mean with Repeated Values
  • Distinct values of x
  • nj frequency of occurrence of

13
The Mean with Repeated Values
  • Relative frequencies

14
Example
-2 1 3 4 6
2 1 3 5 3
15
The Median
  • Informally, the middle value when all the
    values are arranged in order
  • A number m is a median of x if at least half the
    individuals i in the population have
  • and at least half of them have

16
The Median Example 1
  • x 2.0, 1.5, 2.2, 3.1, 5.7 (no repetitions)
  • median(x)2.2

17
The Median Example 2
  • x -2.0, 1.5, 3.1, 3.1, 3.1
  • median(x) 3.1

18
The Median Example 3
  • x -2.0, 1.5, 3.1, 5.7, 5.9, 7.1
  • median(x)Any number in 3.1,5.7
  • By convention, for an even number of individuals
    choose the midpoint between the smallest and
    largest medians, e.g.,

19
Example
  • Change 7.1 to 71. What happens to the mean and
    the median?
  • The mean changes from 3.55 to 14.2
  • No change in the median
  • The median is much less sensitive to outliers
    (which may be mistakes in recording data)

20
The Median for Ordered Categories
A A- B B B- C C C- D D D- F
8 5 10 18 18 15 14 6 4 1 1 0
N100. The median grade is B-.
21
The Mode
  • The data value with the greatest frequency
  • Not useful for interval or ordinal data if
    recorded with precision
  • The only useful location measure for strictly
    nominal data

22
Example
A A- B B B- C C C- D D D- F
8 5 10 18 18 15 14 6 4 1 1 0
The modes are B and B-.
23
Cumulative Frequencies and Percentiles
  • x is an interval or ratio variable.
  • Ordered distinct values
  • Relative frequencies

24
Cumulative Frequencies and Percentiles
  • Cumulative Frequencies
  • Cumulative Relative Frequencies

25
The Weather Persons Prediction Errors x
x'j -2 1 3 4 6
nj 2 1 3 5 3
Nj 2 3 6 11 14
fj .1429 .0714 .2143 .3571 .2143
Fj .1429 .2143 .4286 .7857 1.000
26
Exercise
  • From the table above, what fraction of the data
    is less than 1? What fraction is greater than 3?
    What fraction is greater than or equal to 3?

27
Percentiles
  • x an interval or ratio variable
  • A number a is a pth percentile of x if at least
    p of the values of x are less than or equal to a
    and at least (100-p) of the values of x are
    greater than or equal to a.
  • The 25th percentile is called the first quartile
    of x and the 75th percentile is the third
    quartile of x.
  • The 50th percentile is the second quartile or
    median.

28
Example
  • For the weather persons errors, the 25th
    percentile is 3. The 50th percentile and third
    quartile are both 4.

29
Measures of Variability
  • Statisticians are not only interested in
    describing the values of a variable by a single
    measure of location. They also want to describe
    how much the values of the variable are dispersed
    about that location.

30
Population Variance and Standard Deviation
  • x an interval or ratio variable.
  • Nnumber of individuals in population.
  • Variance of x
  • Standard deviation of x

31
Sample Variance and Standard Deviation
  • n the number of individuals in a sample from a
    population
  • Sample variance
  • Sample standard deviation

32
Alternative Formulas for the Variance
  • Using frequencies
  • Using relative frequencies

33
The Interquartile Range
  • Q1, Q3 1st and 3rd quartiles, respectively
  • Interquartile range
  • Not influenced by a few extremely large or small
    observations (outliers)

34
The Range
  • The difference between the largest data value and
    the smallest
  • Range of sample values is not a reliable
    indicator of the range of a population variable

35
III. Graphical Methods
36
Pie Charts (Circle Graphs)
Sources ATT (1961) The Worlds Telephones R A
language and environment for statistical
computing, the R core development team.
37
Bar Charts (Bar Graphs)
38
Pros and Cons
  • Bar chart has a scale of measurement more
    precise information
  • Pie chart gives more vivid impression of relative
    proportions, e.g., obvious at a glance that N.
    America had more than half the telephones in the
    world.

39
Stemplots (Stem and Leaf Diagrams)
  • StemLeaves Cumulative Frequency
  • 4 7 1
  • 5 448889 7
  • 6 34789 12
  • 7 012234455666888889999 33
  • 8 0022234457799 46
  • 9 0457 50

Grades of 50 students on a test
40
Find the Median
  • StemLeaves Cumulative Frequency
  • 4 7 1
  • 5 448889 7
  • 6 34789 12
  • 7 012234455666888889999 33
  • 8 0022234457799 46
  • 9 0457 50

25th and 26th leaves circled. Median 78
41
Exercise
  • StemLeaves Cumulative Frequency
  • 4 7 1
  • 5 448889 7
  • 6 34789 12
  • 7 012234455666888889999 33
  • 8 0022234457799 46
  • 9 0457 50

The 1st quartile is 70 and the 3rd quartile is 82.
42
Boxplots (Box and Whisker Diagrams)
43
Elements of a Boxplot
largest
outlier
box
whisker
quartiles
median
44
Boxplot Shows Distribution Skewed to the Left
45
Histograms
  • For interval or ratio data
  • Data is grouped into class intervals
  • Superficially like a bar chart

46
Frequency Histogram
Heightbin frequency
Class interval (bin)
Source R A language and environment for
statistical computing, the R core development
team.
47
Probability Histogram
Area of bar relative bin frequency E.g.,
.01125.275
48
Ogives(Cumulative Frequency Polygons)
  • Related to probability histograms
  • Examples of cumulative distribution functions
  • Probability histograms are examples of density
    functions

49
Example Ogive
50
Relationship Between Probability Histogram and
Ogive
  • The height of the ogive is the cumulative area
    under the histogram

51
Estimating Percentiles from Ogives
  • Horizontal line has height .75
  • Vertical line intersects horizontal axis at 60
  • Estimated 3rd quartile is 60
  • True 3rd quartile is 62

52
Scatterplots (Scatter Diagrams)
  • Used for jointly observed interval or ratio
    variables
  • Example Heights and weights of individuals
  • Example State per capita spending on secondary
    education and state crime rate
  • Example Wind speed and ozone concentration

53
Example Scatterplot
centroid
54
Fitting a Line
  • Relationship between variables x and y is
    approximately linear.
  • Approximately, y a bx.
  • Find a and b so that data comes closest to
    satisfying the equation.
  • Least squares a formal mathematical technique
    to be shown later.

55
Line Fitted by Least Squares
56
IV. Sampling
57
Why Sample?
  • Because the population is too large to observe
    all its members.
  • The population may be partly inaccessible.
  • The population may even be hypothetical.

58
Statistical Inference
  • Drawing conclusions about the population based on
    observations of a sample.
  • Reliability of inferences must be quantifiable.
  • Random sampling allows probability statements
    about the accuracy of inferences.

59
Sampling With Replacement
  • Population has N members.
  • n population members chosen sequentially.
  • Once chosen, a member of the population may be
    chosen again.
  • At each stage, all members of the population are
    equally likely to be chosen.
  • Random experiment with possible equally
    likely outcomes.

60
Sampling With Replacement (continued)
  • x is a population variable.
  • X1 value of x for 1st sampled individual, X2
    value of x for 2nd sampled individual, etc.
  • Each Xi is a random variable. The random
    variables are independent.
  • The sequence is a random
    sample of values of x, or a random sample from
    the distribution of x.

61
Sampling Without Replacement
  • Population has N individuals.
  • n members chosen sequentially.
  • Once chosen, an individual may not be chosen
    again.
  • At each stage, all of the remaining members are
    equally likely to be chosen next.
  • Random experiment with
    possible equally likely outcomes.

62
Sampling Without Replacement (continued)
  • Sample without replacement.
  • Ignore the order of the sequence of individuals
    in the sample.
  • Random experiment whose outcomes are subsets of
    size n.
  • Experiment has possible
    equally likely outcomes.
  • Common meaning of random sample of size n

63
Random Number Generators
  • Calculators and spreadsheet programs can generate
    pseudorandom sequences.
  • Press the random number key of your calculator
    several times.
  • Simulates a random sample with replacement from
    the set of numbers between 0 and 1 (to high
    precision).

64
Generating a Sample with Replacement
  • Number the individuals from 1 to N.
  • Generate a pseudorandom number R.
  • Include individual i in the sample if
  • Repeat n times. Individuals may be included more
    than once.

65
Exercise
  • Suppose you have 30 students in your class.
    Use the procedure just described to obtain a
    sample of size 10 (a) with replacement, (b)
    without replacement.

66
V. Estimation
67
The Sample Mean and Standard Deviation
  • is a random sample from the
    distribution of a population variable x.
  • The sample mean is
  • The sample variance is

68
The Sample Mean and Standard Deviation (continued)
  • The sample standard deviation is
  • The sample mean, variance and standard
    deviation are all random variables because they
    depend on the outcome of the random sampling
    experiment.

69
Estimators
  • The sample mean, variance, and standard deviation
    have distributions derived from the distribution
    of values of the population variable x.
  • They are estimators of the population mean m, the
    population variance s2, and the population
    standard deviation s of x.

70
Unbiased Estimators
  • The theoretical expected values of the sample
    mean and sample variance are equal to their
    population counterparts, i.e.,
  • and S2 are said to be unbiased estimators
    of m and s2, respectively
  • S is biased.

and
71
The Distribution of the Random Variable
  • The mean of is m, the same as the mean of
    the population variable x.
  • The standard deviation of is
  • These are the theoretical mean and standard
    deviation.

72
Density Functions
  • A density function is a nonnegative function
    such that the total area between the graph of the
    function and the horizontal axis is 1.
  • A probability histogram is a density function.
  • Other density functions are limits of
    histograms as the number of data elements grows
    without bound.

73
The Standard Normal Density Function
74
Percentiles of the Standard Normal Distribution
za is the 100(1-a) percentile of the distribution
75
Symmetry About the Vertical Axis
76
Probabilities Related to the Standard Normal
Distribution
77
Other Normal Distributions
  • Let Z be a random variable with the standard
    normal distribution.
  • The mean of Z is 0 and the standard deviation of
    Z is 1.
  • Let m and s be any numbers, sgt0.
  • Let Y sZm
  • Y has the normal distribution with mean m and
    standard deviation s.

78
Other Normal Distributions Example
m 1 and s 1.5
79
Standardizing The Inverse Operation
  • Let Y be normally distributed with mean m and
    standard deviation s.
  • Let . This is the z-score of
    Y.
  • Then Z has the standard normal distribution and

80
The Central Limit Theorem
  • Let be the sample average of a random sample
    of n values of a population variable x.
  • The population variable x has mean m and standard
    deviation s.
  • Standardize by subtracting its mean and
    dividing by its standard deviation

81
The Central Limit Theorem (continued)
  • Get Ready for the Central Limit Theorem!

82
The Central Limit Theorem(continued)
  • The Central Limit Theorem
  • As the sample size n grows without bound, the
    distribution of Z approaches the standard
    normal distribution. This is true no matter what
    the distribution of values of the population
    variable x.

83
Another Statement of the CLT
  • For sufficiently large sample sizes n and for
    all numbers a and b,
  • In almost all applications, n50 is large
    enough.

84
The CLT in Action
  • Sample n30 from the population variable
    COUNTS whose distribution is tabulated.
    Calculate the sample average. Repeat this 500
    times and construct a histogram of the z-scores
    of the 500 sample averages. Note The
    distribution of COUNTS is very far from normal.

xj? 0 1 2 3 4 5 6
fj .36 .33 .19 .08 .02 .01 .01
85
Distribution of COUNTS
86
Result-500 Averages of 30 Samples from COUNTS
87
Estimating a Population Mean
  • The sample mean is an unbiased estimator of
    the population mean m.
  • For large sample sizes n, has
    approximately a normal distribution with mean m
    and standard deviation
  • For large n, the sample mean is an accurate
    estimator of the population mean with high
    probability.

88
Example
  • Suppose s 2 and we want to estimate m
    with an error no greater than 0.05.
  • Assume is exactly normally distributed.
    Standardize.

89
Probabilities of 1-place Accuracys 2
90
Confidence Intervals for the Population Mean
Review of
91
100(1-a) Confidence Interval
  • By the CLT
  • Rearranging the inequalities

92
A Difficulty
  • s is probably unknown, so the confidence
    interval
  • cant be used. What to do?

93
Enhanced Central Limit Theorem
  • Define the modified z-score for as
  • As n grows without bound, the distribution of Z
    approaches the standard normal distribution.

94
A More Useful Confidence Interval
  • By the enhanced CLT
  • An approximate 100(1-a) confidence interval is

95
Example
  • n50 from COUNTS (m 1.14)
  • 1.32
  • S 1.39
  • 1-a .95
  • 1.320.39
  • 95 confidence interval (0.93, 1.71)
  • Dont say .95P0.93ltmlt1.71

96
Confidence Intervals for Proportions
  • x is a population variable with only two values,
    0 and 1.
  • Numerical code for two mutually exclusive
    categories, e.g., male and female, or
    approves and disapproves.
  • prelative frequency of x1.
  • mp s2p(1-p)

97
Confidence Intervals for Proportions (continued)
  • Sample n values of x, with replacement. Result is
    a sequence of 1s and 0s.
  • Sample mean is the relative frequency in the
    sample of 1s, e.g., the relative frequency of
    females in the sample of individuals.
  • Denote the sample mean by since it is an
    estimator of p.

98
Confidence Intervals for Proportions(continued)
  • By the enhanced CLT, is
    approximately standard normal.
  • An approximate 100(1-a) confidence interval is

99
Example
  • A public opinion research organization polled
    1000 randomly selected state residents. Of
    these, 413 said they would vote for a 1 sales
    tax increase dedicated to funding higher
    education. Find a 90 confidence interval for
    the proportion of all voters who would vote for
    such a proposal.

100
Solution
  • n 1000
  • 1-a .90
  • 0.413 1.645
  • (0.387, 0.489)

101
Linear Regression and Correlation
  • x and y are jointly observed numeric variables,
    i.e., defined for the same population or arising
    from the same experiment.
  • Have observations for n individuals or outcomes.
  • Data

102
Examples
  • (An observational study) Let x be the height and
    y the weight of individuals from a human
    population.
  • (A designed experiment) Let x be the amount of
    fertilizer applied to a plot of cotton seedlings
    and let y be the weight of raw cotton harvested
    at maturity.

103
Data on Fertilizer and Cotton Yield
x 2 2 2 4 4 4 6 6 6 8 8 8
y 2.3 2.2 2.2 2.5 2.9 2.7 3.4 2.7 3.4 3.5 3.4 3.3
104
Scatterplot of Fertilizer vs. Yield
105
Assumptions of Linear Regression
  • There is a population or distribution of values
    of y for any particular value of x.
  • There are unknown constants a and b so that for
    any particular value of x, the mean of all the
    corresponding values of y is
  • The standard deviation s of the values of y
    corresponding to a value of x is the same for all
    values of x.

106
The Method of Least Squares
  • Estimate a and b by choosing them to minimize the
    sum of squared differences between the observed
    values yi and their putative expected values
  • In symbols, minimize

107
The Least Squares Estimates
  • Let and be the means of the observed
    xs
  • and ys. Let be the sample variance of
    the xs.
  • The covariance between the xs and the ys is
  • The least squares estimate of the slope is
  • The least squares estimate of the intercept is

108
Least Squares Line for Cotton Yield
109
Correlation
  • The correlation between the xs and ys is
  • r is related to the slope b of the least squares
    regression line by
  • r is always between -1 and 1. r measures how
    nearly linear the relationship between x and y
    is. If r 0, then x and y are uncorrelated.

110
Examples
Write a Comment
User Comments (0)
About PowerShow.com