Title: Chapters 2 and 3
1Chapters 2 and 3
2- After collecting our data, we want to
- get a better understanding of its various
- aspects.
- Data can be described numerically or
- graphically.
3Numerically Descriptive Methods
- Numerical Data sample mean, sample median,
sample standard deviation, range, etc. - Categorical Data sample counts or sample
proportion
4Graphical Descriptive Methods
- Numerical Data histogram, boxplot, dotplot, stem
plot, etc. - Categorical Data barchart, pie chart, frequency
tables
5Describing Numerical Data
- The center of the data can be described
- by the sample mean , sample median or
- the sample mode.
- is the usual average and is the
middle - number after sorting.
- The mode is the number that occurs most often.
6- Suppose our data is
- 4 1 9 2 5
- (4 1 9 2 5)/5 21/5 4.2
- the middle number after sorting 4.
- If the sample size is even, the median is the
average - of the 2 middle numbers.
7- Suppose that in the previous dataset,
- 9 was misreported as 99. Then remains at
- 4 but is now 22.2.
- is more sensitive to unusual observations
- known as outliers.
8One of the marks is the sample mean and the other
is the sample median. Which one corresponds to
the green mark?
9The mode
- Ex 2 1 5 4 5 The mode is 5.
- Ex 2 1 5 1 5 The mode is 1
and 5. - Ex 2 1 3 8 4 There is no mode.
10- The sample standard deviation, s, is
- a measure of how spread out the data is.
- The sample variance is s2.
- We could also use the range as a measure of
- the variability.
- Range Max - Min
11As the points move away from the xbar (the mark
in the center), the standard deviation
increases. Note The range of the last 3 are
about the same. The range can stay the same
but the variance increase.
12Sample Final Problem
13Sample Final Problem
14Sample Final Problem
15Describing Categorical Data
- To describe categorical data, there are
- only 2 statistics of interests sample
- counts and sample proportions.
- Ex Suppose 1 out of 20 people have
- gum disease. The sample count is 1 and
- the sample proportion is 1/20.
16Statistic vs. Parameter
- A statistic is a quantity associated with
- the sample and a parameter is a quantity
- associated with the population.
17(No Transcript)
18(No Transcript)
19Example
- A company manufactures bricks. They are
- interested in their mean breaking strength.
- Can they determine the average breaking strength
of 10 bricks? - Can they determine the mean breaking strength of
all bricks produced?
20Example
- Walgreens records the price of prescriptions
- bought at their stores.
- Can Walgreens determine the mean cost of all
prescriptions bought at Walgreens? - Can Walgreens determine the mean cost of all
prescriptions bought in the US?
21Example 1
- Use the calculator to find the following
statistics - for the data below
- 10 15 5 22 38 51
- sample mean, sample variance, sample
- median, range
22Example 2
- Find the sample mean and sample standard
- deviation for the following data
- 2 2 2 2 2 5 5 5 7 7 7 7 8 8 8 8 8 8
- To simplify putting the data into the calculator,
- the next table will be useful.
23- Frequency refers to the number of times
- each value occurs in the sample.
24Example 3
- Instead of knowing the
- actual observations, we
- only know the intervals
- and the number of
- observations in each.
- Again, obtain the sample
- mean and sample sd.
25Graphically Describing Numerical Data
- A histogram splits the data into intervals called
bins or classes. The number (frequency) or
percentage (relative frequency) of observations
in each interval is recorded. This is the height
of each bin.
26Create a frequency histogram
- Data
- 1.2 1.8
- 3.1 0.4
- 0.2 4.8
- 1.5 2.1
- 2.9 3.7
-
27Create a relative frequency histogram
- Data
- 1.2 1.8
- 3.1 0.4
- 0.2 4.8
- 1.5 2.1
- 2.9 3.7
-
28The height of this bin is at approximately 18
which means there are 18 observations between 140
and 160.
These numbers on the vertical axis are all counts
which makes this a frequency histogram.
This bin ranges from 140 to 160.
29Heights of Volcanoes
30- How many volcanoes are in the sample?
- How many volcanoes are more than 8000 feet tall?
- What percentage of the volcanoes are less than
4000 feet tall? - How many volcanoes are between 4000 and 6000 feet
tall?
31Boxplots
- The histogram on the
- right has been split into 4
- pieces so that each
- consists of 25 of the
- data.
- These marks where each
- piece is split is used to
- create the boxplot.
32(No Transcript)
33The minimum (min) is approximately 11.
The maximum (max) Is approximately 35.
The second quartile (Q2) is approximately 20.
The third quartile (Q3) is approximately 27.
The first quartile (Q1) Is approximately 17.
These numbers are called the 5 number summary for
a boxplot.
34Outliers
- Outliers show up as
- circles. In this case, it
- is now the max.
- This is the largest
- observation that is
- NOT an outlier.
35Find the following
Note The Interquartile Range (IQR) is Q3 Q1.
36Shapes
- The shape of the distribution of the data
- can be classified in 3 ways
- Skewed Left
- Skewed Right
- Symmetric
37Skewed Right
- Most of the data (perhaps 50 or so) is on the
left and as you move to the right, the
observations become more and more sparse.
38Skewed Left
- This is basically opposite of skewed right data.
Most of the data is on the right and is more and
more sparse as we move to the left.
39Symmetric
- For symmetric data, we expect the histogram and
boxplot to be symmetric. - For the boxplot, we should see these distances
being approximately equal.
40Dot Plots
- A dot plot places a dot for each observation.
- For the dotplot above, approximately what is
- sample size?
- the sample median?
- the range?
41Stem Plots
Stems Leaves
- For the stem plot on the
- left, what is
- the sample size?
- range?
- sample median?
42Bar Chart
- Approximately how many Toyotas are in the sample?
- Can we all agree the shape is skewed left?
43Pie Chart
- If this is based on a sample of 250,
approximately - how many say they are somewhat interested in
- professional soccer?
44Z-scores
- A z-score for an observation x is defined as
- You can use either the population or
- sample quantities here. That is,
45The z-score for 180 is (180-173.59)/19.46
0.329 and the z-score for 110 is
(110-173.59)/19.46 -3.26 110 is more standard
deviations from the mean than 180 is even though
the z-score is negative.
46Example
- A data set has a mean of 200 and a
- standard deviation of 30. For a data value of
- 245, what is the z-score?
47Percentiles
60 of the distribution is shaded which means 40
remains unshaded.
60
40
This value is the 60th percentile, P60.
In general, the rth percentile is the value with
r of the data or distribution below it.
48Finding the rth percentile
- Example Find the 70th percentile of the
- sample below.
- 29 29 30 31 31 32 32 32 32 32
- 32 33 33 33 33 34 34 34 34 36
- 36 37 38 38 38 39 39 43
- If the data is not already sorted as it is above,
- do that first.
49- There are n28 observations.
- The 70th percentile is found by
-
- n(0.7) 28(0.7) 19.6
- Since 19.6 is not a whole number, go up to
- the next integer, 20. The 70th percentile is
the - 20th number from the bottom, 36.
50- The 25th percentile is found by
- n(0.25) 28(0.25) 7
- Since this is a whole number, the 25th
- percentile is found by averaging the values in
- the 7th and 8th positions. That is, the 25th
- percentile is (32 32)/2 32.
51For the sample below
- n 40
- 32 33 38 39 40 41
- 42 43 44 44 45 46
- 46 47 48 48 49 53
- 53 54 55 55 55 56
- 58 58 59 59 60 61
- 61 62 63 64 67 68
- 68 69 72 74
- Find the following percentiles
- P13
- P35
52Normal Distribution
This distribution has mean 10 and standard
deviation 1.9.
This distribution has mean 2 and standard
deviation 3.
The mean is denoted by µ and the standard
deviation s.
53Empirical Rule
- For a data set having a distribution that is
- approximately bell-shaped, the following 3
- properties apply
- About 68 of the data fall within 1 standard
deviation of the mean. - About 95 of the data fall within 2 standard
deviations of the mean. - About 99.7 of the data fall within 3 standard
deviations of the mean.
54Approximate Percentages
55Since this data looks normal, we can use the
Empirical Rule to conclude that approximately 95
of the observations are between 173.59 -
2(19.46) 134.67 and 173.59 2(19.46)
212.51
56- Consider and .
-
- The z-score for 10.63 is _____.
-
- The z-score for 8.222 is ______.
57- What then are the z-scores for the following?
- The z-score for is _____.
- The z-score for is _____.
- The z-score for is _____.
- The z-score for is _____.
- The z-score for is _____.
- The z-score for is _____.
58Example
- Birth weights are approximately bell-shaped
- with mean 3410 g and sd 520 g.
- Approximately what percentage of the birth
weights fall between 2370 and 4450 grams? - Between what 2 values will approximately 68 of
the birth weights fall between?
59Example
- The length of time car owners keep their cars
- is bell-shaped with mean 7.513 years and
- standard deviation 2.47 years.
- Approximately what percentage of car owners keep
their cars between 5.043 and 9.983 years? - Between what 2 years do approximately 99.7 of
car owners keep their cars?
60Match the symbol to the word.
- Average
- Sample Size
- Population Mean
- Sample Mean
- Sample Variance
- Sample Std. Dev.
- Population Variance
- Mean
61- What remains are other types of graphs
- you can obtain. I will let you read about these
- on your own.
- Histogram for discrete data
- Frequency Polygon
- Ogive Curve
- Pareto Chart
62Discrete Data
- The only observations in the sample are
1,2,3,4,5,6 and no others. - Notice that the numbers are in the middle of the
intervals.
63Frequency Polygon
- Rather than having rectangles, theres a single
point that represents the height at which the
frequency occurs. - And then you draw lines from one height to the
next.
64Ogive (Pronounced oh-jive)
Approximately 12 of the numbers in the sample are
less than or equal to 2.
You could make rectangles as in a histogram if
you wanted to.
65Pareto Chart
- Put simply, a pareto chart is nothing more than a
special bar chart. - Its for categorical data.
- The bars are sorted in order of frequencies.