Title: Chapter 1: Looking at Data Distributions
1Chapter 1 Looking at Data Distributions
21. Introduction
- Individual objects described by a set of data
(people, animals, or things) - Variable Characteristic of an individual. It
can take on different values for different
individuals. - Examples age, height, gender, favorite class,
speed, etc.
3Types of Variables
- Categorical Variable An individual is placed
into one of several groups or categories. These
groups or categories are not usually numerical.
4Types of Variables
- Quantitative Variable (numeric) values can be
added, subtracted, averaged, etc. - Discrete takes on values which are spaced. That
is, for two values of a discrete variable that
are adjacent, there is no value that goes between
them. - Continuous values are all numbers in a given
interval. That is, for two values of a
continuous variable that are adjacent, there is
another value that can go between the two.
5Types of Variables
- Examples
- Numeric
- Variable Discrete Continuous Categorical
- Length
- Hours Enrolled
- Major
- Zip Code
6Types of Variables
- Examples
- Numeric
- Variable Discrete Continuous Categorical
- Length X
- Hours Enrolled X
- Major X
- Zip Code X
7Distribution of a Variable
- The distribution of a variable tells us the
possible values for the variable and the
probability that the variable takes these values.
82. Describing Categorical Variable Distributions
- Suppose we poll 50 students on an issue
(Statistics is interesting!). How can we exhibit
their responses? - Frequency Tables
- gives counts (31 agree), proportions (31/50
.62 agree), and percents (62 agree)
92. Describing Categorical Variable Distributions
- Suppose we poll 50 students on an issue
(Statistics is interesting!). How can we exhibit
their response? - Bar Chart
- can have counts,
- percents or
- proportions on
- vertical axis
102. Describing Categorical Variable Distributions
- Suppose we poll 50 students on an issue
(Statistics is interesting!). How can we exhibit
their response? - Pie Chart
113. Describing Numeric Variable Distributions
- To describe a distribution we need 3 items
- Shape modes, symmetric, skewed, outliers
- Center mean, median, mode
- Spread range, standard deviation, IQR
123. Describing Numeric Variable Distributions
- Shape
- Modes Major peaks in the distribution
- Symmetric The values smaller and larger than
the midpoint are mirror images of each other - Skewed to the right Right tail is much longer
than the left tail - Skewed to the left Left tail is much longer
than the right tail
133. Describing Numeric Variable Distributions
- Center
- Mean The arithmetic average. Add up the
numbers and divide by the number of observations.
If the n observations are , their sample mean is - Median List the data from smallest to
largest. If there are an odd number of data
values, the median is the middle one in the list.
If there are an even number of data values,
average the middle two in the list.
143. Describing Numeric Variable Distributions
- Spread
- Range The difference between the largest and
smallest value. - Standard Deviation Measures spread by looking
at how far observations are from their mean. The
computational formula for the sample standard
deviation is -
-
-
- Variance Square of Standard Deviation
153. Describing Numeric Variable Distributions
- Spread
- Interquartile Range (IQR) IQRQ3-Q1
- Distance between the first quartile (Q1) and
the third quartile (Q3). - Q1 25 of the observations are less than
Q1 and 75 are greater than Q1. - or the median of the observations whose
position in the ordered - list is to the left of the
location of the overall median. - Q3 75 of the observations are less than
Q3 and 25 are greater than Q3. - or the median of the
observations whose position in the ordered
- list is to the right of the
location of the overall median. - Note To compute Q1/Q3 for odd number of
observations, the center value is excluded.
163. Describing Numeric Variable Distributions
- Example Suppose the age of five students are
20, 18, 22, 20, 23.
Mean ? Median ? Q1 ? Q3 ?
Range ? Std. dev. ? Var. ? IQR ?
173. Describing Numeric Variable Distributions
- Examples for Q1 and Q3
- Even number of observations
- The highway mileages of the 18 cars, arranged in
increasing order, are - 13 13 16 19 21 21 23 23 24 26 26 27
27 27 28 28 30 30 - Find Q1 and Q3.
- Odd number of observations
- The highway mileages of the 11 trucks arranged in
increasing order, are - 22 22 23 24 24 25 27 28 28 30 31
- Find Q1 and Q3.
183. Describing Numeric Variable Distributions
- Another example in the book shows how much 50
consecutive shoppers spent in a store. The data
appear as follows
193. Describing Numeric Variable Distributions
- How can we describe the distribution of these 50
numbers?
203. Describing Numeric Variable Distributions
- How can we describe the distribution of these 50
numbers? - 50th percentile is also called the median the
middle data value if ordered smallest to largest - 25th and 75th percentiles are also called Q1 and
Q3 respectively the middle data value of each
half -
213. Describing Numeric Variable Distributions
- How can we describe the distribution of these 50
numbers? - Stemplot (discard decimals)
- 1. Separate each observation into a stem and a
leaf. Stems may have as many digits as needed,
but each leaf contains only a single digit. - Write the stems in a vertical column
- Write each leaf in the row to the right of its
stem, - in increasing order out from the stem.
223. Describing Numeric Variable Distributions
- How can we describe the distribution of these 50
numbers? - Histogram
- Breaks the range of values
- of a variable into intervals
- and displays only the count
- or percent of the
- observations in each interval.
233. Describing Numeric Variable Distributions
- How can we describe the distribution of these 50
numbers? - Box Plot (made up of min., Q1, median, Q3, and
max.) - these five numbers
- are called the
- five number summary
243. Describing Numeric Variable Distributions
- Outlier observations that are unusually far from
the bulk of the data. - What are some possible explanations for outliers?
- The data point was recorded wrong.
- The data point wasnt actually a member of the
population we were trying to sample. - We just happened to get an extreme value in our
sample. - The 1.5 x IQR Criterion for Outliers Designate
an observation a suspected outlier if it falls
more than 1.5 x IQR below the first quartile or
above the third quartile.
253. Describing Numeric Variable Distributions
- Now, we examine the appearance of other data
-
- this example is bimodal has two
modes
263. Describing Numeric Variable Distributions
- Now, we examine the appearance of other data
- this only has one mode - unimodal
273. Describing Numeric Variable Distributions
- Now, we examine the appearance of other data
- This example is called
- right skewed since
- the distribution has
- a long right tail.
283. Describing Numeric Variable Distributions
- Now, we examine the appearance of other data
- This is an example of
- a boxplot that is
- skewed to the right.
293. Describing Numeric Variable Distributions
__________
__________
___________
303. Describing Numeric Variable Distributions
__________
__________
____________
313. Describing Numeric Variable Distributions
- Mean versus Median for Different Distributions
- meanltmedian meanmedian
meangtmedian
Right Skewed
Symmetric
Left Skewed
323. Describing Numeric Variable Distributions
- Example
- Calculate mean, median, std. dev. and IQR of
these two observations - 3, 3, 5, 6, 7
- 3, 3, 5, 6, 7, 80
- Conclusion
- Mean and std. dev. cant resist the influence of
outliers. That is, mean and std. dev. are not
resistant. - Median and IQR are better than mean and std. dev.
for describing a skewed distribution or a
distribution with strong outliers. Use mean and
std. dev. only for reasonably symmetric
distributions that are free of outliers.
333. Describing Numeric Variable Distributions
- Measures of Center and Spread
- To describe distributions use
- Median and IQR Mean and
standard deviation Median and IQR
Right Skewed
Left Skewed
Symmetric
343. Describing Numeric Variable Distributions
- Summary
- Shape (usually Histogram. Boxplot and Stemplot)
- Mode unimodal, bimodal?
- Symmetric? Left skewed? Right skewed?
- Center (Descriptives)
- Spread (Descriptives)
- Outlier (Boxplot)
354. The Normal Distribution
- Sometimes the overall pattern of a large number
of observations is so regular that we can
describe it by a smooth curve. - A density curve is a curve that is always on or
above the horizontal axis and has area exactly 1
underneath it. - A density curve describes the overall pattern of
a distribution. The area under the curve and
above any range of values is the relative
frequency of all observations that fall in that
range.
364. The Normal Distribution
- Normal curves are density curves that are
symmetric, unimodal and bell-shaped. They
describe normal distributions. - A normal curve is specified by giving its mean µ
and its standard deviation s. - We often write that a variable (call it X) has
normal distribution with mean m and variance s2
in the following way -
374. The Normal Distribution
- Here are some examples of normal distributions
m 0
m 3
m -2
s 1
s 2
s 0.5
0
-2
3
N(0,12)
N(-2,0.52)
384. The Normal Distribution
- Empirical Rule (The 68-95-99.7 Rule) If the
distribution is normal, then - Approximately 68 of the data falls within one
standard deviation of the mean - Approximately 95 of the data falls within two
standard deviation of the mean - Approximately 99.7 of the data falls within
three standard deviation of the mean
394. The Normal Distribution Empirical Rule for
N(0,12)
404. The Normal Distribution
- If x is an observation from a distribution that
has mean m and standard deviation s, the
standardized value of x is
This is known as a Z-score. A z-score is
literally how many sds an observation is from
its mean. They are measures of relative standing.
- The standard normal distribution is the normal
distribution with mean 0 and standard
deviation 1. Z N(0,12) .
414. The Normal Distribution
- If a variable X has any normal distribution
N(m,s2 ) , then the standardized variable
- has the standard normal distribution.
424. The Normal Distribution
- For N(0,12) we can find approximate probabilities
associated with different values of Z using
Empirical Rule.
434. The Normal Distribution
- We can find the approximate probability that Z is
to the left of any integers using the Empirical
Rule.
P( Z lt -4.00) ? 0 P( Z lt -3.00) ? 0.15 P( Z lt
-2.00) ? 2.5 P( Z lt -1.00) ? 16
P( Z lt 0.00) ? 50 P( Z lt 1.00) ? 84 P( Z lt
2.00) ? 97.5 P( Z lt 3.00) ? 99.85 P( Z lt
4.00) ? 100
444. The Normal Distribution
- We can find the approximate probability that Z is
to the right of any integers using the symmetry. - P( Z gt z) P( Z lt -z)
- For example, P( Z gt 3.00) P( Z lt -3.00)
- We can find the approximate probability that Z is
between any two integers using the Empirical
Rule. - P( a lt Z lt b) P( Z lt b) P(Z lt a)
- Examples
- P( Z gt 3.00) ? 0.15 P( -2.00 lt Z lt 1.00) ?
81.5 - P( Z gt -1.00) ? 84 P( 2.00 lt Z lt 3.00) ?
2.35
454. The Normal Distributionexample
- P(Z lt 1.25) ? P(Z gt 0.25) ?
- A. 0.4840 A. 0.7040
- B. 0.8944 B. 0.0217
- C. 0.9989 C. 0.8485
- D. 0.1736 D. 0.4013
464. The Normal Distribution
- Now suppose we know X N (m, s2) and we want to
know P(X lt x), P(X gt x) and P(x1 lt X lt x2). - We can first convert the X to Z and then use the
probabilities from the Empirical Rule. - Recall that if X N (m, s2) , then
- we have
474. The Normal Distributionexamples
- Suppose X N ( 3, 22). Find the probability
that X is less than 5.
- Suppose X N (-1, 52). Find the probability
that X is greater than 11.
484. The Normal Distribution
- We will look at some more difficult examples
- Suppose X N (2, 32),
- Given a value z, find the corresponding x that it
came from. - say z 5, x ?
- How many standard deviations is x from m?
- say x 10
- Find Pr (X lt -4 or X gt 8).
- Find Pr ( -4 lt X lt 8 ).
- Find the x such that Pr ( X lt x ) ? .84
- Find the x such that Pr ( X gt x ) ? .84
49Go back to3. Describing Numeric Variable
Distributions
- How can we describe the distribution of these 50
numbers? - Normal Quantile Plot (This compares the
distribution of the sample to the Normal
Distribution) - the straight line
- is normal,
- compare dots
- to the line
50Go back to3. Describing Numeric Variable
Distributions
- Summary
- Shape (Histogram, Boxplot and Stemplot Normal
Quantile Plot) - Mode unimodal, bimodal?
- Symmetric? Left skewed? Right skewed?
- Are the data normally distributed?
- Center (Descriptives)
- Spread (Descriptives)
- Outlier (Boxplot)
515. Distribution Properties
- Shift Changes adding or subtracting a number
from the each of the values. If c gt 0, then
mean
mean c
mean - c
525. Distribution Properties
- The mean, median, Q1, Q3, maximum, and minimum
all shift when there is a shift change. The
shift change, say c, is added or subtracted to
each of the statistics accordingly. - The measures of spread (standard deviation,
variance, IQR, and range) do not change when
there is a shift change.
535. Distribution Properties
- Scale Changes multiplying or dividing each of
the values by a number. If c gt 1, then
mean
meanc
mean/c
545. Distribution Properties
- The mean, median, Q1, Q3, maximum, and minimum
all change when there is a scale change unless
they are zero. Each is multiplied or divided by
the scale change c. - The measures of spread (standard deviation,
variance, IQR, and range) always change when
there is a scale change. The standard deviation,
IQR, and range are multiplied or divided by the
scale change c. The variance is multiplied or
divided by c2.
555. Distribution Properties
- Suppose we measure the weight of everyone on a
football team and obtain the following statistics
for a team report - Mean 230 lbs. Median 240 lbs.
- Std. Dev. 50 lbs. Q1 200 lbs., Q3 280 lbs.
- Variance 250 lbs. IQR 80 lbs
- Min. 170 lbs. Range 180 lbs.
- Max. 350 lbs.
565. Distribution Properties
- Now suppose we found out the scale was 10 lbs.
under so we need to add 10 lbs. to every weight.
What would happen to each of the following
statistics?
Original
After Shift Change
Mean 230 lbs. Mean________
Median 240 lbs. Median_________
s 50 lbs. s_______
Q1 200 lbs. Q1________
Q3 280 lbs. Q3________
575. Distribution Properties
- Now suppose we found out the scale was 10 lbs.
under so we need to add 10 lbs. to every weight.
What would happen to each of the following
statistics?
Original
After Shift Change
Variance 250 lbs.
Variance ________
IQR 80 lbs.
IQR _________
Min 170 lbs.
Min _________
Max 350 lbs.
Max _________
Range 180 lbs.
Range _________
585. Distribution Properties
- Further, suppose we found out that we are
supposed to report the weights and statistics in
kilograms, not lbs (Remember, 1 lb 0.6
kilograms). What would happen to each of the
following statistics?
After Shift Change
After Shift and Scale Change
Mean 240 lbs.
Mean ______________
Median 250 lbs.
Median ______________
s 50 lbs.
s _____________
Q1 210 lbs.
Q1 _____________
Q3 290 lbs.
Q3 _____________
595. Distribution Properties
- Further, suppose we found out that we are
supposed to report the weights and statistics in
kilograms, not lbs (Remember, 1 lb 0.6
kilograms). What would happen to each of the
following statistics?
After Shift Change
After Shift and Scale Change
Variance 250 lbs.
Variance _______________
IQR 80 lbs.
IQR _______________
Min 180 lbs.
Min _______________
Max 360 lbs.
Max ________________
Range 180 lbs.
Range _________________
60Linear Transformations
- If you are given a mean, (or ?), and a
standard deviation, s (or ?), and want to convert
your data so you have a new mean, (or
?new), and new standard deviation, snew (or
?new), all you need is to remember what shift and
scales changes affect. - In our linear transformation formula
- a is the shift change
- b is the scale change
- Standard deviation are only affected by scale
changes, but means are affected by both shift and
scales changes.