Understanding Variability - PowerPoint PPT Presentation

About This Presentation
Title:

Understanding Variability

Description:

Title: PowerPoint Presentation Author: Ron Kenett Last modified by: Maskit Rubinstein Created Date: 9/15/2000 6:51:24 AM Document presentation format – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 56
Provided by: RonK166
Category:

less

Transcript and Presenter's Notes

Title: Understanding Variability


1
Understanding Variability
Instructor Ron S. Kenett Email
ron_at_kpa.co.il Course Website www.kpa.co.il/biosta
t Course textbook MODERN INDUSTRIAL
STATISTICS, Kenett and Zacks, Duxbury Press, 1998
2
Course Syllabus
  • Understanding Variability
  • Variability in Several Dimensions
  • Basic Models of Probability
  • Sampling for Estimation of Population Quantities
  • Parametric Statistical Inference
  • Computer Intensive Techniques
  • Multiple Linear Regression
  • Statistical Process Control
  • Design of Experiments

3
Discrete Data A set of data is said to be
discrete if the values / observations belonging
to it are distinct and separate. That is, they
can be counted (1,2,3,.......). For example, the
number of kittens in a litter the number of
patients in a doctors surgery the number of
flaws in one metre of cloth gender (male,
female) blood group (O, A, B, AB).
4
Continuous Data A set of data is said to be
continuous if the values / observations
belonging to it may take on any value within a
finite or infinite interval. You can count,
order and measure continuous data. For example,
height weight temperature the amount of sugar
in an orange the time required to run a mile.
5
Types of Variables
  • Qualitative Variables
  • Attributes, categories
  • Examples male/female, registered to vote/not,
    ethnicity, eye color....
  • Quantitative Variables
  • Discrete - usually take on integer values but can
    take on fractions when variable allows - counts,
    how many
  • Continuous - can take on any value at any point
    along an interval - measurements, how much

6
Self Assessment Test
For each of the following, indicate whether
the appropriate variable would be qualitative or
quantitative. If the variable is quantitative,
indicate whether it would be discrete or
continuous.
7
Self Assessment Test
  • a) Whether you own an RCA Colortrak television
    set
  • b) Your status as a full-time or a part-time
    student
  • c) Number of people who attended your schools
    graduation last year
  • Qualitative Variable
  • two levels yes/no
  • no measurement
  • Qualitative Variable
  • two levels full/part
  • no measurement
  • Quantitative, Discrete Variable
  • a countable number
  • only whole numbers

8
Self Assessment Test
  • d) The price of your most recent haircut
  • e) Sams travel time from his dorm to the Student
    Union
  • Quantitative, Discrete Variable
  • a countable number
  • only whole numbers
  • Quantitative, Continuous Variable
  • any number
  • time is measured
  • can take on any value greater than zero

9
Self Assessment Test
  • f) The number of students on campus who belong to
    a social fraternity or sorority
  • Quantitative, Discrete Variable
  • a countable number
  • only whole numbers

10
Scales of Measurement
  • Nominal Scale - Labels represent various levels
    of a categorical variable.
  • Ordinal Scale - Labels represent an order that
    indicates either preference or ranking.
  • Interval Scale - Numerical labels indicate order
    and distance between elements. There is no
    absolute zero and multiples of measures are not
    meaningful.
  • Ratio Scale - Numerical labels indicate order and
    distance between elements. There is an absolute
    zero and multiples of measures are meaningful.

11
Self Assessment Test
Bill scored 1200 on the Scholastic Aptitude Test
and entered college as a physics major. As a
freshman, he changed to business because he
thought it was more interesting. Because he made
the deans list last semester, his parents gave
him 30 to buy a new Casio calculator. Identify
at least one piece of information in the
12
Self Assessment Test
  • a) nominal scale of measurement.
  • 1. Bill is going to college.
  • 2. Bill will buy a Casio
  • calculator.
  • 3. Bill was a physics major.
  • 4. Bill is a business major.
  • 5. Bill was on the deans list.

13
Self Assessment Test
  • b) ordinal scale of measurement
  • c) interval scale of measurement
  • d) ratio scale of measurement
  • Bill is a freshman.
  • Bill earned a 1200 on the SAT.
  • Bills parents gave him 30.

14
Self Assessment Test
  • b) ordinal scale of measurement
  • c) interval scale of measurement
  • d) ratio scale of measurement
  • Bill is a freshman.
  • Bill earned a 1200 on the SAT.
  • Bills parents gave him 30.

15
Histogram A histogram is a way of summarising
data that are measured on an interval scale
(either discrete or continuous). It is often used
in exploratory data analysis to illustrate the
major features of the distribution of the data
in a convenient form. It divides up the range of
possible values in a data set into classes or
groups. For each group, a rectangle is
constructed with a base length equal to the
range of values in that specific group, and an
area proportional to the number of observations
falling into that group. This means that the
rectangles might be drawn of non-uniform height.
16
Key Terms
  • Data array
  • An orderly presentation of data in either
    ascending or descending numerical order.
  • Frequency Distribution
  • A table that represents the data in classes and
    that shows the number of observations in each
    class.

17
Key Terms
  • Frequency Distribution
  • Class - The category
  • Frequency - Number in each class
  • Class limits - Boundaries for each class
  • Class interval - Width of each class
  • Class mark - Midpoint of each class

18
Sturges Rule
  • How to set the approximate number of classes to
    begin constructing a frequency distribution.
  • where k approximate number of classes to use
    and
  • n the number of observations in the data set .

19
Frequency Distributions
1. Number of classes Choose an approximate
number of classes for your data. Sturges rule
can help. 2. Estimate the class interval
Divide the approximate number of classes (from
Step 1) into the range of your data to find the
approximate class interval, where the range is
defined as the largest data value minus the
smallest data value. 3. Determine the class
interval Round the estimate (from Step 2) to a
convenient value.
20
Frequency Distributions
4. Lower Class Limit Determine the lower class
limit for the first class by selecting a
convenient number that is smaller than the lowest
data value. 5. Class Limits Determine the other
class limits by repeatedly adding the class width
(from Step 2) to the prior class limit, starting
with the lower class limit (from Step 3). 6.
Define the classes Use the sequence of class
limits to define the classes.
21
Relative Frequency Distributions
1. Retain the same classes defined in the
frequency distribution. 2. Sum the total number
of observations across all classes of the
frequency distribution. 3. Divide the frequency
for each class by the total number of
observations, forming the percentage of data
values in each class.
22
Cumulative Relative Frequency Distributions
1. List the number of observations in the lowest
class. 2. Add the frequency of the lowest class
to the frequency of the second class. Record
that cumulative sum for the second class. 3.
Continue to add the prior cumulative sum to the
frequency for that class, so that the cumulative
sum for the final class is the total number of
observations in the data set.
23
Cumulative Relative Frequency Distributions
  • 4. Divide the accumulated frequencies for each
    class by the total number of observations --
    giving you the percent of all observations that
    occurred up to an including that class.
  • An Alternative Accrue the relative frequencies
    for each class instead of the raw frequencies.
    Then you dont have to divide by the total to get
    percentages.

24
Example
  • The average daily cost to community hospitals for
    patient stays during 1993 for each of the 50 U.S.
    states was given in the next table.
  • a) Arrange these into a data array.
  • b) Construct a stem-and-leaf display.
  • ) Approximately how many classes would be
    appropriate for these data?
  • c d) Construct a frequency distribution. State
    interval width and class mark.
  • e) Construct a histogram, a relative frequency
    distribution, and a cumulative relative frequency
    distribution.

25
Example Data List
AL 775 HI 823 MA 1,036 NM 1,046 SD
506 AK 1,136 ID 659 MI 902 NY
784 TN 859 AZ 1,091 IL 917 MN 652 NC
763 TX 1,010 AR 678 IN 898 MS
555 ND 507 UT 1,081 CA 1,221 IA
612 MO 863 OH 940 VT 676 CO
961 KS 666 MT 482 OK 797 VA 830 CT
1,058 KY 703 NE 626 OR 1,052 WA
1,143 DE 1,024 LA 875 NV 900 PA
861 WV 701 FL 960 ME 738 NH 976 RI
885 WI 744 GA 775 MD 889 NJ
829 SC 838 WY 537
26
Example Data Array
CA 1,221 TX 1,010 RI 885 NY 784 KS
666 WA 1,143 NH 976 LA 875 AL 775 ID
659 AK 1,136 CO 961 MO 863 GA 775 MN
652 AZ 1,091 FL 960 PA 861 NC 763 NE
626 UT 1,081 CH 940 TN 859 WI 744 IA
612 CT 1,058 IL 917 SC 838 ME
738 MS 555 OR 1,052 MI 902 VA 830 KY
703 WY 537 NM 1,046 NV 900 NJ 829 WV
701 ND 507 MA 1,036 IN 898 HI 823 AR
678 SD 506 DE 1,024 MD 889 OK
797 VT 676 MT 482
27
Example Stem and Leaf Display
Stem-and-Leaf Display N 50 Leaf Unit 100
1 12 21 2 11 43, 36 8 10 91, 81, 58, 52,
46, 36, 24, 10 7 9 76, 61, 60, 40, 17, 02,
00 (11) 8 98, 89, 85, 75, 63, 61, 59, 38, 30,
29, 23 9 7 97, 84, 75, 75, 63, 44, 38, 03,
01 7 6 78, 76, 66, 59, 52, 26, 12 4
5 55, 37, 07, 06 1 4 82 Range 482 -
1,221
28
Example Frequency Distribution
  • To approximate the number of classes we should
    use in creating the frequency distribution, use
    Sturges Rule, n 50
  • Sturges rule suggests we use approximately 7
    classes.

29
Example Frequency Distribution
  • Step 1. Number of classes
  • Sturges Rule approximately 7 classes.
  • The range is 1,221 482 739
  • 739/7 106 and 739/8 92
  • Steps 2 3. The Class Interval
  • So, if we use 8 classes, we can make each class
    100 wide.

30
Example Frequency Distribution
  • Step 1. Number of classes
  • Sturges Rule approximately 7 classes.
  • The range is 1,221 482 739
  • 739/7 106 and 739/8 92
  • Steps 2 3. The Class Interval
  • So, if we use 8 classes, we can make each class
    100 wide.

31
Example Frequency Distribution
  • Step 4. The Lower Class Limit
  • If we start at 450, we can cover the range in 8
    classes, each class 100 in width.
  • The first class 450 up to 550
  • Steps 5 6. Setting Class Limits
  • 450 up to 550 850 up to 950
  • 550 up to 650 950 up to 1,050
  • 650 up to 750 1,050 up to 1,150
  • 750 up to 850 1,150 up to 1,250

32
Example Frequency Distribution
Average daily cost Number Mark 450
under 550 4 500 550 under 650
3 600 650 under 750 9 700 750
under 850 9 800 850 under 950
11 900 950 under 1,050 7
1,000 1,050 under 1,150 6
1,100 1,150 under 1,250 1
1,200 Interval width 100
33
Example Histogram
34
Example Relative Frequency Distribution
Average daily cost Number Rel. Freq.
450 under 550 4 4/50 .08 550
under 650 3 3/50 .06 650 under 750
9 9/50 .18 750 under 850
9 9/50 .18 850 under 950 11
11/50 .22 950 under 1,050
7 7/50 .14 1,050 under 1,150 6 6/50
.12 1,150 under 1,250 1 1/50 .02
35
Example Polygon
36
Example Cumulative Frequency Distribution
Average daily cost Number Cum. Freq.
450 under 550 4 4 550 under 650
3 7 650 under 750 9 16 750
under 850 9 25 850 under 9
11 36 950 under 1,050 7 43 1,050
under 1,150 6 49 1,150 under 1,250 1 50
37
Example Cumulative Relative Frequency
Distribution
Average daily cost Cum.Freq.
Cum.Rel.Freq. 450 under 550 4 4/50
.02 550 under 650 7 7/50 .14 650
under 750 16 16/50 .32 750
under 850 25 25/50 .50 850
under 950 36 36/50 .72 950
under 1,050 43 43/50 .86 1,050
under 1,150 49 49/50 .98 1,150
under 1,250 50 50/50 1.00
38
Example Percentage Ogive
39
Statistical Description of Data
40
Key Terms
  • Measures of Central Tendency,
  • The Center
  • Mean
  • µ, population , sample
  • Weighted Mean
  • Median
  • Mode

41
Key Terms
  • Measures of Dispersion,
  • The Spread
  • Range
  • Mean absolute deviation
  • Variance
  • Standard deviation
  • Interquartile range
  • Interquartile deviation
  • Coefficient of variation

42
Key Terms
  • Measures of Relative Position
  • Quantiles
  • Quartiles
  • Deciles
  • Percentiles
  • Residuals
  • Standardized values

43
The Mean
  • Mean
  • Arithmetic average (sum all values)/ of values
  • Population µ (Sxi)/N
  • Sample (Sxi)/n
  • Problem Calculate the average number of truck
    shipments from the United States to five Canadian
    cities for the following data given in thousands
    of bags
  • Montreal, 64.0 Ottawa, 15.0 Toronto, 285.0
  • Vancouver, 228.0 Winnipeg, 45.0
  • (Ans 127.4)

44
The Weighted Mean
  • When what you have is grouped data, compute the
    mean using µ (Swixi)/Swi
  • Problem Calculate the average profit from truck
    shipments, United States to Canada, for the
    following data given in thousands of bags and
    profits per thousand bags
  • Montreal 64.0 Ottawa 15.0 Toronto 285.0
  • 15.00 13.50
    15.50
  • Vancouver 228.0 Winnipeg 45.0
  • 12.00 14.00
  • (Ans 14.04 per thous. bags)

45
The Median
  • To find the median
  • 1. Put the data in an array.
  • 2A. If the data set has an ODD number of numbers,
    the median is the middle value.
  • 2B. If the data set has an EVEN number of
    numbers, the median is the AVERAGE of the middle
    two values.
  • (Note that the median of an even set of data
    values is not necessarily a member of the set of
    values.)
  • The median is particularly useful if there are
    outliers in the data set, which otherwise tend to
    sway the value of an arithmetic mean.

46
The Mode
  • The mode is the most frequent value.
  • While there is just one value for the mean and
    one value for the median, there may be more than
    one value for the mode of a data set.
  • The mode tends to be less frequently used than
    the mean or the median.

47
Comparing Measures of Central Tendency
  • If mean median mode, the shape of the
    distribution is symmetric.
  • If mode lt median lt mean or if mean gt median gt
    mode,
  • the shape of the distribution trails to the
    right,
  • is positively skewed.
  • If mean lt median lt mode or if mode gt median gt
    mean,
  • the shape of the distribution trails to the
    left,
  • is negatively skewed.

48
The Range
  • The range is the distance between the smallest
    and the largest data value in the set.
  • Range largest value smallest value
  • Sometimes range is reported as an interval,
    anchored between the smallest and largest data
    value, rather than the actual width of that
    interval.

49
Residuals
  • Residuals are the differences between each data
    value in the set and the group mean
  • for a population, xi µ
  • for a sample, xi

50
The MAD
  • The mean absolute deviation is found by summing
    the absolute values of all residuals and dividing
    by the number of values in the set
  • for a population, MAD (Sxi µ)/N
  • for a sample, MAD (Sxi )/n

51
The Variance
  • Variance is one of the most frequently used
    measures of spread,
  • for population,
  • for sample,
  • The right side of each equation is often used as
    a computational shortcut.

52
The Standard Deviation
  • Since variance is given in squared units, we
    often find uses for the standard deviation, which
    is the square root of variance
  • for a population,
  • for a sample,

53
Quartiles
  • One of the most frequently used quantiles is the
    quartile.
  • Quartiles divide the values of a data set into
    four subsets of equal size, each comprising 25
    of the observations.
  • To find the first, second, and third quartiles
  • 1. Arrange the N data values into an array.
  • 2. First quartile, Q1 data value at position (N
    1)/4
  • 3. Second quartile, Q2 data value at position
    2(N 1)/4
  • 4. Third quartile, Q3 data value at position
    3(N 1)/4

54
Quartiles
55
Standardized Values
  • How far above or below the individual value is
    compared to the population mean in units of
    standard deviation
  • How far above or below (data value mean)
  • which is the residual...
  • In units of standard deviation divided by s
  • Standardized individual value
  • A negative z means the data value falls below
    the mean.
Write a Comment
User Comments (0)
About PowerShow.com