Title: Course Lecture: Analytic Techniques
1 Course Lecture Analytic Techniques
Descriptive Statistics
2Terminology
- Population complete collection of elements to be
studied - Sample sub-collection of elements drawn from the
population - Parameters population characteristics,
- eg. Mean, Median, Variance
- Estimates sample statistics that estimate the
population parameters, eg. Sample mean, sample
median, sample variance.
3Terminology Levels of measurement
- Nominal
- names, labels or categories only
- Eg. Marital status, Gender,ethnic groups
- Ordinal
- can be arranged an some order
- (but the difference between data values is
meaningless) - Eg. Level of education, qualitative evaluation
good, very good, excellent - Interval
- like ordinal but the difference between values
is meaningful - (however there is no zero starting point)
- Eg. Dates
- Ratio
- like interval but there is a natural zero
starting point - (meaning that, say, doubling the number in
meaningful) - Eg. Distance, speed, income
4MEASURES OF LOCATION
- A measure of central tendency is probably the
most common number used to describe data sets.
It gives us some idea of what the average or
middle or most occurring number in the data
set is. - There are many different measures, in this unit
we will look at only the three most important
measures
5MEASURES OF CENTRAL TENDENCY
1. Mean - µ or 2. Mode - most common 3.
Median - middle value i.e. the (N1)th value
when data arranged in order 2 They can be
calculated even when data has been grouped Note
Sample and population means are different.
6The arithmetic mean
- Usually referred to as simply the mean
- in a sample is called x bar, symbol
- in a population is called mu (pronounced Mew as
in dew), symbol
7The arithmetic mean (Cont.)
- The mean is the sum of all the values, divided by
the number of values. IE for a sample with n
observations, the mean would be
8The arithmetic mean (Cont.)
- The sample mean is often used as an estimate of
the population mean ? - Because the mean is calculated by summing every
observation, it is greatly affected by any
extreme values, and can as such present a
distorted representation of the data.
9The Median
- When the data are not unimodal and symmetrical
(ie skewed), the median is preferred - The median is the middle value when the data are
arranged in order. IE there are an equal number
of observations above and below the median
10The Median (Cont.)
- If there are an odd number of values, the median
is the value of the middle observation.
Otherwise it is somewhere between the two middle
values, and generally calculated as the average
of these two numbers
11The Median (Cont.)
- arrange the data in order (decreasing or
increasing) - locate the middle value using the formula
- middle value th value
12The Mode
- The mode is the value(s) that occurs most often
- Useful on Nominal scale data, where it is not
possible to calculate the mean or median - A distribution can have more than one mode (eg
two modes bimodal)
13Measures of Central tendency Example
- Consider the following data
- 12 34 56 34 21 23 1 19 17 12 34 53
- Calculate the mean, median and mode
14Measures of Central tendency Example (Cont.)
- The mean.
- (12345634212311917123453)
12 - 26.33
15Measures of Central tendency Example (Cont.)
- The Median
- middle value th value
- th 6.5th value
- average of 6th and 7th values
- in order
- 1 12 12 17 19 21 23 34 34 34 53 56
- Median (2123)/2 22
16Measures of Central tendency Example (Cont.)
- The mode
- number 1 12 17 19 21 23 34 53 56
- frequency 1 2 1 1 1 1 3 1 1
- Therefore, the mode is 34
17Mean, Median and Mode
- If the distribution is exactly symmetric, the
mean, the median and the mode are exactly the
same. - If the distribution is skewed, the three measures
differ.
18Which one to use?
- Different by definition
- Mean and median are unique, and only for
quantitative variables. - Mode is not unique.
- Mode is defined for categorical variables also.
- The choice depends on the shape of the
distribution, the type of data and the purpose of
your study - Skewed
- Categorical
- Total quantity
19Measures of Dispersion
- A second important property of a distribution is
a measure of dispersion. IE how variable the
data are - The four most commonly used measures are the
range, variance, standard deviation and
coefficient of variation - We will also look at the Inter Quartile Range
20The Range
- The range is simply the difference between the
highest and lowest values in a data set
Range xmax - xmin
- The range however gives no indication of the
dispersion of values between these two extreme
values. IE there may be a lot of values clumped
at either end of the distribution
21The Variance
- The two most commonly used measures which take
into account all the data values are the variance
and the standard deviation - A data set that is more variable will have a
larger variance than a data set that which is
relatively homogeneous - The variance is the sum of the squared deviations
divided by the number of observations
22The variance (Cont.)
- Consider these data 5, 17, 12, 10
- The mean of the data is
- (5171210)/4 11
- A deviation is the distance of each observation
from the mean
23The variance (Cont.)
Deviations
- For these data, the deviations are
- 5 - 11 -6
- 10 - 11 -1
- 12 - 11 1
- 17 - 11 6
24The variance (Cont.)
- We are interested in the squared deviations, so
the numbers are squared - Number Deviation Squared deviation
-
- 5 -6 36
- 10 -1 1
- 12 1 1
- 17 6 36
25The variance (Cont.)
- The squared deviations are then summed and
divided by the number of observations to give the
variance - Variance (36 1 1 36) / 4
- 18.5
- The variance is hence the average squared
deviation of the data
26The variance (Cont.)
- For a population, the variance is notated by and
the formula
27The variance (Cont.)
- For a sample, the Variance is notated by s2 and
given by the formula - Note the subtle difference in these two formulas.
Your calculator can calculate both these numbers
in a matter of seconds
28The Standard Deviation
- The standard deviation is simple the ve square
root of the variance. Hence for a population the
standard deviation is and for s a sample,
IE - The standard deviation is in the same units as
the mean.
29The coefficient of variation
- The coefficient of variation (CV) is a relative
measure of variability which has no units and is
generally expressed in terms of a percentage - It is used for comparing data that are not
measured using the same units, or when comparing
data with quite different means - It is simply the standard deviation divided by
the mean
30The coefficient of variation (Cont.)
- The CV can only be calculated on data collected
at the ratio level
31The Quartiles
- We can improve the description by also looking at
the middle half of the data - Recall that the Median is the middle value of the
data set. IE the value that 50 of observations
are greater than and 50 of observations are less
than - The quartiles are calculated in a similar fashion
32The Quartiles (Cont.)
- The first quartile lies one quarter of the way
through the data. IE One quarter of the data
values are less than the first quartile - The third quartile lies three quarters of the way
through the data. IE Three quarters of the data
values are less than the third quartile
33The Quartiles (Cont.)
- EG Consider the following data (ordered)
- 2 3 5 9 12 17 23 29 31 32
35 - There are 11 values, so the median is the 6th
value, in this case 17. The first quartile is
the middle value of the observations below the
median, - 2 3 5 9 12
34The Quartiles (Cont.)
- The third quartile is the middle value of the
observations above the median, - 23 29 31 32 35
- So, the data with Q1, M and Q3 are
- 2 3 5 9 12 17 23 29 31 32
35 - And the middle 50 of data lie between Q1 and Q3.
In this case, between 5 and 31
35The Quartiles (Cont.)
- The difference between the 1st and 3rd quartiles
is called the Inter Quartile Range
Inter Quartile Range Q3-Q1
31-5 26
36Approximate statistics for grouped data
- When the data are given in a frequency
distribution table, we cannot calculate the exact
mean and standard deviation - We can however calculate the approximate values
37Statistics for grouped data
- For the mean
- and for the variance
38Statistics for grouped data eg.
- Consider the following frequency table
- Class Interval Frequency
- 1 2 - 5 3
- 2 5 - 8 6
- 3 8 - 11 8
- 4 11 - 14 7
- 5 14 - 17 4
- 6 17 - 20 2
39Grouped data example (Cont.)
- Class Interval Frequency Midpoint fimi
fimi2 - 1 2 - 5 3 3.5 10.5 36.75
- 2 5 - 8 6 6.5 39 253.5
- 3 8 - 11 8 9.5 76 722
- 4 11 - 14 7 12.5 87.5 1093.75
- 5 14 - 17 4 15.5 62 961
- 6 17 - 20 2 18.5 37 684.5
- totals 30 312 3751.5
40Grouped data example (Cont.)