Reminder of 1st lecture - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

Reminder of 1st lecture

Description:

Reminder of 1st lecture. Data types. Summarising qualitative data: counts, ... Dichotomous (Binary): This variable has only 2 ... Reminder of 2nd lecture ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 69
Provided by: moll2
Category:
Tags: 1st | lecture | reminder

less

Transcript and Presenter's Notes

Title: Reminder of 1st lecture


1
Reminder of 1st lecture
  • Data types
  • Summarising qualitative data counts,
  • Summarising quantitative data
  • Location mean, median
  • Spread standard deviation, interquartile
    range
  • Graphical presentation of data

2
Qualitative data
Dichotomous (Binary) This variable has only 2
possible categories (mutually exclusive) Nominal
This variable has more than 2 categories,
mutually exclusive and unordered Ordinal This
variable has more than 2 categories, mutually
exclusive and ordered
3
Quantitative data
Continuous This is used for something measured
on a scale. The variable can take any value
within a range of values e.g. Height in cm,
Weight in kg. Discrete This variable often
represents counts (integer values) e.g. Age to
nearest year, number of children
4
Measures of central tendency
  • Mean - arithmetic average (? mu)
  • add up all observations and divide by the number
    of observations
  • Median - central value of the distribution
  • rank observations and the median is the
    observation below which 50 of all values fall

5
Histogram
  • Assessment of normality

6
Appropriate measures of location and spread
7
Reminder of 2nd lecture
  • The probability for an event or outcome indicates
    how likely it is to happen.
  • The probability of an event has to be between 0
    and 1.
  • A probability of 0 means that the event never
    happens.
  • A probability of 1 means that an event always
    happens.
  • Events that cannot happen together are mutually
    exclusive and their probabilities can be added
    together. P(A or B) P(A) P(B)
  • Events that are independent do not affect each
    other and their probabilities can be multiplied
    together. P(A and B) P(A) x P(B)

8
Properties of Normal curve
P0.68 P0.95 P0.999
9
Chi-Squared
  • Another important distribution related to the
    normal is the Chi-squared distribution.
  • Its use is used when investigating categorical
    data.

10
Crossstab Example
  • If eye and hair colour are not associated then
    for example, the Expected number with blue eyes
    and blond hair would be

11
  • So the chi-squared is found by looking a function
    of the discrepancy between observed and expected
    counts in each cell
  • summed over all combinations of hair and eye
    colour.
  • If this is large and in the tail of the
    distribution, then it may be said that the
    observed is not as expected!
  • More of this later.

12
Summary
  • Probabilities are integral to all things around
    us.
  • We can derive and understand probabilities.
  • We have seen that probabilities build together to
    form probability distributions.
  • Some are theoretical distributions that are well
    understood, the most important being the Normal.
  • Using these theoretical distributions we can
    begin to make inferences about the population on
    the basis of samples.

13
Sampling, sampling distributions and statistical
inference
  • Gordon Prescott

14
Wednesday, 11 October 2006Scots bar staff health
'improved'
  • The health of Scotland's bar staff has improved
    dramatically since the introduction of a smoking
    ban, a medical study has found.
  • Researchers at Dundee University found
    significant health improvements in the first two
    months after the March ban.
  • The results have led to calls for the UK
    Government to speed the introduction of a similar
    ban south of the border.
  • But smokers' rights group Forest said the link
    between passive smoking and ill health had not
    been proven.
  • The team from the university's asthma and allergy
    research group began testing bar workers in and
    around Dundee in February, a month before the ban
    came into force.
  • Using a series of indicators, they established
    symptoms attributable to passive smoking,
    measuring lung function and inflammation in the
    bloodstream.
  • This study provides compelling evidence that
    making workplaces smoke free can have a
    significant and speedy impact on people's
    health Peter Hollins, British Heart
    Foundation

15
Study of bar workers in Dundee before and after
the smoking ban
  • JAMA. 20062961742-1748 To investigate the
    association of smoke-free legislation with
    symptoms, pulmonary function, and markers of
    inflammation of bar workers (n77)
  • At 1 month The percentage of bar workers with
    respiratory and sensory symptoms decreased from
    79.2 (n  61) before the smoke-free policy to
    53.2 (n  41)
  • (total change 26 95 confidence interval CI,
    13.8 to 38.1 Plt0.001)
  • Forced expiratory volume in the first second
    increased from 96.6 predicted to 104.8
  • (change 8.2 95 CI, 3.9 to 12.4 Plt0.001)
  • Serum cotinine levels decreased from 5.15 to 3.22
    ng/mL
  • (change 1.93 ng/mL 95 CI, 2.83 to 1.03
    ng/mL Plt0.001)

16
Study of bar workers in Dundee before and after
the smoking ban
  • JAMA. 20062961742-1748
  • At 2 months Total white blood cell reduced from
    7610 to 6980 cells/µL
  • (630 cells/µL 95 CI, 1010 to 260 cells/µL
    P  0.002)
  • Neutrophil count was reduced from 4440 to 4030
    cells/µL
  • (410 cells/µL 95 CI, 740 to 90 cells/µL
    P  0.03)
  • Smoke-free legislation was associated with
    significant early improvements in symptoms,
    spirometry measurements, and systemic
    inflammation of bar workers

17
Sampling theory
  • A population is the totality of observations
    obtainable from all subjects possessing some
    common specified characteristic
  • male diabetics
  • height of all females in the UK
  • A sample is a set of observations which
    constitutes part of a population

18
Random and biased samples
  • Random sampling is a sampling technique where
    each member in the population is chosen entirely
    by chance, with a known chance of being included
    in the sample. Representative of the population.
  • The most common random sample is obtained when
    any one individual or measurement in the
    population is as likely to be selected as any
    other.
  • Can also have a biased sample where some
    individual or measurements have a greater chance
    of being included than others.
  • With a non-random sampling technique not all
    members have a known chance of being selected, or
    some members have a zero chance of being
    selected. Unrepresentative of the population.

19
Selection of sample
  • Probability sampling
  • Each item/person in the population has a
    calculable non-zero chance of being selected for
    the sample.
  • Sampling error (degree to which the sample
    differs from the population) can be calculated.
  • Convenience sampling
  • The items/people are selected by the researcher
    from the population in a non-random manner.
  • The sampling error is unknown and cannot be
    calculated.

20
  • Random sampling
  • Simple Random sampling,
  • Systematic Random sampling,
  • Stratified Random sampling,
  • Cluster Random sampling,
  • Multi-Stage sampling.
  • Biased sampling
  • Quota sampling

21
Probability sampling
  • Simple random sample
  • Each item in the population has an equal chance
    of being selected for the sample

22
Random number table
  • 84 42 56 53 87 75
  • 78 87 77 03 57 09
  • 85 86 48 86 12 39
  • 65 37 93 76 46 11
  • 09 49 41 73 76 49
  • 64 06 71 99 37 06
  • 46 69 31 24 33 52
  • 67 85 07 75 56 96

23
Systematic random sampling
  • Choose one element at random from the population
    of size N
  • For a sample of size n, choose every N/n element
    thereafter
  • Advantages - It is simpler and can be more
    representative than a simple random sample
  • Disadvantages - possibility of implicit
    clustering, not a simple random sample

24
Example systematic random sampling
  • A survey of GPs in Scotland to assess counselling
    services provided to patients is to be conducted.
  • 1 in 3 GPs are to be included in the sample.
  • List of GPs available is ordered by practice
    within Health Board.
  • Systematic sample will ensure representative
    spread across health boards and practices.

25
Stratified random sampling
  • The strata are subgroups of the population which
    are chosen to minimize differences between
    members of the same strata and maximize the
    differences between members of different strata.
  • Main advantages
  • Increases the representativeness of the sample
  • Increases the precision of the resulting
    estimates
  • Allows comparison between strata

26
Example Stratified random sampling
  • It is likely that size of practice, site of
    practice (rural/urban) may influence whether the
    practice employs a counsellor
  • The list of GPs could be stratified into number
    of partners (lt 4, gt 4) in the practice and area
    (rural/urban)
  • A simple random sample from each of the four
    lists would constitute the sample
  • The sample would ensure that each strata (or
    combination of strata) are represented in the
    sample

27
Cluster random sampling
  • The clusters are subgroups of the population
    which are chosen to maximize differences between
    members of the same cluster and therefore
    minimize the differences between members of
    different clusters
  • Advantages - Cheaper and faster than a simple
    random sample
  • Disadvantages - Less representative than a simple
    random sample and there is a danger of
    contamination between respondents

28
Example Cluster random sampling
  • For a nutritional survey to be carried out, 20
    schools in Scotland will be randomly selected.
  • All secondary year 4 children from those schools
    selected will be interviewed.

29
Multi-stage sampling
  • Different sampling units are sampled at different
    stages
  • Example
  • Geographical areas of the UK would randomly be
    selected, from which hospitals would be randomly
    selected from which wards/patients would then be
    randomly selected.

30
Convenience sampling
  • Quota sampling
  • In this type of sampling an individual is chosen
    by an interviewer.
  • To avoid undue bias the quota is sub-divided into
    various categories e.g. male/female, old/young
    and so on.
  • The interviewer is given quotas for each category
    and uses discretion to select the interviewees.

31
Statistical Inference
POPULATION
SAMPLE
INFERENCE
32
  • Which sampling approach will lead to unbiased
    estimates?

33
Statistical Inference
POPULATION
SAMPLE
INFERENCE
34
Population parameters and sample statistics
  • A population parameter is a measurable
    characteristic of the population
  • Values obtained from a sample are estimates of
    the population parameters

35
Estimation I
  • A parameter is a numerical descriptive measure of
    a population. It is calculated from the
    observations in the population
  • mean - m
  • standard deviation - s
  • A sample statistic is a numerical descriptive
    measure of a sample. It is calculated from the
    observations in the sample
  • Sample mean -
  • Sample standard deviation - s

36
Estimation II
  • Population parameters are fixed so long as the
    population itself does not change
  • Sample statistics will vary from sample to
    sample, even though samples may be random and the
    population does not change

37
Sampling distribution
  • In theory we could select all possible random
    samples from a population and gain an estimate of
    the population parameter from each of the
    individual samples.
  • If a histogram of the sample estimates for each
    individual sample was plotted, this would form
    the sampling distribution of the population
    parameter (probability distribution).

38
Sampling Distribution
39
Sampling distribution
  • The sampling distribution of a sample statistic,
    calculated from a sample of n measurements, is
    the probability distribution of the statistic

40
Example
  • Imagine that a random sample of 100 individuals
    is to be selected from a population
  • Their height in cm is measured
  • The mean height is computed
  • Another random sample of 100 individuals from the
    same population is taken
  • Their height in cm in measured
  • Their mean height is computed
  • This is repeated until 20 random samples have
    been taken

41
20 samples of size 100
  • The first sample of heights of 100 people gives a
    mean of 172.03 cm and a standard deviation (SD)
    of 6.03 cm.
  • The second sample gives mean 173.50 cm SD 6.74
    cm.
  • These figures represent the mean height (cm) for
    each of the 20 random samples
  • 172.03 173.50 171.89 171.95 170.59
  • 172.63 172.72 171.99 172.50 171.71
  • 172.55 172.86 171.58 172.83 172.55
  • 171.28 172.62 171.41 171.38 172.26

42
Histogram of means of 20 samples
43
Histogram of means of 100 samples
44
Sampling error
  • Each random sample may have a different estimate
    of the population parameter due to sampling
    variation
  • Knowledge of the sampling distribution allows us
    to assess how close the estimate obtained from
    one individual sample is to the true population
    parameter. This is known as precision.

45
Precision
  • The larger the size of the sample the greater the
    reduction in sampling error
  • Taking a larger sample will result in reducing
    the sampling variation from the true
    population value that we are trying to estimate.
  • This implies that our estimate would be more
    precise.

46
Standard Error
  • The standard deviation of the sampling
    distribution of the mean is known as the standard
    error of the mean.
  • The standard error provides a measure of how far
    from the true population value the estimate is
    likely to be (the precision)

47
Standard deviation/standard error
  • The standard deviation, s, is a measure of the
    variability of individuals in a sample
  • The standard error is a measure of the
    uncertainty in the sample statistic (e.g. mean,
    proportion)

48
What does the standard error indicate?
  • Consider again the random sample of 100
    individuals is to be selected from a population
  • Their height in cm is measured
  • The mean height is computed
  • Another random sample of 100 individuals from the
    same population is taken
  • Their height in cm in measured
  • Their mean height is computed
  • This is repeated until 20 random samples have
    been taken

49
20 samples of size 100
  • These figures represent the mean height (cm) for
    each of the 20 random samples
  • 172.03 173.50 171.89 171.95 170.59
  • 172.63 172.72 171.99 172.50 171.71
  • 172.55 172.86 171.58 172.83 172.55
  • 171.28 172.62 171.41 171.38 172.26
  • Mean of the 20 samples 172.14 cm
  • SD of 20 sample means 0.689 ?SE (mean)

50
Explanation of sampling error
  • 5 students in population
  • Ages are 22, 25, 28, 30, 35
  • Sample of 3 students randomly selected to
    estimate age in population (5 students)

51
22, 25, 28, 30, 35 ?28
  • Sample 1 22,30,35 mean 29
  • Sample 2 25,28,35 mean 29.3
  • Sample 3 22,28,30 mean 27.7
  • Sample 4 25,30,35 mean 30
  • Sample 5 28,30,35 mean 31

NB Variation in age 22 to 35 Variation in mean
age 27.7 to 31
52
Relationship between sample size and precision of
sample estimate
  • Heights
  • Sample size mean SD Standard error
  • 100 172.03 6.03 0.60
  • 200 172.77 6.42 0.45
  • 500 171.99 6.85 0.31
  • 1000 172.15 6.84 0.22

53
Properties of the sampling distribution of the
mean
  • The mean of the sampling distribution mean of
    the population distribution
  • Standard deviation of the sampling distribution
    ? / ?n standard error of the mean
  • The sampling distribution of the mean is
    approximately Normal for large sample sizes

54
Central Limit Theorem
  • If a random sample of n observations is selected
    from a population, then when n is sufficiently
    large, the sampling distribution of (the
    mean) will be approximately a Normal
    distribution.
  • The larger the sample size, the better the
    approximation to the Normal distribution will be.
  • A sample size of at least 30 will usually be
    enough.

55
Statistical Inference
Representativeness Size
POPULATION
SAMPLE
INFERENCE
56
Statistical Inference
  • Concerned with how we draw conclusions from
    sample data, about the larger population from
    which the sample is selected.
  • There are two types of inference
  • Confidence Intervals (Estimation)
  • Hypothesis Testing (Significance Testing)

57
Confidence intervals
  • When we collect data on a sample of individuals
    we would not expect the results from our sample
    to be exactly the same as those we would get if
    we had data on the whole population.
  • Using the variability in the sample data we can
    calculate a range of values in which the
    population value is likely to lie.
  • We can vary the width of this range depending on
    how confident we want to be that we will have
    included the true population value (usually set
    at 95 confidence).

58
Calculation of confidence intervals
  • Confidence intervals can be calculated for most
    sample estimates using the following notation
  • Sample estimate ? critical value x standard
    error(sample estimate)

59
Calculation of confidence interval for a
population mean
  • Sample estimate
  • sample mean
  • Standard error of mean
  • sample standard deviation/?n
  • For large samples, 5 critical value 1.96
  • Point estimate 172.03 cm SE 0.603 cm
  • 95 CI (170.85 to 173.21) cm

60
Small samples
  • For a small sample (nlt30) we use the
    t-distribution with (n-1) degrees of freedom
  • 95 CI for small samples
  • When n is large the t-distribution approximates
    to the Normal distribution
  • For n 5, 10, 30, 60, 120,
  • t n-1(5) 2.57, 2.23, 2.04, 2.00, 1.98

61
Interpretation of CIs
  • Consider an infinite number of random samples of
    size n from a population
  • A mean and 95 CI could be computed for each of
    the samples
  • For 95 of the samples from the population, the
    95 CI will include the population value, whilst
    in 5 of samples the 95 CI will not include the
    population value.

62
Interpretation of confidence intervals
  • Sample 1

Sample 2
Sample 3
Sample 4
Sample 5
Sample n
True population value (e.g. ?, ?)
63
Example
  • Suppose I want to know the average height of UK
    people. I have a random sample of 100 people with
    a mean of 172.03 and SD of 6.03.
  • The standard error of the mean is 6.03/?1000.603
  • The estimate of the population mean is 172.03
  • The 95 confidence interval (CI) for the
    population mean is 172.03 - 1.96 x 0.603, 172.03
    1.96 x 0.603
  • 95 CI is 170.85, 173.21
  • I am 95 confident that the average height of UK
    people is between 170.85 and 173.21
  • A different 100 people would have a slightly
    different sample mean and confidence interval
  • 95 of random samples of 100 would give a 95
    confidence interval containing the true
    population mean

64
100 samples of 100 people from a population of
10000 with known mean 172 cm and SD 6.8 cm
65
Change in sample size
  • If the sample size is larger, the standard error
    is smaller and therefore the CI is also narrower
  • For my first sample of 100 the standard error of
    the mean was 0.603 cm and 95 CI was 170.85,
    173.21 cm
  • When I took a sample of 200 the mean was 172.77
    cm and the SD was 6.42 cm.
  • The standard error was 6.42/?2000.45 (smaller)
  • 95 CI was 172.77 - 1.96x0.45, 172.77 1.96x0.45
  • 171.88, 173.65 cm
  • For a sample of 1000 the 95 CI was 171.73,
    172.58 cm

66
Confidence intervals
  • Raising the confidence level from 95 to 99
    increases the assurance that the confidence
    interval contains the population mean, but it
    makes the estimate less precise i.e. the width of
    the CI is wider
  • Multiplier changes from 1.96 to 2.58.
  • For height with 100 in the sample
  • 95 CI (170.85, 173.21) cm
  • 99 CI (170.47, 173.58) cm

67
Sample size
  • It is possible to determine what sample size
    should be taken, if we wish to achieve a given
    level of precision
  • This is because precision can be increased by
    reducing the size of the standard error
  • The size of the standard error is based on the
    size of the sample

68
Practicals
  • There are no new computer practicals this week.
  • Instead there is the opportunity to complete the
    data description practical from last week and
    to ask questions in the Friday time slot.
  • The worked example for the data description
    practical will be on the web early next week.
Write a Comment
User Comments (0)
About PowerShow.com