Short Course in Statistics - PowerPoint PPT Presentation

1 / 74
About This Presentation
Title:

Short Course in Statistics

Description:

e.g. 2/3. 7. Statistical Science ... Clara,Yip,STA2103yipc. 15. Why Random Sampling. To be representative ... Observational Data (e.g. registry Data) 31 ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 75
Provided by: sta163
Category:

less

Transcript and Presenter's Notes

Title: Short Course in Statistics


1
Short Course in Statistics
  • Learning Statistics through Computer
  • Notice that Microsoft Chinese Windows is needed
    in some slides

2
Random Sampling
  • To obtain information through sampling
  • Population and Sample
  • Parameter and Statistic

3
Population versus Sample
  • Population
  • The entire group of individuals about which we
    want information
  • Sample
  • A part of the population from which we actually
    collect information, used to draw conclusions
    about the whole population.

4
Example
  • Population the measurements of weights of all
    children under 18
  • Sample the measurements of weights of students
    in 20 secondary and primary schools

5
Parameter versus Statistic
  • Parameter
  • A number that describes the population.
  • Statistic
  • A number that describes a sample.

6
Drawing balls from a box
  • A box contains 10 balls 5 red, 5 black
  • Population 10 balls
  • Parameter proportion of red balls
  • Draw a random sample of size 3
  • Statistic red balls in the sample
  • e.g. 2/3

7
Statistical Science
  • Statistics provides methodology to estimate the
    parameter through the (random) sample

8
How to draw a random sample
  • Construct a sampling frame---give a number (name)
    to each individual in the population
  • Use random number table to draw a random sample
    of prescribed size

9
Random Number Table
  • Imagine that a box containing 10 identical balls
    with numbers 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9.
  • Each time you draw a ball and record the number
    before returning it to the box and draw the next
    ball --- this list (record) is the random number
    table

10
Example
  • Objective---draw a sample of size 5 from a class
    of 30 students
  • Sampling frame---label each student with the
    numbers 00, 01,29.
  • Read the random number table at line 130 ----
    69051 64817 87174 09517
  • 69 05 16 48 17 87 17 40 95 17

11
Multiple Label
  • 003060, 013161, 023262, etc.
  • Notice 01 will correspond to the second
    individual

12
Measurements in the Laboratory
  • Each measurement in the physics lab or chemistry
    lab can be regarded as an element in a random
    sample

13
  • http//www.cuhk.edu.hk/webct
  • User ID Password STA2103(Surname)(Initials)
  • Go to the above website and learn sample survey,
    design of experiment and regression

14
  • Henry,Chau,STA2103chauhKa Ho Enoch,Chan,STA2103ch
    ankheJane,Tang,STA2103tangjVincent,Pong,STA2103p
    ongvClara,Yip,STA2103yipc

15
Why Random Sampling
  • To be representative
  • Some laws governing the statistic---sampling
    distribution and compute the
  • Probability---the chance of the occurrence of an
    event in n independent samplings---can be
    computed

16
Not representative
  • Call in
  • Voluntary response on the Web
  • Telephone survey asking the respondents to
    respond with the number keys
  • Readers letters to the newspaper

17
Sampling Distribution
  • Random sampling ? the statistic would change as
    the sample varies
  • That is, the conclusion might be changed for
    different sample
  • But, if the samples are randomly drawn, we can
    predict the result with high probability

18
Example
  • Population Hong Kong adult residents
  • Sample (random) 600 persons
  • Parameter proportion of the population
    supporting one more public holiday
  • Statistic proportion in the sample

19
Consequence of Random Sampling
  • If we draw 1000 samples (with each sample of size
    600), and we compute the statistic for each
    sample, the histogram of these 1000 (sample)
    proportion is approximately a bell-shaped
    curve---normal density

20
Normal and Probability
  • Normal density has 2 parameters
  • Mean --- true proportion (p)
  • Variance ---varp(1-p)/n
  • Standard deviation (std)sqrt(var)
  • The one sample we draw has probability .95 in the
    interval (p-1.96 std, p1.96 std)

21
Mean of normaltrue parameter
  • If you draw a sample 1000 times, you have 1000
    sample proportions.
  • The average of these 1000 sample proportions
    would be approximately the true proportion ---
    sample proportion is an unbiased estimate of the
    population proportion

22
Variancep(1-p)/n
  • If it is truly random, we can estimate the
    variance of these 1000 sample proportions using p
    (parameter) only.
  • If I have only one sample with accurate estimate
    of p, then the variance of the 1000 sample
    proportion can be computed without using the 1000
    sample proportions

23
Intuition behind the formula p(1-p)/n
  • Symmetric about ½
  • It is maximized by p1/2 (very uncertain)
  • When p is closer to 0 or 1, I.e., things are more
    definite, the variance gets smaller

24
Confidence Interval
  • Conversely, p will be covered by the interval
    (p-1.96 std, p1.96 std) 95 times out of 100 such
    experiments.
  • Notice stdsqrt(p(1-p)/n)

25
95 Confidence Interval
  • Use the formula for 100 surveys, we obtain 100
    different interval estimates
  • 95 out of these 100 intervals would contain the
    true p

26
Opinion Polls
  • People may not give the true response ---
    response error
  • People may not answer the questions ---
    nonresponse error
  • Unit nonresponse (the person does not response at
    all)
  • Item nonresponse (the person does not respond to
    some questions)

27
Response rate
  • If the response rate is less than 80, we would
    doubt about the validity of the inference

28
Election Polls
  • The respondent may not be voters
  • The respondent may not vote even he/she has
    registered
  • The respondent may lie (response error)

29
Questionnaire
  • The way to set questions would affect the
    response (well-known)

30
Other Data Collection Methods
  • Experimental Design
  • Observational Data (e.g. registry Data)

31
How to know the effect of vaccine in preventing
polio
  • We cannot apply the vaccine to all children and
    compare the results in the past
  • We need two groups control group (no real
    treatment) treatment group (apply the vaccine)

32
We should compare the two groups under equal
conditions
  • People are different from each other
  • By random assignment of participants into the two
    groups, we can make the two groups have almost
    identical conditions e.g., around the same on
    average

33
Design of an Experiment
  • For comparing one treatment (A) with the other
    treatment (B), we need to randomize the patient
    into each group receiving the one of the
    treatments

34
Some possible mistakes
  • Data---from hospital record
  • Death rates of surgical patients are different
    for operations with different anesthetics
  • Halothane (1.7), Pentothal (1.7), Cyclopropane
    (3.4), Ether (1.9)
  • Can we say that cyclopropane is more dangerous
    than the other anesthetics?

35
Answer
  • No! the worst patients were receiving
    cyclopropane.

36
The vaccine can prevent Polio
  • 1956---USA---over two million children involved
  • Should they all receive vaccine?
  • Should the male receive vaccine while the female
    receive placebo?

37
Placebo
  • In this case, placebo is another kind of liquid,
    which is similar to the vaccine in its outlook,
    injected into the children.
  • It is used so that all children were receiving
    same treatment. So that the difference in the
    results would not be explained as psychological
    effect

38
Data
39
Analysis
  • The proportion of control group having polio
    after ½ year --- a/(ab)0.00057
  • The proportion of treatment group having polio
    after ½ year---c/(cd)0.00016
  • The effect of treatment----
  • RD (risk difference)c/(cd) - a/(ab) 0.00041

40
Formulation of the Hypotheses
  • Null Hypothesis no difference in the proportions
  • Alternative Hypothesis the two proportions are
    different

41
Analysis
  • We need to compare RD with its variation
  • That is, if we have different experiments, the
    results are different. The variation of these
    results can be measured by its variance.
  • But we have only one experiment

42
Estimate the variation
  • If there are no effect of the vaccine, the true
    risk (probability) of getting polio is
    pr(ac)/(abcd)0.00037
  • Under above hypothesis, the variance of RD is
    given by
  • pr(1-pr) / (1/(ab)1/(cd))
  • The standard deviation is 0.000061.

43
Contd.
  • Thus the ratio 0.00041/0.0000616.76 measures the
    effect of vaccine.
  • Is 6.76 indicates a large or small or no effect?
  • We need a yardstick.

44
Intuition
  • Thus the ratio (RD/std) measures the effect of
    the vaccine.
  • That is, if it is large in absolute value, the
    effect of vaccine is significant
  • How large is large?

45
Random assignment of patients to treatments
  • If we do the experiment 1000 times and each time
    we calculate the ratio
  • We also assume that the effect of vaccine is
    zero..
  • Then we plot the histogram of the 1000 ratios.
    We find the histogram is close to a bell-shape
    curve---normal density curve.

46
Normality
  • Since we know that the ratio is normal and we now
    obtain 6.76.
  • We can compute the area to the right of
    6.76----the probability that the ratio is larger
    than 6.76 under the hypothesis of no effect. We
    find the area is very small (6.9 x 10-12)

47
P value
  • The area correspond to the probability of the
    event which is more extreme to the observed value
  • The usual rule --- p-value lt0.05 reject the null
    hypothesis
  • 0.05 can be interpreted as 5 wrong conclusions
    among 100 experiments

48
Chi Square Test-Another approach
  • We can apply the chi square test to the same data
    set.
  • The chi square test is used to test whether the
    proportion of getting polio is the same for the
    two groups (homogeneity). Equivalently, whether
    the occurrence of polio is independent of the
    treatment (group)

49
Analysis
  • The chi square test statistic is given by N(ad -
    bc)2/((ab)(ac)(bd)(cd))
  • Nabcd
  • When the statistic is large, the hypothesis is
    likely to be wrong

50
Statistical Reasoning
  • The above statistic can be expressed as the
    summation of the quantities
  • (observed counts-expected counts)2
  • divided by the expected counts
  • Here expected counts means the average counts
    under the hypothesis that the two groups are the
    same

51
Chi Square distribution
  • Chi square distribution with one degree of
    freedom
  • P-value0.05
  • Cutoff point 3.84 I.e., reject if the chi square
    statistic is larger than 3.84. Otherwise, accept
    the null hypothesis.

52
T-test (Two-Sample unpaired)
  • Randomize female rats into two groups (high (low)
    protein dies)
  • Response variablesgain in weight between the
    28th and 84th days of age

53
Data
  • High protein134 146 104 119 124 161 107 83 113
    129 97 123
  • Mean120
  • Variance457.5
  • Low protein70 118 101 85 107 132 94
  • Mean101
  • Variance425.3

54
Hypotheses
  • Null hypothesis no difference in the two means
  • Alternative hypothesis the means are different

55
Analysis
  • The difference of the two means120 - 10119
  • 19 measures the difference in weight gains
    between two groups
  • Is it large or small? By chance?
  • We need to compare with its standard deviation

56
Variance and standard deviation
  • Standard deviationsquare root of variance

57
(No Transcript)
58
Indicator
  • This is a better indicator of the difference
    between the two groups

59
Statistical reasoning
  • Indicator and yardstick
  • If we repeat the experiment 1000 times and
    compute 1000 t statistics
  • Plot the histogram for these 1000 t statistics
  • The histogram is similar to normal but with
    heavier tails

60
Analysis
  • We call it a t distribution
  • There are many t distribution for different
    sample sizes
  • The number (the sum of two group sizes 2) is
    called the degree of freedom of the t
    distribution
  • (e.g. 127-217)

61
DFgt 30
  • When the degrees of freedom is larger than or
    equal to 30, the t distribution would become a
    normal distribution

62
Statistical Reasoning
  • Given the degree of freedom, we can find the area
    (probability)
  • If there are no difference between the two
    groups, the t distribution would by symmetric
    about zero.
  • If the data is really arising from two treatments
    with same results, the t statistic should be
    small

63
Statistical Reasoning
  • If the t-statistics is small, the area
    (probability) of observing the actual statistic
    or larger must be large.
  • Conversely, if the area is small, the data tells
    us that the hypothesis is likely to be wrong

64
Statistical Reasoning
  • In this case, t1.89
  • The area for t beyond 1.89 (when degree of
    freedom17) is 0.076.
  • This area is called p-value
  • Usually, when p-value is lees than 0.05, we will
    reject the hypothesis

65
  • Interactive Statistical Pages
  • Try the t-test ( go to the procedure)and chi
    square test (2 x 2 table for sample comparison)
    here.

66
Regression
  • Finding the mean of y for each x
  • To see whether x and y are associated

67
Data
  • ?? ??? ??????
  • ?? 2.5 211
  • ??? 3.9 167
  • ??? 2.9 131
  • ??? 2.4 191
  • ?? 2.9 220
  • ?? 0.8 297
  • ?? 9.1 71
  • ?? 0.8 211
  • ??? 0.7 300
  • ??? 7.9 107
  • ?? ??? ??????
  • ?? 1.8 167
  • ??? 1.9 266
  • ?? 0.8 227
  • ??? 6.5 86
  • ?? 5.8 115
  • ?? 1.6 207
  • ?? 1.3 285
  • ?? 1.2 199
  • ?? 2.7 172

68
??????
???????? ??????????????????, ???????????????,
??????????????, ???????????, ??????????------Ecolo
gic bias.
???
1 2 3 4 5
6 7 8 9
69
??????
????????? ????? !
300 250 200 150 100 50
???(regression line)
???
70
??????
????????? ????? !
300 250 200 150 100 50
???(regression line)
???
71
Analysis
  • Y (death rate) 260.56-22.97 x (Alcohol)
  • The negative sign indicates that Y and x go in
    opposite direction.
  • More Alcohol, less heart disease death rate?
  • The result cannot be extended to individual level
    --- ecologic bias

72
Analysis
  • The variance of the error is given by 1434.79
  • If we compute the variance of Y, we find that the
    variance is given by 4678.05.

73
Questions
  • Email addresstslau_at_sparc2.sta.cuhk.edu.hk
  • Telephone
  • 2609-7927

74
Exercises
  • 1.(Sample survey)
  • Population(Adults in Hong Kong) Sample(random
    sample, telephone survey)
  • Parameterproportion supporting the government in
    handling the protest
  • Statistic
Write a Comment
User Comments (0)
About PowerShow.com