Title: Short Course in Statistics
1Short Course in Statistics
- Learning Statistics through Computer
- Notice that Microsoft Chinese Windows is needed
in some slides
2Random Sampling
- To obtain information through sampling
- Population and Sample
- Parameter and Statistic
-
3Population versus Sample
- Population
- The entire group of individuals about which we
want information
- Sample
- A part of the population from which we actually
collect information, used to draw conclusions
about the whole population.
4Example
- Population the measurements of weights of all
children under 18
- Sample the measurements of weights of students
in 20 secondary and primary schools
5Parameter versus Statistic
- Parameter
- A number that describes the population.
- Statistic
- A number that describes a sample.
6Drawing balls from a box
- A box contains 10 balls 5 red, 5 black
- Population 10 balls
- Parameter proportion of red balls
- Draw a random sample of size 3
- Statistic red balls in the sample
- e.g. 2/3
7Statistical Science
- Statistics provides methodology to estimate the
parameter through the (random) sample
8How to draw a random sample
- Construct a sampling frame---give a number (name)
to each individual in the population - Use random number table to draw a random sample
of prescribed size
9Random Number Table
- Imagine that a box containing 10 identical balls
with numbers 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. - Each time you draw a ball and record the number
before returning it to the box and draw the next
ball --- this list (record) is the random number
table
10Example
- Objective---draw a sample of size 5 from a class
of 30 students - Sampling frame---label each student with the
numbers 00, 01,29. - Read the random number table at line 130 ----
69051 64817 87174 09517 - 69 05 16 48 17 87 17 40 95 17
11Multiple Label
- 003060, 013161, 023262, etc.
- Notice 01 will correspond to the second
individual
12Measurements in the Laboratory
- Each measurement in the physics lab or chemistry
lab can be regarded as an element in a random
sample
13 - http//www.cuhk.edu.hk/webct
- User ID Password STA2103(Surname)(Initials)
- Go to the above website and learn sample survey,
design of experiment and regression
14- Henry,Chau,STA2103chauhKa Ho Enoch,Chan,STA2103ch
ankheJane,Tang,STA2103tangjVincent,Pong,STA2103p
ongvClara,Yip,STA2103yipc
15Why Random Sampling
- To be representative
- Some laws governing the statistic---sampling
distribution and compute the - Probability---the chance of the occurrence of an
event in n independent samplings---can be
computed
16Not representative
- Call in
- Voluntary response on the Web
- Telephone survey asking the respondents to
respond with the number keys - Readers letters to the newspaper
17Sampling Distribution
- Random sampling ? the statistic would change as
the sample varies - That is, the conclusion might be changed for
different sample - But, if the samples are randomly drawn, we can
predict the result with high probability
18Example
- Population Hong Kong adult residents
- Sample (random) 600 persons
- Parameter proportion of the population
supporting one more public holiday - Statistic proportion in the sample
19Consequence of Random Sampling
- If we draw 1000 samples (with each sample of size
600), and we compute the statistic for each
sample, the histogram of these 1000 (sample)
proportion is approximately a bell-shaped
curve---normal density
20Normal and Probability
- Normal density has 2 parameters
- Mean --- true proportion (p)
- Variance ---varp(1-p)/n
- Standard deviation (std)sqrt(var)
- The one sample we draw has probability .95 in the
interval (p-1.96 std, p1.96 std)
21Mean of normaltrue parameter
- If you draw a sample 1000 times, you have 1000
sample proportions. - The average of these 1000 sample proportions
would be approximately the true proportion ---
sample proportion is an unbiased estimate of the
population proportion
22Variancep(1-p)/n
- If it is truly random, we can estimate the
variance of these 1000 sample proportions using p
(parameter) only. - If I have only one sample with accurate estimate
of p, then the variance of the 1000 sample
proportion can be computed without using the 1000
sample proportions
23Intuition behind the formula p(1-p)/n
- Symmetric about ½
- It is maximized by p1/2 (very uncertain)
- When p is closer to 0 or 1, I.e., things are more
definite, the variance gets smaller
24Confidence Interval
- Conversely, p will be covered by the interval
(p-1.96 std, p1.96 std) 95 times out of 100 such
experiments. - Notice stdsqrt(p(1-p)/n)
2595 Confidence Interval
- Use the formula for 100 surveys, we obtain 100
different interval estimates - 95 out of these 100 intervals would contain the
true p
26Opinion Polls
- People may not give the true response ---
response error - People may not answer the questions ---
nonresponse error - Unit nonresponse (the person does not response at
all) - Item nonresponse (the person does not respond to
some questions)
27Response rate
- If the response rate is less than 80, we would
doubt about the validity of the inference
28Election Polls
- The respondent may not be voters
- The respondent may not vote even he/she has
registered - The respondent may lie (response error)
29Questionnaire
- The way to set questions would affect the
response (well-known)
30Other Data Collection Methods
- Experimental Design
- Observational Data (e.g. registry Data)
31How to know the effect of vaccine in preventing
polio
- We cannot apply the vaccine to all children and
compare the results in the past - We need two groups control group (no real
treatment) treatment group (apply the vaccine)
32We should compare the two groups under equal
conditions
- People are different from each other
- By random assignment of participants into the two
groups, we can make the two groups have almost
identical conditions e.g., around the same on
average
33Design of an Experiment
- For comparing one treatment (A) with the other
treatment (B), we need to randomize the patient
into each group receiving the one of the
treatments
34Some possible mistakes
- Data---from hospital record
- Death rates of surgical patients are different
for operations with different anesthetics - Halothane (1.7), Pentothal (1.7), Cyclopropane
(3.4), Ether (1.9) - Can we say that cyclopropane is more dangerous
than the other anesthetics?
35Answer
- No! the worst patients were receiving
cyclopropane.
36The vaccine can prevent Polio
- 1956---USA---over two million children involved
- Should they all receive vaccine?
- Should the male receive vaccine while the female
receive placebo?
37Placebo
- In this case, placebo is another kind of liquid,
which is similar to the vaccine in its outlook,
injected into the children. - It is used so that all children were receiving
same treatment. So that the difference in the
results would not be explained as psychological
effect
38Data
39Analysis
- The proportion of control group having polio
after ½ year --- a/(ab)0.00057 - The proportion of treatment group having polio
after ½ year---c/(cd)0.00016 - The effect of treatment----
- RD (risk difference)c/(cd) - a/(ab) 0.00041
40Formulation of the Hypotheses
- Null Hypothesis no difference in the proportions
- Alternative Hypothesis the two proportions are
different
41Analysis
- We need to compare RD with its variation
- That is, if we have different experiments, the
results are different. The variation of these
results can be measured by its variance. - But we have only one experiment
42Estimate the variation
- If there are no effect of the vaccine, the true
risk (probability) of getting polio is
pr(ac)/(abcd)0.00037 - Under above hypothesis, the variance of RD is
given by - pr(1-pr) / (1/(ab)1/(cd))
- The standard deviation is 0.000061.
43Contd.
- Thus the ratio 0.00041/0.0000616.76 measures the
effect of vaccine. - Is 6.76 indicates a large or small or no effect?
- We need a yardstick.
44Intuition
- Thus the ratio (RD/std) measures the effect of
the vaccine. - That is, if it is large in absolute value, the
effect of vaccine is significant - How large is large?
45Random assignment of patients to treatments
- If we do the experiment 1000 times and each time
we calculate the ratio - We also assume that the effect of vaccine is
zero.. - Then we plot the histogram of the 1000 ratios.
We find the histogram is close to a bell-shape
curve---normal density curve.
46Normality
- Since we know that the ratio is normal and we now
obtain 6.76. - We can compute the area to the right of
6.76----the probability that the ratio is larger
than 6.76 under the hypothesis of no effect. We
find the area is very small (6.9 x 10-12)
47P value
- The area correspond to the probability of the
event which is more extreme to the observed value - The usual rule --- p-value lt0.05 reject the null
hypothesis - 0.05 can be interpreted as 5 wrong conclusions
among 100 experiments
48Chi Square Test-Another approach
- We can apply the chi square test to the same data
set. - The chi square test is used to test whether the
proportion of getting polio is the same for the
two groups (homogeneity). Equivalently, whether
the occurrence of polio is independent of the
treatment (group)
49Analysis
- The chi square test statistic is given by N(ad -
bc)2/((ab)(ac)(bd)(cd)) - Nabcd
- When the statistic is large, the hypothesis is
likely to be wrong
50Statistical Reasoning
- The above statistic can be expressed as the
summation of the quantities - (observed counts-expected counts)2
- divided by the expected counts
- Here expected counts means the average counts
under the hypothesis that the two groups are the
same
51Chi Square distribution
- Chi square distribution with one degree of
freedom - P-value0.05
- Cutoff point 3.84 I.e., reject if the chi square
statistic is larger than 3.84. Otherwise, accept
the null hypothesis.
52T-test (Two-Sample unpaired)
- Randomize female rats into two groups (high (low)
protein dies) - Response variablesgain in weight between the
28th and 84th days of age
53Data
- High protein134 146 104 119 124 161 107 83 113
129 97 123 - Mean120
- Variance457.5
- Low protein70 118 101 85 107 132 94
- Mean101
- Variance425.3
54Hypotheses
- Null hypothesis no difference in the two means
- Alternative hypothesis the means are different
55Analysis
- The difference of the two means120 - 10119
- 19 measures the difference in weight gains
between two groups - Is it large or small? By chance?
- We need to compare with its standard deviation
56Variance and standard deviation
- Standard deviationsquare root of variance
57(No Transcript)
58Indicator
- This is a better indicator of the difference
between the two groups
59Statistical reasoning
- Indicator and yardstick
- If we repeat the experiment 1000 times and
compute 1000 t statistics - Plot the histogram for these 1000 t statistics
- The histogram is similar to normal but with
heavier tails
60Analysis
- We call it a t distribution
- There are many t distribution for different
sample sizes - The number (the sum of two group sizes 2) is
called the degree of freedom of the t
distribution - (e.g. 127-217)
61DFgt 30
- When the degrees of freedom is larger than or
equal to 30, the t distribution would become a
normal distribution
62Statistical Reasoning
- Given the degree of freedom, we can find the area
(probability) - If there are no difference between the two
groups, the t distribution would by symmetric
about zero. - If the data is really arising from two treatments
with same results, the t statistic should be
small
63Statistical Reasoning
- If the t-statistics is small, the area
(probability) of observing the actual statistic
or larger must be large. - Conversely, if the area is small, the data tells
us that the hypothesis is likely to be wrong
64Statistical Reasoning
- In this case, t1.89
- The area for t beyond 1.89 (when degree of
freedom17) is 0.076. - This area is called p-value
- Usually, when p-value is lees than 0.05, we will
reject the hypothesis
65- Interactive Statistical Pages
- Try the t-test ( go to the procedure)and chi
square test (2 x 2 table for sample comparison)
here.
66Regression
- Finding the mean of y for each x
- To see whether x and y are associated
67Data
- ?? ??? ??????
- ?? 2.5 211
- ??? 3.9 167
- ??? 2.9 131
- ??? 2.4 191
- ?? 2.9 220
- ?? 0.8 297
- ?? 9.1 71
- ?? 0.8 211
- ??? 0.7 300
- ??? 7.9 107
- ?? ??? ??????
- ?? 1.8 167
- ??? 1.9 266
- ?? 0.8 227
- ??? 6.5 86
- ?? 5.8 115
- ?? 1.6 207
- ?? 1.3 285
- ?? 1.2 199
- ?? 2.7 172
68??????
???????? ??????????????????, ???????????????,
??????????????, ???????????, ??????????------Ecolo
gic bias.
???
1 2 3 4 5
6 7 8 9
69??????
????????? ????? !
300 250 200 150 100 50
???(regression line)
???
70??????
????????? ????? !
300 250 200 150 100 50
???(regression line)
???
71Analysis
- Y (death rate) 260.56-22.97 x (Alcohol)
- The negative sign indicates that Y and x go in
opposite direction. - More Alcohol, less heart disease death rate?
- The result cannot be extended to individual level
--- ecologic bias
72Analysis
- The variance of the error is given by 1434.79
- If we compute the variance of Y, we find that the
variance is given by 4678.05.
73Questions
- Email addresstslau_at_sparc2.sta.cuhk.edu.hk
- Telephone
- 2609-7927
74Exercises
- 1.(Sample survey)
- Population(Adults in Hong Kong) Sample(random
sample, telephone survey) - Parameterproportion supporting the government in
handling the protest - Statistic