Title: Central Limit Theoem
1- Lecture 5 slides on
- Central Limit Theorem
- Stratified Sampling
- How to acquire random sample
- Prepared by
- Amrita Tamrakar
2Central Limit Theorem
- Assume a given population of numbers
- P x1,x2,.infinity
-
xi xj
3- Let xp average of P, sp variance of P,
- k tuples from sample, µs average of sample.
- Does µs remain fixed?
- Standard Error formula says, E(µs) xp
- If ss variance of the average of sample then
E(µs) xp ss2 sp2 / k
Interesting phenomenon If we plot µ, it is not
going to be skewed but give a bell curve even
though the actual population may be any
distribution.
4The Central limit theorem says As we repeat
sampling random distribution, the randomness
disappears and gets a bell shaped curve which
gets tighter as we proceed.
Skewed Distribution of salary
200k
0k
40k
x exact avg
Plot µ
5Our main objective is Not to reduce the error
but to give exact error interval. Hence we need
to find the variance. There are two options to
find variance sp 1) Use a materialized view with
an extra column e.g.. 0 for females, 1 for
males 2) Calculate the sample variance many
times to get an unbiased original variance .i.e.
Use sample variance as a surrogate of original
variance. ? Which one will be better?
6- Error Interval with Confidence level
- To give the error interval with 95 confidence.
- Find a point d which will give an area0.95
from the curve, then xd will be the error with
95 confidence
? 1
Area0.95
x-d
xd
x
Alternatively, to find out d we can calculate
1.96sd Where standard deviation (sd) sp /v k
http//www.math.duke.edu/wka/math135/confidence.p
df
7Stratified Sampling
Will stratification of salary give a more
accurate results?
Population P broken into r strata (P1Pr )
Sample Mean s1 Sample Size
k1 P1
s2 k2 P2
sr kr Pr
200k
50k
100k
0 k
N1
N2
Nr
Technique to stratify is to minimize variance in
each strata.
8Total sample k1k2kr Mean of sample µs
- Challenges
- Stratification How to break into strata
- Allocation How many samples from 1st group, 2nd
group.? i.e. how to allocate samples
0k 30k 40k 70k
In this graph, can we say get more samples from
30-70k range (allocation strategy) ?
9- How data is organized in database?
- in disc blocks
- To read a single record , need to read the
entire disc block - Clustered index , B tree are some of the
indexing techniques. - Two approaches for sampling
- Online sampling
- Offline sampling also called pre-computed
sampling
10- Effects
- Online sampling costly in-terms of response
time. - Offline sampling can be done during
pre-processing time. - Reuse the sample again.
- How to get sample data
-
- Generate a random number between 0-106 and pull
out the record with that record id. - OR
- Bernoulli's theorem
- Go to each record
- Toss a coin
- If head then pull out the record, else leave
it. - Note May not get the exact sample size
11- How to maintain freshness of data in random
sample via offline method? - Doesnt matter much as they are done for
history data - What if the original query changes? May be it
was directed towards particular field only.. - Generate the random sample again as it doesnt
matter much towards the performance since it
is pre-processed. - E.g. generate once in 3 months.
- Oracle, sqlserver are having the random sampling
functionality added in their newer versions. -