Title: Statistical Genetics 6 GWAS Data QC
1Statistical Genetics6 GWAS Data QC
- Graduate School of Medicine
- Kyoto University
- 2008/09/17-25
- IMS-UT
- Ryo Yamada
2Single marker-single phenotype test
Through genotype data as a co-attribute both of
subjects and marker, relation between phenotype
and marker is tested. This test depends on
randomness of subjects except for the phenotype.
3GWA study
Individual test assumes randomness of subjects
except for the phenotype. The statistics based on
the assumption is corrected by difference between
random distribution of unbiased condition and
random distribution of biased condition.
Tests (markers) are not independent each
other. The statistics based on the dependency is
corrected by evenness of dependency throughout
the tests.
4GWA study
- Biased samples
- Population structure
- Structure in sampling population overall
- Biased sampling from structured population
- Dependency among tests
- Allelic association
- Linkage disequilibrium
- Allelic association due to population structure
- Dependency among tests that share markers and/or
subjects.
5The assumption is corrected by difference between
random distribution of unbiased condition and
random distribution of biased condition.
Deviated records are important for correction.
Do not throw away data records unless specified
causative mistake(s) are reasonably
certain. Throw away data records when appropriate
so that the discard wont disturb the beauty of
distribution.
6Steps
- Check study design
- Check input data
- Run analyzing applications
- Interpret the outputs
- How to survive with endless requests along with
our own research projects?
7Design
- Check study design
- Check input data
- Run analyzing applications
- Interpret the outputs
- How to survive with endless requests along with
our own research projects?
- List subjects
- List phenotypes
- List markers
- Study design is simple or not?
- Dependency among tests that share markers and/or
subjects.
8- Check study design
- Check input data
- Run analyzing applications
- Interpret the outputs
- How to survive with endless requests along with
our own research projects?
Phenotype data for population genetics sex,
self-identified ethnicity, birth place...
Phenotype data of interest Disease
Location data of markers and genes
Methods
Input conditions
Annotation
Data processing
Markers
Subjects
Marker-specific materials for assays
DNA samples
Other materials and machines and assay conditions
Assay experiments
Genotype data
Descriptive statistics
Test
9Check input data with genotype data sets
- Check study design
- Check input data
- Run analyzing applications
- Interpret the outputs
- How to survive with endless requests along with
our own research projects?
- Targets of data-check are
- NOT data records themselves
- BUT items used when they were produced
- Condition of data recording of phenotypes
- Method of annotation
- DNAs, marker-specific and non-specific reagents
and other assay conditions
10Check input data with genotype data setsCheck
WITHOUT genetic knowledge
- Check study design
- Check input data
- Run analyzing applications
- Interpret the outputs
- How to survive with endless requests along with
our own research projects?
- Examples
- Markers with successful call rate far lower than
maker populations (w.c.r.l.p) should have a
marker-specific cause. - Samples w.c.r.l.p should have a sample-specific
cause. - A assay batch w.c.r.l.p should have a
batch-specific cause. - far lower has to be judged in the multiple
testing context.
11Check input data with genotype data setsCheck
WITH genetic knowledge
- Check study design
- Check input data
- Run analyzing applications
- Interpret the outputs
- How to survive with endless requests along with
our own research projects?
- Examples
- A data set of markers annotated in regular X
region and samples with male phenotype identifies
mislabeling of sex-phenotype and/or marker
annotation with their unlikeliness. - A sample pair with far more similar genotypes
than pair-population identifies DNA contamination
(or genetic kinship depending on the
resemblance). - unlikeliness and far more similar has to be
judged in the multiple testing context.
12The assumption is corrected by difference between
random distribution of unbiased condition and
random distribution of biased condition.
Deviated records are important for correction.
Do not throw away data records unless specified
causative mistake(s) are reasonably
certain. Throw away data records when appropriate
so that the discard wont disturb the beauty of
distribution.
13Test
- Check study design
- Check input data
- Run analyzing applications
- Interpret the outputs
- How to survive with endless requests along with
our own research projects?
- Multivariate study, but...
- Multiple monovariate tests appropriate (or
appropriately attempted) corrections - Or partially multivariate-ize????
14Single marker-single phenotype test
Shuffle these connections under the assumption of
independence
(1) Independence test P1,P2 ?A1,A2 (2) Test
of difference of frequency of A1 between P1 and P2
15Haplotype and Diplotype
Genotype(diplotype)
Chromosome
Phenotype or population
Individual
Allele(haplotype)
Inheritance mode
c1-1
D1
S1
c1-2
G1
D2
c2-1
P1
S2
G2
c2-2
c3-1
S3
G3
R1
c3-2
P2
A1
c4-1
S4
R2
c4-2
Shuffle here!
S5
A2
c5-1
c5-2
16Dominant trait test
2x3 contingency table test
Allele test
Recessive trait test
17Four types test based on 2x3 table
18Methods to test
- Contingency table test
- Asymptotic distribution tests
- Pearsons Chi-square
- Chi-square statistics for trend
- incl. Armitage-Cockrans trend statistics
- Exact probability test corresponding to above
asymptotic tests - Individual data-based test
- Logistic regression test
- Likelihood ratio test
19What were observed and which table should be used?
Use tests that are based on 2x3 table or
individual records
2x2 tables arithmetically calculated from 2x3
table are not exact.
20- Check study design
- Check input data
- Run analyzing applications
- Interpret the outputs
- How to survive with endless requests along with
our own research projects?
- Population structure
- Multiple testing ?Test-independence
21Sampling from a structured population
Even sampling
Biased sampling
22Genomic control method
- Variance of statistics inflates with population
structure.
23P?
P-value
Many significant results when samples are biased
with population structure.
Markers
P???????
24?2GC ??2 to fit concaved line into yx line
GC corrects the inflation but does not
incorporate structure information to increase
power. ? is a good index to describe degree of
structure.
25Eigenstrat
- identify eigenvectors to represent SNPs and to
discriminate samples and utilize the eigenvectors
to test association between phenotype and markers
26(No Transcript)
27(No Transcript)
28Up to hereRoutines for GWA From hereWould
be Optional for GWA, particularly 1st stage
screening
29List of optional analyses for now
- Multiple testing correction
- Evaluation of coverage of genome ( Hapmap SNPs)
with genotyped panel - Phenotype-segment association with
haplotype-based association tests - Staged design and power-definition of GWA
- Epistatic investigation
- Multi-phenotype stratification
30Multiple testing ?Test-independence
Fraction(P1lt0.1 or P2lt0.1)
P2
P2
P1
P1
P1
137/1000
190/1000
78/1000
31ExampleCumulative probability density of minimal
P value in Monte-Carlo permutation in a GWA
Log
32Coverage of genome
- Commertial scan panels select tagging SNPs based
on HapMap data so that all SNPs are surrogated by
a SNP with LD more than threshold. - Based on observed scan genotype data, the real
coverage can be re-calculated. - Less covered region might be tested with
SNP-combinations (haplotypes) to even out strucy
density.
33Haplotype-based association testsEpistatic
evaluation
- I would say no gold standard for these in the
context of GWA-scan.
34Multi-phenotype stratification
35- Check study design
- Check input data
- Run analyzing applications
- Interpret the outputs
- How to survive with endless requests along with
our own research projects?
- Many tools are publicly available and useful.
- They might not do everything we want to do but at
least a part of them. - Let them do what they can do
- Example
- plink Haploview Eigenstrat
36Single marker-single phenotype testHaplotype
and Diplotype
c1-1
D1
S1
c1-2
G1
D2
c2-1
P1
S2
G2
c2-2
c3-1
S3
G3
R1
c3-2
P2
A1
c4-1
S4
R2
c4-2
S5
A2
c5-1
c5-2
37Multiple testing ?Test-dependence
Fraction(P1lt0.1 or P2lt0.1)
FWER-correction Bonferroni
Test dependence Allelic association
P2
P1
P1
P1
137/1000
173/1000
78/1000
38Input data (1)
- Check study design
- Check input data
- Run analyzing applications
- Interpret the outputs
- What to do next?
- How to survive with endless requests along with
our own research projects?
- Subjects
- Random samples except for phenotypes of
interests ? -gt NO! - Genetic randomness population structure are
to be checked. - Self-identified ethnicity and sampling location
can be used to interpret the genetic structure.
39Input data (2)
- Check study design
- Check input data
- Run analyzing applications
- Interpret the outputs
- What to do next?
- How to survive with endless requests along with
our own research projects?
- Subjects
- Random samples except for phenotypes of
interests ? -gt NO! - Genetic randomness are to be checked.
40??????????????????????????????
- ?2?
- ??????????????????????
- ??19?11?22?-23?
- ???? ??????
- ??????????
- ???????????
- ?????????
- ?? ?
- ????????????????????
- http//func-gen.hgc.jp/lecture/menu.htm
41??
- ?????????????????????
- ???????????????????????????
- ????????????
- Diploid?????
- ?????????????
- ??????
- ??????????????
42- ????GWA?????1??????????????????????????????????
- False Discovery Rate(FDR)?????????????????????????
????????????? - ???????
43??(??)
???????????
????
- ????1?????????????
- ???????
44???????
- ?????????????????????????????
- Plt0.01????????0.01
- Plt0.05????????0.05
- Plt0.5????????0.5
- Plt0.05????????0.05ltPlt0.1?????????????0.05
45When 100 independent tests are performed....
P-P plot of p value
???p
????p???????? ??????i???p?????? i/(1001).
??P ???1/101
??? p
46????????????????
- ????????
- k??(???)?????????
- pcpn x k
- pc ????p
- pn ????p
- Family-wise error rate
- k??(???)????????????????pn?q??????
- 1-(1-q)kqk
- ?????????????
472?????????P?
0.05 -D0.0475
1-B-C-D 0.95 x 0.95 1-0.0975 0.9025
B
A
??2
????????Plt0.05??????BCD0.0975
0.05
D
C
0.05 -D0.0475
??1
0.05
0.05x0.050.0025
48100?????????????100???????????P????
FWER?????
???????????
pcpn x k
49- ????
- ??7??
- ??????????????????1????2????
- ????????
- 2000????x7??3000???????
- ????
- 500,000SNPs
- ??
- 2x3????
- ???2???????????
- ????????
- Mantel-Haenszel ???
- ?????
- ???12??
- ??
- 5x10(-7)???? 24?
50- ???????????
- ????
- ????????????
- ???????
- ???????
- 1??????????1???
51- ????
- ??7??
- ??????????????????1????2????
- ????????
- 2000????x7??3000???????
- ????
- 500,000SNPs
- ??
- 2x3????
- ???2???????????
- ????????
- Mantel-Haenszel ???
- ?????
- ???12??
- ??
- 5x10(-7)???? 24?
1?????????????????
52x
y
DF2
??
2??(DF2)??????? ?????????1???(?DF1??) ???????
?(Armitage,Trend-Chi square) ?????????? ??????????
??????
??
??
53- 2x3???
- ????????
- ??????????????
- ???2????????1???????
- ???1??????????1?????????????(P?)???
- ?????????????2???????
- ????????????????2?????
54SNP??????????????Allelic association
- 2????????????????????
- ??
- ?????
- ?????
55z (H3)
y (H2)
x (H1)
563??????(????????? Nh2(Ns)) ?????????????????????
?? ???????????? ?????????????????
???????????????
57???????????????? SNP??????????????????????????????
????1?1????????????
58- ????
- ??7??
- ??????????????????1????2????
- ????????
- 2000????x7??3000???????
- ????
- 500,000SNPs
- ??
- 2x3????
- ???2???????????
- ????????
- Mantel-Haenszel ???
- ?????
- ???12??
- ??
- 5x10(-7)???? 24?
59- ????
- ??7??
- ??????????????????1????2????
- ????????
- 2000????x7??3000???????
- ????
- 500,000SNPs
- ??
- 2x3????
- ???2???????????
- ????????
- Mantel-Haenszel ???DerSimonian-Laird???
- ?????
- ???12??
- ??
- 5x10(-7)???? 24?
60D1 vs. ContD2 vs. Cont(D1D2) vs. Cont
D1 vs. Cont1D2 vs. Cont2(D1D1) vs.
(Cont1Cont2)
???????(Mantel-Haenszel,DerSimonian-Laird)
61????