Statistical Genetics 6 GWAS Data QC - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Statistical Genetics 6 GWAS Data QC

Description:

Other materials and machines and assay conditions ... Pearson's Chi-square. Chi-square statistics for trend. incl. Armitage-Cockran's trend statistics ... – PowerPoint PPT presentation

Number of Views:400
Avg rating:3.0/5.0
Slides: 62
Provided by: genomeMe
Category:

less

Transcript and Presenter's Notes

Title: Statistical Genetics 6 GWAS Data QC


1
Statistical Genetics6 GWAS Data QC
  • Graduate School of Medicine
  • Kyoto University
  • 2008/09/17-25
  • IMS-UT
  • Ryo Yamada

2
Single marker-single phenotype test
Through genotype data as a co-attribute both of
subjects and marker, relation between phenotype
and marker is tested. This test depends on
randomness of subjects except for the phenotype.
3
GWA study
Individual test assumes randomness of subjects
except for the phenotype. The statistics based on
the assumption is corrected by difference between
random distribution of unbiased condition and
random distribution of biased condition.
Tests (markers) are not independent each
other. The statistics based on the dependency is
corrected by evenness of dependency throughout
the tests.
4
GWA study
  • Biased samples
  • Population structure
  • Structure in sampling population overall
  • Biased sampling from structured population
  • Dependency among tests
  • Allelic association
  • Linkage disequilibrium
  • Allelic association due to population structure
  • Dependency among tests that share markers and/or
    subjects.

5
The assumption is corrected by difference between
random distribution of unbiased condition and
random distribution of biased condition.
Deviated records are important for correction.
Do not throw away data records unless specified
causative mistake(s) are reasonably
certain. Throw away data records when appropriate
so that the discard wont disturb the beauty of
distribution.
6
Steps
  • Check study design
  • Check input data
  • Run analyzing applications
  • Interpret the outputs
  • How to survive with endless requests along with
    our own research projects?

7
Design
  • Check study design
  • Check input data
  • Run analyzing applications
  • Interpret the outputs
  • How to survive with endless requests along with
    our own research projects?
  • List subjects
  • List phenotypes
  • List markers
  • Study design is simple or not?
  • Dependency among tests that share markers and/or
    subjects.

8
  • Check study design
  • Check input data
  • Run analyzing applications
  • Interpret the outputs
  • How to survive with endless requests along with
    our own research projects?

Phenotype data for population genetics sex,
self-identified ethnicity, birth place...
Phenotype data of interest Disease
Location data of markers and genes
Methods
Input conditions
Annotation
Data processing
Markers
Subjects
Marker-specific materials for assays
DNA samples
Other materials and machines and assay conditions
Assay experiments
Genotype data
Descriptive statistics
Test
9
Check input data with genotype data sets
  • Check study design
  • Check input data
  • Run analyzing applications
  • Interpret the outputs
  • How to survive with endless requests along with
    our own research projects?
  • Targets of data-check are
  • NOT data records themselves
  • BUT items used when they were produced
  • Condition of data recording of phenotypes
  • Method of annotation
  • DNAs, marker-specific and non-specific reagents
    and other assay conditions

10
Check input data with genotype data setsCheck
WITHOUT genetic knowledge
  • Check study design
  • Check input data
  • Run analyzing applications
  • Interpret the outputs
  • How to survive with endless requests along with
    our own research projects?
  • Examples
  • Markers with successful call rate far lower than
    maker populations (w.c.r.l.p) should have a
    marker-specific cause.
  • Samples w.c.r.l.p should have a sample-specific
    cause.
  • A assay batch w.c.r.l.p should have a
    batch-specific cause.
  • far lower has to be judged in the multiple
    testing context.

11
Check input data with genotype data setsCheck
WITH genetic knowledge
  • Check study design
  • Check input data
  • Run analyzing applications
  • Interpret the outputs
  • How to survive with endless requests along with
    our own research projects?
  • Examples
  • A data set of markers annotated in regular X
    region and samples with male phenotype identifies
    mislabeling of sex-phenotype and/or marker
    annotation with their unlikeliness.
  • A sample pair with far more similar genotypes
    than pair-population identifies DNA contamination
    (or genetic kinship depending on the
    resemblance).
  • unlikeliness and far more similar has to be
    judged in the multiple testing context.

12
The assumption is corrected by difference between
random distribution of unbiased condition and
random distribution of biased condition.
Deviated records are important for correction.
Do not throw away data records unless specified
causative mistake(s) are reasonably
certain. Throw away data records when appropriate
so that the discard wont disturb the beauty of
distribution.
13
Test
  • Check study design
  • Check input data
  • Run analyzing applications
  • Interpret the outputs
  • How to survive with endless requests along with
    our own research projects?
  • Multivariate study, but...
  • Multiple monovariate tests appropriate (or
    appropriately attempted) corrections
  • Or partially multivariate-ize????

14
Single marker-single phenotype test
Shuffle these connections under the assumption of
independence
(1) Independence test P1,P2 ?A1,A2 (2) Test
of difference of frequency of A1 between P1 and P2
15
Haplotype and Diplotype
Genotype(diplotype)
Chromosome
Phenotype or population
Individual
Allele(haplotype)
Inheritance mode
c1-1
D1
S1
c1-2
G1
D2
c2-1
P1
S2
G2
c2-2
c3-1
S3
G3
R1
c3-2
P2
A1
c4-1
S4
R2
c4-2
Shuffle here!
S5
A2
c5-1
c5-2
16
Dominant trait test
2x3 contingency table test
Allele test
Recessive trait test
17
Four types test based on 2x3 table
18
Methods to test
  • Contingency table test
  • Asymptotic distribution tests
  • Pearsons Chi-square
  • Chi-square statistics for trend
  • incl. Armitage-Cockrans trend statistics
  • Exact probability test corresponding to above
    asymptotic tests
  • Individual data-based test
  • Logistic regression test
  • Likelihood ratio test

19
What were observed and which table should be used?
Use tests that are based on 2x3 table or
individual records
2x2 tables arithmetically calculated from 2x3
table are not exact.
20
  • Check study design
  • Check input data
  • Run analyzing applications
  • Interpret the outputs
  • How to survive with endless requests along with
    our own research projects?
  • Population structure
  • Multiple testing ?Test-independence

21
Sampling from a structured population
Even sampling
Biased sampling
22
Genomic control method
  • Variance of statistics inflates with population
    structure.

23
P?
P-value
Many significant results when samples are biased
with population structure.
Markers
P???????
24
?2GC ??2 to fit concaved line into yx line
GC corrects the inflation but does not
incorporate structure information to increase
power. ? is a good index to describe degree of
structure.
25
Eigenstrat
  • identify eigenvectors to represent SNPs and to
    discriminate samples and utilize the eigenvectors
    to test association between phenotype and markers

26
(No Transcript)
27
(No Transcript)
28
Up to hereRoutines for GWA From hereWould
be Optional for GWA, particularly 1st stage
screening
29
List of optional analyses for now
  • Multiple testing correction
  • Evaluation of coverage of genome ( Hapmap SNPs)
    with genotyped panel
  • Phenotype-segment association with
    haplotype-based association tests
  • Staged design and power-definition of GWA
  • Epistatic investigation
  • Multi-phenotype stratification

30
Multiple testing ?Test-independence
Fraction(P1lt0.1 or P2lt0.1)
P2
P2
P1
P1
P1
137/1000
190/1000
78/1000
31
ExampleCumulative probability density of minimal
P value in Monte-Carlo permutation in a GWA
Log
32
Coverage of genome
  • Commertial scan panels select tagging SNPs based
    on HapMap data so that all SNPs are surrogated by
    a SNP with LD more than threshold.
  • Based on observed scan genotype data, the real
    coverage can be re-calculated.
  • Less covered region might be tested with
    SNP-combinations (haplotypes) to even out strucy
    density.

33
Haplotype-based association testsEpistatic
evaluation
  • I would say no gold standard for these in the
    context of GWA-scan.

34
Multi-phenotype stratification
  • Mantel-Haenzel test

35
  • Check study design
  • Check input data
  • Run analyzing applications
  • Interpret the outputs
  • How to survive with endless requests along with
    our own research projects?
  • Many tools are publicly available and useful.
  • They might not do everything we want to do but at
    least a part of them.
  • Let them do what they can do
  • Example
  • plink Haploview Eigenstrat

36
Single marker-single phenotype testHaplotype
and Diplotype
c1-1
D1
S1
c1-2
G1
D2
c2-1
P1
S2
G2
c2-2
c3-1
S3
G3
R1
c3-2
P2
A1
c4-1
S4
R2
c4-2
S5
A2
c5-1
c5-2
37
Multiple testing ?Test-dependence
Fraction(P1lt0.1 or P2lt0.1)
FWER-correction Bonferroni
Test dependence Allelic association
P2
P1
P1
P1
137/1000
173/1000
78/1000
38
Input data (1)
  • Check study design
  • Check input data
  • Run analyzing applications
  • Interpret the outputs
  • What to do next?
  • How to survive with endless requests along with
    our own research projects?
  • Subjects
  • Random samples except for phenotypes of
    interests ? -gt NO!
  • Genetic randomness population structure are
    to be checked.
  • Self-identified ethnicity and sampling location
    can be used to interpret the genetic structure.

39
Input data (2)
  • Check study design
  • Check input data
  • Run analyzing applications
  • Interpret the outputs
  • What to do next?
  • How to survive with endless requests along with
    our own research projects?
  • Subjects
  • Random samples except for phenotypes of
    interests ? -gt NO!
  • Genetic randomness are to be checked.

40
??????????????????????????????
  • ?2?
  • ??????????????????????
  • ??19?11?22?-23?
  • ???? ??????
  • ??????????
  • ???????????
  • ?????????
  • ?? ?
  • ????????????????????
  • http//func-gen.hgc.jp/lecture/menu.htm

41
??
  • ?????????????????????
  • ???????????????????????????
  • ????????????
  • Diploid?????
  • ?????????????
  • ??????
  • ??????????????

42
  • ????GWA?????1??????????????????????????????????
  • False Discovery Rate(FDR)?????????????????????????
    ?????????????
  • ???????

43
??(??)
???????????
????
  • ????1?????????????
  • ???????

44
???????
  • ?????????????????????????????
  • Plt0.01????????0.01
  • Plt0.05????????0.05
  • Plt0.5????????0.5
  • Plt0.05????????0.05ltPlt0.1?????????????0.05

45
When 100 independent tests are performed....
P-P plot of p value
???p
????p???????? ??????i???p?????? i/(1001).
??P ???1/101
??? p
46
????????????????
  • ????????
  • k??(???)?????????
  • pcpn x k
  • pc ????p
  • pn ????p
  • Family-wise error rate
  • k??(???)????????????????pn?q??????
  • 1-(1-q)kqk
  • ?????????????

47
2?????????P?
0.05 -D0.0475
1-B-C-D 0.95 x 0.95 1-0.0975 0.9025
B
A
??2
????????Plt0.05??????BCD0.0975
0.05
D
C
0.05 -D0.0475
??1
0.05
0.05x0.050.0025
48
100?????????????100???????????P????
FWER?????
  • 1-(1-q)k

???????????
pcpn x k
49
  • ????
  • ??7??
  • ??????????????????1????2????
  • ????????
  • 2000????x7??3000???????
  • ????
  • 500,000SNPs
  • ??
  • 2x3????
  • ???2???????????
  • ????????
  • Mantel-Haenszel ???
  • ?????
  • ???12??
  • ??
  • 5x10(-7)???? 24?

50
  • ???????????
  • ????
  • ????????????
  • ???????
  • ???????
  • 1??????????1???

51
  • ????
  • ??7??
  • ??????????????????1????2????
  • ????????
  • 2000????x7??3000???????
  • ????
  • 500,000SNPs
  • ??
  • 2x3????
  • ???2???????????
  • ????????
  • Mantel-Haenszel ???
  • ?????
  • ???12??
  • ??
  • 5x10(-7)???? 24?

1?????????????????
52
x
y
DF2
??
2??(DF2)??????? ?????????1???(?DF1??) ???????
?(Armitage,Trend-Chi square) ?????????? ??????????
??????
??
??
53
  • 2x3???
  • ????????
  • ??????????????
  • ???2????????1???????
  • ???1??????????1?????????????(P?)???
  • ?????????????2???????
  • ????????????????2?????

54
SNP??????????????Allelic association
  • 2????????????????????
  • ??
  • ?????
  • ?????

55
z (H3)
y (H2)
x (H1)
56
3??????(????????? Nh2(Ns)) ?????????????????????
?? ???????????? ?????????????????
???????????????
57
???????????????? SNP??????????????????????????????
????1?1????????????
58
  • ????
  • ??7??
  • ??????????????????1????2????
  • ????????
  • 2000????x7??3000???????
  • ????
  • 500,000SNPs
  • ??
  • 2x3????
  • ???2???????????
  • ????????
  • Mantel-Haenszel ???
  • ?????
  • ???12??
  • ??
  • 5x10(-7)???? 24?

59
  • ????
  • ??7??
  • ??????????????????1????2????
  • ????????
  • 2000????x7??3000???????
  • ????
  • 500,000SNPs
  • ??
  • 2x3????
  • ???2???????????
  • ????????
  • Mantel-Haenszel ???DerSimonian-Laird???
  • ?????
  • ???12??
  • ??
  • 5x10(-7)???? 24?

60
D1 vs. ContD2 vs. Cont(D1D2) vs. Cont
D1 vs. Cont1D2 vs. Cont2(D1D1) vs.
(Cont1Cont2)
???????(Mantel-Haenszel,DerSimonian-Laird)
61
????
Write a Comment
User Comments (0)
About PowerShow.com