Title: Linkage Disequilibrium Based SNP Genotype Calling from Short Sequencing Reads
1Linkage Disequilibrium Based SNP Genotype Calling
from Short Sequencing Reads
Ion Mandoiu Computer Science and Engineering
Department University of Connecticut Joint work
with S. Dinakar, J. Duitama, Y. Hernández, J.
Kennedy, and Y. Wu
2Ultra-High Throughput Sequencing
- Recent massively parallel sequencing technologies
deliver orders of magnitude higher throughput
compared to classic Sanger sequencing
Roche/454 FLX Titanium 400bp reads 400Mb/10h run
ABI SOLiD 2.0 25-35bp reads 3-4Gb/6 day run
Helicos HeliScope 25-55bp reads gt1Gb/day
Illumina Genome Analyzer II 35-50bp
reads 1.5Gb/2.5 day run
3Personal Genomes The Future is Now!
4Challenges for Genomic Medicine at Single-Base
Resolution
- Medical sequencing focuses on genetic variation
(SNPs, CNVs, genome rearrangements) - Requires accurate determination of both alleles
at variable loci - This is limited by coverage depth due to random
nature of shotgun sequencing - For the Venter and Watson genomes (both sequenced
at 7.5x average coverage), comparison with SNP
genotyping chips has shown only 75 accuracy for
sequencing based calls of heterozygous SNPs Levy
et al 07, Wheeler et al 08 - WendlWilson 08 predict that 21x coverage is
required for sequencing of normal tissue samples
based on idealized theory that neglects any
heuristic inputs - What heuristic inputs help?
- How much can we gain from improved data analysis?
5Linkage Disequilibrium Sources Modeling
HMM model of haplotype frequencies
- Fi founder haplotype at locus i, Hi observed
allele at locus i - P(Fi), P(Fi Fi-1) and P(Hi Fi) estimated from
reference panel such as Hapmap - For given haplotype h with n SNPs, P(HhM) can
be computed in O(nK2) using forward algorithm,
where Kfounders
6Pipeline for LD-Based Genotype Calling
Reference genome sequence
Read sequences
gtgnlti1779718824 nameEI1W3PE02ILQXT TCAGTGAGGGT
TTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAAT
TTTGCTCTT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCAC
TGCAACCTCTGCCTCCAGGTTCAAGCAATT CTCTGCCTCAGCCTCCCAA
GTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTG
T TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCC
TGACCTCAAATGAC gtgnlti1779718825
nameEI1W3PE02GTXK0 TCAGAATACCTGTTGCCCATTTTTATATGT
TCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT TAATATGTT
TATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCA
CAACACCCGGC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAA
GAGAGAGAATAAGCACTTAAAAGGCGGGTCCA GGGGGCCCGAGCATCGG
AGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTT
ACA
gtgnlti1779718824 nameEI1W3PE02ILQXT TCAGTGAGGGT
TTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAAT
TTTGCTCTT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCAC
TGCAACCTCTGCCTCCAGGTTCAAGCAATT CTCTGCCTCAGCCTCCCAA
GTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTG
T TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCC
TGACCTCAAATGAC gtgnlti1779718825
nameEI1W3PE02GTXK0 TCAGAATACCTGTTGCCCATTTTTATATGT
TCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT TAATATGTT
TATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCA
CAACACCCGGC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAA
GAGAGAGAATAAGCACTTAAAAGGCGGGTCCA GGGGGCCCGAGCATCGG
AGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTT
ACA
gtgnlti1779718824 nameEI1W3PE02ILQXT TCAGTGAGGGT
TTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAAT
TTTGCTCTT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCAC
TGCAACCTCTGCCTCCAGGTTCAAGCAATT CTCTGCCTCAGCCTCCCAA
GTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTG
T TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCC
TGACCTCAAATGAC gtgnlti1779718825
nameEI1W3PE02GTXK0 TCAGAATACCTGTTGCCCATTTTTATATGT
TCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT TAATATGTT
TATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCA
CAACACCCGGC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAA
GAGAGAGAATAAGCACTTAAAAGGCGGGTCCA GGGGGCCCGAGCATCGG
AGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTT
ACA
gtgi88943037refNT_113796.1Hs1_111515 Homo
sapiens chromosome 1 genomic contig, reference
assembly GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCC
ATAAATACCATCAGAAATAACAAAGGGAG CTTTGAAGTATTCTGAGACT
TGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG
GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTA
ATTGAAATTCCTGATGCAAGT AATACAGATGGATTCAGGAGAGGTACTT
CCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC CTCCTAA
TTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAG
CGAAGAGGAATAT TTCTGAGATAATAAATAGGACTGTCCCATATTGGAG
GCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA AATGTACTTTCTCAG
ATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAG
CTAAG TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAG
CTAAACTCCCTAGTCAACTGGTTTGA ATCTACTTCTCCAGCAGCTGGGG
GAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC
gtgi88943037refNT_113796.1Hs1_111515 Homo
sapiens chromosome 1 genomic contig, reference
assembly GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCC
ATAAATACCATCAGAAATAACAAAGGGAG CTTTGAAGTATTCTGAGACT
TGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG
GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTA
ATTGAAATTCCTGATGCAAGT AATACAGATGGATTCAGGAGAGGTACTT
CCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC CTCCTAA
TTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAG
CGAAGAGGAATAT TTCTGAGATAATAAATAGGACTGTCCCATATTGGAG
GCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA AATGTACTTTCTCAG
ATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAG
CTAAG TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAG
CTAAACTCCCTAGTCAACTGGTTTGA ATCTACTTCTCCAGCAGCTGGGG
GAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC
gtgi88943037refNT_113796.1Hs1_111515 Homo
sapiens chromosome 1 genomic contig, reference
assembly GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCC
ATAAATACCATCAGAAATAACAAAGGGAG CTTTGAAGTATTCTGAGACT
TGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG
GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTA
ATTGAAATTCCTGATGCAAGT AATACAGATGGATTCAGGAGAGGTACTT
CCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC CTCCTAA
TTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAG
CGAAGAGGAATAT TTCTGAGATAATAAATAGGACTGTCCCATATTGGAG
GCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA AATGTACTTTCTCAG
ATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAG
CTAAG TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAG
CTAAACTCCCTAGTCAACTGGTTTGA ATCTACTTCTCCAGCAGCTGGGG
GAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC
Quality scores
gtgnlti1779718824 nameEI1W3PE02ILQXT 28 28 28
28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21
7 27 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28
43 36 22 9 28 43 36 22 9 28 44 36 24 14 4 28
28 28 27 28 26 26 35 26 40 34 18 3 28 28 28 27
33 24 26 28 28 28 40 33 14 28 36 27 26 26 37 29
28 28 28 28 27 28 28 28 37 28 27 27 28 36 28
37 28 28 28 27 28 28 28 24 28 28 27 28 28 37 29
36 27 27 28 27 28 33 23 28 33 23 28 36 27 33 23
28 35 25 28 28 36 27 36 27 28 28 28 24 28 37 29
28 19 28 26 37 29 26 39 33 13 37 28 28 28 21 24
28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15
gtgnlti1779718824 nameEI1W3PE02ILQXT 28 28 28
28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21
7 27 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28
43 36 22 9 28 43 36 22 9 28 44 36 24 14 4 28
28 28 27 28 26 26 35 26 40 34 18 3 28 28 28 27
33 24 26 28 28 28 40 33 14 28 36 27 26 26 37 29
28 28 28 28 27 28 28 28 37 28 27 27 28 36 28
37 28 28 28 27 28 28 28 24 28 28 27 28 28 37 29
36 27 27 28 27 28 33 23 28 33 23 28 36 27 33 23
28 35 25 28 28 36 27 36 27 28 28 28 24 28 37 29
28 19 28 26 37 29 26 39 33 13 37 28 28 28 21 24
28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15
gtgnlti1779718824 nameEI1W3PE02ILQXT 28 28 28
28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21
7 27 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28
43 36 22 9 28 43 36 22 9 28 44 36 24 14 4 28
28 28 27 28 26 26 35 26 40 34 18 3 28 28 28 27
33 24 26 28 28 28 40 33 14 28 36 27 26 26 37 29
28 28 28 28 27 28 28 28 37 28 27 27 28 36 28
37 28 28 28 27 28 28 28 24 28 28 27 28 28 37 29
36 27 27 28 27 28 33 23 28 33 23 28 36 27 33 23
28 35 25 28 28 36 27 36 27 28 28 28 24 28 37 29
28 19 28 26 37 29 26 39 33 13 37 28 28 28 21 24
28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15
SNP genotype calls
Hapmap genotypes
rs12095710 T T 9.988139e-01 rs12127179 C T
9.986735e-01 rs11800791 G G 9.977713e-01 rs1157831
0 G G 9.980062e-01 rs1287622 G G 8.644588e-01
rs11804808 C C 9.977779e-01 rs17471528 A G
5.236099e-01 rs11804835 C C 9.977759e-01 rs1180483
6 C C 9.977925e-01 rs1287623 G G 9.646510e-01
rs13374307 G G 9.989084e-01 rs12122008 G G
5.121655e-01 rs17431341 A C 5.290652e-01 rs881635
G G 9.978737e-01 rs9700130 A A 9.989940e-01
rs11121600 A A 6.160199e-01 rs12121542 A A
5.555713e-01 rs11121605 T T 8.387705e-01 rs1256377
9 G G 9.982776e-01 rs11121607 C G
5.639239e-01 rs11121608 G T 5.452936e-01 rs1202974
2 G G 9.973527e-01 rs562118 C C 9.738776e-01
rs12133533 A C 9.956655e-01 rs11121648 G G
9.077355e-01 rs9662691 C C 9.988648e-01
rs11805141 C C 9.928786e-01 rs1287635 C C
6.113270e-01
90 209342 16 F 0 0 2110001?01002100100110021222012
10211?1221220212000 18 F 15 16 2110001201002100100
1100?100201?10111110111?0212000 15 M 0
0 211200100120012010011200101101010111110111102120
00 7 M 0 0 2110001001000200122110001111011100111?1
21210222000 8 F 0 0 011202100120022012211200101101
210211122111?0120000 12 F 9 10 2110001001000200122
1100010110111001112121210220000 9 M 0
0 011?001?012002201221120010?101210211122111101200
00 11 M 7 8 21100210010002001221100012110111001112
121210222000
90 209342 16 F 0 0 2110001?01002100100110021222012
10211?1221220212000 18 F 15 16 2110001201002100100
1100?100201?10111110111?0212000 15 M 0
0 211200100120012010011200101101010111110111102120
00 7 M 0 0 2110001001000200122110001111011100111?1
21210222000 8 F 0 0 011202100120022012211200101101
210211122111?0120000 12 F 9 10 2110001001000200122
1100010110111001112121210220000 9 M 0
0 011?001?012002201221120010?101210211122111101200
00 11 M 7 8 21100210010002001221100012110111001112
121210222000
90 209342 16 F 0 0 2110001?01002100100110021222012
10211?1221220212000 18 F 15 16 2110001201002100100
1100?100201?10111110111?0212000 15 M 0
0 211200100120012010011200101101010111110111102120
00 7 M 0 0 2110001001000200122110001111011100111?1
21210222000 8 F 0 0 011202100120022012211200101101
210211122111?0120000 12 F 9 10 2110001001000200122
1100010110111001112121210220000 9 M 0
0 011?001?012002201221120010?101210211122111101200
00 11 M 7 8 21100210010002001221100012110111001112
121210222000
7Genotype Calling Accuracy vs. Coverage
Watson/454 reads
NA18507/Illumina reads
8Conclusions Ongoing Work
- Exploiting LD information yields significant
improvements in genotyping calling accuracy
and/or cost reduction - Accuracy achieved by previously proposed binomial
test is achieved by HMM-based posterior decoding
algorithm using less than 1/4 of the reads - Ongoing work
- Modeling ambiguities in read mapping
- Haplotype inferrence
- Extension to population sequencing data (removing
need for reference panels)
ACKNOWLEDGEMENTS This work was supported in part
by NSF under awards IIS-0546457 and DBI-0543365
to IM and IIS-0803440 to YW. SD and YH performed
this research as part of the Summer REU program
Bio-Grid Initiatives for Interdisciplinary
Research and Education" funded by NSF under
award CCF-0755373.