Title: Canadian Bioinformatics Workshops
1Canadian Bioinformatics Workshops
2(No Transcript)
3Module 3Genomic Variation Discovery
- MICHAEL STRÖMBERG
- Informatics on High Throughput Sequencing Data
- July 2009
4Topics
- Introduction
- Interpreting raw data
- Read alignment (MOSAIK)
- SNP Discovery (GigaBayes)
- Visualization (Gambit)
- 1000 Genomes Project
5Genetic Variations Why?
Phenotypic differences
Inherited diseases
Ancestral history
6Genetic Variations SNPs INDELs
7Structural Variations
Paul Medvedev review in prep July 2009
8Epigenetic Variations ChIPSeq
Anjali Shah (AB) Nature Methods April 2009
9Interpreting raw data
10Basecalling Intro
- how do we translate the machine readouts to base
calls? - how do we estimate and represent sequencing
errors (base quality values)?
11What is a base quality?
12Calculating Error Rates
atggatagtataacgtcaggctaaactgtagtatatggataaaatgacc
atacgattaca
ref
read
substitution
insertion
deletion
paralogous alignment
atggattagtataacgtcaggctaaactgtagtatatggataaaatgacc
atacgattaca
tggattagtataacgtcagc
tatatggctaaaatgaccata
local misalignment
polymorphic test data
13Error Profile Roche 454
substitutions
deletions
- error rate is low (lt 0.5)
- most errors are INDELs
insertions
14Error Profile Illumina
15Error Profile Illumina (36 bp)
16Error Profile Illumina Variability
17Logistic Regression
Original
Recalibrated
Mark DePristo Broad Institute June 2009
18Read Alignment
19Crash Course Reference-guided Assembly
20Crash Course Reference-guided Assembly
21Crash Course Reference-guided Assembly
22Sequencing Technologies
future
23(No Transcript)
24Pipeline Snapshot
25How Does It Work?
26How Does It Work?
27(No Transcript)
28Functionality Platform Specific
29Enabling INDEL Discovery
30Combining Read Technologies
Capillary
454 FLX
454 GS20
Illumina
31Paired-End Reads
Jarvie Harkins (454) Nature Methods May 2008
32Resolving Paired-End Reads
33MosaikCoverage
34MosaikText
- 1 12 807910 807945 O_5_1_907_1935 1 36 0
- ACCCTTGAAAAATGTTCGTTGACTCTAAATGAAATA
- ACCCTTGAAAAATGTTCGTTGACTCTAAATGAAATA
- 2 7 1019133 1019168 O_5_1_853_1522 1 36 - 0
- ATCGAAAGCCCGCATCATTTTGATCTGCATCCTCAC
- ATCGAAAGCCCGCATCATTTTGATCTGCATCCTCAC
- 3 4 952257 952292 O_5_1_742_688 1 36 - 0
- TGGATCTCTCTTGAATACGTACAATGATACTGTTAT
- TGGATCTCTCTTGAATACGTACAATGATACTGTTAT
- 4 8 176976 177011 O_5_1_1892_1827 1 36 - 0
- TGTGCGTTCTTTGCGGATATGGAAAATCTTGATATC
- TGTGCGTTCTTTGCGGATATGGAAAATCTTGATATC
- 5 5 516470 516505 O_5_1_753_575 1 36 0
- GAATGACACAATATCATTAGTGGTCCCTCAGTTATA
- GAATGACACAATATCATTAGTGGTCCCTCAGTTATA
- 42 O_5_1_1132_922 GAAATCTCATCTCAAGGAGAAGGAAA
CAGCAGATCC U0 - 43 O_5_1_499_472 GCAATTATTATAGCTTTGTCCGATTG
TTCTCTCCCT U1 - 44 O_5_1_1161_922 GTTTATGATTTATCTGGTACAAGTCA
GGCTGTTGTC U0 - 45 O_5_1_848_673 ACTAATTCATTCGTTTACGTCTCAAA
TGATTAATAA U0 - 46 O_5_1_887_756 AATATAACGGCCAGGTATATCATTGG
ATCTCCTTCA U0 - 47 O_5_1_987_943 GATATATACAGTGTTCTTGCCGACAT
AACGGCTTAG U0 - 48 O_5_1_1131_902 GTGATAAAAGAATGTAGGATTATTTA
TAAGTCTGTA U0 - 49 O_5_1_785_360 ATATGGATGAATATAAATACAAGGAC
AAAAAACGTG U0 - 50 O_5_1_908_742 AAATATATATCAGAATTCACATTAGA
CAGGGCACTG U0 - 51 O_5_1_813_721 ATCTTCGATAATAGCAGCCTCAATTT
CAGCGGTAGA U0 - 52 O_5_1_1671_688 GGTTTTCAAAGGCAATTTTTGAGCAA
TATGGGTTTC U0 - 53 O_5_1_912_527 GATGGAGAAAGCTGCCTATAACTTTA
TGGTAAGGAG U0 - 54 O_5_1_224_721 GAAGTACAAAATGTTTTCAGCATGTT
CTTTCATAAC U0 - 55 O_5_1_847_99 AATAGATGCGCCATCTCCGAGAAAAA
GTCTAGACAA U1
Current formats bed, eland, axt, sam/bam (more
added upon request)
35The Love of Ambiguity
36Aligners Feature Set
37Accuracy Classification
38Accuracy Unique Read Alignment
39Alignment Qualities
actual alignment quality
information content (bits)
mismatch BQs / total BQs
40SNP Discovery
41Genetic Variations SNPs INDELs
42SNP Discovery Goal
SNP
sequencing errors
43SNP Discovery Base Qualities
High quality
Low quality
44SNPs Bayesian Statistics
base quality
of individuals
allele call in read
45SNP Discovery
haploid
diploid
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCAT
A
strain 1
individual 1
AACGTTCGCATA AACGTTCGCATA
strain 2
AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCAT
A
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
individual 2
strain 3
AACGTTAGCATA AACGTTAGCATA
individual 3
46Genotyping Consensus Generation
haploid
diploid
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCAT
A
strain 1 A
individual 1 A/C
strain 2 C
AACGTTCGCATA AACGTTCGCATA
AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCAT
A
individual 2 C/C
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
strain 3 A
individual 3 A/A
AACGTTAGCATA AACGTTAGCATA
47Handling Trios
- Take advantage of duplicate data
- De novo mutation rate
48QC Coverage
Auton Hernandez Cornell University June 2009
49QC Inter-SNP Distance
50QC Hardy-Weinberg Violations
Auton Hernandez Cornell University June 2009
HapMap sites in red, other sites in blue. CEU,
P(seg)gt0.5, coverage 2-5x
51QC Other metrics
- P(SNP)
- Determining at the optimal P(SNP) threshold
- Transitionstransversions
- Adjusting filters so that the ratio approaches 2
52Visualisation
53Visualization Consed
54Visualization Gambit
- Data validation
- Hypothesis generation
- Software development aid
- BAM support
- Firefox-like plugins
551000 Genomes Project
561000G Goals
- Discover genetic variations
- 1 minor allele frequencies across genome
- 0.1 0.5 MAF across gene regions
- Variant alleles
- Estimate frequencies
- Identify haplotype background
- Characterize linkage disequilibrium
571000G Pilot Projects
- Pilot 1
- Low coverage
- 180 samples
- 70 samples _at_ 4X
- 110 samples _at_ 2X
- 2.7 Tbp total
- 202 Gbp 454
- 1.8 Tbp Illumina
- 640 Gbp AB SOLiD
Pilot 2 Deep trios (CEU YRI) 6 samples 1.1 Tbp
total 87 Gbp 454 773 Gbp Illumina 270 Gbp AB SOLiD
Pilot 3 Exon capture 607 samples 2.2 Mbp of
targets 8800 targets 10 20x coverage
58Pilot 2 Chr1 SNP Concordance
59Pilot 2 Chr1 SNP Concordance
60Pilot 2 INDEL Validation
1.0 100 in one category 4.0 100 in all
categories
61What have we learned?
- Garbage In, Garbage Out
- SNP calls depend on the alignments
- Alignments depend on the base calls
- Base calls depend on accurate interpretation of
machine readouts - Choose the right tools
- Population genetics seems to be the ultimate
quality control for SNP calls
62The Usual Suspects
L to R Jiantao, Tony, Michele, Chip, Amit, Wen
Fung, Deniz, Michael, Maddy, Gábor
Derek