Canadian Bioinformatics Workshops - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Canadian Bioinformatics Workshops

Description:

Canadian Bioinformatics Workshops – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 63
Provided by: michaels53
Category:

less

Transcript and Presenter's Notes

Title: Canadian Bioinformatics Workshops


1
Canadian Bioinformatics Workshops
  • www.bioinformatics.ca

2
(No Transcript)
3
Module 3Genomic Variation Discovery
  • MICHAEL STRÖMBERG
  • Informatics on High Throughput Sequencing Data
  • July 2009

4
Topics
  • Introduction
  • Interpreting raw data
  • Read alignment (MOSAIK)
  • SNP Discovery (GigaBayes)
  • Visualization (Gambit)
  • 1000 Genomes Project

5
Genetic Variations Why?
Phenotypic differences
Inherited diseases
Ancestral history
6
Genetic Variations SNPs INDELs
7
Structural Variations
Paul Medvedev review in prep July 2009
8
Epigenetic Variations ChIPSeq
Anjali Shah (AB) Nature Methods April 2009
9
Interpreting raw data
10
Basecalling Intro
  • how do we translate the machine readouts to base
    calls?
  • how do we estimate and represent sequencing
    errors (base quality values)?

11
What is a base quality?
12
Calculating Error Rates
atggatagtataacgtcaggctaaactgtagtatatggataaaatgacc
atacgattaca
ref
read
substitution
insertion
deletion
paralogous alignment
atggattagtataacgtcaggctaaactgtagtatatggataaaatgacc
atacgattaca
tggattagtataacgtcagc
tatatggctaaaatgaccata
local misalignment
polymorphic test data
13
Error Profile Roche 454
substitutions
deletions
  • error rate is low (lt 0.5)
  • most errors are INDELs

insertions
14
Error Profile Illumina
15
Error Profile Illumina (36 bp)
16
Error Profile Illumina Variability
17
Logistic Regression
Original
Recalibrated
Mark DePristo Broad Institute June 2009
18
Read Alignment
19
Crash Course Reference-guided Assembly
20
Crash Course Reference-guided Assembly
21
Crash Course Reference-guided Assembly
22
Sequencing Technologies
future
23
(No Transcript)
24
Pipeline Snapshot
25
How Does It Work?
26
How Does It Work?
27
(No Transcript)
28
Functionality Platform Specific
29
Enabling INDEL Discovery
30
Combining Read Technologies
Capillary
454 FLX
454 GS20
Illumina
31
Paired-End Reads
Jarvie Harkins (454) Nature Methods May 2008
32
Resolving Paired-End Reads
33
MosaikCoverage
34
MosaikText
  • 1 12 807910 807945 O_5_1_907_1935 1 36 0
  • ACCCTTGAAAAATGTTCGTTGACTCTAAATGAAATA
  • ACCCTTGAAAAATGTTCGTTGACTCTAAATGAAATA
  • 2 7 1019133 1019168 O_5_1_853_1522 1 36 - 0
  • ATCGAAAGCCCGCATCATTTTGATCTGCATCCTCAC
  • ATCGAAAGCCCGCATCATTTTGATCTGCATCCTCAC
  • 3 4 952257 952292 O_5_1_742_688 1 36 - 0
  • TGGATCTCTCTTGAATACGTACAATGATACTGTTAT
  • TGGATCTCTCTTGAATACGTACAATGATACTGTTAT
  • 4 8 176976 177011 O_5_1_1892_1827 1 36 - 0
  • TGTGCGTTCTTTGCGGATATGGAAAATCTTGATATC
  • TGTGCGTTCTTTGCGGATATGGAAAATCTTGATATC
  • 5 5 516470 516505 O_5_1_753_575 1 36 0
  • GAATGACACAATATCATTAGTGGTCCCTCAGTTATA
  • GAATGACACAATATCATTAGTGGTCCCTCAGTTATA
  • 42 O_5_1_1132_922 GAAATCTCATCTCAAGGAGAAGGAAA
    CAGCAGATCC U0
  • 43 O_5_1_499_472 GCAATTATTATAGCTTTGTCCGATTG
    TTCTCTCCCT U1
  • 44 O_5_1_1161_922 GTTTATGATTTATCTGGTACAAGTCA
    GGCTGTTGTC U0
  • 45 O_5_1_848_673 ACTAATTCATTCGTTTACGTCTCAAA
    TGATTAATAA U0
  • 46 O_5_1_887_756 AATATAACGGCCAGGTATATCATTGG
    ATCTCCTTCA U0
  • 47 O_5_1_987_943 GATATATACAGTGTTCTTGCCGACAT
    AACGGCTTAG U0
  • 48 O_5_1_1131_902 GTGATAAAAGAATGTAGGATTATTTA
    TAAGTCTGTA U0
  • 49 O_5_1_785_360 ATATGGATGAATATAAATACAAGGAC
    AAAAAACGTG U0
  • 50 O_5_1_908_742 AAATATATATCAGAATTCACATTAGA
    CAGGGCACTG U0
  • 51 O_5_1_813_721 ATCTTCGATAATAGCAGCCTCAATTT
    CAGCGGTAGA U0
  • 52 O_5_1_1671_688 GGTTTTCAAAGGCAATTTTTGAGCAA
    TATGGGTTTC U0
  • 53 O_5_1_912_527 GATGGAGAAAGCTGCCTATAACTTTA
    TGGTAAGGAG U0
  • 54 O_5_1_224_721 GAAGTACAAAATGTTTTCAGCATGTT
    CTTTCATAAC U0
  • 55 O_5_1_847_99 AATAGATGCGCCATCTCCGAGAAAAA
    GTCTAGACAA U1

Current formats bed, eland, axt, sam/bam (more
added upon request)
35
The Love of Ambiguity
36
Aligners Feature Set
37
Accuracy Classification
38
Accuracy Unique Read Alignment
39
Alignment Qualities
actual alignment quality
information content (bits)
mismatch BQs / total BQs
40
SNP Discovery
41
Genetic Variations SNPs INDELs
42
SNP Discovery Goal
SNP
sequencing errors
43
SNP Discovery Base Qualities
High quality
Low quality
44
SNPs Bayesian Statistics
base quality
of individuals
allele call in read
45
SNP Discovery
haploid
diploid
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCAT
A
strain 1
individual 1
AACGTTCGCATA AACGTTCGCATA
strain 2
AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCAT
A
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
individual 2
strain 3
AACGTTAGCATA AACGTTAGCATA
individual 3
46
Genotyping Consensus Generation
haploid
diploid
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
AACGTTAGCATA AACGTTAGCATA AACGTTCGCATA AACGTTCGCAT
A
strain 1 A
individual 1 A/C
strain 2 C
AACGTTCGCATA AACGTTCGCATA
AACGTTCGCATA AACGTTCGCATA AACGTTCGCATA AACGTTCGCAT
A
individual 2 C/C
AACGTTAGCATA AACGTTAGCATA AACGTTAGCATA
strain 3 A
individual 3 A/A
AACGTTAGCATA AACGTTAGCATA
47
Handling Trios
  • Take advantage of duplicate data
  • De novo mutation rate

48
QC Coverage
Auton Hernandez Cornell University June 2009
49
QC Inter-SNP Distance
50
QC Hardy-Weinberg Violations
Auton Hernandez Cornell University June 2009
HapMap sites in red, other sites in blue. CEU,
P(seg)gt0.5, coverage 2-5x
51
QC Other metrics
  • P(SNP)
  • Determining at the optimal P(SNP) threshold
  • Transitionstransversions
  • Adjusting filters so that the ratio approaches 2

52
Visualisation
53
Visualization Consed
54
Visualization Gambit
  • Data validation
  • Hypothesis generation
  • Software development aid
  • BAM support
  • Firefox-like plugins

55
1000 Genomes Project
56
1000G Goals
  • Discover genetic variations
  • 1 minor allele frequencies across genome
  • 0.1 0.5 MAF across gene regions
  • Variant alleles
  • Estimate frequencies
  • Identify haplotype background
  • Characterize linkage disequilibrium

57
1000G Pilot Projects
  • Pilot 1
  • Low coverage
  • 180 samples
  • 70 samples _at_ 4X
  • 110 samples _at_ 2X
  • 2.7 Tbp total
  • 202 Gbp 454
  • 1.8 Tbp Illumina
  • 640 Gbp AB SOLiD

Pilot 2 Deep trios (CEU YRI) 6 samples 1.1 Tbp
total 87 Gbp 454 773 Gbp Illumina 270 Gbp AB SOLiD
Pilot 3 Exon capture 607 samples 2.2 Mbp of
targets 8800 targets 10 20x coverage
58
Pilot 2 Chr1 SNP Concordance
59
Pilot 2 Chr1 SNP Concordance
60
Pilot 2 INDEL Validation
1.0 100 in one category 4.0 100 in all
categories
61
What have we learned?
  • Garbage In, Garbage Out
  • SNP calls depend on the alignments
  • Alignments depend on the base calls
  • Base calls depend on accurate interpretation of
    machine readouts
  • Choose the right tools
  • Population genetics seems to be the ultimate
    quality control for SNP calls

62
The Usual Suspects
L to R Jiantao, Tony, Michele, Chip, Amit, Wen
Fung, Deniz, Michael, Maddy, Gábor
Derek
Write a Comment
User Comments (0)
About PowerShow.com