Genome Sequencing and Assembly High throughput Sequencing - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Genome Sequencing and Assembly High throughput Sequencing

Description:

Genome Sequencing and Assembly High throughput Sequencing Xiaole Shirley Liu Jun Liu STAT115, STAT215 * * [Enter any extra notes here; leave the item ID line at the ... – PowerPoint PPT presentation

Number of Views:849
Avg rating:3.0/5.0
Slides: 33
Provided by: stat115Or
Category:

less

Transcript and Presenter's Notes

Title: Genome Sequencing and Assembly High throughput Sequencing


1
Genome Sequencing and AssemblyHigh throughput
Sequencing
  • Xiaole Shirley Liu
  • Jun Liu
  • STAT115, STAT215

2
Outline
  • Genome sequencing strategy
  • Clone-by-clone sequencing
  • Whole-genome shotgun sequencing
  • Hybrid method
  • Next generation sequencing technologies

3
Competing Sequencing Strategies
  • Clone-by-clone and whole-genome shotgun

4
Clone-by-Clone Shotgun Sequencing
  • E.g. Human genome project
  • Map construction
  • Clone selection
  • Subclone library construction
  • Random shotgun phase
  • Directed finishing phase and sequence
    authentication

5
Map Construction
  • Clone genomic DNA in YACs (1MB) or BACs (200KB)
  • Map the relative location of clones
  • Sequenced-tagged sites (STS, e.g. EST) mapping
  • PCR or probe hybridization to screen STS
  • Restriction site fingerprint
  • Most time consuming
  • 1990-98 to generate physical maps for human
    http//www.ncbi.nlm.nih.gov/genemap99/

6
Resolve Clone Relative Location
  • Find a column permutation in the binary
    hybridization matrix, all ones each row are
    located in a block

STS 1 2
3 4 5
DNA clone 1 clone 2 clone 3 clone
4
3 5 1 4 2
1 1 0 0 1 0
2 0 0 1 0 1
3 1 0 0 1 1
4 1 1 0 1 0
1 2 3 4 5
1 0 0 1 1 0
2 1 1 0 0 0
3 0 1 1 1 0
4 0 0 1 1 1
STS Clone
7
Clone Selection
  • Based on clone map, select authentic clones to
    generate a minimum tiling path
  • Most important criteria authentic

8
Subclone Library Construction
  • DNA fragmented by sonication or RE cut
  • Fragment size 2-5 KB

9
Random Shotgun Phase
  • Dideoxy termination reaction
  • Informatics programs
  • Coverage and contigs

10
Dideoxy Termination
  • Method invented by Fred Sanger
  • Automated sequencing developed by Leroy Hood
    (Caltech) and Michael Hunkapiller (ABI)

11
Bioinformatics Programs
  • Developed at Univ. Wash
  • Phred
  • Base calling
  • Phil Green
  • Phrap
  • Assembly
  • Brent Ewing
  • Consed
  • Viewing and editing
  • David Gordon

12
Coverage and contigs
  • Coverage sequenced bp / fragment size
  • E.g. 200KB BAC, sequenced 1000 x 500bp subclones,
    coverage 1000 x 500bp / 200KB 2.5X
  • Lander-Waterman curve

13
Directed Finishing Phase
  • David Gordon auto-finish
  • Deign primers at gap 2 ends, PCR amplify, and
    sequence the two ends until they meet
  • Sequence authentication verify STS and RE sites
  • Finished lt 1 error (or ambiguity) in 10,000bp,
    in the right order and orientation along a
    chromosome, almost no gaps.

14
Genome-Shotgun Sequencing
  • Celera human and drosophila genomes
  • No physical map
  • Jigsaw puzzle assembly
  • Coverage 7-10X

15
Shotgun Assembly
  • Screener
  • Identify low quality reads, contamination, and
    repeats
  • Overlapper
  • gt 40bp overlap with lt 6 mismatches
  • Unitigger
  • Combine the easy (unique assembly) subset first
  • Scaffolder repeat resolution
  • Generate different sized-clone libraries, and
    just sequence the clone ends (read pairs)
  • Use physical map information if available
  • Consensus

16
Hybrid Method
17
Hybrid Method
  • Optimal mixture of clone-by-clone vs whole-genome
    shotgun not established
  • Still need 8-10X overall coverage
  • Bacteria genomes can be sequenced WGS alone
  • Higher eukaryotes need more clone-by-clone
  • Comparative genomics can reduce the physical
    mapping (clone-by-clone) need
  • Sequencing cost decreasing quickly
  • Goal 1000 / genome

18
First Generation Sequencing
19
Second Generation Sequencing
20
2nd Gen Sequencing Tech
  • Traditional sequencing machine
  • 384 reads 1kb / 3 hours
  • 454 (Roche)
  • 1M reads 400bp / 5 hours
  • Solexa (Illumina)
  • 100M-1B reads of 30-100bp / 3-8 days, 8-16
    samples
  • SOLiD (Applied Biosystems)
  • 1.4B reads of 35-50bp / 5-8 days, 16 samples
  • Helicos (single molecule sequencing)
  • 500M reads of 30 bp / week, 50 samples
  • Moving targets

21
Illumina (Solexa) Workflow
22
Illumina HiSeq2000
  • Throughput
  • 1100 / lane
  • 35-100 bp / read
  • 16 lanes (2 flow cells) / run
  • 60-80 million reads / lane
  • Sequencing a human genome 10000, 1 week
  • Bioinfo challenges
  • Very large files
  • CPU and RAM hungry
  • Sequence quality filtering
  • Mapping and downstream analysis

23
Seq Files
_at_HWI-EAS3051119910/1 GCTGGAGGTTCAGGCTGGCCGGAT
TTAAACGTAT HWI-EAS3051119910/1 MVXUWVRKTWWUL
RQQMMWWBBBBBBBBBBBBBB _at_HWI-EAS3051112010/1 AA
GACAAAGATGTGCTTTCTAAATCTGCACTAAT HWI-EAS305111
2010/1 PXXXTXYXTTWYYYXXWWWTMTVXWBBB HWUSI
-EAS366_0112611298188280/1    16      chr9  
 98116600        255     38M           0      
0       TACAATATGTCTTTATTTGAGATATGGATTTTAGGCCG
 Y\bcdab\_UULbTUT\ccLbbYaYcWLYW  XAi1
 MDZ3C30T3     NMi2 HWUSI-EAS366_0112611257
188190/1    4             0       0        
          0       0       AGACCACATGAAGCTCAAGAAG
AAGGAAGACAAAAGTG  ecedddT\cTcaccdK\c__Yb\_c
KS_W\  XMi1 HWUSI-EAS366_0112611315195290/
1    16      chr9    102610263       255     38M
          0       0       GCACTCAAGGGTACAGGAAAAG
GGTCAGAAGTGTGGCC  c_Yc\LcbbbYdTa\dd\ddacdd\Y\d
ddcT  XAi0  MDZ38 NMi0 chr1 123450 123500
chr5 28374615 28374615 -
  • Raw FASTQ
  • Sequence ID, sequence
  • Quality ID, quality score
  • Mapped SAM
  • Map 0 OK, 4 unmapped, 16 mapped reverse strand
  • MD mismatch info
  • NM number of mismatch
  • Mapped BED
  • Chr, start, end, strand

24
(Potential) Applications
  • Metagenomics and infectious disease
  • Ancient DNA, recreate extinct species
  • Comparative genomics (between species) and
    personal genomes (within species)
  • Genetic tests and forensics
  • Circulating nucleic acids
  • Risk, diagnosis, and prognosis prediction
  • Transcriptome and transcriptional regulation
  • More later in the semester

25
Third Generation Sequencing
  • Single molecule sequencing (no amplification
    needed)
  • Some can read very long sequences
  • In 2-3 years, the cost of sequencing a human
    genome will drop below 1000
  • Personal genome sequencing will be a key
    component of public health in every developed
    country
  • The cost of sequencing will be lower than the
    cost of storing the sequences
  • Bioinformatics will be key to convert data into
    knowledge

26
HW Q for Graduate Students
  • Write the first page (Specific Aims) of an NIH
    research proposal with 1 million budget that
    uses high throughput sequencing and
    bioinformatics analysis to solve some interesting
    biomedical problems

27
How to Write a Specific Aims Page
  • Grant title and your name
  • Introductory (1-2) paragraphs (1/4-1/3 page)
  • A is very important in bio/medicine/disease
  • Recent development in A has made some really
    significant findings or improvements
  • However, something is still lacking or not known
    about A
  • The long term goal of our research is
  • The focus of this proposal is
  • The central hypothesis of this proposal is
  • Therefore, we plan to do investigate / develop

28
How to Write a Specific Aims Page
  • Specific Aims (1/3 to ½ page)
  • Specifically, we plan to
  • Aim 1 profile the genome-wide xxx of xxx
  • 1.1 establish xxx
  • 1.2
  • Aim 2 develop a computer algorithm or knowledge
    base
  • 2.1 model xxx
  • Aim 3 identify the mechanism of xxx

29
Specific Aims
  • Sound like you can definitely do it
  • Do not use words that sound like a fishing
    expedition such as try, explore (find) or words
    that sound too trivial such as download (collect)
  • Try not to let your aims successes depend on
    each other, but it is also important to be
    intellectually coherent among the aims (not a
    collection of unrelated topics)
  • Propose as many aims as years for the grant

30
How to Write a Specific Aims Page
  • Last paragraph (lt ¼ page)
  • Whats novel about our approach
  • Deliverables (what will the scientific community
    see at the end)
  • A software, database, map, mechanism, resource
  • Whats the (potential) significance about our
    proposal
  • Best possible outcome

31
Summary
  • Genome sequencing and assembly
  • Clone-by-clone HGP
  • Map big clones, find path, shotgun sequence
    subclones, assemble and finish
  • Sequencing dideoxy termination
  • Whole genome shotgun Celera
  • Massively Parallel Sequencing
  • 454, Solexa, SOLiD, Helicos
  • Many opportunities and many challenges
  • Project proposal

32
Acknowledgement
  • Fritz Roth
  • Dannie Durand
  • Larry Hunter
  • Richard Davis
  • Wei Li
  • Jarek Meller
  • Stefan Bekiranov
  • Stuart M. Brown
  • Rob Mitra
Write a Comment
User Comments (0)
About PowerShow.com