Largescale genome projects - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Largescale genome projects

Description:

Essentially Sub-cloning. Generation of small insert libraries in a well characterised vector. ... is defined as sequenced on both strands using multiple clones. ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 48

Provided by: renataa

Category:

more less

Transcript and Presenter's Notes

Title: Largescale genome projects

1
Large-scale genome projects

Sequencing DNA molecules in the Mb size range
All strategies employ the same underlying
principles
Random Shotgun sequencing

2
Genomic DNA
Shearing/Sonication
Subclone and Sequence
Shotgun reads
Assembly
Contigs
Finishing read
Finishing
Complete sequence
3
Nucleotide Database Growth
4
EMBL breakdown by organism
5
EMBL Release 65
6
Progress on Large Sequencing Projects
7
Strategies for sequencing

How big can you go??
Large-insert clones
cosmids 30-40 kb
BACs/PACs 50 - 100 kb
Whole chromosomes
Whole genomes

8
Genome size and sequencing strategies
Genome size (log Mb)
4
0
1
2
3
H.sapiens (3000 Mb)
D.melanogaster (170 Mb)
C.elegans (100Mb)
P.falciparum (30 Mb)
S.cerevisiae (14 Mb)
E.coli (4 Mb)
Whole genome shotgun (WGS)
Clone-by-clone
Whole Chromosome Shotgun (WCS)
Whole Genome Shotgun (WGS) with Clone skims
9
Genomic DNA
Shearing/Sonication
Subclone and Sequence
Shotgun reads
Assembly
Contigs
Finishing read
Finishing
Complete sequence
10
Strategies for sequencing

Size and GC composition of genome
Volume of data
Ease of cloning
Ease of sequencing
Genome complexity
dispersed repetitive sequence
telomeres centromeres
Politics/Funding

11
Strategies Clone by Clone

Simple (0.5 - 2 K reads)
Few problems with repeats
Relatively simple informatics
Scalability
Quality of physical map
Fingerprint / STS maps
End sequencing

12
Strategies Whole Chromosome shotgun (WCS)

Requires chromosome isolation
Moderate complexity (10s K reads)
Problems with repeats
Complex informatics
Inefficient in isolation
Quality of physical map
Skims of mapped clones

13
Strategies Whole Genome shotgun (WGS)

Moderate to High complexity (10-100s K reads)
Problems with repeats
Complex informatics
Quality of physical map
Fingerprint map
STS markers
End-sequences
Skims of mapped clones

14
Sequencing my genome
Politics
Production
Finishing
Annotation
TIME
MONEY
15
What do you get?
DATA!!, DATA !!, and more DATA!!

Sequence
incomplete v complete
First-pass annotation
Gene discovery
Full annotation
A starting point for research

16
Genome annotation is central to functional
genomics
17
(No Transcript)
18
(No Transcript)
19
Sequencing

Library construction
Colony picking
DNA preparation
Sequencing reactions
Electrophoresis
Tracking/Base calling

20
Libraries

Essentially Sub-cloning
Generation of small insert libraries in a well
characterised vector.
Ease of propagation
Ease of DNA purification
e.g. puc18, M13

21
Libraries - testing

Simple concepts
Insert/Vector ratio
Real data
Insert size
Sequence .
Simple analysis

22
Sequence generation

Pick colonies
Template preparation
Sequence reactions
Standard terminator chemistry
pUC libraries sequenced with forward and reverse
primers

23
Sequence generation

Electrophoresis of products
Old style - slab gels, 32 64 96 lanes
New style - capillary gels, 96 lanes
Transfer of gel image to UNIX
Sequencing machines use a slave Mac/PC
Move data to centralised storage area for
processing

24
Gel image processing

Light-to-Dye estimation
Lane tracking
Lane editing
Trace extraction
Trace standardisation
Mobility correction
Background substitution

25
Pre-processing

Base calling using Phred
modifies SCF file
Quality clipping
Vector clipping
Sequencing vector
Cloning vector
Screen for contaminants
Feature mark up (repeats/transposons)

26
(No Transcript)
27
Finishing

Assembly Process of taking raw single-pass
reads into contiguous consensus sequence
Closure Process of ordering and merging
consensus sequences into a single contiguous
sequence
Finished is defined as sequenced on both strands
using multiple clones. In the absence of multiple
clones the clone must be sequenced with multiple
chemistries. The overall error rate is estimated
at less than 1 error per 10 kb

28
Genome Assembly

Pre-assembly
Assembly
Automated appraisal
Manual review

29
Pre-Assembly

Convert to CAF format
flatfile text format
choice of assembler
choice of post-assembly modules
choice of assembly editor

www.sanger.ac.uk/Software/CAF
30
Assembly

Assemble using Phrap
Read fasta quality scores from CAF file
Merge existing Phrap .ace file as necessary
Adjust clipping

31
Assembly appraisal

auto-edit
removes 70 of read discrepancies
Remove cloning vector
Mark up sequence features
finish
Identify low-quality regions
Cover using re-runs and long-runs
Compare with current databases
plate contamination

32
Manual Assembly appraisal

Use a sequence editor (GAP/consed)
Tools to identify Internal joins
Tools to identify and import data from an
overlapping projects
Tools to check failed or mis-assembled reads for
inclusion in project

33
Manual editing

Sanger uses 100 edit strategy
Where additional data is required
Check clipping
Additional sequencing
Template / Primer / Chemistry
Assemble new data into project
GAP4 Auto-assemble
Repeat whole process

34
Manual Quality Checks

Force annotation tag consistency
All unedited data is re-assembled using Phrap
All high-quality discrepancies are reviewed
Confirm restriction digest (clones)
Check for inverted repeats
Manually check
Areas of high-density edits
Areas with no supporting unedited data
Areas of low read coverage

35
Gap closure

Read pairs
PCR reactions (long-range / combinatorial)
Small-insert libraries
Transposon-insertion libraries

36
Gap closure - contig ordering

Read pair consistency
STS mapping
Physical mapping
Genetic mapping
Optical mapping
Large-insert clone
skims
end-sequencing

37
(No Transcript)
38
Annotation

DNA features (repeats/similarities)
Gene finding
Peptide features
Initial role assignment
Others- regulatory regions

39
Annotation of eukaryotic genomes
Genomic DNA
ab initio gene prediction
transcription
Unprocessed RNA
RNA processing
Mature mRNA
AAAAAAA
Gm3
Comparative gene prediction
translation
Nascent polypeptide
folding
Active enzyme
Functional identification
Reactant A
Product B
Function
40
Genome analysis overview C.elegans
41
DNA features