Title: Paracel TranscriptAssembler
1- Paracel TranscriptAssembler
- Paracel GenomeAssembler
- Pavel Morozov
- 9/17/2002
2Paracel TranscriptAssembler (PTA 2.6.2) Overview
- Key features and advantages
- Architecture
- Using PTA
- Scientific problems for PTA
3Paracel TranscriptAssembler (PTA 2.6.2)
features.
- High capacity solution for EST-based transcript
reconstruction, multi-processor environment. - Complete pipeline for sequence
cleaning,clustering, assembly - Removal of chimeric sequences.
- Detection, alignment and visualization of
alternative splice forms. - Visualization through intuitive graphical
interfaces
4PTA advantages
- EST specific assembly engine
- Reduce number of falsely joined huge clusters by
data cleaning and chimera detection - PTA specifically tuned for detecting and
alignment on of alternative splice forms - Multi-processor capabilities
- Integrated XML format
5TranscriptAssembler Architecture
Input ESTs
Consensus Sequences
Convert
Clean up
Clustering
Assembly
Seeded Clustering
Assembled Clusters
TranscriptView
6Conversion of formats
- Directories or files containing sequences in
these formats - phd files
- Fasta
- Genbank
- EMBL
- CAML (XML)
- Obtains information about clone name,
orientation, tissue type, etc.
7Clean up
- Quality of the clustering and assembly results
depends on the quality of filtering and masking - Removes contaminants e.g. E.coil vector
- Removes low quality sequences
- Non-destructive masking on repeats, polyA tails
and low complexity regions to keep such regions
in final assembly.
8Seed clustering
- Optional tool if full length mRNA is available
- ESTs with similarity above a threshold are
clustered with a seed - Three strategies
- Assemble only novel gene (default)
- Division of labor
- Assemble only the genes of interest
9Clustering
- All verse all comparison
- Utilizes Haste for fast blast/fasta like
comparison - Uses loose clustering philosophy
- Chimera detection algorithm
10Assembly in PTA
- Highly optimized version for EST assembly
- Utilizes quality values (supports mixed input)
- Splice variants divided into individual
contigs/singlets - Detects of chimeric sequences and bad ends
11Splice Variant detection
- Alternate transcript constructs generated during
assembly are exhaustively compared - Determination of multiple genes per cluster
- Relative exon layout displayed in SpliceView
12Splice Variant Alignment View
13Output Files
- Cleanup statistics PROJECT.DATA/.scylla.stat
- Clustering statistics PROJECT.clusters.stat
- Singlets from pairwise clustering
- PROJECT.singlets.caml
- Clusters that sequences went into
- PROJECT.clusters.info
- FastA contigs and singlets from Assembly
- PROJECT.clusters.contigs
- PROJECT.clusters.singlets
14Output Files cont.
- Assembled CAML files are stored in PROJECT.CL/
- Numbered subdirectories (100 clusters each)
- .ace files for use with consed
- .assem.caml.gz files for use with AssemblyView
- View assemblies with AssemblyView
- Cross platform Java application
- View gene transcripts and alternate splice forms
- Edit bases
- View chromatograms
- Export consensus sequences
15-
- Clustering statistics
- --------------------------------------------------
-------- - --- Total no. of sequences from prev. run
0 - --- Total no. of current input sequences
3955807 - --- Total no. of sequences compared
907431 - --- Total no. of singlets after pairwise
compared 324332 - --- Total no. of problem sequences
8203 - --- Total no. sequences in clusters
574896 - --- Total no. of seed clusters
20278 - --- Total no. of clusters
89402 -
- 02-08-2002 054414 PTA End.
16Splice Variant Alignment View
17Contig View
18Transcript View
19Improving Results
- Improve Cleanup parameters
- Avoid over-clustering due to false-positive
- trim appropriate sequencing artifacts
- Choose appropriate search algorithm, thresholds,
and matrix stringency - Rerun clusters that timed out
- Recluster clusters that are two large with higher
threshold ? Perform multiple iterations
20Scientific problems for PTA
- Proteomics
- Gene discovery
- Verify gene predictions for genome assembly
- Detecting splice variants
- Patterns of expression, tissue specificity
- SNP detection
- Combinations of all the above...
21Paracel GenomeAssembler?
- All-in-one pipeline
- Base calling with TraceTuner
- Filtering and masking
- assembly
- Â Easy to use graphical interface and editing
tools - Genomic Assemblies
- Constraints
- troubleshooting
- finishing
22PGA Advantages
- Genome specific assembly engine
- Longer assemblies
- Fewer miss-assemblies
- Most accurate consensus calls
- Use of clone pair constraints
- Resolve repeats
- Produce scaffolds
- Parallel processing for SMP systems
- User friendly graphic interface
23PGA architecture.
Input Files
Vector and E. coli Databases
Parameters
Base Calling w. Quality Values
Vector and Contaminant Screening
Read-pair Constraints Generation
Assembly
User interaction Processing step
Screening results PFPView
ScaffoldView ContigView
Output Files
24Basecalling in PGA - Optional
- Using Paracels TraceTuner? technology
- advanced peak processing ( important at the ends
of reads) - More accurate quality value assignments
- Dye blob processing
- Heterozygote detection
- Calibration for sequencers
- ABI 377 DNA sequencer (DP, DT)
- ABI 3100 genetic analyzer
(Pop6-BD1,2) - ABI 3700 DNA analyzer (Pop5-BD1,2,
Pop6-BD1,4) - Standard calibrations used for other
major sequencers
25TraceTuner Heterozygote detection.
- Calibrations for
- ABI 377 DNA sequencer (DP, DT)
- ABI 3100 genetic analyzer (Pop6-BD1,2)
- ABI 3700 DNA analyzer (Pop5-BD1,2, Pop6-BD1,4)
- Standard calibrations used for other major
sequencers
26TraceTuner Dye Blob processing
27- Clean up Stage
- Annotates low complexity with dust
- Masks Vector contaminants
- Filters sequences with E. coli contamination
- Constraint Generation
- Clone pair information
- Maximum clone length boundaries
- mindist
- Minimum clone length boundaries
- maxdist
28Double-ended reads resolve repeats
- A key feature in PGA forward-reverse clone
constraints - Models double-ended sequencing reads
- One read is anchored in unique sequence
- Distinguishes which repeat instance the other
read lies in
Repeat
Repeat
- Assembly WITHforward-reverse constraints
Repeat
- Assembly WITHOUTforward-reverse constraints
Misassembled fragment
leaves a singleton
29PGA generates scaffolds
Contigs
Scaffold of ordered, oriented contigs
- PGAs forward-reverse constraints help generate a
scaffold - Contigs are oriented and ordered
- Gaps are bounded and may be relatively small (or
closed) - Much more informative result than unordered
contigs, useful in finishing and for low-pass
sequencing
30Running PGA
- PGA Launcher
- Interactive Mode
- Command Line
31PGA Launcher
32Interactive Mode
- Enter the Command pga
- Displays introductory message and instructions
- Answer questions
- Project name
- Data location
- Constraint information
- Filtering
- Processors
33Command line
- Advanced
- Recommend running default first
- pga ltinput datagt ltoptionsgt
- Command line options pga h (Chapter 4)
- Param file prm ltfile namegt (Chapter 9)
- Pga v gives version
34Paracel GenomeAssembler? produces
- Fasta file of Contig and singlet sequences
- Statistical files contig sizes, number and
identity - Statistical information for cleanup stage
- Linking information between contigs used to
generate scaffolds
35End of PGA log file
36PGA Lancher View
37Constraints View
38Contig View
39Scaffold View