Paracel TranscriptAssembler - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Paracel TranscriptAssembler

Description:

Basecalling in PGA - Optional. Using Paracel's TraceTuner technology ... Pga v gives version. Paracel GenomeAssembler produces: ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 40
Provided by: Pav9
Category:

less

Transcript and Presenter's Notes

Title: Paracel TranscriptAssembler


1
  • Paracel TranscriptAssembler
  • Paracel GenomeAssembler
  • Pavel Morozov
  • 9/17/2002

2
Paracel TranscriptAssembler (PTA 2.6.2) Overview
  • Key features and advantages
  • Architecture
  • Using PTA
  • Scientific problems for PTA

3
Paracel TranscriptAssembler (PTA 2.6.2)
features.
  • High capacity solution for EST-based transcript
    reconstruction, multi-processor environment.
  • Complete pipeline for sequence
    cleaning,clustering, assembly
  • Removal of chimeric sequences.
  • Detection, alignment and visualization of
    alternative splice forms.
  • Visualization through intuitive graphical
    interfaces

4
PTA advantages
  • EST specific assembly engine
  • Reduce number of falsely joined huge clusters by
    data cleaning and chimera detection
  • PTA specifically tuned for detecting and
    alignment on of alternative splice forms
  • Multi-processor capabilities
  • Integrated XML format

5
TranscriptAssembler Architecture
Input ESTs
Consensus Sequences
Convert
Clean up
Clustering
Assembly
Seeded Clustering
Assembled Clusters
TranscriptView
6
Conversion of formats
  • Directories or files containing sequences in
    these formats
  • phd files
  • Fasta
  • Genbank
  • EMBL
  • CAML (XML)
  • Obtains information about clone name,
    orientation, tissue type, etc.

7
Clean up
  • Quality of the clustering and assembly results
    depends on the quality of filtering and masking
  • Removes contaminants e.g. E.coil vector
  • Removes low quality sequences
  • Non-destructive masking on repeats, polyA tails
    and low complexity regions to keep such regions
    in final assembly.

8
Seed clustering
  • Optional tool if full length mRNA is available
  • ESTs with similarity above a threshold are
    clustered with a seed
  • Three strategies
  • Assemble only novel gene (default)
  • Division of labor
  • Assemble only the genes of interest

9
Clustering
  • All verse all comparison
  • Utilizes Haste for fast blast/fasta like
    comparison
  • Uses loose clustering philosophy
  • Chimera detection algorithm

10
Assembly in PTA
  • Highly optimized version for EST assembly
  • Utilizes quality values (supports mixed input)
  • Splice variants divided into individual
    contigs/singlets
  • Detects of chimeric sequences and bad ends

11
Splice Variant detection
  • Alternate transcript constructs generated during
    assembly are exhaustively compared
  • Determination of multiple genes per cluster
  • Relative exon layout displayed in SpliceView

12
Splice Variant Alignment View
13
Output Files
  • Cleanup statistics PROJECT.DATA/.scylla.stat
  • Clustering statistics PROJECT.clusters.stat
  • Singlets from pairwise clustering
  • PROJECT.singlets.caml
  • Clusters that sequences went into
  • PROJECT.clusters.info
  • FastA contigs and singlets from Assembly
  • PROJECT.clusters.contigs
  • PROJECT.clusters.singlets

14
Output Files cont.
  • Assembled CAML files are stored in PROJECT.CL/
  • Numbered subdirectories (100 clusters each)
  • .ace files for use with consed
  • .assem.caml.gz files for use with AssemblyView
  • View assemblies with AssemblyView
  • Cross platform Java application
  • View gene transcripts and alternate splice forms
  • Edit bases
  • View chromatograms
  • Export consensus sequences

15
  • Clustering statistics
  • --------------------------------------------------
    --------
  • --- Total no. of sequences from prev. run
    0
  • --- Total no. of current input sequences
    3955807
  • --- Total no. of sequences compared
    907431
  • --- Total no. of singlets after pairwise
    compared 324332
  • --- Total no. of problem sequences
    8203
  • --- Total no. sequences in clusters
    574896
  • --- Total no. of seed clusters
    20278
  • --- Total no. of clusters
    89402
  • 02-08-2002 054414 PTA End.

16
Splice Variant Alignment View
17
Contig View
18
Transcript View
19
Improving Results
  • Improve Cleanup parameters
  • Avoid over-clustering due to false-positive
  • trim appropriate sequencing artifacts
  • Choose appropriate search algorithm, thresholds,
    and matrix stringency
  • Rerun clusters that timed out
  • Recluster clusters that are two large with higher
    threshold ? Perform multiple iterations

20
Scientific problems for PTA
  • Proteomics
  • Gene discovery
  • Verify gene predictions for genome assembly
  • Detecting splice variants
  • Patterns of expression, tissue specificity
  • SNP detection
  • Combinations of all the above...

21
Paracel GenomeAssembler?
  • All-in-one pipeline
  • Base calling with TraceTuner
  • Filtering and masking
  • assembly
  •  Easy to use graphical interface and editing
    tools
  • Genomic Assemblies
  • Constraints
  • troubleshooting
  • finishing

22
PGA Advantages
  • Genome specific assembly engine
  • Longer assemblies
  • Fewer miss-assemblies
  • Most accurate consensus calls
  • Use of clone pair constraints
  • Resolve repeats
  • Produce scaffolds
  • Parallel processing for SMP systems
  • User friendly graphic interface

23
PGA architecture.
Input Files
Vector and E. coli Databases
Parameters
Base Calling w. Quality Values
Vector and Contaminant Screening
Read-pair Constraints Generation
Assembly
User interaction Processing step
Screening results PFPView
ScaffoldView ContigView
Output Files
24
Basecalling in PGA - Optional
  • Using Paracels TraceTuner? technology
  • advanced peak processing ( important at the ends
    of reads)
  • More accurate quality value assignments
  • Dye blob processing
  • Heterozygote detection
  • Calibration for sequencers
  • ABI 377 DNA sequencer (DP, DT)
  • ABI 3100 genetic analyzer
    (Pop6-BD1,2)
  • ABI 3700 DNA analyzer (Pop5-BD1,2,
    Pop6-BD1,4)
  • Standard calibrations used for other
    major sequencers

25
TraceTuner Heterozygote detection.
  • Calibrations for
  • ABI 377 DNA sequencer (DP, DT)
  • ABI 3100 genetic analyzer (Pop6-BD1,2)
  • ABI 3700 DNA analyzer (Pop5-BD1,2, Pop6-BD1,4)
  • Standard calibrations used for other major
    sequencers

26
TraceTuner Dye Blob processing
27
  • Clean up Stage
  • Annotates low complexity with dust
  • Masks Vector contaminants
  • Filters sequences with E. coli contamination
  • Constraint Generation
  • Clone pair information
  • Maximum clone length boundaries
  • mindist
  • Minimum clone length boundaries
  • maxdist

28
Double-ended reads resolve repeats
  • A key feature in PGA forward-reverse clone
    constraints
  • Models double-ended sequencing reads
  • One read is anchored in unique sequence
  • Distinguishes which repeat instance the other
    read lies in

Repeat
Repeat
  • Assembly WITHforward-reverse constraints

Repeat
  • Assembly WITHOUTforward-reverse constraints

Misassembled fragment
leaves a singleton
29
PGA generates scaffolds
Contigs
Scaffold of ordered, oriented contigs
  • PGAs forward-reverse constraints help generate a
    scaffold
  • Contigs are oriented and ordered
  • Gaps are bounded and may be relatively small (or
    closed)
  • Much more informative result than unordered
    contigs, useful in finishing and for low-pass
    sequencing

30
Running PGA
  • PGA Launcher
  • Interactive Mode
  • Command Line

31
PGA Launcher
32
Interactive Mode
  • Enter the Command pga
  • Displays introductory message and instructions
  • Answer questions
  • Project name
  • Data location
  • Constraint information
  • Filtering
  • Processors

33
Command line
  • Advanced
  • Recommend running default first
  • pga ltinput datagt ltoptionsgt
  • Command line options pga h (Chapter 4)
  • Param file prm ltfile namegt (Chapter 9)
  • Pga v gives version

34
Paracel GenomeAssembler? produces
  • Fasta file of Contig and singlet sequences
  • Statistical files contig sizes, number and
    identity
  • Statistical information for cleanup stage
  • Linking information between contigs used to
    generate scaffolds

35
End of PGA log file
36
PGA Lancher View
37
Constraints View
38
Contig View
39
Scaffold View
Write a Comment
User Comments (0)
About PowerShow.com