Novel Peptide Identification using ESTs and Sequence Database Compression - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Novel Peptide Identification using ESTs and Sequence Database Compression

Description:

Center for Bioinformatics and Computational Biology ... Ala2Pro associated with familial amyloid polyneuropathy. 7. Novel Mutation. 8. Searching ESTs ... – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 34
Provided by: umiac7
Category:

less

Transcript and Presenter's Notes

Title: Novel Peptide Identification using ESTs and Sequence Database Compression


1
Novel Peptide Identification using ESTs and
Sequence Database Compression
  • Nathan Edwards
  • Center for Bioinformatics and Computational
    Biology
  • University of Maryland, College Park

2
What is missing from protein sequence databases?
  • Known coding SNPs
  • Novel coding mutations
  • Alternative splicing isoforms
  • Alternative translation start-sites
  • Microexons
  • Alternative translation frames

3
Why dont we see more novel peptides?
  • Tandem mass spectrometry doesnt discriminate
    against novel peptides......but protein
    sequence databases do!
  • Searching traditional protein sequence databases
    biases the results towards well-understood
    protein isoforms!

4
Novel Splice Isoform
5
Novel Splice Isoform
6
Novel Mutation
Ala2?Pro associated with familial amyloid
polyneuropathy
7
Novel Mutation
8
Searching ESTs
  • Proposed long ago
  • Yates, Eng, and McCormack Anal Chem, 95.
  • Now
  • Protein sequences are sufficient for protein
    identification
  • Computationally expensive/infeasible
  • Difficult to interpret
  • Make EST searching feasible for routine searching
    to discover novel peptides.

9
Searching Expressed Sequence Tags (ESTs)
  • Pros
  • No introns!
  • Primary splicing evidence for annotation
    pipelines
  • Evidence for dbSNP
  • Often derived from clinical cancer samples
  • Cons
  • No frame
  • Large (8Gb)
  • Untrusted by annotation pipelines
  • Highly redundant
  • Nucleotide error rate 1

10
Other Search Strategies
  • Genome Corrected ESTs
  • Large (2Gb)
  • Controls for nucleotide error rate
  • Polymorphism lost, potential errors introduced
  • Genome Clustered ESTs
  • Small, Gene model
  • Convergence to well-understood isoforms
  • Controls nucleotide error rate
  • Full-Length mRNAs
  • Incomplete gene coverage, most are already in
    IPI

11
Other Search Strategies
  • Genome
  • Large (6Gb), lots of non-coding DNA
  • Find novel ORFs, no sampling bias
  • Miss spliced peptide sequences.
  • Genscan Exons
  • Small, find novel ORFs.
  • Miss spliced peptide sequences.
  • How should we interpret peptide identifications
    with no mRNA evidence?

12
Compressed EST Peptide Sequence Database
  • For all ESTs mapped to a UniGene gene
  • Six-frame translation
  • Eliminate ORFs lt 30 amino-acids
  • Eliminate amino-acid 30-mers observed once
  • Compress to C2 FASTA database
  • Complete, Correct for amino-acid 30-mers
  • Gene-centric peptide sequence database
  • Size lt 3 of naïve enumeration, 20774 FASTA
    entries
  • Running time 1 of naïve enumeration search
  • E-values 2 of naïve enumeration search results

13
Compressed EST Peptide Sequence Database
  • For all ESTs mapped to a UniGene gene
  • Six-frame translation
  • Eliminate ORFs lt 30 amino-acids
  • Eliminate amino-acid 30-mers observed once
  • Compress to C2 FASTA database
  • Complete, Correct for amino-acid 30-mers
  • Gene-centric peptide sequence database
  • Size lt 3 of naïve enumeration, 20774 FASTA
    entries
  • Running time 1 of naïve enumeration search
  • E-values 2 of naïve enumeration search results

14
SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
15
Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
16
Sequence Databases CSBH-graphs
  • Original sequences correspond to paths

ACDEFGI, ACDEFACG, DEFGEFGI
17
Sequence Databases CSBH-graphs
  • All k-mers represented by an edge have the same
    count

1
2
2
1
2
18
CSBH-graphs
  • Quickly determine which k-mers occur at least
    twice

2
2
1
2
19
de Bruijn Sequences
  • de Bruijn sequences represent all words of length
    k from some alphabet A.
  • A 0,1, k 3 s 0001110100
  • A 0,1, k 4 s 0000111101011001000

20
de Bruijn Graph A 0,1, k 4
1
1
0
1
0
1
1
0
1
0
1
0
1
0
0
0
21
Correct, Complete, Compact (C3) Enumeration
  • Set of paths that use each edge exactly once

ACDEFGEFGI, DEFACG
22
Correct, Complete (C2) Enumeration
  • Set of paths that use each edge at least once

ACDEFGEFGI, DEFACG
23
Patching the CSBH-graph
  • Use artificial edges to fix unbalanced nodes

24
Patching the CSBH-graph
  • Use matching-style formulations to choose
    artificial edges
  • Optimal C2/C3 enumeration in polynomial time.
  • Chinese Postman Problem
  • Edmonds and Johnson, 73
  • l-tuple DNA sequencing
  • Pevzner, 89
  • Shortest (Common) Superstring
  • MAX-SNP-hard, 2.5 approx algorithm

25
C3 Enumeration
in-out
in-out
Cost k
26
C3 Enumeration
in-out
in-out
Cost 0
Cost 0
Cost k
27
Reusing Edges
  • ACDEHAC, ACDFHAC, ACDGHACD

28
Reusing Edges
  • C3 ACDEHACDFHAC, ACDGHACD

29
Reusing Edges
  • C2 ACDEHACDFHACDGHAC

30
C2 Enumeration
in-out
in-out
4
10
Shortcut paths
7
31
Implementation
  • CSBH-graph construction
  • Determine non-trivial nodes directly
  • Consecutive non-trivial nodes determine edges
  • C3/C2 enumeration
  • C3 Trivial assignment of artificial edges
  • C2 Depth-first search Goldbergs CS2
    min cost flow code
  • Eulerian path algorithm
  • Can be applied to entire EST database
  • Condor grid and PBS cluster for CSBH-graph
    construction
  • Large memory machine for C3/C2 enumeration

32
Conclusions
  • Peptides identify more than just proteins
  • Compressed peptide sequence databases makes
    routine EST searching feasible
  • Currently available for download
  • Can include other sources of peptide sequence at
    little additional cost.
  • CSBH-graph edge counts C2/C3 enumeration
    algorithms
  • Minimal FASTA representation of k-mer sets

33
Acknowledgements
  • Chau-Wen Tseng, Xue Wu
  • UMCP Computer Science
  • Catherine Fenselau, Crystal Harvey
  • UMCP Biochemistry
  • Calibrant Biosystems
  • PeptideAtlas, HUPO PPP, X!Tandem
  • Funding National Cancer Institute
Write a Comment
User Comments (0)
About PowerShow.com