Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari

Description:

Maintain and updates EST Clusters. Others generate batches of promising pairs, perform alignment ... Number of non-singleton clusters. Number of singleton ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 37
Provided by: csU101
Category:

less

Transcript and Presenter's Notes

Title: Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari


1
Parallel EST ClusteringbyKalyanaraman, Aluru,
and Kothari
  • Nargess Memarsadeghi
  • CMSC 838 Presentation

2
Talk Overview
  • Overview of talk
  • Motivation
  • Background
  • Techniques
  • Evaluation
  • Related work
  • Observations

3
Motivation EST Clustering
  • Problem EST Clustering
  • Cluster fragments of cDNA
  • Related to fragment assembly problem
  • Detecting overlapping fragments
  • Overlaps can be computed
  • Pairwise alignment algorithm
  • Dynamic programming
  • Alternative
  • Approximate overlap detection algorithms
  • Dynamic programming

4
Motivation
  • Common Tools
  • Takes too long
  • Days for 100,000 ESTs
  • Runs out of memory
  • This paper
  • PaCE
  • Parallel Clustering of ESTs
  • Efficient parallel EST Clustering
  • Space efficient algorithm
  • Reduce total work
  • Reduce run-time

5
Background EST Clustering Tools
  • Three traditional software
  • Originally designed for fragment assembly
  • TIGR Assembler
  • Phrap
  • CAP3
  • One parallel software
  • UICLUSTER assumes ESTs from 3 end

6
EST Clustering Tools
  • Basic approach
  • Find pairs of similar sequences
  • Align similar pairs
  • Dynamic programing
  • Quality of EST clustering
  • Phrap Fastest
  • avoids dynamic programming
  • Relies on approximation, lower quality
  • CAP Least of erroneous clusters

7
EST Clustering Tools Performance
  • With 50,000 maize ESTs
  • Using PC with dual Pentium 450MHZ , 512 RAM
  • TIGR ran out of memory
  • Phrap 40 min
  • CAP gt 24 hours
  • With 100,000 maize ESTs
  • all ran out of memory
  • CAP would require 4 days

8
Goal
  • Space efficient algorithm
  • Space requirement linear in the size of the input
    data set
  • Reduce total work
  • Without sacrificing quality of clustering
  • Reduce run-time and facilitate the clustering of
    large data sets
  • Through parallel processing
  • Scale memory with of processors

9
Approach
  • Expense
  • Pairwise alignment (time memory)
  • Promising pairs
  • Common string s w
  • Cost if common sl gt w , then repeats l-w1
    times

10
Approach (Cont ..)
  • Approach
  • Use trie structure
  • Identify promising pairs
  • Merge clusters with strong overlaps
  • Avoid storing/testing all similar pairs
  • Parallel EST Clustering Software
  • Generalized Suffix Tree (GST)
  • Multiple processors
  • Maintain and updates EST Clusters
  • Others generate batches of promising pairs,
    perform alignment

11
Approach (Cont )
12
Tries
  1. Index for each char
  2. N leaves
  3. Height N

13
Suffix Tries (Cont ..)
  1. TRIM suffix trie

14
Suffix Tries (Cont ..)
  1. Indicies
  2. Storage O(n), constant is high though
  3. Common string
  4. Longest common substring

15
Suffix Tries (Cont ..)
a
b
5
b

a
a
b

b

4

3
2
1
Given a pattern P ab we traverse the tree
according to the pattern.
16
Parallel Generation of GST
  • GST Generalized Suffix Tree
  • Compacted trie
  • Longest common prefix found in constant time
  • Used for on-demand pair generation
  • Sequential O(nl)
  • Parallel O(nl/p)

17
Parallel Generation of GST (Cont )
  • Previous implementations
  • CRCW/CREW PRAM model
  • Work-optimal
  • Involves alphabetical ordering of characters
  • Unrealistic assumptions
  • synchronous operation of processors
  • infinite network bandwidth
  • no memory contention
  • Not practically efficient

18
Parallel Generation of GST (Cont )
  • Papers approach
  • ESTs equally distributed among processors
  • Each processor
  • Partitions suffixes of ESTs into buckets
  • Distribute buckets to the processors
  • All suffixes in a bucket allocated to the same
    processor
  • Total of suffixes allocated to a processor O
    ( )

19
Parallel Generation of GST (Cont )
  • Each buckets processor
  • Compute compacted trie of all its suffixes
  • Cannot use sequential construction
  • Suffixes of a string
  • not in the same bucket
  • Each bucket
  • Subtree in the GST
  • Nodes
  • Depth first search traversal of the trie
  • Pointer to the right most child

20
On-demand Pair Generation
  • A pair should be generated if
  • Share substring of length treshhold
  • Maximal
  • Leaves in a common node
  • Share a substring of length depth of node
  • Parallel algorithm
  • Each processor works with its trie if
  • Depth of its root in GST lt threshhold

21
On-demand Pair Generation
  • To process
  • Sort internal nodes
  • Decreasing order of depth
  • Lists of a node
  • Generated after process
  • Removed after parent is processed
  • Limits space O(nl)
  • Run time pairs generated cost of sorting
  • Rejected pairs increase run-time by a factor of 2
  • Eliminating duplicates reduce run-time

22
Parallel Clustering
  • Master-Slave paradigm
  • Master processor
  • Maintains and updates clusters
  • Using union-find data structure
  • Receives messages from slave processors
  • A batch of next promising pairs generated by
    slave
  • Results of the pairwise alignment
  • Determines which ones to explore
  • Determines if merging should occur
  • Slave processors
  • Generate pairs on demand
  • Perform pairwise alignments of pairs dispatched
    by the master processor

23
Parallel Clustering (Cont)
Organization of Parallel Clustering Software
  • Batch of promising pairs generated results of
    pairwise alignment
  • Batchsize or fewer of pairs results of
    pairwise alignemnt on each pair

Slave P
Master P
Slave P
slave P
24
Parallel Clustering (Cont..)
  • To start
  • Slave P starts with 3 batchsize pairs
  • Sends the 3rd batch to Master P
  • Starts alignment on 1st batch
  • Sends results on 1st a newly generated batch
  • While waiting to receive results from Master P,
    aligns 2nd batch
  • Processor always has the next batch to work
    between
  • Submitting the results of previous batch
  • Receiving another set of pairs

25
Parallel Clustering (Cont..)
  • Improve and control quality
  • Parameters
  • Match and mismatch scores
  • Gap penalties
  • Post processing
  • Detection of alternating splicing
  • Consulting protein databases
  • Organism specific

26
Experimental environment
  • Used C and MPI
  • Tested
  • Quality of software
  • Arabidopsis thaliana (due to availability of its
    genome)
  • Run-time behavior
  • 50,000 Maize ESTs with 32-processor IBM SP
  • of processors
  • Data size
  • ( of Promising pairs) vs data size
  • Batchsize vs ( processors)
  • of Clusters
  • Master processors time

27
Quality Assessment
  • To asses quality
  • A data set and its correct clustering
  • ESTs from plant Arabidopsis thaliana
  • Splice program
  • Align ESTs to the genome
  • Discard ESTs that
  • Dont align
  • Aligned in multiple spots

28
Quality Assessment (Cont )
  • False negative
  • A pair in correct clustering is not paired in the
    output
  • 5
  • False positive
  • A pair not in correct clustering appears in
    results
  • Negligible (lt 0.04)
  • Due to conservative nature of algorithm

29
Quality Assessment
Cluster results Number of singleton clusters Number of non-singleton clusters
Benchmark 10,803 18,727
CAP3 17,930 17,556
PaCE 14,802 19,536
Distribution of the number singleton and
non-singleton clusters for benchmark set of
168,200 Arabidopsis ESTs.
30
Quality Assessment (Cont..)
31
Run-time Assessment
  • Experiment with 50,000 maize ESTs
  • 32-processor IBM SP-2
  • 16 minutes

32
Run-time Assessment (Cont )
p Preprocessing Clustering Total
4 273 102 375
8 119 50 169
16 61 26 87
32 38 15 53
64 29 10 39
Run-time (in seconds) spent in various
components of PaCE for 20,000 ESTs. p, number of
processors.
33
Run-time Assessment (Cont ..)
  • Run-time as a function of batchsize
  • Small batchsize
  • Increase in communication overhead
  • Large batchsize
  • Slaves less responsive to the need of generating
    pairs
  • Slave does not use latest clustering results
  • Optimal batchsize
  • Determined by experiment
  • Master processors time
  • Fixed batchsize, increase in of processors
  • Gradual increase in Master Ps time
  • With 32 processors, increase lt 1
  • Using 1 Master Processor in not bottleneck

34
Results
  • Space Linear in size of the input data set
  • Reduced total work without sacrificing quality
  • Reduced run-time
  • Parallel processors
  • Eliminating pairs
  • Faciliate clustering
  • Scale memory with Processors

35
Observations
  • PaCE Approaches EST clustering problem directly
  • Better than
  • CAP3
  • Phrap
  • TIGR Assembler
  • Compare time/quality
  • TIGICL (TIGR Indices Clustering Tool)
  • Support for PVM
  • MegaBlast
  • STACK
  • Large data sets
  • Lots of Processors
  • Can improve clustering time?
  • Clustering algorithm

36
References
  • http//www.cs.berkeley.edu/kubitron/courses/cs258
    -S02/lectures/eval10-logp.pdf
  • Apostolico, C. Iliopoulos, G. M. Landau, B.
    Schieber, and U. Vishkin. Parallel construction
    of a suffix tree with applications. Algorithmica,
    3347365, 1988.
Write a Comment
User Comments (0)
About PowerShow.com