Title: Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari
1Parallel EST ClusteringbyKalyanaraman, Aluru,
and Kothari
- Nargess Memarsadeghi
- CMSC 838 Presentation
2Talk Overview
- Overview of talk
- Motivation
- Background
- Techniques
- Evaluation
- Related work
- Observations
3Motivation EST Clustering
- Problem EST Clustering
- Cluster fragments of cDNA
- Related to fragment assembly problem
- Detecting overlapping fragments
- Overlaps can be computed
- Pairwise alignment algorithm
- Dynamic programming
- Alternative
- Approximate overlap detection algorithms
- Dynamic programming
4Motivation
- Common Tools
- Takes too long
- Days for 100,000 ESTs
- Runs out of memory
- This paper
- PaCE
- Parallel Clustering of ESTs
- Efficient parallel EST Clustering
- Space efficient algorithm
- Reduce total work
- Reduce run-time
5Background EST Clustering Tools
- Three traditional software
- Originally designed for fragment assembly
- TIGR Assembler
- Phrap
- CAP3
- One parallel software
- UICLUSTER assumes ESTs from 3 end
6EST Clustering Tools
- Basic approach
- Find pairs of similar sequences
- Align similar pairs
- Dynamic programing
- Quality of EST clustering
- Phrap Fastest
- avoids dynamic programming
- Relies on approximation, lower quality
- CAP Least of erroneous clusters
7 EST Clustering Tools Performance
- With 50,000 maize ESTs
- Using PC with dual Pentium 450MHZ , 512 RAM
- TIGR ran out of memory
- Phrap 40 min
- CAP gt 24 hours
- With 100,000 maize ESTs
- all ran out of memory
- CAP would require 4 days
8Goal
- Space efficient algorithm
- Space requirement linear in the size of the input
data set - Reduce total work
- Without sacrificing quality of clustering
- Reduce run-time and facilitate the clustering of
large data sets - Through parallel processing
- Scale memory with of processors
9Approach
- Expense
- Pairwise alignment (time memory)
- Promising pairs
- Common string s w
- Cost if common sl gt w , then repeats l-w1
times
10Approach (Cont ..)
- Approach
- Use trie structure
- Identify promising pairs
- Merge clusters with strong overlaps
- Avoid storing/testing all similar pairs
- Parallel EST Clustering Software
- Generalized Suffix Tree (GST)
- Multiple processors
- Maintain and updates EST Clusters
- Others generate batches of promising pairs,
perform alignment
11Approach (Cont )
12 Tries
- Index for each char
- N leaves
- Height N
13Suffix Tries (Cont ..)
- TRIM suffix trie
14Suffix Tries (Cont ..)
- Indicies
- Storage O(n), constant is high though
- Common string
- Longest common substring
15Suffix Tries (Cont ..)
a
b
5
b
a
a
b
b
4
3
2
1
Given a pattern P ab we traverse the tree
according to the pattern.
16Parallel Generation of GST
- GST Generalized Suffix Tree
- Compacted trie
- Longest common prefix found in constant time
- Used for on-demand pair generation
- Sequential O(nl)
- Parallel O(nl/p)
17Parallel Generation of GST (Cont )
- Previous implementations
- CRCW/CREW PRAM model
- Work-optimal
- Involves alphabetical ordering of characters
- Unrealistic assumptions
- synchronous operation of processors
- infinite network bandwidth
- no memory contention
- Not practically efficient
18Parallel Generation of GST (Cont )
- Papers approach
- ESTs equally distributed among processors
- Each processor
- Partitions suffixes of ESTs into buckets
- Distribute buckets to the processors
- All suffixes in a bucket allocated to the same
processor - Total of suffixes allocated to a processor O
( )
19Parallel Generation of GST (Cont )
- Each buckets processor
- Compute compacted trie of all its suffixes
- Cannot use sequential construction
- Suffixes of a string
- not in the same bucket
- Each bucket
- Subtree in the GST
- Nodes
- Depth first search traversal of the trie
- Pointer to the right most child
20On-demand Pair Generation
- A pair should be generated if
- Share substring of length treshhold
- Maximal
- Leaves in a common node
- Share a substring of length depth of node
- Parallel algorithm
- Each processor works with its trie if
- Depth of its root in GST lt threshhold
21On-demand Pair Generation
- To process
- Sort internal nodes
- Decreasing order of depth
- Lists of a node
- Generated after process
- Removed after parent is processed
- Limits space O(nl)
- Run time pairs generated cost of sorting
- Rejected pairs increase run-time by a factor of 2
- Eliminating duplicates reduce run-time
22Parallel Clustering
- Master-Slave paradigm
- Master processor
- Maintains and updates clusters
- Using union-find data structure
- Receives messages from slave processors
- A batch of next promising pairs generated by
slave - Results of the pairwise alignment
- Determines which ones to explore
- Determines if merging should occur
- Slave processors
- Generate pairs on demand
- Perform pairwise alignments of pairs dispatched
by the master processor
23Parallel Clustering (Cont)
Organization of Parallel Clustering Software
- Batch of promising pairs generated results of
pairwise alignment - Batchsize or fewer of pairs results of
pairwise alignemnt on each pair
Slave P
Master P
Slave P
slave P
24Parallel Clustering (Cont..)
- To start
- Slave P starts with 3 batchsize pairs
- Sends the 3rd batch to Master P
- Starts alignment on 1st batch
- Sends results on 1st a newly generated batch
- While waiting to receive results from Master P,
aligns 2nd batch - Processor always has the next batch to work
between - Submitting the results of previous batch
- Receiving another set of pairs
25Parallel Clustering (Cont..)
- Improve and control quality
- Parameters
- Match and mismatch scores
- Gap penalties
- Post processing
- Detection of alternating splicing
- Consulting protein databases
- Organism specific
26Experimental environment
- Used C and MPI
- Tested
- Quality of software
- Arabidopsis thaliana (due to availability of its
genome) - Run-time behavior
- 50,000 Maize ESTs with 32-processor IBM SP
- of processors
- Data size
- ( of Promising pairs) vs data size
- Batchsize vs ( processors)
- of Clusters
- Master processors time
27Quality Assessment
- To asses quality
- A data set and its correct clustering
- ESTs from plant Arabidopsis thaliana
- Splice program
- Align ESTs to the genome
- Discard ESTs that
- Dont align
- Aligned in multiple spots
28Quality Assessment (Cont )
- False negative
- A pair in correct clustering is not paired in the
output - 5
- False positive
- A pair not in correct clustering appears in
results - Negligible (lt 0.04)
- Due to conservative nature of algorithm
29Quality Assessment
Cluster results Number of singleton clusters Number of non-singleton clusters
Benchmark 10,803 18,727
CAP3 17,930 17,556
PaCE 14,802 19,536
Distribution of the number singleton and
non-singleton clusters for benchmark set of
168,200 Arabidopsis ESTs.
30Quality Assessment (Cont..)
31Run-time Assessment
- Experiment with 50,000 maize ESTs
- 32-processor IBM SP-2
- 16 minutes
32Run-time Assessment (Cont )
p Preprocessing Clustering Total
4 273 102 375
8 119 50 169
16 61 26 87
32 38 15 53
64 29 10 39
Run-time (in seconds) spent in various
components of PaCE for 20,000 ESTs. p, number of
processors.
33Run-time Assessment (Cont ..)
- Run-time as a function of batchsize
- Small batchsize
- Increase in communication overhead
- Large batchsize
- Slaves less responsive to the need of generating
pairs - Slave does not use latest clustering results
- Optimal batchsize
- Determined by experiment
- Master processors time
- Fixed batchsize, increase in of processors
- Gradual increase in Master Ps time
- With 32 processors, increase lt 1
- Using 1 Master Processor in not bottleneck
34Results
- Space Linear in size of the input data set
- Reduced total work without sacrificing quality
- Reduced run-time
- Parallel processors
- Eliminating pairs
- Faciliate clustering
- Scale memory with Processors
35Observations
- PaCE Approaches EST clustering problem directly
- Better than
- CAP3
- Phrap
- TIGR Assembler
- Compare time/quality
- TIGICL (TIGR Indices Clustering Tool)
- Support for PVM
- MegaBlast
- STACK
- Large data sets
- Lots of Processors
- Can improve clustering time?
- Clustering algorithm
36References
- http//www.cs.berkeley.edu/kubitron/courses/cs258
-S02/lectures/eval10-logp.pdf - Apostolico, C. Iliopoulos, G. M. Landau, B.
Schieber, and U. Vishkin. Parallel construction
of a suffix tree with applications. Algorithmica,
3347365, 1988.