Title: Whole Genome Alignment using Multithreaded Parallel Implementation
1Whole Genome Alignment using Multithreaded
Parallel Implementation
- Hyma S Murthy
- CMSC 838 Presentation
2Talk Overview
- Organization of the paper
- Motivation
- Technique
- Pairwise Sequence Comparison using Dynamic
Programming - EARTH Execution Model
- Evaluation
- Result Graphs
- Conclusions
- Related Work (MUMmer)
3Motivation
- Importance of Genome Alignment
- Identify important matched and mismatched regions
- matches represent homolog pairs, conserved
regions or long repeats - mismatchesrepresent foreign fragments inserted
by transposition, sequence reversal or lateral
transfer - Detect functional differences between pathogenic/
non-pathogenic strains, evolutionary distance,
mutations leading to disease, phenotypes, etc. - Problems
- Large computational power, memory and execution
time - Existing algorithms apply dynamic programming
only to subsequences - Computationally intensive to apply to whole
sequences (O(n2)) - Thus applicable only to closely related genomes
4Solution..
- Multithreaded parallel implementation of sequence
alignment algorithm to align whole genomes - Parallel implementation of dynamic programming
technique - Uses collective memory of several nodes
- Uses multithreading to overlap computation and
communication - Applicable to closely related as well as less
similar genomes - Reliable output in reasonable time
5Pairwise Sequence Comparison using Dynamic
Programming
- Basic Idea
- Quantify the similarity between pairs of symbols
of target sequences - Associate score for each possible arrangement
- Similarity is given by the highest score
- Example
- sequence x A T A A G T
- sequence y A T G C A G T
- SCORE 1 1 -1 1 1 1 1
TOTAL -3 - sequence x A T A - A G T
- sequence y A T G C A G T
- SCORE 1 1 -1 2 1 1 1
TOTAL 2 - Model mutation by gaps (gaps indicate evolution
of one sequence into another) -
6Dynamic Programming
- Smith and Waterman approach
- Aligns subsequences of given sequences
- Involves (a) calculation of scores indicating
similarity - (b) identification of alignment(s)
corresponding to the score - Build solution using previous solutions for
smaller subsequences - Construct a two-dimensional array Similarity
Matrix to store scores corresponding to partial
results - Matrix represents all possible alignments of the
input sequences - Recurrence equation
-
-
SMi, j-1 gp SMi-1, j-1 ss SMi-1, j gp
0
SMi, j
7Contd.
- Each element of the matrix is the max of the foll
four values - Left element gap, upper-left element score of
replacing vertical - with horizontal symbol, upper element gap, 0.
- Consider the foll example
- T G A T G G A G
G T
G
2 max0 (-2), 1 (1), 0 (-2), 0
A
T
A
G
G
8Identifying alignments
- Alignments with score above a given threshold are
reported - Start at end of the alignment and move backwards
to the beginning -
- T G A T G G A G G T
T G A T G G A G G T G A T A G G
G
A
T G A T G G A G G T G A T A G G
T
T G A T G G A G G T G A T A G G
A
G
T G A T G G A G G T G A T A G G
G
9EARTH Execution Model
- Program is viewed as a collection of threads
- execution order determined by data and control
dependencies - Threads further divided into fibers
- fibers are non-preemptive and
- all data is ready before their execution
- Each node in EARTH has
- an execution unit
- synchronization unit
- queues linking the two (RQ and EQ)
- local memory
- interface to interconnection network
10EARTH Architecture
Memory bus
node
PE
PE
PE
. .
node
To EQ
Inter connection Network
From RQ
node
EU
Local Memory
EQ
RQ
SU
11Multithreaded parallel implementation
- Divide scoring matrix as follows
- horizontal strips (each element of input sequence
X) - strips into rectangular blocks
- Blocks are calculated by two fibers within a
thread - only one fiber is active at any given time
- Each thread is assigned to one horizontal strip
- the computation is done by even/ odd fibers
within the thread - Initialization delay of reading sequences from
server is minimized - Each thread needs only the piece of input
sequence it grabs and not the whole of sequence X - After computing a block, fiber sends to fiber
beneath a piece of sequence Y - among other information
- The computation of the anti-diagonal elements of
the matrix is as shown
12Computation of similarity matrix on EARTH
P1 P2 P3
Thread A
Thread B
Inactive fiber
E fibers O
E fibers O
Active fiber
Ack
Sync
Data
P1
P2
P3
P4
P1
P2
P3
P4
13Evaluation
- Experimental environment
- Beowulf implementation of EARTH
- Uses Beowulf machine consisting of 64 nodes, each
containing two 200MHz Pentium Pro processors (a
total of 128 processors and 128MB of memory) - Sequences of lengths ranging from 30K to 900K
were tested - Execution times for sequential and parallel
implementation of Smith and Waterman algorithm is
given below -
14Evaluation
- The multithreaded parallel implementation is
named ATGC Another Tool for Genomic Comparison - Experiment alignes
- human and mice mitochondrial genomes
- human and drosophila mitochondrial genomes
- Reason for selection
- human and mice are closely related and the other
pair are less similar - The results were confirmed with MUMmer another
whole genome alignment tool - Result graphs show that ATGC is more accurate
than MUMmer - (verified by using NCBI Blast)
15Result Graphs
16Contd.
17Conclusions
- Comparison of whole genomes requires high
computation and memory - Made convenient by using a multithreaded parallel
implementation of dynamic programming on a
cluster of PCs - Accurate results obtained in reasonable amount of
time - Aligns closely related as well as less similar
genomes - Slower, but plays important role where high
accuracy is needed - ( as seen in comparison with MUMmer for human
and drosophila mitochondrial genome)
18Related work MUMmer(Maximal Unique Match)
- given genomes A and B
- find all maximal, unique, matching subsequences
(MUMs) - extract the longest possible set of matches that
occur in the same order in both genomes - close the gaps
- output the alignment
- maximal unique match (MUM)
- occurs exactly once in both genomes A and B
- not contained in any longer MUM
- key idea in identifying MUMs is to build a suffix
tree for genomes A and B