Whole Genome Alignment using Multithreaded Parallel Implementation - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Whole Genome Alignment using Multithreaded Parallel Implementation

Description:

Smith and Waterman approach: Aligns subsequences of given sequences ... Smith-Waterman. Time. Implementation. CMSC 838T Presentation. Evaluation ... – PowerPoint PPT presentation

Number of Views:167
Avg rating:3.0/5.0
Slides: 19
Provided by: Csu48
Category:

less

Transcript and Presenter's Notes

Title: Whole Genome Alignment using Multithreaded Parallel Implementation


1
Whole Genome Alignment using Multithreaded
Parallel Implementation
  • Hyma S Murthy
  • CMSC 838 Presentation

2
Talk Overview
  • Organization of the paper
  • Motivation
  • Technique
  • Pairwise Sequence Comparison using Dynamic
    Programming
  • EARTH Execution Model
  • Evaluation
  • Result Graphs
  • Conclusions
  • Related Work (MUMmer)

3
Motivation
  • Importance of Genome Alignment
  • Identify important matched and mismatched regions
  • matches represent homolog pairs, conserved
    regions or long repeats
  • mismatchesrepresent foreign fragments inserted
    by transposition, sequence reversal or lateral
    transfer
  • Detect functional differences between pathogenic/
    non-pathogenic strains, evolutionary distance,
    mutations leading to disease, phenotypes, etc.
  • Problems
  • Large computational power, memory and execution
    time
  • Existing algorithms apply dynamic programming
    only to subsequences
  • Computationally intensive to apply to whole
    sequences (O(n2))
  • Thus applicable only to closely related genomes

4
Solution..
  • Multithreaded parallel implementation of sequence
    alignment algorithm to align whole genomes
  • Parallel implementation of dynamic programming
    technique
  • Uses collective memory of several nodes
  • Uses multithreading to overlap computation and
    communication
  • Applicable to closely related as well as less
    similar genomes
  • Reliable output in reasonable time

5
Pairwise Sequence Comparison using Dynamic
Programming
  • Basic Idea
  • Quantify the similarity between pairs of symbols
    of target sequences
  • Associate score for each possible arrangement
  • Similarity is given by the highest score
  • Example
  • sequence x A T A A G T
  • sequence y A T G C A G T
  • SCORE 1 1 -1 1 1 1 1
    TOTAL -3
  • sequence x A T A - A G T
  • sequence y A T G C A G T
  • SCORE 1 1 -1 2 1 1 1
    TOTAL 2
  • Model mutation by gaps (gaps indicate evolution
    of one sequence into another)



6
Dynamic Programming
  • Smith and Waterman approach
  • Aligns subsequences of given sequences
  • Involves (a) calculation of scores indicating
    similarity
  • (b) identification of alignment(s)
    corresponding to the score
  • Build solution using previous solutions for
    smaller subsequences
  • Construct a two-dimensional array Similarity
    Matrix to store scores corresponding to partial
    results
  • Matrix represents all possible alignments of the
    input sequences
  • Recurrence equation

SMi, j-1 gp SMi-1, j-1 ss SMi-1, j gp
0
SMi, j
7
Contd.
  • Each element of the matrix is the max of the foll
    four values
  • Left element gap, upper-left element score of
    replacing vertical
  • with horizontal symbol, upper element gap, 0.
  • Consider the foll example
  • T G A T G G A G
    G T

G
2 max0 (-2), 1 (1), 0 (-2), 0
A
T
A
G
G
8
Identifying alignments
  • Alignments with score above a given threshold are
    reported
  • Start at end of the alignment and move backwards
    to the beginning
  • T G A T G G A G G T

T G A T G G A G G T G A T A G G
G
A
T G A T G G A G G T G A T A G G
T
T G A T G G A G G T G A T A G G
A
G
T G A T G G A G G T G A T A G G
G
9
EARTH Execution Model
  • Program is viewed as a collection of threads
  • execution order determined by data and control
    dependencies
  • Threads further divided into fibers
  • fibers are non-preemptive and
  • all data is ready before their execution
  • Each node in EARTH has
  • an execution unit
  • synchronization unit
  • queues linking the two (RQ and EQ)
  • local memory
  • interface to interconnection network

10
EARTH Architecture
Memory bus
node

PE
PE
PE
. .
node
To EQ
Inter connection Network
From RQ
node
EU
Local Memory
EQ
RQ
SU
11
Multithreaded parallel implementation
  • Divide scoring matrix as follows
  • horizontal strips (each element of input sequence
    X)
  • strips into rectangular blocks
  • Blocks are calculated by two fibers within a
    thread
  • only one fiber is active at any given time
  • Each thread is assigned to one horizontal strip
  • the computation is done by even/ odd fibers
    within the thread
  • Initialization delay of reading sequences from
    server is minimized
  • Each thread needs only the piece of input
    sequence it grabs and not the whole of sequence X
  • After computing a block, fiber sends to fiber
    beneath a piece of sequence Y
  • among other information
  • The computation of the anti-diagonal elements of
    the matrix is as shown

12
Computation of similarity matrix on EARTH
P1 P2 P3
Thread A
Thread B
Inactive fiber
E fibers O
E fibers O
Active fiber
Ack
Sync
Data
P1
P2
P3
P4
P1
P2
P3
P4
13
Evaluation
  • Experimental environment
  • Beowulf implementation of EARTH
  • Uses Beowulf machine consisting of 64 nodes, each
    containing two 200MHz Pentium Pro processors (a
    total of 128 processors and 128MB of memory)
  • Sequences of lengths ranging from 30K to 900K
    were tested
  • Execution times for sequential and parallel
    implementation of Smith and Waterman algorithm is
    given below

14
Evaluation
  • The multithreaded parallel implementation is
    named ATGC Another Tool for Genomic Comparison
  • Experiment alignes
  • human and mice mitochondrial genomes
  • human and drosophila mitochondrial genomes
  • Reason for selection
  • human and mice are closely related and the other
    pair are less similar
  • The results were confirmed with MUMmer another
    whole genome alignment tool
  • Result graphs show that ATGC is more accurate
    than MUMmer
  • (verified by using NCBI Blast)

15
Result Graphs
16
Contd.
17
Conclusions
  • Comparison of whole genomes requires high
    computation and memory
  • Made convenient by using a multithreaded parallel
    implementation of dynamic programming on a
    cluster of PCs
  • Accurate results obtained in reasonable amount of
    time
  • Aligns closely related as well as less similar
    genomes
  • Slower, but plays important role where high
    accuracy is needed
  • ( as seen in comparison with MUMmer for human
    and drosophila mitochondrial genome)

18
Related work MUMmer(Maximal Unique Match)
  • given genomes A and B
  • find all maximal, unique, matching subsequences
    (MUMs)
  • extract the longest possible set of matches that
    occur in the same order in both genomes
  • close the gaps
  • output the alignment
  • maximal unique match (MUM)
  • occurs exactly once in both genomes A and B
  • not contained in any longer MUM
  • key idea in identifying MUMs is to build a suffix
    tree for genomes A and B
Write a Comment
User Comments (0)
About PowerShow.com