Parallel Computation in Biological Sequence Analysis - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Parallel Computation in Biological Sequence Analysis

Description:

Unsorted portion method first load balancing technique ... Apparently, Berger-Munson method provides more accurate alignments with ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 17
Provided by: csU2
Category:

less

Transcript and Presenter's Notes

Title: Parallel Computation in Biological Sequence Analysis


1
Parallel Computation in Biological Sequence
Analysis
  • Xue Wu
  • CMSC 838 Presentation

2
Motivation
  • Scanning and analyzing biological sequences are
    common and repeated tasks in molecular biology
  • Homologous sequence searching
  • Based on pairwise alignment
  • Task is to find similarities between a particular
    query sequence and all the sequences of a
    biosequence databank
  • Multiple sequence alignment
  • Simultaneous alignment of three or more
    nucleotide or amino acid sequences
  • Problems with sequential solution
  • With the exponential growth of the biosequence
    banks, homologous sequence searching becomes time
    consuming
  • The automatic generation of an accurate multiple
    alignment is computationally expensive
  • Parallel solution can reduce computation time and
    provide more accurate result

3
Talk Overview
  • Overview of talk
  • Motivation
  • Techniques and Evaluations
  • Similarity Sequence Searching
  • Multiple Sequence Alignment
  • Observations

4
Techniques similarity sequence searching
  • Two main parallel methods to search sequence
    database
  • Fine grain approach for SIMD parallel computer
  • Parallelize the comparison algorithm itself
  • All processors cooperate to determine the
    similarity score
  • Coarser grain approach for MIMD parallel computer
  • Parallelize the database searching
  • Each processor performs a selected number of
    comparison
  • method used in the paper
  • Parallelize Similarity Searching coarser grain
    approach
  • Workload balancing is the key point for better
    parallelism
  • Partition database, combine results from
    sequential search for each database requires
    equal-sized pieces of database for load balance
  • Percentage of Load ImBalance (PLIB) as metric for
    load imbalance

5
Techniques similarity sequence searching
  • Splitting up database
  • Unsorted portion method first load balancing
    technique
  • Partition the database into a number of portions
  • Portion_size database_size / processors_number
  • If sequence assignment causes sum of sequence
    lengths in portion P exceed ideal size by more
    than X percent, reassign the sequence to portion
    P1
  • Low communication overhead, but possibly high
    PLIB
  • Sorted portion method Master-worker method
  • Sequences are sorted in decreasing length order
    to minimize PLIB
  • The master processor distributes the sequences to
    the worker processors dynamically
  • Low PLIB, but high communication overhead

6
Techniques similarity sequence searching
  • Proposed bucket method
  • Statically apply sorted portion method
  • Algorithm
  • Sequences in the database are sorted in
    decreasing length order
  • Starting from the longest-length sequence, place
    the sequences in N buckets. For each sequence,
  • Find the sum of the sequences length in each
    bucket
  • Find the bucket with the smallest sum value
  • Place the sequence in the bucket
  • In the case of a tie, the smallest numbered
    bucket is selected
  • Each of the N processors performs sequence search
    in its own bucket
  • If only N/n processors are used, each processor
    searches n bucket

7
Techniques similarity sequence searching
  • Evaluation and comparison
  • Comparison of Bucket and Portion method
  • Comparison of Bucket and Master-worker method
  • Algorithms are implemented on the Intel iPSC/860
  • Preprocessing is performed on SPARC station 2
  • Data source is GenBank (release 86.0)
  • Preprocessing overhead is added for Bucket method

8
Techniques similarity sequence searching
  • Evaluation and comparison continued
  • Conclusions
  • In all tested cases, proposed Bucket method has
  • Lower PLIB than Portion method
  • Higher speedup than master-worker method
  • Bucket method has obvious advantage when
  • Sequences length is relatively small
  • Processing with large number of processors

9
Techniques multiple sequence alignment
  • Sequential Berger-Munson algorithm

10
Techniques multiple sequence alignment
  • Sequential Berger-Munson algorithm
  • Applied randomized techniques with optimization
    to iteratively improve the multiple sequence
    alignment
  • Description
  • Randomly partition the input sequences into two
    groups
  • Align two groups of sequences instead of
    individual sequence with alignment score
    calculated by
  • If the new alignment score is higher than the
    previous one, the alignment is accepted and the
    gaps are inserted into the sequences accordingly.
  • The modified or unmodified alignment is used as
    the input for the next iteration. The process is
    stopped after q consecutive iterations of
    rejection.

11
Techniques multiple sequence alignment
  • Parallel Berger-Munson algorithm with speculative
    computation
  • Consecutive sequence of rejected iterations are
    not dependent on each other and can be done in
    parallel

12
Techniques multiple sequence alignment
  • Evaluation
  • Method Improve the alignment generated by
    experts and other program (CLUSTALV)
  • Data Source
  • Three different groups of immunoglobulin
    sequences from Kabat Database (Beta Release 5.0)
  • The average sequence lengths of three groups are
    similar
  • The number of sequences are different, which in
    each group is as twice as the previous group

13
Techniques multiple sequence alignment
  • Evaluation continued
  • Alignment score comparison
  • Apparently, Berger-Munson method provides more
    accurate alignments with sacrifice of computation
    time, which is not ignorable

14
Techniques multiple sequence alignment
  • Parallel Algorithm Speedup factor
  • Conclusion
  • The original iterative method is a good tool for
    improving alignment results
  • With the parallel speculative computation
    technique, it can
  • Increase the alignment score
  • Reduce the computation time
  • Can achieve higher speedup factors when
  • Processing large_sized sequence group
  • Processing sequences with high alignment score
  • Cannot be compared with the previous algorithm by
    Ishikawa et al.

15
Observations
  • Similarity Sequence Searching
  • With the increasing size of biosequence database
    and growth of computation power, coarse grain
    parallelism for sequence searching is more simple
    and effective
  • Time required for processing any given sequence
    depends not only on the length of sequence, but
    also on
  • The composition of the sequence
  • CPU power and CPU availability
  • So dynamic load balancing is still necessary.
  • To minimize communication and scheduling
    overhead,
  • Distributing sequences by fixed/variable size
    block
  • Applying buffering strategy to reduce data
    starvation and shadow scheduling latency
  • Multiple Sequence Alignment
  • With the increasing of computation power,
    parallelizing single multiple sequence Alignment
    is not necessary. However, using parallelism to
    increase the alignment accuracy is still
    attractive.
  • Using computation time to exchange for alignment
    accuracy

16
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com