Reference-Based Alignment in Large Sequence Databases - PowerPoint PPT Presentation

1 / 57

About This Presentation

Title:

Reference-Based Alignment in Large Sequence Databases

Description:

Find chromosome similarities across different organisms. Chromosomes can be relatively large (e.g. Human Chromosome 1 is approx. 272 million bases) ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 58

Provided by: vass80

Category:

more less

Transcript and Presenter's Notes

Title: Reference-Based Alignment in Large Sequence Databases

1
Reference-Based Alignment in Large Sequence
Databases
Speaker Panagiotis Papapetrou Boston University

Joint work with
Vassilis Athitsos (UTA)?
George Kollios (BU)
Dimitrios Gunopulos (UOA and UCR)?

2
General Problem

Given
S collection of strings.
Q query string.
D similarity measure.
Find that substring of S that is most similar to
Q, under the similarity measure D.

3
Motivation

Spell-checking
given some input text the spell-checker consults
its dictionary to find words of high similarity
to the text, so as to identify potential typos.
Data cleaning
data obtained from different sources might
contain inconsistencies which can be eliminated
by looking for similar entities (strings) in the
data.
Near homology search in biological sequences
given different genomes we want to find regions
of high similarity that were the result of a
mutation, etc.

4
Motivation

Our focus
Near homology search in DNA sequences.
Two major requirements
Retrieve near-exact matches of long query
sequences efficiently.

TCTAGGGCA
Q
ACTTAGCTGTAGTCGTTCTATGGCATATGCATGCTGATCTCGTGCGTCA
TG
5
Motivation

Large query sizes
Locate genes in large genomes.
Find chromosome similarities across different
organisms.
Chromosomes can be relatively large (e.g. Human
Chromosome 1 is approx. 272 million bases).
Near homology search
Meaningful for DNA similarity search.
Genomes evolve over time due to small mutations.
Genomes from different organisms might have high
similarity.

6
Problem Statement

Given
S collection of DNA sequences.
Q DNA query sequence.
D similarity measure.
Find the most similar subsequence in S
with a deviation of at most d Q edit
operations.
d at most 15 (near homology search).

7
The Edit Distance Levenshtein et al.1966

Measures how dissimilar two strings are.
ED (A,B) minimum number of operations needed to
transform A into B.
Operations insertion, deletion, substitution.
Example
A ATC and B ACTG

A A T C
ED (A,B) 2
B A C T G
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
Smith-Waterman SmithWaterman et al. 1981

Similarity measure used for local alignment
Match can be a subsequence of the query sequence.
Define three penalties
match, mismatch, gap.
Scoring parameters are defined by the user.
Example
A ATC and B TATTCG
match 2, mismatch -1, gap -1.

16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
Strategy Identify Candidate Endpoints
database sequence X
candidate endpoints
candidate endpoints
indexing structure
query Q

Use dynamic programming only to evaluate the
candidates.

23
RBSA

Decompose subsequence matching into two distinct
problems
Fixed query length
Assumes all queries have the same length.
Variable query length
Uses the solution to the fixed query length
problem.
Achieves efficient retrieval for queries of
arbitrary length.

24
Fixed query length

Q query.
(X, t) database position t.
Q and (X, t) are mapped into a number
D the Edit Distance.
R a reference sequence.

25
(No Transcript)
26
Database Embedding
database sequence
X2
X1
X4
X3
X6
X5
X8
X7
X10
X9
X12
X11
X14
X13
X15
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
Refine step

Refine only those database positions that were
not pruned by filtering.
For refinement we can use either the Edit
Distance or the Smith-Waterman dynamic
programming algorithms.
For Smith-Waterman an upper bound can be applied

SW (Q, X, t) 2Q LBED (Q, X, t)?
38
Offline selection of reference sequences

Goal represent each database position (X, t)
using a set of reference sequences Rt.
Given
Qsample a set of random queries, of size q.
R a set of random reference sequences of size
q.
For each (X, t)
Choose Rt that prunes (X, t) for the largest
number of queries in Qsample.
Greedy selection.

39
Alphabet Reduction

Improve filtering power of RBSA by applying
alphabet reduction
S A, C, G, T.
Use four letter collapsing schemes
Scheme 0 no collapsing.
Scheme 1 A, C -gt X and G, T -gt Y.
Scheme 2 A, G -gt X and C, T -gt Y.
Scheme 3 A, T -gt X and C, G -gt Y.
The number of possible reference sequences
decreases with the alphabet size
4q (2q)2 vs. 2q

40
Variable Query Length

So far we assumed that Qi q, for every Qi.
Q can have arbitrary size
For simplicity assume that Q aq.
At query time
Break Q into non-overlapping segments of size q.
Two versions of RBSA
Exact and approximate.

41
Exact version

Let Xst be a subsequence match for Q, within d
Q.
At least one Qi has within Xst a subsequence
match within edit distance d q.

Q2
Q3
Q1
a 3
q
q
Q
q
ACTTAGCTGTAGTCGTTCTATGGCATATGCATGCTGATCTCGTGCGTCA
TG
Xst
t
s
42
Exact version

Filter and refine.
Break Q into a non-overlapping segments Q1, Q2,
, Qa.

Q2
Q3
Q1
q
q
q
Q

If for some Qi
ED (Qi, Xst) d q
Take the union of all candidates from all Qi s.
Perform the refinement step.

43
Approximate version

Question
Use only one segment Qi of Q.
What is the probability P (Qi) that the
subsequence match of Q is included in the
candidates of Qi?
Proposition
P (Qi) 50.
Using Hamza et. al. 1995.

44
Approximate version

By the previous proposition
If a single Qi is chosen and all candidate
endpoints are generated,
there is at least 50 probability of finding the
correct endpoint of the optimal subsequence match.

45
Approximate version

By the previous proposition
Assume that the optimal match was not found under
Qi.
P (Qj) probability of not finding the optimal
match under Qj, with P (Qj) 0.5, for j1,,a.
If we use p segments Q1, Q2, , Qp
P (Q1, Q2, , Qp) (0.5)p.
Thus, the probability of retrieving the optimal
match is
1 (0.5)p
For p10, this probability is at least 99.9.

46
Experimental Setup

Datasets
Database
Human Chromosome 22 (35,059,634 bases).
Queries
Mouse genome (random chromosomes).
Variable size 40, , 10K bases.
Similarity to DB varied within 5, 10 and 15.
Each dataset contains 200 queries.

47
Performance Measures

Accuracy
Percentage of queries giving correct results.
Efficiency
DP cell cost cost of dynamic programming, as
percentage of brute-force search cost.
Retrieval Runtime cost CPU time per query, as
percentage of brute-force CPU time.
Brute force
Full Dynamic Programming Algorithm
Edit Distance or Smith-Waterman.

48
Competitors