Title: Species Identification through DNA String Analysis
1Species Identificationthrough DNA String
Analysis
- Mark Vorster
- Supervisor Prof Philip Machanick
2Research Overview
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
- Goal
- Aid bioinformaticians in research by providing a
tool which can identify similar DNA sequences in
order to infer homogeneity, in a timely manner. - Reason for problems
- Large data sets
- Days of processing
- No existing specific tools
3Bioinformatics
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
- "Research, development, or application of
computational tools and approaches for expanding
the use of biological, medical, behavioural or
health data, including those to acquire, store,
organize, archive, analyse, or visualise such
data. - Biomedical Information Science and Technology
Initiatives Definition Committee - Dr Huerta - "The branch of science concerned with information
and information flow in biological systems, esp.
the use of computational methods in genetics and
genomics. - Oxford English Dictionary
4History of Bioinformatics and Genetics
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
- 1953 - Watson, Crick , Wilkins and Franklin.
- Discrete abstraction
- Adenine Thymine
- Guanine Cytosine
One helical turn 3.4 nm
Sugar-phosphate backbone
base
Hydrogen bonds
4
http//www.accessexcellence.org/RC/VL/GG/images/st
ructure.gif
5Sequence Analysis and Sequence Alignment
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
- Sequence Alignment
- Global Alignment is expensive
- Assumption Sequences are already Globally
Aligned - Alignment Differences TGAGCACCT
- Insertion TGACGCACCT
- Deletion TGA_CACCT
- Replacement TGATCACCT
- Phylogenetic inference
5
6FASTA File Format
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
- Leading gt
- Sequence Identifier
- Description or comment
- A number of lines of genetic code
- Other Symbols
gtSequenceName description or comment CCGGAATACCTAG
GAC GCCTTCATCCCCCGCC GGTCTGTGATGTCCCA ATGGACCGGA gt
NextSequence description of comment ACGCCTGATTACCT
GC TAGTCGGGATGATAAC CAAGAATTTGTGTCTG
7Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
- Nesting loops inefficient
- Dynamic Programing
- Take into account all previous information
- Improved to O(n2) where n is number of bases in
shorter sequence - Goal Find the closet match between two strings
- Or the minimum number of differences
8Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
- Minimum of
- MatchCost Di-1j-1 , if pi tj
- ReviseCost Di-1j-11 , if pi ? tj
- InsertCost Di-1j1
- DeleteCost Dij-11
- D0j 0 and Di0 i
Di-1j-1 Di-1j
Dij-1 Dij
9Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
H a v e a h s p p y d a y
NULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 1
a 2
p 3
p 4
y 5
10Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
j
H a v e a h s p p y d a y
NULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 1 1 1 1 1 1 1
a 2 2 1 2 2 2 1
p 3 3 2 2 3 3
p 4 4 3 3 3 4
y 5 5 4 4 4 4
tj
Di-1j-1
Di-1j
pi
i
Di-1j-1
Dij-1
- MatchCost Di-1j-1 , if pi tj
- ReviseCost Di-1j-11 , if pi ? tj
- InsertCost Di-1j1
- DeleteCost Dij-11
- MatchCost N/A
- ReviseCost 3
- InsertCost 2
- DeleteCost 4
- -gt Min 2
11Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
H a v e a h s p p y d a y
NULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 1 1 1 1 1 1 1 1 0 1 1 1 1
a 2 2 1 2 2 2 1 2 1 1 2 2 2
p 3 3 2 2 3 3 2 2 2 2 1 2 3
p 4 4 3 3 3 4 3 3 3 3 2 1 2
y 5 5 4 4 4 4 4 4 4 4 3 2 1
12Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
- Changes
- Di0 i , if pi t0
- Di0 i 1 , if pi ? t0
- D0j j , if p0 tj
- D0j j 1 , if p0 ? tj
- Additional stop case for mismatch
13Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
T A C G G A C G G T
T 0 2 3 4 5 6 7 8 9 9
A 2 0 1 2 3 4 5
C 3 1 0 1 2 3 4
G 4 2 1 0 1 2 3
A 5 3 2 1 1 1 2
A 6 4 3 2 2 1 2
G 7 5 4 3 2 2 2
G 8 6 5 4 3 3 3
G 9 7 6 5 4 4 4
A 10 8 7 6 5 4 5
14Discussion
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
- Grouping Algorithm
- Scale of the problem
- 400 800 bases per sequence
- Ten thousands of sequences
- Assumptions
- Sequences Globally Aligned
- Sequences Begin at the Same Place
15Example Grouping
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
Seq336 HK2QS7R01AXRJ6 Seq218 Seq38 Seq235 Seq89
Seq382 HK2QS7R01BR4Q9 Seq173
Seq180 HK2QS7R01ABFDP Seq339 Seq289 Seq491 Seq319
Seq269 HK2QS7R01AZHD7 Seq402 Seq112 Seq203 Seq137
Seq210 HK2QS7R01BMNQ4 Seq364
Seq270 HK2QS7R01AZFOG Seq388 Seq441
Seq442 HK2QS7R01ADASO Seq426 Seq233 Seq374 Seq416
16Results
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
Comparisons for n sequence (n-1)n/2
O(n2), where n is number of sequences. 1600
comparisons per second. 10000 sequence 8.6
hours. (from 10 days)
17-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
?