Species Identification through DNA String Analysis - PowerPoint PPT Presentation

About This Presentation

Title:

Species Identification through DNA String Analysis

Description:

Mark Vorster Supervisor: Prof Philip Machanick – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 18

Provided by: Mark1093

Category:

more less

Transcript and Presenter's Notes

Title: Species Identification through DNA String Analysis

1
Species Identificationthrough DNA String
Analysis

Mark Vorster
Supervisor Prof Philip Machanick

2
Research Overview
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions

Goal
Aid bioinformaticians in research by providing a
tool which can identify similar DNA sequences in
order to infer homogeneity, in a timely manner.
Reason for problems
Large data sets
Days of processing
No existing specific tools

3
Bioinformatics
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions

"Research, development, or application of
computational tools and approaches for expanding
the use of biological, medical, behavioural or
health data, including those to acquire, store,
organize, archive, analyse, or visualise such
data.
Biomedical Information Science and Technology
Initiatives Definition Committee - Dr Huerta
"The branch of science concerned with information
and information flow in biological systems, esp.
the use of computational methods in genetics and
genomics.
Oxford English Dictionary

4
History of Bioinformatics and Genetics
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions

1953 - Watson, Crick , Wilkins and Franklin.
Discrete abstraction
Adenine Thymine
Guanine Cytosine

One helical turn 3.4 nm
Sugar-phosphate backbone
base
Hydrogen bonds
4
http//www.accessexcellence.org/RC/VL/GG/images/st
ructure.gif
5
Sequence Analysis and Sequence Alignment
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions

Sequence Alignment
Global Alignment is expensive
Assumption Sequences are already Globally
Aligned
Alignment Differences TGAGCACCT
Insertion TGACGCACCT
Deletion TGA_CACCT
Replacement TGATCACCT
Phylogenetic inference

5
6
FASTA File Format
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions

Leading gt
Sequence Identifier
Description or comment
A number of lines of genetic code
Other Symbols

gtSequenceName description or comment CCGGAATACCTAG
GAC GCCTTCATCCCCCGCC GGTCTGTGATGTCCCA ATGGACCGGA gt
NextSequence description of comment ACGCCTGATTACCT
GC TAGTCGGGATGATAAC CAAGAATTTGTGTCTG
7
Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions

Nesting loops inefficient
Dynamic Programing
Take into account all previous information
Improved to O(n2) where n is number of bases in
shorter sequence
Goal Find the closet match between two strings
Or the minimum number of differences

8
Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions

Minimum of
MatchCost Di-1j-1 , if pi tj
ReviseCost Di-1j-11 , if pi ? tj
InsertCost Di-1j1
DeleteCost Dij-11
D0j 0 and Di0 i

Di-1j-1 Di-1j
Dij-1 Dij
9
Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
H a v e a h s p p y d a y
NULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 1
a 2
p 3
p 4
y 5
10
Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
j
H a v e a h s p p y d a y
NULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 1 1 1 1 1 1 1
a 2 2 1 2 2 2 1
p 3 3 2 2 3 3
p 4 4 3 3 3 4
y 5 5 4 4 4 4
tj
Di-1j-1
Di-1j
pi
i
Di-1j-1
Dij-1

MatchCost Di-1j-1 , if pi tj
ReviseCost Di-1j-11 , if pi ? tj
InsertCost Di-1j1
DeleteCost Dij-11

MatchCost N/A
ReviseCost 3
InsertCost 2
DeleteCost 4
-gt Min 2

11
Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
H a v e a h s p p y d a y
NULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 1 1 1 1 1 1 1 1 0 1 1 1 1
a 2 2 1 2 2 2 1 2 1 1 2 2 2
p 3 3 2 2 3 3 2 2 2 2 1 2 3
p 4 4 3 3 3 4 3 3 3 3 2 1 2
y 5 5 4 4 4 4 4 4 4 4 3 2 1
12
Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions

Changes
Di0 i , if pi t0
Di0 i 1 , if pi ? t0
D0j j , if p0 tj
D0j j 1 , if p0 ? tj
Additional stop case for mismatch

13
Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
T A C G G A C G G T
T 0 2 3 4 5 6 7 8 9 9
A 2 0 1 2 3 4 5
C 3 1 0 1 2 3 4
G 4 2 1 0 1 2 3
A 5 3 2 1 1 1 2
A 6 4 3 2 2 1 2
G 7 5 4 3 2 2 2
G 8 6 5 4 3 3 3
G 9 7 6 5 4 4 4
A 10 8 7 6 5 4 5
14
Discussion
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions

Grouping Algorithm
Scale of the problem
400 800 bases per sequence
Ten thousands of sequences
Assumptions
Sequences Globally Aligned
Sequences Begin at the Same Place

15
Example Grouping
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
Seq336 HK2QS7R01AXRJ6 Seq218 Seq38 Seq235 Seq89
Seq382 HK2QS7R01BR4Q9 Seq173
Seq180 HK2QS7R01ABFDP Seq339 Seq289 Seq491 Seq319
Seq269 HK2QS7R01AZHD7 Seq402 Seq112 Seq203 Seq137
Seq210 HK2QS7R01BMNQ4 Seq364
Seq270 HK2QS7R01AZFOG Seq388 Seq441
Seq442 HK2QS7R01ADASO Seq426 Seq233 Seq374 Seq416

16
Results
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
Comparisons for n sequence (n-1)n/2
O(n2), where n is number of sequences. 1600
comparisons per second. 10000 sequence 8.6
hours. (from 10 days)
17
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
?

Write a Comment

User Comments (0)