Species Identification through DNA String Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Species Identification through DNA String Analysis

Description:

Mark Vorster Supervisor: Prof Philip Machanick – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 18
Provided by: Mark1093
Category:

less

Transcript and Presenter's Notes

Title: Species Identification through DNA String Analysis


1
Species Identificationthrough DNA String
Analysis
  • Mark Vorster
  • Supervisor Prof Philip Machanick

2
Research Overview
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
  • Goal
  • Aid bioinformaticians in research by providing a
    tool which can identify similar DNA sequences in
    order to infer homogeneity, in a timely manner.
  • Reason for problems
  • Large data sets
  • Days of processing
  • No existing specific tools

3
Bioinformatics
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
  • "Research, development, or application of
    computational tools and approaches for expanding
    the use of biological, medical, behavioural or
    health data, including those to acquire, store,
    organize, archive, analyse, or visualise such
    data.
  • Biomedical Information Science and Technology
    Initiatives Definition Committee - Dr Huerta
  • "The branch of science concerned with information
    and information flow in biological systems, esp.
    the use of computational methods in genetics and
    genomics.
  • Oxford English Dictionary

4
History of Bioinformatics and Genetics
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
  • 1953 - Watson, Crick , Wilkins and Franklin.
  • Discrete abstraction
  • Adenine Thymine
  • Guanine Cytosine

One helical turn 3.4 nm
Sugar-phosphate backbone
base
Hydrogen bonds
4
http//www.accessexcellence.org/RC/VL/GG/images/st
ructure.gif
5
Sequence Analysis and Sequence Alignment
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
  • Sequence Alignment
  • Global Alignment is expensive
  • Assumption Sequences are already Globally
    Aligned
  • Alignment Differences TGAGCACCT
  • Insertion TGACGCACCT
  • Deletion TGA_CACCT
  • Replacement TGATCACCT
  • Phylogenetic inference

5
6
FASTA File Format
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
  • Leading gt
  • Sequence Identifier
  • Description or comment
  • A number of lines of genetic code
  • Other Symbols

gtSequenceName description or comment CCGGAATACCTAG
GAC GCCTTCATCCCCCGCC GGTCTGTGATGTCCCA ATGGACCGGA gt
NextSequence description of comment ACGCCTGATTACCT
GC TAGTCGGGATGATAAC CAAGAATTTGTGTCTG
7
Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
  • Nesting loops inefficient
  • Dynamic Programing
  • Take into account all previous information
  • Improved to O(n2) where n is number of bases in
    shorter sequence
  • Goal Find the closet match between two strings
  • Or the minimum number of differences

8
Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
  • Minimum of
  • MatchCost Di-1j-1 , if pi tj
  • ReviseCost Di-1j-11 , if pi ? tj
  • InsertCost Di-1j1
  • DeleteCost Dij-11
  • D0j 0 and Di0 i

Di-1j-1 Di-1j
Dij-1 Dij
9
Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
H a v e a h s p p y d a y
NULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 1
a 2
p 3
p 4
y 5
10
Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
j
H a v e a h s p p y d a y
NULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 1 1 1 1 1 1 1
a 2 2 1 2 2 2 1
p 3 3 2 2 3 3
p 4 4 3 3 3 4
y 5 5 4 4 4 4
tj
Di-1j-1
Di-1j
pi
i
Di-1j-1
Dij-1
  • MatchCost Di-1j-1 , if pi tj
  • ReviseCost Di-1j-11 , if pi ? tj
  • InsertCost Di-1j1
  • DeleteCost Dij-11
  • MatchCost N/A
  • ReviseCost 3
  • InsertCost 2
  • DeleteCost 4
  • -gt Min 2

11
Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
H a v e a h s p p y d a y
NULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 1 1 1 1 1 1 1 1 0 1 1 1 1
a 2 2 1 2 2 2 1 2 1 1 2 2 2
p 3 3 2 2 3 3 2 2 2 2 1 2 3
p 4 4 3 3 3 4 3 3 3 3 2 1 2
y 5 5 4 4 4 4 4 4 4 4 3 2 1
12
Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
  • Changes
  • Di0 i , if pi t0
  • Di0 i 1 , if pi ? t0
  • D0j j , if p0 tj
  • D0j j 1 , if p0 ? tj
  • Additional stop case for mismatch

13
Approximate String Matching Algorithm
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
T A C G G A C G G T
T 0 2 3 4 5 6 7 8 9 9
A 2 0 1 2 3 4 5
C 3 1 0 1 2 3 4
G 4 2 1 0 1 2 3
A 5 3 2 1 1 1 2
A 6 4 3 2 2 1 2
G 7 5 4 3 2 2 2
G 8 6 5 4 3 3 3
G 9 7 6 5 4 4 4
A 10 8 7 6 5 4 5
14
Discussion
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
  • Grouping Algorithm
  • Scale of the problem
  • 400 800 bases per sequence
  • Ten thousands of sequences
  • Assumptions
  • Sequences Globally Aligned
  • Sequences Begin at the Same Place

15
Example Grouping
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
Seq336 HK2QS7R01AXRJ6 Seq218 Seq38 Seq235 Seq89
Seq382 HK2QS7R01BR4Q9 Seq173
Seq180 HK2QS7R01ABFDP Seq339 Seq289 Seq491 Seq319
Seq269 HK2QS7R01AZHD7 Seq402 Seq112 Seq203 Seq137
Seq210 HK2QS7R01BMNQ4 Seq364
Seq270 HK2QS7R01AZFOG Seq388 Seq441
Seq442 HK2QS7R01ADASO Seq426 Seq233 Seq374 Seq416

16
Results
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
Comparisons for n sequence (n-1)n/2
O(n2), where n is number of sequences. 1600
comparisons per second. 10000 sequence 8.6
hours. (from 10 days)
17
-
-
-
-
Bioinformatics
String Matching
Discussion
Research Overview
Questions
?
Write a Comment
User Comments (0)
About PowerShow.com