Title: Genome
1Genome
- The chromosomes are the volumes of an
encyclopedia called Genome
2Chromosome
gthuman chromosome TACGTATACTGCATCGATGCTATACGACGAT
CGTAGCTACGTACGATCGTACGACGTACGTTACGTACGATCGTACGGTAC
ACCGCGCACGATCACACGATGCGACGATGCGACGATCGTACGACTGCTAC
GATGCGACGATGCGACGATCGTACGACTGCTAGCTACGCATGCCTGCATC
GATGCTATACGACGATCGTAGCTACGTACGATCGTACGACGTACGTTACG
TTGCATCGATGCTATACGACGATCGTAGCTACGTACGATCGCGATGCGAC
GATGCGACGATCGTACGACTGCTAGCTACGCATGCCTGCATCGATGCTAT
ACGACGATCGTAGCTACGTACGATCGTACGACGTACGTTACGTTGCATCG
ATGCTATACGACGATCGTAGCTACGTACGATCGTACGACGTACGTTACGT
ACGATCGTACGGTACACCGCGCACGATCACACGATGCGACGATGCGACGA
TCGTACGACTGCTAGCTACGCATGCCTACGTACGTATCCTACGTACGATC
GTGCAGCATCGATGCTACGTACGACGATCGATATTAATGCAATCATGCAG
CTGCATGCTAGCGATGCTACGACGATCGTACGGTACACCGCGCACGATCA
CACGATGCGACGATGCGACGATCGTACGATGCTGCATCGATGCTATACGA
CGATCGTAGCTACGTACGATCGTACGACGTACGTTACGTACGATCGTACG
GTACACCGCGCACGATCACACGATGCGACGATGCGACGATCGTACGACTG
CTAGCTACGCATGCCTACGTACGTATCCTACGTACGATCGTGCAGCATCG
ATGCTACGTACGACGATCGATATTAATGCAATCATGCCGATGCGACGATG
CGACGATCGTACGACTGCTAGCTACGCATGCCTGCATCGATGCTATACGA
CGATCGTAGCTACGTACGATCGTACGACGTACGTTACGTTGCATCGATGC
TATACGACGATCGTAGCTACGTACGATCGTACGACGTACGTTACGTACGA
TCGTACGGTACACCGCGCACGATCACACGATGCGACGATGCGACGATCGT
ACGACTGCTAGCTACGCATGCCTACGTACGTATCCTACGTACGATCGTGC
AGCATCGATGCTACGTACGACGATCGATATTAATGCAATCATGCAGCTGC
ATGCTAGCGATGCTACGACGATCGTACGGTACACCGCGCACGATCACACG
ATGCGACGATGCGACGATCGTACGATGCTGCATCGATGCTATACGACGAT
CGTAGCTACGTACGATCGTACGACGTACGTTACGTACGATCGTACGGTAC
ACCGCGCACGATCACACGATGCGACGATGCGACGATCGTACGACTGCTAG
CTACGCATGCCTACGTACGTATCCTACGTACGATCGTGCAGCATCGATGC
TACGTACGACGATCGATATTAATGCAATCATGCAGCTGCATGCTAGCGAT
GCTACGATCGATGCTATACGACGATCGTAGCTAGCTGCATGCTAGCGATG
CTACGATCGATGCTATACGACGATCGTAGCTTACGACGTACGTTACGTAC
GATCGTACGGTACACCGCGCACGATCACACGATGCGACGATGCGACGATC
GTACGACTGCTAGCTACGCATGCCTACGTACGTATCCTACGTACGATCGT
CGATGCGACGATGCGACGATCGTACGACTGCTAGCTACGCATGCCTGCAT
CGATGCTATACGACGATCGTAGCTACGTACGATCGTACGACGTACGTTAC
GTTGCATCGATGCTATACGACGATCGTAGCTACGTACGATCGTACGACGT
ACGTTACGTACGATCGTACGGTACACCGCGCACGATCACACGATGCGACG
ATGCGACGATCGTACGACTGCTAGCTACGCATGCCTACGTACGTATCCTA
CGTACGATCGTGCAGCATCGATGCTACGTACGACGATCGATATTAATGCA
ATCATGCAGCTGCATGCTAGCGATGCTACGACGATCGTACGGTACACCGC
GCACGATCACACGATGCGACGATGCGACGATCGTACGATGCTGCATCGAT
GCTATACGACGATCGTAGCTACGTACGATCGTACGACGTACGTTACGTAC
GATCGTACGGTACACCGCGCACGATCACACGATGCGACGATGCGACGATC
GTACGACTGCTAGCTACGCATGCCTACGTACGTATCCTACGTACGATCGT
GCAGCATCGATGCTACGTACGACGATCGATATTAATGCAATCATGCAGCT
GCATGCTAGCGATGCTACGATCGATGCTATACGACGATCGTAGCTGCAGC
ATCGATGCTACGTACGACGATCGATATTAATGCAATCATGCAGCTGCATG
CTAGCGATGCTACGACGATCGTACGGTACACCGCGCACGATCACACGATG
CGACGATGCGACGATCGTACGATGCTGCATCGATGCTATACGACGATCGT
AGCTACGTACGATCGTACGACGTACGTTACGTACGATCGTACGGTACACC
GCGCACGATCACACGATGCGACGATGCGACGATCGTACGACTGCTAGCTA
CGCATGCCTACGTACGTATCCTACGTACGATCGTGCAGCATCGATGCTAC
GTACGACGATCGATATTAATGCAATCATGCAGCTGCATGCTAGCGATGCT
ACGATCGATGCTATACGACGATCGTAGCTGCTACGCATGCCTACGTACGT
ATCCTACGTACGATCGTGCAGCATCGATGCTACGTACGACGATCGATATT
AATGCAATCATGCAGCTGCATGCTAGCGATGCTACGGTACGATCGTCGAT
CGTCAGCTCGATACGTTACGATCTACGATTACGATCATCTATACTATACT
ATACGATATATCTAGATATCGATCTA.ACTCCATTCTTTAAACCGTACTA
CACACACTACTGATCGACGATTACGACGACGAAAGGGCCATATCGGCTAA
CTACATCATAGACAACATCACGGATCGTCTAAGGCCGAGTTAGGTACGAT
TAACGTACGACTACCTATCGTATATACATCACGGATATAACCTATCTACT
ACGATTAACACGATCTATCGTACGGCATATGCATCGTATAGCATCGATTA
GAATACGTATACGTACGATCGTGCATCGATGCTATACGACGATCGTAGCT
ACGTACGATCGTACGACGTACGTTACGTACGATCGTACGGTACACCGCGC
ACGATCACACGATGCGACGATGCGACGATCGTACGACTGCTAGCTACGCA
TGCCTACGTACGTATCCTACGTACGATCGTGCAGCATCGATGCTACGTTG
CATCGATGCTATACGACGATCGTAGCTACGTACGATCGTACGACGTACGT
TACGTACGATCGTACGGTACACCGCGCACGATCACACGATGCGACGATGC
GTGCATCGATGCTATACGACGATCGTAGCTACGTACGATCGTACGACGTA
CGTTACGTACGATCGTACGGTACACCGCGCACGATCACACGATGCGACGA
TGCGACGATCGTACGACTGCTAGCTACGCATGCCTGCATCGATGCTATAC
GACGATCGTAGCTACGTACGATCGTACGACGTACGTTACGTTGCATCGAT
GCTATACGACGATCGTAGCTACGTACGATCGTACGACGTACGTTACGTAC
GATCGTACGGTACACCGCGCACGATCACACGATGCGACGATGCGACGATC
GTACGACTGCTAGCTACGCATGCCTACGTACGTATCCTACGTACGATCGT
GCAGCATCGATGCTACGTACGACGATCGATATTAATGCAATCATGCAGCT
GCATGCTAGCGATGCTACGACGATCGTACGGTACACCGCGCACGATCACA
CGATGCGACGATGCGACGATCGTACGATGCTGCATCGATGCTATACGACG
ATCGTAGCTACGTACGATCGTACGACGTACGTTACGTACGATCGTACGGT
ACACCGCGCACGATCACACGATGCGACGATGCGACGATCGTACGACTGCT
AGCTACGCATGCCTACGTACGTATCCTACGTACGATCGTGCAGCATCGAT
GCTACGTACGACGATCGATATTAATGCAATCATGCAGCTGCATGCTAGCG
ATGCTACGATCGCGATGCGACGATGCGACGATCGTACGACTGCTAGCTAC
GCATGCCTGCATCGATGCTATACGACGATCGTAGCTACGTACGATCGTAC
GACGTACGTTACGTTGCATCGATGCTATACGACGATCGTAGCTACGTACG
ATCGTACGACGTACGTTACGTACGATCGTACGGTACACCGCGCACGATCA
CACGATGCGACGATGCGACGATCGTACGACTGCTAGCTACGCATGCCTAC
GTACGTATCCTACGTACGATCGTGCAGCATCGATGCTACGTACGACGATC
GATATTAATGCAATCATGCAGCTGCATGCTAGCGATGCTACGACGATCGT
ACGGTACACCGCGCACGATCACACGATGCGACGATGCGACGATCGTACGA
TGCTGCATCGATGCTATACGACGATCGTAGCTACGTACGATCGTACGACG
TACGTTACGTACGATCGTACGGTACACCGCGCACGATCACACGATGCGAC
GATGCGACGATCGTACGACTGCTAGCTACGCATGCCTACGTACGTATCCT
ACGTACGATCGTGCAGCATCGATGCTACGTACGACGATCGATATTAATGC
AATCATGCAGCTGCATGCTAGCGATGCTACGATCGATGCTATACGACGAT
CGTAGCTATGCTATACGACGATCGTAGCTACGTACGATCGTACGACGTAC
GTTACGTACGATCGTGCATCGATGCTATACGACGATCGTAGCTACGTACG
ATCGTACGACGTACGTTACGTACGATCGTACGGTACACCGCGCACGATCA
CACGATGCGACGATGCGACGATCGTACGACTGCTAGCTACGCATGCCTAC
TGCATCGATGCTATACGACGATCGTAGCTACGTACGATCGTACGACGTAC
GTTACGTACGATCGTACGGTACACCGCGCACGATCACACGATGCGACGAT
GCGACGATCGTACGACTGCTAGCTACGCATGCCTACGTACGTATCCTACG
TACGATCGTGCAGCATCGATGCTACGTACGACGATCGATATTAATGCAAT
CATGCAGCTGCATGCTAGCGATGCTACGGTACGTATCCTACGTACGATCG
TGCAGCATCGATGCTACGTACGACGATCGATATTAATGCAATCATGCAGC
TGCATGCTAGCGATGCTACGTACGGTACACCGCGCACGATCACACGATGC
GACGATGCGACGATCGTACGACTGCTAGCTACGCATGCCTACGTACGTAT
CCTACGTACGATCGTGCAGCATCGATGCTACGTACGACGATCGATATTAA
TGCAATCATGCAGCTGCATGCTAGCGATGCTACGCTGCTAGCTACGCATG
CCTACGTACGTATCCTACGTACGATCGTGCAGCATCGATGCTACGTACGA
TGCATGCTAGCGATGCTACGACGATCGTACGGTACACCGCGCACGATCAC
ACGATGCGACGATGCGACGATCGTACGATGCTGCATCGATGCTATACGAC
GATCGTAGCTACGTACGATCGTACGACGTACGTTACGTACGATCGTACGG
TACACCGCGCACGATCACACGATGCGACGATGCGACGATCGTACGACTGC
TAGCTACGCATGCCTACGTACGTATCCTACGTACGATCGTGCAGCATCGA
TGCTACGTACGACGATCGATATTAATGCAATCATGCAGCTGCATGCTAGC
GATGCTACGATCGATGCTATACGACGATCGTAGCTACGTACGATCGTACG
ACGTACGTTACGTACGATCGTGCATCGATGCTATACGACGATCGTAGCTA
CGTACGATCGTACGACGTACGTTACGTACGATCGTACGGTACACCGCGCA
CGATCACACGATGCGACGATGCGACGATCGTACGACTGCTAGCTACGCAT
GCCTACTGCATCGATGCTATACGACGATCGTAGCTACGTACGATCGTACG
ACGTACGTTACGTACGATCGTACGGTACACCGCGCACGATCACACGATGC
GACGATGCGACGATCGTACGACTGCTAGCTACGCATGCCTACGTACGTAT
CCTACGTACGATCGTGCAGCATCGATGCTACGTACGACGATCGATATTAA
TGCAATCATGCAGCTGCATGCTAGCGATGCTGTCACGTAGCATGCTGACG
TACGATCGATTCGATCGATCGTACGATCGTAGCTAGCTAGTCGTAGCGAC
GTAGGATTCACGTAGCGATGCGTAGCGTAGCATGCTGACGATGCATCGAT
CGATGCATCATGCTAGCGTAGCTAGCTAGCATGACTGATCGATTAACGGT
ACGTATCCTACGTACGATCGTGCAGCATCGATGCTACGTACGACGATCGA
TATTAATGCAATCATGCAGCTGCATGCTAGCGATGCTACGTACGGTACAC
CGCGCACGATCACACGATGCGACGATGCGACGATCGTACGACTGCTAGCT
ACGCATGCCTACGTACGTATCCTACGTACGATCGTGCAGCATCGATGCTA
CGTACGACGATCGATATTAATGCAATCATGCAGCTGCATGCTAGCGATGC
TACGCTGCTAGCTACGCATGCCTACGTACGTATCCTACGTACGATCGTGC
AGCGATCGATATTAATGCAATCATGCAGCTGCATGCTAGCGATGCTACGT
ACGTACGTATCCTACGTACGATCGTGCAGCATCGATGCTACGTACGACGA
TCGATATTAATGCAATCATGCAGCTGCATGCTAGCGATGCTACGACGATC
GTACGACTGCTAGCTACGCATGCCTACGTACGTATCCTACGTACGATCGT
GCAGCATCGATGCTACGTACGACGATCGATATTAATGCAATCATGCAGCT
GCATGCTAGCGATGCTACGACGACGATCGATATTAATGCAATCATGCAGC
TGCATGCTAGCGATGCTACGTACGATCGTATGCTAGCTAGCATGCATGCA
TGCATGCAT ..
3Recuperació de la informació
- Bioinformatics. Sequence and genome analysis
- David W. Mount
- Flexible Pattern Matching in Strings (2002)
- Gonzalo Navarro and Mathieu Raffinot
- Algorithms on strings (2001)
- M. Crochemore, C. Hancart and T. Lecroq
- http//www-igm.univ-mlv.fr/lecroq/string/index.ht
ml
4String Matching
String matching definition of the problem
(text,pattern)
depends on what we have text or patterns
- The patterns ---gt Data structures for the
patterns
- 1 pattern ---gt The algorithm depends on p and
?
- k patterns ---gt The algorithm depends on k, p
and ?
- The text ----gt Data structure for the text
(suffix tree, ...)
- Sequence alignment (pairwise and multiple)
- Sequence assembly hash algorithm
Hidden Markov Models
5Exact string matching one pattern
How does the string algorithms made the search?
For instance, given the sequence CTACTACTACGTCTAT
ACTGATCGTAGCTACTACATGC search for the pattern
ACTGA.
and for the pattern TACTACGGTATGACTAA
6Exact string matching Brute force algorithm
Example
Given the pattern ATGTA, the search is
G T A C T A G A G G A C G T A T G T A C T G ...
7Exact string matching Brute force algorithm
Text
Pattern
Text
Pattern
8Exact string matching one pattern
How does the matching algorithms made the search?
There is a sliding window along the text against
which the pattern is compared
At each step the comparison is made and the
window is shifted to the right.
Which are the facts that differentiate the
algorithms?
- How the comparison is made.
- The length of the shift.
9Exact string matching one pattern (text on-line)
Experimental efficiency (Navarro Raffinot)
BNDM Backward Nondeterministic Dawg Matching
?
BOM Backward Oracle Matching
64
32
16
Horspool
8
BOM
BNDM
4
Long. pattern
2
w
2 4 8 16
32 64 128
256
10Horspool algorithm
We need a preprocessing phase to construct the
shift table.
11Horspool algorithm example
12Horspool algorithm example
13Horspool algorithm example
14Horspool algorithm example
15Horspool algorithm example
16Horspool algorithm example
17Exemple algorisme de Horspool
18Qüestions sobre lalgorisme de Horspool
Given the pattern ATGTA, the shift table is
Given a random text over an
equally likely probability distribution (EPD)
1.- Determine the expected shift of the window.
And, if the PD is not equally likely?
2.- Determine the expected number of shifts
assuming a text of length n.
3.- Determine the expected number of comparisons
in the suffix search phase
19Exact string matching one pattern (text on-line)
Experimental efficiency (Navarro Raffinot)
BNDM Backward Nondeterministic Dawg Matching
?
BOM Backward Oracle Matching
64
32
16
Horspool
8
BOM
BNDM
4
Long. pattern
2
w
2 4 8 16
32 64 128
256
20BNDM algorithm
21BNDM algorithm exaple
Given the pattern ATGTA
D1 ( 0 1 0 1 0 )
D2 ( 1 0 1 0 0 ) ( 0 0 0 0 0 ) ( 0 0 0 0 0 )
D1 ( 0 0 1 0 0 )
D2 ( 0 1 0 0 0 ) ( 0 0 1 0 0 ) ( 0 0 0 0 0 )
D1 ( 1 0 0 0 1 )
D2 ( 0 0 0 1 0 ) ( 0 1 0 1 0 ) ( 0 0 0 1 0 )
D3 ( 0 0 1 0 0 ) ( 0 0 1 0 0) ( 0 0 1 0 0 )
D4 ( 0 1 0 0 0 ) ( 0 0 0 0 0) ( 0 0 0 0 0 )
22Exemple algorisme BNDM
D1 ( 1 0 0 0 1 )
D2 ( 0 0 0 1 0 ) ( 0 1 0 1 0 ) ( 0 0 0 1 0 )
D3 ( 0 0 1 0 0 ) ( 0 0 1 0 0 ) ( 0 0 1 0 0 )
D4 ( 0 1 0 0 0 ) ( 0 1 0 1 0 ) ( 0 1 0 0 0 )
D5 ( 1 0 0 0 0 ) ( 1 0 0 0 1 ) ( 1 0 0 0 0 )
D6 ( 0 0 0 0 0 ) ( ) ( 0 0 0 0
0 )
Trobat!
23Exemple algorisme BNDM
Given the pattern ATGTA
How the shif is determined?
D1 ( 0 1 0 1 0 )
D2 ( 1 0 1 0 0 ) ( 0 0 0 0 0 ) ( 0 0 0 0 0 )
D1 ( 0 1 0 1 0 )
D2 ( 1 0 1 0 0 ) ( 1 0 0 0 1 ) ( 1 0 0 0 0 )
D3 ( 0 0 0 0 0 ) ( 1 0 0 0 1 ) ( 0 0 0 0 0
)