Title: Bioinformatics: overview
 1Bioinformatics overview
- Handling a computer 
- Opening and saving of files 
- Starting programs 
- Navigating the WWW 
- FTP 
- Browsing 
- Sequence data 
- Primary data 
- Sequence formats 
2Bioinformatics overview (2)
- Databases 
- Entrez 
- SRS 
- Manipulation of DNA sequences 
- Restriction analysis 
- in silico cloning 
- Translation of nt sequence into protein 
- PCR 
- Primer design
3Bioinformatics overview (3)
- Comparison of two sequences 
- Dot matrix 
- Pairwise alignment 
- Multiple alignments 
- Database searches for similar sequences 
- FASTA 
- BLAST
4Bioinformatics overview (4)
- Sequence annotation 
- Intron/Exon prediction 
- Identification of conserved motifs 
- Identification of regulatory sequences 
- Organismal databases 
- D. melanogaster 
- A. thaliana 
- Expression profiling (chips)
5Copy and Paste
To transfer text/sequence files into programs use 
copy/paste
- StrgC for copy StrgV for paste
6Start a program
- Double click on the program you wish to open 
- Microsoft word can be found under 
- Start -gt Programme-gt Microsoft Word
7Create a Folder
- Start - Programme - Windows Explorer - click 
- Desktop - click 
- Datei - Neu - Ordner - click 
- the new folder will appear on the screen 
- rename Neuer Ordner to EDV 
- click once on the icon and a second time on the 
 text field to activate the editor and write EDV
- Save all your future files into this folder 
8Navigating the Internet
- File transfer protocol (FTP) 
- Allows a person or computer to retrieve and send 
 files from/to another computer. Only copies are
 moved the original file remains untouched.
- The network terminal protocol (TELNET) 
- allows a user to log in on any other computer on 
 the network, turning the local computer in a
 terminal.
9Navigating the Internet
- Every computer in the internet has its own unique 
 IP address
- e.g. 193.171.103.86 
- Because these numbers are not intuitive they are 
 often converted into a name
- i122server.vu-wien.ac.at addresses the same 
 computer
- Subdirectories on this computer can be specified 
- i122server.vu-wien.ac.at/edv/start.html is a 
 folder, which has been prepared for this course
10Navigating the Internet
- For internet access you need either a modem or a 
 direct line
- Once you are connected with one server you can 
 access the full internet
- For most purposes an internet browser is 
 sufficient
- Internet Explorer 
- Netscape Navigator 
- Omniweb
11Getting started
- Open your web browser 
- Type in the address http//i122server.vu-wien.ac
 .at/edv/start.html
- Press return 
- You could make a bookmark of this page
12Links
- Rather than typing a new address each time, it is 
 possible to click on specially marked text or
 symbols
- After a single click you will connect to the 
 address
- To view the address before connecting, simply 
 move your mouse above the link
13Assignment 1
- Open Netscape Communicator  
 http//i122server.vu-wien.ac.at/edv/start.html
- Create a new folder on the desktop 
- Open Microsoft Word and type in a random DNA 
 sequence, save this sequence as text only file
 into your folder.
- Software Windows-Explorer and Word
14Sequence data
- Automated DNA sequencing heavily relies on the 
 support of computer algorithms
- Data collection
15Sequence data
- The use of 4 different dyes requires intensive 
 computer calculations to extract sequence
 information
- Electropherograms
16Sequence formats
- While electorpherograms are useful during the 
 sequencing project, after the completion
 sequences are stored as text.
- Plain text contains only the sequence 
 information
- Fasta 
17Other sequence formats
  18Manipulation of DNA Sequences I
Restriction endonucleases sticky ends XhoI 
(c/tcgag), PstI (ctgca/g), ... blunt 
ends SmaI (ccc/ggg), DraI (ttt/aaa),.... Rare 
cutters large recognition sites Frequent 
cutters small recognition sites Multiple 
cloning site 
 19Manipulation of DNA Sequences II 
 20Assignment 2
1. Open the file pSKII.doc and try to find the 
sequences for the sequencing primers M13-forward 
(5' gtaaaacgacggccagt 3') and M13-reverse (5' 
ggaaacagctatgaccatg 3') as well as the RNA 
polymerase promoters T3 (5' aattaaccctcactaaaggg 
3') and T7 (5 gtaatacgactcactatagggc 3'). 
 2. Which primers are homologous to the single 
stranded SKII sequence and which are 
complementary? Software Word, JaMBW (Reverse, 
Complement, Inverse) Download pSKII.doc 
 21Characteristics of cloning vectors I
  multiple cloning site  region for universal 
 (M13 /-) sequencing primers  RNA 
polymerase (T7, T3, SP6) promoters  genes for 
selections 
 22Watson-Crick DNA strands
The upper strand of the dsDNA is called, W 
(Watson) for forward and the lower strand C 
(Crick) for reverse. 
- The C strand is complementary (complement/reverse)
 to the W strand
- C is in antisense to W
23DNA and RNA Polymerases
DNA Polymerases need short primers to start DNA 
synthesis RNA Polymerases need short 
promoters Polymerases synthesize DNA/RNA only in 
the 5 - 3 direction If open reading frame (ORF) 
is coded by the W strand - the C strand codes 
for the antisense gene Also the C strand can code 
for ORFs - than the W strand codes the antisense 
gene 
 24(No Transcript) 
 25(No Transcript) 
 26(No Transcript) 
 27Assignment 3
You received a cDNA clone and the sequence of the 
insert (prc1edvkurs.doc) from your colleague. He 
told you that the startcodon is the "atg" at 
position 79. For synthesis of an antisense RNA 
used as Northern Blot probe you have to subclone 
the insert into another vector. The vector you 
have in the lab is Bluescript (pSKII.doc). 
Bluescript contains a multiple cloning site 
flanked by sequences for the sequencing primers 
M13-forward and M13-reverse and the RNA 
polymerase promoters T3 and T7. a. Find the 
multiple cloning site in the vector b. Find the 
best cloning strategy using only one restriction 
enzyme. c. Use directed cloning to ensure that 
all clones could be used to produce an 
antisense probe with the RNA polymerase T3. d. 
Define a strategy to modify the clone of 3c for 
the use of T7 RNA polymerase. Which enzymes would 
you use? Software Word and Webcutter Download 
prc1edvkurs.doc 
 28Manipulation of DNA Sequences III
 Polymerase chain reaction (PCR) http//bibiserv.t
echfak.uni-bielefeld.de/sadr/pcrtutor.html 
 29Manipulation of DNA Sequences IV
Primer design size between 19 and 25 
bases melting temperature 48 C and 60 
C Tmforward  Tmreverse Tm  2 (A  
T)  4 (G  C) minimum of G/Cs 9 - 11 ( 40 - 
50) distance between primer pairs 10 bp - 40 
kb annealing sites unique - 3 end avoid 
 mispriming primer-primer interaction hairpin
 structures 
 30(No Transcript) 
 31(No Transcript) 
 32Assignment 4
- Microsatellites are highly polymorphic markers, 
 which are extensively used for paternity testing,
 genome walking, provenance studies and analysis
 of population structures.
- They consist of tandemly repeated simple 
 sequences of di-, tri and tetranucleotids as
 (AT)n,(CT)n, (CA)n, (GA)n, (GT)n or (CCT)n,....
- Their length variation results from DNA slippage 
 a mechanism, which increases and decreases their
 repeat number. The repeats are flanked by unique
 sequences, which allow to design specific primers
 for the amplification of the microsatellite.
- Please design primer pairs for the amplification 
 of a microsatellite using the following criteria
 
-  product length 100 - 300 bp 
-  annealing temperature higher than 55 C 
-  primer length between 20 - 24 bp 
- Software Word and Primer3 
- Download microsatellite.doc 
33The importance of centralized databanks 
 34EMBL Databank 
 35EMBL SRS 
 36Entrez
- is a search and retrieval system that integrates 
 information from databases at NCBI
37(No Transcript) 
 38PopSet-prealigned multiple data sets 
 39Taxonomy Browser 
 40Online Mendelian Inheritance in Man
- This database is a catalog of human genes and 
 genetic disorders. The database contains textual
 information and references. It also contains
 copious links to MEDLINE and sequence records in
 Entrez
41Objectives 
- What is the function of this gene? 
- Do other genes have this functional motif? 
- Can I predict the higher order structure of this 
 protein?
- Is this gene a member of a known gene family? 
- Do other organisms have this gene?
42General Database Search Issues
- Search using amino acid sequence if possible 
- Why? Protein evolution is slower than DNA 
 sequence evolution
- Ask the program to translate your query sequence 
 in all 6 possible reading frames.
- Statistical theory is based on unrealistic 
 assumptions consider searches as exploratory
 analyses.
43Similarity Search Jargon
A similarity search of a database is performed by 
aligning a query sequence to each sequence in the 
database. If good matches are found, the search 
returns a list of HSPs High-scoring Segment 
Pairs. 
 44Alignment Jargon
ancestor
Evolutionarily related sequences differ from one 
other because of several processes 
- Substitutions 
- Insertions 
- Deletions
Observed sequences 
 45Alignment Jargon
GCG  ACG
Substitution
A?G
  46Alignment Jargon
ATCG   A-CG
Insertion
?T
-  0 mismatches 
-  3 matches 
-  1 gap
47Alignment Jargon
Deletion
ATCG   A-CG
-  0 mismatches 
-  3 matches 
-  1 gap
48Alignment Jargon
Results of insertion and deletion events can be 
indistinguishable. Indel INsertion or DELetion 
 49Sequence Alignment
-  Sequence alignment is simply the optimal 
 assignment of substitution and indel events to a
 pair of sequences.
- Global alignment align entire sequences 
- Local alignment find best matching regions of 
 sequences
50Alignment of pairs of sequences
- Dot matrix analysis 
- Dynamic programming 
- Word (k-tuple) methods
51Dot matrix
- Sequence A is compared against B 
- Matching bases are marked on a AxB grid
52Dot matrix
- Sequence A is compared against B 
- Matching bases are marked on a AxB grid
53Dot matrix
- The background could be adjusted by changing the 
 window size
54Dot matrix
- The background could be adjusted by changing the 
 window size
- (phage lambda and P22 repressor proteins) 
-  1/1 7/11 15/23
55Dot matrix
- Search for conserved regions and domains 
- Identify repeated nucleic acid and protein 
 domains
- Determine introns and exons 
- Find inverted repeats and stem-loop structures 
- regions of low complexity 
- frameshifts
56(No Transcript) 
 57Assignments 5 and 6
- You isolated a cDNA clone (PlecDNA.doc) and you 
 would like to know how many introns are in the
 gene. Fortunately you are working with a fully
 sequenced organism thus it is easy to retrieve
 the full genomic region (Plegenomic.doc).
-  a) How many introns does the gene contain? 
-  b) What are the sequences (10 bp) around 
 introns 2, 3 and the
-  corresponding exons borders? 
- Software Word and Dotlet 
- Download PlecDNA.doc, Plegenomic.doc 
58Assignment 6
- The previous analysis showed that with the dot 
 matrix program some useful interpretation can be
 made on DNA sequences. You have recently isolated
 a genomic fragment (Test.doc) and encouraged by
 the former results to analyze it with the dot
 matrix program.
- How can you explain the pattern you see in the 
 dot matrix?
- Delete an internal portion of the sequence and 
 compare the full versus the deleted sequence ?
- What is the pattern on the dot matrix? 
- Software Word and Dotlet 
- Download Test.doc 
59Dynamic programming
- The dynamic programming algorithm provides a 
 reliable computation method for aligning
 sequences
- The method has been proven mathematically to 
 yield the optimal alignment (note there may be
 more than a single optimal alignment)
- Both local and global alignments can be produced 
60Problem of alignment
- Roughly n x m comparisons need to be made for two 
 sequences of length n and m.
- If the alignment is to include gaps of any length 
 at any position in either sequence, the number of
 comparisons that must be made becomes
 astronomical
- Dynamic programming is a method of sequence 
 alignment that can take gaps into account but
 requires only a moderate number of comparisons
61The algorithm
 V D S C Y V D S L C Y
 4 -3 -2 -1 -1
-3 6 0 -3 -3
-2 0 4 -1 -2
 1 -4 0 -2 -1
-1 -3 -1 9 -2
-1 -3 -2 -2 7
Y Y
C C
L -
S S
D D
V V
7
9
-11
4
6
4
16 
 62A sub-optimal alignment
 V D S C Y V D S L C Y
 4 -3 -2 -1 -1
-3 6 0 -3 -3
-2 0 4 -1 -2
 1 -4 0 -2 -1
-1 -3 -1 9 -2
-1 -3 -2 -2 7
Y -
C Y
L C
S S
D D
V V
-11
-2
-2
4
6
4
-1 
 63Measuring Alignment Quality(subjective criteria)
- Good alignments should have  
-  many exact matches 
-  few mismatches 
- many of the mismatches should be similar 
 residues
-  few gaps
64Measuring Alignment Quality(objective criteria)
- What is the expected number of HSPs with a score 
 of at least S?
- K constant dependent on the frequency of 
 nucleotide
- m, n  length of sequences 
- ? loge (1/p), p probability of a match of 
 identical bases (1/4 for equal base frequencies
65Measuring Alignment Quality(objective criteria)
Bit scores Raw scores have little meaning without 
detailed knowledge of the scoring system used. By 
normalizing a raw score using One attains a 
bit score S 
 66Measuring Alignment Quality(objective criteria)
Bit scores Raw scores have little meaning without 
detailed knowledge of the scoring system used. By 
normalizing a raw score using One attains a 
bit score S 
 67Measuring Alignment Quality(objective criteria)
Bit scores The E value to a given bit score is 
 Bit scores subsume the statistical essence 
of the scoring system, hence to calculate 
significance one needs to know only the size of 
the search space 
 68Measuring Alignment Quality(objective criteria)
- Significance of a HSP score 
- P(Sgtx)  1-exp (-Kmne-?x) 
- P(Sgtx)  1-exp (-E) 
- m, n effective length of query and databank 
 sequence
- E number of expected HSPs with score at least S 
69Measuring Alignment Quality(objective criteria)
- Significance 
- Some programs provide E-values rather than 
 P-values, as E is easier to understande.g.
 E-value of 5 vs. 10 corresponds to P-value 0.993
 and 0.99995
- P-value is associated with E-value e.g if one 
 expects to find 3 HSPs with score gtS, the
 probability of finding one is 0.95
- When Elt0.01, P-values and E-values are nearly 
 identical
70Scoring matrices
- Rationale 
- certain aa replacements occur often in a protein. 
 Because proteins are functioning despite these
 changes the substituted aa are compatible with
 structure and function. Yet other substitutions
 are rare.
- A scoring matrix is accounting for these 
 differences
71Scoring matrices
- Dayhoff, 1978 
-  PAM (point accepted mutation) matrices 
- Henikoff  Henikoff, 1992 
-  BLOSSUM (blocks amino acid substitution 
 matrices)
72PAM matrices
- This family of matrices lists the likelihood of 
 change from one aa to another in homologous
 proteins during evolution
- Each matrix gives the changes expected for a 
 given period of evolutionary time
- Assumption 
- Each change in the current aa is independent of 
 previous mutation events at that site.
- aa changes observed in short evolutionary times 
 can be extrapolated to longer periods
73PAM matrices
- aa substitutions that occur in a group of 
 evolving proteins were estimated.
- Because these changes are observed in closely 
 related proteins, they represent aa substitutions
 that do not change the function of the protein -gt
 accepted mutations
- 1572 changes in 71 groups of protein sequences 
 were observed
- The number of changes at each aa was counted on 
 a phylogenetic tree
- And divided by the exposure to mutation (aa 
 frequency x number of aa in that group)  PAM1
- Asn, Ser, Asp, Glu (highly mutable) Cys, Trp 
 (least mutable)
74PAM matrices
- The PAM 1 matrix gives the probability of a 
 single change
- To obtain PAM matrices for N mutations, the PAM1 
 matrix is multiplied to itself N times
- PAM250 represents a level of 250 change 
 (corresponds to 20 similarity)
- Computer simulations have shown that PAM250 
 provides a better scoring alignment than lower
 numbered PAMs for distantly (14-27 similarity)
 proteins.
75PAM log odds score
- PAM matrices are usually converted in log odds 
 matrices
- The ratio of the hypothesis that the change 
 represents an authentic evolutionary variation to
 the hypothesis that the change occurred because
 of random sequence variation (no biol.
 significance)
- Phe-gtTry 
- Phe-Try score in PAM250 0.15 
- Frequency of Phe in data 0.04 
- Log odds score 10 x (0.15/0.04) 5.7
76PAM250 
 77BLOSUM matrices
- 500 families of related proteins 
- Search for ungapped aa blocks that were present 
78Gap scores
- The cost of introducing a gap must be higher than 
 the cost for extending it
-  Wx g  rx 
- g  gap opening penalty 
- x  length of the gap 
- r  gap extension penalty 
79(No Transcript) 
 80Assignment 7
- You have obtained a peptid sequence 
 (ASFPCLNGGTCNDQVNGYVCVCAQDTSVSTCET) and
-  would like to find its position in the full 
 length protein.
- Software Word and Blast2 Sequences 
- Download UEGF1.doc 
81Multiple alignments
- Problem 
- Alignment of 
- two sequences (length N) N2 comparisons 
-  300 aa 9x104 
- three sequences (length N) N3 comparisons 
-  300 aa 2.7x107 
- -gt exact multiple alignments are not feasible for 
 most data sets heuristic methods are required
82Progressive methods for multiple alignment
- PILEUP 
- Part of the GCG package 
- CLUSTALW 
- Available as local programs (Mac, PC, Unix) 
- Could be also run on remote computers 
83Progressive alignment algorithm
- Produce a global pairwise alignment for all pairs 
 of sequences
- Full dynamic programming 
- K-tuple approach, similar to FASTA 
- Calculate the pairwise alignment scores 
- Built a tree based on the genetic distances 
 derived from the alignment scores (NJ)
- Align the sequences sequentially, guided by the 
 phylogenetic relationships indicated by the tree
84Progressive alignment weighting
- Problem alike sequences will produce a bias in 
 the alignment
- Solution weighting of sequences based on 
 alignment scores
0.2
A
0.2  0.3/2  0.35
0.3
0.1
B
0.1  0.3/2  0.25
0.5
C
0.5 
 85Progressive alignment problems
- Dependence on the initial pairwise alignments 
- No problem for closely related sequences 
- The more diverged the sequences are, the more 
 problematic is the alignment
- Choice of suitable scoring matrices and gap 
 penalties that apply to the entire set of
 sequences
- -gt Bayesian methods such as hidden Markov models 
 (HMMs) may be preferable for distantly related
 sequences
86Single sequence queries
- Rationale a single sequence should be searched 
 against a database to identify those sequences,
 which are most similar
- Identification of a related gene in another 
 organism
- Identification of a related gene in the same 
 organism
- Similarity may provide clues about function 
87Data banks
- Genomic sequences 
- Complete genomes 
- cDNA/proteins 
- ESTs (expressed sequence tags) 
88FASTA  BLAST rationale
- Main idea Good alignments are expected to share 
 several aa. Hence, consecutive shared aa (words,
 k-tuples) could serve as an indicator of quality.
 
- Observation HSPs of interest are usually longer 
 than a single word, so look for multiple hits on
 the same diagonal, separated by a short distance
89FASTA
- FASTA3 is the latest version with increased 
 ability to detect distantly related sequences
- Input 
- k size of matching sequence patterns or words, 
 called k-tuples
- Similarity matrix 
- Compares query sequence pairwise with each 
 sequence in the database
90FASTA hashing algorithm
- Search for k consecutive matches 
- Use a precompiled table that lists where in the 
 database each possible word occurs
- Generation of the table is in the order L (size 
 of databank)
- Use of the order N (size of query sequence) 
91FASTA hashing algorithm
  92FASTA algorithm
- Hashing built a library of k consecutive 
 residues and search the database represented by
 such a library
- Note not database is searched, but the library 
- DNA k4-6 protein k1-2 
- Longer words result in a faster, but less 
 sensitive search
- Joining those matches within a certain distance 
 of each other are joined along with the region
 between them into a longer matching region
 without gaps.
93FASTA algorithm
- Filtering the 10 best matching regions are 
 rescored using a scoring matrix (BLOSUM or PAM)
- Ends of the regions are trimmed to remove 
 residues not contributing to the score
- The best scoring region INIT1 is reported 
- Joining regions that are near enough are joined. 
 The score of this larger region, including
 penalties for gaps needed to join the initial
 regions is reported as INITN.
- Distance for proteins K132 k216
94FASTA algorithm
- Later versions of FASTA include an optimization 
 step
- When INITN reaches a certain threshold, the score 
 of the region is recalculated to produce an OPT
 score by performing a full local alignment using
 dynamic programming.
- This procedure increases sensitivity but 
 decreases selectivity
95Limitations of FASTA
- FASTA can miss significant similarity since 
- For proteins, similar sequences do not have to 
 share identical residues
- Asp-Lys-Val is quite similar to Glu-Arg-Ile yet 
 it is missed even with k-tuple size of 1 since no
 amino acid matches
- For nucleic acids, due to codon wobble, DNA 
 sequences may look like XXyXXyXXy where Xs are
 conserved and ys are not
96BLAST (1)Basic Local Alignment Search Tool
- Filter low complexity regions are removed 
- Divide query sequence into words (sliding by 1 
 position)
- Include imperfection based on a scoring matrix 
 similar words which produce a score higher than T
 are assembled to a list
- This step is included to permit not perfect 
 matches between subject and query sequence
- Usually about 50 entries per word (rather than 
 20x20x208000)
97BLAST (2)Basic Local Alignment Search Tool
- Approach find segment pairs by first finding 
 word pairs that score above a threshold, i.e.,
 find word pairs of fixed length w with a score of
 at least T
- Key concept Seems similar to FASTA, but we are 
 searching for words which score above T rather
 than that match exactly
98BLAST (3)Basic Local Alignment Search Tool
- Each database entry is scanned for a match to one 
 of the list entries
- Use the short matched regions (x) lying on the 
 same diagonal and within distance A as starting
 points for a longer ungapped alignment between
 words
99BLAST (4)Basic Local Alignment Search Tool
- Extension of the alignment from the matching 
 words in each direction along the sequences.
 Extension continues as long as the score
 increases.The extension is stopped when the
 accumulated score stops increasing and had just
 begun to fall a small amount below the best score
 found for a shorter extension.
- The obtained segment is called high scoring 
 segment pair (HSP)
100BLAST (5)Basic Local Alignment Search Tool
- Determine whether the HSP has a score larger than 
 a cutoff score S
- S is determined by examining the range of scores 
 found by comparing random sequences and by
 choosing a value that is significantly greater
- Determine significance of each HSP score 
- P(Sgtx)  1-exp (-Kmne-?x) 
- P(Sgtx)  1-exp (-E) 
- m, n effective length of query and databank 
 sequence
- E number of expected HSPs with score at least S 
101BLAST (6)Basic Local Alignment Search Tool
- Significance 
- BLAST provides E-values rather than P-values, as 
 E is easier to understande.g. E-value of 5 vs.
 10 corresponds to P-value 0.993 and 0.99995
- P-value is associated with E-value e.g if one 
 expects to find 3 HSPs with score gtS, the
 probability of finding one is 0.95
- When Elt0.01, P-values and E-values are nearly 
 identical
102Selecting the BLAST program 
 103FASTA-BLAST comparison 
 104Significance of database searches (1)
- All previous theory referred to the comparison of 
 two sequences- how should one consider the entire
 set of sequences?
- 1. Significance is independent of the length of a 
 sequence-gt multiply pairwise significance with
 number of sequence entries (FAST A)
- 2. Significance depends on length, as long 
 sequences are composed of multiple distinct
 domains-gt treat entire database as a single
 sequence for calculation of significance
105Significance of database searches (2)
- Until now, only ungapped sequences were 
 considered.
- Computational experiments and analytical results 
 suggest that the same theory could be applied to
 gapped alignments
- For ungapped alignments the statistical 
 parameters (?,K) can be calculated using analytic
 formulas
- For gapped alignments these parameters must be 
 estimated from a large-scale comparison of
 random sequences
106Significance of database searches (3)
- gapped alignments 
- FASTA local alignment scores are produced for 
 the comparison of query and every databank
 sequence. Most of these scores involve unrelated
 sequences, they could therefore be used to
 estimate ? and K.Problemscores from pairs of
 related sequences should be excluded
- BLAST ? and K are estimated for a selected set 
 of substitution matrices and gap costs.The
 estimation could be done with real sequences, but
 has instead relied on random sequences
107Hidden Markov Model (HMM)
- HMMs offer a more systematic approach to 
 estimating model parameters
- HMMs could be compared to a kind of dynamic 
 statistical profile
- Like an ordinary profile, it is built by 
 analyzing the distribution of aa in a training
 set of related proteins
- The topology of a HMM can be visualized as a 
 finite state machine
108Hidden Markov Model (HMM)
Delete States
Insert States
A
Match States
C
C
Begin
End
G
Movement from stage n to n1 with a certain 
transition probability 
 109Hidden Markov Model (HMM)
- More than one path leads to the same result
Delete States
Insert States
A
Match States
C
C
Begin
End
G
Movement from stage n to n1 with a certain 
transition probability 
 110Hidden Markov Model (HMM)
- The probability of a given sequence is obtained 
 by the sum of loge (transition probabilities)
- Hidden Markov model, as the path is hidden 
- Transition probabilities are obtained by training 
 on a set of sequences
- Initialization by estimated transition 
 probabilities
- All possible paths generating a given sequence 
 are visited proportional to the estimated
 transition probabilities
- Counting the number of times a given transition 
 was visited during the above step provides
 improved transition probabilities
- The Viterbi algorithm is used on a trained HMM to 
 determine the best path
- The Viterbi algorithm is similar to dynamic 
 programming
111Hidden Markov Model (HMM)
- HMM is a general technique that can be applied to 
 many different questions
- Multiple sequence alignment 
- Identification of conserved domains 
- Gene prediction 
- Protein secondary structure prediction
112Single aa sequence query programs
- Sequence similarity with query sequence 
- FASTA, BLAST 
- Alignment search with profile (scoring matrix 
 with gap penalties)
- PROFILESEARCH 
- Search with position specific scoring matrix 
 (PSSM) representing ungapped sequence alignment
 (BLOCK)
- MAST 
- Iterative alignment search for similar sequences 
 that starts with query sequence, builds a gapped
 multiple alignment, and then uses this to augment
 the search
- PSI-BLAST 
- Search query sequence for patterns representative 
 of protein families
- PROSITE, INTERPRO, PFAM, CDD/IMPALA
113(No Transcript) 
 114(No Transcript) 
 115(No Transcript) 
 116(No Transcript) 
 117Comparison of EMBL  NCBI 
 118(No Transcript) 
 119Assignments 8 to 10
- You have isolated a number of proteins by their 
 interaction with a protein known
- to interact with RING finger proteins. By 
 sequencing the protein you got
- from human cell lines msvdmnsqgsdsneedydpnceeeeee
 eeddpgdie
- from C.elegans mnsddeiymegsasseddmddeclsd and 
 mddedmsctsgddyagygdedyyneadv
- from Drosophila melanogaster mdsdndndfcdnvdsgnvss
 gddgdddfg and
-  mdsdiemdmesdndgeydddydyyntgedcd 
- from Saccharomyces cerevisiae mssgtendqfysfdesdss
 sielyeshntseftihglv
- from Arabidopsis thaliana mdnnsvigsevdaeadesyvna
 aledgqtgkks and
-  mddyfsaeeeacyyssdqdsldgidneeselqpl 
- a. Find the complete protein sequences for every 
 given peptide and align the sequence to find
 out about their overall homology.
- b. Are there RING finger motifs in your proteins 
 and if yes how many and where?
- c. RING-Finger proteins share a common protein 
 motif of
-  C-X2-C-X9-29-C-X1-3-H-X2-3-C/H-X2-C-X4-48-C-
 X2-C.
- d. Are there other remarkable protein motifs? 
- Software Word, BLAST, FastA and ClustalW 
120Assignment 9
- You received a manuscript submitted for 
 publication. The authors claim that they have
 discovered a gene involved in abnormal muscle
 growth in salmon (hs  heavy salmon). You should
 decide if the paper should be published.
-  b. What gene is it? Is it really a novel gene? 
 
-  c. Do you support the authors claim that this 
 is a salmon gene?
-  d. Could the authors claim be true? 
- Software Word, FastA, BLAST, Pubmed 
- Download hs_gene.doc 
121Assignment 10
- Inspired by the manuscripts you reviewed, you 
 decide to look for the gene in whales.
-  a. Make a sequence alignment to design primers 
 for cross species amplification
-  
-  b. Design primers that have a fair chance to 
 amplify the gene from whales
-  c. You know that human contaminations are a 
 problem in your lab. What would you do to
 minimize the risk of a human contamination?
- Software Word, BLAST, FastA, ClustalW 
122Organismal databases 
 123Arabidopsis thaliana 
 124Drosophila BDGP (1) 
 125Drosophila BDGP (2) 
 126Drosophila Flybase 
 127Drosophila NCBI 
 128Assignments 11 to 13
- In Drosophila microsatellites are very short. Try 
 to find the longest dinucleotide microsatellite
 in D. melanogaster
- Software FLYBASE, BDGP, BLAST, 
129Assignment 12
- ITS sequences are widely employed to reconstruct 
 the phylogeny of closely related species. The
 major advantage of ITS sequences is that you
 could use primers (located in the 18S and 28S
 rDNA) which are conserved across many species.
 You have used these conserved primers to amplify
 the complete ITS region form oaks. The PCR
 products were cloned and sequenced. In the folder
 oaks you find the results of your experiment.
-  Figure 1. Organization of the rDNA 
-  a. Make a contig of your sequences 
-  b. Define the boundaries of the genes with the 
 spacers
-  c. Verify that your sequences originate from 
 oaks.
- Software Word, JaMBW, ClustalW, BLAST,FastA 
- Download oak1, oak2, oak3, oak4, oak5 
130Assignment 13
- You received one pair of microsatellite primers, 
 made PCR and found a highly interesting pattern
 in one population (no variability). Inspired by
 this result, you are interested to know more
 about the locus. Unfortunately, you found only
 the sequence of one of the primers
 (ttttgtcgttttcgttatg) and your friend has gone
 for a 6 months holiday. Fortunately, you are
 working with one of the best studied organisms
 Drosophila melanogaster so you have all
 possibilities to investigate!
-  a. What is the repeat motif of your 
 microsatellite?
-  b. Which gene is in close proximity to the 
 microsatellite?
-  c. On which chromosome is the gene located? 
-  d.Determine the number of available transposon 
 insertions in the gene
-  e. Where in the gene are the transposons 
 inserted?
-  f. What would you do to obtain a flystock 
 having the gene deleted?
- Software FastA, BLAST, FLYBASE, BDGP 
131Gene prediction 
 132Gene prediction
- Goal identify those regions that code for 
 proteins
- Direct approach Look for stretches that can be 
 interpreted as protein using the genetic code
- Statistical approaches Use other knowledge about 
 likely coding regions
5 UTR
Exons
Introns
3 UTR 
 133Gene prediction direct approach
- Genetic code 
- The universal genetic code is common to all 
 organisms
- Prokaryotes, mitochondria and chloroplasts often 
 use slightly different genetic codes
- More than one tRNA may be present for a given 
 codon, allowing more than one possible
 translation product
- Differences in genetic codes occur in start and 
 stop codons only
- Alternate initiation codons codons that encode 
 amino acids but can also be used to start
 translation (GUG, UUG, AUA, UUA, CUG)
- Suppressor tRNA codons codons that normally stop 
 translation but are translated as amino acids
 (UAG, UGA, UAA)
134Gene prediction direct approach
- Reading Frames 
- Since nucleotide sequences are read three bases 
 at a time, there are three possible frames in
 which a given nucleotide sequence can be read
 (in the forward direction)
- Taking the complement of the sequence and reading 
 in the reverse direction gives a total of six
 reading frames
- Open reading frames are defined by a set of 
 codons not interrupted by a stop codon
- Note not all ORFs are actually used 
135Gene prediction direct approach
- Statistical support by Ficketts statistic  
 codon usage bias
- Observation every third base tends to be the 
 same one much more often than expected by chance.
 
- The reason for this is codon usage bias 
- Different levels of expression of different tRNAs 
 for a given amino acid lead to pressure on coding
 regions to conform to the preferred codon usage
- Non-coding regions, on the other hand, feel no 
 selective pressure and can drift
136Gene prediction direct approach
- Statistical support by Ficketts statistic  
 codon usage bias
- Example Glycine codon frequencies 
137Gene prediction direct approach
exon 
 138Gene prediction direct approach
- Problem the direct approach works well for 
 Prokaryotes but not for Eukaryotes
- Codon usage bias is not constant across genes 
- Introns in Eukaryotes
139Gene prediction statistical approach
- To discriminate between different regions of a 
 gene, typical sequence elements are used as
 clues
- Content sensor Region of residues with similar 
 properties (introns, exons)
- Signal sensor A specific signal sequence (may be 
 a consensus)
5 UTR
Exons
Introns
3 UTR 
 140Pre-mRNA splicing 
 141Gene Finding Software
- GENSCAN 
- HMMGENE 
- GENMARK 
- GRAIL
HMMs
Neural Network 
 142Evaluation of gene predictions
- One has to discriminate between 
- True positives (TP) 
- False positives (FP) 
- False negative (FN) 
- Sensitivity TP/(TPFN) 
- Specificity  TP /(TPFP) 
- GRAIL was used for different human data sets 
- Sensitivity 0.48-0.65 specificity 0.61 - 0.72
143Promoter prediction
- Similar to gene prediction, known regulatory 
 signals could be used to make predictions
- Algorithms 
- Neuronal networks 
- HMMs 
144(No Transcript) 
 145(No Transcript) 
 146(No Transcript) 
 147Analyzing Gene Expression (Microarray) Data 
 148Assignments 14 and 15
- You have transformed an Arabidopsis thaliana 
 mutant with a genomic sequence (Annotierungssequen
 z.doc) and the presumable gene is sufficient to
 restore the function of the mutant gene.
- a. Find the coding sequence 
 b. Find the PolyA signal
- c. Where is the TATA box motif located? 
- d. Locate the gene on the A. thaliana map 
- e. Are cDNA clones available for this gene? 
- f. Where is the gene expressed? 
 g. Predict the protein sequence
- h. Does this protein share homologies with other 
 proteins?
- i. Are there any related proteins in other 
 plants/animals?
- j. Do these homologies indicate a possible 
 function?
- k. Does the protein has some interesting domains? 
 
- l. Is there a transmembran domain? m. 
 Predict the subcellular localization
- SoftwareArabidopsis DatenbankTAIR, 
 GENSCAN,Genfinder, MCB search, ExPasy,PLACE
- Download Annotierungssequenz.doc 
149Assignment 15
- Based on sequence polymorphism data your friend 
 concluded that a given sequence has been the
 target of selection. He asked you for advice
 about the identified sequence. Make the best
 possible characterization of the sequence-not
 relying on a single source of information only.
-  
- Download Unknown.doc 
150Microarray Data
- A snapshot of the amount of a particular gene 
 being transcribed in a tissue
- Measured for tens of thousands of genes 
- Use of multiple tissues on a single array allow 
 for direct comparisons between tissues
151Objectives of Microarray Studies
- Gene discovery Which genes are affected when 
 exposed to a treatment?
- Hit it with a stick and see what happens 
- Disease diagnosis Given a profile of levels of 
 expression for many genes, can the unknown
 treatment be predicted?
- Tumor or disease classification 
- Time course experiments allow the study of 
 co-regulation of genes, and for the
 reconstruction of regulatory networks
- Pharmacogenomics 
- The goal of pharmacogenomics is to find 
 correlations between therapeutic responses to
 drugs and the genetic profiles of patients.
152Many computational and statistical problems
- Image analysis (spot identification, background, 
 etc.)
- Data management and pipelining 
- Normalization of data 
- Clustering co-regulated genes 
- Classifying tissue types 
- Regulatory network inference 
- Promoter identification (when combined with 
 genomic sequence data)
153Microarray Technology
- Spotted arrays 
- Attach entire sequence of genes to the array 
- Create cDNA from a tissue (expressed genes) 
- Wash the pool of cDNAs over the array 
- Complementary sequences bind 
- Oligonucleotide arrays (Affymetrix chips) 
- Attach short (25bp) oligos instead of entire genes
154GTTCGA.... The gene
CAAGCT.... cDNA
Via reverse transcription
GUUCGA.... mRNA 
 155Spotted arrays are usually treated with samples 
from two different tissues, each labeled with a 
different color of dye (Red and Green)
Highly expressed in tissue A
Highly expressed in tissue B 
 156(No Transcript) 
 157The Data 
 158Goal Cluster genes that share a profile
Experiment 
 159The approach is formally similar to 
distance-based phylogenetic inference
- Compute a matrix of pairwise profile similarity 
 scores between genes
- Use these scores in something like UPGMA 
- Eisen et al. 1998. Cluster analysis and display 
 of genome-wide expression patterns. PNAS
 9514863-14868
160(No Transcript) 
 161Clustering Techniques
- Bottom-up techniques 
- Each gene starts in its own cluster, and genes 
 are sequentially clustered in a hierarchical
 manner
-  
- Top-down techniques 
- Begin with an initial number of clusters and 
 initial positions for the cluster centers (e.g.,
 averages). Genes are added to the clusters
 according to an optimality criterion.
162Clustering Techniques
- Principal component techniques 
- Identify groups of genes that are highly 
 correlated with some underlying factor
 (principal component).
- Self-organizing maps 
- Similar to Top-down clustering, with restrictions 
 placed on dimensionality of the final result.