Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching

Description:

JM - http://folding.chmcc.org. 1. Introduction to Bioinformatics: Lecture II ... The web site for the Introduction to Bioinformatics course. Updates ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 17
Provided by: pediatrici
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics: Lecture II From Molecular Processes to String Matching


1
Introduction to Bioinformatics Lecture IIFrom
Molecular Processes to String Matching
  • Jarek Meller
  • Division of Biomedical Informatics,
  • Childrens Hospital Research Foundation
  • Department of Biomedical Engineering, UC

2
Outline of the lecture
  • Sequence approximation in computational molecular
    biology the premise and the limits
  • Getting ready for analysis of exact string
    matching and sequence alignment algorithms some
    definitions and interplay with biology
  • The notion of string/sequence similarity
  • Substitution matrices for sequence alignment

3
Before we start literature watch
A draft of the Rat genome has been published!
RGSPC Nature 428 What are the first conclusions
from the comparison with other mammalian
genomes? What approaches and tools have been
used to perform this comparative analysis?
4
Biological Polymers and Central Dogma
Bio-Polymer (alphabet) Process (algorithm)
DNA (A,T,G,C)
replication
transcription
mRNA (U,A,C,G) splicing

translation
Proteins (20 a.a.)
folding
interactions
Lipids, polysaccharides, membranes, signal
transduction, environmental signals etc.
5
Complexity of DNA computing
http//www.genecrc.org/site/lc/lc2d.htm
6
Get the relevant sequences to compare them
conservation and differences
Problem ? Algorithms ?
Programs Sequencing ? Fragment assembly problem
? The Shortest Superstring Problem ? Phrap
(Green, 1994) Gene finding ? Hidden Markov
Models, pattern recognition methods ? GenScan
(Burge Karlin, 1997) Sequence comparison ?
pairwise and multiple sequence alignments ?
dynamic algorithm, heuristic methods ? BLAST
(Altschul et. al., 1990)
7
Redundancy in biological systems
An example sperm whale vs. human myoglobin
Query 1 MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETL
EKFDRFKHLKTEAEMKASE 60 M LSGEWQLVLVW
KVEAD GHGQLIRLFK HPETLEKFDFKHLKE EMKASE
Sbjct 1 MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPET
LEKFDKFKHLKSEDEMKASE 60 Query 61
DLKKHGVTVLTALGAILKKKGHHEAELKPFAQSHATKHKIPIKYLEFISE
AIIHVLHSRH 120 DLKKHG TVLTALG
ILKKKGHHEAEKP AQSHATKHKIPKYLEFISE II VL SH
Sbjct 61 DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHK
IPVKYLEFISECIIQVLQSKH 120 Query 121
PGNFGADAQGAMNKALELFRKDIAAKYKELGYQG 154
PGFGADAQGAMNKALELFRKDA YKELGQG Sbjct 121
PGDFGADAQGAMNKALELFRKDMASNYKELGFQG 154
Ex. Find the sequence of 1mba in the PDB and
blast against nr using NCBI
8
Limits of the sequence approximation
  • All the information and various fingerprints of
    information processing at the molecular level
    (via interactions etc.), including adjustment to
    physiologically relevant external signals seem to
    be included in nucleotide and protein sequences
  • However, there are limits to this simple
    approximation actual understanding of molecular
    processes requires structure, chemistry, kinetics
    and thermodynamics
  • On the other hand, a deeper understanding of the
    nature of biological objects and processes
    greatly facilitates sequence-based studies by
    suggesting critical features, similarity
    measurements etc.

9
Strings, sequences and string operations
String vs. sequence duality will be important for
exact vs. inexact string matching
10
Beyond the letters how to find better models
(e.g. GC content for gene finding)
http//www.imb-jena.de/IMAGE_BPDIR.html
11
Another example active sites, functional motifs
and multiple alignment
12
Distance and similarity measures
13
Edit distance vs. substitution score
14
Substitution matrices for protein sequence
alignment learning and extrapolating from
examples
  • PAM matrices (Dayhoff et. al) extrapolating
    longer evolutionary times from data for very
    similar proteins with more than 85 sequence
    identity (short evolutionary time),
  • s(a,b t) log P(ba,t)/qa
    e.g. P(ba,2) Sc P(bc,1)P(ca,1)
  • BLOSUM matrices (Henikoff Henikoff) multiple
    alignments of more distantly related proteins
    (e.g. BLOSUM50 with 50 sequence identity),
  • s(a,b) log pab/qaqb where
    pab Fab / Scd Fcd
  • Expected score Sab qaqb s(a,b) - Sab
    qaqb log qaqb / pab -H(qp)

15
Summary
16
Web resources and materials for the course
  • Protein Modeling Lab
  • Remote access to PML and the Citrix software
  • All lectures and other materials available
    electronically from the PML servers
  • Electronic tests and homework, web submission
    interfaces
  • The web site for the Introduction to
    Bioinformatics course
  • Updates

http//folding.chmcc.org http//folding.chmcc.org
/protlab/protlab.html http//folding.chmcc.org/int
ro2bioinfo/intro2bioinfo.html
Write a Comment
User Comments (0)
About PowerShow.com