Title: BIOINFORMATICS AND GENE DISCOVERY
1BIOINFORMATICSANDGENE DISCOVERY
UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL
Bioinformatics Tutorials
1998
2(No Transcript)
3From genes to proteins
4From genes to proteins
DNA
PROMOTER ELEMENTS
TRANSCRIPTION
RNA
SPLICE SITES
SPLICING
mRNA
START CODON
STOP CODON
TRANSLATION
PROTEIN
5From genes to proteins
6Comparative Sequence Sizes
- Yeast chromosome 3
350,000 - Escherichia coli (bacterium) genome
4,600,000 - Largest yeast chromosome now mapped
5,800,000 - Entire yeast genome
15,000,000 - Smallest human chromosome (Y)
50,000,000 - Largest human chromosome (1)
250,000,000 - Entire human genome
3,000,000,000
7Low-resolution physical map of chromosome 19
8Chromosome 19 gene map
9Computational Gene Prediction
- Where the genes are unlikely to be located?
- How do transcription factors know where to bind a
region of DNA? - Where are the transcription, splicing, and
translation start and stop signals? - What does coding region do (and non-coding
regions do not) ? - Can we learn from examples?
- Does this sequence look familiar?
10Artificial Intelligence in Biosciences
Neural Networks (NN) Genetic Algorithms
(GA) Hidden Markov Models (HMM) Stochastic
context-free grammars (CFG)
11Information Theory
0
1
1 bit
12Information Theory
00
01
1 bit
11
10
1 bit
13Information Theory
1 bit
1 bit
14Scientific Models
Physical models -- Mathematical models
15Neural Networks
- interconnected assembly of simple processing
elements (units or nodes) - nodes functionality is similar to that of the
animal neuron - processing ability is stored in the inter-unit
connection strengths (weights) - weights are obtained by a process of adaptation
to, or learning from, a set of training patterns
16Genetic Algorithms
Search or optimization methods using simulated
evolution. Population of potential solutions is
subjected to natural selection, crossover, and
mutation
choose initial population evaluate each
individual's fitness repeat select individuals
to reproduce mate pairs at random apply
crossover operator apply mutation
operator evaluate each individual's
fitness until terminating condition
17Crossover
Mutation
18Markov Model (or Markov Chain)
A
G
A
T
C
T
Probability for each character based only on
several preceding characters in the sequence
of preceding characters order of the Markov
Model Probability of a sequence P(s) PA
PA,T PA,T,C PT,C,T PC,T,A PT,A,G
19Hidden Markov Models
States -- well defined conditions Edges --
transitions between the states
ATGAC ATTAC ACGAC ACTAC
Each transition asigned a probability. Probabilit
y of the sequence single path with the highest
probability --- Viterbi path sum of the
probabilities over all paths -- Baum-Welch method
20Hidden Markov Model of Biased Coin Tosses
- States (Si) Two Biased Coins C1, C2
- Outputs (Oj) Two Possible Outputs H, T
- p(OutputsOij) p(C1, H), p(C1, T), p(C2, H)
p(C2, T) - Transitions From State X to Y A11, A22, A12,
A21 - p(Initial Si) p(I, C1), p(I, C2)
- p(End Si) p(C1, E), p(C2, E)
21Hidden Markov Model for Exon and Stop Codon (VEIL
Algorithm)
22GRAIL gene identification program
23Suboptimal Solutions for the Human Growth Hormone
Gene (GeneParser)
24Measures of Prediction Accuracy
Nucleotide Level
25Measures of Prediction Accuracy
Exon Level
WRONGEXON
CORRECTEXON
MISSING EXON
REALITY
PREDICTION
26GeneMark Accuracy Evaluation
27Bibliography http//linkage.rockefeller.edu/wli/ge
ne/list.html and http//www-hto.usc.edu/software/p
rocrustes/fans_ref/
Gene Discovery Exercise http//metalab.unc.edu/pha
rmacy/Bioinfo/Gene