Title: Tale 1: To Identify an Unknown Gene
1(No Transcript)
2(No Transcript)
3(No Transcript)
4(No Transcript)
5Frogs eye view of the jungle(time frozen)
6Frogs eye view of the jungle(time frozen)
Frogs eye view of the jungle(time moving)
7Frogs eye view of the jungle(through movement
filter)
8Frogs eye view of the jungle(through movement
filter)
9Filters Information reducersMovement filter
10Filters Information reducersSequence filter
TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG
AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT
TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC
TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC
GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC
CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA
TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA
AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG
AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA
TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG
CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC
GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG
GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT
CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA
ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA
TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT
CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA
CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG
AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC
CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA
TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT
How organism is made How organism works
CTCCGTAAAC CTCTAAC...
11From Sequence to OrganismHow does Nature do it?
12From Sequence to OrganismHow does Nature do it?
Genetic code
Rules of folding
13From Sequence to OrganismHow does Nature do it?
Genetic code
Gives us
14From Sequence to OrganismHow does Nature do it?
ATGACTTATGATCAACGCACAGGGCTA
Gives us
15From Sequence to OrganismHow does Nature do it?
ATGACTTATGATCAACGCACAGGGCTA
- Transcrl initiation
- Transcrl termination/ polyA tailing
- Splicing
- Transll initiation
Rules of transcriptional and post-transcriptional
control
16From Sequence to OrganismHow does Nature do it?
- Natural filters/transformations
- Selective transcription
- Selective processing
- Translation
- Folding
Functional protein
DNA
17From Sequence to OrganismHow does Nature do it?
Natural filters/transformations
Functional protein
DNA
18From Sequence to OrganismHow can WE do it?
Simulation of Nature
Whether tis nobler in the mind to suffer the
slings and arrows of outrageous fortune...
We must give our military every tool and weapon
it needs to prevail...
???
19From Sequence to OrganismHow can WE do it?
Surrogate Processes
Whether tis nobler in the mind to suffer the
slings and arrows of outrageous fortune...
Utterence of Wm Shakespeare
Utterence of George W Bush
We must give our military every tool and weapon
it needs to prevail...
Words/sentence Choice of words Sentence
structure
20From Sequence to OrganismHow can WE do it?
Surrogate filters
- Natural filters/transformations
- Selective transcription
- Selective processing
- Translation
- Folding
21From Sequence to OrganismHow can WE do it?
- Surrogate filters
- Gene finders
- Natural filters/transformations
- Selective transcription
- Selective processing
- Translation
- Folding
22From Sequence to OrganismHow can WE do it?
- Surrogate filters
- Gene finders
- Similarity finders
- Feature finders
- Natural filters/transformations
- Selective transcription
- Selective processing
- Translation
- Folding
23From Sequence to OrganismHow can WE do it?
- Surrogate filters
- Gene finders
- Similarity finders
- Feature finders
- Pattern finders
- Natural filters/transformations
- Selective transcription
- Selective processing
- Translation
- Folding
24Surrogate Filters
You do it
25Surrogate FiltersGene finders
Class 1 Start/Stop codon search (Map, Frames,
OrfFinder)
Look for stop codons (TAA,TAG,TGA)
CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAA
TGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA
26Surrogate FiltersGene finders
Class 1 Start/Stop codon search (Map, Frames,
OrfFinder)
Look for stop codons (TAA,TAG,TGA)
CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAA
TGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA
TTGCATATTTCCATTTGTACATGTCACAAAGTTTATTCATTTTTAATATT
TGCTGAGATCATGTTAGAGGTGTACGGAGGGGCGTGGAG
27Surrogate FiltersGene finders
Class 1 Start/Stop codon search (Map, Frames,
OrfFinder)
Pro Quick, simple
Con Useless for eukaryotic genomic sequences
(introns)
Inaccurate (start codon problem)
Inaccurate (doubtful short open reading
frames)
28Surrogate FiltersGene finders
Class 2 Codon bias recognition (TestCode)
Are codons equally used?
The code is degenerate
29Surrogate FiltersGene finders
Class 2 Codon bias recognition (TestCode)
Most frequently used codons
Codon bias universal?
Codon usage is biased
30Surrogate FiltersGene finders
Class 2 Codon bias recognition (TestCode)
Pro Quick, simple, available through GCG
Better than Class 1 in excluding false open
reading frames
31Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Principle
Step 1 Create model through extensive
training set Training set
proven or suspected genes
Organism-specific
Step 2 Assess candidate genes through filter of
model
32Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Step 1 Create model through extensive training
set
AAAAACAAGAATACA . . .TTGTTT
TrainingSet
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATC
AGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTAC
CATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTC
TGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCT
CCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACT
TCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAAC
AAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACT
TTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTA
GTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTG
AGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTA
ATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGT
GGAGATGATTCGGTAGCTTT
33Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Step 1 Create model through extensive training
set
AAAAACAAGAATACA . . .TTGTTT
TrainingSet
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATC
AGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTAC
CATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTC
TGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCT
CCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACT
TCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAAC
AAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACT
TTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTA
GTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTG
AGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTA
ATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGT
GGAGATGATTCGGTAGCTTT
34Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Step 1 Create model through extensive training
set
AAAAACAAGAATACA . . .TTGTTT
TrainingSet
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATC
AGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTAC
CATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTC
TGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCT
CCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACT
TCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAAC
AAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACT
TTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTA
GTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTG
AGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTA
ATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGT
GGAGATGATTCGGTAGCTTT
35Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Step 2 Assess candidate genes
3rd order Markov model
A C G TAAA 0.33
0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG
0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20
0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25
0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35
Candidategene
AAAGCAA
36Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Step 2 Assess candidate genes
3rd order Markov model
A C G TAAA 0.33
0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG
0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20
0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25
0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35
Candidategene
0.12
x 0.15
AAAGCAA
37Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Step 2 Assess candidate genes
3rd order Markov model
A C G TAAA 0.33
0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG
0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20
0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25
0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35
Candidategene
0.12
x 0.15 . . .
AAAGCTA
So far, not a good candidate!
38Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Pro Almost most accurate method known
Con Needs big training set
May miss genes of foreign origin
Will miss very small genes
39Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Pro Almost most accurate method known
Con Needs big training set
May miss genes of foreign origin
Will miss very small genes
40Surrogate FiltersScenario I Case of the Hidden
Heterocyst
41Case of the Hidden Heterocyst
NH3
N2
O2
Matveyev and Elhai (unpublished)
42Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
1. Use transposon mutagenesis
43Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Nostoc genome
Transposon
1. Use transposon mutagenesis
to find a mutant defective in heterocyst
differentiation
44Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Nostoc genome
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATC
AGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTAC
CATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTC
TGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCT
CCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACT
TCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAAC
AAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACT
TTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTA
GTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTG
AGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTA
ATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGT
GGA
1. Use transposon mutagenesis
to find a mutant defective in heterocyst
differentiation
2. Sequence out from transposon
45Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Nostoc genome
Do it
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATC
AGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTAC
CATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTC
TGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCT
CCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACT
TCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAAC
AAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACT
TTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTA
GTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTG
AGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTA
ATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGT
GGA
1. Use transposon mutagenesis
to find a mutant defective in heterocyst
differentiation
2. Sequence out from transposon
3. Find gene boundaries
4. Identify gene
46Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
1. Go to http//www.vcu.edu/elhaij/BioInf
2. Open second browser (Ctrl-N in Netscape)
Go to same site (copy and paste URL)
3. In 1st browser, go to Program List Click
on Gene Finders Open GeneMark
4. In 2nd browser, open Nostoc sequence
47(No Transcript)
48Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Mission successful gtTranslation 397..639
(direct), 81 amino acids VLGSKIEEGPKHIILDLSQIDFIDS
SGLGALVQLAKQAQTAEGTLQIVTNARVTQTVKLVRLEKFLSLQKSVEEA
LENVK
or was it?
Check predicted protein against databases
49Surrogate FiltersSimilarity finders
- Blast
- BlastP Protein sequence to search protein
database - BlastN Nucleotide sequence to search nucleotide
database - BlastX Nucleotide sequence (translated) to
search protein database - TBlastN Protein sequence to search (translated)
nucleotide database - Blast2Seq Compare two sequences you specify
Do it
Pfam (Protein motif families) Finds conserved
motifs similar to protein sequence
50(No Transcript)
51Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Mission successful gtTranslation 397..639
(direct), 81 amino acids VLGSKIEEGPKHIILDLSQIDFIDS
SGLGALVQLAKQAQTAEGTLQIVTNARVTQTVKLVRLEKFLSLQKSVEEA
LENVK
Why?
- GeneMark correct Conservation of noncoding
regions
- GeneMark wrong Fooled by weird aa sequence or
start codon
52Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Moral Automated gene finders are wonderful, but
common sense is better
Dont trust automated annotation
53Surrogate FiltersFeature finders
- Hidden Markov model-based methods
- Good for contiguous features (e.g. signal
sequences) - Not good with features with gaps (e.g. promoters)
- Ad hoc methods
- Feature-specific rules (e.g. tandem repeats,
terminators)
Position-dependent frequency tables
Position-specific scoring matrix (PSSM)
Weight table
54Surrogate FiltersFeature finders
Position-dependent frequency tables
Consensus TATAAA
55Surrogate FiltersFeature finders
Position-dependent frequency tables
56Surrogate FiltersFeature finders
Position-Specific Scoring Matrix in action
aceB ACTATGGAGCATCTGCACATGAAAACC atpI ACCTCGAAGGGA
GCAGGAGTGAAAAAC bioB ACGTTTTGGAGAAGCCCCATGGCTCAC g
lnA ATCCAGGAGAGTTAAAGTATGTCCGCT glnH TAGAAAAAAGGAA
ATGCTATGAAGTCT lacZ TTCACACAGGAAACAGCTATGACCATG rp
sJ AATTGGAGCTCTGGTCTCATGCAGAAC serC GCAACGTGGTGAGG
GGAAATGGCTCAA sucA GATGCTTAAGGGATCACGATGCAGAAC trp
E CAAAATTAGAGAATAACAATGCAAACA
unknown
Experimentally proven start sites
57Surrogate FiltersFeature finders
Position-Specific Scoring Matrix in action
aceB ACTATGGAGCATCTGCACATGAAAACC atpI ACCTCGAAGGGA
GCAGGAGTGAAAAAC bioB ACGTTTTGGAGAAGCCCCATGGCTCAC g
lnA ATCCAGGAGAGTTAAAGTATGTCCGCT glnH TAGAAAAAAGGAA
ATGCTATGAAGTCT lacZ TTCACACAGGAAACAGCTATGACCATG rp
sJ AATTGGAGCTCTGGTCTCATGCAGAAC serC GCAACGTGGTGAGG
GGAAATGGCTCAA sucA GATGCTTAAGGGATCACGATGCAGAAC trp
E CAAAATTAGAGAATAACAATGCAAACA
unknown
Experimentally proven start sites
58Surrogate FiltersFeature finders
Position-Specific Scoring Matrix in action
aceB ACCACATAACTATGGAGCATCTGCACATGAAAACC atpI
ACCTCGAAGGGAGCAG.....GAGTGAAAAAC bioB
ACGTTTTGGAGAAGC...CCCATGGCTCAC glnA
ATCCAGGAGAGTTA.AAGTATGTCCGCT glnH
TAGAAAAAAGGAAATG.....CTATGAAGTCT lacZ
TTCACACAGGAAACAG....CTATGACCATG rpsJ
AATTGGAGCTCTGGTCTCATGCAGAAC serC
GCAACGTGGTGAGGG...GAAATGGCTCAA sucA
GATGCTTAAGGGATCA....CGATGCAGAAC trpE
CAAAATTAGAGAATA...ACAATGCAAACA
59Surrogate FiltersFeature finders
Position-Specific Scoring Matrix in action
aceB ACCACATAACTATGGAGCATCT.GCACATGAAAACC atpI
ACCTCGAAGGGAGCAG.....GAGTGAAAAAC bioB
ACGTTTTGGAGAAGC...CCCATGGCTCAC glnA
ATCCAGGAGAGTTA.AAGTATGTCCGCT glnH
TAGAAAAAAGGAAATG.....CTATGAAGTCT lacZ
TTCACACAGGAAACAG....CTATGACCATG rpsJ
AATTGGAGCTCTGGTCTCATGCAGAAC serC
GCAACGTGGTGAGGG...GAAATGGCTCAA sucA
GATGCTTAAGGGATCA....CGATGCAGAAC trpE
CAAAATTAGAGAATA...ACAATGCAAACA
60Surrogate FiltersPattern finders
Specified patterns (FindPatterns, PatScan)
e.g. Find instances of restriction sites
New pattern discovery (Meme, Gibbs sampler)
61Surrogate FiltersPattern finders
How do pattern finders work?
snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTG
TGAAGTC histone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGC
CCGGGTGTTT HMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGG
GACGCGGG TP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGC
CTT protamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCC
CCTGACT
Step 1. Arbitrarily choose candidate pattern from
a sequence
Step 2. Find best matches to pattern in all
sequences
Step 3. Construct position-dependent frequency
table based on matches
Step 4. Calculate relative probability of matches
from frequency table
62Surrogate FiltersPattern finders
How do pattern finders work?
snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTG
TGAAGTC histone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGC
CCGGGTGTTT HMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGG
GACGCGGG TP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGC
CTT protamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCC
CCTGACT
Step 1. Arbitrarily choose candidate pattern from
a sequence
Step 2. Find best matches to pattern in all
sequences
Step 3. Construct position-dependent frequency
table based on matches
Step 4. Calculate relative probability of matches
from frequency table
Step 5. If probability score high, remember
pattern and score
63Surrogate FiltersPattern finders
How do pattern finders work?
snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTG
TGAAGTC histone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGC
CCGGGTGTTT HMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGG
GACGCGGG TP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGC
CTT protamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCC
CCTGACT
Step 1. Arbitrarily choose candidate pattern from
a sequence
Step 2. Find best matches to pattern in all
sequences
Step 3. Construct position-dependent frequency
table based on matches
Step 4. Calculate relative probability of matches
from frequency table
Step 5. If probability score high, remember
pattern and score
Step 6. Repeat Steps 1 - 5
64Surrogate FiltersScenario II Case of the
Masked Motif
- Youve found a gene related to Purple Tongue
Syndrome
- BlastP Encoded protein related to cAMP-binding
proteins
- Are the similarities trivial? Related to cAMP
binding?
- Does your protein contain cAMP-binding site?
- What IS a cAMP-binding site?
- Task
- Determine what is a cAMP-binding site
- Determine if your protein has one
65Surrogate FiltersScenario II Case of the
Masked Motif
Strategy
- Collect sequences of known cAMP-binding proteins
- Run Meme, a pattern-finding programAsk it to
find any significant motifs
Do it
- Rerun Meme. Demand that every protein has
identified motifs
- Run Pfam over known sequence to check
66(No Transcript)
67Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
- Progressive External Ophthalmoplegia (PEO)
- Slow paralysis of voluntary eye muscles
- Many other symptoms (e.g., frequent deafness)
- Loss of mitochondrial DNA
68Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
- Progressive External Ophthalmoplegia (PEO)
- Slow paralysis of voluntary eye muscles
- Many other symptoms (e.g., frequent deafness)
- Loss of mitochondrial DNA
- Inheritance
- Mendelian
- Autosomal dominant
- Linked to chromosome 4q34
69Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
- Progressive External Ophthalmoplegia (PEO)
- Slow paralysis of voluntary eye muscles
- Many other symptoms (e.g., frequent deafness)
- Loss of mitochondrial DNA
- Inheritance
- Mendelian
- Autosomal dominant
- Linked to chromosome 4q34
- Your task
- Examine sequence of 4q34 region
- Assess likelihood that a gene in the area could
cause disease symptoms
70Surrogate FiltersScenario III Case of the
Mortal Mitochondrion Examining Sequence of 4q34
Region
tctacttatattcaatccacagggctacacctagttcttggtacacagta
catgctcagcaagagtctgttgaatgaacacatacatggtttatctgttt
gtctcttccgagttcttgacttctgtctgctctgacctctggcagctttc
cactagtttctagctttcattctgcttacctggatttcggaactctagcc
tgccccactcttagataaacgcatgccctctgtggccctggaaccttagt
gacttctgctataccaaagtctccacgcccagggtgacacgcagctgcag
ctccgtaaacctctaacatgatgtcagcaaatattaaaaaaaaaaagttt
ataaaaacaatgaataaactttgttaaaggtacaaatgaaaattagcaaa
catgggaagataattgagtaaagagtttaaagttaaaaacgaattgcagt
cattctaggggaaggaacagttgtatttgaaaacctgtatggttacatga
actgcctaaaaaacaagctaaggaaaattaaagctcagatttatatattt
taagaaattaattgcaattaatttcctgggattaaatagcatttcctcaa
ccccagctgtcattaaaaagaggcaaatacagccaaggactggatcttct
ccggaaggctgacagcactgaccctcaagaaggcaccggctgacagacag
aacattctgccctaatatgtgctgaaattccgctgagagcagagtggtac
attgaaccctttaggggcttacaaaagaagtgtcctgtgttttagagtca
cagagttttgcagaaacaagtatgaattcacctagtggccccctgcacca
ggtctttcctgtgggcactgagtgcagacacatcaatatgtaatagcaga
atgaatgactgaacgaacgattgaatgaaaagaaatgagaggcagcaggt
tgtcagattctatgaggcaatcacagcatcaggtgaccttagtatctatt
tgagaggactgccatttattctcgggagcgcacggctctaaagaggccca
tatccaggcagtgagctctggtggggggcgcctttagatgcaagaaggag
gaaacagctcgaaatccctgggcctgagcgcggcccgtgcaggccggagg
gtcaagaactctccaccggcggcagcggcccggtgtctgccccggcttcg
ccccggcctaaggctgcctgtgctataaatacgcggcccacatgccgcgg
tgacacggtgttccctgggctcggcgggacagataacatgaatgtgccct
ttaaacgtcccaagttgcagggacagcccccggcccagcctcgctcccgg
aagcgccttcgcccccgatgccctctgcagctgggaggagggggcgcccc
gcacctgcccagccaatgcgcggcgcgagcgccggccgcgacccgcctcc
tctcgcgagagcccggcggggatataagggggagctgcgggccaggcggc
ggccccctagcgtcgcgcagggtcggggactgcgcgcggtgccaggccgg
gcgtgggcgagagcacgaacgggctgcctgcgggctgagagcgtcgagct
gtcaccatgggtgatcacgcttggagcttcctaaaggacttcctggccgg
gggcgtcgccgctgccgtctccaagaccgcggtcgcccccatcgagaggg
tcaaactgctgctgcaggtgaggaccgcgcggtgcaagaggcgggcgcgg
gcgcggcgggccgggcggggcgcgcgatgcggcgcgagctgcagggcgcg
gggcgccgcggaaaatctgcgccaggccacaggcccgggcgcccgcccgc
ccgcgggggaagaaggtgccctctgcgtagagacaggtccagcgtcagtc
gcagattcctggtgtcgggtggcgcccggcgttcgggtgtctatatatgg
aaacccacccggagccggtttacgtgtgccagatcctgcgcccgtgacag
cacgggcgtgcactcaggcccggaggcacctagtgattgccagtattttt
ggcaccgtcttatgcgcacgcacctttacaataaaaacatcaaaataatc
atcacccaagaattcccttatcgtatctcatgcacaatgctgtatgtagg
ctgacgccttcatctttatgtaacctctgtgagagagttattcttctcca
ttttacagatgaagctgaggttttgaaatattaagaaacaattttcggaa
taaactcagatcatcctgtctccaaatcttttcctcccctacctggtcgc
tgaatggtttatcatcctctcgtgttttcctccacctgcccaaaaggtca
gggcccctcaatgaggaagagcccaatttgggagtcagaattactaacaa
caaaacccccacaaattgctcacaacggcagcaaacccttaataattgat
tacttggattatctgcttgaaaactttggaggcctaatgtttagtggatt
tattctccttcctctattagagcatctagtagagatcctcatctccaggg
tgatcagagtgacactgagaaattgtcattttttggccatcatgtctatt
aaatccaaagccctttgaagcagggagtgttactcatttctgtcccccag
taagcccctcatacagttctcaaacctagggaaagtgaaataaataaatg
gctatagctttatataattcaatcaccttttcagtttatttggggcaata
cctttccctcaaataccctaataattgaagcaacattggattattttggc
ttgttatccagtaactaacatggataacagtatccatttacacgtcctcg
tatccatttgatttcctcatcctttttttcttcaaaaaaaaaatctagga
agtgcaaaccttttttttttctcctgtcctcttcccttctctctaccctg
cctgtcctctgtcacccaccctcccctccaccaggtccagcatgccagca
aacagatcagtgctgagaagcagtacaaagggatcattgattgtgtggtg
agaatccctaaggagcagggcttcctctccttctggaggggtaacctggc
caacgtgatccgttacttccccacccaagctctcaacttcgccttcaagg
acaagtacaagcagctcttcttagggggtgtggatcggcataagcagttc
tggcgctactttgctggtaacctggcgtccggtggggccgctggggccac
ctccctttgctttgtctacccgctggactttgctaggaccaggttggctg
ctgatgtgggcaagggcgccgcccagcgtgagttccatggtctgggcgac
tgtatcatcaagatcttcaagtctgatggcctgagggggctctaccaggg
tttcaacgtctctgtccaaggcatcattatctatagagctgcctacttcg
gagtctatgatactgccaagggtgagagaggggcatcggggagaaggagg
gtggtgtggaaagaggatcctatgggatctataactcacaaaggacctga
tatatattgatcttgttttttctagtctctgggataattgaggcttctga
atgaggaggtgatgtgcataagttaatagctgaagcgttccttgtgtcct
ctactgaaataaactctggcctttagttattcagagaggaggagggggga
gcctgtctccctctagacacagccatagcagttactgagtttaacttgaa
gccacttccaatgccctgtatacaagctgagcactgcccctccggggtcc
ggagagggcagcagccacctttgctgtctgcctggtcatatgtgaagcac
ctgcacaggggcaggttccccgcaaggtcagagcatggagctggaggtgc
agtggcctctctccctccacctgctttctgctgagaacaggcacttcata
gccgttcggcttctgggctctgtccacagggatgctgcctgaccccaaga
acgtgcacatttttgtgagctggatgattgcccagagtgtgacggcagtc
gcagggctggtgtcctacccctttgacactgttcgtcgtagaatgatgat
gcagtccggccggaaagggggtaagcttgtgctctactcatctaaacttg
tttggttttgcccgaggagaacattttacagggctcctttcagtcttcct
tactggaaattaattttcaaaattatttgataaggacttagggaagaaag
atggtattaattccccctaacgttctcaactatcctattagggaaaagta
ttttccattttattagagatgataagaacatgaatagtaagacatttaga
tgtgaatttaactaggtatccagcattatagagaccctaggccctcttcc
cttagagcctgggtgcaaaagctagggaaaagaagtagttagctacttct
tacaaagaactcttgcttccctcctagttacaggtgttagtgggatgggg
tgtttagctgggtagagatggcctgaagcaatctgttgtgccagagaaag
ttttggcttctataggttgaaccatatgaaattgccactttaaaagtcaa
aaacagtccaatgttagcagtttcgtatgtttcaacgaatagttacagcc
ttttatttagactgcataacctcgtgcaggatcatctgaggctcagcctc
agttcggtcctccataaaaaaaggtaaccgcgtagcataatactcctgct
ccactgcgcccttcttgtttcgcagttgggcagtccatgaattacttggt
taattgccccagttcttcactgaccttgaactaatggagtaggaatgaca
ggagacccagcctgccagtgaagcaaggaaggagatgtccagtgggatgt
tgcatggagctgggactccatgcccagatgaccctgattttataaaactg
gtaacagtgtgtacagatatgtttcaggggaaaagtctctttcctccagc
gttacggagccctcaccagcatttgtttccacagccgatattatgtacac
ggggacagttgactgctggaggaagattgcaaaagacgaaggagccaagg
ccttcttcaaaggtgcctggtccaatgtgctgagaggcatgggcggtgct
tttgtattggtgttgtatgatgagatcaaaaaatatgtctaatgtaatta
aaacacaagttcacagatttacatgaacttgatctacaagttcacagatc
cattgtgtggtttaatagactattcctaggggaagtaaaaagatctggga
taaaaccagactgaaggaatacctcagaagagatgcttcattgagtgttc
attaaaccacacatgtattttgtatttattttacatttaaattcccacag
caaatagaaaataatttatcatacttgtacaattaactgaagaattgata
ataactgaatgtgaaacatcaataaagaccacttaatgcacgctttctat
tttattgaactcttattaactgtaaaatgcatttttaaaagatcaaaaat
gcatattttctagcatgattcatgtatcagtcagcagccaagcttctaaa
tgccagatattatattgagaatgtattatatgagaacgtacaatgcttaa
agttccggttttcaaacttaggcaggtcatattctatctatcttatccag
cgttactgtaggctagaaagtgataatggctttcataatcctgccttgtc
ttaggcactttcctgcag
71Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Strategy
- Assume that encoded protein is in mitochondria
- Protein has function associated with
mitochondrial location?
- Use Gene finder to identify protein sequence(s)
- Use Similarity finder to identify possible
function
- Protein has structure associated with
mitochondrial location?
- Use Feature finders to identify pertinent
regions - (What ARE pertinent regions?)
72Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Run 4q34 region through FGene
Name PEO-related_gene? First three lines of
sequence tctacttatattcaatccacagggctacacctagttcttg
gtacacagtacatgctcagcaagagtctgttgaat gaacacatacatgg
tttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctc
tggcagctttc cactagtttctagctttcattctgcttacctggatttc
ggaactctagcctgccccactcttagataaacgcatg fgene Wed
Feb 27 165529 GMT 2002 gtPEO-related_gene?
length of sequence - 5768 number of
predicted exons - 5 positions of predicted
exons 1607 - 1717 w 17.84 ORF 1607 -
1717 2985 - 3231 w 9.13 ORF 2985
- 3230 3421 - 3471 w 6.08 ORF
3423 - 3470 3980 - 4120 w 12.62 ORF
3982 - 4119 5035 - 5192 w 1.93 ORF
5037 - 5192 Length of Coding region-
708bp Amino acid sequence -
235aa MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQ
ISAEKQYKGIIDCVVR IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKD
KYKQLFLGGVDRHKQFWRYFAGNLASG IIIYRAAYFGVYDTAKGMLPDP
KNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMMQSG RKGADIMYTGT
VDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEIKKYV
73Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Run 4q34 region through FGeneSH
Name PEO-related_gene? First three lines of
sequence tctacttatattcaatccacagggctacacctagttcttg
gtacacagtacatgctcagcaagagtctgttgaat gaacacatacatgg
tttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctc
tggcagctttc cactagtttctagctttcattctgcttacctggatttc
ggaactctagcctgccccactcttagataaacgcatg Fgenesh
Wed Feb 27 165914 GMT 2002 FGENESH 1.0
Prediction of potential genes in Human
genomic DNA Time Wed Feb 27 165914 2002
Seq name PEO-related_gene? Length of sequence
5768 GC content 48 Zone 2 Positions of
predicted genes and exons G Str Feature
Start End Score ORF Len
1 TSS 1216 -2.70 1
1 CDSf 1607 - 1717 18.01 1607 -
1717 111 1 2 CDSi 2985 - 3471
52.41 2985 - 3470 486 1 3 CDSi
3980 - 4120 20.99 3982 - 4119
138 1 4 CDSl 5035 - 5192 2.32
5037 - 5192 156 1 PolA 5471
0.92 Predicted protein(s) gtFGENESH
1 4 exon (s) 1607 - 5192 298 aa, chain
MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAE
KQYKGIIDCVVR IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQ
LFLGGVDRHKQFWRYFAGNLASG GAAGATSLCFVYPLDFARTRLAADVG
KGAAQREFHGLGDCIIKIFKSDGLRGLYQGFNVS VQGIIIYRAAYFGVY
DTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMM QSGR
KGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEI
KKYV
FGENE output 1607 - 1717 w 17.84
2985 - 3231 w 9.13 3421 - 3471 w
6.08 3980 - 4120 w 12.62 5035 - 5192
w 1.93
74How to decide where exons are?
Strategy
- Compare sequence of 4q34 region to sequence of
mRNA - Sequence of mRNA may be in cDNA library
- Expressed Sequence Tag (EST) library
Problems
- Library may not exist
- Expression of gene may be low
75Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Run 4q34 region through BlastN (x human ests)
MORAL Trust, but verify.
76Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Strategy
- Assume that encoded protein is in mitochondria
- Protein has function associated with
mitochondrial location?
?
- Use Gene finder to identify protein sequence(s)
- Use Similarity finder to identify possible
function
- Protein has structure associated with
mitochondrial location?
- Use Feature finders to identify pertinent
structures - (What ARE pertinent structures?)
77Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Run 4q34 region through BlastP
Name PEO-related_gene? First three lines of
sequence tctacttatattcaatccacagggctacacctagttcttg
gtacacagtacatgctcagcaagagtctgttgaat gaacacatacatgg
tttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctc
tggcagctttc cactagtttctagctttcattctgcttacctggatttc
ggaactctagcctgccccactcttagataaacgcatg Fgenesh
Wed Feb 27 165914 GMT 2002 FGENESH 1.0
Prediction of potential genes in Human
genomic DNA Time Wed Feb 27 165914 2002
Seq name PEO-related_gene? Length of sequence
5768 GC content 48 Zone 2 Positions of
predicted genes and exons G Str Feature
Start End Score ORF Len
1 TSS 1216 -2.70 1
1 CDSf 1607 - 1717 18.01 1607 -
1717 111 1 2 CDSi 2985 - 3471
52.41 2985 - 3470 486 1 3 CDSi
3980 - 4120 20.99 3982 - 4119
138 1 4 CDSl 5035 - 5192 2.32
5037 - 5192 156 1 PolA 5471
0.92 Predicted protein(s) gtFGENESH
1 4 exon (s) 1607 - 5192 298 aa, chain
MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAE
KQYKGIIDCVVR IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQ
LFLGGVDRHKQFWRYFAGNLASG GAAGATSLCFVYPLDFARTRLAADVG
KGAAQREFHGLGDCIIKIFKSDGLRGLYQGFNVS VQGIIIYRAAYFGVY
DTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMM QSGR
KGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEI
KKYV
78Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Run 4q34 region through BlastP
- Summary
- One protein in region
- Contains mitochondrial carrier motifs
- Similar to ATP/ADP transporter
- Mitochondrial signal sequence?
Reasonable candidate for PEO-related protein
79Complex gene discovery
Your turn Repeat and extend characterization of
PEO-related gene
1. Take same sequence (FastA format) e-mailed to
you 2. Get better estimate of promoter and polyA
site (e.g. by TSSW and PolyASH) (Is there
a TATA box upstream from the predicted promoter?)
3. Find encoded protein sequence by suitable
method (e.g. FGeneSH(GC) or comparison with
cDNA) 4. Continue characterization of protein
Contains signal sequence? Contains
transmembrane domains?
80(No Transcript)
81Filter limitation Inevitablebut whose filter?
82Filters controlled by outside programmers
83Filters controlled by you
84(No Transcript)
85(No Transcript)