Tale 1: To Identify an Unknown Gene - PowerPoint PPT Presentation

About This Presentation
Title:

Tale 1: To Identify an Unknown Gene

Description:

Frog's eye view of the jungle (time moving) Frog's eye view of the jungle (time frozen) Frog's eye view of the jungle (through movement filter) Push to restart time ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 86
Provided by: peopl2
Category:

less

Transcript and Presenter's Notes

Title: Tale 1: To Identify an Unknown Gene


1
(No Transcript)
2
(No Transcript)
3
(No Transcript)
4
(No Transcript)
5
Frogs eye view of the jungle(time frozen)
6
Frogs eye view of the jungle(time frozen)
Frogs eye view of the jungle(time moving)
7
Frogs eye view of the jungle(through movement
filter)
8
Frogs eye view of the jungle(through movement
filter)
9
Filters Information reducersMovement filter
10
Filters Information reducersSequence filter
TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG
AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT
TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC
TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC
GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC
CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA
TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA
AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG
AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA
TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG
CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC
GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG
GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT
CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA
ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA
TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT
CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA
CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG
AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC
CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA
TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT
How organism is made How organism works
CTCCGTAAAC CTCTAAC...
11
From Sequence to OrganismHow does Nature do it?
12
From Sequence to OrganismHow does Nature do it?
Genetic code
Rules of folding
13
From Sequence to OrganismHow does Nature do it?
Genetic code
  • Custom antibiotics

Gives us
14
From Sequence to OrganismHow does Nature do it?
ATGACTTATGATCAACGCACAGGGCTA
  • Custom antibiotics
  • Custom antibodies

Gives us
  • Custom enzymes
  • New materials

15
From Sequence to OrganismHow does Nature do it?
ATGACTTATGATCAACGCACAGGGCTA
  • Transcrl initiation
  • Transcrl termination/ polyA tailing
  • Splicing
  • Transll initiation

Rules of transcriptional and post-transcriptional
control
16
From Sequence to OrganismHow does Nature do it?
  • Natural filters/transformations
  • Selective transcription
  • Selective processing
  • Translation
  • Folding

Functional protein
DNA
17
From Sequence to OrganismHow does Nature do it?
Natural filters/transformations
Functional protein
DNA
18
From Sequence to OrganismHow can WE do it?
Simulation of Nature
Whether tis nobler in the mind to suffer the
slings and arrows of outrageous fortune...
We must give our military every tool and weapon
it needs to prevail...
???
19
From Sequence to OrganismHow can WE do it?
Surrogate Processes
Whether tis nobler in the mind to suffer the
slings and arrows of outrageous fortune...
Utterence of Wm Shakespeare
Utterence of George W Bush
We must give our military every tool and weapon
it needs to prevail...
Words/sentence Choice of words Sentence
structure
20
From Sequence to OrganismHow can WE do it?
Surrogate filters
  • Natural filters/transformations
  • Selective transcription
  • Selective processing
  • Translation
  • Folding
  • Gene finders

21
From Sequence to OrganismHow can WE do it?
  • Surrogate filters
  • Gene finders
  • Natural filters/transformations
  • Selective transcription
  • Selective processing
  • Translation
  • Folding
  • Similarity finders

22
From Sequence to OrganismHow can WE do it?
  • Surrogate filters
  • Gene finders
  • Similarity finders
  • Feature finders
  • Natural filters/transformations
  • Selective transcription
  • Selective processing
  • Translation
  • Folding

23
From Sequence to OrganismHow can WE do it?
  • Surrogate filters
  • Gene finders
  • Similarity finders
  • Feature finders
  • Pattern finders
  • Natural filters/transformations
  • Selective transcription
  • Selective processing
  • Translation
  • Folding

24
Surrogate Filters
You do it
25
Surrogate FiltersGene finders
Class 1 Start/Stop codon search (Map, Frames,
OrfFinder)
Look for stop codons (TAA,TAG,TGA)
CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAA
TGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA
26
Surrogate FiltersGene finders
Class 1 Start/Stop codon search (Map, Frames,
OrfFinder)
Look for stop codons (TAA,TAG,TGA)
CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAA
TGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA
TTGCATATTTCCATTTGTACATGTCACAAAGTTTATTCATTTTTAATATT
TGCTGAGATCATGTTAGAGGTGTACGGAGGGGCGTGGAG
27
Surrogate FiltersGene finders
Class 1 Start/Stop codon search (Map, Frames,
OrfFinder)
Pro Quick, simple
Con Useless for eukaryotic genomic sequences
(introns)
Inaccurate (start codon problem)
Inaccurate (doubtful short open reading
frames)
28
Surrogate FiltersGene finders
Class 2 Codon bias recognition (TestCode)
Are codons equally used?
The code is degenerate
29
Surrogate FiltersGene finders
Class 2 Codon bias recognition (TestCode)
Most frequently used codons
Codon bias universal?
Codon usage is biased
30
Surrogate FiltersGene finders
Class 2 Codon bias recognition (TestCode)
Pro Quick, simple, available through GCG
Better than Class 1 in excluding false open
reading frames
31
Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Principle
Step 1 Create model through extensive
training set Training set
proven or suspected genes
Organism-specific
Step 2 Assess candidate genes through filter of
model
32
Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Step 1 Create model through extensive training
set
AAAAACAAGAATACA . . .TTGTTT
TrainingSet
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATC
AGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTAC
CATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTC
TGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCT
CCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACT
TCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAAC
AAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACT
TTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTA
GTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTG
AGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTA
ATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGT
GGAGATGATTCGGTAGCTTT
33
Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Step 1 Create model through extensive training
set
AAAAACAAGAATACA . . .TTGTTT
TrainingSet
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATC
AGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTAC
CATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTC
TGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCT
CCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACT
TCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAAC
AAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACT
TTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTA
GTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTG
AGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTA
ATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGT
GGAGATGATTCGGTAGCTTT
34
Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Step 1 Create model through extensive training
set
AAAAACAAGAATACA . . .TTGTTT
TrainingSet
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATC
AGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTAC
CATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTC
TGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCT
CCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACT
TCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAAC
AAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACT
TTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTA
GTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTG
AGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTA
ATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGT
GGAGATGATTCGGTAGCTTT
35
Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Step 2 Assess candidate genes
3rd order Markov model
A C G TAAA 0.33
0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG
0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20
0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25
0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35
Candidategene
AAAGCAA
36
Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Step 2 Assess candidate genes
3rd order Markov model
A C G TAAA 0.33
0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG
0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20
0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25
0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35
Candidategene
0.12
x 0.15
AAAGCAA
37
Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Step 2 Assess candidate genes
3rd order Markov model
A C G TAAA 0.33
0.25 0.12 0.30AAC 0.30 0.20 0.15 0.35AAG
0.35 0.15 0.20 0.30 AAT 0.30 0.15 0.20
0.25 ACA 0.25 0.20 0.15 0.35 . . .TTG 0.25
0.30 0.15 0.30TTT 0.30 0.25 0.10 0.35
Candidategene
0.12
x 0.15 . . .
AAAGCTA
So far, not a good candidate!
38
Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Pro Almost most accurate method known
Con Needs big training set
May miss genes of foreign origin
Will miss very small genes
39
Surrogate FiltersGene finders
Class 3 Hidden Markov Model (HMM)-based
recognition
Pro Almost most accurate method known
Con Needs big training set
May miss genes of foreign origin
Will miss very small genes
40
Surrogate FiltersScenario I Case of the Hidden
Heterocyst
41
Case of the Hidden Heterocyst
NH3
N2
O2
Matveyev and Elhai (unpublished)
42
Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
1. Use transposon mutagenesis
43
Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Nostoc genome
Transposon
1. Use transposon mutagenesis
to find a mutant defective in heterocyst
differentiation
44
Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Nostoc genome
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATC
AGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTAC
CATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTC
TGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCT
CCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACT
TCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAAC
AAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACT
TTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTA
GTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTG
AGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTA
ATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGT
GGA
1. Use transposon mutagenesis
to find a mutant defective in heterocyst
differentiation
2. Sequence out from transposon
45
Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Nostoc genome
Do it
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATCAATGACTATC
AGACAGAGAATCATCGTGCTGTCAGTAAAACCTCTGATTTCGATCTTTAC
CATAATTGTTATGTTGTAATGACTAACCAGACTATCTTTTACAGAGCTTC
TGGTTAACACTTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCTATGAGACGCT
CCGCCAACGAGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTTAACT
TCAGAAATTCACGGCGGAAATCCATAGTTATTATTACTTATGACTAAAAC
AAAATTACTATGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTATATTTCGACT
TTAAAACTTATAGTAGATGGCTTAATTCTCAAATAACAAACTCATTTTTA
GTAGATATTTCATGCAAACTGAGGTTTTTAGTGATATTTTCCCCTTATTG
AGTACAGCCACTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGATGCCTGGGGTA
ATGCAGTTTATTTCGTTGTATCTGGATGGGTAAAAGTTCGGCGCACCTGT
GGA
1. Use transposon mutagenesis
to find a mutant defective in heterocyst
differentiation
2. Sequence out from transposon
3. Find gene boundaries
4. Identify gene
46
Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
1. Go to http//www.vcu.edu/elhaij/BioInf
2. Open second browser (Ctrl-N in Netscape)
Go to same site (copy and paste URL)
3. In 1st browser, go to Program List Click
on Gene Finders Open GeneMark
4. In 2nd browser, open Nostoc sequence
47
(No Transcript)
48
Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Mission successful gtTranslation 397..639
(direct), 81 amino acids VLGSKIEEGPKHIILDLSQIDFIDS
SGLGALVQLAKQAQTAEGTLQIVTNARVTQTVKLVRLEKFLSLQKSVEEA
LENVK
or was it?
Check predicted protein against databases
49
Surrogate FiltersSimilarity finders
  • Blast
  • BlastP Protein sequence to search protein
    database
  • BlastN Nucleotide sequence to search nucleotide
    database
  • BlastX Nucleotide sequence (translated) to
    search protein database
  • TBlastN Protein sequence to search (translated)
    nucleotide database
  • Blast2Seq Compare two sequences you specify
  • FastA
  • (Various flavors)

Do it
Pfam (Protein motif families) Finds conserved
motifs similar to protein sequence
50
(No Transcript)
51
Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Mission successful gtTranslation 397..639
(direct), 81 amino acids VLGSKIEEGPKHIILDLSQIDFIDS
SGLGALVQLAKQAQTAEGTLQIVTNARVTQTVKLVRLEKFLSLQKSVEEA
LENVK
Why?
  • GeneMark correct Conservation of noncoding
    regions
  • GeneMark wrong Fooled by weird aa sequence or
    start codon

52
Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Moral Automated gene finders are wonderful, but
common sense is better
Dont trust automated annotation
53
Surrogate FiltersFeature finders
  • Hidden Markov model-based methods
  • Good for contiguous features (e.g. signal
    sequences)
  • Not good with features with gaps (e.g. promoters)
  • Ad hoc methods
  • Feature-specific rules (e.g. tandem repeats,
    terminators)

Position-dependent frequency tables
Position-specific scoring matrix (PSSM)
Weight table
54
Surrogate FiltersFeature finders
Position-dependent frequency tables
Consensus TATAAA
55
Surrogate FiltersFeature finders
Position-dependent frequency tables
56
Surrogate FiltersFeature finders
Position-Specific Scoring Matrix in action
aceB ACTATGGAGCATCTGCACATGAAAACC atpI ACCTCGAAGGGA
GCAGGAGTGAAAAAC bioB ACGTTTTGGAGAAGCCCCATGGCTCAC g
lnA ATCCAGGAGAGTTAAAGTATGTCCGCT glnH TAGAAAAAAGGAA
ATGCTATGAAGTCT lacZ TTCACACAGGAAACAGCTATGACCATG rp
sJ AATTGGAGCTCTGGTCTCATGCAGAAC serC GCAACGTGGTGAGG
GGAAATGGCTCAA sucA GATGCTTAAGGGATCACGATGCAGAAC trp
E CAAAATTAGAGAATAACAATGCAAACA
unknown
Experimentally proven start sites
57
Surrogate FiltersFeature finders
Position-Specific Scoring Matrix in action
aceB ACTATGGAGCATCTGCACATGAAAACC atpI ACCTCGAAGGGA
GCAGGAGTGAAAAAC bioB ACGTTTTGGAGAAGCCCCATGGCTCAC g
lnA ATCCAGGAGAGTTAAAGTATGTCCGCT glnH TAGAAAAAAGGAA
ATGCTATGAAGTCT lacZ TTCACACAGGAAACAGCTATGACCATG rp
sJ AATTGGAGCTCTGGTCTCATGCAGAAC serC GCAACGTGGTGAGG
GGAAATGGCTCAA sucA GATGCTTAAGGGATCACGATGCAGAAC trp
E CAAAATTAGAGAATAACAATGCAAACA
unknown
Experimentally proven start sites
58
Surrogate FiltersFeature finders
Position-Specific Scoring Matrix in action
aceB ACCACATAACTATGGAGCATCTGCACATGAAAACC atpI
ACCTCGAAGGGAGCAG.....GAGTGAAAAAC bioB
ACGTTTTGGAGAAGC...CCCATGGCTCAC glnA
ATCCAGGAGAGTTA.AAGTATGTCCGCT glnH
TAGAAAAAAGGAAATG.....CTATGAAGTCT lacZ
TTCACACAGGAAACAG....CTATGACCATG rpsJ
AATTGGAGCTCTGGTCTCATGCAGAAC serC
GCAACGTGGTGAGGG...GAAATGGCTCAA sucA
GATGCTTAAGGGATCA....CGATGCAGAAC trpE
CAAAATTAGAGAATA...ACAATGCAAACA
59
Surrogate FiltersFeature finders
Position-Specific Scoring Matrix in action
aceB ACCACATAACTATGGAGCATCT.GCACATGAAAACC atpI
ACCTCGAAGGGAGCAG.....GAGTGAAAAAC bioB
ACGTTTTGGAGAAGC...CCCATGGCTCAC glnA
ATCCAGGAGAGTTA.AAGTATGTCCGCT glnH
TAGAAAAAAGGAAATG.....CTATGAAGTCT lacZ
TTCACACAGGAAACAG....CTATGACCATG rpsJ
AATTGGAGCTCTGGTCTCATGCAGAAC serC
GCAACGTGGTGAGGG...GAAATGGCTCAA sucA
GATGCTTAAGGGATCA....CGATGCAGAAC trpE
CAAAATTAGAGAATA...ACAATGCAAACA
60
Surrogate FiltersPattern finders
Specified patterns (FindPatterns, PatScan)
e.g. Find instances of restriction sites
New pattern discovery (Meme, Gibbs sampler)
61
Surrogate FiltersPattern finders
How do pattern finders work?
snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTG
TGAAGTC histone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGC
CCGGGTGTTT HMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGG
GACGCGGG TP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGC
CTT protamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCC
CCTGACT
Step 1. Arbitrarily choose candidate pattern from
a sequence
Step 2. Find best matches to pattern in all
sequences
Step 3. Construct position-dependent frequency
table based on matches
Step 4. Calculate relative probability of matches
from frequency table
62
Surrogate FiltersPattern finders
How do pattern finders work?
snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTG
TGAAGTC histone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGC
CCGGGTGTTT HMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGG
GACGCGGG TP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGC
CTT protamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCC
CCTGACT
Step 1. Arbitrarily choose candidate pattern from
a sequence
Step 2. Find best matches to pattern in all
sequences
Step 3. Construct position-dependent frequency
table based on matches
Step 4. Calculate relative probability of matches
from frequency table
Step 5. If probability score high, remember
pattern and score
63
Surrogate FiltersPattern finders
How do pattern finders work?
snRNA U1 (pU1-6) AGGTATATGGAGCTGTGACAGGGCAGAAGTGTG
TGAAGTC histone H1t GCCCTACCCTATATAAGGCCCCGAGGCCGC
CCGGGTGTTT HMG-14 CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGG
GACGCGGG TP1 GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGC
CTT protamine P1 CCCTGGCATCTATAACAGGCCGCAGAGCTGGCC
CCTGACT
Step 1. Arbitrarily choose candidate pattern from
a sequence
Step 2. Find best matches to pattern in all
sequences
Step 3. Construct position-dependent frequency
table based on matches
Step 4. Calculate relative probability of matches
from frequency table
Step 5. If probability score high, remember
pattern and score
Step 6. Repeat Steps 1 - 5
64
Surrogate FiltersScenario II Case of the
Masked Motif
  • Youve found a gene related to Purple Tongue
    Syndrome
  • BlastP Encoded protein related to cAMP-binding
    proteins
  • Are the similarities trivial? Related to cAMP
    binding?
  • Does your protein contain cAMP-binding site?
  • What IS a cAMP-binding site?
  • Task
  • Determine what is a cAMP-binding site
  • Determine if your protein has one

65
Surrogate FiltersScenario II Case of the
Masked Motif
Strategy
  • Collect sequences of known cAMP-binding proteins
  • Run Meme, a pattern-finding programAsk it to
    find any significant motifs

Do it
  • Rerun Meme. Demand that every protein has
    identified motifs
  • Run Pfam over known sequence to check

66
(No Transcript)
67
Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
  • Progressive External Ophthalmoplegia (PEO)
  • Slow paralysis of voluntary eye muscles
  • Many other symptoms (e.g., frequent deafness)
  • Loss of mitochondrial DNA

68
Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
  • Progressive External Ophthalmoplegia (PEO)
  • Slow paralysis of voluntary eye muscles
  • Many other symptoms (e.g., frequent deafness)
  • Loss of mitochondrial DNA
  • Inheritance
  • Mendelian
  • Autosomal dominant
  • Linked to chromosome 4q34

69
Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
  • Progressive External Ophthalmoplegia (PEO)
  • Slow paralysis of voluntary eye muscles
  • Many other symptoms (e.g., frequent deafness)
  • Loss of mitochondrial DNA
  • Inheritance
  • Mendelian
  • Autosomal dominant
  • Linked to chromosome 4q34
  • Your task
  • Examine sequence of 4q34 region
  • Assess likelihood that a gene in the area could
    cause disease symptoms

70
Surrogate FiltersScenario III Case of the
Mortal Mitochondrion Examining Sequence of 4q34
Region
tctacttatattcaatccacagggctacacctagttcttggtacacagta
catgctcagcaagagtctgttgaatgaacacatacatggtttatctgttt
gtctcttccgagttcttgacttctgtctgctctgacctctggcagctttc
cactagtttctagctttcattctgcttacctggatttcggaactctagcc
tgccccactcttagataaacgcatgccctctgtggccctggaaccttagt
gacttctgctataccaaagtctccacgcccagggtgacacgcagctgcag
ctccgtaaacctctaacatgatgtcagcaaatattaaaaaaaaaaagttt
ataaaaacaatgaataaactttgttaaaggtacaaatgaaaattagcaaa
catgggaagataattgagtaaagagtttaaagttaaaaacgaattgcagt
cattctaggggaaggaacagttgtatttgaaaacctgtatggttacatga
actgcctaaaaaacaagctaaggaaaattaaagctcagatttatatattt
taagaaattaattgcaattaatttcctgggattaaatagcatttcctcaa
ccccagctgtcattaaaaagaggcaaatacagccaaggactggatcttct
ccggaaggctgacagcactgaccctcaagaaggcaccggctgacagacag
aacattctgccctaatatgtgctgaaattccgctgagagcagagtggtac
attgaaccctttaggggcttacaaaagaagtgtcctgtgttttagagtca
cagagttttgcagaaacaagtatgaattcacctagtggccccctgcacca
ggtctttcctgtgggcactgagtgcagacacatcaatatgtaatagcaga
atgaatgactgaacgaacgattgaatgaaaagaaatgagaggcagcaggt
tgtcagattctatgaggcaatcacagcatcaggtgaccttagtatctatt
tgagaggactgccatttattctcgggagcgcacggctctaaagaggccca
tatccaggcagtgagctctggtggggggcgcctttagatgcaagaaggag
gaaacagctcgaaatccctgggcctgagcgcggcccgtgcaggccggagg
gtcaagaactctccaccggcggcagcggcccggtgtctgccccggcttcg
ccccggcctaaggctgcctgtgctataaatacgcggcccacatgccgcgg
tgacacggtgttccctgggctcggcgggacagataacatgaatgtgccct
ttaaacgtcccaagttgcagggacagcccccggcccagcctcgctcccgg
aagcgccttcgcccccgatgccctctgcagctgggaggagggggcgcccc
gcacctgcccagccaatgcgcggcgcgagcgccggccgcgacccgcctcc
tctcgcgagagcccggcggggatataagggggagctgcgggccaggcggc
ggccccctagcgtcgcgcagggtcggggactgcgcgcggtgccaggccgg
gcgtgggcgagagcacgaacgggctgcctgcgggctgagagcgtcgagct
gtcaccatgggtgatcacgcttggagcttcctaaaggacttcctggccgg
gggcgtcgccgctgccgtctccaagaccgcggtcgcccccatcgagaggg
tcaaactgctgctgcaggtgaggaccgcgcggtgcaagaggcgggcgcgg
gcgcggcgggccgggcggggcgcgcgatgcggcgcgagctgcagggcgcg
gggcgccgcggaaaatctgcgccaggccacaggcccgggcgcccgcccgc
ccgcgggggaagaaggtgccctctgcgtagagacaggtccagcgtcagtc
gcagattcctggtgtcgggtggcgcccggcgttcgggtgtctatatatgg
aaacccacccggagccggtttacgtgtgccagatcctgcgcccgtgacag
cacgggcgtgcactcaggcccggaggcacctagtgattgccagtattttt
ggcaccgtcttatgcgcacgcacctttacaataaaaacatcaaaataatc
atcacccaagaattcccttatcgtatctcatgcacaatgctgtatgtagg
ctgacgccttcatctttatgtaacctctgtgagagagttattcttctcca
ttttacagatgaagctgaggttttgaaatattaagaaacaattttcggaa
taaactcagatcatcctgtctccaaatcttttcctcccctacctggtcgc
tgaatggtttatcatcctctcgtgttttcctccacctgcccaaaaggtca
gggcccctcaatgaggaagagcccaatttgggagtcagaattactaacaa
caaaacccccacaaattgctcacaacggcagcaaacccttaataattgat
tacttggattatctgcttgaaaactttggaggcctaatgtttagtggatt
tattctccttcctctattagagcatctagtagagatcctcatctccaggg
tgatcagagtgacactgagaaattgtcattttttggccatcatgtctatt
aaatccaaagccctttgaagcagggagtgttactcatttctgtcccccag
taagcccctcatacagttctcaaacctagggaaagtgaaataaataaatg
gctatagctttatataattcaatcaccttttcagtttatttggggcaata
cctttccctcaaataccctaataattgaagcaacattggattattttggc
ttgttatccagtaactaacatggataacagtatccatttacacgtcctcg
tatccatttgatttcctcatcctttttttcttcaaaaaaaaaatctagga
agtgcaaaccttttttttttctcctgtcctcttcccttctctctaccctg
cctgtcctctgtcacccaccctcccctccaccaggtccagcatgccagca
aacagatcagtgctgagaagcagtacaaagggatcattgattgtgtggtg
agaatccctaaggagcagggcttcctctccttctggaggggtaacctggc
caacgtgatccgttacttccccacccaagctctcaacttcgccttcaagg
acaagtacaagcagctcttcttagggggtgtggatcggcataagcagttc
tggcgctactttgctggtaacctggcgtccggtggggccgctggggccac
ctccctttgctttgtctacccgctggactttgctaggaccaggttggctg
ctgatgtgggcaagggcgccgcccagcgtgagttccatggtctgggcgac
tgtatcatcaagatcttcaagtctgatggcctgagggggctctaccaggg
tttcaacgtctctgtccaaggcatcattatctatagagctgcctacttcg
gagtctatgatactgccaagggtgagagaggggcatcggggagaaggagg
gtggtgtggaaagaggatcctatgggatctataactcacaaaggacctga
tatatattgatcttgttttttctagtctctgggataattgaggcttctga
atgaggaggtgatgtgcataagttaatagctgaagcgttccttgtgtcct
ctactgaaataaactctggcctttagttattcagagaggaggagggggga
gcctgtctccctctagacacagccatagcagttactgagtttaacttgaa
gccacttccaatgccctgtatacaagctgagcactgcccctccggggtcc
ggagagggcagcagccacctttgctgtctgcctggtcatatgtgaagcac
ctgcacaggggcaggttccccgcaaggtcagagcatggagctggaggtgc
agtggcctctctccctccacctgctttctgctgagaacaggcacttcata
gccgttcggcttctgggctctgtccacagggatgctgcctgaccccaaga
acgtgcacatttttgtgagctggatgattgcccagagtgtgacggcagtc
gcagggctggtgtcctacccctttgacactgttcgtcgtagaatgatgat
gcagtccggccggaaagggggtaagcttgtgctctactcatctaaacttg
tttggttttgcccgaggagaacattttacagggctcctttcagtcttcct
tactggaaattaattttcaaaattatttgataaggacttagggaagaaag
atggtattaattccccctaacgttctcaactatcctattagggaaaagta
ttttccattttattagagatgataagaacatgaatagtaagacatttaga
tgtgaatttaactaggtatccagcattatagagaccctaggccctcttcc
cttagagcctgggtgcaaaagctagggaaaagaagtagttagctacttct
tacaaagaactcttgcttccctcctagttacaggtgttagtgggatgggg
tgtttagctgggtagagatggcctgaagcaatctgttgtgccagagaaag
ttttggcttctataggttgaaccatatgaaattgccactttaaaagtcaa
aaacagtccaatgttagcagtttcgtatgtttcaacgaatagttacagcc
ttttatttagactgcataacctcgtgcaggatcatctgaggctcagcctc
agttcggtcctccataaaaaaaggtaaccgcgtagcataatactcctgct
ccactgcgcccttcttgtttcgcagttgggcagtccatgaattacttggt
taattgccccagttcttcactgaccttgaactaatggagtaggaatgaca
ggagacccagcctgccagtgaagcaaggaaggagatgtccagtgggatgt
tgcatggagctgggactccatgcccagatgaccctgattttataaaactg
gtaacagtgtgtacagatatgtttcaggggaaaagtctctttcctccagc
gttacggagccctcaccagcatttgtttccacagccgatattatgtacac
ggggacagttgactgctggaggaagattgcaaaagacgaaggagccaagg
ccttcttcaaaggtgcctggtccaatgtgctgagaggcatgggcggtgct
tttgtattggtgttgtatgatgagatcaaaaaatatgtctaatgtaatta
aaacacaagttcacagatttacatgaacttgatctacaagttcacagatc
cattgtgtggtttaatagactattcctaggggaagtaaaaagatctggga
taaaaccagactgaaggaatacctcagaagagatgcttcattgagtgttc
attaaaccacacatgtattttgtatttattttacatttaaattcccacag
caaatagaaaataatttatcatacttgtacaattaactgaagaattgata
ataactgaatgtgaaacatcaataaagaccacttaatgcacgctttctat
tttattgaactcttattaactgtaaaatgcatttttaaaagatcaaaaat
gcatattttctagcatgattcatgtatcagtcagcagccaagcttctaaa
tgccagatattatattgagaatgtattatatgagaacgtacaatgcttaa
agttccggttttcaaacttaggcaggtcatattctatctatcttatccag
cgttactgtaggctagaaagtgataatggctttcataatcctgccttgtc
ttaggcactttcctgcag
71
Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Strategy
  • Assume that encoded protein is in mitochondria
  • Protein has function associated with
    mitochondrial location?
  • Use Gene finder to identify protein sequence(s)
  • Use Similarity finder to identify possible
    function
  • Protein has structure associated with
    mitochondrial location?
  • Use Feature finders to identify pertinent
    regions
  • (What ARE pertinent regions?)

72
Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Run 4q34 region through FGene
Name PEO-related_gene? First three lines of
sequence tctacttatattcaatccacagggctacacctagttcttg
gtacacagtacatgctcagcaagagtctgttgaat gaacacatacatgg
tttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctc
tggcagctttc cactagtttctagctttcattctgcttacctggatttc
ggaactctagcctgccccactcttagataaacgcatg fgene Wed
Feb 27 165529 GMT 2002 gtPEO-related_gene?

length of sequence - 5768 number of
predicted exons - 5 positions of predicted
exons 1607 - 1717 w 17.84 ORF 1607 -
1717 2985 - 3231 w 9.13 ORF 2985
- 3230 3421 - 3471 w 6.08 ORF
3423 - 3470 3980 - 4120 w 12.62 ORF
3982 - 4119 5035 - 5192 w 1.93 ORF
5037 - 5192 Length of Coding region-
708bp Amino acid sequence -
235aa MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQ
ISAEKQYKGIIDCVVR IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKD
KYKQLFLGGVDRHKQFWRYFAGNLASG IIIYRAAYFGVYDTAKGMLPDP
KNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMMQSG RKGADIMYTGT
VDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEIKKYV
73
Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Run 4q34 region through FGeneSH
Name PEO-related_gene? First three lines of
sequence tctacttatattcaatccacagggctacacctagttcttg
gtacacagtacatgctcagcaagagtctgttgaat gaacacatacatgg
tttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctc
tggcagctttc cactagtttctagctttcattctgcttacctggatttc
ggaactctagcctgccccactcttagataaacgcatg Fgenesh
Wed Feb 27 165914 GMT 2002 FGENESH 1.0
Prediction of potential genes in Human
genomic DNA Time Wed Feb 27 165914 2002
Seq name PEO-related_gene? Length of sequence
5768 GC content 48 Zone 2 Positions of
predicted genes and exons G Str Feature
Start End Score ORF Len
1 TSS 1216 -2.70 1
1 CDSf 1607 - 1717 18.01 1607 -
1717 111 1 2 CDSi 2985 - 3471
52.41 2985 - 3470 486 1 3 CDSi
3980 - 4120 20.99 3982 - 4119
138 1 4 CDSl 5035 - 5192 2.32
5037 - 5192 156 1 PolA 5471
0.92 Predicted protein(s) gtFGENESH
1 4 exon (s) 1607 - 5192 298 aa, chain
MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAE
KQYKGIIDCVVR IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQ
LFLGGVDRHKQFWRYFAGNLASG GAAGATSLCFVYPLDFARTRLAADVG
KGAAQREFHGLGDCIIKIFKSDGLRGLYQGFNVS VQGIIIYRAAYFGVY
DTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMM QSGR
KGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEI
KKYV
FGENE output 1607 - 1717 w 17.84
2985 - 3231 w 9.13 3421 - 3471 w
6.08 3980 - 4120 w 12.62 5035 - 5192
w 1.93
74
How to decide where exons are?
Strategy
  • Compare sequence of 4q34 region to sequence of
    mRNA
  • Sequence of mRNA may be in cDNA library
  • Expressed Sequence Tag (EST) library

Problems
  • Library may not exist
  • Expression of gene may be low

75
Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Run 4q34 region through BlastN (x human ests)
MORAL Trust, but verify.
76
Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Strategy
  • Assume that encoded protein is in mitochondria
  • Protein has function associated with
    mitochondrial location?

?
  • Use Gene finder to identify protein sequence(s)
  • Use Similarity finder to identify possible
    function
  • Protein has structure associated with
    mitochondrial location?
  • Use Feature finders to identify pertinent
    structures
  • (What ARE pertinent structures?)

77
Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Run 4q34 region through BlastP
Name PEO-related_gene? First three lines of
sequence tctacttatattcaatccacagggctacacctagttcttg
gtacacagtacatgctcagcaagagtctgttgaat gaacacatacatgg
tttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctc
tggcagctttc cactagtttctagctttcattctgcttacctggatttc
ggaactctagcctgccccactcttagataaacgcatg Fgenesh
Wed Feb 27 165914 GMT 2002 FGENESH 1.0
Prediction of potential genes in Human
genomic DNA Time Wed Feb 27 165914 2002
Seq name PEO-related_gene? Length of sequence
5768 GC content 48 Zone 2 Positions of
predicted genes and exons G Str Feature
Start End Score ORF Len
1 TSS 1216 -2.70 1
1 CDSf 1607 - 1717 18.01 1607 -
1717 111 1 2 CDSi 2985 - 3471
52.41 2985 - 3470 486 1 3 CDSi
3980 - 4120 20.99 3982 - 4119
138 1 4 CDSl 5035 - 5192 2.32
5037 - 5192 156 1 PolA 5471
0.92 Predicted protein(s) gtFGENESH
1 4 exon (s) 1607 - 5192 298 aa, chain
MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAE
KQYKGIIDCVVR IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQ
LFLGGVDRHKQFWRYFAGNLASG GAAGATSLCFVYPLDFARTRLAADVG
KGAAQREFHGLGDCIIKIFKSDGLRGLYQGFNVS VQGIIIYRAAYFGVY
DTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMM QSGR
KGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEI
KKYV
78
Surrogate FiltersScenario III Case of the
Mortal Mitochondrion
Run 4q34 region through BlastP
  • Summary
  • One protein in region
  • Contains mitochondrial carrier motifs
  • Similar to ATP/ADP transporter
  • Mitochondrial signal sequence?

Reasonable candidate for PEO-related protein
79
Complex gene discovery
Your turn Repeat and extend characterization of
PEO-related gene
1. Take same sequence (FastA format) e-mailed to
you 2. Get better estimate of promoter and polyA
site (e.g. by TSSW and PolyASH) (Is there
a TATA box upstream from the predicted promoter?)
3. Find encoded protein sequence by suitable
method (e.g. FGeneSH(GC) or comparison with
cDNA) 4. Continue characterization of protein
Contains signal sequence? Contains
transmembrane domains?

80
(No Transcript)
81
Filter limitation Inevitablebut whose filter?
82
Filters controlled by outside programmers
83
Filters controlled by you
84
(No Transcript)
85
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com