Title: Identifying Novel Proteins
1Identifying Novel Proteins
So youve found a gene youre interested in,
youve blasted it against the biggest protein
database you can find, and have got no real clues
as to what its function might be. What do you do
next Now, apart from making sure you really
have a gene on your hands, there are two ways
forward 1. If there are believable BLASTx
matches, but they are all predicted genes with no
functional annotation, it might still be possible
to use them as stepping stones to other, more
informative, BLASTx matches which would not show
up as similar to the original sequence. Think of
this as traversing the phylogenetic tree.
2. Accumulate as much partial data about the
sequence in the hopes that it sheds light on the
function. This will include functional protein
domains, expression data, genomic alignment and
secondary structure. Its unlikely that you will
become casually involved with higher order
structures as solving or comparing these is a
complex and specialised task.
2Orphan Genes
Your lack of knowledge about protein function,
having compared your sequence with all known
proteins in the database, will manifest itself in
two rather different ways. 1. There are good
BLASTx matches with phylogenetically close
organisms, but all the reasonably close hits are
things like Theoretical .. or Predicted or
Riken .. or ORF285, chromosome 9 we find
plenty of evidence for orthologous genes, but
these are just different ways of saying but we
know nothing about their function either. 2.
There are no close BLASTx matches. This is a sign
that this protein only exists in your organism.
These are known as orphan genes, and the
phenomenon is quite well documented (see
reference). Obviously these are going to be
quite tough to work on, as nothing like them has
been seen before Special case. There are good
BLASTx matches with phylogenetically DISTANT
organisms check for contamination!
An Evolutionary Analysis of Orphan Genes in
Drosophila. Domazet-Loso T, Tautz D. Genome Res.
2003 Oct 13(10) 2213-2219.
3Phylogenetic Stepping Stones
Consider a gene which has the same function
across many phyla, and suppose we consider a
phylogenetic tree based on sequence similarity
Its possible that the sequence of the gene in
your species is sufficiently similar to its
orthologs in species B and C that these will show
up in a BLAST search, but not in species D or E.
But the sequence of the gene in species C is more
similar to those in D or E. So once you get to C,
and BLAST from there you might get to E, which
happens to have been researched and its function
known. This could be done manually, but it has
been formalised in PSI BLAST, which uses
iterative rounds of BLAST searching to build a
more generalised model of the gene sequence, and
uses this evolving model to gradually traverse
the tree. Although if not used carefully it can
go horribly wrong
4PSI BLAST
(Position Specific Iterated since you asked)
Initial Query SREFTHYQWERLIKKTYFARFHNCMLISFSWER Ma
tches from database SREKLSYQAERLIIWERFARFHICMLIPQS
WER SREKDSYQUERLIPWTYFARFHLCMLIPKSWER New
Composite Query SREFTHYQWERLIKKTYFARFHNCMLISFSWER
KLS A IWER I PQ D U P
TY L K 2nd Round Matches from
database SREKLSYQAERLIIWERFARFHICMLIPQSWER SREKDSY
QUERLIPWTYFARFHLCMLIPKSWER TUEKDSYPASAASPWERQREAFL
HKLAPQSIEY And so on
Initial Query SREFTHYQWERLIKKTYFARFHNCMLISFSWER Ma
tches from database SREKLSYQAERLIIWERFARFHICMLIPQS
WER SREKDSYQUERLIPWTYFARFHLCMLIPKSWER New
Composite Query SREFTHYQWERLIKKTYFARFHNCMLISFSWER
KLS A IWER I PQ D U P
TY L K 2nd Round Matches from
database SREKLSYQAERLIIWERFARFHICMLIPQSWER SREKDSY
QUERLIPWTYFARFHLCMLIPKSWER TUEKDSYPASAASPWERQREAFL
HKLAPQSIEY And so on
5PSI BLAST
Round 1 results
6PSI BLAST
Round 2 results
7PSI BLAST
Round 3 results
8PSI BLAST
Round 4 results
9Finally some function!
10Functional Domain Analysis
Proteins are considered to have functional
domains within them, specific regions of the
protein which have specific tasks, and that these
domains are recognisably conserved between
different proteins, even though the overall
similarities of the proteins may be quite low.
Typical Diagram of Functional Domains on a
Protein
11Functional Domain
If you can find functional domains, you may know
something about the general behaviour of your
protein, even if you dont know exactly what its
function is. But, as usual, be aware that
non-significant matches are quite likely to be
displayed in any analysis website and at least
look for some confidence score or other measure
of significance. And treat everything with a
degree of caution. Main specialised sites for
this type of analysis are SMART and Pfam. Which
have considerable overlapping functionality. Also
InterProScan which attempts to integrate all the
available tools The search methods are rather
different from BLAST, and rely primarily on
building up a model of the functional domain from
known examples. The model is then a generalised
pattern for a given domain, and your unknown
sequences are searched against the models, using
rather more advanced methods, typically involving
Hidden Markoff models.
12Functional Domains and Hidden Markoff Models
Once a functional domain has been identified in a
number of sequences, we can build a model of it.
By which we just mean a summation of our
understanding of the linear sequence
variants. 1234567890 YSCMVGHEAL FSCVVGHEAL
1 2 3 4 5 6 7 8 9
0 YTCKVDHETL model YF ST C ? V ? H
E ? L FTCQVTHEGD YSCRVKHVTL score 5 5
10 10 10 8 8 YTCVVGHEAL The
scores may be arbitrary but they constitute the
Hidden Markoff Model by which we evaluate other
proteins to see if they contain this domain. As
you accumulate more examples the model gets more
refined, and hopefully more accurate The higher
the score of your test protein sequence against
the model the more likely it is presumed to
contain the domain. The model will also allow for
the possibility of (expensive) gaps if the
spacing of your real sequence doesnt fit the
model. Known variable regions can be modelled as
cheaper gaps.
Once a functional domain has been identified in a
number of sequences, we can build a model of it.
By which we just mean a summation of our
understanding of the linear sequence
variants. 1234567890 YSCMVGHEAL FSCVVGHEAL YTCKVD
HETL FTCQVTHEGD YSCRVKHVTL YTCVVGHEAL The
scores may be arbitrary but they constitute the
Hidden Markoff Model by which we evaluate other
proteins to see if they contain this domain. As
you accumulate more examples the model gets more
refined, and hopefully more accurate The higher
the score of your test protein sequence against
the model the more likely it is presumed to
contain the domain. The model will also allow for
the possibility of (expensive) gaps if the
spacing of your real sequence doesnt fit the
model. Known variable regions can be modelled as
cheaper gaps.
13Problems with Models by Example
There are two conceptual problems with building
models from examples. The likelihood is that the
behaviour of the protein domain is related to the
three dimensional shape of the molecule, and the
nature of its interactions with other molecules,
and as we are not taking these into account at
all, we cannot expect our model to be very
realistic. Secondly, the model is (by its
nature) highly biased towards the examples
already found, and further examples found with
the help of the model will tend to reinforce any
initial bias. So our model may tend to grow away
from the actual consensus across all possible
proteins, and lock us out of whole subsets of
data. Incidentally this problem of bias is
very similar to what can happen with PSI BLAST if
your choice of proteins to include in your
growing model diverge from your original sequence
too much, and can quickly take you off into
strange territory
14(No Transcript)
15Using SMART
16Exercise 1 Using Pfam and SMART
Online Scratch Pad For the following exercises,
you may find a scratch pad useful for keeping
information from previous stages of a search. If
you open up the file scratch-pad,html youll
find you can keep text data in the outlined box.
You cannot save the data, and itll vanish if you
close the window, or refresh it!
Go to the example-sequences.html file and the
Protein Domain Searches section, and copy the
sequence for gtigf4D. Then go to the SMART web
site, paste your sequence, tick at least the
signal peptides box, and then run the
search. While thats running, go to the Pfam site
(in a new browser window) and search the same
sequence there.Compare the two results sets. Is
there any difference? Should we expect any? Now
go to the NCBI BLAST page, and do a
protein-protein BLASTp this may be a useful way
of getting to the same data. What could you have
learned about the function of this gene? If you
are ahead of the rest of the group, check out the
results for the much longer gttitin sequence.
17Using SMART
18Exercise 2 Random Sequences Again
We recall that random DNA sequences gave us
alignments against real proteins when using
BLASTx, and that E-values can gave us a good idea
whether alignments are biologically meaningful or
not. This becomes even more important when
searching for subtler matches generally shorter
sequences with considerable variation allowed at
most positions. Go to the file
random-protein-sequences.html and copy the
sequence assigned to you. Go to whichever of Pfam
or SMART web sites you preferred, and run the
search on your sequence. Did you find any domain
hits? Were they significant? Was it possible to
tell? Look at the actual alignments, if you can
find out how to, and also see if you can find the
model that the domain is based on. Repeat with a
second sequence if you have time.
19Functional Motifs in Proteins
We may be more familiar with functional motifs in
DNA sequences, e.g. transcription factor binding
sites. Here for example is the (Xenopus) TBox
motif TCGACGACCGT But short motifs are
also present in protein sequences, FHA domain
interaction motif 1 T..ILA Forkhead-associate
d (FHA) domain binds phosphothreonine or
phosphoserine containing peptides The general
problem with motifs is the number of false
positives, as they are ge
The ELM server (http//elm.eu.org/) ELM is a
resource for predicting functional sites in
eukaryotic proteins. Putative functional sites
are identified by patterns (regular expressions).
To improve the predictive power, context-based
rules and logical filters are applied to reduce
the amount of false positives
20Functional Motifs Reported by ELM in a Random
Amino Acid Sequence
21Secondary Structure Analysis
The weak neighbour-neighbour interactions between
amino acids in a protein molecule give rise to a
small number of basic structural arrangements.
The two main forms are linear helical structures
(alpha-helices) or sheets of parallel chains
(beta sheets), the intermolecular bonds stabilise
the structures. We may consider that the larger
scale structure of the whole protein is built
from these smaller scale structures, and as such
they may give us some insight into the role of
the protein even in the absence of much
functional data.
3-dimensional protein structures that you see
pictures of, are often composed of alph-helices
and beta-sheets linked by less well structured
sections of the protein.
http//www.chemsoc.org/exemplarchem/entries/2004/d
urham_mcdowall/prot-3.html
There are a large number of web pages devoted to
analysing proteins for secondary structure, and
even some which attempt to aggregate the results
of several different methods (at PBIL).
22Is it Really a Gene?
If you are really getting nowhere with your
functional analysis, it may worth checking
whether you have got a gene at all. There are
several circumstances in which this might
arise. If you are using a physical reagent like
a cDNA clone, its possible that it contains an
incomplete mRNA sequence, and you are just
looking at a plausible but unreal ORF in the 3
UTR. Or it could contain an unspliced immature
transcript. Or it could even be a contamination
from some other, very different species, e.g.
bacteria. You may learn a lot by aligning your
sequence with the organism genome, to check that
its there and that it appears to have exons (if
you would expect them). Or if you found the gene
by some sort of mapping/positional analysis, and
you are analysing sequences from gene models
shown on the genome, check that there is real
(e.g. EST) evidence for this gene it may be
purely theoretical, and entirely bogus
23Genomic Analaysis
It is possible that analysing the position of
your gene on the genome can tell you something
about its possible function. Genes sometimes
function in expression cassettes, where
neighbouring genes are either co-expressed, or
under closely related (temporal or spatial)
regulation. So if nearby genes are well
characterised it would be worth considering this
as a possibility. Equally, if there are obvious
orthologs of this gene in other species, check
out the genomic context there too. You should
also be able to find out if your gene is a member
of a gene family, or whether it shares small
regions of coding sequence with other genes. Is
there a way of doing tBLASTn or tBLASTx against
the genome in your preferred browser?
24Expression Data
Genes that are co-expressed may well be involved
in the same pathways, the more intricate the
pattern of co-expression, the greater the
likelihood. You may find genes of known function
that yours is associated with. If you found the
gene originally in an expression array experiment
this may be an easy way in. Alternatively there
is a growing amount of expression data out there
in databases, although at the moment its pretty
difficult to systematically mine it. Various
efforts are underway to facilitate this (FlyMine,
ArrayExpress) tho its not clear how effective
these are yet. It may also be difficult to track
your gene down in the data sets. If your gene
is from an EST or cDNA sequence, see if the ESTs
are clustered and check out which libraries they
come from. This may tell you whether your gene is
expressed in specific stages/tissues, or whether
it is more ubiquitous.
25Exercise 3 Genuine Unknowns
- The sequence file identification-example-sequences
.html contains 12 gene sequences from Xenopus
tropicalis which superficially look hard to
identify. The full cDNA sequence, is given along
with the amino acid sequence translated from the
presumed ORF. - Pick one of the first six sequences, and start to
accumulate data about it. - Check BLASTx new sequences are arriving on the
database all the time - Consider whether PSI BLAST might be useful
- Check against the genome
- Look for functional protein domains
- Look for secondary structure
- If you find anything that looks useful keep a
note of it.
But bear in mind that, in the real world, you may
soon be thinking about going back to the
laboratory for further experimental work!
26Exercise 3 Results
gtu-one Xt6.1-CAAL21151.3 Dpy30, SCOP domains
PSI 2 rounds -gt chloroplast enolase?ADP-ribosyla
tion factor-like gtu-two Xt6.1-CABJ8169.5
sipP, RUN, PDZ, PTB domains PSI 2 rounds -gt
rap2 interacting protein x gtu-three TEgg047e16
clear orphan, no domains, no results with PSI
BLAST, Egg/Ova/Gas EST expression gtu-four
IMAGE7016814 Globin domains, odd organisms,
no hit on genome - worm contamination, adult
whole body lib. gtu-five IMAGE5384335 signal
peptide, seven transmembrane regions (!) gtu-six
TEgg044i21 signal peptide, coiled coils
domain - PSI 2 rounds -gt yeast-tht1 gtu-seven
Xt6.1-CAAO3979.3 coiled coils domain - PSI 2
rounds, meaningless name -gt myosin (?) gtu-eight
TEgg001m03 single exon ORF, 5 RRM_1 domains,
5th(!) mouse hit, Rbm12 gtu-nine CABE11813
long protein, no domains, no more additions
after 2 rounds of PSI BLAST, all_predicted gtu-ten
TGas024h08 long protein, no domains,
sort-of-name, PSI 2 rounds -gt chloroplast RNA
processing 1 1e-05...