Title: Bioinformatics Workshop 2 Recap
1Bioinformatics Workshop 2Recap Warm-Up Exercise
Determine whether there is an available Xenopus
clone (laevis or tropicalis) for
Claudin-2 Start by retrieving the amino acid
sequence for Human claudin-2 from Entrez Gene
(see list of useful websites) then by
appropriate use of various BLAST flavours, search
parameters and notions of orthology see if you
can get to an answer. Use the scratch-pad.html
(first item in list of useful websites) to keep
notes, accession numbers, sequences, etc. as you
go along.
2Answer to Recap Warm-Up Exercise
1. Get fasta protein sequence gtgi9966781refNP_0
65117.1 claudin 2 Homo sapiens
MASLGLQLVGYILGLLGLLGTLVAMLLPSWKTSSYVGASIVTAVGFSKGL
WMECATHSTGITQCDIYSTL LGLPADIQAAQAMMVTSSAISSLACIISV
VGMRCTVFCQESRAKDRVAVAGGVFFILGGLLGFIPVAWNL
HGILRDFYSPLVPDSMKFEIGEALYLGIISSLFSLIAGIILCFSCSSQRN
RSNYYDAYQAQPLATRSSPR PGQPPKVKSEFNSYSLTGYV 2a.
tBLASTn against est_others database Xenopus
laevis 2b. tBLASTn against est_others database
Xenopus tropicalis (this gives us the ESTs in
each species which best match our human
protein) 3. Get the top EST sequence for each
species, and search each in turn against the
human proteins BLASTx against nr Homo
sapiens (this is a check for orthologs) best
laevis EST (CF520733.1 ) gave top 2 human
hits gi6912314refNP_036262.1 claudin 14
Homo sapiens gtgi215... 274 1e-73
gi9966781refNP_065117.1 claudin 2 Homo
sapiens gtgi1568... 197 2e-50 So this EST
was probably Xl claudin 14. The best trop EST
(DT398005.1) gave top 2 human hits gi4502875ref
NP_001297.1 claudin 3 Homo sapiens
gtgi1635... 340 3e-93 gi4502877refNP_001
296.1 claudin 4 Homo sapiens gtgi1265...
317 2e-86 So this EST was probably
Xt-claudin 3.
3BLAST Parameters Exercises
5. E-Value maximum for reporting Open the file
example-sequences.html. Copy the sequence
gtsumo-binding-motif and go to the NCBI BLAST Home
Page. Go to the PROTEIN BLAST section, BLASTp,
and paste the sequence. Run the search with the
default values. Now re-run the search setting
the maximum E-value in the box -gt 100 setting the
maximum E-value in the box -gt 1000 setting the
maximum E-value in the box -gt 10000 What
difference does this make? Have you found related
proteins in your results?
4Bioinformatics Workshop 2Identifying Unknown
Genes
- Open a web browser and type in the URL
- informatics.gurdon.cam.ac.uk/online/workshops
- bookmark this page
- Click on the link to the file
- useful-websites.html
- bookmark this page too
- it also contains links to the example sequence
files used in the workshop, and the presentations
themselves
5Part 1Genome Browsers
Now that most model organisms have had their
genomes sequenced, we can get a lot more
information about how the gene works, than by
just doing a BLAST search against the protein
databases. Even if your favourite genome is
still just in scaffolds and not yet assembled
into chromosomes, we can still add a lot of
value. The main tasks that one does to a genome
before releasing it to the user community is to
annotate it. In practice this means adding gene
models, based on known expressed sequences, both
in the same organism and other fairly closely
related ones, and possibly also purely predicted
ones based on sequence composition analysis and
features like start and stop codons, and splice
sites. And then known mapping markers, SNPs, etc,
etc. With 3,000,000,000 nucleotides in the
genome sequence (human) this present a
considerable challenge to display on a web
browser page, which is of course the preferred
option. Most genome browsers (software designed
to display genome based data in a web broswer)
have taken roughly the same approach, which well
take a quick look at
6Gene model
gene model
genome
Aligned cDNA
Aligned ESTs
7Schematic Genome Browser
Mus musculus, chromosome 12
genome
TRACKS
Your sequence
Genes
ESTs
conservation Human Fish
8(No Transcript)
9How to Use UCSC Browser
10Displaying your own data
You can also use the UCSC browser to display you
own data Not just your blasted sequence. Simply
create a text file in one of several specified
formats, e.g. ------------------------------------
--------------------------------------------------
----------------------- browser position
chr11,000,000-1,050,000 track nametrack1
visibility1 description"My display data"
itemRgb"On" priority1
chr1 1006500 1008500 1006500 0 1006500 1008500
0,0,255
chr1 1011500 1012750 1011500 0 1011
500 1012750 0,100,150
chr1 1015250 1016500 10152
50 0 1015250 1016500 0,100,150
chr1 1018000 1021000 1018000 0 1018000 1021000
0,170,80
chr1 1024500 1028000 1024500 0 10245
00 1028000 80,170,0 -------------------------
--------------------------------------------------
---------------------------------- And load via
the Genomes / manage custom tracks
facility. These mechanisms are well documented
on the UCSC site.
11Exercises
1. Find the web site for the Santa Cruz Genome
Browser (sometimes called the Golden Path), and
investigate the three genes for which you have
the full length cDNA sequence, or the protein
sequence, in the file example-sequences.html gtTN
eu084i05 (Xenopus) How many exons does the gene
appear to have? Has it been mapped already? Are
there any likely upstream regulatory elements
(look for conservation across species)? Are
there other genes near by? gtTGas122d03
(Xenopus) Is this a relatively unique gene, or a
member of a gene family? What can we learn from
the comparison with human genes? Are there any
differences between the gene model predicted from
your cDNA, and the existing predictions? gthsp70-5
(human) Starts with the protein sequence. How
might this be better?
12Exercise 1. Results gtTNeu084i05
13Exercises
2. Now go to the two other main genome browsers,
Ensemble and NCBI find the Xenopus genome (at
the moment you wont find it at NCBI, so use the
mouse genome instead), and see if you get the
same sort of functionality from them. Use the
same two sequences. Are there different
features? Are they easier/harder to use?
14Part 2 Identifying Novel Proteins
sequence to analyse
what is its function?
FUNCTIONAL ANNOTATION
Gravin-like
15Different Possible Outcomes
Suppose you have a cDNA sequence and you run
BLASTx 1 - genes of identifiably same function
in several different species 4e-014 -
polyunsaturated fatty acid elongase Xenopus
laevis 7e-140 - fatty acid elongase 2 Rattus
norvegicus 1e-140 - ELOVL6 protein Homo
sapiens 2 - genes of unknown function in
several different species 2e-103 - unnamed
protein product Tetraodon nigroviridis 3e-115
- 2310009N05Rik protein Mus musculus 5e-117 -
hypothetical protein FLJ22378 Homo sapiens 3
- genes with no significant BLASTx hits in other
species 7.3 - 1-deoxy-D-xylulose 5-phosphate
synthase Chlamydophila abortus 4.7 -
PREDICTED similar to tweety 2 isoform 1 Bos
taurus 4 - significant BLASTx hits in
phylogenetically distant species 2e-200 coat
maintenance protein Escherichia coli
KNOWN
NOVEL
ORPHAN
OUCH..!
16Different Ways not to Know Anything
Your lack of knowledge about protein function,
having directly compared your sequence with all
known proteins in the database, will manifest
itself in two rather different ways. 1. It
looks like a NOVEL gene we find plenty of
evidence for orthologous genes, but these are
just different ways of saying but we know nothing
about their function either. 2. It looks like an
ORPHAN gene this is a sign that this protein
may only exists in your organism. The phenomenon
is quite well documented (see reference).
Obviously these are going to be quite tough to
work on, as nothing like them has been seen
before Special case. There are good BLASTx
matches with phylogenetically DISTANT organisms
check for contamination!
An Evolutionary Analysis of Orphan Genes in
Drosophila. Domazet-Loso T, Tautz D. Genome Res.
2003 Oct 13(10) 2213-2219.
17Indirect Functional Identification
So youve found a gene youre interested in,
youve blasted it against the biggest protein
database you can find, and have still got no real
clues as to what its function might be. What do
you do next (make sure you really have a gene
on your hands) 1. LOOK FOR MORE DISTANTLY
RELATED GENES WITH ANNOTATION If there are
believable BLASTx matches, but they are all
predicted genes with no functional annotation, it
might still be possible to use them as stepping
stones to other, more informative, BLASTx matches
which would not show up as similar to the
original sequence. Think of this as traversing
the phylogenetic tree. 2. FIND PARTIAL OR
INDIRECT DATA DOMAINS, EXPRESSION,
ETC. Accumulate as much partial data about the
sequence in the hopes that it sheds light on the
function. This will include functional protein
domains, expression data, genomic alignment and
secondary structure. Its unlikely that you will
become casually involved with higher order
structures as solving or comparing these is a
complex and specialised task.
18Phylogenetic Stepping Stones
Consider a gene which has the same function
across many phyla, and suppose we consider a
phylogenetic tree based on sequence similarity
Its possible that the sequence of the gene in
your species is sufficiently similar to its
orthologs in species B and C that these will show
up in a BLAST search, but not in species D or E.
But the sequence of the gene in species C is more
similar to those in D or E. So once you get to C,
and BLAST from there you might get to E, which
happens to have been researched and its function
known. This could be done manually, but it has
been formalised in PSI BLAST, which uses
iterative rounds of BLAST searching to build a
more generalised model of the gene sequence, and
uses this evolving model to gradually traverse
the tree. Although if not used carefully it can
go horribly wrong
19PSI BLAST
(Position Specific Iterated since you asked)
Initial Query SREFTHYQWERLIKKTYFARFHNCMLISFSWER Ma
tches from database SREKTSYQAERLIIWERFARFHICMLIPQS
WER SREKDSYQUERLIPWTYFARFHNCMLIPKSWER New
Composite Query SREFTHYQWERLIKKTYFARFHNCMLISFSWER
K S A IWER I PQ D U P T
K 2nd Round Matches from
database SREKTSYQAERLIIWERFARFHICMLIPQSWER SREKDSY
QUERLIPWTYFARFHNCMLIPKSWER PRAKDTRQIQRLSYWTTFLLFV
ITSLQRKITER PRAKDTRQIQRLSYWTTFLLFVITSLQRKITER And
so on
20PSI BLAST
Round 1 results
21PSI BLAST
Round 2 results
22PSI BLAST
Round 3 results
23PSI BLAST
Round 4 results
24Finally some function!
25Functional Domain Analysis
Proteins are considered to have functional
domains within them, specific regions of the
protein which have specific tasks, and that these
domains are recognisably conserved between
different proteins, even though the overall
similarities of the proteins may be quite low.
Typical Diagram of Functional Domains on a
Protein
26Functional Domain
If you can find functional domains, you may know
something about the general behaviour of your
protein, even if you dont know exactly what its
function is. But, as usual, be aware that
non-significant matches are quite likely to be
displayed in any analysis website and at least
look for some confidence score or other measure
of significance. And treat everything with a
degree of caution. Main specialised sites for
this type of analysis are SMART and Pfam. Which
have considerable overlapping functionality. Also
InterProScan which attempts to integrate all the
available tools The search methods are rather
different from BLAST, and rely primarily on
building up a model of the functional domain from
known examples. The model is then a generalised
pattern for a given domain, and your unknown
sequences are searched against the models, using
rather more advanced methods, typically involving
Hidden Markoff models.
27Functional Domains and Hidden Markoff Models
Once a functional domain has been identified in a
number of sequences, we can build a model of it.
By which we just mean a summation of our
understanding of the linear sequence
variants. 1234567890 YSCMVGHEAL FSCVVGHEAL
1 2 3 4 5 6 7 8 9
0 YTCKVDHETL model YF ST C ? V ? H
E ? L FTCQVTHEGD YSCRVKHVTL score 5 5
10 10 10 8 8 YTCVVGHEAL The
scores may be arbitrary but they constitute the
Hidden Markoff Model by which we evaluate other
proteins to see if they contain this domain. As
you accumulate more examples the model gets more
refined, and hopefully more accurate The higher
the score of your test protein sequence against
the model the more likely it is presumed to
contain the domain. The model will also allow for
the possibility of (expensive) gaps if the
spacing of your real sequence doesnt fit the
model. Known variable regions can be modelled as
cheaper gaps.
Once a functional domain has been identified in a
number of sequences, we can build a model of it.
By which we just mean a summation of our
understanding of the linear sequence
variants. 1234567890 YSCMVGHEAL FSCVVGHEAL YTCKVD
HETL FTCQVTHEGD YSCRVKHVTL YTCVVGHEAL The
scores may be arbitrary but they constitute the
Hidden Markoff Model by which we evaluate other
proteins to see if they contain this domain. As
you accumulate more examples the model gets more
refined, and hopefully more accurate The higher
the score of your test protein sequence against
the model the more likely it is presumed to
contain the domain. The model will also allow for
the possibility of (expensive) gaps if the
spacing of your real sequence doesnt fit the
model. Known variable regions can be modelled as
cheaper gaps.
28Problems with Models by Example
There are two conceptual problems with building
models from examples. The likelihood is that the
behaviour of the protein domain is related to the
three dimensional shape of the molecule, and the
nature of its interactions with other molecules,
and as we are not taking these into account at
all, we cannot expect our model to be very
realistic. Secondly, the model is (by its
nature) highly biased towards the examples
already found, and further examples found with
the help of the model will tend to reinforce any
initial bias. So our model may tend to grow away
from the actual consensus across all possible
proteins, and lock us out of whole subsets of
data. Incidentally this problem of bias is
very similar to what can happen with PSI BLAST if
your choice of proteins to include in your
growing model diverge from your original sequence
too much, and can quickly take you off into
strange territory
29(No Transcript)
30Using SMART
31Exercise 1 Using Pfam and SMART
Online Scratch Pad For the following exercises,
you may find a scratch pad useful for keeping
information from previous stages of a search. If
you open up the file scratch-pad,html youll
find you can keep text data in the outlined box.
You cannot save the data, and itll vanish if you
close the window, or refresh it!
Go to the example-sequences.html file and the
Protein Domain Searches section, and copy the
sequence for gtigf4D. Then go to the SMART web
site, paste your sequence, tick at least the
signal peptides box, and then run the
search. While thats running, go to the Pfam site
(in a new browser window) and search the same
sequence there.Compare the two results sets. Is
there any difference? Should we expect any? Now
go to the NCBI BLAST page, and do a
protein-protein BLASTp this may be a useful way
of getting to the same data. What could you have
learned about the function of this gene? If you
are ahead of the rest of the group, check out the
results for the much longer gttitin sequence.
32Using SMART
33Exercise 2 Random Sequences Again
We recall that random DNA sequences gave us
alignments against real proteins when using
BLASTx, and that E-values can gave us a good idea
whether alignments are biologically meaningful or
not. This becomes even more important when
searching for subtler matches generally shorter
sequences with considerable variation allowed at
most positions. Go to the file
random-protein-sequences.html and copy the
sequence assigned to you. Go to whichever of Pfam
or SMART web sites you preferred, and run the
search on your sequence. Did you find any domain
hits? Were they significant? Was it possible to
tell? Look at the actual alignments, if you can
find out how to, and also see if you can find the
model that the domain is based on. Repeat with a
second sequence if you have time.
34Functional Motifs in Proteins
You may be more familiar with functional motifs
in DNA sequences, e.g. transcription factor
binding sites. Here for example is the (Xenopus)
TBox motif TCGACGACCGT But short motifs
are also present in protein sequences, e.g FHA
domain interaction motif 1 T..ILA (
Forkhead-associated (FHA) domain binds
phosphothreonine or phosphoserine containing
peptides ) The general problem with motifs is
the number of false positives, as they are
generally pretty short. For the above example we
can easily see that (approx) every 20th amino
acid will be a T, and about 1 in 7 of these will
have ILorA in the third position following. So
this motif should appear about every 140 amino
acids in a random sequence This implies a
pretty high rate of (probably) false positives
and the almost certain need for confirmatory
biology!
35The ELM Server
Eukaryotic Linear Motif The ELM server
(http//elm.eu.org/) ELM is a resource for
predicting functional sites in eukaryotic
proteins. Putative functional sites are
identified by patterns (regular expressions). To
improve the predictive power, context-based rules
and logical filters are applied to reduce the
amount of false positives. We can judge the
problem of interpreting these searches if we use
a randomly generated sequence and send it to the
ELM server
36Functional Motifs Reported by ELM in a Random
Amino Acid Sequence
37Secondary Structure Analysis
The weak neighbour-neighbour interactions between
amino acids in a protein molecule give rise to a
small number of basic structural arrangements.
The two main forms are linear helical structures
(alpha-helices) or sheets of parallel chains
(beta sheets), the intermolecular bonds stabilise
the structures. We may consider that the larger
scale structure of the whole protein is built
from these smaller scale structures, and as such
they may give us some insight into the role of
the protein even in the absence of much
functional data.
3-dimensional protein structures that you see
pictures of, are often composed of alpha-helices
and beta-sheets linked by less well structured
sections of the protein.
http//www.chemsoc.org/exemplarchem/entries/2004/d
urham_mcdowall/prot-3.html
There are a large number of web pages devoted to
analysing proteins for secondary structure, and
even some which attempt to aggregate the results
of several different methods (at PBIL).
38Is it Really a Gene?
If you are really getting nowhere with your
functional analysis, it may worth checking
whether you have got a gene at all. There are
several circumstances in which this might
arise. If you are using a physical reagent like
a cDNA clone, its possible that it contains an
incomplete mRNA sequence, and you are just
looking at a plausible but unreal ORF in the 3
UTR. Or it could contain an unspliced immature
transcript. Or it could even be a contamination
from some other, very different species, e.g.
bacteria. You may learn a lot by aligning your
sequence with the organisms genome, to check
that it is there and that it appears to have
exons (if you would expect them). Or if you
found the gene by some sort of mapping/positional
analysis, and you are analysing sequences from
gene models shown on the genome, check that there
is real (e.g. EST) evidence for this gene it
may be purely theoretical, and entirely bogus
39Genomic Analysis
It is possible that analysing the position of
your gene on the genome can tell you something
about its possible function. Genes sometimes
function in expression cassettes, where
neighbouring genes are either co-expressed, or
under closely related (temporal or spatial)
regulation. So if nearby genes are well
characterised it would be worth considering this
as a possibility. Equally, if there are obvious
orthologs of this gene in other species, check
out the genomic context there too. You should
also be able to find out if your gene is a member
of a gene family, or whether it shares small
regions of coding sequence with other genes. Is
there a way of doing tBLASTn or tBLASTx against
the genome in your preferred browser?
40Expression Data
Genes that are co-expressed may well be involved
in the same pathways, the more intricate the
pattern of co-expression, the greater the
likelihood. You may find genes of known function
that yours is associated with. If you found the
gene originally in an expression array experiment
this may be an easy way in. Alternatively there
is a growing amount of expression data out there
in databases, although at the moment its pretty
difficult to systematically mine it. Various
efforts are underway to facilitate this (FlyMine,
ArrayExpress) tho its not clear how effective
these are yet. It may also be difficult to track
your gene down in the data sets. If your gene
is from an EST or cDNA sequence, see if the ESTs
are clustered and check out which libraries they
come from. This may tell you whether your gene is
expressed in specific stages/tissues, or whether
it is more ubiquitous.
41Exercise 3 Genuine Unknowns
- The sequence file identification-example-sequences
.html contains 12 gene sequences from Xenopus
tropicalis which superficially look hard to
identify. The full cDNA sequence, is given along
with the amino acid sequence translated from the
presumed ORF. - Start with the first sequence, and accumulate
data about it, then work your way on down the
list - Consider doing the following searches
- Check BLASTx/p new sequences are arriving on
the database all the time - Consider whether PSI BLAST might be useful
- Check against the genome
- Look for functional protein domains
- Look for secondary structure
- If you find anything that looks useful keep a
note of it.
But bear in mind that, in the real world, you may
soon be thinking about going back to the
laboratory for further experimental work!
42Exercise 3 Results
gtu-one Xt6.1-CAAL21151.3 Dpy30, SCOP domains
PSI 2 rounds -gt chloroplast enolase?ADP-ribosyla
tion factor-like gtu-two Xt6.1-CABJ8169.5
sipP, RUN, PDZ, PTB domains PSI 2 rounds -gt
rap2 interacting protein x gtu-three TEgg047e16
clear orphan, no domains, no results with PSI
BLAST, Egg/Ova/Gas EST expression gtu-four
IMAGE7016814 Globin domains, odd organisms,
no hit on genome - worm contamination, adult
whole body lib. gtu-five IMAGE5384335 signal
peptide, seven transmembrane regions (!) gtu-six
TEgg044i21 signal peptide, coiled coils
domain - PSI 2 rounds -gt yeast-tht1 gtu-nine
CABE11813 long protein, no domains, no more
additions after 2 rounds of PSI BLAST,
all_predicted gtu-ten TGas024h08 long protein,
no domains, sort-of-name, PSI 2 rounds -gt
chloroplast RNA processing 1 1e-05...