Title: Homology and Homologs
1Homology and Homologs
Homology just means sequence similarity by virtue
of a common evolutionary ancestor.
gtgi24640218refNP_572350.2 Â Â CG3126-PA,
isoform A Drosophila melanogaster Length1571
Score 427 bits (1098), Expect 6e-118
Identities 223/415 (53), Positives 297/415
(71), Gaps 19/415 (4) Frame 2 Query
1901 SLVDHNEIMAKLTLKQEGDDGPDVRGGSGDILLVHATETDRKDLV
LYFEAFLTTYRTFIT 2080 I L LK
DGPVGG D LVHA EAFTTRTFI Sbjct
1151 NMLEEVNITRYLILKKREEDGPEVKGGYIDALIVHASRVQKVADN
AFCEAFITTFRTFIQ 1210 Query 2081
PEELIQKLQYRYERF-CHFQDTFKQRVSKNTFFVLVRVVDELCLVEMTDE
ILKLLMELVF 2257 P IKL RY F C QD
KQ K TF LVRVVL T L LLE V Sbjct
1211 PIDVIEKLTHRYTYFFCQVQDN-KQKAAKETFALLVRVVNDLTST
DLTSQLLSLLVEFVY 1269 Query 2258
RLVCKGELSLARILRKNILEKV---ENKRMLHHANS-ALKPLAARGVAA
RPG------- 2401 LVC GL LALR
EKV GA G Sbjct 1270
QLVCSGQLYLAKLLRNKFVEKVTLYKEPKVYGFVGELGGAGSVGGAGIAG
SGGCSGTAGG 1329 Query 2402 ----TLHDFHSLEIAEQLTLL
DAELFYKIEIPEVLLWAKEQNEEKSPNLTQFTEHFNNMS 2569
L D SLEIAEQTLLDAELF KIEIPEVLLAKQ
EEKSPNL FTEHFN MS Sbjct 1330 GNQPSLLDLKSLEIAEQMT
LLDAELFTKIEIPEVLLFAKDQCEEKSPNLNKFTEHFNKMS 1389
Query 2570 YWVRSIIMLQEKAQDRERLLLKFIKIMKHLRKLNNFN
SYLAILSALDSAPIRRLEWQKQT 2749 YW RS I
ARE KFIKIMKHLRKNNNSYLALSALDS PIRRLEWQK
Sbjct 1390 YWARSKILRLQDAKEREKHVNKFIKIMKHLRKMNNYNS
YLALLSALDSGPIRRLEWQKGI 1449 Query 2750
SEGLAEYCTLIDSSSSFRAYRAALAEVEPPCIPYLGLILQDLTFVHLGNP
DHID-GKVNF 2926 E C LIDSSSSFRAYR
ALAE PPCIPYGLILQDLTFVHGN D G NF Sbjct
1450 TEEVRSFCALIDSSSSFRAYRQALAETNPPCIPYIGLILQDLTFV
HVGNQDYLSKGVINF 1509 Query 2927
SKRWQQFNILDSMRRFQQVHYEIRRNDEIISFFNDFSDHLAEEALWELSL
KIKPR 3091 SKRWQQNIDMRF Y
RRN II FFF D EE WS KIKPR Sbjct 1510
SKRWQQYNIIDNMKRFKKCAYPFRRNERIIRFFDNFKDFMGEEEMWQISE
KIKPR 1564
These two sequences, my Xenopus query sequence
and the matching Drosophila sequence, show strong
(and variable) homology, but even if we knew the
function of the Drosophila gene it may not tell
us much about the function of the Xenopus gene.
2Genes and Evolution - I
Gene duplication though speciation
The two copies of Gene A will now evolve
independently, but will continue to have the same
function
They are ORTHOLOGS
3Genes and Evolution - II
The two copies of Gene A will now evolve
independently, but will probably not continue to
have exactly the same function
Gene duplication though internal genome
duplication
They are PARALOGS
4Homologs, orthologs paralogs
http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/Or
thology.html
5Mutation and Evolution
Translated part of mRNA sequence
Ancestral sequence
ATGAAGGCTGCCTACGACTGCCGTGCCAGAATGCTGAGG ?
MKAAYDCRARMLR
In species A
ATGAAGGCTGCCTATGACTGCCGTGCCAGAATGCTGAGG ?
MKAAYDCRARMLR
ATGAATGCTGCCTATGACTGCCGTGCCAGAATGCTGAGG ?
MNAAYDCRARMLR
ATGAATGCTGCCTATGACTGCCGTGCCAGAATGCTAAGG ?
MNAAYDCRARMLR
ATGAATGCTGCCTATGACTGCCGTG GAATGCTAAGG ?
MNAAYDCR GMLR
ATGAATGCAGCCTATGACTGCCGTG GAATGCTAAGG ?
MNAAYDCR GMLR
ATGAATGCAGCCTATGATTGCCGTG GAATGCTAAGG ?
MNAAYDCR GMLR
ATGAATGCAGCCTATGATTGCCGAG GAATGCTAAGG ?
MNAAYDCR GMLR
In species B
ATGAAGGCTGCCTACGACTGCCGTGCCATAATGCTGAGG ?
MKAAYDCRAIMLR
ATGAAGGCCGCCTACGACTGCCGTGCCATAATGCTGAGG ?
MKAAYDCRAIMLR
ATGAAGGCCGCCTACGACTGTCGTGCCATAATGCTGAGG ?
MKAAYDCRAIMLR
ATGAAGGCCGCCTACGACTGTCGTGCCATAATGCTGAGA ?
MKAAYDCRAIMLR
ATGAAGGCCGCCTACGACTGTCGTGCCATAATCCTGAGA ?
MKAAYDCRAIILR
ATGAAGGCCGCATACGACTGTCGTGCCATAATCCTGAGA ?
MKAAYDCRAIILR
ATGAATGCAGCCTATGATTGCCGAG---GAATGCTAAGG
MNAAYDCR-GMLR
ATGAAGGCCGCATACGACTGTCGT
GCCATAATCCTGAGA MKAAYDCRAIILR
6Searching for Similarity
amino acid comparison
DNA comparison
ATGAATGCAGCCTATGATTGCCGAG---GAATGCTAAGG
MNAAYDCR-GMLR
ATGAAGGCCGCATACGACTGTCGT
GCCATAATCCTGAGA MKAAYDCRAIILR
The DNA sequence can change while the amino acid
sequence stays the same, so always look for
similarities by comparing amino acid
sequences. We note that evolution causes sequence
to change, by substitution, insertion or
deletion, but not usually by small-scale
re-ordering. So we need a tool which will find
the alignment between the two sequences which
shows the greatest degree of similarity while
introducing the fewest gaps as possible.
7The Downside of Gaps
Take two random sequences, with no real
similarity GACACTAGGTCGATGCGTGGTGGCGAGA ACGCATCCG
GATGTGCACCGTGGAACTG And allow cost free
gaps GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA
ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG Clearly,
although the alignment has no mismatches, it is
obviously not biologically meaningful! The
introduction of gaps into alignments must ideally
reflect biological possibilities, but this is
rather difficult. So the tendency is to make gaps
expensive, and introduced only when they make
more long range matching happen than they
introduce un-matching, e.g. TTCCCAACTCTCCTCTTTC
ACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCC
CGTCCAAGAA
TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACT
CCCCCAAAATCAAGCGCACCCCGTCCCAGAA TTCCCAACTCTCCTCTT
TCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCA
CCCCGTCCAAGAA
TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAA
GGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA
8The Essential Task
Basically what we are trying to do, is to see
whether we can work out the function of an
unknown gene by comparing its sequence with those
of genes in other species where we already know
the function. We can do this because the
sequence of most genes is conserved to some
extent during evolution of different
species. The problem is that while gene function
is probably related to both its overall
three-dimensional structure and small regions of
specific linear sequence, our only serious tool
for discerning similarity between proteins is
based firmly on long range linear sequence
similarity. And there is no obvious requirement
on genes to conserve sequence in order to
conserve function its just easier that
way But it seems clear that we can only expect
this to be effective if we are looking at true
ORTHOLOGS.
9Finding Orthologs
So how do we find orthologs, and can we know when
we have? The simplest is Reciprocal Best BLAST,
but it implicitly relies on having all the
protein sequences of you own organism, and the
one you wish to find an ortholog in.
database of human proteins
database of frog proteins
best match human protein
frog protein
x