Title: The Challenge of Predicting Gene Function
1The Challenge ofPredicting Gene Function
- Ross D. King
- Department of Computer Science
- University of Wales, Aberystwyth
2Gene Function Prediction
- The most important revelation from the sequenced
genomes is that the functions of typically only
between 60-70 of the predicted genes are known
with any confidence. - The new science of functional genomics is
dedicated to determining the function of the
genes of unassigned function, and to further
detailing the function of genes with purported
function
3Data Mining Prediction
- We have developed a method for predicting the
functional class of gene products based on
ILP/Relational data mining. - The idea is to learn a reliable predictive
function on the examples of genes with products
of known function. - Then apply this function to genes where the
functional class is unknown. - We call this approach Data Mining Prediction
(DMP).
4Predicting Gene Function in Yeast
- We will demonstrate our approach using ORFs in
yeast - (Saccharomyces cerevisiae).
- Using the MIPS functional classification scheme
- For those ORFs whose function is currently
unknown - Using 5 types of data
- Sequence statistics
- Homology (sequence similarity)
- Predicted Secondary Structure
- Expression (microarray)
- Phenotype
5We want to map from sequence to function class
6Classification Schemes 1
1,0,0,0 "METABOLISM" 2,0,0,0 "ENERGY" 3,0,0,0
"CELL CYCLE AND DNA PROCESSING" 4,0,0,0
"TRANSCRIPTION" 5,0,0,0 "PROTEIN
SYNTHESIS" 6,0,0,0 "PROTEIN FATE (folding,
modification, destination)" 8,0,0,0 "CELLULAR
TRANSPORT AND TRANSPORT MECHANISMS" 10,0,0,0
"CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION
MECHANISM" 11,0,0,0 "CELL RESCUE, DEFENSE AND
VIRULENCE" 13,0,0,0 "REGULATION OF / INTERACTION
WITH CELLULAR ENVIRONMENT" 14,0,0,0 "CELL
FATE" 29,0,0,0 "TRANSPOSABLE ELEMENTS, VIRAL AND
PLASMID PROTEINS" 30,0,0,0 "CONTROL OF CELLULAR
ORGANIZATION" 40,0,0,0 "SUBCELLULAR
LOCALISATION" 62,0,0,0 "PROTEIN ACTIVITY
REGULATION" 63,0,0,0 "PROTEIN WITH BINDING
FUNCTION OR COFACTOR REQUIREMENT " 67,0,0,0
"TRANSPORT FACILITATION" 98,0,0,0 "CLASSIFICATION
NOT YET CLEAR-CUT" 99,0,0,0 "UNCLASSIFIED
PROTEINS"
7Classification Schemes 2
Hierarchy of classes
1,0,0,0 "METABOLISM" 1,1,0,0 "amino acid
metabolism" 1,2,0,0 "nitrogen and sulfur
metabolism" 1,3,0,0 "nucleotide
metabolism" 1,4,0,0 "phosphate metabolism" 1,5,0,0
"C-compound and carbohydrate metabolism" 1,6,0,0
"lipid, fatty-acid and isoprenoid
metabolism" 1,7,0,0 "metabolism of vitamins,
cofactors, and prosthetic groups" 1,20,0,0
"secondary metabolism"
8Classification schemes 3
Hierarchy of classes
1,0,0,0 "METABOLISM" 1,1,0,0 "amino acid
metabolism" 1,1,1,0 "amino acid
biosynthesis" 1,1,4,0 "regulation of amino acid
metabolism" 1,1,7,0 "amino acid
transport" 1,1,10,0 "amino acid degradation
(catabolism)" 1,1,99,0 "other amino acid
metabolism activities" 1,2,0,0 "nitrogen and
sulfur metabolism" 1,3,0,0 "nucleotide
metabolism" 1,4,0,0 "phosphate metabolism" 1,5,0,0
"C-compound and carbohydrate metabolism" 1,6,0,0
"lipid, fatty-acid and isoprenoid
metabolism" 1,7,0,0 "metabolism of vitamins,
cofactors, and prosthetic groups" 1,20,0,0
"secondary metabolism"
... and ORFs may have multiple functions too!
9Sequence Data
field description type aa_rat_X of amino
acid X in the protein real seq_len length of
the protein sequence int aa_rat_pair_X_Y of
the amino acids X and Y consecutively real mol_wt
molecular weight of the protein int theo_pI the
oretical pI (isoelectric point) real atomic_comp_
X atomic composition of X (C,H,N,O,S) real alipha
tic_index aliphatic index real hydro grand
average of hydropathy real strand the DNA
strand 'w' or 'c' position the number of
exons (no. of start positions) int cai codon
adaptation index real motifs number of PROSITE
motifs int tmSpans number of transmembrane
spans int chromosome chromosome
number 1..16,mit
478 attributes in total
10Homology data
YAL001C mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdls
dk....
sfc3 keyword(membrane) length(358) dbref(prosite)
dbref(embl)
We look up the associated information from
SwissProt
11Predicted Secondary Structure Data
mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdkkvk...
cbbbbccaaaaaaaaaaaacccccbbbbaaaaaacccbbccccccb...
We record length and relative positions of the
secondary structure elements. This is relational
data.
12Expression Data
- Microrarray experiments to measure expression
changes in yeast under a variety of conditions,
including cell cycle, heat shock, diauxic shift. - Short time series data, numerical-valued
Spellman et al (1998), Roth et al (1998) DeRisi
et al (1997), Eisen et al (1998) Gasch et al
(2000, 2001), Chu et al (1998)
13Phenotype Data
- Data from knockout gene growth experiments
- Many missing data
- 69 attributes x 1461 ORFs of known function
- 991 genes of unknown function
- Data taken from 3 sources (TRIPLES, MIPS, EUROFAN)
deleted ORF
growth medium
ORF YAL001C YAL019W YAL021C YAL029C
calcofluor white w n n n
sorbitol n s n w
benomyl n w n w
...
H2O2 w w n r
s sensitive (less growth) w wild-type (no
observable effect) r resistant (more
growth) n no data
14What are the Machine Learning Issues?
- Large volume of data
- Missing data
- Accurate results required
- Intelligible results required
- Class hierarchy
- Multiple labels
- Relational data
15Relational vs Propositional
Propositional single table, fixed number of
columns/attributes
Relational multiple tables, multiple values
16Data Mining Prediction (DMP)
Entire database
Test data
1/3
2/3
PolyFARM
Data for rule creation
Validation data
1/3
2/3
Training data
All rules
Best rules
Rule gener- ation
Select best rules
Measure rule accuracy
C4.5
Results
17Warmr
- Warmr is an ILP Algorithm Developed by Dehaspe
et al. - It is an ILP version of the well known Apriori
data mining algorithm. - Designed to find frequent patterns in a datalog
database.
18PolyFARM
- First-order association rule mining
- Finding all frequent first order patterns in the
data - Distributed on a Beowulf cluster
- 47,034 homology patterns, f gt 5
- 19,628 structure patterns, f gt 2
- Clare King PADL 2003
hom(SPID, close) sq_len(SPID, short)
classification(SPID, ecoli)
A close homology to a short protein in E. coli
struc(Pos1, a) neighbour(Pos1, Pos2, c)
neighbour(Pos2, Pos3, a) coil_dist(high)
Contains alpha-coil-alpha with a high overall
coil distribution
19Propositionalisation
Transforming relational data into boolean
attributes
patt1 patt2 patt3 patt4 ... patt47034 YAL001C 0
1 0 0 ... 1 YAL002W 0 1 1 0 ... 1 YAL003W 1 0 0 1
... 0 YAL004W 1 1 0 0 ... 1 YAL005C 0 0 0 0 ... 1
...
20Dichotomic Search 1
- As an alternative to the WARMR data-mining
approach, we developed a frequent pattern finding
method based on dichotomic search. - This approach uses domain-specific logics as
intermediates between propositional logic and
predicate logic.
21Dichotomic Search 2
- Most existing algorithms traverse the search
space in either a top-down or a bottom-up
fashion. We propose a new approach based on
dichotomic search which explores the search space
in both direction, allowing larger steps - Dichotomic search combines completeness (w.r.t.
concepts), non-redundancy, and flexibility. - Ferre, S. King, R.D. (2005). Fundamenta
Informaticae
22Data Mining Prediction (DMP)
Entire database
Test data
1/3
2/3
PolyFARM
Data for rule creation
Validation data
1/3
2/3
Training data
All rules
Best rules
Rule gener- ation
Select best rules
Measure rule accuracy
C4.5
Results
23C4.5
aa_ratio_pair_p_y
- Open source decision tree algorithm
- propositional learning
- commonly used
- produces interpretable rules
- reliable
- fast
- accurate
- Made modifications for
- multiple labels
- hierarchical labels
- Clare King Bioinformatics 2002
gt0.232
lt0.232
metabolism
strand
w
c
transcription
aa_rat_a
gt6.4
lt6.4
transport
cell fate
24Data Mining Prediction (DMP)
Entire database
Test data
1/3
2/3
PolyFARM
Data for rule creation
Validation data
1/3
2/3
Training data
All rules
Best rules
Rule gener- ation
Select best rules
Measure rule accuracy
C4.5
Results
25Results
- Many rules from each data type
- Rules at each level of hierarchy
- Some classes are much easier to predict than
others (for example "protein synthesis" at
71-93, "energy" at 20-47) - Good levels of accuracy on held out test data
- Many predictions for ORFs of unknown function
(some function at some level is predicted for 96
of the ORFs of unknown function) - Some rules explainable by biology -gt scientific
knowledge discovery - Clare King (2003) Bioinformatics suppl. 2.,
42-49
26Accuracy Table
27Expression Data Rule
If in the micro-array experiment (sorbitol
incubation) the ORF expression is gt -0.25 and in
the micro-array experiment (nitrogen depletion)
the ORF expression is lt -1.29 and in the
micro-array experiment (YPD stationary phase) the
ORF expression is gt -1.06 then the function of
this ORF is pheromone response, mating type
determination, sex-specific proteins"
Accuracy on training data 11/12 (92) Accuracy
on the test data 3/4 (75) 21 predictions made
28Structure Rule
- 80 accurate on test data
- Most matching ORFs belong to the Mitochondrial
Carrier Family - These have 6 long transmembrane alpha-helices of
about 20-30 amino acids - Why do we notice alpha-helices of length 10-14?
29Alignment
YJL133W -------NEYNPLIHCLC----GSISGSTCAAITTPLDCIKT
VLQIRG------------ 251 YKR052C -------NSYNPLIHCLC-
---GGISGATCAALTTPLDCIKTVLQVRG------------
241 YIL006W ----NNTNSINLQRLIMA----SSVSKMIASAVTYPHE
ILRTRMQLKS------------ 310 YBR104W
----LTRNEIPPWKLCLF----GAFSGTMLWLTVYPLDVVKSIIQNDD--
---------- 271 YGR096W ----KTTAAHKKWELATLNHSAGTIGG
VIAKIITFPLETIRRRMQFMNSKHLEK------ 250 YJR095W
-----QMDVLPSWETSCI----GLISGAIGPFSNAPLDTIKTRLQKDK--
---------- 246 YKL120W -----LMKDGPALHLTAS-----TISG
LGVAVVMNPWDVILTRIYNQK------------ 261 YLR348C
-----FDASKNYTHLTAS-----LLAGLVATTVCSPADVMKTRIMNGS--
---------- 239 YMR166C ----DGRDGELSIPNEILT---GACAG
GLAGIITTPMDVVKTRVQTQQPPSQSNKSYSVT 300 YDL198C
------DYSQATWSQNFIS---SIVGACSSLIVSAPLDVIKTRIQNRN--
---------- 242 YGR257C ----RFASKDANWVHFINSFASGCISG
MIAAICTHPFDVGKTRWQISMMN---------- 302 YDL119C
FIHYNPEGGFTTYTSTTVNTTSAVLSASLATTVTAPFDTIKTRMQLEP--
---------- 255 YJL133W -SQTVSLEIMRKADTFSKAASAIYQV
YGWKGFWRGWKPRIVANMPATAISWTAYECAKHF 310 YKR052C
-SETVSIEIMKDANTFGRASRAILEVHGWKGFWRGLKPRIVANIPATAIS
WTAYECAKHF 300 YIL006W -DIPDSIQRR-----LFPLIKATYAQE
GLKGFYSGFTTNLVRTIPASAITLVSFEYFRNR 364 YBR104W
-LRKPKYKNS-----ISYVAKTIYAKEGIRAFFKGFGPTMVRSAPVNGAT
FLTFELVMRF 325 YGR096W FSRHSSVYGSYKGYGFARIGLQILKQE
GVSSLYRGILVALSKTIPTTFVSFWGYETAIHY 310 YJR095W
---SISLEKQSGMKKIITIGAQLLKEEGFRALYKGITPRVMRVAPGQAVT
FTVYEYVREH 303 YKL120W ----GDLYKG-----PIDCLVKTVRIE
GVTALYKGFAAQVFRIAPHTIMCLTFMEQTMKL 312 YLR348C
----GDHQP------ALKILADAVRKEGPSFMFRGWLPSFTRLGPFTMLI
FFAIEQLKKH 289 YMR166C HPHVTNGRPAALSNSISLSLRTVYQSE
GVLGFFSGVGPRFVWTSVQSSIMLLLYQMTLRG 360 YDL198C
---FDNPESG------LRIVKNTLKNEGVTAFFKGLTPKLLTTGPKLVFS
FALAQSLIPR 293 YGR257C ---NSDPKGGNRSRNMFKFLETIWRTE
GLAALYTGLAARVIKIRPSCAIMISSYEISKKV 359 YDL119C
----SKFTNS------FNTFTSIVKNENVLKLFSGLSMRLARKAFSAGIA
WGIYEELVKR 305
30Alignment
YJL133W -------cccccaaaaaa----aaaaaaaaaaacccaaaaaa
aaaacc------------ 251 YKR052C -------cccccaaaaaa-
---aaaaaaaaaaacccaaaaaaaaaacc------------
241 YIL006W ----ccccccccaaaaaa----aaaaaaaaaaacccaa
aaaaaaaacc------------ 310 YBR104W
----ccccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc--
---------- 271 YGR096W ----cccccccccccccbaaaaaaaaa
aaaaaacccaaaaaaaaaacccccccc------ 250 YJR095W
-----cccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaccc--
---------- 246 YKL120W -----ccccccaaaaaaa-----aaaa
aaaaaacccaaaaaaaaaacc------------ 261 YLR348C
-----ccccccaaaaaaa-----aaaaaaaaaacccaaaaaaaaaacc--
---------- 239 YMR166C ----cccccccccaaaaaa---aaaaa
aaaaaacccaaaaaaaaaacccccccccccccc 300 YDL198C
------cccccccaaaaaa---aaaaaaaaaaacccaaaaaaaaaacc--
---------- 242 YGR257C ----ccccccccccccaaaaaaaaaaa
aaaaaacccaaaaaaaaaacccc---------- 302 YDL119C
ccccccccccccccaaaaaaaaaaaaaaaaaaacccaaaaaaaaaacc--
---------- 255 YJL133W -ccccccccccccccaaaaaaaaaaa
ccccaaaaccaaaaaaacaaaaaaaaaaaaaaaa 310 YKR052C
-ccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaa
aaaaaaaaaa 300 YIL006W -ccccccccc-----aaaaaaaaaaac
cccaaacccaaaaaaaccaaaaaaaaaaaaaaa 364 YBR104W
-ccccccccc-----aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaa
aaaaaaaaaa 325 YGR096W cccccccccccccccaaaaaaaaaaac
ccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 310 YJR095W
---ccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaa
aaaaaaaaaa 303 YKL120W ----cccccc-----aaaaaaaaaaac
ccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 312 YLR348C
----ccccc------aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaa
aaaaaaaaaa 289 YMR166C cccccccccccccccaaaaaaaaaaac
ccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 360 YDL198C
---cccccca------aaaaaaaaaacccaaaaacccaaaaaaaaaaaaa
aaaaaaaaaa 293 YGR257C ---ccccccccccccaaaaaaaaaaac
ccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 359 YDL119C
----ccccca------aaaaaaaaaacccaaaaacccaaaaaaccaaaaa
aaaaaaaaaa 305
31Homology rule
- This rule is 100 accurate on test data
- Almost all matching ORFs are from the 20S
proteasome subunit for degradation of proteins - These subunits exist in archaea and eukaryotes,
but only in one specific branch of bacteria
(actinomycetes).
32Homology rule
- This rule is 100 accurate on test data
- Almost all matching ORFs are from the 20S
proteasome subunit for degradation of proteins - These subunits exist in archaea and eukaryotes,
but only in one specific branch of bacteria
(actinomycetes).
33Application of DMP to Bacterial Genomes
- Successful for both M. tuberculosis and E. coli.
- Of the ORFs with no assigned function gt40 were
predicted to have a function at one or more
levels of the class hierarchy. - It was found that many of the predictive rules
were more general than possible using sequence
homology. - References
- King et al. (2000) KDD 2000
- King et al. (2000) Yeast (Comparative and
Functional Genomics) - King et al. (2001) Bioinformatics
34Example Rule (level 2 E. coli)
If the ORF is not predicted to have a b-strand of
length ? 3 ? a homologous protein from class
Chytridiomycetes was found Then its functional
class is Cell processes, Transport/binding
proteins 12/13 (86) correct on Test Set -
probability of this result occurring by chance is
estimated at 4x10-7. 24 ORFs of unknown
function are predicted by the rule.
16 ORFs now with putative or confirmed function -
93.8 accurate predictions
35Experimental Conformation
- The original bacterial ORF predictions were made
over three years ago. - In the intervening time many more ORFs have been
sequenced, making traditional homologous
prediction methods more accurate and sensitive,
and the function of some ORFs have been
determined by wet biology. - The E. coli genome has been re-annotated by
Monica Rileys group.
36Wet Biology conformation
- A number of predictions have been confirmed or
falsified by new wet experimental data. - This new data is biased towards hard classes.
Despite this the results are still good - Level 2 23 predictions - 47.8 accuracy
- Level 3 23 predictions - 43.4 accuracy
This is very much better than random as there are
many classes.
37Confirmation of Wet Predictions
38Extension to Arabidopsis Genome
- Collaborative project with the Institute of
Grassland and Environmental Research and the
University of Nottingham. - Large increase in data 6,000 (yeast) -gt 25,000
ORFs. - Large amount of micro-array data from the
Nottingham Arabidopsis stock centre. - The increase in data is a challenge to our
machine learning algorithms, 100s MBs. - Clare, A., Karwath, A., Ougham, H. and
King, RD (2006) Functional Bioinformatics for
Arabidopsis thaliana. Bioinformatics 2006 22
1130-1136
39Results
- Accuracy comparable to yeast and bacteria
- Large fraction of genes of currently unknown
function are predicted. - Some rules could be interpreted in terms of known
biology - Clare, A., Karwath, A., Ougham, H. and King, RD
(2006) Functional Bioinformatics for Arabidopsis
thaliana. Bioinformatics 2006 22 1130-1136
40Gibberellin Biosynthesis Prediction
- Gibberellin is an important plant hormone.
- Chosen because of interesting phenotypes often
extreme size. - Insertion of a promoter to overproduce gene
product. - Result
- 2 days earlier flowering
- Average leaf number and weight increased at 21
days. - This phenotype is consistent with prediction.
41(No Transcript)
42Leaf number increases more rapidly in the mutant
(yellow bars) than in wildtype Landsberg erecta
(blue bars)
43Paclobutrazol (P) (inhibitor of gibberllin)
abolishes the difference between mutant (M) and
wildtype (L)C control
44Availability
All predictions available at http//www.genepredic
tions.org
All rules and data available at
http//www.aber.ac.uk/compsci/Research/bio/dss/
45ILP 2005 Challenge 1
- Yeast function prediction data used as a
community challenge http//www.protein-logic.com/
- The intention of the challenge was to provide a
real-world data set to test of how far we have
progressed in the field of ILP and
multi-relational data mining. The questions we
wanted to answer were Are the tools up to the
job? Do they scale? Do they handle noisy, sparse
and complex data?
46ILP 2005 Challenge 2
- A. J. Knobbe, E. K. Y. Ho, R. Malik ILP
CHallenge 2005 The Safarii MRDM environment.
C. Perlich Approaching the ILP 2005 challenge
Class-Conditional Bayesian Propositionalization
for Genetic Classification. J. Struyf, C. Vens,
T. Croonenborghs, S. Dzeroski, H. Blockeel
Applying Predictive Clustering Trees to the
Inductive Logic Programming 2005 Challenge Data. - F. Riguzzi A Simple Approach to a Multi-Label
Classification Problem.
47Propositional Approach
- Zafer Barutcuoglu, Robert E. Schapire and Olga G.
Troyanskaya. Hierarchical multi-label prediction
of gene function. Bioinformatics (in press) - Hierarchy of SVMs.
- Uses a Bayesian net to combine predictions.
48Conclusions
- Data mining and machine learning are powerful
tools for functional genomics. - The DMP method can be successfully applied to
different genomes (bacterial, yeast, Arabidopsis)
to predict gene functional class. - Micro-array data is a useful component in DMP.
- Biological insight can be extracted from DMP
rules. - The structure of gene prediction problems makes
them an exciting test bed for machine learning
methods.
49Acknowledgements
- Amanda Clare Aberystwyth
- Andreas Karwath Freiburg (Aberystwyth)
- Luc Dehaspe PharmaDM
- Helen Ougham IGER
- BBSRC
50The Need for Logic to Represent Scientific
Knowledge
- Logic is the best understood way to represent
knowledge. - Traditional statistics, machine learning, and
data mining are based on propositional logic. - For some problems we require a richer description
language, i.e. first-order predicate calculus. - Using logic programming (predicate calculus) we
can incorporate deduction, abduction, and
induction.
51Inductive Logic Programming
- Inductive Logic Programming (ILP) uses logic
programs (first-order predicate calculus) to
learn with describe examples, theories, and
background knowledge. - For certain types of problem ILP is a powerful
data analysis technique - more accurate, and more
comprehensible results than conventional methods. - Has been successfully applied to a number of
biological/chemical problems.
52ILP for Science
- The key advantage of ILP for scientific
applications is that it allows the application of
compact relational representations that are
natural for scientists to use. This allows
domain understandable rules to be automatically
formed. - This advantage comes at a computational cost.
However, non-technical reasons are probably the
greatest barrier to adoption of ILP. For
example, it is very difficult to explain the
benefits of ILP to domain experts.
53Prediction of Lethality
- Instead of using microarray-data to prediction
the functional class of a gene we have been using
the same approach to predict whether a gene
knock-out will be lethal (grown in a rich medium).
If false the function of the ORF is cell
cycle and true the function of the ORF is rRNA
transcription and in the micro-array experiment
(cell cycle) the ORF expression is gt -0.79 then
the knockout is lethal.
Example Rule Test accuracy 82 (Default 21).
54Summary Results
- Using voting (2 or more rules agree on a
prediction) - Level 2 128 ORFs predicted - 87.5 accuracy
- Level 3 23 ORFs predicted - 91.3 accuracy
- All predictions
- Level 2 335 ORFs predicted - 64.5 accuracy
- Level 3 204 ORFs predicted - 44.6 accuracy