The Challenge of Predicting Gene Function

About This Presentation

Title:

The Challenge of Predicting Gene Function

Description:

The most important revelation from the sequenced genomes is that ... strand the DNA strand 'w' or 'c' position the number of exons (no. of start positions) int ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 55

Provided by: ross197

Category:

more less

Transcript and Presenter's Notes

Title: The Challenge of Predicting Gene Function

1
The Challenge ofPredicting Gene Function

Ross D. King
Department of Computer Science
University of Wales, Aberystwyth

2
Gene Function Prediction

The most important revelation from the sequenced
genomes is that the functions of typically only
between 60-70 of the predicted genes are known
with any confidence.
The new science of functional genomics is
dedicated to determining the function of the
genes of unassigned function, and to further
detailing the function of genes with purported
function

3
Data Mining Prediction

We have developed a method for predicting the
functional class of gene products based on
ILP/Relational data mining.
The idea is to learn a reliable predictive
function on the examples of genes with products
of known function.
Then apply this function to genes where the
functional class is unknown.
We call this approach Data Mining Prediction
(DMP).

4
Predicting Gene Function in Yeast

We will demonstrate our approach using ORFs in
yeast
(Saccharomyces cerevisiae).
Using the MIPS functional classification scheme
For those ORFs whose function is currently
unknown
Using 5 types of data

Sequence statistics
Homology (sequence similarity)
Predicted Secondary Structure
Expression (microarray)
Phenotype

5
We want to map from sequence to function class
6
Classification Schemes 1

MIPS/GeneOntology

1,0,0,0 "METABOLISM" 2,0,0,0 "ENERGY" 3,0,0,0
"CELL CYCLE AND DNA PROCESSING" 4,0,0,0
"TRANSCRIPTION" 5,0,0,0 "PROTEIN
SYNTHESIS" 6,0,0,0 "PROTEIN FATE (folding,
modification, destination)" 8,0,0,0 "CELLULAR
TRANSPORT AND TRANSPORT MECHANISMS" 10,0,0,0
"CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION
MECHANISM" 11,0,0,0 "CELL RESCUE, DEFENSE AND
VIRULENCE" 13,0,0,0 "REGULATION OF / INTERACTION
WITH CELLULAR ENVIRONMENT" 14,0,0,0 "CELL
FATE" 29,0,0,0 "TRANSPOSABLE ELEMENTS, VIRAL AND
PLASMID PROTEINS" 30,0,0,0 "CONTROL OF CELLULAR
ORGANIZATION" 40,0,0,0 "SUBCELLULAR
LOCALISATION" 62,0,0,0 "PROTEIN ACTIVITY
REGULATION" 63,0,0,0 "PROTEIN WITH BINDING
FUNCTION OR COFACTOR REQUIREMENT " 67,0,0,0
"TRANSPORT FACILITATION" 98,0,0,0 "CLASSIFICATION
NOT YET CLEAR-CUT" 99,0,0,0 "UNCLASSIFIED
PROTEINS"
7
Classification Schemes 2
Hierarchy of classes
1,0,0,0 "METABOLISM" 1,1,0,0 "amino acid
metabolism" 1,2,0,0 "nitrogen and sulfur
metabolism" 1,3,0,0 "nucleotide
metabolism" 1,4,0,0 "phosphate metabolism" 1,5,0,0
"C-compound and carbohydrate metabolism" 1,6,0,0
"lipid, fatty-acid and isoprenoid
metabolism" 1,7,0,0 "metabolism of vitamins,
cofactors, and prosthetic groups" 1,20,0,0
"secondary metabolism"
8
Classification schemes 3
Hierarchy of classes
1,0,0,0 "METABOLISM" 1,1,0,0 "amino acid
metabolism" 1,1,1,0 "amino acid
biosynthesis" 1,1,4,0 "regulation of amino acid
metabolism" 1,1,7,0 "amino acid
transport" 1,1,10,0 "amino acid degradation
(catabolism)" 1,1,99,0 "other amino acid
metabolism activities" 1,2,0,0 "nitrogen and
sulfur metabolism" 1,3,0,0 "nucleotide
metabolism" 1,4,0,0 "phosphate metabolism" 1,5,0,0
"C-compound and carbohydrate metabolism" 1,6,0,0
"lipid, fatty-acid and isoprenoid
metabolism" 1,7,0,0 "metabolism of vitamins,
cofactors, and prosthetic groups" 1,20,0,0
"secondary metabolism"
... and ORFs may have multiple functions too!
9
Sequence Data
field description type aa_rat_X of amino
acid X in the protein real seq_len length of
the protein sequence int aa_rat_pair_X_Y of
the amino acids X and Y consecutively real mol_wt
molecular weight of the protein int theo_pI the
oretical pI (isoelectric point) real atomic_comp_
X atomic composition of X (C,H,N,O,S) real alipha
tic_index aliphatic index real hydro grand
average of hydropathy real strand the DNA
strand 'w' or 'c' position the number of
exons (no. of start positions) int cai codon
adaptation index real motifs number of PROSITE
motifs int tmSpans number of transmembrane
spans int chromosome chromosome
number 1..16,mit
478 attributes in total
10
Homology data
YAL001C mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdls
dk....
sfc3 keyword(membrane) length(358) dbref(prosite)
dbref(embl)
We look up the associated information from
SwissProt
11
Predicted Secondary Structure Data
mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdkkvk...
cbbbbccaaaaaaaaaaaacccccbbbbaaaaaacccbbccccccb...
We record length and relative positions of the
secondary structure elements. This is relational
data.
12
Expression Data

Microrarray experiments to measure expression
changes in yeast under a variety of conditions,
including cell cycle, heat shock, diauxic shift.
Short time series data, numerical-valued

Spellman et al (1998), Roth et al (1998) DeRisi
et al (1997), Eisen et al (1998) Gasch et al
(2000, 2001), Chu et al (1998)
13
Phenotype Data

Data from knockout gene growth experiments
Many missing data
69 attributes x 1461 ORFs of known function
991 genes of unknown function
Data taken from 3 sources (TRIPLES, MIPS, EUROFAN)

deleted ORF
growth medium
ORF YAL001C YAL019W YAL021C YAL029C
calcofluor white w n n n
sorbitol n s n w
benomyl n w n w
...
H2O2 w w n r
s sensitive (less growth) w wild-type (no
observable effect) r resistant (more
growth) n no data
14
What are the Machine Learning Issues?

Large volume of data
Missing data
Accurate results required
Intelligible results required
Class hierarchy
Multiple labels
Relational data

15
Relational vs Propositional
Propositional single table, fixed number of
columns/attributes
Relational multiple tables, multiple values
16
Data Mining Prediction (DMP)
Entire database
Test data
1/3
2/3
PolyFARM
Data for rule creation
Validation data
1/3
2/3
Training data
All rules
Best rules
Rule gener- ation
Select best rules
Measure rule accuracy
C4.5
Results
17
Warmr

Warmr is an ILP Algorithm Developed by Dehaspe
et al.
It is an ILP version of the well known Apriori
data mining algorithm.
Designed to find frequent patterns in a datalog
database.

18
PolyFARM

First-order association rule mining
Finding all frequent first order patterns in the
data
Distributed on a Beowulf cluster
47,034 homology patterns, f gt 5
19,628 structure patterns, f gt 2
Clare King PADL 2003

hom(SPID, close) sq_len(SPID, short)
classification(SPID, ecoli)
A close homology to a short protein in E. coli
struc(Pos1, a) neighbour(Pos1, Pos2, c)
neighbour(Pos2, Pos3, a) coil_dist(high)
Contains alpha-coil-alpha with a high overall
coil distribution
19
Propositionalisation
Transforming relational data into boolean
attributes
patt1 patt2 patt3 patt4 ... patt47034 YAL001C 0
1 0 0 ... 1 YAL002W 0 1 1 0 ... 1 YAL003W 1 0 0 1
... 0 YAL004W 1 1 0 0 ... 1 YAL005C 0 0 0 0 ... 1
...
20
Dichotomic Search 1

As an alternative to the WARMR data-mining
approach, we developed a frequent pattern finding
method based on dichotomic search.
This approach uses domain-specific logics as
intermediates between propositional logic and
predicate logic.

21
Dichotomic Search 2

Most existing algorithms traverse the search
space in either a top-down or a bottom-up
fashion. We propose a new approach based on
dichotomic search which explores the search space
in both direction, allowing larger steps
Dichotomic search combines completeness (w.r.t.
concepts), non-redundancy, and flexibility.
Ferre, S. King, R.D. (2005). Fundamenta
Informaticae

22
Data Mining Prediction (DMP)
Entire database
Test data
1/3
2/3
PolyFARM
Data for rule creation
Validation data
1/3
2/3
Training data
All rules
Best rules
Rule gener- ation
Select best rules
Measure rule accuracy
C4.5
Results
23
C4.5
aa_ratio_pair_p_y

Open source decision tree algorithm
propositional learning
commonly used
produces interpretable rules
reliable
fast
accurate
Made modifications for
multiple labels
hierarchical labels
Clare King Bioinformatics 2002

gt0.232
lt0.232
metabolism
strand
w
c
transcription
aa_rat_a
gt6.4
lt6.4
transport
cell fate
24
Data Mining Prediction (DMP)
Entire database
Test data
1/3
2/3
PolyFARM
Data for rule creation
Validation data
1/3
2/3
Training data
All rules
Best rules
Rule gener- ation
Select best rules
Measure rule accuracy
C4.5
Results
25
Results

Many rules from each data type
Rules at each level of hierarchy
Some classes are much easier to predict than
others (for example "protein synthesis" at
71-93, "energy" at 20-47)
Good levels of accuracy on held out test data
Many predictions for ORFs of unknown function
(some function at some level is predicted for 96
of the ORFs of unknown function)
Some rules explainable by biology -gt scientific
knowledge discovery
Clare King (2003) Bioinformatics suppl. 2.,
42-49

26
Accuracy Table
27
Expression Data Rule
If in the micro-array experiment (sorbitol
incubation) the ORF expression is gt -0.25 and in
the micro-array experiment (nitrogen depletion)
the ORF expression is lt -1.29 and in the
micro-array experiment (YPD stationary phase) the
ORF expression is gt -1.06 then the function of
this ORF is pheromone response, mating type
determination, sex-specific proteins"
Accuracy on training data 11/12 (92) Accuracy
on the test data 3/4 (75) 21 predictions made
28
Structure Rule

80 accurate on test data
Most matching ORFs belong to the Mitochondrial
Carrier Family
These have 6 long transmembrane alpha-helices of
about 20-30 amino acids
Why do we notice alpha-helices of length 10-14?

29
Alignment
YJL133W -------NEYNPLIHCLC----GSISGSTCAAITTPLDCIKT
VLQIRG------------ 251 YKR052C -------NSYNPLIHCLC-
---GGISGATCAALTTPLDCIKTVLQVRG------------
241 YIL006W ----NNTNSINLQRLIMA----SSVSKMIASAVTYPHE
ILRTRMQLKS------------ 310 YBR104W
----LTRNEIPPWKLCLF----GAFSGTMLWLTVYPLDVVKSIIQNDD--
---------- 271 YGR096W ----KTTAAHKKWELATLNHSAGTIGG
VIAKIITFPLETIRRRMQFMNSKHLEK------ 250 YJR095W
-----QMDVLPSWETSCI----GLISGAIGPFSNAPLDTIKTRLQKDK--
---------- 246 YKL120W -----LMKDGPALHLTAS-----TISG
LGVAVVMNPWDVILTRIYNQK------------ 261 YLR348C
-----FDASKNYTHLTAS-----LLAGLVATTVCSPADVMKTRIMNGS--
---------- 239 YMR166C ----DGRDGELSIPNEILT---GACAG
GLAGIITTPMDVVKTRVQTQQPPSQSNKSYSVT 300 YDL198C
------DYSQATWSQNFIS---SIVGACSSLIVSAPLDVIKTRIQNRN--
---------- 242 YGR257C ----RFASKDANWVHFINSFASGCISG
MIAAICTHPFDVGKTRWQISMMN---------- 302 YDL119C
FIHYNPEGGFTTYTSTTVNTTSAVLSASLATTVTAPFDTIKTRMQLEP--
---------- 255 YJL133W -SQTVSLEIMRKADTFSKAASAIYQV
YGWKGFWRGWKPRIVANMPATAISWTAYECAKHF 310 YKR052C
-SETVSIEIMKDANTFGRASRAILEVHGWKGFWRGLKPRIVANIPATAIS
WTAYECAKHF 300 YIL006W -DIPDSIQRR-----LFPLIKATYAQE
GLKGFYSGFTTNLVRTIPASAITLVSFEYFRNR 364 YBR104W
-LRKPKYKNS-----ISYVAKTIYAKEGIRAFFKGFGPTMVRSAPVNGAT
FLTFELVMRF 325 YGR096W FSRHSSVYGSYKGYGFARIGLQILKQE
GVSSLYRGILVALSKTIPTTFVSFWGYETAIHY 310 YJR095W
---SISLEKQSGMKKIITIGAQLLKEEGFRALYKGITPRVMRVAPGQAVT
FTVYEYVREH 303 YKL120W ----GDLYKG-----PIDCLVKTVRIE
GVTALYKGFAAQVFRIAPHTIMCLTFMEQTMKL 312 YLR348C
----GDHQP------ALKILADAVRKEGPSFMFRGWLPSFTRLGPFTMLI
FFAIEQLKKH 289 YMR166C HPHVTNGRPAALSNSISLSLRTVYQSE
GVLGFFSGVGPRFVWTSVQSSIMLLLYQMTLRG 360 YDL198C
---FDNPESG------LRIVKNTLKNEGVTAFFKGLTPKLLTTGPKLVFS
FALAQSLIPR 293 YGR257C ---NSDPKGGNRSRNMFKFLETIWRTE
GLAALYTGLAARVIKIRPSCAIMISSYEISKKV 359 YDL119C
----SKFTNS------FNTFTSIVKNENVLKLFSGLSMRLARKAFSAGIA
WGIYEELVKR 305
30
Alignment
YJL133W -------cccccaaaaaa----aaaaaaaaaaacccaaaaaa
aaaacc------------ 251 YKR052C -------cccccaaaaaa-
---aaaaaaaaaaacccaaaaaaaaaacc------------
241 YIL006W ----ccccccccaaaaaa----aaaaaaaaaaacccaa
aaaaaaaacc------------ 310 YBR104W
----ccccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaacc--
---------- 271 YGR096W ----cccccccccccccbaaaaaaaaa
aaaaaacccaaaaaaaaaacccccccc------ 250 YJR095W
-----cccccccaaaaaa----aaaaaaaaaaacccaaaaaaaaaccc--
---------- 246 YKL120W -----ccccccaaaaaaa-----aaaa
aaaaaacccaaaaaaaaaacc------------ 261 YLR348C
-----ccccccaaaaaaa-----aaaaaaaaaacccaaaaaaaaaacc--
---------- 239 YMR166C ----cccccccccaaaaaa---aaaaa
aaaaaacccaaaaaaaaaacccccccccccccc 300 YDL198C
------cccccccaaaaaa---aaaaaaaaaaacccaaaaaaaaaacc--
---------- 242 YGR257C ----ccccccccccccaaaaaaaaaaa
aaaaaacccaaaaaaaaaacccc---------- 302 YDL119C
ccccccccccccccaaaaaaaaaaaaaaaaaaacccaaaaaaaaaacc--
---------- 255 YJL133W -ccccccccccccccaaaaaaaaaaa
ccccaaaaccaaaaaaacaaaaaaaaaaaaaaaa 310 YKR052C
-ccccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaa
aaaaaaaaaa 300 YIL006W -ccccccccc-----aaaaaaaaaaac
cccaaacccaaaaaaaccaaaaaaaaaaaaaaa 364 YBR104W
-ccccccccc-----aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaa
aaaaaaaaaa 325 YGR096W cccccccccccccccaaaaaaaaaaac
ccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 310 YJR095W
---ccccccccccccaaaaaaaaaaacccaaaaaccaaaaaaaccaaaaa
aaaaaaaaaa 303 YKL120W ----cccccc-----aaaaaaaaaaac
ccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 312 YLR348C
----ccccc------aaaaaaaaaaacccaaaaaccaaaaaaaccaaaaa
aaaaaaaaaa 289 YMR166C cccccccccccccccaaaaaaaaaaac
ccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 360 YDL198C
---cccccca------aaaaaaaaaacccaaaaacccaaaaaaaaaaaaa
aaaaaaaaaa 293 YGR257C ---ccccccccccccaaaaaaaaaaac
ccaaaaaccaaaaaaaccaaaaaaaaaaaaaaa 359 YDL119C
----ccccca------aaaaaaaaaacccaaaaacccaaaaaaccaaaaa
aaaaaaaaaa 305
31
Homology rule

This rule is 100 accurate on test data
Almost all matching ORFs are from the 20S
proteasome subunit for degradation of proteins
These subunits exist in archaea and eukaryotes,
but only in one specific branch of bacteria
(actinomycetes).

32
Homology rule

This rule is 100 accurate on test data
Almost all matching ORFs are from the 20S
proteasome subunit for degradation of proteins
These subunits exist in archaea and eukaryotes,
but only in one specific branch of bacteria
(actinomycetes).

33
Application of DMP to Bacterial Genomes

Successful for both M. tuberculosis and E. coli.
Of the ORFs with no assigned function gt40 were
predicted to have a function at one or more
levels of the class hierarchy.
It was found that many of the predictive rules
were more general than possible using sequence
homology.
References
King et al. (2000) KDD 2000
King et al. (2000) Yeast (Comparative and
Functional Genomics)
King et al. (2001) Bioinformatics

34
Example Rule (level 2 E. coli)
If the ORF is not predicted to have a b-strand of
length ? 3 ? a homologous protein from class
Chytridiomycetes was found Then its functional
class is Cell processes, Transport/binding
proteins 12/13 (86) correct on Test Set -
probability of this result occurring by chance is
estimated at 4x10-7. 24 ORFs of unknown
function are predicted by the rule.
16 ORFs now with putative or confirmed function -
93.8 accurate predictions
35
Experimental Conformation

The original bacterial ORF predictions were made
over three years ago.
In the intervening time many more ORFs have been
sequenced, making traditional homologous
prediction methods more accurate and sensitive,
and the function of some ORFs have been
determined by wet biology.
The E. coli genome has been re-annotated by
Monica Rileys group.

36
Wet Biology conformation

A number of predictions have been confirmed or
falsified by new wet experimental data.
This new data is biased towards hard classes.
Despite this the results are still good
Level 2 23 predictions - 47.8 accuracy
Level 3 23 predictions - 43.4 accuracy

This is very much better than random as there are
many classes.
37
Confirmation of Wet Predictions
38
Extension to Arabidopsis Genome

Collaborative project with the Institute of
Grassland and Environmental Research and the
University of Nottingham.
Large increase in data 6,000 (yeast) -gt 25,000
ORFs.
Large amount of micro-array data from the
Nottingham Arabidopsis stock centre.
The increase in data is a challenge to our
machine learning algorithms, 100s MBs.
Clare, A., Karwath, A., Ougham, H. and
King, RD (2006) Functional Bioinformatics for
Arabidopsis thaliana. Bioinformatics 2006 22
1130-1136

39
Results

Accuracy comparable to yeast and bacteria
Large fraction of genes of currently unknown
function are predicted.
Some rules could be interpreted in terms of known
biology
Clare, A., Karwath, A., Ougham, H. and King, RD
(2006) Functional Bioinformatics for Arabidopsis
thaliana. Bioinformatics 2006 22 1130-1136

40
Gibberellin Biosynthesis Prediction

Gibberellin is an important plant hormone.
Chosen because of interesting phenotypes often
extreme size.
Insertion of a promoter to overproduce gene
product.
Result
2 days earlier flowering
Average leaf number and weight increased at 21
days.
This phenotype is consistent with prediction.

41
(No Transcript)
42
Leaf number increases more rapidly in the mutant
(yellow bars) than in wildtype Landsberg erecta
(blue bars)
43
Paclobutrazol (P) (inhibitor of gibberllin)
abolishes the difference between mutant (M) and
wildtype (L)C control
44
Availability
All predictions available at http//www.genepredic
tions.org
All rules and data available at
http//www.aber.ac.uk/compsci/Research/bio/dss/
45
ILP 2005 Challenge 1

Yeast function prediction data used as a
community challenge http//www.protein-logic.com/
The intention of the challenge was to provide a
real-world data set to test of how far we have
progressed in the field of ILP and
multi-relational data mining. The questions we
wanted to answer were Are the tools up to the
job? Do they scale? Do they handle noisy, sparse
and complex data?

46
ILP 2005 Challenge 2

A. J. Knobbe, E. K. Y. Ho, R. Malik ILP
CHallenge 2005 The Safarii MRDM environment.
C. Perlich Approaching the ILP 2005 challenge
Class-Conditional Bayesian Propositionalization
for Genetic Classification. J. Struyf, C. Vens,
T. Croonenborghs, S. Dzeroski, H. Blockeel
Applying Predictive Clustering Trees to the
Inductive Logic Programming 2005 Challenge Data.
F. Riguzzi A Simple Approach to a Multi-Label
Classification Problem.

47
Propositional Approach

Zafer Barutcuoglu, Robert E. Schapire and Olga G.
Troyanskaya. Hierarchical multi-label prediction
of gene function. Bioinformatics (in press)
Hierarchy of SVMs.
Uses a Bayesian net to combine predictions.

48
Conclusions

Data mining and machine learning are powerful
tools for functional genomics.
The DMP method can be successfully applied to
different genomes (bacterial, yeast, Arabidopsis)
to predict gene functional class.
Micro-array data is a useful component in DMP.
Biological insight can be extracted from DMP
rules.
The structure of gene prediction problems makes
them an exciting test bed for machine learning
methods.

49
Acknowledgements

Amanda Clare Aberystwyth
Andreas Karwath Freiburg (Aberystwyth)
Luc Dehaspe PharmaDM
Helen Ougham IGER
BBSRC

50
The Need for Logic to Represent Scientific
Knowledge

Logic is the best understood way to represent
knowledge.
Traditional statistics, machine learning, and
data mining are based on propositional logic.
For some problems we require a richer description
language, i.e. first-order predicate calculus.
Using logic programming (predicate calculus) we
can incorporate deduction, abduction, and
induction.

51
Inductive Logic Programming

Inductive Logic Programming (ILP) uses logic
programs (first-order predicate calculus) to
learn with describe examples, theories, and
background knowledge.
For certain types of problem ILP is a powerful
data analysis technique - more accurate, and more
comprehensible results than conventional methods.
Has been successfully applied to a number of
biological/chemical problems.

52
ILP for Science

The key advantage of ILP for scientific
applications is that it allows the application of
compact relational representations that are
natural for scientists to use. This allows
domain understandable rules to be automatically
formed.
This advantage comes at a computational cost.
However, non-technical reasons are probably the
greatest barrier to adoption of ILP. For
example, it is very difficult to explain the
benefits of ILP to domain experts.

53
Prediction of Lethality

Instead of using microarray-data to prediction
the functional class of a gene we have been using
the same approach to predict whether a gene
knock-out will be lethal (grown in a rich medium).

If false the function of the ORF is cell
cycle and true the function of the ORF is rRNA
transcription and in the micro-array experiment
(cell cycle) the ORF expression is gt -0.79 then
the knockout is lethal.
Example Rule Test accuracy 82 (Default 21).
54
Summary Results