Biological Data Mining - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Biological Data Mining

Description:

to develop and validate techniques for extracting explicit ... Dr Shuang Cang Mar - Sept 2000. Dr Abul Azad Jan 2001 - Research Fellows. Collaborators ... – PowerPoint PPT presentation

Number of Views:185

Avg rating:3.0/5.0

Slides: 18

Provided by: Salt86

Category:

more less

Transcript and Presenter's Notes

Title: Biological Data Mining

1
Biological Data Mining

A comparison of Neural Network and Symbolic
Techniques
http//www.cmd.port.ac.uk/biomine/

2
1. Objectives

The project aims
to develop and validate techniques for extracting
explicit information from bioinformatic data
to express this information as logical rules and
decision trees
to apply these new procedures to a range of
scientific problems related to bioinformatics and
cheminformatics

3
2. Extracting information

Artificial neural networks can be trained to
reproduce the non-linear relationships underlying
bioinformatic data with good predictive accuracy
but it is often hard to comprehend those
relationships from the internal structure of the
network
with the result that networks are often regarded
as black boxes.
Decision trees using symbolic rules are easier to
interpret
leading to a greater likelihood of understanding
the relationships in the data
allowing the behaviour of individual cases to be
explained.

4
3. Extracting Decision Trees

The Trepan procedure (Craven,1996) extracts
decision trees from a neural network and a set of
training cases by recursively partitioning the
input space.
The decision tree is built in a best-first
manner, expanding the tree at nodes where there
is greatest potential for increasing the fidelity
of the tree to the network.

5
4. Splitting Tests

The splitting tests at the nodes are m-of-n
expressions, e.g. 2-of-x1, x2, x3, where the
xi are Boolean conditions.
Start with a set of candidate tests
binary tests on each value for nominal features
binary tests on thresholds for real-valued
features
Use a beam search with a beam width of two.
Initialize the beam with the candidate test that
maximizes the information gain.

6
5. Splitting Tests (II)

To each m-of-n test in the beam and each
candidate test, apply two operators
m-of-n1 e.g. 2-of-x1, x2 gt 2-of-x1, x2, x3
m1-of-n1 e.g. 2-of-x1, x2 gt 3-of-x1, x2,
x3
Admit new tests to the beam if they increase the
information gain and are significantly different
(chi-squared) from existing tests.

7
6. Example Substance P Binding to NK1 Receptors

Substance P is a neuropeptide with the sequence
H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-Met-NH2
Wang et al. used the multipin technique to
synthesize 512 29 stereoisomers generated by
systematic replacement of L- by D-amino acids at
9 positions
The aim was to measure binding potencies to NK1
receptors identify the positions at which
stereo-chemistry affects binding strength.

8
7. Application of Trepan

A series of networks with 991 architectures
were trained using 90 of the data as a training
set.
For each network a decision tree was grown using
Trepan.
The trees showed high fidelity with the networks
on a 10 test set.

9
8. Results

Binding activity was determined by five
positions, viz.
H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-Met-NH2
The positions identified agree with the FIRM
(Formal Inference-based Recursive Modelling)
analysis of Young and Hawkins
Young S Hawkins D.M. (2000) Analysis of a
large, high-throughput screening data using
recursive partitioning. Molecular Modelling
Prediction of Bioactivity (ed. Gundertofte
JØrgensen).

10
9. A Typical Trepan Tree
11
10. Test set confusion matrix tree versus
network
12
11. Test set confusion matrix tree versus
observed
13
12. Future Work

Complete the implementation of the Trepan
algorithm.
model the distribution of the input data and
generate a set of query instances to be
classified by the network used as additional
training cases during tree extraction.
Extend the algorithm to enable the extraction of
regression trees.
Provide a Bayesian formulation for the decision
tree extraction algorithm.

14
13. Future Applications

Apply Trepan to ligand-receptor binding problems.
compare the performance of these algorithms with
existing symbolic data mining techniques
(ID3/C5).

15
14. References

Wang J-X et al. (1993) Study of
stereo-requirements of substance P binding to NK1
receptors using analogues with systematic D-amino
acid replacements. Biorganic Medicinal
Chemistry Letters, 3, 451-456.
Young S Hawkins D.M. (2000) Analysis of a
large, high-throughput screening data using
recursive partitioning. Molecular Modelling
Prediction of Bioactivity (ed. Gundertofte
JØrgensen).

16
Grantholder

Professor Martyn Ford
Centre for Molecular Design
University of Portsmouth
martyn.ford_at_port.ac.uk

Research Fellows
Dr Shuang Cang Mar - Sept 2000 Dr Abul Azad
Jan 2001 -
17
Collaborators

Dr Antony Browne
School of Computing, Information
Systems and Mathematics, London Guildhall
University. abrowne_at_lgu.ac.uk
Professor Philip Picton
School of Technology and
Design, University College Northampton.
phil.picton_at_northampton.ac.uk
Dr David Whitley
Centre for Molecular Design,
University of Portsmouth.
david.whitley_at_port.ac.uk