Biological Data Mining - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Biological Data Mining

Description:

to develop and validate techniques for extracting explicit ... Dr Shuang Cang Mar - Sept 2000. Dr Abul Azad Jan 2001 - Research Fellows. Collaborators ... – PowerPoint PPT presentation

Number of Views:185
Avg rating:3.0/5.0
Slides: 18
Provided by: Salt86
Category:
Tags: biological | cang | data | mining

less

Transcript and Presenter's Notes

Title: Biological Data Mining


1
Biological Data Mining
  • A comparison of Neural Network and Symbolic
    Techniques
  • http//www.cmd.port.ac.uk/biomine/

2
1. Objectives
  • The project aims
  • to develop and validate techniques for extracting
    explicit information from bioinformatic data
  • to express this information as logical rules and
    decision trees
  • to apply these new procedures to a range of
    scientific problems related to bioinformatics and
    cheminformatics

3
2. Extracting information
  • Artificial neural networks can be trained to
    reproduce the non-linear relationships underlying
    bioinformatic data with good predictive accuracy
  • but it is often hard to comprehend those
    relationships from the internal structure of the
    network
  • with the result that networks are often regarded
    as black boxes.
  • Decision trees using symbolic rules are easier to
    interpret
  • leading to a greater likelihood of understanding
    the relationships in the data
  • allowing the behaviour of individual cases to be
    explained.

4
3. Extracting Decision Trees
  • The Trepan procedure (Craven,1996) extracts
    decision trees from a neural network and a set of
    training cases by recursively partitioning the
    input space.
  • The decision tree is built in a best-first
    manner, expanding the tree at nodes where there
    is greatest potential for increasing the fidelity
    of the tree to the network.

5
4. Splitting Tests
  • The splitting tests at the nodes are m-of-n
    expressions, e.g. 2-of-x1, x2, x3, where the
    xi are Boolean conditions.
  • Start with a set of candidate tests
  • binary tests on each value for nominal features
  • binary tests on thresholds for real-valued
    features
  • Use a beam search with a beam width of two.
  • Initialize the beam with the candidate test that
    maximizes the information gain.

6
5. Splitting Tests (II)
  • To each m-of-n test in the beam and each
    candidate test, apply two operators
  • m-of-n1 e.g. 2-of-x1, x2 gt 2-of-x1, x2, x3
  • m1-of-n1 e.g. 2-of-x1, x2 gt 3-of-x1, x2,
    x3
  • Admit new tests to the beam if they increase the
    information gain and are significantly different
    (chi-squared) from existing tests.

7
6. Example Substance P Binding to NK1 Receptors
  • Substance P is a neuropeptide with the sequence
  • H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-Met-NH2
  • Wang et al. used the multipin technique to
    synthesize 512 29 stereoisomers generated by
    systematic replacement of L- by D-amino acids at
    9 positions
  • The aim was to measure binding potencies to NK1
    receptors identify the positions at which
    stereo-chemistry affects binding strength.

8
7. Application of Trepan
  • A series of networks with 991 architectures
    were trained using 90 of the data as a training
    set.
  • For each network a decision tree was grown using
    Trepan.
  • The trees showed high fidelity with the networks
    on a 10 test set.

9
8. Results
  • Binding activity was determined by five
    positions, viz.
  • H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-Met-NH2
  • The positions identified agree with the FIRM
    (Formal Inference-based Recursive Modelling)
    analysis of Young and Hawkins
  • Young S Hawkins D.M. (2000) Analysis of a
    large, high-throughput screening data using
    recursive partitioning. Molecular Modelling
    Prediction of Bioactivity (ed. Gundertofte
    JØrgensen).

10
9. A Typical Trepan Tree
11
10. Test set confusion matrix tree versus
network
12
11. Test set confusion matrix tree versus
observed
13
12. Future Work
  • Complete the implementation of the Trepan
    algorithm.
  • model the distribution of the input data and
    generate a set of query instances to be
    classified by the network used as additional
    training cases during tree extraction.
  • Extend the algorithm to enable the extraction of
    regression trees.
  • Provide a Bayesian formulation for the decision
    tree extraction algorithm.

14
13. Future Applications
  • Apply Trepan to ligand-receptor binding problems.
  • compare the performance of these algorithms with
    existing symbolic data mining techniques
    (ID3/C5).

15
14. References
  • Wang J-X et al. (1993) Study of
    stereo-requirements of substance P binding to NK1
    receptors using analogues with systematic D-amino
    acid replacements. Biorganic Medicinal
    Chemistry Letters, 3, 451-456.
  • Young S Hawkins D.M. (2000) Analysis of a
    large, high-throughput screening data using
    recursive partitioning. Molecular Modelling
    Prediction of Bioactivity (ed. Gundertofte
    JØrgensen).

16
Grantholder
  • Professor Martyn Ford
  • Centre for Molecular Design
  • University of Portsmouth
  • martyn.ford_at_port.ac.uk

Research Fellows
Dr Shuang Cang Mar - Sept 2000 Dr Abul Azad
Jan 2001 -
17
Collaborators
  • Dr Antony Browne
    School of Computing, Information
    Systems and Mathematics, London Guildhall
    University. abrowne_at_lgu.ac.uk
  • Professor Philip Picton
    School of Technology and
    Design, University College Northampton.
    phil.picton_at_northampton.ac.uk
  • Dr David Whitley
    Centre for Molecular Design,
    University of Portsmouth.
    david.whitley_at_port.ac.uk
Write a Comment
User Comments (0)
About PowerShow.com