Biological Data Mining - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Biological Data Mining

Description:

Dr Anthony Browne School of Computing, Information Systems and ... the tree (Kolmogorov-Smirnoff test for real valued features, chi-squared for discrete ones) ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 13
Provided by: Salt86
Category:

less

Transcript and Presenter's Notes

Title: Biological Data Mining


1
Biological Data Mining
  • A comparison of Neural Network and Symbolic
    Techniques
  • http//www.cmd.port.ac.uk/biomine/

2
Grantholder
  • Professor Martyn Ford
  • Centre for Molecular Design
  • University of Portsmouth
  • martyn.ford_at_port.ac.uk

3
Collaborators
  • Dr Anthony Browne
    School of Computing, Information
    Systems and Mathematics, London Guildhall
    University. abrowne_at_lgu.ac.uk
  • Professor Philip Picton
    School of Technology and
    Design, University College Northampton.
    phil.picton_at_northampton.ac.uk
  • Dr David Whitley
    Centre for Molecular Design,
    University of Portsmouth.
    david.whitley_at_port.ac.uk

4
Objectives
  • The project aims
  • to develop validate techniques for extracting
    explicit information from bioinformatic data
  • to express this information as logical rules and
    decision trees
  • to apply these new procedures to a range of
    scientific problems related to bioinformatics and
    cheminformatics

5
Extracting information
  • Artificial neural networks (ANNs) can be used to
    identify the non-linear relationships that
    underlie bioinformatic data, but . . .
  • trained ANNs do not lead to a concise and
    explicit model
  • specifying the underlying structure is therefore
    difficult
  • as a result, ANNs are often regarded as black
    boxes

6
Data Mining and Neural Networks
  • Standard data mining algorithms exist (such as
    ID3 or C5) so why use an ANN? It would be
    advantageous if the rules extracted
  • Give a better fit to the data with the same
    number of rules (i.e. explain the data more
    accurately)
  • Give the same fit to the data with less rules
    (i.e. explain the data more comprehensibly) or
  • Give both a better fit to the data and use less
    rules (i.e. explain the data more comprehensibly
    and more accurately).

7
Extracting Decision Trees
  • The TREPAN procedure (Craven,1996)
  • extracts decision trees from ANNs
  • performs better than the symbolic learning
    algorithms ID3 and C5
  • the current implementation is restricted to a
    particular network architecture, but
  • the underlying algorithm is independent of
    network architecture

8
Trepan
  • Builds a decision tree representing the function
    the ANN has learnt by recursively partitioning
    the input space.
  • Draws query instances by taking into account the
    distribution of instances in the problem domain.
  • For real-valued features uses kernel density
    estimates to generate a model of the underlying
    data that is used to select instances for
    presentation to the network.

9
Trepan
  • Builds the decision tree in a best-first manner
  • as each node is added the fidelity of the
    decision tree to the ANN is maximised
  • this is done by examining the significance of the
    distributions at consecutive levels of the tree
    (Kolmogorov-Smirnoff test for real valued
    features, chi-squared for discrete ones)
  • Allows the user to control the size of the final
    tree by selecting appropriate stopping criteria.

10
Aims
  • Implement the TREPAN algorithm in a portable
    format, independent of network architecture.
  • Extend the algorithm to enable the extraction of
    regression trees.
  • Provide a Bayesian formulation for the decision
    tree extraction algorithm.
  • Compare the performance of these algorithms with
    existing symbolic data mining techniques (ID3/C5).

11
Aims
  • Apply the extracted decision trees
  • to searches of bioinformatic databases
  • protein databases
  • genomic databases
  • to searches of cheminformatic databases
  • chemical libraries
  • natural product databases
  • to investigate ligand/receptor binding
  • to quantify molecular similarity/diversity
  • to identify new leads and optimise properties

12
Case study ligand interaction with GPCRs
  • 28 GPCRs
  • a number of putative interaction sites
  • 3 principal properties of amino acids (AAs)
  • MLR results for 2 ligands
Write a Comment
User Comments (0)
About PowerShow.com