Data Mining in Bioinformatics - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Data Mining in Bioinformatics

Description:

Compare single gene in a control situation versus a treatment situation ... Car Insurance: Risk Assessment. Age 25 ? Risk: Low. Risk: High. Sports car ? Risk: High ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 54
Provided by: lisag5
Category:

less

Transcript and Presenter's Notes

Title: Data Mining in Bioinformatics


1
Data Mining in Bioinformatics
2
Outline
  • Introduction
  • Interdisciplinary Problem Statement
  • Microarray Problem Overview
  • Microarray Data Processing
  • Image Analysis and Data Mining
  • Prior Knowledge
  • Data Mining Methods
  • Database and Optimization Techniques
  • Visualization
  • Validation
  • Summary

3
Introduction Recommended Literature
  • 1. Bioinformatics The Machine Learning Approach
    by P. Baldi S. Brunak, 2nd edition, The MIT
    Press, 2001
  • 2. Data Mining Concepts and Techniques by J.
    Han M. Kamber, Morgan Kaufmann Publishers, 2001
  • 3. Pattern Classification by R. Duda, P. Hart and
    D. Stork, 2nd edition, John Wiley Sons, 2001

4
Bioinformatics, Computational Biology, Data Mining
  • Bioinformatics is an interdisciplinary field
    about the the information processing problems in
    computational biology and a unified treatment of
    the data mining methods for solving these
    problems.
  • Computational Biology is about modeling real data
    and simulating unknown data of biological
    entities, e.g.
  • Genomes (viruses, bacteria, fungi, plants,
    insects,)
  • Proteins and Proteomes
  • Biological Sequences
  • Molecular Function and Structure
  • Data Mining is searching for knowledge in data
  • Knowledge mining from databases
  • Knowledge extraction
  • Data/pattern analysis
  • Data dredging
  • Knowledge Discovery in Databases (KDD)

5
Introduction Problems in Bioinformatics Domain
  • Problems in Bioinformatics Domain
  • Data production at the levels of molecules,
    cells, organs, organisms, populations
  • Integration of structure and function data, gene
    expression data, pathway data, phenotypic and
    clinical data,
  • Prediction of Molecular Function and Structure
  • Computational biology synthesis (simulations)
    and analysis (machine learning)

6
  • MICROARRAY PROBLEM

7
Microarray Problem Major Objective
  • Major Objective Discover a comprehensive theory
    of lifes organization at the molecular level
  • The major actors of molecular biology the
    nucleic acids, DeoxyriboNucleic acid (DNA) and
    RiboNucleic Acids (RNA)
  • The central dogma of molecular biology

Proteins are very complicated molecules with 20
different amino acids.
8
Input and Output of Microarray Data Analysis
  • Input Laser image scans (data) and underlying
    experiment hypotheses or experiment designs
    (prior knowledge)
  • Output
  • Conclusions about the input hypotheses or
    knowledge about statistical behavior of
    measurements
  • The theory of biological systems learnt
    automatically from data (machine learning
    perspective)
  • Model fitting, Inference process

9
Overview of Microarray Problem
Biology Application Domain
Validation
Data Analysis
Microarray Experiment
Data Mining
Image Analysis
Experiment Design and Hypothesis
Data Warehouse
Artificial Intelligence (AI)
Knowledge discovery in databases (KDD)
Statistics
10
Statistics Community
  • Random Variables
  • Statistical Measures
  • Probability and Probability Distribution
  • Confidence Interval Estimations
  • Test of Hypotheses
  • Goodness of Fit
  • Regression and Correlation Analysis

11
Artificial Intelligence (AI) Community
  • Issues
  • Prior knowledge (e.g., invariance)
  • Model deviation from true model
  • Sampling distributions
  • Computational complexity
  • Model complexity (overfitting)

Collect Data
Choose Features
Choose Model
Train Classifier
Evaluate Classifier
Design Cycle of Predictive Modeling
12
Knowledge Discovery in Databases (KDD) Community
Database
13
Microarray Data Mining and Image Analysis Steps
  • Image Analysis
  • Normalization
  • Grid Alignment
  • Spot Quality Assurance Control
  • Feature construction (selection and extraction)
  • Data Mining
  • Prior knowledge
  • Statistics
  • Machine learning
  • Pattern recognition
  • Database techniques
  • Optimization techniques
  • Visualization
  • Validation
  • Issues
  • Cross validation techniques

?
14
  • MICROARRAY IMAGE ANALYSIS

15
Microarray Image Analysis
16
  • DATA MINING OF MICROARRAY DATA

17
Why Data Mining ? Sequence Example
  • Biology Language and Goals
  • A gene can be defined as a region of DNA.
  • A genome is one haploid set of chromosomes with
    the genes they contain.
  • Perform competent comparison of gene sequences
    across species and account for inherently noisy
    biological sequences due to random variability
    amplified by evolution
  • Assumption if a gene has high similarity to
    another gene then they perform the same function
  • Analysis Language and Goals
  • Feature is an extractable attribute or
    measurement (e.g., gene expression, location)
  • Pattern recognition is trying to characterize
    data pattern (e.g., similar gene expressions,
    equidistant gene locations).
  • Data mining is about uncovering patterns,
    anomalies and statistically significant
    structures in data (e.g., find two similar gene
    expressions with confidence gt x)

18
Types of Expected Data Mining and Analysis Results
  • Hypothetical Examples
  • Binary answers using tests of hypotheses
  • Drug treatment is successful with a confidence
    level x.
  • Statistical behavior (probability distribution
    functions)
  • A class of genes with functionality X follows
    Poisson distribution.
  • Expected events
  • As the amount of treatment will increase the gene
    expression level will decrease.
  • Relationships
  • Expression level of gene A is correlated with
    expression level of gene B under varying
    treatment conditions (gene A and B are part of
    the same pathway).
  • Decision trees
  • Classification of a new gene sequence by a
    domain expert.

19
  • PRIOR KNOWLEDGE

20
Prior Knowledge Experiment Design
  • Microarray sources of systematic and random
    errors
  • Feature selection and variability
  • Expectations and Hypotheses
  • Data cleaning and transformations
  • Data mining method selection
  • Interpretation

Collect Data
Data Cleaning and Transformations
Choose Features
Prior Knowledge
Choose Model and Data Mining Method
21
Prior Knowledge from Experiment Design
  • Complexity Levels of Microarray Experiments
  • Compare single gene in a control situation versus
    a treatment situation
  • Example Is the level of expression (up-regulated
    or down-regulated) significantly different in the
    two situations? (drug design application)
  • Methods t-test, Bayesian approach
  • Find multiple genes that share common
    functionalities
  • Example Find related genes that are dependent?
  • Methods Clustering (hierarchical, k-means,
    self-organizing maps, neural network, support
    vector machines)
  • Infer the underlying gene and protein networks
    that are responsible for the patterns and
    functional pathways observed
  • Example What is the gene regulation at system
    level?
  • Directions mining regulatory regions, modeling
    regulatory networks on a global scale
  • Goal of Future Experiment Designs Understand
    biology at the system level, e.g., gene networks,
    protein networks, signaling networks, metabolic
    networks, immune system and neuronal networks.

22
Data Mining Techniques
Visualization
23
  • STATISTICS

24
Statistics
Statistics
Inductive Statistics
Descriptive Statistics
Make forecast and inferences
Describe data
Are two sample sets identically distributed ?
25
Statistical t-test
  • Gene Expression Level in Control and Treatment
    situations
  • Is the behavior of a single gene different in
    Control situation than in Treatment situation ?

?
Normalized distance
Normalized distance t follows a Student
distribution with f degrees of freedom.
If tgtthresh then the control and treatment data
populations are considered to be different.
  • m sample mean
  • s variance

26
  • MACHINE LEARNING
  • AND
  • PATTERN RECOGNITION

27
Machine Learning
Machine Learning
Supervised
Unsupervised
Natural groupings
Reinforcement
Examples
28
Pattern Recognition
Pattern Recognition
k-nearest neighbors, support vectors
Locally Weighted Learning
Statistical Models
Linear Correlation and Regression
Decision Trees
Neural Networks
NN representation and gradient based optimization
NN representation and genetic algorithm based
optimization
29
Unsupervised Learning and Clustering
  • A cluster is a collection of data objects that
    are similar to one another within the same
    cluster and are dissimilar to the objects in
    other clusters.
  • Examples of data objects
  • gene expression levels, sets of co-regulated
    genes (pathways), protein structures
  • Categories of Clustering Methods
  • Partitioning Methods
  • Hierarchical Methods
  • Density-Based Methods

Natural groupings
30
Unsupervised Clustering Partitioning Methods
Example Centroid-Based Technique
  • K-means Algorithm partitions a set of n objects
    into k clusters so that the resulting
    intra-cluster similarity is high but the
    inter-cluster similarity is low.
  • Input number of desired cluster k
  • Output k labels assigned to n objects
  • Steps
  • Select k initial clusters centers
  • Compute similarity as a distance between an
    object and each cluster center
  • Assign a label to an object based on the minimum
    similarity
  • Repeat for all objects
  • Re-compute the clusters centers as a mean of
    all objects assign to a given cluster
  • Repeat from Step 2 until objects do not change
    their labels.

31
Unsupervised Clustering Partitioning Methods
Example Representative-Based Technique
  • K-medoids Algorithm partitions a set of n objects
    into k clusters so that it minimizes the sum of
    the dissimilarities of all the objects to their
    nearest medoid.
  • Input number of desired cluster k
  • Output k labels assigned to n objects
  • Steps
  • Select k initial objects as the initial medoids
  • Compute similarity as a distance between an
    object and each cluster medoid
  • Assign a label to an object based on the minimum
    similarity
  • Repeat for all objects
  • Randomly select a non-medoid object and swap with
    the current medoid it would decrease
    intra-cluster square error
  • Repeat from Step 2 until objects do not change
    their labels.

32
Unsupervised Clustering Hierarchical Clustering
  • Hierarchical Clustering partitions a set of n
    objects into a tree of clusters
  • Types of Hierarchical Clustering
  • Agglomerative hierarchical clustering
  • Bottom-up strategy of building clusters
  • Divisive hierarchical clustering
  • Top-down strategy of building clusters

33
Unsupervised Agglomerative Hierarchical Clustering
  • Agglomerative Hierarchical Clustering partitions
    a set of n objects into a tree of clusters with a
    bottom-up strategy.
  • Steps
  • Assign a unique label to each data object and
    form n clusters
  • Find nearest clusters and merge them
  • Repeat Step 2 till the number of desired clusters
    is equal to the number of merged clusters.
  • Types of Agglomerative Hierarchical Clustering
  • The nearest neighbor algorithms (minimum or
    single-linkage algorithm, minimal spanning tree)
  • The farthest neighbor algorithms (maximum or
    complete-linkage algorithm)

34
Unsupervised Clustering Density-Based Clustering
  • Density-Based Spatial Clustering with Noise
    aggregates objects into clusters if the objects
    are density connected.
  • Density connected objects
  • Simplified explanationP and Q are density
    connected if there is an object O such that both
    P and Q are density connected to O.
  • Aggregate P and Q if they are density connected
    with respect to R-radius neighborhood and Minimum
    Object criteria

35
Supervised Learning or Classification
  • Classification is a two-step process consisting
    of learning classification rules followed by
    assignment of classification label.

36
Supervised Learning Decision Tree
  • Decision tree algorithm constructs a tree
    structure in a top-down recursive
    divide-and-conquer manner

Attributes
Age lt 25 ?
Age Car Type Risk
23 family High
17 sports High
43 sports High
68 family Low
32 truck Low
20 family High
no
yes
Risk High
Sports car ?
yes
no
Risk High
Risk Low
Answers
Car Insurance Risk Assessment
Visualization of Decision Boundaries
37
Supervised Learning Bayesian Classification
  • Bayesian Classification is based on Bayes theorem
    and it can predict class membership
    probabilities.
  • Bayes Theorem (X-data sample, H-hypothesis of
    data label)
  • P(H/X) posterior probability
  • P(H) prior probability
  • Classification-maximum posteriori hypothesis

38
Statistical Models Linear Discriminant
  • Linear Discriminant Functions form boundaries
    between data classes.
  • Finding Linear Discriminant Functions is achieved
    by minimizing a criterion error function.

Linear discriminant function
Quadratic discriminant function
Finding w coefficients -Gradient Descent
Procedures -Newtons algorithm
39
Neural Networks
  • Neural network is a set of connected input/output
    units where each connection has a weight
    associated with it.
  • Phase I learning adjust weights such that the
    network predicts accurately class labels of the
    input samples
  • Phase II classification- assign labels by
    passing an unknown sample through the network
  • Steps
  • Initial weights from -1,1
  • Propagate the inputs forward
  • Backpropagate the error
  • Terminate learning (training) if (a) delta w lt
    thresh or (b) percentage of misclassified samples
    lt thresh or (c) max number of iterations has been
    exceeded

Interpretation
40
Support Vector Machines (SVM)
  • SVM algorithm finds a separating hyperplane with
    the largest margin and uses it for classification
    of new samples

41
  • DATABASE TECHNIQUES
  • AND
  • OPTIMIZATION TECHNIQUES

42
Database Techniques
  • Database Design and Modeling (tables, procedures,
    functions, constraints)
  • Database Interface to Data Mining System
  • Efficient Import and Export of Data
  • Database Data Visualization
  • Database Clustering for Access Efficiency
  • Database Performance Tuning (memory usage, query
    encoding)
  • Database Parallel Processing (multiple servers
    and CPUs)
  • Distributed Information Repositories (data
    warehouse)

MINING
43
Optimization Techniques
  • Highly nonlinear search space (global versus
    local maxima)
  • Gradient based optimization
  • Genetic algorithm based optimization
  • Optimization with sampling
  • Large search space
  • Example A genome with N genes can encode 2N
    states (active or inactive states, regulated is
    not considered). Human genome 230,000
    Nematode genome 220,000 patterns.

44
  • VISUALIZATION

45
Visualization
  • Data 3D cubes,distribution charts, curves,
    surfaces, link graphs, image frames and movies,
    parallel coordinates
  • Results pie charts, scatter plots, box plots,
    association rules, parallel coordinates,
    dendograms, temporal evolution

Parallel coordinates
Pie chart
Temporal evolution
46
Novel Visualization of Features
Feature Selection and Visualization
Feature Selection
Mean Feature Image
47
Novel Visualization of Clustering Results
Class Labeling and Visualization
Isodata (K-means) Clustering
Mean Feature Image
Label Image
48
  • VALIDATION

49
Why Validation?
  • Validation type
  • Within the existing data
  • With newly collected data
  • Errors and uncertainties
  • Systematic or random errors
  • Unknown variables - number of classes
  • Noise level - statistical confidence due to
    noise
  • Model validity error measure, model over-fit or
    under-fit
  • Number of data points - measurement replicas
  • Other issues
  • Experimental support of general theories
  • Exhaustive sampling is not permissive

50
Error Detection Example of Spot Screening
Mask Image Location and Size Screening
Mask Image No Screening
Mask Image SNR Screening
51
Cross Validation Example
  • One-tier cross validation
  • Train on different data than test data
  • Two-tier cross validation
  • The score from one-tier cross validation is used
    by the bias optimizer to select the best learning
    algorithm parameters ( of control points) . The
    more you optimize the more you over-fit. The
    second tier is to measure the level of over-fit
    (unbiased measure of accuracy).
  • Useful for comparing learning algorithms with
    control parameters that are optimized.
  • Number of folds is not optimized.
  • Computational complexity
  • folds of top tier X folds of bottom tier X
    control points X CPU of algorithm

52
Summary
  • Bioinformatics and Microarray problem
  • Interdisciplinary Challenges Terminology
  • Understanding Biology and Computer Science
  • Data mining and image analysis steps
  • Image Analysis
  • Experiment Design as Prior Knowledge
  • Expected Results of Data Mining
  • Which Data Mining Technique to Use?
  • Data Mining Challenges Complexity, Data Size,
    Search Space
  • Validation
  • Confidence in Obtained Results?
  • Error Screening
  • Cross validation techniques

53
Backup
Write a Comment
User Comments (0)
About PowerShow.com