Data Mining in Bioinformatics

About This Presentation

Title:

Data Mining in Bioinformatics

Description:

Compare single gene in a control situation versus a treatment situation ... Car Insurance: Risk Assessment. Age 25 ? Risk: Low. Risk: High. Sports car ? Risk: High ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 54

Provided by: lisag5

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining in Bioinformatics

1
Data Mining in Bioinformatics
2
Outline

Introduction
Interdisciplinary Problem Statement
Microarray Problem Overview
Microarray Data Processing
Image Analysis and Data Mining
Prior Knowledge
Data Mining Methods
Database and Optimization Techniques
Visualization
Validation
Summary

3
Introduction Recommended Literature

1. Bioinformatics The Machine Learning Approach
by P. Baldi S. Brunak, 2nd edition, The MIT
Press, 2001
2. Data Mining Concepts and Techniques by J.
Han M. Kamber, Morgan Kaufmann Publishers, 2001
3. Pattern Classification by R. Duda, P. Hart and
D. Stork, 2nd edition, John Wiley Sons, 2001

4
Bioinformatics, Computational Biology, Data Mining

Bioinformatics is an interdisciplinary field
about the the information processing problems in
computational biology and a unified treatment of
the data mining methods for solving these
problems.
Computational Biology is about modeling real data
and simulating unknown data of biological
entities, e.g.
Genomes (viruses, bacteria, fungi, plants,
insects,)
Proteins and Proteomes
Biological Sequences
Molecular Function and Structure
Data Mining is searching for knowledge in data
Knowledge mining from databases
Knowledge extraction
Data/pattern analysis
Data dredging
Knowledge Discovery in Databases (KDD)

5
Introduction Problems in Bioinformatics Domain

Problems in Bioinformatics Domain
Data production at the levels of molecules,
cells, organs, organisms, populations
Integration of structure and function data, gene
expression data, pathway data, phenotypic and
clinical data,
Prediction of Molecular Function and Structure
Computational biology synthesis (simulations)
and analysis (machine learning)

MICROARRAY PROBLEM

7
Microarray Problem Major Objective

Major Objective Discover a comprehensive theory
of lifes organization at the molecular level
The major actors of molecular biology the
nucleic acids, DeoxyriboNucleic acid (DNA) and
RiboNucleic Acids (RNA)
The central dogma of molecular biology

Proteins are very complicated molecules with 20
different amino acids.
8
Input and Output of Microarray Data Analysis

Input Laser image scans (data) and underlying
experiment hypotheses or experiment designs
(prior knowledge)
Output
Conclusions about the input hypotheses or
knowledge about statistical behavior of
measurements
The theory of biological systems learnt
automatically from data (machine learning
perspective)
Model fitting, Inference process

9
Overview of Microarray Problem
Biology Application Domain
Validation
Data Analysis
Microarray Experiment
Data Mining
Image Analysis
Experiment Design and Hypothesis
Data Warehouse
Artificial Intelligence (AI)
Knowledge discovery in databases (KDD)
Statistics
10
Statistics Community

Random Variables
Statistical Measures
Probability and Probability Distribution
Confidence Interval Estimations
Test of Hypotheses
Goodness of Fit
Regression and Correlation Analysis

11
Artificial Intelligence (AI) Community

Issues
Prior knowledge (e.g., invariance)
Model deviation from true model
Sampling distributions
Computational complexity
Model complexity (overfitting)

Collect Data
Choose Features
Choose Model
Train Classifier
Evaluate Classifier
Design Cycle of Predictive Modeling
12
Knowledge Discovery in Databases (KDD) Community
Database
13
Microarray Data Mining and Image Analysis Steps

Image Analysis
Normalization
Grid Alignment
Spot Quality Assurance Control
Feature construction (selection and extraction)
Data Mining
Prior knowledge
Statistics
Machine learning
Pattern recognition
Database techniques
Optimization techniques
Visualization
Validation
Issues
Cross validation techniques

?
14

MICROARRAY IMAGE ANALYSIS

15
Microarray Image Analysis
16

DATA MINING OF MICROARRAY DATA

17
Why Data Mining ? Sequence Example

Biology Language and Goals
A gene can be defined as a region of DNA.
A genome is one haploid set of chromosomes with
the genes they contain.
Perform competent comparison of gene sequences
across species and account for inherently noisy
biological sequences due to random variability
amplified by evolution
Assumption if a gene has high similarity to
another gene then they perform the same function
Analysis Language and Goals
Feature is an extractable attribute or
measurement (e.g., gene expression, location)
Pattern recognition is trying to characterize
data pattern (e.g., similar gene expressions,
equidistant gene locations).
Data mining is about uncovering patterns,
anomalies and statistically significant
structures in data (e.g., find two similar gene
expressions with confidence gt x)

18
Types of Expected Data Mining and Analysis Results

Hypothetical Examples
Binary answers using tests of hypotheses
Drug treatment is successful with a confidence
level x.
Statistical behavior (probability distribution
functions)
A class of genes with functionality X follows
Poisson distribution.
Expected events
As the amount of treatment will increase the gene
expression level will decrease.
Relationships
Expression level of gene A is correlated with
expression level of gene B under varying
treatment conditions (gene A and B are part of
the same pathway).
Decision trees
Classification of a new gene sequence by a
domain expert.

PRIOR KNOWLEDGE

20
Prior Knowledge Experiment Design

Microarray sources of systematic and random
errors
Feature selection and variability
Expectations and Hypotheses
Data cleaning and transformations
Data mining method selection
Interpretation

Collect Data
Data Cleaning and Transformations
Choose Features
Prior Knowledge
Choose Model and Data Mining Method
21
Prior Knowledge from Experiment Design

Complexity Levels of Microarray Experiments
Compare single gene in a control situation versus
a treatment situation
Example Is the level of expression (up-regulated
or down-regulated) significantly different in the
two situations? (drug design application)
Methods t-test, Bayesian approach
Find multiple genes that share common
functionalities
Example Find related genes that are dependent?
Methods Clustering (hierarchical, k-means,
self-organizing maps, neural network, support
vector machines)
Infer the underlying gene and protein networks
that are responsible for the patterns and
functional pathways observed
Example What is the gene regulation at system
level?
Directions mining regulatory regions, modeling
regulatory networks on a global scale
Goal of Future Experiment Designs Understand
biology at the system level, e.g., gene networks,
protein networks, signaling networks, metabolic
networks, immune system and neuronal networks.

22
Data Mining Techniques
Visualization
23

STATISTICS

24
Statistics
Statistics
Inductive Statistics
Descriptive Statistics
Make forecast and inferences
Describe data
Are two sample sets identically distributed ?
25
Statistical t-test

Gene Expression Level in Control and Treatment
situations
Is the behavior of a single gene different in
Control situation than in Treatment situation ?

?
Normalized distance
Normalized distance t follows a Student
distribution with f degrees of freedom.
If tgtthresh then the control and treatment data
populations are considered to be different.

m sample mean
s variance

MACHINE LEARNING
AND
PATTERN RECOGNITION

27
Machine Learning
Machine Learning
Supervised
Unsupervised
Natural groupings
Reinforcement
Examples
28
Pattern Recognition
Pattern Recognition
k-nearest neighbors, support vectors
Locally Weighted Learning
Statistical Models
Linear Correlation and Regression
Decision Trees
Neural Networks
NN representation and gradient based optimization
NN representation and genetic algorithm based
optimization
29
Unsupervised Learning and Clustering

A cluster is a collection of data objects that
are similar to one another within the same
cluster and are dissimilar to the objects in
other clusters.
Examples of data objects
gene expression levels, sets of co-regulated
genes (pathways), protein structures
Categories of Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods

Natural groupings
30
Unsupervised Clustering Partitioning Methods
Example Centroid-Based Technique

K-means Algorithm partitions a set of n objects
into k clusters so that the resulting
intra-cluster similarity is high but the
inter-cluster similarity is low.
Input number of desired cluster k
Output k labels assigned to n objects
Steps
Select k initial clusters centers
Compute similarity as a distance between an
object and each cluster center
Assign a label to an object based on the minimum
similarity
Repeat for all objects
Re-compute the clusters centers as a mean of
all objects assign to a given cluster
Repeat from Step 2 until objects do not change
their labels.

31
Unsupervised Clustering Partitioning Methods
Example Representative-Based Technique

K-medoids Algorithm partitions a set of n objects
into k clusters so that it minimizes the sum of
the dissimilarities of all the objects to their
nearest medoid.
Input number of desired cluster k
Output k labels assigned to n objects
Steps
Select k initial objects as the initial medoids
Compute similarity as a distance between an
object and each cluster medoid
Assign a label to an object based on the minimum
similarity
Repeat for all objects
Randomly select a non-medoid object and swap with
the current medoid it would decrease
intra-cluster square error
Repeat from Step 2 until objects do not change
their labels.

32
Unsupervised Clustering Hierarchical Clustering

Hierarchical Clustering partitions a set of n
objects into a tree of clusters
Types of Hierarchical Clustering
Agglomerative hierarchical clustering
Bottom-up strategy of building clusters
Divisive hierarchical clustering
Top-down strategy of building clusters

33
Unsupervised Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering partitions
a set of n objects into a tree of clusters with a
bottom-up strategy.
Steps
Assign a unique label to each data object and
form n clusters
Find nearest clusters and merge them
Repeat Step 2 till the number of desired clusters
is equal to the number of merged clusters.
Types of Agglomerative Hierarchical Clustering
The nearest neighbor algorithms (minimum or
single-linkage algorithm, minimal spanning tree)
The farthest neighbor algorithms (maximum or
complete-linkage algorithm)

34
Unsupervised Clustering Density-Based Clustering

Density-Based Spatial Clustering with Noise
aggregates objects into clusters if the objects
are density connected.
Density connected objects
Simplified explanationP and Q are density
connected if there is an object O such that both
P and Q are density connected to O.
Aggregate P and Q if they are density connected
with respect to R-radius neighborhood and Minimum
Object criteria

35
Supervised Learning or Classification

Classification is a two-step process consisting
of learning classification rules followed by
assignment of classification label.

36
Supervised Learning Decision Tree

Decision tree algorithm constructs a tree
structure in a top-down recursive
divide-and-conquer manner

Attributes
Age lt 25 ?
Age Car Type Risk
23 family High
17 sports High
43 sports High
68 family Low
32 truck Low
20 family High
no
yes
Risk High
Sports car ?
yes
no
Risk High
Risk Low
Answers
Car Insurance Risk Assessment
Visualization of Decision Boundaries
37
Supervised Learning Bayesian Classification

Bayesian Classification is based on Bayes theorem
and it can predict class membership
probabilities.
Bayes Theorem (X-data sample, H-hypothesis of
data label)
P(H/X) posterior probability
P(H) prior probability
Classification-maximum posteriori hypothesis

38
Statistical Models Linear Discriminant

Linear Discriminant Functions form boundaries
between data classes.
Finding Linear Discriminant Functions is achieved
by minimizing a criterion error function.

Linear discriminant function
Quadratic discriminant function
Finding w coefficients -Gradient Descent
Procedures -Newtons algorithm
39
Neural Networks

Neural network is a set of connected input/output
units where each connection has a weight
associated with it.
Phase I learning adjust weights such that the
network predicts accurately class labels of the
input samples
Phase II classification- assign labels by
passing an unknown sample through the network
Steps
Initial weights from -1,1
Propagate the inputs forward
Backpropagate the error
Terminate learning (training) if (a) delta w lt
thresh or (b) percentage of misclassified samples
lt thresh or (c) max number of iterations has been
exceeded

Interpretation
40
Support Vector Machines (SVM)

SVM algorithm finds a separating hyperplane with
the largest margin and uses it for classification
of new samples

DATABASE TECHNIQUES
AND
OPTIMIZATION TECHNIQUES

42
Database Techniques

Database Design and Modeling (tables, procedures,
functions, constraints)
Database Interface to Data Mining System
Efficient Import and Export of Data
Database Data Visualization
Database Clustering for Access Efficiency
Database Performance Tuning (memory usage, query
encoding)
Database Parallel Processing (multiple servers
and CPUs)
Distributed Information Repositories (data
warehouse)

MINING
43
Optimization Techniques

Highly nonlinear search space (global versus
local maxima)
Gradient based optimization
Genetic algorithm based optimization
Optimization with sampling
Large search space
Example A genome with N genes can encode 2N
states (active or inactive states, regulated is
not considered). Human genome 230,000
Nematode genome 220,000 patterns.

VISUALIZATION

45
Visualization

Data 3D cubes,distribution charts, curves,
surfaces, link graphs, image frames and movies,
parallel coordinates
Results pie charts, scatter plots, box plots,
association rules, parallel coordinates,
dendograms, temporal evolution

Parallel coordinates
Pie chart
Temporal evolution
46
Novel Visualization of Features
Feature Selection and Visualization
Feature Selection
Mean Feature Image
47
Novel Visualization of Clustering Results
Class Labeling and Visualization
Isodata (K-means) Clustering
Mean Feature Image
Label Image
48

VALIDATION

49
Why Validation?

Validation type
Within the existing data
With newly collected data
Errors and uncertainties
Systematic or random errors
Unknown variables - number of classes
Noise level - statistical confidence due to
noise
Model validity error measure, model over-fit or
under-fit
Number of data points - measurement replicas
Other issues
Experimental support of general theories
Exhaustive sampling is not permissive

50
Error Detection Example of Spot Screening
Mask Image Location and Size Screening
Mask Image No Screening
Mask Image SNR Screening
51
Cross Validation Example

One-tier cross validation
Train on different data than test data
Two-tier cross validation
The score from one-tier cross validation is used
by the bias optimizer to select the best learning
algorithm parameters ( of control points) . The
more you optimize the more you over-fit. The
second tier is to measure the level of over-fit
(unbiased measure of accuracy).
Useful for comparing learning algorithms with
control parameters that are optimized.
Number of folds is not optimized.
Computational complexity
folds of top tier X folds of bottom tier X
control points X CPU of algorithm