Title: Data Mining in Bioinformatics
1Data Mining in Bioinformatics
2Outline
- Introduction
- Interdisciplinary Problem Statement
- Microarray Problem Overview
- Microarray Data Processing
- Image Analysis and Data Mining
- Prior Knowledge
- Data Mining Methods
- Database and Optimization Techniques
- Visualization
- Validation
- Summary
3Introduction Recommended Literature
- 1. Bioinformatics The Machine Learning Approach
by P. Baldi S. Brunak, 2nd edition, The MIT
Press, 2001 - 2. Data Mining Concepts and Techniques by J.
Han M. Kamber, Morgan Kaufmann Publishers, 2001 - 3. Pattern Classification by R. Duda, P. Hart and
D. Stork, 2nd edition, John Wiley Sons, 2001
4Bioinformatics, Computational Biology, Data Mining
- Bioinformatics is an interdisciplinary field
about the the information processing problems in
computational biology and a unified treatment of
the data mining methods for solving these
problems. - Computational Biology is about modeling real data
and simulating unknown data of biological
entities, e.g. - Genomes (viruses, bacteria, fungi, plants,
insects,) - Proteins and Proteomes
- Biological Sequences
- Molecular Function and Structure
- Data Mining is searching for knowledge in data
- Knowledge mining from databases
- Knowledge extraction
- Data/pattern analysis
- Data dredging
- Knowledge Discovery in Databases (KDD)
5Introduction Problems in Bioinformatics Domain
- Problems in Bioinformatics Domain
- Data production at the levels of molecules,
cells, organs, organisms, populations - Integration of structure and function data, gene
expression data, pathway data, phenotypic and
clinical data, - Prediction of Molecular Function and Structure
- Computational biology synthesis (simulations)
and analysis (machine learning)
6 7Microarray Problem Major Objective
- Major Objective Discover a comprehensive theory
of lifes organization at the molecular level - The major actors of molecular biology the
nucleic acids, DeoxyriboNucleic acid (DNA) and
RiboNucleic Acids (RNA) - The central dogma of molecular biology
Proteins are very complicated molecules with 20
different amino acids.
8Input and Output of Microarray Data Analysis
- Input Laser image scans (data) and underlying
experiment hypotheses or experiment designs
(prior knowledge) - Output
- Conclusions about the input hypotheses or
knowledge about statistical behavior of
measurements - The theory of biological systems learnt
automatically from data (machine learning
perspective) - Model fitting, Inference process
9Overview of Microarray Problem
Biology Application Domain
Validation
Data Analysis
Microarray Experiment
Data Mining
Image Analysis
Experiment Design and Hypothesis
Data Warehouse
Artificial Intelligence (AI)
Knowledge discovery in databases (KDD)
Statistics
10Statistics Community
- Random Variables
- Statistical Measures
- Probability and Probability Distribution
- Confidence Interval Estimations
- Test of Hypotheses
- Goodness of Fit
- Regression and Correlation Analysis
11Artificial Intelligence (AI) Community
- Issues
- Prior knowledge (e.g., invariance)
- Model deviation from true model
- Sampling distributions
- Computational complexity
- Model complexity (overfitting)
Collect Data
Choose Features
Choose Model
Train Classifier
Evaluate Classifier
Design Cycle of Predictive Modeling
12Knowledge Discovery in Databases (KDD) Community
Database
13Microarray Data Mining and Image Analysis Steps
- Image Analysis
- Normalization
- Grid Alignment
- Spot Quality Assurance Control
- Feature construction (selection and extraction)
- Data Mining
- Prior knowledge
- Statistics
- Machine learning
- Pattern recognition
- Database techniques
- Optimization techniques
- Visualization
- Validation
- Issues
- Cross validation techniques
?
14- MICROARRAY IMAGE ANALYSIS
15Microarray Image Analysis
16- DATA MINING OF MICROARRAY DATA
17Why Data Mining ? Sequence Example
- Biology Language and Goals
- A gene can be defined as a region of DNA.
- A genome is one haploid set of chromosomes with
the genes they contain. - Perform competent comparison of gene sequences
across species and account for inherently noisy
biological sequences due to random variability
amplified by evolution - Assumption if a gene has high similarity to
another gene then they perform the same function - Analysis Language and Goals
- Feature is an extractable attribute or
measurement (e.g., gene expression, location) - Pattern recognition is trying to characterize
data pattern (e.g., similar gene expressions,
equidistant gene locations). - Data mining is about uncovering patterns,
anomalies and statistically significant
structures in data (e.g., find two similar gene
expressions with confidence gt x)
18Types of Expected Data Mining and Analysis Results
- Hypothetical Examples
- Binary answers using tests of hypotheses
- Drug treatment is successful with a confidence
level x. - Statistical behavior (probability distribution
functions) - A class of genes with functionality X follows
Poisson distribution. - Expected events
- As the amount of treatment will increase the gene
expression level will decrease. - Relationships
- Expression level of gene A is correlated with
expression level of gene B under varying
treatment conditions (gene A and B are part of
the same pathway). - Decision trees
- Classification of a new gene sequence by a
domain expert.
19 20Prior Knowledge Experiment Design
- Microarray sources of systematic and random
errors - Feature selection and variability
- Expectations and Hypotheses
- Data cleaning and transformations
- Data mining method selection
- Interpretation
Collect Data
Data Cleaning and Transformations
Choose Features
Prior Knowledge
Choose Model and Data Mining Method
21Prior Knowledge from Experiment Design
- Complexity Levels of Microarray Experiments
- Compare single gene in a control situation versus
a treatment situation - Example Is the level of expression (up-regulated
or down-regulated) significantly different in the
two situations? (drug design application) - Methods t-test, Bayesian approach
- Find multiple genes that share common
functionalities - Example Find related genes that are dependent?
- Methods Clustering (hierarchical, k-means,
self-organizing maps, neural network, support
vector machines) - Infer the underlying gene and protein networks
that are responsible for the patterns and
functional pathways observed - Example What is the gene regulation at system
level? - Directions mining regulatory regions, modeling
regulatory networks on a global scale - Goal of Future Experiment Designs Understand
biology at the system level, e.g., gene networks,
protein networks, signaling networks, metabolic
networks, immune system and neuronal networks.
22Data Mining Techniques
Visualization
23 24Statistics
Statistics
Inductive Statistics
Descriptive Statistics
Make forecast and inferences
Describe data
Are two sample sets identically distributed ?
25Statistical t-test
- Gene Expression Level in Control and Treatment
situations - Is the behavior of a single gene different in
Control situation than in Treatment situation ?
?
Normalized distance
Normalized distance t follows a Student
distribution with f degrees of freedom.
If tgtthresh then the control and treatment data
populations are considered to be different.
26- MACHINE LEARNING
- AND
- PATTERN RECOGNITION
27Machine Learning
Machine Learning
Supervised
Unsupervised
Natural groupings
Reinforcement
Examples
28Pattern Recognition
Pattern Recognition
k-nearest neighbors, support vectors
Locally Weighted Learning
Statistical Models
Linear Correlation and Regression
Decision Trees
Neural Networks
NN representation and gradient based optimization
NN representation and genetic algorithm based
optimization
29Unsupervised Learning and Clustering
- A cluster is a collection of data objects that
are similar to one another within the same
cluster and are dissimilar to the objects in
other clusters. - Examples of data objects
- gene expression levels, sets of co-regulated
genes (pathways), protein structures - Categories of Clustering Methods
- Partitioning Methods
- Hierarchical Methods
- Density-Based Methods
Natural groupings
30Unsupervised Clustering Partitioning Methods
Example Centroid-Based Technique
- K-means Algorithm partitions a set of n objects
into k clusters so that the resulting
intra-cluster similarity is high but the
inter-cluster similarity is low. - Input number of desired cluster k
- Output k labels assigned to n objects
- Steps
- Select k initial clusters centers
- Compute similarity as a distance between an
object and each cluster center - Assign a label to an object based on the minimum
similarity - Repeat for all objects
- Re-compute the clusters centers as a mean of
all objects assign to a given cluster - Repeat from Step 2 until objects do not change
their labels.
31Unsupervised Clustering Partitioning Methods
Example Representative-Based Technique
- K-medoids Algorithm partitions a set of n objects
into k clusters so that it minimizes the sum of
the dissimilarities of all the objects to their
nearest medoid. - Input number of desired cluster k
- Output k labels assigned to n objects
- Steps
- Select k initial objects as the initial medoids
- Compute similarity as a distance between an
object and each cluster medoid - Assign a label to an object based on the minimum
similarity - Repeat for all objects
- Randomly select a non-medoid object and swap with
the current medoid it would decrease
intra-cluster square error - Repeat from Step 2 until objects do not change
their labels.
32Unsupervised Clustering Hierarchical Clustering
- Hierarchical Clustering partitions a set of n
objects into a tree of clusters - Types of Hierarchical Clustering
- Agglomerative hierarchical clustering
- Bottom-up strategy of building clusters
- Divisive hierarchical clustering
- Top-down strategy of building clusters
33Unsupervised Agglomerative Hierarchical Clustering
- Agglomerative Hierarchical Clustering partitions
a set of n objects into a tree of clusters with a
bottom-up strategy. - Steps
- Assign a unique label to each data object and
form n clusters - Find nearest clusters and merge them
- Repeat Step 2 till the number of desired clusters
is equal to the number of merged clusters. - Types of Agglomerative Hierarchical Clustering
- The nearest neighbor algorithms (minimum or
single-linkage algorithm, minimal spanning tree) - The farthest neighbor algorithms (maximum or
complete-linkage algorithm)
34Unsupervised Clustering Density-Based Clustering
- Density-Based Spatial Clustering with Noise
aggregates objects into clusters if the objects
are density connected. - Density connected objects
- Simplified explanationP and Q are density
connected if there is an object O such that both
P and Q are density connected to O. - Aggregate P and Q if they are density connected
with respect to R-radius neighborhood and Minimum
Object criteria
35Supervised Learning or Classification
- Classification is a two-step process consisting
of learning classification rules followed by
assignment of classification label.
36Supervised Learning Decision Tree
- Decision tree algorithm constructs a tree
structure in a top-down recursive
divide-and-conquer manner
Attributes
Age lt 25 ?
Age Car Type Risk
23 family High
17 sports High
43 sports High
68 family Low
32 truck Low
20 family High
no
yes
Risk High
Sports car ?
yes
no
Risk High
Risk Low
Answers
Car Insurance Risk Assessment
Visualization of Decision Boundaries
37Supervised Learning Bayesian Classification
- Bayesian Classification is based on Bayes theorem
and it can predict class membership
probabilities. - Bayes Theorem (X-data sample, H-hypothesis of
data label) - P(H/X) posterior probability
- P(H) prior probability
- Classification-maximum posteriori hypothesis
38Statistical Models Linear Discriminant
- Linear Discriminant Functions form boundaries
between data classes. - Finding Linear Discriminant Functions is achieved
by minimizing a criterion error function.
Linear discriminant function
Quadratic discriminant function
Finding w coefficients -Gradient Descent
Procedures -Newtons algorithm
39Neural Networks
- Neural network is a set of connected input/output
units where each connection has a weight
associated with it. - Phase I learning adjust weights such that the
network predicts accurately class labels of the
input samples - Phase II classification- assign labels by
passing an unknown sample through the network - Steps
- Initial weights from -1,1
- Propagate the inputs forward
- Backpropagate the error
- Terminate learning (training) if (a) delta w lt
thresh or (b) percentage of misclassified samples
lt thresh or (c) max number of iterations has been
exceeded
Interpretation
40Support Vector Machines (SVM)
- SVM algorithm finds a separating hyperplane with
the largest margin and uses it for classification
of new samples
41- DATABASE TECHNIQUES
- AND
- OPTIMIZATION TECHNIQUES
42Database Techniques
- Database Design and Modeling (tables, procedures,
functions, constraints) - Database Interface to Data Mining System
- Efficient Import and Export of Data
- Database Data Visualization
- Database Clustering for Access Efficiency
- Database Performance Tuning (memory usage, query
encoding) - Database Parallel Processing (multiple servers
and CPUs) - Distributed Information Repositories (data
warehouse)
MINING
43Optimization Techniques
- Highly nonlinear search space (global versus
local maxima) - Gradient based optimization
- Genetic algorithm based optimization
- Optimization with sampling
- Large search space
- Example A genome with N genes can encode 2N
states (active or inactive states, regulated is
not considered). Human genome 230,000
Nematode genome 220,000 patterns.
44 45Visualization
- Data 3D cubes,distribution charts, curves,
surfaces, link graphs, image frames and movies,
parallel coordinates - Results pie charts, scatter plots, box plots,
association rules, parallel coordinates,
dendograms, temporal evolution
Parallel coordinates
Pie chart
Temporal evolution
46Novel Visualization of Features
Feature Selection and Visualization
Feature Selection
Mean Feature Image
47Novel Visualization of Clustering Results
Class Labeling and Visualization
Isodata (K-means) Clustering
Mean Feature Image
Label Image
48 49Why Validation?
- Validation type
- Within the existing data
- With newly collected data
- Errors and uncertainties
- Systematic or random errors
- Unknown variables - number of classes
- Noise level - statistical confidence due to
noise - Model validity error measure, model over-fit or
under-fit - Number of data points - measurement replicas
- Other issues
- Experimental support of general theories
- Exhaustive sampling is not permissive
50Error Detection Example of Spot Screening
Mask Image Location and Size Screening
Mask Image No Screening
Mask Image SNR Screening
51Cross Validation Example
- One-tier cross validation
- Train on different data than test data
- Two-tier cross validation
- The score from one-tier cross validation is used
by the bias optimizer to select the best learning
algorithm parameters ( of control points) . The
more you optimize the more you over-fit. The
second tier is to measure the level of over-fit
(unbiased measure of accuracy). - Useful for comparing learning algorithms with
control parameters that are optimized. - Number of folds is not optimized.
- Computational complexity
- folds of top tier X folds of bottom tier X
control points X CPU of algorithm
52Summary
- Bioinformatics and Microarray problem
- Interdisciplinary Challenges Terminology
- Understanding Biology and Computer Science
- Data mining and image analysis steps
- Image Analysis
- Experiment Design as Prior Knowledge
- Expected Results of Data Mining
- Which Data Mining Technique to Use?
- Data Mining Challenges Complexity, Data Size,
Search Space - Validation
- Confidence in Obtained Results?
- Error Screening
- Cross validation techniques
-
53Backup