Data Mining in Micro array Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining in Micro array Analysis

Description:

Similar to Prediction: Predict some unknown or missing categorical value rather ... Day Outlook Temperature Humidity Wind Play Tennis. 1 Sunny Hot High Weak No ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 11
Provided by: mmg7
Category:
Tags: analysis | array | data | how | micro | mining | play | tennis | to

less

Transcript and Presenter's Notes

Title: Data Mining in Micro array Analysis


1
Data Mining in Micro array Analysis
  • Classification (Supervised Learning)
  • Finding models (functions) that describe and
    distinguish classes or concepts for future
    prediction
  • E.g., predict disease based on gene expression
    profiles
  • Similar to Prediction Predict some unknown or
    missing categorical value rather than a numerical
    values
  • Presentation decision-tree, classification rule,
    neural network
  • Cluster analysis (Unsupervised Learning)
  • Class label is unknown Group data to form new
    classes, e.g., cluster genes to find distribution
    patterns
  • Clustering based on the principle maximizing the
    intra-class similarity and minimizing the
    interclass similarity
  • E.g. Group genes based on their gene expression
    profiles

2
Supervised vs Unsupervised Learning
Unsupervised Clustering
Supervised Classification
  • unknown number of classes
  • known number of classes
  • based on a training set
  • no prior knowledge
  • used to classify future observations
  • used to understand (explore) data

3
Supervised vs. Unsupervised Learning
debt





















Supervised Learning
Unsupervised Learning
income
4
Classification
Training Set Data with known classes
Data with unknown classes
Class Assignment
Classification Technique
Classifier
5
Types of Classifiers
Linear Classifier
Non Linear Classifier
debt


o
o

o

o
o

o




o
o

o

o
income
aincome bdebt lt t gt No loan !
6
Predictive Modelling
Day Outlook Temperature Humidity Wind Play
Tennis 1 Sunny Hot High Weak No 2 Sunny Hot
High Strong No 3 Overcast Hot High Weak Yes 4
Rain Mild High Weak Yes 5 Rain Cool Normal We
ak Yes 6 Rain Cool Normal Strong No 7 Overcast
Cool Normal Strong Yes 8 Sunny Mild High Wea
k No 9 Sunny Cool Normal Weak Yes 10 Rain Mild
Normal Weak Yes 11 Sunny Mild Normal Strong Ye
s 12 Overcast Mild High Strong Yes 13 Overcast H
ot Normal Weak Yes 14 Rain Mild High Strong No
  • Predict categorical class labels
  • Classify data (construct a model) based on the
    training set and the values (class labels) in
    a classifying attribute and
  • Use it in classifying new data

7
Classification
  • Task determine which of a fixed set of classes
    an example belongs to
  • Input training set of examples annotated with
    class values.
  • Outputinduced hypotheses (model/concept
    description/classifiers)

Learning Induce classifiers from training data

Inductive Learning System
Training Data
Classifiers (Derived Hypotheses)
Predication Using Hypothesis for Prediction
classifying any example described in the same
manner
Classifier
Decision on class assignment
Data to be classified
8
Decision Tree Example
Day Outlook Temperature Humidity Wind Play
Tennis 1 Sunny Hot High Weak No 2 Sunny Hot
High Strong No 3 Overcast Hot High Weak Yes 4
Rain Mild High Weak Yes 5 Rain Cool Normal We
ak Yes 6 Rain Cool Normal Strong No 7 Overcast
Cool Normal Strong Yes 8 Sunny Mild High Wea
k No 9 Sunny Cool Normal Weak Yes 10 Rain Mild
Normal Weak Yes 11 Sunny Mild Normal Strong Ye
s 12 Overcast Mild High Strong Yes 13 Overcast H
ot Normal Weak Yes 14 Rain Mild High Strong No

9
Classification Relevant Gene Identification
  • Goal Identify subset of genes that distinguish
    between treatments, tissues, etc.
  • Method
  • Collect several samples grouped by treatments
    (e.g. Diseased vs. Healthy)
  • Use genes as features
  • Build a classifier to distinguish treatments

10
Gene Expression Example
ID G1 G2 G3 G4 Cancer 1 11.12 1.34 1.97 11.0
No 2 12.34 2.01 1.22 11.1 No 3 13.11 1.34 1.34 2.
0 Yes 4 13.34 11.11 1.38 2.23 Yes 5 14.11 13.10 1.
06 2.44 Yes 6 11.34 14.21 1.07 1.23 No 7 21.01 12.
32 1.97 1.34 Yes 8 66.11 33.3 1.97 1.34 Yes 9 33.1
1 44.1 1.96 11.23 Yes 10 11.54 11.1 1.97 10.01 Yes
11 12.00 15.1 1.98 9.01 Yes 12 15.23 1.11 1.89 1
2.48 No 13 31.22 2.0 1.99 13.51 Yes 14 11.33 11.1
1.01 11.01 No 15 .. .. .. ..
Problem With large number of genes (10000) Need
to use feature selection/reduction techniques
Write a Comment
User Comments (0)
About PowerShow.com