Title: Supervised Learning
1Lecture 6
2Unsupervised Learning
- Learning From Unlabeled Data
- Clustering, Correlation, PCA
- Identify relationships between the features
3Supervised Learning
- Learning From Labeled Data
- Neural Networks, Support Vector Machines,
Decision Trees - Identify relationships between the features and
the categories
4Supervised Learning
- Given
- Examples whose feature values are known and
- Whose categories are known
- Do
- Predict the categories of examples whose feature
values are known but whose categories are not
54 ways of representing gene-chip data
From Molla, et. aI AI Magaizine 25, 2004
6Typical Methodology
- N-Fold Cross validation
- Split labeled data into N (usually 10) folds
- Train on All but 1 (N-1) folds of the data
- Test on the left-out fold (ignoring the category
label) - Repeat until all N folds have been tested
- Note This methodology can be (and is) used to
test predictive different statistical models as
well.
7Example
8Oligonucleotide Microarrays
- Specific probes synthesized atknown spot on
chips surface - Probes complementary to RNA of genes to be
measured - Typical gene (1kb) MUCH longer than typical
probe (24 bases)
9Probes Good vs. Bad
Blue Probe
Red Sample
good probe
bad probe
10Probe-Picking Method Needed
- Hybridization characteristics differ between
probes - Probe set represents very small subset of gene
- Accurate measurement of expression requires good
probe set
11Related Work
- Use known hybridization characteristics
- Lockhardt et al. 1996
- Melting point (Tm) predictions
- Kurata and Suyama 1999
- Li and Stormo 2001
- Stable secondary structure
- Kurata and Suyama 1999
12Our Approach
- Apply established machine-learning algorithms
- Train on categorized examples
- Test on examples with category hidden
- Choose features to represent probes
- Categorize probes as good or bad
13The Features
14The Data
- Tilings of 8 genes (from E. coli B. subtilus)
- Every possible probe (10,000 probes)
- Genes known to be expressed in sample
- Gene Sequence GTAGCTAGCATTAGCATGGCCAGTCATG
- Complement CATCGATCGTAATCGTACCGGTCAGTAC
- Probe 1 CATCGATCGTAATCGTACCGGTCA
- Probe 2 ATCGATCGTAATCGTACCGGTCAG
- Probe 3 TCGATCGTAATCGTACCGGTCAGT
-
15Our Microarray
16Defining our Categories
Low Intensity BAD Probes (45)
Mid-Intensity Not Used in Training Set (23)
High Intensity GOOD Probes (32)
Frequency
0
.05
.15
1.0
Normalized Probe Intensity
17The Machine Learning Techniques
- Naïve Bayes (Mitchell 1997)
- Neural Networks (Rumelhart et al. 1995)
- Decision Trees (Quinlan 1996)
- Can interpret predictions of each learner
probabilistically
18Naïve Bayes
- Assumes conditional independence between features
- Make judgments about test set examples based on
conditional probability estimates made on
training set
19Naïve Bayes
For each example in the test set, evaluate the
following
20Neural Network(1-of-n encoding with probe length
3)
Weights
A1 C1 G1 T1
Example probe sequence CAG
Good or Bad
A2 C2 G2 T2
ACTIVATION
A3 C3 G3 T3
ERROR
21Decision Tree
fracC
High
Low
fracT
fracG
Automatically builds a tree of rules
High
Low
fracTC
Low
High
Low
High
fracG
fracAC
High
Low
Low
High
n14
Good Probe
C
G
T
A
Bad Probe
22Decision Tree
The information gain of a feature, F, is
23Information Gain per Feature
Probe Composition Features
Normalized Information Gain
Base Position Features
Base Position
Dimer Position
24Cross-Validation
- Leave-one-out testing
- For each gene (of the 8)
- Train on all but this gene
- Test on this gene
- Record result
- Forget what was learned
- Average results across 8 test genes
25Typical Probe-Intensity Prediction Across Short
Region
Actual
Normalized Probe Intensity
Starting Nucleotide Position for 24-mer Probe
26Typical Probe-Intensity Prediction Across Short
Region
Neural Network
Naïve Bayes
Decision Tree
Actual
Normalized Probe Intensity
Starting Nucleotide Position for 24-mer Probe
27Probe-Picking Results
Perfect Selector
Number of probes selected with intensity gt 90th
percentile
Number of probes selected
28Probe-Picking Results
Perfect Selector
Neural Network
Number of probes selected with intensity gt 90th
percentile
Naïve Bayes
Primer Melting Point
Decision Tree
Number of probes selected
29A couple of final notes for the class
30What is Principal Component Analysis (PCA)?in 1
silde
31PCA Tutorial
- http//csnet.otago.ac.nz/cosc453/student_tutorials
/principal_components.pdf
32Vocal Tract ? Slide Whistle
?