Boosting and predictive modeling - PowerPoint PPT Presentation

1 / 85
About This Presentation
Title:

Boosting and predictive modeling

Description:

... Mason, Rogers, Pregibon, Cortes 2000. University of Washington. 26. AD ... Saved AT&T 15M$ in the year 2000 in operations costs and missed opportunities. Recall ... – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 86
Provided by: yoavf
Learn more at: http://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Boosting and predictive modeling


1
Boosting and predictive modeling
  • Yoav Freund
  • Columbia University

2
What is data mining?
  • Lots of data - complex models
  • Classifying customers using transaction logs.
  • Classifying events in high-energy physics
    experiments.
  • Object detection in computer vision.
  • Predicting gene regulatory networks.
  • Predicting stock prices and portfolio management.

3
Leo BreimanStatistical Modeling / the two
culturesStatistical Science, 2001
  • The data modeling culture (Generative modeling)
  • Assume a stochastic model (5-50 parameters).
  • Estimate model parameters.
  • Interpret model and make predictions.
  • Estimated population 98 of statisticians
  • The algorithmic modeling culture (Predictive
    modeling)
  • Assume relationship btwn predictor vars and
    response vars has a functional form (102 --
    106 parameters).
  • Search (efficiently) for the best prediction
    function.
  • Make predictions
  • Interpretation / causation - mostly an
    after-thought.
  • Estimated population 2 0f statisticians (many
    in other fields).

4
Toy Example
  • Computer receives telephone call
  • Measures Pitch of voice
  • Decides gender of caller

5
Generative modeling
Voice Pitch
6
Discriminative approach
Voice Pitch
7
Ill-behaved data
Voice Pitch
8
Plan of talk
  • Boosting
  • Alternating Decision Trees
  • Data-mining ATT transaction logs.
  • The I/O bottleneck in data-mining.
  • Resistance of boosting to over-fitting.
  • Confidence rated prediction.
  • Confidence-rating for object recognition.
  • Gene regulation modeling.
  • Summary

9
Plan of talk
  • Boosting Combining weak classifiers.
  • Alternating Decision Trees
  • Data-mining ATT transaction logs.
  • The I/O bottleneck in data-mining.
  • Resistance of boosting to over-fitting.
  • Confidence rated prediction.
  • Confidence-rating for object recognition.
  • Gene regulation modeling.
  • Summary

10
batch learning for binary classification
11
A weighted training set
12
A weak learner
Weighted training set
Weak Learner
h
13
The boosting process

14
Adaboost
Freund, Schapire 1997
15
Main property of Adaboost
  • If advantages of weak rules over random guessing
    are g1,g2,..,gT then training error of final
    rule is at most

16
Boosting block diagram
17
Adaboost as gradient-descent
18
Plan of talk
  • Boosting
  • Alternating Decision Trees a hybrid of boosting
    and decision trees
  • Data-mining ATT transaction logs.
  • The I/O bottleneck in data-mining.
  • Resistance of boosting to over-fitting.
  • Confidence rated prediction.
  • Confidence-rating for object recognition.
  • Gene regulation modeling.
  • Summary

19
Decision Trees
1
-1
Xgt3
Ygt5
-1
-1
1
-1
20
A decision tree as a sum of weak rules.
-0.2
0.1
-0.1
0.2
-0.2
0.1
-0.1
-0.3
0.2
-0.3
21
An alternating decision tree
Freund, Mason 1997
-0.2
0.7
22
Example Medical Diagnostics
  • Cleve dataset from UC Irvine database.
  • Heart disease diagnostics (1healthy,-1sick)
  • 13 features from tests (real valued and
    discrete).
  • 303 instances.

23
AD-tree for heart-disease diagnostics
gt0 Healthy lt0 Sick
24
Plan of talk
  • Boosting
  • Alternating Decision Trees
  • Data-mining ATT transaction logs.
  • The I/O bottleneck in data-mining.
  • Resistance of boosting to over-fitting.
  • Confidence rated prediction.
  • Confidence-rating for object recognition.
  • Gene regulation modeling.
  • Summary

25
ATT buisosity problem
Freund, Mason, Rogers, Pregibon, Cortes 2000
  • Distinguish business/residence customers from
    call detail information. (time of day, length of
    call )
  • 230M telephone numbers, label unknown for 30
  • 260M calls / day
  • Required computer resources

Huge counting log entries to produce statistics
-- use specialized I/O efficient sorting
algorithms (Hancock). Significant Calculating
the classification for 70M customers. Negligible
Learning (2 Hours on 10K training examples on an
off-line computer).
26
AD-tree for buisosity
27
AD-tree (Detail)
28
Quantifiable results
  • For accuracy 94 increased coverage from 44 to
    56.
  • Saved ATT 15M in the year 2000 in operations
    costs and missed opportunities.

29
Plan of talk
  • Boosting
  • Alternating Decision Trees
  • Data-mining ATT transaction logs.
  • The I/O bottleneck in data-mining.
  • Resistance of boosting to over-fitting.
  • Confidence rated prediction.
  • Confidence-rating for object recognition.
  • Gene regulation modeling.
  • Summary

30
The database bottleneck
  • Physical limit disk seek takes 0.01 sec
  • Same time to read/write 105 bytes
  • Same time to perform 107 CPU operations
  • Commercial DBMS are optimized for varying queries
    and transactions.
  • Statistical analysis requires evaluation of fixed
    queries on massive data streams.
  • Keeping disk I/O sequential is key.
  • Data Compression improves I/O speed but
    restricts random access.

31
CS theory regarding very large data-sets
  • Massive datasets You pay 1 per disk block you
    read/write e per CPU operation. Internal memory
    can store N disk blocks
  • Example problem Given a stream of line segments
    (in the plane), identify all segment pairs that
    intersect.
  • Vitter, Motwani, Indyk,
  • Property testing You can only look at a small
    fraction of the data
  • Example problem decide whether a given graph is
    bi-partite by testing only a small fraction of
    the edges.
  • Rubinfeld, Ron, Sudan, Goldreich, Goldwasser,

32
Plan of talk
  • Boosting
  • Alternating Decision Trees
  • Data-mining ATT transaction logs.
  • The I/O bottleneck in data-mining.
  • Resistance of boosting to over-fitting.
  • Confidence rated prediction.
  • Confidence-rating for object recognition.
  • Gene regulation modeling.
  • Summary

33
A very curious phenomenon
Boosting decision trees
Using lt10,000 training examples we fit gt2,000,000
parameters
34
Large margins
Thesis large margins gt reliable predictions
Very similar to SVM.
35
Experimental Evidence
36
Theorem
Schapire, Freund, Bartlett Lee / Annals of
statistics 1998
H set of binary functions with VC-dimension d
No dependence on no. of combined functions!!!
37
Idea of Proof
38
Plan of talk
  • Boosting
  • Alternating Decision Trees
  • Data-mining ATT transaction logs.
  • The I/O bottleneck in data-mining.
  • Resistance of boosting to over-fitting.
  • Confidence rated prediction.
  • Confidence-rating for object recognition.
  • Gene regulation modeling.
  • Summary

39
A motivating example
?
?
?
40
The algorithm
Freund, Mansour, Schapire, Annals of Stat, August
2004
41
Suggested tuning
Yields
42
Confidence Rating block diagram
43
Summary of Confidence-Rated Classifiers
  • Frequentist explanation for the benefits of
    model averaging
  • Separates between inherent uncertainty and
    uncertainty due to finite training set.
  • Computational hardness unknown other than in few
    special cases
  • Margins from Boosting or SVM can be used as an
    approximation.
  • Many practical applications!

44
Plan of talk
  • Boosting
  • Alternating Decision Trees
  • Data-mining ATT transaction logs.
  • The I/O bottleneck in data-mining.
  • Resistance of boosting to over-fitting.
  • Confidence rated prediction.
  • Confidence-rating for object recognition.
  • Gene regulation modeling.
  • Summary

45
Face Detection - Using confidence to save time
Viola Jones 1999
  • Paul Viola and Mike Jones developed a face
    detector that can work in real time (15 frames
    per second).

46
Image Features
Rectangle filters Similar to Haar wavelets
Papageorgiou, et al.
47
Example Classifier for Face Detection
A classifier with 200 rectangle features was
learned using AdaBoost 95 correct detection on
test set with 1 in 14084 false positives. Not
quite competitive...
ROC curve for 200 feature classifier
48
Employing a cascade to minimize average detection
time
The accurate detector combines 6000 simple
features using Adaboost.
In most boxes, only 8-9 features are calculated.
Features 1-3
Features 4-10
49
Using confidence to avoid labeling
Levin, Viola, Freund 2003
50
Image 1
51
Image 1 - diff from time average
52
Image 2
53
Image 2 - diff from time average
54
Co-training
Blum and Mitchell 98
Partially trained B/W based Classifier
Raw B/W
Hwy Images
Diff Image
Partially trained Diff based Classifier
55
(No Transcript)
56
Co-Training Results
57
Plan of talk
  • Boosting
  • Alternating Decision Trees
  • Data-mining ATT transaction logs.
  • The I/O bottleneck in data-mining.
  • Resistance of boosting to over-fitting.
  • Confidence rated prediction.
  • Confidence-rating for object recognition.
  • Gene regulation modeling.
  • Summary

58
Gene Regulation
  • Regulatory proteins bind to non-coding regulatory
    sequence of a gene to control rate of
    transcription

59
From mRNA to Protein
Nucleus wall
60
Protein Transcription Factors
61
Genome-wide Expression Data
  • Microarrays measure mRNA transcript expression
    levels for all of the 6000 yeast genes at once.
  • Very noisy data
  • Rough time slice over all compartments of many
    cells.
  • Protein expression not observed

62
Partial Parts List for Yeast
  • Many known and putative
  • Transcription factors
  • Signaling molecules that activate transcription
    factors
  • Known and putative binding site motifs
  • In yeast, regulatory sequence 500 bp upstream
    region

63
GeneClass Problem Formulation
M. Middendorf, A. Kundaje, C. Wiggins,
Y. Freund, C. Leslie. Predicting Genetic
Regulatory Response Using Classification. ISMB
2004.
  • Predict target gene regulatory response from
    regulator activity and binding site data

64
Role of quantization
By Quantizing expression into three classes We
reduce noise but maintain most of signal
Weighting 1/-1 examples linearly with
Expression level performs slightly better.
65
Problem setup
  • Data point Target gene X Microarray
  • Input features
  • Parent state -1,0,1
  • Motif Presence 0,1
  • Predict output
  • Target Gene -1,1

66
Boosting with Alternating Decision Trees (ADTs)
  • Use boosting to build a single ADT, margin-based
    generalization of decision tree

Splitter Node Is MotifMIG1 present AND ParentXBP1
up?
Prediction Node F(x) given by sum of prediction
nodes along all paths consistent with x
67
Statistical Validation
  • 10-fold cross-validation experiments, 50,000
    (gene/microarray) training examples
  • Significant correlation between prediction score
    and true log expression ratio on held-out data.
  • Prediction accuracy on 1/-1 labels 88.5

68
Biological InterpretationFrom correlation to
causation
  • Good prediction only implies Correlation.
  • To infer causation we need to integrate
    additional knowledge.
  • Comparative case studies train on similar
    conditions (stresses), test on related
    experiments
  • Extract significant features from learned model
  • Iteration score (IS) Boosting iteration at which
    feature first appears
  • Identifies significant motifs, motif-parent pairs
  • Abundance score (AS) Number of nodes in ADT
    containing feature
  • Identifies important regulators
  • In silico knock-outs remove significant
    regulator and retrain.

69
Case Study Heat Shock and Osmolarity
  • Training set Heat shock, osmolarity, amino acid
    starvation
  • Test set Stationary phase, simultaneous heat
    shockosmolarity
  • Results
  • Test error 9.3
  • Supports Gasch hypothesis heat shock and
    osmolarity pathways independent, additive
  • High scoring parents (AS) USV1 (stationary phase
    and heat shock), PPT1 (osmolarity response), GAC1
    (response to heat)

70
Case Study Heat Shock and Osmolarity
  • Results
  • High scoring binding sites (IS)
  • MSN2/MSN4 STRE element
  • Heat shock related HSF1 and RAP1 binding sites
  • Osmolarity/glycerol pathways CAT8, MIG1, GCN4
  • Amino acid starvation GCN4, CHA4, MET31
  • High scoring motif-parent pair (IS)
  • TPK1STRE pair (kinase that regulates MSN2 via
    cellular localization) indirect effect

Direct binding
Indirect effect
Co-occurrence
71
Case Study In silico knockout
  • Training and test sets Same as heat shock and
    osmolarity case study
  • Knockout Remove USV1 from regulator list and
    retrain
  • Results
  • Test error 12 (increase from 9)
  • Identify putative downstream targets of USV1
    target genes that change from correct to
    incorrect label
  • GO annotation analysis reveals putative
    functions Nucleoside transport, cell-wall
    organization and biogenesis, heat-shock protein
    activity
  • Putative functions match those identified in wet
    lab USV1 knockout (Segal et al., 2003)

72
Conclusions Gene Regulation
  • New predictive model for study of gene regulation
  • First gene regulation model to make quantitative
    predictions.
  • Using actual expression levels - no clustering.
  • Strong prediction accuracy on held-out
    experiments
  • Interpretable hypotheses significant regulators,
    binding motifs, regulator-motif pairs
  • New methodology for biological analysis
    comparative training/test studies, in silico
    knockouts

73
Plan of talk
  • Boosting
  • Alternating Decision Trees
  • Data-mining ATT transaction logs.
  • The I/O bottleneck in data-mining.
  • Resistance of boosting to over-fitting.
  • Confidence rated prediction.
  • Confidence-rating for object recognition.
  • Gene regulation modeling.
  • Summary

74
Summary
  • Moving from density estimation to classification
    can make hard problems tractable.
  • Boosting is an efficient and flexible method for
    constructing complex and accurate classifiers.
  • I/O is the main bottleneck to data-mining
  • Sampling, data localization and parallelization
    help.
  • Correlation -gt Causation still a hard problem,
    requires domain specific expertise and
    integration of data sources.

75
Future work
  • New applications
  • Bio-informatics.
  • Vision / Speech and signal processing.
  • Information Retrieval and Information Extraction.
  • Theory
  • Improving the robustness of learning algorithms.
  • Utilization of unlabeled examples in
    confidence-rated classification.
  • Sequential experimental design.
  • Relationships between learning algorithms and
    stochastic differential equations.

76
Extra
77
Plan of talk
  • Boosting
  • Alternating Decision Trees
  • Data-mining ATT transaction logs.
  • The I/O bottleneck in data-mining.
  • High-energy physics.
  • . Resistance of boosting to over-fitting.
  • Confidence rated prediction.
  • Confidence-rating for object recognition.
  • Gene regulation modeling.
  • Summary

78
Analysis for the MiniBooNE experiment
  • Goal To test for neutrino mass by searching for
    neutrino oscillations.
  • Important because it may lead us to physics
    beyond the Standard Model.
  • The BooNE project began in 1997.
  • The first beam induced neutrino events were
    detected in September, 2002.

MiniBooNE detector (Fermi Lab)
79
MiniBooNE Classification Task
Ion Stancu. UC Riverside
80
(No Transcript)
81
Results
82
Using confidence to reduce labeling
Unlabeled data
Partially trained classifier
Query-by-committee, Seung, Opper
Sompolinsky Freund, Seung, Shamir Tishby
83
Discriminative approach
Voice Pitch
84
Results from Yotam Abramson.
85
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com