From Feature Construction, to Simple but Effective Modeling, to Domain Transfer - PowerPoint PPT Presentation

1 / 83
About This Presentation
Title:

From Feature Construction, to Simple but Effective Modeling, to Domain Transfer

Description:

Frequent pattern is a good candidate for discriminative features So, how to mine them? ... Select most discriminative patterns; Represent data in the feature ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 84
Provided by: IBMU288
Category:

less

Transcript and Presenter's Notes

Title: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer


1
From Feature Construction, to Simple but
Effective Modeling, to Domain Transfer
  • Wei Fan
  • IBM T.J.Watson
  • www.cs.columbia.edu/wfan
  • www.weifan.info
  • weifan_at_us.ibm.com, wei.fan_at_gmail.com

2
Feature Vector
  • Most data mining and machine learning model
    assume the following structured data
  • (x1, x2, ..., xk) -gt y
  • where xis are independent variable
  • y is dependent variable.
  • y drawn from discrete set classification
  • y drawn from continuous variable regression

3
Frequent Pattern-Based Feature Construction
  • Data not in the pre-defined feature vectors
  • Transactions
  • Biological sequence
  • Graph database

Frequent pattern is a good candidate for
discriminative features So, how to mine them?
4
FP Sub-graph
(example borrowed from George Karypis
presentation)
5
Computational Issues
  • Measured by its frequency or support.
  • E.g. frequent subgraphs with sup 10
  • Cannot enumerate sup 10 without first
    enumerating all patterns gt 10.
  • Random sampling not work since it is not
    exhaustive.
  • NP hard problem

6
Conventional Procedure
Two-Step Batch Method
  1. Mine frequent patterns (gtsup)
  1. Select most discriminative patterns
  1. Represent data in the feature space using such
    patterns
  1. Build classification models.

Feature Construction followed by Selection
7
Two Problems
  • Mine step
  • combinatorial explosion

2. patterns not considered if minsupport isnt
small enough
1. exponential explosion
8
Two Problems
  • Select step
  • Issue of discriminative power

4. Correlation not directly evaluated on their
joint predictability
3. InfoGain against the complete dataset, NOT on
subset of examples
9
Direct Mining Selection via Model-based Search
Tree
Feature Miner
Classifier
  • Basic Flow

Compact set of highly discriminative
patterns 1 2 3 4 5 6 7 . . .
Global Support 1020/100000.02
Divide-and-Conquer Based Frequent Pattern Mining
Mined Discriminative Patterns
10
Analyses (I)
  • Scalability of pattern enumeration
  • Upper bound (Theorem 1)
  • Scale down ratio
  • Bound on number of returned features

11
Analyses (II)
  • Subspace pattern selection
  • Original set
  • Subset
  • Non-overfitting
  • Optimality under exhaustive search

12
Experimental Studies
Itemset Mining (I)
  • Scalability Comparison

Datasets Pat using MbT sup Ratio (MbT Pat / Pat using MbT sup)
Adult 252809 0.41
Chess 8 0
Hypo 423439 0.0035
Sick 4818391 0.00032
Sonar 95507 0.00775
13
Experimental Studies
Itemset Mining (II)
  • Accuracy of Mined Itemsets

4 Wins 1 loss
But, much smaller number of patterns
14
Experimental Studies
Itemset Mining (III)
  • Convergence

15
Experimental Studies
Graph Mining (I)
  • 9 NCI anti-cancer screen datasets
  • The PubChem Project. URL pubchem.ncbi.nlm.nih.gov
    .
  • Active (Positive) class around 1 - 8.3
  • 2 AIDS anti-viral screen datasets
  • URL http//dtp.nci.nih.gov.
  • H1 CMCA 3.5
  • H2 CA 1

16
Experimental Studies
Graph Mining (II)
  • Scalability

17
Experimental Studies
Graph Mining (III)
  • AUC and Accuracy

AUC
11 Wins
10 Wins 1 Loss
18
Experimental Studies
Graph Mining (IV)
  • AUC of MbT, DT MbT VS Benchmarks

7 Wins, 4 losses
19
Summary
  • Model-based Search Tree
  • Integrated feature mining and construction.
  • Dynamic support
  • Can mine extremely small support patterns
  • Both a feature construction and a classifier
  • Not limited to one type of frequent pattern
    plug-play
  • Experiment Results
  • Itemset Mining
  • Graph Mining
  • New Found a DNA sequence not previously reported
    but can be explained in biology.
  • Code and dataset available for download

20
How to train models?
  • Even the true distribution is unknown, still
    assume that the data is generated by some known
    function.
  • Estimate parameters inside the function via
  • training data
  • CV on the training data
  • After structure is prefixed, learning becomes
    optimization to minimize errors
  • quadratic loss
  • exponential loss
  • slack variables
  • There probably will always be mistakes unless
  • The chosen model indeed generates the
    distribution
  • Data is sufficient to estimate those parameters

21
How to train models II
  • List of methods
  • Decision Trees
  • RIPPER rule learner
  • CBA association rule
  • clustering-based methods
  • Not quite sure the exact function, but use a
    family of free-form functions given some
    preference criteria.
  • Preference criteria
  • Simplest hypothesis that fits the data is the
    best.
  • Heuristics
  • info gain, gini index, Kearns-Mansour, etc
  • pruning MDL pruning, reduced error-pruning,
    cost-based pruning.
  • Truth none of purity check functions guarantee
    accuracy on unseen test data, it only tries to
    build a smaller model
  • There probably will always be mistakes unless
  • the training data is sufficiently large.
  • free form function/criteria is appropriate.

22
Can Data Speak for Themselves?
  • Make no assumption about the true model, neither
    parametric form nor free form.
  • Encode the data in some rather neutral
    representations
  • Think of it like encoding numbers in computers
    binary representation.
  • Always cannot represent some numbers, but overall
    accurate enough.
  • Main challenge
  • Avoid rote learning do not remember all the
    details
  • Generalization
  • Evenly representing numbers Evenly
    encoding the data.

23
Potential Advantages
  • If the accuracy is quite good, then
  • Method is quite automatic and easy to use
  • No Brainer DM can be everybodys tool.

24
Encoding Data for Major Problems
  • Classification
  • Given a set of labeled data items, such as, (amt,
    merchant category, outstanding balance,
    date/time, ,) and the label is whether it is a
    fraud or non-fraud.
  • Label set of discrete values
  • classifier predict if a transaction is a fraud
    or non-fraud.
  • Probability Estimation
  • Similar to the above setting estimate the
    probability that a transaction is a fraud.
  • Difference no truth is given, i.e., no true
    probability
  • Regression
  • Given a set of valued data items, such as
    (zipcode, capital gain, education, ), interested
    value is annual gross income.
  • Target value continuous values.
  • Several other on-going problems

25
Encoding Data in Decision Trees
  • Think of each tree as a way to encode the
    training data.
  • Why tree? a decision tree records some common
    characteristic of the data, but not every piece
    of trivial detail
  • Obviously, each tree encodes the data
    differently.
  • Subjective criteria that prefers some encodings
    than others are always adhoc.
  • Do not prefer anything then just do it randomly
  • Minimizes the difference by multiple encodings,
    and then average them.

26
Random Decision Tree to Encode Data
-classification, regression, probability
estimation
  • At each node, an un-used feature is chosen
    randomly
  • A discrete feature is un-used if it has never
    been chosen previously on a given decision path
    starting from the root to the current node.
  • A continuous feature can be chosen multiple times
    on the same decision path, but each time a
    different threshold value is chosen

27
Continued
  • We stop when one of the following happens
  • A node becomes too small (lt 3 examples).
  • Or the total height of the tree exceeds some
    limits
  • Such as the total number of features.

28
Illustration of RDT
B1 0,1 B2 0,1 B3 continuous
B1 chosen randomly
Random threshold 0.3
B2 0,1 B3 continuous
B2 0,1 B3 continuous
B3 chosen randomly
Random threshold 0.6
B2 chosen randomly
B3 continous
29
Classification
30
Regression
31
Prediction
  • Simply Averaging over multiple trees

32
Potential Advantage
  • Training can be very efficient. Particularly true
    for very large datasets.
  • No cross-validation based estimation of
    parameters for some parametric methods.
  • Natural multi-class probability.
  • Natural multi-label classification and
    probability estimation.
  • Imposes very little about the structures of the
    model.

33
Reasons
  • The true distribution P(yX) is never known.
  • Is it an elephant?
  • Every random tree is not a random guess of this
    P(yX).
  • Their structure is, but not the node statistics
  • Every random tree is consistent with the training
    data.
  • Each tree is quite strong, not weak.
  • In other words, if the distribution is the same,
    each random tree itself is a rather decent model.

34
Expected Error Reduction
  • Proven that for quadratic loss, such as
  • for probability estimation
  • ( P(yX) P(yX, ?) )2
  • regression problems
  • ( y f(x))2
  • General theorem the expected quadratic loss of
    RDT (and any other model averaging) is less than
    any combined model chosen at random.

35
Theorem Summary
36
Number of trees
  • Sampling theory
  • The random decision tree can be thought as
    sampling from a large (infinite when continuous
    features exist) population of trees.
  • Unless the data is highly skewed, 30 to 50 gives
    pretty good estimate with reasonably small
    variance. In most cases, 10 are usually enough.

37
Variance Reduction
38
Optimal Decision Boundary
from Tony Lius thesis (supervised by Kai Ming
Ting)
39
(No Transcript)
40
Regression Decision Boundary (GUIDE)
  • Properties
  • Broken and Discontinuous
  • Some points are far from truth
  • Some wrong ups and downs

41
RDT Computed Function
  • Properties
  • Smooth and Continuous
  • Close to true function
  • All ups and downs caught

42
Hidden Variable
43
Hidden Variable
  • Limitation of GUIDE
  • Need to decide grouping variables and independent
    variables. A non-trivial task.
  • If all variables are categorical, GUIDE becomes a
    single CART regression tree.
  • Strong assumption and greedy-based search.
    Sometimes, can lead to very unexpected results.

44
It grows like
45
ICDM08 Cup Crown Winner
  • Nuclear ban monitoring
  • RDT based approach is the highest award winner.

46
Ozone Level Prediction (ICDM06 Best Application
Paper)
  • Daily summary maps of two datasets from Texas
    Commission on Environmental Quality (TCEQ)

47
SVM 1-hr criteria CV
48
AdaBoost 1-hr criteria CV
49
SVM 8-hr criteria CV
50
AdaBoost 8-hr criteria CV
51
Other Applications
  • Credit Card Fraud Detection
  • Late and Default Payment Prediction
  • Intrusion Detection
  • Semi Conductor Process Control
  • Trading anomaly detection

52
Conclusion
  • Imposing a particular form of model may not be a
    good idea to train highly-accurate models for
    general purpose of DM.
  • It may not even be efficient for some forms of
    models.
  • RDT has been show to solve all three major
    problems in data mining, classification,
    probability estimation and regressions, simply,
    efficiently and accurately.
  • When physical truth is unknown, RDT is highly
    recommended
  • Code and dataset is available for download.

53
Standard Supervised Learning
training (labeled)
test (unlabeled)
Classifier
85.5
New York Times
New York Times
54
In Reality
training (labeled)
test (unlabeled)
Classifier
64.1
Labeled data not available!
Reuters
New York Times
New York Times
55
Domain Difference ? Performance Drop
train
test
ideal setting
Classifier
NYT
NYT
85.5
New York Times
New York Times
realistic setting
Classifier
NYT
Reuters
64.1
Reuters
New York Times
56
A Synthetic Example
Training (have conflicting concepts)
Test
Partially overlapping
57
Goal
Source Domain
Source Domain
Target Domain
Source Domain
  • To unify knowledge that are consistent with the
    test domain from multiple source domains (models)

58
Summary
  • Transfer from one or multiple source domains
  • Target domain has no labeled examples
  • Do not need to re-train
  • Rely on base models trained from each domain
  • The base models are not necessarily developed for
    transfer learning applications

59
Locally Weighted Ensemble
Training set 1
M1
x-feature value y-class label
Training set 2
M2
Test example x
Training set


Training set k
Mk
60
Modified Bayesian Model Averaging
Bayesian Model Averaging
Modified for Transfer Learning
M1
M1
Test set
Test set
M2
M2


Mk
Mk
61
Global versus Local Weights
x
y
M1
M2
wg
wl
wg
wl
2.40 5.23 -2.69 0.55 -3.97 -3.62 2.08
-3.73 5.08 2.15 1.43 4.48
1 0 0 0 0 1
0.6 0.4 0.2 0.1 0.6 1
0.9 0.6 0.4 0.1 0.3 0.2
0.3 0.3 0.3 0.3 0.3 0.3
0.2 0.6 0.7 0.5 0.3 1
0.7 0.7 0.7 0.7 0.7 0.7
0.8 0.4 0.3 0.5 0.7 0
Training
  • Locally weighting scheme
  • Weight of each model is computed per example
  • Weights are determined according to models
    performance on the test set, not training set

62
Synthetic Example Revisited
M1
M2
M2
M1
Training (have conflicting concepts)
Test
Partially overlapping
63
Optimal Local Weights
Higher Weight
0.9 0.1
C1
Test example x
0.8 0.2
0.4 0.6
C2
w
f
H
0.9 0.4
w1
0.8

w2
0.2
0.1 0.6
  • Optimal weights
  • Solution to a regression problem

64
Approximate Optimal Weights
  • Optimal weights
  • Impossible to get since f is unknown!
  • How to approximate the optimal weights
  • M should be assigned a higher weight at x if
    P(yM,x) is closer to the true P(yx)
  • Have some labeled examples in the target domain
  • Use these examples to compute weights
  • None of the examples in the target domain are
    labeled
  • Need to make some assumptions about the
    relationship between feature values and class
    labels

65
Clustering-Manifold Assumption
Test examples that are closer in feature space
are more likely to share the same class label.
66
Graph-based Heuristics
  • Graph-based weights approximation
  • Map the structures of models onto test domain

weight on x
M2
Clustering Structure
M1
67
Graph-based Heuristics
Higher Weight
  • Local weights calculation
  • Weight of a model is proportional to the
    similarity between its neighborhood graph and the
    clustering structure around x.

68
Local Structure Based Adjustment
  • Why adjustment is needed?
  • It is possible that no models structures are
    similar to the clustering structure at x
  • Simply means that the training information are
    conflicting with the true target distribution at x

Error
Error
M2
Clustering Structure
M1
69
Local Structure Based Adjustment
  • How to adjust?
  • Check if is below a
    threshold
  • Ignore the training information and propagate the
    labels of neighbors in the test set to x

M2
Clustering Structure
M1
70
Verify the Assumption
  • Need to check the validity of this assumption
  • Still, P(yx) is unknown
  • How to choose the appropriate clustering
    algorithm
  • Findings from real data sets
  • This property is usually determined by the nature
    of the task
  • Positive cases Document categorization
  • Negative cases Sentiment classification
  • Could validate this assumption on the training
    set

71
Algorithm
Check Assumption
Neighborhood Graph Construction
Model Weight Computation
Weight Adjustment
72
Data Sets
  • Different applications
  • Synthetic data sets
  • Spam filtering public email collection ?
    personal inboxes (u01, u02, u03) (ECML/PKDD 2006)
  • Text classification same top-level
    classification problems with different sub-fields
    in the training and test sets (Newsgroup,
    Reuters)
  • Intrusion detection data different types of
    intrusions in training and test sets.

73
Baseline Methods
  • Baseline Methods
  • One source domain single models
  • Winnow (WNN), Logistic Regression (LR), Support
    Vector Machine (SVM)
  • Transductive SVM (TSVM)
  • Multiple source domains
  • SVM on each of the domains
  • TSVM on each of the domains
  • Merge all source domains into one ALL
  • SVM, TSVM
  • Simple averaging ensemble SMA
  • Locally weighted ensemble without local structure
    based adjustment pLWE
  • Locally weighted ensemble LWE
  • Implementation Package
  • Classification SNoW, BBR, LibSVM, SVMlight
  • Clustering CLUTO package

74
Performance Measure
  • Prediction Accuracy
  • 0-1 loss accuracy
  • Squared loss mean squared error
  • Area Under ROC Curve
  • (AUC)
  • Tradeoff between true positive
  • rate and false positive rate
  • Should be 1 ideally

75
A Synthetic Example
Training (have conflicting concepts)
Test
Partially overlapping
76
Experiments on Synthetic Data
77
Spam Filtering
Accuracy
  • Problems
  • Training set public emails
  • Test set personal emails from three users U00,
    U01, U02

WNN
LR
SVM
SMA
TSVM
pLWE
LWE
MSE
WNN
LR
SVM
SMA
TSVM
pLWE
LWE
78
20 Newsgroup
C vs S
R vs T
R vs S
S vs T
C vs R
C vs T
79
Acc
WNN
LR
SVM
SMA
TSVM
pLWE
LWE
MSE
WNN
LR
SVM
SMA
TSVM
pLWE
LWE
80
Reuters
Accuracy
  • Problems
  • Orgs vs People (O vs Pe)
  • Orgs vs Places (O vs Pl)
  • People vs Places (Pe vs Pl)

WNN
LR
SVM
SMA
TSVM
pLWE
LWE
MSE
WNN
LR
SVM
SMA
TSVM
pLWE
LWE
81
Intrusion Detection
  • Problems (Normal vs Intrusions)
  • Normal vs R2L (1)
  • Normal vs Probing (2)
  • Normal vs DOS (3)
  • Tasks
  • 2 1 -gt 3 (DOS)
  • 3 1 -gt 2 (Probing)
  • 3 2 -gt 1 (R2L)

82
Conclusions
  • Locally weighted ensemble framework
  • transfer useful knowledge from multiple source
    domains
  • Graph-based heuristics to compute weights
  • Make the framework practical and effective
  • Code and Dataset available for download

83
More information
  • www.weifan.info or
  • www.cs.columbia.edu/wfan
  • For code, dataset and papers
Write a Comment
User Comments (0)
About PowerShow.com