From Feature Construction, to Simple but Effective Modeling, to Domain Transfer

About This Presentation

Title:

From Feature Construction, to Simple but Effective Modeling, to Domain Transfer

Description:

Frequent pattern is a good candidate for discriminative features So, how to mine them? ... Select most discriminative patterns; Represent data in the feature ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 84

Provided by: IBMU288

Category:

more less

Transcript and Presenter's Notes

Title: From Feature Construction, to Simple but Effective Modeling, to Domain Transfer

1
From Feature Construction, to Simple but
Effective Modeling, to Domain Transfer

Wei Fan
IBM T.J.Watson
www.cs.columbia.edu/wfan
www.weifan.info
weifan_at_us.ibm.com, wei.fan_at_gmail.com

2
Feature Vector

Most data mining and machine learning model
assume the following structured data
(x1, x2, ..., xk) -gt y
where xis are independent variable
y is dependent variable.
y drawn from discrete set classification
y drawn from continuous variable regression

3
Frequent Pattern-Based Feature Construction

Data not in the pre-defined feature vectors
Transactions
Biological sequence
Graph database

Frequent pattern is a good candidate for
discriminative features So, how to mine them?
4
FP Sub-graph
(example borrowed from George Karypis
presentation)
5
Computational Issues

Measured by its frequency or support.
E.g. frequent subgraphs with sup 10
Cannot enumerate sup 10 without first
enumerating all patterns gt 10.
Random sampling not work since it is not
exhaustive.
NP hard problem

6
Conventional Procedure
Two-Step Batch Method

Mine frequent patterns (gtsup)

Select most discriminative patterns

Represent data in the feature space using such
patterns

Build classification models.

Feature Construction followed by Selection
7
Two Problems

Mine step
combinatorial explosion

2. patterns not considered if minsupport isnt
small enough
1. exponential explosion
8
Two Problems

Select step
Issue of discriminative power

4. Correlation not directly evaluated on their
joint predictability
3. InfoGain against the complete dataset, NOT on
subset of examples
9
Direct Mining Selection via Model-based Search
Tree
Feature Miner
Classifier

Basic Flow

Compact set of highly discriminative
patterns 1 2 3 4 5 6 7 . . .
Global Support 1020/100000.02
Divide-and-Conquer Based Frequent Pattern Mining
Mined Discriminative Patterns
10
Analyses (I)

Scalability of pattern enumeration
Upper bound (Theorem 1)
Scale down ratio

Bound on number of returned features

11
Analyses (II)

Subspace pattern selection
Original set
Subset

Non-overfitting
Optimality under exhaustive search

12
Experimental Studies
Itemset Mining (I)

Scalability Comparison

Datasets Pat using MbT sup Ratio (MbT Pat / Pat using MbT sup)
Adult 252809 0.41
Chess 8 0
Hypo 423439 0.0035
Sick 4818391 0.00032
Sonar 95507 0.00775
13
Experimental Studies
Itemset Mining (II)

Accuracy of Mined Itemsets

4 Wins 1 loss
But, much smaller number of patterns
14
Experimental Studies
Itemset Mining (III)

Convergence

15
Experimental Studies
Graph Mining (I)

9 NCI anti-cancer screen datasets
The PubChem Project. URL pubchem.ncbi.nlm.nih.gov
.
Active (Positive) class around 1 - 8.3
2 AIDS anti-viral screen datasets
URL http//dtp.nci.nih.gov.
H1 CMCA 3.5
H2 CA 1

16
Experimental Studies
Graph Mining (II)

Scalability

17
Experimental Studies
Graph Mining (III)

AUC and Accuracy

AUC
11 Wins
10 Wins 1 Loss
18
Experimental Studies
Graph Mining (IV)

AUC of MbT, DT MbT VS Benchmarks

7 Wins, 4 losses
19
Summary

Model-based Search Tree
Integrated feature mining and construction.
Dynamic support
Can mine extremely small support patterns
Both a feature construction and a classifier
Not limited to one type of frequent pattern
plug-play
Experiment Results
Itemset Mining
Graph Mining
New Found a DNA sequence not previously reported
but can be explained in biology.
Code and dataset available for download

20
How to train models?

Even the true distribution is unknown, still
assume that the data is generated by some known
function.
Estimate parameters inside the function via
training data
CV on the training data

After structure is prefixed, learning becomes
optimization to minimize errors
quadratic loss
exponential loss
slack variables

There probably will always be mistakes unless
The chosen model indeed generates the
distribution
Data is sufficient to estimate those parameters

21
How to train models II

List of methods
Decision Trees
RIPPER rule learner
CBA association rule
clustering-based methods

Not quite sure the exact function, but use a
family of free-form functions given some
preference criteria.

Preference criteria
Simplest hypothesis that fits the data is the
best.
Heuristics
info gain, gini index, Kearns-Mansour, etc
pruning MDL pruning, reduced error-pruning,
cost-based pruning.
Truth none of purity check functions guarantee
accuracy on unseen test data, it only tries to
build a smaller model

There probably will always be mistakes unless
the training data is sufficiently large.
free form function/criteria is appropriate.

22
Can Data Speak for Themselves?

Make no assumption about the true model, neither
parametric form nor free form.
Encode the data in some rather neutral
representations
Think of it like encoding numbers in computers
binary representation.
Always cannot represent some numbers, but overall
accurate enough.
Main challenge
Avoid rote learning do not remember all the
details
Generalization
Evenly representing numbers Evenly
encoding the data.

23
Potential Advantages

If the accuracy is quite good, then
Method is quite automatic and easy to use
No Brainer DM can be everybodys tool.

24
Encoding Data for Major Problems

Classification
Given a set of labeled data items, such as, (amt,
merchant category, outstanding balance,
date/time, ,) and the label is whether it is a
fraud or non-fraud.
Label set of discrete values
classifier predict if a transaction is a fraud
or non-fraud.
Probability Estimation
Similar to the above setting estimate the
probability that a transaction is a fraud.
Difference no truth is given, i.e., no true
probability
Regression
Given a set of valued data items, such as
(zipcode, capital gain, education, ), interested
value is annual gross income.
Target value continuous values.
Several other on-going problems

25
Encoding Data in Decision Trees

Think of each tree as a way to encode the
training data.
Why tree? a decision tree records some common
characteristic of the data, but not every piece
of trivial detail
Obviously, each tree encodes the data
differently.
Subjective criteria that prefers some encodings
than others are always adhoc.
Do not prefer anything then just do it randomly
Minimizes the difference by multiple encodings,
and then average them.

26
Random Decision Tree to Encode Data
-classification, regression, probability
estimation

At each node, an un-used feature is chosen
randomly
A discrete feature is un-used if it has never
been chosen previously on a given decision path
starting from the root to the current node.
A continuous feature can be chosen multiple times
on the same decision path, but each time a
different threshold value is chosen

27
Continued

We stop when one of the following happens
A node becomes too small (lt 3 examples).
Or the total height of the tree exceeds some
limits
Such as the total number of features.

28
Illustration of RDT
B1 0,1 B2 0,1 B3 continuous
B1 chosen randomly
Random threshold 0.3
B2 0,1 B3 continuous
B2 0,1 B3 continuous
B3 chosen randomly
Random threshold 0.6
B2 chosen randomly
B3 continous
29
Classification
30
Regression
31
Prediction

Simply Averaging over multiple trees

32
Potential Advantage

Training can be very efficient. Particularly true
for very large datasets.
No cross-validation based estimation of
parameters for some parametric methods.
Natural multi-class probability.
Natural multi-label classification and
probability estimation.
Imposes very little about the structures of the
model.

33
Reasons

The true distribution P(yX) is never known.
Is it an elephant?
Every random tree is not a random guess of this
P(yX).
Their structure is, but not the node statistics
Every random tree is consistent with the training
data.
Each tree is quite strong, not weak.
In other words, if the distribution is the same,
each random tree itself is a rather decent model.

34
Expected Error Reduction

Proven that for quadratic loss, such as
for probability estimation
( P(yX) P(yX, ?) )2
regression problems
( y f(x))2
General theorem the expected quadratic loss of
RDT (and any other model averaging) is less than
any combined model chosen at random.

35
Theorem Summary
36
Number of trees

Sampling theory
The random decision tree can be thought as
sampling from a large (infinite when continuous
features exist) population of trees.
Unless the data is highly skewed, 30 to 50 gives
pretty good estimate with reasonably small
variance. In most cases, 10 are usually enough.

37
Variance Reduction
38
Optimal Decision Boundary
from Tony Lius thesis (supervised by Kai Ming
Ting)
39
(No Transcript)
40
Regression Decision Boundary (GUIDE)

Properties
Broken and Discontinuous
Some points are far from truth
Some wrong ups and downs

41
RDT Computed Function

Properties
Smooth and Continuous
Close to true function
All ups and downs caught

42
Hidden Variable
43
Hidden Variable

Limitation of GUIDE
Need to decide grouping variables and independent
variables. A non-trivial task.
If all variables are categorical, GUIDE becomes a
single CART regression tree.
Strong assumption and greedy-based search.
Sometimes, can lead to very unexpected results.

44
It grows like
45
ICDM08 Cup Crown Winner

Nuclear ban monitoring
RDT based approach is the highest award winner.

46
Ozone Level Prediction (ICDM06 Best Application
Paper)

Daily summary maps of two datasets from Texas
Commission on Environmental Quality (TCEQ)

47
SVM 1-hr criteria CV
48
AdaBoost 1-hr criteria CV
49
SVM 8-hr criteria CV
50
AdaBoost 8-hr criteria CV
51
Other Applications

Credit Card Fraud Detection
Late and Default Payment Prediction
Intrusion Detection
Semi Conductor Process Control
Trading anomaly detection

52
Conclusion

Imposing a particular form of model may not be a
good idea to train highly-accurate models for
general purpose of DM.
It may not even be efficient for some forms of
models.
RDT has been show to solve all three major
problems in data mining, classification,
probability estimation and regressions, simply,
efficiently and accurately.
When physical truth is unknown, RDT is highly
recommended
Code and dataset is available for download.

53
Standard Supervised Learning
training (labeled)
test (unlabeled)
Classifier
85.5
New York Times
New York Times
54
In Reality
training (labeled)
test (unlabeled)
Classifier
64.1
Labeled data not available!
Reuters
New York Times
New York Times
55
Domain Difference ? Performance Drop
train
test
ideal setting
Classifier
NYT
NYT
85.5
New York Times
New York Times
realistic setting
Classifier
NYT
Reuters
64.1
Reuters
New York Times
56
A Synthetic Example
Training (have conflicting concepts)
Test
Partially overlapping
57
Goal
Source Domain
Source Domain
Target Domain
Source Domain

To unify knowledge that are consistent with the
test domain from multiple source domains (models)

58
Summary

Transfer from one or multiple source domains
Target domain has no labeled examples
Do not need to re-train
Rely on base models trained from each domain
The base models are not necessarily developed for
transfer learning applications

59
Locally Weighted Ensemble
Training set 1
M1
x-feature value y-class label
Training set 2
M2
Test example x
Training set

Training set k
Mk
60
Modified Bayesian Model Averaging
Bayesian Model Averaging
Modified for Transfer Learning
M1
M1
Test set
Test set
M2
M2

Mk
Mk
61
Global versus Local Weights
x
y
M1
M2
wg
wl
wg
wl
2.40 5.23 -2.69 0.55 -3.97 -3.62 2.08
-3.73 5.08 2.15 1.43 4.48
1 0 0 0 0 1
0.6 0.4 0.2 0.1 0.6 1
0.9 0.6 0.4 0.1 0.3 0.2
0.3 0.3 0.3 0.3 0.3 0.3
0.2 0.6 0.7 0.5 0.3 1
0.7 0.7 0.7 0.7 0.7 0.7
0.8 0.4 0.3 0.5 0.7 0
Training

Locally weighting scheme
Weight of each model is computed per example
Weights are determined according to models
performance on the test set, not training set

62
Synthetic Example Revisited
M1
M2
M2
M1
Training (have conflicting concepts)
Test
Partially overlapping
63
Optimal Local Weights
Higher Weight
0.9 0.1
C1
Test example x
0.8 0.2
0.4 0.6
C2
w
f
H
0.9 0.4
w1
0.8

w2
0.2
0.1 0.6

Optimal weights
Solution to a regression problem

64
Approximate Optimal Weights

Optimal weights
Impossible to get since f is unknown!

How to approximate the optimal weights
M should be assigned a higher weight at x if
P(yM,x) is closer to the true P(yx)
Have some labeled examples in the target domain
Use these examples to compute weights
None of the examples in the target domain are
labeled
Need to make some assumptions about the
relationship between feature values and class
labels

65
Clustering-Manifold Assumption
Test examples that are closer in feature space
are more likely to share the same class label.
66
Graph-based Heuristics

Graph-based weights approximation
Map the structures of models onto test domain

weight on x
M2
Clustering Structure
M1
67
Graph-based Heuristics
Higher Weight

Local weights calculation
Weight of a model is proportional to the
similarity between its neighborhood graph and the
clustering structure around x.

68
Local Structure Based Adjustment

Why adjustment is needed?
It is possible that no models structures are
similar to the clustering structure at x
Simply means that the training information are
conflicting with the true target distribution at x

Error
Error
M2
Clustering Structure
M1
69
Local Structure Based Adjustment

How to adjust?
Check if is below a
threshold
Ignore the training information and propagate the
labels of neighbors in the test set to x

M2
Clustering Structure
M1
70
Verify the Assumption

Need to check the validity of this assumption
Still, P(yx) is unknown
How to choose the appropriate clustering
algorithm
Findings from real data sets
This property is usually determined by the nature
of the task
Positive cases Document categorization
Negative cases Sentiment classification
Could validate this assumption on the training
set

71
Algorithm
Check Assumption
Neighborhood Graph Construction
Model Weight Computation
Weight Adjustment
72
Data Sets

Different applications
Synthetic data sets
Spam filtering public email collection ?
personal inboxes (u01, u02, u03) (ECML/PKDD 2006)
Text classification same top-level
classification problems with different sub-fields
in the training and test sets (Newsgroup,
Reuters)
Intrusion detection data different types of
intrusions in training and test sets.

73
Baseline Methods

Baseline Methods
One source domain single models
Winnow (WNN), Logistic Regression (LR), Support
Vector Machine (SVM)
Transductive SVM (TSVM)
Multiple source domains
SVM on each of the domains
TSVM on each of the domains
Merge all source domains into one ALL
SVM, TSVM
Simple averaging ensemble SMA
Locally weighted ensemble without local structure
based adjustment pLWE
Locally weighted ensemble LWE
Implementation Package
Classification SNoW, BBR, LibSVM, SVMlight
Clustering CLUTO package

74
Performance Measure

Prediction Accuracy
0-1 loss accuracy
Squared loss mean squared error
Area Under ROC Curve
(AUC)
Tradeoff between true positive
rate and false positive rate
Should be 1 ideally

75
A Synthetic Example
Training (have conflicting concepts)
Test
Partially overlapping
76
Experiments on Synthetic Data
77
Spam Filtering
Accuracy

Problems
Training set public emails
Test set personal emails from three users U00,
U01, U02

WNN
LR
SVM
SMA
TSVM
pLWE
LWE
MSE
WNN
LR
SVM
SMA
TSVM
pLWE
LWE
78
20 Newsgroup
C vs S
R vs T
R vs S
S vs T
C vs R
C vs T
79
Acc
WNN
LR
SVM
SMA
TSVM
pLWE
LWE
MSE
WNN
LR
SVM
SMA
TSVM
pLWE
LWE
80
Reuters
Accuracy