Simultaneous Partitioning and Learning for Prediction from Complex Data presentation

About This Presentation

Transcript and Presenter's Notes

Title: Simultaneous Partitioning and Learning for Prediction from Complex Data

1
Simultaneous Partitioning and Learning for
Prediction from Complex Data

Meghana Deodhar and Joydeep Ghosh
Dept of ECE
The University of Texas at Austin

2
Background

Difficult classification/regression problems may
involve a heterogeneous population
Divide and conquer approach
Partition the population and model each partition
separately
Advantages
Learning models on more homogenous data
Improves accuracy
Simpler, more interpretable models

3
Cash Flow Forecasting L. Lokmic and K. Smith
00

Problem Forecast when issued checks will be
cashed in
Very large and heterogeneous training dataset
Check amounts span a very large range (0.02 to
32 million)
Data partitioning
Partition checks based on amount small, medium,
large
SOM to cluster small group into 3 (relatively)
homogenous partitions
Learn separate predictive model (regression or
NN) in each partition
Collection of predictive models more accurate
than a single model

4
Cash Flow Forecasting L. Lokmic and K. Smith
00

Experimental Results
Check dataset 11,000 checks over a 3 month
period
Target variable duration in days after which
issued check will be cashed in
Error abs(actual forecast/actual)

5
Short-Term Load ForecastingM. Djukanovic et al.
93

Problem Forecast the hourly electric load
pattern for a day
First level of division
Model working days, weekends and holidays
separately
For each day type, cluster input data into
coherent groups and train model in each group
Relation between input features and load profile
stronger in each group vs. entire population
Classify test point into a cluster and use
corresponding model to forecast load

6
Motivation

Traditionally partitioning is a priori
Domain knowledge
Clustering algorithm
but, a priori partitioning may be suboptimal
Solution Interleaving partitioning and
construction of prediction models
Hard vs. soft partitioning solutions

7
Agenda

Hard partitioning of both objects and features
Homogeneous groups of dyadic data
Classification
Regression
Soft partitioning of input space for regression
Mixture of Experts
Hierarchical versions
Fuzzy clustering and modeling
Output Space Partitioning for modeling large
number of classes

8
Hard partitioning of both objects and features

References
Deodhar and Ghosh, KDD07
Agarwal and Merugu, KDD07
Motivation
Dyadic data is now pervasive, attributes
available for objects and features
Recommender system customers, products, ratings
Search advertising query, ads, click through
probabilities
Web search query, web pages, relevance score

9
Applications in Bioinformatics

Microarray data analysis Liu et al. 05, Conde
et al. 03
Clustering followed by supervised learning
Problem Classification of experiments
Pre-processing step clustering genes
Each cluster represented by cluster
representative
Dimensionality reduction step for classification
Reduces gene redundancies/noise
Parsimonious classification models

10
Applications in Marketing

Simultaneous market segmentation and structure
Partition the consumers into homogenous groups
Find groups of equivalent products
Dual problems
Sequential clustering and analysis
Identify consumer segments
Maximum likelihood based latent class models
Grover and Srinivasan 87
SOM based clustering of consumers Reutterer 98
Competing products ones with high purchase
probabilities in each consumer segment

11
Example Recommender System

Predict customer purchase decisions

Product attributes e.g. price, market share
Customer attributes e.g. demographics
products
customers
12
Possible Approaches

Collaborative Filtering
Classification
Logistic regression
Co-clustering or Bi-clustering
Bregman Co-clustering

13
Collaborative Filtering

Collaborative Filtering technique for reducing
information overload
Improves access to relevant products and
information
e.g. Recommender systems that suggest books,
films, music, etc.
Predict how well a user will like an unrated item
Based on preferences of a community
Preference judgments can be explicit or implicit
Explicit numerical ratings for each item
Implicit extracted from purchase records or web
logs

14
Collaborative Filtering

Find a neighborhood of similar customers
Based on known choices
Predict current purchase decision using
preference of neighborhood
Ignores customer/product attributes

products
customers
15
Single Classification Model

Constructs a map from the feature vector of a
customer-product combination to the choice
May not be adequate to capture heterogeneity
Does not use neighborhood information
Similarity of customers/products

Feature vector - customer, product attributes
Target variable - Matrix entries
16
Co-clustering

Simultaneously clusters along multiple axes
Exploits the duality between the axes
Improves upon single sided clustering
Applications
Microarray data analysis (genes and experiments)
Text data clustering (documents and words)
Bregman Co-clustering Banerjee et al. 06
Partitional divides matrix into a grid of
rectangular blocks
Can deal with missing data

17
Co-clustering

Identifies neighborhood of similar customers and
products
Predicts unknown choice using known entries
within the co-cluster
Ignores customer and product attributes

Product clusters
Customer clusters
18
Simultaneous Co-clustering and Classification

Exploits neighborhood information and attributes
Iteratively clusters along both axes and fits
predictive model in each co-cluster
Common framework for solving classification and
regression problems

Product attributes
Customer attributes
Classification Model
19
Problem Definition (Regression)

Z m x n matrix of customers and products
Matrix entries are real numbers (e.g. ratings)
Assumption matrix entry is a linear combination
of customer and product attributes (Ci and Pj )
Model parameters ßT ß0, ßcT, ßpT
Attribute vector xijT 1, CiT , PjT
Aim Simultaneously cluster customers and
products into a grid of co-clusters, such that
the values within each co-cluster are predicted
by the same regression model

20
Regression Example
p
c
21
Regression Example

After rearranging rows and cols.

p
c
22
Regression Example
p
c 2p
c
23
Regression Example
p
c 2p
1 c p
c
5c p
2c 3p
24
Reconstruction Errors
Reconstructed with simultaneous co-clustering
and regression
Reconstructed with a single linear model z 1.2
3.6c 1.5p
MSE 7.9
MSE 21.8
25
Objective Function

? mapping from m rows to k row clusters
? mapping from n columns to l column clusters
Total of k l regression models.
Weight (wuv) associated with each matrix entry
1 - known entry, 0 - missing
Find co-clustering (?, ?) and models (ßs) that
minimize
the total squared error

26
Objective Function Details

Indicates how well the co-cluster models fit
given data
Based on the prediction model, not cluster
homogeneity!
Elementwise squared error summed over all matrix
entries

Sum over row clusters
Sum over col clusters
All rows in row cluster
All cols in col cluster
Predicted value using co-cluster linear model
27
Row and Column Cluster Updates

Objective function is a sum of row/column errors
Assign each row to a row cluster that minimizes
the row error
Row cluster assignment for row u

u
ß11
ß12
e1(u)
ß21
ß22
e2(u)
e3(u)
ß31
ß32
28
Meta -Algorithm

Input Data Z, Weights W, Attributes C, P
Output co-clustering (?, ?), models ßsß
Initialize ?, ?
Iterate until convergence
Re-estimate model for each co-cluster
Re-estimate the co-clusters
Update row clusters assign each row to closest
row cluster
Update col clusters assign each col to closest
col cluster
Return ?, ?, ßsß
Guaranteed to converge to locally optimal solution

29
Simultaneous Co-clustering and Classification

Elements of Z are class labels (2 class problem)
Logistic regression model relating attributes to
the class label
Log odds modeled as a linear combination of the
attributes ( ßTxij )
Find co-clustering (?, ?) and models (ßs) that
minimize the total log loss (-ve log likelihood)
Log loss instead of squared error

30
Meta Algorithm (Classification)

Input Data Z, Weights W, Attributes C, P
Output co-clustering (?, ?), classification
models ßsß
Initialize ?, ?
Iterate until convergence
Re-estimate classification model for each
co-cluster
Re-estimate the co-clusters
Update row clusters assign each row to closest
row cluster
Update col clusters assign each col to closest
col cluster
(Closest in terms of least prediction error)
Return ?, ?, ßsß
Guaranteed to converge to locally optimal solution

31
Predicting Missing Values

If missing matrix entry zuv is assigned to row
cluster g and col cluster h with model parameters
ßgh, predict zuv as
Classification
Regression

32
Reduced Parameter Approach

Simultaneous co-clustering and prediction
k l independent models
(1 C P) k l parameters
May overfit when training data is limited
Single model
(1 C P) parameters
May not be adequate
Reduced Parameter Approach
k l models, but smoothing achieved by sharing
parameters
Customer (product) coefficients for all models in
the same row (column) cluster are constrained to
be identical
(1 C) k (1 P) l parameters

33
Model Update Step
Update customer coefficients
34
Model Update Step
Update product coefficients
35
Recommender System Application

Predict unknown course choices of masters
students
32 courses, 326 students
Student attributes career aspiration,
undergraduate degree
Course attributes department, evaluation score

Classification Error
Model CC, CC k 2, l 2
36
Results
F measure plot
Precision-Recall curve
37
ERIM Marketing Dataset

Household panel data collected by A.C. Nielsen
1714 customers, 121 products from 6 product
categories (ketchup, sugar, etc.)
Customer-product matrix cell values units
purchased
Household Attributes income, residents, male
head employed, female head employed, total
visits, total expense
Product Attributes market share, price, times
product was advertised
Predict units purchased

38
Data Sample
39
Dataset Details

Properties
Sparse 74.86 values are 0
Very skewed 99.12 values lt 20, rest very high
(outliers)
Standardization of product attributes and units
purchased
Linear least squares - very sensitive to outliers
Separate models for high and low valued matrix
entries
Threshold of 20 units purchased

40
Results

Model for low valued matrix entries
Bulk of the data (99.12)

41
Market Segmentation and Structure
Coefficients of global model and sample
co-cluster models
Low market share
Cheapest, most popular products
High income, large visits
42
Lessons Learnt

Interpretable and actionable segmentation and
models
Coefficients of co-cluster models differ from
global model
Multiple models required to capture heterogeneity
Co-cluster models differ significantly
Different purchase factors important for
different customer-product subsets
Product attributes more indicative of preference
Elimination of insignificant predictors to get
sparse models

43
Extensions

Can be extended to work with other prediction
models
Other applications
Microarray data clustering (incorporate gene,
experiment annotations)
Recommending blogs to users (matrix of blogs vs.
users)
Clustering web documents annotated by link and
semantic information
Analyzing survey data simultaneous market
segmentation and structure
Extensions to time-series data each matrix
entry is a vector of values over time (e.g.
customer purchase behavior over time)

44
Predictive Discrete Latent Factor ModelsD.
Agarwal and S. Merugu 07

Similar motivation and problem setting
Prediction of missing matrix entries (dyadic
response variables), given attributes (covariate
information)
Uses co-clustering to solve a prediction problem
Response variable modeled as a sum of
Function of covariates (global structure)
Co-cluster specific constant (local structure)
Exploits local structure
Co-cluster specific constant assumed as part of
noise model
Teased out of global model residues

45
PDLF Model

Constrained mixture model
k l components, ?IJ mixture prior of IJth
component
Each component is a generalized linear model
ff exponential family, g link function
Global trends xijT? shared across the components
Each co-cluster/latent factor has an additional
offset ?IJ

46
Model Estimation

Generalized EM algorithm
Soft vs. hard assignment
Main steps
Random initialization of row/column clusters and
parameters
Repeat till convergence
Estimate global model coefficients ?
(Newton-Raphsons method)
Estimate co-cluster offsets ?IJ,
Find the optimal row and column clustering
Scalable each iteration linear in observations

47
PDLF vs. Model CC

PDLF
Single global model co-cluster constants
Robust even when data is limited
Model CC
k x l co-cluster models
Works well when large amount of data is available
Complementary approaches

48
Logistic Regression on Movie Lens
Rating gt 3 ve 23 covariates
PDLF
Co-clustering
Logistic Regression
49
Experiments on Click Count Data

Dataset 47903 ip-domains, 585 web-sites
Attributes ip-location, routing type, etc.
Predict click count observations
PDLF model based on Poisson distributions
Co-cluster interactions more interesting than in
ordinary co-clustering

Prediction Results
50
Agenda

Hard partitioning of both objects and features
Homogeneous groups of dyadic data
Classification
Regression
Soft partitioning of input space for regression
Mixture of Experts
Hierarchical versions
Fuzzy clustering and modeling
Output Space Partitioning for modeling large
number of classes

51
Soft Partitioning Mixture of Experts Model
Jacobs and Jordan '91

Prediction model is composed of a collection of
experts specializing on different regions of
the input space
Simultaneous partitioning of input space and
training of experts
Soft partitioning multiple experts involved in
different degrees in predicting an output
Uses linear models to form non-linear maps

52
Modular Decomposition in Mixtures of Experts

Regression setting

Expert 1
y1
g1(x)
Y(x) ? gi (x)yi (x)
Expert 2
?
y2
g2
x

yk
Expert K

gk
Gating Network
53
Gating and Expert Models

Simple models for local fits
Generalization of CART

?11
v11
x1
x1
y1
g1
x2
x2
y2
g2
. .
. .
. .
. .
xp
xp
yk
gk
vkp
?kp
Output layer (softmax)
Input layer
Output layer (linear)
Input layer
Expert Network
Gating Network
54
Underlying Probability Model

Gating network output layer softmax
Ensures
Probability model
p(yx, j) modeled as a Gaussian with mean µj
Output of expert j µj
Maximum likelihood formulation
Parameters determined using gradient ascent or EM

55
Hierarchical MOEs Jordan and Jacobs 94
56
Soft Partitioning Approaches in Marketing

Fuzzy clusterwise regression Wedel and Steenkamp
89
To solve simultaneous market segmentation and
structure problem
Fuzzy co-clusters
Fractional membership from customers and products
Preferences modeled as linear combination of
product attributes
Minimize total squared error between actual and
predicted preferences
Interpretation difficult, hard version gives
block diagonal co-clusters

57
Soft Partitioning Approaches in Marketing

Latent class modeling Hoffman and Puzicha 99
Problem predict individual choice based on
history
No attribute information used
Aspect model
Each customer-product combination assumed to be
generated from 1 of K latent classes (clusters)
Two-sided clustering model
Each customer belongs to 1 of k customer clusters
Each product belongs to 1 of l product clusters
Association parameter between customer and
product clusters
Models estimated using EM

58
Agenda

Hard partitioning of both objects and features
Homogeneous groups of dyadic data
Classification
Regression
Soft partitioning of input space for regression
Mixture of Experts
Hierarchical versions
Fuzzy clustering and modeling
Output Space Partitioning for modeling large
number of classes

59
Addressing Multi-class Problems via Output Space
Decomposition

Approach Convert a C-class problem into multiple
binary classification problems
History
Committee machine Neilson 65
1 class vs. all others, followed by voting/max
Pairwise classification Hastie Tibshirani 96
Combine pairwise class probabilities using KL
divergence based criterion
Error correcting output coding Dietterich
Bakhiri 95
Pros of meta-classifiers can be less
Cons groupings may be forced

60
Are Classes Always Orthogonal?

All previous approaches (one vs. all, pairwise,
ECOC) do not exploit any relationships between
classes
But often, some classes are closer to one another
than to others!
Also classes may be sub-classes/mixed classes
Example hyperspectral data.

61
Bolivar Peninsula
62
Dataset
63
Class Hierarchy
General Uplands
Pure Salicornia
Sand Flats
Bare Soil
High Proximal Marsh
Transition Zone
Water
Low Proximal Marsh
Pasture
High Distal Marsh
Trees
64
Hierarchical Grouping of Classes

Desired a general framework for natural grouping
of classes
Hierarchically divide into related
classes/subclasses
Custom features
Top down Partitioning Solve 3 coupled problems
group classes into two meta-classes
design feature extractor tailored for the 2
meta-classes
design the 2-meta-class classifier

65
Binary Hierarchical Classifier (BHC) S. Kumar et
al. 02

N leaf nodes (classes)
N-1 internal nodes
Partition into two meta-classes
Node specific design flexibility
Fisher discriminant based features
Bayesian classifier
Bottom up and top down
implementations
Both Soft and hard versions

66
BHC Partitioning a Set of Classes

Three coupled problems
Finding the partition
Determining discriminant function (Fisher
discriminant)
Estimating parameters of Bayesian binary
classifier
Solved via deterministic annealing
Initialize partitions
For a given temperature T
Compute soft meta-class parameters
Compute Fisher discriminant
Update associations
Reduce T, continue

67
Evolution of Class Partitioning
Assoc with Right Child -?
Iteration -?
68
ECOC vs. BHC

ECOC Techniques
Error correcting ensembles compensate for flaws
of weak base classifiers
Some decision boundaries may be too difficult to
learn (natural class affinities not exploited)
BHC
Groups classes by natural affinities ? simpler
two-(meta)class problems
node specific feature extraction/selection
techniques

69
Advantages of BHC over ECOC methods

Comparable performance even with small percentage
of training data but going beyond accuracy, BHC
has additional advantages
Eliminates the need to come up with an optimal
code matrix
Yields an interpretable ensemble of classifiers
Can construct classifiers at varying levels of
granularity
Reveals inherent class affinities and enables the
use of adaptive feature selection techniques
Simpler decision boundaries
Facilitates knowledge transfer

70
Experimental Results (BHC vs. ECOC)
71
Sample BHC Trees
72
Overall Summary and Take-Aways

Divide and conquer strategy to solve difficult
problems
Divide partition
Conquer learn models
Beneficial to interleave divide and conquer
phases and solve the two problems simultaneously
Applicable to any domain involving complex
problems
Marketing
Internet search/advertising
Recommender systems
Load forecasting

73
References

L. Lokmic and K. A. Smith. Cash flow forecasting
using supervised and unsupervised neural
networks. ICJNN, 2000.
M. Djukanovic, B. Babic, D. Sobajic and Y. Pao.
Unsupervised/supervised learning concept for
24-house load forecasting. IEE Proceedings-Generat
ion, Transmission and Distribution, 1993.
D. Agrawal and S. Merugu. Predictive discrete
latent factor models for large scale dyadic data.
KDD 2007.
M. Deodhar and J. Ghosh. A framework for
simultaneous co-clustering and learning from
complex data. KDD 2007.
S. Kumar, J. Ghosh, M. Crawford. Hierarchical
fusion of multiple classifiers for hyperspectral
data analysis. Pattern Analysis and Applications,
5(2)210-220, 2002.
T. G. Dietterich and G. Bakiri. Solving
multiclass learning problems via error correcting
output codes. Journal of Artificial Intelligence
Research, 2263-286, 1995.
Trevor Hastie, Robert Tibshirani. Classification
by pairwise coupling. Advances in Neural
Information Processing Systems, 10507-713, 1998.

74
References

A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and
D. Modha. A generalized maximum entropy approach
to bregman co-clustering and matrix
approximation. JMLR, 81919--1986, 2007.
L. Conde, A. Mateos , J. Herrero and J. Dopazo.
Improved class prediction in DNA microarray gene
expression data by unsupervised reduction of the
dimensionality followed by supervised learning
with a perceptron. Journal of VLSI Signal
Process. Syst, 35245253, 2003.
X. Liu, A. Krishnan and A. Mondry. An
entropy-based gene selection method for cancer
classification using microarray data. BMC
Bioinformatics 2005.
R. Grover and V. Srinivasan. A Simultaneous
Approach to Market Segmentation and Market
Structuring. JMR 1987.
M. Wedel and J. Steenkamp. A Clusterwise
Regression Method for Simultaneous Fuzzy Market
Structuring and Benefit Segmentation. JMR 1991.
W. Moe and P. Fader. Modeling Hedonic Portfolio
Products A Joint Segmentation Analysis of Music
Compact Disc Sales. JMR 2001.

75
References

T. Reutterer. Competitive Market Structure and
Segmentation Analysis with Self-Organizing
Feature Maps. EMAC 1998.
M. Jordan and R. Jacobs. Hierarchical mixtures of
experts and the EM algorithm, 1994.
R. Jacobs, M. Jordan, S. Nowlan and Hinton.
Adaptive mixtures of local experts. Neural
Computation, 3, 79-87, 1991.
T. Hofmann and J. Puzicha. Latent Class Models
for Collaborative Filtering. Proceedings of the
International Joint Conference in Artificial
Intelligence, 1999.

Write a Comment

User Comments (0)

About PowerShow.com

Simultaneous Partitioning and Learning for Prediction from Complex Data PowerPoint PPT Presentation