Title: Simultaneous Partitioning and Learning for Prediction from Complex Data
1Simultaneous Partitioning and Learning for
Prediction from Complex Data
- Meghana Deodhar and Joydeep Ghosh
- Dept of ECE
- The University of Texas at Austin
2Background
- Difficult classification/regression problems may
involve a heterogeneous population - Divide and conquer approach
- Partition the population and model each partition
separately - Advantages
- Learning models on more homogenous data
- Improves accuracy
- Simpler, more interpretable models
3Cash Flow Forecasting L. Lokmic and K. Smith
00
- Problem Forecast when issued checks will be
cashed in - Very large and heterogeneous training dataset
- Check amounts span a very large range (0.02 to
32 million) - Data partitioning
- Partition checks based on amount small, medium,
large - SOM to cluster small group into 3 (relatively)
homogenous partitions - Learn separate predictive model (regression or
NN) in each partition - Collection of predictive models more accurate
than a single model
4Cash Flow Forecasting L. Lokmic and K. Smith
00
- Experimental Results
- Check dataset 11,000 checks over a 3 month
period - Target variable duration in days after which
issued check will be cashed in - Error abs(actual forecast/actual)
5Short-Term Load ForecastingM. Djukanovic et al.
93
- Problem Forecast the hourly electric load
pattern for a day - First level of division
- Model working days, weekends and holidays
separately - For each day type, cluster input data into
coherent groups and train model in each group - Relation between input features and load profile
stronger in each group vs. entire population - Classify test point into a cluster and use
corresponding model to forecast load
6Motivation
- Traditionally partitioning is a priori
- Domain knowledge
- Clustering algorithm
- but, a priori partitioning may be suboptimal
- Solution Interleaving partitioning and
construction of prediction models - Hard vs. soft partitioning solutions
7Agenda
- Hard partitioning of both objects and features
- Homogeneous groups of dyadic data
- Classification
- Regression
- Soft partitioning of input space for regression
- Mixture of Experts
- Hierarchical versions
- Fuzzy clustering and modeling
- Output Space Partitioning for modeling large
number of classes
8Hard partitioning of both objects and features
- References
- Deodhar and Ghosh, KDD07
- Agarwal and Merugu, KDD07
- Motivation
- Dyadic data is now pervasive, attributes
available for objects and features - Recommender system customers, products, ratings
- Search advertising query, ads, click through
probabilities - Web search query, web pages, relevance score
9Applications in Bioinformatics
- Microarray data analysis Liu et al. 05, Conde
et al. 03 - Clustering followed by supervised learning
- Problem Classification of experiments
- Pre-processing step clustering genes
- Each cluster represented by cluster
representative - Dimensionality reduction step for classification
- Reduces gene redundancies/noise
- Parsimonious classification models
10Applications in Marketing
- Simultaneous market segmentation and structure
- Partition the consumers into homogenous groups
- Find groups of equivalent products
- Dual problems
- Sequential clustering and analysis
- Identify consumer segments
- Maximum likelihood based latent class models
Grover and Srinivasan 87 - SOM based clustering of consumers Reutterer 98
- Competing products ones with high purchase
probabilities in each consumer segment
11Example Recommender System
- Predict customer purchase decisions
Product attributes e.g. price, market share
Customer attributes e.g. demographics
products
customers
12Possible Approaches
- Collaborative Filtering
- Classification
- Logistic regression
- Co-clustering or Bi-clustering
- Bregman Co-clustering
13Collaborative Filtering
- Collaborative Filtering technique for reducing
information overload - Improves access to relevant products and
information - e.g. Recommender systems that suggest books,
films, music, etc. - Predict how well a user will like an unrated item
- Based on preferences of a community
- Preference judgments can be explicit or implicit
- Explicit numerical ratings for each item
- Implicit extracted from purchase records or web
logs
14Collaborative Filtering
- Find a neighborhood of similar customers
- Based on known choices
- Predict current purchase decision using
preference of neighborhood - Ignores customer/product attributes
products
customers
15Single Classification Model
- Constructs a map from the feature vector of a
customer-product combination to the choice - May not be adequate to capture heterogeneity
- Does not use neighborhood information
- Similarity of customers/products
Feature vector - customer, product attributes
Target variable - Matrix entries
16Co-clustering
- Simultaneously clusters along multiple axes
- Exploits the duality between the axes
- Improves upon single sided clustering
- Applications
- Microarray data analysis (genes and experiments)
- Text data clustering (documents and words)
- Bregman Co-clustering Banerjee et al. 06
- Partitional divides matrix into a grid of
rectangular blocks - Can deal with missing data
17Co-clustering
- Identifies neighborhood of similar customers and
products - Predicts unknown choice using known entries
within the co-cluster - Ignores customer and product attributes
Product clusters
Customer clusters
18Simultaneous Co-clustering and Classification
- Exploits neighborhood information and attributes
- Iteratively clusters along both axes and fits
predictive model in each co-cluster - Common framework for solving classification and
regression problems
Product attributes
Customer attributes
Classification Model
19Problem Definition (Regression)
- Z m x n matrix of customers and products
- Matrix entries are real numbers (e.g. ratings)
- Assumption matrix entry is a linear combination
of customer and product attributes (Ci and Pj ) - Model parameters ßT ß0, ßcT, ßpT
- Attribute vector xijT 1, CiT , PjT
- Aim Simultaneously cluster customers and
products into a grid of co-clusters, such that
the values within each co-cluster are predicted
by the same regression model
20Regression Example
p
c
21Regression Example
- After rearranging rows and cols.
p
c
22Regression Example
p
c 2p
c
23Regression Example
p
c 2p
1 c p
c
5c p
2c 3p
24Reconstruction Errors
Reconstructed with simultaneous co-clustering
and regression
Reconstructed with a single linear model z 1.2
3.6c 1.5p
MSE 7.9
MSE 21.8
25Objective Function
- ? mapping from m rows to k row clusters
- ? mapping from n columns to l column clusters
- Total of k l regression models.
- Weight (wuv) associated with each matrix entry
- 1 - known entry, 0 - missing
- Find co-clustering (?, ?) and models (ßs) that
minimize - the total squared error
26Objective Function Details
- Indicates how well the co-cluster models fit
given data - Based on the prediction model, not cluster
homogeneity! - Elementwise squared error summed over all matrix
entries
Sum over row clusters
Sum over col clusters
All rows in row cluster
All cols in col cluster
Predicted value using co-cluster linear model
27Row and Column Cluster Updates
- Objective function is a sum of row/column errors
- Assign each row to a row cluster that minimizes
the row error - Row cluster assignment for row u
u
ß11
ß12
e1(u)
ß21
ß22
e2(u)
e3(u)
ß31
ß32
28Meta -Algorithm
- Input Data Z, Weights W, Attributes C, P
- Output co-clustering (?, ?), models ßsß
- Initialize ?, ?
- Iterate until convergence
- Re-estimate model for each co-cluster
- Re-estimate the co-clusters
- Update row clusters assign each row to closest
row cluster - Update col clusters assign each col to closest
col cluster - Return ?, ?, ßsß
- Guaranteed to converge to locally optimal solution
29Simultaneous Co-clustering and Classification
- Elements of Z are class labels (2 class problem)
- Logistic regression model relating attributes to
the class label - Log odds modeled as a linear combination of the
attributes ( ßTxij ) - Find co-clustering (?, ?) and models (ßs) that
minimize the total log loss (-ve log likelihood) - Log loss instead of squared error
30Meta Algorithm (Classification)
- Input Data Z, Weights W, Attributes C, P
- Output co-clustering (?, ?), classification
models ßsß - Initialize ?, ?
- Iterate until convergence
- Re-estimate classification model for each
co-cluster - Re-estimate the co-clusters
- Update row clusters assign each row to closest
row cluster - Update col clusters assign each col to closest
col cluster - (Closest in terms of least prediction error)
- Return ?, ?, ßsß
- Guaranteed to converge to locally optimal solution
31Predicting Missing Values
- If missing matrix entry zuv is assigned to row
cluster g and col cluster h with model parameters
ßgh, predict zuv as - Classification
- Regression
32Reduced Parameter Approach
- Simultaneous co-clustering and prediction
- k l independent models
- (1 C P) k l parameters
- May overfit when training data is limited
- Single model
- (1 C P) parameters
- May not be adequate
- Reduced Parameter Approach
- k l models, but smoothing achieved by sharing
parameters - Customer (product) coefficients for all models in
the same row (column) cluster are constrained to
be identical - (1 C) k (1 P) l parameters
33Model Update Step
Update customer coefficients
34Model Update Step
Update product coefficients
35Recommender System Application
- Predict unknown course choices of masters
students - 32 courses, 326 students
- Student attributes career aspiration,
undergraduate degree - Course attributes department, evaluation score
Classification Error
Model CC, CC k 2, l 2
36Results
F measure plot
Precision-Recall curve
37ERIM Marketing Dataset
- Household panel data collected by A.C. Nielsen
- 1714 customers, 121 products from 6 product
categories (ketchup, sugar, etc.) - Customer-product matrix cell values units
purchased - Household Attributes income, residents, male
head employed, female head employed, total
visits, total expense - Product Attributes market share, price, times
product was advertised - Predict units purchased
38Data Sample
39Dataset Details
- Properties
- Sparse 74.86 values are 0
- Very skewed 99.12 values lt 20, rest very high
(outliers) - Standardization of product attributes and units
purchased - Linear least squares - very sensitive to outliers
- Separate models for high and low valued matrix
entries - Threshold of 20 units purchased
40Results
- Model for low valued matrix entries
- Bulk of the data (99.12)
41Market Segmentation and Structure
Coefficients of global model and sample
co-cluster models
Low market share
Cheapest, most popular products
High income, large visits
42Lessons Learnt
- Interpretable and actionable segmentation and
models - Coefficients of co-cluster models differ from
global model - Multiple models required to capture heterogeneity
- Co-cluster models differ significantly
- Different purchase factors important for
different customer-product subsets - Product attributes more indicative of preference
- Elimination of insignificant predictors to get
sparse models
43Extensions
- Can be extended to work with other prediction
models - Other applications
- Microarray data clustering (incorporate gene,
experiment annotations) - Recommending blogs to users (matrix of blogs vs.
users) - Clustering web documents annotated by link and
semantic information - Analyzing survey data simultaneous market
segmentation and structure - Extensions to time-series data each matrix
entry is a vector of values over time (e.g.
customer purchase behavior over time)
44Predictive Discrete Latent Factor ModelsD.
Agarwal and S. Merugu 07
- Similar motivation and problem setting
- Prediction of missing matrix entries (dyadic
response variables), given attributes (covariate
information) - Uses co-clustering to solve a prediction problem
- Response variable modeled as a sum of
- Function of covariates (global structure)
- Co-cluster specific constant (local structure)
- Exploits local structure
- Co-cluster specific constant assumed as part of
noise model - Teased out of global model residues
45PDLF Model
- Constrained mixture model
- k l components, ?IJ mixture prior of IJth
component - Each component is a generalized linear model
- ff exponential family, g link function
- Global trends xijT? shared across the components
- Each co-cluster/latent factor has an additional
offset ?IJ
46Model Estimation
- Generalized EM algorithm
- Soft vs. hard assignment
- Main steps
- Random initialization of row/column clusters and
parameters - Repeat till convergence
- Estimate global model coefficients ?
(Newton-Raphsons method) - Estimate co-cluster offsets ?IJ,
- Find the optimal row and column clustering
- Scalable each iteration linear in observations
47PDLF vs. Model CC
- PDLF
- Single global model co-cluster constants
- Robust even when data is limited
- Model CC
- k x l co-cluster models
- Works well when large amount of data is available
- Complementary approaches
48 Logistic Regression on Movie Lens
Rating gt 3 ve 23 covariates
PDLF
Co-clustering
Logistic Regression
49 Experiments on Click Count Data
- Dataset 47903 ip-domains, 585 web-sites
- Attributes ip-location, routing type, etc.
- Predict click count observations
- PDLF model based on Poisson distributions
- Co-cluster interactions more interesting than in
ordinary co-clustering
Prediction Results
50Agenda
- Hard partitioning of both objects and features
- Homogeneous groups of dyadic data
- Classification
- Regression
- Soft partitioning of input space for regression
- Mixture of Experts
- Hierarchical versions
- Fuzzy clustering and modeling
- Output Space Partitioning for modeling large
number of classes
51Soft Partitioning Mixture of Experts Model
Jacobs and Jordan '91
- Prediction model is composed of a collection of
experts specializing on different regions of
the input space - Simultaneous partitioning of input space and
training of experts - Soft partitioning multiple experts involved in
different degrees in predicting an output - Uses linear models to form non-linear maps
52Modular Decomposition in Mixtures of Experts
Expert 1
y1
g1(x)
Y(x) ? gi (x)yi (x)
Expert 2
?
y2
g2
x
yk
Expert K
gk
Gating Network
53Gating and Expert Models
- Simple models for local fits
- Generalization of CART
?11
v11
x1
x1
y1
g1
x2
x2
y2
g2
. .
. .
. .
. .
xp
xp
yk
gk
vkp
?kp
Output layer (softmax)
Input layer
Output layer (linear)
Input layer
Expert Network
Gating Network
54Underlying Probability Model
- Gating network output layer softmax
- Ensures
- Probability model
- p(yx, j) modeled as a Gaussian with mean µj
- Output of expert j µj
- Maximum likelihood formulation
- Parameters determined using gradient ascent or EM
55Hierarchical MOEs Jordan and Jacobs 94
56Soft Partitioning Approaches in Marketing
- Fuzzy clusterwise regression Wedel and Steenkamp
89 - To solve simultaneous market segmentation and
structure problem - Fuzzy co-clusters
- Fractional membership from customers and products
- Preferences modeled as linear combination of
product attributes - Minimize total squared error between actual and
predicted preferences - Interpretation difficult, hard version gives
block diagonal co-clusters
57Soft Partitioning Approaches in Marketing
- Latent class modeling Hoffman and Puzicha 99
- Problem predict individual choice based on
history - No attribute information used
- Aspect model
- Each customer-product combination assumed to be
generated from 1 of K latent classes (clusters) - Two-sided clustering model
- Each customer belongs to 1 of k customer clusters
- Each product belongs to 1 of l product clusters
- Association parameter between customer and
product clusters - Models estimated using EM
58Agenda
- Hard partitioning of both objects and features
- Homogeneous groups of dyadic data
- Classification
- Regression
- Soft partitioning of input space for regression
- Mixture of Experts
- Hierarchical versions
- Fuzzy clustering and modeling
- Output Space Partitioning for modeling large
number of classes
59Addressing Multi-class Problems via Output Space
Decomposition
- Approach Convert a C-class problem into multiple
binary classification problems - History
- Committee machine Neilson 65
- 1 class vs. all others, followed by voting/max
- Pairwise classification Hastie Tibshirani 96
- Combine pairwise class probabilities using KL
divergence based criterion - Error correcting output coding Dietterich
Bakhiri 95 - Pros of meta-classifiers can be less
- Cons groupings may be forced
60Are Classes Always Orthogonal?
- All previous approaches (one vs. all, pairwise,
ECOC) do not exploit any relationships between
classes - But often, some classes are closer to one another
than to others! - Also classes may be sub-classes/mixed classes
- Example hyperspectral data.
61Bolivar Peninsula
62Dataset
63Class Hierarchy
General Uplands
Pure Salicornia
Sand Flats
Bare Soil
High Proximal Marsh
Transition Zone
Water
Low Proximal Marsh
Pasture
High Distal Marsh
Trees
64Hierarchical Grouping of Classes
- Desired a general framework for natural grouping
of classes - Hierarchically divide into related
classes/subclasses - Custom features
- Top down Partitioning Solve 3 coupled problems
- group classes into two meta-classes
- design feature extractor tailored for the 2
meta-classes - design the 2-meta-class classifier
65Binary Hierarchical Classifier (BHC) S. Kumar et
al. 02
- N leaf nodes (classes)
- N-1 internal nodes
- Partition into two meta-classes
- Node specific design flexibility
- Fisher discriminant based features
- Bayesian classifier
-
- Bottom up and top down
- implementations
- Both Soft and hard versions
66BHC Partitioning a Set of Classes
- Three coupled problems
- Finding the partition
- Determining discriminant function (Fisher
discriminant) - Estimating parameters of Bayesian binary
classifier - Solved via deterministic annealing
- Initialize partitions
- For a given temperature T
- Compute soft meta-class parameters
- Compute Fisher discriminant
- Update associations
- Reduce T, continue
67Evolution of Class Partitioning
Assoc with Right Child -?
Iteration -?
68ECOC vs. BHC
- ECOC Techniques
- Error correcting ensembles compensate for flaws
of weak base classifiers - Some decision boundaries may be too difficult to
learn (natural class affinities not exploited) - BHC
- Groups classes by natural affinities ? simpler
two-(meta)class problems - node specific feature extraction/selection
techniques
69Advantages of BHC over ECOC methods
- Comparable performance even with small percentage
of training data but going beyond accuracy, BHC
has additional advantages - Eliminates the need to come up with an optimal
code matrix - Yields an interpretable ensemble of classifiers
- Can construct classifiers at varying levels of
granularity - Reveals inherent class affinities and enables the
use of adaptive feature selection techniques - Simpler decision boundaries
- Facilitates knowledge transfer
70Experimental Results (BHC vs. ECOC)
71Sample BHC Trees
72Overall Summary and Take-Aways
- Divide and conquer strategy to solve difficult
problems - Divide partition
- Conquer learn models
- Beneficial to interleave divide and conquer
phases and solve the two problems simultaneously - Applicable to any domain involving complex
problems - Marketing
- Internet search/advertising
- Recommender systems
- Load forecasting
73References
- L. Lokmic and K. A. Smith. Cash flow forecasting
using supervised and unsupervised neural
networks. ICJNN, 2000. - M. Djukanovic, B. Babic, D. Sobajic and Y. Pao.
Unsupervised/supervised learning concept for
24-house load forecasting. IEE Proceedings-Generat
ion, Transmission and Distribution, 1993. - D. Agrawal and S. Merugu. Predictive discrete
latent factor models for large scale dyadic data.
KDD 2007. - M. Deodhar and J. Ghosh. A framework for
simultaneous co-clustering and learning from
complex data. KDD 2007. - S. Kumar, J. Ghosh, M. Crawford. Hierarchical
fusion of multiple classifiers for hyperspectral
data analysis. Pattern Analysis and Applications,
5(2)210-220, 2002. - T. G. Dietterich and G. Bakiri. Solving
multiclass learning problems via error correcting
output codes. Journal of Artificial Intelligence
Research, 2263-286, 1995. - Trevor Hastie, Robert Tibshirani. Classification
by pairwise coupling. Advances in Neural
Information Processing Systems, 10507-713, 1998.
74References
- A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and
D. Modha. A generalized maximum entropy approach
to bregman co-clustering and matrix
approximation. JMLR, 81919--1986, 2007. - L. Conde, A. Mateos , J. Herrero and J. Dopazo.
Improved class prediction in DNA microarray gene
expression data by unsupervised reduction of the
dimensionality followed by supervised learning
with a perceptron. Journal of VLSI Signal
Process. Syst, 35245253, 2003. - X. Liu, A. Krishnan and A. Mondry. An
entropy-based gene selection method for cancer
classification using microarray data. BMC
Bioinformatics 2005. - R. Grover and V. Srinivasan. A Simultaneous
Approach to Market Segmentation and Market
Structuring. JMR 1987. - M. Wedel and J. Steenkamp. A Clusterwise
Regression Method for Simultaneous Fuzzy Market
Structuring and Benefit Segmentation. JMR 1991. - W. Moe and P. Fader. Modeling Hedonic Portfolio
Products A Joint Segmentation Analysis of Music
Compact Disc Sales. JMR 2001.
75References
- T. Reutterer. Competitive Market Structure and
Segmentation Analysis with Self-Organizing
Feature Maps. EMAC 1998. - M. Jordan and R. Jacobs. Hierarchical mixtures of
experts and the EM algorithm, 1994. - R. Jacobs, M. Jordan, S. Nowlan and Hinton.
Adaptive mixtures of local experts. Neural
Computation, 3, 79-87, 1991. - T. Hofmann and J. Puzicha. Latent Class Models
for Collaborative Filtering. Proceedings of the
International Joint Conference in Artificial
Intelligence, 1999.