Simultaneous Partitioning and Learning for Prediction from Complex Data PowerPoint PPT Presentation

presentation player overlay
1 / 75
About This Presentation
Transcript and Presenter's Notes

Title: Simultaneous Partitioning and Learning for Prediction from Complex Data


1
Simultaneous Partitioning and Learning for
Prediction from Complex Data
  • Meghana Deodhar and Joydeep Ghosh
  • Dept of ECE
  • The University of Texas at Austin

2
Background
  • Difficult classification/regression problems may
    involve a heterogeneous population
  • Divide and conquer approach
  • Partition the population and model each partition
    separately
  • Advantages
  • Learning models on more homogenous data
  • Improves accuracy
  • Simpler, more interpretable models

3
Cash Flow Forecasting L. Lokmic and K. Smith
00
  • Problem Forecast when issued checks will be
    cashed in
  • Very large and heterogeneous training dataset
  • Check amounts span a very large range (0.02 to
    32 million)
  • Data partitioning
  • Partition checks based on amount small, medium,
    large
  • SOM to cluster small group into 3 (relatively)
    homogenous partitions
  • Learn separate predictive model (regression or
    NN) in each partition
  • Collection of predictive models more accurate
    than a single model

4
Cash Flow Forecasting L. Lokmic and K. Smith
00
  • Experimental Results
  • Check dataset 11,000 checks over a 3 month
    period
  • Target variable duration in days after which
    issued check will be cashed in
  • Error abs(actual forecast/actual)

5
Short-Term Load ForecastingM. Djukanovic et al.
93
  • Problem Forecast the hourly electric load
    pattern for a day
  • First level of division
  • Model working days, weekends and holidays
    separately
  • For each day type, cluster input data into
    coherent groups and train model in each group
  • Relation between input features and load profile
    stronger in each group vs. entire population
  • Classify test point into a cluster and use
    corresponding model to forecast load

6
Motivation
  • Traditionally partitioning is a priori
  • Domain knowledge
  • Clustering algorithm
  • but, a priori partitioning may be suboptimal
  • Solution Interleaving partitioning and
    construction of prediction models
  • Hard vs. soft partitioning solutions

7
Agenda
  • Hard partitioning of both objects and features
  • Homogeneous groups of dyadic data
  • Classification
  • Regression
  • Soft partitioning of input space for regression
  • Mixture of Experts
  • Hierarchical versions
  • Fuzzy clustering and modeling
  • Output Space Partitioning for modeling large
    number of classes

8
Hard partitioning of both objects and features
  • References
  • Deodhar and Ghosh, KDD07
  • Agarwal and Merugu, KDD07
  • Motivation
  • Dyadic data is now pervasive, attributes
    available for objects and features
  • Recommender system customers, products, ratings
  • Search advertising query, ads, click through
    probabilities
  • Web search query, web pages, relevance score

9
Applications in Bioinformatics
  • Microarray data analysis Liu et al. 05, Conde
    et al. 03
  • Clustering followed by supervised learning
  • Problem Classification of experiments
  • Pre-processing step clustering genes
  • Each cluster represented by cluster
    representative
  • Dimensionality reduction step for classification
  • Reduces gene redundancies/noise
  • Parsimonious classification models

10
Applications in Marketing
  • Simultaneous market segmentation and structure
  • Partition the consumers into homogenous groups
  • Find groups of equivalent products
  • Dual problems
  • Sequential clustering and analysis
  • Identify consumer segments
  • Maximum likelihood based latent class models
    Grover and Srinivasan 87
  • SOM based clustering of consumers Reutterer 98
  • Competing products ones with high purchase
    probabilities in each consumer segment

11
Example Recommender System
  • Predict customer purchase decisions

Product attributes e.g. price, market share
Customer attributes e.g. demographics
products
customers
12
Possible Approaches
  • Collaborative Filtering
  • Classification
  • Logistic regression
  • Co-clustering or Bi-clustering
  • Bregman Co-clustering

13
Collaborative Filtering
  • Collaborative Filtering technique for reducing
    information overload
  • Improves access to relevant products and
    information
  • e.g. Recommender systems that suggest books,
    films, music, etc.
  • Predict how well a user will like an unrated item
  • Based on preferences of a community
  • Preference judgments can be explicit or implicit
  • Explicit numerical ratings for each item
  • Implicit extracted from purchase records or web
    logs

14
Collaborative Filtering
  • Find a neighborhood of similar customers
  • Based on known choices
  • Predict current purchase decision using
    preference of neighborhood
  • Ignores customer/product attributes

products
customers
15
Single Classification Model
  • Constructs a map from the feature vector of a
    customer-product combination to the choice
  • May not be adequate to capture heterogeneity
  • Does not use neighborhood information
  • Similarity of customers/products

Feature vector - customer, product attributes
Target variable - Matrix entries
16
Co-clustering
  • Simultaneously clusters along multiple axes
  • Exploits the duality between the axes
  • Improves upon single sided clustering
  • Applications
  • Microarray data analysis (genes and experiments)
  • Text data clustering (documents and words)
  • Bregman Co-clustering Banerjee et al. 06
  • Partitional divides matrix into a grid of
    rectangular blocks
  • Can deal with missing data

17
Co-clustering
  • Identifies neighborhood of similar customers and
    products
  • Predicts unknown choice using known entries
    within the co-cluster
  • Ignores customer and product attributes

Product clusters
Customer clusters
18
Simultaneous Co-clustering and Classification
  • Exploits neighborhood information and attributes
  • Iteratively clusters along both axes and fits
    predictive model in each co-cluster
  • Common framework for solving classification and
    regression problems

Product attributes
Customer attributes
Classification Model
19
Problem Definition (Regression)
  • Z m x n matrix of customers and products
  • Matrix entries are real numbers (e.g. ratings)
  • Assumption matrix entry is a linear combination
    of customer and product attributes (Ci and Pj )
  • Model parameters ßT ß0, ßcT, ßpT
  • Attribute vector xijT 1, CiT , PjT
  • Aim Simultaneously cluster customers and
    products into a grid of co-clusters, such that
    the values within each co-cluster are predicted
    by the same regression model

20
Regression Example
p
c
21
Regression Example
  • After rearranging rows and cols.

p
c
22
Regression Example
p
c 2p
c
23
Regression Example
p
c 2p
1 c p
c
5c p
2c 3p
24
Reconstruction Errors
Reconstructed with simultaneous co-clustering
and regression
Reconstructed with a single linear model z 1.2
3.6c 1.5p
MSE 7.9
MSE 21.8
25
Objective Function
  • ? mapping from m rows to k row clusters
  • ? mapping from n columns to l column clusters
  • Total of k l regression models.
  • Weight (wuv) associated with each matrix entry
  • 1 - known entry, 0 - missing
  • Find co-clustering (?, ?) and models (ßs) that
    minimize
  • the total squared error

26
Objective Function Details
  • Indicates how well the co-cluster models fit
    given data
  • Based on the prediction model, not cluster
    homogeneity!
  • Elementwise squared error summed over all matrix
    entries

Sum over row clusters
Sum over col clusters
All rows in row cluster
All cols in col cluster
Predicted value using co-cluster linear model
27
Row and Column Cluster Updates
  • Objective function is a sum of row/column errors
  • Assign each row to a row cluster that minimizes
    the row error
  • Row cluster assignment for row u

u
ß11
ß12
e1(u)
ß21
ß22
e2(u)
e3(u)
ß31
ß32
28
Meta -Algorithm
  • Input Data Z, Weights W, Attributes C, P
  • Output co-clustering (?, ?), models ßsß
  • Initialize ?, ?
  • Iterate until convergence
  • Re-estimate model for each co-cluster
  • Re-estimate the co-clusters
  • Update row clusters assign each row to closest
    row cluster
  • Update col clusters assign each col to closest
    col cluster
  • Return ?, ?, ßsß
  • Guaranteed to converge to locally optimal solution

29
Simultaneous Co-clustering and Classification
  • Elements of Z are class labels (2 class problem)
  • Logistic regression model relating attributes to
    the class label
  • Log odds modeled as a linear combination of the
    attributes ( ßTxij )
  • Find co-clustering (?, ?) and models (ßs) that
    minimize the total log loss (-ve log likelihood)
  • Log loss instead of squared error

30
Meta Algorithm (Classification)
  • Input Data Z, Weights W, Attributes C, P
  • Output co-clustering (?, ?), classification
    models ßsß
  • Initialize ?, ?
  • Iterate until convergence
  • Re-estimate classification model for each
    co-cluster
  • Re-estimate the co-clusters
  • Update row clusters assign each row to closest
    row cluster
  • Update col clusters assign each col to closest
    col cluster
  • (Closest in terms of least prediction error)
  • Return ?, ?, ßsß
  • Guaranteed to converge to locally optimal solution

31
Predicting Missing Values
  • If missing matrix entry zuv is assigned to row
    cluster g and col cluster h with model parameters
    ßgh, predict zuv as
  • Classification
  • Regression

32
Reduced Parameter Approach
  • Simultaneous co-clustering and prediction
  • k l independent models
  • (1 C P) k l parameters
  • May overfit when training data is limited
  • Single model
  • (1 C P) parameters
  • May not be adequate
  • Reduced Parameter Approach
  • k l models, but smoothing achieved by sharing
    parameters
  • Customer (product) coefficients for all models in
    the same row (column) cluster are constrained to
    be identical
  • (1 C) k (1 P) l parameters

33
Model Update Step
Update customer coefficients
34
Model Update Step
Update product coefficients
35
Recommender System Application
  • Predict unknown course choices of masters
    students
  • 32 courses, 326 students
  • Student attributes career aspiration,
    undergraduate degree
  • Course attributes department, evaluation score

Classification Error
Model CC, CC k 2, l 2
36
Results
F measure plot
Precision-Recall curve
37
ERIM Marketing Dataset
  • Household panel data collected by A.C. Nielsen
  • 1714 customers, 121 products from 6 product
    categories (ketchup, sugar, etc.)
  • Customer-product matrix cell values units
    purchased
  • Household Attributes income, residents, male
    head employed, female head employed, total
    visits, total expense
  • Product Attributes market share, price, times
    product was advertised
  • Predict units purchased

38
Data Sample
39
Dataset Details
  • Properties
  • Sparse 74.86 values are 0
  • Very skewed 99.12 values lt 20, rest very high
    (outliers)
  • Standardization of product attributes and units
    purchased
  • Linear least squares - very sensitive to outliers
  • Separate models for high and low valued matrix
    entries
  • Threshold of 20 units purchased

40
Results
  • Model for low valued matrix entries
  • Bulk of the data (99.12)

41
Market Segmentation and Structure
Coefficients of global model and sample
co-cluster models
Low market share
Cheapest, most popular products
High income, large visits
42
Lessons Learnt
  • Interpretable and actionable segmentation and
    models
  • Coefficients of co-cluster models differ from
    global model
  • Multiple models required to capture heterogeneity
  • Co-cluster models differ significantly
  • Different purchase factors important for
    different customer-product subsets
  • Product attributes more indicative of preference
  • Elimination of insignificant predictors to get
    sparse models

43
Extensions
  • Can be extended to work with other prediction
    models
  • Other applications
  • Microarray data clustering (incorporate gene,
    experiment annotations)
  • Recommending blogs to users (matrix of blogs vs.
    users)
  • Clustering web documents annotated by link and
    semantic information
  • Analyzing survey data simultaneous market
    segmentation and structure
  • Extensions to time-series data each matrix
    entry is a vector of values over time (e.g.
    customer purchase behavior over time)

44
Predictive Discrete Latent Factor ModelsD.
Agarwal and S. Merugu 07
  • Similar motivation and problem setting
  • Prediction of missing matrix entries (dyadic
    response variables), given attributes (covariate
    information)
  • Uses co-clustering to solve a prediction problem
  • Response variable modeled as a sum of
  • Function of covariates (global structure)
  • Co-cluster specific constant (local structure)
  • Exploits local structure
  • Co-cluster specific constant assumed as part of
    noise model
  • Teased out of global model residues

45
PDLF Model
  • Constrained mixture model
  • k l components, ?IJ mixture prior of IJth
    component
  • Each component is a generalized linear model
  • ff exponential family, g link function
  • Global trends xijT? shared across the components
  • Each co-cluster/latent factor has an additional
    offset ?IJ

46
Model Estimation
  • Generalized EM algorithm
  • Soft vs. hard assignment
  • Main steps
  • Random initialization of row/column clusters and
    parameters
  • Repeat till convergence
  • Estimate global model coefficients ?
    (Newton-Raphsons method)
  • Estimate co-cluster offsets ?IJ,
  • Find the optimal row and column clustering
  • Scalable each iteration linear in observations

47
PDLF vs. Model CC
  • PDLF
  • Single global model co-cluster constants
  • Robust even when data is limited
  • Model CC
  • k x l co-cluster models
  • Works well when large amount of data is available
  • Complementary approaches

48
Logistic Regression on Movie Lens
Rating gt 3 ve 23 covariates
PDLF
Co-clustering
Logistic Regression
49
Experiments on Click Count Data
  • Dataset 47903 ip-domains, 585 web-sites
  • Attributes ip-location, routing type, etc.
  • Predict click count observations
  • PDLF model based on Poisson distributions
  • Co-cluster interactions more interesting than in
    ordinary co-clustering

Prediction Results
50
Agenda
  • Hard partitioning of both objects and features
  • Homogeneous groups of dyadic data
  • Classification
  • Regression
  • Soft partitioning of input space for regression
  • Mixture of Experts
  • Hierarchical versions
  • Fuzzy clustering and modeling
  • Output Space Partitioning for modeling large
    number of classes

51
Soft Partitioning Mixture of Experts Model
Jacobs and Jordan '91
  • Prediction model is composed of a collection of
    experts specializing on different regions of
    the input space
  • Simultaneous partitioning of input space and
    training of experts
  • Soft partitioning multiple experts involved in
    different degrees in predicting an output
  • Uses linear models to form non-linear maps

52
Modular Decomposition in Mixtures of Experts
  • Regression setting

Expert 1
y1
g1(x)
Y(x) ? gi (x)yi (x)
Expert 2
?
y2
g2
x

yk
Expert K

gk
Gating Network
53
Gating and Expert Models
  • Simple models for local fits
  • Generalization of CART

?11
v11
x1
x1
y1
g1
x2
x2
y2
g2
. .
. .
. .
. .
xp
xp
yk
gk
vkp
?kp
Output layer (softmax)
Input layer
Output layer (linear)
Input layer
Expert Network
Gating Network
54
Underlying Probability Model
  • Gating network output layer softmax
  • Ensures
  • Probability model
  • p(yx, j) modeled as a Gaussian with mean µj
  • Output of expert j µj
  • Maximum likelihood formulation
  • Parameters determined using gradient ascent or EM

55
Hierarchical MOEs Jordan and Jacobs 94
56
Soft Partitioning Approaches in Marketing
  • Fuzzy clusterwise regression Wedel and Steenkamp
    89
  • To solve simultaneous market segmentation and
    structure problem
  • Fuzzy co-clusters
  • Fractional membership from customers and products
  • Preferences modeled as linear combination of
    product attributes
  • Minimize total squared error between actual and
    predicted preferences
  • Interpretation difficult, hard version gives
    block diagonal co-clusters

57
Soft Partitioning Approaches in Marketing
  • Latent class modeling Hoffman and Puzicha 99
  • Problem predict individual choice based on
    history
  • No attribute information used
  • Aspect model
  • Each customer-product combination assumed to be
    generated from 1 of K latent classes (clusters)
  • Two-sided clustering model
  • Each customer belongs to 1 of k customer clusters
  • Each product belongs to 1 of l product clusters
  • Association parameter between customer and
    product clusters
  • Models estimated using EM

58
Agenda
  • Hard partitioning of both objects and features
  • Homogeneous groups of dyadic data
  • Classification
  • Regression
  • Soft partitioning of input space for regression
  • Mixture of Experts
  • Hierarchical versions
  • Fuzzy clustering and modeling
  • Output Space Partitioning for modeling large
    number of classes

59
Addressing Multi-class Problems via Output Space
Decomposition
  • Approach Convert a C-class problem into multiple
    binary classification problems
  • History
  • Committee machine Neilson 65
  • 1 class vs. all others, followed by voting/max
  • Pairwise classification Hastie Tibshirani 96
  • Combine pairwise class probabilities using KL
    divergence based criterion
  • Error correcting output coding Dietterich
    Bakhiri 95
  • Pros of meta-classifiers can be less
  • Cons groupings may be forced

60
Are Classes Always Orthogonal?
  • All previous approaches (one vs. all, pairwise,
    ECOC) do not exploit any relationships between
    classes
  • But often, some classes are closer to one another
    than to others!
  • Also classes may be sub-classes/mixed classes
  • Example hyperspectral data.


61
Bolivar Peninsula
62
Dataset
63
Class Hierarchy
General Uplands
Pure Salicornia
Sand Flats
Bare Soil
High Proximal Marsh
Transition Zone
Water
Low Proximal Marsh
Pasture
High Distal Marsh
Trees
64
Hierarchical Grouping of Classes
  • Desired a general framework for natural grouping
    of classes
  • Hierarchically divide into related
    classes/subclasses
  • Custom features
  • Top down Partitioning Solve 3 coupled problems
  • group classes into two meta-classes
  • design feature extractor tailored for the 2
    meta-classes
  • design the 2-meta-class classifier

65
Binary Hierarchical Classifier (BHC) S. Kumar et
al. 02
  • N leaf nodes (classes)
  • N-1 internal nodes
  • Partition into two meta-classes
  • Node specific design flexibility
  • Fisher discriminant based features
  • Bayesian classifier
  • Bottom up and top down
  • implementations
  • Both Soft and hard versions

66
BHC Partitioning a Set of Classes
  • Three coupled problems
  • Finding the partition
  • Determining discriminant function (Fisher
    discriminant)
  • Estimating parameters of Bayesian binary
    classifier
  • Solved via deterministic annealing
  • Initialize partitions
  • For a given temperature T
  • Compute soft meta-class parameters
  • Compute Fisher discriminant
  • Update associations
  • Reduce T, continue

67
Evolution of Class Partitioning
Assoc with Right Child -?
Iteration -?
68
ECOC vs. BHC
  • ECOC Techniques
  • Error correcting ensembles compensate for flaws
    of weak base classifiers
  • Some decision boundaries may be too difficult to
    learn (natural class affinities not exploited)
  • BHC
  • Groups classes by natural affinities ? simpler
    two-(meta)class problems
  • node specific feature extraction/selection
    techniques

69
Advantages of BHC over ECOC methods
  • Comparable performance even with small percentage
    of training data but going beyond accuracy, BHC
    has additional advantages
  • Eliminates the need to come up with an optimal
    code matrix
  • Yields an interpretable ensemble of classifiers
  • Can construct classifiers at varying levels of
    granularity
  • Reveals inherent class affinities and enables the
    use of adaptive feature selection techniques
  • Simpler decision boundaries
  • Facilitates knowledge transfer

70
Experimental Results (BHC vs. ECOC)
71
Sample BHC Trees
72
Overall Summary and Take-Aways
  • Divide and conquer strategy to solve difficult
    problems
  • Divide partition
  • Conquer learn models
  • Beneficial to interleave divide and conquer
    phases and solve the two problems simultaneously
  • Applicable to any domain involving complex
    problems
  • Marketing
  • Internet search/advertising
  • Recommender systems
  • Load forecasting

73
References
  • L. Lokmic and K. A. Smith. Cash flow forecasting
    using supervised and unsupervised neural
    networks. ICJNN, 2000.
  • M. Djukanovic, B. Babic, D. Sobajic and Y. Pao.
    Unsupervised/supervised learning concept for
    24-house load forecasting. IEE Proceedings-Generat
    ion, Transmission and Distribution, 1993.
  • D. Agrawal and S. Merugu. Predictive discrete
    latent factor models for large scale dyadic data.
    KDD 2007.
  • M. Deodhar and J. Ghosh. A framework for
    simultaneous co-clustering and learning from
    complex data. KDD 2007.
  • S. Kumar, J. Ghosh, M. Crawford. Hierarchical
    fusion of multiple classifiers for hyperspectral
    data analysis. Pattern Analysis and Applications,
    5(2)210-220, 2002.
  • T. G. Dietterich and G. Bakiri. Solving
    multiclass learning problems via error correcting
    output codes. Journal of Artificial Intelligence
    Research, 2263-286, 1995.
  • Trevor Hastie, Robert Tibshirani. Classification
    by pairwise coupling. Advances in Neural
    Information Processing Systems, 10507-713, 1998.

74
References
  • A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and
    D. Modha. A generalized maximum entropy approach
    to bregman co-clustering and matrix
    approximation. JMLR, 81919--1986, 2007.
  • L. Conde, A. Mateos , J. Herrero and J. Dopazo.
    Improved class prediction in DNA microarray gene
    expression data by unsupervised reduction of the
    dimensionality followed by supervised learning
    with a perceptron. Journal of VLSI Signal
    Process. Syst, 35245253, 2003.
  • X. Liu, A. Krishnan and A. Mondry. An
    entropy-based gene selection method for cancer
    classification using microarray data. BMC
    Bioinformatics 2005.
  • R. Grover and V. Srinivasan. A Simultaneous
    Approach to Market Segmentation and Market
    Structuring. JMR 1987.
  • M. Wedel and J. Steenkamp. A Clusterwise
    Regression Method for Simultaneous Fuzzy Market
    Structuring and Benefit Segmentation. JMR 1991.
  • W. Moe and P. Fader. Modeling Hedonic Portfolio
    Products A Joint Segmentation Analysis of Music
    Compact Disc Sales. JMR 2001.

75
References
  • T. Reutterer. Competitive Market Structure and
    Segmentation Analysis with Self-Organizing
    Feature Maps. EMAC 1998.
  • M. Jordan and R. Jacobs. Hierarchical mixtures of
    experts and the EM algorithm, 1994.
  • R. Jacobs, M. Jordan, S. Nowlan and Hinton.
    Adaptive mixtures of local experts. Neural
    Computation, 3, 79-87, 1991.
  • T. Hofmann and J. Puzicha. Latent Class Models
    for Collaborative Filtering. Proceedings of the
    International Joint Conference in Artificial
    Intelligence, 1999.
Write a Comment
User Comments (0)
About PowerShow.com