Title: Learning on the Test Data: Leveraging Unseen Features
1Learning on the Test DataLeveraging Unseen
Features
- Ben Taskar Ming FaiWong
Daphne Koller
2Introduction
- Most statistical learning models make the
assumption that data instances are IID samples
from some fixed distribution. - In many cases, the data are collected from
different sources, at different times, locations
and under different circumstances. - We usually build a statistical model of features
under the assumption that future data will
exhibit the same regularities as the training
data. - In many data sets, however, there are
scope-limited features whose predictive power is
only applicable to a certain subset of the data.
3Examples
- 1. Classifying news articles chronologically
- Suppose the task is to classify news
articles chronologically. New - events, people and places appear and
disappear) in bursts over - time.
- The training data might consist of
articles taken over some time - period these are only somewhat
representative of the future - articles.
- The training data may contain some
features that are not observed - in the training data.
- 2. Classifying customers into categories
- Our training data might be collected from
one geographical region - which may not represent the distribution
in other regions.
4 We can get away with this difficulty by
mixing all the examples and selecting the
training and test sets randomly. But this
homogeneity cannot be ensured in real world task,
where only the non-representative training data
is actually available for training. The
test data may contain many features that were
never or only rarely observed in training data.
These features may be used for classification.
For ex, in the news article task these local
features might include the names of places or
people currently in the news. In the
customers ex, these local features might include
purchases of products that are specific to a
region.
5Scoped Learning
- Suppose we want to classify news articles
chronologically. The phrase XXX said today
might appear in many places in data for different
values of XXX - These features are called scope limited
features or local features. - Another example
- Suppose there are 2 labels grain and trade.
Words like corn or wheat often appear in phrase
tons of wheat". So we can learn that if a word
appears in the context of tons of xxx it is
likely to be associated with grain. So if we find
a phrase like tons of rye in the test data we
can infer that it has some positive interaction
with label grain. - Scoped learning is a probabilistic framework
that combines the traditional IID features with
scope limited features.
6The intuitive procedure for
using the local features is to use the
information from the global (IID) features to
infer the rules that govern the local information
for a particular subset of data.When data
exhibits scope they found significant gains in
performance over traditional models which only
uses IID features.All the data instances within
a particular scope exhibit some structural
regularity and we assume that all the future data
will exhibit the same structural regularity.
7General Framework
- Notion of scope
- We assume that data instances are sampled from
some set of scopes, each of which is associated
with some data distribution. - Different distributions share a probabilistic
model for some set of global features, but can
contain a different probabilistic model for a
scope-specific set of local features. - These local features may be rarely or never seen
in the scopes comprising the training data.
8- Let X denote global features, Z denote local
features, and Y the class variable.For each
global feature Xi, there is a parameter ?i.
Additionally,for each scope and each local
feature Zi, there isa parameter ?iS. -
- Then the distribution of Y given all the
features and weights is
9Probabilistic model
- We assume that the global weights can be learned
from training data. So their values are fixed
when we encounter a new scope and the local
feature weights are unknown and can be treated as
hidden variables in the graphical model. - Idea
- The evidence from global features for
the labels of some of the instances to modify our
beliefs about the role of the local feature
present in these instances to be consistent with
the labels. By learning about the roles of these
features, we can then propagate this information
to improve accuracy on instances that are harder
to classify using global features alone. -
10- To implement this idea, we define a joint
distribution over ?S and y1, . . . , ym. - Why use Markov Random Fields
- Here the association between the variables are
correlated rather than causal. Markov random
fields are used to model spatial interactions or
interacting features.
11Markov Network
- Let V (Vd,Vc) denote a set of random variables,
where Vd are discrete and Vc are continuous
variables, respectively. - A Markov network over V defines a joint
distribution over V, assigning a density over Vc
for each possible assignment vd to Vd. -
- A Markov network M is an undirected graph whose
nodes correspond to V. - It is parameterization by a set of potential
functions f1(C1), . . . , fl(Cl) such that each C
V is a fully connected subgraph, or clique, in M,
i.e., each Vi, Vj C are connected by an edge in
M. - Here we assume that the f(C) is a log-quadratic
function - The Markov network then represents the
distribution
12- In our case the log-quadratic model consists of 3
types of potentials - 1) f(?i,Yj,Xij) exp(?iYjXij)
- relates each global feature Xij in
instance i to its weight ?i and the class
variables Yj of the corresponding instance i. - 2) f(?i,Yj,Zij) exp(?iYjZij)
- relates the local feature Zij to its
weight ?i and the label Y j - Finally, as the local feature weights are assumed
to be hidden, we introduce a prior over their
values, or the form - Overall, our model specifies a joint distribution
as follows
13Markov network for two instances, two global
features and three local features
14- The graph can be simplified further when we
account for varaibles whose values are fixed. - The global feature weights are learned from the
training data and hence their value is fixed and
we also know all the feature values. - The resulting Markov network is shown below
(Assuming that the instance (x1, z1, y1) contains
the features Z1 and Z2, and the instance(x2, z2,
y2) contains the features Z2 and Z3.) - Y2
- ?1 ?2
?3 - Y1
15- This can be reduced further. When Zij0 there is
no interaction between Yj and any of the
variables ?i. -
- In this case we can simply omit the edge between
?i and Yj - And the resulting Markov network is shown below
-
- Y2
- ?1 ?2
?3 - Y1
16- In this model, we can see that the labels of all
of the instances are correlated with the local
feature weights of features they contain, and
thereby with each other. Thus, for example, if we
obtain evidence (from global features) about the
label Y 1, it would change our posterior beliefs
about the local feature weight 2, which in turn
would change our beliefs about the label Y 2.
Thus, by running probabilistic inference over
this graphical model, we obtain updated beliefs
both about the local feature weights and about
the instance labels.
17Learning the Model
- Learning Global Feature Weights
- In this case we simply learn their parameters
from the training data, using standard logistic
regression. Maximum-likelihood (ML) estimation
finds the weights ? that maximize the conditional
likelihood of the labels given the global
features. - Learning Local feature Distributions
- We can exploit such patterns by learning a model
that predicts the prior of the local feature
weights using meta features features of
features. More precisely, we learna model that
predicts the prior mean µi for i from someset of
meta-features mi. As our predictive model for the
mean µi we choose to use a linear regression
model, setting - µi w
mi.
18Using the model
- Step1
- Given a training set, we first learn the
model. In the training set, there local and
global features are treated identically. When
applying the model to the test set, however, our
first decision is to determine the set of local
and global features. - Step 2
- Our next step is to generate the Markov
network for the test set. Probabilistic inference
over this model infers the effect of local
features. - Step 3
- We use Expectation Propagation for
inference. It maintains approximate beliefs
(marginals) over nodes of the Markov network and
iteratively adjusts them to achieve local
consistency.
19Experimental Results
- Reuters
- The Reuters news articles data set contains
substantial number of documents hand labeled into
grain, crude, trade, and money-fx. - Using this data set, six experimental setups are
created, by using all possible pairings of
categories from the four categories chosen. - The resulting sequence is divided into nine time
segments with roughly the same number of
documents in each segment.
20(No Transcript)
21- WebKB2
- This data set consists of hand-labeled web
pages from Computer Science department web sites
of four schools Berkeley, CMU, MIT and Stanford
and they are categorized into faculty, student,
course and organization. - Six experimental setups are created by using
all possible pairings of categories from the four
categories.