Domain Adaptation for Natural Language Processing - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Domain Adaptation for Natural Language Processing

Description:

Amogh Asgekar (06329006) Jeevan Chalke (06329011) Vinay Deshpande ... For every instance xt add ?.xt as new shared features. Structural Correspondence Learning ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 31

Provided by: Comp754

Category:

more less

Transcript and Presenter's Notes

Title: Domain Adaptation for Natural Language Processing

1
Domain Adaptation for Natural Language Processing

Amogh Asgekar (06329006)
Jeevan Chalke (06329011)
Vinay Deshpande (06305001)
Jubin Chheda (06305003)

2
Outline

NLP tasks
Types of domain adaption
Sample selection bias
Structural corresponding learning
Adaptation by feature augmentation
Conclusion

3
Tasks in NLP domain

POS Tagging
Assign POS tags to the words in a given test
corpus.
Parsing
Construct a structure out of the given sentence
formations.
Word sense disambiguation
Select a particular meaning of the word from
various possibilities.
Named entity recognition
Identifying named entities (names, address etc)
from a given corpus.

4
Domain Adaptation

The pre-mentioned tasks are performed by
learning from a corpus and then applying the
knowledge to classify the test instances.
In case the training distributions and test
distributions are different, then the classifier
tends to perform erroneously.
In such cases, classifier needs to be domain
adapted to perform accurately on both the
domains.

5
A POS tagging task

Consider the following example-
Learner has access to
Labeled data S randomly sampled from the training
distribution PS.
Unlabelled sample T sampled from an unknown test
distribution PT.
Task of the learner is to predicts labels of
points generated and labeled according to PT.

6
Types of Domain Adaptation

Analyse the causes for domain divergence and
model them into the learner
Sample selection bias
Discover the divergence of the distributions
during training
Structural Correspondence Learning
Feature Augmentation Model

7
Sample Selection Bias

What is Sample Selection Bias?
Samples (x, y, s) are drawn independently from a
domain (X Y S) with distribution D. S is a
binary space. If s1, that instance is selected.
Four cases of dependence of (x, y) on s
1. s ? x and s ? y
2. s ? y x
3. s ? x y
4. s depends on both x and y

8
Sample Selection Bias Correction

Case s ? y x i.e. The sub-domain selection
depends only on the words and not on their
POS-tags.
Now if D is the original distribution of domain
and D is the distribution of selected sub-domain
then, we can convert from one domain to other
using a multiplier
Thus, D(x, y, s) ß(X) D(x, y, s)
The prior probabilities Pr(s1) and Pr(s1x)
must be known.
Pr(s1 x) should be non-zero for each x i.e. at
least one instance of each word should be
selected.

9
Sample Selection Bias in POS-Tagging

Medical corpus is generated from English language
by some sample selection procedure.
This brings about a non uniform distribution to
the selection of instances from the parent
distribution.

10
Application to Domain Adaptation
11
Structural Correspondence Learning

Central idea is to find a common mapping between
the two domains and use it to train a classifier
fitting on both the domains.
Requirements of SCL for domain adaptation are-
Large sample of unlabelled data from training and
test domain.
Small sample of labelled data from the training
domain.

12
Structural Correspondence Learning

SCL Process Step 1
Define Pivot Features
Pivot features are those that behave with high
correlation in both the domains.
They should be frequent enough to define
correlations with non pivot features from the
same domain, and at the same time diverse enough
to separately identify the different features.
E.g.. Determiners are good pivot features, since
they occur frequently in any domain of written
English, but choosing only determiners will not
help us to discriminate between nouns and
adjectives.

13
Structural Correspondence Learning
14
Structural Correspondence Learning

SCL Process Step 2
Learn pivot predictors
For every pivot feature k out of m pivot
features, learn a classifier on the joint data
(training test).
Classification problem will be binary of form
Does this instance contain pivot feature k
The weights wk learnt in the above task encode
the covariance of the various features with the
pivot feature k.

15
Structural Correspondence Learning

SCL Process Step 3
Compute an SVD of pivot predictor space.
An SVD will give a lower dimensional
approximation to the pivot predictor space.
If W is the matrix of pivot predictors, then
will be the SVD for W.
Consider as the matrix with the rows as
the top left singular vectors of W.

16
Structural Correspondence Learning

SCL Process Step 4
Construct a mapping of both domains to low
dimensional shared space.
The rows of ? capture the variance of pivot
predictor space as best as possible in h
dimensions.
Thus we get ?.X as the desired mapping into the
low dimensional space.
For every instance xt add ?.xt as new shared
features.

17
Structural Correspondence Learning

Visualizing the common feature space

Feature Space 1
Common mapped feature space
Feature Space 2
18
Feature Augmentation yet another approach to
Adaptation

Have access to a large, annotated corpus of data
from a source domain. In addition, we spend a
little money to annotate a small corpus in the
target domain.

19
Feature Augmentation yet another approach to
Adaptation

Appropriate exactly in the case when one has
enough target data to do slightly better than
just using only source data.
Domain Adaptation gt Standard Supervised Learning
Problem.

20
Problem formulation

Ds Samples from source distribution
Dt Samples from target distribution
DsN, DsM, N gtgt M
F Number of features of source target data
W Input Space, YOutput Space
WRF

21
The crux

Define
Augmented input space WR3F
Mapping on source data FSW?W
FS ( w ) lt w, w , O gt
Mapping on source data FtW?W
Ft ( w ) lt w, O, w gt
where, O lt 0, 0, , 0 gt ? RF

22
Example POS tagging

Source, S Wall Street Journal (WSJ)
Destination, D Hardware Reviews, AnandTech (AT)
monitor appears more often as noun in AT than
in WSJ.
Consider WR2
w1 the word is the
w2 the word is monitor

23
Example

In W
w1 w2 will be general versions
w3 w4 will be source-specific versions
w5 w6 will be target-specific versions
The transformed source data a small amount of
labeled transformed target data is used to then
train a classifier.
The trained classifier can now be used for
labeling unlabeled instances.

24
Results reported
25
Toy Example Demo Weka

Data Set
Source, S WSJ Target, T AT
W(w1,w2)
w1 the
w2 monitor
DsN2765
DtM172
Test Data size, L29
RBF Network Learning

26
Demo Results
The inclusion of advanced features would improve
the accuracy even more.
27
Theory Involved

Theorem
If es is the minimum error of the classifier in
Ds and et is the minimum error on Dt, then error
in D will be

28
How this technique is useful

Simple easy to implement as pre-processing step
Out-performs several state-of-the-art approaches
Several Theoretical Guarantees
Extensible to Multi-domain adaptation

29
Conclusion

Mapping to a common domain as a method of
adaptation shows better result than derived
domains.
Adaptation on arbitrary domains is impossible.
Domain knowledge is key to performing domain
adaptation.
Upper bound on the target domain error depends
only on the distance of the two domains.
Algorithms that directly minimize the empirical
loss as a function of distance of the two
domains instead of heuristical estimates will
perform better.

30
References

R.K. Ando and T. Zhang. A framework for learning
predictive structures from multiple tasks and
unlabeled data. Journal of Machine , Learning
Research, 618171853, 2005.
Shai Ben-David, John Blitzer, Koby Crammer, and
Fernando Pereira. Analysis of representations for
domain adaptation. In Advances in Neural
Information Processing Systems 20, Cambridge, MA,
2007. MIT Press.
J. Blitzer, R. McDonald, and F. Pereira. Domain
Adaptation with Structural Correspondence
Learning. Proceedings of the Empirical Methods in
Natural Language Processing (EMNLP), 2006.
J. Huang, A. Smola, A. Gretton, K. Borgwardt, and
B. Schölkopf. Correcting Sample Selection Bias by
Unlabeled Data.
D. Kifer, S. Ben-David, and J. Gehrke. Detecting
change in data streams. Proceedings of the 30th
VLDB Conference, Toronto, Canada, pages 180191,
2004.
Xiao Li and Jeff Bilmes. A Bayesian Divergence
Prior for Classifier Adaptation. Eleventh
International Conference on Artificial
Intelligence and Statistics (AISTATS-2007), 2007.
B. Zadrozny. Learning and evaluating classifiers
under sample selection bias. ACM International
Conference Proceeding Series, 2004.
Hal Daumé III. Frustratingly easy domain
adaptation. In Conference of the Association for
Computational Linguistics (ACL), Prague, Czech
Republic, 2007.