Title: Domain Adaptation for Natural Language Processing
1Domain Adaptation for Natural Language Processing
- Amogh Asgekar (06329006)
- Jeevan Chalke (06329011)
- Vinay Deshpande (06305001)
- Jubin Chheda (06305003)
2Outline
- NLP tasks
- Types of domain adaption
- Sample selection bias
- Structural corresponding learning
- Adaptation by feature augmentation
- Conclusion
3Tasks in NLP domain
- POS Tagging
- Assign POS tags to the words in a given test
corpus. - Parsing
- Construct a structure out of the given sentence
formations. - Word sense disambiguation
- Select a particular meaning of the word from
various possibilities. - Named entity recognition
- Identifying named entities (names, address etc)
from a given corpus.
4Domain Adaptation
- The pre-mentioned tasks are performed by
learning from a corpus and then applying the
knowledge to classify the test instances. - In case the training distributions and test
distributions are different, then the classifier
tends to perform erroneously. - In such cases, classifier needs to be domain
adapted to perform accurately on both the
domains.
5A POS tagging task
- Consider the following example-
- Learner has access to
- Labeled data S randomly sampled from the training
distribution PS. - Unlabelled sample T sampled from an unknown test
distribution PT. - Task of the learner is to predicts labels of
points generated and labeled according to PT.
6Types of Domain Adaptation
- Analyse the causes for domain divergence and
model them into the learner - Sample selection bias
- Discover the divergence of the distributions
during training - Structural Correspondence Learning
- Feature Augmentation Model
7Sample Selection Bias
- What is Sample Selection Bias?
- Samples (x, y, s) are drawn independently from a
domain (X Y S) with distribution D. S is a
binary space. If s1, that instance is selected. - Four cases of dependence of (x, y) on s
- 1. s ? x and s ? y
- 2. s ? y x
- 3. s ? x y
- 4. s depends on both x and y
8Sample Selection Bias Correction
- Case s ? y x i.e. The sub-domain selection
depends only on the words and not on their
POS-tags. - Now if D is the original distribution of domain
and D is the distribution of selected sub-domain
then, we can convert from one domain to other
using a multiplier -
-
- Thus, D(x, y, s) ß(X) D(x, y, s)
- The prior probabilities Pr(s1) and Pr(s1x)
must be known. - Pr(s1 x) should be non-zero for each x i.e. at
least one instance of each word should be
selected. -
9Sample Selection Bias in POS-Tagging
- Medical corpus is generated from English language
by some sample selection procedure. - This brings about a non uniform distribution to
the selection of instances from the parent
distribution.
10Application to Domain Adaptation
11Structural Correspondence Learning
- Central idea is to find a common mapping between
the two domains and use it to train a classifier
fitting on both the domains. - Requirements of SCL for domain adaptation are-
- Large sample of unlabelled data from training and
test domain. - Small sample of labelled data from the training
domain.
12Structural Correspondence Learning
- SCL Process Step 1
- Define Pivot Features
- Pivot features are those that behave with high
correlation in both the domains. - They should be frequent enough to define
correlations with non pivot features from the
same domain, and at the same time diverse enough
to separately identify the different features. - E.g.. Determiners are good pivot features, since
they occur frequently in any domain of written
English, but choosing only determiners will not
help us to discriminate between nouns and
adjectives.
13Structural Correspondence Learning
14Structural Correspondence Learning
- SCL Process Step 2
- Learn pivot predictors
- For every pivot feature k out of m pivot
features, learn a classifier on the joint data
(training test). - Classification problem will be binary of form
Does this instance contain pivot feature k -
- The weights wk learnt in the above task encode
the covariance of the various features with the
pivot feature k.
15Structural Correspondence Learning
- SCL Process Step 3
- Compute an SVD of pivot predictor space.
- An SVD will give a lower dimensional
approximation to the pivot predictor space. - If W is the matrix of pivot predictors, then
- will be the SVD for W.
- Consider as the matrix with the rows as
the top left singular vectors of W. -
16Structural Correspondence Learning
- SCL Process Step 4
- Construct a mapping of both domains to low
dimensional shared space. - The rows of ? capture the variance of pivot
predictor space as best as possible in h
dimensions. - Thus we get ?.X as the desired mapping into the
low dimensional space. - For every instance xt add ?.xt as new shared
features.
17Structural Correspondence Learning
- Visualizing the common feature space
Feature Space 1
Common mapped feature space
Feature Space 2
18 Feature Augmentation yet another approach to
Adaptation
- Have access to a large, annotated corpus of data
from a source domain. In addition, we spend a
little money to annotate a small corpus in the
target domain.
19 Feature Augmentation yet another approach to
Adaptation
- Appropriate exactly in the case when one has
enough target data to do slightly better than
just using only source data. - Domain Adaptation gt Standard Supervised Learning
Problem.
20Problem formulation
- Ds Samples from source distribution
- Dt Samples from target distribution
- DsN, DsM, N gtgt M
- F Number of features of source target data
- W Input Space, YOutput Space
- WRF
21The crux
- Define
- Augmented input space WR3F
- Mapping on source data FSW?W
- FS ( w ) lt w, w , O gt
- Mapping on source data FtW?W
- Ft ( w ) lt w, O, w gt
- where, O lt 0, 0, , 0 gt ? RF
22Example POS tagging
- Source, S Wall Street Journal (WSJ)
- Destination, D Hardware Reviews, AnandTech (AT)
- monitor appears more often as noun in AT than
in WSJ. - Consider WR2
- w1 the word is the
- w2 the word is monitor
23Example
- In W
- w1 w2 will be general versions
- w3 w4 will be source-specific versions
- w5 w6 will be target-specific versions
- The transformed source data a small amount of
labeled transformed target data is used to then
train a classifier. - The trained classifier can now be used for
labeling unlabeled instances.
24Results reported
25Toy Example Demo Weka
- Data Set
- Source, S WSJ Target, T AT
- W(w1,w2)
- w1 the
- w2 monitor
- DsN2765
- DtM172
- Test Data size, L29
- RBF Network Learning
26Demo Results
The inclusion of advanced features would improve
the accuracy even more.
27Theory Involved
- Theorem
- If es is the minimum error of the classifier in
Ds and et is the minimum error on Dt, then error
in D will be
28How this technique is useful
- Simple easy to implement as pre-processing step
- Out-performs several state-of-the-art approaches
- Several Theoretical Guarantees
- Extensible to Multi-domain adaptation
29Conclusion
- Mapping to a common domain as a method of
adaptation shows better result than derived
domains. - Adaptation on arbitrary domains is impossible.
Domain knowledge is key to performing domain
adaptation. - Upper bound on the target domain error depends
only on the distance of the two domains. - Algorithms that directly minimize the empirical
loss as a function of distance of the two
domains instead of heuristical estimates will
perform better.
30References
- R.K. Ando and T. Zhang. A framework for learning
predictive structures from multiple tasks and
unlabeled data. Journal of Machine , Learning
Research, 618171853, 2005. - Shai Ben-David, John Blitzer, Koby Crammer, and
Fernando Pereira. Analysis of representations for
domain adaptation. In Advances in Neural
Information Processing Systems 20, Cambridge, MA,
2007. MIT Press. - J. Blitzer, R. McDonald, and F. Pereira. Domain
Adaptation with Structural Correspondence
Learning. Proceedings of the Empirical Methods in
Natural Language Processing (EMNLP), 2006. - J. Huang, A. Smola, A. Gretton, K. Borgwardt, and
B. Schölkopf. Correcting Sample Selection Bias by
Unlabeled Data. - D. Kifer, S. Ben-David, and J. Gehrke. Detecting
change in data streams. Proceedings of the 30th
VLDB Conference, Toronto, Canada, pages 180191,
2004. - Xiao Li and Jeff Bilmes. A Bayesian Divergence
Prior for Classifier Adaptation. Eleventh
International Conference on Artificial
Intelligence and Statistics (AISTATS-2007), 2007. - B. Zadrozny. Learning and evaluating classifiers
under sample selection bias. ACM International
Conference Proceeding Series, 2004. - Hal Daumé III. Frustratingly easy domain
adaptation. In Conference of the Association for
Computational Linguistics (ACL), Prague, Czech
Republic, 2007.