Domain Adaptation for Natural Language Processing - PowerPoint PPT Presentation

1 / 30
About This Presentation

Domain Adaptation for Natural Language Processing


Amogh Asgekar (06329006) Jeevan Chalke (06329011) Vinay Deshpande ... For every instance xt add ?.xt as new shared features. Structural Correspondence Learning ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 31
Provided by: Comp754


Transcript and Presenter's Notes

Title: Domain Adaptation for Natural Language Processing

Domain Adaptation for Natural Language Processing
  • Amogh Asgekar (06329006)
  • Jeevan Chalke (06329011)
  • Vinay Deshpande (06305001)
  • Jubin Chheda (06305003)

  • NLP tasks
  • Types of domain adaption
  • Sample selection bias
  • Structural corresponding learning
  • Adaptation by feature augmentation
  • Conclusion

Tasks in NLP domain
  • POS Tagging
  • Assign POS tags to the words in a given test
  • Parsing
  • Construct a structure out of the given sentence
  • Word sense disambiguation
  • Select a particular meaning of the word from
    various possibilities.
  • Named entity recognition
  • Identifying named entities (names, address etc)
    from a given corpus.

Domain Adaptation
  • The pre-mentioned tasks are performed by
    learning from a corpus and then applying the
    knowledge to classify the test instances.
  • In case the training distributions and test
    distributions are different, then the classifier
    tends to perform erroneously.
  • In such cases, classifier needs to be domain
    adapted to perform accurately on both the

A POS tagging task
  • Consider the following example-
  • Learner has access to
  • Labeled data S randomly sampled from the training
    distribution PS.
  • Unlabelled sample T sampled from an unknown test
    distribution PT.
  • Task of the learner is to predicts labels of
    points generated and labeled according to PT.

Types of Domain Adaptation
  • Analyse the causes for domain divergence and
    model them into the learner
  • Sample selection bias
  • Discover the divergence of the distributions
    during training
  • Structural Correspondence Learning
  • Feature Augmentation Model

Sample Selection Bias
  • What is Sample Selection Bias?
  • Samples (x, y, s) are drawn independently from a
    domain (X Y S) with distribution D. S is a
    binary space. If s1, that instance is selected.
  • Four cases of dependence of (x, y) on s
  • 1. s ? x and s ? y
  • 2. s ? y x
  • 3. s ? x y
  • 4. s depends on both x and y

Sample Selection Bias Correction
  • Case s ? y x i.e. The sub-domain selection
    depends only on the words and not on their
  • Now if D is the original distribution of domain
    and D is the distribution of selected sub-domain
    then, we can convert from one domain to other
    using a multiplier
  • Thus, D(x, y, s) ß(X) D(x, y, s)
  • The prior probabilities Pr(s1) and Pr(s1x)
    must be known.
  • Pr(s1 x) should be non-zero for each x i.e. at
    least one instance of each word should be

Sample Selection Bias in POS-Tagging
  • Medical corpus is generated from English language
    by some sample selection procedure.
  • This brings about a non uniform distribution to
    the selection of instances from the parent

Application to Domain Adaptation
Structural Correspondence Learning
  • Central idea is to find a common mapping between
    the two domains and use it to train a classifier
    fitting on both the domains.
  • Requirements of SCL for domain adaptation are-
  • Large sample of unlabelled data from training and
    test domain.
  • Small sample of labelled data from the training

Structural Correspondence Learning
  • SCL Process Step 1
  • Define Pivot Features
  • Pivot features are those that behave with high
    correlation in both the domains.
  • They should be frequent enough to define
    correlations with non pivot features from the
    same domain, and at the same time diverse enough
    to separately identify the different features.
  • E.g.. Determiners are good pivot features, since
    they occur frequently in any domain of written
    English, but choosing only determiners will not
    help us to discriminate between nouns and

Structural Correspondence Learning
Structural Correspondence Learning
  • SCL Process Step 2
  • Learn pivot predictors
  • For every pivot feature k out of m pivot
    features, learn a classifier on the joint data
    (training test).
  • Classification problem will be binary of form
    Does this instance contain pivot feature k
  • The weights wk learnt in the above task encode
    the covariance of the various features with the
    pivot feature k.

Structural Correspondence Learning
  • SCL Process Step 3
  • Compute an SVD of pivot predictor space.
  • An SVD will give a lower dimensional
    approximation to the pivot predictor space.
  • If W is the matrix of pivot predictors, then
  • will be the SVD for W.
  • Consider as the matrix with the rows as
    the top left singular vectors of W.

Structural Correspondence Learning
  • SCL Process Step 4
  • Construct a mapping of both domains to low
    dimensional shared space.
  • The rows of ? capture the variance of pivot
    predictor space as best as possible in h
  • Thus we get ?.X as the desired mapping into the
    low dimensional space.
  • For every instance xt add ?.xt as new shared

Structural Correspondence Learning
  • Visualizing the common feature space

Feature Space 1
Common mapped feature space
Feature Space 2
Feature Augmentation yet another approach to
  • Have access to a large, annotated corpus of data
    from a source domain. In addition, we spend a
    little money to annotate a small corpus in the
    target domain.

Feature Augmentation yet another approach to
  • Appropriate exactly in the case when one has
    enough target data to do slightly better than
    just using only source data.
  • Domain Adaptation gt Standard Supervised Learning

Problem formulation
  • Ds Samples from source distribution
  • Dt Samples from target distribution
  • DsN, DsM, N gtgt M
  • F Number of features of source target data
  • W Input Space, YOutput Space
  • WRF

The crux
  • Define
  • Augmented input space WR3F
  • Mapping on source data FSW?W
  • FS ( w ) lt w, w , O gt
  • Mapping on source data FtW?W
  • Ft ( w ) lt w, O, w gt
  • where, O lt 0, 0, , 0 gt ? RF

Example POS tagging
  • Source, S Wall Street Journal (WSJ)
  • Destination, D Hardware Reviews, AnandTech (AT)
  • monitor appears more often as noun in AT than
    in WSJ.
  • Consider WR2
  • w1 the word is the
  • w2 the word is monitor

  • In W
  • w1 w2 will be general versions
  • w3 w4 will be source-specific versions
  • w5 w6 will be target-specific versions
  • The transformed source data a small amount of
    labeled transformed target data is used to then
    train a classifier.
  • The trained classifier can now be used for
    labeling unlabeled instances.

Results reported
Toy Example Demo Weka
  • Data Set
  • Source, S WSJ Target, T AT
  • W(w1,w2)
  • w1 the
  • w2 monitor
  • DsN2765
  • DtM172
  • Test Data size, L29
  • RBF Network Learning

Demo Results
The inclusion of advanced features would improve
the accuracy even more.
Theory Involved
  • Theorem
  • If es is the minimum error of the classifier in
    Ds and et is the minimum error on Dt, then error
    in D will be

How this technique is useful
  • Simple easy to implement as pre-processing step
  • Out-performs several state-of-the-art approaches
  • Several Theoretical Guarantees
  • Extensible to Multi-domain adaptation

  • Mapping to a common domain as a method of
    adaptation shows better result than derived
  • Adaptation on arbitrary domains is impossible.
    Domain knowledge is key to performing domain
  • Upper bound on the target domain error depends
    only on the distance of the two domains.
  • Algorithms that directly minimize the empirical
    loss as a function of distance of the two
    domains instead of heuristical estimates will
    perform better.

  • R.K. Ando and T. Zhang. A framework for learning
    predictive structures from multiple tasks and
    unlabeled data. Journal of Machine , Learning
    Research, 618171853, 2005.
  • Shai Ben-David, John Blitzer, Koby Crammer, and
    Fernando Pereira. Analysis of representations for
    domain adaptation. In Advances in Neural
    Information Processing Systems 20, Cambridge, MA,
    2007. MIT Press.
  • J. Blitzer, R. McDonald, and F. Pereira. Domain
    Adaptation with Structural Correspondence
    Learning. Proceedings of the Empirical Methods in
    Natural Language Processing (EMNLP), 2006.
  • J. Huang, A. Smola, A. Gretton, K. Borgwardt, and
    B. Schölkopf. Correcting Sample Selection Bias by
    Unlabeled Data.
  • D. Kifer, S. Ben-David, and J. Gehrke. Detecting
    change in data streams. Proceedings of the 30th
    VLDB Conference, Toronto, Canada, pages 180191,
  • Xiao Li and Jeff Bilmes. A Bayesian Divergence
    Prior for Classifier Adaptation. Eleventh
    International Conference on Artificial
    Intelligence and Statistics (AISTATS-2007), 2007.
  • B. Zadrozny. Learning and evaluating classifiers
    under sample selection bias. ACM International
    Conference Proceeding Series, 2004.
  • Hal Daumé III. Frustratingly easy domain
    adaptation. In Conference of the Association for
    Computational Linguistics (ACL), Prague, Czech
    Republic, 2007.
Write a Comment
User Comments (0)