Importance of Semantic Representation: Dataless Classification - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Importance of Semantic Representation: Dataless Classification

Description:

Importance of Semantic Representation: Dataless Classification Ming-Wei Chang Lev Ratinov Dan Roth Vivek Srikumar University of Illinois, Urbana-Champaign – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 40
Provided by: VivekSr
Category:

less

Transcript and Presenter's Notes

Title: Importance of Semantic Representation: Dataless Classification


1
Importance of Semantic Representation Dataless
Classification
  • Ming-Wei Chang Lev Ratinov Dan Roth
    Vivek Srikumar
  • University of Illinois, Urbana-Champaign

2
Text Categorization
  • Classify the following sentence
  • Syd Millar was the chairman of the International
    Rugby Board in 2003.
  • Pick a label
  • Class1 vs. Class2
  • Traditionally, we need annotated data to train a
    classifier

3
Text Categorization
  • Humans dont seem to need labeled data
  • Syd Millar was the chairman of the International
    Rugby Board in 2003.
  • Pick a label
  • Sports vs. Finance
  • Label names carry a lot of information!

4
Text Categorization
  • Do we really always need labeled data?

5
Contributions
  • We can often go quite far without annotated data
  • if we know the meaning of text
  • This works for text categorization
  • .and is consistent across different domains

6
Outline
  • Semantic Representation
  • On-the-fly Classification
  • Datasets
  • Exploiting unlabeled data
  • Robustness to different domains

7
Outline
  • Semantic Representation
  • On-the-fly Classification
  • Datasets
  • Exploiting unlabeled data
  • Robustness to different domains

8
Semantic Representation
  • One common representation is the Bag of Words
    representation
  • All text is a vector in the space of words.

9
Semantic Representation
  • Explicit Semantic Analysis
  • Gabrilovich Markovitch, 2006, 2007
  • Text is a vector in the space of concepts
  • Concepts are defined by Wikipedia articles

10
Explicit Semantic Analysis Example
Wikipedia article titles
11
Semantic Representation
  • Two semantic representations
  • Bag of words
  • ESA

12
Outline
  • Semantic Representation
  • On-the-fly Classification
  • Datasets
  • Exploiting unlabeled data
  • Robustness to different domains

13
Traditional Text Categorization
Labeled corpus
Sports
Finance
Semantic space
A classifier
14
Dataless Classification
Labeled corpus
Labels
Sports
Finance
What can we do using just the labels?
15
But labels are text too!
16
Dataless Classification
New unlabeled document
Labels
Sports
Finance
Semantic space
17
What is Dataless Classification?
  • Humans dont need training for classification
  • Annotated training data not always needed
  • Look for the meaning of words

18
What is Dataless Classification?
  • Humans dont need training for classification
  • Annotated training data not always needed
  • Look for the meaning of words

19
On-the-fly Classification
New unlabeled document
Labels
Sports
Finance
Semantic space
20
On-the-fly Classification
  • No training data needed
  • We know the meaning of label names
  • Pick the label that is closest in meaning to the
    document
  • Nearest neighbors

21
On-the-fly Classification
New unlabeled document
New labels
Hockey
Baseball
Semantic space
22
On-the-fly Classification
  • No need to even know labels before hand
  • Compare with traditional classification
  • Annotated training data for each label

23
Outline
  • Semantic Representation
  • On-the-fly Classification
  • Datasets
  • Exploiting unlabeled data
  • Robustness to different domains

24
Dataset 1 Twenty Newsgroups
  • Posts to newsgroups
  • Newsgroups have descriptive names
  • sci.electronics Science Electronics
  • rec.motorbikes Motorbikes

25
Dataset 2 Yahoo Answers
  • Posts to Yahoo! Answers
  • Posts categorized into a two level hierarchy
  • 20 top level categories
  • Totally 280 categories at the second level
  • Arts and Humanities, Theater Acting
  • Sports, Rugby League

26
Experiments
  • 20 Newsgroups
  • 10 binary problems (from Raina et al, 06)
  • Religion vs. Politics.guns
  • Motorcycles vs. MS Windows
  • Yahoo! Answers
  • 20 binary problems
  • Health, Diet fitness vs. Health Allergies
  • Consumer Electronics DVRs vs. Pets Rodents

27
Results On-the-fly classification
Dataset Supervised Baseline Bag of Words ESA
Newsgroup 71.7 65.7 85.3
Yahoo! 84.3 66.8 88.6
Naïve Bayes classifier Uses annotated
data, Ignores labels
Nearest neighbors, Uses labels,
No annotated data
28
Outline
  • Semantic Representation
  • On-the-fly Classification
  • Datasets
  • Exploiting unlabeled data
  • Robustness to different domains

29
Using Unlabeled Data
  • Knowing the data collection helps
  • We can learn specific biases of the dataset
  • Potential for semi-supervised learning

30
Bootstrapping
  • Each label name is a labeled document
  • One example in word or concept space
  • Train initial classifier
  • Same as the on-the-fly classifier
  • Loop
  • Classify all documents with current classifier
  • Retrain classifier with highly confident
    predictions

31
Co-training
  • Words and concepts are two independent views
  • Each view is a teacher for the other
  • Blum Mitchell 98

32
Co-training
  • Train initial classifiers in word space and
    concept space
  • Loop
  • Classify documents with current classifiers
  • Retrain with highly confident predictions of
    both classifiers

33
Using unlabeled data
  • Three approaches
  • Bootstrapping with labels using Bag of Words
  • Bootstrapping with labels using ESA
  • Co-training

34
More Results
Co-training using just labels does as well as
supervision with 100 examples
No annotated data
35
Outline
  • Semantic Representation
  • On-the-fly Classification
  • Datasets
  • Exploiting unlabeled data
  • Robustness to different domains

36
Domain Adaptation
  • Classifiers trained on one domain and tested on
    another
  • Performance usually decreases across domains

37
But the label names are the same
  • Label names dont depend on the domain
  • Label names are robust across domains
  • On-the-fly classifiers are domain independent

38
Example
  • Baseball vs. Hockey

39
Conclusion
  • Sometimes, label names are tell us more about a
    class than annotated examples
  • Standard learning practice of treating labels as
    unique identifiers loses information
  • The right semantic representation helps
  • What is the right one?
Write a Comment
User Comments (0)
About PowerShow.com