Title: Importance of Semantic Representation: Dataless Classification
1Importance of Semantic Representation Dataless
Classification
- Ming-Wei Chang Lev Ratinov Dan Roth
Vivek Srikumar - University of Illinois, Urbana-Champaign
2Text Categorization
- Classify the following sentence
- Syd Millar was the chairman of the International
Rugby Board in 2003. - Pick a label
- Class1 vs. Class2
- Traditionally, we need annotated data to train a
classifier
3Text Categorization
- Humans dont seem to need labeled data
- Syd Millar was the chairman of the International
Rugby Board in 2003. - Pick a label
- Sports vs. Finance
- Label names carry a lot of information!
4Text Categorization
- Do we really always need labeled data?
5Contributions
- We can often go quite far without annotated data
- if we know the meaning of text
- This works for text categorization
- .and is consistent across different domains
6Outline
- Semantic Representation
- On-the-fly Classification
- Datasets
- Exploiting unlabeled data
- Robustness to different domains
7Outline
- Semantic Representation
- On-the-fly Classification
- Datasets
- Exploiting unlabeled data
- Robustness to different domains
8Semantic Representation
- One common representation is the Bag of Words
representation - All text is a vector in the space of words.
9Semantic Representation
- Explicit Semantic Analysis
- Gabrilovich Markovitch, 2006, 2007
- Text is a vector in the space of concepts
- Concepts are defined by Wikipedia articles
10Explicit Semantic Analysis Example
Wikipedia article titles
11Semantic Representation
- Two semantic representations
- Bag of words
- ESA
12Outline
- Semantic Representation
- On-the-fly Classification
- Datasets
- Exploiting unlabeled data
- Robustness to different domains
13Traditional Text Categorization
Labeled corpus
Sports
Finance
Semantic space
A classifier
14Dataless Classification
Labeled corpus
Labels
Sports
Finance
What can we do using just the labels?
15But labels are text too!
16Dataless Classification
New unlabeled document
Labels
Sports
Finance
Semantic space
17What is Dataless Classification?
- Humans dont need training for classification
- Annotated training data not always needed
- Look for the meaning of words
18What is Dataless Classification?
- Humans dont need training for classification
- Annotated training data not always needed
- Look for the meaning of words
19On-the-fly Classification
New unlabeled document
Labels
Sports
Finance
Semantic space
20On-the-fly Classification
- No training data needed
- We know the meaning of label names
- Pick the label that is closest in meaning to the
document - Nearest neighbors
21On-the-fly Classification
New unlabeled document
New labels
Hockey
Baseball
Semantic space
22On-the-fly Classification
- No need to even know labels before hand
- Compare with traditional classification
- Annotated training data for each label
23Outline
- Semantic Representation
- On-the-fly Classification
- Datasets
- Exploiting unlabeled data
- Robustness to different domains
24Dataset 1 Twenty Newsgroups
- Posts to newsgroups
- Newsgroups have descriptive names
- sci.electronics Science Electronics
- rec.motorbikes Motorbikes
25Dataset 2 Yahoo Answers
- Posts to Yahoo! Answers
- Posts categorized into a two level hierarchy
- 20 top level categories
- Totally 280 categories at the second level
- Arts and Humanities, Theater Acting
- Sports, Rugby League
26Experiments
- 20 Newsgroups
- 10 binary problems (from Raina et al, 06)
- Religion vs. Politics.guns
- Motorcycles vs. MS Windows
- Yahoo! Answers
- 20 binary problems
- Health, Diet fitness vs. Health Allergies
- Consumer Electronics DVRs vs. Pets Rodents
27Results On-the-fly classification
Dataset Supervised Baseline Bag of Words ESA
Newsgroup 71.7 65.7 85.3
Yahoo! 84.3 66.8 88.6
Naïve Bayes classifier Uses annotated
data, Ignores labels
Nearest neighbors, Uses labels,
No annotated data
28Outline
- Semantic Representation
- On-the-fly Classification
- Datasets
- Exploiting unlabeled data
- Robustness to different domains
29Using Unlabeled Data
- Knowing the data collection helps
- We can learn specific biases of the dataset
- Potential for semi-supervised learning
30Bootstrapping
- Each label name is a labeled document
- One example in word or concept space
- Train initial classifier
- Same as the on-the-fly classifier
- Loop
- Classify all documents with current classifier
- Retrain classifier with highly confident
predictions
31Co-training
- Words and concepts are two independent views
- Each view is a teacher for the other
- Blum Mitchell 98
32Co-training
- Train initial classifiers in word space and
concept space - Loop
- Classify documents with current classifiers
- Retrain with highly confident predictions of
both classifiers
33Using unlabeled data
- Three approaches
- Bootstrapping with labels using Bag of Words
- Bootstrapping with labels using ESA
- Co-training
34More Results
Co-training using just labels does as well as
supervision with 100 examples
No annotated data
35Outline
- Semantic Representation
- On-the-fly Classification
- Datasets
- Exploiting unlabeled data
- Robustness to different domains
36Domain Adaptation
- Classifiers trained on one domain and tested on
another - Performance usually decreases across domains
37But the label names are the same
- Label names dont depend on the domain
- Label names are robust across domains
- On-the-fly classifiers are domain independent
38Example
39Conclusion
- Sometimes, label names are tell us more about a
class than annotated examples - Standard learning practice of treating labels as
unique identifiers loses information - The right semantic representation helps
- What is the right one?