Title: Text Representation
1Text Representation Text Classification for
Intelligent Information Retrieval
- Ning Yu
- School of Library and Information Science
- Indiana University at Bloomington
2Outline
- The big picture
- A specific problem opinion detection
3Intelligent information retrieval
- Characteristics
- Not restricted to keyword matching and Boolean
search - Deal with natural language query and advanced
search criteria - Coarse-to-fine level of granularity
- Automatically organize/evaluate/interpret
solution space - User-centered, e.g., adapt to users learning
habit - Etc.
4Intelligent information retrieval
- System Preferences
- Various source of evidence
- Natural language processing
- Semantic web technologies
- Automatic text classification
- Etc.
5Intelligent IR system diagram
6A Specific QuestionSemi-Supervised Learning
for Identifying Opinions in Web Content
7Growing demand for online opinions
- Enormous body of user-generated content
- About anything, published anywhere and at any
time - Useful for literature review, decision making,
market monitoring, etc.
8Major approaches for opinion detection
9Whats Essential?Labeled Data! And lots of
them!!!
- To acquire a broad and comprehensive collection
of opinion-bearing features (e.g., bag-of-words,
POS words, N-grams (ngt1), linguistic
collocations, stylistic features, contextual
features) - To generate complex patterns (e.g., good
amount) that can approximate the context of
words. - To generate and evaluate opinion detection
systems - To allow evaluation of opinion detection
strategies with high confidence
9
10Challenges for opinion detection
- Shortage of opinion-labeled data manual
annotation is tedious, error-prone and difficult
to scale up Domain transfer strategies designed f
or opinion detection in one data domain generally
do not perform well in another domain
11Motivations research question
- Easy to collect unlabeled user-generated content
that contains opinions - Semi-Supervised Learning (SSL) requires only a
limited number of labeled data to automatically
label unlabeled data has achieved promising
results in NLP studies - Is SSL effective in opinion detection both in
sparse data situations and for domain adaptation?
12Datasets data split
Dataset (sentences) Blog Posts Movie Reviews News Articles
Opinion 4,843 5,000 5,297
Non-opinion 4,843 5,000 5,174
13Two major SSL methods Self-training
- Assumption Highly confident predictions made by
an initial opinion classifier are reliable and
can be added to the labeled set. - Limitation Auto-labeled data may be biased by
the particular opinion classifier.
14Two major SSL methods Co-training
- Assumption Two opinion classifiers with
different strengths and weaknesses can benefit
from each other. - Limitation It is not always easy to create two
different classifiers.
15Experimental design
- General settings for SSL
- Naïve Bayes classifier for self-training
- Binary values for unigram and bigram features
- Co-training strategies
- Unigrams and bigrams (content vs. context)
- Two randomly split feature/training sets
- A character-based language model (CLM) and a
bag-of-words model (BOW)
16Results Overall
- For movie reviews and news articles, co-training
proved to be most robust - For blog posts, SSL showed no benefits over SL
due to the low initial accuracy
17Results Movie reviews
- Both self-training and co-training can improve
opinion detection performance - Co-training is more effective than self-training
18Results Movie reviews (cont.)
- The more different the two classifiers, the
better the performance
19Results Domain transfer
(movie
reviews-gtblog posts)
- For a difficult domain (e.g., blog), simple
self-training alone is promising for tackling the
domain transfer problem.
20Contributions
- Comprehensive research expands the spectrum of
SSL application to opinion detection - Investigation of SSL model that best fits the
problem space extends understanding of opinion
detection and provides a resource for
knowledge-based representation - Generation of guidelines and evaluation baselines
advances later studies using SSL algorithms in
opinion detection - Research extensible to other data domains,
non-English texts, and other text mining tasks
21Thank you!
If you want a second opinion, Ill ask my
computer
www.CartoonStock.com
21