Title: Text classification for political science research
1Text classification for political science research
- Bei Yu
- Daniel Diermeier
- Stefan Kaufmann
- Northwestern University
- October 22, 2007
2The booming of using text data for political
science research
- Political text in digital format
- Formal discourse
- Congressional records
- Party manifestos
- Informal discourse
- E-gov public input on government policies
- Internet forum websites, newsgroups, blogs
3Text classification for political science research
- Ideology and party classification
- Party manifestos (Laver, Gary and Benoit 2003)
- Senatorial speeches (Diermeier et al. 2006)
- Newsgroup discussions (Mullen and Malouf 2006)
- Opinion classification
- Newsgroup discussions (Agrawal et a. 2003)
- Public comments (Kwon et al. 2006)
- Congressional floor debate (Thomas et al. 2006)
4A data mining process
Graph taken from http//alg.ncsa.uiuc.edu/tools/do
cs/d2k/manual/dataMining.html
5Political text classification
All political texts of interest
Samplingmethod
Generalization
(X, Y)
X
Text samples
docvectors
Training set
Classifier
textrepresentationmodel
Classificationmethods
Class labels
Y
6Assumptions in classification methods
- Clear external criteria for class definitions
- Zero-noise class labels
- Independently and identically distributed data
from a fixed distribution (X, Y) - A models generalizability is problematic when
the assumptions are violated
7Assumption violations in real data
- Subjective class definitions
- Unreliable class labels
- Errors in manually coded labels
- Mismatch between convenient labels and true
labels - Drifted distribution
- Non-i.i.d. data
- Debate interactive process
- Sample bias
- Small number of examples vs. large number of
features - The hidden classes in convenient data set
8Text classification for political scientists
- An analytical tool for hypothesis testing
- An example of political science problem
- Argument - ideology is a belief system which
rules a persons view on various issues. - Evidence voting records
- Concern many factors affect voting decisions
- More evidence?
- Can we predict someones ideology based on what
they said instead of how they voted?
9Text classification for political scientists
- Ideology classification
- Training data
- 101st-107th Senatorial speeches
- Test data
- 108th Senatorial speeches
- Result interpretation
- High accuracy as a confidence level of the
evidence that the target concept can be
generalized to the whole text collection set of
interest.
10Impact of assumption violation on generalization
- Possible violations
- Biased Sample
- Changing distribution
- Non-i.i.d. data
- They are not rare in political text
classification - It is not easy to foresee them
- How can we find them out?
- Black-box approach?
- Some classifiers never used for prediction
- White-box approach?
- Feature weighting for linear text classifiers
- Checking unexpected features
-
11Feature analysis for linear text classifier
interpretation
I am not sure what Im looking for, but I am
sure what Im NOT looking for.
12Case study party classification
- Ideology and party
- highly similar classes according to available
labels - Choose party as the labels
- Experiment round 1
- Training data 101st-107th Senators
- Testing data 108th Senators
- SVM and naïve Bayes algorithms
- Accuracy gt 90
- Success?
- Feature analysis
- Unexpected features senator names and state
names
13Problems in experiment round 1
- Possible coincidence
- Person classifier?
- The target concept should be person-independent
- Possible revision in experiment design
- Remove the names
- Use different senators for training and testing
14(No Transcript)
15Experiment round 2
- Remove the names
- Accuracy gt 90
- Concerns not completely cleaned
- Train and test on different senators
- Accuracy lt 60
- Failure? Maybe not
- Training senators 101-104
- Test senators 108
16Problems in experiment round 2
- Cant control person coincidence
- The concept of party membership is possibly
time-dependent - BOW representation
- Vocabulary change
17Experiment round 3
- Control time, cross person
- Training set 2005 House representatives
- Test set 2005 Senators
- Accuracy 80
- Conclusion cross person
- Cross person, cross time
- Training set 2005 House representatives
- Test set 2005-1989 Senators
18Accuracy curve over time
19The accuracy change over time
- Why?
- Vocabulary change over time?
- The Chamber is more partisan than before?
- More experiments
- Same-senate party classification
20Same-senate party classification
21The lesson
- Multiple hidden factors might lead to a high
classification accuracy - Personal characteristics in speech
- Vocabulary similarity during a period of time
- Topic similarity
- Our real goal
- Find evidence that ideology is a concept
cross-person, cross-issue, and cross-time
22Conclusion
- The importance of scrutinizing assumption
violations when using text classification as an
analytical tool for political science research - Series of experiments and careful result
interpretation for assumption validation - One accuracy number is not enough
- Accuracy measure - black box
- Feature analysis white box
- Explanation needed for unexpected relevant
features