Text classification for political science research - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Text classification for political science research

Description:

E-gov: public input on government policies. Internet forum: websites, newsgroups, blogs ... An example of political science problem ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 23
Provided by: dhcsNort
Category:

less

Transcript and Presenter's Notes

Title: Text classification for political science research


1
Text classification for political science research
  • Bei Yu
  • Daniel Diermeier
  • Stefan Kaufmann
  • Northwestern University
  • October 22, 2007

2
The booming of using text data for political
science research
  • Political text in digital format
  • Formal discourse
  • Congressional records
  • Party manifestos
  • Informal discourse
  • E-gov public input on government policies
  • Internet forum websites, newsgroups, blogs

3
Text classification for political science research
  • Ideology and party classification
  • Party manifestos (Laver, Gary and Benoit 2003)
  • Senatorial speeches (Diermeier et al. 2006)
  • Newsgroup discussions (Mullen and Malouf 2006)
  • Opinion classification
  • Newsgroup discussions (Agrawal et a. 2003)
  • Public comments (Kwon et al. 2006)
  • Congressional floor debate (Thomas et al. 2006)

4
A data mining process
Graph taken from http//alg.ncsa.uiuc.edu/tools/do
cs/d2k/manual/dataMining.html
5
Political text classification
All political texts of interest
Samplingmethod
Generalization
(X, Y)
X
Text samples
docvectors
Training set
Classifier
textrepresentationmodel
Classificationmethods
Class labels
Y
6
Assumptions in classification methods
  • Clear external criteria for class definitions
  • Zero-noise class labels
  • Independently and identically distributed data
    from a fixed distribution (X, Y)
  • A models generalizability is problematic when
    the assumptions are violated

7
Assumption violations in real data
  • Subjective class definitions
  • Unreliable class labels
  • Errors in manually coded labels
  • Mismatch between convenient labels and true
    labels
  • Drifted distribution
  • Non-i.i.d. data
  • Debate interactive process
  • Sample bias
  • Small number of examples vs. large number of
    features
  • The hidden classes in convenient data set

8
Text classification for political scientists
  • An analytical tool for hypothesis testing
  • An example of political science problem
  • Argument - ideology is a belief system which
    rules a persons view on various issues.
  • Evidence voting records
  • Concern many factors affect voting decisions
  • More evidence?
  • Can we predict someones ideology based on what
    they said instead of how they voted?

9
Text classification for political scientists
  • Ideology classification
  • Training data
  • 101st-107th Senatorial speeches
  • Test data
  • 108th Senatorial speeches
  • Result interpretation
  • High accuracy as a confidence level of the
    evidence that the target concept can be
    generalized to the whole text collection set of
    interest.

10
Impact of assumption violation on generalization
  • Possible violations
  • Biased Sample
  • Changing distribution
  • Non-i.i.d. data
  • They are not rare in political text
    classification
  • It is not easy to foresee them
  • How can we find them out?
  • Black-box approach?
  • Some classifiers never used for prediction
  • White-box approach?
  • Feature weighting for linear text classifiers
  • Checking unexpected features

11
Feature analysis for linear text classifier
interpretation
I am not sure what Im looking for, but I am
sure what Im NOT looking for.
12
Case study party classification
  • Ideology and party
  • highly similar classes according to available
    labels
  • Choose party as the labels
  • Experiment round 1
  • Training data 101st-107th Senators
  • Testing data 108th Senators
  • SVM and naïve Bayes algorithms
  • Accuracy gt 90
  • Success?
  • Feature analysis
  • Unexpected features senator names and state
    names

13
Problems in experiment round 1
  • Possible coincidence
  • Person classifier?
  • The target concept should be person-independent
  • Possible revision in experiment design
  • Remove the names
  • Use different senators for training and testing

14
(No Transcript)
15
Experiment round 2
  • Remove the names
  • Accuracy gt 90
  • Concerns not completely cleaned
  • Train and test on different senators
  • Accuracy lt 60
  • Failure? Maybe not
  • Training senators 101-104
  • Test senators 108

16
Problems in experiment round 2
  • Cant control person coincidence
  • The concept of party membership is possibly
    time-dependent
  • BOW representation
  • Vocabulary change

17
Experiment round 3
  • Control time, cross person
  • Training set 2005 House representatives
  • Test set 2005 Senators
  • Accuracy 80
  • Conclusion cross person
  • Cross person, cross time
  • Training set 2005 House representatives
  • Test set 2005-1989 Senators

18
Accuracy curve over time
19
The accuracy change over time
  • Why?
  • Vocabulary change over time?
  • The Chamber is more partisan than before?
  • More experiments
  • Same-senate party classification

20
Same-senate party classification
21
The lesson
  • Multiple hidden factors might lead to a high
    classification accuracy
  • Personal characteristics in speech
  • Vocabulary similarity during a period of time
  • Topic similarity
  • Our real goal
  • Find evidence that ideology is a concept
    cross-person, cross-issue, and cross-time

22
Conclusion
  • The importance of scrutinizing assumption
    violations when using text classification as an
    analytical tool for political science research
  • Series of experiments and careful result
    interpretation for assumption validation
  • One accuracy number is not enough
  • Accuracy measure - black box
  • Feature analysis white box
  • Explanation needed for unexpected relevant
    features
Write a Comment
User Comments (0)
About PowerShow.com