Authorship Verification as a OneClass Classification Problem - PowerPoint PPT Presentation

About This Presentation
Title:

Authorship Verification as a OneClass Classification Problem

Description:

Use 10-fold validation for A again X, for each fold. Do 10 iterations ... 509 documents (by Ben Ish Chai) TL includes 524 documents (Ben Ish Chai claims to ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 18
Provided by: luiy
Category:

less

Transcript and Presenter's Notes

Title: Authorship Verification as a OneClass Classification Problem


1
Authorship Verification as a One-Class
Classification Problem
  • Moshe Koppel
  • Jonathan Schler

2
Introduction
  • Goal
  • Given examples of the writing of a single author,
    ask to determine if given texts is written by
    this author
  • Authorship attribution
  • Given examples of several of authors, ask to
    determine which author wrote the given anonymous
    texts

3
Challenge
  • Negative samples are neither exhaustive nor
    representative
  • Single author may consciously vary his/her style
    from text to text

4
Authorship Verification
  • Naïve Approach
  • Given examples of the writing of author A
  • Concoct a mishmash of works by other authors
  • Learn a model for A vs. not-A
  • Learn A vs. X (an mystery work)
  • Easy to distinguish between A and X
  • Different author
  • Same author (otherwise)

5
Authorship Verification
  • Unmasking basic idea
  • A small number of features do most of the works
    in distinguish books
  • Iteratively remove those most useful features
  • Gauge the speed with which cross-validation
    accuracy degrades

6
Authorship Verification
Unmasking House of Seven Gables against Hawthorne
(actual author), Melville and Cooper
7
Experiment
8
Experiment
  • Use One-class SVM as baseline
  • 6 of 20 same-author pairs are correctly
    classified
  • 143 of 189 different-author pairs are correctly
    classified

9
Experiment
  • Using Unmasking Approach
  • Choose feature set with 250 words with highest
    average frequency in Ax and X
  • Build Degradation Curve
  • Use 10-fold validation for A again X, for each
    fold
  • Do 10 iterations
  • Build a model for A against X
  • Evaluate accuracy results
  • Add accuracy number to degradation curve
  • Remove 6 top contributing feature from data

10
Experiment
Unmasking An Ideal Husband against each of the
ten authors
11
Experiment
  • Distinguish same-author curves and
    different-author curve
  • Represent degradation curve as feature vector
  • Feature vector numerical vector in terms of its
    essential feature
  • Accuracy after 6 elimination rounds lt 89
  • The 2nd highest accuracy drop in two iteration gt
    16
  • Test degradation curve

12
Experiment Result
  • 19 of 20 same-author pairs are correctly
    classified
  • 181 of 189 different-author pairs are correctly
    classified
  • Accuracy 95.7

13
Extension
  • Use negative examples to eliminate some false
    positive from the unmasking phase
  • In our case, use elimination method improved
    accuracy
  • 189 of 189 different-author pairs are correctly
    classified
  • Introduced a single new misclassified

14
Extension
  • Elimination
  • If alternative author A1,,An exists then
  • build model M for classifying A vs. all other
    alternative authors
  • test each chunk of X with built model M
  • for each alternative author Ai
  • build model Mi for classifying Ai vs. A or all
    other alternative authors
  • test each chunk of X with built model Mi
  • If number of chunks assigned to Ai gt of chunks
    assigned to A then
  • return different-author

15
Actual Literary Mystery
  • Two 19th century collection of Hebrew-Aramaic
  • RP includes 509 documents (by Ben Ish Chai)
  • TL includes 524 documents (Ben Ish Chai claims to
    have found in an archive)

16
Actual Literary Mystery
Unmasking TL against Ben Ish Chai and four
impostors
17
Conclusion
  • Unmasking complete ignore examples
  • High accuracy
  • Unmasking Elimination (little negative data)
  • Accuracy better
  • More experiment need to confirm this methods is
    also good for other languages
Write a Comment
User Comments (0)
About PowerShow.com