Patrick Juola - PowerPoint PPT Presentation

About This Presentation
Title:

Patrick Juola

Description:

Authorship Attribution and Stylometry (lecture 5) Patrick Juola Duquesne University www.jgaap.com juola_at_mathcs.duq.edu Some Housekeeping I m having trouble with n/w ... – PowerPoint PPT presentation

Number of Views:158
Avg rating:3.0/5.0
Slides: 17
Provided by: Aprof4
Category:

less

Transcript and Presenter's Notes

Title: Patrick Juola


1
Authorship Attribution and Stylometry(lecture 5)
  • Patrick Juola
  • Duquesne University
  • www.jgaap.com
  • juola_at_mathcs.duq.edu

2
Some Housekeeping
  • Im having trouble with n/w connectivity to
    Duquesne
  • Watch www.mathcs.duq.edu/juola
  • Watch www.jgaap.com
  • Will be posting new developments as they occur
  • (Will also post NG corpus as requested.)

3
ESSLLI material
  • The Personae corpus is freely available
  • BUT the one weve developed is not
  • If youre willing to have your essays and
    information published, contact me
  • juola_at_mathcs.duq.edu
  • I will collate and publish via the web

4
JGAAP material
  • JGAAP is freeware use and enjoy
  • New developments to JGAAP are always welcome,
    subject to licensure (i.e. GPL).
  • Wiki at www.jgaap.com is open for
  • Feature requests
  • Bug reports
  • Comments
  • New developers

5
Interest in a volume?
  • Depending upon public interest,... i.e. you,
    should we pursue the idea of an edited collection
    of JGAAP-related papers?
  • There are a lot of publishers at this summer
    school
  • Contact me if youre interested

6
So, now what?
  • JGAAP seems to work, but needs more development
  • More corpora (and more specialist corpora) are
    needed
  • But if you have an authorship problem to solve
    NOW

7
Top/bottom methods
  • Sorry, still having n/w troubles 8-(
  • Best canonicizers unify case, normalize
    whitespace
  • Strip punctuation hinders
  • Best events word bigrams
  • Worst word lengths
  • Best analysis KL-distance, cosine distance
  • Worst LZW

8
But....
  • (Show spreadsheet, stupid!)

9
Testing transferrence
  • 8 AAAC problems are English
  • 5 are foreign (French x2, Dutch, Latin,
    Serbian/Slavonic)
  • Does English score reflect foreign score?
  • If so, have evidence that best practices in
    English are also best practices in novel
    language.
  • N.b. evidence is not proof!

10
2008/9 AAAC data
  • 281 different analyses, generally better than
    AAAC submisssions.
  • Correlation r 0.6680 (cf. 0.594)
  • Significance p lt 0.0001 (cf. 0.05)
  • Coefficient of determination (r2)
  • 45 of variation explained by algorithm
    performance alone (rather than other factors)

11
Tranferrence
  • Best practices transfer a best practice in one
    environment is likely to be a good practice in
    another
  • Turn it around Do we really expect something
    terrible in English to magically improve in
    Polish?
  • Caveat No predictions about absolute error
    rates
  • Caveat(2) Assumes lg. agnosticism

12
Some other findings
  • OCR errors do not materially impact accuracy
    (Noecker, et al.)
  • Asymmetry is a significant factor in
    distance-based attribution methods (Ryan and
    Juola)
  • Algorithm performance dominates language or data
    size effects (Juola)

13
Other findings (2)
  • Cosine distance on large numbers of words
    outperforms higher-overhead methods on fewer
    words (Noecker Juola)
  • Characters trump words for Chinese at current
    word seg technology (Zhao Juola)
  • Mosteller-Wallaces function words are overtuned
    (in preparation)

14
Best practices for now
  • Mixture of experts improves accuracy
  • Run multiple analyses, mixing event types
    (character and word n-grams)
  • Cosine distance and KL-distance work well on
    large event sets
  • SVM works well on small event set
  • Current leader KL-distance (max) on word
    bigrams

15
Future extensions
  • AAAC corpus too small to distinguish among
    20,000 methods (testing continuing, though)
  • Add more methods to JGAAP, hopefully solicited
    from community
  • Continue to develop/publish best practices

16
  • Merci
  • Arigato
  • ???????
  • Danke
  • Gracias
  • Tesekkür ederim
  • Dank U

Tak!
Write a Comment
User Comments (0)
About PowerShow.com