Patrick Juola - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Patrick Juola

Description:

Authorship Attribution and Stylometry Patrick Juola Duquesne University www.jgaap.com juola_at_mathcs.duq.edu Whodunit? Authorship Attribution (aka Stylometry, cf ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 31
Provided by: Aprof4
Category:

less

Transcript and Presenter's Notes

Title: Patrick Juola


1
Authorship Attribution and Stylometry
  • Patrick Juola
  • Duquesne University
  • www.jgaap.com
  • juola_at_mathcs.duq.edu

2
Whodunit?
  • Authorship Attribution (aka Stylometry, cf.
    Authorship Profiling) identifying an author
    from his/her writings
  • Did Shakespeare really write those plays?
  • Or was it the Earl of Oxford?
  • Or Francis Bacon?
  • Or Roger Bacon?
  • Or Kevin Bacon?
  • c.

3
More technical definition
  • Authorship attribution inferring the identity
    of the author of a document by examination.
  • Stylometry inferring properties of the author
    by examination
  • E.g. the author was a male native English speaker
    aged between 25-35 with no college education but
    with theater training

4
Important problem
  • Long history (Book of Judges, shibboleth)
  • Key to literature
  • and to history and journalism
  • and teaching (catching cheaters) and
    law/investigation (Unabomber)
  • and psychology (inferring personality from
    writing) and security and, and,

5
Computers are problematic
  • Handwriting is easy, anyone can do it.
  • Typewriting is still pretty easy if you know what
    youre looking for
  • But one 12pt Times Roman A looks identical to
    any other.
  • What cues to authorship exist?

6
Looking for clues
What is this object?
7
Looking for clues (2)
  • How far does light travel in 1/300,000 of a
    second?

8
Looking for clues (3)
  • Where is the dinner fork?

9
Finding clues
The object is a couch.
10
Looking for clues (2)
  • How far does light travel in 1/300,000 of a
    second?
  • Approximately one kilometer.
  • Note that other answers are not wrong, just
    individual.
  • E.g. kilometre is a standard spelling
  • km is standard abbreviation
  • click or k are commonly-understood slang

11
Finding clues (3)
  • The dinner fork is to the left of the plate

12
Finding clues (3b)
  • The dinner fork is on the immediate left of the
    plate

13
Another example
  • The paradigmatic and systematic utilization of
    sesquipedalian lexical items can be an
    informative element of individual and
    idiosyncratic patterns of linguistic variation
  • Or, some people use big words

14
History
  • Judges 126 Then said they unto him, Say now
    Shibboleth and he said Sibboleth for he could
    not frame to pronounce it right. Then they took
    him, and slew him at the passages of Jordan, and
    there fell at that time of the Ephraimites forty
    and two thousand.

15
The stylome
  • The underlying theoretical assumption is that
    language is not completely controllable (i.e.
    its hard to lose your accent)
  • Obviously, some parts (e.g. lexicon) are more
    controllable than others (e.g. accent).
  • Van Halteren has coined the term stylome to
    describe these specific individual differences.
    Others use fingerprint.

16
Some early candidates
  • Authorial vocabulary may be a stylome.
  • You cant use words you dont know.
  • Can we measure vocabulary size?
  • Similarly, average word length may be a stylome
    (first proposed by De Morgan)
  • . But neither of these work especially well.

17
Federalist Papers
  • Modern stylometry more or less starts with
    Mosteller and Wallace
  • Studied the Federalist Papers using multivariate
    statistics
  • Took frequencies of specific high-frequency
    function words
  • Classified disputed documents as H/M based on
    Bayesian analysis

18
Successes and Failures
  • M/W results generally confirmed accepted
    scholarship
  • But its also a largely artificial problem!
  • Federalist Papers have become standard
  • Other examples have produced noted failures
  • E.g. Fosters attribution of A Funeral Elegy

19
Lots of ways to study
  • Rudman has suggested that more than 1000
    different features have been proposed over the
    past 100 years.
  • Most work in the sense of better than chance.
  • But better than chance isnt very good in the
    real world.

20
The Ur-study
  • Find a document, with presumptive author
  • Collect uncontroversial corpus of authors
    writings
  • Collect set of distractor authors, with sample
    corpora for each author
  • Identify something found in authors writings and
    test documentbut not in distractors
  • Publish

21
Textual considerations
  • First question How confident are we that we
    have a valid text to study?
  • Issues include corruption, editorial changes,
    formatting (e.g. running heads), printers errors
  • Second question How confident are we of our
    uncontroversial stuff?
  • Third question Do we have the right distractor
    authors?

22
Technical considerations
  • First question How good is the technique were
    using?
  • Second question Are there representativeness
    issues involved?
  • Third question Do we have enough data?
  • Fourth question How do we interpret the results?

23
Search for best practices
  • First question How good is the technique were
    using?
  • Development of good techniques is an open
    research question.
  • hence JGAAP

24
JGAAP
  • Single framework allows comparative testing under
    controlled conditions
  • Modular, object-oriented approach makes extension
    to new methods easy.
  • Simple GUI for ease of use
  • Simple 3-phase model under the hood

25
Under the hood
  • Canonicization -- perform necessary conversions,
    strip out irrelevant and confusing differences
  • Event Set Generation partition document into
    Events
  • Statistical Analysis k-NN, LDA, SVM, Naïve
    Bayes, whatever you like.

26
Event Set Generation
  • Documents contain events (also called
    features, but events stresses ordering).
  • E.g. words are events
  • Make bag-of-words, or bag of word-bigrams
  • Properties of words are also events
  • POS, word lengths, frequencies
  • Phase II convert document to (ordered) Event
    Set.

27
Analysis
  • Classify Event Sets based on statistical
    properties.
  • Again, many different ways to do this

28
A simple example
  • Build histogram of Events based on (normalized)
    frequencies.
  • Convert histogram to vector space by enumerating
    over elements
  • Calculate distances between various histograms
    using distance formula
  • Assign authorship of unknown document to closest
    document of known authorship

29
Getting JGAAP
  • JGAAP is available at www.jgaap.com
  • Also available at http//www.mathcs.duq.edu/fa08r
    utenbar/jgaap.zip
  • Requires Java (JDK) 1.5 or better
  • Also requires ant
  • Freeware, so we can (and will) be developing
    during the course

30
Plans for rest of course
  • Details of JGAAP
  • Details of some of the models JGAAP includes
  • Developing new test corpora
  • Developing new models based on analysis of test
    corpora
  • Extension to profiling.
Write a Comment
User Comments (0)
About PowerShow.com