Stylometry and authorship - PowerPoint PPT Presentation

About This Presentation
Title:

Stylometry and authorship

Description:

Stylometry and authorship D. Holmes Authorship attribution Computers and the Humanities 28 (1994), 87-106. D. Holmes The Evolution of Stylometry in ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 23
Provided by: mcus
Category:

less

Transcript and Presenter's Notes

Title: Stylometry and authorship


1
Stylometry and authorship
  • D. Holmes Authorship attribution Computers and
    the Humanities 28 (1994), 87-106.
  • D. Holmes The Evolution of Stylometry in
    Humanities Scholarship Literary and Linguistic
    Computing 13 (1998), 111-117. http//llc.oxfordjou
    rnals.org/cgi/reprint/13/3/111.pdf
  • T. McEnery M. Oates Authorship identification
    and computational stylometry in Dale et al (eds)
    Handbook of Natural Language Processing, New York
    (2000) Dekker, chapter 23.1

2
Stylometry
  • Measurement of (aspects) of style
  • Especially using computational tools
  • Purposes
  • Genre classification
  • Historical study of language change (diachronic
    linguistics)
  • Literary analysis
  • Authorship attribution
  • Forensic linguistics

3
Authorship attribution
  • Has been a topic of research since at least
    mod-19th century (predates computers)
  • Interest in
  • resolving issues of disputed authorship
  • identifying authorship of anonymous texts
  • may be useful in detecting plagiarism, and
    authorship of computer viruses
  • used in forensic setting, eg to detect genuine
    confessions

4
Authorship attribution
  • Five main approaches
  • Physical evidence
  • eg carbon dating and handwriting analysis, as in
    case of Hitler Diaries. Not relevant to
    linguistics/stylistics
  • Historical evidence
  • eg did Marlowe or Shakespeare write Edward III?
    It was published 1596, 3 yrs after Marlowes
    death, but contains references to the defeat of
    the Armada (1588)
  • knowledge intensive, not feasible for computers

5
Authorship attribution
  • Cipher-based decryption
  • idea that authors deliberately encode their names
    in text
  • especially widespread in Bible studies, but also
    in Shakespeare-Bacon debate
  • Penn (1987) used computer analysis to show Bacon
    had written a lot of Shakespeares plays
  • easily debunked see http//shakespeareauthorship.
    com/5b Ross showed that using the same
    techniques proved that bacon also wrote
    Spensers Faerie Queene, the Bible, Caesars
    Gallic Wars, Hiawatha, Moby Dick and The
    Federalist Papers (see later)

6
Authorship attribution
  • Manual analysis
  • Much used in forensic linguistics
  • Detailed analysis of unlimited linguistic traits
  • Not suitable for computational analysis, but
    well look at some examples later
  • Computational stylometry
  • Involves counting things
  • So can only look at what is easily countable

7
Stylometry
  • Assumes that the essence of the individual style
    of an author can be captured with reference to a
    number of quantitative criteria, called
    discriminators
  • Obviously, some (many) aspects of style are
    conscious and deliberate
  • as such they can be easily imitated and indeed
    often are
  • many famous pastiches, either humorous or as a
    sort of homage
  • Computational stylometry is focused on
    subconscious elements of style less easy to
    imitate or falsify

8
Stylometry is not foolproof
  • We should be aware of shortcomings
  • Discriminators are mostly lexical, though some
    recent work has looked also at syntactic
    discriminators
  • Authors styles change, either over time, or
    deliberately, eg when writing in different
    literary genres
  • Many techniques rely on large quantities of data
  • Most of the following techniques are better at
    dealing with closed questions
  • Who wrote this, A or B?
  • If A wrote these, did they also write this?
  • How likely is it that A wrote this?
  • but not Who wrote this?

9
Some classical examples
  • Did Homer write both the Illiad and the Odyssey?
  • both generally attributed to a single individual
    named Homer, but both are derived from long
    oral tradition
  • Did Paul write all the NT Letters of St Paul?
  • Especially, the authorship of Hebrews has long
    been debated on theological grounds
  • Plato developed his philosophy in the form of
    dialogues, putting his own doctrines into the
    mouth of Socrates his teacher.
  • Ascertaining the correct chronological order of
    these dialogues would help to understand how
    Plato developed his philosophy
  • Did Shakespeare write all of his plays?
  • Various authors including Bacon and Marlowe are
    said to have written parts or all of several
    plays
  • Shakespeare may even be a nom-de-plume for a
    group of writers
  • two more plays Edward III and Two Noble Kinsmen
    may have been written partly by Shakespeare

10
Some modern examples
  • The Federalist Papers
  • a series of articles published in 1787-88 with
    the aim of promoting the ratification of the new
    US constitution.
  • written by three authors, Jay, Hamilton and
    Madison, under the pseudonym Publius
  • Some are of known (and in some cases joint)
    authorship but others are disputed
  • Pioneering stylometric methods were famously used
    by Mosteller and Wallace in the early 1960s to
    attempt to answer this question
  • It is now considered as settled
  • The Federalist Papers present a difficult but
    solvable test case, and are seen as a benchmark
    to test new ideas

11
Some modern examples
  • Similarities with private letters helped to
    identify the style of the Unabombers manifesto
  • Unabomber Theodore Kaczynski perpetrated a number
    of bomb attacks on universities and airlines
    between 1978 and 1995
  • Promised to stop if his 35,000-word
    anti-industrialist manifesto was published in
    major newspapers
  • Distinctive writing style and turns of phrase
    enabled him to be identified
  • Authorship of Primary Colors, a work of fiction
    about preparations for the Democratic primaries
    which showed the Bill Clinton character in a bad
    light

12
Some modern examples
  • Derek Bentley and his disputed murder
    confession (1953)
  • Bentley (an illiterate man of low IQ) and another
    man involved in an armed robbery in which a
    policeman was shot
  • Bentley found guilty and hanged in January 1953
  • In 1971 author Yallop looked closely at the case,
  • As well as conflicting ballistic evidence, and
    some procedurtal errors in the trial, Bentleys
    statement was found to have been doctored by
    police
  • Contested statement used then every 58 words on
    average and repeatedly used I then.
  • BoE uses then every 500 words, and then I ten
    times more often than I then. Importantly,
    witness statement frequencies overall are similar
    to BoE.
  • Police statement genre of the time used then
    every 78 words, and typically used the I then
    form.
  • Derek Bentley acquitted in 1999, posthumously,
    appeal assisted by a linguistics professor

13
Basic methodologies
  • Word or sentence length too obvious and easy to
    manipulate
  • Frequencies of letter pairs strangely successful,
    though limited
  • Distribution of words of a given length (in
    syllables), especially relative frequencies, ie
    length of gaps between words of same syllable
    length.

14
Vocabulary richness
  • Based on the idea that authors vocabulary is
    more or less constant
  • Various measures
  • Type-token ratio
  • Simpsons index (the chance that two word
    arbitrarily chosen from text will be the same)
  • Yules K (occurrence of a given word is a chance
    occurrence can be modelled as a Poisson
    distribution)
  • Entropy (measure of uniformity)

15
The Federalist Papers
  • 85 papers arguing for the adoption of the US
    constitution
  • written by three authors (Jay, Hamilton, Madison)
  • 5 authored by Jay
  • 51 authored by Hamilton
  • 14 authored by Madison
  •   3 jointly by Hamilton and Madison
  • authorship of 12 of them disputed (Hamilton or
    Madison?)
  • Mosteller and Wallace (1964) employed function
    words such as prepositions, conjunctions, and
    articles as discriminators.
  • e.g., the word upon averaged 3.24 appearances per
    1,000 words in the known writings of Hamilton but
    only 0.23 in the writings of Madison
  • 30 marker words identified as discriminative of
    the two contested authors upon, whilst, there,
    on, while, vigor, by, consequently, would, voice

16
Bayesian probability
  • Bayes hypothesis reconciles prior hypotheses (in
    this case based on historical observation) with
    conditional probabilities based on measurements
  • If prior hypothesis (eg that there is a 13
    chance that Madison wrote the paper) is confirmed
    by the measurements (eg of features associated
    with Madisons style), the result will be neutral
  • If prior hypothesis is contradicted by the
    measurements, result will be much more striking

17
Cumulative sum charts
  • Method
  • Assume authorial fingerprints such as
    percentage of short words, or words beginning
    with a vowel
  • Put two texts together and plot the number of
    items per sentence against the cumulative average
  • If graph has a sharp divergence at the point
    where the texts are joined, this shows the
    authors differ
  • Highly controversial
  • Interpretation of graphs very subjective
  • But much used in courts!
  • Weighted cusum
  • Slightly sounder footing statistically
    eliminates need for subjective judgment
  • Still not very accurate compared to other measures

18
Multivariate analysis
  • Thanks to computers it is now possible to collect
    large numbers of different measurements, of a
    variety of features
  • Variants of multivariate analysis
  • Cluster analysis
  • Correspondence analysis
  • Principal components analysis

19
Cluster analysis
  • Group objects according to their similarity with
    respect to a given feature
  • Produces a tree diagram or dendogram

20
Correspondence analysis
  • Example of superlatives in Dickens and
    Smolletts works
  • Tabata 2007 http//www.digitalhumanities.org/dh20
    07/abstracts/xhtml.xq?id259)
  • Count frequency of 242 superlatives in 30 texts
  • CA allows classification of associations between
    variables in a 2d matrix, rows x columns
  • D1 distinguishes Dickens from Smollett
  • D2

21
Principal components analysis
  • Like cluster analysis but can work with much
    larger range of variables
  • PCA is a statistical method for arranging large
    arrays of data into interpretable patterning
    match
  • principal components are computed by
    calculating the correlations between all the
    variables, then grouping them into sets that show
    the most correspondence
  • each set is a component, or dimension

22
Final word
  • Many of these techniques are also used to
    identify different genres rather than different
    authors
  • especially PCA, where the dimensions can be
    characterised
  • (In fact, cluster analysis and PCA illustrations
    were taken from such a study!)
  • An interesting question how well do they work on
    pastiches?
  • If interested, see H Somers F Tweedie
    Authorship attribution and pastiche, Computers
    and the Humanities 37 (2003), 407-429.
Write a Comment
User Comments (0)
About PowerShow.com