Computational Stylometry - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Computational Stylometry

Description:

Variation within a single author (e.g. Jane Austen) Qualitative Studies ... Latinate Words in Jane Austen ... unique words which appear in Jane Austen's novels. ... – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 27
Provided by: osirisSun
Category:

less

Transcript and Presenter's Notes

Title: Computational Stylometry


1
Computational Stylometry
  • 5 Approaches to Authorship Attribution

2
1. Historical Research
  • Knowledge intensive, beyond current abilities of
    the computer. E.g. documents mentioning the
    Spanish Armada must have been written after 1588,
    and thus could not have been written by anyone
    who died before 1588.
  • See web page by Tom Reedy and David Kathman How
    we know that Shakespeare wrote Shakespeare.
  • the name William Shakespeare appears on the
    plays and poems
  • WS was an actor in the company that performed the
    works of WS
  • In 1615 Edward Howes published a list of our
    moderne, and present excellent poets, which
    included WS.
  • Some time before 1623, a monument was erected to
    WS in Stratford, depicting him as a writer.

3
2. Cipher-based author identification
  • Cipher-based author identification, see
    shakespeareauthorship.com/bacpenl.html for
    details of Penn Learys Baconian ciphers.
  • AE BF CG DH EI FK GL HM IN KO LP MQ
    NR OS PT QV RY SA TB VC YD
  • (JI, V U, VVW, 0 to 9 A to K)
  • e.g. subjects merits doth ? acfnigba qiynba
    hsbm
  • that same ignorance ? bmeb aeqi nlrsyergi
  • boatswain ? fsebacenr (first word in The
    Tempest
  • This technique, supposedly proves Bacon wrote all
    of Shakespeares works. Critics point out it also
    proves he wrote Ceasars Gallic Wars and Moby
    Dick.

4
3. Physical Evidence
  • e.g. carbon dating, analysis of handwriting.
  • Evidence against Konrad Kujaus Hitler Diaries
  • the handwriting did not resemble Hitlers
  • the paper, ink and glue of the diaries was
    manufactured post-WW2.

5
4. Forensic Linguistics
  • Forensic - need not be quantitative. See
    Forensic Linguistics, a review by L J Hurst of
    Malcolm Coulthards work.
  • Confessions of the Birmingham Six describe one
    suspect carrying one white plastic carrier bag,
    and another carrying two white plastic carrier
    bags. These words were those of the police (who
    were sure that the explosives were in white
    carrier bags) - no one in ordinary speech would
    use and repeat those abnormally long phrases. No
    one spoke that way at the trial, nor could any
    examples be found in the University of Birmingham
    COBUILD database
  • The Derek Bentley confession repeatedly contains
    the word then, as in Chris Craig and I then,
    the policeman then. But outside police reports
    no one puts then after the subject of a
    sentence - it is the police whose reports are
    obsessed with sequence.

6
5. Computational Stylometry
  • Computational Stylometry
  • An attempt to capture the essence of the style of
    a particular author by reference to a variety of
    quantitative criteria, usually lexical, called
    discriminators.
  • Stylometry is really the area in which
    computationally tractable approaches to author
    identification have been developed. However,
    general author identification software packages
    have not been produced. Many studies have
    developed analyses of texts by hand, or using
    simple frequency counting programs, have
    extracted the relevant data and processed it
    using statistical software packages or manual
    statistical analysis.

7
Problems with the attribution of authorship style
  • Homogeneity of authorship. A clear underlying
    assumption in studies of authorship is that
    although authors may consciously influence their
    own style, there will always be a subconscious
    exercise of a consistent style throughout their
    work. However, there is no clear and indisputable
    evidence of such features. But some authors, such
    as Oliver Goldsmith, are very versatile, with
    variation between different speakers in his
    novels. Also, traits of an Anglo-Irish school of
    writers may mask any personal tendencies.
  • Heterogeneity of authorship over time. Chronology
    of texts, early vs. late styles (Yardi)
  • Authorship and genre (genre differences more
    pronounced than author differences).
  • Variation within a single author (e.g. Jane
    Austen)

8
Qualitative Studies
  • Qualitative studies have focused on the hapax
    legomena (words which appear only once in the
    entire text).
  • These reflect the background and experience of
    the author Obscure, out-of-date or technical
    terms. Delicate shades of meaning conveyed.
  • The largest group of words in the vocabulary of a
    text, e.g. in a 100,000 word sample of the BNC,
    9500 come up just once, 1500 come up twice,
    1000 three times, 600 four times, 300 five
    times, 200 six times. Others (7 to 7447 times)
    are 2500 words.
  • The problem with these as discriminators is that
    statistical tests e.g. chi-squared require 5
    occurrences. (i.e their individual low rate of
    occurrence makes them difficult to handle
    statistically).

9
Statistical Techniques (quantitative)
  • require the study of frequently occurring
    features
  • traits must be possible to express numerically
    (e.g. Semitisms in the New Testament are
    difficult to classify).
  • Frequently used features are
  • word and sentence length (Frequency spectra, mode
    2 for JS mill, 3 for Oliver Twist (Mendenhall).
  • Positions of words within sentences e.g. gar as
    second or third word, correlation -0.47 with time
    (Mortons chronology of Isocrates).
  • Vocabulary studies the choice and frequency of
    words, measures of vocabulary richness e.g.
    Yules K measure.
  • Syntactic analysis e.g. phrase structure rewrite
    rule frequencies used for measures of vocabulary
    richness. (crime fiction - cave of shadows).
  • The ideal situation for authorship studies is
    when there are
  • large amounts of undisputed text,
  • few contenders for the authorship of the disputed
    text(s).

10
Experimental Methodology
  • The experimental methodology is to build corpora
  • A - works definitely by author A
  • B - works definitely by author B
  • C - works of disputed authorship, but probably
    written by A or B.
  • Then select discriminants and associated
    measures.
  • When the technique has been shown to discriminate
    effectively between A and B, then try it on C.

11
Measures of vocabulary richness (1)
  • tokens N length of text in words
  • types V number of different words in the text
  • hapax legomena V1 number of words occurring
    just once in the text
  • hapax dislegomena V2 number of words occurring
    exactly twice in the text
  • Vi number of words occurring exactly i times
  • The type / token ratio depends on the length of
    the text (less for longer texts), but is a useful
    measure when the comparison texts have been
    standardised for length.
  • Honorés measure depends on the hapax legomena
  • R 100 log N / ( 1 - (V1 / V))
  • Sichels measure depends on the dislegomena,
    which reaches equilibrium with respect to N
  • S V2 / V

12
Measures of vocabulary richness (2)
  • Yules characteristic K depends on words of all
    frequencies.
  • K 10,000 (M - N) / N²
  • Where M S i² .V(i,N)
  • and V(i, N) number of types which occur i times
    in a sample of N tokens.
  • Some results by Yule
  • De Imitatione Christi K 84.2
  • Works definitely by Kempis K 59.7
  • Works by definitely by Gerson K 35.9

13
A stylometric analysis of Mormon Scripture and
Related Texts
  • by D. I. Holmes
  • His results for Yules K scores were as follows
  • Joseph Smith 1 (J1) - personal writings 57.7
    Joseph Smith 2 (J2) - personal writings 82.1
    Joseph Smith 3 (J3) - personal writings 78.6
  • Nephi 1 (N1) 145.2 Nephi 2 (N2) 155.2 Nephi 3
    (N3) 150.5 Jacob (JB) 134.3 Lehi (L1) 109.4
    Moroni 1 (R1) 131.5 Moroni 2 (R2) 115.7 Mormon
    1 (M1) 183.8 Mormon 2 (M2) 132.7 Mormon 3 (M3)
    119.2 Mormon 4 (M4) 168.9 Mormon 5 (M5) 125.5
    Alma 1 (A1) 149.0 Alma 2 (A2) 150.6 Doctrine 1
    (D1) 126.9 Doctrine 2 (D2) 91.6 Doctrine 3 (D3)
    98.9 Book of Abraham (AB)146.4
  • Isaiah 1 (I1) - King James Bible 81.3 Isaiah 2
    (I2) - King James Bible 114.2 Isaiah 3 (I3) -
    King James Bible 90.9

14
Analysis of Mormon Scripture (2)
  • These K scores were used to produce a similarity
    matrix using the formula
  • 1 - ( (Xr - Xs) / range)²
  • and the similarity matrix used to produce a
    dendrogram. The dendrogram showed
  • Joseph Smith, Isaiah and other prophets in
    separate clusters
  • But variation within a prophets writings greater
    than the variation between prophets.

15
(No Transcript)
16
A Widow and Her Soldier (1)
  • Stylometry and the American Civil War (Holmes,
    Gordon and Wilson).
  • Fifty years after the American Civil war, General
    George Picketts widow, LaSalle Corbell Pickett,
    published letters purportedly written by her
    husband, many of them from the field of battle
    itself. Historians are divided as to their
    authenticity, and this paper describes a
    stylometric investigation into the Pickett
    letters.
  • The Burrows technique essentially picks the N
    most common words in the corpus under
    investigation and computes the occurrence rate of
    these N words in each text, thus converting each
    text into an N dimensional array of numbers.
    Multivariate statistical techniques are then
    applied to the data to look for patterns. The two
    techniques most commonly employed are principal
    components analysis (PCA) and cluster analysis.

17
A Widow and Her Soldier (2)
  • PCA notes from Homes and Forsyth above Principal
    Components Analysis (PCA) aims to transform the
    observed variables into a new set of variables
    which are uncorrelated and arranged in decreasing
    order of importance. These new variables (or
    components) are combinations of the original
    variables, and it is hoped that the first few
    components will account for most of the variation
    in the original data. If we can account for most
    of the variation using just two components, we
    can plot each text on a two dimensional graph.
  • . Typically the data are plotted in the space of
    just the first two components

18
A Widow and Her Soldier (2)
  • Samples were taken from
  • LaSalle Picketts autobiography
  • O LaSalles personal letters
  • George personal pre-war and post-war letters
  • Georges war reports
  • ? Inman papers, genuine handwritten letters by
    George
  • ? Walter Harrisons book Picketts men
    (possible source of material in the letters).
  • The investigation strongly suggests that LaSalle
    Pickett, composed the published letters herself.

19
(No Transcript)
20
The Federalist Papers
  • Published under the pseudonym Publius in 1787 -
    1788 to persuade the people of New York to accept
    the new American constitution.
  • Jay wrote 5 essays (undisputed)
  • Hamilton wrote 43 essays (undisputed)
  • Madison wrote 14 essays (undisputed)
  • 12 essays were disputed (Hamilton or Madison?)
  • The Bayesian Approach of Mosteller and Wallace.
  • The styles of Hamilton and Madison varied in the
    frequency of use of certain words, e.g. enough
    was found in 14 papers by Hamilton, but none by
    Madison whilst was found in no papers by
    Hamilton, but in 13 by Madison. 28 such
    discriminating terms were found.
  • All 12 disputed papers thought to be by Madison.

21
Neural Networks Tweedie, Singh and Holmes
  • architecture multilayer perceptron
  • output layer has two nodes, one corresponding to
    Hamilton, one corresponding to Madison
  • one hidden layer
  • input layer has ten nodes, each corresponding to
    one of the following words any, from, an, may,
    upon, can, every, his, do, there, on.
  • how were texts converted into inputs? (minimum
    occurrence in any document ? 0, maximum ? 1).
  • training on undisputed papers with feedback
    (supervised learning).
  • Weights updated by conjugate gradient method
    (zero weights show features with no
    discriminating power).
  • Again all 12 papers were judged in favour of
    Madison.

22
Neural Network Merriam and Matthews
  • 5 input nodes, one for each of
  • no / T10
  • (of x and)/of
  • so / T10
  • (the x and) / the
  • with / T10
  • T10 Taylors ten function words but, by, for,
    no, not, so, that, the, to, with
  • And 2 output nodes for Shakespeare vs. Marlowe)
    Results found an anonymous play, Edward III,
    more likely to have been written by Shakespeare.

23
(No Transcript)
24
Neural Network Kjell
  • See also Bradley Kjell
  • one input node for each of 26 26 bigrams
  • 2 output nodes for Madison vs. Hamilton).

25
Latinate Words in Jane Austen
  • The Density of Latinate Words in the Speeches of
    Jane Austens Characters, by Mary DeForest Eric
    Johnson.
  • English has two main sources for words German
    and Latin. Distinct from each other, they have
    polarised our language into high diction and low
    (diglossia).
  • Latinate words are indicators of status and
    education.
  • The proportion of Latinate words in a text can be
    used as a stylometric measure, perhaps similar to
    average word length or vocabulary richness.
  • DeForest and Johnson classified all 13,809 unique
    words which appear in Jane Austens novels. Of
    these, 6151 were found to derive from Latin or
    Greek. They also excluded 945 stop words.
  • Their theory was that Latinate words, not merely
    long words, delineate character.

26
Latinate Words in Jane Austen
  • The characters in the novels were found to vary
    in the proportion of Latinate words they used,
    with high proportions being indicative of
  • high social class
  • formality
  • insincerity and euphemism (Orwell, e.g.
    terminate with extreme prejudice)
  • self-control (as opposed to emotion)
  • mens speech (education was the preserve of men
    in the 18th century)
  • stateliness vs. squalor (e.g. Mansfield Park).
Write a Comment
User Comments (0)
About PowerShow.com