Title: Computational Stylometry
1Computational Stylometry
- 5 Approaches to Authorship Attribution
21. Historical Research
- Knowledge intensive, beyond current abilities of
the computer. E.g. documents mentioning the
Spanish Armada must have been written after 1588,
and thus could not have been written by anyone
who died before 1588. - See web page by Tom Reedy and David Kathman How
we know that Shakespeare wrote Shakespeare. - the name William Shakespeare appears on the
plays and poems - WS was an actor in the company that performed the
works of WS - In 1615 Edward Howes published a list of our
moderne, and present excellent poets, which
included WS. - Some time before 1623, a monument was erected to
WS in Stratford, depicting him as a writer.
32. Cipher-based author identification
- Cipher-based author identification, see
shakespeareauthorship.com/bacpenl.html for
details of Penn Learys Baconian ciphers. - AE BF CG DH EI FK GL HM IN KO LP MQ
NR OS PT QV RY SA TB VC YD - (JI, V U, VVW, 0 to 9 A to K)
- e.g. subjects merits doth ? acfnigba qiynba
hsbm - that same ignorance ? bmeb aeqi nlrsyergi
- boatswain ? fsebacenr (first word in The
Tempest - This technique, supposedly proves Bacon wrote all
of Shakespeares works. Critics point out it also
proves he wrote Ceasars Gallic Wars and Moby
Dick.
43. Physical Evidence
- e.g. carbon dating, analysis of handwriting.
- Evidence against Konrad Kujaus Hitler Diaries
- the handwriting did not resemble Hitlers
- the paper, ink and glue of the diaries was
manufactured post-WW2.
54. Forensic Linguistics
- Forensic - need not be quantitative. See
Forensic Linguistics, a review by L J Hurst of
Malcolm Coulthards work. - Confessions of the Birmingham Six describe one
suspect carrying one white plastic carrier bag,
and another carrying two white plastic carrier
bags. These words were those of the police (who
were sure that the explosives were in white
carrier bags) - no one in ordinary speech would
use and repeat those abnormally long phrases. No
one spoke that way at the trial, nor could any
examples be found in the University of Birmingham
COBUILD database - The Derek Bentley confession repeatedly contains
the word then, as in Chris Craig and I then,
the policeman then. But outside police reports
no one puts then after the subject of a
sentence - it is the police whose reports are
obsessed with sequence.
65. Computational Stylometry
- Computational Stylometry
- An attempt to capture the essence of the style of
a particular author by reference to a variety of
quantitative criteria, usually lexical, called
discriminators. - Stylometry is really the area in which
computationally tractable approaches to author
identification have been developed. However,
general author identification software packages
have not been produced. Many studies have
developed analyses of texts by hand, or using
simple frequency counting programs, have
extracted the relevant data and processed it
using statistical software packages or manual
statistical analysis.
7Problems with the attribution of authorship style
- Homogeneity of authorship. A clear underlying
assumption in studies of authorship is that
although authors may consciously influence their
own style, there will always be a subconscious
exercise of a consistent style throughout their
work. However, there is no clear and indisputable
evidence of such features. But some authors, such
as Oliver Goldsmith, are very versatile, with
variation between different speakers in his
novels. Also, traits of an Anglo-Irish school of
writers may mask any personal tendencies. - Heterogeneity of authorship over time. Chronology
of texts, early vs. late styles (Yardi) - Authorship and genre (genre differences more
pronounced than author differences). - Variation within a single author (e.g. Jane
Austen)
8Qualitative Studies
- Qualitative studies have focused on the hapax
legomena (words which appear only once in the
entire text). - These reflect the background and experience of
the author Obscure, out-of-date or technical
terms. Delicate shades of meaning conveyed. - The largest group of words in the vocabulary of a
text, e.g. in a 100,000 word sample of the BNC,
9500 come up just once, 1500 come up twice,
1000 three times, 600 four times, 300 five
times, 200 six times. Others (7 to 7447 times)
are 2500 words. - The problem with these as discriminators is that
statistical tests e.g. chi-squared require 5
occurrences. (i.e their individual low rate of
occurrence makes them difficult to handle
statistically).
9Statistical Techniques (quantitative)
- require the study of frequently occurring
features - traits must be possible to express numerically
(e.g. Semitisms in the New Testament are
difficult to classify). - Frequently used features are
- word and sentence length (Frequency spectra, mode
2 for JS mill, 3 for Oliver Twist (Mendenhall). - Positions of words within sentences e.g. gar as
second or third word, correlation -0.47 with time
(Mortons chronology of Isocrates). - Vocabulary studies the choice and frequency of
words, measures of vocabulary richness e.g.
Yules K measure. - Syntactic analysis e.g. phrase structure rewrite
rule frequencies used for measures of vocabulary
richness. (crime fiction - cave of shadows). - The ideal situation for authorship studies is
when there are - large amounts of undisputed text,
- few contenders for the authorship of the disputed
text(s).
10Experimental Methodology
- The experimental methodology is to build corpora
- A - works definitely by author A
- B - works definitely by author B
- C - works of disputed authorship, but probably
written by A or B. - Then select discriminants and associated
measures. - When the technique has been shown to discriminate
effectively between A and B, then try it on C.
11Measures of vocabulary richness (1)
- tokens N length of text in words
- types V number of different words in the text
- hapax legomena V1 number of words occurring
just once in the text - hapax dislegomena V2 number of words occurring
exactly twice in the text - Vi number of words occurring exactly i times
- The type / token ratio depends on the length of
the text (less for longer texts), but is a useful
measure when the comparison texts have been
standardised for length. - Honorés measure depends on the hapax legomena
- R 100 log N / ( 1 - (V1 / V))
- Sichels measure depends on the dislegomena,
which reaches equilibrium with respect to N - S V2 / V
12Measures of vocabulary richness (2)
- Yules characteristic K depends on words of all
frequencies. - K 10,000 (M - N) / N²
- Where M S i² .V(i,N)
- and V(i, N) number of types which occur i times
in a sample of N tokens. - Some results by Yule
- De Imitatione Christi K 84.2
- Works definitely by Kempis K 59.7
- Works by definitely by Gerson K 35.9
13A stylometric analysis of Mormon Scripture and
Related Texts
- by D. I. Holmes
- His results for Yules K scores were as follows
- Joseph Smith 1 (J1) - personal writings 57.7
Joseph Smith 2 (J2) - personal writings 82.1
Joseph Smith 3 (J3) - personal writings 78.6 - Nephi 1 (N1) 145.2 Nephi 2 (N2) 155.2 Nephi 3
(N3) 150.5 Jacob (JB) 134.3 Lehi (L1) 109.4
Moroni 1 (R1) 131.5 Moroni 2 (R2) 115.7 Mormon
1 (M1) 183.8 Mormon 2 (M2) 132.7 Mormon 3 (M3)
119.2 Mormon 4 (M4) 168.9 Mormon 5 (M5) 125.5
Alma 1 (A1) 149.0 Alma 2 (A2) 150.6 Doctrine 1
(D1) 126.9 Doctrine 2 (D2) 91.6 Doctrine 3 (D3)
98.9 Book of Abraham (AB)146.4 - Isaiah 1 (I1) - King James Bible 81.3 Isaiah 2
(I2) - King James Bible 114.2 Isaiah 3 (I3) -
King James Bible 90.9
14Analysis of Mormon Scripture (2)
- These K scores were used to produce a similarity
matrix using the formula - 1 - ( (Xr - Xs) / range)²
- and the similarity matrix used to produce a
dendrogram. The dendrogram showed - Joseph Smith, Isaiah and other prophets in
separate clusters - But variation within a prophets writings greater
than the variation between prophets.
15(No Transcript)
16A Widow and Her Soldier (1)
- Stylometry and the American Civil War (Holmes,
Gordon and Wilson). - Fifty years after the American Civil war, General
George Picketts widow, LaSalle Corbell Pickett,
published letters purportedly written by her
husband, many of them from the field of battle
itself. Historians are divided as to their
authenticity, and this paper describes a
stylometric investigation into the Pickett
letters. - The Burrows technique essentially picks the N
most common words in the corpus under
investigation and computes the occurrence rate of
these N words in each text, thus converting each
text into an N dimensional array of numbers.
Multivariate statistical techniques are then
applied to the data to look for patterns. The two
techniques most commonly employed are principal
components analysis (PCA) and cluster analysis.
17A Widow and Her Soldier (2)
- PCA notes from Homes and Forsyth above Principal
Components Analysis (PCA) aims to transform the
observed variables into a new set of variables
which are uncorrelated and arranged in decreasing
order of importance. These new variables (or
components) are combinations of the original
variables, and it is hoped that the first few
components will account for most of the variation
in the original data. If we can account for most
of the variation using just two components, we
can plot each text on a two dimensional graph. - . Typically the data are plotted in the space of
just the first two components
18A Widow and Her Soldier (2)
- Samples were taken from
- LaSalle Picketts autobiography
- O LaSalles personal letters
- George personal pre-war and post-war letters
- Georges war reports
- ? Inman papers, genuine handwritten letters by
George - ? Walter Harrisons book Picketts men
(possible source of material in the letters). - The investigation strongly suggests that LaSalle
Pickett, composed the published letters herself.
19(No Transcript)
20The Federalist Papers
- Published under the pseudonym Publius in 1787 -
1788 to persuade the people of New York to accept
the new American constitution. - Jay wrote 5 essays (undisputed)
- Hamilton wrote 43 essays (undisputed)
- Madison wrote 14 essays (undisputed)
- 12 essays were disputed (Hamilton or Madison?)
- The Bayesian Approach of Mosteller and Wallace.
- The styles of Hamilton and Madison varied in the
frequency of use of certain words, e.g. enough
was found in 14 papers by Hamilton, but none by
Madison whilst was found in no papers by
Hamilton, but in 13 by Madison. 28 such
discriminating terms were found. - All 12 disputed papers thought to be by Madison.
21Neural Networks Tweedie, Singh and Holmes
- architecture multilayer perceptron
- output layer has two nodes, one corresponding to
Hamilton, one corresponding to Madison - one hidden layer
- input layer has ten nodes, each corresponding to
one of the following words any, from, an, may,
upon, can, every, his, do, there, on. - how were texts converted into inputs? (minimum
occurrence in any document ? 0, maximum ? 1). - training on undisputed papers with feedback
(supervised learning). - Weights updated by conjugate gradient method
(zero weights show features with no
discriminating power). - Again all 12 papers were judged in favour of
Madison.
22Neural Network Merriam and Matthews
- 5 input nodes, one for each of
- no / T10
- (of x and)/of
- so / T10
- (the x and) / the
- with / T10
- T10 Taylors ten function words but, by, for,
no, not, so, that, the, to, with - And 2 output nodes for Shakespeare vs. Marlowe)
Results found an anonymous play, Edward III,
more likely to have been written by Shakespeare.
23(No Transcript)
24Neural Network Kjell
- See also Bradley Kjell
- one input node for each of 26 26 bigrams
- 2 output nodes for Madison vs. Hamilton).
25Latinate Words in Jane Austen
- The Density of Latinate Words in the Speeches of
Jane Austens Characters, by Mary DeForest Eric
Johnson. - English has two main sources for words German
and Latin. Distinct from each other, they have
polarised our language into high diction and low
(diglossia). - Latinate words are indicators of status and
education. - The proportion of Latinate words in a text can be
used as a stylometric measure, perhaps similar to
average word length or vocabulary richness. - DeForest and Johnson classified all 13,809 unique
words which appear in Jane Austens novels. Of
these, 6151 were found to derive from Latin or
Greek. They also excluded 945 stop words. - Their theory was that Latinate words, not merely
long words, delineate character.
26Latinate Words in Jane Austen
- The characters in the novels were found to vary
in the proportion of Latinate words they used,
with high proportions being indicative of - high social class
- formality
- insincerity and euphemism (Orwell, e.g.
terminate with extreme prejudice) - self-control (as opposed to emotion)
- mens speech (education was the preserve of men
in the 18th century) - stateliness vs. squalor (e.g. Mansfield Park).