- PowerPoint PPT Presentation

About This Presentation
Title:

Description:

Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 21
Provided by: ramy7
Learn more at: http://cobweb.cs.uga.edu
Category:

less

Transcript and Presenter's Notes

Title:


1
 Using Machine Learning Techniques in Stylometry
  • Ramyaa, Congzhou He, Dr. Khaled Rasheed

2
Introduction
  • Stylometry
  • Major problems facing stylometry
  • Decision trees
  • Artificial Neural Networks

3
Stylometry
  • The measure of style
  • Fundamental assumption there is an unconscious
    aspect to an authors style that cannot be
    consciously manipulated but which possesses
    quantifiable and distinctive features.
  • Major applications today clinical tools in
    disease detection and forensic tools in court
    trials, text categorization, author attribution.

4
Major problems facing stylometry
  • no consensus as to what characteristic features
    to use
  • Which indicators to use word length, sentence
    length, tests of position, the distribution of
    once-occurring words (hapax legomena), the
    frequencies of marker words, letter sequence,
    syllable length or syntactical measures?

5
Major problems facing stylometry
  • No consensus as to what methodology or
    techniques to apply in standard research
  • Which techniques to use -- statistical
    methods and automated pattern recognition
    methods?
  • Statistical methods e.g. Bayesian analysis,
    cluster analysis such as the widely used
    Principal Components Analysis (PCA).
  • Automated pattern recognition methods e.g.
    Artificial Neural Networks (ANN), Genetic
    Programming (GP).

6
Significant Featuresof our paper
  • Recognizing the works of five authors
  • Use of unconventional indicators such as
    punctuation marks as well as standard indicators
    such as function words
  • Only 21 indicators, which shows that not many
    features are required for high-performance
    classification as opposed to common belief

7
Data Extraction
  • 78 samples from five popular Victorian authors
  • Jane Austen
  • Pride and Prejudice Chapters 1-5
  • Mansfield Park Chapters 1-5
  • Emma Chapters 1-5
  • Sense and Sensibility Chapters 1-5

8
  • Charles Dickens
  • David Copperfield Chapters 1-5
  • Great Expectations Chapters 1-5
  • Hard Times Chapters 1-6
  • Tale of Two Cities Chapters 1-6
  • -- William Thackeray
  • Vanity Fair Chapters 1-6
  • Mens Wives Chapters 1-6
  • Emily Bronte
  • Wuthering Heights Chapters 1-12
  • Charlotte Bronte
  • Jane Eyre Chapters 1-12

9
21 attributes as input
  • type-token ratio
  • mean word length
  • mean sentence length
  • standard deviation of sentence length
  • mean paragraph length
  • chapter length
  • number of commas per thousand tokens
  • number of semicolons per thousand tokens
  • number of quotation marks per thousand
  • tokens

10
  • number of exclamation marks /1000 tokens
  • number of hyphens per thousand tokens
  • number of ands per thousand tokens
  • number of buts per thousand tokens
  • number of howevers per thousand tokens
  • number of ifs per thousand tokens
  • number of thats per thousand tokens
  • number of mores per thousand tokens
  • number of musts per thousand tokens
  • number of mights per thousand tokens
  • number of thiss per thousand tokens
  • number of verys per thousand tokens

11
Decision Tree Learning
  • See5 package by Quinlan based on ID3 algorithm
  • features of decision tree results easy to
    understand focus on individual attributes
  • Use fuzzy thresholds for continuous values
  • Either winnowing or boosting gives the best
    result 82.4 accuracy, significantly above
    random guess (20).

12
Result from winnowing
  • Evaluation on test data (17 cases)
  • Decision Tree
  • ----------------
  • Size Errors
  •  
  • 5 3(17.6) ltlt
  •  
  • (a) (b) (c) (d) (e) lt-classified
    as
  • ---- ---- ---- ---- ----
  • 4 1 (a)
    class jane
  • 5 1 (b)
    class charles
  • 2 (c)
    class william
  • 1 1 (d)
    class emily
  • 2 (e)
    class charlotte

13
Results from boosting
  • Evaluation on test data (17 cases)
  • boost 3(17.6) ltlt
  •  
  • (a) (b) (c) (d) (e) lt-classified
    as
  • ---- ---- ---- ---- ----
  • 4 1 (a) class
    jane
  • 5 1 (b) class
    charles
  • 2 (c) class
    william
  • 1 1 (d) class
    emily
  • 2 (e) class
    charlotte

14
Artificial Neural Network (ANN) Learning
  • practical and powerful method of pattern
    recognition
  • can invent new features that are not explicit in
    the input
  • all attributes taken into consideration
  • inductive rules not accessible to humans

15
  • Many architectures were tried.
  • Kohonen SOM, Probabilistic nets, Nets based on
    statistical model were tried
  • Back propagation feed forward nets gave the best
    results
  • The best network had 21 inputs and 10 outputs
  • The best architecture had 15 hidden nodes in the
    first hidden layer and 11 in the second

16
Predictor analysis
17
Results from ANN
(a) (b) (c) (d) (e) ? classified
as ---- ---- ---- ---- ---- 2
(a) class
jane 2
(b) class charles 2
(c) class william
2 4 (d) class
emily 5 (e)
class charlotte  
18
Misclassifications
  • No. 4 Pride and Prejudice Chapter 3 is
    misclassified as written by Charlotte Bronte
  • Nos. 67 71 Tale of Two Cities Chapter 1 and
    Chapter 5 are misclassified as written by William
    Thackeray.
  •  All the other authors are correctly classified.
    (88.2 accuracy on the validation set)

19
Conclusion
  • Very good results were obtained in both the
    experiments
  • Artificial Intelligence provides stylometry with
    excellent classifiers that require fewer input
    variables than traditional statistics 
  • Future Research
  • GA/GP
  • a general classifier applicable to all authors
  • Different set of features

20
Thank you
  • ?
Write a Comment
User Comments (0)
About PowerShow.com