STYLOMETRY IN IR SYSTEMS - PowerPoint PPT Presentation

About This Presentation
Title:

STYLOMETRY IN IR SYSTEMS

Description:

STYLOMETRY IN IR SYSTEMS Leyla B LGE B ra EL KKAYA Kardelen HATUN * OUTL NE Stylistics and Stylometry Applications of stylometry History of stylometric ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 26
Provided by: csBilken
Category:

less

Transcript and Presenter's Notes

Title: STYLOMETRY IN IR SYSTEMS


1
STYLOMETRY IN IR SYSTEMS
  • Leyla BILGE
  • Büsra ÇELIKKAYA
  • Kardelen HATUN

2
Outline
  • Stylistics and Stylometry
  • Applications of stylometry
  • History of stylometric researches
  • Stylistic features
  • Recent Studies
  • Our approach
  • Conclusion

3
STYLISTICS
  • The theoritical framework for stylistic combines
  • Hallidays Language Theory
  • Sanders Theories of Stylistic
  • Halliday says
  • A text is what is meant, selected from the
    total set of opinions that constitute what can be
    meant
  • Sander says
  • Style is the result of choices made by an
    author from a range of possibilities offered by
    the language system

4
STYLISTICS
  • Stylistic variation depends on
  • Author preferences and competence
  • Familiarity
  • Genre
  • Communicative context
  • Expected characteristics of the intended audience
  • Modeling, representing and utilizing this
    variation is the business of stylistic analysis.

5
stylometry
  • The application of the study of linguistic style
  • Style refers to the linguistic choices of authors
    that persist over their works, independently of
    content
  • Aim is to describe a text from a rather formal
    perspective like
  • Number of words
  • Number of repetitions
  • Sentence length

6
APPLICATIONS OF STYLOMETRY
  • Authorship attribution
  • Forensic author identification
  • To find the author of an anonymous text
  • Observation of the characteristics of a
    particular author
  • Organization and retrieval of documents based on
    their writing style
  • Systems for genre-based information retrieval

7
HISTORY OF STYLOMETRY
  • Stylometry grew out of analyzing text for
    evidence of authenticity, authorial identity
  • According to modern practice of discipline, there
    are distinctive patterns of a language to
    identify authors
  • After development of computers and their
    capacities
  • Large data sets can be analyzed
  • New methods can be generated and easily applied

8
HISTORY OF STYLOMETRY, CONTD
  • Current researches uses techniques based on term
    frequency counts
  • Frequency data are collected for common terms
  • These data are then analyzed using a range of
    fairly standard statistical techniques
  • However, they cannot guarantee quality ouput yet,
    i.e. Ulysses

9
Methodology
  • Use a subset of structural and stylometric
    features on a set of authors without
    consideration of author characteristics
  • Currently, authorship attribution studies are
    dominated by the use of lexical measures
  • Generally used statistics
  • Word length
  • Syllables per word
  • Sentence-length
  • Sentence count
  • Text length in words
  • Use of punctuation marks

10
Stylistic Features
  • Lexically-Based Methods
  • Vocabulary richness of the author
  • Frequencies of occurrence of individual words
  • Vocabulary diversity
  • Type-token ratio V/N
  • V size of vocabulary of sample text
  • N number of tokens
  • Hapax legomena
  • How many words occur once
  • Frequencies of occurrence
  • Function words

11
Stylistic Features
  • Problems
  • Text length dependent
  • Unstable for short texts
  • Function word set requires manual effort
  • Specific to the group of authors considered
  • Solution
  • Use set of most frequent words
  • Both content-words and function words

12
Related Studies
  • Analysis of the text by a natural language
    processing tool
  • Use existing NLP tool
  • Sentence and Chunk Boundaries Detector (SCBD)
  • Use sub-word units like character N-grams instead
    of word frequencies
  • Character sequences of length n
  • Most frequent n-grams provide information about
    authors stylistic choices on lexical,
    syntactical and structural level

13
Word based features
  • Bag-of-words
  • Apply stemming and stopword list
  • Function words
  • Content-free
  • POS Annotation
  • Feature Selection
  • Semantic Disambiguation

14
Linguistic constituents
  • Structure of natural language sentences show word
    occurrences follow a specific order
  • Words are grouped into syntactic units called
    constituents
  • Use word relationships by extracting constituents
    for feature construction
  • Subdivide document into sentences
  • Construct a syntax tree for each sentence

15
Syntax tree
  • Use a syntax tree representation of different
    authors sentences as features

16
Our Aprroach
  • Use Stylometry to analyze the following
  • Texts translated by the same translator but
    written by different authors
  • Texts translated by different translators but
    written by the same authors

17
Proposed Steps
  • Feature Extraction
  • Determine which features represent the style best
  • Training
  • Training the classifier with a training set
  • Many methods present, (SVM, bayesian)
  • Recognition and Classification of texts
  • Analyzing the results of classification

18
1. Feature Extraction
  • The stylometric features of a text can be
  • Word length
  • Sentence length
  • Paragraph length
  • Character n-grans
  • Function words
  • Feature choices affect classification results
    seriously.
  • Then obtain a feature vector with n-dimensions
  • V v1,v2,v3 vn

19
2. Training
  • Choose training data for every class
  • May be randomly selected texts
  • May be manually picked
  • Determine the corresponding parameters to each
    class

20
3. Recognition and Classification
  • Use the parameters we obtained from training data
  • Compute the distance
  • Label the data
  • Classify the data

21
Results of the Classification
  • We will have two set of results
  • The original texts classified by author
  • The translated texts classified by no prior class
    information
  • These results will give us a clue about the two
    issues we stated at the beginning
  • Example The Picture of Dorian Gray is
    translated into Turkish by many translators
  • Look if these are clustered in one class or
    separate classes

22
Our Aim
  • With the right classification we will be able to
    identify
  • If sytlometric analysis works in finding an
    author in two different languages
  • If translations carry more of their translators
    style or if they still have their authors style
  • yet, to date, no stylometrist has managed to
    establish a methodology which is better able to
    capture the style of a text than that based on
    lexical items.

23
Conclusion
  • Today there are many useful applications of
    stylometry.
  • Authorship attribution, plagiarism detection,
    genre-based information retrieval
  • What features are valuable for analysis is still
    an important question.
  • We aim to find the stylistic connection between a
    text and its translation.

24
References
  • Computational Stylistics in Forensic Author
    Identifiction, Carole E. Charsi
  • Style vs. Expression in Literary Narratives,
    Özlem Uzuner, Boris Katz
  • Computer-Based Authorship Attribution Without
    Lexical Measures, E. Stamatatos, N. Fakotakis, G.
    Kokkinakis
  • Ensemble-Based Author Identification Using
    Character N-grams, E. Stamatatos
  • Combining Text and Linguistic Document
    Representations for Authorship Attribution, A.
    Kaster, S. Siersdofer, G. Weikum

25
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com