STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED TEXT CLASSIFICATION PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED TEXT CLASSIFICATION


1
STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED
TEXT CLASSIFICATION
  • S. Sameen Fatima
  • Dept. of Computer Science Engineering
  • Osmania University
  • Hyderabad
  • (sameenf_at_hotmail.com)

2
BACKGROUND
  • Q. What is classification (of text)?
  • A. Classification is an important IR task in
    which one or more category labels are assigned to
    a document.
  • Approaches to Classification (of text)
  • Earlier approaches to text classification
    assigned labels to documents based on CONTENTS
  • 1. Word-based techniques
  • -statistical (tf,idf)
  • - term/keyword searches
  • Advantage Simple and can be automated
  • Disadvantage Phrases cannot be extracted
  • 2. Phrase-based techniques
  • a) In-depth NLP Here we aspire to represent all
    the information in a text using context.
  • -syntax
  • -semantics
  • -statistics
  • Advantage General task-independent
    representation
  • Disadvantage Costly, Not possible in
    polynomial time
  • b) Information Extraction Here we delimit in
    advance, as part of the specification of a task,
    the semantic range of the output, the relations
    we will represent and other allowable fillers in
    each slot.
  • Advantage It works well for a specific corpus

3
Limitations of the Earlier Approaches to Text
Classification
  • Texts have, besides content, STYLE which has not
    been accounted for.
  • It is the focus of this talk to present STYLE as
    a new basis for text classification

4
COMPUTATIONAL STYLISTICS
  • The study of style or in other words the
    detection of patterns common to a writing is
    known as STYLISTICS.
  • If stylistic analysis uses computer-aided methods
    and statistical methods for analysis of texts,
    the field of study is called COMPUTATIONAL
    STYLISTICS.

5
Related Work in Computational Stylistics
  • 1. Pre-WWW Era
  • - Author Attribution Studies
  • Popular Mosteller and Wallaces study of
    anonymous essays published in THE FEDERALIST to
    identify the authors (Hamilton and Madison).
  • Stylistic parameters sentence-length, content
    words(nouns, adjectives, verbs), function
    words(preposition, conjunction), use of by, from,
    and to, ..
  • Came up with interesting result that content
    words were too subject-dependent and were not
    good discriminators, while function words were
    good discriminators.
  • - Automatic Abstracting
  • Borko and Chatman advanced the view that it
    seems possible to make stylistic distinctions
    between informative (discusses research)
    abstracts and indicative (discusses the article
    whichh descsribes the research) abstracts, based
    on form, voice, tense, focus of the abstract.
  • - Teaching writing styles for different types of
    documents.
  • Writers WorkBench program on ATT Unix.

6
Related Work in Computational Stylistics(contd)
  • 2. WWW-Era (on-going)
  • -Stylistic variation between the different genres
    found in the Wall Street Journal. (Jussi
    Karlgren, Troy Strazheim)
  • Example Articles, Business News with
    tables, Business News, Lists of briefs,
    Editorials, letters, Briefs, Whats New,
    Tables.
  • Use simple stylistic parameters
    characters/word, digits/keywords, words/sentence.
  • - Establishing a genre palette for internet
    material.
  • (Jussi Karlgren, John Dewe, Ivan Bretan)

7
  • Definition of a Genre/Functional Style
  • A set of documents with a perceived consistent
    tendency to make the same stylistic choices,
    specifically if it has an established
    communication functions, a functional style.
  • Genres can have differing usefulness
  • Genres in my work (Corpus)
  • Editorials from Hindu
  • Editorials from Hindustan Times
  • Editorials from Times of India

8
  • Hypothesis
  • Editorials from each newspaper show a systematic
    and consistent difference in the choice of a
    presentation style, specifically to establish
    some intended communication function (aggressive,
    conservative, liberal)
  • Aim of the Experiment
  • To find a descriptive and predictive algorithm
    for classifying editorials from different
    newspapers based on stylistic features.

9
Mathematical Model
  • Two models were explored to find which was
    applicable.
  • 1. Vector Space Model - Used by Salton in the
    SMART system (IRS)
  • 2. Euclidean Space Model.
  • Euclidean Space Model
  • An n-dimensional Euclidean space, En is defined
    as the set of all n-tules of real numbers (x1,
    x2, ., xn) where the Euclidean distance in En
    between 2 points x (x1, x2, ., xn) and y
    (y1, y2, ., yn) is defined by
  • d(x,y) sqrt((x1-y1)2 (x2-y2)2 .
    (xn-yn)2)
  • In our project Euclidean Space represents a
    Stylistic Space

10
  • In the Vector Space Model distance between two
    points x and y is related by the angle ?(x,y)
    formed by the lines from each of the points to
    the origin, which is given by
  • cos ?(x,y) (x . y) / ( (x .x)0.5 (y . y)0.5)
  • This failed in stylistic analysis

11
Stylistic Profiling
  • A method of identifying the stylistic features in
    the writing style of an individual or a group of
    people and to present them in a systematic way.
  • 1. Lexical Features
  • Percentage of interrogative pronouns
  • Percentage of emphatic pronouns
  • Percentage of prepositions
  • Percentage of conjunctions
  • Percentage of articles
  • Percentage of action words
  • Percentage of unique words
  • 2. Structural Features
  • Average words/sentence
  • maximum sentence length
  • Total no. of sentences
  • Total no. of words
  • Total no. of characters
  • 3. Affective Features
  • Percentage of passive sentences
  • Flesch Reading Ease
  • Coleman Liau Grade level

12
Classification Algorithm
1. Training Phase
90 FSPs
Training set consisting of 30 editorials each
from H, HT, TI
90 SPs
Conduct ANOVA test extract the SIGNIFICANT
FEATURES
Compute the mean for each of the significant
features for each newspaper
3 Prototypes
Feature Extraction (Lexical, Structural,
Affective)
P-H P-HT P-TI
2. Classification Phase
Least d(I,P-H), Classify as Hindu
FSP, I
Significant Feature Extraction
Compute the distance between I and each of the
prototypes from the training phase
New instance of editorial
Least d(I,P-HT), Classify as HT
Least d(I,P-TI), Classify as TI
13
Results
  • 1. Data Collection (SP)
  • 2. Results of identifying significant features in
    the training phase (FSP)
  • One-tailed ANOVA test was carried out
  • Null hypothesis No difference between the
    means
  • Alternate hypothesis Means are different
  • ratio of the variance estimates is
    calculated, FSb2/Sw2
  • Sb2 Sw2 (Check for null hypothesis)
  • Sb2 gt Sw2 (Check for alternate hypothesis)
  • F gt Fcrit for a particular significance level,
    then we say that the means of the feature are
    significantly different
  • 3. Results of the classification phase

14
  • Performance Evaluation
  • Following measures were computed
  • Precision Number-classified-correctly/Number-tot
    al-classified
  • Recall Number-classified-correctly/Number-releva
    nt-for-classification
  • Conclusion
  • The results of the experiment were positive. It
    was possible to classify editorials with a good
    degree of recall and precision

15
  • Scope for further work
  • Currently, it is not clear whether topic and
    style are two independent dimensions of variation
    in text, or they go hand in hand. This can be
    further explored by subclassifying editorials
    based on topic and then studying each of them for
    stylistic variations
  • Applications
  • - For classifying documents on the Internet based
    on GENRE
  • - Relating FSPs of editorials to the reader
    profiles for each newspaper so as to establish
    any interesting relationship.
Write a Comment
User Comments (0)
About PowerShow.com