STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED TEXT CLASSIFICATION presentation

About This Presentation

Transcript and Presenter's Notes

Title: STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED TEXT CLASSIFICATION

1
STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED
TEXT CLASSIFICATION

S. Sameen Fatima
Dept. of Computer Science Engineering
Osmania University
Hyderabad
(sameenf_at_hotmail.com)

2
BACKGROUND

Q. What is classification (of text)?
A. Classification is an important IR task in
which one or more category labels are assigned to
a document.
Approaches to Classification (of text)
Earlier approaches to text classification
assigned labels to documents based on CONTENTS
1. Word-based techniques
-statistical (tf,idf)
- term/keyword searches
Advantage Simple and can be automated
Disadvantage Phrases cannot be extracted
2. Phrase-based techniques
a) In-depth NLP Here we aspire to represent all
the information in a text using context.
-syntax
-semantics
-statistics
Advantage General task-independent
representation
Disadvantage Costly, Not possible in
polynomial time
b) Information Extraction Here we delimit in
advance, as part of the specification of a task,
the semantic range of the output, the relations
we will represent and other allowable fillers in
each slot.
Advantage It works well for a specific corpus

3
Limitations of the Earlier Approaches to Text
Classification

Texts have, besides content, STYLE which has not
been accounted for.
It is the focus of this talk to present STYLE as
a new basis for text classification

4
COMPUTATIONAL STYLISTICS

The study of style or in other words the
detection of patterns common to a writing is
known as STYLISTICS.
If stylistic analysis uses computer-aided methods
and statistical methods for analysis of texts,
the field of study is called COMPUTATIONAL
STYLISTICS.

5
Related Work in Computational Stylistics

1. Pre-WWW Era
- Author Attribution Studies
Popular Mosteller and Wallaces study of
anonymous essays published in THE FEDERALIST to
identify the authors (Hamilton and Madison).
Stylistic parameters sentence-length, content
words(nouns, adjectives, verbs), function
words(preposition, conjunction), use of by, from,
and to, ..
Came up with interesting result that content
words were too subject-dependent and were not
good discriminators, while function words were
good discriminators.
- Automatic Abstracting
Borko and Chatman advanced the view that it
seems possible to make stylistic distinctions
between informative (discusses research)
abstracts and indicative (discusses the article
whichh descsribes the research) abstracts, based
on form, voice, tense, focus of the abstract.
- Teaching writing styles for different types of
documents.
Writers WorkBench program on ATT Unix.

6
Related Work in Computational Stylistics(contd)

2. WWW-Era (on-going)
-Stylistic variation between the different genres
found in the Wall Street Journal. (Jussi
Karlgren, Troy Strazheim)
Example Articles, Business News with
tables, Business News, Lists of briefs,
Editorials, letters, Briefs, Whats New,
Tables.
Use simple stylistic parameters
characters/word, digits/keywords, words/sentence.
- Establishing a genre palette for internet
material.
(Jussi Karlgren, John Dewe, Ivan Bretan)

Definition of a Genre/Functional Style
A set of documents with a perceived consistent
tendency to make the same stylistic choices,
specifically if it has an established
communication functions, a functional style.
Genres can have differing usefulness
Genres in my work (Corpus)
Editorials from Hindu
Editorials from Hindustan Times
Editorials from Times of India

Hypothesis
Editorials from each newspaper show a systematic
and consistent difference in the choice of a
presentation style, specifically to establish
some intended communication function (aggressive,
conservative, liberal)
Aim of the Experiment
To find a descriptive and predictive algorithm
for classifying editorials from different
newspapers based on stylistic features.

9
Mathematical Model

Two models were explored to find which was
applicable.
1. Vector Space Model - Used by Salton in the
SMART system (IRS)
2. Euclidean Space Model.
Euclidean Space Model
An n-dimensional Euclidean space, En is defined
as the set of all n-tules of real numbers (x1,
x2, ., xn) where the Euclidean distance in En
between 2 points x (x1, x2, ., xn) and y
(y1, y2, ., yn) is defined by
d(x,y) sqrt((x1-y1)2 (x2-y2)2 .
(xn-yn)2)
In our project Euclidean Space represents a
Stylistic Space

In the Vector Space Model distance between two
points x and y is related by the angle ?(x,y)
formed by the lines from each of the points to
the origin, which is given by
cos ?(x,y) (x . y) / ( (x .x)0.5 (y . y)0.5)
This failed in stylistic analysis

11
Stylistic Profiling

A method of identifying the stylistic features in
the writing style of an individual or a group of
people and to present them in a systematic way.
1. Lexical Features
Percentage of interrogative pronouns
Percentage of emphatic pronouns
Percentage of prepositions
Percentage of conjunctions
Percentage of articles
Percentage of action words
Percentage of unique words
2. Structural Features
Average words/sentence
maximum sentence length
Total no. of sentences
Total no. of words
Total no. of characters
3. Affective Features
Percentage of passive sentences
Flesch Reading Ease
Coleman Liau Grade level

12
Classification Algorithm
1. Training Phase
90 FSPs
Training set consisting of 30 editorials each
from H, HT, TI
90 SPs
Conduct ANOVA test extract the SIGNIFICANT
FEATURES
Compute the mean for each of the significant
features for each newspaper
3 Prototypes
Feature Extraction (Lexical, Structural,
Affective)
P-H P-HT P-TI
2. Classification Phase
Least d(I,P-H), Classify as Hindu
FSP, I
Significant Feature Extraction
Compute the distance between I and each of the
prototypes from the training phase
New instance of editorial
Least d(I,P-HT), Classify as HT
Least d(I,P-TI), Classify as TI
13
Results

1. Data Collection (SP)
2. Results of identifying significant features in
the training phase (FSP)
One-tailed ANOVA test was carried out
Null hypothesis No difference between the
means
Alternate hypothesis Means are different
ratio of the variance estimates is
calculated, FSb2/Sw2
Sb2 Sw2 (Check for null hypothesis)
Sb2 gt Sw2 (Check for alternate hypothesis)
F gt Fcrit for a particular significance level,
then we say that the means of the feature are
significantly different
3. Results of the classification phase

Performance Evaluation
Following measures were computed
Precision Number-classified-correctly/Number-tot
al-classified
Recall Number-classified-correctly/Number-releva
nt-for-classification
Conclusion
The results of the experiment were positive. It
was possible to classify editorials with a good
degree of recall and precision

Scope for further work
Currently, it is not clear whether topic and
style are two independent dimensions of variation
in text, or they go hand in hand. This can be
further explored by subclassifying editorials
based on topic and then studying each of them for
stylistic variations
Applications
- For classifying documents on the Internet based
on GENRE
- Relating FSPs of editorials to the reader
profiles for each newspaper so as to establish
any interesting relationship.

Write a Comment

User Comments (0)

About PowerShow.com

STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED TEXT CLASSIFICATION PowerPoint PPT Presentation