Title: STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED TEXT CLASSIFICATION
1STYLISTIC VARIATION AS A BASIS FOR GENRE-BASED
TEXT CLASSIFICATION
- S. Sameen Fatima
- Dept. of Computer Science Engineering
- Osmania University
- Hyderabad
- (sameenf_at_hotmail.com)
2BACKGROUND
- Q. What is classification (of text)?
- A. Classification is an important IR task in
which one or more category labels are assigned to
a document. - Approaches to Classification (of text)
- Earlier approaches to text classification
assigned labels to documents based on CONTENTS - 1. Word-based techniques
- -statistical (tf,idf)
- - term/keyword searches
- Advantage Simple and can be automated
- Disadvantage Phrases cannot be extracted
- 2. Phrase-based techniques
- a) In-depth NLP Here we aspire to represent all
the information in a text using context. - -syntax
- -semantics
- -statistics
- Advantage General task-independent
representation - Disadvantage Costly, Not possible in
polynomial time - b) Information Extraction Here we delimit in
advance, as part of the specification of a task,
the semantic range of the output, the relations
we will represent and other allowable fillers in
each slot. - Advantage It works well for a specific corpus
3Limitations of the Earlier Approaches to Text
Classification
- Texts have, besides content, STYLE which has not
been accounted for. - It is the focus of this talk to present STYLE as
a new basis for text classification
4COMPUTATIONAL STYLISTICS
- The study of style or in other words the
detection of patterns common to a writing is
known as STYLISTICS. - If stylistic analysis uses computer-aided methods
and statistical methods for analysis of texts,
the field of study is called COMPUTATIONAL
STYLISTICS.
5Related Work in Computational Stylistics
- 1. Pre-WWW Era
- - Author Attribution Studies
- Popular Mosteller and Wallaces study of
anonymous essays published in THE FEDERALIST to
identify the authors (Hamilton and Madison). - Stylistic parameters sentence-length, content
words(nouns, adjectives, verbs), function
words(preposition, conjunction), use of by, from,
and to, .. - Came up with interesting result that content
words were too subject-dependent and were not
good discriminators, while function words were
good discriminators. - - Automatic Abstracting
- Borko and Chatman advanced the view that it
seems possible to make stylistic distinctions
between informative (discusses research)
abstracts and indicative (discusses the article
whichh descsribes the research) abstracts, based
on form, voice, tense, focus of the abstract. - - Teaching writing styles for different types of
documents. - Writers WorkBench program on ATT Unix.
6Related Work in Computational Stylistics(contd)
- 2. WWW-Era (on-going)
- -Stylistic variation between the different genres
found in the Wall Street Journal. (Jussi
Karlgren, Troy Strazheim) - Example Articles, Business News with
tables, Business News, Lists of briefs,
Editorials, letters, Briefs, Whats New,
Tables. - Use simple stylistic parameters
characters/word, digits/keywords, words/sentence. - - Establishing a genre palette for internet
material. - (Jussi Karlgren, John Dewe, Ivan Bretan)
7- Definition of a Genre/Functional Style
- A set of documents with a perceived consistent
tendency to make the same stylistic choices,
specifically if it has an established
communication functions, a functional style. - Genres can have differing usefulness
- Genres in my work (Corpus)
- Editorials from Hindu
- Editorials from Hindustan Times
- Editorials from Times of India
8- Hypothesis
- Editorials from each newspaper show a systematic
and consistent difference in the choice of a
presentation style, specifically to establish
some intended communication function (aggressive,
conservative, liberal) - Aim of the Experiment
- To find a descriptive and predictive algorithm
for classifying editorials from different
newspapers based on stylistic features.
9Mathematical Model
- Two models were explored to find which was
applicable. - 1. Vector Space Model - Used by Salton in the
SMART system (IRS) - 2. Euclidean Space Model.
- Euclidean Space Model
- An n-dimensional Euclidean space, En is defined
as the set of all n-tules of real numbers (x1,
x2, ., xn) where the Euclidean distance in En
between 2 points x (x1, x2, ., xn) and y
(y1, y2, ., yn) is defined by - d(x,y) sqrt((x1-y1)2 (x2-y2)2 .
(xn-yn)2) - In our project Euclidean Space represents a
Stylistic Space
10- In the Vector Space Model distance between two
points x and y is related by the angle ?(x,y)
formed by the lines from each of the points to
the origin, which is given by - cos ?(x,y) (x . y) / ( (x .x)0.5 (y . y)0.5)
- This failed in stylistic analysis
-
11Stylistic Profiling
- A method of identifying the stylistic features in
the writing style of an individual or a group of
people and to present them in a systematic way. - 1. Lexical Features
- Percentage of interrogative pronouns
- Percentage of emphatic pronouns
- Percentage of prepositions
- Percentage of conjunctions
- Percentage of articles
- Percentage of action words
- Percentage of unique words
- 2. Structural Features
- Average words/sentence
- maximum sentence length
- Total no. of sentences
- Total no. of words
- Total no. of characters
- 3. Affective Features
- Percentage of passive sentences
- Flesch Reading Ease
- Coleman Liau Grade level
12Classification Algorithm
1. Training Phase
90 FSPs
Training set consisting of 30 editorials each
from H, HT, TI
90 SPs
Conduct ANOVA test extract the SIGNIFICANT
FEATURES
Compute the mean for each of the significant
features for each newspaper
3 Prototypes
Feature Extraction (Lexical, Structural,
Affective)
P-H P-HT P-TI
2. Classification Phase
Least d(I,P-H), Classify as Hindu
FSP, I
Significant Feature Extraction
Compute the distance between I and each of the
prototypes from the training phase
New instance of editorial
Least d(I,P-HT), Classify as HT
Least d(I,P-TI), Classify as TI
13Results
- 1. Data Collection (SP)
- 2. Results of identifying significant features in
the training phase (FSP) - One-tailed ANOVA test was carried out
- Null hypothesis No difference between the
means - Alternate hypothesis Means are different
- ratio of the variance estimates is
calculated, FSb2/Sw2 - Sb2 Sw2 (Check for null hypothesis)
- Sb2 gt Sw2 (Check for alternate hypothesis)
- F gt Fcrit for a particular significance level,
then we say that the means of the feature are
significantly different - 3. Results of the classification phase
-
14- Performance Evaluation
- Following measures were computed
- Precision Number-classified-correctly/Number-tot
al-classified - Recall Number-classified-correctly/Number-releva
nt-for-classification - Conclusion
- The results of the experiment were positive. It
was possible to classify editorials with a good
degree of recall and precision
15- Scope for further work
- Currently, it is not clear whether topic and
style are two independent dimensions of variation
in text, or they go hand in hand. This can be
further explored by subclassifying editorials
based on topic and then studying each of them for
stylistic variations - Applications
- - For classifying documents on the Internet based
on GENRE - - Relating FSPs of editorials to the reader
profiles for each newspaper so as to establish
any interesting relationship.