Title: Applicability of N-Grams to Data Classification
1Applicability of N-Grams to Data Classification
- A review of 3 NLP-related papers
- Presented by Andrei Missine
- (CS 825, Fall 2003)
2What are N-Grams?
- Sequences of words or tokens from a corpus.
- Used to predict the probability of a word W being
the next word given 0 (n - 1) words before it. - Common N-grams unigrams, bigrams, trigrams and
four-grams. - One of the simpler statistical models used in NLP.
3N-Grams and Authorship Attribution
- Authorship Attribution is the process of
determining who the author of a given text is. - An approach suggested by the authors of this
paper(1) is to parse a known document written by
an author A1 on the byte level and to extract
n-grams. - The most frequent n-grams are then saved as the
author profile for this author (A1). - This process is repeated for all other authors
(A2 An). We now have a collection of author
profiles. - Given a new text it is compared versus the
existing profiles and the one with the smallest
dissimilarity is chosen as the most likely author.
(1) N-Gram-based Author Profiles for Authorship
Attribution
4N-Grams on Byte Level?
- Instead of treating text as a collection of
words, just look at the bytes. - No modifications to the algorithm are required
when switching between languages. - The good side the experiment performed with
100(2) accuracy for English and 97(2) accuracy
for Greek data. This is much better than any of
the previously attempted methods. - The bad side this approach did worse on Chinese
data, performing with 89(2) accuracy (previously
achieved accuracy is 94). - A likely reason for this is because many Asian
languages use Unicode (2 bytes) to encode
characters, so some n-grams might include only
half of a character.
(2) Best achieved accuracy
5N-Grams and Sentiment Classification
- In this particular paper(3) the authors discuss
how N-Grams and machine learning can be applied
to classifying movie reviews as positive or
negative. - The main reasons why movie reviews were chosen
are their wide availability, ease of
programmatically determining if the review is
positive or negative (e.g. by the number of
stars)and finally, the large availability of
different reviewers. - Some preliminary results the chance of guessing
the classification is 50. When 2 graduate
computer science were asked to provide a list of
positive and negative words the results were 58
and 64 accurate. Finally, when a statistical
method was applied to get such a list the rate of
accuracy was 69.
(3) Thumbs up? Sentiment Classification using
Machine Learning Techniques
6N-Grams and Sentiment Classification (continued)
- So how well did machine learning do?
- Naïve Bayesian classification has the best
performance of 81.5 when unigrams and Parts of
Speech(4) are used. - Maximum Entropy classification has slightly lower
best performance at 81.0 when top 2633 unigrams
are chosen. - Support Vector Machines have the best overall
performance of the three, with the highest being
82.9 achieved when 16153 unigrams were used. - Notes
- The data was acquired from the corpus collected
from IMDb. - Interestingly, the presence of the n-grams
appears to be more important than their frequency
in this application.
(4) As mentioned by authors crude form of sense
disambiguation
7N-Grams and Sentiment Classification (continued)
- Problems
- Why is machine learning not doing so well on some
articles? - Sometimes considering just the N-grams is not
enough one needs to look at the broader context
in which they are used. - One of the examples provided by the authors is
thwarted expectations where the reviewer goes
on describing how great the movie should have
been, and finishes with a quick comment on how
bad it turned out. In this case there will be a
larger amount of positive information and only a
small bit of negative and the article might
wrongfully get a positive rating. - The converse of the above is also true an
article might wrongfully get a negative rating on
a positive review such as It was sick,
disgusting and disturbing It was great!(5)
(5) Same idea as the Spice Girls review in the
paper
8Affect Sensing on the Sentence Level
- The last approach(6) I examined is based on
affect sensing by trying to apply well known
facts to a sentence and thus detecting the
overall mood. - Source of common-sense information used was Open
Mind Common Sense which has 500,000 sentences
in its corpus. - Some simple linguistic models were used in
conjunction with a smoothing model which would be
responsible for determining how the mood was
carried over from one sentence to the next. - These were combined to produce an email client
which would attempt to react emotionally (via a
simple drawing of a face) to the users text. - The approach used by the authors is different
from N-grams.
(6) A Model of Textual Affect Sensing using
Real-World Knowledge
9Affect Sensing versus N-Grams
- Can be used to provide the user with a friendlier
and more natural interface. - The structure proposed by the authors can handle
negations and slightly trickier linguistic
structures than most simple n-gram based
approaches. - Can use common sense to infer more information
than n-grams. - Comes at a price of much more complicated
algorithms and dependency on language-specific
sources such as OMCS. - Affect sensing is very young and was not
evaluated thoroughly whereas n-grams have been
around for some time and are well studied. - Final note Neither can handle sarcasm Yeah,
right.
10References
- N-gram-based Author Profiles for Authorship
Attribution by Keselj Vlado, Peng Fuchun,
Cercone Nick and Thomas Calvin. In Proceedings
of the Conference Pacific Association for
Computational Linguistics, PACLING'03, Dalhousie
University, Halifax, Nova Scotia, Canada, August
2003. http//www.cs.dal.ca/vlado/papers/pacling0
3-keselj-etc.html - Thumbs up? Sentiment Classification using
Machine Learning Techniques (2002) by Bo Pang,
Lillian Lee, Shivakumar Vaithyanathan
Proceedings of the 2002 Conference on Empirical
Methods in Natural Language Processing (EMNLP)
http//citeseer.nj.nec.com/pang02thumbs.html - A Model of Textual Affect Sensing using
Real-World Knowledge by Hugo Liu, Henry
Lieberman and Ted Selker. International
Conference on Intelligent User Interfaces (IUI
2003). Miami, Florida http//citeseer.nj.nec.com/
liu03model.html - Foundations of Statistical Natural Language
Processing, by Christopher D. Manning and
Hinrich Schutze