Author Identification for LiveJournal - PowerPoint PPT Presentation

About This Presentation
Title:

Author Identification for LiveJournal

Description:

Author Identification for LiveJournal. Alyssa Liang. The problem. LiveJournal a blogging website ... Given a document (an entry), identify the author ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 6
Provided by: alyssa4
Learn more at: https://nlp.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Author Identification for LiveJournal


1
Author Identification for LiveJournal
  • Alyssa Liang

2
The problem
  • LiveJournal a blogging website
  • Given a document (an entry), identify the author
  • Hierarchical classification
  • first classify by gender
  • then classify author based on gender

3
Features
  • Unigrams Bigrams
  • Average sentence and word length
  • Number of words and distinct words
  • Number of sentences in paragraph
  • Number of UPPERCASE characters
  • Number of words not in the dictionary
  • Number of words with length lt 4
  • Number of characters in italics, bold, and
    striked out

4
The 3 Classifiers
  • Naïve Bayes generative model
  • Assumes features in document are independent of
    each other
  • Implemented multi-variate Bernoulli model
  • Only represented if feature appeared in document,
    not number of times feature appears
  • Decision Trees
  • An internal nodes is a test of a feature, and
    each branch from the node represents the values
    it can take
  • A leaf node represents a classification
  • Build a smallish tree from the training data
    using minimum average entropy
  • Maximum Entropy conditional model
  • model all that is known and assume nothing is
    unknown
  • Tries to find most uniform model that satisifies
    constraints, i.e. maximize the entropy

5
Results
Hierarchical
  • Hierarchical classification has no benefits
  • Need to improve gender classification could use
    different features
  • Feature Reduction (on gender classification)
  • took 512 most important features and reran maxent
    training then took 256 most important features,
    etc.
  • Proved to be very stable
  • Best features consisted mostly of bigrams (many
    of which contained punctuation).
  • Also chose features where there was a large
    difference between male and female (number of
    distinct words, UPPERCASE letters, short words)
Write a Comment
User Comments (0)
About PowerShow.com