Towards Separating Trigramgenerated and Real sentences with SVM - PowerPoint PPT Presentation

About This Presentation
Title:

Towards Separating Trigramgenerated and Real sentences with SVM

Description:

If we can detect trigram-generated ... Boils down to finding good features. ... Empirically good shape' Summary. Now we know these features don't work... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 11
Provided by: csC76
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Towards Separating Trigramgenerated and Real sentences with SVM


1
Towards Separating Trigram-generated and Real
sentences with SVM
  • Jerry Zhu
  • CALD KDD Lab
  • 2001/4/20

2
Domain Speech Recognition
  • A large portion of errors are due to
    over-generation of trigram language models.
  • If we can detect trigram-generated sentences, we
    can improve accuracy.

ltsgt when the director of the many of your father
lt/sgt ltsgt and so the the monster and here is
obviously a very very profitable business
lt/sgt ltsgt his views on today's seek level thanks
very much for being with us lt/sgt
3
A two-class classification problem
  • fake (trigram-generated) or real sentence?
  • Data 100k fake and 100k real long (gt 7 words)
    sentences.
  • fake sentences dont look right (bad syntax),
    dont make sense (bad semantics). Boils down to
    finding good features.
  • Semantic coherence has been explored Eneva et
    al, but not syntactic features, and the
    combination.
  • SVM margin for probabilities.

4
Previous work semantic features
  • Around 70 semantic features, most interestingly
  • Content word co-occurrence statistics
  • Content word repetition
  • Decision tree Boosting, around 80 accuracy.
  • We hope the combination of syntactic features
    will significantly improve accuracy.

5
Exploring syntactic features
  • Bag-of-word feature (raw counts, frequency,
    binary linear or polynomial kernel) 57
  • Tag with part-of-speech (39 POS)
  • when/WRB the/DT director/NN of/IN the/DT many/NN
    of/IN your/PRP father/NN
  • Bag-of-POS 56
  • Sparse Sequence of POS any k POS in that order,
    weighted by the span. 39k features.
  • ( WRB-IN-DT )
  • ?5 .

6
Exploring syntactic features (cnt.)
  • Sparse Sequence works on letters for text
    categorization, but on POS 58 (k3, max span8)
  • Leave stopwords together with POS
  • WRB the NN of the many of your NN
  • Sparse sequence on stopwordsPOS 57

7
Exploring syntactic features (cnt.)
  • StopwordsPOS 4-grams novelty rate, count
    distribution likelihood ratio, min, max, median,
    mean counts
  • These combined with semantic features 75
  • Semantic features alone 77

8
SVM margin
Empirically good shape
9
Summary
  • Now we know these features dont work
  • SVM wasnt a wise choice with large amount of
    data and a lot of noise

10
Future?
  • Parsing
  • Logistic regression instead of SVM
Write a Comment
User Comments (0)
About PowerShow.com