Towards Separating Trigramgenerated and Real sentences with SVM

About This Presentation

Title:

Towards Separating Trigramgenerated and Real sentences with SVM

Description:

If we can detect trigram-generated ... Boils down to finding good features. ... Empirically good shape' Summary. Now we know these features don't work... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 11

Provided by: csC76

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Towards Separating Trigramgenerated and Real sentences with SVM

1
Towards Separating Trigram-generated and Real
sentences with SVM

Jerry Zhu
CALD KDD Lab
2001/4/20

2
Domain Speech Recognition

A large portion of errors are due to
over-generation of trigram language models.
If we can detect trigram-generated sentences, we
can improve accuracy.

ltsgt when the director of the many of your father
lt/sgt ltsgt and so the the monster and here is
obviously a very very profitable business
lt/sgt ltsgt his views on today's seek level thanks
very much for being with us lt/sgt
3
A two-class classification problem

fake (trigram-generated) or real sentence?
Data 100k fake and 100k real long (gt 7 words)
sentences.
fake sentences dont look right (bad syntax),
dont make sense (bad semantics). Boils down to
finding good features.
Semantic coherence has been explored Eneva et
al, but not syntactic features, and the
combination.
SVM margin for probabilities.

4
Previous work semantic features

Around 70 semantic features, most interestingly
Content word co-occurrence statistics
Content word repetition
Decision tree Boosting, around 80 accuracy.
We hope the combination of syntactic features
will significantly improve accuracy.

5
Exploring syntactic features

Bag-of-word feature (raw counts, frequency,
binary linear or polynomial kernel) 57
Tag with part-of-speech (39 POS)
when/WRB the/DT director/NN of/IN the/DT many/NN
of/IN your/PRP father/NN
Bag-of-POS 56
Sparse Sequence of POS any k POS in that order,
weighted by the span. 39k features.
( WRB-IN-DT )
?5 .

6
Exploring syntactic features (cnt.)

Sparse Sequence works on letters for text
categorization, but on POS 58 (k3, max span8)
Leave stopwords together with POS
WRB the NN of the many of your NN
Sparse sequence on stopwordsPOS 57

7
Exploring syntactic features (cnt.)

StopwordsPOS 4-grams novelty rate, count
distribution likelihood ratio, min, max, median,
mean counts
These combined with semantic features 75
Semantic features alone 77

8
SVM margin
Empirically good shape
9
Summary