Title: Towards Separating Trigramgenerated and Real sentences with SVM
1Towards Separating Trigram-generated and Real
sentences with SVM
- Jerry Zhu
- CALD KDD Lab
- 2001/4/20
2Domain Speech Recognition
- A large portion of errors are due to
over-generation of trigram language models. - If we can detect trigram-generated sentences, we
can improve accuracy.
ltsgt when the director of the many of your father
lt/sgt ltsgt and so the the monster and here is
obviously a very very profitable business
lt/sgt ltsgt his views on today's seek level thanks
very much for being with us lt/sgt
3A two-class classification problem
- fake (trigram-generated) or real sentence?
- Data 100k fake and 100k real long (gt 7 words)
sentences. - fake sentences dont look right (bad syntax),
dont make sense (bad semantics). Boils down to
finding good features. - Semantic coherence has been explored Eneva et
al, but not syntactic features, and the
combination. - SVM margin for probabilities.
4Previous work semantic features
- Around 70 semantic features, most interestingly
- Content word co-occurrence statistics
- Content word repetition
- Decision tree Boosting, around 80 accuracy.
- We hope the combination of syntactic features
will significantly improve accuracy.
5Exploring syntactic features
- Bag-of-word feature (raw counts, frequency,
binary linear or polynomial kernel) 57 - Tag with part-of-speech (39 POS)
- when/WRB the/DT director/NN of/IN the/DT many/NN
of/IN your/PRP father/NN - Bag-of-POS 56
- Sparse Sequence of POS any k POS in that order,
weighted by the span. 39k features. - ( WRB-IN-DT )
- ?5 .
6Exploring syntactic features (cnt.)
- Sparse Sequence works on letters for text
categorization, but on POS 58 (k3, max span8) - Leave stopwords together with POS
- WRB the NN of the many of your NN
- Sparse sequence on stopwordsPOS 57
7Exploring syntactic features (cnt.)
- StopwordsPOS 4-grams novelty rate, count
distribution likelihood ratio, min, max, median,
mean counts - These combined with semantic features 75
- Semantic features alone 77
8SVM margin
Empirically good shape
9Summary
- Now we know these features dont work
- SVM wasnt a wise choice with large amount of
data and a lot of noise
10Future?
- Parsing
- Logistic regression instead of SVM