Title: Tokenization
1Tokenization POS-Tagging
- presented by
- Yajing Zhang
- Saarland University
- yazhang_at_coli.uni-sb.de
2Outline
- Tokenization
- Importance
- Problems solutions
- POS tagging
- HMM tagger
- TnT statistical tagger
3Why Tokenization?
- Tokenization the isolation of word-like units
from a text. - Building blocks of other text processing.
- The accuracy of tokenization affects the results
of other higher level processing, e.g. parsing.
4Problems of tokenization
- Definition of Token
- United States, ATT, 3-year-old
- Ambiguity of punctuation as sentence boundary
- Prof. Dr. J.M.
- Ambiguity in numbers
- 123,456.78
5Some Solutions
- Using regular expressions to match numbers and
abbreviations - (0-9,)0-9(.0-9)?
- A-Zbcdfghj-np-tvxz\.
- Using corpus as a filter to identify
abbreviations - Using a lexical list (most important
abbreviations are listed)
6POS Tagging
- Labeling each word in a sentence with its
appropriate part of speech - Information sources in tagging
- Tags of other words in the context
- The word itself
- Different approaches
- Rule-based Tagger
- Stochastic POS Tagger
- Simplest stochastic Tagger
- HMM Tagger
7Simplest Stochastic Tagger
- Each word is assigned its most frequent tag (most
frequently encountered in the training set) - Problem may generate a valid tag for a word but
unacceptable tag sequences - Time flies like an arrow
- NN VBZ VB DT NN
8Markov Models (MM)
- In a Markov chain, the future element of the
sequence depends only on the current element in
the sequence, but not the past elements - X (X1, , XT) is a sequence of random
variables, S s1, , sN is the state space -
- and
9Example of Markov Models (MM)
Cf. Manning Schütze, 1999, page 319
10Hidden Markov Model
- In (visible) MM, we know the state sequences the
model passes, so the state sequence is regarded
as output - In HMM, we dont know the state sequences, but
only some probabilistic function of it - Markov models can be used wherever one wants to
model the probability of a linear sequence of
events - HMM can be trained from unannotated text
11HMM Tagger
- Assumption words tag only depends on the
previous tag and this dependency does not change
over time - HMM tagger uses states to represent POS tags and
outputs (symbol emission) to represent the words. - Tagging task is to find the most probable tag
sequence for a sequence of words.
12Finding the most probable sequence
Cf. Erhard Hinrichs Sandra Kübler
13HMM tagging an example
Cf. Erhard Hinrichs Sandra Kübler
14HMM tagging an example
Cf. Erhard Hinrichs Sandra Kübler
15Calculating the most likely sequence
Green transition probabilities Blue emission
probabilities
16Dealing with unknown words
- The simplest model assume that unknown words can
have any POS tags, or the most frequent one in
the tagset - In practice, morphological info like suffix is
used as hint
17TnT (TrigramsnTags)
- A statistical tagger using Markov Models states
represent tags and outputs represent words - To find the current tag is to calculate
18Transition and emission probabilities
- Transition and output probabilities are estimated
from a tagged corpus - Bigrams
- Trigrams
- Lexical
19Smoothing Technique
- Needed due to sparse-data problem
- The trigram is most likely to be zero in a
limited corpus - Without smoothing, the complete probability
becomes zero - Smoothing
-
-
- where
20Other techniques
- Handling unknown words
- Using the longest suffix (the final sequence of
characters of a word) as a strong predictor for
word classes - To calculate the probability of a tag t given the
last m letters li of an n letter word. m depends
on the specific word - Capitalization
- Works better for English than for German
21Evaluation
- Corpora
- German NEGRA corpus around 355,000 tokens
- WSJ (Wall Street Journal) in the Penn Treebank
around 1.2 Million tokens - 10-fold cross validation
- The tagger assigns tags as well as probabilities
to words - ?rank different assignments
22Results for German and English
23POS Learning Curve for NEGRA
24Learning Curve for Penn Treebank
25Conclusion
- Good results for both German and English corpus
- Average accuracy TnT achieves is between 96 and
97 - The accuracy for known tokens is significantly
higher than for unknown tokens
26References
- What is a word, whats a sentence (Grefenstette
94) - POS-Tagging and Partial Parsing
- (Abney 96)
- TNT- A Statistical Part-of-Speech Tagger (Brants
2000) - Foundations of Statistical Natural Language
Processing - (Manning Schütze 99)