Tokenization

About This Presentation

Transcript and Presenter's Notes

Title: Tokenization

1
Tokenization POS-Tagging

presented by
Yajing Zhang
Saarland University
yazhang_at_coli.uni-sb.de

2
Outline

Tokenization
Importance
Problems solutions
POS tagging
HMM tagger
TnT statistical tagger

3
Why Tokenization?

Tokenization the isolation of word-like units
from a text.
Building blocks of other text processing.
The accuracy of tokenization affects the results
of other higher level processing, e.g. parsing.

4
Problems of tokenization

Definition of Token
United States, ATT, 3-year-old
Ambiguity of punctuation as sentence boundary
Prof. Dr. J.M.
Ambiguity in numbers
123,456.78

5
Some Solutions

Using regular expressions to match numbers and
abbreviations
(0-9,)0-9(.0-9)?
A-Zbcdfghj-np-tvxz\.
Using corpus as a filter to identify
abbreviations
Using a lexical list (most important
abbreviations are listed)

6
POS Tagging

Labeling each word in a sentence with its
appropriate part of speech
Information sources in tagging
Tags of other words in the context
The word itself
Different approaches
Rule-based Tagger
Stochastic POS Tagger
Simplest stochastic Tagger
HMM Tagger

7
Simplest Stochastic Tagger

Each word is assigned its most frequent tag (most
frequently encountered in the training set)
Problem may generate a valid tag for a word but
unacceptable tag sequences
Time flies like an arrow
NN VBZ VB DT NN

8
Markov Models (MM)

In a Markov chain, the future element of the
sequence depends only on the current element in
the sequence, but not the past elements
X (X1, , XT) is a sequence of random
variables, S s1, , sN is the state space
and

9
Example of Markov Models (MM)
Cf. Manning Schütze, 1999, page 319
10
Hidden Markov Model

In (visible) MM, we know the state sequences the
model passes, so the state sequence is regarded
as output
In HMM, we dont know the state sequences, but
only some probabilistic function of it
Markov models can be used wherever one wants to
model the probability of a linear sequence of
events
HMM can be trained from unannotated text

11
HMM Tagger

Assumption words tag only depends on the
previous tag and this dependency does not change
over time
HMM tagger uses states to represent POS tags and
outputs (symbol emission) to represent the words.
Tagging task is to find the most probable tag
sequence for a sequence of words.

12
Finding the most probable sequence
Cf. Erhard Hinrichs Sandra Kübler
13
HMM tagging an example
Cf. Erhard Hinrichs Sandra Kübler
14
HMM tagging an example
Cf. Erhard Hinrichs Sandra Kübler
15
Calculating the most likely sequence
Green transition probabilities Blue emission
probabilities
16
Dealing with unknown words

The simplest model assume that unknown words can
have any POS tags, or the most frequent one in
the tagset
In practice, morphological info like suffix is
used as hint

17
TnT (TrigramsnTags)

A statistical tagger using Markov Models states
represent tags and outputs represent words
To find the current tag is to calculate

18
Transition and emission probabilities

Transition and output probabilities are estimated
from a tagged corpus
Bigrams
Trigrams
Lexical

19
Smoothing Technique

Needed due to sparse-data problem
The trigram is most likely to be zero in a
limited corpus
Without smoothing, the complete probability
becomes zero
Smoothing
where

20
Other techniques

Handling unknown words
Using the longest suffix (the final sequence of
characters of a word) as a strong predictor for
word classes
To calculate the probability of a tag t given the
last m letters li of an n letter word. m depends
on the specific word
Capitalization
Works better for English than for German

21
Evaluation

Corpora
German NEGRA corpus around 355,000 tokens
WSJ (Wall Street Journal) in the Penn Treebank
around 1.2 Million tokens
10-fold cross validation
The tagger assigns tags as well as probabilities
to words
?rank different assignments

22
Results for German and English
23
POS Learning Curve for NEGRA
24
Learning Curve for Penn Treebank
25
Conclusion

Good results for both German and English corpus
Average accuracy TnT achieves is between 96 and
97
The accuracy for known tokens is significantly
higher than for unknown tokens

26
References

What is a word, whats a sentence (Grefenstette
94)
POS-Tagging and Partial Parsing
(Abney 96)
TNT- A Statistical Part-of-Speech Tagger (Brants
2000)
Foundations of Statistical Natural Language
Processing
(Manning Schütze 99)

Write a Comment

User Comments (0)

About PowerShow.com

Tokenization PowerPoint PPT Presentation