Tokenization - PowerPoint PPT Presentation

About This Presentation
Title:

Tokenization

Description:

Using a lexical list (most important abbreviations are listed) winter semester 05/06 ... HMM, we don't know the state sequences, but only some probabilistic ... – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 27
Provided by: Tob71
Category:

less

Transcript and Presenter's Notes

Title: Tokenization


1
Tokenization POS-Tagging
  • presented by
  • Yajing Zhang
  • Saarland University
  • yazhang_at_coli.uni-sb.de

2
Outline
  • Tokenization
  • Importance
  • Problems solutions
  • POS tagging
  • HMM tagger
  • TnT statistical tagger

3
Why Tokenization?
  • Tokenization the isolation of word-like units
    from a text.
  • Building blocks of other text processing.
  • The accuracy of tokenization affects the results
    of other higher level processing, e.g. parsing.

4
Problems of tokenization
  • Definition of Token
  • United States, ATT, 3-year-old
  • Ambiguity of punctuation as sentence boundary
  • Prof. Dr. J.M.
  • Ambiguity in numbers
  • 123,456.78

5
Some Solutions
  • Using regular expressions to match numbers and
    abbreviations
  • (0-9,)0-9(.0-9)?
  • A-Zbcdfghj-np-tvxz\.
  • Using corpus as a filter to identify
    abbreviations
  • Using a lexical list (most important
    abbreviations are listed)

6
POS Tagging
  • Labeling each word in a sentence with its
    appropriate part of speech
  • Information sources in tagging
  • Tags of other words in the context
  • The word itself
  • Different approaches
  • Rule-based Tagger
  • Stochastic POS Tagger
  • Simplest stochastic Tagger
  • HMM Tagger

7
Simplest Stochastic Tagger
  • Each word is assigned its most frequent tag (most
    frequently encountered in the training set)
  • Problem may generate a valid tag for a word but
    unacceptable tag sequences
  • Time flies like an arrow
  • NN VBZ VB DT NN

8
Markov Models (MM)
  • In a Markov chain, the future element of the
    sequence depends only on the current element in
    the sequence, but not the past elements
  • X (X1, , XT) is a sequence of random
    variables, S s1, , sN is the state space
  • and

9
Example of Markov Models (MM)
Cf. Manning Schütze, 1999, page 319
10
Hidden Markov Model
  • In (visible) MM, we know the state sequences the
    model passes, so the state sequence is regarded
    as output
  • In HMM, we dont know the state sequences, but
    only some probabilistic function of it
  • Markov models can be used wherever one wants to
    model the probability of a linear sequence of
    events
  • HMM can be trained from unannotated text

11
HMM Tagger
  • Assumption words tag only depends on the
    previous tag and this dependency does not change
    over time
  • HMM tagger uses states to represent POS tags and
    outputs (symbol emission) to represent the words.
  • Tagging task is to find the most probable tag
    sequence for a sequence of words.

12
Finding the most probable sequence
Cf. Erhard Hinrichs Sandra Kübler
13
HMM tagging an example
Cf. Erhard Hinrichs Sandra Kübler
14
HMM tagging an example
Cf. Erhard Hinrichs Sandra Kübler
15
Calculating the most likely sequence
Green transition probabilities Blue emission
probabilities
16
Dealing with unknown words
  • The simplest model assume that unknown words can
    have any POS tags, or the most frequent one in
    the tagset
  • In practice, morphological info like suffix is
    used as hint

17
TnT (TrigramsnTags)
  • A statistical tagger using Markov Models states
    represent tags and outputs represent words
  • To find the current tag is to calculate

18
Transition and emission probabilities
  • Transition and output probabilities are estimated
    from a tagged corpus
  • Bigrams
  • Trigrams
  • Lexical

19
Smoothing Technique
  • Needed due to sparse-data problem
  • The trigram is most likely to be zero in a
    limited corpus
  • Without smoothing, the complete probability
    becomes zero
  • Smoothing
  • where

20
Other techniques
  • Handling unknown words
  • Using the longest suffix (the final sequence of
    characters of a word) as a strong predictor for
    word classes
  • To calculate the probability of a tag t given the
    last m letters li of an n letter word. m depends
    on the specific word
  • Capitalization
  • Works better for English than for German

21
Evaluation
  • Corpora
  • German NEGRA corpus around 355,000 tokens
  • WSJ (Wall Street Journal) in the Penn Treebank
    around 1.2 Million tokens
  • 10-fold cross validation
  • The tagger assigns tags as well as probabilities
    to words
  • ?rank different assignments

22
Results for German and English
23
POS Learning Curve for NEGRA
24
Learning Curve for Penn Treebank
25
Conclusion
  • Good results for both German and English corpus
  • Average accuracy TnT achieves is between 96 and
    97
  • The accuracy for known tokens is significantly
    higher than for unknown tokens

26
References
  • What is a word, whats a sentence (Grefenstette
    94)
  • POS-Tagging and Partial Parsing
  • (Abney 96)
  • TNT- A Statistical Part-of-Speech Tagger (Brants
    2000)
  • Foundations of Statistical Natural Language
    Processing
  • (Manning Schütze 99)
Write a Comment
User Comments (0)
About PowerShow.com