Title: Fall 2004
1EECS 595 / LING 541 / SI 661
Natural Language Processing
- Fall 2004
- Lecture Notes 2
2Course logistics
- Instructor Prof. Dragomir Radev
(radev_at_umich.edu) - Class times Tu 110-355 PM, in 412, WH
- Office hours M 10-11, Tu 11-12 in 3080, WH
Home page
http//www.si.umich.edu/radev/NLP-fall2004
3Regular Expressions andAutomata
4Regular expressions
- Searching for woodchuck
- Searching for woodchucks with an optional final
s - Regular expressions
- Finite-state automata (singular automaton)
5Regular expressions
- Basic regular expression patterns
- Perl-based syntax (slightly different from other
notations for regular expressions) - Disjunctions abc
- Ranges A-Z
- Negations Ss
- Optional characters ? and
- Wild cards .
- Anchors and , also \b and \B
- Disjunction, grouping, and precedence
6Writing correct expressions
- Exercise write a Perl regular expression to
match the English article the
/the//tThe//\btThe\b//a-zA-ZtThea-zA
-Z//(a-zA-Z)tThea-zA-Z/
7A more complex example
- Exercise Write a regular expression that will
match any PC with more than 500MHz and 32 Gb of
disk space for less than 1000
/0-9//0-9\.0-90-9//\b0-9(\.0-9
0-9)?\b//\b0-9 (MHzMmegahertzGhz
Ggigahertz)\b//\b0-9 (MbMmegabytes?)\b/
/\b0-9(\.0-9) (GbGgigabytes?)\b/
8Advanced operators
9Substitutions and memory
s/colour/color/
- Memory (\1, \2, etc. refer back to matches)
s/(0-9)/lt\1gt/
10Eliza Weizenbaum, 1966
- User Men are all alike
- ELIZA IN WHAT WAY
- User Theyre always bugging us about something
or other - ELIZA CAN YOU THINK OF A SPECIFIC EXAMPLE?
- User Well, my boyfriend made me come here
- ELIZA YOUR BOYFRIEND MADE YOU COME HERE
- User He says Im depressed much of the time
- ELIZA I AM SORRY TO HEAR THAT YOU ARE DEPRESSED
11Eliza-style regular expressions
Step 1 replace first person references with
second person referencesStep 2 use additional
regular expressions to generate replies Step 3
use scores to rank possible transformations
- s/. YOU ARE (depressedsad) ./I AM SORRY TO
HEAR YOU ARE \1/ - s/. YOU ARE (depressedsad) ./WHY DO YOU THINK
YOU ARE \1/ - s/. all ./IN WHAT WAY/
- s/. always ./CAN YOU THINK OF A SPECIFIC
EXAMPLE/
12Finite-state automata
- Finite-state automata (FSA)
- Regular languages
- Regular expressions
13Finite-state automata (machines)
baa! baaa! baaaa! baaaaa! ...
baa!
a
b
a
a
!
q0
q1
q2
q3
q4
finalstate
state
transition
14Input tape
q0
a
b
a
!
b
15Finite-state automata
- Q a finite set of N states q0, q1, qN
- ? a finite input alphabet of symbols
- q0 the start state
- F the set of final states
- ?(q,i) transition function
16State-transition tables
Input Input Input
State b a !
0 1 0 0
1 0 2 0
2 0 3 0
3 0 3 4
4 0 0 0
17The FSM toolkit and friends
- Developed at ATT Research (Riley, Pereira,
Mohri, Sproat) - Download http//www.research.att.com/sw/tools/fs
m/tech.htmlhttp//www.research.att.com/sw/tools/l
extools/ - Tutorial available
- 4 useful parts FSM, Lextools, GRM, Dot
(separate) - /clair3/tools/fsm-3.6/bin
- /clair3/tools/lextools/bin
- /clair3/tools/dot/bin
18D-RECOGNIZE
function D-RECOGNIZE (tape, machine) returns
accept or reject index ? Beginning of tape
current-state ? Initial state of machine loop
if End of input has been reached then
if current-state is an accept state then
return accept else return
reject elsif transition-table
current-state, tapeindex is empty then
return reject else current-state ?
transition-table current-state, tapeindex
index ? index 1end
19Adding a failing state
a
b
a
a
!
q0
q1
q2
q3
q4
!
!
b
!
b
!
b
b
a
qF
a
20Languages and automata
- Formal languages regular languages, non-regular
languages - deterministic vs. non-deterministic FSAs
- Epsilon (?) transitions
21Using NFSAs to accept strings
- Backup add markers at choice points, then
possibly revisit underexplored markers - Look-ahead look ahead in input
- Parallelism look at alternatives in parallel
22Using NFSAs
Input Input Input Input
State b a ! e
0 1 0 0 0
1 0 2 0 0
2 0 2,3 0 0
3 0 0 4 0
4 0 0 0 0
23More about FSAs
- Transducers
- Equivalence of DFSAs and NFSAs
- Recognition as search depth-first,
breadth-search
24Recognition using NFSAs
25Regular languages
- Operations on regular languages and FSAs
concatenation, closure, union - Properties of regular languages (closed under
concatenation, union, disjunction, intersection,
difference, complementation, reversal, Kleene
closure)
26An exercise
- JM 2.8. Write a regular expression for the
language accepted by the NFSA in the Figure.
27Morphology and Finite-State Transducers
28Morphemes
- Stems, affixes
- Affixes prefixes, suffixes, infixes hingi
(borrow) humingi (agent) in Tagalog,
circumfixes sagen gesagt in German - Concatenative morphology
- Templatic morphology (Semitic languages)
- lmd (learn), lamad (he studied), limed (he
taught), lumad (he was taught)
29Morphological analysis
30Inflectional morphology
- Tense, number, person, mood, aspect
- Five verb forms in English
- 40 forms in French
- Six cases in Russianhttp//www.departments.buckn
ell.edu/russian/language/case.html - Up to 40,000 forms in Turkish (you cause X to
cause Y to do Z)
31Derivational morphology
- Nominalization computerization, appointee,
killer, fuzziness - Formation of adjectives computational,
embraceable, clueless
32Finite-state morphological parsing
- Cats cat N PL
- Cat cat N SG
- Cities city N PL
- Geese goose N PL
- Ducks (duck N PL) or (duck V 3SG)
- Merging V PRES-PART
- Caught (catch V PAST-PART) or (catch V PAST)
33Principles of morphological parsing
- Lexicon
- Morphotactics (e.g., plural follows noun)
- Orthography (easy ? easier)
- Irregular nouns e.g., geese, sheep, mice
- Irregular verbs e.g., caught, ate, eate
34FSA for adjectives
- Big, bigger, biggest
- Cool, cooler, coolest, coolly
- Red, redder, reddest
- Clear, clearer, clearest, clearly, unclear,
unclearly - Happy, happier, happiest, happily
- Unhappy, unhappier, unhappiest, unhappily
- What about unbig, redly, and realest?
35Using FSA for recognition
- Is a string a legitimate word or not?
- Two-level morphology lexical level surface
level (Koskenniemi 83) - Finite-state transducers (FST) used for regular
relations - Inversion and composition of FST
36Orthographic rules
- Beg/begging
- Make/making
- Watch/watches
- Try/tries
- Panic/panicked
37Combining FST lexicon and rules
- Cascades of transducersthe output of one
becomes the input of another
38Weighted Automata
39Phonetic symbols
40Using WFST for language modeling
- Phonetic representation
- Part-of-speech tagging
41Word Classes andPart Of Speech Tagging
42Some POS statistics
- Preposition list from COBUILD
- Single-word particles
- Conjunctions
- Pronouns
- Modal verbs
43Tagsets for English
- Penn Treebank
- Other tagsets (see Week 1 slides)
44POS ambiguity
- Degrees of ambiguity (DeRose 1988)
- Rule-based POS tagging
- ENGTWOL (Voutilainen et al. )
- Sample rule
- Adverbial-That rule (it isnt that
odd) (Given input thatif (1
A/ADV/QUANT) (2 SENT-LIM) (NOT 1
SVOC/A) (not a verb like consider)then
eliminate non-ADV tagselse eliminate ADV tag
45Evaluating POS taggers
- Percent correct
- What is the lower bound on a systems
performance? - What about the upper bound?
46 Kappa
- N number of items (index i)
- n number of categories (index j)
- k number of annotators
- when k gt .8 agreement is considered high
47Readings for next time
- JM Chapters 5.9, 8, 9
- Lecture notes 2