Fall 2004 - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Fall 2004

Description:

Geese: goose N PL. Ducks: (duck N PL) or (duck V 3SG) Merging: V ... Irregular nouns: e.g., geese, sheep, mice. Irregular verbs: e.g., caught, ate, eate ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 48
Provided by: rad75
Category:
Tags: fall

less

Transcript and Presenter's Notes

Title: Fall 2004


1

EECS 595 / LING 541 / SI 661
Natural Language Processing
  • Fall 2004
  • Lecture Notes 2

2
Course logistics
  • Instructor Prof. Dragomir Radev
    (radev_at_umich.edu)
  • Class times Tu 110-355 PM, in 412, WH
  • Office hours M 10-11, Tu 11-12 in 3080, WH

Home page
http//www.si.umich.edu/radev/NLP-fall2004
3
Regular Expressions andAutomata
4
Regular expressions
  • Searching for woodchuck
  • Searching for woodchucks with an optional final
    s
  • Regular expressions
  • Finite-state automata (singular automaton)

5
Regular expressions
  • Basic regular expression patterns
  • Perl-based syntax (slightly different from other
    notations for regular expressions)
  • Disjunctions abc
  • Ranges A-Z
  • Negations Ss
  • Optional characters ? and
  • Wild cards .
  • Anchors and , also \b and \B
  • Disjunction, grouping, and precedence

6
Writing correct expressions
  • Exercise write a Perl regular expression to
    match the English article the

/the//tThe//\btThe\b//a-zA-ZtThea-zA
-Z//(a-zA-Z)tThea-zA-Z/
7
A more complex example
  • Exercise Write a regular expression that will
    match any PC with more than 500MHz and 32 Gb of
    disk space for less than 1000

/0-9//0-9\.0-90-9//\b0-9(\.0-9
0-9)?\b//\b0-9 (MHzMmegahertzGhz
Ggigahertz)\b//\b0-9 (MbMmegabytes?)\b/
/\b0-9(\.0-9) (GbGgigabytes?)\b/
8
Advanced operators
9
Substitutions and memory
  • Substitutions

s/colour/color/
  • Memory (\1, \2, etc. refer back to matches)

s/(0-9)/lt\1gt/
10
Eliza Weizenbaum, 1966
  • User Men are all alike
  • ELIZA IN WHAT WAY
  • User Theyre always bugging us about something
    or other
  • ELIZA CAN YOU THINK OF A SPECIFIC EXAMPLE?
  • User Well, my boyfriend made me come here
  • ELIZA YOUR BOYFRIEND MADE YOU COME HERE
  • User He says Im depressed much of the time
  • ELIZA I AM SORRY TO HEAR THAT YOU ARE DEPRESSED

11
Eliza-style regular expressions
Step 1 replace first person references with
second person referencesStep 2 use additional
regular expressions to generate replies Step 3
use scores to rank possible transformations
  • s/. YOU ARE (depressedsad) ./I AM SORRY TO
    HEAR YOU ARE \1/
  • s/. YOU ARE (depressedsad) ./WHY DO YOU THINK
    YOU ARE \1/
  • s/. all ./IN WHAT WAY/
  • s/. always ./CAN YOU THINK OF A SPECIFIC
    EXAMPLE/

12
Finite-state automata
  • Finite-state automata (FSA)
  • Regular languages
  • Regular expressions

13
Finite-state automata (machines)
baa! baaa! baaaa! baaaaa! ...
baa!
a
b
a
a
!
q0
q1
q2
q3
q4
finalstate
state
transition
14
Input tape
q0
a
b
a
!
b
15
Finite-state automata
  • Q a finite set of N states q0, q1, qN
  • ? a finite input alphabet of symbols
  • q0 the start state
  • F the set of final states
  • ?(q,i) transition function

16
State-transition tables
Input Input Input
State b a !
0 1 0 0
1 0 2 0
2 0 3 0
3 0 3 4
4 0 0 0
17
The FSM toolkit and friends
  • Developed at ATT Research (Riley, Pereira,
    Mohri, Sproat)
  • Download http//www.research.att.com/sw/tools/fs
    m/tech.htmlhttp//www.research.att.com/sw/tools/l
    extools/
  • Tutorial available
  • 4 useful parts FSM, Lextools, GRM, Dot
    (separate)
  • /clair3/tools/fsm-3.6/bin
  • /clair3/tools/lextools/bin
  • /clair3/tools/dot/bin

18
D-RECOGNIZE
function D-RECOGNIZE (tape, machine) returns
accept or reject index ? Beginning of tape
current-state ? Initial state of machine loop
if End of input has been reached then
if current-state is an accept state then
return accept else return
reject elsif transition-table
current-state, tapeindex is empty then
return reject else current-state ?
transition-table current-state, tapeindex
index ? index 1end
19
Adding a failing state
a
b
a
a
!
q0
q1
q2
q3
q4
!
!
b
!
b
!
b
b
a
qF
a
20
Languages and automata
  • Formal languages regular languages, non-regular
    languages
  • deterministic vs. non-deterministic FSAs
  • Epsilon (?) transitions

21
Using NFSAs to accept strings
  • Backup add markers at choice points, then
    possibly revisit underexplored markers
  • Look-ahead look ahead in input
  • Parallelism look at alternatives in parallel

22
Using NFSAs
Input Input Input Input
State b a ! e
0 1 0 0 0
1 0 2 0 0
2 0 2,3 0 0
3 0 0 4 0
4 0 0 0 0
23
More about FSAs
  • Transducers
  • Equivalence of DFSAs and NFSAs
  • Recognition as search depth-first,
    breadth-search

24
Recognition using NFSAs
25
Regular languages
  • Operations on regular languages and FSAs
    concatenation, closure, union
  • Properties of regular languages (closed under
    concatenation, union, disjunction, intersection,
    difference, complementation, reversal, Kleene
    closure)

26
An exercise
  • JM 2.8. Write a regular expression for the
    language accepted by the NFSA in the Figure.

27
Morphology and Finite-State Transducers
28
Morphemes
  • Stems, affixes
  • Affixes prefixes, suffixes, infixes hingi
    (borrow) humingi (agent) in Tagalog,
    circumfixes sagen gesagt in German
  • Concatenative morphology
  • Templatic morphology (Semitic languages)
  • lmd (learn), lamad (he studied), limed (he
    taught), lumad (he was taught)

29
Morphological analysis
  • rewrites
  • unbelievably

30
Inflectional morphology
  • Tense, number, person, mood, aspect
  • Five verb forms in English
  • 40 forms in French
  • Six cases in Russianhttp//www.departments.buckn
    ell.edu/russian/language/case.html
  • Up to 40,000 forms in Turkish (you cause X to
    cause Y to do Z)

31
Derivational morphology
  • Nominalization computerization, appointee,
    killer, fuzziness
  • Formation of adjectives computational,
    embraceable, clueless

32
Finite-state morphological parsing
  • Cats cat N PL
  • Cat cat N SG
  • Cities city N PL
  • Geese goose N PL
  • Ducks (duck N PL) or (duck V 3SG)
  • Merging V PRES-PART
  • Caught (catch V PAST-PART) or (catch V PAST)

33
Principles of morphological parsing
  • Lexicon
  • Morphotactics (e.g., plural follows noun)
  • Orthography (easy ? easier)
  • Irregular nouns e.g., geese, sheep, mice
  • Irregular verbs e.g., caught, ate, eate

34
FSA for adjectives
  • Big, bigger, biggest
  • Cool, cooler, coolest, coolly
  • Red, redder, reddest
  • Clear, clearer, clearest, clearly, unclear,
    unclearly
  • Happy, happier, happiest, happily
  • Unhappy, unhappier, unhappiest, unhappily
  • What about unbig, redly, and realest?

35
Using FSA for recognition
  • Is a string a legitimate word or not?
  • Two-level morphology lexical level surface
    level (Koskenniemi 83)
  • Finite-state transducers (FST) used for regular
    relations
  • Inversion and composition of FST

36
Orthographic rules
  • Beg/begging
  • Make/making
  • Watch/watches
  • Try/tries
  • Panic/panicked

37
Combining FST lexicon and rules
  • Cascades of transducersthe output of one
    becomes the input of another

38
Weighted Automata
39
Phonetic symbols
  • IPA
  • Arpabet
  • Examples

40
Using WFST for language modeling
  • Phonetic representation
  • Part-of-speech tagging

41
Word Classes andPart Of Speech Tagging
42
Some POS statistics
  • Preposition list from COBUILD
  • Single-word particles
  • Conjunctions
  • Pronouns
  • Modal verbs

43
Tagsets for English
  • Penn Treebank
  • Other tagsets (see Week 1 slides)

44
POS ambiguity
  • Degrees of ambiguity (DeRose 1988)
  • Rule-based POS tagging
  • ENGTWOL (Voutilainen et al. )
  • Sample rule
  • Adverbial-That rule (it isnt that
    odd) (Given input thatif (1
    A/ADV/QUANT) (2 SENT-LIM) (NOT 1
    SVOC/A) (not a verb like consider)then
    eliminate non-ADV tagselse eliminate ADV tag

45
Evaluating POS taggers
  • Percent correct
  • What is the lower bound on a systems
    performance?
  • What about the upper bound?

46
Kappa
  • N number of items (index i)
  • n number of categories (index j)
  • k number of annotators
  • when k gt .8 agreement is considered high

47
Readings for next time
  • JM Chapters 5.9, 8, 9
  • Lecture notes 2
Write a Comment
User Comments (0)
About PowerShow.com