CPSC 503 Computational Linguistics - PowerPoint PPT Presentation

About This Presentation
Title:

CPSC 503 Computational Linguistics

Description:

RegExps and Finite State Automata Lecture 2. Giuseppe Carenini. 9/29/09. CPSC503 Spring 2004 ... 'OR' /([Ff]rom|[Ss]ubject|[Dd]ate) ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 35
Provided by: gcare
Category:

less

Transcript and Presenter's Notes

Title: CPSC 503 Computational Linguistics


1
CPSC 503Computational Linguistics
  • RegExps and Finite State Automata Lecture 2
  • Giuseppe Carenini

2
Survey Results
By Student
By topic
3
Knowledge-Formalisms Map(including probabilistic
formalisms)
State Machines (and prob. versions) (Finite State
Automata,Finite State Transducers, Markov Models)
Morphology
Syntax
Rule systems (and prob. versions) (e.g., (Prob.)
Context-Free Grammars)
Semantics
  • Logical formalisms
  • (First-Order Logics)

Pragmatics Discourse and Dialogue
AI planners
4
Next Two Lectures
  • State Machines (no prob.)
  • Finite State Automata (and Regular Expressions)
  • Finite State Transducers

(English) Morphology
5
Today 1/16
  • Regular Expressions
  • Errors
  • Finite-state automata
  • Generation
  • Recognition
  • Non-determinism

6
Regular Expressions
  • Def. Notation to specify a set of strings
  • Simplest case /CPSC503/
  • disjunction of characters, negation
  • /CPSC5034/, /CPSC500-9/,/CPSC5034/
  • . Any character (to match a period \.)
  • OR /(FfromSsubjectDdate)/
  • Anchors (start of of line), (end of line),
    \b (word boundary)
  • /(Ffrom\bSsubject\bDdate\b)/

7
Regular Expressions (cont.)
  • ( ) Grouping /happyier/ vs. /happ(yier)/
  • Operators applied to preceding item (character or
    exp.)
  • ? Optional /colou?r/,/July? (fourth4(th)?)/
  • Repetitions
  • one or more
  • any number including none
  • num num times

/0-9(\.0-9)3/
8
Example of Usage Text Searching
  • Find me all instances of the determiner the
    in an English text.
  • To count them
  • To substitute them with something else
  • You try /the/

The other cop went to the bank but there were no
people there.
9
Errors
  • The process we just went through was based on
    fixing two kinds of errors
  • Matching strings that we should not have matched
    (there, other)
  • False positives
  • Not matching things that we should have matched
    (The)
  • False negatives

10
Errors cont.
  • Reducing the error rate for an application often
    involves two antagonistic efforts
  • Increasing accuracy (minimizing false positives)
  • Increasing coverage (minimizing false negatives).

11
Finite State Automata
implement (generate and recognize)
Regular Expressions
  • FSA

describe
Many Linguistic Phenomena
model
12
FSAs as Graphs
  • Lets start with the sheep language from the
    text /baa!/

13
Verify
  • It can generate the same set of strings (language)
  • To generate a string
  • follow a path leading to an accept state
  • at each transition output corresponding symbol

14
Sheep FSA
  • We can say the following things about this
    machine
  • It has 5 states
  • b,a, and ! are in its alphabet
  • q0 is the start state
  • q4 is an accept state
  • It has 5 transitions

15
Sheep FSA
  • We can say the following things about this
    machine
  • It has 5 states
  • At least b,a, and ! are in its alphabet
  • q0 is the start state
  • q4 is an accept state
  • It has 5 transitions

16
But note
  • There are other machines that correspond to this
    language
  • More on this one later

17
More Formally
  • You can specify an FSA by enumerating the
    following things.
  • The set of states Q
  • A finite alphabet S
  • A start state
  • A set of accept/final states
  • A transition function that maps QxS to Q

18
Represented as a Table
19
About Alphabets
  • Dont take that word to narrowly it just means
    we need a finite set of symbols in the input.
  • These symbols can and will stand for bigger
    objects that can have internal structure.

20
Dollars and Cents
21
Recognition
  • Def. process of determining if a string is in the
    language were defining with the machine
  • Or its the process of determining if the
    equivalent regular expression matches a string

22
Recognition Pseudocode (slide)
  • Assume input on a tape
  • Start in the start state pointing at the
    beginning of the tape
  • Examine the current input symbol
  • Consult the table
  • (If a transition is allowed) Go to a new state
    and update the tape pointer (Else Fail).
  • Repeat this process, until you run out of tape
  • Now, if you are in an accept state accept the
    string otherwise Fail

23
D-Recognize
24
Key Points
  • D-recognize is a simple table-driven interpreter
  • Matching strings with regular expressions (ala
    Perl) is a matter of
  • translating the expression into a machine (table)
    and
  • passing the table to an interpreter

25
FSA Generative Formalisms
  • FSAs can be viewed from two perspectives
  • Acceptors that can tell you if a string is in the
    language
  • Generators to produce all and only the strings in
    the language

26
Non-Determinism
27
Non-Determinism cont.
  • Yet another technique
  • Epsilon transitions
  • Key point these transitions do not examine or
    advance the tape during recognition


e
28
Non-Deterministic RecognitionKey ideas
  • An input can lead to multiple paths
  • The algorithm may need to explore all possible
    paths
  • Whenever there is a choice (one possibility) is
    to explore alternatives one at the time.
  • Save alternatives in an agenda

29
Non-Deterministic Recognition
  • Success occurs when a path is found through the
    machine that ends in an accept state
  • Failure occurs when none of the possible paths
    lead to an accept state

30
Example (slide)
b
a
a
a
!
\
31
Recognition as Search
32
Equivalence between D and ND
  • ND machines can always be converted to D ones
  • That means that ND machines are not more powerful
    than D ones
  • It also means that one way to do recognition with
    a ND machine is to turn it into a D one.

33
Why Bother?
  • Non-determinism doesnt get us more formal
    power and it causes headaches so why bother?
  • More natural solutions
  • Machines based on construction are too big

34
Next Time
  • Read Chapter 1 (on-line) and Chapter 2 of
    textbook
  • Try understand
  • ND-recognize algorithm
  • and why it is a state-space search algorithm
Write a Comment
User Comments (0)
About PowerShow.com