Regular Expressions and Automata - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Regular Expressions and Automata

Description:

Finite-state automate define formal languages (without having to enumerate all ... That means that they have the same power; non-deterministic machines are not ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 52
Provided by: Kathy9
Category:

less

Transcript and Presenter's Notes

Title: Regular Expressions and Automata


1
Regular Expressions and Automata
Lecture 2-2
  • September 10
  • 2009

2
Finite State Automata
  • Regular Expressions (REs) can be viewed as a way
    to describe machines called Finite State Automata
    (FSA, also known as automata, finite automata).
  • FSAs and their close variants are a theoretical
    foundation of much of the field of NLP.

3
Finite State Automata
  • FSAs recognize the regular languages represented
    by regular expressions
  • SheepTalk /baa!/
  • Directed graph with labeled nodes and arc
    transitions
  • Five states q0 the start state, q4 the final
    state, 5 transitions

4
Formally
  • FSA is a 5-tuple consisting of
  • Q set of states q0,q1,q2,q3,q4
  • ? a finite alphabet of symbols a,b,!
  • q0 a start state
  • F a set of accept/final states in Q q4
  • ?(q,i) a transition function mapping Q x ? to Q

5
State Transition Table for SheepTalk
State Input Input Input
State b a !
0 1 Ø Ø
1 Ø 2 Ø
2 Ø 3 Ø
3 Ø 3 4
4 Ø Ø Ø
6
Recognition
  • Recognition (or acceptance) is the process of
    determining whether or not a given input should
    be accepted by a given machine.
  • Or its the process of determining if as string
    is in the language were defining with the
    machine
  • In terms of REs, its the process of determining
    whether or not a given input matches a particular
    regular expression.
  • Traditionally, recognition is viewed as
    processing an input written on a tape consisting
    of cells containing elements from the alphabet.

7
  • FSA recognizes (accepts) strings of a regular
    language
  • baa!
  • baaa!
  • baaa!
  • Tape metaphor a rejected input

q0
b
!
a
b
a
8
Recognition
  • Simply a process of starting in the start state
  • Examining the current input
  • Consulting the table
  • Going to a new state and updating the tape
    pointer.
  • Until you run out of tape.

9
D-Recognize
10
b a a a a !
State Input Input Input
State b a !
0 1 Ø Ø
1 Ø 2 Ø
2 Ø 3 Ø
3 Ø 3 4
4 Ø Ø Ø
11
Key Points
  • Deterministic means that at each point in
    processing there is always one unique thing to do
    (no choices).
  • D-recognize is a simple table-driven interpreter
  • The algorithm is universal for all unambiguous
    languages.
  • To change the machine, you change the table.

12
Key Points
  • Crudely therefore matching strings with regular
    expressions (ala Perl) is a matter of
  • translating the expression into a machine (table)
    and
  • passing the table to an interpreter

13
Recognition as Search
  • You can view this algorithm as a degenerate kind
    of state-space search.
  • States are pairings of tape positions and state
    numbers.
  • Operators are compiled into the table
  • Goal state is a pairing with the end of tape
    position and a final accept state
  • Its degenerate because?

14
Formal Languages
  • Formal Languages are sets of strings composed of
    symbols from a finite set of symbols.
  • Finite-state automate define formal languages
    (without having to enumerate all the strings in
    the language)
  • Given a machine m (such as a particular FSA) L(m)
    means the formal language characterized by m.
  • L(Sheeptalk FSA) baa!, baaa!, baaaa!, (an
    infinite set)

15
Generative Formalisms
  • The term Generative is based on the view that you
    can run the machine as a generator to get strings
    from the language.
  • FSAs can be viewed from two perspectives
  • Acceptors that can tell you if a string is in the
    language
  • Generators to produce all and only the strings in
    the language

16
Three Views
  • Three equivalent formal ways to look at what
    were up to (not including tables and well
    find more)

Regular Expressions
Finite State Automata
Regular Languages
17
Determinism
  • Lets take another look at what is going on with
    d-recognize.
  • In particular, lets look at what it means to be
    deterministic here and see if we can relax that
    notion.
  • How would our recognition algorithm change?
  • What would it mean for the accepted language?

18
Determinism and Non-Determinism
  • Deterministic There is at most one transition
    that can be taken given a current state and input
    symbol.
  • Non-deterministic There is a choice of several
    transitions that can be taken given a current
    state and input symbol. (The machine doesnt
    specify how to make the choice.)

19
Non-Deterministic FSAs for SheepTalk
a
b
a
a
!
q0
q4
q1
q2
q3
b
a
a
!
q0
q4
q1
q2
q3
?
20
FSAs as Grammars for Natural Language
dr
the
rev
mr
pat
l.
robinson
q2
q4
q5
q0
q3
q1
q6
ms
hon
mrs
?
?
Can you use a regexpr to capture this too?
21
Equivalence
  • Non-deterministic machines can be converted to
    deterministic ones with a fairly simple
    construction (essentially building set states
    that are reached by following all possible states
    in parallel)
  • That means that they have the same power
    non-deterministic machines are not more powerful
    than deterministic ones
  • It also means that one way to do recognition with
    a non-deterministic machine is to turn it into a
    deterministic one.
  • Problems translating gives us a not very
    intuitive machine, and this machine has LOTS of
    states

22
Non-Deterministic Recognition
  • In a ND FSA there exists at least one path
    directed through the machine by a string that is
    in the language defined by the machine that leads
    to an accept condition..
  • But not all paths directed through the machine by
    an accept string lead to an accept state. It is
    OK for some paths to lead to a reject condition.
  • In a ND FSA no path directed through the machine
    by a string outside the language leads to an
    accept condition.

23
Non-Deterministic Recognition
  • So success in a non-deterministic recognition
    occurs when a path is found through the machine
    that ends in an accept.
  • However, being driven to a reject condition by an
    input does not imply it should be rejected.
  • Failure occurs only when none of the possible
    paths lead to an accept state.
  • This means that the problem of non-deterministic
    recognition can be thought of as a standard
    search problem.

24
The Problem of Choice
  • Choice in non-deterministic models comes up again
    and again in NLP.
  • Several Standard Solutions
  • Backup (search, this chapter)
  • Save input/state of machine at choice points
  • If wrong choice, use this saved state to back up
    and try another choice
  • Lookahead
  • Look ahead in the input to help make a choice
  • Parallelism
  • Look at all choices in parallel

25
Backup
  • After a wrong choice leads to a dead-end (either
    no input left in a non-accept state, or no legal
    transitions), return to a previous choice point
    to pursue another unexplored choice.
  • Thus, at each choice point, the search process
    needs to remember the (unexplored) choices.
  • Standard State Space Search.
  • State (FSA node or machine state, tape-position)

26
Example
b
a
a
a
!
\
q0
q2
q1
q2
q3
q4
27
ND-Recognize Code
28
Example
Agenda
29
Example
30
Example
Agenda
31
Example
32
Example
Agenda
33
Example
34
Example
35
Example
Agenda
36
Example
Agenda
37
Example
38
Example
Agenda
39
Example
Agenda
40
Example
Agenda
41
Example
Agenda
42
Key Points
  • States in the search space are pairings of tape
    positions and states in the machine.
  • By keeping track of as yet unexplored states, a
    recognizer can systematically explore all the
    paths through the machine given an input.

43
Infinite Search
  • If youre not careful such searches can go into
    an infinite loop.
  • How?

44
Why Bother?
  • Non-determinism doesnt get us more formal power
    and it causes headaches so why bother?
  • More natural solutions
  • Machines based on construction are too big

45
Compositional Machines
  • Formal languages are just sets of strings
  • Therefore, we can talk about various set
    operations (intersection, union, concatenation)
  • This turns out to be a useful exercise

46
Union
  • Accept a string in either of two languages

47
Concatenation
  • Accept a string consisting of a string from
    language L1 followed by a string from language L2.

48
Negation
  • Construct a machine M2 to accept all strings not
    accepted by machine M1 and reject all the strings
    accepted by M1
  • Invert all the accept and not accept states in M1
  • Does that work for non-deterministic machines?

49
Intersection
  • Accept a string that is in both of two specified
    languages
  • An indirect construction
  • AB (A or B)

50
Why Bother?
  • FSAs can be useful tools for recognizing and
    generating subsets of natural language
  • But they cannot represent all NL phenomena
    (Center Embedding The mouse the cat ... chased
    died.)

51
Summing Up
  • Regular expressions and FSAs can represent
    subsets of natural language as well as regular
    languages
  • Both representations may be impossible for humans
    to understand for any real subset of a language
  • But they are very easy to use for smaller subsets
  • Next time Read Ch 3
  • For fun
  • Think of ways you might characterize features of
    your email using only regular expressions
Write a Comment
User Comments (0)
About PowerShow.com