Title: CSE 3813 Introduction to Formal Languages and Automata
1CSE 3813Introduction to Formal Languages and
Automata
- Chapter 3
- Regular languages and regular grammars
- These class notes are based on material from our
textbook, An Introduction to Formal Languages and
Automata, 4th ed., by Peter Linz, published by
Jones and Bartlett Publishers, Inc., Sudbury, MA,
2006. They are intended for classroom use only
and are not a substitute for reading the textbook.
2Operations on formal languages
Let L1 10 and L2 011, 11. Union
L1 ? L2 10, 011, 11 Concatenation L1
L2 10011, 1011 Kleene Star L1 ?,
10, 1010, 101010, Other operations
intersection, complement, difference
3Definition Of Regular Languages
- A regular language over an alphabet ? is one that
contains either a single string of length 0 or 1,
or strings which can be obtained by using the
operations of union, concatenation, or Kleene on
strings of length 0 or 1.
4Alternative definition of regular languages
The simplest possible regular languages are the
empty set and languages consisting of a single
string that is either the empty string or has
length one. For example if ? a,b, the
simplest languages are ?, ?, a, and b. A
regular language is a language that can be built
from these simple languages, by using the three
operations of union, concatenation, and Kleene
star.
5Regular Languages correspond to Regular
Expressions
- L Ø RE is Ø
- L ? RE ?
- L a RE a
- L L1 ? L2 RE (r1 r2)
- L L1 L2 RE (r1r2)
- L L1 RE (r1)
6Regular expressions
A useful shorthand for describing regular
languages. Compare to arithmetic expressions,
such as (x 3)/2. An arithmetic expression is
constructed using arithmetic operators, such as
addition and division. A regular expression is
constructed using operations on languages, such
as concatenation, union, and Kleene star. The
value of an arithmetic expression is a number.
The value of a regular expression is a language.
7Recursive definition of a regular expression
? is a regular expression corresponding to the
language ?. ? is a regular expression
corresponding to the language ?. For each
symbol a ? ?, a is a regular expression
corresponding to the language a. For any
regular expressions r and s, corresponding to
the regular languages L(r) and L(s),
respectively, each of the following is a
regular expression (r s) corresponds
to the language L(r) ? L(s) (r s) or (rs)
corresponds to the language L(r)L(s) (r)
corresponds to the language (L(r))
8Examples
a b ?, a, b, aa, aaa, aaaa, aaaaa, aba
w ? ? w has exactly one b (a b) any
string of as and bs (a b)aa (a b) w ?
? w contains aa (a b)aa (a b) (a
b)bb (a b)
w ? ? w contains aa or bb (a ?)b
abn n ? 0 bn n ? 0 As with arithmetic
expressions, there is an order of precedence for
operators -- unless you change it
using parentheses. The order is star closure
first, then concatenation, then union.
9More examples
All strings containing no more than two as (b
c)(? a)(b c)(? a)(b c) All strings
containing no runs of as of length greater than
two (b c)(? a aa)(b c)((b c)(b
c)(? a aa)(b c)) All strings in which
all runs of as have lengths that are multiples
of three (aaa b c)
10Hints for writing regular expressions
Assume ? a, b, c. Zero or more as
a One or more as
aa Any string at all
(a b c) Any
nonempty string (a b
c)(a b c) Any string that does not contain
a (b c) Any string containing exactly one
a (b c)a(b c)
11Practice
- Let ? a,b,c. Give a regular expression for
the following languages - all strings containing exactly two as
- all strings containing no more than three as
12Practice
Let ? a,b,c. Give a regular expression for
the following languages (a) all strings
containing exactly two as (b c)a(b c)a(b
c) ( b) all strings containing no more than
three as (b c)(? a)(b c)(? a)(b
c)(? a)(b c)
13Practice
What languages correspond to the
following regular expressions? ab (aaa
bba) (ab)
14More practice
Give regular expressions for the following
languages, where the alphabet is ? a, b,
c. -- all strings ending in b -- all strings
containing no more than two as -- all strings
of even length
15More practice
Give regular expressions for the following
languages, where the alphabet is ? 0, 1. --
all strings of one or more 0s followed by a 1
-- all strings of two or more symbols followed
by three or more 0s -- all strings that do not
end with 01
16Do these strings match the regular expression?
Regular expression String (01
1) 0101 (a
?)b
b (ab)a
? (a b)(ab) bb
17Big Question
Given a specific regular language L, is it safe
to assume that any subset of L is also regular?
NO! Remember that the more powerful a language
is, the more precisely it is able to discriminate
between strings that do and do not belong to the
language.
18Big Question
Suppose L is described by the regular expression
ab. Then ab, aabb, aaabbb, etc. are strings in
this language. However, a, b, aab, aaaaaaab,
etc. are also strings in this language. The
language consisting only of the strings of the
form anbn is NOT a regular language, even though
it is a subset of ab.
19Accepting (review)
- Let M (Q, S, q0, d, A) be an FA.
- A string x ? S is accepted by M if
- d(q0, x) ? A
- The language accepted (or recognized) by M is the
set L(M) x ? S x is accepted by M - A language L over the alphabet S is regular iff
there is a Finite Automaton that accepts L.
20Kleenes theorem
1) For any regular expression r that represents
language L(r), there is a finite automaton that
accepts that same language. 2) For any finite
automaton M that accepts language L(M), there is
a regular expression that represents the same
language. Therefore, the class of languages that
can be represented by regular expressions is
equivalent to the class of languages accepted by
finite automata -- the regular languages.
21Kleenes theorem part 1
NFA
regular expression
proved
Kleenes Theorem part 2
DFA
22Theorem 3.1
1st half of Kleenes theorem Let r be a regular
expression. Then there exists some
nondeterministic regular accepter that accepts
L(r). Consequently, L(r) is a regular
language. Proof strategy for any regular
expression, we show how to construct an
equivalent NFA. Because regular expressions are
defined recursively, the proof is by induction.
23Base step Give an NFA that accepts each of the
simple or base languages, ?, ?, and a for
each a ? ?.
a
24Inductive step For each of the operations --
union, concatenation and Kleene star -- show how
to construct an accepting NFA. Closure under
union
M1
?
?
M2
?
?
25Closure under concatenation
26Closure under Kleene Star
?
M1
?
?
?
27Closure properties of Regular Languages
- Union, concatenation, and Kleene star of two
regular languages will result in a regular
language, since we can write a regular expression
for them. - Intersection and difference (complement) of two
regular languages will also produce a regular
language. - The class of regular languages is said to be
closed under these operations. (More in Ch. 4.)
28Exercise
Use the construction of the first half of
Kleenes theorem to construct a NFA that accepts
the language L(abaa bbaab).
29Exercise
Use the construction of the first half of
Kleenes theorem to construct a NFA that accepts
the language L(abaa bbaab).
?
?
FA accepting abaa
q0
qf
?
FA accepting bbaab
?
30Homework
Construct an NFA that accepts the language
corresponding to the regular expression ((b(a
b)a) a)
31Theorem 3.2
Kleenes theorem part 2 Let L be a regular
language. Then there exists a regular expression
r such that L L(r). Any language accepted
by a finite automaton can be represented by a
regular expression. The proof strategy For any
DFA, we show how create an equivalent regular
expression. In other words, we describe an
algorithm for converting any DFA to a regular
expression.
32Expression diagram
- A labeled directed graph (similar to a finite
state diagram) in which transitions are labeled
by regular expressions - Has a single start state with no incoming
transitions - Has a single accepting state with no outgoing
transitions - Example
33Algorithm for converting a DFA into an equivalent
regular expression
Initial step Change every transition labeled
a,b to (ab). Add a single start state with an
outgoing ?-transition to the current start state,
and add a single final state with incoming
?-transitions from every previous final
state. Main step Until expression diagram has
only two states (initial state and final state),
repeat the following -- pick some
non-start, non-final state -- remove it from
the diagram and re-label transitions with
regular expressions so that the same language
is accepted
34The key step is removing states and
re-labeling transitions with regular expressions.
Here are some examples of how to do this.
b
a
a
aba
b
aba
a
b
abb
a
b
b
a
a
35Exercise
a,b
a
a
(ab)
?
?
b
b
Continue ...
36Exercise
a
(ab)
?
?
b
(ab)
?
ab
ab (ab)
37Exercise
Find a regular expression that corresponds to the
language accepted by the following DFA.
38Exercise
?
?
abba
abb
?
(abba)abb
39Homework
Find a regular expression that corresponds to the
language accepted by the following DFA.
0
q1
q0
0
1
1
q2
1
0
40Applications of regular expressions
- Validation
- checking that an input string is in valid format
- example 1 checking format of email address on
- WWW entry form
- example 2 UNIX regex command
- Search and selection
- looking for strings that match a certain pattern
- example UNIX grep command
- Tokenization
- converting sequence of characters (a string) into
sequence of tokens (e.g., keywords, identifiers) - used in lexical analysis phase of compiler
41Grammar
- A grammar G (V, T, S, P) consists of the
following quadruple - a set V of variables (non-terminal symbols),
including a starting symbol S ? NT - a set T of terminals (same as an alphabet, ?)
- A start symbol S ? V
- a set P of production rules
- Example
- S ? aS A
- A? bA ?
42Derivation
- Strings are derived from a grammar
- Example of a derivation
- S ? aS ? aaS ? aaA ? aabA ? aab
- At each step, a nonterminal is replaced by the
sentential form on the right-hand side of a rule
(a sentential form can contain nonterminals
and/or terminals) - Automata recognize languages grammars generate
languages
43Context-free grammar
- A grammar is said to be context-free if every
rule has a single non-terminal on the left-hand
side - This means you can apply the rule in any context.
More complicated languages (such as English)
have context-dependent rules. - A language generated from a context-free grammar
is called a context-free language.
44The English language
- In fact, it may not be possible to fully specify
the stntax of the English language. - The language grows all the time, and new words
and constructions are constantly being added. - In addition, lanuage exists to be used to convey
meaning sometimes a particular meaning is better
conveyed by not using standard syntax.
45Consider this poem by e e cummings
- l (a
- le
- af
- fa
- ll
- s)
- one
- l
- iness
46e e cummingss poetry
- A Cummings poem is spare and precise, employing
a few key words eccentrically placed on the page.
Some of these words were invented by Cummings,
often by combining two common words into a new
synthesis. He also revised grammatical and
linguistic rules to suit his own purposes, using
such words as "if," "am," and "because" as nouns,
for example, or assigning his own private
meanings to words. - - http//www.poetryfoundation.org/archive/poet.htm
l?id81323
47 Buffalo Bills defunct who used to ride a
watersmooth-silver stallion and break
onetwothreefourfive pigeonsjustlikethat
Jesus he was a handsome man and
what I want to know is how do you like your
blueeyed boy Mister Death
e e cummings
48Regular grammar
- A grammar is said to be right-linear if all
productions are of the form A?xB or A?x, where A
and B are variables and x is a string of
terminals. (This means that if there is a
variable on the right side of the production
rule, then it is the rightmost element in the
rule.) - A grammar is said to be left-linear if all
productions are of the form A?Bx or A?x - A regular grammar is either right-linear or
left-linear.
49Linear grammar
- A grammar can be linear without being right- or
left-linear. - A linear grammar is a grammar in which at most
one variable can occur on the right side of any
production rule, without any restriction on the
position of the variable. - Example
- S ? aS A
- A? Ab ?
50Another formalism for regular languages
- Every regular grammar generates a regular
language, and every regular language can be
generated by a regular grammar. - A regular grammar is a simpler, special-case of a
context-free grammar - The regular languages are a proper subset of the
context-free languages
51Exercises
- Find a regular grammar that generates the
language on ? a,b consisting of all strings
with no more than three as.
52Exercises
- Find a regular grammar that generates the
language on ? a,b consisting of all strings
with no more than three as - S ? bS aA ?
- A ? bA aB ?
- B ? bB aC ?
- C ? bC ?
53Exercises
- Find a regular grammar that generates the
language consisting of even-length strings over
a,b.
54Exercises
- Find a regular grammar that generates the
language consisting of even-length strings over
a,b. - S ? aaS abS baS bbS ?
55Non-regular languages
- There are non-regular languages that can be
generated by context-free grammars - The language anbn n ? 0 is generated by the
grammar S ? aSb ? - The language L w na(w) nb(w) is generated
by the grammar S ? SS ? aSb bSa
56Exercise
- What language is generated by the following
context-free (but not regular) grammar? - S ? aSa bSb a b ?
57Exercise
- What language is generated by the following
context-free grammar? - S ? aSa bSb a b ?
- This is the odd/even palindrome language
- L w(ab?)wR
58Programming languages
- Programming languages are context-free, but not
regular - Programming languages have the following features
that require infinite stack memory - matching parentheses in algebraic expressions
- nested if .. then .. else statements, and nested
loops - block structure
59Exercise
- Given a grammar, you should be able to say what
language it generates - Use set notation to define the language generated
by the following grammars - 1) S ? aaSB ?
- B ? bB b
- 2) S ? aSbb A
- A ? cA c
60Exercise
S ? aaSB ? B ? bB b It helps to list some
of the strings that can be formed S ? aaSB ? aaB
? aab S ? aaSB ? aaB ? aabB ? aabb S ? aaSB ? aaB
? aabB ? aabbB ? aabbb S ? aaSB ? aaB ? aabB ?
aabbB ? aabbbB ? aabbbb S ? aaSB ? aaaaSBB ?
aaaaBB ? aaaaBb ? aaaabb S ? aaSB ? aaaaSBB ?
aaaaBB ? aaaaBbB ? aaaaBbb ? aaaabbb What is
the pattern? L (aa)nbnb
61Exercise
- Given a language, you should be able give a
grammar that generates it. - For example, give a regular (right-linear)
grammar for the language consisting of all
strings over a, b, c that begin with a, contain
exactly two bs, and end with cc.
62Exercise
- Give a regular (right-linear) grammar for the
language consisting of all strings over a, b, c
that begin with a, contain exactly two bs, and
end with cc - S ? aA
- A ? bB aA cA
- B ? bC aB cB
- C ? aC cC cD
- D ? c
63Derivation
Given the grammar, S ? aaSB ? B ?
bB b the string aab can be derived in two
different ways S ? aaSB ? aaB ? aab S ? aaSB
? aaSb ? aab
64Parse tree
Both derivations on the previous slide correspond
to the following parse tree.
The tree structure shows the rule that is applied
to each nonterminal, without showing the order of
rule applications. Each internal node of the
tree corresponds to a nonterminal, and the leaves
of the derivation tree represent the string of
terminals.
65Exercise
Let G be the grammar S ? abSc A A ? cAd
cd 1) Give a derivation of ababccddcc. 2) Build
the parse tree for the derivation of that
string. 3) Use set notation to define L(G).
66Leftmost (rightmost) derivation
In a leftmost derivation, the leftmost
nonterminal is replaced at each step. In a
rightmost derivation, the rightmost nonterminal
is replaced at each step. Many derivations are
neither leftmost nor rightmost. If there is a
single parse tree, there is also a single
leftmost derivation.
67Ambiguity
A grammar is ambiguous if it can generate a
string with two possible parse trees. (A string
has more than one parse tree if and only if it
has more than one leftmost derivation.) English
can be ambiguous. Example Disabled fly to see
Carter.
68Example
Given the following grammar S ? S S S S
1 0 The string 1 1 0 has two different
parse trees.
69Equivalent grammars
Here is a non-ambiguous grammar that generates
the same language. S ? S A 1 0 A ? A B
1 0 B ? 1 0 Two grammars that generate
the same language are said to be equivalent. To
make parsing easier, we prefer grammars that are
not ambiguous.
70Dangling else
x 3 if x gt 2 then if x gt 4 then x 1
else x 5 So, what is x?
71Ambiguous vs. unambiguous
Ambiguous grammar ltstatementgt IF lt
expressiongt THEN ltstatementgt IF
ltexpressiongt THEN ltstatementgt ELSE ltstatementgt
ltotherstatementgt Unambi
guous grammar ltstatementgt ltst1gt ltst2gt
ltst1gt IF ltexpressiongt THEN ltst1gt ELSE
ltst1gt ltotherstatementgt ltst2gt
IF ltexpressiongt THEN ltstatementgt IF
ltexpressiongt THEN ltst1gt ELSE ltst2gt
72Exercise
Show that the following grammar is ambiguous. S
? AB aaB A ? a Aa B ? b Construct an
equivalent grammar that is unambiguous.
73Parsing
- In practical applications, it is usually not
enough to decide whether a string belongs to a
language. It is also important to know how to
derive the string from the language. - Parsing uncovers the syntactical structure of a
string, which is represented by a parse tree.
(The syntactical structure is important for
assigning semantics to the string -- for example,
if it is a program).
74Parsing
Let G be a context-free grammar for C. Let the
string w be a C program. One thing a compiler
does -- in particular, the part of the compiler
called the parser -- is determine whether w
is a syntactically correct C program. It also
constructs a parse tree for the program that is
used in code generation. There are many
sophisticated and efficient algorithms for
parsing. You may study them in more
advanced classes (for example, on compilers) or
come across them on your own. We will not discuss
them in this class.
75Theorem 3.3
- Every language generated by a right-linear
grammar is regular. - Proof
- Specify a procedure for automatically
constructing an NFA that mimics the derivations
of a right-linear grammar.
76Theorem 3.3
- Justification
- The sentential forms produced by a right linear
grammar have exactly one variable, which occurs
as the rightmost symbol. - Assume that our grammar has a production rule
- D ? dE
- and that, during the derivation of a string,
there is a step wcD ? wcdE - We can construct an NFA which has states D and E,
and an arc labeled d from D to E. - NFAs can be converted to DFAs.
- All languages accepted by DFAs are regular.
77Theorem 3.3
- Construction
- For each variable Vi in the grammar there will be
a state in the automaton labeled Vi. - The initial state of the automaton will be
labeled V0 and will correspond to the S variable
in the grammar. - For each production rule Vi ? a1a2amVj the
automaton will have transitions such that - d(Vi ? a1a2am) Vj
- For each production rule Vi ? a1a2am the
automaton will have transitions such that - d(Vi ? a1a2am) Vfinal
78Theorem 3.3
Construct an NFA that accepts the language
generated by the grammar S ? aA convert
to V0 ? aV1 A ?abS b V1 ? abV0 b
a
b
V0
V1
Vf
b
a
79Theorem 3.4
- Every regular language can be generated by a
right-linear grammar. - Proof
- Generate a DFA for the language.
- Specify a procedure for automatically
constructing a right-linear grammar from the DFA.
80Theorem 3.4
- Given a regular language L, let M (Q, ?, d, q0,
F) be a DFA that accepts L. Let Q q0, q1, ,
qn and ? a1, a2, , am. - Construct the grammar G (V, T, S, P) with
- V q0, q1, , qn
- T a1, a2, , am
- S q0.
- P initially.
- P, the set of production rules, is constructed as
follows
81Theorem 3.4
- For each transition
- d(qi, aj) qk
- in the transition table of M, add to P the
production - qi ? ajqk
- If qk is in F, then add to P the production
- qk ? ?
82Example
- Construct a right-linear grammar for the language
L L(aaba) - First, build an NFA for L
a
a
a
q0
q1
q2
qf
b
83Example, cont.
a
a
a
q0
q1
q2
qf
b
P initially. Add to P a rule for each
transition in the NFA q0 ? aq1 q1 ? aq2 q2 ?
bq2 q2 ? aqf Since qf is in F, add to P the
production qf ? ?
84Example
a
a
a
q0
q1
q2
qf
b
Now P You can convert to normal grammar
notation q0 ? aq1 S ? aA q1 ? aq2 A ? aB q2 ?
bq2 B ? bB q2 ? aqf B ? aC qf ? ? C ? ?
85Theorem 3.5
A language L is regular if and only if there
exists a left-linear grammar G such that L
L(G). Proof The strategy here is a little
tricky. We describe an algorithm to construct a
right-linear grammar that generates the reverse
of all the strings generated by the left-linear
grammar.
86Theorem 3.5
Given any left-linear grammar we can construct
from it an right-linear grammar G by replacing
productions of the form A ? Bv with A ?
vRB and A ? v with A ? vR Since L(G) is
generated by a right-linear grammar, it is
regular. It can be demonstrated that L(G)
(L(G))R. It can be proven that the reverse of
any regular language is also regular (see
exercise 12, section 2.3 in the Linz
text). Hence, L is regular.
87Theorem 3.6
A language L is regular if and only if there
exists a regular grammar G such that L L(G).
Proof Combine our definition of regular
grammars, which includes the statement, A
regular grammar is either right-linear or
left-linear, with theorems 3.4 and 3.5
883 ways of specifying regular languages
Regular expressions DFA NFA Regular
grammars
describe
accept
Regular languages
generate