Title: Regular Expressions and Languages
1Regular Expressions and Languages A regular
expression is a notation to represent languages,
i.e. a set of strings, where the set is either
finite or contains strings that are generated
using simple recursive rules. The languages
represented by regular expressions are called
regular languages. Some examples of regular
languages are given next before we see the
precise definition. Example. The set of integer
constants used in a typical programming language
an integer constant contains a sequence of one or
more decimal digits, up to some system-dependent
maximum length. Example. variable name starting
with a letter, followed by zero or more letter of
digit characters, or certain punctuation symbols
(such as the underscore _, dash -), up to
certain maximum length.
2Regular languages More precisely, let A be an
alphabet (i.e., a finite, non-empty set of
symbols). The collection of regular languages
over A is defined by the following (recursive)
rules (i. Base step) The empty set ?, ?, and
a for every a? A are regular languages (ii.
Recursive step) If X and Y are regular languages,
then the sets X ? Y (union), X Y
(concatenation), and X (Kleene star) are regular
languages (iii. Closure) No other sets are a
regular language unless they are the results of
applying the base step followed by zero or more
recursive steps. Note that the above rules define
not just one regular language they define an
infinite collection of languages (over an
alphabet) and call each a regular language over
that alphabet.
3Example. Let A a, b be an alphabet. The
following are some examples of regular languages
over A (a) ?, ?, a, b (Rule (i)) (b) ab,
aa, aba ab ? aa ? aba (Rule (i),
and Rule (ii) for concatenations and unions) (c)
an n ? 0 a (Rule (i), and Rule (ii) for
the Kleene star operation) (d) an b2 n ? 0
abb (Rule (i), and Rule (ii) for star and
concatenation) (e) an b2 n ? 0 ? bm a2 m
? 0 (the result of (d) and Rule (ii) for
union). Note that any finite set of strings is a
regular language, as demonstrated from (a) and
(b) of the above. Also note the use of
parentheses when necessary, e.g., (a ? b).
4Regular expressions To simplify the notations of
regular languages, and drawing analogy to
arithmetic expressions used in algebra, we could
replace the union symbol ? with the plus sign ,
drop the braces and for sets, and use
parentheses ( and ) for grouping when
necessary, such as the following Example. The
following are some regular expressions over
alphabet A a, b, and the corresponding
regular languages Expression Language a
ab a, ab (a b)bb a, bbb ba(a
b)ab baa, bab ? ?
5More precisely, we could define regular
expressions by the following rules for both the
notations and the sets they represent Let A be
an alphabet. (Basis) The constants ? , ?,
and a are regular expressions for each a that
belongs to the alphabet A. The languages that
they represent are, respectively, L(? ) ? ,
L(?) ?, and L(a) a. (Recursion) If E
and F are regular expressions, then E F, EF,
and E, are regular expressions. The languages
they represent are, respectively, L(E F) L(E)
? L(F), L(EF) L(E) ? L(F) (concatenation), and
L(E) (L(E)). (Closure) No other notations
are a regular expression unless they are
constructed by applying the base step followed by
zero or more recursive steps. Note that the use
of parentheses is for grouping purposes. Thus,
(01) means the set ? , 01, 0101, , but 01
means the set 0, 01, 011, . In general,
L((E)) L(E).
6Some examples of Regular expressions are as
follows (a) The language of integer constants as
in C (0123456789) (0123456789)
assuming we place no limit on the maximum
length. (b) The set of strings over a, b that
contain the substring aa (a b)aa (a
b) (c) The set of strings over a, b that
contain exactly two occurrences of symbol
a babab (d) The set of strings over a, b
that contain up to 2 symbols ? a b aa
ab ba bb (e) The set of strings over a, b
that begin with a, and have an even number of
b a(abab)a. (f) The set of strings over a,
b that do not contain the substring aa
b(abb)(? a)
7Laws and rules for manipulating regular
expressions Let L, M, and N denote regular
expressions. (1) (Associative law) L(MN)
(LM)N. (This is true for any sets L, M, and N of
strings. (2) (distributive laws of
concatenation over union) L(M N) LM LN and
(M N)L ML NL. (3) (Idempotent law) L L
L. (This is the idempotent law for set
union.) (4) (L) L. (Both sides contain
all possible strings that are made up of strings
of L.) (5) ? ? ?. (The Kleene star
always contains the empty string ?. (6)
(L)L L(L). (Both sides equal L ? L2 ? L3 ?
, which is denoted L) (7) (LM) (L
M).
8- Finite Automata and Regular Languages
- Finite automata (DFA and NFA) and regular
languages are equivalent in the sense that every
regular language can be recognized (accepted) by
a finite automata and, conversely, for every
finite automata there is a regular language which
is the language accepted by the finite automata. - Since DFA and NFA are equivalent, we can prove
their equivalence to regular languages in two
parts - Let M be a DFA, and let L L(M) be the
language accepted by M. Prove there is a regular
expression R such that L(R) L. - Let L L(R) be a regular language of
expression R. Prove there is an NFA M such that
L(M) L, where L(M) denotes the language
accepted by M.
9We will prove the second part first since it is
easier. Specifically, we will use the following
recursive rules to construct an NFA for each
regular expression (for a fixed alphabet A).
Since regular expressions are generated by
recursive rules, it suffices to show that
corresponding to the base step and to the
recursive step, we can construct an NFA that
accepts exactly the same language as they are
being constructed. (Base step) The following NFAs
correspond to the regular expressions ? , ?, and
a, respectively, for a belongs to A. It is easy
to verify that these NFAs are correct. Note that
in each case, we are constructing an NFA with
exactly one start state, one final (accepting)
state, no arcs into the start state, and no arcs
out of the final state we call this the
desirable property (for lack of a better term).
?
a
10(Recursive step) Suppose E and F are regular
expressions whose equivalent NFAs M and N have
already been constructed, where both have the
desirable property, the following diagrams show
how to construct the NFAs corresponding to,
respectively, expressions E F, EF, and
E Note that in each case, we add one or
more ? -transitions the resulting NFAs also have
the desirable property.
?
M
N
M
?
?
NFA for EF
?
?
?
N
?
?
M
NFA for E F
?
NFA for E
11Example. An NFA for the regular expression
(01)1(01). We follow (roughly) the following
steps (a) (e) Note that the NFAs of
every step satisfy the desirable property.
0
?
?
0
1
?
?
1
(a) NFA for 0
(b) NFA for 1
(c) NFA for 01
?
?
0
0
?
?
?
?
?
?
?
?
?
?
?
?
1
1
?
?
?
0
(d) NFA for (01)
?
?
?
1
?
?
1
(e) NFA for (01)1(01)
12Construction of a regular expression equivalent
to a DFA Let A be a DFA with states labeled 1,
2, , n. We assume state 1 is the (only) start
state. The idea of the construction (or proof)
is to demonstrate that for any two states i and
j, we can construct a regular expression Rij such
that it contains those strings that are made up
of the labels of paths connecting node i to node
j thus, string w belongs to Rij if (i, w) ? (j,
?). We prove this assertion by recursive
construction. First, define R(k)ij as the set
of strings that are made up of the labels of
paths connecting node i to node j passing through
only states ? k (i.e., each intermediate state of
the path must have a label ? k).
i
j
R(k)ij
all state labels ? k
13We now show how to use regular expressions to
represent such sets (notations) R(k)ij, by using
induction on k. (Basis) When k 0. That is,
consider labels of the paths that connect state i
to state j without intermediate states, i.e. a
direct arc (edge) from i to j if exist. There
are two cases (i) When i ? j. Thus R(0)ij ?
if there is no arc (transition) from state i to
state j else R(0)ij a1 a2 am if there
are transitions from state i to state j labeled
a1, a2, , am . (ii) When i j. Thus R(0)ij ?
if there is no arc (transition) from state i to
state j else R(0)ij ? a1 a2 am if
there are transitions from state i to state j
labeled a1, a2, , am .
a1
a1
j
i
i
am
am
14(Recursion) When k ? 1. We can express R(k)ij
in terms of the notations with a smaller
superscript. Specifically, R(k)ij R(k?1)ij
R(k?1)ik (R(k?1)kk) R(k?1)kj This is true
because each path in the set R(k)ij can either
avoid state k or passes through it one or more
times. In the former case, those paths
constitute the expression R(k?1)ij in the
latter case, the subpaths that lead to state k
the first time are represented by R(k?1)ik ,
followed by paths from state k to itself (zero or
more times) represented by (R(k?1)kk) , finally
continued with paths from state k to state j
represented by R(k?1)kj . The following diagram
illustrates the idea
i
k
k
k
k
k
j
R(k?1)ik
R(k?1)kj
(R(k?1)kk)
15Notice that these R(k)ij are all regular
expression because their base step (when k 0)
start with regular expression and, in each step
of the recursive rule, only regular expression
operations (i.e., , , and concatenation) are
used. To complete the proof that a DFA M can be
converted to an equivalent regular expression, we
construct R(n)1j for each of the accepting states
j. (Recall the states are labeled 1 through n,
and state 1is the start state.). Then L(M) the
sum (i.e. the union) of these R(n)1j s, where
state j ranges over all accepting states of
M. Example (p. 94 of the Text) Convert the
following DFA to an equivalent regular
expression.
1
0,1
1
2
0
16We first apply the basis step (k 0) and
construct the following sets R(0)11 ? 1
R(0)12 0 R(0)21 ? and R(0)22 ? 0 1
each corresponding the single-arc transitions
from one state to another state. Using the
recursive rule, we can now construct the R(k)ij
s with k 1 Recursive
rule Simplified R(1)11 ? 1(? 1)(?
1)(? 1) 1 R(1)12 0 (? 1) (?
1)0 10 R(1)21 ? ?(? 1)(? 1) ?
R(1)22 ? 0 1 ?(? 1) 0 ? 0 1 Note
that laws such as ?L ? and (? L) L are
used during simplification. Since there is only
one accepting state (state 2), we only need to
construct R(k)12 for k 2. Thus, the equivalent
regular expression is R(2)12 R(1)12 R(1)12
(R(1)22) R(1)22 10 10(? 0 1)(? 0
1) 10(01) after simplification.