Title: LIS651 lecture 4 regular expressions
1LIS651 lecture 4regular expressions
- Thomas Krichel
- 2006-04-22
2remember DOS?
- DOS had the character as a wildcard. If you
said - DIR .EXE
- It would list all the files ending with .EXE
- Thus the wildcard would mean all characters
except the dot - Similarly, you could say
- DEL .
- to delete all your files
3regular expression
- Is nothing but a fancy wildcard.
- There are various flavours of regular
expressions. - We will be using POSIX regular expressions here.
They themselves come in two flavors - old-style
- extended
- We study extended here aka POSIX 1003.2.
- Perl regular expressions are more powerful and
more widely used. - POSIX regular expressions are accepted by both
PHP and mySQL. Details are to follow.
4pattern
- The regular expression describes a pattern of
characters. - Patters are common in other circumstances.
- Query Krichel Thomas in Google
- Query "Thomas Krichel" in Google
- Dates are of the form yyyy-mm-dd.
5pattern matching
- We say that a regular expression matches the
string if an instance of the pattern described by
the regular expression can be found in the
string. - If we say matches in the string may make it a
little more clearer. - Sometimes people also say that the string matches
the regular expression. - I am confused.
6metacharacters
- Instead of just giving the star special
meaning, in a regular expression all the
following have special meaning - \ . ( ) ?
- Collectively, these characters are knows as
metacharacters. They don't stand for themselves
but they mean something else. - For example DEL .EXE does not mean delete the
file ".EXE". It means delete anything ending
with .EXE.
7metacharacters
- We are somehow already familiar with
metacharacters. - In XML lt means start of an element. To use lt
literally, you have to use lt - In PHP the "\n" does not mean backslash and then
n. It means the newline character.
8simple regular expressions
- Characters that are not metacharacters just
simply mean themselves - good does not match in Good Beer
- d B matches in Good Beer
- dB does not match in Good Beer
- Beer ' does not match in Good Beer
- If there are serveral matches, the pattern will
match at the first occurance - o matches in Good Beer
9the backslash \ quote
- If you want to match a metacharacter in the
string, you have to quote it with the backslash - a 6 pack does not match in a 6 pack
- a 6\ pack does match in a 6 pack
- \ does not match in a \ against boozing
- \\ does match in a \ against boozing
10other characters to be quoted
- Certain non-metacharacters also need to be
quoted. These include some of the usual suspects - \n the newline
- \r the carriage return
- \t the tabulation character
- But this quoting occurs by virtue of PHP, it is
not part of the regular expression. - Remember Sandfords law.
11anchor metacharacters and
- matches at the beginning of the string.
- matches at the end of the string.
- keeper matches in beerkeeper
- keeper matches in beerkeeper
- keeper does not match in beerkeeper
- matches in
- Note that in a double quoted-string an expression
starting with will be replaced by the
variable's string value (or nothing if the
variable has not been set).
12character classes
- We can define a character class by grouping a
list of characters between and - bieer matches in beer
- bieer matches in bier
- Bbieer matches in Bier
- Within a class, metacharacters need not be
escaped. In the class only -, and are
metacharacters.
13- in the character class
- Within a character class, the dash - becomes a
metacharacter. - You can use to give a range, according to the
sequence of characters in the character set you
are using. Its usually alphabetic - bea-er matches in beer
- bea-er matches in becr
- bea-er does not match in befr
- If the dash - is the last character in the class,
it is treated like an ordinary character.
14 in the character class
- gives you the end of the class. But if you put
it first, it is treated like an ordinary
character, because having it there otherwise
would create an empty class, and that would make
no sense. - be,r matches in ber
15 in the character class
- If the caret appears as the first element in
the class, it negates the characters mentioned. - beir matches in beer
- bieer does not match in bier
- bea-er does match in befr
- beer matches in beer
- beer6-9 matches beer0 to beer5
- Otherwise, it is an ordinary character.
16standard character classes
- The following predefined classes exist
- alnum any alphanumeric characters
- digit any digits
- punct any punctuation characters
- alpha any alphabetic characters (letters)
- graph any graphic characters
- space any space character (blank and \n, \r)
- blank any blank character (space and tab)
- lower any lowercase character
17standard character classes
- upper any uppercase character
- cntrl any control character
- print any printable character
- xdigit any character for a hex number
- They are locale and operating system dependent.
- With this discussion we leave character classes.
18The period . metacharacter
- The period matches any character bar the newline
\n. - The reason why the \n is not counted is historic.
In olden days matching was done line by line,
because the computer could not hold as much
memory. - . does not match in
- . does not match in "\n"
- . matches in a
19alternative operator
- This acts like an or
- beerwine matches in beer
- beerwine matches in wine
- Alternatives are performed last, i.e. they take
the component alternative as large as they can.
20grouping with ( )
- You can use ( ) to group
- (beerwine) (glass) matches in beer glass
- (beerwine) (glass) matches in wine glass
- (beerwine) (glass) matches in beer
- (beerwine) (glass) matches in wine
- (beerwine) (glass(es)) matches in
- beer glasses
- Yes, groups can be nested.
21repetition operators
- means zero or more times what preceeds it.
- means one or more times what preceeds it.
- ? means zero or one time what preceeds it.
- The shortest preceding expression is used, i.e.
either a single character or a group. - (beer ) matches in
- (beer )? matches in
- (beer ) matches in beer beer beer
- ber matches in beer
- ber does not match in bebe
22enumeration
- We can use min,max to give a minimum min and a
maximum max. min and max are positive integers. - be1,3r matches in ber
- be1,3r matches in beer
- be1,3r matches in beeer
- be1,3r does not matches in beeeer
- ? is just a shorthand for 0,1
- is just a shorthand for 1,
- is just a shorthand for 0,
23examples
- US zip code 0-95(-0-94)?
- something like a current date in ISO form
- (200-92)-(01-910-2)-(120-9301)
- Something like a Palmer School course code
(DIS89)(LIS5-9))0-92 - Something like an XML tag lt/alpha /gt
24not using posix regular expressions
- Do not use regular expressions when you want to
accomplish a simple for which there is a special
PHP function already available. - A special PHP function will usually do the
specialized task easier. Parsing and
understanding the regular expression takes the
machine time.
25ereg()
- ereg(regex, string) searches for the pattern
described in regex within the string string. - It returns the false if no string was found.
- If you call the function as ereg(regex, string,
matches) the matches will be stored in the array
matches. Thus matches will be a numeric array of
the grouped parts (something in ()) of the string
in the string. The first group match will be
matches1.
26ereg_replace
- ereg_replace ( regex, replacement, string )
searches for the pattern described in regex
within the string string and replaces occurrences
with replacement. It returns the replaced string. - If replacement contains expressions of the form
\\number, where number is an integer between 1
and 9, the number sub-expression is used. - better_orderereg_replace('glass of
(KarlsbergBruch)', 'pitcher of \\1',order)
27split()
- split(regex, string, max) splits the string
string at the occurrences of the pattern
described by the regular expression regex. It
returns an array. The matched pattern is not
included. - If the optional argument max is given, it means
the maximum number of elements in the returned
array. The last element then contains the unsplit
rest of the string string. - Use explode() if you are not splitting at a
regular expression pattern. It is faster.
28case-insensitive function
- eregi() does the same as ereg() but work
case-insensitively. - eregi_replace() does the same as ereg_replace()
but work case-insensitively. - spliti() does the same as split() but work
case-insensitively.
29regular expressions in mySQL
- You can use POSIX regular expressions in mySQL in
the SELECT command - SELECT WHERE REGEXP regex
- where regex is a regular expression.
30communication with wotan
- For file editing and manipulation, we use putty.
- For file transfer, we use winscp.
- Both are available on the web.
- The protocol is ssh, the secure shell, based
public-key cryptography.
31installing putty
- Go to your favorite search engine to search for
putty. - If you have administrator rights install the
installer version. - Since you have already installed winscp, you
should have no further problems.
32putty options
- In the window/translation choose UTF-8, always.
- Find out what the size of your screen is of
screen that your are using for the font that you
are using, and save that in your session. - For wotan, the port is 22, ssh.
- You can choose to disable the annoying bell.
33issuing commands
- While you are logged in, you talk to the computer
by issuing commands. - Your commands are read by command line
interpreter. - The command line interpreter is called a shell.
- You are using the Bourne Again Shell, bash.
34bash features
- bash allows to browse the command history with
the up/down arrow keys - bash allows to edit commands with the left/right
arrow keys - exit is the command to leave the shell.
35files, directories and links
- Files are continuous chunks data on disks that
are required for software applications. - Directories are files that contain other files.
Microsoft calls them folders. - In UNIX, the directory separator is /
- The top directory is / on its own.
36home directory
- When you first log in to wotan you are placed in
your home directory /home/username - cd is the command that gets you back to the
home directory. - The home directory is also abbreviated as
- cd user gets you to the home of user user.
- cd does what?
37/public_html
- Is your web directory. I created it with mkdir
public_html in your home directory. - The web server on wotan will map requests to
http//wotan.liu.edu/user to show the file
user/public_html/index.html - The web server will map requests to
http//wotan.liu.edu/user/file to show the file
user/public_html/file - The server will do this by virtue of a
configuration option.
38changing directory, listing files
- cd directory changes into the directory directory
- the current directory is .
- its parent directory is ..
- ls lists files
39users and groups
- root is the user name of the superuser.
- The superuser has all privileges.
- There are other physical users, i.e. persons
using the machine - There are users that are virtual, usually created
to run a daemon. For example, the web sever in
run by a user www-data. - Arbitrary users can be put together in groups.
-
40permission model
- Permission of files are given
- to the owner of the file
- to the the group of the file
- and to the rest of the world
- A group is a grouping of users. Unix allows to
define any number of groups and make users a
member of it. - The rest of the world are all other users who
have access to the system. That includes
www-data!
41listing files
- ls lists files
- ls -l make a long listing. It contains
- elementary type and permissions (see next slide)
- owner
- group
- size
- date
- name
-
42first element in ls -l
- Type indicator
- d means directory
- l means link
- - means ordinary file
- 3 letters for permission of owner
- 3 letters for permission of group
- 3 letters for permission of rest of the world
- r means read, w means write, x means execute
- Directories need to be executable to get in them
43change permission chmod
- usage chmod permission file
- file is a file
- permisson is three numbers, for owner, group and
rest of the world. - Each number is sum of elementary numbers
- 4 is read
- 2 is write
- 1 is excute
- 0 means no permission.
- Example chmod 764 file
44general structure of commands
- commandname flag --option
- Where commandname is a name of a command
- flag can be a letter
- Several letters set several flags at the same
time - An option can also be expressed with - - and a
word, this is more user-friendly than flags.
45example command ls
- ls lists files
- ls -l makes a long listing
- ls -a lists all files, not only regular files but
some hidden files as well - all files that start with a dot are hidden
- ls -la lists all files is long listing
- ls --all is the same as ls -a. --all is known as
a long listing.
46copying and removing files
- cp file copyfile copies file file to file
copyfile. If copyfile is a directory, it copies
into the directory. - mv file movedfile moves file file to file
movedfile. If movedfile is a directory, it moves
into the directory. - rm file removes file, there is no recycling bin!!
47directories and files
- mkdir directory makes a directory
- rmdir directory removes an empty directory
- rm -r directory removes a directory and all its
files - more file
- Pages contents of file, no way back
- less file
- Pages contents of file, u to go back, q to
quit
48soft links
- A link is a file that contain the address of
another file. Microsoft call it a shortcut. - A soft link can be created with the command
- ln -s file link_to_file where file is a file that
is already there and link_to_file is the link.
49file transfer
- You can use winscp to upload and download files
to wotan. - If uploaded files in the web directory remain
invisible, that is most likely a problem with
permission. Refer back to permissions. - chmod 644 will put it right for the files
- chmod 755 . (yes with a dot) will put it right
for the current directory - is a wildcard for all files.
- rm -r is a command to avoid.
50editing
- There are a plethora of editors available.
- For the neophyte, nano works best.
- nano file edits the file file.
- nano -w switches off line wrapping.
- nano shows the commands available at the bottom
of the screen. Note that letter, where letter is
a letter, means pressing CONTROL and the letter
letter at the same time.
51emacs
- This is another editor that is incredibly
featureful and complex. - Written by Richard M. Stallman, of GNU and GPL
fame. - Get an emacs cheat sheet of the web before you
start it. Or look at next slide.
52emacs commands
- (here stands for the control characher)
- xs saves buffer
- xc exits emacs
- g escapes out of a troublesome situation
- controlspace sets the mark
- w removes until the mark (cut)
- y pastes
53common emacs/bash commands
- k kills until the end of the line or removes
empty line - y yank what has been killed (paste)
- a get to the beginning of the line
- e get to the end of the line
54emacs modes
- Just like people get into different moods, emacs
gets into different modes. - One mode that will split your pants is the PHP
mode. - emacs file.php to edit the file file in PHP
mode. - Then look how emacs checks for completion of
parenthesis, braces, brackets, and the and use
the tab character to indent.
55copy and paste
- Putty allows to copy and paste text between
windows and wotan. - On the windows machine, it uses the windows
approach to copy and paste - On wotan machine,
- you copy by highlighting with the mouse left
button - you paste using the middle button
- if you don't have a middle button, use left and
right together
56running mySQL
- You can run mySQL in command line mode in wotan.
Type - mysql -u user -p
- You will then be prompted for your password. The
username and password are your mySQL user name
and mySQL password, not your wotan user name and
wotan password. - Dont forget the semicolon after each command!
57http//openlib.org/home/krichel
- Thank you for your attention!
- Please switch off machines b4 leaving!