Title: CC384 Natural Language Engineering
 1CC384 - Natural Language Engineering
  2The basic tasks in text processing
- TOKENIZATION identify tokens in text 
- WORD COUNTING count words and their frequencies 
- SEARCHING FOR WORDS 
- NORMALIZATION 
- MASSIMO POESIO, massimo poesio, masimo peosio ? 
 Massimo Poesio
- Oct 20, 20th of October, .. ? 20/10/2009 
- STEMMING
3Regular Expressions and Finite State Automata
- A central language technology 
- Regular expressions a way to express powerful 
 SEARCH PATTERNS that can be implemented
 efficiently
- Implemented in Perl, Java 1.4, Emacs, search 
 engines
- Finite state automata the computational model 
 underlying regular expressions
- The regular expressions model can be expanded to 
 specify SUBSTITUTIONS as well, implementable as
 FINITE STATE TRANSDUCERS
- In fact, Finite State Transducers are powerful 
 enough to be usable for PARSING
- Simpler cases of parsing tokenization, 
 normalization
4Searching text for words
char text .int leftMargin Boolean 
matchWord(String word)  boolean retval  
true for (int i  0 i lt word.length()  
retval  true i)  if 
(word.charAt(i) ! textleftMarginI) 
retvalfalse  return(retval) 
 5Searching text for patterns
- Most common case searching using Google or 
 similar
- Simpler case just looking for web pages 
 containing a word (accommodation)
- More complex cases 
- Different spellings 
- accomodation OR accommodation 
- Centre OR Center Cognitive Science 
- Patterns only occurring in certain contexts 
- But also to validate string entered by the user 
- E.g., checking whether the string entered is a 
 phone number
- (44)(0)20-12341234, 02012341234, 44 (0) 
 1234-1234
- But not (44)020-12341234, 12341234(020) 
- A regular email address 
- asmith_at_mactec.com, foo12_at_foo.edu, 
 bob.smith_at_foo.tv
- But not asmith, _at_mactech.com, a_at_a 
- A post code 
- G1 1AA, EH10 2QQ, SW1 1ZZ 
6Regular Expressions  a formalism for expressing 
search patterns
- Because matching is a very common problem, over 
 the years computer scientists have identified a
 set of patterns that
- Are very common 
- Can be searched for efficiently 
- The language of REGULAR EXPRESSIONS has been 
 developed to characterize these patterns
- Many programming languages (Perl, Java 1.4, TCL, 
 Python. ) / web search tools / software systems
 (awk, sed, emacs) allow users to use regular
 expressions to specify what they are searching
 these REs are then compiled into efficient code
- You do not need to write the code yourselves! 
7Regular Expressions the basic case 
- The simplest form of regular expression a 
 SEQUENCE OF SYMBOLS
- /can/ 
- Matches any string which contains can can, 
 canterbury, scannning
- Whitespace can be included /top ten/ 
- Also matches how to stop tension
8More complex types of regular expressions
- Disjunction 
- /centrecenter/ 
- /accomodationaccommodation/ 
- Also 
-  /Ccentre/ 
- /accommmodation/ 
- Repetitions 
-  Any number greater than 0 
- /YES!/ 
- Matches YES!, YESS!, YESSS! 
- E.g., any binary number 01 
-  0 or more 
- /ab/ 
- Matches a, ab, abb, abbb
9Software that includes an implementation of REs 
- Pure REs awk, egrep, lex 
- Extended REs perl, Java
10Regular expressions in Java (from 1.4)
- Standard library java.util.regex 
- Tutorial (very good) http//java.sun.com/docs/bo
 oks/tutorial/extra/regex/index.html
- Alternative 
-  http//www.javaworld.com/javaworld/jw-07-2001
 /jw-0713-regex-p2.html
- Main classes 
- PATTERN ( compiled form of a RE) 
- Pattern rePattern  Pattern.compile(ab") 
- MATCHER ( analyze a string using a pattern) 
- Matcher pm  rePattern.matcher(string) 
- pm.find() find the next substring that matches 
- pm.group() the substring found by find()
11Grep in Java 1.4 (cc384/code/java)
-  .import java.util.regex. public 
 class Grep   .
 // Pattern used to parse lines
 private static Pattern linePattern
 Pattern.compile(".\r?\n")
 // The input pattern that we're looking
 for  private static Pattern pattern
 // Compile the
 pattern from the command line  private
 static void compile(String pat)
 try  pattern  Pattern.compile(pat)
 catch (PatternSyntaxException
 x)   System.err.println(
 x.getMessage()) System.exit(1)
 // Use the linePattern to break
 the given CharBuffer into lines, applying
 // the input pattern to each
 line to see if we have a match
-  private static void grep(File f, 
 CharBuffer cb)   Matcher
 lm  linePattern.matcher(cb) // Line matcher
 Matcher pm  null // Pattern
 matcher  int lines  0
 while (lm.find())  lines
 CharSequence cs
 lm.group() // The current line
 if (pm  null) pm
 pattern.matcher(cs) else pm.reset(cs)
 if (pm.find())
 System.out.print(f  ""  lines  ""  cs)
 if (lm.end()
 cb.limit()) break
12Regular expressions in Perl
- Example print lines containing the string can 
 (a simple version of the grep program)
while (ltSTDINgt)  if (/can/)  print _  
 13Even more complex cases and more metacharacters 
(PERL- and Java-specific )
- Other forms of disjunction 
- Range /textfile02-4/ 
- Will match textfile02 textfile03 textfile04 
 
- Metacharacters (in Perl / Java) 
- \d (any digit) /a\dz/ matches a0z, a123z, a456z 
- \w (letter, digit, or underscore _) 
- \s (any whitespace) 
- Any character . (period) 
- /cyclo.ane/ matches 
- cyclodecane, cyclohexane, cyclones drive me 
 insane
- Zero or one times ? 
- /accomm?odation/ matches accomodation and 
 accommodation
- Negation abc 
- /textfile0268/ matches textfile1, 
 textfile3,
14Applications of more complex REs
- Web pages about Centres and Centers 
- /CcentreCcenter/ 
- Regular expression to validate phone numbers 
- (44)(0)20-12341234, 02012341234, 44 (0) 
 1234-1234
- But not (44)020-12341234, 12341234(020) 
- (\(?\?0-9\)?)?0-9_\- \(\) 
- Validating email addresses 
- asmith_at_mactec.com, foo12_at_foo.edu, 
 bob.smith_at_foo.tv
- But not asmith, _at_mactech.com, a_at_a 
- (a-zA-Z0-9_\-\.)_at_((\0-91,3\.0-91,3\.
 0-91,3\.)((a-zA-Z0-9\-\.)))(a-zA-Z2,4
 0-91,3)(\?)
15Notational Variants
- Different programming languages tend to use 
 different notations for expressing REs.
- In FSA, 
- Sequence d,o,g 
- Disjunction c,a,t,d,o,g (instead of 
 catdog)
- Range a..z (instead of a-z) 
- Any symbol whatsoever ? (instead of .) 
- Optional character E (instead of E?)
16Notational variants advanced search in Google
- CAPITALIZATION, etc 
- Google search is not case-sensitive 
- OR search 
- vacation london OR paris 
- NUMRANGE search 
- DVD player 250..350 
- WILDCARD search 
- "Sony Vaio  laptop" 
- For more tips http//www.google.com/help/refinese
 arch.html
17Readings
- Jurafsky and Martin, chapter 2 
- The regular expressions library 
- http//www.regxlib.com/ 
- The Java tutorial at Sun, section on regular 
 expressions
- http//java.sun.com/docs/books/tutorial/extra/rege
 x/index.html
- The sections of the Perl manual on regular 
 expressions (perlre)
- Jeffrey Friedl, Understanding Regular 
 Expressions, The Perl Journal
18Acknowledgments
- Some material borrowed from Gosse Bouma