Title: Constructing a Spell Checker
1Constructing a Spell Checker
- Project 1
- 22c021 Data Structures
2Objective
- spell checker is a program that checks and
possibly corrects spelling errors in documents. - In this project you will take a first step, by
building a somewhat primitive spell checker.
3Spelling Checking
Dictionary
I am a university student.
A
AM
I am a univrsity student.
I
STUDENT
I am a university stdent.
UNIVERSITY
4wordClass class contains two strings
wordList class contains Several methods
Outputs the Correct words
buildDict.java Builds the dictionary
spellCheck.java Check and correct spellings
User input files to build the dictionary
User input files or words to be checked
5What to write?
- two separate programs
-
- buildDict
- spellCheck
builds a small dictionary of words
uses this dictionary of words to determine which
words in a given document are misspelled
6The Dictionary
- we will build a dictionary by extracting words
from large on-line documents known to be error
free - Three docs given alice.txt, carol.txt, and
hyde.txt
7Building Dictionary
- The program buildDict should
- read each of these documents,
- extract words from them, and
- insert them into a dictionary
8Word
- contiguous sequence of 2 or more letters (lower
case or upper case) immediately followed by a
non-letter character
9Data Structure
- You will use a hash table that is quadratically
probed - This table will be accessible to both buildDict
and spellCheck.
10wordList class
- Each of the extracted words needs to be inserted
into a data structure that can also be searched
quickly (the same word is not inserted twice)
To support this you should define and implement
a class called wordList
11WordList class
- You should use hashing with quadratic probing to
implement the wordList class.
12Hashing with Quadratically Probing
- resolving collisions in hash tables (explore a
sequence of locations until an empty one is
found) - collision is resolved by putting the item in the
next empty place given by a probe sequence - The space between places in the sequence
increases quadratically
13Illustration
- Table Size is 11 (0..10)
- Hash Function h(x) x mod 11
- Insert keys
- 20 mod 11 9
- 30 mod 11 8
- 2 mod 11 2
- 13 mod 11 2 ? 2123
- 25 mod 11 3 ? 3124
- 24 mod 11 2 ? 212, 2226
- 10 mod 11 10
- 9 mod 11 9 ? 912, 922 mod 11, 932 mod 11 7
14- Find specific values of constants to use (check
the Wiki link) - For h(s) check chainingHashTable.java
- it is a good idea for M (the size of the hash
table) to be a prime number - choose M to be about 2-5 times the number of
items to insert - Experiment to figure out a value for M.
15Remember!
- Keep a check so that duplicate words are not
entered - There should be an option for resizing the array.
Dont forget to rehash after that (M is now
changed!) - Resizing is not just doubling. We have to find
the next prime number. - We need the quadratic probing to search a word
16Java codes to look up
try // Create file FileWriter fs new
FileWriter("out.txt") BufferedWriter out
new BufferedWriter(fs) out.write(word")
//Close the output stream out.close()
catch (Exception e)//Catch exception if any
System.err.println("Error "
e.getMessage())
17Java codes to look up
import java.io.InputStreamReader try
InputStreamReader isr new InputStreamReader(S
ystem.in) BufferedReader br new
BufferedReader(isr) String s
br.readLine() catch (IOException e)
numberFromConsole 0
18Java codes to look up
- Break a string into tokens
String s I am a university student" StringToke
nizer st new StringTokenizer(s) while
(st.hasMoreTokens()) println(st.nextToken())
Output I am a university student
19Java codes to look up
.equals() And NOT
20Till now..
- Words are saved in a WordList object
- The words in this object are written to the file
dictionary.dat - the spellcheck program can effectively use the
hashing to find a word
21SpellCheck program
- First step
- read the list of words in the file dictionary.dat
and store these in a wordList object - should also add valid single letter words to
wordList object (A, I)
22SpellCheck Program
- prompt the user for the name of a document she
wants spell-checked - read words one-by-one and check which words
appear in its dictionary
23Java code to look up
- Prompt the user and read from terminal
try System.out.print(prompt) return new
BufferedReader (new InputStreamReader(System.in)
).readLine() catch (IOException e)
e.printStackTrace()
24Correcting documents
dictionary
Input file (may have misspelled words)
Output file (no misspelled words)
spellCheck program
25Checking words
- What happens when a word is encountered that is
not present in the dictionary? - Display the word at the terminal
- producing the following prompt Do you want to
replace (R), replace all (P), ignore (I), ignore
all (N), or exit (E)?
26wordList object structure
- each item in a wordList object consists of two
words - (i) an incorrect word and
- (ii) a replacement word
27wordList object structure
Misspelled words (Primary words)
colour
grammer
joyfull
color
joyful
grammar
An element in the wrodList object
Dictionary Words (Replacement words)
28Replace All
Misspelled words (Primary words)
colour
color
Dictionary Words (Replacement words)
29Ignore All
Misspelled words (Primary words)
colour
colour
Both words are identical!
Dictionary Words (Replacement words)
30Correcting documents
Dictionary
Replacement Words
Input file (may have misspelled words)
Output file (no misspelled words)
spellCheck program
31Hint!! (based on the previous slides)
- make a class. e.g., wordClass with two String
fields e.g., primaryWord and secondaryWord - A worldClass object will be a wordList object
element - Lets see how it looks..
32Hint!!
In buildDict.java, the dictionary is a worldList
object. Each element of the object contains two
strings.
Primary Word dictionary word
color
Secondary Word is blank
33Hint!!
In spellCheck.java, replacementWords is a
worldList object. Each element of the object
contains two strings.
Primary Word misspelled word
colour
color
Secondary Word is dictionary word
34Testing
- Finding a good hashing function is important!
Repeatedly experiment to find the right M, and
the constants. - Experiment with different input files
- Try out different user inputs
- Misspell your words a lot!
- Try different options each time you encounter a
misspelled word
35Testing
- Do vigorous testing! This is the only way to have
an accurate and improved program - You have 3 weeks to work on the project.
Remember, you will need 2/3 of the time for
testing!
36Documentation
- You should document your code extensively
- Consider using the Javadoc style of comments
37Questions?
- We are here to help you! Email us, or come to our
office hours, and ask questions in the class!