Constructing a Spell Checker - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Constructing a Spell Checker

Description:

Find specific values of constants to use (check the Wiki link) ... What happens when a word is encountered that is not present in the dictionary? ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 38
Provided by: kajarighos
Category:

less

Transcript and Presenter's Notes

Title: Constructing a Spell Checker


1
Constructing a Spell Checker
  • Project 1
  • 22c021 Data Structures

2
Objective
  • spell checker is a program that checks and
    possibly corrects spelling errors in documents.
  • In this project you will take a first step, by
    building a somewhat primitive spell checker.

3
Spelling Checking
Dictionary
I am a university student.
A
AM
I am a univrsity student.
I
STUDENT
I am a university stdent.
UNIVERSITY
4
wordClass class contains two strings
wordList class contains Several methods
Outputs the Correct words
buildDict.java Builds the dictionary
spellCheck.java Check and correct spellings
User input files to build the dictionary
User input files or words to be checked
5
What to write?
  • two separate programs
  • buildDict
  • spellCheck

builds a small dictionary of words
uses this dictionary of words to determine which
words in a given document are misspelled
6
The Dictionary
  • we will build a dictionary by extracting words
    from large on-line documents known to be error
    free
  • Three docs given alice.txt, carol.txt, and
    hyde.txt

7
Building Dictionary
  • The program buildDict should
  • read each of these documents,
  • extract words from them, and
  • insert them into a dictionary

8
Word
  • contiguous sequence of 2 or more letters (lower
    case or upper case) immediately followed by a
    non-letter character

9
Data Structure
  • You will use a hash table that is quadratically
    probed
  • This table will be accessible to both buildDict
    and spellCheck.

10
wordList class
  • Each of the extracted words needs to be inserted
    into a data structure that can also be searched
    quickly (the same word is not inserted twice)

To support this you should define and implement
a class called wordList
11
WordList class
  • You should use hashing with quadratic probing to
    implement the wordList class.

12
Hashing with Quadratically Probing
  • resolving collisions in hash tables (explore a
    sequence of locations until an empty one is
    found)
  • collision is resolved by putting the item in the
    next empty place given by a probe sequence
  • The space between places in the sequence
    increases quadratically

13
Illustration
  • Table Size is 11 (0..10)
  • Hash Function h(x) x mod 11
  • Insert keys
  • 20 mod 11 9
  • 30 mod 11 8
  • 2 mod 11 2
  • 13 mod 11 2 ? 2123
  • 25 mod 11 3 ? 3124
  • 24 mod 11 2 ? 212, 2226
  • 10 mod 11 10
  • 9 mod 11 9 ? 912, 922 mod 11, 932 mod 11 7

14
  • Find specific values of constants to use (check
    the Wiki link)
  • For h(s) check chainingHashTable.java
  • it is a good idea for M (the size of the hash
    table) to be a prime number
  • choose M to be about 2-5 times the number of
    items to insert
  • Experiment to figure out a value for M.

15
Remember!
  • Keep a check so that duplicate words are not
    entered
  • There should be an option for resizing the array.
    Dont forget to rehash after that (M is now
    changed!)
  • Resizing is not just doubling. We have to find
    the next prime number.
  • We need the quadratic probing to search a word

16
Java codes to look up
  • writing into a file

try // Create file FileWriter fs new
FileWriter("out.txt") BufferedWriter out
new BufferedWriter(fs) out.write(word")
//Close the output stream out.close()
catch (Exception e)//Catch exception if any
System.err.println("Error "
e.getMessage())
17
Java codes to look up
  • Read from console

import java.io.InputStreamReader try
InputStreamReader isr new InputStreamReader(S
ystem.in) BufferedReader br new
BufferedReader(isr) String s
br.readLine() catch (IOException e)
numberFromConsole 0
18
Java codes to look up
  • Break a string into tokens

String s I am a university student" StringToke
nizer st new StringTokenizer(s) while
(st.hasMoreTokens()) println(st.nextToken())
Output I am a university student
19
Java codes to look up
  • Compare strings

.equals() And NOT

20
Till now..
  • Words are saved in a WordList object
  • The words in this object are written to the file
    dictionary.dat
  • the spellcheck program can effectively use the
    hashing to find a word

21
SpellCheck program
  • First step
  • read the list of words in the file dictionary.dat
    and store these in a wordList object
  • should also add valid single letter words to
    wordList object (A, I)

22
SpellCheck Program
  • prompt the user for the name of a document she
    wants spell-checked
  • read words one-by-one and check which words
    appear in its dictionary

23
Java code to look up
  • Prompt the user and read from terminal

try System.out.print(prompt) return new
BufferedReader (new InputStreamReader(System.in)
).readLine() catch (IOException e)
e.printStackTrace()
24
Correcting documents
dictionary
Input file (may have misspelled words)
Output file (no misspelled words)
spellCheck program
25
Checking words
  • What happens when a word is encountered that is
    not present in the dictionary?
  • Display the word at the terminal
  • producing the following prompt Do you want to
    replace (R), replace all (P), ignore (I), ignore
    all (N), or exit (E)?

26
wordList object structure
  • each item in a wordList object consists of two
    words
  • (i) an incorrect word and
  • (ii) a replacement word

27
wordList object structure
Misspelled words (Primary words)
colour
grammer
joyfull
color
joyful
grammar
An element in the wrodList object
Dictionary Words (Replacement words)
28
Replace All
Misspelled words (Primary words)
colour
color
Dictionary Words (Replacement words)
29
Ignore All
Misspelled words (Primary words)
colour
colour
Both words are identical!
Dictionary Words (Replacement words)
30
Correcting documents
Dictionary
Replacement Words
Input file (may have misspelled words)
Output file (no misspelled words)
spellCheck program
31
Hint!! (based on the previous slides)
  • make a class. e.g., wordClass with two String
    fields e.g., primaryWord and secondaryWord
  • A worldClass object will be a wordList object
    element
  • Lets see how it looks..

32
Hint!!
In buildDict.java, the dictionary is a worldList
object. Each element of the object contains two
strings.
Primary Word dictionary word
color
Secondary Word is blank
33
Hint!!
In spellCheck.java, replacementWords is a
worldList object. Each element of the object
contains two strings.
Primary Word misspelled word
colour
color
Secondary Word is dictionary word
34
Testing
  • Finding a good hashing function is important!
    Repeatedly experiment to find the right M, and
    the constants.
  • Experiment with different input files
  • Try out different user inputs
  • Misspell your words a lot!
  • Try different options each time you encounter a
    misspelled word

35
Testing
  • Do vigorous testing! This is the only way to have
    an accurate and improved program
  • You have 3 weeks to work on the project.
    Remember, you will need 2/3 of the time for
    testing!

36
Documentation
  • You should document your code extensively
  • Consider using the Javadoc style of comments

37
Questions?
  • We are here to help you! Email us, or come to our
    office hours, and ask questions in the class!
Write a Comment
User Comments (0)
About PowerShow.com