Title: BP1110: Close Enough Indexed Record Retrieval In Progress Using Soundalikes and Near Matches
1BP1110Close Enough Indexed Record Retrieval In
Progress Using Sound-alikes and Near Matches
- Steve Southwell (ses_at_bravepointdallas.com)
- Senior Consultant
- BravePoint, Inc.
2The Problem - User Perspective
- Users expect intuitive text searches.
- Google and other consumer-oriented web sites have
raised the bar. - Find what I'm looking for not what I typed.
- It's not my problem if I'm a bad speller
- Oh yeah... Put the most interesting results at
the top of the list.
3Types of Searches Where Close Counts
4Target Smart Searching Example
User Can't Spell!
5Amazon Smart Searching Example
User Can't Spell!
6Types of Searches Where Close Counts
- Product Searches
- Searches for Proper Names
7Yellow Pages Smart Searching
User Can't Spell!
8Types of Searches Where Close Counts
- Product Searches
- Searches for Proper Names
- Full-text Searches
9Google Smart Searching Example
User Can't Spell!
10AltaVista Smart Searching Example
User Can't Spell!
11The Problem Developer Perspective
- Internal users need quick results. Time is
money. - If customers want to to buy, I'll help them find
it. - If they can't spell it, we still sell it.
- A widget by any other name... It's still for
sale. - List the good stuff first.
12Technical Issues
- How can Progress store what a word sounds like?
- How do I search for sound-alikes or similar
words? - How can I rank search results?
13Determining What a Word Sounds Like
- Soundex
- Used by US Census Bureau since 1880
- Intended to index surnames
- Only codes starting letter and 3 sounds
- Had to be simple enough to do by hand.
1 B, P, F, V 4 L
2 C, S, K, G, J, Q, X, Z 5 M,N 3
D, T 6 R
14Soundex Examples
1 B, P, F, V 4 L
2 C, S, K, G, J, Q, X, Z 5 M,N 3
D, T 6 R
- Last Name Southwell
- Soundex S340
- First letter S
- Next consonant T 3
- H W not represented.
- Next consonant L 4
- Next L is a double skip
- Pad with 0
Other S340 Names Seidl, Steele, Staley, Stahl,
Stahley, Seidel, Settle, Shadle, Shotwell,
Shuttle, Sidwell, Southall, Stall, Steel, Steely,
Stell, Still, Stoll, Stowell, Stull, Sudlow,
Suttle
15src/samples/soundex.p
- DEFINE INPUT PARAMETER name AS CHARACTER
NO-UNDO. - DEFINE OUTPUT PARAMETER code AS CHARACTER
NO-UNDO. - DEFINE VARIABLE e AS INTEGER NO-UNDO.
- DEFINE VARIABLE i AS INTEGER NO-UNDO.
- DEFINE VARIABLE k AS CHARACTER NO-UNDO.
- DEFINE VARIABLE l AS CHARACTER NO-UNDO.
- ASSIGN
- l ""
- name CAPS(name)
- code SUBSTRING(name,1,1).
- DO i 2 TO LENGTH(name)
- e ASC(SUBSTRING(name,i,1)) - 64.
- IF e 1 AND e
- k SUBSTRING("01230120022455012623010202",e,1
). - IF k l AND k "0" THEN code code k.
- IF LENGTH(code) 3 THEN LEAVE.
- END.
16Soundey
- More sound codes
- Indexes vowel positions
- Codes the entire word
- Makes phonetic substitutions
-
-
0 aehiouwy 5 mn 1 bp 6 r 2
ckqx 7 fv 3 dt 8 gj 4
l 9 sz
17Soundey Continued
- Soundeylib.i available free at www.FreeFrameWork.o
rg - More sophisticated than Soundex
-
-
0 aehiouwy 5 mn 1 bp 6 r 2
ckqx 7 fv 3 dt 8 gj 4
l 9 sz
18Steps in Soundey Conversion
- Pre-token
- Mark word boundaries
- Anywhere translations
- Ends translations
- Begins translations
- Eliminate silent E
- Unmark word boundaries
- Translate characters to digits
- Eliminate double digits
19Soundey Example
- Word Telephone Soundey 3040705
- Replace 'ph' with 'f'
telefone - Eliminate silent 'e' on the end
telefon - Translate characters to digits
- T 3, E 0, L4, E0, F7, O0, N5
-
3040705
20Technical Issues
- How can Progress store what a word sounds like?
- How do I search for sound-alikes or similar
words? - How can I rank search results?
21Using Soundey
- Make necessary database modifications
22Database Mods for Soundey
- Add extra fields 2 per target field
- SoundeyCode Straight Soundey translation
- SoundFragList Allow matching on beginning or
end of target word. - Word-indexes on above fields
- Add soundey.df data definitions needed for
Soundeylib.i - Load data for Soundey tables.
23Using Soundey
- Make necessary database modifications
- Set up code to make search target fields
Soundeyized
24Populating Soundey Fields in DB
- lib/soundeylib.i
- ...
- FOR EACH ITEM
- EXCLUSIVE-LOCK
- ASSIGN ITEM.soundeyCode
toSoundey(ITEM.ItemName " "
ITEM.CatDescription). - END.
- ...
25Using Soundey
- Make necessary database modifications
- Set up code to make search target fields
Soundeyized - Use Soundey in 4gl queries
26Using Soundey in 4gl Queries
- MySearch toSoundey(MySearch).
- FOR EACH ITEM
- WHERE ITEM.SoundeyCode CONTAINS mySearch
NO-LOCK - ...
- END.
27Soundey Use in 4gl
- Demo of Sports2000 item search with Soundey
itemsearch1.w
28General Soundey Query Tips
- Try regular contains search first.
- Convert search string to Soundey code, and do
contains search on Soundey code field. - Try Split and Rejoin
- Other alternatives
- Synonym and Related word searches
- Neural Networks with User Feedback
- Forced Ranking
29Soundey Extensibility
- Can make it replace known words or fragments
- Anywhere
- Beginning of words
- Ending of words
- GUI demonstration
30Other Search Issues
- Numbers and Ordinals
- 29 Palms / Twentynine Palms
- 5th Inning / Fifth Inning
- Abbreviations / Slang
- Ft. Worth, TX / Fort Worth, Texas
- Hyphens / Compound Words
- Word Synonyms
31Technical Issues
- How can Progress store what a word sounds like?
- How do I search for sound-alikes or similar
words? - How can I rank search results?
32Ranking Search Results
- Not an exact science
- Can use many criteria
- Number of word matches
- Similarity to key words
- Preferred results upsells, recent additions,
etc. - Requires use of temp-table for results.
- All results must be analyzed, so keep set small.
(MAX-ROWS?)
33Search Ranking Demonstration
34Technical Issues
- How can Progress store what a word sounds like?
- How do I search for sound-alikes or similar
words? - How can I rank search results?
35Source Code Availability
- All source code used in this presentation can be
found at the FreeFrameWork website
http//www.freeframework.org - Up-to-date copy of this presentation available
with the source code at the FreeFrameWork site.
36(No Transcript)