Title: Human Language Technology
1Human Language Technology
2Acknowledgements
- John Repici (2002) http//www.creativyst.com/Doc/A
rticles/SoundEx1/SoundEx1.htm - Porter, M.F., 1980, An algorithm for suffix
stripping, reprinted in Sparck Jones, Karen, and
Peter Willet, 1997, Readings in Information
Retrieval, San Francisco Morgan Kaufmann, ISBN
1-55860-454-4. Vince has a copy of this - Jurafsky Martin appendix B pp 833-836.
3Conflation
COMPUT
COMPUTE
COMPUTER
COMPUTES
COMPUTING
COMPUTABILITY
COMPUTATION
4Types of Conflation Algorithm
- Stemming
- Process based - e.g. affix stripping
- Lemmatisation
- Attempt to map to same lemma
- POS dependent
- Morphological Analysis
- Includes morpho-syntactic information
5Word Conflation Algorithms
- Morphological analysis versus conflation
- Notion of word class used is application
dependent - Genealogy Phonetic similarity
- Information Retrieval Semantic similarity
- Based on written language (not phonetic
transcription) - Well known algorithms
- Soundex
- Porter
6SoundexProblems with Names
- Names can be misspelt Rossner
- Same name can be spelt in different waysKirkop
Chircop - Same name appears differently in different
cultures Tchaikovsky Chaicowski - To solve this problem, we need phonetically
oriented algorithms which can find similar
sounding terms and names. - Just such a family of algorithms exist and are
called SoundExes, after the first patented
version.
7The Soundex Algorithm
- A Soundex algorithm takes a word as input and
produces a character string which identifies a
set of words that are (roughly) phonetically
alike. - It is very handy for searching large databases
- Originally developed 1918 by Margaret K. Odell
and Robert C. Russell of the US Bureau of
Archives, to simplify census-taking.
8Soundex Algorithm 1
- The Soundex Algorithm uses the following
steps to encode a word - The first character of the word is retained as
the first character of the Soundex code. - The following letters are discarded
a,e,i,o,u,h,w, and y. - Remaining consonants are given a code number.
- If consonants having the same code number appear
consecutively, the number will only be coded
once. (e.g. "B233" becomes "B23")
9Code Numbers
b, p, f, and v 1
c, s, k, g, j, q, x, z 2
d, t 3
l 4
m,n 5
r 6
10Soundex Algorithm Example
- The Soundex Algorithm uses the following
steps to encode a word - ROSNER
- The first character of the word is retained as
the first character of the Soundex code R - The following letters are discarded
a,e,i,o,u,h,w, and y. RSNR - Remaining consonants are given a code number.
R256 - If consonants having the same code number appear
consecutively, the number will only be coded
once. (e.g. "B233" becomes "B23")R256
11Soundex Algorithm 2
- The resulting code is modified so that it becomes
exactly four characters long If it is less than
4 characters, zeroes are added to the end (e.g.
"B2" becomes "B200") - If it is more than 4 characters, the code is
truncated (e.g. "B2435" becomes "B243")
12Uses for the Soundex Code
- Airline reservations - The soundex code for a
passenger's surname is often recorded to avoid
confusion when trying to pronounce it. - U.S. Census - As is noted above, the U.S. Census
Department was a frequent user of the Soundex
algorithm while trying to compile a listing of
families around the turn of the century. - Genealogy - In genealogy, the Soundex code is
most often used to avoid problems when dealing
with names that might have alternate spellings.
13Improvements
- Preprocessing before applying the basic
algorithm, e.g. identification of - DG with G
- GH with H
- GN with N (not 'ng')
- KN with N
- PH with F
- Question where to stop?
- Question how to evaluate?
14IR Applications
- Information RetrievalQuery ?
? Relevant
Documents - Bag of Terms document model
- What is a single term?
15Why Stemming is Necessary
- Frequently we get collections of words of the
following kind in the same documentcompute,
computer, computing, computation, computability
. - Performance of IR system will be improved if all
of these terms are conflated. - Less terms to worry about
- More accurate statistics
16Issues
- Is a dictionary available?
- Stems
- Affixes
- Motivation linguistic credibility or engineering
performance? - When to remove a affix versus when to leave it
alone - Porter (1980) W1 and W2 should be conflated if
there appears to be no difference between the
statements "this document is about
W1/W2"relate/relativity vs. radioactive/radioact
ivity
17Consonants and Vowels
- A consonant is a letter other than a,e,i,o,u and
other than y preceded by a consonant sky, (nb. y
in toy is not regarded as a consonant). - If a letter is not a consonant it is a vowel.
- A sequence of consonants (cc..c) or vowels
(vv..v) will be represented by C or V
respectively. - For example the word troubles maps to C V C V C
- Any word or part of a word, therefore has one of
the following forms(CV)n.C(CV)n.V(VC)n.C(
VC)n.V
18Measure
- All the above patterns can be replaced bythe
following regular expression(C) (VC)m (V) - m is called the measure of any word or word part.
- m0 tr, ee, tree, y, bym1 trouble, oats,
trees, ivym2 troubles private
19Rules
- Rules for removing a suffix are given in the
form(condition) S1 ? S2 - i.e. if a word ends with suffix S1, and the stem
before S1 satisfies the condition, then it is
replaced with S2. Example(m gt 1) EMENT ? - Example enlargement ? enlarg
20Conditions
- S - stem ends with s
- Z - stem ends with z
- T stem ends with t
- v - stem contains a vowel
- d - stem ends with a double consonant
- o - stem ends cvc, where second c is not w, x
or y e.g. wil, -hop - In conditions, Boolean operators are possible
e.g. (mgt1 and (S or T)) - Sets of rules applied in 7 steps. Within each
step, rule matching longest suffix applies.
21Organisation
-s
Step 1 Plurals and Third Person Singular Verbs
-ed, -ing
fly/flies
Step 2 Verbal Past Tense and Progressive
Step 3 Y to I Noun Inflections
Steps 4 and 5 Derivational Morphology Multiple
Suffixes visualisation ? visualise
Steps 6 Derivational Morphology Single Suffixes
Step 7 Cleanup
22Step 1Plural Nouns and 3rd Person Singular Verbs
condition rewrite example
SSES ? SS caresses ? caress
IES ? I ponies ? poni
SS ? SS caress ? caress
S ? cats ? cat
23Step 2a Verbal Past Tense and Progressive Forms
condition rewrite example
(mgt1) EED ? EE feed ? feed agreed ? agree
(v) ED ? e plastered ? plaster bled ? bled
(v) ING ? e killing ? killsing ? sing
24Step 2b CleanupIf 2nd or 3rd of last step
succeeds
condition rewrite example
AT ? ATE generat ? generate
BL ? BLE troubl ? trouble
IZ ? IZE capsiz ? capsize
d and not (L or S or Z) ? single letter hopp ? hop hiss ? hiss
25Step 3 Y to I
(v) Y ? I happy ? happi cry ? cry
26STEP 4 Derivational Morphology 1 Multiple Suffixes (excerpt) STEP 4 Derivational Morphology 1 Multiple Suffixes (excerpt) STEP 4 Derivational Morphology 1 Multiple Suffixes (excerpt)
Condition Rewrite Example
(m gt 0) ATIONAL ? ATE relational ? relate
(m gt 0) TIONAL ? TION conditional ? condition
(m gt 0) ENCI ? ENCE valenci ? valence
(m gt 0) ABLI ? ABLE comfortabli ? comfortable
(m gt 0) OUSLI ? OUS analagously ? analagous
(m gt 0) IZATION ? IZE digitizer ? digitize
(m gt 0) ATION ? ATE generation ? generate
(m gt 0) ATOR ? ATE operator ? operate
(m gt 0) ALISM ? AL formalism ? formal
(m gt 0) IVENESS ? IVE pensiveness ? pensive
(m gt 0) FULNESS ? FUL hopefulness ? hopeful
(m gt 0) OUSNESS ? OUS callousness ? callous
(m gt 0) ALITI ? AL formality ? formal
(m gt 0) BILITI ? BLE possibility ? possible
27Step 6 Derivational Morphology III Single Suffixes Step 6 Derivational Morphology III Single Suffixes Step 6 Derivational Morphology III Single Suffixes
Condition Rewrite Example
(m gt 1) AL ? e revival ? reviv
(m gt 1) ANCE ? e allowance ? allow
(m gt 1) ENCE ? e inference ? infer
(m gt 1) ER ? e airliner ? airlin
(m gt 1) IC ? e Coptic ? Copt
(m gt 1) ABLE ? e laughable ? laugh
(m gt 1) ANT ? e irritant ? irrit
(m gt 1) EMENT ? e replacement ? replac
(m gt 1) MENT ? e adjustment ? adjust
(m gt 1) ENT ? e dependent ? depend
(m gt 0) (S or T) ION ? e adoption ? adopt
(m gt 1) OU ? e callousness ? callous
(m gt 1) ISM ? e formalism? formal
(m gt 1) ATE ? e activate ? activ
ITI ? e
28Porter Example
- INPUTin the first focus area, integrated
projects shall help develop, principally, common
open platforms for software and services
supporting a distributed information and decision
systems for risk and crisis management
29Porter Output
Original Word Stemmed Word
first first
focus focu
area area
integrated integr
projects project
help help
develop develop
principally princip
common common
open open
platforms platform
Original Word Stemmed Word
platforms platform
software softwar
services servic
supporting support
distributed distribut
information inform
decision decis
systems system
risk risk
crisis crisi
management manag
30Stemming Errors
- Under-stemming
- the error of taking off too small a suffix
- croulons ? croulon
- since croulons is a form of the verb crouler
- Over-stemming
- the error of taking off too much
- example croûtons ? croût
- since croûtons is the plural of croûton
- Miss-stemming
- taking off what looks like an ending, but is
really part of the stem - reply ? rep
31Summary
- Conflation serves different purposes
- Generally, motivation is to achieve an
engineering goal rather than linguistic fidelity. - This can cause errors in the bag of words model.
- Soundex and Porter very well established and
easily available.