Title: Using String Similarity Metrics for Terminology Recognition
1Using String Similarity Metrics for Terminology
Recognition
- Jonathan Butters
- March 2008
- LREC 2008 Marrakech, Morocco
2Introduction - Terms
- Objects are discrete, the terms people use to
describe objects are usually not! - Different groups of people tend to use different
terms to refer to identical objects Sublanguage
(Harris, 1968) - Terms can differ due to orthographical
differences, abbreviations, acronyms and synonyms
Icosahedron
Football
Foot-ball
Soccerball
Footy
Design
Personnel
Maintenance
Other Countries!
3Introduction Relating Terms
- There are many applications where the ability to
relate the different terms would be useful - String similarity metrics can be used to relate
terms - String similarity metrics inherently take into
consideration aspects such as - Word Order
- Acronyms
- Abbreviations
Predictive text suggestions
Matching component concepts
Reducing lists of options
4An Example Application
- We have a list of locations
- Some are similar
- Most are dissimilar (irrelevant)
- How do we choose the most similar?
- Top 10? Top 100? Top 1000? Top 10?
5Introduction Selecting Terms
- Background In Aerospace Engineering
- Specialising in Avionics
- Electronic noise is a problem
- But can be filtered!
- Can dissimilar string matches be identified as
noise? - Can this noise be removed?... Automatically
6String Similarity Metrics
7Introduction Similarity Metrics
- String metrics automatically calculate how
similar (or dissimilar) two strings are - Two strings are identical if they have the same
characters in the same order - Each similarity measure assigns a numeric value
based upon the relative similarity between the
two strings - Vector based
- Cost based
8Metric Selection - Examples
- Query String language resources and evaluation
conference 2008 - String A language resources and evaluation
conference 2009 - String B lrec 2008
Metric Name String A score String B score
Levenshtein 1.0 40.0
Monge Elkan 0.9583333 0.46666667
Jaro Winkler 0.99455786 0.1
Euclidean Distance 1.4142135 2.4494898
Jaccard Similarity 0.71428573 0.14285715
9Metric Selection - SimMetrics
- SimMetrics Java library of 23 string similarity
metrics - Developed at the University of Sheffield
(Chapman, 2004) - Outputs a normalised similarity score!
Metric Name String A score String B score
Levenshtein 0.97959185 0.18367344
Monge Elkan 0.9583333 0.46666667
Jaro Winkler 0.99455786 0.1
Euclidean Distance 0.8333333 0.61270165
Jaccard Similarity 0.71428573 0.14285715
10Metric Selection
11Metric Selection - Investigation
- Investigation focused on Aerospace domain terms
- Reduce list of components presented to user
- 298 automatically extracted sublanguage engine
component terms - 513 official component terms
- The similarity of each combination of 298 terms
was calculated... 298C2 44253 comparisons - Carried out for each of the 23 metrics in
SimMetrics
12Metric Selection - Investigation
- For each metric - each string pair (and score)
was ordered by decreasing similarity - Few string pairs scored high results - wide
similarity band - Vast majority scored low scores
- Bands of similarity score were made, the number
of strings that scored within those bands were
totalled - Distribution graphs were Gaussian or Dirac
- Depending on the scoring mechanism of the
similarity metric
13Metric Selection - Results
14Metric Selection - Levenshtein
- Because
- Jaro-Winkler gave consistently relatively high
scores to unrelated strings - Levenshtein grouped dissimilar strings further
towards the lower end of the scale - More similar
strings over a wider range
15Metric Selection - Example
Starter Valve Tail Bearing Oil Feed Tube
ASSY.
Air Oil Heat Exchanger Air/Oil Heat
Exchanger
16Noise Detection Removal
- The peak is formed by the strings that are
dissimilar - If two random strings are compared, they will
have a random similarity score - As there are many randomly similar string pairs
their scores form a Gaussian noise pattern... - Approximately 100 of a randomly distributed
variable falls below approximately four standard
deviations above the mean
17Noise Detection Removal
- Strings that scored outside the randomly
distributed scores were... by definition, not
randomly distributed! - Strings that were not randomly distributed tended
to include terms that were relevant to one
another!... - The noise peak can be located and isolated by
disregarding all similarities below four standard
deviations above the mean
18Noise Detection Removal
A standard Gaussian (normal) distribution
19Shorter Terms
- Although the dataset used contained mostly long
strings, noise removal method remains effective
for shorter strings within the dataset - Shorter terms constitute a small, random match of
longer and more typical strings - longer strings are now randomly distributed!
- The mean similarity tends to be lower, and hence,
the cut-off similarity automatically reduces, now
similar shorter strings fall above the automatic
cut off
20Noise Detection Removal
- Advantages of this automatic method
- Scales with source data size
- Selecting top 10 may include or exclude relevant
results! - Can be used to pick out strings that are more
similar than, or stand out from the rest of the
strings
21Results
- The 298 extracted terms were compared against
each of the 513 official terms. - After noise was automatically removed, in some
cases more than one relevant result suggested, in
this case, the first n results were considered as
follows
n Recall at n Precision at n
1 86.84 86.84
2 90.13 88.83
3 92.67 89.04
4 98.67 91.56
5 99.40 92.08
22Example List Reduction
- List of 10028 unique UK locations
- Query checked against list
- Noise removed
Query Automatic Cut Off of Results above cut off
Bradford 0.59434813 83 (0.745)
Huddersfield 0.49659976 112 (1.005)
Chipping Norton 0.5526771 6 (0.054)
23Conclusions
- Dissimilar string matches can be modelled as a
noise pattern - The noise pattern can be removed!
- Methodology is applicable to any set of strings
- Not only for Aerospace domain terms!
- Method is scalable
- Can be used to automatically remove obviously
incorrect matches - Provides users with fewer options faster
selection! - Can be used to extract strings that are more
similar than, or stand out from the rest
24Future Work
- Integrate approach into many apps
- Form Filling
- Improved similarity metrics
- Domain specific datasets (Aerospace)
- Stop words, mutually exclusive words
- Combine metrics to break ties
25Thank you
26Refs
- Butters, Jonathan (2007) - A Terminology
Recognizer for the Aerospace Domain. Masters
Thesis, The University of Sheffield - http//www.dcs.shef.ac.uk/jdb/papers.html
- Harris, Z. (1968). Mathematical Structures of
Language. - John Wiley Sons, New York.
- Sam Chapman SimMetrics
- http//www.dcs.shef.ac.uk/sam/simmetrics.html