Using String Similarity Metrics for Terminology Recognition - PowerPoint PPT Presentation

About This Presentation
Title:

Using String Similarity Metrics for Terminology Recognition

Description:

Objects are discrete, the terms people use to describe objects are usually not! ... Specialising in Avionics. Electronic noise is a problem. But can be filtered! ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 27
Provided by: jdb9
Learn more at: http://www.lrec-conf.org
Category:

less

Transcript and Presenter's Notes

Title: Using String Similarity Metrics for Terminology Recognition


1
Using String Similarity Metrics for Terminology
Recognition
  • Jonathan Butters
  • March 2008
  • LREC 2008 Marrakech, Morocco

2
Introduction - Terms
  • Objects are discrete, the terms people use to
    describe objects are usually not!
  • Different groups of people tend to use different
    terms to refer to identical objects Sublanguage
    (Harris, 1968)
  • Terms can differ due to orthographical
    differences, abbreviations, acronyms and synonyms

Icosahedron
Football
Foot-ball
Soccerball
Footy
Design
Personnel
Maintenance
Other Countries!
3
Introduction Relating Terms
  • There are many applications where the ability to
    relate the different terms would be useful
  • String similarity metrics can be used to relate
    terms
  • String similarity metrics inherently take into
    consideration aspects such as
  • Word Order
  • Acronyms
  • Abbreviations

Predictive text suggestions
Matching component concepts
Reducing lists of options
4
An Example Application
  • We have a list of locations
  • Some are similar
  • Most are dissimilar (irrelevant)
  • How do we choose the most similar?
  • Top 10? Top 100? Top 1000? Top 10?

5
Introduction Selecting Terms
  • Background In Aerospace Engineering
  • Specialising in Avionics
  • Electronic noise is a problem
  • But can be filtered!
  • Can dissimilar string matches be identified as
    noise?
  • Can this noise be removed?... Automatically

6
String Similarity Metrics
7
Introduction Similarity Metrics
  • String metrics automatically calculate how
    similar (or dissimilar) two strings are
  • Two strings are identical if they have the same
    characters in the same order
  • Each similarity measure assigns a numeric value
    based upon the relative similarity between the
    two strings
  • Vector based
  • Cost based

8
Metric Selection - Examples
  • Query String language resources and evaluation
    conference 2008
  • String A language resources and evaluation
    conference 2009
  • String B lrec 2008

Metric Name String A score String B score
Levenshtein 1.0 40.0
Monge Elkan 0.9583333 0.46666667
Jaro Winkler 0.99455786 0.1
Euclidean Distance 1.4142135 2.4494898
Jaccard Similarity 0.71428573 0.14285715
9
Metric Selection - SimMetrics
  • SimMetrics Java library of 23 string similarity
    metrics
  • Developed at the University of Sheffield
    (Chapman, 2004)
  • Outputs a normalised similarity score!

Metric Name String A score String B score
Levenshtein 0.97959185 0.18367344
Monge Elkan 0.9583333 0.46666667
Jaro Winkler 0.99455786 0.1
Euclidean Distance 0.8333333 0.61270165
Jaccard Similarity 0.71428573 0.14285715
10
Metric Selection
11
Metric Selection - Investigation
  • Investigation focused on Aerospace domain terms
  • Reduce list of components presented to user
  • 298 automatically extracted sublanguage engine
    component terms
  • 513 official component terms
  • The similarity of each combination of 298 terms
    was calculated... 298C2 44253 comparisons
  • Carried out for each of the 23 metrics in
    SimMetrics

12
Metric Selection - Investigation
  • For each metric - each string pair (and score)
    was ordered by decreasing similarity
  • Few string pairs scored high results - wide
    similarity band
  • Vast majority scored low scores
  • Bands of similarity score were made, the number
    of strings that scored within those bands were
    totalled
  • Distribution graphs were Gaussian or Dirac
  • Depending on the scoring mechanism of the
    similarity metric

13
Metric Selection - Results
  • Dirac distributions
  • Gaussian distributions

14
Metric Selection - Levenshtein
  • Because
  • Jaro-Winkler gave consistently relatively high
    scores to unrelated strings
  • Levenshtein grouped dissimilar strings further
    towards the lower end of the scale - More similar
    strings over a wider range

15
Metric Selection - Example
Starter Valve Tail Bearing Oil Feed Tube
ASSY.
Air Oil Heat Exchanger Air/Oil Heat
Exchanger
16
Noise Detection Removal
  • The peak is formed by the strings that are
    dissimilar
  • If two random strings are compared, they will
    have a random similarity score
  • As there are many randomly similar string pairs
    their scores form a Gaussian noise pattern...
  • Approximately 100 of a randomly distributed
    variable falls below approximately four standard
    deviations above the mean

17
Noise Detection Removal
  • Strings that scored outside the randomly
    distributed scores were... by definition, not
    randomly distributed!
  • Strings that were not randomly distributed tended
    to include terms that were relevant to one
    another!...
  • The noise peak can be located and isolated by
    disregarding all similarities below four standard
    deviations above the mean

18
Noise Detection Removal
A standard Gaussian (normal) distribution
19
Shorter Terms
  • Although the dataset used contained mostly long
    strings, noise removal method remains effective
    for shorter strings within the dataset
  • Shorter terms constitute a small, random match of
    longer and more typical strings
  • longer strings are now randomly distributed!
  • The mean similarity tends to be lower, and hence,
    the cut-off similarity automatically reduces, now
    similar shorter strings fall above the automatic
    cut off

20
Noise Detection Removal
  • Advantages of this automatic method
  • Scales with source data size
  • Selecting top 10 may include or exclude relevant
    results!
  • Can be used to pick out strings that are more
    similar than, or stand out from the rest of the
    strings

21
Results
  • The 298 extracted terms were compared against
    each of the 513 official terms.
  • After noise was automatically removed, in some
    cases more than one relevant result suggested, in
    this case, the first n results were considered as
    follows

n Recall at n Precision at n
1 86.84 86.84
2 90.13 88.83
3 92.67 89.04
4 98.67 91.56
5 99.40 92.08
22
Example List Reduction
  • List of 10028 unique UK locations
  • Query checked against list
  • Noise removed

Query Automatic Cut Off of Results above cut off
Bradford 0.59434813 83 (0.745)
Huddersfield 0.49659976 112 (1.005)
Chipping Norton 0.5526771 6 (0.054)
23
Conclusions
  • Dissimilar string matches can be modelled as a
    noise pattern
  • The noise pattern can be removed!
  • Methodology is applicable to any set of strings
  • Not only for Aerospace domain terms!
  • Method is scalable
  • Can be used to automatically remove obviously
    incorrect matches
  • Provides users with fewer options faster
    selection!
  • Can be used to extract strings that are more
    similar than, or stand out from the rest

24
Future Work
  • Integrate approach into many apps
  • Form Filling
  • Improved similarity metrics
  • Domain specific datasets (Aerospace)
  • Stop words, mutually exclusive words
  • Combine metrics to break ties

25
Thank you
26
Refs
  • Butters, Jonathan (2007) - A Terminology
    Recognizer for the Aerospace Domain. Masters
    Thesis, The University of Sheffield
  • http//www.dcs.shef.ac.uk/jdb/papers.html
  • Harris, Z. (1968). Mathematical Structures of
    Language.
  • John Wiley Sons, New York.
  • Sam Chapman SimMetrics
  • http//www.dcs.shef.ac.uk/sam/simmetrics.html
Write a Comment
User Comments (0)
About PowerShow.com