Title: Natural Language Processing and KOS as an Aid to Retrieval
1Natural Language Processing and KOS as an Aid to
Retrieval
- Iolo Jones, Daniel Cunliffe and Doug Tudhope
- Hypermedia Research Unit, School of Computing,
University of Glamorgan - ijones6,
- djcunlif,
- dstudhope
- _at_glam.ac.uk
- ISKO 2004
2Natural Language Processing and KOS as an Aid to
Retrieval
- KOS and Natural Language
- Review of relevant literature in the field of NLP
- Development of algorithms and experimental model
use of Rogets Thesaurus and WordNet as
linguistic tools - Evaluation of results to date and directions for
future work
3KOS and Natural Language
- Past 150 years in LS and IS increased
specificity in design of controlled/normalized
vocabularies - Why reintroduce NL into the equation?
- Need pointed out by Smeaton (1997), Hearst and
Karadi (1997), Voorhees (1994).
4Linguistic Information in KOS
- Especially in thesaurus design
- Homographs disambiguated by parenthetical
qualifier, e.g. Mercury (Greek God), Mercury
(Metal) and Mercury (Planet) Svenonius (2000) - Relations of BT, NT, equivalence and related term
- Scope notes free text glosses, forming data for
linguistic exploitation
5General developments in NLP - I
- Development of general linguistic tools steadily
maintained since late 1940s - Kading, 1897 Boas, 1940 early use of corpora
- Masterman, 1947 machine translation using
machine-implemented knowledge base (using Roget) - Early 1960s AI techniques (e.g. Wilks,
preference semantics) - Early enthusiasm and optimism killed by 1966
ALPAC report led to funding cuts
6General developments in NLP - II
- Lexical approach (Amsler, 1989, Lesk 1986)
- Decade from 1990 statistical language
processing (Charniak, 1997) - Use of large corpora BNC, Penn Treebank
- Notable advances in POS tagging (Brill, 1995)
- and in WSD (Hearst, 1991)
7Specific applications
- NLIDB (Natural Language Interfaces to Databases)
(Copestake Sparck Jones, 1990 Androutsopoulos
et al., 1995) - E-commerce (Winiwarter and Ibrahim, 2000)
- Knowledge-based search (Clark et al., 2000)
- NL IR in digital libraries (Strzalkowski et al.,
1996) - Query paraphrase of a NL question (Radev et al.,
2001) - General conclusion
- NL techniques quite successful in specific,
limited domains
8Preparing Rogets Thesaurus for use as a general
linguistic resource
- Non-proprietary download from Project Gutenberg
http//www.gutenberg.net - 1911 edition 1000 new words
- Needed corrections (e.g sowing ones mild oats)
and a word-to-head index (but see recent work by
Old) - Rationale 1) Abstract universal classification,
2) provision of synonym sets (Wilks, 1998)
9Word Sense Disambiguation
- The problem
- I thought it was rum (Hodges)
- Owing to the licensing regulations, children
will not be served at the bar (Electricity Club,
Cardiff) - Man jailed in tomato case (South Wales Echo)
- Flying planes can be dangerous(Chomsky)
10Using Rogets in implementing the Semantic
Overlap process
- Starting point parsing of NL string into POS and
constituent phrases - LINK Parser (Sleator and Temperley, 1993)
- Focus on Noun Phrases (NPs)
- Disambiguation of homographs within the parsed
string - especially those marked as homographs or bound
terms within the KOS (using AAT )
11The Semantic Overlap process - I
- Take a known ambiguous word in the text string,
assign to it a list of candidate Rogets
categories try to match these candidate
categories with a list of candidate categories
for any other noun phrases in text string. - Apply same process to homographs inside the KOS.
12(No Transcript)
13The Semantic Overlap process - II
- Im looking for clubs or sticks used for
fighting
14Homograph Example - I
15Homograph Example - II
16(No Transcript)
17Semantic Overlap process on KOS
- Assign homographs within the AAT to Rogets
categories (here using equivalent terms) - 45
- anchor
- fastener
-
- 367
- annual
- plant
-
- 593
- annual
- publication
-
- Bound Term apron pieces position in thesaurus
3285 - 225
- apron
- skirt
-
18WordNet - I
- In an attempt to compare/improve upon
performance, we use WordNet (Fellbaum, ed. 1998) - Using JWordNet jwn.sourceforge.com
- Lexical reference system whose design is inspired
by current psycholinguistic theories of human
lexical memory
19WordNet - II
- English nouns, verbs, adjectives and adverbs are
organized into synonym sets, each representing
one underlying lexical concept - Different relations link the synonym sets
- Hierarchically organized hypernymns, hypoyms
(is-a), unique beginners (25 for noun source
files)
20WordNet - III
- 1. (1) pin -- (a piece of jewelry that is pinned
onto the wearer's garment) - 2. fall, pin -- (when a wrestler's shoulders are
forced to the mat) - 3. peg, pin -- (small markers inserted into a
surface to mark scores or define locations etc.) - 4. personal identification number, PIN, PIN
number -- (a number you choose and use to gain
access to various accounts) - (11 senses in all)
21WordNet - IV
- Measure of fit given by
- (depth of match of ambiguous word in hierarchy)
- (total depth of hierarchy)
- This value is minimised for best fit
22measure of fit 7/8
23measure of fit 2/8 1/4
24WN Homograph Example - I
25WN Homograph Example - II
26Further Developments - I
- Automatic document indexing and classification
use all the modules developed so far working in
the other direction, from the target document to
the indexing language - Seek to formalise a facet grammar especially
using citation order, rules (e.g. AAT rules for
constructing strings)
27Conclusions
- Problem of disambiguation often left in the hands
of the user (e.g., Google etc.) - Aim to provide some semantic processing of query
to sort results by meaning, aiding user in
disambiguation process - Powerful tools at our disposal aim is to
harness them successfully to the application at
hand - to exploit knowledge implicit in structure of
KOS and in the structures of natural language
28Bibliography - I
- AAT http//www.getty.edu/research/tools/vocabulary
/aat/. - Androutsopoulos, I., Ritchie, G. D. and Thanisch,
P. (1995) Natural Language Interfaces to
Databases -- an introduction., Journal of Natural
Language Engineering, 1, 29-81. - Brill, E. (1995) Transformation-Based
Error-Driven Parsing, Computational Linguistics. - Charniak, E. (1997) Statistical Techniques for
Natural Language Parsing, A I Magazine, 18,
33-44. - Clark, P., Thompson, J., Holmback, H. and Duncan,
L. (2000) Exploiting a Thesaurus-Based Semantic
Net for Knowledge-Based Search, Proc. 12th Cong
on Innovative Applications of AI (AIII/IAAI '00),
988-995. - Cunningham, H. (1999) A Definition and Short
History of Language Engineering, Journal of
Natural Language Engineering, 5, 1-16. - Hearst, M. (1991) Noun Homograph Disambiguation
Using Local Context in Large Text Corpora, Proc.
7th Annual Conference of the University of
Waterloo Centre for the New OED and Text
Research, Oxford. - Hodge, G. (2000) Systems of Knowledge
Organization for Digital Libraries Beyond
Traditional Authority Files, Council on Library
and Information Resources (CLIR), Washington,
D.C. - Ide, N. and Véronis, J. (1998) Introduction to
the Special Issue on Word Sense Disambiguation
The State of the Art, Computational Linguistics,
24, 140ff.
29Bibliography - II
- Jarmasz, M. and Szpakowicz, S. (2001) Roget's
Thesaurus as an Electronic Lexical Database, In
NIE BEZ ZNACZENIA. Prace ofiarowane Profesorowi
Zygmuntowi Saloniemu z okazji 40-lecia pracy
naukowej(Eds, Gruszczynski, W. and Kopcinska, D.)
(to appear), Bialystok. - Mason, O. (2000) Programming for Corpus
Linguistics How to do Text Analysis with Java,
Edinburgh University Press, Edinburgh. - McEnery, T. and Wilson, A. (1996) Corpus
Linguistics, Edinburgh University Press,
Edinburgh. - Radev, D. R., Qi, H., Zheng, Z.,
Blair-Goldensohn, Zhang, Z., Fan, W. and Prager,
J. (2001) Mining the web for answers to natural
language questions, Proceedings of ACM CIKM,
143-150. - Sleator, D. D. and Temperley, D. (1993) Parsing
English with a Link Grammar, Third International
Workshop on Parsing Technologies. - Strzalkowski, T., Perez-Carballo, J. and
Marinescu, M. (1996) Natural language information
retrieval in digital libraries, Proceedings of
the first ACM international conference on Digital
libraries., 117-125. - Svenonius, E. (2000) The Intellectual Foundation
of Information Organization, MIT Press,
Cambridge, MA. - Wilks, Y. (1998) Language Processing and the
Thesaurus, Proc. National Language Research
Institute, Tokyo, Japan. - Winiwarter, W. and Ibrahim, I. K. (2000) A
Mulitilingual Natural Language Interface for
E-Commerce Applications, Proc. of the 13th
International Conference on Applications of
Prolog, Tokyo, Japan. - Yarowsky, D. (1992) Word-Sense Disambiguation
using Statistical Models of Roget's Categories
Trained on Large Corpora, Proceedings of
COLING-92, 454-460.