Natural Language Processing and KOS as an Aid to Retrieval presentation

About This Presentation

Transcript and Presenter's Notes

Title: Natural Language Processing and KOS as an Aid to Retrieval

1
Natural Language Processing and KOS as an Aid to
Retrieval

Iolo Jones, Daniel Cunliffe and Doug Tudhope
Hypermedia Research Unit, School of Computing,
University of Glamorgan
ijones6,
djcunlif,
dstudhope
_at_glam.ac.uk
ISKO 2004

2
Natural Language Processing and KOS as an Aid to
Retrieval

KOS and Natural Language
Review of relevant literature in the field of NLP
Development of algorithms and experimental model
use of Rogets Thesaurus and WordNet as
linguistic tools
Evaluation of results to date and directions for
future work

3
KOS and Natural Language

Past 150 years in LS and IS increased
specificity in design of controlled/normalized
vocabularies
Why reintroduce NL into the equation?
Need pointed out by Smeaton (1997), Hearst and
Karadi (1997), Voorhees (1994).

4
Linguistic Information in KOS

Especially in thesaurus design
Homographs disambiguated by parenthetical
qualifier, e.g. Mercury (Greek God), Mercury
(Metal) and Mercury (Planet) Svenonius (2000)
Relations of BT, NT, equivalence and related term
Scope notes free text glosses, forming data for
linguistic exploitation

5
General developments in NLP - I

Development of general linguistic tools steadily
maintained since late 1940s
Kading, 1897 Boas, 1940 early use of corpora
Masterman, 1947 machine translation using
machine-implemented knowledge base (using Roget)
Early 1960s AI techniques (e.g. Wilks,
preference semantics)
Early enthusiasm and optimism killed by 1966
ALPAC report led to funding cuts

6
General developments in NLP - II

Lexical approach (Amsler, 1989, Lesk 1986)
Decade from 1990 statistical language
processing (Charniak, 1997)
Use of large corpora BNC, Penn Treebank
Notable advances in POS tagging (Brill, 1995)
and in WSD (Hearst, 1991)

7
Specific applications

NLIDB (Natural Language Interfaces to Databases)
(Copestake Sparck Jones, 1990 Androutsopoulos
et al., 1995)
E-commerce (Winiwarter and Ibrahim, 2000)
Knowledge-based search (Clark et al., 2000)
NL IR in digital libraries (Strzalkowski et al.,
1996)
Query paraphrase of a NL question (Radev et al.,
2001)
General conclusion
NL techniques quite successful in specific,
limited domains

8
Preparing Rogets Thesaurus for use as a general
linguistic resource

Non-proprietary download from Project Gutenberg
http//www.gutenberg.net
1911 edition 1000 new words
Needed corrections (e.g sowing ones mild oats)
and a word-to-head index (but see recent work by
Old)
Rationale 1) Abstract universal classification,
2) provision of synonym sets (Wilks, 1998)

9
Word Sense Disambiguation

The problem
I thought it was rum (Hodges)
Owing to the licensing regulations, children
will not be served at the bar (Electricity Club,
Cardiff)
Man jailed in tomato case (South Wales Echo)
Flying planes can be dangerous(Chomsky)

10
Using Rogets in implementing the Semantic
Overlap process

Starting point parsing of NL string into POS and
constituent phrases
LINK Parser (Sleator and Temperley, 1993)
Focus on Noun Phrases (NPs)
Disambiguation of homographs within the parsed
string
especially those marked as homographs or bound
terms within the KOS (using AAT )

11
The Semantic Overlap process - I

Take a known ambiguous word in the text string,
assign to it a list of candidate Rogets
categories try to match these candidate
categories with a list of candidate categories
for any other noun phrases in text string.
Apply same process to homographs inside the KOS.

12
(No Transcript)
13
The Semantic Overlap process - II

Im looking for clubs or sticks used for
fighting

14
Homograph Example - I

pins (jewelry)

15
Homograph Example - II

pins (fasteners)

16
(No Transcript)
17
Semantic Overlap process on KOS

Assign homographs within the AAT to Rogets
categories (here using equivalent terms)
45
anchor
fastener
367
annual
plant
593
annual
publication
Bound Term apron pieces position in thesaurus
3285
225
apron
skirt

18
WordNet - I

In an attempt to compare/improve upon
performance, we use WordNet (Fellbaum, ed. 1998)
Using JWordNet jwn.sourceforge.com
Lexical reference system whose design is inspired
by current psycholinguistic theories of human
lexical memory

19
WordNet - II

English nouns, verbs, adjectives and adverbs are
organized into synonym sets, each representing
one underlying lexical concept
Different relations link the synonym sets
Hierarchically organized hypernymns, hypoyms
(is-a), unique beginners (25 for noun source
files)

20
WordNet - III

1. (1) pin -- (a piece of jewelry that is pinned
onto the wearer's garment)
2. fall, pin -- (when a wrestler's shoulders are
forced to the mat)
3. peg, pin -- (small markers inserted into a
surface to mark scores or define locations etc.)
4. personal identification number, PIN, PIN
number -- (a number you choose and use to gain
access to various accounts)
(11 senses in all)

21
WordNet - IV

Measure of fit given by
(depth of match of ambiguous word in hierarchy)
(total depth of hierarchy)
This value is minimised for best fit

22
measure of fit 7/8
23
measure of fit 2/8 1/4
24
WN Homograph Example - I

pins (jewelry)

25
WN Homograph Example - II

pins (fasteners)

26
Further Developments - I

Automatic document indexing and classification
use all the modules developed so far working in
the other direction, from the target document to
the indexing language
Seek to formalise a facet grammar especially
using citation order, rules (e.g. AAT rules for
constructing strings)

27
Conclusions

Problem of disambiguation often left in the hands
of the user (e.g., Google etc.)
Aim to provide some semantic processing of query
to sort results by meaning, aiding user in
disambiguation process
Powerful tools at our disposal aim is to
harness them successfully to the application at
hand
to exploit knowledge implicit in structure of
KOS and in the structures of natural language

28
Bibliography - I

AAT http//www.getty.edu/research/tools/vocabulary
/aat/.
Androutsopoulos, I., Ritchie, G. D. and Thanisch,
P. (1995) Natural Language Interfaces to
Databases -- an introduction., Journal of Natural
Language Engineering, 1, 29-81.
Brill, E. (1995) Transformation-Based
Error-Driven Parsing, Computational Linguistics.
Charniak, E. (1997) Statistical Techniques for
Natural Language Parsing, A I Magazine, 18,
33-44.
Clark, P., Thompson, J., Holmback, H. and Duncan,
L. (2000) Exploiting a Thesaurus-Based Semantic
Net for Knowledge-Based Search, Proc. 12th Cong
on Innovative Applications of AI (AIII/IAAI '00),
988-995.
Cunningham, H. (1999) A Definition and Short
History of Language Engineering, Journal of
Natural Language Engineering, 5, 1-16.
Hearst, M. (1991) Noun Homograph Disambiguation
Using Local Context in Large Text Corpora, Proc.
7th Annual Conference of the University of
Waterloo Centre for the New OED and Text
Research, Oxford.
Hodge, G. (2000) Systems of Knowledge
Organization for Digital Libraries Beyond
Traditional Authority Files, Council on Library
and Information Resources (CLIR), Washington,
D.C.
Ide, N. and Véronis, J. (1998) Introduction to
the Special Issue on Word Sense Disambiguation
The State of the Art, Computational Linguistics,
24, 140ff.

29
Bibliography - II

Jarmasz, M. and Szpakowicz, S. (2001) Roget's
Thesaurus as an Electronic Lexical Database, In
NIE BEZ ZNACZENIA. Prace ofiarowane Profesorowi
Zygmuntowi Saloniemu z okazji 40-lecia pracy
naukowej(Eds, Gruszczynski, W. and Kopcinska, D.)
(to appear), Bialystok.
Mason, O. (2000) Programming for Corpus
Linguistics How to do Text Analysis with Java,
Edinburgh University Press, Edinburgh.
McEnery, T. and Wilson, A. (1996) Corpus
Linguistics, Edinburgh University Press,
Edinburgh.
Radev, D. R., Qi, H., Zheng, Z.,
Blair-Goldensohn, Zhang, Z., Fan, W. and Prager,
J. (2001) Mining the web for answers to natural
language questions, Proceedings of ACM CIKM,
143-150.
Sleator, D. D. and Temperley, D. (1993) Parsing
English with a Link Grammar, Third International
Workshop on Parsing Technologies.
Strzalkowski, T., Perez-Carballo, J. and
Marinescu, M. (1996) Natural language information
retrieval in digital libraries, Proceedings of
the first ACM international conference on Digital
libraries., 117-125.
Svenonius, E. (2000) The Intellectual Foundation
of Information Organization, MIT Press,
Cambridge, MA.
Wilks, Y. (1998) Language Processing and the
Thesaurus, Proc. National Language Research
Institute, Tokyo, Japan.
Winiwarter, W. and Ibrahim, I. K. (2000) A
Mulitilingual Natural Language Interface for
E-Commerce Applications, Proc. of the 13th
International Conference on Applications of
Prolog, Tokyo, Japan.
Yarowsky, D. (1992) Word-Sense Disambiguation
using Statistical Models of Roget's Categories
Trained on Large Corpora, Proceedings of
COLING-92, 454-460.

Write a Comment

User Comments (0)

About PowerShow.com

Natural Language Processing and KOS as an Aid to Retrieval PowerPoint PPT Presentation