Natural Language Processing and KOS as an Aid to Retrieval PowerPoint PPT Presentation

presentation player overlay
1 / 29
About This Presentation
Transcript and Presenter's Notes

Title: Natural Language Processing and KOS as an Aid to Retrieval


1
Natural Language Processing and KOS as an Aid to
Retrieval
  • Iolo Jones, Daniel Cunliffe and Doug Tudhope
  • Hypermedia Research Unit, School of Computing,
    University of Glamorgan
  • ijones6,
  • djcunlif,
  • dstudhope
  • _at_glam.ac.uk
  • ISKO 2004

2
Natural Language Processing and KOS as an Aid to
Retrieval
  • KOS and Natural Language
  • Review of relevant literature in the field of NLP
  • Development of algorithms and experimental model
    use of Rogets Thesaurus and WordNet as
    linguistic tools
  • Evaluation of results to date and directions for
    future work

3
KOS and Natural Language
  • Past 150 years in LS and IS increased
    specificity in design of controlled/normalized
    vocabularies
  • Why reintroduce NL into the equation?
  • Need pointed out by Smeaton (1997), Hearst and
    Karadi (1997), Voorhees (1994).

4
Linguistic Information in KOS
  • Especially in thesaurus design
  • Homographs disambiguated by parenthetical
    qualifier, e.g. Mercury (Greek God), Mercury
    (Metal) and Mercury (Planet) Svenonius (2000)
  • Relations of BT, NT, equivalence and related term
  • Scope notes free text glosses, forming data for
    linguistic exploitation

5
General developments in NLP - I
  • Development of general linguistic tools steadily
    maintained since late 1940s
  • Kading, 1897 Boas, 1940 early use of corpora
  • Masterman, 1947 machine translation using
    machine-implemented knowledge base (using Roget)
  • Early 1960s AI techniques (e.g. Wilks,
    preference semantics)
  • Early enthusiasm and optimism killed by 1966
    ALPAC report led to funding cuts

6
General developments in NLP - II
  • Lexical approach (Amsler, 1989, Lesk 1986)
  • Decade from 1990 statistical language
    processing (Charniak, 1997)
  • Use of large corpora BNC, Penn Treebank
  • Notable advances in POS tagging (Brill, 1995)
  • and in WSD (Hearst, 1991)

7
Specific applications
  • NLIDB (Natural Language Interfaces to Databases)
    (Copestake Sparck Jones, 1990 Androutsopoulos
    et al., 1995)
  • E-commerce (Winiwarter and Ibrahim, 2000)
  • Knowledge-based search (Clark et al., 2000)
  • NL IR in digital libraries (Strzalkowski et al.,
    1996)
  • Query paraphrase of a NL question (Radev et al.,
    2001)
  • General conclusion
  • NL techniques quite successful in specific,
    limited domains

8
Preparing Rogets Thesaurus for use as a general
linguistic resource
  • Non-proprietary download from Project Gutenberg
    http//www.gutenberg.net
  • 1911 edition 1000 new words
  • Needed corrections (e.g sowing ones mild oats)
    and a word-to-head index (but see recent work by
    Old)
  • Rationale 1) Abstract universal classification,
    2) provision of synonym sets (Wilks, 1998)

9
Word Sense Disambiguation
  • The problem
  • I thought it was rum (Hodges)
  • Owing to the licensing regulations, children
    will not be served at the bar (Electricity Club,
    Cardiff)
  • Man jailed in tomato case (South Wales Echo)
  • Flying planes can be dangerous(Chomsky)

10
Using Rogets in implementing the Semantic
Overlap process
  • Starting point parsing of NL string into POS and
    constituent phrases
  • LINK Parser (Sleator and Temperley, 1993)
  • Focus on Noun Phrases (NPs)
  • Disambiguation of homographs within the parsed
    string
  • especially those marked as homographs or bound
    terms within the KOS (using AAT )

11
The Semantic Overlap process - I
  • Take a known ambiguous word in the text string,
    assign to it a list of candidate Rogets
    categories try to match these candidate
    categories with a list of candidate categories
    for any other noun phrases in text string.
  • Apply same process to homographs inside the KOS.

12
(No Transcript)
13
The Semantic Overlap process - II
  • Im looking for clubs or sticks used for
    fighting

14
Homograph Example - I
  • pins (jewelry)

15
Homograph Example - II
  • pins (fasteners)

16
(No Transcript)
17
Semantic Overlap process on KOS
  • Assign homographs within the AAT to Rogets
    categories (here using equivalent terms)
  • 45
  • anchor
  • fastener
  • 367
  • annual
  • plant
  • 593
  • annual
  • publication
  • Bound Term apron pieces position in thesaurus
    3285
  • 225
  • apron
  • skirt

18
WordNet - I
  • In an attempt to compare/improve upon
    performance, we use WordNet (Fellbaum, ed. 1998)
  • Using JWordNet jwn.sourceforge.com
  • Lexical reference system whose design is inspired
    by current psycholinguistic theories of human
    lexical memory

19
WordNet - II
  • English nouns, verbs, adjectives and adverbs are
    organized into synonym sets, each representing
    one underlying lexical concept
  • Different relations link the synonym sets
  • Hierarchically organized hypernymns, hypoyms
    (is-a), unique beginners (25 for noun source
    files)

20
WordNet - III
  • 1. (1) pin -- (a piece of jewelry that is pinned
    onto the wearer's garment)
  • 2. fall, pin -- (when a wrestler's shoulders are
    forced to the mat)
  • 3. peg, pin -- (small markers inserted into a
    surface to mark scores or define locations etc.)
  • 4. personal identification number, PIN, PIN
    number -- (a number you choose and use to gain
    access to various accounts)
  • (11 senses in all)

21
WordNet - IV
  • Measure of fit given by
  • (depth of match of ambiguous word in hierarchy)
  • (total depth of hierarchy)
  • This value is minimised for best fit

22
measure of fit 7/8
23
measure of fit 2/8 1/4
24
WN Homograph Example - I
  • pins (jewelry)

25
WN Homograph Example - II
  • pins (fasteners)

26
Further Developments - I
  • Automatic document indexing and classification
    use all the modules developed so far working in
    the other direction, from the target document to
    the indexing language
  • Seek to formalise a facet grammar especially
    using citation order, rules (e.g. AAT rules for
    constructing strings)

27
Conclusions
  • Problem of disambiguation often left in the hands
    of the user (e.g., Google etc.)
  • Aim to provide some semantic processing of query
    to sort results by meaning, aiding user in
    disambiguation process
  • Powerful tools at our disposal aim is to
    harness them successfully to the application at
    hand
  • to exploit knowledge implicit in structure of
    KOS and in the structures of natural language

28
Bibliography - I
  • AAT http//www.getty.edu/research/tools/vocabulary
    /aat/.
  • Androutsopoulos, I., Ritchie, G. D. and Thanisch,
    P. (1995) Natural Language Interfaces to
    Databases -- an introduction., Journal of Natural
    Language Engineering, 1, 29-81.
  • Brill, E. (1995) Transformation-Based
    Error-Driven Parsing, Computational Linguistics.
  • Charniak, E. (1997) Statistical Techniques for
    Natural Language Parsing, A I Magazine, 18,
    33-44.
  • Clark, P., Thompson, J., Holmback, H. and Duncan,
    L. (2000) Exploiting a Thesaurus-Based Semantic
    Net for Knowledge-Based Search, Proc. 12th Cong
    on Innovative Applications of AI (AIII/IAAI '00),
    988-995.
  • Cunningham, H. (1999) A Definition and Short
    History of Language Engineering, Journal of
    Natural Language Engineering, 5, 1-16.
  • Hearst, M. (1991) Noun Homograph Disambiguation
    Using Local Context in Large Text Corpora, Proc.
    7th Annual Conference of the University of
    Waterloo Centre for the New OED and Text
    Research, Oxford.
  • Hodge, G. (2000) Systems of Knowledge
    Organization for Digital Libraries Beyond
    Traditional Authority Files, Council on Library
    and Information Resources (CLIR), Washington,
    D.C.
  • Ide, N. and VĂ©ronis, J. (1998) Introduction to
    the Special Issue on Word Sense Disambiguation
    The State of the Art, Computational Linguistics,
    24, 140ff.

29
Bibliography - II
  • Jarmasz, M. and Szpakowicz, S. (2001) Roget's
    Thesaurus as an Electronic Lexical Database, In
    NIE BEZ ZNACZENIA. Prace ofiarowane Profesorowi
    Zygmuntowi Saloniemu z okazji 40-lecia pracy
    naukowej(Eds, Gruszczynski, W. and Kopcinska, D.)
    (to appear), Bialystok.
  • Mason, O. (2000) Programming for Corpus
    Linguistics How to do Text Analysis with Java,
    Edinburgh University Press, Edinburgh.
  • McEnery, T. and Wilson, A. (1996) Corpus
    Linguistics, Edinburgh University Press,
    Edinburgh.
  • Radev, D. R., Qi, H., Zheng, Z.,
    Blair-Goldensohn, Zhang, Z., Fan, W. and Prager,
    J. (2001) Mining the web for answers to natural
    language questions, Proceedings of ACM CIKM,
    143-150.
  • Sleator, D. D. and Temperley, D. (1993) Parsing
    English with a Link Grammar, Third International
    Workshop on Parsing Technologies.
  • Strzalkowski, T., Perez-Carballo, J. and
    Marinescu, M. (1996) Natural language information
    retrieval in digital libraries, Proceedings of
    the first ACM international conference on Digital
    libraries., 117-125.
  • Svenonius, E. (2000) The Intellectual Foundation
    of Information Organization, MIT Press,
    Cambridge, MA.
  • Wilks, Y. (1998) Language Processing and the
    Thesaurus, Proc. National Language Research
    Institute, Tokyo, Japan.
  • Winiwarter, W. and Ibrahim, I. K. (2000) A
    Mulitilingual Natural Language Interface for
    E-Commerce Applications, Proc. of the 13th
    International Conference on Applications of
    Prolog, Tokyo, Japan.
  • Yarowsky, D. (1992) Word-Sense Disambiguation
    using Statistical Models of Roget's Categories
    Trained on Large Corpora, Proceedings of
    COLING-92, 454-460.
Write a Comment
User Comments (0)
About PowerShow.com