Controlled Vocabulary and Thesaurus Design - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Controlled Vocabulary and Thesaurus Design

Description:

One solution: create a thesaurus with entry terms, where a user's known ... What types of information and evidence do we need to create an effective thesaurus? ... – PowerPoint PPT presentation

Number of Views:208
Avg rating:3.0/5.0
Slides: 14
Provided by: richg8
Category:

less

Transcript and Presenter's Notes

Title: Controlled Vocabulary and Thesaurus Design


1
Session 1
  • Controlled Vocabulary and Thesaurus Design

2
Goals of Session 1
  • Get an overview of the basic definitions,
    concepts, background and rationale of controlled
    vocabularies and thesauri.
  • Understand the types of data that can be
    represented with controlled terms subject,
    genre, object type and material, personal,
    corporate and geographic names, etc.
  • Understand how different types of collections and
    users also dictate the selection and construction
    of controlled vocabularies in general, and
    thesauri in particular.
  • Discuss The situational nature of natural vs.
    controlled language

3
Thesaurus
  • Thesaurus a tool for vocabulary control of a
    specific subject domain
  • It contains preferred terms, non-preferred
    terms, the semantic relations between terms,
    rules for use and other administrative
    information
  • It presupposes a particular collection of
    documents, and a particular group of users

4
Differences between thesauri and other
classification systems
  • Thesauri
  • are used by both LIS professionals and searchers
  • have a narrower subject scope
  • contain specific terms
  • do not contain subdivisions
  • contain instructions for their use
  • explicitly define relationships among terms

5
Why control vocabulary?
  • Goal of information retrieval systems efficient
    access to information
  • Bates First 70 is easy, last 30 is hard
  • Core problem different terms, same meaning
  • Other problems
  • Scalability almost all IR research is based on
    small test collections or user communities
  • User needs vary with time, subject domain
  • Basic disconnect between indexer and user
    experience

6
Vocabulary and scalability
  • Q. How many words does the average U.S. high
    school graduate know (not including proper names,
    numbers, foreign words, etc.)?

7
Vocabulary and scalability
  • Q. How many words does the average U.S. high
    school graduate know (not including proper names,
    numbers, foreign words, etc.)?
  • A 45,000

8
Vocabulary and scalability
  • There is no avoiding these statistics. As a
    database grows, the number of words a human being
    knows does not grow correspondingly.
    Consequently, the average number of hits grows
    instead.
  • One solution create a thesaurus with entry
    terms, where a users known term can be
    associated with a relevant concept the user may
    not know.

9
Weaknesses of controlled vocabulary approach
  • Indexer consistency studies
  • Controlled vocabulary vs. free-text search
    studies
  • Folk classification
  • Symbols, icons

10
After decades of IR research
  • To represent a document with a few words, need
    the most content-rich words possible.
  • Content-bearing terms occur both relatively
    frequently in a document, and relatively
    infrequently in a collection.
  • Invest effort in identifying and extracting the
    most content-bearing terms from literature, items
    in collection, user queries, etc.

11
What to represent with controlled vocabulary?
  • Subject
  • Genre
  • Physical material
  • Personal names
  • Corporate/organization names
  • Events
  • Etc

12
Exercise
  • List five words or phrases you think other people
    might use to describe you, after meeting you for
    the first time.
  • List five words or phrases you would use to
    describe yourself.

13
Discuss
  • Do these terms adequately represent the people in
    this room?
  • What dont they describe?
  • What do the differences say about
  • The unspoken hierarchy of value of different
    types of knowledge (i.e. what matters and what
    doesnt)
  • The situational nature of natural vs. controlled
    vocabulary
  • Challenges of creating and maintaining thesauri
    that reflect this diversity
  • What types of information and evidence do we need
    to create an effective thesaurus?
Write a Comment
User Comments (0)
About PowerShow.com