Title: Prof. Ray Larson
1Lecture 11 Thesaurus Design
SIMS 202 Information Organization and Retrieval
- Prof. Ray Larson Prof. Marc Davis
- UC Berkeley SIMS
- Tuesday and Thursday 1030 am - 1200 pm
- Fall 2002
- http//www.sims.berkeley.edu/academics/courses/is2
02/f02/
2Lecture Overview
- Review
- Name Authority Control
- Types of Controlled Vocabularies
- Thesaurus Design and Development
- Developing Controlled Vocabularies
- Thesaurus Design
- Steps In Thesaurus Development
- Indexing
3Lecture Overview
- Review
- Name Authority Control
- Types Of Controlled Vocabularies
- Thesaurus Design And Development
- Developing Controlled Vocabularies
- Thesaurus Design
- Steps In Thesaurus Development
- Indexing
4Types of Indexing Languages
- Uncontrolled keyword indexing
- Indexing languages
- Controlled, but not structured
- Thesauri
- Controlled and structured
- Classification systems
- Controlled, structured, and coded
- Faceted classification systems
5Uses of Controlled Vocabularies
- Library subject headings, classification and
authority files - Commercial journal indexing services and
databases - Yahoo, and other web classification schemes
- Online and manual systems within organizations
- SunSolve
- MacArthur
6Indexing Languages
- An index is a systematic guide designed to
indicate topics or features of documents in order
to facilitate retrieval of documents or parts of
documents - An indexing language is the set of terms used in
an index to represent topics or features of
documents, and the rules for combining or using
those terms
7Classification Systems
- A classification system is an indexing language
often based on a broad ordering of topical areas - Thesauri and classification systems both use this
broad ordering and maintain a structure of
broader, narrower, and related topics - Classification schemes commonly use a coded
notation for representing a topic and its place
in relation to other terms
8Automatic Indexing and Classification
- Automatic indexing is typically the simple
deriving of keywords from a document and
providing access to all of those words - More complex automatic indexing systems attempt
to select controlled vocabulary terms based on
terms in the document - Automatic classification attempts to
automatically group similar documents using
either - A fully automatic clustering method
- An established classification scheme and set of
documents already indexed by that scheme
9Clustering
Agglomerative methods Polythetic, Exclusive or
Overlapping, Unordered clusters are
order-dependent
Rocchios method (Yes the same Rocchio as
Relevance Feedback)
1. Select initial centers (I.e. seed the
space) 2. Assign docs to highest matching centers
and compute centroids 3. Reassign all documents
to centroid(s)
10Automatic Class Assignment
Automatic Class Assignment Polythetic, Exclusive
or Overlapping, usually ordered clusters are
order-independent, usually based on an
intellectually derived scheme
Doc
Doc
Doc
Doc
Search Engine
Doc
Doc
Doc
1. Create pseudo-documents representing
intellectually derived classes. 2. Search using
document contents 3. Obtain ranked list 4. Assign
document to N categories ranked over
threshold. OR assign to top-ranked category
11Lecture Overview
- Review
- Name Authority Control
- Types Of Controlled Vocabularies
- Thesaurus Design And Development
- Developing Controlled Vocabularies
- Thesaurus Design
- Steps In Thesaurus Development
- Indexing
12Developing Controlled Vocabularies
- Origins and uses of controlled vocabularies for
information retrieval - Types of indexing languages, thesauri and
classification systems - Process of design and development of thesauri
13Origins
- Very early history of content representation
- Sumerian tokens and envelopes
- Alexandria - pinakes
- Indices
14Origins
- Biblical Indexes and Concordances
- Hugo de St. Caro 1247 A.D. 500 Monks -- KWOC
- Book indexes (Nuremburg Chronicle)
- Library Catalogs
- Journal Indexes
- Information Explosion following WWII
- Cranfield Studies of indexing languages and
information retrieval - Development of bibliographic databases
- Index Medicus -- production and Medlars searching
15Origins
- Communication theory revisited
- Problems with transmission of meaning
Noise
16Structure of an IR System
17What is a Controlled Vocabulary?
- The greatest problem of today is how to teach
people to ignore the irrelevant, how to refuse to
know things, before they are suffocated. For too
many facts are as bad as none at all. (W.H.
Auden) - Similarly, there are too many ways of expressing
or explaining the topic of a document - Controlled vocabularies are sets of Rules for
topic identification and indexing, and a
THESAURUS, which consists of lead-in vocabulary
and an limited and selective Indexing Language
sometimes with special coding or structures
18Lecture Overview
- Review
- Name Authority Control
- Types Of Controlled Vocabularies
- Thesaurus Design And Development
- Developing Controlled Vocabularies
- Thesaurus Design
- Steps In Thesaurus Development
- Indexing
19Thesauri
- A Thesaurus is a collection of selected
vocabulary (preferred terms or descriptors) with
links among synonymous, equivalent, broader,
narrower and other related terms
20Thesauri (cont.)
- National and International Standards for Thesauri
- ANSI/NISO z39.19-1994 American National
Standard Guidelines for the Construction, Format
and Management of Monolingual Thesauri - ANSI/NISO Draft Standard Z39.4-199x American
National Standard Guidelines for Indexes in
Information Retrieval - ISO 2788 Documentation Guidelines for the
establishment and development of monolingual
thesauri - ISO 5964 Documentation Guidelines for the
establishment and development of multilingual
thesauri
21Thesauri (cont.)
- Examples
- The ERIC Thesaurus of Descriptors
- The Medical Subject Headings (MESH) of the
National Library of Medicine - The Art and Architecture Thesaurus
22Why Develop a Thesaurus?
- To provide a conceptual structure or space for
a body of information - To make it possible to adequately describe the
topical contents of informational objects at an
appropriate level of generality or specificity - To provide enhanced search capabilities and to
improve the effectiveness of searching (i.e., to
retrieve most of the relevant material without
too much irrelevant material)
23Why Develop a Thesaurus?
- To provide vocabulary (or terminological) control
- When there are several possible terms designating
a single concept, the thesaurus should lead the
indexer or searcher to the appropriate concept,
regardless of the terms they start with
24Lecture Overview
- Review
- Name Authority Control
- Types Of Controlled Vocabularies
- Thesaurus Design And Development
- Developing Controlled Vocabularies
- Thesaurus Design
- Steps In Thesaurus Development
- Indexing
25Preliminary Considerations
- What is used now?
- Continue using an existing thesaurus?
- Ad hoc modification of existing thesaurus?
- Develop a new well-structured thesaurus?
- What is the scope and complexity of the subject
field? - What kind of retrieval objects or data will be
dealt with? - How exhaustive and specific is the desired
description of objects?
26Preliminary Considerations
- The scope and complexity of the field will
provide some indication of the scope and
complexity of the thesaurus - It is better to plan for a larger and more
comprehensive system than a smaller system that
rapidly will become inadequate as the database
grows - Development of a good thesaurus requires a major
intellectual effort as well as clerical
operations like data entry and production of
sorted lists
27Development of a Thesaurus
- Term selection
- Merging and development of concept classes
- Definition of broad subject fields and subfields
- Development of classificatory structure
- Review, testing, application, revision
28Flow of Work in Thesaurus Construction
291. Term Selection
- Select sources for the collection of terms
- Prearranged Sources
- Open-ended Sources
- Assign codes to each source
- Selection of terms
- For part of pre-arranged and for all open-ended
sources - Enter terms into database with all information
301.1 Kinds of Sources
- Prearranged Sources
- Existing descriptor lists, classification schemes
thesauri - This includes universal schemes like DDC or LCSH
- Nomenclatures of single disciplines
- Treatises on the terminology of a field
- Encyclopedias, lexica, dictionaries and
glossaries - Tables of contents of textbooks and handbooks
- Indexes of journals or abstracting journals
- Indexes of other publications in the field
311.1 Kinds of Sources
- Open-ended sources
- Lists of search requests or interest profiles
- Description of projects/activities to be served
by the information retrieval system - Discussion with specialists in the field
- Sample of documents in the field
- Ask users why and how these documents relate to
the field - Have documents indexed by experts in the field
- Lists of titles of documents in the field
- Abstracts and reviews of documents
- Your own knowledge
32Selection of Sources
- Prearranged sources require less effort in
gathering the material, and may already indicate
some relationships between terms and concepts and
relationships among terms - Open-ended sources can reflect current
terminology and may provide more complete
coverage - Choose a set of sources that are current, as
complete as possible, and considered authoritative
33Selection of Sources
- Each selected source is assigned an ID for
tracking its use in the development of the
thesaurus - Useful when making decisions about which terms to
prefer - Useful for backtracking when questions arise
(where did this come from?)
34Selection of Terms
- Terms can be transferred directly from
prearranged sources to the recording medium
(cards or database) - Have to decide which terms and references to
include, or to take the whole source
35Selection of Terms
- In open-ended sources you read through the source
and pick out terms (i.e. words and phrases) that
might be useful in retrieval or as references to
other terms - Alternatively, use keyword and phrase extraction
software to create lists of terms and select from
those - Transfer selected terms to the recording medium
(cards or database)
36Work Form
From Soergel, p. 399
372. Merging and Development of Concept Classes
- Sort Term DB into alphabetical order
- First Round
- Merge information for identical terms, possibly
pulling info from additional sources - Second Round
- Merge synonyms or terms in the same concept class
383. Definition of Broad Subject Fields and
Subfields
- Define broad subject fields and sort terms into
these broad fields - Define subfields within each broad field and sort
terms into these subfields - Work out the detailed structure
- Select preferred terms
- Merge information for terms in the same concept
class - Repeat these steps
- For each subfield within a broad field
- And for each broad field
- Until all terms have been consolidated and
preferred terms selected
394. Development of Classificatory Structure
- Produce preliminary version of classified index
and update the working database - Improve classificatory structure
- Reality check
- Produce and distribute a version of the
classified index - Distribute to users/experts
405. Final Stages
- Review
- Testing
- Application
- Revision
41Review
- Discuss classified index with users/experts
- Select descriptors and checklist descriptors
- Assign notational symbols
- Produce main thesaurus and indexes
42Review (cont.)
- Check cross references and insert where needed
- Produce test version
- Test by indexing
- Modify as needed
- Produce production version
43Testing a Thesaurus
- Assign descriptors to a sample set of NEW
documents (use enough to get an idea of any gaps
in the thesaurus) - Test retrieval using sample questions and seeing
how effectively the thesaurus maps to the
appropriate descriptor
44Lecture Overview
- Review
- Name Authority Control
- Types Of Controlled Vocabularies
- Thesaurus Design And Development
- Developing Controlled Vocabularies
- Thesaurus Design
- Steps In Thesaurus Development
- Indexing
45The Indexing Process
- Concept identification
- Term selection (via thesaurus)
- Term assignment
46Application The Indexing Process (Manual)
47Thesaurus Revision and Updates
- There will always be new concepts, products, or
expressions that need to be added to the
thesaurus - Set a regular schedule of reviews and revisions
- Collect complaints, problems, etc. and fold into
revision of the thesaurus
48References
- Soegel, D. Indexing Languages and Thesauri
Construction and Maintenance. Los Angeles
Melville Publishing Co., 1974 - Foskett, A.C. The Subject Approach to
Information. London Clive Bingley, 1982. - Standards
- ANSI/NISO z39.19-1994 American National
Standard Guidelines for the Construction, Format
and Management of Monolingual Thesauri - ANSI/NISO Draft Standard Z39.4-199x American
National Standard Guidelines for Indexes in
Information Retrieval - ISO 2788 Documentation Guidelines for the
establishment and development of monolingual
thesauri - ISO 5964 Documentation Guidelines for the
establishment and development of multilingual
thesauri
49Next Time
- Metadata and Markup
- How can metadata be expressed and structured in
documents and databases - More on XML and its use in defining metadata
systems