Title: Thesaurus Construction and Use
1Thesaurus Construction and Use
- University of California, Berkeley
- School of InformationIS 245 Organization of
Information In Collections
2Lecture Overview
- Review
- Facetted Classification
- Traditional vs. Facetted Classification
- Designing Facetted Classifications
- Today
- Thesaurus design
- Steps in Thesaurus development
- Indexing
3Hierarchical Classification
Slide author Marti Hearst
4Labeled Categories for Hierarchical Classification
- LITERATURE
- 100 English Literature
- 110 English Prose
- English Prose 16th Century
- English Prose 17th Century
- English Prose 18th Century
- ...
- 111 English Poetry
- 121 English Poetry 16th Century
- 122 English Poetry 17th Century
- ...
- 112 English Drama
- 130 English Drama 16th Century
-
- 200 French Literature
Slide author Marti Hearst
5Facetted Categories
- Mutually exclusive
- Non-overlapping, distinct categories
- Relational
- Relations between facets, subfacets, and foci
(elements) are not restricted to hierarchical
generalization-specialization relations - Composable
- Combined using grammars of order and relation to
form compound descriptions
6Facetted Classification Along With Labeled
Categories
- A Language
- a English
- b French
- c Spanish
- B Genre
- a Prose
- b Poetry
- c Drama
- C Period
- a 16th Century
- b 17th Century
- c 18th Century
- d 19th Century
- Aa English Literature
- AaBa English Prose
- AaBaCa English Prose 16th Century
- AbBbCd French Poetry 19th Century
- BbCd Drama 19th Century
Slide author Marti Hearst
7Ranganathan
- PMEST Facets
- P(ersonality)
- WHO The most important types or names of things
for the particular discipline - M(atter)
- WHAT Constituent materials
- E(nergy)
- HOW Action or activity terms
- S(pace)
- WHERE Where things occur
- T(ime)
- WHEN When things occur
8Classical CRG/BC2 Facet Analysis
- Entity
- Kind
- Part
- Property
- Material
- Process
- Operation
- Patient
- Product
- By-Product
- Agent
- Space
- Time
9Classical Facet Analysis
- What is being done?
- Entity
- Kind
- Product
- By-Product
- What are its parts?
- Part
- What are its properties?
- Property
- Material
- How is this achieved?
- Process
- By what means?
- Operation
- By whom?
- Agent
- Patient
- Where?
- Space
- When?
- Time
10Classical Facet Analysis
- Nouns
- Entity
- Kind
- Part
- Patient
- Product
- By-Product
- Agent
- Adjectives
- Property
- Material
- Intransitive Verb
- Process
- Transitive Verb
- Operation
- Adverb
- Space
- Time
11Semantic and Syntactic Relationships
- Semantic relationships
- Is-A (thing/kind, genus/species)
- Mammals
- Primates
- Humans
- Has-Parts
- Human
- Head
- Eyes
- Syntactic relationships
- Compounds
- Wheat harvesting wheat harvesting
- Object operation operation on object
12Facetted Classification
- Clearly distinguishes between semantic
relationships and syntactic relationships - Semantic relationships
- Within a facet
- Containment relations
- Syntactic relationships
- Across facets
- Combinatoric relations
- Have a syntax for syntactic combination of
semantic terms
13Power of Facet Combinations
- The syntactic relations of facetted
classifications enable a small controlled
vocabulary to produce - Many, many structured descriptions
- Complex, but formally structured descriptions
using nested compound descriptions - Descriptions for things we do not have words for
14Today
- More on thesaurus standards and examples
15Types of Indexing Languages
- Uncontrolled keyword indexing
- Indexing languages
- Controlled, but not structured
- Thesauri
- Controlled and structured
- Classification systems
- Controlled, structured, and coded
- Facetted classification systems
16Thesauri
- A Thesaurus is a collection of selected
vocabulary (preferred terms or descriptors) with
links among synonymous, equivalent, broader,
narrower and other related terms
17Thesaurus Standards
- National and International Standards for Thesauri
- ANSI/NISO z39.19-1994 American National
Standard Guidelines for the Construction, Format
and Management of Monolingual Thesauri - ANSI/NISO Draft Standard Z39.4-199x American
National Standard Guidelines for Indexes in
Information Retrieval - ISO 2788 Documentation Guidelines for the
establishment and development of monolingual
thesauri - ISO 5964 Documentation Guidelines for the
establishment and development of multilingual
thesauri
18Thesaurus Examples
- Examples
- Non-Facetted
- The ERIC Thesaurus of Descriptors
- Semi-Facetted
- The Medical Subject Headings (MESH) of the
National Library of Medicine - Facetted
- The Art and Architecture Thesaurus
19ERIC Thesaurus Entry
20ERIC Thesaurus Alphabetic
21ERIC Thesaurus KWIC Index
22ERIC Thesaurus Hierarchies
23ERIC Thesaurus Groups
24ERIC Thesaurus Online
http//www.ericfacility.net/extra/pub/thessearch.c
fm
25MESH Entry
26MESH Alphabetic
27MESH Tree Structures
28MESH KWOC Index
29MESH - Online
http//www.nlm.nih.gov/mesh/meshhome.html
30AAT Facets
31AAT Hierarchies (print)
32AAT Hierarchies (online)
http//www.getty.edu/research/tools/vocabulary/aat
/
33AAT Entry (online)
34Lecture Overview
- Thesaurus Design and Development
- Controlled Vocabularies for topical description
- Thesaurus Design
- Steps In Thesaurus Development (intro)
35Why Develop a Thesaurus?
- To provide a conceptual structure or space for
a body of information - To make it possible to adequately describe the
topical content of information resources at an
appropriate level of generality or specificity - To provide enhanced search capabilities and to
improve the effectiveness of searching (i.e., to
retrieve most of the relevant material without
too much irrelevant material)
36Why Develop a Thesaurus?
- To provide vocabulary (or terminological) control
- When there are several possible terms designating
a single concept, the thesaurus should lead the
indexer or searcher to the appropriate concept,
regardless of the terms they start with
37Preliminary Considerations
- What is used now?
- Continue using an existing thesaurus?
- Ad hoc modification of existing thesaurus?
- Develop a new well-structured thesaurus?
- What is the scope and complexity of the subject
field? - What kind of retrieval objects or data will be
dealt with? - How exhaustive and specific is the desired
description of objects?
38Preliminary Considerations
- The scope and complexity of the field will
provide some indication of the scope and
complexity of the thesaurus - It is better to plan for a larger and more
comprehensive system than a smaller system that
rapidly will become inadequate as the database
grows - Development of a good thesaurus requires a major
intellectual effort as well as clerical
operations like data entry and production of
sorted lists
39Development of a Thesaurus
- Term Selection.
- Merging and Development of Concept Classes.
- Definition of Broad Subject Fields and Subfields.
- Development of Classificatory structure
- Review, Testing, Application, Revision.
401. Term Selection
- Select sources for the collection of terms.
- Prearranged Sources
- Open-ended Sources
- Assign codes to each source.
- Selection of terms
- For part of pre-arranged and for all open-ended
sources - Enter terms into database with all information.
411.1 Kinds of Sources
- Prearranged Sources
- Existing descriptor lists, classification schemes
thesauri. This includes universal schemes like
DDC or LCSH. - Nomenclatures of single disciplines
- Treatises on the terminology of a field
- Encyclopedias, lexica, dictionaries and
glossaries. - Tables of contents of textbooks and handbooks
- Indexes of journals or abstracting journals
- Indexes of other publications in the field
421.1 Kinds of Sources
- Open-ended sources
- Lists of search requests or interest profiles
- Description of projects/activities to be served
by the information retrieval system. - Discussion with specialists in the field
- Sample of documents in the field
- Ask users why and how these documents relate to
the field. - Have documents indexed by experts in the field
- Lists of titles of documents in the field
- Abstracts and reviews of documents
- Your own knowledge
43Selection of sources
- Prearranged sources require less effort in
gathering the material, and may already indicate
some relationships between terms and concepts and
relationships among terms. - Open-ended sources can reflect current
terminology and may provide more complete
coverage. - Choose a set of sources that are current, as
complete as possible, and considered
authoratative.
44Selection of Sources
- Each selected source is assigned an ID for
tracking its use in the development of the
thesaurus. - Useful when making decisions about which terms to
prefer - Useful for backtracking when questions arise
(where did this come from?)
45Selection of Terms
- Terms can be transferred directly from
prearranged sources to the recording medium
(cards or database) - Have to decide which terms and references to
include, or to take the whole source
46Selection of Terms
- In open-ended sources you read through the source
and pick out terms (I.e. words and phrases) that
might be useful in retrieval or as references to
other terms. - Alternatively, use keyword and phrase extraction
software to create lists of terms and select from
those. - Transfer selected terms to the recording medium
(cards or database).
472. Merging and Development of Concept Classes
- Sort Term DB into alphabetical order.
- First Round Merge information for Identical
terms -- possibly pulling info from additional
sources.
- Second Round Merge synonyms or terms in the same
concept class.
483. Definition of Broad Subject Fields and
Subfields
- Define Broad Subject fields and sort terms into
these broad fields - Define subfields within each broad field and sort
terms into these subfields.
- Work out the detailed structure
- Select Preferred Terms
- Merge information for terms in the same concept
class - Repeat these steps
- for each subfield within a broad field
- and for each broad field
- Until all terms have been consolidated and
preferred terms selected
494. Development of Classificatory Structure
- Produce preliminary version of classified index
and update the working database. - Improve classificatory structure
- Reality check produce and distribute a version
of the classified index. Distribute to
users/experts.
505. Final Stages
- Review
- Testing
- Application
- Revision
51Review
- Discuss classified index with users/experts.
- Select descriptors and checklist descriptors.
- Assign Notational Symbols
- Produce Main Thesaurus Indexes
52Review (cont.)
- Check cross references and insert where needed
- Produce Test Version
- Test by Indexing
- Modify as needed
- Produce Production Version.
53Testing a Thesaurus
- Assign descriptors to a sample set of NEW
documents (use enough to get an idea of any gaps
in the thesaurus. - Test retrieval using sample questions and seeing
how effectively the thesaurus maps to the
appropriate descriptor
54Flow of Work in Thesaurus Construction
55The Indexing Process
- Concept identification
- term selection (via thesaurus)
- term assignment
56Application The Indexing Process (Manual)
57Thesaurus Revision and Updates
- There will always be new concepts, products, or
expressions that need to be added to the
thesaurus. - Set a regular schedule of reviews and revisions.
- Collect complaints, problems, etc. and fold into
revision of the thesaurus
58References
- Soegel, D. Indexing Languages and Thesauri
Construction and Maintenance. Los Angeles
Melville Publishing Co., 1974 - Foskett, A.C. The Subject Approach to
Information. London Clive Bingley, 1982. - Standards
- ANSI/NISO z39.19--1994 -- American National
Standard Guidelines for the Construction, Format
and Management of Monolingual Thesauri - ANSI/NISO Draft Standard Z39.4-199x -- American
National Standard Guidelines for Indexes in
Information Retrieval - ISO 2788 -- Documentation -- Guidelines for the
establishment and development of monolingual
thesauri - ISO 5964-- Documentation -- Guidelines for the
establishment and development of multilingual
thesauri