Title: Taxonomies and Indexing: A Technical Strategy
1Taxonomies and Indexing A Technical Strategy
- Diane Vizine-Goetz
- Office of Research
- OCLC Online Computer Library Center, Inc.
2Context
- Techniques and approaches developed by for
libraries and other institutions responsible for
preserving the human record - Broad scope
- Long tradition of information organization
3Why organize information?
- For
- Search and retrieval
- Use
- Preservation disposition
4Why Organize Information by Subject?
- Find information on a particular subject
- Only and all relevant information
- precision
- recall
- Find related information
5How?
- Subject analysis
- Conceptual analysis--Determining what an
information object is about - Translate concepts into knowledge organization
(KO) scheme - e.g., Subject indexes
- Thesauri
- Classification scheme
- Automated, Semi-automated, Human/Intellectual
6Automation Subject Analysis
7Automated Concept Identification
- Automated Indexing
- Ranges from simply identifying words in a
document, to - Sophisticated analyses that identify key names,
words, and phrases - WordSmith Project http//orc.rsch.oclc.org5061/
- Automated Classification
- Automated assignment of documents to categories
or classes
8Political News Concepts Extracted by WordSmith
- fair housing
- fair housing act
- family planning
- family planning programmes
- family planning programs
- family planning services
- federal government
- federal government deficit
- federal reserve
- federal reserve bank
- federal reserve board
- federal reserve chairman alan greenspan
- federal reserve system
9(No Transcript)
10(No Transcript)
11(No Transcript)
12(No Transcript)
13Advantages of automatic concept identification
- Inexpensive
- Suitable for indexing/categorizing large
quantities of text - Can identify popular and emerging concepts and
terminology
14Why use knowledge organization schemes?
- Knowledge organization schemes such as subject
heading lists, thesauri, classification schemes
are specialized languages designed for retrieving
information - Goal--to reduce ambiguities that cause precision
recall failures
15Free text v.s. controlled subject retrieval
language
- WordSmith
- family planning
- family planning programmes
- family planning programs
- family planning services
- Library of Congress Subject Headings (LCSH)
- Birth control clinics
- UF Family planning services
- Planned parenthood services
- BT Clinics
- 19860211
-
16MeSH Heading vs. LCSH
- Family Planning
- Note Programs or services designed to assist
the family in controlling reproduction by either
improving or diminishing fertility. - Entry Term
- Birth Control
- Planned Parenthood
- Basal Body Temperature Method
- Birth Limiting
- Births Averted
- Family Planning Surveys
- ...
-
- Birth control (19880919)
- UF Family planning
- Planned parenthood
- Population control
- Pregnancy--Prevention
- BT Hygiene, Sexual
- Sexual ethics
- RT Contraception
- Family size
- NT Abortion
- Birth Intervals
- Childlessness
- ...
-
17Characteristics of subject retrieval languages
- Terminology is often domain specific
- Medicine gt MeSH Engineering gt INSPEC
Agriculture gt Agrovoc - Control vocabulary (synonyms homonyms)
- Express relationships between terms
18Within a domain, terms are context independent
- Ei Thesaurus
- TM
- Bank protection
- UF
- Coastal engineering--Bank protection
- Inland waterways--Bank protection
- SN
- Protection of river banks and lake shores. For
seacoasts, use SHORE PROTECTION - DT January 1993
- BT
- Protection
- RT
- Banks (bodies of water)
- Coastal engineering
- Environmental engineering
- Erosion
- Inland waterways
- River control
- Shore protection
- Slope protection
- Soil conservation
- MC 407.2 407.3
- OC 914.1
19Controlled Vocabulary
- Preferred way of expressing a concept
- e.g., Popular vs. technical
- Heart attack vs. Myocardial infarction
- Non-used vocabulary often included
- Synonyms
- Current/Outdated terms gt Disabled/Handicapped
- Lexical variants
- Phrase/Inverted forms gt Bilingual
education/Education, Bilingual - Quasi-Synonyms
- Synonyms/Antonyms gt Literacy/Illiteracy
20Relationships
- Equivalence
- Synonymous terms
- Hierarchy
- Generic relationship (kind)
- Whole-part relationship
- Instance relationship (example)
- Association
21Subject Retrieval using a controlled vocabulary
22Related Terms in LCSH
23Classification / Categorization System
- A systematic arrangement of knowledge into useful
categories - General schemes special schemes
- DDC, LCC, UDC AGRIS, MSC
- Present a generalized view of knowledge at
varying levels of depth - May be enumerative or synthetic
24Some Advantages of Traditional Schemes
- Meaningful notation
- Well-developed hierarchies
- Well-defined categories
- Rich network of relationships
25Meaningful Notation (DDC)
- 005.1 Programming
- 005.1 Programmation
- 005.1 ????????????????
- 005.1 Programación
26DDC Notation Indicates Hierarchy
- 600 Technology
- 630 Agriculture
- 633 Field and plantation crops
- 633.1 Cereals
- 633.11 Wheat
- 633.12 Buckwheat
- 633.13 Oats
27Well-developed Hierarchies
28Hierarchies Categories
- Hierarchical from general to specific
- Categories have superordinate, coordinate,
subordinate relationships in hierarchy - Subcategories must be mutually exclusive
29Hierarchies Categories
- Top gt Recreation gt Automotive gt Driving gt Road
Rage - Social Problems gt Public Safety gt Traffic Hazards
gt Highways gt Road Rage
30Hierarchies, Categories, Relationships
- 500 Science
- 510 Mathematics
- 512 Algebra, number theory
-
- 512.3 Fields
- Class here field theory, Galois theory
- Class linear algebra in 512.5 class
number theory in 512.7
31Advantages of Category Schemes
- Facilitate retrieval based on concepts not simply
keywords - Provide context for search terms (disambiguates)
- Facilitate browsing search refinement
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36Advantages Disadvantages of Formal KO Schemes
-
- Bring like items together
- Provide context show relationships
- Support browsing
- May accommodate multilingual usage
- -
- Reactive to emerging topics
- Terminology may not match users
- Not practical to apply to everything
37Advantages Disadvantages of Free Text
-
- Latest terminology
- Application not an issue
- -
- User must to produce synonyms and relationships
- Limited browsing
- Little multilingual support
38Other Solutions
- Combine approaches
- Map among KO schemes
- Map free text terms to KO schemes
- Produce supplemental browsable indexes from free
text
39Resources
- ANSI/NISO Z39.19-1993 (Revision of ANSI
Z39.19-1980) Guidelines for the Construction,
Format, and Management of Monolingual Thesauri
lthttp//www.niso.org/stantech.htmlz3919gt - Controlled vocabularies, thesauri and
classification systems available in the WWW. DC
Subject lthttp//www.lub.lu.se/metadata/subject-hel
p.htmlgt - The Intellectual Foundation of Information
Organizationby Elaine Svenonius. MIT Press
ISBN 0262194333 - List of Web Subject Resources lthttp//www.loc.gov/
catdir/pcc/saco/resources.htmlgt - The Organization of Information (Library and
Information Science Text Series) by Arlene G.
Taylor. Libraries Unlimited ISBN 1563084988 - Resources for Indexers lthttp//www.asindexing.org/
asires.shtmlgt