Title: Organizing Information: Metadata and Controlled Vocabularies
1Organizing Information Metadata and Controlled
Vocabularies
- Ray R. Larson
- University of California, Berkeley
- School of Information Management and Systems
2Overview Metadata and Controlled Vocabularies
- Definitions
- Origins and Uses of Controlled Vocabularies for
Information Retrieval - Metadata
- Types of Indexing Languages, Thesauri and
Classification Systems - Process of Design and Development of Thesauri
3Information Organization and Retrieval
- To organize is to (1) furnish with organs, make
organic, make into living tissue, become organic
(2) form into an organic whole give orderly
structure to frame and put into working order
make arrangements for. - Knowledge is knowing, familiarity gained by
experience persons range of information a
theoretical or practical understanding of the
sum of what is known. - To retrieve is to (1) recover by investigation or
effort of memory, restore to knowledge or recall
to mind regain possession of (2) rescue from a
bad state, revive, repair, set right. - Information is (1) informing, telling thing
told, knowledge, items of knowledge, news.
The Oxford English Dictionary, cf. Rowley
4Information Properties
- Information can be communicated electronically
- Broadcasting
- Networking
- Information can be easily duplicated and shared
- Problems of Ownership
- Problems of Control
Adapted from Silicon Dreams by Robert W. Lucky
5Information Hierarchy
- Data
- The raw material of information
- Information
- Data organized and presented by someone
- Knowledge
- Information read, heard or seen and understood
- Wisdom
- Distilled and integrated knowledge and
understanding
6Information Hierarchy
Wisdom
Knowledge
Information
Data
7Information Life Cycle
8Information Life Cycle
- Authoring/Modifying
- Organizing/Indexing
- Storing/Retrieving
- Distribution/Networking
- Accessing/Filtering
- Using/Creating
9Origins
- Very early history of content representation
- Sumerian tokens and envelopes
- Alexandria - pinakes
- Indices
10Origins
- Biblical Indexes and Concordances (Hugo de St.
Caro 500 monks, 1247 -- KWIC) - Journal Indexes
- Information Explosion following WWII
- Cranfield Studies of indexing languages and
information retrieval - Development of bibliographic databases
- Index Medicus -- production and Medlars searching
11Origins
- Communication theory revisited
- Problems with transmission of meaning
Noise
12Structure of an IR System
Search Line
Adapted from Soergel, p. 19
13Metadata
- Data about data
- Information about Information
- Description of information structure and contents
for individual information items, or entire
collections of information
14Types of Metadata
- Element names.
- Element description.
- Element representation.
- Element coding.
- Element semantics.
- Element classification.
15Metadata Systems
- AACRII/MARC
- Dublin Core
- RDF (Resource Description Framework)
- SGML/XML
- DBMS Metadata
- Controlled vocabularies
16Goals of Descriptive Cataloging (AACRII/MARC)
- 1. To enable a person to find a document of which
- the author, or
- the title, or
- the subject is known
- 2. To show what a library has
- by a given author
- on a given subject (and related subjects)
- in a given kind (or form) of literature.
- 3. To assist in the choice of a document
- as to its edition (bibliographically)
- as to its character (literary or topical)
Charles A. Cutter, 1876
17Dublin Core Elements
- Title
- Creator
- Subject
- Description
- Publisher
- Other Contributors
- Date
- Resource Type
- Format
- Resource Identifier
- Source
- Language
- Relation
- Coverage
- Rights Management
18RDF (W3C)
- A model for representing named properties and
property values - Resources (the things described)
- Properties (aspects, attributes, characteristics
of resources) - Statements (ResourcePropertyValue of Property
for the Resource) - Expressed in XML
19SGML XML
- What is SGML/XML?
- Document Type Definitions
- Document Markup
- Sources and Resources
20Databases Metadata
- Particularly in the Relational Model metadata is
part of the Database, providing information about
the structure and contents of the database - What Relations (tables) in the the DB
- Relation(table) attributes (domains)
- Attribute representation and storage
- Other information (indexes, etc)
21Controlled Vocabularies
- Vocabulary control is the attempt to provide a
standardized and consistent set of terms (such as
subject headings, names, classifications, etc.)
with the intent of aiding the searcher in finding
information.
22Controlled Vocabularies
- Names and name authorities
- Design of controlled vocabularies for subject
access -- Thesaurus design
23Names
- Cutters (1876) objectives of bibliographic
description - To enable a person to find a document of which
the author is known. - To show what the library has by a given author.
- First serves access.
- Second serves collocation.
24Problems with Names
- How many names should be associated with a
document? - Which of these should be the main entry?
- What form should each of the names take?
- What references should be made from other
possible forms of names that havent been used?
25The problem
- Proliferation of the forms of names
- Different names for the same person
- Different people with the same names
- Examples
- from Books in Print (semi-controlled but not
consistent) - ERIC author index (not controlled)
26Rules for description
- AACR II and other sets of descriptive cataloging
rules provide guidelines for - Determining the number of name entries
- Choosing a main entry
- Deciding on the form of name to be used
- Deciding when to make references
27Authority control
- Authority control is concerned with creation and
maintenance of a set of terms that have been
chosen as the standard representatives (also know
as established) based on some set of rules. - If you have rules, why do you need to keep track
of all of the headings?
28Conditions of Authorship?
- Single person or single corporate entity
- Unknown or anonymous authors
- Shared responsibility
- Collections or editorially assembled works
- Works of mixed responsibility (e.g. translations)
- Related Works
29Added Entries
- Personal names
- Collaborators
- Editors, compilers, writers
- Translators (in some cases)
- Illustrators (in some cases)
- Other persons associated with the work (such as
the honoree in a Festschrift). - Corporate Names
- Any prominently named corporate body that has
involvement in the work beyond publication,
distribution, etc.
30Choice of Name
- AACR II says that the predominant form of the
name used in a particular authors writings
should be chosen as the form of name. - References should be made from the other forms of
the name.
31Form of the Name
- When names appear in multiple forms, one form
needs to be chosen. Criteria for choice are - Fullness (e.g. Full names vs. initials only)
- Language of the name.
- Spelling (choose predominant form)
- Entry element
- John Smith or Smith, John?
- Mao Zedong or Zedong, Mao? (Mao Tse Tung?)
32Name Authority Files
IDNAFL8057230 STp ELn STHa MSc
UIPa TD19910821174242 KRCa NMUa
CRCc UPNa SBUa SBCa DIDn
DF05-14-80 RFEa CSC SRUb SRTn
SRNn TSS TGA? ROM? MOD VSTd
08-21-91 Other Versions
earlier 040 DLCcDLCdDLCdOCoLC 053
PR6005.R517 100 10 Creasey, John 400 10
Cooke, M. E. 400 10 Cooke, Margaret,d1908-1973
400 10 Cooper, Henry St. John,d1908-1973
400 00 Credo,d1908-1973 400 10 Fecamps,
Elise 400 10 Gill, Patrick,d1908-1973 400
10 Hope, Brian,d1908-1973 400 10 Hughes,
Colin,d1908-1973 400 10 Marsden, James 400
10 Matheson, Rodney 400 10 Ranger, Ken 400
20 St. John, Henry,d1908-1973 400 10 Wilde,
Jimmy 500 10 wnnncaAshe, Gordon,d1908-1973
Different names for the same person
33Name Authority Files
IDNAFO9114111 STp ELn STHa MSn
UIPa TD19910817053048 KRCa NMUa
CRCc UPNa SBUa SBCa DIDn
DF06-03-91 RFEa CSCc SRUb SRTn
SRNn TSS TGA? ROM? MOD VSTd
08-19-91 040 OCoLCcOCoLC 100 10 Marric,
J. J.,d1908-1973 500 10 wnnncaCreasey,
John 663 Works by this author are entered
under the name used in the item. For a
listing of other names used by this author,
search also underbCrease y, John 670
OCLC 13441825 His Gideon's day, 1955b(hdg.
Creasey, John usage J .J. Marric) 670
LC data base, 6/10/91b(hdg. Creasey, John
usage J.J. Marric) 670 Pseuds. and
nicknames dict., c1987b(Creasey, John,
1908-1973 Britis h author pseud.
Marric, J. J.)
34Name authority files
IDNAFL8166762 STp ELn STHa MSc
UIPa TD19910604053124 KRCa NMUa
CRCc UPNa SBUa SBCa DIDn
DF08-20-81 RFEa CSC SRUb SRTn
SRNn TSS TGA? ROM? MOD VSTd
06-06-91 Other Versions
earlier 040 DLCcDLCdDLCdOCoLC 100 10
Butler, William Vivian,d1927- 400 10 Butler,
W. V.q(William Vivian),d1927- 400 10 Marric,
J. J.,d1927- 670 His The durable
desperadoes, 1973. 670 His The young
detective's handbook, c1981bt.p. (W.V. Butler)
670 His Gideon's way, 1986bCIP t.p.
(William Vivian Butler writing as J .J.
Marric)
Different people writing with the same name
35Controlled Vocabularies for Information Access
- The greatest problem of today is how to teach
people to ignore the irrelevant, how to refuse to
know things, before they are suffocated. For too
many facts are as bad as none at all. (W.H.
Auden) - Similarly, there are too many ways of expressing
or explaining the topic of a document. - Controlled vocabularies are sets of Rules for
topic identification and indexing, and a
THESAURUS, which consists of lead-in vocabulary
and an limited and selective Indexing Language
sometimes with special coding or structures.
36Structure of an IR System
Search Line
Adapted from Soergel, p. 19
37Uses of Controlled Vocabularies
- Library Subject Headings, Classification and
Authority Files. - Commercial Journal Indexing Services and
databases - Yahoo, and other Web classification schemes
- Online and Manual Systems within organizations
- SunSolve
- MacArthur
38Types of Indexing Languages
- Uncontrolled Keyword Indexing
- Indexing Languages
- Controlled, but not structured
- Thesauri
- Controlled and Structured
- Classification Systems
- Controlled, Structured, and Coded
- Faceted Classification Systems
39Indexing Languages
- An index is a systematic guide designed to
indicate topics or features of documents in order
to facilitate retrieval of documents or parts of
documents. - An Indexing language is the set of terms used in
an index to represent topics or features of
documents, and the rules for combining or using
those terms.
40Indexing Languages
- Library of Congress Subject Headings
- Yellow Pages Topics
- Wilson Indexes (Readers Guide)
41Thesauri
- A Thesaurus is a collection of selected
vocabulary (preferred terms or descriptors) with
links among Synonymous, Equivalent, Broader,
Narrower and other Related Terms
42Thesauri (cont.)
- National and International Standards for Thesauri
- ANSI/NISO z39.19--1994 -- American National
Standard Guidelines for the Construction, Format
and Management of Monolingual Thesauri - ANSI/NISO Draft Standard Z39.4-199x -- American
National Standard Guidelines for Indexes in
Information Retrieval - ISO 2788 -- Documentation -- Guidelines for the
establishment and development of monolingual
thesauri - ISO 5964-- Documentation -- Guidelines for the
establishment and development of multilingual
thesauri
43Thesauri (cont.)
- Examples
- The ERIC Thesaurus of Descriptors
- The Art and Architecture Thesaurus
- The Medical Subject Headings (MESH) of the
National Library of Medicine
44Why develop a thesaurus?
- To provide a conceptual structure or space for
a body of information - To make it possible to adequately describe the
topical contents of informational objects at an
appropriate level of generality or specificity - To provide enhanced search capabilities and to
improve the effectiveness of searching (I.e., to
retrieve most of the relevant material without
too much irrelevant material).
45Why develop a thesaurus?
- To provide vocabulary (or terminological)
control. - When there are several possible terms designating
a single concept, the thesaurus should lead the
indexer or searcher to the appropriate concept,
regardless of the terms they start with.
46Preliminary considerations
- What is used now?
- Continue using an existing thesaurus?
- Ad hoc modification of existing thesaurus?
- Develop a new well-structured thesaurus?
- What is the scope and complexity of the subject
field? - What kind of retrieval objects or data will be
dealt with? - How exhaustive and specific is the desired
description of objects?
47Preliminary Considerations
- The scope and complexity of the field will
provide some indication of the scope and
complexity of the thesaurus. - It is better to plan for a larger and more
comprehensive system than a smaller system that
rapidly will become inadequate as the database
grows. - Development of a good thesaurus requires a major
intellectual effort as well as clerical
operations like data entry and production of
sorted lists.
48Development of a Thesaurus
- Term Selection.
- Merging and Development of Concept Classes.
- Definition of Broad Subject Fields and Subfields.
- Development of Classificatory structure
- Review, Testing, Application, Revision.
49Flow of Work in Thesaurus Construction
50The Indexing Process
- Concept identification
- term selection (via thesaurus)
- term assignment
51Application The Indexing Process (Manual)
Select Alternative term to represent Concept
NO
Is Term suitable
Adapted from ISO 5963, p.5
52Classification Systems
- A classification system is an indexing language
often based on a broad ordering of topical areas.
Thesauri and classification systems both use this
broad ordering and maintain a structure of
broader, narrower, and related topics.
Classification schemes commonly use a coded
notation for representing a topic and its place
in relation to other terms.
53Classification Systems (cont.)
- Examples
- The Library of Congress Classification System
- The Dewey Decimal Classification System
- The ACM Computing Reviews Categories
- The American Mathematical Society Classification
System
54Automatic Indexing and Classification
- Automatic indexing is typically the simple
deriving of keywords from a document and
providing access to all of those words. - More complex Automatic Indexing Systems attempt
to select controlled vocabulary terms based on
terms in the document. - Automatic classification attempts to
automatically group similar documents using
either - A fully automatic clustering method.
- An established classification scheme and set of
documents already indexed by that scheme.
55Clustering
Agglomerative methods Polythetic, Exclusive or
Overlapping, Unordered clusters are
order-dependent.
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Rocchios method
1. Select initial centers (I.e. seed the
space) 2. Assign docs to highest matching centers
and compute centroids 3. Reassign all documents
to centroid(s)
56Automatic Class Assignment
Automatic Class Assignment Polythetic, Exclusive
or Overlapping, usually ordered clusters are
order-independent, usually based on an
intellectually derived scheme
Doc
Doc
Doc
Doc
Search Engine
Doc
Doc
Doc
1. Create pseudo-documents representing
intellectually derived classes. 2. Search using
document contents 3. Obtain ranked list 4. Assign
document to N categories ranked over
threshold. OR assign to top-ranked category
57References
- Soegel, D. Indexing Languages and Thesauri
Construction and Maintenance. Los Angeles
Melville Publishing Co., 1974 - Foskett, A.C. The Subject Approach to
Information. London Clive Bingley, 1982. - Standards
- ISO 2788 -- Documentation -- Guidelines for the
establishment and development of monolingual
thesauri - ISO 5964-- Documentation -- Guidelines for the
establishment and development of multilingual
thesauri - ANSI/NISO z39.19--1994 -- American National
Standard Guidelines for the Construction, Format
and Management of Monolingual Thesauri