Title: Prof. Ray Larson
1Lecture 22 Thesaurii and Metadata
SIMS 202 Information Organization and Retrieval
- Prof. Ray Larson Prof. Marc Davis
- UC Berkeley SIMS
- Tuesday and Thursday 1030 am - 1200 pm
- Fall 2004
- http//www.sims.berkeley.edu/academics/courses/is2
02/f04/
2Lecture Overview
- Review (and expansion)
- Facetted Classification
- Thesaurus Design and Development
- Metadata And Markup
- XML As A Metadata Lingua Franca
- Dublin Core Revisited
- METS
- Other Metadata schemas and protocols in XML
- Discussion
3Lecture Overview
- Review (and expansion)
- Facetted Classification
- Thesaurus Design and Development
- Metadata And Markup
- XML As A Metadata Lingua Franca
- Dublin Core Revisited
- METS
- Other Metadata schemas and protocols in XML
- Discussion
4Indexing Languages
- An index is a systematic guide designed to
indicate topics or features of documents in order
to facilitate retrieval of documents or parts of
documents - An indexing language is the set of terms used in
an index to represent topics or features of
documents, and the rules for combining or using
those terms
5Controlled Vocabularies
- Vocabulary control is the attempt to provide a
standardized and consistent set of terms (such as
subject headings, names, classifications, etc.)
with the intent of aiding the searcher in finding
information - That is, it is an attempt to provide a consistent
set of descriptions for use in (or as) metadata
6Hierarchical Classification
Slide author Marti Hearst
7Labeled Categories for Hierarchical Classification
- LITERATURE
- 100 English Literature
- 110 English Prose
- English Prose 16th Century
- English Prose 17th Century
- English Prose 18th Century
- ...
- 111 English Poetry
- 121 English Poetry 16th Century
- 122 English Poetry 17th Century
- ...
- 112 English Drama
- 130 English Drama 16th Century
-
- 200 French Literature
Slide author Marti Hearst
8Facetted Categories
- Mutually exclusive
- Non-overlapping, distinct categories
- Relational
- Relations between facets, subfacets, and foci
(elements) are not restricted to hierarchical
generalization-specialization relations - Composable
- Combined using grammars of order and relation to
form compound descriptions
9Facetted Classification Along With Labeled
Categories
- A Language
- a English
- b French
- c Spanish
- B Genre
- a Prose
- b Poetry
- c Drama
- C Period
- a 16th Century
- b 17th Century
- c 18th Century
- d 19th Century
- Aa English Literature
- AaBa English Prose
- AaBaCa English Prose 16th Century
- AbBbCd French Poetry 19th Century
- BbCd Drama 19th Century
Slide author Marti Hearst
10Ranganathan
- PMEST Facets
- P(ersonality)
- WHO Types of things
- M(atter)
- WHAT Constituent materials
- E(nergy)
- HOW Action or activity terms
- S(pace)
- WHERE Where things occur
- T(ime)
- WHEN When things occur
11Classical Facet Analysis
- What is being done?
- Entity
- Kind
- Product
- By-Product
- What are its parts?
- Part
- What are its properties?
- Property
- Material
- How is this achieved?
- Process
- By what means?
- Operation
- By whom?
- Agent
- Patient
- Where?
- Space
- When?
- Time
12Semantic and Syntactic Relationships
- Semantic relationships
- Is-A (thing/kind, genus/species)
- Mammals
- Primates
- Humans
- Has-Parts
- Human
- Head
- Eyes
- Syntactic relationships
- Compounds
- Wheat harvesting wheat harvesting
- Object operation operation on object
13Facetted Classification
- Clearly distinguishes between semantic
relationships and syntactic relationships - Semantic relationships
- Within a facet
- Containment relations
- Syntactic relationships
- Across facets
- Combinatoric relations
- Have a syntax for syntactic combination of
semantic terms
14Power of Facet Combinations
- The syntactic relations of facetted
classifications enable a small controlled
vocabulary to produce - Many, many structured descriptions
- Complex, but formally structured descriptions
using nested compound descriptions - Descriptions for things we do not have words for
15Lecture Overview
- Review (and expansion)
- Facetted Classification
- Thesaurus Design and Development
- Metadata And Markup
- XML As A Metadata Lingua Franca
- Dublin Core Revisited
- METS
- Other Metadata schemas and protocols in XML
- Discussion
16Types of Indexing Languages
- Uncontrolled keyword indexing
- Indexing languages
- Controlled, but not structured
- Thesauri
- Controlled and structured
- Classification systems
- Controlled, structured, and coded
- Facetted classification systems
17Thesauri
- A Thesaurus is a collection of selected
vocabulary (preferred terms or descriptors) with
links among synonymous, equivalent, broader,
narrower and other related terms
18Thesaurus Standards
- National and International Standards for Thesauri
- ANSI/NISO z39.19-1994 American National
Standard Guidelines for the Construction, Format
and Management of Monolingual Thesauri - ANSI/NISO Draft Standard Z39.4-199x American
National Standard Guidelines for Indexes in
Information Retrieval - ISO 2788 Documentation Guidelines for the
establishment and development of monolingual
thesauri - ISO 5964 Documentation Guidelines for the
establishment and development of multilingual
thesauri
19Thesaurus Examples
- Examples
- The ERIC Thesaurus of Descriptors
- The Medical Subject Headings (MESH) of the
National Library of Medicine - The Art and Architecture Thesaurus
20ERIC Thesaurus Entry
21ERIC Thesaurus Alphabetic
22ERIC Thesaurus KWIC Index
23ERIC Thesaurus Hierarchies
24ERIC Thesaurus Groups
25ERIC Thesaurus Online
http//www.ericfacility.net/extra/pub/thessearch.c
fm
26MESH Entry
27MESH Alphabetic
28MESH Tree Structures
29MESH KWOC Index
30MESH - Online
http//www.nlm.nih.gov/mesh/meshhome.html
31AAT Facets
32AAT Hierarchies (print)
33AAT Hierarchies (online)
http//www.getty.edu/research/tools/vocabulary/aat
/
34AAT Entry (online)
35Why Develop a Thesaurus?
- To provide a conceptual structure or space for
a body of information - To make it possible to adequately describe the
topical content of information resources at an
appropriate level of generality or specificity - To provide enhanced search capabilities and to
improve the effectiveness of searching (i.e., to
retrieve most of the relevant material without
too much irrelevant material)
36Why Develop a Thesaurus?
- To provide vocabulary (or terminological) control
- When there are several possible terms designating
a single concept, the thesaurus should lead the
indexer or searcher to the appropriate concept,
regardless of the terms they start with
37Preliminary Considerations
- What is used now?
- Continue using an existing thesaurus?
- Ad hoc modification of existing thesaurus?
- Develop a new well-structured thesaurus?
- What is the scope and complexity of the subject
field? - What kind of retrieval objects or data will be
dealt with? - How exhaustive and specific is the desired
description of objects?
38Preliminary Considerations
- The scope and complexity of the field will
provide some indication of the scope and
complexity of the thesaurus - It is better to plan for a larger and more
comprehensive system than a smaller system that
rapidly will become inadequate as the database
grows - Development of a good thesaurus requires a major
intellectual effort as well as clerical
operations like data entry and production of
sorted lists
39Development of a Thesaurus
- Term selection
- Merging and development of concept classes
- Definition of broad subject fields and subfields
- Development of classificatory structure
- Review, testing, application, revision
40Flow of Work in Thesaurus Construction
411. Term Selection
- Select sources for the collection of terms
- Prearranged Sources
- Open-ended Sources
- Assign codes to each source
- Selection of terms
- For part of pre-arranged and for all open-ended
sources - Enter terms into database with all information
421.1 Kinds of Sources
- Prearranged Sources
- Existing descriptor lists, classification schemes
thesauri - This includes universal schemes like DDC or LCSH
- Nomenclatures of single disciplines
- Treatises on the terminology of a field
- Encyclopedias, lexica, dictionaries and
glossaries - Tables of contents of textbooks and handbooks
- Indexes of journals or abstracting journals
- Indexes of other publications in the field
431.1 Kinds of Sources
- Open-ended sources
- Lists of search requests or interest profiles
- Description of projects/activities to be served
by the information retrieval system - Discussion with specialists in the field
- Sample of documents in the field
- Ask users why and how these documents relate to
the field - Have documents indexed by experts in the field
- Lists of titles of documents in the field
- Abstracts and reviews of documents
- Your own knowledge
44Selection of Sources
- Prearranged sources require less effort in
gathering the material, and may already indicate
some relationships between terms and concepts and
relationships among terms - Open-ended sources can reflect current
terminology and may provide more complete
coverage - Choose a set of sources that are current, as
complete as possible, and considered authoritative
45Selection of Terms
- In open-ended sources you read through the source
and pick out terms (i.e. words and phrases) that
might be useful in retrieval or as references to
other terms - Alternatively, use keyword and phrase extraction
software to create lists of terms and select from
those - Transfer selected terms to the recording medium
(cards or database)
46Work Form Still relevant??
From Soergel, p. 399
472. Merging and Development of Concept Classes
- Sort Term DB into alphabetical order
- First Round
- Merge information for identical terms, possibly
pulling info from additional sources - Second Round
- Merge synonyms or terms in the same concept class
483. Definition of Broad Subject Fields and
Subfields
- Define broad subject fields and sort terms into
these broad fields - Define subfields within each broad field and sort
terms into these subfields - Work out the detailed structure
- Select preferred terms
- Merge information for terms in the same concept
class - Repeat these steps
- For each subfield within a broad field
- And for each broad field
- Until all terms have been consolidated and
preferred terms selected
494. Development of Classificatory Structure
- Produce preliminary version of classified index
and update the working database - Improve classificatory structure
- Reality check
- Produce and distribute a version of the
classified index - Distribute to users/experts
505. Final Stages
- Review
- Testing
- Application
- Revision
51Review
- Discuss classified index with users/experts
- Select descriptors and checklist descriptors
- Assign notational symbols
- Produce main thesaurus and indexes
52Testing a Thesaurus
- Assign descriptors to a sample set of NEW
documents (use enough to get an idea of any gaps
in the thesaurus) - Test retrieval using sample questions and seeing
how effectively the thesaurus maps to the
appropriate descriptor
53Lecture Overview
- Review (and expansion)
- Facetted Classification
- Thesaurus Design and Development
- Metadata And Markup
- XML As A Metadata Lingua Franca
- Dublin Core Revisited
- METS
- Other Metadata schemas and protocols in XML
- Discussion
54XML as a common syntax
- XML (and SGML) provide a way of expressing the
structure of documents that can be verified and
validated by document processing systems - Documents can be metadata structures
- Such as the description of a particular
photograph in our Phone project - XML thus provides a way of representing metadata
descriptions as well as the content that they
describe
55XML as a common syntax
- All XML documents follow some simple rules that
make them interchangeable and usable across
different systems - All data and markup is in UNICODE
- All elements are marked by begin and end tags
- All markup is case-sensitive
- XML DTDs and/or Schemas define the valid
structure (and sometimes content) of the documents
56Dublin Core
- Review
- Simple metadata for describing internet resources
- For Document-Like Objects
- 15 Elements
57Dublin Core Elements
- Title
- Creator
- Subject
- Description
- Publisher
- Other Contributors
- Date
- Resource Type
- Format
- Resource Identifier
- Source
- Language
- Relation
- Coverage
- Rights Management
58DC XML DTD Implementation
- There have been various versions
- This one is the one recommended (required) by the
Open Archives Initiative Metadata Harvesting
Protocol (OAI-MHP) - Uses XML Name Spaces
- Available at http//dublincore.org/documents/2001/
09/20/dcmes-xml/
59DC Element and Attribute Definitions
lt!-- The elements from DCMES 1.1 --gt lt!-- The
name given to the resource. --gt lt!ELEMENT
dctitle (PCDATA)gt lt!ATTLIST dctitle xmllang
CDATA IMPLIEDgt lt!-- An entity primarily
responsible for making the content of the
resource. --gt lt!ELEMENT dccreator (PCDATA)gt
lt!ATTLIST dccreator xmllang CDATA IMPLIEDgt
lt!-- The topic of the content of the resource.
--gt lt!ELEMENT dcsubject (PCDATA)gt lt!ATTLIST
dcsubject xmllang CDATA IMPLIEDgt lt!-- An
account of the content of the resource. --gt
lt!ELEMENT dcdescription (PCDATA)gt lt!ATTLIST
dcdescription xmllang CDATA IMPLIEDgt lt!--
The entity responsible for making the resource
available. --gt lt!ELEMENT dcpublisher
(PCDATA)gt lt!ATTLIST dcpublisher xmllang CDATA
IMPLIEDgt lt!-- An entity responsible for making
contributions to the content of the resource.
--gt lt!ELEMENT dccontributor (PCDATA)gt
lt!ATTLIST dccontributor xmllang CDATA
IMPLIEDgt lt!-- A date associated with an event
in the life cycle of the resource. --gt lt!ELEMENT
dcdate (PCDATA)gt lt!ATTLIST dcdate xmllang
CDATA IMPLIEDgt
60DC Element Definitions (cont.)
lt!-- The nature or genre of the content of the
resource. --gt lt!ELEMENT dctype (PCDATA)gt
lt!ATTLIST dctype xmllang CDATA IMPLIEDgt lt!--
The physical or digital manifestation of the
resource. --gt lt!ELEMENT dcformat (PCDATA)gt
lt!ATTLIST dcformat xmllang CDATA IMPLIEDgt
lt!-- An unambiguous reference to the resource
within a given context. --gt lt!ELEMENT
dcidentifier (PCDATA)gt lt!ATTLIST dcidentifier
xmllang CDATA IMPLIEDgt lt!ATTLIST dcidentifier
rdfresource CDATA IMPLIEDgt lt!-- A Reference
to a resource from which the present resource is
derived. --gt lt!ELEMENT dcsource (PCDATA)gt
lt!ATTLIST dcsource xmllang CDATA IMPLIEDgt
lt!ATTLIST dcsource rdfresource CDATA
IMPLIEDgt lt!-- A language of the intellectual
content of the resource. --gt lt!ELEMENT
dclanguage (PCDATA)gt lt!ATTLIST dclanguage
xmllang CDATA IMPLIEDgt lt!-- A reference to a
related resource. --gt lt!ELEMENT dcrelation
(PCDATA)gt lt!ATTLIST dcrelation xmllang CDATA
IMPLIEDgt lt!ATTLIST dcrelation rdfresource
CDATA IMPLIEDgt lt!-- The extent or scope of the
content of the resource. --gt lt!ELEMENT
dccoverage (PCDATA)gt lt!ATTLIST dccoverage
xmllang CDATA IMPLIEDgt lt!-- Information about
rights held in and over the resource. --gt
lt!ELEMENT dcrights (PCDATA)gt lt!ATTLIST
dcrights xmllang CDATA IMPLIEDgt
61A More Complex SGML DTD
lt!DOCTYPE USMARC lt!-- USMARC DTD. UCB-SLIS
v.0.08 --gt lt!-- By Jerome P. McDonough, April 1,
1994 --gt lt!ELEMENT USMARC - - (Leader, Directry,
VarFlds)gt lt!ATTLIST USMARC Material
(BKAMCFMPMUVMSE) "BK" id
CDATA IMPLIEDgt lt!-- Author's Note the id
attribute for the USMARC element is
intended to hold a unique record number
for each MARC record in the
local database. That is to
say, it is intended ONLY as an aid in
maintaining the local database of MARC
records --gt lt!ELEMENT Leader - O (LRL, RecStat,
RecType, BibLevel, UCP, IndCount, SFCount,
BaseAddr, EncLevel, DscCatFm,
LinkRec, EntryMap)gt lt!ELEMENT Directry - O
(PCDATA)gt lt!ELEMENT VarFlds - O (VarCFlds,
VarDFlds)gt lt!-- Component parts of Leader
--gt lt!-- Logical Record Length --gt lt!ELEMENT LRL
- O (PCDATA)gt etc
62More Complex DTD (cont.)
lt!-- Variable Data Fields --gt lt!ELEMENT VarDFlds
- O (NumbCode, MainEnty?, Titles, EdImprnt?,
PhysDesc?, Series?,
Notes?, SubjAccs?, AddEnty?, LinkEnty?,
SAddEnty?, HoldAltG?,
Fld9XX?)gt lt!-- Component Parts of Variable Data
Fields --gt lt!-- Numbers Codes --gt lt!ELEMENT
NumbCode - O (Fld010?, Fld011?, Fld015?, Fld017,
Fld018?, Fld019, Fld020,
Fld022, Fld023, Fld024,
Fld025, Fld027, Fld028, Fld029,
Fld030, Fld032, Fld033, Fld034,
Fld035, Fld036?,
Fld037, Fld039, Fld040?, Fld041?, Fld042?,
Fld043?, Fld044?,
Fld045?, Fld046?, Fld047?, Fld048, Fld050,
Fld051, Fld052,
Fld055, Fld060, Fld061, Fld066?,
Fld069, Fld070,
Fld071, Fld072, Fld074, Fld080?,
Fld082, Fld084, Fld086, Fld088, Fld090,
Fld096)gt lt!-- Main Entries --gt lt!ELEMENT
MainEnty - O (Fld100?, Fld110?, Fld111?,
Fld130?)gt lt!-- Titles --gt lt!ELEMENT Titles - O
(Fld210?, Fld211, Fld212, Fld214, Fld222,
Fld240?, Fld242, Fld243?, Fld245,
Fld246, Fld247)gt lt!-- Edition, Imprint, etc.
--gt lt!ELEMENT EdImprnt - O (Fld250?, Fld254?,
Fld255, Fld256?, Fld257?, Fld260?,
Fld261?, Fld262?, Fld263?,
Fld265?)gt lt!-- Physical Description, etc.
--gt lt!ELEMENT PhysDesc - O (Fld300, Fld305,
Fld306?, Fld310?, Fld315?,
Fld321, Fld340, Fld350?, Fld351, Fld355,
Fld357, Fld362)gt etc
63Complex DTD (cont.)
lt!-- Title Statement --gt lt!ELEMENT Fld245 - O
(Six?, (abcfghknps))gt lt!ATTLIST Fld245
AddEnty (NoYesBlank) IMPLIED
NFChars (0123456789Blnk)
IMPLIEDgt etc lt!-- Subfield Element
Declarations --gt lt!ELEMENT a - O
(PCDATA)gt lt!ELEMENT b - O
(PCDATA)gt lt!ELEMENT c - O
(PCDATA)gt lt!ELEMENT d - O
(PCDATA)gt lt!ELEMENT e - O (PCDATA)gt
64Example METS
- METS the Metadata Encoding and Transmission
Standard is a new Schema intended to provide - a standard for encoding descriptive,
administrative, and structural metadata regarding
objects within a digital library, expressed using
the XML schema language of the World Wide Web
Consortium - METS can be used to wrap complex sets of data
(the actual data, with rules for encoding binary
forms), the metadata describing the parts of that
data, and the sequence and conditions under which
the data can or should be presented or displayed
65Other Protocols and Metadata Systems Using XML
- SOAP (Simple Object Access Protocol)
- SRW (Search and Retrieval for the Web)
- OAI-MHP (Open Archives Initiative Metadata
Harvesting Protocol) - RDF (Resource Description Framework)
- MPEG-7 (more next time)
- METS
- ADL Gazetteer Protocol
- DAV/DASL (Distributed Authoring and Versioning)
- SDLIP (Simple Digital Library Interoperability
Protocol) - Also versions of MARC and other formats in XML
66Lecture Overview
- Review
- Types of Controlled Vocabularies
- Name Authority Control
- Thesaurus Design and Development
- Controlled Vocabularies for topical description
- Thesaurus Design
- Steps In Thesaurus Development
- Indexing
- Discussion (including some from last time)
67Discussion Questions
- Morgan Ames on Vickery
- Though facets are a powerful tool for organizing
information, they can be very time-consuming to
define. Vickery describes the creation of
facets, starting with the analysis of terms used
by a user group, then the sorting of the terms
into facets, the development of facets (depending
on how often they're used), the arrangement of
the facets, and finally, the establishment of a
notation for the facets. Could one automate some
or all of the process of defining facets for a
particular area - say, an online community? If
so, which parts could be automated, and how? If
not, why not - what are the limitations of
automation?
68Discussion Questions
- Lilia Manguy on Thesaurus Construction
- The reading mentions thesauri being constructed
for institutions. What are some examples of
institutions with specialized thesauri? Why were
they deemed necessary?
69Discussion Questions
- Lilia Manguy on Thesaurus Construction
- In our field, what are some scenarios in which a
thesaurus would need to be constructed? How would
you determine who would be your expert
consultants? Who would you choose?
70Discussion Questions
- Lilia Manguy on Thesaurus Construction
- Using the process outlined in the reading for
constructing a thesaurus, how would you qualify
whether your thesaurus is good or bad?
71Discussion Questions
- SorryWe will come back to this in the section on
Interfaces for IR - Christine Jones on Card Sorting
- Carrie Burgener on Flamenco
72Discussion Questions
- Chitra Madhwacharyula on Org. of Info., Chap 3
- Associative indexing is the concept in which
items are linked together and any item can lead
to access of other related information (e.g.
hypertext documents). Is it possible to have
efficient and usable associative indexing without
the use of computers and if so how? - How does Google use the concept of associative
indexing?
73Discussion Questions
- Chitra Madhwacharyula on Org. of Info., Chap 3
- In the 1930s Vannevar Bush developed the idea of
memex, "a device in which an individual stores
all his books, records, and communications, and
which is mechanized so that it may be consulted
with exceeding speed and flexibility". It was
based on the concept of associative indexing. How
similar/dissimilar is this device to the current
generation cataloging and/or retrieval systems?
74Discussion Questions
- Jaime Parada on Org. of Info., Chap 5
- The fierce competition between vendors in the
OPAC and Online Index market may increase the
development of new innovative technology and
better systems, but it contributes to the lack of
standardization in system design. How can the
Z39.50 protocol help with this issue? Does an
increase on standardization reduce the innovative
nature of vendors and the creation of better
systems? - User-centered design may refer to "enhancing
system performance to deliver better results,
designing for particular users since one size
does not fit all". How does user-centered design
interfere with standardization?
75Announcements and Next
- Midterms Returned
- Extra Credit
- Next time
- Multimedia Information Organization and Retrieval
- Readings/Discussion
- Computational Media Aesthetics Finding Meaning
Beautiful - The Holy Grail of Content-Based Media Analysis
- Editing Out Video Editing
76(No Transcript)