Lecture 07: Controlled Vocabularies - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture 07: Controlled Vocabularies

Description:

Lecture 07: Controlled Vocabularies SIMS 202: Information Organization and Retrieval Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 ... – PowerPoint PPT presentation

Number of Views:212
Avg rating:3.0/5.0
Slides: 65
Provided by: ValuedG131
Category:

less

Transcript and Presenter's Notes

Title: Lecture 07: Controlled Vocabularies


1
Lecture 07 Controlled Vocabularies
SIMS 202 Information Organization and Retrieval
  • Prof. Ray Larson Prof. Marc Davis
  • UC Berkeley SIMS
  • Tuesday and Thursday 1030 am - 1200 am
  • Fall 2003

Some slides in this lecture were developed by
Prof. Marti Hearst
2
Lecture Contents
  • Phone Project
  • Review
  • Metadata Systems
  • Dublin Core
  • Controlled Vocabularies
  • Name Authority Files
  • Other Types of Controlled Vocabularies
  • Faceted vs. Hierarchic Organization of
    Vocabularies
  • Discussion Questions

3
Lecture Contents
  • Phone Project
  • Review
  • Metadata Systems
  • Dublin Core
  • Controlled Vocabularies
  • Name Authority Files
  • Other Types of Controlled Vocabularies
  • Faceted vs. Hierarchic Organization of
    Vocabularies
  • Discussion Questions

4
Assignments
  • Assignment 2 Due
  • Assignment 3 Photo Capture and Annotation
  • Assigned Sept 18
  • Due Sept 23

5
Phone Project Consent Forms
  • Collection of Data for the Phone Project
  • Informed Consent and Release Form
  • Informed Consent to Release Academic Information
  • You must sign these forms to receive a phone and
    participate in the Phone Project
  • Signing these consent forms is not a condition of
    your participation in this course, nor will it be
    used as a basis for grading your performance
    therein

6
Collection of Data for the Phone Project
  • Call logging
  • All phone calls made from the phones provided to
    you will be logged. The phone conversations
    themselves are not going to be recorded, but
    record will be made of which numbers were called
    when and for how long.
  • Approximate location logging
  • Your approximate location may be logged whenever
    the phone is used either for phone calls or to
    take, upload, annotate or retrieve photos.
  • Data correlation
  • The information call logging and approximate
    location logging may be correlated with various
    other sources of information (e.g., raw location
    data may be correlated with map data to try to
    determine in which buildings the phone was used.)
  • Sublicensing of data collected
  • Garage Cinema Research may sublicense portions of
    the collected data to other parties. This may
    include images of you or provided by you, as well
    as metadata about you or provided by you.
  • Privacy projections
  • Garage Cinema Research will not release your
    name, email address, or the complete phone
    numbers of the parties you called, except for
    their area codes and except for calls made
    between two Phone Project phones.

7
Informed Consent and Release Form
  • License to content
  • License to the content contributed by you to the
    system, including but not limited to images,
    annotations, and annotation frameworks, as well
    as any data that will be collected in accordance
    with the privacy protecting measures.
  • Identifying information and pseudonyms
  • Use of your name and email address by the system,
    understanding that they are not going to be
    released to third parties. Your name will be
    replaced with a pseudonym if the data is released
    to third parties.
  • Personal data collection
  • Applications built in the system will benefit
    from the use of personal information, however,
    you are not required to provide the system with
    any personal information about yourself or other
    people beyond the data that is being collected
    automatically.
  • Right of inspection/correction/deletion of photos
  • You have the right to inspect photos of you or
    information about you submitted by you and/or
    other users of the system and to have them
    corrected or removed.

8
Consent to Release Academic Information
  • Agreement to post work on IS202 web site
  • You agree to have your Phone Project course work
    posted, including your name, on the IS202 web
    site, which is accessible to the general public.
  • Understanding of course enrollment and authorship
    disclosure
  • You understand that this will publicly reveal
    that you are a student at the University of
    California at Berkeley, that you are taking this
    course, and that you are an author of this work.
  • Indefinite time period of posting
  • You understand that my name may be posted on this
    web site indefinitely, starting in September
    2003.
  • Optional email address posting
  • The posting of student email addresses on the
    IS202 web site Phone Project group pages, while
    kindly requested, is not required.

9
Lecture Contents
  • Phone Project
  • Review
  • Metadata Systems
  • Dublin Core
  • Controlled Vocabularies
  • Name Authority Files
  • Other Types of Controlled Vocabularies
  • Faceted vs. Hierarchic Organization of
    Vocabularies
  • Discussion Questions

10
Metadata
  • Structures and languages for the description of
    information resources and their elements
    (components or features)
  • Metadata is information on the organization of
    the data, the various data domains, and the
    relationship between them (Baeza-Yates p. 142)

11
Metadata
  • Often two main types of metadata are
    distinguished
  • Descriptive metadata
  • Describes the information/data object and its
    properties
  • May use a variety of descriptive formats and
    rules
  • Topical metadata
  • Describes the topic or aboutness of an
    information/data object
  • May include a variety of vocabularies for
    describing, subjects, topics, categories, etc.

12
Metadata Systems and Standards
  • Naming and ID systems URLS, ISBNS
  • Bibliographic description MARC, Dublin Core,
    TEI, etc.
  • Music SMDL
  • Images and objects CIMI, VRA core categories
  • Numeric data DDI, SDSM
  • Geospatial data FGDC
  • Collections EAD

13
Dublin Core
  • Simple metadata for describing internet resources
  • For Document-Like Objects
  • 15 Elements (in base DC)

14
Dublin Core Elements
  • Title
  • Creator
  • Subject
  • Description
  • Publisher
  • Other Contributors
  • Date
  • Resource Type
  • Format
  • Resource Identifier
  • Source
  • Language
  • Relation
  • Coverage
  • Rights Management

15
Lecture Contents
  • Phone Project
  • Review
  • Metadata Systems
  • Dublin Core
  • Controlled Vocabularies
  • Name Authority Files
  • Other Types of Controlled Vocabularies
  • Faceted vs. Hierarchic Organization of
    Vocabularies
  • Discussion Questions

16
Controlled Vocabularies
  • Vocabulary control is the attempt to provide a
    standardized and consistent set of terms (such as
    subject headings, names, classifications, etc.)
    with the intent of aiding the searcher in finding
    information
  • That is, it is an attempt to provide a consistent
    set of descriptions for use in (or as) metadata

17
Controlled Vocabularies
  • Names and name authorities
  • Gazetteers (geographic names)
  • Code lists (e.g., LC language codes)
  • Subject heading lists
  • Classification schemes
  • Thesauri

18
Control of Names
  • Cutters (1876) objectives of bibliographic
    description
  • To enable a person to find a document of which
  • The author, or
  • The title, or
  • The subject is known
  • To show what a library has
  • By a given author
  • On a given subject (and related subjects)
  • In a given kind (or form) of literature.
  • First serves access
  • Second serves collocation

19
Problems with Names
  • How many names should be associated with a
    document?
  • Which of these should be the main entry?
  • What form should each of the names take?
  • What references should be made from other
    possible forms of names that havent been used?

20
The Problem
  • Proliferation of the forms of names
  • Different names for the same person
  • Different people with the same names
  • Examples
  • from Books in Print (semi-controlled but not
    consistent)
  • ERIC author index (not controlled)

21
Goethe
etc
22
John Muir
23
Pauline Cochrane nee Atherton
24
Pauline Cochrane nee Atherton
25
Rules for Description
  • AACR II and other sets of descriptive cataloging
    rules provide guidelines for
  • Determining the number of name entries
  • Choosing a main entry
  • Deciding on the form of name to be used
  • Deciding when to make references

26
Authority Control
  • Authority control is concerned with creation and
    maintenance of a set of terms that have been
    chosen as the standard representatives (also know
    as established) based on some set of rules
  • If you have rules, why do you need to keep track
    of all of the headings? Cant you just infer the
    headings from the rules?

27
Conditions of Authorship?
  • Single person or single corporate entity
  • Unknown or anonymous authors
  • Fictitiously ascribed works
  • Shared responsibility
  • Collections or editorially assembled works
  • Works of mixed responsibility (e.g.,
    translations)
  • Related works

28
Added Entries
  • Personal names
  • Collaborators
  • Editors, compilers, writers
  • Translators (in some cases)
  • Illustrators (in some cases)
  • Other persons associated with the work (such as
    the honoree in a festschrift)
  • Corporate names
  • Any prominently named corporate body that has
    involvement in the work beyond publication,
    distribution, etc.

29
Choice of Name
  • AACR II says that the predominant form of the
    name used in a particular authors writings
    should be chosen as the form of name
  • References should be made from the other forms of
    the name

30
Form of the Name
  • When names appear in multiple forms, one form
    needs to be chosen
  • Criteria for choice are
  • Fullness (e.g., full names vs. initials only)
  • Language of the name
  • Spelling (choose predominant form)
  • Entry element
  • John Smith or Smith, John?
  • Mao Zedong or Zedong, Mao? (Mao Tse Tung?)

31
Name Authority Files
IDNAFL8057230 STp ELn STHa MSc
UIPa TD19910821174242 KRCa NMUa
CRCc UPNa SBUa SBCa DIDn
DF05-14-80 RFEa CSC SRUb SRTn
SRNn TSS TGA? ROM? MOD VSTd
08-21-91 Other Versions
earlier 040 DLCcDLCdDLCdOCoLC 053
PR6005.R517 100 10 Creasey, John 400 10
Cooke, M. E. 400 10 Cooke, Margaret,d1908-1973
400 10 Cooper, Henry St. John,d1908-1973
400 00 Credo,d1908-1973 400 10 Fecamps,
Elise 400 10 Gill, Patrick,d1908-1973 400
10 Hope, Brian,d1908-1973 400 10 Hughes,
Colin,d1908-1973 400 10 Marsden, James 400
10 Matheson, Rodney 400 10 Ranger, Ken 400
20 St. John, Henry,d1908-1973 400 10 Wilde,
Jimmy 500 10 wnnncaAshe, Gordon,d1908-1973
Different names for the same person
32
Name Authority Files
IDNAFO9114111 STp ELn STHa MSn
UIPa TD19910817053048 KRCa NMUa
CRCc UPNa SBUa SBCa DIDn
DF06-03-91 RFEa CSCc SRUb SRTn
SRNn TSS TGA? ROM? MOD VSTd
08-19-91 040 OCoLCcOCoLC 100 10 Marric,
J. J.,d1908-1973 500 10 wnnncaCreasey,
John 663 Works by this author are entered
under the name used in the item. For a
listing of other names used by this author,
search also underbCrease y, John 670
OCLC 13441825 His Gideon's day, 1955b(hdg.
Creasey, John usage J .J. Marric) 670
LC data base, 6/10/91b(hdg. Creasey, John
usage J.J. Marric) 670 Pseuds. and
nicknames dict., c1987b(Creasey, John,
1908-1973 Britis h author pseud.
Marric, J. J.)
33
Name Authority Files
IDNAFL8166762 STp ELn STHa MSc
UIPa TD19910604053124 KRCa NMUa
CRCc UPNa SBUa SBCa DIDn
DF08-20-81 RFEa CSC SRUb SRTn
SRNn TSS TGA? ROM? MOD VSTd
06-06-91 Other Versions
earlier 040 DLCcDLCdDLCdOCoLC 100 10
Butler, William Vivian,d1927- 400 10 Butler,
W. V.q(William Vivian),d1927- 400 10 Marric,
J. J.,d1927- 670 His The durable
desperadoes, 1973. 670 His The young
detective's handbook, c1981bt.p. (W.V. Butler)
670 His Gideon's way, 1986bCIP t.p.
(William Vivian Butler writing as J .J.
Marric)
Different people writing with the same name
34
The Haunting of Lauran Paine
Batchelor, Reg. Beck,
Harry. Bedford, Kenneth.
Bosworth, Frank. Bovee, Ruth.
Cassidy, Claude. Custer, Clint.
Dana, Amber. Dana, Richard.
Davis, Audrey. Drexler, J. F.
Duchesne, Antoinette. Fisher,
Margot. Fleck, Betty.
Frost, Joni. Gordon, Angela.
Gorman, Beth. Hayden, Jay.
Houston, Will. Howard, Troy.
Ingersol, Jared.
Kelly, Ray. Ketchum, Jack.
Liggett, Hunter. Lucas, J.
K. Lyon, Buck. Morgan,
Arlene. Morgan, Valerie.
O'Connor, Clint. St. George, Arthur.
Sharp, Helen. Thorn,
Barbara. Archer, Dennis.
Clark, Badger.
1. Paine, Lauran. ALSO KNOWN AS
Carrel, Mark. Thompson, Russ.
Andrews, A. A. Benton, Will.
Bradford, Will. Bradley,
Concho. Brennan, Will.
Carter, Nevada. Allen, Clay.
Almonte, Rosa. Armour, John.
Cassady, Claude. Glendenning, Donn.
Kelley, Ray. Kilgore, John.
Martin, Tom. Slaughter,
Jim. Standish, Buck.

35
Some Interesting Ones
36
Structure of an IR System
37
Uses of Controlled Vocabularies
  • Library subject headings, classification, and
    authority files
  • Commercial journal indexing services and
    databases
  • Yahoo, and other web classification schemes
  • Online and manual systems within organizations
  • SunSolve
  • MacArthur

38
Types of Indexing Languages
  • Uncontrolled keyword indexing
  • Indexing languages
  • Controlled, but not structured
  • Thesauri
  • Controlled and structured
  • Classification systems
  • Controlled, structured, and coded
  • Faceted thesauri and classification systems

39
Indexing Languages
  • An index is a systematic guide designed to
    indicate topics or features of documents in order
    to facilitate retrieval of documents or parts of
    documents
  • An Indexing language is the set of terms used in
    an index to represent topics or features of
    documents, and the rules for combining or using
    those terms

40
Indexing Languages
  • Library of Congress Subject Headings
  • Yellow pages topics
  • Wilson indexes (readers guide)

41
Thesauri
  • A thesaurus is a collection of selected
    vocabulary (preferred terms or descriptors) with
    links among
  • Synonymous
  • Equivalent
  • Broader
  • Narrower, and
  • Other related terms
  • National and international standards for thesauri
    (More next time)

42
Classification Systems
  • A classification system is an indexing language
    often based on a broad ordering of topical areas
  • Thesauri and classification systems both use this
    broad ordering and maintain a structure of
    broader, narrower, and related topics
  • Classification schemes commonly use a coded
    notation for representing a topic and its place
    in relation to other terms

43
Classification Systems (Cont.)
  • Examples
  • The Library of Congress Classification System
  • The Dewey Decimal Classification System
  • The ACM Computing Reviews Categories
  • The American Mathematical Society Classification
    System

44
Using Controlled Vocabulary
  • Start with the text of the document
  • Attempt to control or regularize
  • The concepts expressed within
  • mutually exclusive
  • exhaustive
  • The language used to express those concepts
  • limit the normal linguistic variations
  • regulate word order and structure of phrases
  • reduce the number of synonyms or near-synonyms
  • Also, provide cross-references between concepts
    and their expression

(These slides follow Bates 88)
Slide author Marti Hearst
45
Classification Schemes
  • Classify possible concepts.
  • Goals
  • Completely distinct conceptual categories
    (mutually exclusive)
  • Complete coverage of conceptual categories
    (exhaustive)

Slide author Marti Hearst
46
Assigning Headings vs. Descriptors
  • Descriptors
  • Mix and match
  • Subject headings
  • Assign one (or a few) complex heading(s) to the
    document

How would we describe recipes using each
technique?
Slide author Marti Hearst
47
Subject Heading vs. Descriptors
  • Wilsonline
  • Athletes
  • Athletes -- Heathhygiene
  • Athletes -- Nutrition
  • Athletes -- Physical Exams
  • Athletics
  • Athletics -- Administration
  • Athletics -- Equipment -- Catalogs
  • Sports -- Accidents and Injuries
  • Sports -- Accidents and Injuries -- Prevention
  • ERIC
  • Athletes
  • Athletic Coaches
  • Athletic Equipment
  • Athletic Fields
  • Athletics
  • Sports Psychology
  • Sportsmanship

Slide author Marti Hearst
48
Subject Headings vs. Descriptors
  • Describe the contents of an entire document
  • Designed to be looked up in an alphabetical index
  • Look up document under its heading
  • Few (1-5) headings per document
  • AKA Precoordination
  • Describe one concept within a document
  • Designed to be used in Boolean searching
  • Combine to describe the desired document
  • Many (5-25) descriptors per document
  • AKA Postcoordination

Slide author Marti Hearst
49
Lecture Contents
  • Phone Project
  • Review
  • Metadata Systems
  • Dublin Core
  • Controlled Vocabularies
  • Name Authority Files
  • Other Types of Controlled Vocabularies
  • Faceted vs. Hierarchic Organization of
    Vocabularies
  • Discussion Questions

50
Hierarchical Classification
  • Each category is successively broken down into
    smaller and smaller subdivisions
  • No item occurs in more than one subdivision
  • Each level divided out by a character of
    division (also known as a feature)
  • Example
  • Distinguish Literature based on
  • Language
  • Genre
  • Time Period

Slide author Marti Hearst
51
Hierarchical Classification
Slide author Marti Hearst
52
Labeled Categories for Hierarchical Classification
  • LITERATURE
  • 100 English Literature
  • 110 English Prose
  • English Prose 16th Century
  • English Prose 17th Century
  • English Prose 18th Century
  • ...
  • 111 English Poetry
  • 121 English Poetry 16th Century
  • 122 English Poetry 17th Century
  • ...
  • 112 English Drama
  • 130 English Drama 16th Century
  • 200 French Literature

Slide author Marti Hearst
53
Faceted Classification
  • Create a separate, free-standing list for each
    characteristic or division (feature)
  • Combine features to create a classification

Slide author Marti Hearst
54
Faceted Classification Along With Labeled
Categories
  • A Language
  • a English
  • b French
  • c Spanish
  • B Genre
  • a Prose
  • b Poetry
  • c Drama
  • C Period
  • a 16th Century
  • b 17th Century
  • c 18th Century
  • d 19th Century
  • Aa English Literature
  • AaBa English Prose
  • AaBaCa English Prose 16th Century
  • AbBbCd French Poetry 19th Century
  • BbCd Drama 19th Century

Slide author Marti Hearst
55
Questions
  • How (and when) to use both types of
    classification structures?
  • How to look through them?
  • How to use them in searching?

Slide author Marti Hearst
56
Lecture Contents
  • Phone Project
  • Review
  • Metadata Systems
  • Dublin Core
  • Controlled Vocabularies
  • Name Authority Files
  • Other Types of Controlled Vocabularies
  • Faceted vs. Hierarchic Organization of
    Vocabularies
  • Discussion Questions

57
Sarah Ellinger on Svenonius
  • Many of the studies Svenonius cites seem to
    grapple with the same issue how, or from whose
    perspective, do we measure the success of a
    database search? Should a successful search
    return all information, or distinguish by
    relevance? Can we always accept the searcher's
    view of relevant material? If an issue is under
    debate, should our search technologies provide
    the user with information from all sides, or only
    the side with which the searcher agrees?

58
Sarah Ellinger on Svenonius
  • In regards to discipline-specific search
    vocabularies, Svenonius asks, "Would it not make
    more sense to custom tailor a vocabulary-control
    tool to the vocabulary being tailored?" In a
    world where academic terms are prone to change,
    how do we avoid replicating obsoletisms like
    "Vietnamese conflict" in disciplinary
    vocabularies? Would such a vocabulary preserve
    outdated associations in the minds of searchers
    or lend credence to some theories over others?
    How can a controlled vocabulary reflect academic
    debate?

59
Matt Meiske on Bates
  • Bates article was written over 17 years ago.
    Since then, online catalogues have changed (e.g.,
    web-based Melvyl), but not to the extent that
    Bates proposes. Why not?

60
Matt Meiske on Bates
  • In her proposal, Bates states that a good online
    catalogue design will provide some means of
    orientation, so that the user can get a feel
    for the system. Seventeen years later, the world
    is far a more computer-centric place. Are we
    becoming naturally oriented to systems of this
    sort? Is the issue of orientation / docking
    still relevant?

61
Paul Laskowski on Borgman
  • Borgman wrote in 1996, but certain passages
    already seem outdated to me (e.g., "a customer
    trying to operate a mouse as a foot pedal." 499)
    The spread of GUI interfaces, in particular, may
    solve some of the interface problems Borgman
    identifies. Is there still a problem with "user
    education," or is it now time to focus on how
    catalogues react to user queries? Is the real
    problem for users one of "technical skills," or
    should users be trained specifically to formulate
    queries "strategically"? Is this a skill that
    can be taught?

62
Paul Laskowski on Borgman
  • I opened up JSTOR (http//www.jstor.com) to try
    to compare Borgman's ideas to practice. JSTOR
    allows me to query the following fields author,
    title, abstract, and full-text. Borgman does not
    seem to foresee the ability to search the full
    text. Does this ability make a subject query
    obsolete? In what scenario might I prefer to
    query a subject field? JSTOR allows me to
    constrain my search along multiple fields, using
    the operators AND, OR, and NEAR (10 words or 25
    words). Sure enough, no information seems to be
    given on the order of operations. In what cases
    might this foil my search attempt?

63
Next Time
  • Thesaurus Design and Construction
  • Readings/Discussion
  • Chapter F Flow of Work in the Construction of
    Indexing Languages and Thesauri (Soergel) - Simon
  • The House of Quality (Hauser and Clausing) - Sean
  • Designing the Organizational Framework (Sano) -
    Lisa
  • Phone Project
  • Phones!
  • Phone demo
  • Assignment 3 Photo Capture and Annotation

64
Discussion Questions Leaders
  • Soergel
  • Hauser and Clausing
  • Sano
Write a Comment
User Comments (0)
About PowerShow.com