Prof. Ray Larson - PowerPoint PPT Presentation

About This Presentation
Title:

Prof. Ray Larson

Description:

Lecture 11: Thesaurus Design SIMS 202: Information Organization and Retrieval Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12 ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 50
Provided by: ValuedGate659
Category:
Tags: larson | prof | ray | thesaurus

less

Transcript and Presenter's Notes

Title: Prof. Ray Larson


1
Lecture 11 Thesaurus Design
SIMS 202 Information Organization and Retrieval
  • Prof. Ray Larson Prof. Marc Davis
  • UC Berkeley SIMS
  • Tuesday and Thursday 1030 am - 1200 pm
  • Fall 2002
  • http//www.sims.berkeley.edu/academics/courses/is2
    02/f02/

2
Lecture Overview
  • Review
  • Name Authority Control
  • Types of Controlled Vocabularies
  • Thesaurus Design and Development
  • Developing Controlled Vocabularies
  • Thesaurus Design
  • Steps In Thesaurus Development
  • Indexing

3
Lecture Overview
  • Review
  • Name Authority Control
  • Types Of Controlled Vocabularies
  • Thesaurus Design And Development
  • Developing Controlled Vocabularies
  • Thesaurus Design
  • Steps In Thesaurus Development
  • Indexing

4
Types of Indexing Languages
  • Uncontrolled keyword indexing
  • Indexing languages
  • Controlled, but not structured
  • Thesauri
  • Controlled and structured
  • Classification systems
  • Controlled, structured, and coded
  • Faceted classification systems

5
Uses of Controlled Vocabularies
  • Library subject headings, classification and
    authority files
  • Commercial journal indexing services and
    databases
  • Yahoo, and other web classification schemes
  • Online and manual systems within organizations
  • SunSolve
  • MacArthur

6
Indexing Languages
  • An index is a systematic guide designed to
    indicate topics or features of documents in order
    to facilitate retrieval of documents or parts of
    documents
  • An indexing language is the set of terms used in
    an index to represent topics or features of
    documents, and the rules for combining or using
    those terms

7
Classification Systems
  • A classification system is an indexing language
    often based on a broad ordering of topical areas
  • Thesauri and classification systems both use this
    broad ordering and maintain a structure of
    broader, narrower, and related topics
  • Classification schemes commonly use a coded
    notation for representing a topic and its place
    in relation to other terms

8
Automatic Indexing and Classification
  • Automatic indexing is typically the simple
    deriving of keywords from a document and
    providing access to all of those words
  • More complex automatic indexing systems attempt
    to select controlled vocabulary terms based on
    terms in the document
  • Automatic classification attempts to
    automatically group similar documents using
    either
  • A fully automatic clustering method
  • An established classification scheme and set of
    documents already indexed by that scheme

9
Clustering
Agglomerative methods Polythetic, Exclusive or
Overlapping, Unordered clusters are
order-dependent
Rocchios method (Yes the same Rocchio as
Relevance Feedback)
1. Select initial centers (I.e. seed the
space) 2. Assign docs to highest matching centers
and compute centroids 3. Reassign all documents
to centroid(s)
10
Automatic Class Assignment
Automatic Class Assignment Polythetic, Exclusive
or Overlapping, usually ordered clusters are
order-independent, usually based on an
intellectually derived scheme
Doc
Doc
Doc
Doc
Search Engine
Doc
Doc
Doc
1. Create pseudo-documents representing
intellectually derived classes. 2. Search using
document contents 3. Obtain ranked list 4. Assign
document to N categories ranked over
threshold. OR assign to top-ranked category
11
Lecture Overview
  • Review
  • Name Authority Control
  • Types Of Controlled Vocabularies
  • Thesaurus Design And Development
  • Developing Controlled Vocabularies
  • Thesaurus Design
  • Steps In Thesaurus Development
  • Indexing

12
Developing Controlled Vocabularies
  • Origins and uses of controlled vocabularies for
    information retrieval
  • Types of indexing languages, thesauri and
    classification systems
  • Process of design and development of thesauri

13
Origins
  • Very early history of content representation
  • Sumerian tokens and envelopes
  • Alexandria - pinakes
  • Indices

14
Origins
  • Biblical Indexes and Concordances
  • Hugo de St. Caro 1247 A.D. 500 Monks -- KWOC
  • Book indexes (Nuremburg Chronicle)
  • Library Catalogs
  • Journal Indexes
  • Information Explosion following WWII
  • Cranfield Studies of indexing languages and
    information retrieval
  • Development of bibliographic databases
  • Index Medicus -- production and Medlars searching

15
Origins
  • Communication theory revisited
  • Problems with transmission of meaning

Noise
16
Structure of an IR System
17
What is a Controlled Vocabulary?
  • The greatest problem of today is how to teach
    people to ignore the irrelevant, how to refuse to
    know things, before they are suffocated. For too
    many facts are as bad as none at all. (W.H.
    Auden)
  • Similarly, there are too many ways of expressing
    or explaining the topic of a document
  • Controlled vocabularies are sets of Rules for
    topic identification and indexing, and a
    THESAURUS, which consists of lead-in vocabulary
    and an limited and selective Indexing Language
    sometimes with special coding or structures

18
Lecture Overview
  • Review
  • Name Authority Control
  • Types Of Controlled Vocabularies
  • Thesaurus Design And Development
  • Developing Controlled Vocabularies
  • Thesaurus Design
  • Steps In Thesaurus Development
  • Indexing

19
Thesauri
  • A Thesaurus is a collection of selected
    vocabulary (preferred terms or descriptors) with
    links among synonymous, equivalent, broader,
    narrower and other related terms

20
Thesauri (cont.)
  • National and International Standards for Thesauri
  • ANSI/NISO z39.19-1994 American National
    Standard Guidelines for the Construction, Format
    and Management of Monolingual Thesauri
  • ANSI/NISO Draft Standard Z39.4-199x American
    National Standard Guidelines for Indexes in
    Information Retrieval
  • ISO 2788 Documentation Guidelines for the
    establishment and development of monolingual
    thesauri
  • ISO 5964 Documentation Guidelines for the
    establishment and development of multilingual
    thesauri

21
Thesauri (cont.)
  • Examples
  • The ERIC Thesaurus of Descriptors
  • The Medical Subject Headings (MESH) of the
    National Library of Medicine
  • The Art and Architecture Thesaurus

22
Why Develop a Thesaurus?
  • To provide a conceptual structure or space for
    a body of information
  • To make it possible to adequately describe the
    topical contents of informational objects at an
    appropriate level of generality or specificity
  • To provide enhanced search capabilities and to
    improve the effectiveness of searching (i.e., to
    retrieve most of the relevant material without
    too much irrelevant material)

23
Why Develop a Thesaurus?
  • To provide vocabulary (or terminological) control
  • When there are several possible terms designating
    a single concept, the thesaurus should lead the
    indexer or searcher to the appropriate concept,
    regardless of the terms they start with

24
Lecture Overview
  • Review
  • Name Authority Control
  • Types Of Controlled Vocabularies
  • Thesaurus Design And Development
  • Developing Controlled Vocabularies
  • Thesaurus Design
  • Steps In Thesaurus Development
  • Indexing

25
Preliminary Considerations
  • What is used now?
  • Continue using an existing thesaurus?
  • Ad hoc modification of existing thesaurus?
  • Develop a new well-structured thesaurus?
  • What is the scope and complexity of the subject
    field?
  • What kind of retrieval objects or data will be
    dealt with?
  • How exhaustive and specific is the desired
    description of objects?

26
Preliminary Considerations
  • The scope and complexity of the field will
    provide some indication of the scope and
    complexity of the thesaurus
  • It is better to plan for a larger and more
    comprehensive system than a smaller system that
    rapidly will become inadequate as the database
    grows
  • Development of a good thesaurus requires a major
    intellectual effort as well as clerical
    operations like data entry and production of
    sorted lists

27
Development of a Thesaurus
  • Term selection
  • Merging and development of concept classes
  • Definition of broad subject fields and subfields
  • Development of classificatory structure
  • Review, testing, application, revision

28
Flow of Work in Thesaurus Construction
29
1. Term Selection
  • Select sources for the collection of terms
  • Prearranged Sources
  • Open-ended Sources
  • Assign codes to each source
  • Selection of terms
  • For part of pre-arranged and for all open-ended
    sources
  • Enter terms into database with all information

30
1.1 Kinds of Sources
  • Prearranged Sources
  • Existing descriptor lists, classification schemes
    thesauri
  • This includes universal schemes like DDC or LCSH
  • Nomenclatures of single disciplines
  • Treatises on the terminology of a field
  • Encyclopedias, lexica, dictionaries and
    glossaries
  • Tables of contents of textbooks and handbooks
  • Indexes of journals or abstracting journals
  • Indexes of other publications in the field

31
1.1 Kinds of Sources
  • Open-ended sources
  • Lists of search requests or interest profiles
  • Description of projects/activities to be served
    by the information retrieval system
  • Discussion with specialists in the field
  • Sample of documents in the field
  • Ask users why and how these documents relate to
    the field
  • Have documents indexed by experts in the field
  • Lists of titles of documents in the field
  • Abstracts and reviews of documents
  • Your own knowledge

32
Selection of Sources
  • Prearranged sources require less effort in
    gathering the material, and may already indicate
    some relationships between terms and concepts and
    relationships among terms
  • Open-ended sources can reflect current
    terminology and may provide more complete
    coverage
  • Choose a set of sources that are current, as
    complete as possible, and considered authoritative

33
Selection of Sources
  • Each selected source is assigned an ID for
    tracking its use in the development of the
    thesaurus
  • Useful when making decisions about which terms to
    prefer
  • Useful for backtracking when questions arise
    (where did this come from?)

34
Selection of Terms
  • Terms can be transferred directly from
    prearranged sources to the recording medium
    (cards or database)
  • Have to decide which terms and references to
    include, or to take the whole source

35
Selection of Terms
  • In open-ended sources you read through the source
    and pick out terms (i.e. words and phrases) that
    might be useful in retrieval or as references to
    other terms
  • Alternatively, use keyword and phrase extraction
    software to create lists of terms and select from
    those
  • Transfer selected terms to the recording medium
    (cards or database)

36
Work Form
From Soergel, p. 399
37
2. Merging and Development of Concept Classes
  • Sort Term DB into alphabetical order
  • First Round
  • Merge information for identical terms, possibly
    pulling info from additional sources
  • Second Round
  • Merge synonyms or terms in the same concept class

38
3. Definition of Broad Subject Fields and
Subfields
  • Define broad subject fields and sort terms into
    these broad fields
  • Define subfields within each broad field and sort
    terms into these subfields
  • Work out the detailed structure
  • Select preferred terms
  • Merge information for terms in the same concept
    class
  • Repeat these steps
  • For each subfield within a broad field
  • And for each broad field
  • Until all terms have been consolidated and
    preferred terms selected

39
4. Development of Classificatory Structure
  • Produce preliminary version of classified index
    and update the working database
  • Improve classificatory structure
  • Reality check
  • Produce and distribute a version of the
    classified index
  • Distribute to users/experts

40
5. Final Stages
  • Review
  • Testing
  • Application
  • Revision

41
Review
  • Discuss classified index with users/experts
  • Select descriptors and checklist descriptors
  • Assign notational symbols
  • Produce main thesaurus and indexes

42
Review (cont.)
  • Check cross references and insert where needed
  • Produce test version
  • Test by indexing
  • Modify as needed
  • Produce production version

43
Testing a Thesaurus
  • Assign descriptors to a sample set of NEW
    documents (use enough to get an idea of any gaps
    in the thesaurus)
  • Test retrieval using sample questions and seeing
    how effectively the thesaurus maps to the
    appropriate descriptor

44
Lecture Overview
  • Review
  • Name Authority Control
  • Types Of Controlled Vocabularies
  • Thesaurus Design And Development
  • Developing Controlled Vocabularies
  • Thesaurus Design
  • Steps In Thesaurus Development
  • Indexing

45
The Indexing Process
  • Concept identification
  • Term selection (via thesaurus)
  • Term assignment

46
Application The Indexing Process (Manual)
47
Thesaurus Revision and Updates
  • There will always be new concepts, products, or
    expressions that need to be added to the
    thesaurus
  • Set a regular schedule of reviews and revisions
  • Collect complaints, problems, etc. and fold into
    revision of the thesaurus

48
References
  • Soegel, D. Indexing Languages and Thesauri
    Construction and Maintenance. Los Angeles
    Melville Publishing Co., 1974
  • Foskett, A.C. The Subject Approach to
    Information. London Clive Bingley, 1982.
  • Standards
  • ANSI/NISO z39.19-1994 American National
    Standard Guidelines for the Construction, Format
    and Management of Monolingual Thesauri
  • ANSI/NISO Draft Standard Z39.4-199x American
    National Standard Guidelines for Indexes in
    Information Retrieval
  • ISO 2788 Documentation Guidelines for the
    establishment and development of monolingual
    thesauri
  • ISO 5964 Documentation Guidelines for the
    establishment and development of multilingual
    thesauri

49
Next Time
  • Metadata and Markup
  • How can metadata be expressed and structured in
    documents and databases
  • More on XML and its use in defining metadata
    systems
Write a Comment
User Comments (0)
About PowerShow.com