Title: CSINFO 430 Information Retrieval
1CS/INFO 430Information Retrieval
Lecture 25 Metadata 4
2Course Administration
Early examination There will be an early
opportunity to take the final examination on
Wednesday, December 5, 1000 to noon, in Upson
5126. If you wish to take the early examination
you must inform Sarah Birns (sbirns_at_cs.cornell.edu
) by Friday, November 30.
3Example Geospatial Information
Example Alexandria Digital Library at the
University of California, Santa Barbara Funded
by the NSF Digital Libraries Initiative since
1994. Collections include any data
referenced by a geographical footprint.
terrestrial maps, aerial and satellite
photographs, astronomical maps, databases,
related textual information Program of
research with practical implementation at the
university's map library
4Alexandria Computer Systems and User Interfaces
- Computer systems
- Digitized maps and geospatial information --
large files - Wavelets provide multi-level decomposition
of image - -gt first level is a small coarse image
- -gt extra levels provide greater detail
- User interfaces
- Small size of computer displays
- Slow performance of Internet in delivering
large files - -gt retain state throughout a session
5Alexandria User Interface
6Alexandria Information Discovery
Metadata for information discovery Coverage
geographical area covered, such as the city of
Santa Barbara or the Pacific Ocean. Scope
varieties of information, such as topographical
features, political boundaries, or population
density. Latitude and longitude provide basic
metadata for maps and for geographical features.
7Gazetteer
Gazetteer database and a set of procedures that
translate representations of geospatial
references place names, geographic features,
coordinates postal codes, census tracts Search
engine tailored to peculiarities of searching for
place names. Research is making steady progress
at feature extraction, using automatic programs
to identify objects in aerial photographs or
printed maps -- topic for long-term research.
8The Alexandria Gazetteer
The Alexandria Digital Library (ADL) primary
attribute of an object is location on Earth
(e.g., map, satellite photograph). Geographic
footprint latitude and longitude values that
represent a point, a bounding box, a linear
feature, or a complete polygonal boundary.
Gazetteer list of geographic names, with
geographic locations and other descriptive
information. Geographic name proper name for a
geographic place or feature (e.g., Santa Barbara
County, Mount Washington, St. Francis Hospital,
and Southern California)
9Use of a Gazetteer
Answers the "Where is" question for
example, "Where is Santa Barbara?"
Translates between geographic names and
locations. A user can find objects by
matching the footprint of a geographic name
to the footprints of the collection
objects. Locates particular types of
geographic features in a designated area.
For example, a user can draw a box around
an area on a map and find the schools, hospitals,
lakes, or volcanoes in the area.
10Alexandria Gazetteer Example from a search on
"Tulsa"
Feature name State County Type Latitude Longitude
Tulsa OK Tulsa pop pl 360914N 0955933W Tulsa
Country OK Osage locale 360958N 0960012W Club
Tulsa County OK Tulsa civil 360600N 09554
00W Tulsa Helicopters OK Tulsa airport 360500N 095
5205W Incorporated Heliport
11Challenges for the Alexandria Gazetteer
Content standard A standard conceptual schema
for gazetteer information. Feature
types A type scheme to categorize individual
features, is rich in term variants and
extensible. Temporal aspects Geographic names
and attributes change through time. "Fuzzy"
footprints Extent of a geographic feature is
often approximate or ill-defined (e.g., Southern
California).
12Challenges for the Alexandria Gazetteer
(continued)
Quality aspects (a) Indicate the accuracy of
latitude and longitude data. (b) Ensure that the
reported coordinates agree with the other
elements of the description. Spatial
extents (a) Points do not represent the extent
of the geographic locations and are therefore
only minimally useful. (b) Bounding boxes, often
include too much territory (e.g., the bounding
box for California also includes Nevada).
13Alexandria Gazetteer
Alexandria Digital Library Linda L. Hill, James
Frew, and Qi Zheng, Geographic Names The
Implementation of a Gazetteer in a Georeferenced
Digital Library. D-Lib Magazine, 5 1, January
1999. http//www.dlib.org/dlib/january99/hill/01h
ill.html
14Classification and Categorization
empirically-defined classes
pre-defined classes
terms
thesaurus
classification text categorization
documents
document clustering
15Text categorization
Text categorization Problem is to classify
documents by whether they belong to a fixed set
of pre-determined categories. Each document
may belong to many categories. Each category is
taken as a separate binary classification
problem. Classification Problem is to assign
each document to exactly one pre-determined
category.
16Text categorization
Outline Select a subject domain. Choose
a corpus of documents that cover the
domain. Obtain a training set of documents that
have been assigned to a set of categories.
Create a vocabulary by extracting terms,
normalization, precoordination of phrases, etc.
Use the terms in a document as a feature set
that describes the document. Scale the terms
using idf or similar measure. Use machine
learning methods (e.g., support vector machine)
to train an automatic classifier.
17Lexicon and Thesaurus
Lexicon contains information about words, their
morphological variants, and their grammatical
usage. Thesaurus relates words by
meaning ship, vessel, sail craft, navy,
marine, fleet, flotilla book, writing, work,
volume, tome, tract, codex search, discovery,
detection, find, revelation (From Roget's
Thesaurus, 1911)
18Thesaurus in Information Retrieval
Use of a thesaurus in indexing (precoordination) A
. Manual A human indexer assigns standard terms
and associations.
computer-aided instruction see also
education UF teaching machines BT educational
computing TT computer applications RT
education RT teaching
used for broader term top term related term
From INSPEC Thesaurus
19Thesaurus in Information Retrieval
Use of a thesaurus in indexing (precoordination) B
. Automatic Divide terms into thesaurus classes.
Replace similar terms by a thesaurus class.
408 dislocation 409 blast-cooled junction heat-f
low minority-carrier heat-transfer n-p-n
p-n-p 410 anneal point-contact strain
recombine transition unijunction
From Salton and McGill
20Desirable Properties for Information Retrieval
Thesaurus is specific to a subject area.
Contains only terms of interest for
identification within that subject
area. Ambiguous terms are coded only for
the senses important for that field.
Target is that each thesaurus class should
include terms of moderate frequency.
Ideally the classes should have similar
frequency.
21Alexandria Thesaurus Example
canals A feature type category for places such
as the Erie Canal. Used for The category
canals is used instead of any of the following.
canal bends canalized streams ditch
mouths ditches drainage canals
drainage ditches ... more ... Broader Terms
Canals is a sub-type of hydrographic structures.
22Alexandria Thesaurus Example (continued)
canals (continued) Related Terms The following
is a list of other categories related to canals
(non-hierarchial relationships). channels
locks transportation features tunnels Scope
Note Manmade waterway used by watercraft or for
drainage, irrigation, mining, or water power.
Definition of canals.
23Art and Architecture Thesaurus
- Controlled vocabulary for describing and
retrieving information - fine art, architecture, decorative art, and
material culture. - Almost 128,000 terms for objects, textual
materials, images, - architecture and culture from all periods and all
cultures. - Used by archives, museums, and libraries to
describe items in their - collections.
- Used as a database accessed via a Web site.
- Used by computer programs, for information
retrieval, and natural - language processing.
- A project of the J. Paul Getty Trust
http//www.getty.edu/research/conducting_research/
vocabularies/aat/
24Art and Architecture Thesaurus
Provides the terminology for objects, and the
vocabulary necessary to describe them, such as
style, period, shape, color, construction, or
use, and scholarly concepts, such as theories, or
criticism. Concept a cluster of terms, one of
which is established as the preferred term, or
descriptor. Categories associated concepts,
physical attributes, styles and periods, agents,
activities, materials, and objects.
25Art and Architecture Thesaurus Sample Record
Record ID 198841 Descriptor rhyta Note Refers
to vessels from Ancient Greece, eastern Europe,
or the Middle East that typically have a closed
form with two openings, one at the top for
filling and one at the base so that liquid could
stream out. They are often in the shape of a horn
or an animal's head, and were typically used as a
drinking cup or for pouring wine into another
vessel. Hierarchy Containers
TQ ...ltcontainers by function or
contextgt ...........ltculinary containersgt ......
.............ltcontainers for serving and
consuming foodgt
26Art and Architecture Thesaurus Sample Record
(continued)
Terms rhyta rhyton (alternate, singular)
protomai protome rhea rheon rheons
Related concepts stirrup cups sturzbechers dr
inking vessels ceremonial vessels
27Medical Subject Headings (MeSH)
National Library of Medicine's controlled
vocabulary thesaurus The library provides MeSH
subject headings for each article in the 4,800
journals that it indexes every year and every
book, etc. acquired by the library. 23,000
primary headings. Additional thesaurus of
about 151,000 chemical terms. Terms are
organized in a hierarchy. 135,000
cross-references. Experts who understand the
field and are able to formulate queries using
MeSH terms and the MeSH structures.
28MeSH hierarchy
Biological Sciences G Biological
Sciences G01 Health Occupations
G02 Environment and Public Health
G03 Biological Phenomena, Cell
Phenomena, and Immunity G04
Genetics G05 Biochemical
Phenomena, Metabolism, and Nutrition G06
Physiological Processes G07
Reproductive and Urinary Physiology G08
Circulatory and Respiratory Physiology
G09 Digestive, Oral, and Skin
Physiology G10 Musculoskeletal,
Neural, and Ocular Physiology G11
Chemical and Pharmacologic Phenomena G12
29MeSH hierarchy (continued)
Physiological Processes G07
Adaptation, Physiological G07.062
Aging G07.168 Body
Constitution G07.265 Body
Temperature G07.315
Body Temperature Regulation G07.315.232
Skin Temperature
G07.315.753 Chronobiology
G07.450 Electrophysiology
G07.453 Fluid Shifts
G07.503 Growth and Embryonic
Development G07.553
Homeostasis G07.621 Tensile
Strength G07.900 Tropism
G07.950
30MeSH hierarchy (continued)
MeSH Heading Body Temperature Tree
Number E01.370.600.120 Tree Number G07.315
Entry Term Organ Temperature See Also Fever See
Also Thermography See Also Thermometers
Allowable Qualifiers DE GE IM PH RE Unique
ID D001831