Title: Metadata as Infrastructure for Information Retrieval and Text Mining
1Metadata as Infrastructure for Information
Retrieval and Text Mining
Prof. Ray R. Larson University of California,
BerkeleySchool of Information
2Overview
- Metadata as Infrastructure
- What, Where, When and Who?
- What are Entry Vocabulary Indexes?
- Notion of an EVI
- How are EVIs Built
- Time Period Directories
- Mining Metadata for new metadata
3Metadata as Infrastructure
- The difference between memorization and
understanding lies in knowing the context and
relationships of whatever is of interest. When
setting out to learn about a new topic, a
well-tested practice is to follow the traditional
5Ws and the H Who?, What?, When?, Where?,
Why?, and How?
4Metadata as Infrastructure
- The reference collections of paper-based
libraries provide a structured environment for
resources, with encyclopedias and subject
catalogs, gazetteers, chronologies, and
biographical dictionaries, offering direct
support for at least What, Where, When, and Who. - The digital environment does not yet provide an
effective, and easily exploited, infrastructure
comparable to the traditional reference library.
5What?
- Searching texts by topic, e.g. Dewey, LCSH, any
subject index, or category scheme applied to
documents. - Two kinds of mapping in every search
- Documents are assigned to topic categories, e.g.
Dewey - Queries have to map to topic categories, e.g.
Deweys Relativ Index from ordinary words/phrases
to Decimal Classification numbers. - Also mapping between topic systems, e.g. US
Patent classification and International Patent
Classification.
6What searches involve mapping to controlled
vocabularies
Thesaurus/ Ontology
Texts
7Start with a collection of documents.
8Classify and index with controlled vocabulary Or
use a pre-indexed collection.
9ProblemControlled Vocabularies can be difficult
for people to use.
pass mtr veh spark ign eng
10SolutionEntry Level Vocabulary Indexes.
Index
EVI
11What and Entry Vocabulary Indexes
- EVIs are a means of mapping from users
vocabulary to the controlled vocabulary of a
collection of documents
12Building and Searching EVIs
13Technical Details
For noun phrases
Internet DB indexed with a controlled vocabulary.
Building an Entry Vocabulary Module (EVI)
14Association Measure
C C t a
b t c d
Where t is the occurrence of a term and C is the
occurrence of a class in the training set
15Association Measure
16Alternatively
- Because the evidence terms in EVIs can be
considered a document, you can also use IR
techniques and use the top-ranked classes for
classification or query expansion
17Digital library resources
Statistical association
18EVI example
Index termpass mtr veh spark ign eng
EVI 1
User Query Automobile
Index termautomobiles OR internal
combustible engines
EVI 2
19But why stop there?
Index
EVI
20Which EVI do I use?
Index
EVI
Index
EVI
Index
EVI
Index
21EVI to EVIs
Index
EVI
EVI2
Index
EVI
Index
EVI
Index
22Why not treat language the same way?
23It is also difficult to move between different
media forms
Thesaurus/ Ontology
Texts
EVI
Numeric datasets
24Searching across data types
- Different media can be linked indirectly via
metadata, but often (e.g. for socio-economic
numeric data series) you also need to specify
WHERE to get correct results
25But texts associated with numeric data can be
mapped as well
Thesaurus/ Ontology
Texts
EVI
EVI
captions
Numeric datasets
26EVI to Numeric Data example
27But there are also geographic dependencies
Thesaurus/ Ontology
Texts
EVI
EVI
captions
Maps/ Geo Data
Numeric datasets
28WHERE Place names are problematic
- Variant forms St. Petersburg, ????? ?????????,
Saint-Pétersbourg, . . . - Multiple names Cluj, in Romania / Roumania /
Rumania, is also called Klausenburg and
Kolozsvar. - Names changes Bombay ? Mumbai.
- HomographsVienna, VA, and Vienna, Austria
- 50 Springfields.
- Anachronisms No Germany before 1870
- Vague, e.g. Midwest, Silicon Valley
- Unstable boundaries 19th century Poland
Balkans USSR - Use a gazetteer!
29WHERE. Geo-temporal search interface. Place
names found in documents. Gazetteer provided lat.
long. Places displayed on map.
Timebar?
30Zoom on map. Click on place for a list of
records. Click on record to display text.
31Catalogs and gazetteers should talk to each other!
Catalog search
Gazetteer search
Geographic sort / display of catalog search
result.
32So geographic search becomes part of the
infrastructure
Thesaurus/ Ontology
Texts
EVI
Gazetteers
captions
Maps/ Geo Data
Numeric datasets
33WHEN Search by time is also weakly supported
- Calendars are the standard for time
- But people use the names of events to refer to
time periods - Named time periods resemble place names in being
- Unstable European War, Great War, First World
War - Multiple Second World War, Great Patriotic War
- Ambiguous Civil war in different centuries in
England, USA, Spain, etc. - Places have temporal aspects periods have
geographical aspects When the Stone Age was,
varies by region
34Similarity between place names and period names
- Suggests a similar solution A gazetteer-like
Time Period Directory. - Gazetteer
- Place name Type Spatial markers (Lat long)
-- When - Time Period Directory
- Period name Type Time markers (Calendar)
Where - Note the symmetry in the connections between
Where and When.
35Solution - Time Period Directories
- Initial development involved mining the Library
of Congress Subject Authority file for named time
periods
36LC MARC Authorities Records
ltUSMARCgt ltFld001gtsh 00000613 lt/Fld001gt ltFld151gtltagt
Magdeburg (Germany)lt/agtltxgtHistorylt/xgtltygtSiege,
1550-1551lt/ygtlt/Fld151gt ltFld550gtltwgtglt/wgtltagtSiegeslt/
agtltzgtGermanylt/zgtlt/Fld550gt ltFld670gtltagtWork cat.
45053442 Besselmeier, S. Warhafftige history vnd
beschreibung des Magdeburgischen Kriegs,
1552.lt/agtlt/Fld670gt ltFld670gtltagtCath.
encyc.lt/agtltbgt(Magdeburg besieged (1550-51) by
the Margrave Maurice of Saxony)lt/bgtlt/Fld670gt ltFld6
70gtltagtOx. encyc. reformationlt/agtltbgt(Magdeburg
... during the 1550-1551 siege of Magdeburg
...)lt/bgtlt/Fld670gt lt/USMARCgt
37timePeriodEntry Time Period Directory Instance Contains components described below
- periodID Unique identifier
- periodName Period name, can be repeated for alternative names Information about language, script, transliteration scheme Source information and notes (where was the period name mentioned)
- descriptiveNotes Description of time period
- dates Calendar and date format Begin end date (exact, earliest, latest, most-likely, advocated-by-source, ongoing) Notes, sources
- periodClassification Period type, e.g. Period of Conflict, Art movement Can plug in different classification schemes Can be repeated for several classifications
- location Associated places with time period Contains both place name and entry to a gazetteer providing more specific place information like latitude / longitude coordinates Can plug in different location indicators (e.g. ADL gazetteer, Getty Thesaurus of Geographic names) Recently added coordinates for direct use
- relatedPeriod Related time periods periodID of related periods Information about relationship type (part-of, successor etc.) Can plug in different relationship type schemes
- entryMetadata Notes about creator / creation of instance Entry date Modification date
38(No Transcript)
39Time periods by named location
40Catalog Search Result
41Web Interface - Access by map
42Zoomable interface gives access to geographically
focused info
43Web Interface - Access by timeline
Link initiates search of the Library of Congress
catalog for all records relating to this time
period.
44WHEN and WHAT
- These named time periods are derived from Library
of Congress catalog subject headings and so can
be used for catalog searching which finds books
on topics important for that time period
45Time period directories link via the place (or
time)
Thesaurus/ Ontology
Texts
EVI
Gazetteers
captions
Maps/ Geo Data
Numeric datasets
Time Period Directory
Time lines, Chronologies
46WHEN, WHERE and WHO
- Catalog records found from a time period search
commonly include names of persons important at
that time. Their names can be forwarded to, e.g.,
biographies in the Wikipedia encyclopedia.
47Place and time are broadly important across
numerous tools and genres including, e.g.
Language atlases, Library catalogs, Biographical
dictionaries, Bibliographies, Archival finding
aids, Museum records, etc., etc. Biographical
dictionaries are heavy on place and time
Emanuel Goldberg, Born Moscow 1881. PhD under
Wilhelm Ostwald, Univ. of Leipzig, 1906.
Director, Zeiss Ikon, Dresden, 1926-33. Moved to
Palestine 1937. Died Tel Aviv, 1970. Life as a
series of episodes involving Activity (WHAT),
WHERE, WHEN, and WHO else.
48A new form of biographical dictionary would link
to all
Biographical Dictionary
Thesaurus/ Ontology
Texts
EVI
Gazetteers
captions
Maps/ Geo Data
Numeric datasets
Time Period Directory
Time lines, Chronologies
49A Metadata Infrastructure
50Acknowledgements
- Electronic Cultural Atlas Initiative project
- This work was partially supported by the
Institute of Museum and Library Services through
a National Leadership Grant for Libraries, award
number LG-02-04-0041-04, Oct 2004 - Sept 2006
entitled Supporting the Learner What, Where,
When and Who See http//ecai.org/imls2004 - Michael Buckland, Fred Gey, Vivien Petras, Matt
Meiske, Kim Carl - Contact ray_at_sims.berkeley.edu