Metadata as Infrastructure for Information Retrieval and Text Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Metadata as Infrastructure for Information Retrieval and Text Mining

Description:

PhD under Wilhelm Ostwald, Univ. of Leipzig, 1906. ... Michael Buckland, Fred Gey, Vivien Petras, Matt Meiske, Kim Carl. Contact: ray_at_sims.berkeley.edu ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 51
Provided by: ait4
Learn more at: https://ecai.org
Category:

less

Transcript and Presenter's Notes

Title: Metadata as Infrastructure for Information Retrieval and Text Mining


1
Metadata as Infrastructure for Information
Retrieval and Text Mining
Prof. Ray R. Larson University of California,
BerkeleySchool of Information
2
Overview
  • Metadata as Infrastructure
  • What, Where, When and Who?
  • What are Entry Vocabulary Indexes?
  • Notion of an EVI
  • How are EVIs Built
  • Time Period Directories
  • Mining Metadata for new metadata

3
Metadata as Infrastructure
  • The difference between memorization and
    understanding lies in knowing the context and
    relationships of whatever is of interest. When
    setting out to learn about a new topic, a
    well-tested practice is to follow the traditional
    5Ws and the H Who?, What?, When?, Where?,
    Why?, and How?

4
Metadata as Infrastructure
  • The reference collections of paper-based
    libraries provide a structured environment for
    resources, with encyclopedias and subject
    catalogs, gazetteers, chronologies, and
    biographical dictionaries, offering direct
    support for at least What, Where, When, and Who.
  • The digital environment does not yet provide an
    effective, and easily exploited, infrastructure
    comparable to the traditional reference library.

5
What?
  • Searching texts by topic, e.g. Dewey, LCSH, any
    subject index, or category scheme applied to
    documents.
  • Two kinds of mapping in every search
  • Documents are assigned to topic categories, e.g.
    Dewey
  • Queries have to map to topic categories, e.g.
    Deweys Relativ Index from ordinary words/phrases
    to Decimal Classification numbers.
  • Also mapping between topic systems, e.g. US
    Patent classification and International Patent
    Classification.

6
What searches involve mapping to controlled
vocabularies
Thesaurus/ Ontology
Texts
7
Start with a collection of documents.
8
Classify and index with controlled vocabulary Or
use a pre-indexed collection.
9
ProblemControlled Vocabularies can be difficult
for people to use.
pass mtr veh spark ign eng
10
SolutionEntry Level Vocabulary Indexes.
Index
EVI
11
What and Entry Vocabulary Indexes
  • EVIs are a means of mapping from users
    vocabulary to the controlled vocabulary of a
    collection of documents

12
Building and Searching EVIs
13
Technical Details
For noun phrases
Internet DB indexed with a controlled vocabulary.
Building an Entry Vocabulary Module (EVI)
14
Association Measure
C C t a
b t c d
Where t is the occurrence of a term and C is the
occurrence of a class in the training set
15
Association Measure
  • Maximum Likelihood ratio

16
Alternatively
  • Because the evidence terms in EVIs can be
    considered a document, you can also use IR
    techniques and use the top-ranked classes for
    classification or query expansion

17
Digital library resources

Statistical association
18
EVI example
Index termpass mtr veh spark ign eng
EVI 1
User Query Automobile
Index termautomobiles OR internal
combustible engines
EVI 2
19
But why stop there?
Index
EVI
20
Which EVI do I use?
Index
EVI
Index
EVI
Index
EVI
Index
21
EVI to EVIs
Index
EVI
EVI2
Index
EVI
Index
EVI
Index
22
Why not treat language the same way?
23
It is also difficult to move between different
media forms
Thesaurus/ Ontology
Texts
EVI
Numeric datasets
24
Searching across data types
  • Different media can be linked indirectly via
    metadata, but often (e.g. for socio-economic
    numeric data series) you also need to specify
    WHERE to get correct results

25
But texts associated with numeric data can be
mapped as well
Thesaurus/ Ontology
Texts
EVI
EVI
captions
Numeric datasets
26
EVI to Numeric Data example
27
But there are also geographic dependencies
Thesaurus/ Ontology
Texts
EVI
EVI
captions
Maps/ Geo Data
Numeric datasets
28
WHERE Place names are problematic
  • Variant forms St. Petersburg, ????? ?????????,
    Saint-Pétersbourg, . . .
  • Multiple names Cluj, in Romania / Roumania /
    Rumania, is also called Klausenburg and
    Kolozsvar.
  • Names changes Bombay ? Mumbai.
  • HomographsVienna, VA, and Vienna, Austria
  • 50 Springfields.
  • Anachronisms No Germany before 1870
  • Vague, e.g. Midwest, Silicon Valley
  • Unstable boundaries 19th century Poland
    Balkans USSR
  • Use a gazetteer!

29
WHERE. Geo-temporal search interface. Place
names found in documents. Gazetteer provided lat.
long. Places displayed on map.
Timebar?
30
Zoom on map. Click on place for a list of
records. Click on record to display text.
31
Catalogs and gazetteers should talk to each other!
Catalog search
Gazetteer search
Geographic sort / display of catalog search
result.
32
So geographic search becomes part of the
infrastructure
Thesaurus/ Ontology
Texts
EVI
Gazetteers
captions
Maps/ Geo Data
Numeric datasets
33
WHEN Search by time is also weakly supported
  • Calendars are the standard for time
  • But people use the names of events to refer to
    time periods
  • Named time periods resemble place names in being
  • Unstable European War, Great War, First World
    War
  • Multiple Second World War, Great Patriotic War
  • Ambiguous Civil war in different centuries in
    England, USA, Spain, etc.
  • Places have temporal aspects periods have
    geographical aspects When the Stone Age was,
    varies by region

34
Similarity between place names and period names
  • Suggests a similar solution A gazetteer-like
    Time Period Directory.
  • Gazetteer
  • Place name Type Spatial markers (Lat long)
    -- When
  • Time Period Directory
  • Period name Type Time markers (Calendar)
    Where
  • Note the symmetry in the connections between
    Where and When.

35
Solution - Time Period Directories
  • Initial development involved mining the Library
    of Congress Subject Authority file for named time
    periods

36
LC MARC Authorities Records
ltUSMARCgt ltFld001gtsh 00000613 lt/Fld001gt ltFld151gtltagt
Magdeburg (Germany)lt/agtltxgtHistorylt/xgtltygtSiege,
1550-1551lt/ygtlt/Fld151gt ltFld550gtltwgtglt/wgtltagtSiegeslt/
agtltzgtGermanylt/zgtlt/Fld550gt ltFld670gtltagtWork cat.
45053442 Besselmeier, S. Warhafftige history vnd
beschreibung des Magdeburgischen Kriegs,
1552.lt/agtlt/Fld670gt ltFld670gtltagtCath.
encyc.lt/agtltbgt(Magdeburg besieged (1550-51) by
the Margrave Maurice of Saxony)lt/bgtlt/Fld670gt ltFld6
70gtltagtOx. encyc. reformationlt/agtltbgt(Magdeburg
... during the 1550-1551 siege of Magdeburg
...)lt/bgtlt/Fld670gt lt/USMARCgt
37
timePeriodEntry Time Period Directory Instance Contains components described below
- periodID Unique identifier
- periodName Period name, can be repeated for alternative names Information about language, script, transliteration scheme Source information and notes (where was the period name mentioned)
- descriptiveNotes Description of time period
- dates Calendar and date format Begin end date (exact, earliest, latest, most-likely, advocated-by-source, ongoing) Notes, sources
- periodClassification Period type, e.g. Period of Conflict, Art movement Can plug in different classification schemes Can be repeated for several classifications
- location Associated places with time period Contains both place name and entry to a gazetteer providing more specific place information like latitude / longitude coordinates Can plug in different location indicators (e.g. ADL gazetteer, Getty Thesaurus of Geographic names) Recently added coordinates for direct use
- relatedPeriod Related time periods periodID of related periods Information about relationship type (part-of, successor etc.) Can plug in different relationship type schemes
- entryMetadata Notes about creator / creation of instance Entry date Modification date
38
(No Transcript)
39
Time periods by named location
40
Catalog Search Result
41
Web Interface - Access by map
42
Zoomable interface gives access to geographically
focused info
43
Web Interface - Access by timeline
Link initiates search of the Library of Congress
catalog for all records relating to this time
period.
44
WHEN and WHAT
  • These named time periods are derived from Library
    of Congress catalog subject headings and so can
    be used for catalog searching which finds books
    on topics important for that time period

45
Time period directories link via the place (or
time)
Thesaurus/ Ontology
Texts
EVI
Gazetteers
captions
Maps/ Geo Data
Numeric datasets
Time Period Directory
Time lines, Chronologies
46
WHEN, WHERE and WHO
  • Catalog records found from a time period search
    commonly include names of persons important at
    that time. Their names can be forwarded to, e.g.,
    biographies in the Wikipedia encyclopedia.

47
Place and time are broadly important across
numerous tools and genres including, e.g.
Language atlases, Library catalogs, Biographical
dictionaries, Bibliographies, Archival finding
aids, Museum records, etc., etc. Biographical
dictionaries are heavy on place and time
Emanuel Goldberg, Born Moscow 1881. PhD under
Wilhelm Ostwald, Univ. of Leipzig, 1906.
Director, Zeiss Ikon, Dresden, 1926-33. Moved to
Palestine 1937. Died Tel Aviv, 1970. Life as a
series of episodes involving Activity (WHAT),
WHERE, WHEN, and WHO else.
48
A new form of biographical dictionary would link
to all
Biographical Dictionary
Thesaurus/ Ontology
Texts
EVI
Gazetteers
captions
Maps/ Geo Data
Numeric datasets
Time Period Directory
Time lines, Chronologies
49
A Metadata Infrastructure
50
Acknowledgements
  • Electronic Cultural Atlas Initiative project
  • This work was partially supported by the
    Institute of Museum and Library Services through
    a National Leadership Grant for Libraries, award
    number LG-02-04-0041-04, Oct 2004 - Sept 2006
    entitled Supporting the Learner What, Where,
    When and Who See http//ecai.org/imls2004
  • Michael Buckland, Fred Gey, Vivien Petras, Matt
    Meiske, Kim Carl
  • Contact ray_at_sims.berkeley.edu
Write a Comment
User Comments (0)
About PowerShow.com