Lecture 8: Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture 8: Clustering

Description:

Five different training sets (Russian, German, French, Spanish, and All Languages ... Lucy Kuntz, Paul O'Leary & Ralph Moon, 'Cheshire II: Designing a Next-Generation ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 80
Provided by: ValuedGate70
Category:

less

Transcript and Presenter's Notes

Title: Lecture 8: Clustering


1
Lecture 8 Clustering
Automatic Classification
  • University of California, Berkeley
  • School of InformationIS 245 Organization of
    Information In Collections

Some slides in this lecture were originally
created by Prof. Marti Hearst
2
Overview
  • Introduction to Automatic Classification and
    Clustering
  • Classification of Classification Methods
  • Classification Clusters and Information Retrieval
    in Cheshire II
  • The 4W project revisited

3
Classification
  • The grouping together of items (including
    documents or their representations) which are
    then treated as a unit. The groupings may be
    predefined or generated algorithmically. The
    process itself may be manual or automated.
  • In document classification the items are grouped
    together because they are likely to be wanted
    together
  • For example, items about the same topic.

4
Automatic Indexing and Classification
  • Automatic indexing is typically the simple
    deriving of keywords from a document and
    providing access to all of those words.
  • More complex Automatic Indexing Systems attempt
    to select controlled vocabulary terms based on
    terms in the document.
  • Automatic classification attempts to
    automatically group similar documents using
    either
  • A fully automatic clustering method.
  • An established classification scheme and set of
    documents already indexed by that scheme.

5
Background and Origins
  • Early suggestion by Fairthorne
  • The Mathematics of Classification
  • Early experiments by Maron (1961) and Borko and
    Bernick(1963)
  • Work in Numerical Taxonomy and its application to
    Information retrieval Jardine, Sibson, van
    Rijsbergen, Salton (1970s).
  • Early IR clustering work more concerned with
    efficiency issues than semantic issues.

6
Cluster Hypothesis
  • The basic notion behind the use of classification
    and clustering methods
  • Closely associated documents tend to be relevant
    to the same requests.
  • C.J. van Rijsbergen

7
Classification of Classification Methods
  • Class Structure
  • Intellectually Formulated
  • Manual assignment (e.g. Library classification)
  • Automatic assignment (e.g. Cheshire
    Classification Mapping)
  • Automatically derived from collection of items
  • Hierarchic Clustering Methods (e.g. Single Link)
  • Agglomerative Clustering Methods (e.g. Dattola)
  • Hybrid Methods (e.g. Query Clustering)

8
Classification of Classification Methods
  • Relationship between properties and classes
  • monothetic
  • polythetic
  • Relation between objects and classes
  • exclusive
  • overlapping
  • Relation between classes and classes
  • ordered
  • unordered

Adapted from Sparck Jones
9
Properties and Classes
  • Monothetic
  • Class defined by a set of properties that are
    both necessary and sufficient for membership in
    the class
  • Polythetic
  • Class defined by a set of properties such that to
    be a member of the class some individual must
    have some number (usually large) of those
    properties, and that a large number of
    individuals in the class possess some of those
    properties, and no individual possesses all of
    the properties.

10
Monothetic vs. Polythetic
Polythetic
Monothetic
Adapted from van Rijsbergen, 79
11
Exclusive Vs. Overlapping
  • Item can either belong exclusively to a single
    class
  • Items can belong to many classes, sometimes with
    a membership weight

12
Ordered Vs. Unordered
  • Ordered classes have some sort of structure
    imposed on them
  • Hierarchies are typical of ordered classes
  • Unordered classes have no imposed precedence or
    structure and each class is considered on the
    same level
  • Typical in agglomerative methods

13
Clustering Methods
  • Hierarchical
  • Agglomerative
  • Hybrid
  • Automatic Class Assignment

14
Coefficients of Association
  • Simple
  • Dices coefficient
  • Jaccards coefficient
  • Cosine coefficient
  • Overlap coefficient

15
Hierarchical Methods
Single Link Dissimilarity Matrix
Hierarchical methods Polythetic, Usually
Exclusive, Ordered Clusters are order-independent
16
Threshold .1
Single Link Dissimilarity Matrix
17
Threshold .2
18
Threshold .3
19
Clustering
Agglomerative methods Polythetic, Exclusive or
Overlapping, Unordered clusters are
order-dependent.
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Doc
Rocchios method (similar to current K-means
methods
1. Select initial centers (I.e. seed the
space) 2. Assign docs to highest matching centers
and compute centroids 3. Reassign all documents
to centroid(s)
20
Automatic Class Assignment
Automatic Class Assignment Polythetic, Exclusive
or Overlapping, usually ordered clusters are
order-independent, usually based on an
intellectually derived scheme
21
Automatic Categorization in Cheshire II
  • The Cheshire II system is intended to provide a
    bridge between the purely bibliographic realm of
    previous generations of online catalogs and the
    rapidly expanding realm of full-text and
    multimedia information resources. It is currently
    used in the UC Berkeley Digital Library Project
    and for a number of other sites and projects.

22
Overview of Cheshire II
  • It supports SGML as the primary database type.
  • It is a client/server application.
  • Uses the Z39.50 Information Retrieval Protocol.
  • Supports Boolean searching of all servers.
  • Supports probabilistic ranked retrieval in the
    Cheshire search engine.
  • Supports nearest neighbor'' searches, relevance
    feedback and Two-Stage Search.
  • GUI interface on X window displays (Tcl/Tk).
  • HTTP/CGI interface for the Web (Tcl scripting).

23
Cheshire II - Cluster Generation
  • Define basis for clustering records.
  • Select field to form the basis of the cluster.
  • Evidence Fields to use as contents of the
    pseudo-documents.
  • During indexing cluster keys are generated with
    basis and evidence from each record.
  • Cluster keys are sorted and merged on basis and
    pseudo-documents created for each unique basis
    element containing all evidence fields.
  • Pseudo-Documents (Class clusters) are indexed on
    combined evidence fields.

24
Cheshire II - Two-Stage Retrieval
  • Using the LC Classification System
  • Pseudo-Document created for each LC class
    containing terms derived from content-rich
    portions of documents in that class (subject
    headings, titles, etc.)
  • Permits searching by any term in the class
  • Ranked Probabilistic retrieval techniques attempt
    to present the Best Matches to a query first.
  • User selects classes to feed back for the second
    stage search of documents.
  • Can be used with any classified/Indexed
    collection.

25
Probabilistic Retrieval Logistic Regression
  • Estimates for relevance based on log-linear model
    with various statistical measures of document
    content as independent variables.

Log odds of relevance is a linear function of
attributes
Term contributions summed
Probability of Relevance is inverse of log odds
26
Probabilistic Retrieval Logistic Regression
In Cheshire II probability of relevance is based
on Logistic regression from a sample set of TREC
documents to determine values of the
coefficients. At retrieval the probability
estimate is obtained by
For 6 attributes or clues about term usage in
the documents and the query
27
Probabilistic Retrieval Logistic Regression
attributes
Average Absolute Query Frequency Query
Length Average Absolute Document
Frequency Document Length Average Inverse
Document Frequency Inverse Document
Frequency Number of Terms in common between
query and document (M) -- logged
28
Cheshire II Demo
  • Examples from the
  • SciMentor(BioSearch) project
  • Journal of Biological Chemistry and MEDLINE data
  • CHESTER (EconLit)
  • Journal of Economic Literature subjects
  • Unfamiliar Metadata TIDES Projects
  • Basis for clusters is a normalized Library of
    Congress Class Number
  • Evidence is provided by terms from record titles
    (and subject headings for the all languages
  • Five different training sets (Russian, German,
    French, Spanish, and All Languages
  • Testing cross-language retrieval and
    classification
  • 4W Project Search

29
References
  • Christian Plaunt Barbara Norgard, An
    Association Based Method for Automatic Indexing
    with a Controlled Vocabulary. JASIS 49(10),
    1998.
  • Preprint available available on class web site
  • Ray R. Larson, Jerome McDonough, Lucy Kuntz, Paul
    OLeary Ralph Moon, Cheshire II Designing a
    Next-Generation Online Catalog. JASIS, 47(7)
    555-567, 1996.

30
Developing a Metadata Infrastructure for
Information AccessWhat, Where, When and Who?
  • Prof. Ray R. Larson
  • University of California, BerkeleySchool of
    Information

31
Overview
  • Metadata as Infrastructure
  • What, Where, When and Who?
  • What are Entry Vocabulary Indexes?
  • Notion of an EVI
  • How are EVIs Built
  • Time Period Directories
  • Mining Metadata for new metadata

32
Metadata as Infrastructure
  • The difference between memorization and
    understanding lies in knowing the context and
    relationships of whatever is of interest. When
    setting out to learn about a new topic, a
    well-tested practice is to follow the traditional
    5Ws and the H Who?, What?, When?, Where?,
    Why?, and How?

33
Metadata as Infrastructure
  • The reference collections of paper-based
    libraries provide a structured environment for
    resources, with encyclopedias and subject
    catalogs, gazetteers, chronologies, and
    biographical dictionaries, offering direct
    support for at least What, Where, When, and Who.
  • The digital environment does not yet provide an
    effective, and easily exploited, infrastructure
    comparable to the traditional reference library.

34
What?
  • Searching texts by topic, e.g. Dewey, LCSH, any
    subject index, or category scheme applied to
    documents.
  • Two kinds of mapping in every search
  • Documents are assigned to topic categories, e.g.
    Dewey
  • Queries have to map to topic categories, e.g.
    Deweys Relativ Index from ordinary words/phrases
    to Decimal Classification numbers.
  • Also mapping between topic systems, e.g. US
    Patent classification and International Patent
    Classification.

35
What searches involve mapping to controlled
vocabularies
Thesaurus/ Ontology
Texts
36
Start with a collection of documents.
37
Classify and index with controlled vocabulary Or
use a pre-indexed collection.
38
ProblemControlled Vocabularies can be difficult
for people to use.
pass mtr veh spark ign eng
39
SolutionEntry Level Vocabulary Indexes.
Index
EVI
40
What and Entry Vocabulary Indexes
  • EVIs are a means of mapping from users
    vocabulary to the controlled vocabulary of a
    collection of documents

41
Building and Searching EVIs
42
Technical Details
43
Association Measure
44
Association Measure
  • Maximum Likelihood ratio

45
Alternatively
  • Because the evidence terms in EVIs can be
    considered a document, you can also use IR
    techniques and use the top-ranked classes for
    classification or query expansion

46
(No Transcript)
47
EVI example
Index termpass mtr veh spark ign eng
EVI 1
User Query Automobile
Index termautomobiles OR internal
combustible engines
EVI 2
48
But why stop there?
Index
EVI
49
Which EVI do I use?
Index
EVI
Index
EVI
Index
EVI
Index
50
EVI to EVIs
Index
EVI
EVI2
Index
EVI
Index
EVI
Index
51
Why not treat language the same way?
52
It is also difficult to move between different
media forms
Thesaurus/ Ontology
Texts
EVI
Numeric datasets
53
Searching across data types
  • Different media can be linked indirectly via
    metadata, but often (e.g. for socio-economic
    numeric data series) you also need to specify
    WHERE to get correct results

54
But texts associated with numeric data can be
mapped as well
Thesaurus/ Ontology
Texts
EVI
EVI
captions
Numeric datasets
55
EVI to Numeric Data example
56
But there are also geographic dependencies
Thesaurus/ Ontology
Texts
EVI
EVI
captions
Maps/ Geo Data
Numeric datasets
57
WHERE Place names are problematic
  • Variant forms St. Petersburg, ????? ?????????,
    Saint-Pétersbourg, . . .
  • Multiple names Cluj, in Romania / Roumania /
    Rumania, is also called Klausenburg and
    Kolozsvar.
  • Names changes Bombay ? Mumbai.
  • HomographsVienna, VA, and Vienna, Austria
  • 50 Springfields.
  • Anachronisms No Germany before 1870
  • Vague, e.g. Midwest, Silicon Valley
  • Unstable boundaries 19th century Poland
    Balkans USSR
  • Use a gazetteer!

58
WHERE. Geo-temporal search interface. Place
names found in documents. Gazetteer provided lat.
long. Places displayed on map.
Timebar?
59
Zoom on map. Click on place for a list of
records. Click on record to display text.
60
Catalogs and gazetteers should talk to each other!
Catalog search
Gazetteer search
Geographic sort / display of catalog search
result.
61
So geographic search becomes part of the
infrastructure
Thesaurus/ Ontology
Texts
EVI
Gazetteers
captions
Maps/ Geo Data
Numeric datasets
62
WHEN Search by time is also weakly supported
  • Calendars are the standard for time
  • But people use the names of events to refer to
    time periods
  • Named time periods resemble place names in being
  • Unstable European War, Great War, First World
    War
  • Multiple Second World War, Great Patriotic War
  • Ambiguous Civil war in different centuries in
    England, USA, Spain, etc.
  • Places have temporal aspects periods have
    geographical aspects When the Stone Age was,
    varies by region

63
Similarity between place names and period names
  • Suggests a similar solution A gazetteer-like
    Time Period Directory.
  • Gazetteer
  • Place name Type Spatial markers (Lat long)
    -- When
  • Time Period Directory
  • Period name Type Time markers (Calendar)
    Where
  • Note the symmetry in the connections between
    Where and When.

64
Solution - Time Period Directories
  • Initial development involved mining the Library
    of Congress Subject Authority file for named time
    periods

65
LC MARC Authorities Records
ltUSMARCgt ltFld001gtsh 00000613 lt/Fld001gt ltFld151gtltagt
Magdeburg (Germany)lt/agtltxgtHistorylt/xgtltygtSiege,
1550-1551lt/ygtlt/Fld151gt ltFld550gtltwgtglt/wgtltagtSiegeslt/
agtltzgtGermanylt/zgtlt/Fld550gt ltFld670gtltagtWork cat.
45053442 Besselmeier, S. Warhafftige history vnd
beschreibung des Magdeburgischen Kriegs,
1552.lt/agtlt/Fld670gt ltFld670gtltagtCath.
encyc.lt/agtltbgt(Magdeburg besieged (1550-51) by
the Margrave Maurice of Saxony)lt/bgtlt/Fld670gt ltFld6
70gtltagtOx. encyc. reformationlt/agtltbgt(Magdeburg
... during the 1550-1551 siege of Magdeburg
...)lt/bgtlt/Fld670gt lt/USMARCgt
66
timePeriodEntry Time Period Directory Instance Contains components described below
- periodID Unique identifier
- periodName Period name, can be repeated for alternative names Information about language, script, transliteration scheme Source information and notes (where was the period name mentioned)
- descriptiveNotes Description of time period
- dates Calendar and date format Begin end date (exact, earliest, latest, most-likely, advocated-by-source, ongoing) Notes, sources
- periodClassification Period type, e.g. Period of Conflict, Art movement Can plug in different classification schemes Can be repeated for several classifications
- location Associated places with time period Contains both place name and entry to a gazetteer providing more specific place information like latitude / longitude coordinates Can plug in different location indicators (e.g. ADL gazetteer, Getty Thesaurus of Geographic names) Recently added coordinates for direct use
- relatedPeriod Related time periods periodID of related periods Information about relationship type (part-of, successor etc.) Can plug in different relationship type schemes
- entryMetadata Notes about creator / creation of instance Entry date Modification date
67
(No Transcript)
68
Time periods by named location
69
Catalog Search Result
70
Web Interface - Access by map
71
Zoomable interface gives access to geographically
focused info
72
Web Interface - Access by timeline
Link initiates search of the Library of Congress
catalog for all records relating to this time
period.
73
WHEN and WHAT
  • These named time periods are derived from Library
    of Congress catalog subject headings and so can
    be used for catalog searching which finds books
    on topics important for that time period

74
Time period directories link via the place (or
time)
75
WHEN, WHERE and WHO
  • Catalog records found from a time period search
    commonly include names of persons important at
    that time. Their names can be forwarded to, e.g.,
    biographies in the Wikipedia encyclopedia.

76
Place and time are broadly important across
numerous tools and genres including, e.g.
Language atlases, Library catalogs, Biographical
dictionaries, Bibliographies, Archival finding
aids, Museum records, etc., etc. Biographical
dictionaries are heavy on place and time
Emanuel Goldberg, Born Moscow 1881. PhD under
Wilhelm Ostwald, Univ. of Leipzig, 1906.
Director, Zeiss Ikon, Dresden, 1926-33. Moved to
Palestine 1937. Died Tel Aviv, 1970. Life as a
series of episodes involving Activity (WHAT),
WHERE, WHEN, and WHO else.
77
A new form of biographical dictionary would link
to all
78
A Metadata Infrastructure
79
Acknowledgements
  • Electronic Cultural Atlas Initiative project
  • This work was partially supported by the
    Institute of Museum and Library Services through
    a National Leadership Grant for Libraries, award
    number LG-02-04-0041-04, Oct 2004 - Sept 2006
    entitled Supporting the Learner What, Where,
    When and Who See http//ecai.org/imls2004
  • Michael Buckland, Fred Gey, Vivien Petras, Matt
    Meiske, Kim Carl, Anya Kartavenko, Minakshi
    Mukherjee
  • Contact ray_at_sims.berkeley.edu
Write a Comment
User Comments (0)
About PowerShow.com