meow::06 - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

meow::06

Description:

... mathematic conjecture varieties projective variety theory cohomology moduli ... smooth projective type surfaces higher dimensional varieties plane curves nodes ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 39
Provided by: libU9
Category:

less

Transcript and Presenter's Notes

Title: meow::06


1
meow06
Kat Hagedorn
David Newman
Clustering, Classification, and Metadata
Enhancement Techniques July 24, 2006
Bill Landis, ex officio
2
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
  • Preprocessing and Topic Modeling
  • The Browser
  • Lessons Learned and Next Steps

3
Goals
  • Evaluate topical/subject-based metadata
    enhancement
  • Experiment on testbed of multiple OAI
    repositories
  • Discuss lessons learned and refine testing
  • Propose products and services

4
What We Did
Preprocessing Topic Modeling gt
vocab- ulary
Cluster
preprocess
topic model (cluster/learn)
topics
OAI records
5
What We Did
Preprocessing Topic Modeling gt
vocab- ulary
Cluster
preprocess
topic model (cluster/learn)
topics
OAI records
vocab -ulary
Classify
preprocess
topic model (classify)
1. topics in records 2. records in topics
oai rec
OAI records
6
What We Did
Preprocessing Topic Modeling gt
clustering is learning the topics
vocab- ulary
Cluster
preprocess
topic model (cluster/learn)
topics
OAI records
vocab -ulary
Classify
preprocess
topic model (classify)
1. topics in records 2. records in topics
oai rec
OAI records
classification is using the learned topics
7
Repository Selection
Preprocessing Topic Modeling gt
  • Mix of cultural heritage repositories?
  • UMich, Library of Congress, CDL, State Lib of
    Victoria (Aust),
  • Average of 15 words per record (excl. stopwords)
  • Topics often specific to collection (e.g., State
    Lib of Victoria)
  • Experience with CDLs American West project
  • Mix of scientific/research repositories?
  • CiteSeer, arXiv, PubMed,
  • ltdescriptiongt is a reasonably reliable 200-word
    abstract
  • Average of 75 words per record
  • Topics more likely to span repositories
  • For purposes of evaluation, used (mostly)
    English-language repositories

8
Selected Repositories
Preprocessing Topic Modeling gt
Repositories harvested by UMich/OAIster, June 7,
2006.
9
Usage of Dublin Core Fields
Preprocessing Topic Modeling gt
  • Decided to use words from lttitlegt, ltdescriptiongt,
    ltsubjectgt for clustering
  • Idiosyncrasies
  • CiteSeer repeats ltauthorgt and lttitlegt in
    ltsubjectgt
  • CiteSeer puts citations to other IDs in
    ltdescriptiongt
  • arXiv puts e.g., Comment 12 pages PostScript
    in ltdescriptiongt
  • RePEc no ltsubjectgt, repeats ID in ltdescriptiongt
  • etc.
  • Approach Process all repositories identically,
    no special treatment

10
Preprocessing Example
Preprocessing Topic Modeling gt
ltIDoaiCiteSeerPSU44072gt lttitlegtReinforcement
Learning A Survey ltdescriptiongtThis paper
surveys the field of reinforcement learning from
a computer-science perspective. It is written to
be accessible to researchers familiar with
machine learning. Both the historical basis of
the field and a broad selection of current work
are summarized. Reinforcement learning is the
problem faced by an agent that learns behavior
through trial-and-error interactions with a
dynamic environment. The work described here has
a resemblance to work in psychology, but differs
considerably in the details and in the use of the
word "reinforcement." ltsubjectgtLeslie Pack
Kaelbling, Michael Littman, Andrew Moore.
Reinforcement Learning A Survey
vocab -ulary
ltIDoaiCiteSeerPSU44072gt reinforcement
learning survey survey field reinforcement
learning computer science perspective written
accessible researcher familiar machine learning
historical basis field broad selection current
summarized reinforcement learning faced agent
learn behavior trial error interaction dynamic
environment resemblance psychology differ
considerably detail word reinforcement leslie
pack kaelbling littman andrew moore reinforcement
learning survey
preprocess
11
Stopwords and Stemming
Preprocessing Topic Modeling gt
  • Standard and, the,
  • Research related research, paper, data, system,
    method, result,
  • Repository specific cern, citeseer, repec,
    Smith,
  • All tokens starting with a digit 1996, 401k,
  • Produced stopword list of 500 words
  • Applied very simple stemming (cars ? car)
  • Note replacing collocations improves
    interpretability of topics, but not quality (los
    angeles ? los_angeles)
  • Dont need to find and exclude all stopwords
    because topic model will help find these (e.g.
    des, les, une, ) -- suppress after the fact

12
Building Vocabulary
Preprocessing Topic Modeling gt
  • Preprocessed (sampled) repositories, excluded
    stopwords
  • Only kept words that occurred in more than 10
    records
  • Result a final vocabulary with 90,000 words
  • Most frequent words cell, high, energy, protein,
    function, algorithm, field, theory, physics,
  • Resulting discussion point When do we need to
    re-create the vocabulary? (When classifying, new
    documents will be filtered through existing
    vocabulary)

13
Preprocessing Topic Modeling gt
  • Average of 75 words per record
  • Bimodal because used records with abstracts and
    records without abstracts
  • Topic model isnt adversely affected by very
    short records

14
Computation
Preprocessing Topic Modeling gt
  • Clustering (Learning)
  • D 750,000 records
  • W 90,000 word vocabulary
  • L 75 words per record
  • T 500 topics
  • iter 500 iterations
  • memory 3DL T(DW) 3 GByte
  • time D L T Iter 3 days (3 GHz Xeon)
  • Classification
  • D 3,000,000 records total
  • iter 40 iterations
  • max memory 2 GByte
  • max time 5 hours (but repositories can run in
    parallel)

Decision point How many topics? Decision point
How many iterations?
15
Broad Topical Categories
Preprocessing Topic Modeling gt
  • 500 topics too many to look at
  • Need to organize topics under broad topical
    categories
  • Cluster the clusters (automatic)
  • Use pre-defined categories
  • Classify group of keywords (manual automatic)
  • Create hierarchy by hand (manual)

16
Broad Topical Categories
Preprocessing Topic Modeling gt
vocab- ulary
Cluster
preprocess
topic model (cluster/learn)
topics
OAI records
Cluster the clusters
topic model (cluster/learn)
broad topical categories
17
Broad Topical Categories
Preprocessing Topic Modeling gt
vocab- ulary
Cluster
preprocess
topic model (cluster/learn)
topics
OAI records
Cluster the clusters
topic model (cluster/learn)
broad topical categories
vocab -ulary
Classify group of keywords
preprocess
topic model (classify)
topics organized under broad topical categories
group of keywords
18
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
  • Preprocessing and Topic Modeling
  • The Browser
  • Lessons Learned and Next Steps

19
The Browser
The Browser gt
  • PHP/MySQL browser of 3 million OAI records
  • Preserving transparency for this audience
  • Browser not meant for end users
  • No search, no information architecture, etc.
  • http//yarra.calit2.uci.edu/meow/

Based on 750,000 sampled records from 9
repositories, 500 topics
20
The Browser http//yarra.calit2.uci.edu/meow/
The Browser gt
21
Selected Topics Useful
The Browser gt
  • t201   learning machine training learn
    algorithm task examples reinforcement inductive
    learned learner supervised unsupervised
  • t482   labor worker employment wage market
    labour job unemployment wages earning panel find
    evidence individual participation
  • t381   algebraic geometry mathematic
    conjecture varieties projective variety theory
    cohomology moduli curves prove genus rational
    give math
  • t097   dark matter universe astrophysic
    cosmological cosmic background density inflation
    spectrum power scale cmb halo cosmology
    gravitational
  • t027   hiv virus human immunodeficiency type
    envelope infection viral cd4 infected gag
    replication reverse aid tat gp120
  • t365   waste radioactive wastes tank nuclear
    facilities management hanford disposal fuel
    storage material processing facility site level
  • gt show all 500 sub-topics (to see all 500 topics)

22
Selected Topics Less Useful
The Browser gt
  • t255   journal author chapter vol notes
    editor publication issue special bibliography
    reader references appendix literature submitted
    topic
  • t328   paul mark thank andrew scott stephen
    alan steven miller george martin obituaries
    thesis daniel prof ian
  • t384   supported part grant author foundation
    partially contract science national nsf support
    advanced ccr provided center agency
  • t112   look people difficult thing need want
    fact reason help understand think say alway try
    easy bad
  • t496   increase increased increases decrease
    increasing decreased decreases observed change
    decreasing significant caused decline
  • t012   des les dan une est par sur pour qui
    nous sont aux ces analyse pay cette
  • But junk topics alleviate the need to
    exhaustively find stopwords
  • many useless words cluster as topics which can be
    suppressed

and very useful to filter out French records
23
Broad Topical Categories (BTCs)
The Browser gt
  • By clustering the clusters
  • worked well
  • mathematics, global energy resources,
  • can choose desired number of broad topical
    categories (e.g., 25) and thresholding
  • By classifying groups of keywords
  • worked well too
  • Then review and manually edit
  • include or exclude any subtopic

24
BTCs Clustering the clusters
The Browser gt
25
BTCs Classifying group of keywords
The Browser gt
  • gtgtgt Aerospace Engineering
  • stars (15)
  • space (18)
  • aeronautics (20)
  • astronautics (20)
  • rocket (12)
  • shuttle (12)
  • exploration (15)
  • lander (3)
  • planets (7)
  • black holes (7)
  • quasars (7)
  • pulsars (7)
  • observatories (10)
  • air traffic (10)
  • aircraft (15)
  • aerospace (20)
  • airplanes (10)
  • airports (10)
  • heliports (10)
  • helicopters (10)
  • aviation (18)
  • FAA (7)
  • airlines (12)
  • flight (18)
  • comets (10)
  • meteorites (12)
  • spacecraft (15)
  • air force (7)
  • pilots (7)
  • jets (7)
  • air travel (15)
  • flying (18)

domain expert specifies list of relevant keywords
and (importance)
26
BTCs Classifying group of keywords
The Browser gt
  • gtgtgt Aerospace Engineering
  • t192 (69) vehicle flight vehicles engine car
    road speed nasa aircraft air
  • t352 (13) star solar planet mass astrophysic
    binary dwarf orbital sun companion
  • t191 (8) space spaces hilbert subspace
    dimensional subspaces defined exploration linear
    point
  • gtgtgt Dermatology
  • t388 (83) infection skin disease tract
    respiratory fever burgdorferi caused wound
    arthritis
  • t157 (8) cancer tumor p53 breast carcinoma
    survival human tumour malignant prostate
  • t071 (7) growth tuberculosis mycobacterium
    growing grow igf factor bcg avium
  • gtgtgt Geology and Earth Sciences
  • t121 (73) geothermal rock seismic energy
    mountain drilling fluid survey spring yucca
  • t268 (12) sea atmospheric climate ice ocean
    atmosphere cloud global wind aerosol
  • gtgtgt Molecular, Cellular and Developmental Biology
  • t276 (31) molecular biological sciences
    molecules biology molecule quantitative
    biochemistry basic
  • t417 (15) cell apoptosis cellular death
    cultured bcl lines hela transfected mediated
  • t355 (12) brain neuron neuronal cortex
    synaptic cortical rat nervous cerebral dopamine
  • t418 (9) genes genome gene repeat chromosome
    sequences dna genomic sequence region

in review, would delete this topic from this BTC
just found 1 topic relevant to transportation
27
Browse Records in a Topic
The Browser gt
can navigate back to multiple BTCs
nice mix of repositories
28
Browse Records in a Topic From one repository
The Browser gt
display records just from Library of Congress
29
Sample Record
The Browser gt
  • Murphy's Law in algebraic geometry Badly-behaved
    deformation spaces
  • gt preprocessed text
  • murphy law algebraic geometry badly behaved
    deformation spaces
  • consider question bad deformation space object
    answer priori reason deformation space bad moduli
    spaces
  • precisely singularity finite type smooth
    parameter hilbert scheme curves projective space
    moduli spaces
  • smooth projective type surfaces higher
    dimensional varieties plane curves nodes cusp
    stable sheaves
  • isolated threefold singularities object
    pathological fact nice curves smooth surfaces
    ample canonical bundle
  • stable sheaves torsion free rank singularities
    normal cohen macaulay justifies mumford
    philosophy moduli
  • spaces behaved object arbitrarily bad priori
    reason construct smooth curve projective space
    deformation
  • space component singularity type reduced behavior
    subschemes similarly give surface f_p lift course
    hold
  • holomorphic category difficult compute
    deformation spaces directly obstruction theories
    circumvent relating
  • tractable deformation spaces smooth morphism
    essential starting point mnev universality
    theorem
  • mathematic algebraic geometry mathematic complex
    variables
  • gt top topics
  • t381 algebraic geometry mathematic conjecture
    varieties projective variety theory cohomology
    moduli curves prove genus rational give math
    t191 space spaces

topics for this record
link to actual OAI record
30
Repository-specific Browsers
The Browser gt
  • Library of Congress (http//yarra.calit2.uci.edu/o
    ai/loc/)
  • University of Michigan (http//yarra.calit2.uci.ed
    u/oai/umich/)
  • University of Washington (http//yarra.calit2.uci.
    edu/oai/uwash/)
  • African Journals Online (http//yarra.calit2.uci.e
    du/oai/africa/)
  • and many more

31
Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
  • Preprocessing and Topic Modeling
  • The Browser
  • Lessons Learned and Next Steps

32
Evaluation
Lessons Learned Next Steps gt
  • Topic modeling worked well
  • Most topics were useful
  • Drain on computer resources was reasonable
  • Human effort was relatively small
  • All repositories processed identically, no
    special treatment
  • Strategy worked well
  • Clustering, then
  • Classification, and
  • Broad Topical Categories creation

33
Further Evaluation
Lessons Learned Next Steps gt
  • Current processing only for
  • English-language repositories
  • Science/research based repositories
  • Need to test cultural heritage repositories and
    foreign-language records
  • Less consistent descriptive language and length
  • On-the-horse problem more prevalent
  • Greater need to individually process repositories
  • Also need usability testing to evaluate further
  • Depends on criteria -- who are users?
  • Librarians?
  • End-users?
  • Depends on products and services desired by users

34
Discussion Point When to Re-cluster?
Lessons Learned Next Steps gt
classify
classify
cluster
classify
cluster
cluster
classify
classify
classify
  • Need to re-cluster
  • when collection changes significantly
  • if there is a hole in topics
  • but NOT just because you have another repository
  • If you re-cluster
  • all topics will be different
  • have to discard hand-labeling
  • Broad Topical Categories might be different
  • However, classification is
  • cheap and easy
  • e.g., for OAIster, could re-classify every
    harvestuntil spring clean

35
Products and Services
Lessons Learned Next Steps gt
  • Depending on users
  • What kind of service is useful?
  • What should interface to topics look/act like?
  • What kind of use should we envision?
  • We have some ideas

36
Archive of Topics
Lessons Learned Next Steps gt
  • Are the topics we created useful to anyone else?
  • Scenario librarian uses topics/classifier for
    local resources
  • To use locally you need
  • the preprocessor (i.e. the preprocessing rules)
  • the vocabulary (file of 90,000 words)
  • the topic model classifier

37
Subject Search/Browse for OAIster
Lessons Learned Next Steps gt
  • Integrate topics into OAIster
  • add to records so can perform canned topic search
  • add to interface so can browse BTCs to records
  • Additionally, can allow users to find records
    similar to those retrieved
  • e.g., retrieved records on cosmology and can find
    similar records on astrophysics, relativity,
  • How to do this?

38
How To Reach Us
  • David Newman University of California, Irvine
  • ltnewman_at_uci.edugt
  • Kat Hagedorn University of Michigan
  • ltkhage_at_umich.edugt
  • Bill Landis California Digital Library
  • ltbill.landis_at_ucop.edugt
Write a Comment
User Comments (0)
About PowerShow.com