Title: meow::06
1meow06
Kat Hagedorn
David Newman
Clustering, Classification, and Metadata
Enhancement Techniques July 24, 2006
Bill Landis, ex officio
2Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
- Preprocessing and Topic Modeling
- The Browser
- Lessons Learned and Next Steps
3Goals
- Evaluate topical/subject-based metadata
enhancement - Experiment on testbed of multiple OAI
repositories - Discuss lessons learned and refine testing
- Propose products and services
4What We Did
Preprocessing Topic Modeling gt
vocab- ulary
Cluster
preprocess
topic model (cluster/learn)
topics
OAI records
5What We Did
Preprocessing Topic Modeling gt
vocab- ulary
Cluster
preprocess
topic model (cluster/learn)
topics
OAI records
vocab -ulary
Classify
preprocess
topic model (classify)
1. topics in records 2. records in topics
oai rec
OAI records
6What We Did
Preprocessing Topic Modeling gt
clustering is learning the topics
vocab- ulary
Cluster
preprocess
topic model (cluster/learn)
topics
OAI records
vocab -ulary
Classify
preprocess
topic model (classify)
1. topics in records 2. records in topics
oai rec
OAI records
classification is using the learned topics
7Repository Selection
Preprocessing Topic Modeling gt
- Mix of cultural heritage repositories?
- UMich, Library of Congress, CDL, State Lib of
Victoria (Aust), - Average of 15 words per record (excl. stopwords)
- Topics often specific to collection (e.g., State
Lib of Victoria) - Experience with CDLs American West project
- Mix of scientific/research repositories?
- CiteSeer, arXiv, PubMed,
- ltdescriptiongt is a reasonably reliable 200-word
abstract - Average of 75 words per record
- Topics more likely to span repositories
- For purposes of evaluation, used (mostly)
English-language repositories
8Selected Repositories
Preprocessing Topic Modeling gt
Repositories harvested by UMich/OAIster, June 7,
2006.
9Usage of Dublin Core Fields
Preprocessing Topic Modeling gt
- Decided to use words from lttitlegt, ltdescriptiongt,
ltsubjectgt for clustering - Idiosyncrasies
- CiteSeer repeats ltauthorgt and lttitlegt in
ltsubjectgt - CiteSeer puts citations to other IDs in
ltdescriptiongt - arXiv puts e.g., Comment 12 pages PostScript
in ltdescriptiongt - RePEc no ltsubjectgt, repeats ID in ltdescriptiongt
- etc.
- Approach Process all repositories identically,
no special treatment
10Preprocessing Example
Preprocessing Topic Modeling gt
ltIDoaiCiteSeerPSU44072gt lttitlegtReinforcement
Learning A Survey ltdescriptiongtThis paper
surveys the field of reinforcement learning from
a computer-science perspective. It is written to
be accessible to researchers familiar with
machine learning. Both the historical basis of
the field and a broad selection of current work
are summarized. Reinforcement learning is the
problem faced by an agent that learns behavior
through trial-and-error interactions with a
dynamic environment. The work described here has
a resemblance to work in psychology, but differs
considerably in the details and in the use of the
word "reinforcement." ltsubjectgtLeslie Pack
Kaelbling, Michael Littman, Andrew Moore.
Reinforcement Learning A Survey
vocab -ulary
ltIDoaiCiteSeerPSU44072gt reinforcement
learning survey survey field reinforcement
learning computer science perspective written
accessible researcher familiar machine learning
historical basis field broad selection current
summarized reinforcement learning faced agent
learn behavior trial error interaction dynamic
environment resemblance psychology differ
considerably detail word reinforcement leslie
pack kaelbling littman andrew moore reinforcement
learning survey
preprocess
11Stopwords and Stemming
Preprocessing Topic Modeling gt
- Standard and, the,
- Research related research, paper, data, system,
method, result, - Repository specific cern, citeseer, repec,
Smith, - All tokens starting with a digit 1996, 401k,
- Produced stopword list of 500 words
- Applied very simple stemming (cars ? car)
- Note replacing collocations improves
interpretability of topics, but not quality (los
angeles ? los_angeles) - Dont need to find and exclude all stopwords
because topic model will help find these (e.g.
des, les, une, ) -- suppress after the fact
12Building Vocabulary
Preprocessing Topic Modeling gt
- Preprocessed (sampled) repositories, excluded
stopwords - Only kept words that occurred in more than 10
records - Result a final vocabulary with 90,000 words
- Most frequent words cell, high, energy, protein,
function, algorithm, field, theory, physics, - Resulting discussion point When do we need to
re-create the vocabulary? (When classifying, new
documents will be filtered through existing
vocabulary)
13Preprocessing Topic Modeling gt
- Average of 75 words per record
- Bimodal because used records with abstracts and
records without abstracts - Topic model isnt adversely affected by very
short records
14Computation
Preprocessing Topic Modeling gt
- Clustering (Learning)
- D 750,000 records
- W 90,000 word vocabulary
- L 75 words per record
- T 500 topics
- iter 500 iterations
- memory 3DL T(DW) 3 GByte
- time D L T Iter 3 days (3 GHz Xeon)
- Classification
- D 3,000,000 records total
- iter 40 iterations
- max memory 2 GByte
- max time 5 hours (but repositories can run in
parallel)
Decision point How many topics? Decision point
How many iterations?
15Broad Topical Categories
Preprocessing Topic Modeling gt
- 500 topics too many to look at
- Need to organize topics under broad topical
categories - Cluster the clusters (automatic)
- Use pre-defined categories
- Classify group of keywords (manual automatic)
- Create hierarchy by hand (manual)
16Broad Topical Categories
Preprocessing Topic Modeling gt
vocab- ulary
Cluster
preprocess
topic model (cluster/learn)
topics
OAI records
Cluster the clusters
topic model (cluster/learn)
broad topical categories
17Broad Topical Categories
Preprocessing Topic Modeling gt
vocab- ulary
Cluster
preprocess
topic model (cluster/learn)
topics
OAI records
Cluster the clusters
topic model (cluster/learn)
broad topical categories
vocab -ulary
Classify group of keywords
preprocess
topic model (classify)
topics organized under broad topical categories
group of keywords
18Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
- Preprocessing and Topic Modeling
- The Browser
- Lessons Learned and Next Steps
19The Browser
The Browser gt
- PHP/MySQL browser of 3 million OAI records
- Preserving transparency for this audience
- Browser not meant for end users
- No search, no information architecture, etc.
- http//yarra.calit2.uci.edu/meow/
Based on 750,000 sampled records from 9
repositories, 500 topics
20The Browser http//yarra.calit2.uci.edu/meow/
The Browser gt
21Selected Topics Useful
The Browser gt
- t201 learning machine training learn
algorithm task examples reinforcement inductive
learned learner supervised unsupervised - t482 labor worker employment wage market
labour job unemployment wages earning panel find
evidence individual participation - t381 algebraic geometry mathematic
conjecture varieties projective variety theory
cohomology moduli curves prove genus rational
give math - t097 dark matter universe astrophysic
cosmological cosmic background density inflation
spectrum power scale cmb halo cosmology
gravitational - t027 hiv virus human immunodeficiency type
envelope infection viral cd4 infected gag
replication reverse aid tat gp120 - t365 waste radioactive wastes tank nuclear
facilities management hanford disposal fuel
storage material processing facility site level - gt show all 500 sub-topics (to see all 500 topics)
22Selected Topics Less Useful
The Browser gt
- t255 journal author chapter vol notes
editor publication issue special bibliography
reader references appendix literature submitted
topic - t328 paul mark thank andrew scott stephen
alan steven miller george martin obituaries
thesis daniel prof ian - t384 supported part grant author foundation
partially contract science national nsf support
advanced ccr provided center agency - t112 look people difficult thing need want
fact reason help understand think say alway try
easy bad - t496 increase increased increases decrease
increasing decreased decreases observed change
decreasing significant caused decline - t012 des les dan une est par sur pour qui
nous sont aux ces analyse pay cette - But junk topics alleviate the need to
exhaustively find stopwords - many useless words cluster as topics which can be
suppressed
and very useful to filter out French records
23Broad Topical Categories (BTCs)
The Browser gt
- By clustering the clusters
- worked well
- mathematics, global energy resources,
- can choose desired number of broad topical
categories (e.g., 25) and thresholding - By classifying groups of keywords
- worked well too
- Then review and manually edit
- include or exclude any subtopic
24BTCs Clustering the clusters
The Browser gt
25BTCs Classifying group of keywords
The Browser gt
- gtgtgt Aerospace Engineering
- stars (15)
- space (18)
- aeronautics (20)
- astronautics (20)
- rocket (12)
- shuttle (12)
- exploration (15)
- lander (3)
- planets (7)
- black holes (7)
- quasars (7)
- pulsars (7)
- observatories (10)
- air traffic (10)
- aircraft (15)
- aerospace (20)
-
- airplanes (10)
- airports (10)
- heliports (10)
- helicopters (10)
- aviation (18)
- FAA (7)
- airlines (12)
- flight (18)
- comets (10)
- meteorites (12)
- spacecraft (15)
- air force (7)
- pilots (7)
- jets (7)
- air travel (15)
- flying (18)
-
domain expert specifies list of relevant keywords
and (importance)
26BTCs Classifying group of keywords
The Browser gt
- gtgtgt Aerospace Engineering
- t192 (69) vehicle flight vehicles engine car
road speed nasa aircraft air - t352 (13) star solar planet mass astrophysic
binary dwarf orbital sun companion - t191 (8) space spaces hilbert subspace
dimensional subspaces defined exploration linear
point - gtgtgt Dermatology
- t388 (83) infection skin disease tract
respiratory fever burgdorferi caused wound
arthritis - t157 (8) cancer tumor p53 breast carcinoma
survival human tumour malignant prostate - t071 (7) growth tuberculosis mycobacterium
growing grow igf factor bcg avium - gtgtgt Geology and Earth Sciences
- t121 (73) geothermal rock seismic energy
mountain drilling fluid survey spring yucca - t268 (12) sea atmospheric climate ice ocean
atmosphere cloud global wind aerosol - gtgtgt Molecular, Cellular and Developmental Biology
- t276 (31) molecular biological sciences
molecules biology molecule quantitative
biochemistry basic - t417 (15) cell apoptosis cellular death
cultured bcl lines hela transfected mediated - t355 (12) brain neuron neuronal cortex
synaptic cortical rat nervous cerebral dopamine - t418 (9) genes genome gene repeat chromosome
sequences dna genomic sequence region
in review, would delete this topic from this BTC
just found 1 topic relevant to transportation
27Browse Records in a Topic
The Browser gt
can navigate back to multiple BTCs
nice mix of repositories
28Browse Records in a Topic From one repository
The Browser gt
display records just from Library of Congress
29Sample Record
The Browser gt
- Murphy's Law in algebraic geometry Badly-behaved
deformation spaces - gt preprocessed text
- murphy law algebraic geometry badly behaved
deformation spaces - consider question bad deformation space object
answer priori reason deformation space bad moduli
spaces - precisely singularity finite type smooth
parameter hilbert scheme curves projective space
moduli spaces - smooth projective type surfaces higher
dimensional varieties plane curves nodes cusp
stable sheaves - isolated threefold singularities object
pathological fact nice curves smooth surfaces
ample canonical bundle - stable sheaves torsion free rank singularities
normal cohen macaulay justifies mumford
philosophy moduli - spaces behaved object arbitrarily bad priori
reason construct smooth curve projective space
deformation - space component singularity type reduced behavior
subschemes similarly give surface f_p lift course
hold - holomorphic category difficult compute
deformation spaces directly obstruction theories
circumvent relating - tractable deformation spaces smooth morphism
essential starting point mnev universality
theorem -
- mathematic algebraic geometry mathematic complex
variables - gt top topics
- t381 algebraic geometry mathematic conjecture
varieties projective variety theory cohomology
moduli curves prove genus rational give math
t191 space spaces
topics for this record
link to actual OAI record
30Repository-specific Browsers
The Browser gt
- Library of Congress (http//yarra.calit2.uci.edu/o
ai/loc/) - University of Michigan (http//yarra.calit2.uci.ed
u/oai/umich/) - University of Washington (http//yarra.calit2.uci.
edu/oai/uwash/) - African Journals Online (http//yarra.calit2.uci.e
du/oai/africa/) - and many more
31Clustering, Classification, and Metadata
Enhancement Techniques on OAI Records
- Preprocessing and Topic Modeling
- The Browser
- Lessons Learned and Next Steps
32Evaluation
Lessons Learned Next Steps gt
- Topic modeling worked well
- Most topics were useful
- Drain on computer resources was reasonable
- Human effort was relatively small
- All repositories processed identically, no
special treatment - Strategy worked well
- Clustering, then
- Classification, and
- Broad Topical Categories creation
33Further Evaluation
Lessons Learned Next Steps gt
- Current processing only for
- English-language repositories
- Science/research based repositories
- Need to test cultural heritage repositories and
foreign-language records - Less consistent descriptive language and length
- On-the-horse problem more prevalent
- Greater need to individually process repositories
- Also need usability testing to evaluate further
- Depends on criteria -- who are users?
- Librarians?
- End-users?
- Depends on products and services desired by users
34Discussion Point When to Re-cluster?
Lessons Learned Next Steps gt
classify
classify
cluster
classify
cluster
cluster
classify
classify
classify
- Need to re-cluster
- when collection changes significantly
- if there is a hole in topics
- but NOT just because you have another repository
- If you re-cluster
- all topics will be different
- have to discard hand-labeling
- Broad Topical Categories might be different
- However, classification is
- cheap and easy
- e.g., for OAIster, could re-classify every
harvestuntil spring clean
35Products and Services
Lessons Learned Next Steps gt
- Depending on users
- What kind of service is useful?
- What should interface to topics look/act like?
- What kind of use should we envision?
- We have some ideas
36Archive of Topics
Lessons Learned Next Steps gt
- Are the topics we created useful to anyone else?
- Scenario librarian uses topics/classifier for
local resources - To use locally you need
- the preprocessor (i.e. the preprocessing rules)
- the vocabulary (file of 90,000 words)
- the topic model classifier
37Subject Search/Browse for OAIster
Lessons Learned Next Steps gt
- Integrate topics into OAIster
- add to records so can perform canned topic search
- add to interface so can browse BTCs to records
- Additionally, can allow users to find records
similar to those retrieved - e.g., retrieved records on cosmology and can find
similar records on astrophysics, relativity, - How to do this?
38How To Reach Us
- David Newman University of California, Irvine
- ltnewman_at_uci.edugt
- Kat Hagedorn University of Michigan
- ltkhage_at_umich.edugt
-
- Bill Landis California Digital Library
- ltbill.landis_at_ucop.edugt