Title: ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction
1ICS 278 Data MiningLecture 14 Document
Clustering and Topic Extraction
- Padhraic Smyth
- Department of Information and Computer Science
- University of California, Irvine
2Text Mining
- Information Retrieval
- Text Classification
- Text Clustering
- Information Extraction
3Document Clustering
- Set of documents D in term-vector form
- no class labels this time
- want to group the documents into K groups or into
a taxonomy - Each cluster hypothetically corresponds to a
topic - Methods
- Any of the well-known clustering methods
- K-means
- E.g., spherical k-means, normalize document
distances - Hierarchical clustering
- Probabilistic model-based clustering methods
- e.g., mixtures of multinomials
- Single-topic versus multiple-topic models
- Extensions to author-topic models
4Mixture Model Clustering
5Mixture Model Clustering
6Mixture Model Clustering
Conditional Independence model for each
component (often quite useful to first-order)
7Mixtures of Documents
Terms
1
1
1
1
1
1
1
1
1
1
1
1
Component 1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Component 2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
8Terms
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
9Treat as Missing
Terms
1
1
1
1
1
C1
1
1
1
1
1
C1
1
1
1
C1
1
1
1
1
C1
1
1
1
C1
1
1
1
1
Documents
C1
C1
C2
1
1
1
C2
1
1
1
C2
1
1
1
1
C2
1
1
1
C2
1
1
1
1
1
C2
1
1
C2
1
1
1
C2
10Treat as Missing
Terms
1
1
1
1
1
C1
P(C1x1)
P(C2x1)
1
1
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
Documents
C1
P(C1..)
P(C2..)
C1
P(C1..)
P(C2..)
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
C2
P(C1..)
P(C2..)
1
1
1
P(C1..)
P(C2..)
E-Step estimate component membership
probabilities given current parameter estimates
11Treat as Missing
Terms
1
1
1
1
1
C1
P(C1x1)
P(C2x1)
1
1
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
Documents
C1
P(C1..)
P(C2..)
C1
P(C1..)
P(C2..)
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
C2
P(C1..)
P(C2..)
1
1
1
P(C1..)
P(C2..)
M-Step use fractional weighted data to get new
estimates of the parameters
12A Document Cluster
Most Likely Terms in Component 5 weight 0.08
TERM p(tk) write
0.571 drive 0.465 problem
0.369 mail 0.364
articl 0.332 hard
0.323 work 0.319 system
0.303 good 0.296 time
0.273 Highest Lift Terms in
Component 5 weight 0.08 TERM
LIFT p(tk) p(t) scsi 7.7
0.13 0.02 drive 5.7 0.47
0.08 hard 4.9 0.32 0.07
card 4.2 0.23 0.06 format
4.0 0.12 0.03 softwar
3.8 0.21 0.05 memori 3.6
0.14 0.04 install 3.6 0.14
0.04 disk 3.5 0.12 0.03
engin 3.3 0.21 0.06
13Another Document Cluster
Most Likely Terms in Component 1 weight 0.11
TERM p(tk) articl
0.684 good 0.368 dai
0.363 fact 0.322
god 0.320 claim
0.294 apr 0.279 fbi
0.256 christian 0.256
group 0.239 Highest Lift Terms
in Component 1 weight 0.11 TERM
LIFT p(tk) p(t) fbi
8.3 0.26 0.03 jesu 5.5 0.16
0.03 fire 5.2 0.20 0.04
christian 4.9 0.26 0.05 evid
4.8 0.24 0.05 god
4.6 0.32 0.07 gun 4.2
0.17 0.04 faith 4.2 0.12
0.03 kill 3.8 0.22 0.06
bibl 3.7 0.11 0.03
14A topic is represented as a (multinomial)
distribution over words
Example topic 1
Example topic 2
SPEECH
.0691
WORDS
.0671
RECOGNITION
.0412
WORD
.0557
SPEAKER
.0288
USER
.0230
PHONEME
.0224
DOCUMENTS
.0205
CLASSIFICATION
.0154
TEXT
.0195
SPEAKERS
.0140
RETRIEVAL
.0152
FRAME
.0135
INFORMATION
.0144
PHONETIC
.0119
DOCUMENT
.0144
PERFORMANCE
.0111
LARGE
.0102
ACOUSTIC
.0099
COLLECTION
.0098
BASED
.0098
KNOWLEDGE
.0087
PHONEMES
.0091
MACHINE
.0080
UTTERANCES
.0091
RELEVANT
.0077
SET
.0089
SEMANTIC
.0076
LETTER
.0088
SIMILARITY
.0071
15The basic model.
C
X1
X2
Xd
16A better model.
B
C
A
X1
X2
Xd
17A better model.
B
C
A
X1
X2
Xd
History - latent class models in statistics -
Hofmann applied to documents (SIGIR 99) -
recent extensions, e.g., Blei, Jordan, Ng (JMLR,
2003) - variously known as factor/aspect/latent
class models
18A better model.
B
C
A
X1
X2
Xd
Inference can be intractable due to undirected
loops!
19A better model for documents.
- Multi-topic model
- A document is generated from multiple components
- Multiple components can be active at once
- Each component multinomial distribution
- Parameter estimation is tricky
- Very useful
- parses into high-level semantic components
20A generative model for documents
w P(wz 1) f (1)
w P(wz 2) f (2)
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK
0.0 RESEARCH 0.0 MATHEMATICS 0.0
HEART 0.0 LOVE 0.0 SOUL 0.0 TEARS 0.0 JOY
0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
topic 1
topic 2
21Choose mixture weights for each document,
generate bag of words
q P(z 1), P(z 2) 0, 1 0.25,
0.75 0.5, 0.5 0.75, 0.25 1, 0
MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS
RESEARCH WORK SCIENTIFIC MATHEMATICS WORK
SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC
HEART LOVE TEARS KNOWLEDGE HEART
MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK
TEARS SOUL KNOWLEDGE HEART
WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE
LOVE SOUL
TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY
22A visual example Bars
sample each pixel from a mixture of topics
pixel word image document
23(No Transcript)
24(No Transcript)
25Interpretable decomposition
- SVD gives a basis for the data, but not an
interpretable one - The true basis is not orthogonal, so rotation
does no good
26(Dumais, Landauer)
P(w)
27History of multi-topic models
- Latent class models in statistics
- Hoffman 1999
- Original application to documents
- Blei, Ng, and Jordan (2001, 2003)
- Variational methods
- Griffiths and Steyvers (2003)
- Gibbs sampling approach (very efficient)
28A selection of topics
STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STR
UCTURAL RESOLUTION HELIX THREE HELICES DETERMINED
RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIO
NAL INTERACTIONS MOLECULE SURFACE
NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NE
URONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATE
RAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AR
EAS THALAMIC
TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GR
OWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MAL
IGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN
MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR
MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONT
RACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOL
ATED MYOD FAILURE
HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION
HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AI
DS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL
FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCO
PY WATER FORCES PARTICLES STRENGTH POLYMER IONIC A
TOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTION
S BEADS MECHANICAL
29A selection of topics
STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUD
Y DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USI
NG FINDINGS DEMONSTRATE REPORT INDICATED CONSISTEN
T REPORTS CONTRAST
MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKN
OWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RE
SPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING L
ARGELY KNOWN
MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMP
LE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL
ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PRE
DICTIONS CONSISTENT PARAMETERS
CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROM
OSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MA
PS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES
ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATA
L EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BI
RTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROG
ENESIS ADULTS
PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMOD
IUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFE
CTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA
CRUZI BRUCEI HUMAN HOSTS
MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFS
PRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCT
ION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY
30A selection of topics
STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUD
Y DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USI
NG FINDINGS DEMONSTRATE REPORT INDICATED CONSISTEN
T REPORTS CONTRAST
MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKN
OWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RE
SPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING L
ARGELY KNOWN
MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMP
LE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL
ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PRE
DICTIONS CONSISTENT PARAMETERS
CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROM
OSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MA
PS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES
ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATA
L EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BI
RTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROG
ENESIS ADULTS
PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMOD
IUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFE
CTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA
CRUZI BRUCEI HUMAN HOSTS
MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFS
PRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCT
ION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY
31 1 2
3 4
GROUP 0.057185 DYNAMIC 0.152141
DISTRIBUTED 0.192926 RESEARCH 0.066798
MULTICAST 0.051620 STRUCTURE
0.137964 COMPUTING 0.044376
SUPPORTED 0.043233 INTERNET 0.049499
STRUCTURES 0.088040 SYSTEMS 0.038601
PART 0.035590 PROTOCOL
0.041615 STATIC 0.043452
SYSTEM 0.031797 GRANT 0.034476
RELIABLE 0.020877 PAPER 0.032706
HETEROGENEOUS 0.030996 SCIENCE
0.023250 GROUPS 0.019552
DYNAMICALLY 0.023940 ENVIRONMENT 0.023163
FOUNDATION 0.022653 PROTOCOLS
0.019088 PRESENT 0.015328
PAPER 0.017960 FL 0.021220
IP 0.014980 META 0.015175
SUPPORT 0.016587 WORK
0.021061 TRANSPORT 0.012529
CALLED 0.011669 ARCHITECTURE 0.016416
NATIONAL 0.019947 DRAFT 0.009945
RECURSIVE 0.010145 ENVIRONMENTS
0.013271 NSF 0.018116
Content components
Boilerplate components
32 5 6
7 8
DIMENSIONAL 0.038901 RULES 0.090569
ORDER 0.192759 GRAPH
0.095687 POINTS 0.037263
CLASSIFICATION 0.062699 TERMS
0.048688 PATH 0.061784
SURFACE 0.031438 RULE 0.062174
PARTIAL 0.044907 GRAPHS
0.061217 GEOMETRIC 0.025006
ACCURACY 0.028926 HIGHER 0.041284
PATHS 0.030151 SURFACES 0.020152
ATTRIBUTES 0.023090 REDUCTION
0.035061 EDGE 0.028590
MESH 0.016875 INDUCTION 0.021909
PAPER 0.028602 NUMBER 0.022775
PLANE 0.013902 CLASSIFIER
0.019418 TERM 0.018204
CONNECTED 0.016817 POINT 0.013780
SET 0.018303 ORDERING 0.017652
DIRECTED 0.014405 GEOMETRY
0.013780 ATTRIBUTE 0.016204
SHOW 0.017022 NODES 0.013625
PLANAR 0.012385 CLASSIFIERS 0.015417
MAGNITUDE 0.015526 VERTICES
0.013554
9
10 11
12 INFORMATION 0.281237
SYSTEM 0.143873 PAPER 0.077870
LANGUAGE 0.158786 TEXT 0.048675
FILE 0.054076 CONDITIONS
0.041187 PROGRAMMING 0.097186
RETRIEVAL 0.044046 OPERATING 0.053963
CONCEPT 0.036268 LANGUAGES
0.082410 SOURCES 0.029548
STORAGE 0.039072 CONCEPTS 0.033457
FUNCTIONAL 0.032815 DOCUMENT 0.029000
DISK 0.029957 DISCUSSED
0.027414 SEMANTICS 0.027003
DOCUMENTS 0.026503 SYSTEMS 0.029221
DEFINITION 0.024673 SEMANTIC
0.024341 RELEVANT 0.018523
KERNEL 0.028655 ISSUES 0.024603
NATURAL 0.016410 CONTENT 0.016574
ACCESS 0.018293 PROPERTIES
0.021511 CONSTRUCTS 0.014129
AUTOMATICALLY 0.009326 MANAGEMENT
0.017218 IMPORTANT 0.021370
GRAMMAR 0.013640 DIGITAL 0.008777
UNIX 0.016878 EXAMPLES 0.019754
LISP 0.010326
33 13
14 15
16 MODEL 0.429185
PAPER 0.050411 TYPE 0.088650
KNOWLEDGE 0.212603 MODELS 0.201810
APPROACHES 0.045245 SPECIFICATION
0.051469 SYSTEM 0.090852
MODELING 0.066311 PROPOSED 0.043132
TYPES 0.046571 SYSTEMS
0.051978 QUALITATIVE 0.018417
CHANGE 0.040393 FORMAL 0.036892
BASE 0.042277 COMPLEX 0.009272
BELIEF 0.025835 VERIFICATION
0.029987 EXPERT 0.020172
QUANTITATIVE 0.005662 ALTERNATIVE 0.022470
SPECIFICATIONS 0.024439 ACQUISITION
0.017816 CAPTURE 0.005301
APPROACH 0.020905 CHECKING 0.024439
DOMAIN 0.016638 MODELED 0.005301
ORIGINAL 0.019026 SYSTEM
0.023259 INTELLIGENT 0.015737
ACCURATELY 0.004639 SHOW 0.017852
PROPERTIES 0.018242 BASES
0.015390 REALISTIC 0.004278
PROPOSE 0.016991 ABSTRACT 0.016826
BASED 0.014004
Style components
34Recent Results on Author-Topic Models
35Authors
Words
Can we model authors, given documents? (more
generally, build statistical profiles of
entities given sparse observed data)
36Authors
Hidden Topics
Words
Model Author-Topic distributions Topic-Word
distributions Parameters learned via Bayesian
learning
37Authors
Hidden Topics
Words
38Authors
Hidden Topics
Words
39Authors
Hidden Topics
Words
40Authors
Hidden Topics
Words
41Authors
Hidden Topics
Words
42Authors
Hidden Topics
Words
43Hidden Topics
Words
Topic Model - document can be generated from
multiple topics - Hofmann (SIGIR 99), Blei,
Jordan, Ng (JMLR, 2003)
44Authors
Hidden Topics
Words
Model Author-Topic distributions Topic-Word
distributions NOTE documents can be composed of
multiple topics
45The Author-Topic Model Assumptions of
Generative Model
- Each author is associated with a topics mixture
- Each document is a mixture of topics
- With multiple authors, the document will be a
mixture of the topics mixtures of the coauthors - Each word in a text is generated from one topic
and one author (potentially different for each
word)
46Generative Process
- Lets assume authors A1 and A2 collaborate and
produce a paper - A1 has multinomial topic distribution q1
- A2 has multinomial topic distribution q2
- For each word in the paper
- Sample an author x (uniformly) from A1, A2
- Sample a topic z from a qX
- Sample a word w from a multinomial topic
distribution ?z
47Graphical Model
From the set of co-authors
1. Choose an author
2. Choose a topic
3. Choose a word
48Model Estimation
- Estimate x and z by Gibbs sampling (assignments
of each word to an author and topic) - Estimation is efficient linear in data size
- Infer
- Author-Topic distributions (Q)
- Topic-Word distributions (F)
49Data
- 1700 proceedings papers from NIPS (2000 authors)
- (NIPS Neural Information Processing
Systems) - 160,000 CiteSeer abstracts (85,000 authors)
- Removed stop words
- Word order is irrelevant, just use word counts
- Processing time
- Nips 2000 Gibbs iterations ? 12 hours on PC
workstation - CiteSeer 700 Gibbs iterations ? 111 hours
50Author Modeling Data Sets
Source Documents Unique Authors Unique Words Total Word Count
CiteSeer 163,389 85,465 30,799 11.7 million
CORA 13,643 11,427 11,101 1.2 million
NIPS 1,740 2,037 13,649 2.3 million
51Four example topics from CiteSeer (T300)
52Four more topics
53Some likely topics per author (CiteSeer)
- Author Andrew McCallum, U Mass
- Topic 1 classification, training,
generalization, decision, data, - Topic 2 learning, machine, examples,
reinforcement, inductive,.. - Topic 3 retrieval, text, document, information,
content, - Author Hector Garcia-Molina, Stanford
- - Topic 1 query, index, data, join, processing,
aggregate. - - Topic 2 transaction, concurrency, copy,
permission, distributed. - - Topic 3 source, separation, paper,
heterogeneous, merging.. - Author Paul Cohen, USC/ISI
- - Topic 1 agent, multi, coordination,
autonomous, intelligent. - - Topic 2 planning, action, goal, world,
execution, situation - - Topic 3 human, interaction, people,
cognitive, social, natural.
54Four example topics from NIPS (T100)
55Four more topics
56Stability of Topics
- Content of topics is arbitrary across runs of
model(e.g., topic 1 is not the same across
runs) - However,
- Majority of topics are stable over processing
time - Majority of topics can be aligned across runs
- Topics appear to represent genuine structure in
data
57Comparing NIPS topics from the same
chain(t11000, and t22000)
BEST KL 0.54
Re-ordered topics at t22000
WORST KL 4.78
KL distance
topics at t11000
58Comparing NIPS topics and CiteSeer topics
KL 2.88
KL 4.48
Re-ordered CiteSeer topics
KL 4.92
KL distance
NIPS topics
KL 5.0
59Detecting Unusual Papers by Authors
- For any paper by an author, we can calculate how
surprising words in a document are some papers
are on unusual topics by author
Papers ranked by unusualness (perplexity) for C.
Faloutsos
Papers ranked by unusualness (perplexity) for M.
Jordan
60Author Separation
- Can model attribute words to authors correctly
within a document? - Test of model
- 1) artificially combine abstracts from different
authors - 2) check whether assignment is to correct
original author
- A method1 is described which like the kernel1
trick1 in support1 vector1 machines1 SVMs1 lets
us generalize distance1 based2 algorithms to
operate in feature1 spaces usually nonlinearly
related to the input1 space This is done by
identifying a class of kernels1 which can be
represented as norm1 based2 distances1 in Hilbert
spaces It turns1 out that common kernel1
algorithms such as SVMs1 and kernel1 PCA1 are
actually really distance1 based2 algorithms and
can be run2 with that class of kernels1 too As
well as providing1 a useful new insight1 into how
these algorithms work the present2 work can form
the basis1 for conceiving new algorithms - This paper presents2 a comprehensive approach for
model2 based2 diagnosis2 which includes proposals
for characterizing and computing2 preferred2
diagnoses2 assuming that the system2 description2
is augmented with a system2 structure2 a
directed2 graph2 explicating the interconnections
between system2 components2 Specifically we first
introduce the notion of a consequence2 which is a
syntactically2 unconstrained propositional2
sentence2 that characterizes all consistency2
based2 diagnoses2 and show2 that standard2
characterizations of diagnoses2 such as minimal
conflicts1 correspond to syntactic2 variations1
on a consequence2 Second we propose a new
syntactic2 variation on the consequence2 known as
negation2 normal form NNF and discuss its merits
compared to standard variations Third we
introduce a basic algorithm2 for computing
consequences in NNF given a structured system2
description We show that if the system2
structure2 does not contain cycles2 then there is
always a linear size2 consequence2 in NNF which
can be computed in linear time2 For arbitrary1
system2 structures2 we show a precise connection
between the complexity2 of computing2
consequences and the topology of the underlying
system2 structure2 Finally we present2 an
algorithm2 that enumerates2 the preferred2
diagnoses2 characterized by a consequence2 The
algorithm2 is shown1 to take linear time2 in the
size2 of the consequence2 if the preference
criterion1 satisfies some general conditions
Written by (1) Scholkopf_B
Written by (2) Darwiche_A
61Applications of Author-Topic Models
- Expert Finder
- Find researchers who are knowledgeable in
cryptography and machine learning within 100
miles of Washington DC - Find reviewers for this set of NSF proposals who
are active in relevant topics and have no
conflicts of interest - Prediction
- Given a document and some subset of known authors
for the paper (k0,1,2), predict the other
authors - Predict how many papers in different topics will
appear next year - Change Detection/Monitoring
- Which authors are on the leading edge of new
topics? - Characterize the topic trajectory of this
author over time
62(No Transcript)
63Rise in Web, Mobile, JAVA
Web
64Rise of Machine Learning
65Bayes lives on.
66Decline in Languages, OS,
67Decline in CS Theory,
68Trends in Database Research
69Trends in NLP and IR
NLP
IR
70Security Research Reborn
71(Not so) Hot Topics
Neural Networks
GAs
Wavelets
72Decline in use of Greek Letters ?
73Future Work
- Theory development
- Incorporate citation information, collaboration
networks - Other document types, e.g., email
- handling subject lines, email threads, and to
and cc fields - New datasets
- Enron email corpus
- Web pages
- PubMed abstracts (possibly)
74New applications of author-topic models
- Black box for text document collection
summarization - Automatically extract a summary of relevant
topics and author patterns for a large data set
such as Enron email - Expert Finder
- Find researchers who are knowledgeable in
cryptography and machine learning within 100
miles of Washington DC - Find reviewers for this set of NSF proposals who
are active in relevant topics and have no
conflicts of interest - Change Detection/Monitoring
- Which authors are on the leading edge of new
topics? - Characterize the topic trajectory of this
author over time - Prediction (work in progress)
- Given a document and some subset of known authors
for the paper (k0,1,2), predict the other
authors - Predict how many papers in different topics will
appear next year
75The Author-Topic Browser
(a)
Querying on author Pazzani_M
Querying on topic relevant to author
(b)
Querying on document written by author
(c)
76Scientific syntax and semantics
Factorization of language based on statistical
dependency patterns long-range, document
specific, dependencies short-range
dependencies constant across all documents
semantics probabilistic topics
q
z
z
z
w
w
w
x
x
x
syntax probabilistic regular grammar
77x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
78x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
THE
79x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
THE LOVE
80x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
THE LOVE OF
81x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
THE LOVE OF RESEARCH
82Semantic topics
83Syntactic classes
5
8
14
25
26
30
33
IN
ARE
THE
SUGGEST
LEVELS
RESULTS
BEEN
FOR
WERE
THIS
INDICATE
NUMBER
ANALYSIS
MAY
ON
WAS
ITS
SUGGESTING
LEVEL
DATA
CAN
BETWEEN
IS
THEIR
SUGGESTS
RATE
STUDIES
COULD
DURING
WHEN
AN
SHOWED
TIME
STUDY
WELL
AMONG
REMAIN
EACH
REVEALED
CONCENTRATIONS
FINDINGS
DID
FROM
REMAINS
ONE
SHOW
VARIETY
EXPERIMENTS
DOES
UNDER
REMAINED
ANY
DEMONSTRATE
RANGE
OBSERVATIONS
DO
WITHIN
PREVIOUSLY
INCREASED
INDICATING
CONCENTRATION
HYPOTHESIS
MIGHT
THROUGHOUT
BECOME
EXOGENOUS
PROVIDE
DOSE
ANALYSES
SHOULD
THROUGH
BECAME
OUR
SUPPORT
FAMILY
ASSAYS
WILL
TOWARD
BEING
RECOMBINANT
INDICATES
SET
POSSIBILITY
WOULD
INTO
BUT
ENDOGENOUS
PROVIDES
FREQUENCY
MICROSCOPY
MUST
AT
GIVE
TOTAL
INDICATED
SERIES
PAPER
CANNOT
INVOLVING
MERE
PURIFIED
DEMONSTRATED
AMOUNTS
WORK
REMAINED
AFTER
APPEARED
TILE
SHOWS
RATES
EVIDENCE
ALSO
THEY
ACROSS
APPEAR
FULL
SO
CLASS
FINDING
AGAINST
ALLOWED
CHRONIC
REVEAL
VALUES
MUTAGENESIS
BECOME
WHEN
NORMALLY
ANOTHER
DEMONSTRATES
AMOUNT
OBSERVATION
MAG
ALONG
EACH
EXCESS
SUGGESTED
SITES
MEASUREMENTS
LIKELY
84Syntactic classes
5
8
14
25
26
30
33
IN
ARE
THE
SUGGEST
LEVELS
RESULTS
BEEN
FOR
WERE
THIS
INDICATE
NUMBER
ANALYSIS
MAY
ON
WAS
ITS
SUGGESTING
LEVEL
DATA
CAN
BETWEEN
IS
THEIR
SUGGESTS
RATE
STUDIES
COULD
DURING
WHEN
AN
SHOWED
TIME
STUDY
WELL
AMONG
REMAIN
EACH
REVEALED
CONCENTRATIONS
FINDINGS
DID
FROM
REMAINS
ONE
SHOW
VARIETY
EXPERIMENTS
DOES
UNDER
REMAINED
ANY
DEMONSTRATE
RANGE
OBSERVATIONS
DO
WITHIN
PREVIOUSLY
INCREASED
INDICATING
CONCENTRATION
HYPOTHESIS
MIGHT
THROUGHOUT
BECOME
EXOGENOUS
PROVIDE
DOSE
ANALYSES
SHOULD
THROUGH
BECAME
OUR
SUPPORT
FAMILY
ASSAYS
WILL
TOWARD
BEING
RECOMBINANT
INDICATES
SET
POSSIBILITY
WOULD
INTO
BUT
ENDOGENOUS
PROVIDES
FREQUENCY
MICROSCOPY
MUST
AT
GIVE
TOTAL
INDICATED
SERIES
PAPER
CANNOT
INVOLVING
MERE
PURIFIED
DEMONSTRATED
AMOUNTS
WORK
REMAINED
AFTER
APPEARED
TILE
SHOWS
RATES
EVIDENCE
ALSO
THEY
ACROSS
APPEAR
FULL
SO
CLASS
FINDING
AGAINST
ALLOWED
CHRONIC
REVEAL
VALUES
MUTAGENESIS
BECOME
WHEN
NORMALLY
ANOTHER
DEMONSTRATES
AMOUNT
OBSERVATION
MAG
ALONG
EACH
EXCESS
SUGGESTED
SITES
MEASUREMENTS
LIKELY