ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction - PowerPoint PPT Presentation

1 / 81
About This Presentation
Title:

ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction

Description:

M-Step: use 'fractional' weighted data. to get new estimates of the parameters ... Data Mining Lectures Lecture 12: Text Mining Padhraic Smyth, UC Irvine ... – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 82
Provided by: Informatio367
Category:

less

Transcript and Presenter's Notes

Title: ICS 278: Data Mining Lecture 14: Document Clustering and Topic Extraction


1
ICS 278 Data MiningLecture 14 Document
Clustering and Topic Extraction
  • Padhraic Smyth
  • Department of Information and Computer Science
  • University of California, Irvine

2
Text Mining
  • Information Retrieval
  • Text Classification
  • Text Clustering
  • Information Extraction

3
Document Clustering
  • Set of documents D in term-vector form
  • no class labels this time
  • want to group the documents into K groups or into
    a taxonomy
  • Each cluster hypothetically corresponds to a
    topic
  • Methods
  • Any of the well-known clustering methods
  • K-means
  • E.g., spherical k-means, normalize document
    distances
  • Hierarchical clustering
  • Probabilistic model-based clustering methods
  • e.g., mixtures of multinomials
  • Single-topic versus multiple-topic models
  • Extensions to author-topic models

4
Mixture Model Clustering

5
Mixture Model Clustering

6
Mixture Model Clustering

Conditional Independence model for each
component (often quite useful to first-order)
7
Mixtures of Documents
Terms
1
1
1
1
1
1
1
1
1
1
1
1
Component 1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Component 2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
8
Terms
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Documents
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
9
Treat as Missing
Terms
1
1
1
1
1
C1
1
1
1
1
1
C1
1
1
1
C1
1
1
1
1
C1
1
1
1
C1
1
1
1
1
Documents
C1
C1
C2
1
1
1
C2
1
1
1
C2
1
1
1
1
C2
1
1
1
C2
1
1
1
1
1
C2
1
1
C2
1
1
1
C2
10
Treat as Missing
Terms
1
1
1
1
1
C1
P(C1x1)
P(C2x1)
1
1
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
Documents
C1
P(C1..)
P(C2..)
C1
P(C1..)
P(C2..)
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
C2
P(C1..)
P(C2..)
1
1
1
P(C1..)
P(C2..)
E-Step estimate component membership
probabilities given current parameter estimates
11
Treat as Missing
Terms
1
1
1
1
1
C1
P(C1x1)
P(C2x1)
1
1
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
C1
1
P(C1..)
P(C2..)
1
1
C1
1
P(C1..)
P(C2..)
1
1
1
Documents
C1
P(C1..)
P(C2..)
C1
P(C1..)
P(C2..)
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
C2
P(C1..)
P(C2..)
1
1
1
1
1
C2
P(C1..)
P(C2..)
1
1
C2
P(C1..)
P(C2..)
1
1
1
P(C1..)
P(C2..)
M-Step use fractional weighted data to get new
estimates of the parameters
12
A Document Cluster
Most Likely Terms in Component 5 weight 0.08
TERM p(tk) write
0.571 drive 0.465 problem
0.369 mail 0.364
articl 0.332 hard
0.323 work 0.319 system
0.303 good 0.296 time
0.273 Highest Lift Terms in
Component 5 weight 0.08 TERM
LIFT p(tk) p(t) scsi 7.7
0.13 0.02 drive 5.7 0.47
0.08 hard 4.9 0.32 0.07
card 4.2 0.23 0.06 format
4.0 0.12 0.03 softwar
3.8 0.21 0.05 memori 3.6
0.14 0.04 install 3.6 0.14
0.04 disk 3.5 0.12 0.03
engin 3.3 0.21 0.06
13
Another Document Cluster
Most Likely Terms in Component 1 weight 0.11
TERM p(tk) articl
0.684 good 0.368 dai
0.363 fact 0.322
god 0.320 claim
0.294 apr 0.279 fbi
0.256 christian 0.256
group 0.239 Highest Lift Terms
in Component 1 weight 0.11 TERM
LIFT p(tk) p(t) fbi
8.3 0.26 0.03 jesu 5.5 0.16
0.03 fire 5.2 0.20 0.04
christian 4.9 0.26 0.05 evid
4.8 0.24 0.05 god
4.6 0.32 0.07 gun 4.2
0.17 0.04 faith 4.2 0.12
0.03 kill 3.8 0.22 0.06
bibl 3.7 0.11 0.03
14
A topic is represented as a (multinomial)
distribution over words
Example topic 1
Example topic 2
SPEECH
.0691
WORDS
.0671
RECOGNITION
.0412
WORD
.0557
SPEAKER
.0288
USER
.0230
PHONEME
.0224
DOCUMENTS
.0205
CLASSIFICATION
.0154
TEXT
.0195
SPEAKERS
.0140
RETRIEVAL
.0152
FRAME
.0135
INFORMATION
.0144
PHONETIC
.0119
DOCUMENT
.0144
PERFORMANCE
.0111
LARGE
.0102
ACOUSTIC
.0099
COLLECTION
.0098
BASED
.0098
KNOWLEDGE
.0087
PHONEMES
.0091
MACHINE
.0080
UTTERANCES
.0091
RELEVANT
.0077
SET
.0089
SEMANTIC
.0076
LETTER
.0088
SIMILARITY
.0071


15
The basic model.
C
X1
X2
Xd
16
A better model.
B
C
A
X1
X2
Xd
17
A better model.
B
C
A
X1
X2
Xd
History - latent class models in statistics -
Hofmann applied to documents (SIGIR 99) -
recent extensions, e.g., Blei, Jordan, Ng (JMLR,
2003) - variously known as factor/aspect/latent
class models
18
A better model.
B
C
A
X1
X2
Xd
Inference can be intractable due to undirected
loops!
19
A better model for documents.
  • Multi-topic model
  • A document is generated from multiple components
  • Multiple components can be active at once
  • Each component multinomial distribution
  • Parameter estimation is tricky
  • Very useful
  • parses into high-level semantic components

20
A generative model for documents
w P(wz 1) f (1)
w P(wz 2) f (2)
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK
0.0 RESEARCH 0.0 MATHEMATICS 0.0
HEART 0.0 LOVE 0.0 SOUL 0.0 TEARS 0.0 JOY
0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
topic 1
topic 2
21
Choose mixture weights for each document,
generate bag of words
q P(z 1), P(z 2) 0, 1 0.25,
0.75 0.5, 0.5 0.75, 0.25 1, 0
MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS
RESEARCH WORK SCIENTIFIC MATHEMATICS WORK
SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC
HEART LOVE TEARS KNOWLEDGE HEART
MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK
TEARS SOUL KNOWLEDGE HEART
WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE
LOVE SOUL
TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY
22
A visual example Bars
sample each pixel from a mixture of topics
pixel word image document
23
(No Transcript)
24
(No Transcript)
25
Interpretable decomposition
  • SVD gives a basis for the data, but not an
    interpretable one
  • The true basis is not orthogonal, so rotation
    does no good

26
(Dumais, Landauer)
P(w)
27
History of multi-topic models
  • Latent class models in statistics
  • Hoffman 1999
  • Original application to documents
  • Blei, Ng, and Jordan (2001, 2003)
  • Variational methods
  • Griffiths and Steyvers (2003)
  • Gibbs sampling approach (very efficient)

28
A selection of topics
STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STR
UCTURAL RESOLUTION HELIX THREE HELICES DETERMINED
RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIO
NAL INTERACTIONS MOLECULE SURFACE
NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NE
URONAL LAYER RAT NUCLEI CEREBELLUM CEREBELLAR LATE
RAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AR
EAS THALAMIC
TUMOR CANCER TUMORS HUMAN CELLS BREAST MELANOMA GR
OWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MAL
IGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN
MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR
MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONT
RACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOL
ATED MYOD FAILURE
HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION
HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AI
DS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL
FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCO
PY WATER FORCES PARTICLES STRENGTH POLYMER IONIC A
TOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTION
S BEADS MECHANICAL
29
A selection of topics
STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUD
Y DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USI
NG FINDINGS DEMONSTRATE REPORT INDICATED CONSISTEN
T REPORTS CONTRAST
MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKN
OWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RE
SPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING L
ARGELY KNOWN
MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMP
LE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL
ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PRE
DICTIONS CONSISTENT PARAMETERS
CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROM
OSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MA
PS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES
ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATA
L EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BI
RTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROG
ENESIS ADULTS
PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMOD
IUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFE
CTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA
CRUZI BRUCEI HUMAN HOSTS
MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFS
PRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCT
ION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY
30
A selection of topics
STUDIES PREVIOUS SHOWN RESULTS RECENT PRESENT STUD
Y DEMONSTRATED INDICATE WORK SUGGEST SUGGESTED USI
NG FINDINGS DEMONSTRATE REPORT INDICATED CONSISTEN
T REPORTS CONTRAST
MECHANISM MECHANISMS UNDERSTOOD POORLY ACTION UNKN
OWN REMAIN UNDERLYING MOLECULAR PS REMAINS SHOW RE
SPONSIBLE PROCESS SUGGEST UNCLEAR REPORT LEADING L
ARGELY KNOWN
MODEL MODELS EXPERIMENTAL BASED PROPOSED DATA SIMP
LE DYNAMICS PREDICTED EXPLAIN BEHAVIOR THEORETICAL
ACCOUNT THEORY PREDICTS COMPUTER QUANTITATIVE PRE
DICTIONS CONSISTENT PARAMETERS
CHROMOSOME REGION CHROMOSOMES KB MAP MAPPING CHROM
OSOMAL HYBRIDIZATION ARTIFICIAL MAPPED PHYSICAL MA
PS GENOMIC DNA LOCUS GENOME GENE HUMAN SITU CLONES
ADULT DEVELOPMENT FETAL DAY DEVELOPMENTAL POSTNATA
L EARLY DAYS NEONATAL LIFE DEVELOPING EMBRYONIC BI
RTH NEWBORN MATERNAL PRESENT PERIOD ANIMALS NEUROG
ENESIS ADULTS
PARASITE PARASITES FALCIPARUM MALARIA HOST PLASMOD
IUM ERYTHROCYTES ERYTHROCYTE MAJOR LEISHMANIA INFE
CTED BLOOD INFECTION MOSQUITO INVASION TRYPANOSOMA
CRUZI BRUCEI HUMAN HOSTS
MALE FEMALE MALES FEMALES SEX SEXUAL BEHAVIOR OFFS
PRING REPRODUCTIVE MATING SOCIAL SPECIES REPRODUCT
ION FERTILITY TESTIS MATE GENETIC GERM CHOICE SRY
31
1 2
3 4
GROUP 0.057185 DYNAMIC 0.152141
DISTRIBUTED 0.192926 RESEARCH 0.066798
MULTICAST 0.051620 STRUCTURE
0.137964 COMPUTING 0.044376
SUPPORTED 0.043233 INTERNET 0.049499
STRUCTURES 0.088040 SYSTEMS 0.038601
PART 0.035590 PROTOCOL
0.041615 STATIC 0.043452
SYSTEM 0.031797 GRANT 0.034476
RELIABLE 0.020877 PAPER 0.032706
HETEROGENEOUS 0.030996 SCIENCE
0.023250 GROUPS 0.019552
DYNAMICALLY 0.023940 ENVIRONMENT 0.023163
FOUNDATION 0.022653 PROTOCOLS
0.019088 PRESENT 0.015328
PAPER 0.017960 FL 0.021220
IP 0.014980 META 0.015175
SUPPORT 0.016587 WORK
0.021061 TRANSPORT 0.012529
CALLED 0.011669 ARCHITECTURE 0.016416
NATIONAL 0.019947 DRAFT 0.009945
RECURSIVE 0.010145 ENVIRONMENTS
0.013271 NSF 0.018116
Content components
Boilerplate components
32
5 6
7 8
DIMENSIONAL 0.038901 RULES 0.090569
ORDER 0.192759 GRAPH
0.095687 POINTS 0.037263
CLASSIFICATION 0.062699 TERMS
0.048688 PATH 0.061784
SURFACE 0.031438 RULE 0.062174
PARTIAL 0.044907 GRAPHS
0.061217 GEOMETRIC 0.025006
ACCURACY 0.028926 HIGHER 0.041284
PATHS 0.030151 SURFACES 0.020152
ATTRIBUTES 0.023090 REDUCTION
0.035061 EDGE 0.028590
MESH 0.016875 INDUCTION 0.021909
PAPER 0.028602 NUMBER 0.022775
PLANE 0.013902 CLASSIFIER
0.019418 TERM 0.018204
CONNECTED 0.016817 POINT 0.013780
SET 0.018303 ORDERING 0.017652
DIRECTED 0.014405 GEOMETRY
0.013780 ATTRIBUTE 0.016204
SHOW 0.017022 NODES 0.013625
PLANAR 0.012385 CLASSIFIERS 0.015417
MAGNITUDE 0.015526 VERTICES
0.013554
9
10 11
12 INFORMATION 0.281237
SYSTEM 0.143873 PAPER 0.077870
LANGUAGE 0.158786 TEXT 0.048675
FILE 0.054076 CONDITIONS
0.041187 PROGRAMMING 0.097186
RETRIEVAL 0.044046 OPERATING 0.053963
CONCEPT 0.036268 LANGUAGES
0.082410 SOURCES 0.029548
STORAGE 0.039072 CONCEPTS 0.033457
FUNCTIONAL 0.032815 DOCUMENT 0.029000
DISK 0.029957 DISCUSSED
0.027414 SEMANTICS 0.027003
DOCUMENTS 0.026503 SYSTEMS 0.029221
DEFINITION 0.024673 SEMANTIC
0.024341 RELEVANT 0.018523
KERNEL 0.028655 ISSUES 0.024603
NATURAL 0.016410 CONTENT 0.016574
ACCESS 0.018293 PROPERTIES
0.021511 CONSTRUCTS 0.014129
AUTOMATICALLY 0.009326 MANAGEMENT
0.017218 IMPORTANT 0.021370
GRAMMAR 0.013640 DIGITAL 0.008777
UNIX 0.016878 EXAMPLES 0.019754
LISP 0.010326
33
13
14 15
16 MODEL 0.429185
PAPER 0.050411 TYPE 0.088650
KNOWLEDGE 0.212603 MODELS 0.201810
APPROACHES 0.045245 SPECIFICATION
0.051469 SYSTEM 0.090852
MODELING 0.066311 PROPOSED 0.043132
TYPES 0.046571 SYSTEMS
0.051978 QUALITATIVE 0.018417
CHANGE 0.040393 FORMAL 0.036892
BASE 0.042277 COMPLEX 0.009272
BELIEF 0.025835 VERIFICATION
0.029987 EXPERT 0.020172
QUANTITATIVE 0.005662 ALTERNATIVE 0.022470
SPECIFICATIONS 0.024439 ACQUISITION
0.017816 CAPTURE 0.005301
APPROACH 0.020905 CHECKING 0.024439
DOMAIN 0.016638 MODELED 0.005301
ORIGINAL 0.019026 SYSTEM
0.023259 INTELLIGENT 0.015737
ACCURATELY 0.004639 SHOW 0.017852
PROPERTIES 0.018242 BASES
0.015390 REALISTIC 0.004278
PROPOSE 0.016991 ABSTRACT 0.016826
BASED 0.014004
Style components
34
Recent Results on Author-Topic Models
35
Authors
Words
Can we model authors, given documents? (more
generally, build statistical profiles of
entities given sparse observed data)
36
Authors
Hidden Topics
Words
Model Author-Topic distributions Topic-Word
distributions Parameters learned via Bayesian
learning
37
Authors
Hidden Topics
Words
38
Authors
Hidden Topics
Words
39
Authors
Hidden Topics
Words
40
Authors
Hidden Topics
Words
41
Authors
Hidden Topics
Words
42
Authors
Hidden Topics
Words
43
Hidden Topics
Words
Topic Model - document can be generated from
multiple topics - Hofmann (SIGIR 99), Blei,
Jordan, Ng (JMLR, 2003)
44
Authors
Hidden Topics
Words
Model Author-Topic distributions Topic-Word
distributions NOTE documents can be composed of
multiple topics
45
The Author-Topic Model Assumptions of
Generative Model
  • Each author is associated with a topics mixture
  • Each document is a mixture of topics
  • With multiple authors, the document will be a
    mixture of the topics mixtures of the coauthors
  • Each word in a text is generated from one topic
    and one author (potentially different for each
    word)

46
Generative Process
  • Lets assume authors A1 and A2 collaborate and
    produce a paper
  • A1 has multinomial topic distribution q1
  • A2 has multinomial topic distribution q2
  • For each word in the paper
  • Sample an author x (uniformly) from A1, A2
  • Sample a topic z from a qX
  • Sample a word w from a multinomial topic
    distribution ?z

47
Graphical Model
From the set of co-authors
1. Choose an author
2. Choose a topic
3. Choose a word
48
Model Estimation
  • Estimate x and z by Gibbs sampling (assignments
    of each word to an author and topic)
  • Estimation is efficient linear in data size
  • Infer
  • Author-Topic distributions (Q)
  • Topic-Word distributions (F)

49
Data
  • 1700 proceedings papers from NIPS (2000 authors)
  • (NIPS Neural Information Processing
    Systems)
  • 160,000 CiteSeer abstracts (85,000 authors)
  • Removed stop words
  • Word order is irrelevant, just use word counts
  • Processing time
  • Nips 2000 Gibbs iterations ? 12 hours on PC
    workstation
  • CiteSeer 700 Gibbs iterations ? 111 hours

50
Author Modeling Data Sets
Source Documents Unique Authors Unique Words Total Word Count
CiteSeer 163,389 85,465 30,799 11.7 million
CORA 13,643 11,427 11,101 1.2 million
NIPS 1,740 2,037 13,649 2.3 million
51
Four example topics from CiteSeer (T300)
52
Four more topics
53
Some likely topics per author (CiteSeer)
  • Author Andrew McCallum, U Mass
  • Topic 1 classification, training,
    generalization, decision, data,
  • Topic 2 learning, machine, examples,
    reinforcement, inductive,..
  • Topic 3 retrieval, text, document, information,
    content,
  • Author Hector Garcia-Molina, Stanford
  • - Topic 1 query, index, data, join, processing,
    aggregate.
  • - Topic 2 transaction, concurrency, copy,
    permission, distributed.
  • - Topic 3 source, separation, paper,
    heterogeneous, merging..
  • Author Paul Cohen, USC/ISI
  • - Topic 1 agent, multi, coordination,
    autonomous, intelligent.
  • - Topic 2 planning, action, goal, world,
    execution, situation
  • - Topic 3 human, interaction, people,
    cognitive, social, natural.

54
Four example topics from NIPS (T100)
55
Four more topics
56
Stability of Topics
  • Content of topics is arbitrary across runs of
    model(e.g., topic 1 is not the same across
    runs)
  • However,
  • Majority of topics are stable over processing
    time
  • Majority of topics can be aligned across runs
  • Topics appear to represent genuine structure in
    data

57
Comparing NIPS topics from the same
chain(t11000, and t22000)
BEST KL 0.54
Re-ordered topics at t22000
WORST KL 4.78
KL distance
topics at t11000
58
Comparing NIPS topics and CiteSeer topics
KL 2.88
KL 4.48
Re-ordered CiteSeer topics
KL 4.92
KL distance
NIPS topics
KL 5.0
59
Detecting Unusual Papers by Authors
  • For any paper by an author, we can calculate how
    surprising words in a document are some papers
    are on unusual topics by author

Papers ranked by unusualness (perplexity) for C.
Faloutsos
Papers ranked by unusualness (perplexity) for M.
Jordan
60
Author Separation
  • Can model attribute words to authors correctly
    within a document?
  • Test of model
  • 1) artificially combine abstracts from different
    authors
  • 2) check whether assignment is to correct
    original author
  • A method1 is described which like the kernel1
    trick1 in support1 vector1 machines1 SVMs1 lets
    us generalize distance1 based2 algorithms to
    operate in feature1 spaces usually nonlinearly
    related to the input1 space This is done by
    identifying a class of kernels1 which can be
    represented as norm1 based2 distances1 in Hilbert
    spaces It turns1 out that common kernel1
    algorithms such as SVMs1 and kernel1 PCA1 are
    actually really distance1 based2 algorithms and
    can be run2 with that class of kernels1 too As
    well as providing1 a useful new insight1 into how
    these algorithms work the present2 work can form
    the basis1 for conceiving new algorithms
  • This paper presents2 a comprehensive approach for
    model2 based2 diagnosis2 which includes proposals
    for characterizing and computing2 preferred2
    diagnoses2 assuming that the system2 description2
    is augmented with a system2 structure2 a
    directed2 graph2 explicating the interconnections
    between system2 components2 Specifically we first
    introduce the notion of a consequence2 which is a
    syntactically2 unconstrained propositional2
    sentence2 that characterizes all consistency2
    based2 diagnoses2 and show2 that standard2
    characterizations of diagnoses2 such as minimal
    conflicts1 correspond to syntactic2 variations1
    on a consequence2 Second we propose a new
    syntactic2 variation on the consequence2 known as
    negation2 normal form NNF and discuss its merits
    compared to standard variations Third we
    introduce a basic algorithm2 for computing
    consequences in NNF given a structured system2
    description We show that if the system2
    structure2 does not contain cycles2 then there is
    always a linear size2 consequence2 in NNF which
    can be computed in linear time2 For arbitrary1
    system2 structures2 we show a precise connection
    between the complexity2 of computing2
    consequences and the topology of the underlying
    system2 structure2 Finally we present2 an
    algorithm2 that enumerates2 the preferred2
    diagnoses2 characterized by a consequence2 The
    algorithm2 is shown1 to take linear time2 in the
    size2 of the consequence2 if the preference
    criterion1 satisfies some general conditions

Written by (1) Scholkopf_B
Written by (2) Darwiche_A
61
Applications of Author-Topic Models
  • Expert Finder
  • Find researchers who are knowledgeable in
    cryptography and machine learning within 100
    miles of Washington DC
  • Find reviewers for this set of NSF proposals who
    are active in relevant topics and have no
    conflicts of interest
  • Prediction
  • Given a document and some subset of known authors
    for the paper (k0,1,2), predict the other
    authors
  • Predict how many papers in different topics will
    appear next year
  • Change Detection/Monitoring
  • Which authors are on the leading edge of new
    topics?
  • Characterize the topic trajectory of this
    author over time

62
(No Transcript)
63
Rise in Web, Mobile, JAVA
Web
64
Rise of Machine Learning
65
Bayes lives on.
66
Decline in Languages, OS,
67
Decline in CS Theory,
68
Trends in Database Research
69
Trends in NLP and IR
NLP
IR
70
Security Research Reborn
71
(Not so) Hot Topics
Neural Networks
GAs
Wavelets
72
Decline in use of Greek Letters ?
73
Future Work
  • Theory development
  • Incorporate citation information, collaboration
    networks
  • Other document types, e.g., email
  • handling subject lines, email threads, and to
    and cc fields
  • New datasets
  • Enron email corpus
  • Web pages
  • PubMed abstracts (possibly)

74
New applications of author-topic models
  • Black box for text document collection
    summarization
  • Automatically extract a summary of relevant
    topics and author patterns for a large data set
    such as Enron email
  • Expert Finder
  • Find researchers who are knowledgeable in
    cryptography and machine learning within 100
    miles of Washington DC
  • Find reviewers for this set of NSF proposals who
    are active in relevant topics and have no
    conflicts of interest
  • Change Detection/Monitoring
  • Which authors are on the leading edge of new
    topics?
  • Characterize the topic trajectory of this
    author over time
  • Prediction (work in progress)
  • Given a document and some subset of known authors
    for the paper (k0,1,2), predict the other
    authors
  • Predict how many papers in different topics will
    appear next year

75
The Author-Topic Browser
(a)
Querying on author Pazzani_M
Querying on topic relevant to author
(b)
Querying on document written by author
(c)
76
Scientific syntax and semantics
Factorization of language based on statistical
dependency patterns long-range, document
specific, dependencies short-range
dependencies constant across all documents
semantics probabilistic topics
q
z
z
z
w
w
w
x
x
x
syntax probabilistic regular grammar
77
x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
78
x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
THE
79
x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
THE LOVE
80
x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
THE LOVE OF
81
x 2
OF 0.6 FOR 0.3 BETWEEN 0.1
x 1
0.8
z 1 0.4
z 2 0.6
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2
SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
0.7
0.1
0.3
0.2
x 3
THE 0.6 A 0.3 MANY 0.1
0.9
THE LOVE OF RESEARCH
82
Semantic topics
83
Syntactic classes
5
8
14
25
26
30
33
IN
ARE
THE
SUGGEST
LEVELS
RESULTS
BEEN
FOR
WERE
THIS
INDICATE
NUMBER
ANALYSIS
MAY
ON
WAS
ITS
SUGGESTING
LEVEL
DATA
CAN
BETWEEN
IS
THEIR
SUGGESTS
RATE
STUDIES
COULD
DURING
WHEN
AN
SHOWED
TIME
STUDY
WELL
AMONG
REMAIN
EACH
REVEALED
CONCENTRATIONS
FINDINGS
DID
FROM
REMAINS
ONE
SHOW
VARIETY
EXPERIMENTS
DOES
UNDER
REMAINED
ANY
DEMONSTRATE
RANGE
OBSERVATIONS
DO
WITHIN
PREVIOUSLY
INCREASED
INDICATING
CONCENTRATION
HYPOTHESIS
MIGHT
THROUGHOUT
BECOME
EXOGENOUS
PROVIDE
DOSE
ANALYSES
SHOULD
THROUGH
BECAME
OUR
SUPPORT
FAMILY
ASSAYS
WILL
TOWARD
BEING
RECOMBINANT
INDICATES
SET
POSSIBILITY
WOULD
INTO
BUT
ENDOGENOUS
PROVIDES
FREQUENCY
MICROSCOPY
MUST
AT
GIVE
TOTAL
INDICATED
SERIES
PAPER
CANNOT
INVOLVING
MERE
PURIFIED
DEMONSTRATED
AMOUNTS
WORK
REMAINED
AFTER
APPEARED
TILE
SHOWS
RATES
EVIDENCE
ALSO
THEY
ACROSS
APPEAR
FULL
SO
CLASS
FINDING
AGAINST
ALLOWED
CHRONIC
REVEAL
VALUES
MUTAGENESIS
BECOME
WHEN
NORMALLY
ANOTHER
DEMONSTRATES
AMOUNT
OBSERVATION
MAG
ALONG
EACH
EXCESS
SUGGESTED
SITES
MEASUREMENTS
LIKELY
84
Syntactic classes
5
8
14
25
26
30
33
IN
ARE
THE
SUGGEST
LEVELS
RESULTS
BEEN
FOR
WERE
THIS
INDICATE
NUMBER
ANALYSIS
MAY
ON
WAS
ITS
SUGGESTING
LEVEL
DATA
CAN
BETWEEN
IS
THEIR
SUGGESTS
RATE
STUDIES
COULD
DURING
WHEN
AN
SHOWED
TIME
STUDY
WELL
AMONG
REMAIN
EACH
REVEALED
CONCENTRATIONS
FINDINGS
DID
FROM
REMAINS
ONE
SHOW
VARIETY
EXPERIMENTS
DOES
UNDER
REMAINED
ANY
DEMONSTRATE
RANGE
OBSERVATIONS
DO
WITHIN
PREVIOUSLY
INCREASED
INDICATING
CONCENTRATION
HYPOTHESIS
MIGHT
THROUGHOUT
BECOME
EXOGENOUS
PROVIDE
DOSE
ANALYSES
SHOULD
THROUGH
BECAME
OUR
SUPPORT
FAMILY
ASSAYS
WILL
TOWARD
BEING
RECOMBINANT
INDICATES
SET
POSSIBILITY
WOULD
INTO
BUT
ENDOGENOUS
PROVIDES
FREQUENCY
MICROSCOPY
MUST
AT
GIVE
TOTAL
INDICATED
SERIES
PAPER
CANNOT
INVOLVING
MERE
PURIFIED
DEMONSTRATED
AMOUNTS
WORK
REMAINED
AFTER
APPEARED
TILE
SHOWS
RATES
EVIDENCE
ALSO
THEY
ACROSS
APPEAR
FULL
SO
CLASS
FINDING
AGAINST
ALLOWED
CHRONIC
REVEAL
VALUES
MUTAGENESIS
BECOME
WHEN
NORMALLY
ANOTHER
DEMONSTRATES
AMOUNT
OBSERVATION
MAG
ALONG
EACH
EXCESS
SUGGESTED
SITES
MEASUREMENTS
LIKELY
Write a Comment
User Comments (0)
About PowerShow.com