Title: Predictive Profiling from Massive Transactional Data Sets
1 Statistical Modeling of Large Text
CollectionsPadhraic SmythDepartment of
Computer ScienceUniversity of California, Irvine
MURI Project Kick-off MeetingNovember 18th
2008
2The Text Revolution
- Widespread availability of text in digital form
is driving - many new applications based on automated text
analysis - Categorization/classification
- Automated summarization
- Machine translation
- Information extraction
- And so on.
3The Text Revolution
- Widespread availability of text in digital form
is driving - many new applications based on automated text
analysis - Categorization/classification
- Automated summarization
- Machine translation
- Information extraction
- And so on.
- Most of this work is happening in computing, but
many of the underlying techniques are statistical
4Motivation
Pennsylvania Gazette 80,000 articles 1728-1800
16 million Medline articles
NYT 1.5 million articles
5Problems of Interest
- What topics do these documents span?
- Which documents are about a particular topic?
- How have topics changed over time?
- What does author X write about?
- and so on..
6Problems of Interest
- What topics do these documents span?
- Which documents are about a particular topic?
- How have topics changed over time?
- What does author X write about?
- and so on..
- Key Ideas
- Learn a probabilistic model over words and docs
- Treat query-answering as computation of
appropriate conditional probabilities
7Topic Models for Documents
- P( words document ) ??
- S P(wordstopic) P (topicdocument)
-
Topic probability distribution over words
Coefficients for each document
Automatically learned from text corpus
8Topics Multinomials over Words
9Topics Multinomials over Words
10Basic Concepts
- Topics distributions over words
- Unknown a priori, learned from data
- Documents represented as mixtures of topics
- Learning algorithm
- Gibbs sampling (stochastic search)
- Linear time per iteration
- Provides a full probabilistic model over words,
documents, and topics - Query answering computation of conditional
probabilities
11Enron email data
250,000 emails 28,000 individuals 1999-2002
12Enron email business topics
13Enron non-work topics
14Enron public-interest topics...
15Examples of Topics from New York Times
Terrorism
Wall Street Firms
Stock Market
Bankruptcy
WEEK DOW_JONES POINTS 10_YR_TREASURY_YIELD PERCENT
CLOSE NASDAQ_COMPOSITE STANDARD_POOR CHANGE FRIDA
Y DOW_INDUSTRIALS GRAPH_TRACKS EXPECTED BILLION NA
SDAQ_COMPOSITE_INDEX EST_02 PHOTO_YESTERDAY YEN 10
500_STOCK_INDEX
WALL_STREET ANALYSTS INVESTORS FIRM GOLDMAN_SACHS
FIRMS INVESTMENT MERRILL_LYNCH COMPANIES SECURITIE
S RESEARCH STOCK BUSINESS ANALYST WALL_STREET_FIRM
S SALOMON_SMITH_BARNEY CLIENTS INVESTMENT_BANKING
INVESTMENT_BANKERS INVESTMENT_BANKS
SEPT_11 WAR SECURITY IRAQ TERRORISM NATION KILLED
AFGHANISTAN ATTACKS OSAMA_BIN_LADEN AMERICAN ATTAC
K NEW_YORK_REGION NEW MILITARY NEW_YORK WORLD NATI
ONAL QAEDA TERRORIST_ATTACKS
BANKRUPTCY CREDITORS BANKRUPTCY_PROTECTION ASSETS
COMPANY FILED BANKRUPTCY_FILING ENRON BANKRUPTCY_C
OURT KMART CHAPTER_11 FILING COOPER BILLIONS COMPA
NIES BANKRUPTCY_PROCEEDINGS DEBTS RESTRUCTURING CA
SE GROUP
16Topic trends from New York Times
TOUR RIDER LANCE_ARMSTRONG TEAM BIKE RACE FRANCE
Tour-de-France
COMPANY QUARTER PERCENT ANALYST SHARE SALES EARNIN
G
Quarterly Earnings
330,000 articles 2000-2002
ANTHRAX LETTER MAIL WORKER OFFICE SPORES POSTAL BU
ILDING
Anthrax
17What does an author write about?
- Author Jerry Friedman, Stanford
18What does an author write about?
- Author Jerry Friedman, Stanford
- Topic 1 regression, estimate, variance, data,
series, - Topic 2 classification, training, accuracy,
decision, data, - Topic 3 distance, metric, similarity, measure,
nearest,
19What does an author write about?
- Author Jerry Friedman, Stanford
- Topic 1 regression, estimate, variance, data,
series, - Topic 2 classification, training, accuracy,
decision, data, - Topic 3 distance, metric, similarity, measure,
nearest, - Author Rakesh Agrawal, IBM
20What does an author write about?
- Author Jerry Friedman, Stanford
- Topic 1 regression, estimate, variance, data,
series, - Topic 2 classification, training, accuracy,
decision, data, - Topic 3 distance, metric, similarity, measure,
nearest, - Author Rakesh Agrawal, IBM
- - Topic 1 index, data, update, join,
efficient. - - Topic 2 query, database, relational,
optimization, answer. - - Topic 3 data, mining, association, discovery,
attributes,
21Examples of Data Sets Modeled
- 1,200 Bible chapters (KJV)
- 4,000 Blog entries
- 20,000 PNAS abstracts
- 80,000 Pennsylvania Gazette articles
- 250,000 Enron emails
- 300,000 North Carolina vehicle accident police
reports - 500,000 New York Times articles
- 650,000 CiteSeer abstracts
- 8 million MEDLINE abstracts
- Books by Austen, Dickens, and Melville
- ..
- Exactly the same algorithm used in all cases
and in all cases interpretable topics produced
automatically
22Related Work
- Statistical origins
- Latent class models in statistics (late 60s)
- Admixture models in genetics
- LDA Model Blei, Ng, and Jordan (2003)
- Variational EM
- Topic Model Griffiths and Steyvers (2004)
- Collapsed Gibbs sampler
- Alternative approaches
- Latent semantic indexing (LSI/LSA)
- less interpretable, not appropriate for count
data - Document clustering
- simpler but less powerful
23Clusters v. Topics
Hidden Markov Models in Molecular Biology New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification.
24Clusters v. Topics
One Cluster
Hidden Markov Models in Molecular Biology New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. cluster 88 model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov
25Clusters v. Topics
Multiple Topics
One Cluster
Hidden Markov Models in Molecular Biology New Algorithms and Applications Pierre Baldi, Yves C Hauvin, Tim Hunkapiller, Marcella A. McClure Hidden Markov Models (HMMs) can be applied to several important problems in molecular biology. We introduce a new convergent learning algorithm for HMMs that, unlike the classical Baum-Welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual Viterbi most likely path approximation. Left-right HMMs with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. In all cases, the models derived capture all the important statistical properties of the families and can be used efficiently in a number of important tasks such as multiple alignment, motif detection, and classification. cluster 88 model data models time neural figure state learning set parameters network probability number networks training function system algorithm hidden markov topic 10 state hmm markov sequence models hidden states probabilities sequences parameters transition probability training hmms hybrid model likelihood modeling topic 37 genetic structure chain protein population region algorithms human mouse selection fitness proteins search evolution generation function sequence sequences genes
26Extensions
- Author-topic models
- Authors mixtures over topics
-
(Steyvers, Smyth, Rosen-Zvi, Griffiths, 2004) - Special-words model
- Documents mixtures of topics idiosyncratic
words -
(Chemudugunta, Smyth, Steyvers, 2006) - Entity-topic models
- Topic models that can reason about entities
- (Newman,
Chemudugunta, Smyth, Steyvers, 2006) - See also work by McCallum, Blei, Buntine,
Welling, Fienberg, Xing, etc - Probabilistic basis allows for a wide range of
generalizations
27Combining Models for Networks and Text
28Combining Models for Networks and Text
29Combining Models for Networks and Text
30Combining Models for Networks and Text
31Technical Approach and Challenges
- Develop flexible probabilistic network models
that can incorporate textual information - e.g., ERGMs with text as node or edge covariates
- e.g., latent space models with text-based
covariates - e.g., dynamic relational models with text as edge
covariates - Research challenges
- Computational scalability
- ERGMS not directly applicable to large text data
sets - What text representation to use
- High-dimensional bag of words ?
- Low-dimensional latent topics ?
- Utility of text
- Does the incorporation of textual information
produce more accurate models or predictions? How
can this be quantified?
32Graphical Model
z
Group Variable
..........
Word 2
Word 1
Word n
33Graphical Model
z
Group Variable
w
Word
n words
34Graphical Model
z
Group Variable
w
Word
n words
D documents
35Mixture Model for Documents
Group Probabilities
a
z
Group Variable
f
Group-Word distributions
w
Word
n words
D documents
36Clustering with a Mixture Model
Cluster Probabilities
a
z
Cluster Variable
f
Cluster-Word distributions
w
Word
n words
D documents
37Graphical Model for Topics
Document-Topic distributions
q
z
Topic
f
Topic-Word distributions
w
Word
n
D
38Learning via Gibbs sampling
Document-Topic distributions
q
Gibbs sampler to estimate z for each word
occurrence, marginalizing over
other parameters
z
Topic
f
Topic-Word distributions
w
Word
n
D
39More Details on Learning
- Gibbs sampling for word-topic assignments (z)
- 1 iteration full pass through all words in all
documents - Typically run a few hundred Gibbs iterations
- Estimating ? and ?
- use z samples to get point estimates
- non-informative Dirichlet priors for ? and ?
- Computational Efficiency
- Learning is linear in the number of word tokens ?
- Can still take order of a day on 100k or more
docs
40Gibbs Sampler Stability