Title: Monitoring Message Streams:
1Monitoring Message Streams Algorithmic Methods
for Automatic Processing of Messages
Fred Roberts, Rutgers University
2THE MONITORING MESSAGE STREAMS PROJECT TEAM
Endre Boros, Rutgers Operations Research Paul
Kantor, Rutgers Information and Library
Studies Dave Lewis, Consultant Ilya Muchnik,
Rutgers DIMACS/CS S. Muthukrishnan, Rutgers
CS David Madigan, Rutgers Statistics Rafail
Ostrovsky, Telcordia Technologies Fred Roberts,
Rutgers DIMACS/Math Martin Strauss, ATT
Labs Wen-Hua Ju, Avaya Labs (collaborator) Andrei
Anghelescu, Graduate Student Dmitry Fradkin,
Graduate Student Alex Genkin, programmer Vladimir
Menkov, programmer
3OBJECTIVE
Monitor huge streams of textualized communication
to automatically detect pattern changes and
"significant" events
Motivation monitoring email traffic
4TECHNICAL PROBLEM
- Given stream of text in any language.
- Decide whether "new events" are present in the
flow of messages. - Event new topic or topic with unusual level of
activity. - Initial Problem Retrospective or Supervised
Event Identification Classification into
pre-existing classes.
5TECHNICAL PROBLEM
- Batch filtering Given relevant documents up
front. - Adaptive filtering pay for information about
relevance as process moves along.
6- MORE COMPLEX PROBLEM PROSPECTIVE DETECTION OR
UNSUPERVISED LEARNING - Classes change - new classes or change meaning
- A difficult problem in statistics
- Recent new C.S. approaches
- Semi-supervised Learning
- Algorithm suggests a possible new class
- Human analyst labels it determines its
significance
7COMPONENTS OF AUTOMATIC MESSAGE PROCESSING
- (1). Compression of Text -- to meet storage and
processing limitations - (2). Representation of Text -- put in form
amenable to computation and statistical analysis - (3). Matching Scheme -- computing similarity
between documents - (4). Learning Method -- build on judged examples
to determine characteristics of document cluster
(event) - (5). Fusion Scheme -- combine methods (scores) to
yield improved detection/clustering.
8COMPONENTS OF AUTOMATIC MESSAGE PROCESSING - II
- These distinctions are somewhat arbitrary.
- Many approaches to message processing overlap
several of these components of automatic message
processing. - Project Premise Existing methods dont exploit
the full power of the 5 components, synergies
among them, and/or an understanding of how to
apply them to text data.
9COMPRESSION
- Reduce the dimension before statistical analysis.
- We often have just one shot at the data as it
comes streaming by
10COMPRESSION II
- Recent results One-pass through data can
reduce volume significantly w/o degrading
performance significantly.
We believe that sophisticated dimension reduction
methods in a preprocessing stage followed by
sophisticated statistical tools in a
detection/filtering stage can be a very powerful
approach. Our methods so far give us some
confidence that we are right.
11REPRESENTATIONS
- Term representation binary, term frequency (TF),
log TF, expTF - Term weighting IDF (inverted document
frequency), IDF2, statistical - IDF(x) log 1 N/(i(x)1)
- N documents
- i(x) documents in which a term x appears
- Document Normalization L1, L2
- L1 sum x), L2 sqrt(sumxi2)
12MATCHING SCHEMES
- Inner product
- The usual inner product between vectors
- Euclidean distance
13LEARNING
- Rocchio (A linear classifier based on the center
of gravity of the known relevant (and not
relevant) documents.) - Centroid (Essentially quadratic classifier based
on ratio of distances from the centroids of the
positive and negative judged examples.) - aiSVM-b0
- SVMLight
- SVM and Regression
- 1-NN, k-NN
- Q_up (Centroid or Rocchio in online setting)
- Bayesian Binary Regression with Normal or Laplace
prior - Dense logit, sparse logit, dense probit, sparse
probit
14FUSION METHODS
- combining scores based on ranks, linear
functions, or nonparametric schemes
15SAMPLE METHODS STUDIED
See full table
ILH inverted list heuristics (keep few terms)
16 DATA SETS USED
- No readily available data set has all the
characteristics of data on which we expect our
methods to be used - However Many of our methods depend essentially
only on a matrix of term frequencies by
documents. - Thus, many available data sets can be used for
experimentation.
17 DATA SETS USED II
- TREC (Text Retrieval Conference) data
time-stamped subsets of the data (order 105 to
106 messages) - Reuters Corpus Vol. 1 (8 x 105 messages)
- Medline Abstracts (order 107 with human indexing)
18 COMPARING ALGORITHMS
- Effectiveness
- F1 harmonic mean of precision and recall
- Precision (SelectedRelevant/SelectedAll)
- Recall (SelectedRelevant/RelevantAll)
- F1 2/(1/P 1/R) harmonic mean
- Area under ROC Curves (Receiver Operating
Characteristic) Curves - Comparing true positive rate (detection rate) vs.
false positive rate (false alarm rate) - ROC curve plot of true positive rate vs. false
positive rate - The closer this is to 1, the more accurate the
classifier
19 COMPARING ALGORITHMS
- Space
- Size of data structure used
- Dimensionality reduction relative to dimension of
original space - Index storage cost
20 COMPARING ALGORITHMS
- Time
- CPU time to train, to classify
- Queries/minute
- Operation count
21 COMPARING ALGORITHMS
- Insight
- Gains in understanding
- No concrete measures
There is a tradeoff among criteria. A modest
decrease in effectiveness might be acceptable if
there is a significant savings in time or space.
22Class of Methods I Classical
- These models are based on the count of term
frequencies (TF, bag of words) which is combined
in an inner product, using a weight that reflects
the prevalence of a term in the collection (IDF).
They use the center of mass of the known relevant
documents, and of the known not relevant
documents to define a classifier. - Good for baseline comparisons to other methods
23Classical Models II
- Rocchio Model
- Developed in the 1970s
- Vector representing topic is modified by adding a
multiple of each vector representing a known
relevant document and subtracting a multiple of
each vector representing a known irrelevant
document - Scoring function is inner product of this vector
with vector representing incoming document - There is an ideal direction in space of
documents and farther in that direction is always
better.
24Classical Models III
- Centroid Model
- Based on idea that there is a desirable location
- Anything closer to that location (in a certain
ellipsoid metric) is deemed better.
25Geometrically
Rocchio
Centroid
n
r
Centroid method expects localized concentration
of relevant documents. (This suggests the
importance of neighborhoods.) Rocchio expects
relevance to increase in certain preferred
directions.
26Classic Methods Components
- Rocchio
- Representation Term Rep TF or logTF, Term
weight IDF or IDF2 Document normalization none
or L2 - Compression none
- Matching Euclidean (Q_up if online) or inner
prod. - Learning none, Rocchio, or Q_up
- Centroid
- Representation Term Rep TF or logTF, Term
weight IDF or IDF2 Document normalization none
or L2 - Compression none or Bayesian (2 methods)
- Matching Euclidean (Q_up if online) or inner
prod. - Learning none or centroid, Q_up (adaptive)
27Classic Methods Some Results
- Rocchio
- Effectiveness of our methods industry standard
- With very careful tuning, a Rocchio method
produced the best adaptive results at TREC11 - Centroid
- Among the methods we have tested, Centroid
methods are up among the top performers in terms
of effectiveness
28Class of Methods II Nearest Neighbor (kNN)
Classifiers for Text Filtering
- Route message by
- Finding k most similar training messages
(neighbors) - Assign to classes that are most common among
neighbors (optionally weighting by distance) - kNN studied since 1958, for text since early 90s
- Moderately effective for text has been
considered inefficient - But, finding neighbors only needs to be done once
- No matter how many classes
- So for large number of topics, maybe more
efficient than one-classifier-per-topic approaches
29kNN Components
- Representation term representation TF, term
weight IDF2, document normalization L2 - Compression none, ILH-1, ILH-2, ILH-3, or RPV
- Matching Euclidean or L2 inner product
- Learning none for batch otherwise traditional
k-NN
30Speeding up kNN
- Worked on fast implementation
- Store text and classes sparsely (Representation)
- Store class labels sparsely
- Arrange computations to do work proportional only
to number of class labels in neighbors, not total
number of classes - Search engine heuristics use the in-memory
inverted file (Matching) - Use inverted file
- Retain only high impact terms within each
document, or within each inverted list - Classification-time selection of inverted lists
31kNN Results
- Great reduction in size of inverted index and
speed of classification - Slight additional cost in effectiveness
- Effectiveness slightly below our best methods
(Bayesian probit and logistic classifiers) - Compressed index 90 smaller than original index
w/only 7-12 loss in effectiveness - Approximate matching is 90-95 faster w/ only
2-10 loss in effectiveness - Ours are first large scale experiments on search
engine heuristic for neighbor lookup in kNN - Partnership between theoreticians and
practitioners.
32Random Projections Method k-NN
First large-scale implementation of theoretical
random projection methods. Shows promise for
large scale datasets.
33Class of Methods III Bayesian Methods
- Bayesian statistical methods place prior
probability distributions on all unknowns, and
then compute posterior distribution for the
unknowns conditional on the knowns.
Thomas Bayes
34Bayesian Methods
- Zhang and Oles (2001) first use of large-scale
Bayesian logistic regression (10,000 dimensions) - The Bayesian approach explicitly incorporates
prior knowledge about model complexity
(regularization) - Bayesian Logistic with Laplace Priors We extend
Bayesian approach to incorporate a prior
requirement for sparsity in applications with up
to 100,000 dimensions - Logistic regression has one parameter per
dimension our sparse model sets many of these to
zero - Resulting sparse models produce outstanding
accuracy and ultra-fast predictions with no
ad-hoc feature selection
35Bayesian Methods Components
- Representation term representation logTF, term
weighting IDF, document normalization none or L2 - Compression none or N-best (3 methods)
- Matching inner product
- Learning Bayesian Binary Regression with Laplace
prior or normal prior
36Bayesian Methods Sample Results
RCV-1
OHSUMED
Rows 1-3 Reuters Vol. 1 data Rows 4-6 Medline
data SVM is in table for comparisons
sake. NumFeat dimensions in document
representation, MedianUsed words used to make
predictions (role of sparsity)
37Bayesian Methods Sample Results
- Our implementation of Bayesian Logistic with
Laplace priors is highly efficient for very large
scale problems. - Other efficient variants implemented as well
Bayesian probit - Compared to Zhang Oles, our implementation is
- 200 to 2000 times faster
- Space required is 200 to 2000 times smaller
- Accuracy as good as the best results ever
published. - What we did is unique.
- Very large scale 122,000 features with no ad hoc
feature selection - In sum, we have a sparseness-inducing Laplace
prior that produces dramatically simpler models
with essentially no loss in accuracy
38Data Fusion
- Data Fusion combines several systems for
representing, compressing, matching or learning
into a compound system that is generally able to
perform better - Each system assigns a score to a new document
- We have explored score combining schemes for
data fusion. - Work has emphasized computational and
visualization tools to facilitate study of
relationships among various combinations of
methods
39Fusion Key Findings
- The global fusion approach asks is there a
rule for combining the results of two or more
systems without regard to the specific pattern or
topic being sought? It does not work terribly
well - Local fusion When specialized to the specific
topic, simple linear methods can produce
substantial improvements (up to 10 or 20)
judged by standard measures of retrieval
performance - Some positive results in case where fusion rule
is selected on basis of one sample of topics and
applied to another.
40Some Comparisons of Effectiveness
These are just a few examples of results. We have
done over 5000 complete experiments.
Full RCV1 Data set
Subset of 4060 items
41- Compression Methods Summary
- Applied old heuristics and new theory-based
algorithms - Random projections to real subspaces
- Random projections to Hamming cubes
- Combinatorial PCA
- Streaming algorithms for finding deviant cases
in massive data - Classic search engine inverted file heuristics
42Compression Methods Combinatorial PCA (cPCA)
- PCA principal components analysis, a
mathematical procedure that transforms a number
of (possibly) correlated variables into a
(smaller) number of uncorrelated variables called
principal components. Goals Reduce
dimensionality, discover meaningful variables. - New Combinatorial analog of PCA (novel method)
- Finds most frequent and most strongly correlated
features in the data matrix - Application feature selection for a learning
algorithm (SVM in our experiments) - cPCA is a compression method for any type of data
matrix - cPCA can be used with any classifier learning
method
43cPCA Experimental results
- Experiments with 2 data sets, 16 representations
and 10 classifiers - Dimensionality reduction by a factor of 30
(47,000 down to 1,500) - Time 3 min for 23,000 documents x 47,000 words
- cPCA has computational complexity lower by an
order of magnitude compared to PCA
44STREAMING DATA ANALYSIS
- Motivated by need to make decisions about data
during an initial scan as data stream by. - Recent development of theoretical CS algorithms
- Algorithms motivated by intrusion detection,
transaction applications, time series
transactions
45 Streaming Text Analyses
- Aj number of texts seen in time period j
- Aj number of documents that contain the word
j - Ai,j number emails or bytes sent from address
i to address j - Ai,j number of occurrences of word j in
document i
Using these different data representations, we
have developed rapid methods for a number of
monitoring applications such as finding changing
trends, outliers and deviants, unique items, rare
events, heavy hitters, etc.
46Decision-Theoretic Adaptive Filtering (A Formal
Framework)
- Goal pick out interesting documents in a stream
and present to oracle for a reward - Oracle gives feedback only for presented
documents - Complex rewards and penalties (e.g., if never
present any documents then never improve)
- Formal model chooses optimal strategy balancing
instant rewards penalties with long-term payoff - Initial implementation and experimentation shows
considerable promise
47METHODS READY FOR INTEGRATION AND TESTING BY THE
PROTOTYPE DEVELOPERS
- Classic method Rocchio,
- Classic method Centroid
- kNN with IFH (inverted file heuristic)
- Sparse Bayesian (Bayesian with Laplace priors)
- cPCA
- Lots of code for components
48OUR EXPECTATION IMPACT AFTER 12 MONTHS
- We will have developed innovative methods for
classification of accumulated documents in
relation to known tasks/targets/themes and
building profiles to track future relevant
messages. - We are optimistic that by end-to-end
experimentation, we will continue to discover new
uses relevant to each of the component tasks for
recently developed mathematical and statistical
methods. We expect to achieve significant
improvements in performance on accepted measures
that could not be achieved by piecemeal study of
one or two component tasks or by simply
concentrating on pushing existing methods
further.
49ACCOMPLISHMENTS IMPACT AFTER 12 MONTHS
- General Observation On classic measures such as
precision, recall, F1, the existing methods do
quite well. We have sought progress on other
measures. - For example, if we can reduce the space required
by 90 with only a 10 loss in F1, we can cast a
wider net in the search for important messages.
(Our kNN methods do this.) - Similarly, if we can reduce the time required
significantly while only reducing effectiveness a
little bit, we can have a similar impact. (Our
kNN and Bayesian methods do this.)
50FUTURE WORK OVERALL GOALS FOR NEXT TWO YEARS
- We will have extended our analysis to
semi-supervised discovery of potentially
interesting clusters of documents provided we
can interact with appropriate judges who can
provide realistic feedback. - This should allow us to identify potentially
threatening events in time for cognizant agencies
to prevent them from occurring.
51Future Work kNN
- Continue efforts to improve efficiency by moving
to online setting - Improve efficiency by demonstrating provable
quality approximate matching guarantees - Randomized matrix multiplication methods
- Draw on lessons from random projection work
52Future Work kNN
- Now that we have fast kNN, work on learning
component to improve effectiveness - Local logistic regression
- Online learning of thresholds
- Big issue only some neighbors labeled
- Apply more sophisticated learning algorithms to
the neighbors once they are found local
learning (used in robotics, not much in text
classification) - Might have no neighbor judged for topic
- Use reverse kNN learning?
53Future Work Random Projections
- For random projection to be effective, we need
huge spaces, so we will look at pairs or triples
of terms - Develop random projection methods for on-line
processing of text messages when items are being
added/deleted from the database using
derandomization tricks - Use for clustering methods for semi-supervised
learning
54Future Work Bayesian Methods
- Online Bayesian algorithms
- status small experiments with two competing
algorithms - plan comprehensive large-scale implementation
experiments - Develop Bayesian multi-label models (k of n
categories) All our work so far on binary
classification real world documents can
simultaneously belong to many stochastically
dependent categories - status designed three competing approaches
- plan comprehensive large-scale implementation
experiments
55Future Work Bayesian Methods II
- Our target applications feature a paucity of
labeled training data. Highly informative
Bayesian priors drawing on information in
category descriptions have potential to provide
dramatic improvements on predictive accuracy with
tiny training datasets - status simple experiments with initial ideas
- plan comprehensive large-scale implementation
experiments
56Future Work Bayesian Methods III
- Merging formal decision-theoretic model for
adaptive filtering with online Bayesian
algorithms - Our decision-theoretic framework for adaptive
filtering shows considerable promise - We need to marry our online sparse Bayesian model
with the decision-theoretic framework - High-dimensional Bayesian topic detection and
tracking via sparse hidden-Markov models - Important for semi-supervised learning
- Build on recent dramatic advances in variational
approaches that provide dramatic speedups
57Future Work Feature Selection Methods, cPCA
- Develop an online feature selection method which
could work with any online classification
algorithm - Develop a feature selection method to detect
rare events in massive message streams - Tailor cPCA to the specifics of SVM classifiers
58Future Work Fusion
- Fusing with scores means giving up most of the
information that diverse methods contain. This
presents a strong challenge for us. New tools are
needed. - Fusion will be much more effective when it is
applied earlier in the chain of components, to
combine methods of representation, matching,
compression, and learning.
59Future Work Streaming Text Data Historic Data
Analysis
- The accumulation of text messages is massive over
time - A lot of streaming research is focused on
on-going or current analyses - It is a great challenge to use only summarized
historic data and see if a currently emerging
phenomenon had precursors occurring in the past - We propose an exciting and novel architecture for
historic and posterior analyses via small
summaries
60Architecture for Historic Text Data Analysis
Conceptual representation
Multidimensional data Attributes To, From user
Ids, Time sent. Metadata
Reply? Cc? Fwd? Attachments? Language? Text
labels words, phrases, etc.
data stream
Materialized Summary (updated on the stream)
Big changes, Deviants Unusual trends, Decision
trees, Clustering.
Approximate ad hoc queries and analyses
61Future Work Streaming Text Data Tracking Term
Statistics
- Text filtering systems use word statistics
- Frequency of words in all docs (IDF) or classes
of documents (Naive Bayes) - But new words (and phrases) never stop coming
- How do we estimate frequency of unbounded number
of terms from bounded space statistics? - Approaches we will study
- Randomized streaming data algorithms
- Reinforcement learning
- Backoff models on term structure
62Future Work Streaming Data Analysis Tracking
Topics and Parties(new direction of research)
- Number of participants in information exchange is
vast and growing rapidly. How do we track topics
as they emerge? - Key problem in semi-supervised online learning
- Topic tracking good relevance scores are linear
combinations of words (frequencies or
occurrences) - Idea Keep a sketch of parties for each word
separately - linear combinations of sketches make sense - we
can trace topics -
63Future Work Streaming Data Analysis Tracking
Topics and Parties
- This design is topic-independent
- no need to know topics in advance
- no limit for the number of topics
- This design is efficient
- Sketches help search many combinations of words
potentially related to a topic very quickly.
64(No Transcript)