Title: Monitoring Message Streams:
1Monitoring Message Streams Algorithmic Methods
for Automatic Processing of Messages
Fred Roberts, Rutgers University
2MMS Goal
Monitor huge communication streams, in
particular, streams of textualized communication
to automatically detect pattern changes and
"significant" events
Motivation monitoring email traffic, news,
communiques, faxes, voice intercepts (with speech
recognition)
3MMS Overall Objectives
- Synergistic improvements in
- Performance in terms of space, time,
effectiveness, and/or insight - Understanding the tradeoffs among these types of
improvements - Compression for efficient resource use
- Representation that aids fitting models
- Efficient matching of text to text and model to
text - Learning models from data and prior knowledge
- Reduction in need for large amounts of training
data or labor-intensive input - Fusion of complementary filtering approaches
4MMS Approaches
- Emphasis on Supervised Filtering
- Given example documents, textbook descriptions,
etc., find documents on this topic in incoming
stream or past data - Less Emphasis on Unsupervised Event
Identification - Detect emergent characteristics, anomalous
patterns, etc. in incoming stream of text or
historical statistics on the stream
5MMS Approaches Supervised Filtering
- Batch filtering All training texts processed
before any texts of active interest to user - Adaptive filtering User trains system during use
- Value of examples for both information and
training must be considered
6MMS Approaches Dealing with Massive Data
- Creating summary statistics on massive data
streams - Detect outliers, heavy hitters (most frequent
items) , etc. - Allow us to return to past without keeping raw
data - Reducing need for labeled training examples in
supervised classification - Bayesian priors from domain knowledge
- Tuning on unlabeled data
7Accomplishments Phase II (Jan 04 Sep 04)
- Bayesian Logistic Regression
- Using sparseness-favoring priors, our methods
have produced outstanding accuracy and fast
predictions with no ad-hoc feature selection - State of art text classification effectiveness
- Recently Highest score on TREC 2004 triage task
- Public release of our Bayesian Binary Regression
(BBR) software (500 downloads)
Thomas Bayes
8Accomplishments Phase II (Jan 04 Sep 04)
- Bayesian Logistic Regression (contd)
- Ability to use domain knowledge to set prior
distributions led to large improvements in
effectiveness when little training data is
available - New online algorithms online updating of
Bayesian models as new data become available
9Accomplishments Phase II (Jan 04 Sep 04)
- Streaming Algorithms
- New sketch-based algorithms for detecting word
frequency changes and other patterns in massive
text streams - Rapid methods for finding changing trends,
outliers and deviants, rare events, heavy hitters - Initial results using summarized data to search
for meaningful answers to queries about the past - Initial work on textual and structural patterns
in informal communication networks
10Accomplishments Phase II (Jan 04 Sep 04)
- Nearest neighbor classification Fast
implementation - Continued development of heuristics for
approximate neighbor finding with an in-memory
inverted index - Our results have reduced memory by 90 and time
by 90 to 99 with minimal impact on
effectiveness. - Packaged and delivered kNN software
- Developing algorithms for speeding up slow but
potentially highly effective local learning
approach - Based on training a separate logistic regression
on the neighbors of each test document! - Slow, but with many avenues to large speedups
11Accomplishments Phase II (Jan 04 Sep 04)
- Adaptive Filtering
- Models to Aid in Learning When to act greedily
(exploit -- submit documents - we believe relevant) and when to take risks
(explore -- submit documents that can - be irrelevant)
- Seek approximate solutions to the intractable
optimal exploration/exploitation tradeoff - Experiments show slight improvements in filtering
effectiveness compared to greedy (exploit-only)
approach
12 Some MMS Work in Depth Bayesian Priors from
Domain Knowledge
- Bayesian methods assume prior beliefs
- about parameters before data is seen
- Project Phase I generic, vague priors
- Project Phase II Reference materials or
intuitions about words may help predict class.
Use these to set priors. (Material very unlike
training examples) - Goal reduce need for training examples
- Replace 1000s of randomly sampled examples with
few, possibly biased examples
13Knowledge-Driven Priors Issues
- Reference texts have some non-topical words
- Use words that discriminate among topics (use
Inverted Document Frequency (IDF) weighting
within reference collection) - Small training sets increase problems with
thresholding and text representation - Use unlabeled data to aid thresholding and to
learn IDF weights - Use separate prior for intercept term of model
14Knowledge-Driven Priors Results
- Topics 27 Reuters Region categories
- Knowledge CIA World Factbook (WFB) entries
- Examples 10/topic
- Baseline results (F1 measure)
- WFB 0.234 , no WFB 0.052
- Better small training sets, improved algorithms
- WFB 0.591, no WFB 0.395
15Knowledge-Driven Priors Summary
- Reference materials of text type very different
from documents to be classified can aid
supervised filtering - In combination with tuning on unlabeled data,
this technique can provide immediate practical
benefits - Current methods are crude and ad hoc
substantial improvements should be possible
16Some MMS Work in Depth Streaming Analysis
- Problem Monitor fast, massive text streams and
support both online tracking as well as historic
analysis for events. - Multidimensional data source, destination,
time sent or received, metadata (reply,
language), text - labels (words, phrases), links.
- Goal To use highly compact summaries that are
computed at stream speed and perform accurate
analyses.
17Streaming Analysis Tool CM Sketch
- Theoretical We have developed the CM Sketch that
uses (1/e) log 1/d space to approximate data
distribution with error at most e, and
probability of success at least 1-d. - All other previously known sample or sketch
methods use space at least (1/e2). - CM Sketch is an order of magnitude better.
- Practical Few 10's of KBs gives accurate
summary of large data Create summaries of data
that allow historic queries to find - Heavy Hitters (Most Frequent Items)
- Quantiles of a Distribution (Median, Percentiles
etc.) - Finding items with large changes
18Streaming Analysis Using Web Logs
- Web logs (blogs) or regularly updated on-line
journals provide informal, opinionated, candid
data that is more like email than is the web. - We have begun to automatically collect blogs,
stripping formatting and tags, ads, etc., and
outputting corresponding "bag of words" into
streaming algorithms for analysis, archiving.
10s to 100 GB scale. - 30001 compression using CM Sketch methods.
- Allows accurate analysis of popular words, new
emergent words, etc., including multilingual
occurrences.
19Deliverables Phase I
- Classic method Rocchio
- Classic method Centroid
- kNN with IFH (inverted file heuristic)
- Sparse Bayesian (Bayesian with Laplace priors)
- Combinatorial PCA
- Homotopic Linking of Widely Varying Rocchio
Methods - aiSVM
- Fusion
20Deliverables Phase II
- Revised and extended version of kNN code,
including scripts for running local learning
experiments - Substantially extended version of BBR, including
use of domain knowledge to set priors - CM Sketch (C library for count-min sketching)
- Code to use CM Sketch to find heavy hitters,
quantiles, and large changes in streams
21MMS Future Directions
- Bayesian
- Expand types of domain knowledge usable
- For instance, making use of the taxonomies
available in many subject areas - Improve self-tuning of BBR software
- Make it more effective for novice users
- Surprisingly subtle questions Cross-validation,
calibration, scaling (e.g., when multiple
features) - Incorporate previous work on online Bayesian
methods into BBR
22MMS Future Directions
- Streaming
- Systematically explore summarization methods such
as sampling, bitmaps, sketches - Develop warehousing techniques for large scale
sketch-based historical analyses - Massiveness of data implies linear algorithms too
inefficient. Seek sublinear methods. - Develop sketch-based methods for link analysis in
temporally changing multigraphs - From and To addresses in email, links between
blogs, etc. - Add modeling component to the sketch-based
analysis Exploit knowledge of distribution of
the data.
23MMS Future Directions
- kNN
- kNN with small training sample for each of
massive number of topics - maybe only 5 to 10 known
- relevant/irrelevant documents
- Since small samples have little overlap, extend
kNN approach to deal with partially labeled
datasets - Bayesian kNN
- Incorporate methods developed in our Bayesian
work for dealing with small training sets (e.g.,
tuning thresholds on unlabeled data). - More fundamental combinations of Bayesian and kNN
methods (e.g., tunable distance metrics)
24MMS Future Directions
- Greedy Round Robin Feature Selection
- In phase I work Explored greedy heuristic
- to choose subset of original set of terms as
- features
- Did extremely well in TREC2002 topic
intersection tasks - Will develop a Greedy Round Robin (GRR) method
- Applies if features fall into two or more
conceptually distinct sets (e.g., metadata such
as source/destination, genre or medium of the
message) - Each list of features is consulted in turn.
- Plan experimental analysis of GRR
- Plan theoretical analysis of GRR using simulation
25MMS Future Directions
- Adaptive filtering
- Experiment with new adaptive thresholding methods
(synergy with Bayesian thresholding work) - Scoring threshold is adjusted downward if judging
too many irrelevant documents upward if judging
too few relevant documents - Aim for algorithm with state-of-art effectiveness
and provable theoretical properties - Compare rate of convergence of various algorithms
on real data.
26MMS PROJECT TEAM
Paul Kantor, Rutgers Communic., Info. Library
Studies Dave Lewis, Consultant Michael Littman,
Rutgers CS David Madigan, Rutgers Statistics S.
Muthukrishnan, Rutgers CS Rafail Ostrovsky,
Telcordia/UCLA Fred Roberts, Rutgers
DIMACS/Math Martin Strauss, ATT Labs/U.
Michigan) Wen-Hua Ju, Avaya Labs
(collaborator) Andrei Anghelescu, Graduate
Student Suhrid Balakrishnan, Graduate
Student Aynur Dayanik, Graduate Student Dmitry
Fradkin, Graduate Student Peng Song, Graduate
Student Graham Cormode, postdoc Alex Genkin,
software developer Vladimir Menkov, software
developer