Monitoring Message Streams: - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Monitoring Message Streams:

Description:

MMS: Goal ... MMS: Approaches. 5 ... Some MMS Work in Depth: Streaming Analysis ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 27

Provided by: spass

Category:

more less

Transcript and Presenter's Notes

Title: Monitoring Message Streams:

1
Monitoring Message Streams Algorithmic Methods
for Automatic Processing of Messages
Fred Roberts, Rutgers University
2
MMS Goal
Monitor huge communication streams, in
particular, streams of textualized communication
to automatically detect pattern changes and
"significant" events
Motivation monitoring email traffic, news,
communiques, faxes, voice intercepts (with speech
recognition)
3
MMS Overall Objectives

Synergistic improvements in
Performance in terms of space, time,
effectiveness, and/or insight
Understanding the tradeoffs among these types of
improvements
Compression for efficient resource use
Representation that aids fitting models
Efficient matching of text to text and model to
text
Learning models from data and prior knowledge
Reduction in need for large amounts of training
data or labor-intensive input
Fusion of complementary filtering approaches

4
MMS Approaches

Emphasis on Supervised Filtering
Given example documents, textbook descriptions,
etc., find documents on this topic in incoming
stream or past data
Less Emphasis on Unsupervised Event
Identification
Detect emergent characteristics, anomalous
patterns, etc. in incoming stream of text or
historical statistics on the stream

5
MMS Approaches Supervised Filtering

Batch filtering All training texts processed
before any texts of active interest to user
Adaptive filtering User trains system during use
Value of examples for both information and
training must be considered

6
MMS Approaches Dealing with Massive Data

Creating summary statistics on massive data
streams
Detect outliers, heavy hitters (most frequent
items) , etc.
Allow us to return to past without keeping raw
data
Reducing need for labeled training examples in
supervised classification
Bayesian priors from domain knowledge
Tuning on unlabeled data

7
Accomplishments Phase II (Jan 04 Sep 04)

Bayesian Logistic Regression
Using sparseness-favoring priors, our methods
have produced outstanding accuracy and fast
predictions with no ad-hoc feature selection
State of art text classification effectiveness
Recently Highest score on TREC 2004 triage task
Public release of our Bayesian Binary Regression
(BBR) software (500 downloads)

Thomas Bayes
8
Accomplishments Phase II (Jan 04 Sep 04)

Bayesian Logistic Regression (contd)
Ability to use domain knowledge to set prior
distributions led to large improvements in
effectiveness when little training data is
available
New online algorithms online updating of
Bayesian models as new data become available

9
Accomplishments Phase II (Jan 04 Sep 04)

Streaming Algorithms
New sketch-based algorithms for detecting word
frequency changes and other patterns in massive
text streams
Rapid methods for finding changing trends,
outliers and deviants, rare events, heavy hitters
Initial results using summarized data to search
for meaningful answers to queries about the past
Initial work on textual and structural patterns
in informal communication networks

10
Accomplishments Phase II (Jan 04 Sep 04)

Nearest neighbor classification Fast
implementation
Continued development of heuristics for
approximate neighbor finding with an in-memory
inverted index
Our results have reduced memory by 90 and time
by 90 to 99 with minimal impact on
effectiveness.
Packaged and delivered kNN software
Developing algorithms for speeding up slow but
potentially highly effective local learning
approach
Based on training a separate logistic regression
on the neighbors of each test document!
Slow, but with many avenues to large speedups

11
Accomplishments Phase II (Jan 04 Sep 04)

Adaptive Filtering
Models to Aid in Learning When to act greedily
(exploit -- submit documents
we believe relevant) and when to take risks
(explore -- submit documents that can
be irrelevant)
Seek approximate solutions to the intractable
optimal exploration/exploitation tradeoff
Experiments show slight improvements in filtering
effectiveness compared to greedy (exploit-only)
approach

12
Some MMS Work in Depth Bayesian Priors from
Domain Knowledge

Bayesian methods assume prior beliefs
about parameters before data is seen
Project Phase I generic, vague priors
Project Phase II Reference materials or
intuitions about words may help predict class.
Use these to set priors. (Material very unlike
training examples)
Goal reduce need for training examples
Replace 1000s of randomly sampled examples with
few, possibly biased examples

13
Knowledge-Driven Priors Issues

Reference texts have some non-topical words
Use words that discriminate among topics (use
Inverted Document Frequency (IDF) weighting
within reference collection)
Small training sets increase problems with
thresholding and text representation
Use unlabeled data to aid thresholding and to
learn IDF weights
Use separate prior for intercept term of model

14
Knowledge-Driven Priors Results

Topics 27 Reuters Region categories
Knowledge CIA World Factbook (WFB) entries
Examples 10/topic
Baseline results (F1 measure)
WFB 0.234 , no WFB 0.052
Better small training sets, improved algorithms
WFB 0.591, no WFB 0.395

15
Knowledge-Driven Priors Summary

Reference materials of text type very different
from documents to be classified can aid
supervised filtering
In combination with tuning on unlabeled data,
this technique can provide immediate practical
benefits
Current methods are crude and ad hoc
substantial improvements should be possible

16
Some MMS Work in Depth Streaming Analysis

Problem Monitor fast, massive text streams and
support both online tracking as well as historic
analysis for events.
Multidimensional data source, destination,
time sent or received, metadata (reply,
language), text
labels (words, phrases), links.
Goal To use highly compact summaries that are
computed at stream speed and perform accurate
analyses.

17
Streaming Analysis Tool CM Sketch

Theoretical We have developed the CM Sketch that
uses (1/e) log 1/d space to approximate data
distribution with error at most e, and
probability of success at least 1-d.
All other previously known sample or sketch
methods use space at least (1/e2).
CM Sketch is an order of magnitude better.
Practical Few 10's of KBs gives accurate
summary of large data Create summaries of data
that allow historic queries to find
Heavy Hitters (Most Frequent Items)
Quantiles of a Distribution (Median, Percentiles
etc.)
Finding items with large changes

18
Streaming Analysis Using Web Logs

Web logs (blogs) or regularly updated on-line
journals provide informal, opinionated, candid
data that is more like email than is the web.
We have begun to automatically collect blogs,
stripping formatting and tags, ads, etc., and
outputting corresponding "bag of words" into
streaming algorithms for analysis, archiving.
10s to 100 GB scale.
30001 compression using CM Sketch methods.
Allows accurate analysis of popular words, new
emergent words, etc., including multilingual
occurrences.

19
Deliverables Phase I

Classic method Rocchio
Classic method Centroid
kNN with IFH (inverted file heuristic)
Sparse Bayesian (Bayesian with Laplace priors)
Combinatorial PCA
Homotopic Linking of Widely Varying Rocchio
Methods
aiSVM
Fusion

20
Deliverables Phase II

Revised and extended version of kNN code,
including scripts for running local learning
experiments
Substantially extended version of BBR, including
use of domain knowledge to set priors
CM Sketch (C library for count-min sketching)
Code to use CM Sketch to find heavy hitters,
quantiles, and large changes in streams

21
MMS Future Directions

Bayesian
Expand types of domain knowledge usable
For instance, making use of the taxonomies
available in many subject areas
Improve self-tuning of BBR software
Make it more effective for novice users
Surprisingly subtle questions Cross-validation,
calibration, scaling (e.g., when multiple
features)
Incorporate previous work on online Bayesian
methods into BBR

22
MMS Future Directions

Streaming
Systematically explore summarization methods such
as sampling, bitmaps, sketches
Develop warehousing techniques for large scale
sketch-based historical analyses
Massiveness of data implies linear algorithms too
inefficient. Seek sublinear methods.
Develop sketch-based methods for link analysis in
temporally changing multigraphs
From and To addresses in email, links between
blogs, etc.
Add modeling component to the sketch-based
analysis Exploit knowledge of distribution of
the data.

23
MMS Future Directions

kNN
kNN with small training sample for each of
massive number of topics
maybe only 5 to 10 known
relevant/irrelevant documents
Since small samples have little overlap, extend
kNN approach to deal with partially labeled
datasets
Bayesian kNN
Incorporate methods developed in our Bayesian
work for dealing with small training sets (e.g.,
tuning thresholds on unlabeled data).
More fundamental combinations of Bayesian and kNN
methods (e.g., tunable distance metrics)

24
MMS Future Directions

Greedy Round Robin Feature Selection
In phase I work Explored greedy heuristic
to choose subset of original set of terms as
features
Did extremely well in TREC2002 topic
intersection tasks
Will develop a Greedy Round Robin (GRR) method
Applies if features fall into two or more
conceptually distinct sets (e.g., metadata such
as source/destination, genre or medium of the
message)
Each list of features is consulted in turn.
Plan experimental analysis of GRR
Plan theoretical analysis of GRR using simulation

25
MMS Future Directions

Adaptive filtering
Experiment with new adaptive thresholding methods
(synergy with Bayesian thresholding work)
Scoring threshold is adjusted downward if judging
too many irrelevant documents upward if judging
too few relevant documents
Aim for algorithm with state-of-art effectiveness
and provable theoretical properties
Compare rate of convergence of various algorithms
on real data.

26
MMS PROJECT TEAM
Paul Kantor, Rutgers Communic., Info. Library
Studies Dave Lewis, Consultant Michael Littman,
Rutgers CS David Madigan, Rutgers Statistics S.
Muthukrishnan, Rutgers CS Rafail Ostrovsky,
Telcordia/UCLA Fred Roberts, Rutgers
DIMACS/Math Martin Strauss, ATT Labs/U.
Michigan) Wen-Hua Ju, Avaya Labs
(collaborator) Andrei Anghelescu, Graduate
Student Suhrid Balakrishnan, Graduate
Student Aynur Dayanik, Graduate Student Dmitry
Fradkin, Graduate Student Peng Song, Graduate
Student Graham Cormode, postdoc Alex Genkin,
software developer Vladimir Menkov, software
developer

Write a Comment

User Comments (0)