Monitoring Message Streams:

About This Presentation

Title:

Monitoring Message Streams:

Description:

Martin Strauss, AT&T Labs. Wen-Hua Ju, Avaya Labs (collaborator) ... Monitor huge streams of textualized communication to automatically detect ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 65

Provided by: spass

Learn more at: http://www.stat.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: Monitoring Message Streams:

1
Monitoring Message Streams Algorithmic Methods
for Automatic Processing of Messages
Fred Roberts, Rutgers University
2
THE MONITORING MESSAGE STREAMS PROJECT TEAM
Endre Boros, Rutgers Operations Research Paul
Kantor, Rutgers Information and Library
Studies Dave Lewis, Consultant Ilya Muchnik,
Rutgers DIMACS/CS S. Muthukrishnan, Rutgers
CS David Madigan, Rutgers Statistics Rafail
Ostrovsky, Telcordia Technologies Fred Roberts,
Rutgers DIMACS/Math Martin Strauss, ATT
Labs Wen-Hua Ju, Avaya Labs (collaborator) Andrei
Anghelescu, Graduate Student Dmitry Fradkin,
Graduate Student Alex Genkin, programmer Vladimir
Menkov, programmer
3
OBJECTIVE
Monitor huge streams of textualized communication
to automatically detect pattern changes and
"significant" events
Motivation monitoring email traffic
4
TECHNICAL PROBLEM

Given stream of text in any language.
Decide whether "new events" are present in the
flow of messages.
Event new topic or topic with unusual level of
activity.
Initial Problem Retrospective or Supervised
Event Identification Classification into
pre-existing classes.

5
TECHNICAL PROBLEM

Batch filtering Given relevant documents up
front.
Adaptive filtering pay for information about
relevance as process moves along.

MORE COMPLEX PROBLEM PROSPECTIVE DETECTION OR
UNSUPERVISED LEARNING
Classes change - new classes or change meaning
A difficult problem in statistics
Recent new C.S. approaches
Semi-supervised Learning
Algorithm suggests a possible new class
Human analyst labels it determines its
significance

7
COMPONENTS OF AUTOMATIC MESSAGE PROCESSING

(1). Compression of Text -- to meet storage and
processing limitations
(2). Representation of Text -- put in form
amenable to computation and statistical analysis
(3). Matching Scheme -- computing similarity
between documents
(4). Learning Method -- build on judged examples
to determine characteristics of document cluster
(event)
(5). Fusion Scheme -- combine methods (scores) to
yield improved detection/clustering.

8
COMPONENTS OF AUTOMATIC MESSAGE PROCESSING - II

These distinctions are somewhat arbitrary.
Many approaches to message processing overlap
several of these components of automatic message
processing.
Project Premise Existing methods dont exploit
the full power of the 5 components, synergies
among them, and/or an understanding of how to
apply them to text data.

9
COMPRESSION

Reduce the dimension before statistical analysis.
We often have just one shot at the data as it
comes streaming by

10
COMPRESSION II

Recent results One-pass through data can
reduce volume significantly w/o degrading
performance significantly.

We believe that sophisticated dimension reduction
methods in a preprocessing stage followed by
sophisticated statistical tools in a
detection/filtering stage can be a very powerful
approach. Our methods so far give us some
confidence that we are right.
11
REPRESENTATIONS

Term representation binary, term frequency (TF),
log TF, expTF
Term weighting IDF (inverted document
frequency), IDF2, statistical
IDF(x) log 1 N/(i(x)1)
N documents
i(x) documents in which a term x appears
Document Normalization L1, L2
L1 sum x), L2 sqrt(sumxi2)

12
MATCHING SCHEMES

Inner product
The usual inner product between vectors
Euclidean distance

13
LEARNING

Rocchio (A linear classifier based on the center
of gravity of the known relevant (and not
relevant) documents.)
Centroid (Essentially quadratic classifier based
on ratio of distances from the centroids of the
positive and negative judged examples.)
aiSVM-b0
SVMLight
SVM and Regression
1-NN, k-NN
Q_up (Centroid or Rocchio in online setting)
Bayesian Binary Regression with Normal or Laplace
prior
Dense logit, sparse logit, dense probit, sparse
probit

14
FUSION METHODS

combining scores based on ranks, linear
functions, or nonparametric schemes

15
SAMPLE METHODS STUDIED
See full table
ILH inverted list heuristics (keep few terms)
16
DATA SETS USED

No readily available data set has all the
characteristics of data on which we expect our
methods to be used
However Many of our methods depend essentially
only on a matrix of term frequencies by
documents.
Thus, many available data sets can be used for
experimentation.

17
DATA SETS USED II

TREC (Text Retrieval Conference) data
time-stamped subsets of the data (order 105 to
106 messages)
Reuters Corpus Vol. 1 (8 x 105 messages)
Medline Abstracts (order 107 with human indexing)

18
COMPARING ALGORITHMS

Effectiveness
F1 harmonic mean of precision and recall
Precision (SelectedRelevant/SelectedAll)
Recall (SelectedRelevant/RelevantAll)
F1 2/(1/P 1/R) harmonic mean
Area under ROC Curves (Receiver Operating
Characteristic) Curves
Comparing true positive rate (detection rate) vs.
false positive rate (false alarm rate)
ROC curve plot of true positive rate vs. false
positive rate
The closer this is to 1, the more accurate the
classifier

19
COMPARING ALGORITHMS

Space
Size of data structure used
Dimensionality reduction relative to dimension of
original space
Index storage cost

20
COMPARING ALGORITHMS

Time
CPU time to train, to classify
Queries/minute
Operation count

21
COMPARING ALGORITHMS

Insight
Gains in understanding
No concrete measures

There is a tradeoff among criteria. A modest
decrease in effectiveness might be acceptable if
there is a significant savings in time or space.
22
Class of Methods I Classical

These models are based on the count of term
frequencies (TF, bag of words) which is combined
in an inner product, using a weight that reflects
the prevalence of a term in the collection (IDF).
They use the center of mass of the known relevant
documents, and of the known not relevant
documents to define a classifier.
Good for baseline comparisons to other methods

23
Classical Models II

Rocchio Model
Developed in the 1970s
Vector representing topic is modified by adding a
multiple of each vector representing a known
relevant document and subtracting a multiple of
each vector representing a known irrelevant
document
Scoring function is inner product of this vector
with vector representing incoming document
There is an ideal direction in space of
documents and farther in that direction is always
better.

24
Classical Models III

Centroid Model
Based on idea that there is a desirable location
Anything closer to that location (in a certain
ellipsoid metric) is deemed better.

25
Geometrically
Rocchio
Centroid
n
r
Centroid method expects localized concentration
of relevant documents. (This suggests the
importance of neighborhoods.) Rocchio expects
relevance to increase in certain preferred
directions.
26
Classic Methods Components

Rocchio
Representation Term Rep TF or logTF, Term
weight IDF or IDF2 Document normalization none
or L2
Compression none
Matching Euclidean (Q_up if online) or inner
prod.
Learning none, Rocchio, or Q_up
Centroid
Representation Term Rep TF or logTF, Term
weight IDF or IDF2 Document normalization none
or L2
Compression none or Bayesian (2 methods)
Matching Euclidean (Q_up if online) or inner
prod.
Learning none or centroid, Q_up (adaptive)

27
Classic Methods Some Results

Rocchio
Effectiveness of our methods industry standard
With very careful tuning, a Rocchio method
produced the best adaptive results at TREC11
Centroid
Among the methods we have tested, Centroid
methods are up among the top performers in terms
of effectiveness

28
Class of Methods II Nearest Neighbor (kNN)
Classifiers for Text Filtering

Route message by
Finding k most similar training messages
(neighbors)
Assign to classes that are most common among
neighbors (optionally weighting by distance)
kNN studied since 1958, for text since early 90s
Moderately effective for text has been
considered inefficient
But, finding neighbors only needs to be done once
No matter how many classes
So for large number of topics, maybe more
efficient than one-classifier-per-topic approaches

29
kNN Components

Representation term representation TF, term
weight IDF2, document normalization L2
Compression none, ILH-1, ILH-2, ILH-3, or RPV
Matching Euclidean or L2 inner product
Learning none for batch otherwise traditional
k-NN

30
Speeding up kNN

Worked on fast implementation
Store text and classes sparsely (Representation)
Store class labels sparsely
Arrange computations to do work proportional only
to number of class labels in neighbors, not total
number of classes
Search engine heuristics use the in-memory
inverted file (Matching)
Use inverted file
Retain only high impact terms within each
document, or within each inverted list
Classification-time selection of inverted lists

31
kNN Results

Great reduction in size of inverted index and
speed of classification
Slight additional cost in effectiveness
Effectiveness slightly below our best methods
(Bayesian probit and logistic classifiers)
Compressed index 90 smaller than original index
w/only 7-12 loss in effectiveness
Approximate matching is 90-95 faster w/ only
2-10 loss in effectiveness
Ours are first large scale experiments on search
engine heuristic for neighbor lookup in kNN
Partnership between theoreticians and
practitioners.

32
Random Projections Method k-NN
First large-scale implementation of theoretical
random projection methods. Shows promise for
large scale datasets.
33
Class of Methods III Bayesian Methods

Bayesian statistical methods place prior
probability distributions on all unknowns, and
then compute posterior distribution for the
unknowns conditional on the knowns.

Thomas Bayes
34
Bayesian Methods

Zhang and Oles (2001) first use of large-scale
Bayesian logistic regression (10,000 dimensions)
The Bayesian approach explicitly incorporates
prior knowledge about model complexity
(regularization)
Bayesian Logistic with Laplace Priors We extend
Bayesian approach to incorporate a prior
requirement for sparsity in applications with up
to 100,000 dimensions
Logistic regression has one parameter per
dimension our sparse model sets many of these to
zero
Resulting sparse models produce outstanding
accuracy and ultra-fast predictions with no
ad-hoc feature selection

35
Bayesian Methods Components

Representation term representation logTF, term
weighting IDF, document normalization none or L2
Compression none or N-best (3 methods)
Matching inner product
Learning Bayesian Binary Regression with Laplace
prior or normal prior

36
Bayesian Methods Sample Results
RCV-1
OHSUMED
Rows 1-3 Reuters Vol. 1 data Rows 4-6 Medline
data SVM is in table for comparisons
sake. NumFeat dimensions in document
representation, MedianUsed words used to make
predictions (role of sparsity)
37
Bayesian Methods Sample Results

Our implementation of Bayesian Logistic with
Laplace priors is highly efficient for very large
scale problems.
Other efficient variants implemented as well
Bayesian probit
Compared to Zhang Oles, our implementation is
200 to 2000 times faster
Space required is 200 to 2000 times smaller
Accuracy as good as the best results ever
published.
What we did is unique.
Very large scale 122,000 features with no ad hoc
feature selection
In sum, we have a sparseness-inducing Laplace
prior that produces dramatically simpler models
with essentially no loss in accuracy

38
Data Fusion

Data Fusion combines several systems for
representing, compressing, matching or learning
into a compound system that is generally able to
perform better
Each system assigns a score to a new document
We have explored score combining schemes for
data fusion.
Work has emphasized computational and
visualization tools to facilitate study of
relationships among various combinations of
methods

39
Fusion Key Findings

The global fusion approach asks is there a
rule for combining the results of two or more
systems without regard to the specific pattern or
topic being sought? It does not work terribly
well
Local fusion When specialized to the specific
topic, simple linear methods can produce
substantial improvements (up to 10 or 20)
judged by standard measures of retrieval
performance
Some positive results in case where fusion rule
is selected on basis of one sample of topics and
applied to another.

40
Some Comparisons of Effectiveness
These are just a few examples of results. We have
done over 5000 complete experiments.
Full RCV1 Data set
Subset of 4060 items
41

Compression Methods Summary
Applied old heuristics and new theory-based
algorithms
Random projections to real subspaces
Random projections to Hamming cubes
Combinatorial PCA
Streaming algorithms for finding deviant cases
in massive data
Classic search engine inverted file heuristics

42
Compression Methods Combinatorial PCA (cPCA)

PCA principal components analysis, a
mathematical procedure that transforms a number
of (possibly) correlated variables into a
(smaller) number of uncorrelated variables called
principal components. Goals Reduce
dimensionality, discover meaningful variables.
New Combinatorial analog of PCA (novel method)
Finds most frequent and most strongly correlated
features in the data matrix
Application feature selection for a learning
algorithm (SVM in our experiments)
cPCA is a compression method for any type of data
matrix
cPCA can be used with any classifier learning
method

43
cPCA Experimental results

Experiments with 2 data sets, 16 representations
and 10 classifiers
Dimensionality reduction by a factor of 30
(47,000 down to 1,500)
Time 3 min for 23,000 documents x 47,000 words
cPCA has computational complexity lower by an
order of magnitude compared to PCA

44
STREAMING DATA ANALYSIS

Motivated by need to make decisions about data
during an initial scan as data stream by.
Recent development of theoretical CS algorithms
Algorithms motivated by intrusion detection,
transaction applications, time series
transactions

45
Streaming Text Analyses

Aj number of texts seen in time period j
Aj number of documents that contain the word
j
Ai,j number emails or bytes sent from address
i to address j
Ai,j number of occurrences of word j in
document i

Using these different data representations, we
have developed rapid methods for a number of
monitoring applications such as finding changing
trends, outliers and deviants, unique items, rare
events, heavy hitters, etc.
46
Decision-Theoretic Adaptive Filtering (A Formal
Framework)

Goal pick out interesting documents in a stream
and present to oracle for a reward
Oracle gives feedback only for presented
documents
Complex rewards and penalties (e.g., if never
present any documents then never improve)

Formal model chooses optimal strategy balancing
instant rewards penalties with long-term payoff
Initial implementation and experimentation shows
considerable promise

47
METHODS READY FOR INTEGRATION AND TESTING BY THE
PROTOTYPE DEVELOPERS

Classic method Rocchio,
Classic method Centroid
kNN with IFH (inverted file heuristic)
Sparse Bayesian (Bayesian with Laplace priors)
cPCA
Lots of code for components

48
OUR EXPECTATION IMPACT AFTER 12 MONTHS

We will have developed innovative methods for
classification of accumulated documents in
relation to known tasks/targets/themes and
building profiles to track future relevant
messages.
We are optimistic that by end-to-end
experimentation, we will continue to discover new
uses relevant to each of the component tasks for
recently developed mathematical and statistical
methods. We expect to achieve significant
improvements in performance on accepted measures
that could not be achieved by piecemeal study of
one or two component tasks or by simply
concentrating on pushing existing methods
further.

49
ACCOMPLISHMENTS IMPACT AFTER 12 MONTHS

General Observation On classic measures such as
precision, recall, F1, the existing methods do
quite well. We have sought progress on other
measures.
For example, if we can reduce the space required
by 90 with only a 10 loss in F1, we can cast a
wider net in the search for important messages.
(Our kNN methods do this.)
Similarly, if we can reduce the time required
significantly while only reducing effectiveness a
little bit, we can have a similar impact. (Our
kNN and Bayesian methods do this.)

50
FUTURE WORK OVERALL GOALS FOR NEXT TWO YEARS

We will have extended our analysis to
semi-supervised discovery of potentially
interesting clusters of documents provided we
can interact with appropriate judges who can
provide realistic feedback.
This should allow us to identify potentially
threatening events in time for cognizant agencies
to prevent them from occurring.

51
Future Work kNN

Continue efforts to improve efficiency by moving
to online setting
Improve efficiency by demonstrating provable
quality approximate matching guarantees
Randomized matrix multiplication methods
Draw on lessons from random projection work

52
Future Work kNN

Now that we have fast kNN, work on learning
component to improve effectiveness
Local logistic regression
Online learning of thresholds
Big issue only some neighbors labeled
Apply more sophisticated learning algorithms to
the neighbors once they are found local
learning (used in robotics, not much in text
classification)
Might have no neighbor judged for topic
Use reverse kNN learning?

53
Future Work Random Projections

For random projection to be effective, we need
huge spaces, so we will look at pairs or triples
of terms
Develop random projection methods for on-line
processing of text messages when items are being
added/deleted from the database using
derandomization tricks
Use for clustering methods for semi-supervised
learning

54
Future Work Bayesian Methods

Online Bayesian algorithms
status small experiments with two competing
algorithms
plan comprehensive large-scale implementation
experiments
Develop Bayesian multi-label models (k of n
categories) All our work so far on binary
classification real world documents can
simultaneously belong to many stochastically
dependent categories
status designed three competing approaches
plan comprehensive large-scale implementation
experiments

55
Future Work Bayesian Methods II

Our target applications feature a paucity of
labeled training data. Highly informative
Bayesian priors drawing on information in
category descriptions have potential to provide
dramatic improvements on predictive accuracy with
tiny training datasets
status simple experiments with initial ideas
plan comprehensive large-scale implementation
experiments

56
Future Work Bayesian Methods III

Merging formal decision-theoretic model for
adaptive filtering with online Bayesian
algorithms
Our decision-theoretic framework for adaptive
filtering shows considerable promise
We need to marry our online sparse Bayesian model
with the decision-theoretic framework
High-dimensional Bayesian topic detection and
tracking via sparse hidden-Markov models
Important for semi-supervised learning
Build on recent dramatic advances in variational
approaches that provide dramatic speedups

57
Future Work Feature Selection Methods, cPCA

Develop an online feature selection method which
could work with any online classification
algorithm
Develop a feature selection method to detect
rare events in massive message streams
Tailor cPCA to the specifics of SVM classifiers

58
Future Work Fusion

Fusing with scores means giving up most of the
information that diverse methods contain. This
presents a strong challenge for us. New tools are
needed.
Fusion will be much more effective when it is
applied earlier in the chain of components, to
combine methods of representation, matching,
compression, and learning.

59
Future Work Streaming Text Data Historic Data
Analysis

The accumulation of text messages is massive over
time
A lot of streaming research is focused on
on-going or current analyses
It is a great challenge to use only summarized
historic data and see if a currently emerging
phenomenon had precursors occurring in the past
We propose an exciting and novel architecture for
historic and posterior analyses via small
summaries

60
Architecture for Historic Text Data Analysis
Conceptual representation
Multidimensional data Attributes To, From user
Ids, Time sent. Metadata
Reply? Cc? Fwd? Attachments? Language? Text
labels words, phrases, etc.
data stream
Materialized Summary (updated on the stream)
Big changes, Deviants Unusual trends, Decision
trees, Clustering.
Approximate ad hoc queries and analyses
61
Future Work Streaming Text Data Tracking Term
Statistics

Text filtering systems use word statistics
Frequency of words in all docs (IDF) or classes
of documents (Naive Bayes)
But new words (and phrases) never stop coming
How do we estimate frequency of unbounded number
of terms from bounded space statistics?
Approaches we will study
Randomized streaming data algorithms
Reinforcement learning
Backoff models on term structure

62
Future Work Streaming Data Analysis Tracking
Topics and Parties(new direction of research)

Number of participants in information exchange is
vast and growing rapidly. How do we track topics
as they emerge?
Key problem in semi-supervised online learning
Topic tracking good relevance scores are linear
combinations of words (frequencies or
occurrences)
Idea Keep a sketch of parties for each word
separately
linear combinations of sketches make sense - we
can trace topics

63
Future Work Streaming Data Analysis Tracking
Topics and Parties

This design is topic-independent
no need to know topics in advance
no limit for the number of topics
This design is efficient
Sketches help search many combinations of words
potentially related to a topic very quickly.

64
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Monitoring Message Streams: - PowerPoint PPT Presentation

Monitoring Message Streams:

Martin Strauss, AT&T Labs. Wen-Hua Ju, Avaya Labs (collaborator) ... Monitor huge streams of textualized communication to automatically detect ... – PowerPoint PPT presentation