Monitoring Message Streams: - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Monitoring Message Streams:

Description:

Martin Strauss, AT&T Labs. Wen-Hua Ju, Avaya Labs (collaborator) ... Monitor huge streams of textualized communication to automatically detect ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 65
Provided by: spass
Category:

less

Transcript and Presenter's Notes

Title: Monitoring Message Streams:


1
Monitoring Message Streams Algorithmic Methods
for Automatic Processing of Messages
Fred Roberts, Rutgers University
2
THE MONITORING MESSAGE STREAMS PROJECT TEAM
Endre Boros, Rutgers Operations Research Paul
Kantor, Rutgers Information and Library
Studies Dave Lewis, Consultant Ilya Muchnik,
Rutgers DIMACS/CS S. Muthukrishnan, Rutgers
CS David Madigan, Rutgers Statistics Rafail
Ostrovsky, Telcordia Technologies Fred Roberts,
Rutgers DIMACS/Math Martin Strauss, ATT
Labs Wen-Hua Ju, Avaya Labs (collaborator) Andrei
Anghelescu, Graduate Student Dmitry Fradkin,
Graduate Student Alex Genkin, programmer Vladimir
Menkov, programmer
3
OBJECTIVE
Monitor huge streams of textualized communication
to automatically detect pattern changes and
"significant" events
Motivation monitoring email traffic
4
TECHNICAL PROBLEM
  • Given stream of text in any language.
  • Decide whether "new events" are present in the
    flow of messages.
  • Event new topic or topic with unusual level of
    activity.
  • Initial Problem Retrospective or Supervised
    Event Identification Classification into
    pre-existing classes.

5
TECHNICAL PROBLEM
  • Batch filtering Given relevant documents up
    front.
  • Adaptive filtering pay for information about
    relevance as process moves along.

6
  • MORE COMPLEX PROBLEM PROSPECTIVE DETECTION OR
    UNSUPERVISED LEARNING
  • Classes change - new classes or change meaning
  • A difficult problem in statistics
  • Recent new C.S. approaches
  • Semi-supervised Learning
  • Algorithm suggests a possible new class
  • Human analyst labels it determines its
    significance

7
COMPONENTS OF AUTOMATIC MESSAGE PROCESSING
  • (1). Compression of Text -- to meet storage and
    processing limitations
  • (2). Representation of Text -- put in form
    amenable to computation and statistical analysis
  • (3). Matching Scheme -- computing similarity
    between documents
  • (4). Learning Method -- build on judged examples
    to determine characteristics of document cluster
    (event)
  • (5). Fusion Scheme -- combine methods (scores) to
    yield improved detection/clustering.

8
COMPONENTS OF AUTOMATIC MESSAGE PROCESSING - II
  • These distinctions are somewhat arbitrary.
  • Many approaches to message processing overlap
    several of these components of automatic message
    processing.
  • Project Premise Existing methods dont exploit
    the full power of the 5 components, synergies
    among them, and/or an understanding of how to
    apply them to text data.

9
COMPRESSION
  • Reduce the dimension before statistical analysis.
  • We often have just one shot at the data as it
    comes streaming by

10
COMPRESSION II
  • Recent results One-pass through data can
    reduce volume significantly w/o degrading
    performance significantly.

We believe that sophisticated dimension reduction
methods in a preprocessing stage followed by
sophisticated statistical tools in a
detection/filtering stage can be a very powerful
approach. Our methods so far give us some
confidence that we are right.
11
REPRESENTATIONS
  • Term representation binary, term frequency (TF),
    log TF, expTF
  • Term weighting IDF (inverted document
    frequency), IDF2, statistical
  • IDF(x) log 1 N/(i(x)1)
  • N documents
  • i(x) documents in which a term x appears
  • Document Normalization L1, L2
  • L1 sum x), L2 sqrt(sumxi2)

12
MATCHING SCHEMES
  • Inner product
  • The usual inner product between vectors
  • Euclidean distance

13
LEARNING
  • Rocchio (A linear classifier based on the center
    of gravity of the known relevant (and not
    relevant) documents.)
  • Centroid (Essentially quadratic classifier based
    on ratio of distances from the centroids of the
    positive and negative judged examples.)
  • aiSVM-b0
  • SVMLight
  • SVM and Regression
  • 1-NN, k-NN
  • Q_up (Centroid or Rocchio in online setting)
  • Bayesian Binary Regression with Normal or Laplace
    prior
  • Dense logit, sparse logit, dense probit, sparse
    probit

14
FUSION METHODS
  • combining scores based on ranks, linear
    functions, or nonparametric schemes

15
SAMPLE METHODS STUDIED
See full table
ILH inverted list heuristics (keep few terms)
16
DATA SETS USED
  • No readily available data set has all the
    characteristics of data on which we expect our
    methods to be used
  • However Many of our methods depend essentially
    only on a matrix of term frequencies by
    documents.
  • Thus, many available data sets can be used for
    experimentation.

17
DATA SETS USED II
  • TREC (Text Retrieval Conference) data
    time-stamped subsets of the data (order 105 to
    106 messages)
  • Reuters Corpus Vol. 1 (8 x 105 messages)
  • Medline Abstracts (order 107 with human indexing)

18
COMPARING ALGORITHMS
  • Effectiveness
  • F1 harmonic mean of precision and recall
  • Precision (SelectedRelevant/SelectedAll)
  • Recall (SelectedRelevant/RelevantAll)
  • F1 2/(1/P 1/R) harmonic mean
  • Area under ROC Curves (Receiver Operating
    Characteristic) Curves
  • Comparing true positive rate (detection rate) vs.
    false positive rate (false alarm rate)
  • ROC curve plot of true positive rate vs. false
    positive rate
  • The closer this is to 1, the more accurate the
    classifier

19
COMPARING ALGORITHMS
  • Space
  • Size of data structure used
  • Dimensionality reduction relative to dimension of
    original space
  • Index storage cost

20
COMPARING ALGORITHMS
  • Time
  • CPU time to train, to classify
  • Queries/minute
  • Operation count

21
COMPARING ALGORITHMS
  • Insight
  • Gains in understanding
  • No concrete measures

There is a tradeoff among criteria. A modest
decrease in effectiveness might be acceptable if
there is a significant savings in time or space.
22
Class of Methods I Classical
  • These models are based on the count of term
    frequencies (TF, bag of words) which is combined
    in an inner product, using a weight that reflects
    the prevalence of a term in the collection (IDF).
    They use the center of mass of the known relevant
    documents, and of the known not relevant
    documents to define a classifier.
  • Good for baseline comparisons to other methods

23
Classical Models II
  • Rocchio Model
  • Developed in the 1970s
  • Vector representing topic is modified by adding a
    multiple of each vector representing a known
    relevant document and subtracting a multiple of
    each vector representing a known irrelevant
    document
  • Scoring function is inner product of this vector
    with vector representing incoming document
  • There is an ideal direction in space of
    documents and farther in that direction is always
    better.

24
Classical Models III
  • Centroid Model
  • Based on idea that there is a desirable location
  • Anything closer to that location (in a certain
    ellipsoid metric) is deemed better.

25
Geometrically
Rocchio
Centroid
n
r
Centroid method expects localized concentration
of relevant documents. (This suggests the
importance of neighborhoods.) Rocchio expects
relevance to increase in certain preferred
directions.
26
Classic Methods Components
  • Rocchio
  • Representation Term Rep TF or logTF, Term
    weight IDF or IDF2 Document normalization none
    or L2
  • Compression none
  • Matching Euclidean (Q_up if online) or inner
    prod.
  • Learning none, Rocchio, or Q_up
  • Centroid
  • Representation Term Rep TF or logTF, Term
    weight IDF or IDF2 Document normalization none
    or L2
  • Compression none or Bayesian (2 methods)
  • Matching Euclidean (Q_up if online) or inner
    prod.
  • Learning none or centroid, Q_up (adaptive)

27
Classic Methods Some Results
  • Rocchio
  • Effectiveness of our methods industry standard
  • With very careful tuning, a Rocchio method
    produced the best adaptive results at TREC11
  • Centroid
  • Among the methods we have tested, Centroid
    methods are up among the top performers in terms
    of effectiveness

28
Class of Methods II Nearest Neighbor (kNN)
Classifiers for Text Filtering
  • Route message by
  • Finding k most similar training messages
    (neighbors)
  • Assign to classes that are most common among
    neighbors (optionally weighting by distance)
  • kNN studied since 1958, for text since early 90s
  • Moderately effective for text has been
    considered inefficient
  • But, finding neighbors only needs to be done once
  • No matter how many classes
  • So for large number of topics, maybe more
    efficient than one-classifier-per-topic approaches

29
kNN Components
  • Representation term representation TF, term
    weight IDF2, document normalization L2
  • Compression none, ILH-1, ILH-2, ILH-3, or RPV
  • Matching Euclidean or L2 inner product
  • Learning none for batch otherwise traditional
    k-NN

30
Speeding up kNN
  • Worked on fast implementation
  • Store text and classes sparsely (Representation)
  • Store class labels sparsely
  • Arrange computations to do work proportional only
    to number of class labels in neighbors, not total
    number of classes
  • Search engine heuristics use the in-memory
    inverted file (Matching)
  • Use inverted file
  • Retain only high impact terms within each
    document, or within each inverted list
  • Classification-time selection of inverted lists

31
kNN Results
  • Great reduction in size of inverted index and
    speed of classification
  • Slight additional cost in effectiveness
  • Effectiveness slightly below our best methods
    (Bayesian probit and logistic classifiers)
  • Compressed index 90 smaller than original index
    w/only 7-12 loss in effectiveness
  • Approximate matching is 90-95 faster w/ only
    2-10 loss in effectiveness
  • Ours are first large scale experiments on search
    engine heuristic for neighbor lookup in kNN
  • Partnership between theoreticians and
    practitioners.

32
Random Projections Method k-NN
First large-scale implementation of theoretical
random projection methods. Shows promise for
large scale datasets.
33
Class of Methods III Bayesian Methods
  • Bayesian statistical methods place prior
    probability distributions on all unknowns, and
    then compute posterior distribution for the
    unknowns conditional on the knowns.

Thomas Bayes
34
Bayesian Methods
  • Zhang and Oles (2001) first use of large-scale
    Bayesian logistic regression (10,000 dimensions)
  • The Bayesian approach explicitly incorporates
    prior knowledge about model complexity
    (regularization)
  • Bayesian Logistic with Laplace Priors We extend
    Bayesian approach to incorporate a prior
    requirement for sparsity in applications with up
    to 100,000 dimensions
  • Logistic regression has one parameter per
    dimension our sparse model sets many of these to
    zero
  • Resulting sparse models produce outstanding
    accuracy and ultra-fast predictions with no
    ad-hoc feature selection

35
Bayesian Methods Components
  • Representation term representation logTF, term
    weighting IDF, document normalization none or L2
  • Compression none or N-best (3 methods)
  • Matching inner product
  • Learning Bayesian Binary Regression with Laplace
    prior or normal prior

36
Bayesian Methods Sample Results
RCV-1
OHSUMED
Rows 1-3 Reuters Vol. 1 data Rows 4-6 Medline
data SVM is in table for comparisons
sake. NumFeat dimensions in document
representation, MedianUsed words used to make
predictions (role of sparsity)
37
Bayesian Methods Sample Results
  • Our implementation of Bayesian Logistic with
    Laplace priors is highly efficient for very large
    scale problems.
  • Other efficient variants implemented as well
    Bayesian probit
  • Compared to Zhang Oles, our implementation is
  • 200 to 2000 times faster
  • Space required is 200 to 2000 times smaller
  • Accuracy as good as the best results ever
    published.
  • What we did is unique.
  • Very large scale 122,000 features with no ad hoc
    feature selection
  • In sum, we have a sparseness-inducing Laplace
    prior that produces dramatically simpler models
    with essentially no loss in accuracy

38
Data Fusion
  • Data Fusion combines several systems for
    representing, compressing, matching or learning
    into a compound system that is generally able to
    perform better
  • Each system assigns a score to a new document
  • We have explored score combining schemes for
    data fusion.
  • Work has emphasized computational and
    visualization tools to facilitate study of
    relationships among various combinations of
    methods

39
Fusion Key Findings
  • The global fusion approach asks is there a
    rule for combining the results of two or more
    systems without regard to the specific pattern or
    topic being sought? It does not work terribly
    well
  • Local fusion When specialized to the specific
    topic, simple linear methods can produce
    substantial improvements (up to 10 or 20)
    judged by standard measures of retrieval
    performance
  • Some positive results in case where fusion rule
    is selected on basis of one sample of topics and
    applied to another.

40
Some Comparisons of Effectiveness
These are just a few examples of results. We have
done over 5000 complete experiments.
Full RCV1 Data set
Subset of 4060 items
41
  • Compression Methods Summary
  • Applied old heuristics and new theory-based
    algorithms
  • Random projections to real subspaces
  • Random projections to Hamming cubes
  • Combinatorial PCA
  • Streaming algorithms for finding deviant cases
    in massive data
  • Classic search engine inverted file heuristics

42
Compression Methods Combinatorial PCA (cPCA)
  • PCA principal components analysis, a
    mathematical procedure that transforms a number
    of (possibly) correlated variables into a
    (smaller) number of uncorrelated variables called
    principal components. Goals Reduce
    dimensionality, discover meaningful variables.
  • New Combinatorial analog of PCA (novel method)
  • Finds most frequent and most strongly correlated
    features in the data matrix
  • Application feature selection for a learning
    algorithm (SVM in our experiments)
  • cPCA is a compression method for any type of data
    matrix
  • cPCA can be used with any classifier learning
    method

43
cPCA Experimental results
  • Experiments with 2 data sets, 16 representations
    and 10 classifiers
  • Dimensionality reduction by a factor of 30
    (47,000 down to 1,500)
  • Time 3 min for 23,000 documents x 47,000 words
  • cPCA has computational complexity lower by an
    order of magnitude compared to PCA

44
STREAMING DATA ANALYSIS
  • Motivated by need to make decisions about data
    during an initial scan as data stream by.
  • Recent development of theoretical CS algorithms
  • Algorithms motivated by intrusion detection,
    transaction applications, time series
    transactions

45
Streaming Text Analyses
  • Aj number of texts seen in time period j
  • Aj number of documents that contain the word
    j
  • Ai,j number emails or bytes sent from address
    i to address j
  • Ai,j number of occurrences of word j in
    document i

Using these different data representations, we
have developed rapid methods for a number of
monitoring applications such as finding changing
trends, outliers and deviants, unique items, rare
events, heavy hitters, etc.
46
Decision-Theoretic Adaptive Filtering (A Formal
Framework)
  • Goal pick out interesting documents in a stream
    and present to oracle for a reward
  • Oracle gives feedback only for presented
    documents
  • Complex rewards and penalties (e.g., if never
    present any documents then never improve)
  • Formal model chooses optimal strategy balancing
    instant rewards penalties with long-term payoff
  • Initial implementation and experimentation shows
    considerable promise

47
METHODS READY FOR INTEGRATION AND TESTING BY THE
PROTOTYPE DEVELOPERS
  • Classic method Rocchio,
  • Classic method Centroid
  • kNN with IFH (inverted file heuristic)
  • Sparse Bayesian (Bayesian with Laplace priors)
  • cPCA
  • Lots of code for components

48
OUR EXPECTATION IMPACT AFTER 12 MONTHS
  • We will have developed innovative methods for
    classification of accumulated documents in
    relation to known tasks/targets/themes and
    building profiles to track future relevant
    messages.
  • We are optimistic that by end-to-end
    experimentation, we will continue to discover new
    uses relevant to each of the component tasks for
    recently developed mathematical and statistical
    methods. We expect to achieve significant
    improvements in performance on accepted measures
    that could not be achieved by piecemeal study of
    one or two component tasks or by simply
    concentrating on pushing existing methods
    further.

49
ACCOMPLISHMENTS IMPACT AFTER 12 MONTHS
  • General Observation On classic measures such as
    precision, recall, F1, the existing methods do
    quite well. We have sought progress on other
    measures.
  • For example, if we can reduce the space required
    by 90 with only a 10 loss in F1, we can cast a
    wider net in the search for important messages.
    (Our kNN methods do this.)
  • Similarly, if we can reduce the time required
    significantly while only reducing effectiveness a
    little bit, we can have a similar impact. (Our
    kNN and Bayesian methods do this.)

50
FUTURE WORK OVERALL GOALS FOR NEXT TWO YEARS
  • We will have extended our analysis to
    semi-supervised discovery of potentially
    interesting clusters of documents provided we
    can interact with appropriate judges who can
    provide realistic feedback.
  • This should allow us to identify potentially
    threatening events in time for cognizant agencies
    to prevent them from occurring.

51
Future Work kNN
  • Continue efforts to improve efficiency by moving
    to online setting
  • Improve efficiency by demonstrating provable
    quality approximate matching guarantees
  • Randomized matrix multiplication methods
  • Draw on lessons from random projection work

52
Future Work kNN
  • Now that we have fast kNN, work on learning
    component to improve effectiveness
  • Local logistic regression
  • Online learning of thresholds
  • Big issue only some neighbors labeled
  • Apply more sophisticated learning algorithms to
    the neighbors once they are found local
    learning (used in robotics, not much in text
    classification)
  • Might have no neighbor judged for topic
  • Use reverse kNN learning?

53
Future Work Random Projections
  • For random projection to be effective, we need
    huge spaces, so we will look at pairs or triples
    of terms
  • Develop random projection methods for on-line
    processing of text messages when items are being
    added/deleted from the database using
    derandomization tricks
  • Use for clustering methods for semi-supervised
    learning

54
Future Work Bayesian Methods
  • Online Bayesian algorithms
  • status small experiments with two competing
    algorithms
  • plan comprehensive large-scale implementation
    experiments
  • Develop Bayesian multi-label models (k of n
    categories) All our work so far on binary
    classification real world documents can
    simultaneously belong to many stochastically
    dependent categories
  • status designed three competing approaches
  • plan comprehensive large-scale implementation
    experiments

55
Future Work Bayesian Methods II
  • Our target applications feature a paucity of
    labeled training data. Highly informative
    Bayesian priors drawing on information in
    category descriptions have potential to provide
    dramatic improvements on predictive accuracy with
    tiny training datasets
  • status simple experiments with initial ideas
  • plan comprehensive large-scale implementation
    experiments

56
Future Work Bayesian Methods III
  • Merging formal decision-theoretic model for
    adaptive filtering with online Bayesian
    algorithms
  • Our decision-theoretic framework for adaptive
    filtering shows considerable promise
  • We need to marry our online sparse Bayesian model
    with the decision-theoretic framework
  • High-dimensional Bayesian topic detection and
    tracking via sparse hidden-Markov models
  • Important for semi-supervised learning
  • Build on recent dramatic advances in variational
    approaches that provide dramatic speedups

57
Future Work Feature Selection Methods, cPCA
  • Develop an online feature selection method which
    could work with any online classification
    algorithm
  • Develop a feature selection method to detect
    rare events in massive message streams
  • Tailor cPCA to the specifics of SVM classifiers

58
Future Work Fusion
  • Fusing with scores means giving up most of the
    information that diverse methods contain. This
    presents a strong challenge for us. New tools are
    needed.
  • Fusion will be much more effective when it is
    applied earlier in the chain of components, to
    combine methods of representation, matching,
    compression, and learning.

59
Future Work Streaming Text Data Historic Data
Analysis
  • The accumulation of text messages is massive over
    time
  • A lot of streaming research is focused on
    on-going or current analyses
  • It is a great challenge to use only summarized
    historic data and see if a currently emerging
    phenomenon had precursors occurring in the past
  • We propose an exciting and novel architecture for
    historic and posterior analyses via small
    summaries

60
Architecture for Historic Text Data Analysis
Conceptual representation
Multidimensional data Attributes To, From user
Ids, Time sent. Metadata
Reply? Cc? Fwd? Attachments? Language? Text
labels words, phrases, etc.
data stream
Materialized Summary (updated on the stream)
Big changes, Deviants Unusual trends, Decision
trees, Clustering.
Approximate ad hoc queries and analyses
61
Future Work Streaming Text Data Tracking Term
Statistics
  • Text filtering systems use word statistics
  • Frequency of words in all docs (IDF) or classes
    of documents (Naive Bayes)
  • But new words (and phrases) never stop coming
  • How do we estimate frequency of unbounded number
    of terms from bounded space statistics?
  • Approaches we will study
  • Randomized streaming data algorithms
  • Reinforcement learning
  • Backoff models on term structure

62
Future Work Streaming Data Analysis Tracking
Topics and Parties(new direction of research)
  • Number of participants in information exchange is
    vast and growing rapidly. How do we track topics
    as they emerge?
  • Key problem in semi-supervised online learning
  • Topic tracking good relevance scores are linear
    combinations of words (frequencies or
    occurrences)
  • Idea Keep a sketch of parties for each word
    separately
  • linear combinations of sketches make sense - we
    can trace topics

63
Future Work Streaming Data Analysis Tracking
Topics and Parties
  • This design is topic-independent
  • no need to know topics in advance
  • no limit for the number of topics
  • This design is efficient
  • Sketches help search many combinations of words
    potentially related to a topic very quickly.

64
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com