Monitoring Message Streams: - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Monitoring Message Streams:

Description:

MMS: Goal ... MMS: Approaches. 5 ... Some MMS Work in Depth: Streaming Analysis ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 27
Provided by: spass
Category:

less

Transcript and Presenter's Notes

Title: Monitoring Message Streams:


1
Monitoring Message Streams Algorithmic Methods
for Automatic Processing of Messages
Fred Roberts, Rutgers University
2
MMS Goal
Monitor huge communication streams, in
particular, streams of textualized communication
to automatically detect pattern changes and
"significant" events
Motivation monitoring email traffic, news,
communiques, faxes, voice intercepts (with speech
recognition)
3
MMS Overall Objectives
  • Synergistic improvements in
  • Performance in terms of space, time,
    effectiveness, and/or insight
  • Understanding the tradeoffs among these types of
    improvements
  • Compression for efficient resource use
  • Representation that aids fitting models
  • Efficient matching of text to text and model to
    text
  • Learning models from data and prior knowledge
  • Reduction in need for large amounts of training
    data or labor-intensive input
  • Fusion of complementary filtering approaches

4
MMS Approaches
  • Emphasis on Supervised Filtering
  • Given example documents, textbook descriptions,
    etc., find documents on this topic in incoming
    stream or past data
  • Less Emphasis on Unsupervised Event
    Identification
  • Detect emergent characteristics, anomalous
    patterns, etc. in incoming stream of text or
    historical statistics on the stream

5
MMS Approaches Supervised Filtering
  • Batch filtering All training texts processed
    before any texts of active interest to user
  • Adaptive filtering User trains system during use
  • Value of examples for both information and
    training must be considered

6
MMS Approaches Dealing with Massive Data
  • Creating summary statistics on massive data
    streams
  • Detect outliers, heavy hitters (most frequent
    items) , etc.
  • Allow us to return to past without keeping raw
    data
  • Reducing need for labeled training examples in
    supervised classification
  • Bayesian priors from domain knowledge
  • Tuning on unlabeled data

7
Accomplishments Phase II (Jan 04 Sep 04)
  • Bayesian Logistic Regression
  • Using sparseness-favoring priors, our methods
    have produced outstanding accuracy and fast
    predictions with no ad-hoc feature selection
  • State of art text classification effectiveness
  • Recently Highest score on TREC 2004 triage task
  • Public release of our Bayesian Binary Regression
    (BBR) software (500 downloads)

Thomas Bayes
8
Accomplishments Phase II (Jan 04 Sep 04)
  • Bayesian Logistic Regression (contd)
  • Ability to use domain knowledge to set prior
    distributions led to large improvements in
    effectiveness when little training data is
    available
  • New online algorithms online updating of
    Bayesian models as new data become available

9
Accomplishments Phase II (Jan 04 Sep 04)
  • Streaming Algorithms
  • New sketch-based algorithms for detecting word
    frequency changes and other patterns in massive
    text streams
  • Rapid methods for finding changing trends,
    outliers and deviants, rare events, heavy hitters
  • Initial results using summarized data to search
    for meaningful answers to queries about the past
  • Initial work on textual and structural patterns
    in informal communication networks

10
Accomplishments Phase II (Jan 04 Sep 04)
  • Nearest neighbor classification Fast
    implementation
  • Continued development of heuristics for
    approximate neighbor finding with an in-memory
    inverted index
  • Our results have reduced memory by 90 and time
    by 90 to 99 with minimal impact on
    effectiveness.
  • Packaged and delivered kNN software
  • Developing algorithms for speeding up slow but
    potentially highly effective local learning
    approach
  • Based on training a separate logistic regression
    on the neighbors of each test document!
  • Slow, but with many avenues to large speedups

11
Accomplishments Phase II (Jan 04 Sep 04)
  • Adaptive Filtering
  • Models to Aid in Learning When to act greedily
    (exploit -- submit documents
  • we believe relevant) and when to take risks
    (explore -- submit documents that can
  • be irrelevant)
  • Seek approximate solutions to the intractable
    optimal exploration/exploitation tradeoff
  • Experiments show slight improvements in filtering
    effectiveness compared to greedy (exploit-only)
    approach

12
Some MMS Work in Depth Bayesian Priors from
Domain Knowledge
  • Bayesian methods assume prior beliefs
  • about parameters before data is seen
  • Project Phase I generic, vague priors
  • Project Phase II Reference materials or
    intuitions about words may help predict class.
    Use these to set priors. (Material very unlike
    training examples)
  • Goal reduce need for training examples
  • Replace 1000s of randomly sampled examples with
    few, possibly biased examples

13
Knowledge-Driven Priors Issues
  • Reference texts have some non-topical words
  • Use words that discriminate among topics (use
    Inverted Document Frequency (IDF) weighting
    within reference collection)
  • Small training sets increase problems with
    thresholding and text representation
  • Use unlabeled data to aid thresholding and to
    learn IDF weights
  • Use separate prior for intercept term of model

14
Knowledge-Driven Priors Results
  • Topics 27 Reuters Region categories
  • Knowledge CIA World Factbook (WFB) entries
  • Examples 10/topic
  • Baseline results (F1 measure)
  • WFB 0.234 , no WFB 0.052
  • Better small training sets, improved algorithms
  • WFB 0.591, no WFB 0.395

15
Knowledge-Driven Priors Summary
  • Reference materials of text type very different
    from documents to be classified can aid
    supervised filtering
  • In combination with tuning on unlabeled data,
    this technique can provide immediate practical
    benefits
  • Current methods are crude and ad hoc
    substantial improvements should be possible

16
Some MMS Work in Depth Streaming Analysis
  • Problem Monitor fast, massive text streams and
    support both online tracking as well as historic
    analysis for events.
  • Multidimensional data source, destination,
    time sent or received, metadata (reply,
    language), text
  • labels (words, phrases), links.
  • Goal To use highly compact summaries that are
    computed at stream speed and perform accurate
    analyses.

17
Streaming Analysis Tool CM Sketch
  • Theoretical We have developed the CM Sketch that
    uses (1/e) log 1/d space to approximate data
    distribution with error at most e, and
    probability of success at least 1-d.
  • All other previously known sample or sketch
    methods use space at least (1/e2).
  • CM Sketch is an order of magnitude better.
  • Practical Few 10's of KBs gives accurate
    summary of large data Create summaries of data
    that allow historic queries to find
  • Heavy Hitters (Most Frequent Items)
  • Quantiles of a Distribution (Median, Percentiles
    etc.)
  • Finding items with large changes

18
Streaming Analysis Using Web Logs
  • Web logs (blogs) or regularly updated on-line
    journals provide informal, opinionated, candid
    data that is more like email than is the web.
  • We have begun to automatically collect blogs,
    stripping formatting and tags, ads, etc., and
    outputting corresponding "bag of words" into
    streaming algorithms for analysis, archiving.
    10s to 100 GB scale.
  • 30001 compression using CM Sketch methods.
  • Allows accurate analysis of popular words, new
    emergent words, etc., including multilingual
    occurrences.

19
Deliverables Phase I
  • Classic method Rocchio
  • Classic method Centroid
  • kNN with IFH (inverted file heuristic)
  • Sparse Bayesian (Bayesian with Laplace priors)
  • Combinatorial PCA
  • Homotopic Linking of Widely Varying Rocchio
    Methods
  • aiSVM
  • Fusion

20
Deliverables Phase II
  • Revised and extended version of kNN code,
    including scripts for running local learning
    experiments
  • Substantially extended version of BBR, including
    use of domain knowledge to set priors
  • CM Sketch (C library for count-min sketching)
  • Code to use CM Sketch to find heavy hitters,
    quantiles, and large changes in streams

21
MMS Future Directions
  • Bayesian
  • Expand types of domain knowledge usable
  • For instance, making use of the taxonomies
    available in many subject areas
  • Improve self-tuning of BBR software
  • Make it more effective for novice users
  • Surprisingly subtle questions Cross-validation,
    calibration, scaling (e.g., when multiple
    features)
  • Incorporate previous work on online Bayesian
    methods into BBR

22
MMS Future Directions
  • Streaming
  • Systematically explore summarization methods such
    as sampling, bitmaps, sketches
  • Develop warehousing techniques for large scale
    sketch-based historical analyses
  • Massiveness of data implies linear algorithms too
    inefficient. Seek sublinear methods.
  • Develop sketch-based methods for link analysis in
    temporally changing multigraphs
  • From and To addresses in email, links between
    blogs, etc.
  • Add modeling component to the sketch-based
    analysis Exploit knowledge of distribution of
    the data.

23
MMS Future Directions
  • kNN
  • kNN with small training sample for each of
    massive number of topics
  • maybe only 5 to 10 known
  • relevant/irrelevant documents
  • Since small samples have little overlap, extend
    kNN approach to deal with partially labeled
    datasets
  • Bayesian kNN
  • Incorporate methods developed in our Bayesian
    work for dealing with small training sets (e.g.,
    tuning thresholds on unlabeled data).
  • More fundamental combinations of Bayesian and kNN
    methods (e.g., tunable distance metrics)

24
MMS Future Directions
  • Greedy Round Robin Feature Selection
  • In phase I work Explored greedy heuristic
  • to choose subset of original set of terms as
  • features
  • Did extremely well in TREC2002 topic
    intersection tasks
  • Will develop a Greedy Round Robin (GRR) method
  • Applies if features fall into two or more
    conceptually distinct sets (e.g., metadata such
    as source/destination, genre or medium of the
    message)
  • Each list of features is consulted in turn.
  • Plan experimental analysis of GRR
  • Plan theoretical analysis of GRR using simulation

25
MMS Future Directions
  • Adaptive filtering
  • Experiment with new adaptive thresholding methods
    (synergy with Bayesian thresholding work)
  • Scoring threshold is adjusted downward if judging
    too many irrelevant documents upward if judging
    too few relevant documents
  • Aim for algorithm with state-of-art effectiveness
    and provable theoretical properties
  • Compare rate of convergence of various algorithms
    on real data.

26
MMS PROJECT TEAM
Paul Kantor, Rutgers Communic., Info. Library
Studies Dave Lewis, Consultant Michael Littman,
Rutgers CS David Madigan, Rutgers Statistics S.
Muthukrishnan, Rutgers CS Rafail Ostrovsky,
Telcordia/UCLA Fred Roberts, Rutgers
DIMACS/Math Martin Strauss, ATT Labs/U.
Michigan) Wen-Hua Ju, Avaya Labs
(collaborator) Andrei Anghelescu, Graduate
Student Suhrid Balakrishnan, Graduate
Student Aynur Dayanik, Graduate Student Dmitry
Fradkin, Graduate Student Peng Song, Graduate
Student Graham Cormode, postdoc Alex Genkin,
software developer Vladimir Menkov, software
developer
Write a Comment
User Comments (0)
About PowerShow.com