PROGRAMS IN HOMELAND SECURITY AT DIMACS - PowerPoint PPT Presentation

About This Presentation
Title:

PROGRAMS IN HOMELAND SECURITY AT DIMACS

Description:

These algorithms apply to situations with ... Problem solved using dynamic programming algorithms. ... Seeking heuristic algorithms, approximations to optimal. ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 131
Provided by: dimacsR
Category:

less

Transcript and Presenter's Notes

Title: PROGRAMS IN HOMELAND SECURITY AT DIMACS


1
PROGRAMS IN HOMELAND SECURITY AT DIMACS
Fred S. Roberts DIMACS Director
2
THE FOUNDING OF DIMACSTHE NSF SCIENCE AND
TECHNOLOGY CENTERS PROGRAM
  • The STC program was launched by the White House
    and the National Academy of Sciences in 1988 in
    order to increase the economic competitiveness of
    the U.S.
  • NSF ran a nationwide competition. The rules
  • cutting edge research
  • education and knowledge transfer
  • university-industry partnerships

3
THE FOUNDING OF DIMACS
  • Because of the increasing importance of discrete
    mathematics and theoretical computer science,
    especially in the fields of telecommunications
    and computing, four institutions, Rutgers and
    Princeton Universities and ATT Bell Labs and
    Bell Communications Research (Bellcore) each
    developed strong research groups in these fields.
  • Under the leadership of Rutgers, they came
    together to found DIMACS and entered the STC
    competition.
  • There were more than 800 preproposals more than
    300 proposals, in all fields of science 11
    winners.

4
The DIMACS Partners Today
Rutgers University Princeton University ATT
Labs Bell Labs (Lucent Technologies) NEC
Laboratories America Telcordia Technologies Affil
iates Avaya Labs HP Labs IBM Research Microsoft
Research Stevens Institute of Technology
5
WHO IS DIMACS?
  • There are about 250 scientists affiliated with
    DIMACS and called permanent members.
  • Most are from the partner and affiliated
    organizations.
  • They include many of the worlds leaders in
    discrete mathematics and theoretical computer
    science and their applications.
  • They also include statisticians, biologists,
    psychologists, chemists, epidemiologists, and
    engineers.
  • None are paid by DIMACS, but they join in DIMACS
    projects.

6
Outline A Selection of DIMACS Projects
  • Bioterrorism Sensor Location
  • Port of Entry Inspection Algorithms
  • Monitoring Message Streams
  • Author Identification
  • Computational and Mathematical Epidemiology
  • Adverse Event/Disease Reporting/Surveillance/Analy
    sis
  • Bioterrorism Working Group
  • Modeling Social Responses to Bioterrorism
  • Predicting Disease Outbreaks from Remote Sensing
    and Media Data
  • Communication Security and Information Privacy

7
The Bioterrorism Sensor Location Problem
8
  • Early warning is critical in defense against
    terrorism
  • This is a crucial factor underlying the
    governments plans to place networks of
    sensors/detectors to warn of a bioterrorist
    attack

The BASIS System Salt Lake City
9
Locating Sensors is not Easy
  • Sensors are expensive
  • How do we select them and where do we place them
    to maximize coverage, expedite an alarm, and
    keep the cost down?
  • Approaches that improve upon existing, ad hoc
    location methods could save countless lives in
    the case of an attack and also money in capital
    and operational costs.

10
Two Fundamental Problems
  • Sensor Location Problem
  • Choose an appropriate mix of sensors
  • decide where to locate them for best protection
    and early warning

11
Two Fundamental Problems
  • Pattern Interpretation Problem When sensors set
    off an alarm, help public health decision makers
    decide
  • Has an attack taken place?
  • What additional monitoring is needed?
  • What was its extent and location?
  • What is an appropriate response?

12
The SLP What is a Measure of Success of a
Solution?
  • A modeling problem.
  • Needs to be made precise.
  • Many possible formulations.

13
The SLP What is a Measure of Success of a
Solution?
  • Identify and ameliorate false alarms.
  • Defending against a worst case attack or an
    average case attack.
  • Minimize time to first alarm? (Worst case?
    (Average case?)
  • Maximize coverage of the area.
  • Minimize geographical area not covered
  • Minimize size of population not covered
  • Minimize probability of missing an attack

14
The SLP What is a Measure of Success of a
Solution?
  • Cost Given a mix of available sensors and a
    fixed budget, what mix will best accomplish our
    other goals?

15
The SLP What is a Measure of Success of a
Solution?
  • Its hard to separate the goals.
  • Even a small number of sensors might detect an
    attack if there is no constraint on time to
    alarm.
  • Without budgetary restrictions, a lot more can be
    accomplished.

16
The Sensor Location Problem
  • Approach is to develop new algorithmic methods.
  • We are building on approaches to other modeling
    problems, seeing if they can be modified in the
    sensor location context.
  • This is a multi-criteria modeling problem and it
    seems hopeless to try to find optimal solutions
  • We will be happy with efficient algorithms that
    find good solutions

17
Algorithmic Approaches I Greedy Algorithms
18
Greedy Algorithms
  • Find the most important location first and locate
    a sensor there.
  • Find second-most important location.
  • Etc.
  • Builds on earlier mathematical work at Institute
    for Defense Analyses (Grotte, Platt)
  • Steepest ascent approach.
  • No guarantee of optimal or best solution.
  • In practice, gets pretty close to optimal
    solution.

19
Algorithmic Approaches II Variants of Classic
Location and Clustering Methods
20
Algorithmic Approaches II Variants of Classic
Location and Clustering Methods
  • Location theory locate facilities (sensors) to
    be used by users located in a region.
  • Cluster analysis Given points in a metric space,
    partition them into groups or clusters so points
    within clusters are relatively close.
  • Clusters correspond to points covered by a
    facility (sensor).

21
Variants of Classic Location and Clustering
Methods
  • k-median clustering Given k sensors, place them
    so each point in the city is within x feet of a
    sensor.
  • Complications More dimensions location affects
    sensitivity, wind strength enters, sensors have
    different characteristics, etc.
  • This higher-dimensional k-median clustering
    problem is hard! Best-known algorithms are due to
    Rafail Ostrovsky.

22
Variants of Classic Location and Clustering
Methods
  • Further complications make this even more
    challenging
  • Different costs of different sensors
  • Restrictions on where we can place different
    sensors
  • Is it better to have every point within x feet of
    some sensor or every point within y feet of at
    least three sensors (y gt x)?
  • Approximation methods due to Chuzhoy,
    Ostrovsky, and Rabani and to Guha, Tardos, and
    Shmoys are relevant.

23
Algorithmic Approaches III Variants of Highway
Sensor Network Algorithms
24
Variants of Highway Sensor Network Algorithms
  • Sensors located along highways and nearby
    pathways measure atmospheric and road conditions.
  • Muthukrishnan, et al. have developed very
    efficient algorithms for sensor location.
  • Based on bichromatic clustering and
    bichromatic facility location (color nodes
    corresponding to sensors red, nodes corresponding
    to sensor messages blue)

25
Variants of Highway Sensor Network Algorithms
  • These algorithms apply to situations with many
    more sensors than the bioterrorism sensor
    location problem.
  • As BT sensor technology changes, we can envision
    a myriad of miniature sensors distributed around
    a city, making this work all the more relevant.

26
Algorithmic Approaches IV Building on Equipment
Placing Algorithms
27
Building on Equipment Placing Algorithms
  • The Node Placement Problem is problem of
    determining locations or nodes to install certain
    types of networking equipment.
  • Coverage and cost are a major consideration.
  • Researchers at Telcordia Technologies have
    studied variations of this problem arising from
    broadband access technologies.

28
The Broadband Access Node Placement Problem
  • There are inherent range limitations that drive
    placement.
  • E.g. customer for DSL service must be within xx
    feet of an assigned multiplexer.
  • Multiplexer sensor.
  • Problem solved using dynamic programming
    algorithms.
  • (Tamra Carpenter, Martin Eiger,David Shallcross,
    Paul Seymour)

29
The Broadband Access Node Placement Problem
Complications
  • Restrictions on types of equipment that can be
    placed at a given node.
  • Constraints on how far a signal from a given
    piece of equipment can travel.
  • Cost and profit maximization considerations.
  • Relevance of work on general integer programming,
    the knapsack cover problem, and local access
    network expansion problems.

30
The Pattern Interpretation Problem
31
The Pattern Interpretation Problem
  • It will be up to the Decision Maker to decide how
    to respond to an alarm from the sensor network.

32
The Pattern Interpretation Problem
  • Little has been done to develop analytical models
    for rapid evaluation of a positive alarm or
    pattern of alarms from a sensor network.
  • How can this pattern be used to minimize false
    alarms?
  • Given an alarm, what other surveillance measures
    can be used to confirm an attack, locate areas of
    major threat, and guide public health
    interventions?

33
The Pattern Interpretation Problem (PIP)
  • Close connection to the SLP.
  • How we interpret a pattern of alarms will affect
    how we place the sensors.
  • The same simulation models used to place the
    sensors can help us in tracing back from an alarm
    to a triggering attack.

34
Approaching the PIP Minimizing False Alarms
35
Approaching the PIP Minimizing False Alarms
  • One approach Redundancy. Require two or more
    sensors to make a detection before an alarm is
    considered confirmed.

36
Approaching the PIP Minimizing False Alarms
  • Portal Shield requires two positives for the
    same agent during a specific time period.
  • Redundancy II Place two or more sensors at or
    near the same location. Require two proximate
    sensors to give off an alarm before we consider
    it confirmed.
  • Redundancy drawbacks cost, delay in confirming
    an alarm.

37
Approaching the PIP Using Decision Rules
  • Existing sensors come with a sensitivity level
    specified and sound an alarm when the number of
    particles collected is sufficiently high above
    threshold.

38
Approaching the PIP Using Decision Rules
  • Alternative decision rule alarm if two sensors
    reach 90 of threshold, three reach 75 of
    threshold, etc.
  • One approach use clustering algorithms for
    sounding an alarm based on a given distribution
    of clusters of sensors reaching a percentage of
    threshold.

39
Approaching the PIP Using Decision Rules
  • When sensors are to be used jointly, the rules
    for tuning each sensor should be optimized to
    take advantage of the fact that each is part of a
    network.
  • The optimal tuning depends on the decision rule
    applied to reach an overall decision given the
    sensor inputs.

40
Approaching the PIP Using Decision Rules
  • Prior work along these lines in missile detection
    (Cherikh and Kantor)

41
Approaching the PIP Using Decision Rules
  • Most work has concentrated on the case of
    stochastic independence of information available
    at two sensors clearly violated in BT sensor
    location problems.
  • Even with stochastic independence, finding
    optimal decision rules is nontrivial.
  • Recent promising approaches of Paul Kantor study
    fusion of multiple methods for monitoring message
    streams.

42
Approaching the PIP Spatio-Temporal Mining of
Sensor Data
43
Approaching the PIP Spatio-Temporal Mining of
Sensor Data
  • Sensors provide observations of the state of the
    world localized in space and time.
  • Finding trends in data from individual sensors
    time series data mining.
  • PIP detecting general correlations in multiple
    time series of observations.
  • This has been studied in statistics, database
    theory, knowledge discovery, data mining.
  • Complications proximity relationships based on
    geography complex chronological effects.

44
Approaching the PIP Spatio-Temporal Mining of
Sensor Data
  • Sensor technology is evolving rapidly.
  • It makes sense to consider idealized settings
    where data are collected continuously and
    communicated instantly.
  • Then, modern methods of spatio-temporal data
    mining due to Muthukrishnan and others are
    relevant.

45
Approaching the PIP Triggering Other Methods of
Surveillance
  • One type of BT surveillance cannot be considered
    in isolation.
  • Question How can the pattern of sensor warnings
    guide other biosurveillance methods?
  • Increased syndromic surveillance?
  • Change threshold for alarm in syndromic
    surveillance?
  • Increased attention to E.R. visits in a certain
    region?

46
Approaching the PIP Triggering Other Methods of
Surveillance
  • Decreased threshold for alarm from subway worker
    absenteeism levels?

47
Approaching the PIP Triggering Other Methods of
Surveillance
  • If there is an initial alarm, each sensor may be
    read more often.
  • How do we pick the sensors to read more
    frequently?
  • This is adaptive biosensor engagement.
  • Methods of bichromatic combinatorial optimization
    may be relevant.
  • As for the SLP, sensors get one color, sensor
    messages another.
  • Relevance of work of Muthukrishnan.

48
Outline
  • Bioterrorism Sensor Location
  • Port of Entry Inspection Algorithms
  • Monitoring Message Streams
  • Author Identification
  • Computational and Mathematical Epidemiology
  • Adverse Event/Disease Reporting/Surveillance/Analy
    sis
  • Bioterrorism Working Group
  • Modeling Social Responses to Bioterrorism
  • Predicting Disease Outbreaks from Remote Sensing
    and Media Data
  • Communication Security and Information Privacy

49
Port of Entry Inspection Algorithms
In collaboration with Los Alamos National
Laboratory
50
Port of Entry Inspection Algorithms
  • Goal Find ways to intercept illicit nuclear
    materials and weapons destined for the U.S. via
    the maritime transportation system
  • Aim Develop decision support algorithms that
    will help us to optimally intercept illicit
    materials and weapons
  • Find inspection schemes that minimize total
    cost including cost of false positives and
    false negatives

51
Sequential Decision Making Problem
  • Stream of entities arrives at a port
  • Decision Maker needs to decide which to inspect,
    which to subject to increasingly stringent
    inspection based on outcomes of previous
    inspections
  • Our approach decision logics and combinatorial
    optimization methods
  • Builds on approach of Stroud
  • and Saeger and large literature
  • in sequential decision making.

52
Sequential Decision Making Problem
  • Entities arriving to be classified into
    categories.
  • Simple case 0 ok, 1 suspicious
  • Observations are made.
  • Inspection scheme specifies which observations
    are to be made based on previous observations
  • Entities have attributes a0, a1, , an, each in a
    number of states
  • Sample attributes
  • Does ships manifest set off an alarm?
  • Does container give off neutron or Gamma emission
    above threshold?
  • Does a radiograph image come up positive?
  • Does an induced fission test come up positive?

53
Sequential Decision Making Problem
  • Simplest Case Attributes are in state 0 or 1
  • Then Entity is a binary string like 011001
  • Then Classification is a decision function F
    that assigns each binary string to a category.
  • If there are two categories, 0 and 1, F is a
    boolean function.
  • F(000) F(111) 1, F(abc) 0 otherwise
  • This classifies an entity as positive iff it has
    none of the attributes or all of them.

54
Sequential Decision Making Problem
  • Different problems depending on whether or not F
    is known. Assume first that F is known.
  • Given an entity, test its attributes until know
    enough to calculate the value of F.
  • An inspection scheme tells us in which order to
    test the attributes to minimize cost.
  • Even this simplified problem is hard
    computationally.

55
Binary Decision Tree Approach
  • We assume we have sensors to measure presence or
    absence of attributes.
  • Build a tree
  • Nodes are sensors or categories (0 or 1)
  • Label nodes with atrribute the sensor measures
    for or the number of the category
  • Category nodes are leaves of the tree nodes
    with only one neighbor
  • Two arcs exit from each sensor node, labeled left
    and right.
  • Take the right arc when sensor says the attribute
    is present, left arc otherwise

56
Binary Decision Tree Approach
  • We reach category 1 from the root only through
    the path a0 to a1 to 1.
  • Thus, an entity is classified in category 1 iff
    it has both attributes.
  • The binary decision tree corresponds to the
    boolean function F(11) 1, F(10) F(01) F(00)
    0.

Figure 1
57
Binary Decision Tree Approach
  • We reach category 1 from the root by
  • a0 L to a1 R a2 R 1 or
  • a0 R a2 R1
  • An entity is classified in category 1 iff has
  • a1 and a2 and not a0 or
  • a0 and a2 and possibly a1.
  • Corresponding boolean function F(111) F(101)
    F(011) 1, F(abc) 0 otherwise.

Figure 2
58
Binary Decision Tree Approach
  • This binary decision tree corresponds to the same
    boolean function
  • F(111) F(101) F(011) 1, F(abc) 0
    otherwise.
  • However, it has one less observation node. So, it
    is more efficient if all observations are equally
    costly and equally likely.

Figure 3
59
Binary Decision Tree Approach
  • Even if the boolean function F is fixed, the
    problem of finding the optimal binary decision
    tree for it is NP-complete.
  • For small n, can try to solve it by brute force
    enumeration.
  • But even for n 4, not practical. (n 4 at Port
    of Long Beach-Los Angeles)
  • Seeking heuristic algorithms, approximations to
    optimal.
  • Making special assumptions about the boolean
    function F.
  • Example For so-called monotone boolean
    functions, integer programming formulations give
    promising heuristics.

60
Cost Functions
  • Above analysis Only uses number of sensors
  • Using a sensor has a cost
  • Unit cost of inspecting one item with it
  • Fixed cost of purchasing and deploying it
  • Delay cost from queuing up at the sensor station
  • How many nodes of the decision tree are actually
    visited during average inspection? Depends on
    distribution of entities.

61
Cost Functions
  • Cost of false positive Cost of additional tests.
  • If it means opening the container, its very
    expensive.
  • Cost of false negative Complex issue.

62
Complications
  • Sensor errors probabilistic approach
  • More than two values of an attribute (present,
    absent, present with 75 probability, )
  • Partially defined boolean functions (inferring
    the boolean function from observations)
  • In this case, machine learning approaches are
    promising
  • Bayesian binary regression
  • Splitting strategies
  • Pruning learned decision trees

63
Outline
  • Bioterrorism Sensor Location
  • Port of Entry Inspection Algorithms
  • Monitoring Message Streams
  • Author Identification
  • Computational and Mathematical Epidemiology
  • Adverse Event/Disease Reporting/Surveillance/Analy
    sis
  • Bioterrorism Working Group
  • Modeling Social Responses to Bioterrorism
  • Predicting Disease Outbreaks from Remote Sensing
    and Media Data
  • Communication Security and Information Privacy

64
Monitoring Message Streams Algorithmic Methods
for Automatic Processing of Messages
65
OBJECTIVE
Monitor huge communication streams, in
particular, streams of textualized communication
to automatically detect pattern changes and
"significant" events
Motivation monitoring email traffic, news,
communiques, faxes, voice intercepts (with speech
recognition)
66
TECHNICAL APPROACHES
  • Given stream of text in any language.
  • Decide whether "new events" are present in the
    flow of messages.
  • Event new topic or topic with unusual level of
    activity.
  • Initial Problem Retrospective or Supervised
    Event Identification Classification into
    pre-existing classes. Given example messages on
    events/topics of interest, algorithm detects
    instances in the stream.

67
TECHNICAL APPROACHES SUPERVISED FILTERING
  • Batch filtering Given examples of relevant
    documents up front.
  • Adaptive filtering Examples accumulated need to
    decide if will bother analyst for guidance pay
    for information about relevance as process moves
    along.

68
  • MORE COMPLEX PROBLEM PROSPECTIVE DETECTION OR
    UNSUPERVISED FILTERING
  • Classes change - new classes or change meaning
  • A difficult problem in statistics
  • Recent new C.S. approaches
  • Semi-supervised Learning
  • Algorithm suggests a possible new event/topic
  • Human analyst labels it determines its
    significance

69
COMPONENTS OF AUTOMATIC MESSAGE PROCESSING
  • (1). Compression of Text increase speed, reduce
    memory/disk use
  • (2). Representation of Text convert text to
    form amenable to computation and statistical
    analysis
  • (3). Matching Scheme compute similarity between
    texts
  • (4). Learning Method create profiles of
    events/topics from known examples.
  • (5). Fusion Scheme -- combine multiple filtering
    techniques to increase accuracy.

70
COMPONENTS OF AUTOMATIC MESSAGE PROCESSING - II
  • These distinctions are somewhat arbitrary.
  • Many approaches to message processing overlap
    several of these components of automatic message
    processing our techniques usually address more
    than one component.
  • Project Premise Existing methods dont exploit
    the full power of the 5 components, synergies
    among them, and/or an understanding of how to
    apply them to text data.

71
COMPONENTS OF AUTOMATIC MESSAGE PROCESSING - III
  • Our approach is to develop/explore methods for
    each component and then to combine them.
  • In the first phase of the project, we did over
    5000 complete experiments with different
    combinations of methods.

72
Nearest Neighbor (kNN) Classifiers
  • Route message by
  • Finding k most similar training messages
    (neighbors)
  • Assign to classes that are most common among
    neighbors (using weighting by distance)
  • kNN classifiers studied since 1958, for text
    since early 90s
  • Moderately effective for text has been
    considered inefficient finding neighbors is slow
  • But, finding neighbors only needs to be done once
  • No matter how many classes (even if huge)
  • So for large number of topics, maybe more
    efficient than one-classifier-per-topic approaches

73
Speeding up kNN
  • Can finding neighbors be made fast enough to make
    kNN practical?
  • Worked on fast implementation
  • Store text and classes sparsely (Representation)
  • Store class labels sparsely
  • Arrange computations to do work proportional only
    to number of class labels in neighbors, not total
    number of classes
  • Search engine heuristics use the in-memory
    inverted file (Matching)
  • Use inverted file (group by word, not by
    document)
  • Retain only high impact terms within each
    document, or within each inverted list
  • Compute similarities using only inverted lists
    for the few words occurring in test document

74
kNN Results
  • Great reduction in size of inverted index and
    speed of classification
  • Slight additional cost in effectiveness
  • Effectiveness slightly below our best methods
    (Bayesian probit and logistic classifiers)
  • Compressed index 90 smaller than original index
    w/only 7-12 loss in effectiveness (macro-F1)
  • Approximate matching is 10 to 100 times faster w/
    only 2-10 loss in effectiveness (macro-F1)
  • Ours are first large scale experiments on search
    engine heuristic for neighbor lookup in kNN
  • Partnership between theoreticians and
    practitioners.

75
Bayesian Methods
  • Bayesian statistical methods place prior
    probability distributions on all unknowns, and
    then compute posterior distribution for the
    unknowns conditional on the knowns.

Thomas Bayes
76
Bayesian Methods
  • Zhang and Oles (2001) developed an efficient
    optimization algorithm for logistic regression
    (10,000 dimensions) and achieved excellent
    predictive performance.
  • The Bayesian approach explicitly incorporates
    prior knowledge about model complexity
    (regularization)
  • We extended the Bayesian approach to incorporate
    a prior requirement for sparsity.
  • Logistic regression has one parameter per
    dimension our sparse model sets many of these to
    zero handles hundreds of thousands of parameters
    efficiently.
  • Resulting sparse models produce outstanding
    accuracy and ultra-fast predictions with no
    ad-hoc feature selection

77
Bayesian Methods Sample Results
  • We have implemented several efficient variants,
    e.g., probit,informative priors.
  • Publicly released software over 1000 downloads
  • Compared to Zhang Oles, our implementation
  • Eliminates ad hoc feature selection
  • Often uses less than 1 of the features at
    prediction time
  • Is publicly available
  • Accuracy as good as the best results ever
    published.
  • In sum, we have a sparseness-inducing Bayesian
    approach that produces dramatically simpler
    models with no loss in accuracy

78
Streaming Data Analysis
  • Motivated by need to make decisions about data
    during an initial scan as data stream by
  • Recent development of theoretical CS algorithms
  • Algorithms motivated by intrusion detection,
    transaction applications, time series
    transactions

79
Streaming Text Data Historic Data Analysis
  • The accumulation of text messages is massive over
    time
  • A lot of streaming research is focused on
    on-going or current analyses
  • It is a great challenge to use only summarized
    historic data and see if a currently emerging
    phenomenon had precursors occurring in the past
  • We are working on a novel architecture for
    historic and posterior analyses via small
    summaries - sketches

80
Streaming Analysis Tool CM Sketch
  • Theoretical We have developed the CM Sketch that
    uses (1/e) log 1/d space to approximate data
    distribution with error at most e, and
    probability of success at least 1-d.
  • All other previously known sample or sketch
    methods use space at least (1/e2).
  • CM Sketch is an order of magnitude better.
  • Practical Few 10's of KBs gives accurate
    summary of large data Create summaries of data
    that allow historic queries to find
  • Heavy Hitters (Most Frequent Items)
  • Quantiles of a Distribution (Median, Percentiles
    etc.)
  • Finding items with large changes

81
Outline
  • Bioterrorism Sensor Location
  • Port of Entry Inspection Algorithms
  • Monitoring Message Streams
  • Author Identification
  • Computational and Mathematical Epidemiology
  • Adverse Event/Disease Reporting/Surveillance/Analy
    sis
  • Bioterrorism Working Group
  • Modeling Social Responses to Bioterrorism
  • Predicting Disease Outbreaks from Remote Sensing
    and Media Data
  • Communication Security and Information Privacy

82
Large-scale Automated Author Identification
83
Statistical Analysis of Text
  • Statistical text analysis has a long history in
    literary analysis and in solving disputed
    authorship problems
  • First (?) is Thomas C. Mendenhall in 1887

84
  • Hamilton versus Madison the Federalist Papers
  • Mosteller and Wallace (1963) used Naïve Bayes
    with a Poisson and Negative Binomial model
  • Good predictive performance

85
Some Background
  • Identification technologies important for
    homeland security and in the legal system
  • Author attribution for textual artifacts using
    topic independent stylometric features has a
    long history
  • Historical focus on small numbers of authors and
    low-dimensional representations via function words

86
Author ID Project Objectives
  • Application of state-of-the-art statistical and
    computing technologies to authorship attribution
  • Work with very high-dimensional document
    representations
  • Focus on providing working solutions to
    particular problems

87
Author ID Project Focus
  • Goal Identification of Authors From Large
    Collection of Objects
  • traditional disputed authorship (choose among k
    known authors)
  • clustering of putative authors (e.g., internet
    handles termin8r, heyr, KaMaKaZie)
  • document pair analysis Were two documents
    written by the same author?
  • odd-man-out Were these documents written by one
    of this set of authors or by someone else?

88
Representation
  • Long tradition in stylometry that seeks a small
    number of textual characteristics that
    distinguish the texts of authors from one another
    (Burrows, Holmes, Binongo, Hoover, Mosteller
    Wallace, McMenamin, Tweedie, etc.)
  • Typically use function words (a, with, as,
    were, all, would, etc.) followed by PCA cluster
    analysis
  • Function words aim to be topic-independent
  • Hoover (2003) shows that using all high-frequency
    words does a better job than function words alone

89
Idiosyncratic Usage
  • Idiosyncratic usage less formalized in the
    literature (misspellings, repeated neologisms,
    etc.) but apparently useful. For example,
    Fosters unmasking of Klein as the author of
    Primary Colors
  • Klein and Anonymous loved unusual adjectives
    ending in -y and inous cartoony, chunky,
    crackly, dorky, snarly,, slimetudinous,
    vertiginous,
  • Both Klein and Anonymous added letters to their
    interjections ahh, aww, naww.
  • Both Klein and Anonymous loved to coin words
    beginning in hyper-, mega-, post-, quasi-, and
    semi-, more than all others put together
  • Klein and Anonymous use riffle to mean rifle
    or rustle, a usage for which the OED provides no
    instance in the past thousand years

90
Odd-Man Out
  • Were these documents written by one of this set
    of authors or by someone else?
  • Training data contains documents by given set of
    authors
  • Test data contains documents by some set of
    authors including some not in original set
  • Bayesian hierarchical model incorporates prior
    knowledge that model parameters for different
    authors differ from each other
  • Initial success on small-scale simulated examples
  • Generalizations for more than one new author

91
Some Results
  • Created largest-ever (?) feature set including
    function words, suffixes, POS tags, lengths,
    spelling errors, common English errors,
    grammatical errors, phrases, idiosyncratic usage,
    ngrams, etc.
  • Extensive experiments for 1-of-K and
    odd-man-out
  • New 1.2 million message Listserv corpus, 82,000
    authors

92
Some Results - II
  • Developed general purpose feature
    extraction software for author attribution
  • Bayesian Multinomial Regression Software extends
    our highly scalable, sparse, BBR software (MMS
    Project) to the multi-class case

93
Outline
  • Bioterrorism Sensor Location
  • Port of Entry Inspection Algorithms
  • Monitoring Message Streams
  • Author Identification
  • Computational and Mathematical Epidemiology
  • Adverse Event/Disease Reporting/Surveillance/Analy
    sis
  • Bioterrorism Working Group
  • Modeling Social Responses to Bioterrorism
  • Predicting Disease Outbreaks from Remote Sensing
    and Media Data
  • Communication Security and Information Privacy

94
Special Focus on Computational and Mathematical
Epidemiology
smallpox
95
Components of a Special Focus
  • Working Groups
  • Tutorials
  • Workshops
  • Visitor Programs
  • Graduate Student Programs
  • Postdoc Programs
  • Dissemination

96
A Sampling of Working Groups
  • WGs on Large Data Sets
  • Adverse Event/Disease Reporting, Surveillance
    Analysis
  • Data Mining and Epidemiology
  • WGs on Analogies between Computers and Humans
  • Analogies between Computer Viruses/Immune Systems
    and Human Viruses/Immune Systems
  • Distributed Computing, Social Networks, and
    Disease Spread Processes

97
WGs on Methods/Tools of Theoretical CS
  • Phylogenetic Trees and Rapidly Evolving Diseases
  • Order-Theoretic Aspects of Epidemiology
  • WGs on Computational Methods for Analyzing Large
    Models for Spread/Control of Disease
  • Spatio-temporal and Network Modeling of Diseases
  • Methodologies for Comparing Vaccination
    Strategies

98
WGs on Mathematical Sciences Methodologies
  • Mathematical Models and Defense Against
    Bioterrorism
  • Predictive Methodologies for Infectious Diseases
  • Statistical, Mathematical, and Modeling Issues in
    the Analysis of Marine Diseases

99
Workshops on Modeling of Infectious Diseases
A Sampling of Workshops
  • The Pathogenesis of Infectious Diseases
  • Models/Methodological Problems of Botanical
    Epidemiology
  • WS on Modeling of Non-Infectious Diseases
  • Disease Clusters

100
Workshops on Evolution and Epidemiology
  • Genetics and Evolution of Pathogens
  • The Epidemiology and Evolution of Influenza
  • The Evolution and Control of Drug Resistance
  • Models of Co-Evolution of Hosts and Pathogens

101
Workshops on Methodological Issues
  • Capture-recapture Models in Epidemiology
  • Spatial Epidemiology and Geographic Information
    Systems
  • Ecologic Inference
  • Combinatorial Group Testing

102
Outline
  • Bioterrorism Sensor Location
  • Port of Entry Inspection Algorithms
  • Monitoring Message Streams
  • Author Identification
  • Computational and Mathematical Epidemiology
  • Adverse Event/Disease Reporting/Surveillance/Analy
    sis
  • Bioterrorism Working Group
  • Modeling Social Responses to Bioterrorism
  • Predicting Disease Outbreaks from Remote Sensing
    and Media Data
  • Communication Security and Information Privacy

103
The DIMACS Working Group on Adverse Event/Disease
Reporting, Surveillance, and Analysis
104
Working Group on Adverse Event/Disease Reporting,
Surveillance, and Analysis
  • Health surveillance a core activity in public
    health
  • Concerns about bioterrorism have attracted
    attention to new surveillance methods
  • OTC drug sales
  • Subway worker absenteeism
  • Ambulance dispatches
  • Spawns need for novel statistical methods for
    surveillance of multiple data streams.
  • WG coordinated closely with National Syndromic
    Surveillance Conferences

105
New Data Types for Public Health Surveillance
  • Managed care patient encounter data
  • Pre-diagnostic/chief complaint (text data)
  • Over-the-counter sales transactions
  • Drug store
  • Grocery store
  • 911-emergency calls
  • Ambulance dispatch data
  • Absenteeism data
  • ED discharge summaries
  • Prescription/pharmaceuticals
  • Adverse event reports

106
Farzad Mostashari
107
New Analytic Methods and Approaches
  • Spatial-temporal scan statistics
  • Statistical process control (SPC)
  • Bayesian applications
  • Market-basket association analysis
  • Text mining
  • Rule-based surveillance
  • Change-point techniques

108
SubGroup on Privacy Confidentiality of Health
Data
  • Privacy concerns are a major stumbling block to
    public health surveillance, in particular
    bioterrorism surveillance.
  • Challenge produce anonymous data specific enough
    for research.
  • Exploring ways to remove identifiers (s.s. ,
    tel. , zip code) from data sets.
  • Exploring ways to aggregate, remove information
    from data sets.
  • Partnerships with cryptographers
  • Exploring methods of combinatorial optimization

109
Outline
  • Bioterrorism Sensor Location
  • Port of Entry Inspection Algorithms
  • Monitoring Message Streams
  • Author Identification
  • Computational and Mathematical Epidemiology
  • Adverse Event/Disease Reporting/Surveillance/Analy
    sis
  • Bioterrorism Working Group
  • Modeling Social Responses to Bioterrorism
  • Predicting Disease Outbreaks from Remote Sensing
    and Media Data
  • Communication Security and Information Privacy

110
Bioterrorism Working Group
anthrax
111
Bioterrorism Working Group
  • Biosurveillance
  • Evolution
  • Modeling Bioterror Response Logistics
  • Computer Science Challenges
  • Agroterrorism

112
Modeling Bioterror Response Logistics
  • Exploring Discrete Optimization/Queueing
  • size of stockpiles of vaccines
  • allocation of medications
  • analysis of bottlenecks in treatment facilities
  • transportation schedules

1947 smallpox vaccincation queue NYC
113
Agroterrorism
  • Subgroup just starting
  • Interest in plant diseases
  • Partnership with the National Plant Diagnostic
    Network
  • Emphasis on Data Mining and Epidemiology

114
Outline
  • Bioterrorism Sensor Location
  • Port of Entry Inspection Algorithms
  • Monitoring Message Streams
  • Author Identification
  • Computational and Mathematical Epidemiology
  • Adverse Event/Disease Reporting/Surveillance/Analy
    sis
  • Bioterrorism Working Group
  • Modeling Social Responses to Bioterrorism
  • Predicting Disease Outbreaks from Remote Sensing
    and Media Data
  • Communication Security and Information Privacy

115
Working Group on Modeling Social Responses to
Bioterrorism
  • Models of the spread of infectious disease
    commonly assume passive bystanders and rational
    actors who will comply with health authorities.
  • It is not clear how well this assumption applies
    to situations like a bioterrorist attack using
    smallpox or plague.

116
Working Group on Modeling Social Responses to
Bioterrorism
  • Interdisciplinary group is discussing
    incorporating social behavior into models,
    building models of public health decisionmaking,
    risk communication.
  • Some Issues
  • Movement
  • Compliance
  • Rumor
  • Subcultural differences
  • Indirect economic effects
  • Social stigmata
  • Panic

How do you measure the indirect cost of an
attack?
117
Outline
  • Bioterrorism Sensor Location
  • Port of Entry Inspection Algorithms
  • Monitoring Message Streams
  • Author Identification
  • Computational and Mathematical Epidemiology
  • Adverse Event/Disease Reporting/Surveillance/Analy
    sis
  • Bioterrorism Working Group
  • Modeling Social Responses to Bioterrorism
  • Predicting Disease Outbreaks from Remote Sensing
    and Media Data
  • Communication Security and Information Privacy

118
Predicting Disease Outbreaks from Remote Sensing
and Media Data
Outbreaks of disease in other parts of the world
have the capacity to affect the security of the
US
Joint project with Imaging Science and
Information Systems Center at Georgetown Universit
y Medical School (ISIS Center)
119
Predicting Disease Outbreaks from Remote Sensing
and Media Data
  • Recent work has shown that its possible to
    predict disease outbreaks in distant parts of the
    world using remotely sensed satellite data.
  • SARS and heightened avian flu in the Pacific Rim
    appeared following temperature anomalies in
    China.
  • Could we have anticipated this
  • given enviro-climatic information?

120
Predicting Disease Outbreaks from Remote Sensing
and Media Data
  • Rift Valley Fever epidemic in 1997/8 in East
    Africa occurred following heavy flooding related
    to El Nino
  • Flooding in Venezuela in 1995 resulted in a
    multi-pathogen outbreak.

121
Predicting Disease Outbreaks from Remote Sensing
and Media Data
  • Indications and warnings can alert US responders
    to bioevents in faraway places.
  • Disease that can result in social disruptions can
    be detected in open source media reports even if
    there is no official reporting of this.

122
Predicting Disease Outbreaks from Remote Sensing
and Media Data
  • A model developed at the ISIS Center at
    Georgetown predicts social disruptions due to
    disease based on keyword hit counts from
    text-based sources (media reports).
  • DIMACS Project goal Use media model to develop
    ways to predict social disruptions from disease
    from remote sensing enviro-climatic data.
  • We will be using remote sensing data indicating
    increased Normalized Difference Vegetation Index
    (NDVI).

123
Predicting Disease Outbreaks from Remote Sensing
and Media Data
  • Project Premise We can use enviro-climatic
    indices such as NDVI coupled with disease-related
    social disruption predictors from media data
    delayed by several months to validate the
    enviro-climatic indicators as predictors.
  • Approach Machine Learning
  • Project waiting to get started

124
Predicting Disease Outbreaks from Remote Sensing
and Media Data
  • The approach is similar to ones used by members
    of the DIMACS team to estimate probability of a
    match between remotely sensed signals and a
    signature that has been observed before. This
    work has been applied to face recognition and
    explosive detection.

125
Outline
  • Bioterrorism Sensor Location
  • Port of Entry Inspection Algorithms
  • Monitoring Message Streams
  • Author Identification
  • Computational and Mathematical Epidemiology
  • Adverse Event/Disease Reporting/Surveillance/Analy
    sis
  • Bioterrorism Working Group
  • Modeling Social Responses to Bioterrorism
  • Predicting Disease Outbreaks from Remote Sensing
    and Media Data
  • Communication Security and Information Privacy

126
Special Focus on Communication Security and
Information Privacy
127
Special Focus on Communication Security and
Information Privacy
  • Working Groups
  • Privacy-Preserving Data Mining
  • Usable Privacy and Security Software
  • Data De-Identification, Combinatorial
    Optimization, Graph Theory, and the Stat-OR
    Interface
  • Intrusion Detection and Network Security
    Management Systems

128
Special Focus on Communication Security and
Information Privacy
  • A Selection of Workshops
  • Software Security
  • Applied Cryptography and Network Security
  • Large-scale Internet Attacks
  • Mobile and Wireless Security
  • Security of Web Services and E-Commerce
  • Database Security Query Authorization and
    Information Inference

129
Working Group on Analogies between Computer
Viruses and Biological Viruses
  • Can ideas for defending against biological
    viruses lead to ideas for defending against
    computer viruses?
  • Concern about large gap between initial time of
    attack and implementation of defensive strategies
  • Public health approach Once a virus has
    infected a machine, it tries to connect it to as
    many computers as possible, as fast as possible.
    A throttle limits rate at which a computer can
    connect to new computers.

130
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com