Title: PROGRAMS IN HOMELAND SECURITY AT DIMACS
1PROGRAMS IN HOMELAND SECURITY AT DIMACS
Fred S. Roberts DIMACS Director
2THE FOUNDING OF DIMACSTHE NSF SCIENCE AND
TECHNOLOGY CENTERS PROGRAM
- The STC program was launched by the White House
and the National Academy of Sciences in 1988 in
order to increase the economic competitiveness of
the U.S. - NSF ran a nationwide competition. The rules
- cutting edge research
- education and knowledge transfer
- university-industry partnerships
3THE FOUNDING OF DIMACS
- Because of the increasing importance of discrete
mathematics and theoretical computer science,
especially in the fields of telecommunications
and computing, four institutions, Rutgers and
Princeton Universities and ATT Bell Labs and
Bell Communications Research (Bellcore) each
developed strong research groups in these fields. - Under the leadership of Rutgers, they came
together to found DIMACS and entered the STC
competition. - There were more than 800 preproposals more than
300 proposals, in all fields of science 11
winners.
4 The DIMACS Partners Today
Rutgers University Princeton University ATT
Labs Bell Labs (Lucent Technologies) NEC
Laboratories America Telcordia Technologies Affil
iates Avaya Labs HP Labs IBM Research Microsoft
Research Stevens Institute of Technology
5 WHO IS DIMACS?
- There are about 250 scientists affiliated with
DIMACS and called permanent members. - Most are from the partner and affiliated
organizations. - They include many of the worlds leaders in
discrete mathematics and theoretical computer
science and their applications. - They also include statisticians, biologists,
psychologists, chemists, epidemiologists, and
engineers. - None are paid by DIMACS, but they join in DIMACS
projects.
6 Outline A Selection of DIMACS Projects
- Bioterrorism Sensor Location
- Port of Entry Inspection Algorithms
- Monitoring Message Streams
- Author Identification
- Computational and Mathematical Epidemiology
- Adverse Event/Disease Reporting/Surveillance/Analy
sis - Bioterrorism Working Group
- Modeling Social Responses to Bioterrorism
- Predicting Disease Outbreaks from Remote Sensing
and Media Data - Communication Security and Information Privacy
7The Bioterrorism Sensor Location Problem
8- Early warning is critical in defense against
terrorism - This is a crucial factor underlying the
governments plans to place networks of
sensors/detectors to warn of a bioterrorist
attack
The BASIS System Salt Lake City
9Locating Sensors is not Easy
- Sensors are expensive
- How do we select them and where do we place them
to maximize coverage, expedite an alarm, and
keep the cost down? - Approaches that improve upon existing, ad hoc
location methods could save countless lives in
the case of an attack and also money in capital
and operational costs.
10Two Fundamental Problems
- Sensor Location Problem
- Choose an appropriate mix of sensors
- decide where to locate them for best protection
and early warning
11Two Fundamental Problems
- Pattern Interpretation Problem When sensors set
off an alarm, help public health decision makers
decide - Has an attack taken place?
- What additional monitoring is needed?
- What was its extent and location?
- What is an appropriate response?
12The SLP What is a Measure of Success of a
Solution?
- A modeling problem.
- Needs to be made precise.
- Many possible formulations.
13The SLP What is a Measure of Success of a
Solution?
- Identify and ameliorate false alarms.
- Defending against a worst case attack or an
average case attack. - Minimize time to first alarm? (Worst case?
(Average case?) - Maximize coverage of the area.
- Minimize geographical area not covered
- Minimize size of population not covered
- Minimize probability of missing an attack
14The SLP What is a Measure of Success of a
Solution?
- Cost Given a mix of available sensors and a
fixed budget, what mix will best accomplish our
other goals?
15The SLP What is a Measure of Success of a
Solution?
- Its hard to separate the goals.
- Even a small number of sensors might detect an
attack if there is no constraint on time to
alarm. - Without budgetary restrictions, a lot more can be
accomplished.
16The Sensor Location Problem
- Approach is to develop new algorithmic methods.
- We are building on approaches to other modeling
problems, seeing if they can be modified in the
sensor location context. - This is a multi-criteria modeling problem and it
seems hopeless to try to find optimal solutions - We will be happy with efficient algorithms that
find good solutions
17Algorithmic Approaches I Greedy Algorithms
18Greedy Algorithms
- Find the most important location first and locate
a sensor there. - Find second-most important location.
- Etc.
- Builds on earlier mathematical work at Institute
for Defense Analyses (Grotte, Platt) - Steepest ascent approach.
- No guarantee of optimal or best solution.
- In practice, gets pretty close to optimal
solution.
19Algorithmic Approaches II Variants of Classic
Location and Clustering Methods
20Algorithmic Approaches II Variants of Classic
Location and Clustering Methods
- Location theory locate facilities (sensors) to
be used by users located in a region. - Cluster analysis Given points in a metric space,
partition them into groups or clusters so points
within clusters are relatively close. - Clusters correspond to points covered by a
facility (sensor).
21Variants of Classic Location and Clustering
Methods
- k-median clustering Given k sensors, place them
so each point in the city is within x feet of a
sensor. - Complications More dimensions location affects
sensitivity, wind strength enters, sensors have
different characteristics, etc. - This higher-dimensional k-median clustering
problem is hard! Best-known algorithms are due to
Rafail Ostrovsky.
22Variants of Classic Location and Clustering
Methods
- Further complications make this even more
challenging - Different costs of different sensors
- Restrictions on where we can place different
sensors - Is it better to have every point within x feet of
some sensor or every point within y feet of at
least three sensors (y gt x)?
- Approximation methods due to Chuzhoy,
Ostrovsky, and Rabani and to Guha, Tardos, and
Shmoys are relevant.
23Algorithmic Approaches III Variants of Highway
Sensor Network Algorithms
24Variants of Highway Sensor Network Algorithms
- Sensors located along highways and nearby
pathways measure atmospheric and road conditions. - Muthukrishnan, et al. have developed very
efficient algorithms for sensor location. - Based on bichromatic clustering and
bichromatic facility location (color nodes
corresponding to sensors red, nodes corresponding
to sensor messages blue)
25Variants of Highway Sensor Network Algorithms
- These algorithms apply to situations with many
more sensors than the bioterrorism sensor
location problem. - As BT sensor technology changes, we can envision
a myriad of miniature sensors distributed around
a city, making this work all the more relevant.
26Algorithmic Approaches IV Building on Equipment
Placing Algorithms
27Building on Equipment Placing Algorithms
- The Node Placement Problem is problem of
determining locations or nodes to install certain
types of networking equipment. - Coverage and cost are a major consideration.
- Researchers at Telcordia Technologies have
studied variations of this problem arising from
broadband access technologies.
28The Broadband Access Node Placement Problem
- There are inherent range limitations that drive
placement. - E.g. customer for DSL service must be within xx
feet of an assigned multiplexer. - Multiplexer sensor.
- Problem solved using dynamic programming
algorithms. - (Tamra Carpenter, Martin Eiger,David Shallcross,
Paul Seymour)
29The Broadband Access Node Placement Problem
Complications
- Restrictions on types of equipment that can be
placed at a given node. - Constraints on how far a signal from a given
piece of equipment can travel. - Cost and profit maximization considerations.
- Relevance of work on general integer programming,
the knapsack cover problem, and local access
network expansion problems.
30The Pattern Interpretation Problem
31The Pattern Interpretation Problem
- It will be up to the Decision Maker to decide how
to respond to an alarm from the sensor network.
32The Pattern Interpretation Problem
- Little has been done to develop analytical models
for rapid evaluation of a positive alarm or
pattern of alarms from a sensor network. - How can this pattern be used to minimize false
alarms? - Given an alarm, what other surveillance measures
can be used to confirm an attack, locate areas of
major threat, and guide public health
interventions?
33The Pattern Interpretation Problem (PIP)
- Close connection to the SLP.
- How we interpret a pattern of alarms will affect
how we place the sensors. - The same simulation models used to place the
sensors can help us in tracing back from an alarm
to a triggering attack.
34Approaching the PIP Minimizing False Alarms
35Approaching the PIP Minimizing False Alarms
- One approach Redundancy. Require two or more
sensors to make a detection before an alarm is
considered confirmed.
36Approaching the PIP Minimizing False Alarms
- Portal Shield requires two positives for the
same agent during a specific time period. - Redundancy II Place two or more sensors at or
near the same location. Require two proximate
sensors to give off an alarm before we consider
it confirmed. - Redundancy drawbacks cost, delay in confirming
an alarm.
37Approaching the PIP Using Decision Rules
- Existing sensors come with a sensitivity level
specified and sound an alarm when the number of
particles collected is sufficiently high above
threshold.
38Approaching the PIP Using Decision Rules
- Alternative decision rule alarm if two sensors
reach 90 of threshold, three reach 75 of
threshold, etc. - One approach use clustering algorithms for
sounding an alarm based on a given distribution
of clusters of sensors reaching a percentage of
threshold.
39Approaching the PIP Using Decision Rules
- When sensors are to be used jointly, the rules
for tuning each sensor should be optimized to
take advantage of the fact that each is part of a
network. - The optimal tuning depends on the decision rule
applied to reach an overall decision given the
sensor inputs.
40Approaching the PIP Using Decision Rules
- Prior work along these lines in missile detection
(Cherikh and Kantor)
41Approaching the PIP Using Decision Rules
- Most work has concentrated on the case of
stochastic independence of information available
at two sensors clearly violated in BT sensor
location problems. - Even with stochastic independence, finding
optimal decision rules is nontrivial. - Recent promising approaches of Paul Kantor study
fusion of multiple methods for monitoring message
streams.
42Approaching the PIP Spatio-Temporal Mining of
Sensor Data
43Approaching the PIP Spatio-Temporal Mining of
Sensor Data
- Sensors provide observations of the state of the
world localized in space and time. - Finding trends in data from individual sensors
time series data mining. - PIP detecting general correlations in multiple
time series of observations. - This has been studied in statistics, database
theory, knowledge discovery, data mining. - Complications proximity relationships based on
geography complex chronological effects.
44Approaching the PIP Spatio-Temporal Mining of
Sensor Data
- Sensor technology is evolving rapidly.
- It makes sense to consider idealized settings
where data are collected continuously and
communicated instantly. - Then, modern methods of spatio-temporal data
mining due to Muthukrishnan and others are
relevant.
45Approaching the PIP Triggering Other Methods of
Surveillance
- One type of BT surveillance cannot be considered
in isolation. - Question How can the pattern of sensor warnings
guide other biosurveillance methods? - Increased syndromic surveillance?
- Change threshold for alarm in syndromic
surveillance? - Increased attention to E.R. visits in a certain
region?
46Approaching the PIP Triggering Other Methods of
Surveillance
- Decreased threshold for alarm from subway worker
absenteeism levels?
47Approaching the PIP Triggering Other Methods of
Surveillance
- If there is an initial alarm, each sensor may be
read more often. - How do we pick the sensors to read more
frequently? - This is adaptive biosensor engagement.
- Methods of bichromatic combinatorial optimization
may be relevant. - As for the SLP, sensors get one color, sensor
messages another. - Relevance of work of Muthukrishnan.
48 Outline
- Bioterrorism Sensor Location
- Port of Entry Inspection Algorithms
- Monitoring Message Streams
- Author Identification
- Computational and Mathematical Epidemiology
- Adverse Event/Disease Reporting/Surveillance/Analy
sis - Bioterrorism Working Group
- Modeling Social Responses to Bioterrorism
- Predicting Disease Outbreaks from Remote Sensing
and Media Data - Communication Security and Information Privacy
49Port of Entry Inspection Algorithms
In collaboration with Los Alamos National
Laboratory
50Port of Entry Inspection Algorithms
- Goal Find ways to intercept illicit nuclear
materials and weapons destined for the U.S. via
the maritime transportation system - Aim Develop decision support algorithms that
will help us to optimally intercept illicit
materials and weapons - Find inspection schemes that minimize total
cost including cost of false positives and
false negatives
51Sequential Decision Making Problem
- Stream of entities arrives at a port
- Decision Maker needs to decide which to inspect,
which to subject to increasingly stringent
inspection based on outcomes of previous
inspections - Our approach decision logics and combinatorial
optimization methods - Builds on approach of Stroud
- and Saeger and large literature
- in sequential decision making.
52Sequential Decision Making Problem
- Entities arriving to be classified into
categories. - Simple case 0 ok, 1 suspicious
- Observations are made.
- Inspection scheme specifies which observations
are to be made based on previous observations - Entities have attributes a0, a1, , an, each in a
number of states - Sample attributes
- Does ships manifest set off an alarm?
- Does container give off neutron or Gamma emission
above threshold? - Does a radiograph image come up positive?
- Does an induced fission test come up positive?
53Sequential Decision Making Problem
- Simplest Case Attributes are in state 0 or 1
- Then Entity is a binary string like 011001
- Then Classification is a decision function F
that assigns each binary string to a category. - If there are two categories, 0 and 1, F is a
boolean function. - F(000) F(111) 1, F(abc) 0 otherwise
- This classifies an entity as positive iff it has
none of the attributes or all of them.
54Sequential Decision Making Problem
- Different problems depending on whether or not F
is known. Assume first that F is known. - Given an entity, test its attributes until know
enough to calculate the value of F. - An inspection scheme tells us in which order to
test the attributes to minimize cost. - Even this simplified problem is hard
computationally.
55Binary Decision Tree Approach
- We assume we have sensors to measure presence or
absence of attributes. - Build a tree
- Nodes are sensors or categories (0 or 1)
- Label nodes with atrribute the sensor measures
for or the number of the category - Category nodes are leaves of the tree nodes
with only one neighbor - Two arcs exit from each sensor node, labeled left
and right. - Take the right arc when sensor says the attribute
is present, left arc otherwise
56Binary Decision Tree Approach
- We reach category 1 from the root only through
the path a0 to a1 to 1. - Thus, an entity is classified in category 1 iff
it has both attributes. - The binary decision tree corresponds to the
boolean function F(11) 1, F(10) F(01) F(00)
0.
Figure 1
57Binary Decision Tree Approach
- We reach category 1 from the root by
- a0 L to a1 R a2 R 1 or
- a0 R a2 R1
- An entity is classified in category 1 iff has
- a1 and a2 and not a0 or
- a0 and a2 and possibly a1.
- Corresponding boolean function F(111) F(101)
F(011) 1, F(abc) 0 otherwise.
Figure 2
58Binary Decision Tree Approach
- This binary decision tree corresponds to the same
boolean function - F(111) F(101) F(011) 1, F(abc) 0
otherwise. - However, it has one less observation node. So, it
is more efficient if all observations are equally
costly and equally likely.
Figure 3
59Binary Decision Tree Approach
- Even if the boolean function F is fixed, the
problem of finding the optimal binary decision
tree for it is NP-complete. - For small n, can try to solve it by brute force
enumeration. - But even for n 4, not practical. (n 4 at Port
of Long Beach-Los Angeles) - Seeking heuristic algorithms, approximations to
optimal. - Making special assumptions about the boolean
function F. - Example For so-called monotone boolean
functions, integer programming formulations give
promising heuristics.
60 Cost Functions
- Above analysis Only uses number of sensors
- Using a sensor has a cost
- Unit cost of inspecting one item with it
- Fixed cost of purchasing and deploying it
- Delay cost from queuing up at the sensor station
- How many nodes of the decision tree are actually
visited during average inspection? Depends on
distribution of entities.
61Cost Functions
- Cost of false positive Cost of additional tests.
- If it means opening the container, its very
expensive. - Cost of false negative Complex issue.
62Complications
- Sensor errors probabilistic approach
- More than two values of an attribute (present,
absent, present with 75 probability, ) - Partially defined boolean functions (inferring
the boolean function from observations) - In this case, machine learning approaches are
promising - Bayesian binary regression
- Splitting strategies
- Pruning learned decision trees
63 Outline
- Bioterrorism Sensor Location
- Port of Entry Inspection Algorithms
- Monitoring Message Streams
- Author Identification
- Computational and Mathematical Epidemiology
- Adverse Event/Disease Reporting/Surveillance/Analy
sis - Bioterrorism Working Group
- Modeling Social Responses to Bioterrorism
- Predicting Disease Outbreaks from Remote Sensing
and Media Data - Communication Security and Information Privacy
64Monitoring Message Streams Algorithmic Methods
for Automatic Processing of Messages
65OBJECTIVE
Monitor huge communication streams, in
particular, streams of textualized communication
to automatically detect pattern changes and
"significant" events
Motivation monitoring email traffic, news,
communiques, faxes, voice intercepts (with speech
recognition)
66TECHNICAL APPROACHES
- Given stream of text in any language.
- Decide whether "new events" are present in the
flow of messages. - Event new topic or topic with unusual level of
activity. - Initial Problem Retrospective or Supervised
Event Identification Classification into
pre-existing classes. Given example messages on
events/topics of interest, algorithm detects
instances in the stream.
67TECHNICAL APPROACHES SUPERVISED FILTERING
- Batch filtering Given examples of relevant
documents up front. - Adaptive filtering Examples accumulated need to
decide if will bother analyst for guidance pay
for information about relevance as process moves
along.
68- MORE COMPLEX PROBLEM PROSPECTIVE DETECTION OR
UNSUPERVISED FILTERING - Classes change - new classes or change meaning
- A difficult problem in statistics
- Recent new C.S. approaches
- Semi-supervised Learning
- Algorithm suggests a possible new event/topic
- Human analyst labels it determines its
significance
69COMPONENTS OF AUTOMATIC MESSAGE PROCESSING
- (1). Compression of Text increase speed, reduce
memory/disk use - (2). Representation of Text convert text to
form amenable to computation and statistical
analysis - (3). Matching Scheme compute similarity between
texts - (4). Learning Method create profiles of
events/topics from known examples. - (5). Fusion Scheme -- combine multiple filtering
techniques to increase accuracy.
70COMPONENTS OF AUTOMATIC MESSAGE PROCESSING - II
- These distinctions are somewhat arbitrary.
- Many approaches to message processing overlap
several of these components of automatic message
processing our techniques usually address more
than one component. - Project Premise Existing methods dont exploit
the full power of the 5 components, synergies
among them, and/or an understanding of how to
apply them to text data.
71COMPONENTS OF AUTOMATIC MESSAGE PROCESSING - III
- Our approach is to develop/explore methods for
each component and then to combine them. - In the first phase of the project, we did over
5000 complete experiments with different
combinations of methods.
72Nearest Neighbor (kNN) Classifiers
- Route message by
- Finding k most similar training messages
(neighbors) - Assign to classes that are most common among
neighbors (using weighting by distance) - kNN classifiers studied since 1958, for text
since early 90s - Moderately effective for text has been
considered inefficient finding neighbors is slow - But, finding neighbors only needs to be done once
- No matter how many classes (even if huge)
- So for large number of topics, maybe more
efficient than one-classifier-per-topic approaches
73Speeding up kNN
- Can finding neighbors be made fast enough to make
kNN practical? - Worked on fast implementation
- Store text and classes sparsely (Representation)
- Store class labels sparsely
- Arrange computations to do work proportional only
to number of class labels in neighbors, not total
number of classes - Search engine heuristics use the in-memory
inverted file (Matching) - Use inverted file (group by word, not by
document) - Retain only high impact terms within each
document, or within each inverted list - Compute similarities using only inverted lists
for the few words occurring in test document
74kNN Results
- Great reduction in size of inverted index and
speed of classification - Slight additional cost in effectiveness
- Effectiveness slightly below our best methods
(Bayesian probit and logistic classifiers) - Compressed index 90 smaller than original index
w/only 7-12 loss in effectiveness (macro-F1) - Approximate matching is 10 to 100 times faster w/
only 2-10 loss in effectiveness (macro-F1) - Ours are first large scale experiments on search
engine heuristic for neighbor lookup in kNN - Partnership between theoreticians and
practitioners.
75Bayesian Methods
- Bayesian statistical methods place prior
probability distributions on all unknowns, and
then compute posterior distribution for the
unknowns conditional on the knowns.
Thomas Bayes
76Bayesian Methods
- Zhang and Oles (2001) developed an efficient
optimization algorithm for logistic regression
(10,000 dimensions) and achieved excellent
predictive performance. - The Bayesian approach explicitly incorporates
prior knowledge about model complexity
(regularization) - We extended the Bayesian approach to incorporate
a prior requirement for sparsity. - Logistic regression has one parameter per
dimension our sparse model sets many of these to
zero handles hundreds of thousands of parameters
efficiently. - Resulting sparse models produce outstanding
accuracy and ultra-fast predictions with no
ad-hoc feature selection
77Bayesian Methods Sample Results
- We have implemented several efficient variants,
e.g., probit,informative priors. - Publicly released software over 1000 downloads
- Compared to Zhang Oles, our implementation
- Eliminates ad hoc feature selection
- Often uses less than 1 of the features at
prediction time - Is publicly available
- Accuracy as good as the best results ever
published. - In sum, we have a sparseness-inducing Bayesian
approach that produces dramatically simpler
models with no loss in accuracy
78Streaming Data Analysis
- Motivated by need to make decisions about data
during an initial scan as data stream by - Recent development of theoretical CS algorithms
- Algorithms motivated by intrusion detection,
transaction applications, time series
transactions
79Streaming Text Data Historic Data Analysis
- The accumulation of text messages is massive over
time - A lot of streaming research is focused on
on-going or current analyses - It is a great challenge to use only summarized
historic data and see if a currently emerging
phenomenon had precursors occurring in the past - We are working on a novel architecture for
historic and posterior analyses via small
summaries - sketches
80Streaming Analysis Tool CM Sketch
- Theoretical We have developed the CM Sketch that
uses (1/e) log 1/d space to approximate data
distribution with error at most e, and
probability of success at least 1-d. - All other previously known sample or sketch
methods use space at least (1/e2). - CM Sketch is an order of magnitude better.
- Practical Few 10's of KBs gives accurate
summary of large data Create summaries of data
that allow historic queries to find - Heavy Hitters (Most Frequent Items)
- Quantiles of a Distribution (Median, Percentiles
etc.) - Finding items with large changes
81 Outline
- Bioterrorism Sensor Location
- Port of Entry Inspection Algorithms
- Monitoring Message Streams
- Author Identification
- Computational and Mathematical Epidemiology
- Adverse Event/Disease Reporting/Surveillance/Analy
sis - Bioterrorism Working Group
- Modeling Social Responses to Bioterrorism
- Predicting Disease Outbreaks from Remote Sensing
and Media Data - Communication Security and Information Privacy
82Large-scale Automated Author Identification
83Statistical Analysis of Text
- Statistical text analysis has a long history in
literary analysis and in solving disputed
authorship problems - First (?) is Thomas C. Mendenhall in 1887
84- Hamilton versus Madison the Federalist Papers
- Mosteller and Wallace (1963) used Naïve Bayes
with a Poisson and Negative Binomial model - Good predictive performance
85Some Background
- Identification technologies important for
homeland security and in the legal system - Author attribution for textual artifacts using
topic independent stylometric features has a
long history - Historical focus on small numbers of authors and
low-dimensional representations via function words
86Author ID Project Objectives
- Application of state-of-the-art statistical and
computing technologies to authorship attribution - Work with very high-dimensional document
representations - Focus on providing working solutions to
particular problems
87Author ID Project Focus
- Goal Identification of Authors From Large
Collection of Objects - traditional disputed authorship (choose among k
known authors) - clustering of putative authors (e.g., internet
handles termin8r, heyr, KaMaKaZie) - document pair analysis Were two documents
written by the same author? - odd-man-out Were these documents written by one
of this set of authors or by someone else?
88Representation
- Long tradition in stylometry that seeks a small
number of textual characteristics that
distinguish the texts of authors from one another
(Burrows, Holmes, Binongo, Hoover, Mosteller
Wallace, McMenamin, Tweedie, etc.) - Typically use function words (a, with, as,
were, all, would, etc.) followed by PCA cluster
analysis - Function words aim to be topic-independent
- Hoover (2003) shows that using all high-frequency
words does a better job than function words alone
89Idiosyncratic Usage
- Idiosyncratic usage less formalized in the
literature (misspellings, repeated neologisms,
etc.) but apparently useful. For example,
Fosters unmasking of Klein as the author of
Primary Colors - Klein and Anonymous loved unusual adjectives
ending in -y and inous cartoony, chunky,
crackly, dorky, snarly,, slimetudinous,
vertiginous, - Both Klein and Anonymous added letters to their
interjections ahh, aww, naww. - Both Klein and Anonymous loved to coin words
beginning in hyper-, mega-, post-, quasi-, and
semi-, more than all others put together - Klein and Anonymous use riffle to mean rifle
or rustle, a usage for which the OED provides no
instance in the past thousand years
90Odd-Man Out
- Were these documents written by one of this set
of authors or by someone else? - Training data contains documents by given set of
authors - Test data contains documents by some set of
authors including some not in original set - Bayesian hierarchical model incorporates prior
knowledge that model parameters for different
authors differ from each other - Initial success on small-scale simulated examples
- Generalizations for more than one new author
91Some Results
- Created largest-ever (?) feature set including
function words, suffixes, POS tags, lengths,
spelling errors, common English errors,
grammatical errors, phrases, idiosyncratic usage,
ngrams, etc. - Extensive experiments for 1-of-K and
odd-man-out - New 1.2 million message Listserv corpus, 82,000
authors
92Some Results - II
- Developed general purpose feature
extraction software for author attribution - Bayesian Multinomial Regression Software extends
our highly scalable, sparse, BBR software (MMS
Project) to the multi-class case
93 Outline
- Bioterrorism Sensor Location
- Port of Entry Inspection Algorithms
- Monitoring Message Streams
- Author Identification
- Computational and Mathematical Epidemiology
- Adverse Event/Disease Reporting/Surveillance/Analy
sis - Bioterrorism Working Group
- Modeling Social Responses to Bioterrorism
- Predicting Disease Outbreaks from Remote Sensing
and Media Data - Communication Security and Information Privacy
94Special Focus on Computational and Mathematical
Epidemiology
smallpox
95Components of a Special Focus
- Working Groups
- Tutorials
- Workshops
- Visitor Programs
- Graduate Student Programs
- Postdoc Programs
- Dissemination
96A Sampling of Working Groups
- WGs on Large Data Sets
- Adverse Event/Disease Reporting, Surveillance
Analysis - Data Mining and Epidemiology
- WGs on Analogies between Computers and Humans
- Analogies between Computer Viruses/Immune Systems
and Human Viruses/Immune Systems - Distributed Computing, Social Networks, and
Disease Spread Processes
97WGs on Methods/Tools of Theoretical CS
- Phylogenetic Trees and Rapidly Evolving Diseases
- Order-Theoretic Aspects of Epidemiology
- WGs on Computational Methods for Analyzing Large
Models for Spread/Control of Disease - Spatio-temporal and Network Modeling of Diseases
- Methodologies for Comparing Vaccination
Strategies
98WGs on Mathematical Sciences Methodologies
- Mathematical Models and Defense Against
Bioterrorism - Predictive Methodologies for Infectious Diseases
- Statistical, Mathematical, and Modeling Issues in
the Analysis of Marine Diseases
99Workshops on Modeling of Infectious Diseases
A Sampling of Workshops
- The Pathogenesis of Infectious Diseases
- Models/Methodological Problems of Botanical
Epidemiology - WS on Modeling of Non-Infectious Diseases
- Disease Clusters
100Workshops on Evolution and Epidemiology
- Genetics and Evolution of Pathogens
- The Epidemiology and Evolution of Influenza
- The Evolution and Control of Drug Resistance
- Models of Co-Evolution of Hosts and Pathogens
101Workshops on Methodological Issues
- Capture-recapture Models in Epidemiology
- Spatial Epidemiology and Geographic Information
Systems - Ecologic Inference
- Combinatorial Group Testing
102 Outline
- Bioterrorism Sensor Location
- Port of Entry Inspection Algorithms
- Monitoring Message Streams
- Author Identification
- Computational and Mathematical Epidemiology
- Adverse Event/Disease Reporting/Surveillance/Analy
sis - Bioterrorism Working Group
- Modeling Social Responses to Bioterrorism
- Predicting Disease Outbreaks from Remote Sensing
and Media Data - Communication Security and Information Privacy
103The DIMACS Working Group on Adverse Event/Disease
Reporting, Surveillance, and Analysis
104Working Group on Adverse Event/Disease Reporting,
Surveillance, and Analysis
- Health surveillance a core activity in public
health - Concerns about bioterrorism have attracted
attention to new surveillance methods - OTC drug sales
- Subway worker absenteeism
- Ambulance dispatches
- Spawns need for novel statistical methods for
surveillance of multiple data streams. - WG coordinated closely with National Syndromic
Surveillance Conferences
105New Data Types for Public Health Surveillance
- Managed care patient encounter data
- Pre-diagnostic/chief complaint (text data)
- Over-the-counter sales transactions
- Drug store
- Grocery store
- 911-emergency calls
- Ambulance dispatch data
- Absenteeism data
- ED discharge summaries
- Prescription/pharmaceuticals
- Adverse event reports
106Farzad Mostashari
107New Analytic Methods and Approaches
- Spatial-temporal scan statistics
- Statistical process control (SPC)
- Bayesian applications
- Market-basket association analysis
- Text mining
- Rule-based surveillance
- Change-point techniques
108SubGroup on Privacy Confidentiality of Health
Data
- Privacy concerns are a major stumbling block to
public health surveillance, in particular
bioterrorism surveillance. - Challenge produce anonymous data specific enough
for research. - Exploring ways to remove identifiers (s.s. ,
tel. , zip code) from data sets. - Exploring ways to aggregate, remove information
from data sets. - Partnerships with cryptographers
- Exploring methods of combinatorial optimization
109 Outline
- Bioterrorism Sensor Location
- Port of Entry Inspection Algorithms
- Monitoring Message Streams
- Author Identification
- Computational and Mathematical Epidemiology
- Adverse Event/Disease Reporting/Surveillance/Analy
sis - Bioterrorism Working Group
- Modeling Social Responses to Bioterrorism
- Predicting Disease Outbreaks from Remote Sensing
and Media Data - Communication Security and Information Privacy
110Bioterrorism Working Group
anthrax
111Bioterrorism Working Group
- Biosurveillance
- Evolution
- Modeling Bioterror Response Logistics
- Computer Science Challenges
- Agroterrorism
112Modeling Bioterror Response Logistics
- Exploring Discrete Optimization/Queueing
- size of stockpiles of vaccines
- allocation of medications
- analysis of bottlenecks in treatment facilities
- transportation schedules
1947 smallpox vaccincation queue NYC
113Agroterrorism
- Subgroup just starting
- Interest in plant diseases
- Partnership with the National Plant Diagnostic
Network - Emphasis on Data Mining and Epidemiology
114 Outline
- Bioterrorism Sensor Location
- Port of Entry Inspection Algorithms
- Monitoring Message Streams
- Author Identification
- Computational and Mathematical Epidemiology
- Adverse Event/Disease Reporting/Surveillance/Analy
sis - Bioterrorism Working Group
- Modeling Social Responses to Bioterrorism
- Predicting Disease Outbreaks from Remote Sensing
and Media Data - Communication Security and Information Privacy
115Working Group on Modeling Social Responses to
Bioterrorism
- Models of the spread of infectious disease
commonly assume passive bystanders and rational
actors who will comply with health authorities. - It is not clear how well this assumption applies
to situations like a bioterrorist attack using
smallpox or plague.
116Working Group on Modeling Social Responses to
Bioterrorism
- Interdisciplinary group is discussing
incorporating social behavior into models,
building models of public health decisionmaking,
risk communication. - Some Issues
- Movement
- Compliance
- Rumor
- Subcultural differences
- Indirect economic effects
- Social stigmata
- Panic
How do you measure the indirect cost of an
attack?
117 Outline
- Bioterrorism Sensor Location
- Port of Entry Inspection Algorithms
- Monitoring Message Streams
- Author Identification
- Computational and Mathematical Epidemiology
- Adverse Event/Disease Reporting/Surveillance/Analy
sis - Bioterrorism Working Group
- Modeling Social Responses to Bioterrorism
- Predicting Disease Outbreaks from Remote Sensing
and Media Data - Communication Security and Information Privacy
118Predicting Disease Outbreaks from Remote Sensing
and Media Data
Outbreaks of disease in other parts of the world
have the capacity to affect the security of the
US
Joint project with Imaging Science and
Information Systems Center at Georgetown Universit
y Medical School (ISIS Center)
119Predicting Disease Outbreaks from Remote Sensing
and Media Data
- Recent work has shown that its possible to
predict disease outbreaks in distant parts of the
world using remotely sensed satellite data. - SARS and heightened avian flu in the Pacific Rim
appeared following temperature anomalies in
China. - Could we have anticipated this
- given enviro-climatic information?
120Predicting Disease Outbreaks from Remote Sensing
and Media Data
- Rift Valley Fever epidemic in 1997/8 in East
Africa occurred following heavy flooding related
to El Nino - Flooding in Venezuela in 1995 resulted in a
multi-pathogen outbreak.
121Predicting Disease Outbreaks from Remote Sensing
and Media Data
- Indications and warnings can alert US responders
to bioevents in faraway places. - Disease that can result in social disruptions can
be detected in open source media reports even if
there is no official reporting of this.
122Predicting Disease Outbreaks from Remote Sensing
and Media Data
- A model developed at the ISIS Center at
Georgetown predicts social disruptions due to
disease based on keyword hit counts from
text-based sources (media reports). - DIMACS Project goal Use media model to develop
ways to predict social disruptions from disease
from remote sensing enviro-climatic data. - We will be using remote sensing data indicating
increased Normalized Difference Vegetation Index
(NDVI).
123Predicting Disease Outbreaks from Remote Sensing
and Media Data
- Project Premise We can use enviro-climatic
indices such as NDVI coupled with disease-related
social disruption predictors from media data
delayed by several months to validate the
enviro-climatic indicators as predictors. - Approach Machine Learning
- Project waiting to get started
124Predicting Disease Outbreaks from Remote Sensing
and Media Data
- The approach is similar to ones used by members
of the DIMACS team to estimate probability of a
match between remotely sensed signals and a
signature that has been observed before. This
work has been applied to face recognition and
explosive detection.
125 Outline
- Bioterrorism Sensor Location
- Port of Entry Inspection Algorithms
- Monitoring Message Streams
- Author Identification
- Computational and Mathematical Epidemiology
- Adverse Event/Disease Reporting/Surveillance/Analy
sis - Bioterrorism Working Group
- Modeling Social Responses to Bioterrorism
- Predicting Disease Outbreaks from Remote Sensing
and Media Data - Communication Security and Information Privacy
126Special Focus on Communication Security and
Information Privacy
127Special Focus on Communication Security and
Information Privacy
- Working Groups
- Privacy-Preserving Data Mining
- Usable Privacy and Security Software
- Data De-Identification, Combinatorial
Optimization, Graph Theory, and the Stat-OR
Interface - Intrusion Detection and Network Security
Management Systems
128Special Focus on Communication Security and
Information Privacy
- A Selection of Workshops
- Software Security
- Applied Cryptography and Network Security
- Large-scale Internet Attacks
- Mobile and Wireless Security
- Security of Web Services and E-Commerce
- Database Security Query Authorization and
Information Inference
129Working Group on Analogies between Computer
Viruses and Biological Viruses
- Can ideas for defending against biological
viruses lead to ideas for defending against
computer viruses? - Concern about large gap between initial time of
attack and implementation of defensive strategies - Public health approach Once a virus has
infected a machine, it tries to connect it to as
many computers as possible, as fast as possible.
A throttle limits rate at which a computer can
connect to new computers.
130(No Transcript)