Summarization and Deviation Detection What is new - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Summarization and Deviation Detection What is new

Description:

Anthrax release occurred at a random point during the second year ... Detection time calculated as first alert after anthrax release. ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 43

Provided by: gregoryp8

Category:

more less

Transcript and Presenter's Notes

Title: Summarization and Deviation Detection What is new

1
Summarization and Deviation Detection --What is
new?
2
Outline

Novelty Detection
KEFIR Key Findings Reporter
WSARE What is Strange About Recent Events

3
What is New?
Old data
new data
4
Condition Monitoring

Detect when system is not operating normally
novelty detectionHappy families are all alike
every unhappy family is unhappy in its own way.
machinery
health screening
scientific measurement
Diagnose what is going wrong classification
much more difficult needs many examples of fault
classes

5
Classic Approach

Look for a rare value for some attribute e.g. a
value that is outside of the normal operating
range
Fails to identify anomalies that occur due to a
combination of features which by themselves are
not abnormal
What is needed is a multi-dimensional approach

6
Novelty Detection

Build a probability density model p(x) based on
training data from the normal class (e.g. using
Gaussian mixture models).
Use a validation set of normal data to determine
the threshold ? for normality (e.g. choose ? so
that 99 of examples have p(x) gt ?).
Apply model to new data y. If p(y) lt ?, then
classify example as novel.

7
Applications

Tarassenko et al. Breast cancer screening.
Extract texture features from regions on
mammogram use novelty score to decide which
should be passed to expert clinician for
follow-up
Pfizer. High-throughput screening. Assessing
the biological activity of thousands (millions)
of compounds on plates with c. 100 compounds per
plate. Looking for faulty measurements model
distribution of control values on plates

8
Applications II

Novelty detection is valuable in any data mining
application
Models are trained for a specific generating
distribution p(x) and will only generalise well
to data drawn from that distribution
If the distribution changes, model performance
will suffer thus it is worth monitoring the
input data density and flag those examples with a
low probability the model is likely to give an
inaccurate prediction

9
Summarization

Concisely summarize what is new and different,
unexpected
with respect to previous values
with respect to expected values
Focus on what is actionable!

10
Problem Healthcare Costs

Healthcare costs in US 1 out of 7 GDP and
rising
potential problems fraud, misuse,
understanding where the problems are is first
step to fixing them
GTE self insured for medical costs
GTE healthcare costs X00,000,000
Task Analyze employee health care data and
generate a report that describes the major
problems

11
GTE Key Findings Reporter KEFIR

KEFIR Approach
Analyze all possible deviations
Select interesting findings
Augment key findings with
Explanations of plausible causes
Recommendations of appropriate actions
Convert findings to a user-friendly report with
text and graphics

12
KEFIR Search Space
13
Drill-Down Example
14
What Change Is Important?
Use tables of national and regional norms for key
measures i.e. very simple model of normality
15
Deviation Detection

Drill Down through the search space
Generate a finding for each measure
deviation from previous period
deviation from norm
deviation projected for next period, if no action

16
Interestingness of Deviations
Impact how much the deviation affects the bottom
line
Savings Percentage how much of the deviation
from the norm can be expected to be saved by the
action
17
Recommendations
Hierarchical recommendation rules define
appropriate intervention strategies for important
measures and study areas.
Example
measure admission rate per 1000 study_area
Inpatient admissions percent_change gt 0.10
If
Then
Utilization review is needed in the area of
admission certification.
Expected Savings 20
18
Explanation
A measure is explained by finding the path of
related measures with the highest impact (greedy
search)
The large increase in m1 in group s1 was caused
by an increase in m3, which was caused by a rise
in m5 , primarily in sector s13. Limited to
single-factor explanations
19
Report Generation

Automatic generation of business-user-oriented
reports
Natural language generation with template
matching
Graphics
Delivered via web browser

20
(No Transcript)
21
Sample KEFIR pages
Overview
Inpatient admissions
22
Status

Prototype implemented in GTE in 1995
KEFIR received GTEs highest award for technical
achievement in 1995
Key business user left GTE in 1996 and system was
no longer used
Publication
Selecting and Reporting What is Interesting The
KEFIR Application to Healthcare Data, C. Matheus,
G. Piatetsky-Shapiro, and D. McNeill, in Advances
in Knowledge Discovery and Data Mining, AAAI/MIT
Press, 1996

23
Whats Strange About Recent Events (WSARE)

Weng-Keen Wong (Carnegie Mellon University)
Andrew Moore (Carnegie Mellon University)
Gregory Cooper (University of Pittsburgh)
Michael Wagner (University of Pittsburgh)
http//www.autonlab.org/wsare

Designed to be easily applicable to any
date/time-indexed biosurveillance-relevant data
stream
24
Motivation
Suppose we have access to Emergency Department
data from hospitals around a city (with patient
confidentiality preserved)
25
Traditional Approaches

We need to build a univariate detector to
monitor each interesting combination of
attributes

Diarrhea cases among children
Number of cases involving people working in
southern part of the city
Respiratory syndrome cases among females
Number of cases involving teenage girls living
in the western part of the city
Youll need hundreds of univariate detectors! We
would like to identify the groups with the
strangest behavior in recent events.
Viral syndrome cases involving senior citizens
from eastern part of city
Botulinic syndrome cases
Number of children from downtown hospital
And so on
26
WSARE Approach

Rule-Based Anomaly Pattern Detection
Association rules used to characterize anomalous
patterns. For example, a two-component rule
would be
Gender Male AND 40 ? Age lt 50
Limit to two-attribute rules to reduce size of
search space

27
WSARE v2.0 Overview

Obtain Recent and Baseline datasets

2. Search for rule with best score
All Data
Recent Data
3. Determine p-value of best scoring rule through
randomization test
Baseline
4. If p-value is less than threshold, signal alert
28
Step 1 Obtain Recent and Baseline Data
Recent Data
Data from last 24 hours
Baseline
Baseline data is assumed to capture non-outbreak
behavior. We use data from 35, 42, 49 and 56
days prior to the current day
29
Example

Sat 12-23-2001
35.8 (48/134) of today's cases have 30 lt age lt
40
17.0 (45/265) of other (baseline) cases have
30 lt age lt 40

30
Step 2. Search for Best Rule

For each rule, form a 2x2 contingency table eg.
Perform Fishers Exact Test to get a p-value
(score) for each rule (for this data 0.00005)
Find rule R-best with the lowest score.
Caution This score is not the true p-value of
RBEST because of multiple tests

31
Step 3 Randomization Test

Take the recent cases and the baseline cases.
Shuffle the date field to produce a randomized
dataset DBRand
Find the rule with the best score on DBRand.

32
Step 3 Randomization Test
Repeat the procedure 1000 times. Determine how
many scores from the 1000 iterations are better
than the original score.
If the original score were here, it would place
in the top 1 of the 1000 scores from the
randomization test. This is statistically
significant and an alert should be raised.
Estimated p-value of the rule is better scores
/ iterations
33
Results on Actual ED Data from 2001

1. Sat 2001-02-13 SCORE -0.00000004 PVALUE
0.00000000
14.80 ( 74/500) of today's cases have Viral
Syndrome True and Encephalitic Prodome False
7.42 (742/10000) of baseline have Viral
Syndrome True and Encephalitic Syndrome False
2. Sat 2001-03-13 SCORE -0.00000464 PVALUE
0.00000000
12.42 ( 58/467) of today's cases have
Respiratory Syndrome True
6.53 (653/10000) of baseline have
Respiratory Syndrome True
3. Wed 2001-06-30 SCORE -0.00000013 PVALUE
0.00000000
1.44 ( 9/625) of today's cases have 100 lt
Age lt 110
0.08 ( 8/10000) of baseline have 100 lt Age
lt 110
4. Sun 2001-08-08 SCORE -0.00000007 PVALUE
0.00000000
83.80 (481/574) of today's cases have
Unknown Syndrome False
74.29 (7430/10001) of baseline have Unknown
Syndrome False
5. Thu 2001-12-02 SCORE -0.00000087 PVALUE
0.00000000
14.71 ( 70/476) of today's cases have Viral
Syndrome True and Encephalitic Syndrome False
7.89 (789/9999) of baseline have Viral
Syndrome True and Encephalitic Syndrome False

N.B. each rule has conditions on at most two
attributes
34
WSARE 30 Improving the Baseline
Baseline
Recall that the baseline was assumed to be
captured by data that was from 35, 42, 49, and 56
days prior to the current day.
What if this assumption isnt true? What if data
from 7, 14, 21 and 28 days prior is better?
We would like to determine the baseline
automatically!
35
Temporal Trends
From Goldenberg, A., Shmueli, G., Caruana, R.
A., and Fienberg, S. E. (2002). Early
statistical detection of anthrax outbreaks by
tracking over-the-counter medication sales.
Proceedings of the National Academy of Sciences
(pp. 5237-5249)
36
WSARE v3.0

Generate the baseline
Taking into account recent flu levels
Taking into account that today is a public
holiday
Taking into account that this is Spring
Taking into account recent heatwave
Taking into account that theres a known natural
Food-borne outbreak in progress

Bonus More efficient use of historical data
37
Idea Bayesian Networks
Bayesian Network A graphical model representing
the joint probability distribution of a set of
random variables
On Cold Tuesday Mornings the folks coming in
from the North part of the city are more likely
to have respiratory problems
Patients from West Park Hospital are less likely
to be young
On the day after a major holiday, expect a boost
in the morning followed by a lull in the
afternoon
The Viral prodrome is more likely to co-occur
with a Rash prodrome than Botulinic
38
Obtaining Baseline Data
All Historical Data
Todays Environment
What should be happening today given todays
environment

Learn Bayesian Network

2. Generate baseline given todays environment
Baseline
39
Simulation
DAY OF WEEK
FLU LEVEL
SEASON
WEATHER
Region Anthrax Concentration
Has Anthrax
AGE
Outside Activity
Immune System
GENDER
Region Grassiness
Has Flu
Has Sunburn
Heart Health
DATE
Region Food Condition
Has Cold
Has Allergy
REGION
Has Heart Attack
Actions None, Purchase Medication, ED visit,
Absent. If Action is not None, output record to
dataset.
Has Food Poisoning
Disease
Actual Symptom
REPORTED SYMPTOM
ACTION
DRUG
40
Simulation

100 different data sets
Each data set consisted of a two year period
Anthrax release occurred at a random point during
the second year
Algorithms allowed to train on data from the
current day back to the first day in the
simulation
Any alerts before actual anthrax release are
considered a false positive
Detection time calculated as first alert after
anthrax release. If no alerts raised, cap
detection time at 14 days