Summarization and Deviation Detection -- What is new? - PowerPoint PPT Presentation

About This Presentation
Title:

Summarization and Deviation Detection -- What is new?

Description:

Anthrax release occurred at a random point during the second year ... Detection time calculated as first alert after anthrax release. ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 39
Provided by: grego122
Category:

less

Transcript and Presenter's Notes

Title: Summarization and Deviation Detection -- What is new?


1
Summarization and Deviation Detection --What is
new?
2
Outline
  • Summarization
  • KEFIR Key Findings Reporter
  • WSARE What is Strange About Recent Events

3
What is New?
Old data
new data
4
Summarization
  • Concisely summarize what is new and different,
    unexpected
  • with respect to previous values
  • with respect to expected values
  • Focus on what is actionable!

5
Problem Healthcare Costs
  • Healthcare costs in US 1 out of 7 GDP and
    rising
  • potential problems fraud, misuse,
  • understanding where the problems are is first
    step to fixing them
  • GTE self insured for medical costs
  • GTE healthcare costs X00,000,000
  • Task Analyze employee health care data and
    generate a report that describes the major
    problems

6
GTE Key Findings Reporter KEFIR
  • KEFIR Approach
  • Analyze all possible deviations
  • Select interesting findings
  • Augment key findings with
  • Explanations of plausible causes
  • Recommendations of appropriate actions
  • Convert findings to a user-friendly report with
    text and graphics

7
KEFIR Search Space
8
Drill-Down Example
9
What Change Is Important?
10
Deviation Detection
  • Drill Down through the search space
  • Generate a finding for each measure
  • deviation from previous period
  • deviation from norm
  • deviation projected for next period, if no action

11
Interestingness of Deviations
Impact how much the deviation affects the bottom
line
Savings Percentage how much of the deviation
from the norm can be expected to be saved by the
action
12
Recommendations
Hierarchical recommendation rules define
appropriate intervention strategies for important
measures and study areas.
Example
measure admission rate per 1000 study_area
Inpatient admissions percent_change gt 0.10
If
Then
Utilization review is needed in the area of
admission certification.
Expected Savings 20
13
Explanation
A measure is explained by finding the path of
related measures with the highest impact
The large increase in m1 in group s1 was caused
by an increase in m3, which was caused by a rise
in m5 , primarily in sector s13.
14
Report Generation
  • Automatic generation of business-user-oriented
    reports
  • Natural language generation with template
    matching
  • Graphics
  • delivered via browser

15
(No Transcript)
16
Sample KEFIR pages
Overview
Inpatient admissions
17
Status
  • Prototype implemented in GTE in 1995
  • KEFIR received GTEs highest award for technical
    achievement in 1995
  • Key business user left GTE in 1996 and system was
    no longer used
  • Publication
  • Selecting and Reporting What is Interesting The
    KEFIR Application to Healthcare Data, C. Matheus,
    G. Piatetsky-Shapiro, and D. McNeill, in Advances
    in Knowledge Discovery and Data Mining, AAAI/MIT
    Press, 1996

18
Whats Strange About Recent Events (WSARE)
  • Weng-Keen Wong (Carnegie Mellon University)
  • Andrew Moore (Carnegie Mellon University)
  • Gregory Cooper (University of Pittsburgh)
  • Michael Wagner (University of Pittsburgh)
  • http//www.autonlab.org/wsare

Designed to be easily applicable to any
date/time-indexed biosurveillance-relevant data
stream
19
Motivation
Suppose we have access to Emergency Department
data from hospitals around a city (with patient
confidentiality preserved)
Primary Key Date Time Hospital ICD9 Prodrome Gender Age Home Location Work Location Many more
100 6/1/03 912 1 781 Fever M 20s NE ?
101 6/1/03 1045 1 787 Diarrhea F 40s NE NE
102 6/1/03 1103 1 786 Respiratory F 60s NE N
103 6/1/03 1107 2 787 Diarrhea M 60s E ?
104 6/1/03 1215 1 717 Respiratory M 60s E NE
105 6/1/03 1301 3 780 Viral F 50s ? NW
106 6/1/03 1305 3 487 Respiratory F 40s SW SW
107 6/1/03 1357 2 786 Unmapped M 50s SE SW
108 6/1/03 1422 1 780 Viral M 40s ? ?

20
Traditional Approaches
  • We need to build a univariate detector to monitor
    each interesting combination of attributes

Diarrhea cases among children
Number of cases involving people working in
southern part of the city
Respiratory syndrome cases among females
Number of cases involving teenage girls living
in the western part of the city
Youll need hundreds of univariate detectors! We
would like to identify the groups with the
strangest behavior in recent events.
Viral syndrome cases involving senior citizens
from eastern part of city
Botulinic syndrome cases
Number of children from downtown hospital
And so on
21
WSARE Approach
  • Rule-Based Anomaly Pattern Detection
  • Association rules used to characterize anomalous
    patterns. For example, a two-component rule
    would be
  • Gender Male AND 40 ? Age lt 50

22
WSARE v2.0 Overview
  1. Obtain Recent and Baseline datasets

2. Search for rule with best score
All Data
Recent Data
3. Determine p-value of best scoring rule through
randomization test
Baseline
4. If p-value is less than threshold, signal alert
23
Step 1 Obtain Recent and Baseline Data
Recent Data
Data from last 24 hours
Baseline
Baseline data is assumed to capture non-outbreak
behavior. We use data from 35, 42, 49 and 56
days prior to the current day
24
Example
  • Sat 12-23-2001
  • 35.8 (48/134) of today's cases have 30 lt age lt
    40
  • 17.0 (45/265) of other (baseline) cases have
  • 30 lt age lt 40

25
Step 2. Search for Best Rule
  • For each rule, form a 2x2 contingency table eg.
  • Perform Fishers Exact Test to get a p-value
    (score) for each rule (for this data 0.00005)
  • Find rule R-best with the lowest score.
  • Caution This score is not the true p-value of
    RBEST because of multiple tests

CountRecent CountBaseline
Age Decile 3 48 45
Age Decile ? 3 86 220
26
Step 3 Randomization Test
June 4, 2002 C2
June 5, 2002 C3
June 12, 2002 C4
June 19, 2002 C5
June 26, 2002 C6
June 26, 2002 C7
July 2, 2002 C8
July 3, 2002 C9
July 10, 2002 C10
July 17, 2002 C11
July 24, 2002 C12
July 30, 2002 C13
July 31, 2002 C14
July 31, 2002 C15
June 4, 2002 C2
June 12, 2002 C3
July 31, 2002 C4
June 26, 2002 C5
July 31, 2002 C6
June 5, 2002 C7
July 2, 2002 C8
July 3, 2002 C9
July 10, 2002 C10
July 17, 2002 C11
July 24, 2002 C12
July 30, 2002 C13
June 19, 2002 C14
June 26, 2002 C15
  • Take the recent cases and the baseline cases.
    Shuffle the date field to produce a randomized
    dataset called DBRand
  • Find the rule with the best score on DBRand.

27
Step 3 Randomization Test
Repeat the procedure on the previous slide for
1000 iterations. Determine how many scores from
the 1000 iterations are better than the original
score.
If the original score were here, it would place
in the top 1 of the 1000 scores from the
randomization test. We would be impressed and an
alert should be raised.
Estimated p-value of the rule is better scores
/ iterations
28
Results on Actual ED Data from 2001
  • 1. Sat 2001-02-13 SCORE -0.00000004 PVALUE
    0.00000000
  • 14.80 ( 74/500) of today's cases have Viral
    Syndrome True and Encephalitic Prodome False
  • 7.42 (742/10000) of baseline have Viral
    Syndrome True and Encephalitic Syndrome False
  • 2. Sat 2001-03-13 SCORE -0.00000464 PVALUE
    0.00000000
  • 12.42 ( 58/467) of today's cases have
    Respiratory Syndrome True
  • 6.53 (653/10000) of baseline have
    Respiratory Syndrome True
  • 3. Wed 2001-06-30 SCORE -0.00000013 PVALUE
    0.00000000
  • 1.44 ( 9/625) of today's cases have 100 lt
    Age lt 110
  • 0.08 ( 8/10000) of baseline have 100 lt Age
    lt 110
  • 4. Sun 2001-08-08 SCORE -0.00000007 PVALUE
    0.00000000
  • 83.80 (481/574) of today's cases have
    Unknown Syndrome False
  • 74.29 (7430/10001) of baseline have Unknown
    Syndrome False
  • 5. Thu 2001-12-02 SCORE -0.00000087 PVALUE
    0.00000000
  • 14.71 ( 70/476) of today's cases have Viral
    Syndrome True and Encephalitic Syndrome False
  • 7.89 (789/9999) of baseline have Viral
    Syndrome True and Encephalitic Syndrome False

29
WSARE 30 Improving the Baseline
Baseline
Recall that the baseline was assumed to be
captured by data that was from 35, 42, 49, and 56
days prior to the current day.
What if this assumption isnt true? What if data
from 7, 14, 21 and 28 days prior is better?
We would like to determine the baseline
automatically!
30
Temporal Trends
From Goldenberg, A., Shmueli, G., Caruana, R.
A., and Fienberg, S. E. (2002). Early
statistical detection of anthrax outbreaks by
tracking over-the-counter medication sales.
Proceedings of the National Academy of Sciences
(pp. 5237-5249)
31
WSARE v3.0
  • Generate the baseline
  • Taking into account recent flu levels
  • Taking into account that today is a public
    holiday
  • Taking into account that this is Spring
  • Taking into account recent heatwave
  • Taking into account that theres a known natural
    Food-borne outbreak in progress

Bonus More efficient use of historical data
32
Idea Bayesian Networks
Bayesian Network A graphical model representing
the joint probability distribution of a set of
random variables
On Cold Tuesday Mornings the folks coming in
from the North part of the city are more likely
to have respiratory problems
Patients from West Park Hospital are less likely
to be young
On the day after a major holiday, expect a boost
in the morning followed by a lull in the
afternoon
The Viral prodrome is more likely to co-occur
with a Rash prodrome than Botulinic
33
Obtaining Baseline Data
All Historical Data
Todays Environment
What should be happening today given todays
environment
  1. Learn Bayesian Network

2. Generate baseline given todays environment
Baseline
34
Simulation
DAY OF WEEK
FLU LEVEL
SEASON
WEATHER
Region Anthrax Concentration
Has Anthrax
AGE
Outside Activity
Immune System
GENDER
Region Grassiness
Has Flu
Has Sunburn
Heart Health
DATE
Region Food Condition
Has Cold
Has Allergy
REGION
Has Heart Attack
Actions None, Purchase Medication, ED visit,
Absent. If Action is not None, output record to
dataset.
Has Food Poisoning
Disease
Actual Symptom
REPORTED SYMPTOM
ACTION
DRUG
35
Simulation
  • 100 different data sets
  • Each data set consisted of a two year period
  • Anthrax release occurred at a random point during
    the second year
  • Algorithms allowed to train on data from the
    current day back to the first day in the
    simulation
  • Any alerts before actual anthrax release are
    considered a false positive
  • Detection time calculated as first alert after
    anthrax release. If no alerts raised, cap
    detection time at 14 days

36
Simulation Plot
Anthrax release (not highest peak)
37
Results on Simulation
38
Summary
  • Summarization of what is new and interesting
  • Key ideas
  • search many possible findings
  • compare to past data and expected data
  • avoid overfitting
  • focus on actionable changes
  • Example systems
  • KEFIR (GTE, 1992-1995)
  • WSARE (CMU/Pitt, 2002-3)
Write a Comment
User Comments (0)
About PowerShow.com