Title: Intelligent Choices Preceeding Data Analysis
1Intelligent Choices Preceeding Data Analysis
- Katharina Morik
- Univ. Dortmund, www-ai.cs.uni-dortmund.de
- Knowledge Discovery in Databases (KDD)
- The Mining Mart approach
- Case studies
- Item sales
- Intensive care
2The UCI Library Approach
- Learning task classification
- Evaluation criteria accuracy and coverage
- Data sets
- Small number of examples
- Small number of features
- All and only relevant features included
- No noise
3KDD Task
- Learning task of the application needs to be
transferred to a formal learning task
(classification, regression, clustering) - I want to predict sales 4 weeks ahead
- I want to know more about my best (worst)
customers - I want to detect fraud
- Databases
- Very large number of records
- Very large number of features
- Relevant fatures missing
- Noise included
4Observation
- Experienced users can apply any learning system
successfully to any application, since they
prepare the data well... - The representation LE of examples and the choice
of a sample determines the applicability of
learning methods. - A chain of data transformations (learning steps
or manual preprocessing) leads to LE of the
method that delivers the desired result. - Experienced users remember prototypical
successful transformation/learning chains
5The Real Process
application data
users performance system
LE1 LH1 LE2 LH2 ... LEn-1 LHn-1 LEn
LHnm LEnm ... LHn1 LEn1
learning/data mining
LHn
6Intelligent Choices
- 80 of the KDD work is invested into
- Choosing the learning task
- Sampling
- Feature generation, extraction, and selection
- Data cleaning
- Model selection or tuning the hypothesis space
- Defining appropriate evaluation criteria
7The Mining Mart Approach
- Best practice cases of preprocessing chains
exist... - Data, LE and LH are described on the meta level.
- The meta-level description is presented in
application terms. - MiningMart users choose a case and apply the
corresponding transformation and learning chain
to their application. - ... and more can be obtained!
8Call for Participation
- MiningMart develops an operational meta-language
for describing data and operators. - MiningMart prepares the first cases of KDD.
- MiningMart will present the case-base in the WWW.
- You may contribute to the endavour!
- Apply the meta-language to your application and
deliver it as a positive example to the
case-base or - apply a case of MiningMart to your data.
9The Consortium
- Katharina Morik Univ. Dortmund, D (Coordinator)
- Lorenza Saitta Univ. Piemonte del Avogadro, I
- Pieter Adriaans Perot Systems Netherland, NL
- ? Michael May GMD, D
- Jörg-Uwe Kietz SwissLife, CH
- Fabio Malabocchia TILab, I
10The Mining Mart System
Human Computer Interface
KDD process tasks, problem models
Case-base of successful KDD process
Meta-data Applicability
Meta-data
Meta-data
Manual Pre-processing Operators Time
multi-relation
ML-Operators Time Parameters Features
Description Logic
Raw-data
Augmented data of results
11The Meta Model for Meta Data
The Relational Model describes the database
The Execution Model generates SQL statements or
calls to external tools
The Conceptual Model describes the individuals
and classes of the domain with their relations
The Case Model describes chains of preprocessing
operators
12Use of the Meta Model
- The meta model is stored in a database.
- The database manager delivers the relational
model. - The data analyst delivers the conceptual model.
- The KDD expert delivers or adjusts a case
model.First cases are delivered by the Mining
Mart project. - The system compiles meta data into SQL statements
and calls to external tools hence executing the
case model on the data.
13Sales of Items of a Drugstore
160
140
120
100
80
60
40
Sales
20
0
0196
0796
1396
1996
2596
3196
3796
4396
4996
0397
0997
1597
2197
2797
3397
3997
4597
5197
0598
1198
1798
2398
2998
3598
4198
4798
5398
Week
14Learning Task 1Predict Sales of an Item
- Given drug store sales data of 50 items in 20
shops over 104 weeks - predict the sales of an item such that
- the prediction never underestimates the sale,
- the prediction overestimates less than the rule
of thumb. - Observation 90 of the items are sold less than
10 times a week. - Requirement prediction horizon is more than 4
weeks ahead.
15Shop Application -- Data
LE DB1 I T1 A1 ... A 50 set of multivariate
time series
16Preprocessing
- From shops to items multivariate to univariate
- LE1 it1 a1 ... tk ak
- For all shops for all items
- Create view Univariate as
- Select shop, week, itemi
- Where shopdmj
- From Source
- Multiple learning
17Method 1 for Task 1Exponential Smoothing
- Univariate time series as input ( LE1 ),
- incremental method current hypothesis h and new
observation o yield next hypothesis by h h
l o, where l is given by the user, - predicts sales of n-next week by last h.
18Method 2 for Task 1SVM in the Regression Mode
- Multiple learning for each shop and each item,
the support vector machine learns a function
which is then used for prediction. - Asymmetric loss
- underestimation is multiplied by 20,i.e. 3 sales
too few predicted -- 60 loss - overestimation is counted as it is,i.e. 3 sales
too much predicted -- 3 loss - (Stefan Rüping 1999)
19Further Preprocessing
- Obtaining many vectors from one series by sliding
windows - LH5 it1 a1 ... tw aw move window of size w by
m steps
20Article 766933 (bag?)
sales
time
21Comparison with Exponential Smoothing
22loss
horizon
23Learning Task 2Learning Sequences
- Are there typical sequences that are valid for
all items? - After an action for an item its sales decrease.
- Each decrease of sales is followed by an
increase. - Given a set of subsequent eventsfind frequent
sequences.
24From Sales Data toEvent Sequences
Multivariate time series
Univariate time series ? Subsequent
events ? LHn-1 LEn
LHn frequent event sequences
?
25From Series to Sequences
- Given some time series detect events (states,
intervals) - An event is a triple (state, begin,finish).
- The state might be a label or a (mean) value.
- Typical labels are increase, decrease, stable...
26Unsupervised Methods
- All contiguous observations within one level
(range) form one event (Bauer). - All contiguous observations with more or less the
same gradient form one event (Morik, Wessel). - Clusters of subsequences form events (Das).
27Moving Gradient
Determining the time intervals with user-given
tolerance threshhold. Abstracting into classes of
gradients increase,peak,decrease, stable...
28Sales of Item 182830 in Shop 55
29Summarizing Sales byTolerant Moving Gradient
(Wessel, Morik 1999)
30From Subsequent Eventsto Event Sequences
Multivariate time series
Univariate time series Moving
gradient Subsequent events
? LHn-1 LEn
LHn frequent event sequences
?
31Transformation into Facts
LE4
stable(182830,1,33,0). decreasing(182830,
33,34,-6). stable(182830, 34, 39,0). increasing(18
2830, 39, 40,7). decreasing(182830, 40,
42,-5). stable(182830, 42,108,0).
32Summarizing Item 646152 in Shop 55 by Intolerant
Moving Gradient
33Corresponding Facts
increasing(646152,1,2,3). decreasing(646152,2,3,-1
1). increasingPeak(646152,3,4,22). ... stable(6461
52, 25,37,0). increasing(646152, 37, 38,
8). decreasing(646152, 38, 39, -7). stable(646152,
39,40, 0). increasing(646152, 40,
41,7). decreasing(646152, 41, 42,-8). increasing(6
46152, 42, 43,10). stable(646152, 43, 48,-1).
small time intervals
34Method 3 for Task 2 Inductive Logic Programming
- Rules about sequencesp1(I, Tb, Te, A r), p2(I,
Te, Te2, As) ? p3(I, Te2, Te3, A t) - Results for sequences of sales trendsincreasing
(Item, Tb, Te) ? decreasing(Item, Te, Te2)
increasing (Item, Tb, Te), decreasing(Item, Te,
Te2) ? stable(Item, Te2, Te3)
35Same Data -- Several Cases
- Predict sales of a particular item in a
particular shop - multivariate to univariate, multiple exponential
smoothing ORmultivariate to univariate, sliding
windows, multiple learning with regression SVM - Find relations between trends that are valid for
all sales in all shopsmultivariate to
univariate, summarizing, transformation into
facts, rule learning
36Applications in Intensive Care
- On-line monitoring of intensive care patients
- high-dimensional data about patient and
medication - measured every minute
- stored in the Emtec database of patient records
--- - learning when to intervene in which way.
37Patient G.C., male, 60 years old
Hemihepatektomie right
38The Data
- LE DB2 i 1 t 1 a 1 1 ... a 1 k i1
t 2 a 2 1 ... a 2 k - ...
- i2 t 1 a 1 1 ... a 1 k
- ...
set of rows for each patient1 row for each
minute
39Preprocessing
- Chaining database rowsi 1 t 1 a 1 1 ... a 1 k,
t 2 a 2 1 ... a 2 k , ... - Multivariate to univariatei 1 t 1 a 1, t 2 a 1
... t m a 1i 1 t 1 a 2, t 2 a 2 ... t m a
2... - Detecting level changes and outliers
40Phase State Analysis
Time series
yt1
yt
Deter- ministicProcess
yt
yt1
time t
yt
AR(1)-process with outlier (AO)
yt
timet
HRt
yt1
Heart rate
yt
time t
U.Gather, M. Bauer
41Level Change Detection
- level_change(pat4999, 50, 112, hr, up)
- level_change(pat4999, 112, 164, hr, down)
- level_change(pat4999, 10, 74, art, constant)
- level_change(pat4999, 74, 110, art, down)
- Computed Feature
- Comparing norm values for a vital sign and its
mean in a time interval ( standard deviation) - deviation(pat4999, 10, 74, art, up)
42Learning Task 3Recommend Interventions for
Patients
- Are there valid rules
- for all multivariate time series,
- such that therapeutical interventions follow from
a patients state?
43Method 3Inductive Logic Programming
- Given patient records in the form of facts
- deviations -- time intervals
- therapeutical interventions -- time points
- types of vital signs (group1 hr, swi, co
group2 art, vr) - Learn rules about interventions
- group1(V), deviation(P, T1, T2, V, Dir)
- ?noradrenaline(P, T2, Dir)
44The Chain of Preprocessing Steps
45Learning Task 4Predict Next Minutes
Intervention
- Given a patients state at time ti,
- learn whether and how to intervene at t i1
- Preprocessing
- Selection of time points where an intervention
was done - Multiple to binary classfor each drug, form the
concepts drug_up, drug_down - Multiple learning for each binary class resulting
inclassifiers for each drug and direction of
dose change (SVM_light)
46The Chain of Preprocessing Steps
47Same Data -- Several Cases
- Find time relations that express therapy
protocols - chaining db rows, multivariate to univariate,
level changes, deviations, RDT - Predict intervention for a particular drug
- select time points, multiple to binary class,
SVM_light
48Behind the Boxes
49Functionality of MD-Compiler
Manual preprocessing operators of M4 are very
elementary. Results of operators are mostly
views.
50Several view definitions
Inline View versus Physical View-Object
Physical View-Object created by MD-Compiler for
reading data and executing statistics
51Several view definitions
Inline View versus Physical View-Object
Create View V_05 (...) as select ... from
(select ... from (select ... from
Table_x) instead of Create View V_05 (...) as
select ... from V_04
52Several view definitions
Materialized view
Created by the MD-Compiler automatically in the
background
performance gain when selecting data from
V_04 or V_07 all operation-outputs can be
realized as views
- additional storage needed
53System-Architecture
M4- Relation Editor
Statistics
T1
T2
T3
PL/Sql
T4
T5
T6
M4-Relational Model
M4-Concept Editor
M4-Conceptual Model
Time
M4-Case Editor
Operators
Java-Code
from UniDo
Mining Mart Database
54Summary of Cases Involving Time
55Summary
- Preprocessing is the key issue in data analysis!
- Goal Support users in making intelligent choices
- Approach Cases of best practice
- View of a computer scientist
- Scalability to very large databases
- Meta-data driven processing
- Case studies on analysing data involving time
56MiningMart Approach
- Manager -- end-userknows about the business case
- Database manager knows about the data
- Case designer -- power-userexpert in KDD
- Developer supplies (learning) operators