Intelligent Choices Preceeding Data Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Intelligent Choices Preceeding Data Analysis

Description:

The KDD expert delivers or adjusts a case model. ... Beauty. Sweets. Self-tanning cream. Candles 2. Baby food 2. 14. Learning Task 1: ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 57
Provided by: fachbereic
Category:

less

Transcript and Presenter's Notes

Title: Intelligent Choices Preceeding Data Analysis


1
Intelligent Choices Preceeding Data Analysis
  • Katharina Morik
  • Univ. Dortmund, www-ai.cs.uni-dortmund.de
  • Knowledge Discovery in Databases (KDD)
  • The Mining Mart approach
  • Case studies
  • Item sales
  • Intensive care

2
The UCI Library Approach
  • Learning task classification
  • Evaluation criteria accuracy and coverage
  • Data sets
  • Small number of examples
  • Small number of features
  • All and only relevant features included
  • No noise

3
KDD Task
  • Learning task of the application needs to be
    transferred to a formal learning task
    (classification, regression, clustering)
  • I want to predict sales 4 weeks ahead
  • I want to know more about my best (worst)
    customers
  • I want to detect fraud
  • Databases
  • Very large number of records
  • Very large number of features
  • Relevant fatures missing
  • Noise included

4
Observation
  • Experienced users can apply any learning system
    successfully to any application, since they
    prepare the data well...
  • The representation LE of examples and the choice
    of a sample determines the applicability of
    learning methods.
  • A chain of data transformations (learning steps
    or manual preprocessing) leads to LE of the
    method that delivers the desired result.
  • Experienced users remember prototypical
    successful transformation/learning chains

5
The Real Process
application data
users performance system
LE1 LH1 LE2 LH2 ... LEn-1 LHn-1 LEn
LHnm LEnm ... LHn1 LEn1
learning/data mining
LHn
6
Intelligent Choices
  • 80 of the KDD work is invested into
  • Choosing the learning task
  • Sampling
  • Feature generation, extraction, and selection
  • Data cleaning
  • Model selection or tuning the hypothesis space
  • Defining appropriate evaluation criteria

7
The Mining Mart Approach
  • Best practice cases of preprocessing chains
    exist...
  • Data, LE and LH are described on the meta level.
  • The meta-level description is presented in
    application terms.
  • MiningMart users choose a case and apply the
    corresponding transformation and learning chain
    to their application.
  • ... and more can be obtained!

8
Call for Participation
  • MiningMart develops an operational meta-language
    for describing data and operators.
  • MiningMart prepares the first cases of KDD.
  • MiningMart will present the case-base in the WWW.
  • You may contribute to the endavour!
  • Apply the meta-language to your application and
    deliver it as a positive example to the
    case-base or
  • apply a case of MiningMart to your data.

9
The Consortium
  • Katharina Morik Univ. Dortmund, D (Coordinator)
  • Lorenza Saitta Univ. Piemonte del Avogadro, I
  • Pieter Adriaans Perot Systems Netherland, NL
  • ? Michael May GMD, D
  • Jörg-Uwe Kietz SwissLife, CH
  • Fabio Malabocchia TILab, I

10
The Mining Mart System
Human Computer Interface
KDD process tasks, problem models
Case-base of successful KDD process
Meta-data Applicability
Meta-data
Meta-data
Manual Pre-processing Operators Time
multi-relation
ML-Operators Time Parameters Features
Description Logic
Raw-data
Augmented data of results
11
The Meta Model for Meta Data
The Relational Model describes the database
The Execution Model generates SQL statements or
calls to external tools
The Conceptual Model describes the individuals
and classes of the domain with their relations
The Case Model describes chains of preprocessing
operators
12
Use of the Meta Model
  • The meta model is stored in a database.
  • The database manager delivers the relational
    model.
  • The data analyst delivers the conceptual model.
  • The KDD expert delivers or adjusts a case
    model.First cases are delivered by the Mining
    Mart project.
  • The system compiles meta data into SQL statements
    and calls to external tools hence executing the
    case model on the data.

13
Sales of Items of a Drugstore
160
140
120
100
80
60
40
Sales
20
0
0196
0796
1396
1996
2596
3196
3796
4396
4996
0397
0997
1597
2197
2797
3397
3997
4597
5197
0598
1198
1798
2398
2998
3598
4198
4798
5398
Week
14
Learning Task 1Predict Sales of an Item
  • Given drug store sales data of 50 items in 20
    shops over 104 weeks
  • predict the sales of an item such that
  • the prediction never underestimates the sale,
  • the prediction overestimates less than the rule
    of thumb.
  • Observation 90 of the items are sold less than
    10 times a week.
  • Requirement prediction horizon is more than 4
    weeks ahead.

15
Shop Application -- Data
LE DB1 I T1 A1 ... A 50 set of multivariate
time series
16
Preprocessing
  • From shops to items multivariate to univariate
  • LE1 it1 a1 ... tk ak
  • For all shops for all items
  • Create view Univariate as
  • Select shop, week, itemi
  • Where shopdmj
  • From Source
  • Multiple learning

17
Method 1 for Task 1Exponential Smoothing
  • Univariate time series as input ( LE1 ),
  • incremental method current hypothesis h and new
    observation o yield next hypothesis by h h
    l o, where l is given by the user,
  • predicts sales of n-next week by last h.

18
Method 2 for Task 1SVM in the Regression Mode
  • Multiple learning for each shop and each item,
    the support vector machine learns a function
    which is then used for prediction.
  • Asymmetric loss
  • underestimation is multiplied by 20,i.e. 3 sales
    too few predicted -- 60 loss
  • overestimation is counted as it is,i.e. 3 sales
    too much predicted -- 3 loss
  • (Stefan Rüping 1999)

19
Further Preprocessing
  • Obtaining many vectors from one series by sliding
    windows
  • LH5 it1 a1 ... tw aw move window of size w by
    m steps

20
Article 766933 (bag?)
sales
time
21
Comparison with Exponential Smoothing
22
loss
horizon
23
Learning Task 2Learning Sequences
  • Are there typical sequences that are valid for
    all items?
  • After an action for an item its sales decrease.
  • Each decrease of sales is followed by an
    increase.
  • Given a set of subsequent eventsfind frequent
    sequences.

24
From Sales Data toEvent Sequences
Multivariate time series
Univariate time series ? Subsequent
events ? LHn-1 LEn
LHn frequent event sequences
?
25
From Series to Sequences
  • Given some time series detect events (states,
    intervals)
  • An event is a triple (state, begin,finish).
  • The state might be a label or a (mean) value.
  • Typical labels are increase, decrease, stable...

26
Unsupervised Methods
  • All contiguous observations within one level
    (range) form one event (Bauer).
  • All contiguous observations with more or less the
    same gradient form one event (Morik, Wessel).
  • Clusters of subsequences form events (Das).

27
Moving Gradient
Determining the time intervals with user-given
tolerance threshhold. Abstracting into classes of
gradients increase,peak,decrease, stable...










28
Sales of Item 182830 in Shop 55
29
Summarizing Sales byTolerant Moving Gradient
(Wessel, Morik 1999)
30
From Subsequent Eventsto Event Sequences
Multivariate time series
Univariate time series Moving
gradient Subsequent events
? LHn-1 LEn
LHn frequent event sequences
?
31
Transformation into Facts
LE4
stable(182830,1,33,0). decreasing(182830,
33,34,-6). stable(182830, 34, 39,0). increasing(18
2830, 39, 40,7). decreasing(182830, 40,
42,-5). stable(182830, 42,108,0).
32
Summarizing Item 646152 in Shop 55 by Intolerant
Moving Gradient
33
Corresponding Facts
increasing(646152,1,2,3). decreasing(646152,2,3,-1
1). increasingPeak(646152,3,4,22). ... stable(6461
52, 25,37,0). increasing(646152, 37, 38,
8). decreasing(646152, 38, 39, -7). stable(646152,
39,40, 0). increasing(646152, 40,
41,7). decreasing(646152, 41, 42,-8). increasing(6
46152, 42, 43,10). stable(646152, 43, 48,-1).
small time intervals
34
Method 3 for Task 2 Inductive Logic Programming
  • Rules about sequencesp1(I, Tb, Te, A r), p2(I,
    Te, Te2, As) ? p3(I, Te2, Te3, A t)
  • Results for sequences of sales trendsincreasing
    (Item, Tb, Te) ? decreasing(Item, Te, Te2)
    increasing (Item, Tb, Te), decreasing(Item, Te,
    Te2) ? stable(Item, Te2, Te3)

35
Same Data -- Several Cases
  • Predict sales of a particular item in a
    particular shop
  • multivariate to univariate, multiple exponential
    smoothing ORmultivariate to univariate, sliding
    windows, multiple learning with regression SVM
  • Find relations between trends that are valid for
    all sales in all shopsmultivariate to
    univariate, summarizing, transformation into
    facts, rule learning

36
Applications in Intensive Care
  • On-line monitoring of intensive care patients
  • high-dimensional data about patient and
    medication
  • measured every minute
  • stored in the Emtec database of patient records
    ---
  • learning when to intervene in which way.

37
Patient G.C., male, 60 years old
Hemihepatektomie right
38
The Data
  • LE DB2 i 1 t 1 a 1 1 ... a 1 k i1
    t 2 a 2 1 ... a 2 k
  • ...
  • i2 t 1 a 1 1 ... a 1 k
  • ...

set of rows for each patient1 row for each
minute
39
Preprocessing
  • Chaining database rowsi 1 t 1 a 1 1 ... a 1 k,
    t 2 a 2 1 ... a 2 k , ...
  • Multivariate to univariatei 1 t 1 a 1, t 2 a 1
    ... t m a 1i 1 t 1 a 2, t 2 a 2 ... t m a
    2...
  • Detecting level changes and outliers

40
Phase State Analysis

Time series
yt1
yt
Deter- ministicProcess
yt
yt1
time t
yt
AR(1)-process with outlier (AO)
yt
timet
HRt
yt1
Heart rate
yt
time t
U.Gather, M. Bauer
41
Level Change Detection
  • level_change(pat4999, 50, 112, hr, up)
  • level_change(pat4999, 112, 164, hr, down)
  • level_change(pat4999, 10, 74, art, constant)
  • level_change(pat4999, 74, 110, art, down)
  • Computed Feature
  • Comparing norm values for a vital sign and its
    mean in a time interval ( standard deviation)
  • deviation(pat4999, 10, 74, art, up)

42
Learning Task 3Recommend Interventions for
Patients
  • Are there valid rules
  • for all multivariate time series,
  • such that therapeutical interventions follow from
    a patients state?

43
Method 3Inductive Logic Programming
  • Given patient records in the form of facts
  • deviations -- time intervals
  • therapeutical interventions -- time points
  • types of vital signs (group1 hr, swi, co
    group2 art, vr)
  • Learn rules about interventions
  • group1(V), deviation(P, T1, T2, V, Dir)
  • ?noradrenaline(P, T2, Dir)

44
The Chain of Preprocessing Steps
45
Learning Task 4Predict Next Minutes
Intervention
  • Given a patients state at time ti,
  • learn whether and how to intervene at t i1
  • Preprocessing
  • Selection of time points where an intervention
    was done
  • Multiple to binary classfor each drug, form the
    concepts drug_up, drug_down
  • Multiple learning for each binary class resulting
    inclassifiers for each drug and direction of
    dose change (SVM_light)

46
The Chain of Preprocessing Steps
47
Same Data -- Several Cases
  • Find time relations that express therapy
    protocols
  • chaining db rows, multivariate to univariate,
    level changes, deviations, RDT
  • Predict intervention for a particular drug
  • select time points, multiple to binary class,
    SVM_light

48
Behind the Boxes
49
Functionality of MD-Compiler
Manual preprocessing operators of M4 are very
elementary. Results of operators are mostly
views.
50
Several view definitions
Inline View versus Physical View-Object
Physical View-Object created by MD-Compiler for
reading data and executing statistics
51
Several view definitions
Inline View versus Physical View-Object
Create View V_05 (...) as select ... from
(select ... from (select ... from
Table_x) instead of Create View V_05 (...) as
select ... from V_04
52
Several view definitions
Materialized view
Created by the MD-Compiler automatically in the
background
performance gain when selecting data from
V_04 or V_07 all operation-outputs can be
realized as views
- additional storage needed
53
System-Architecture
M4- Relation Editor
Statistics
T1
T2
T3
PL/Sql
T4
T5
T6
M4-Relational Model
M4-Concept Editor
M4-Conceptual Model
Time
M4-Case Editor
Operators
Java-Code
from UniDo
Mining Mart Database
54
Summary of Cases Involving Time
55
Summary
  • Preprocessing is the key issue in data analysis!
  • Goal Support users in making intelligent choices
  • Approach Cases of best practice
  • View of a computer scientist
  • Scalability to very large databases
  • Meta-data driven processing
  • Case studies on analysing data involving time

56
MiningMart Approach
  • Manager -- end-userknows about the business case
  • Database manager knows about the data
  • Case designer -- power-userexpert in KDD
  • Developer supplies (learning) operators
Write a Comment
User Comments (0)
About PowerShow.com