Title: Trend Analysis and Risk Identification
1Trend Analysis and Risk Identification
Lenka Nováková1, Jirí Kléma1, Michal Jakob1,
Simon Rawles2, Olga Štepánková1
1 The Gerstner laboratory for intelligent
decision making and control, Czech Technical
University, Prague
2 Department of Computer Science, University of
Bristol, Bristol, UK
PKDD 2003, Discovery Challenge
2Outline
- STULONG data, orientation towards CVD
- Used tools
- SumatraTT, Statistica, Weka
- Used techniques
- mainly statistical tests - ANOVA, Chi-square,
etc. - Exploratory analysis and subgroup discovery
- Entry table
- Trend analysis
- Entry and Control tables
- three principal ways of preprocessing
- derived aggregated attributes
- univariate and multivariate analysis
3STULONG Data
- Four tables Entry, Control, Letter, Death
- Dependent variable CVD
- CardioVascular Disease
- boolean attribute derived of A2 questionnaire
(Control table)
CVD false The patient has no coronary
disease.
CVD true The patient has one of these
attributes true (Hodn1,
Hodn2, Hodn3, Hodn11, Hodn13, Hodn14)
positive angina pectoris
(silent) myocardial infarction
ischaemic heart disease
cerebrovascular accident
We remove patients who have diabetes (Hodn4) or
cancer (Hodn15) only.
4ENTRY - subgroup discovery
- AQ no.6 Are there any differences in the ENTRY
examination for different CVD groups? - Statistica 6.0
- module for interactive decision tree induction
- two tailed t-test or chi-square test to asses
significance of subgroups - Dependencies are relatively weak
- Interesting dependencies found
- social characteristics derived attribute
AGE_of_ENTRY - alcohol positive effect of beer, no effect of
wine - sugar consumption increases CVD risk
- well-known dependencies are not mentioned
(smoking, BMI, cholesterol)
5ENTRY - general model
- General CVD model (in WEKA)
- feature selection modeling (e.g., decision
trees) - tends to generate trivial models (always
predicting false) - asymmetric error-cost matrix does not help
- Predict CVD risk
- Identify principal variables (Chi-squared test)
- Naïve Bayes ROC evaluation
- three independent variables
- discretized AGE_of_ENTRY
- discretized BMI
- Cholrisk - derived of CHLST
- AUC 0.66
6CONTROL - trend analysis
- AQ no.7 Are there any differences in development
of risk factors for different CVD groups?
ENTRY table
CONTR table
ICO primary key Year of birth Year of
entry Smoking Alcohol Cholesterol Body Mass
Index Blood pressure
ICO Risk factors followed during 20 years
7Global Approach
- Risk factors to be observed are selected
- SYST, DIAST, TRIGL, BMI, CHLSTMG
- Selected control examinations are transformed
- pivoting
- Patients with no control entries are removed
- about 60 patients
- Trend aggregates are calculated
ICO_1
ICO_2
8Derived trend attributes
9Global Approach - results
- The derived aggregates were discretized
- e.g., the gradient can be strongly decreasing,
decreasing, constant, increasing, strongly
increasing - Chi-square test for independence wrt. to CVD
- Large number of aggregates proved to be
significant including gradients (Chi square test,
p0.05)
1012
1112
12ControlCount vs. CVD
- ControlCount
- number of examinations
- strong relation with CVD
- AUC 0.35
- ControlCount ? CVD risk ?
- anachronistic attribute
- introduced by the design of the study
- ControlCount has influence on the trend
aggregates - ControlCount ? gradients tend to be
more steep etc. - Conclusion global approach cannot be applied (at
least with these aggregates)
13Windowing Approach I.
- The same risk factors, the same pivoting
transformation and similar trend aggregates - BUT the constant number of examinations
- Issues
- window
- time period vs. number of examinations
- 5 examinations are enough to express trend
- patients records (1 ControlCount 3)
- entry is used as the first examination
- records are dependent
- CVD classification
- time from the last examination to CVD
- yes/no (yes CVD in the next year or CVD in
future)
14Windowing Approach I.
...
Data
First vector
New vector
15Aggregate tests
T-tests Grouping Time_round (Trend_all_nahrady
in Trend_analysis.stw) Group 1 1000 Group 2 1
- Trend aggregates approach the normal distribution
in all (both) the specified CVD groups - Two groups were selected CVD never appears in
the future (1000) vs. CVD appears at the next
exam. (1) - T-test for comparison of the group means can be
applied (plt0.05) - Do the means of the calculated aggregates differ
in the different CVD groups? - Just a few of them
- two variables (!gradients!) are clearly
significant only - SYST and DIAST
- two significant intercepts
- TRIGL and CHLST
16Further tests of SYST, DIAST
T-tests Grouping Time_round (Trend_all_nahrady
in Trend_analysis.stw) Group 1 1000 Group 2 1
- Try to test the gradients for all the CVD groups,
not only two extreme groups - Repeated ANOVA can be applied development of
SYST/DIAST trend for different CVD groups
17(No Transcript)
18(No Transcript)
19(No Transcript)
20Windowing Approach II.
- There are missing values of risk factors
- Windowing I.
- skips missing values
- different numbers of rows are generated for
different factors - Windowing II.
- replaces the missing values
- the same numbers of rows are generated for
different factors - enables multivariate analysis
- combination of different aggregates and their
relation with CVD
21Windowing II.
...
Data
First vector
New vector
2227 patients only!
23Conclusions
- The main scope
- AQ no.7 Are there any differences in development
of risk factors for different CVD groups? - Contributions
- Pitfalls of the global approach revealed
- Using windowing differences proved for SYST and
DIAST blood pressures - Other assumptions and ideas
- interesting course of development of risk factors
(DIAST is decreasing first then increases and CVD
appears) - other trends may have influence under specific
conditions (BMITrend and overweight, etc.)