Dental Data Mining: Practical Issues and Potential Pitfalls

About This Presentation

Title:

Dental Data Mining: Practical Issues and Potential Pitfalls

Description:

'Semi-automatic discovery of patterns, associations, anomalies, and statistically ... Trees (CART, MART, boosting, bagging) Random Forests ... – PowerPoint PPT presentation

Number of Views:183

Avg rating:3.0/5.0

Slides: 40

Provided by: integr1

Category:

more less

Transcript and Presenter's Notes

Title: Dental Data Mining: Practical Issues and Potential Pitfalls

1
Dental Data Mining Practical Issues and
Potential Pitfalls

Stuart A. Gansky
University of California, San Francisco
Center to Address Disparities in Childrens Oral
Health
Support US DHHS/NIH/NIDCR U54 DE14251

2
What is Knowledge Discovery and Data Mining (KDD)?

Semi-automatic discovery of patterns,
associations, anomalies, and statistically
significant structures in data MIT Tech Review
(2001)
Interface of
Artificial Intelligence Machine Language
Computer Science Engineering Statistics
Association for Computing Machinery Special
Interest Group on Knowledge Discovery in Data and
Data Mining (ACM SIGKDD sponsors KDD Cup)

3
Data Mining as Alchemy
4
Some Potential KDD Applications in Oral Health
Research

Large surveys (eg NHANES)
Longitudinal studies (eg VA Aging Study)
Disease registries (eg SEER)
Digital diagnostics (radiographic others)
Molecular biology (eg PCR, microarrays)
Health services research / claims data
Provider and workforce databases

Supervised Learning
Regression
k nearest neighbor
Trees (CART, MART, boosting, bagging)
Random Forests
Multivariate Adaptive Regression Splines (MARS)
Neural Networks
Support Vector Machines

Unsupervised Learning
Hierarchical clustering
k-means

6
KDD Steps
7
Data Quality
8
Example Caries

Predicting disease with traditional logistic
regression may have modelling difficulties
nonlinearity (ANN better) interactions (CART
better)(Kattan et al, Comp Biomed Res, 98)
Want to compare the performance of logistic
regression to popular data mining techniques
tree and artificial neural network models in
dental caries data
CART in caries (Stewart Stamm, JDR, 91)

9
Example study child caries

Background 20 of children have 80 of caries
(tooth decay)
University of Rochester longitudinal study
(Leverett et al, J Dent Res, 1993)
466 1st-2nd graders caries-free at baseline
Saliva samples exams every 6 months
Goal Predict 24 month caries incidence (output)

10
18-month Predictors (Inputs)

Salivary bacteria
Mutans Streptococci (log10 CFU/ml)
Lactobacilli (log10 CFU/ml)
Salivary chemistry
Fluoride (ppm)
Calcium (mmol/l)
Phosphate (ppm)

11
Modeling Methods
Logistic Regression
Neural Networks
Decision Trees
12
Logistic Regression Models
Logit (Primary Dentition Caries)
Schematic Surface
log10 Mutans Streptococci
Fluoride (F) ppm
13
Tree Models
Logit (Primary Dentition Caries)
Schematic Surface
log10 Mutans Streptococci
Fluoride (F) ppm
14
Artificial Neural Networks
Logit (Primary Dentition Caries)
Schematic Surface
log10 Mutans Streptococci
Fluoride (F) ppm
15
Artificial Neural Network (p-r-1)
wij
x1
wj
h1
x2
h2
y
? ? ?
? ? ?
hr
xp
inputs
hidden layer (neurons)
output
16
Common Mistakes with ANN (Scwartzer et al,
StatMed, 2000)

Too many parameters for sample size
No validation
No model complexity penalty
(eg Akaike Information Criterion (AIC))
Incorrect misclassification estimation
Implausible function
Incorrectly described network complexity
Inadequate statistical competitors
Insufficiently compared to stat competitors

17
Validation

Split sample (70 training/30 validation)
Validation estimates unbiased misclassification
K-fold Cross Validation
Mean squared error (Brier Score)

18
Why Validate?

Example Overfitting in 2 Dimensions

19
Data
20
Linear Fit to Data
21
High Degree Polynomial Fit to Data
22
10-Fold Cross-validation
1 2 3 4 5 6 7 8 9 10

23
10-Fold Cross-validation
1 2 3 4 5 6 7 8 9 10

24
10-Fold Cross-validation
1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

25
Caries Example Model Settings

Logit
Stepwise selection
Alpha.05 to enter, alpha.20 to stay
AIC to judge additional predictors
Tree
Splitting criterion Gini index
Pruning Proportion correctly classified

26
ANN Settings

Artifical Neural Network (5-3-1 22 df)
Multilayer perceptron
5 Preliminary runs
Levenberg-Marquardt optimization
No weight decay parameter
Average error selection
3 Hidden nodes/neurons
Activation function hyperbolic tangent

27
ANN Sensitivity Analyses

Random seeds 5 values
No differences
Weight decay parameters 0, .001, .005, .01, .25
Only slight differences for .01 and .25
Hidden nodes/neurons 2, 3, 4
3 seems best

28
Tree Model
29
Tree Model
Prevalence Node gt Overall (17)
Overall Primary Caries 17
N466
Prevalence Node lt Overall (17)
log10 LB lt5.71 16
log10 LB ?5.71 43
log10 LB lt3.26 10
log10 MS lt7.09 14
log10 LB ?3.26 27
log10 MS gt7.09 100
log10 LB gt5.76 0
log10 LB lt5.76 67
log10 MS ?3.89 15
log10 MS lt3.89 1
log10 MS lt 6.86 25
log10 MS ? 6.86 100
30
Receiver Operating Characteristic (ROC) Curves
31
Cumulative Captured Response Curves
32
Lift Chart
33
Logistic Regression

Beta Std Err Odds Ratio 95 CI
log10 MS .238 .072 1.27 1.10 1.46
log10 LB .311 .070 1.36 1.19 1.57

34
MARS MS at 4 Times
35
(No Transcript)
36
(No Transcript)
37
5-fold CV Results

Logit Tree ANN
RMS error .365 .363 .362
AUC .680 .553 .707

38
Summary

Data quality and study design are paramount
Utilize multiple methods
Be sure to validate
Graphical displays help interpretations
KDD methods may provide advantages over
traditional statistical models in dental data

39
(No Transcript)
40
Prediction
as good as the
data
and
model

Write a Comment

User Comments (0)