Title: Dental Data Mining: Practical Issues and Potential Pitfalls
1Dental Data Mining Practical Issues and
Potential Pitfalls
- Stuart A. Gansky
- University of California, San Francisco
- Center to Address Disparities in Childrens Oral
Health - Support US DHHS/NIH/NIDCR U54 DE14251
2What is Knowledge Discovery and Data Mining (KDD)?
- Semi-automatic discovery of patterns,
associations, anomalies, and statistically
significant structures in data MIT Tech Review
(2001) - Interface of
- Artificial Intelligence Machine Language
- Computer Science Engineering Statistics
- Association for Computing Machinery Special
Interest Group on Knowledge Discovery in Data and
Data Mining (ACM SIGKDD sponsors KDD Cup)
3Data Mining as Alchemy
4Some Potential KDD Applications in Oral Health
Research
- Large surveys (eg NHANES)
- Longitudinal studies (eg VA Aging Study)
- Disease registries (eg SEER)
- Digital diagnostics (radiographic others)
- Molecular biology (eg PCR, microarrays)
- Health services research / claims data
- Provider and workforce databases
5- Supervised Learning
- Regression
- k nearest neighbor
- Trees (CART, MART, boosting, bagging)
- Random Forests
- Multivariate Adaptive Regression Splines (MARS)
- Neural Networks
- Support Vector Machines
- Unsupervised Learning
- Hierarchical clustering
- k-means
6KDD Steps
7Data Quality
8Example Caries
- Predicting disease with traditional logistic
regression may have modelling difficulties
nonlinearity (ANN better) interactions (CART
better)(Kattan et al, Comp Biomed Res, 98) - Want to compare the performance of logistic
regression to popular data mining techniques
tree and artificial neural network models in
dental caries data - CART in caries (Stewart Stamm, JDR, 91)
9Example study child caries
- Background 20 of children have 80 of caries
(tooth decay) - University of Rochester longitudinal study
(Leverett et al, J Dent Res, 1993) - 466 1st-2nd graders caries-free at baseline
- Saliva samples exams every 6 months
- Goal Predict 24 month caries incidence (output)
1018-month Predictors (Inputs)
- Salivary bacteria
- Mutans Streptococci (log10 CFU/ml)
- Lactobacilli (log10 CFU/ml)
- Salivary chemistry
- Fluoride (ppm)
- Calcium (mmol/l)
- Phosphate (ppm)
11Modeling Methods
Logistic Regression
Neural Networks
Decision Trees
12Logistic Regression Models
Logit (Primary Dentition Caries)
Schematic Surface
log10 Mutans Streptococci
Fluoride (F) ppm
13Tree Models
Logit (Primary Dentition Caries)
Schematic Surface
log10 Mutans Streptococci
Fluoride (F) ppm
14Artificial Neural Networks
Logit (Primary Dentition Caries)
Schematic Surface
log10 Mutans Streptococci
Fluoride (F) ppm
15Artificial Neural Network (p-r-1)
wij
x1
wj
h1
x2
h2
y
? ? ?
? ? ?
hr
xp
inputs
hidden layer (neurons)
output
16Common Mistakes with ANN (Scwartzer et al,
StatMed, 2000)
- Too many parameters for sample size
- No validation
- No model complexity penalty
- (eg Akaike Information Criterion (AIC))
- Incorrect misclassification estimation
- Implausible function
- Incorrectly described network complexity
- Inadequate statistical competitors
- Insufficiently compared to stat competitors
17Validation
- Split sample (70 training/30 validation)
- Validation estimates unbiased misclassification
- K-fold Cross Validation
- Mean squared error (Brier Score)
18Why Validate?
- Example Overfitting in 2 Dimensions
19Data
20Linear Fit to Data
21High Degree Polynomial Fit to Data
2210-Fold Cross-validation
1 2 3 4 5 6 7 8 9 10
2310-Fold Cross-validation
1 2 3 4 5 6 7 8 9 10
2410-Fold Cross-validation
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
25Caries Example Model Settings
- Logit
- Stepwise selection
- Alpha.05 to enter, alpha.20 to stay
- AIC to judge additional predictors
- Tree
- Splitting criterion Gini index
- Pruning Proportion correctly classified
26ANN Settings
- Artifical Neural Network (5-3-1 22 df)
- Multilayer perceptron
- 5 Preliminary runs
- Levenberg-Marquardt optimization
- No weight decay parameter
- Average error selection
- 3 Hidden nodes/neurons
- Activation function hyperbolic tangent
27ANN Sensitivity Analyses
- Random seeds 5 values
- No differences
- Weight decay parameters 0, .001, .005, .01, .25
- Only slight differences for .01 and .25
- Hidden nodes/neurons 2, 3, 4
- 3 seems best
28Tree Model
29Tree Model
Prevalence Node gt Overall (17)
Overall Primary Caries 17
N466
Prevalence Node lt Overall (17)
log10 LB lt5.71 16
log10 LB ?5.71 43
log10 LB lt3.26 10
log10 MS lt7.09 14
log10 LB ?3.26 27
log10 MS gt7.09 100
log10 LB gt5.76 0
log10 LB lt5.76 67
log10 MS ?3.89 15
log10 MS lt3.89 1
log10 MS lt 6.86 25
log10 MS ? 6.86 100
30Receiver Operating Characteristic (ROC) Curves
31Cumulative Captured Response Curves
32Lift Chart
33Logistic Regression
- Beta Std Err Odds Ratio 95 CI
- log10 MS .238 .072 1.27 1.10 1.46
- log10 LB .311 .070 1.36 1.19 1.57
34MARS MS at 4 Times
35(No Transcript)
36(No Transcript)
375-fold CV Results
- Logit Tree ANN
- RMS error .365 .363 .362
- AUC .680 .553 .707
38Summary
- Data quality and study design are paramount
- Utilize multiple methods
- Be sure to validate
- Graphical displays help interpretations
- KDD methods may provide advantages over
traditional statistical models in dental data
39(No Transcript)
40Prediction
as good as the
data
and
model