Dental Data Mining: Practical Issues and Potential Pitfalls - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Dental Data Mining: Practical Issues and Potential Pitfalls

Description:

'Semi-automatic discovery of patterns, associations, anomalies, and statistically ... Trees (CART, MART, boosting, bagging) Random Forests ... – PowerPoint PPT presentation

Number of Views:183
Avg rating:3.0/5.0
Slides: 40
Provided by: integr1
Category:

less

Transcript and Presenter's Notes

Title: Dental Data Mining: Practical Issues and Potential Pitfalls


1
Dental Data Mining Practical Issues and
Potential Pitfalls
  • Stuart A. Gansky
  • University of California, San Francisco
  • Center to Address Disparities in Childrens Oral
    Health
  • Support US DHHS/NIH/NIDCR U54 DE14251

2
What is Knowledge Discovery and Data Mining (KDD)?
  • Semi-automatic discovery of patterns,
    associations, anomalies, and statistically
    significant structures in data MIT Tech Review
    (2001)
  • Interface of
  • Artificial Intelligence Machine Language
  • Computer Science Engineering Statistics
  • Association for Computing Machinery Special
    Interest Group on Knowledge Discovery in Data and
    Data Mining (ACM SIGKDD sponsors KDD Cup)

3
Data Mining as Alchemy
4
Some Potential KDD Applications in Oral Health
Research
  • Large surveys (eg NHANES)
  • Longitudinal studies (eg VA Aging Study)
  • Disease registries (eg SEER)
  • Digital diagnostics (radiographic others)
  • Molecular biology (eg PCR, microarrays)
  • Health services research / claims data
  • Provider and workforce databases

5
  • Supervised Learning
  • Regression
  • k nearest neighbor
  • Trees (CART, MART, boosting, bagging)
  • Random Forests
  • Multivariate Adaptive Regression Splines (MARS)
  • Neural Networks
  • Support Vector Machines
  • Unsupervised Learning
  • Hierarchical clustering
  • k-means

6
KDD Steps
7
Data Quality
8
Example Caries
  • Predicting disease with traditional logistic
    regression may have modelling difficulties
    nonlinearity (ANN better) interactions (CART
    better)(Kattan et al, Comp Biomed Res, 98)
  • Want to compare the performance of logistic
    regression to popular data mining techniques
    tree and artificial neural network models in
    dental caries data
  • CART in caries (Stewart Stamm, JDR, 91)

9
Example study child caries
  • Background 20 of children have 80 of caries
    (tooth decay)
  • University of Rochester longitudinal study
    (Leverett et al, J Dent Res, 1993)
  • 466 1st-2nd graders caries-free at baseline
  • Saliva samples exams every 6 months
  • Goal Predict 24 month caries incidence (output)

10
18-month Predictors (Inputs)
  • Salivary bacteria
  • Mutans Streptococci (log10 CFU/ml)
  • Lactobacilli (log10 CFU/ml)
  • Salivary chemistry
  • Fluoride (ppm)
  • Calcium (mmol/l)
  • Phosphate (ppm)

11
Modeling Methods
Logistic Regression
Neural Networks
Decision Trees
12
Logistic Regression Models
Logit (Primary Dentition Caries)
Schematic Surface
log10 Mutans Streptococci
Fluoride (F) ppm
13
Tree Models
Logit (Primary Dentition Caries)
Schematic Surface
log10 Mutans Streptococci
Fluoride (F) ppm
14
Artificial Neural Networks
Logit (Primary Dentition Caries)
Schematic Surface
log10 Mutans Streptococci
Fluoride (F) ppm
15
Artificial Neural Network (p-r-1)
wij
x1
wj
h1
x2
h2
y
? ? ?
? ? ?
hr
xp
inputs
hidden layer (neurons)
output
16
Common Mistakes with ANN (Scwartzer et al,
StatMed, 2000)
  • Too many parameters for sample size
  • No validation
  • No model complexity penalty
  • (eg Akaike Information Criterion (AIC))
  • Incorrect misclassification estimation
  • Implausible function
  • Incorrectly described network complexity
  • Inadequate statistical competitors
  • Insufficiently compared to stat competitors

17
Validation
  • Split sample (70 training/30 validation)
  • Validation estimates unbiased misclassification
  • K-fold Cross Validation
  • Mean squared error (Brier Score)

18
Why Validate?
  • Example Overfitting in 2 Dimensions

19
Data
20
Linear Fit to Data
21
High Degree Polynomial Fit to Data
22
10-Fold Cross-validation
1 2 3 4 5 6 7 8 9 10

23
10-Fold Cross-validation
1 2 3 4 5 6 7 8 9 10

24
10-Fold Cross-validation
1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

25
Caries Example Model Settings
  • Logit
  • Stepwise selection
  • Alpha.05 to enter, alpha.20 to stay
  • AIC to judge additional predictors
  • Tree
  • Splitting criterion Gini index
  • Pruning Proportion correctly classified

26
ANN Settings
  • Artifical Neural Network (5-3-1 22 df)
  • Multilayer perceptron
  • 5 Preliminary runs
  • Levenberg-Marquardt optimization
  • No weight decay parameter
  • Average error selection
  • 3 Hidden nodes/neurons
  • Activation function hyperbolic tangent

27
ANN Sensitivity Analyses
  • Random seeds 5 values
  • No differences
  • Weight decay parameters 0, .001, .005, .01, .25
  • Only slight differences for .01 and .25
  • Hidden nodes/neurons 2, 3, 4
  • 3 seems best

28
Tree Model
29
Tree Model
Prevalence Node gt Overall (17)
Overall Primary Caries 17
N466
Prevalence Node lt Overall (17)
log10 LB lt5.71 16
log10 LB ?5.71 43
log10 LB lt3.26 10
log10 MS lt7.09 14
log10 LB ?3.26 27
log10 MS gt7.09 100
log10 LB gt5.76 0
log10 LB lt5.76 67
log10 MS ?3.89 15
log10 MS lt3.89 1
log10 MS lt 6.86 25
log10 MS ? 6.86 100
30
Receiver Operating Characteristic (ROC) Curves
31
Cumulative Captured Response Curves
32
Lift Chart
33
Logistic Regression
  • Beta Std Err Odds Ratio 95 CI
  • log10 MS .238 .072 1.27 1.10 1.46
  • log10 LB .311 .070 1.36 1.19 1.57

34
MARS MS at 4 Times
35
(No Transcript)
36
(No Transcript)
37
5-fold CV Results
  • Logit Tree ANN
  • RMS error .365 .363 .362
  • AUC .680 .553 .707

38
Summary
  • Data quality and study design are paramount
  • Utilize multiple methods
  • Be sure to validate
  • Graphical displays help interpretations
  • KDD methods may provide advantages over
    traditional statistical models in dental data

39
(No Transcript)
40
Prediction
as good as the
data
and
model
Write a Comment
User Comments (0)
About PowerShow.com