Title: OBSTree Analysis of Handwritten Digits
1OBSTree Analysis of Handwritten Digits
- Using OBSTree to identify handwritten digits
collected for zip code recognition. - Atina Dunlap Brooks (adbrook2_at_stat.ncsu.edu)
- Jacqueline Hughes-Oliver (hughesol_at_stat.ncsu.edu)
- North Carolina State University
2Outline
- OBSTree method
- Zip code dataset
- Results
- Interpretation
- Completeness penalty
3OBSTree Optimal Bit String Tree
- Classification tree
- Branches on sets of explanatory variables
- Not individual explanatory variables
- Computationally intensive
- Split search
- Stochastic search for explanatory variables
- Exhaustive search for values (2m)
- Exhaustive test to trim variable set (2m)
4Example
- 1 responses
- X11 X21
- 2 responses
- X31 X41
- 0 responses
- unstructured
5Traditional Trees
- 1s found
- X11, X21
- 2s NOT found
- confounded
16 24 06
X11
X10
16 22 03
03 22
X21
X20
03 22
16
6OBSTree
- 1s found
- X11, X21
- 2s found
- X31, X41
16 24 06
X11, X21
16
24 06
X31, X41
24
06
7Algorithm Modifications
- Originally developed for finding QSARs
- For non-QSAR
- C code (speed)
- Balanced multi-class
- Starting point
- Tie breakers
- Depth selection
- Penalty function
8USPS Zip Code Dataset
- 256 covariates (16x16)
- Training -7291 observations
- 10 responses
- 0 1194 5 - 556
- 1 1005 6 - 664
- 2 731 7 - 645
- 3 658 8 - 542
- 4 652 9 - 644
- Test -2007 observations
- 10 responses
- 0 359 5 - 160
- 1 264 6 - 170
- 2 198 7 - 147
- 3 166 8 - 166
- 4 200 9 - 177
9Binary Conversion
- OBSTree requires binary variables
- Grayscale -1 , 1
- Converted to 0 1
10Training Branches 1-3
Present 40,72,168,216 Absent 7,27,117,124,171,23
0
1806
Present 59,116,230 Absent 105,121,137,169,193
0557
Present 22,24,26 Absent 88,105,116,136,198,201,2
12,230
7315
160 more branches
11Branch 1
- 806 1s
- Present
- 40,72,168,216
- Absent
- 7, 27,117,124,171,230
- Examples
12Why So Many Branches?
- Examples from Branch 6
- Examples from Branch 17
13Training Confusion Matrix
- Misclassified 17 (0.23)
- Depth 163 branches
14Test Confusion Matrix
- Misclassified 302 (15.05)
15Method Comparison
- Human1 2.5
- CART2 17
- C4.53 16
- OBSTree 15.1
- Random Forest4 6.5
- Neural Net5 5.1
- SVM6 4.2
- - first 149 nodes are pure
16Completeness Penalty
- Entropy
- n in node
- nk of class k in node
17Branches 1-3Completeness Penalty ct.75
Present 40,72,168,216 Absent 22,27,117,124,171,2
30
1916, 41
Present 24 Absent 72,88,103,105,119,136,149,
151,198,213,220,229
7454, 92
before penalty
Present 76,116,230 Absent 105,121, 136,169
1806
0557
0574
7315
147 more branches
18Branch 1Completeness Penalty ct.75
- 916 1s, 1 4s
- Present
- 40,72,168,216
- Absent Descriptors
- 22,27,117,124,171,230
- Examples
194 Misclassified as 1
20Training ConfusionMatrix - Completeness Penalty
- Misclassified 25 (0.34)
- Depth 150 branches
21Test ConfusionMatrix - Completeness Penalty
- Misclassified 286 (14.25)
22Method Comparison
- Human1 2.5
- CART2 17
- C4.53 16
- OBSTree 15.1
- OBSTree with penalty 14.3
- Random Forest4 6.5
- Neural Net5 5.1
- SVM6 4.2
23Acknowledgements
- The authors wish to acknowledge the work and
discussions with Ke Zhang, Stan Young, Jaijun
Liu, and Haojun Ouyang from NC State University
who were invaluable in performing this research.
24Thank You