Title: Data Mining on NIJ data
1Data Mining on NIJ data
2Unstructured Data Mining
Text
Image
Keyword Extraction
Feature Extraction
Structured Data Base
Structured Data Base
Data Mining
Data Mining
3Handwritten CEDAR Letter
4Document Level Features
- 1. Entropy
- 2. Gray-level threshold
- 3. Number of black pixels
- 4. Stroke width
- 5. Number of interior contours
- 6. Number of exterior contours
- 7. Number of vertical slope components
- 8. Number of horizontal slope components
- 9. Number of negative slope components
- 10. Number of positive slope components
- 11. Slant
- 12. Height
Measure of Pen Pressure
Measure of Writing Movement
Measure of Stroke Formation
Slant
Word Proportion
5Character Level Features
6Character Level Features
Gradient 000000000011000000001100001110000000111
0000000110000001100010000000011000000000000011100
1100011111000011110000000010 010100000100011100111
11001111100000100000100000000000000000000 01000001
001000 (192) Structure
00000000000000000000110000111000100001000010000001
0000 000000000100101000000000011000010100110000110
000000000000100100 0110011000000000000001100101000
00000000001100000000000000000000 000000010000
(192) Concavity 1111011010011111011001100
0000110111101101001100100000 110000011100000000000
000000000000000000000000000000111111100000 0000000
00000 (128)
7Writer and Feature Data
Writer data
Feature data (normalized)
8Instances of the Data (normalized)
Feature document level data (12 features)
Entropy dark pixel blob hole hslope
nslope pslope vslope slant width ht real
int int int int int
int int int real int
int
.95 .49 .70 .71 .50 .10 .51 .92
.13 .47 .32 .21 .94 .49 .75
.70 .50 .11 .53 .84 .26 .54
.35 .18 .94 .49 .67 .74 .50 .10
.45 .85 .23 .48 .32 .22 .93
.72 .33 .47 .50 .21 .28 .30 .66
.60 .42 .10 .93 .74 .33 .48
.50 .22 .26 .30 .60 .59 .45
.10 .93 .79 .36 .54 .50 .18 .27
.32 .60 .59 .52 .09 .92 .30
.61 .66 .60 .11 .35 .49 .70
.71 .57 .10 .94 .42 .72 .66 .60
.11 .32 .49 .67 .74 .53 .10
.94 .40 .75 .67 .60 .12 .34 .49
.75 .70 .54 .11 .96 .30 .60
.59 .50 .10 .21 .30 .66 .60
.36 .10 .95 .32 .60 .59 .50 .09
.22 .30 .60 .59 .39 .10 .95
.30 .66 .60 .50 .10 .21 .32 .60
.59 .34 .09
9Data Mining on sub-group
White male
White female
Black female
Black male
10Data Mining on sub-group (Cont.)
- Subgroup analysis is useful information to be
mined. - 1-constraint subgroups
- Male Female,
- White Black Hispanic, etc.
- 2-constraints subgroups
- Male-white Female-white, etc.
- 3-constraints subgroups
- Male-white-2545 Female-white-2545, etc.
There are a combinatorially large number of
subgroups.
11subgroups
Gender Age Handedness Ethnicity eDucation Schoolin
g
W
If W lt support, reject
Constraints
G
S
D
E
H
A
1
GA
GH
AH
AE
AD
AS
HE
HD
HS
ED
ES
DS
GS
GD
GE
2
3
GAE
GAD
GAH
GAS
GHE
GHD
GHS
GED
GES
GDS
AHE
.
.
.
.
.
.
GAHEDS
12Database
Writer data
Raw feature data
Normalized feature data
Color Scale
0.0
1.0
13Feature Database (White and Black)
Female
Male
white
black
white
black
1224
2544
4564
gt 65
14What to do
- 1. Feature Selection
- Process that chooses an optimal subset of
features according to a certain criterion
(Feature Selection for knowledge discovery and
data mining by Huan Liu and Hiroshi Motoda) - Since there are limited number of writer in each
sub-group, reduced subset of features is needed. - To improve performance (speed of learning,
predictive accuracy, or simplicity of rules) - To visualize the data for model selection
- To reduce dimensionality and remove noise
15Feature Selection
Example of feature selection
1-3
9-11
Feature 9-10 11-12
Feature 1-2 2-3
Feature 6-10 8-12
- Knowing that some features are highly correlated
to some others can help removing redundant
features
16What to do
- 2. Visualization of trend (if any) of writer
sub-groups - Useful tool so that we can quickly obtain an
overall structural view of the trend of sub-group - Seeing is Believing !
17Implementation of Subgroup Analysis on NIJ Data
Task Which writer subgroup is more
distinguishable than others (if any)?
Writer Data
Find a subgroup that has enouth support
Feature Data
Data Preparation
Subgroup Classifier
18The Result of Subgroup Classification Results
- Procedure for writer subgroup analysis
- Find subgroup that has enough support
- Choose the other (complement) group
- Make data sets(4) for Artificial Neural Network
- Train ANN and get the results from two test sets
- Limit
- 3 categoris are used (gender, ethnicity and age)
- up to 2 constraints are considered
- only Document-level features are used
-
19Subgroup Classifier
dark
1
blob
Feature extraction
Writer is Which group?
hole
slant
height
Artificial neural network (11-6-1)
20The Result of Subgroup Classification Results
21Theyre distinguishable, but why...
- Need to explain why theyre distinguishable
- ANN does a good job, but cant explain clearly
its output - 12 features are too many to explain and
visualize - Only 2 (or 3) dimensions are visualizable
- Question Does a reasonable two or three
dimensional representation of the data exist that
may be analyzed visually? - Reference Feature Selection for Knowledge
Discovery and Data Mining - - Huan Liu and Hiroshi Motoda
22Feature Extraction
- Common characteristic of feature extraction
methods is that they all produce new features y
based on the original features x. - After feature extraction, representation of data
is changed so that many techniques such as
visualization, decision tree building can be
conveniently used. - Feature extraction started, as early as in 60s
and 70s, as a problem of finding the intrinsic
dimensionality of a data set - the minimum number
of independent features required to generate the
instances
23Visualization Perspective
- Data of high dimensions cannot be analyzed
visually - It is often necessary to reduce its
dimensionality in order to visualize the data - The most popular method of determining
topological dimensionality is the Karhunen-Loeve
(K-L) method (also called Principal Component
Analysis) which is based on the eigenvalues of a
covariance matrix(R) computed from the data
24Visualization Perspective
- The M eigenvectors corresponding to the M
largest eigenvalues of R define a linear
transformation from the N-dimensional space to an
M-dimensional space in which the features are
uncorrelated. - This property of uncorrelated features is
derived from a theorem stating that if the
eigenvalues of a matrix are distinct, then the
associated eigenvectors are linearly independent -
- For the purpose of visualization, one may take
the M features corresponding to the M largest
eigenvalues of R
25Applied to the NIJ data
1. Normalize each features values into a range
0,1 2. Obtain the correlation matrix for the 12
original features 3. Find eigenvalues of the
correlation matrix 4. Select the largest two
eigenvalues should be chosen 5. Output the chosen
eigenvectors associated with the chosen
eigenvalues. Here we obtain a 12 2
transformation matrix M 6. Transform the
normalized data Dold into data Dnew of extracted
features as follows Dnew Dold M The
resulting data is of 2-dimensional having the
original class label attached to each instance
26Applied to the NIJ data
27Applied to the NIJ data
Sample Iris data (the original is 4-dimensional)