Data Mining on NIJ data - PowerPoint PPT Presentation

About This Presentation

Title:

Data Mining on NIJ data

Description:

7. Number of vertical slope components. 8. Number of ... GD. GE. 2. GAE. GAD. GAH. GAS. GHE. GHD. GHS. GED. GES. GDS. AHE. 3. GAHEDS. subgroups. Database. 0.0 ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 28

Provided by: Gues256

Learn more at: https://cedar.buffalo.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining on NIJ data

1
Data Mining on NIJ data

Sangjik Lee

2
Unstructured Data Mining
Text
Image
Keyword Extraction
Feature Extraction
Structured Data Base
Structured Data Base
Data Mining
Data Mining
3
Handwritten CEDAR Letter
4
Document Level Features

1. Entropy
2. Gray-level threshold
3. Number of black pixels
4. Stroke width
5. Number of interior contours
6. Number of exterior contours
7. Number of vertical slope components
8. Number of horizontal slope components
9. Number of negative slope components
10. Number of positive slope components
11. Slant
12. Height

Measure of Pen Pressure
Measure of Writing Movement
Measure of Stroke Formation
Slant
Word Proportion
5
Character Level Features
6
Character Level Features
Gradient 000000000011000000001100001110000000111
0000000110000001100010000000011000000000000011100
1100011111000011110000000010 010100000100011100111
11001111100000100000100000000000000000000 01000001
001000 (192) Structure
00000000000000000000110000111000100001000010000001
0000 000000000100101000000000011000010100110000110
000000000000100100 0110011000000000000001100101000
00000000001100000000000000000000 000000010000
(192) Concavity 1111011010011111011001100
0000110111101101001100100000 110000011100000000000
000000000000000000000000000000111111100000 0000000
00000 (128)
7
Writer and Feature Data
Writer data
Feature data (normalized)
8
Instances of the Data (normalized)
Feature document level data (12 features)
Entropy dark pixel blob hole hslope
nslope pslope vslope slant width ht real
int int int int int
int int int real int
int
.95 .49 .70 .71 .50 .10 .51 .92
.13 .47 .32 .21 .94 .49 .75
.70 .50 .11 .53 .84 .26 .54
.35 .18 .94 .49 .67 .74 .50 .10
.45 .85 .23 .48 .32 .22 .93
.72 .33 .47 .50 .21 .28 .30 .66
.60 .42 .10 .93 .74 .33 .48
.50 .22 .26 .30 .60 .59 .45
.10 .93 .79 .36 .54 .50 .18 .27
.32 .60 .59 .52 .09 .92 .30
.61 .66 .60 .11 .35 .49 .70
.71 .57 .10 .94 .42 .72 .66 .60
.11 .32 .49 .67 .74 .53 .10
.94 .40 .75 .67 .60 .12 .34 .49
.75 .70 .54 .11 .96 .30 .60
.59 .50 .10 .21 .30 .66 .60
.36 .10 .95 .32 .60 .59 .50 .09
.22 .30 .60 .59 .39 .10 .95
.30 .66 .60 .50 .10 .21 .32 .60
.59 .34 .09
9
Data Mining on sub-group
White male
White female
Black female
Black male
10
Data Mining on sub-group (Cont.)

Subgroup analysis is useful information to be
mined.
1-constraint subgroups
Male Female,
White Black Hispanic, etc.
2-constraints subgroups
Male-white Female-white, etc.
3-constraints subgroups
Male-white-2545 Female-white-2545, etc.

There are a combinatorially large number of
subgroups.
11
subgroups
Gender Age Handedness Ethnicity eDucation Schoolin
g
W
If W lt support, reject
Constraints
G
S
D
E
H
A
1
GA
GH
AH
AE
AD
AS
HE
HD
HS
ED
ES
DS
GS
GD
GE
2

3
GAE
GAD
GAH
GAS
GHE
GHD
GHS
GED
GES
GDS
AHE
.
.
.
.
.
.
GAHEDS
12
Database
Writer data
Raw feature data
Normalized feature data
Color Scale
0.0
1.0

13
Feature Database (White and Black)
Female
Male
white
black
white
black
1224
2544
4564
gt 65
14
What to do

1. Feature Selection
Process that chooses an optimal subset of
features according to a certain criterion
(Feature Selection for knowledge discovery and
data mining by Huan Liu and Hiroshi Motoda)
Since there are limited number of writer in each
sub-group, reduced subset of features is needed.
To improve performance (speed of learning,
predictive accuracy, or simplicity of rules)
To visualize the data for model selection
To reduce dimensionality and remove noise

15
Feature Selection
Example of feature selection
1-3
9-11
Feature 9-10 11-12
Feature 1-2 2-3
Feature 6-10 8-12

Knowing that some features are highly correlated
to some others can help removing redundant
features

16
What to do

2. Visualization of trend (if any) of writer
sub-groups
Useful tool so that we can quickly obtain an
overall structural view of the trend of sub-group
Seeing is Believing !

17
Implementation of Subgroup Analysis on NIJ Data
Task Which writer subgroup is more
distinguishable than others (if any)?
Writer Data
Find a subgroup that has enouth support
Feature Data
Data Preparation
Subgroup Classifier
18
The Result of Subgroup Classification Results

Procedure for writer subgroup analysis
Find subgroup that has enough support
Choose the other (complement) group
Make data sets(4) for Artificial Neural Network
Train ANN and get the results from two test sets
Limit
3 categoris are used (gender, ethnicity and age)
up to 2 constraints are considered
only Document-level features are used

19
Subgroup Classifier
dark
1
blob
Feature extraction
Writer is Which group?
hole
slant
height
Artificial neural network (11-6-1)
20
The Result of Subgroup Classification Results

21
Theyre distinguishable, but why...

Need to explain why theyre distinguishable
ANN does a good job, but cant explain clearly
its output
12 features are too many to explain and
visualize
Only 2 (or 3) dimensions are visualizable
Question Does a reasonable two or three
dimensional representation of the data exist that
may be analyzed visually?
Reference Feature Selection for Knowledge
Discovery and Data Mining
- Huan Liu and Hiroshi Motoda

22
Feature Extraction

Common characteristic of feature extraction
methods is that they all produce new features y
based on the original features x.
After feature extraction, representation of data
is changed so that many techniques such as
visualization, decision tree building can be
conveniently used.
Feature extraction started, as early as in 60s
and 70s, as a problem of finding the intrinsic
dimensionality of a data set - the minimum number
of independent features required to generate the
instances

23
Visualization Perspective

Data of high dimensions cannot be analyzed
visually
It is often necessary to reduce its
dimensionality in order to visualize the data
The most popular method of determining
topological dimensionality is the Karhunen-Loeve
(K-L) method (also called Principal Component
Analysis) which is based on the eigenvalues of a
covariance matrix(R) computed from the data

24
Visualization Perspective

The M eigenvectors corresponding to the M
largest eigenvalues of R define a linear
transformation from the N-dimensional space to an
M-dimensional space in which the features are
uncorrelated.
This property of uncorrelated features is
derived from a theorem stating that if the
eigenvalues of a matrix are distinct, then the
associated eigenvectors are linearly independent
For the purpose of visualization, one may take
the M features corresponding to the M largest
eigenvalues of R

25
Applied to the NIJ data
1. Normalize each features values into a range
0,1 2. Obtain the correlation matrix for the 12
original features 3. Find eigenvalues of the
correlation matrix 4. Select the largest two
eigenvalues should be chosen 5. Output the chosen
eigenvectors associated with the chosen
eigenvalues. Here we obtain a 12 2
transformation matrix M 6. Transform the
normalized data Dold into data Dnew of extracted
features as follows Dnew Dold M The
resulting data is of 2-dimensional having the
original class label attached to each instance
26
Applied to the NIJ data

27
Applied to the NIJ data

Sample Iris data (the original is 4-dimensional)

Write a Comment

User Comments (0)