Data Mining on NIJ data - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining on NIJ data

Description:

7. Number of vertical slope components. 8. Number of ... GD. GE. 2. GAE. GAD. GAH. GAS. GHE. GHD. GHS. GED. GES. GDS. AHE. 3. GAHEDS. subgroups. Database. 0.0 ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 28
Provided by: Gues256
Category:
Tags: nij | data | gd | mining

less

Transcript and Presenter's Notes

Title: Data Mining on NIJ data


1
Data Mining on NIJ data
  • Sangjik Lee

2
Unstructured Data Mining
Text
Image
Keyword Extraction
Feature Extraction
Structured Data Base
Structured Data Base
Data Mining
Data Mining
3
Handwritten CEDAR Letter
4
Document Level Features
  • 1. Entropy
  • 2. Gray-level threshold
  • 3. Number of black pixels
  • 4. Stroke width
  • 5. Number of interior contours
  • 6. Number of exterior contours
  • 7. Number of vertical slope components
  • 8. Number of horizontal slope components
  • 9. Number of negative slope components
  • 10. Number of positive slope components
  • 11. Slant
  • 12. Height

Measure of Pen Pressure
Measure of Writing Movement
Measure of Stroke Formation
Slant
Word Proportion
5
Character Level Features
6
Character Level Features
Gradient 000000000011000000001100001110000000111
0000000110000001100010000000011000000000000011100
1100011111000011110000000010 010100000100011100111
11001111100000100000100000000000000000000 01000001
001000 (192) Structure
00000000000000000000110000111000100001000010000001
0000 000000000100101000000000011000010100110000110
000000000000100100 0110011000000000000001100101000
00000000001100000000000000000000 000000010000
(192) Concavity 1111011010011111011001100
0000110111101101001100100000 110000011100000000000
000000000000000000000000000000111111100000 0000000
00000 (128)
7
Writer and Feature Data
Writer data
Feature data (normalized)
8
Instances of the Data (normalized)
Feature document level data (12 features)
Entropy dark pixel blob hole hslope
nslope pslope vslope slant width ht real
int int int int int
int int int real int
int
.95 .49 .70 .71 .50 .10 .51 .92
.13 .47 .32 .21 .94 .49 .75
.70 .50 .11 .53 .84 .26 .54
.35 .18 .94 .49 .67 .74 .50 .10
.45 .85 .23 .48 .32 .22 .93
.72 .33 .47 .50 .21 .28 .30 .66
.60 .42 .10 .93 .74 .33 .48
.50 .22 .26 .30 .60 .59 .45
.10 .93 .79 .36 .54 .50 .18 .27
.32 .60 .59 .52 .09 .92 .30
.61 .66 .60 .11 .35 .49 .70
.71 .57 .10 .94 .42 .72 .66 .60
.11 .32 .49 .67 .74 .53 .10
.94 .40 .75 .67 .60 .12 .34 .49
.75 .70 .54 .11 .96 .30 .60
.59 .50 .10 .21 .30 .66 .60
.36 .10 .95 .32 .60 .59 .50 .09
.22 .30 .60 .59 .39 .10 .95
.30 .66 .60 .50 .10 .21 .32 .60
.59 .34 .09
9
Data Mining on sub-group
White male
White female
Black female
Black male
10
Data Mining on sub-group (Cont.)
  • Subgroup analysis is useful information to be
    mined.
  • 1-constraint subgroups
  • Male Female,
  • White Black Hispanic, etc.
  • 2-constraints subgroups
  • Male-white Female-white, etc.
  • 3-constraints subgroups
  • Male-white-2545 Female-white-2545, etc.

There are a combinatorially large number of
subgroups.
11
subgroups
Gender Age Handedness Ethnicity eDucation Schoolin
g
W
If W lt support, reject
Constraints
G
S
D
E
H
A
1
GA
GH
AH
AE
AD
AS
HE
HD
HS
ED
ES
DS
GS
GD
GE
2

3
GAE
GAD
GAH
GAS
GHE
GHD
GHS
GED
GES
GDS
AHE
.
.
.
.
.
.
GAHEDS
12
Database
Writer data
Raw feature data
Normalized feature data
Color Scale
0.0
1.0

13
Feature Database (White and Black)
Female
Male
white
black
white
black
1224
2544
4564
gt 65
14
What to do
  • 1. Feature Selection
  • Process that chooses an optimal subset of
    features according to a certain criterion
    (Feature Selection for knowledge discovery and
    data mining by Huan Liu and Hiroshi Motoda)
  • Since there are limited number of writer in each
    sub-group, reduced subset of features is needed.
  • To improve performance (speed of learning,
    predictive accuracy, or simplicity of rules)
  • To visualize the data for model selection
  • To reduce dimensionality and remove noise

15
Feature Selection
Example of feature selection
1-3
9-11
Feature 9-10 11-12
Feature 1-2 2-3
Feature 6-10 8-12
  • Knowing that some features are highly correlated
    to some others can help removing redundant
    features

16
What to do
  • 2. Visualization of trend (if any) of writer
    sub-groups
  • Useful tool so that we can quickly obtain an
    overall structural view of the trend of sub-group
  • Seeing is Believing !

17
Implementation of Subgroup Analysis on NIJ Data
Task Which writer subgroup is more
distinguishable than others (if any)?
Writer Data
Find a subgroup that has enouth support
Feature Data
Data Preparation
Subgroup Classifier
18
The Result of Subgroup Classification Results
  • Procedure for writer subgroup analysis
  • Find subgroup that has enough support
  • Choose the other (complement) group
  • Make data sets(4) for Artificial Neural Network
  • Train ANN and get the results from two test sets
  • Limit
  • 3 categoris are used (gender, ethnicity and age)
  • up to 2 constraints are considered
  • only Document-level features are used

19
Subgroup Classifier
dark
1
blob
Feature extraction
Writer is Which group?
hole
slant
height
Artificial neural network (11-6-1)
20
The Result of Subgroup Classification Results

21
Theyre distinguishable, but why...
  • Need to explain why theyre distinguishable
  • ANN does a good job, but cant explain clearly
    its output
  • 12 features are too many to explain and
    visualize
  • Only 2 (or 3) dimensions are visualizable
  • Question Does a reasonable two or three
    dimensional representation of the data exist that
    may be analyzed visually?
  • Reference Feature Selection for Knowledge
    Discovery and Data Mining
  • - Huan Liu and Hiroshi Motoda

22
Feature Extraction
  • Common characteristic of feature extraction
    methods is that they all produce new features y
    based on the original features x.
  • After feature extraction, representation of data
    is changed so that many techniques such as
    visualization, decision tree building can be
    conveniently used.
  • Feature extraction started, as early as in 60s
    and 70s, as a problem of finding the intrinsic
    dimensionality of a data set - the minimum number
    of independent features required to generate the
    instances

23
Visualization Perspective
  • Data of high dimensions cannot be analyzed
    visually
  • It is often necessary to reduce its
    dimensionality in order to visualize the data
  • The most popular method of determining
    topological dimensionality is the Karhunen-Loeve
    (K-L) method (also called Principal Component
    Analysis) which is based on the eigenvalues of a
    covariance matrix(R) computed from the data

24
Visualization Perspective
  • The M eigenvectors corresponding to the M
    largest eigenvalues of R define a linear
    transformation from the N-dimensional space to an
    M-dimensional space in which the features are
    uncorrelated.
  • This property of uncorrelated features is
    derived from a theorem stating that if the
    eigenvalues of a matrix are distinct, then the
    associated eigenvectors are linearly independent
  • For the purpose of visualization, one may take
    the M features corresponding to the M largest
    eigenvalues of R

25
Applied to the NIJ data
1. Normalize each features values into a range
0,1 2. Obtain the correlation matrix for the 12
original features 3. Find eigenvalues of the
correlation matrix 4. Select the largest two
eigenvalues should be chosen 5. Output the chosen
eigenvectors associated with the chosen
eigenvalues. Here we obtain a 12 2
transformation matrix M 6. Transform the
normalized data Dold into data Dnew of extracted
features as follows Dnew Dold M The
resulting data is of 2-dimensional having the
original class label attached to each instance
26
Applied to the NIJ data

27
Applied to the NIJ data

Sample Iris data (the original is 4-dimensional)
Write a Comment
User Comments (0)
About PowerShow.com