Title: Visualization and Data Mining techniques
1Visualization and Data Mining techniques
- By-
- Group number- 14
- Chidroop Madhavarapu(105644921)
- Deepanshu Sandhuria(105595184)
- Data Mining CSE 634
- Prof. Anita Wasilewska
2References
- http//coblitz.codeen.org3125/citeseer.ist.psu.ed
u/cache/papers/cs/10335/ftpzSzzSzftp.cs.umn.eduzS
zdeptzSzuserszSzkumarzSzdatavis.pdf/ganesh96visual
.pdf - http//www.ailab.si/blaz/predavanja/ozp/gradivo/20
02-Keim-Visualization20in20DM-IEEE20Trans20Vis
.pdf - http//www.geocities.com/anand_palm/
- http//citeseer.ist.psu.edu/cache/papers/cs/27216/
httpzSzzSzwww-users.cs.umn.eduzSzzCz7EctluzSzPape
rTalkFilezSzits02.pdf/shekhar02cubeview.pdf - http//www.cs.umn.edu/Research/shashi-group/
- http//www.cs.umn.edu/Research/shashi-group/Book/s
db-chap1.pdf - http//www.cs.umn.edu/research/shashi-group/alan_p
lanb.pdf - http//coblitz.codeen.org3125/citeseer.ist.psu.ed
u/cache/papers/cs/27637/httpzSzzSzwww-users.cs.um
n.eduzSzzCz7EpushengzSzpubzSzkdd2001zSzkdd.pdf/she
khar01detecting.pdf
3Motivation
- Visualization for Data Mining
- Huge amounts of information
- Limited display capacity of output devices
- Visual Data Mining (VDM) is a new approach for
- exploring very large data sets, combining
traditional - mining methods and information visualization
techniques.
4Why Visual Data Mining
5Why Visual Data Mining
6VDM Approach
- VDM takes advantage of both,
- The power of automatic calculations, and
- The capabilities of human processing.
- Human perception offers phenomenal abilities to
extract structures from pictures.
7Levels of VDM
- No or very limited integration
- Corresponds to the application of either
traditional information - visualization or automated data mining
methods. - Loose integration
- Visualization and automated mining methods are
applied sequentially. - The result of one step can be used as input for
another step. - Full integration
- Automated mining and visualization methods
applied in parallel. - Combination of the results.
8Methods of Data Visualization
- Different methods are available for visualization
of data - based on type of data
- Data can be
- Univariate
- Bivariate
- Multivariate
9Univariate data
- Measurement of single quantitative variable
- Characterize distribution
- Represented using following methods
- Histogram
- Pie Chart
10Histogram
11Pie Chart
12Bivariate Data
- Constitutes of paired samples of two quantitative
variables - Variables are related
- Represented using following methods
- Scatter plots
- Line graphs
13Scatter plots
14Line graphs
15Multivariate Data
- Multi dimensional representation of multivariate
data - Represented using following methods
- Icon based methods
- Pixel based methods
- Dynamic parallel coordinate system
16Icon based Methods
17Pixel Based Methods
- Approach
- Each attribute value is represented by one
colored pixel (the value ranges of the attributes
are mapped to a fixed color map). - The values of each attribute are presented in
separate sub windows. - Examples
- Dense Pixel Displays
18 Dense Pixel Display
- Approach
- Each attribute value is represented by one
colored pixel (the value ranges of the attributes
are mapped to a fixed color map). - Different attributes are presented in separate
sub windows.
19Visual Data Mining Framework and Algorithm
Development
- Ganesh, M., Han, E.H., Kumar, V., Shekar, S.,
- Srivastava, J. (1996).
- Working Paper. Twin Cities, MN University of
Minnesota, - Twin Cities Campus.
20References
- http//coblitz.codeen.org3125/citeseer.ist.psu.ed
u/cache/papers/cs/10335/ftpzSzzSzftp.cs.umn.eduzS
zdeptzSzuserszSzkumarzSzdatavis.pdf/ganesh96visual
.pdf - http//www.ailab.si/blaz/predavanja/ozp/gradivo/20
02-Keim-Visualization20in20DM-IEEE20Trans20Vis
.pdf - http//www.geocities.com/anand_palm/
21Abstract
- VDM refers to refers to the use of visualization
techniques in Data Mining process to - Evaluate
- Monitor
- Guide
- This paper provides a framework for VDM via the
loose coupling of databases and visualization
systems. - The paper applies VDM towards designing new
algorithms that can learn decision trees by
manually refining some of the decisions made by
well known algorithms such as C4.5.
22Components of VQLBCI
- The three major components of VQLBCI are Visual
Representations, Computations and Events.
23Visual Development of Algorithms
- Most interesting use of visual data mining is the
development of new insights and algorithms. - The figure below shows the ER diagram for
learning classification decision trees. - This model allows the user to monitor the quality
and impact of decisions made by the learning
procedure. - Learning procedure can be refined interactively
via a visual interface.
24ER diagram for the search space of decision tree
learning algorithm
25General Framework
- Learning a classification decision tree from a
training data set can be regarded as a process of
searching for the best decision tree that meets
user-provided goal constraints. - The problem space of this search process consists
of Model Candidates, Model Candidate Generator
and Model Constraints. - Many existing classification-learning algorithms
like C4.5 and CDP fit nicely within this search
framework. New learning algorithms that fit
users requirements can be developed by defining
the components of the problem space.
26General Framework
- Model Candidate corresponds to the partial
classification decision tree. Each node of the
decision tree is a Model Atom - Search process is the process of finding a final
model candidate such that it meets user goal
specifications. - Model Candidate Generator transforms the current
model candidate into a new model candidate by
selecting one model atom to expand from the
expandable leaf model atoms. - Model Constraints (used by Model Candidate
Generator) provide controls and boundaries to the
search space.
27Search Process
28Acceptability Constraint
- Model Constraints consist of Acceptability
constraints, Expandability constraints and a
Data-Entropy calculation function. - Acceptability constraint predicate specifies when
a model candidate is acceptable and thus allows
search process to stop. EX - A1) Total no of expandable leaf model atoms 0.
- A2) Overall error rate of the model candidate lt
acceptable error rate. - A3) Total number of model atoms in the model
candidategt maximal allowable tree size. - A1 is used in C4.5 and CDP
29Expandability Constraint
- An Expandability constraint predicate specifies
whether a leaf model atom is expandable or not.
EX - C4.5 uses E1 and E2
- CDP uses E2 and E3
-
30Traversal Strategy
- Traversal strategy ranks expandable leaf model
atoms based on the model atom attributes. EX - Increasing order of depth
- Decreasing order of depth
- Orders based on other model atom attributes.
31Steps in Visual Algorithm Development
- No single algorithm is the best all the time,
performance is highly data dependent. - By changing different predicates of model
constraints, users can construct new
classification-learning algorithm. - This enables users to find an algorithm that
works the best on a given data set. - Two algorithms are developed BF based on Best
First search idea and CDP which is a
modification of CDP
32BF
- This algorithm is based on the Best-First search
idea. - For Acceptability criteria, it includes A1 and A2
with a user specified acceptable error rate. - The Traversal strategy chosen is T3
- In Best-First, expandable leaf model atoms are
ranked according to the decreasing order of the
number of misclassified training cases. (local
error rate size of subset training data set) - The traversal strategy will expand a model atom
that has the most misclassified training cases,
thus reducing the overall error rate the most.
33CDP
- CDP is a modification of CDP
- CDP has dynamic pruning using expandability
constraint E3. - Here, the depth is modified according to the size
of the training data set of the model atom. - We set
- B is the branching factor of the decision tree, t
is the size of training data set belonging to
model atom, T is the whole training data set.
34Comparison of different classification learning
algorithms
35Experiment
- The new BF and CDP algorithms are compared with
the C4.5 and CDP algorithms. - Various metrics are selected to compare the
efficiency, accuracy and size of final decision
trees of the classification algorithm. - The generation efficiency of the nodes is
measured in terms of the total number of nodes
generated. - To compare accuracy of the various algorithms,
the mean classification error on the test data
sets have been computed.
36Classification error for 10 data sets
37Nodes generated for 10 data sets
38Final decision tree size
39Results/Conclusion
- CDP has accuracy comparable to C4.5 while
generating considerably fewer nodes. - CDP has accuracy comparable to C4.5 while
generating considerably fewer nodes. - CDP outperformed CDP in error rate and number of
nodes generated. - Considering all performance metrics together,
CDP is the best overall algorithm. - Considering classification accuracy alone, C4.5P
is the winner.
40Conclusion
- Different datasets require different algorithms
for best results. - Diverse user requirements put different
constraints on the final decision tree. - The experiment shows that Interactive Visual Data
Mining Framework can help find the most suitable
algorithm for a given data set and group of user
requirements.
41Data Mining for Selective Visualization of Large
Spatial Datasets
- Proceedings of 14th IEEE International Conference
on Tools with Artificial Intelligence - (ICTAI'02), 2002.
- Washington (November 2002), DC, USA,
- Shashi Shekhar, Chang-Tien Lu, Pusheng Zhang,
Rulin Liu - Computer Science Engineering Department
- University of Minnesota
42References
- http//citeseer.ist.psu.edu/cache/papers/cs/27216/
httpzSzzSzwww-users.cs.umn.eduzSzzCz7EctluzSzPape
rTalkFilezSzits02.pdf/shekhar02cubeview.pdf - http//www.cs.umn.edu/Research/shashi-group/
- http//www.cs.umn.edu/Research/shashi-group/Book/s
db-chap1.pdf - http//www.cs.umn.edu/research/shashi-group/alan_p
lanb.pdf - http//coblitz.codeen.org3125/citeseer.ist.psu.ed
u/cache/papers/cs/27637/httpzSzzSzwww-users.cs.um
n.eduzSzzCz7EpushengzSzpubzSzkdd2001zSzkdd.pdf/she
khar01detecting.pdf
43Basic Terminology
- Spatial databases
- Alphanumeric data geographical cordinates
- Spatial mining
- Mining of spatial databases
- Spatial datawarehouse
- Contains geographical data
- Spatial outliers
- Observations that appear to be inconsistent with
the remainder of that set of data
44Spatial Cluster
45Contribution
- Propose and implement the CubeView visualization
system - General data cube operations
- Built on the concept of spatial data warehouse to
support data mining and data visualization - Efficient and scalable spatial outlier detection
algorithms
46Challenges in spatial data mining
- Classical data mining - numbers and categories.
- Spatial data
- more complex and
- extended objects such as points, lines and
polygons. - Second, classical data mining works with explicit
inputs, whereas spatial predicates and attributes
are often implicit. - Third, classical data mining treats each input
independently of other inputs.
47Application Domain
- The Traffic Management Center - Minnesota
Department of Transportation (MNDOT) has a
database to archive sensor network. - Sensor network includes
- about nine hundred stations
- each of which contains one to four loop detector
- Measurement of Volume and occupancy.
- Volume is vehicles passing through station in
5-minute interval - Occupancy is percentage of time station is
occupied with vehicles
48Basic Concepts
- Spatial Data Warehouse
- Spatial Data Mining
- Spatial Outliers Detection
49Spatial Data Warehouse
- Employs data cube structure
- Outputs - albums of maps.
- Traffic data warehouse
- Measures - volume and occupancy
- Dimensions - time and space.
50Spatial Data Mining
- Process of discovering interesting and useful but
implicit spatial patterns. - key goal is to partially automate knowledge
discovery - Search for nuggets of information embedded in
very large quantities of spatial data.
51Spatial Outliers Detection
- Suspiciously deviating observations
- Local instability
- Each Station
- Spatial attributes time, space
- Non spatial attributes volume, occupancy
52Basic Structure CubeView
53CubeView Visualization System
- Each node in cube a visualization style
- S - Traffic volume of station at all times.
- TTD Time of the day
- TDW Day of the week
- STTD Daily traffic volume of each station
- TTD TDWS Traffic volume at each station at
different times on different days
54Dimension Lattice
55CubeView Visualization System
56CubeView Visualization System
57CubeView Visualization System
58Data Mining Algorithms for Visualization
- Problem Definition
- Given a spatial graph G S , E
- S - s1, s2, s3, s4..
- E edges (neighborhood of stations)
-
- f ( x ) - attribute value for a data record
- N ( x )- fixed cardinality set of neighbors of x
- ) - Average attribute value of x
neighbors - S( x ) - difference of the attribute value of
each data object and the average attribute value
of neighbors.
59Data Mining Algorithms for Visualization
- Problem Definition cont
- S( x ) - difference of the attribute value of
each data object and the average attribute value
of neighbors. - Test for detecting an outlier
- confidence level threshold ?
-
-
60Data Mining Algorithms for Visualization
- Few points
- First, the neighborhood can be selected based on
a fixed cardinality or a fixed graph distance or
a fixed Euclidean distance. - Second, the choice of neighborhood aggregate
function can be mean, variance, or
auto-correlation. - Third, the choice for comparing a location with
its neighbors can be either just a number or a
vector of attribute values. - Finally, the statistic for the base distribution
can be selected as normal distribution.
61Data Mining Algorithms for Visualization
- Algorithms
- Test Parameters Computation(TPC) Algorithm
- Route Outlier Detection(ROD) Algorithm
62Data Mining Algorithms for Visualization
63Data Mining Algorithms for Visualization
64Data Mining Algorithms for Visualization
65Software
- http//www.cs.umn.edu/research/shashi-group/vis/tr
affic_volumemap2.htm - http//www.cs.umn.edu/research/shashi-group/vis/Da
taCube.htm
66Visualization and Data Mining techniques