Title: Visualization for Classification and Clustering Techniques
1Visualization for Classification and Clustering
Techniques
- Marc René
- CSE 8331
- Data Mining - Project 1
2Overview
- Importance of Data Visualization in the KDD
Process - Understanding and Trust
- Visualization techniques
- Classification
- Clustering
- Future Directions
3KDD Process
- Selection
- Obtain data from all of sources
- Preprocessing
- After selecting the data, clean it to make sure
it is consistent - Transformation
- After preprocessing the data, analyze the
format/amount of data - Data Mining
- Once the data is in a useable format/content,
apply various algorithms based upon the results
trying to be achieved - Interpretation/Evaluation
- Finally, present the results of the data mining
step to the user, so that the results can be used
to solve the business need at hand
4Importance of Data Visualization
- The final step in the KDD process
- Highly dependent on the Data Visualization
technique - Bad/inappropriate technique may result in
misunderstanding - Misunderstanding may cause an incorrect (or no)
decision
It is important to consider that the KDD process
is useless if the results are not understandable
5Current Issues w/Data Visualization
- The literature suggests a significant reliance on
expert users - General lack of data visualization support in
many data mining tools Goebel99 - These are significant problems if KDD/DM/Data
Visualization will expand at the rates suggested - Data visualization tool market 2.2 billion by
2007 Nuttall03
6Suggested Direction
- Need to determine techniques that balance
simplicity with completeness - If this can be done for non-expert users
- Simplicity Completeness ? Understanding
- Understanding ? Trust
- Trust ? more use of KDD/DM
- Result will be
- Better business value
- Higher ROI
7Common Visualization Techniques
- Visualization techniques dependent upon
- The type of data mining technique chosen
- The underlying structure and attributes of the
data - Classification Clustering
- Decision Trees - Scatter Plots
- Scatter Plots - Dendrograms
- Axis-Parallel Decision Trees - Smoothed Data
Histograms - Circle Segments - Self-Organizing Maps
- Decision Tables - Proximity Matrixes
8Classification
9Decision Tree
- Information limited to
- Attributes
- Splitting values
- Terminal node class assignments
10Decision Tree with Histograms
- Data mining rarely classify 100 of the data
correctly - Include the success of properly classifying the
data - histogram added for each terminal node - Percentage of data that was classified
correctly/incorrectly - Assists users in determining if the
classification is good enough
11Decision Tree - Different Format
- Vertical representation - allows for easy user
interaction - Combines the split points and classification
accuracy - compactly - Key difference - colors are matched with a
specific classification
12Scatter Plot with Regression Line
- Excellent way to view 2-dimensional data
- Familiar to anyone who has taken high-school
algebra - Regression lines provide descriptive techniques
for classification
13Axis-Parallel Decision Tree
- Combination Scatter Plot and Decision Tree
- Areas divided in parallel regions on the axis
- Well suited for classification problems with two
attribute values - High visibility into the impact of outliers
14Circle Segments
- Multi-dimension data
- Maps dataset with n dimensions onto a circle
divided by n segments - Each segment is a different attribute
- Each pixel inside a segment is a single value of
the attribute - Values of each attribute are then sorted
(independently) and assigned a different colors
based upon its class
15Decision Table
- Interactive technique
- Maps attribute data to a 2D hierarchical matrix
- Levels can be drilled down - another set of
attributes - Height of a cell conveys the number of data
entities - Cells color coded
- Neutral color ? no data in that intersection
point - Color coded by class (percentage)
16Decision Table
17Clustering
18Scatter Plot
- Extensions include, displaying points in
- Various sizes and colors to indicate additional
attributes - Shading of points to introduce a third dimension
- Using different brightness levels of the same
color to represent continuous values for the same
attribute - Using various points or classification
identifiers (i.e., numbers, symbols) - Using various glyphs to display additional
attributes
19Scatter Plot
- Map decision trees on top of scatter plots to
describe clusters
20Scatter Plot with Regression Lines
21Scatter Plot w/Min Spanning Tree
22Dendrogram
- Intuitive representation - hierarchical
decomposition of data into sets of nested
clusters. - From an agglomerative perspective
- Each leaf - a single data entity
- Each internal node - the union of all data
entities in its sub-tree - The root - the entire dataset
- The height of any internal node - the similarity
between its children.
23Dendrogram with Exemplars
- The most typical member of each cluster
Wishart99 - Underlined labels of the leafs
- Done in combination with shading to identify the
clustering level
24Smoothed Data Histogram
- Represents data on a display map
- Similar data items are located close to each
other - More defined the clusters lighter colors
25Smoothed Data Histogram - Detail
26Self-Organizing Map Grid
- Source of Smoothed Data Histogram
- Numbers indicate most common cluster
27Proximity Matrix
- Graphically display the relationship between data
elements - Usually symmetric, but can be sorted by the
strength of relationships
28Proximity Matrix and Dendrogram
29Summary
- Data visualization techniques are extremely
important for understanding the KDD process - A balance of simplicity and completeness is
important - The techniques discussed allow average users to
understand the results of the KDD process - Understanding ? KDD results to be
interpreted/trusted by non-expert users ?
extending the business value - If data visualization techniques do not establish
a high level of trust in the KDD process, the
process will fail
30Future Direction
- Significant effort will be spent on improving
data visualization techniques in the next few
years - KDD process and data mining are becoming more
widespread - Business will expect tools to become more
user-friendly and support the varied level of
skills - Trends are moving to a more interactive mode
- Static reporting techniques (i.e., standard
decision trees, standard circle segments) are
being replaced - Interactive techniques (i.e., smoothed data
histograms, interactive circle segments and
decision tables) - Very interactive data models ? virtual reality
are also being considered/proposed
31ReferencesPart 1
- Ahlberg, C., Spotfire An Information
Exploration Environment, ACM SIGMOD Record,
Volume 25, Number 4, December 1996 - Ankerst, M., et. al., Visual Classification An
Interactive Approach to Decision Tree
Construction, KDD-99, San Diego, CA - Ankerst, M., et. al., Towards an Effective
Cooperation of the User and the Computer for
Classification, KDD00, Boston, MA, USA - Apte C. and Weiss S.M., Data Mining with
Decision Trees and Decision Rules, Future
Generation Computer Systems, November 1997 - Arkin, E., et. al., Decision Trees for
Geometric Models, ACM, 9th Annual Computational
Geometry, 5/93/CA, USA - de Hann, G., et. al., Towards Intuitive
Exploration Tools for Data Visualization in VR,
VRST02, November 11-13, 2003, Hong Kong - Dunham, M. H., Data Mining Introductory and
Advanced Topics, Prentice Hall, 2003. - Fekete, J. and Plaisant, C., Excentric Labeling
Dynamic Neighborhood Labeling for Data
Visualization, Proceedings of the Conference on
Human factors in Computer Systems (CHI'99), ACM ,
New York - Fredrikson, A., et. al., Temporal, Geographical
and Categorical Aggregations Viewed through
Coordinated Displays A Case Study with Highway
Incident Data, NPIVM99, Kansas City, MO, 1999 - Goebel, M. and Gruenwald, L., A Survey of Data
Mining and Knowledge Discovery Software Tools,
SIGKDD Explorations, June 1999. - Han, J. and Cersone, N., RuleViz A Model for
Visualizing Knowledge Discovery Process, Sixth
International Conference on Knowledge Discovery
and Data Mining, 2000 - Ho, T., et. al., Visualization Support for a
User-Centered KDD Process, SIGKDD02, 2002.
32ReferencesPart 2
- Hsieh, H. and Shipman, F. M. III, VITE A Visual
Interface Supporting the Direct Manipulation of
Structured Data Using Two-Way Mappings, IUI
2000, New Orleans LA - Solving Business Problems with IBM DB2
Intelligent Miner, Presented by DB2 Developer
Domain, http//www7b.software.ibm.com/dmdd - Jain, A. K., et. al., Data Clustering A
Review, ACM Computing Surveys, Volume 3, Number
3, September 1999 - Keim, D. A., Visual Techniques for Exploring
Databases, KDD97, Newport Beach, CA, 1997 - Kohavi, R., and Sommerfield, D, Targeting
Business Users with Decision Table Classifiers,
KDD99, New York City, 1998 - Kohavi, R., et. al., Emerging Trends in
Business Analytics, Communications of the ACM,
Volume 45, Number 8, August 2002 - Liu, B., et. al., Clustering Through Decision
Tree Construction, CIKM 2000, ACM, McLean VA,
2000 - Louie, J. Q. and Kraay, T., Origami A New
Visualization Tool, KDD-99, San Diego, CA - Moret, B. M. E., Decision Trees and Diagrams,
Computing Surveys, Volume 14, Number 4, December
1982 - Nuttall, C., "It's a Vision Thing", Financial
Times-IT Review , November 12, 2003 - Pampalk, E. et. al., Using Smoothed Data
Histograms for Cluster Visualization in
Self-Organizing Maps, Proceeding of the
International Conference on Artificial Neural
Networks (ICANN02), Springer Lecture Notes in
Computer Science, Madrid Spain, 2002 - Pampalk, E. et. al., Content-based Organization
and Visualization of Music Archives, Proceeding
of the 10th ACM International Conference on
Multimedia (MM02), Juan-les-Pins, France, 2002 - Pampalk, E., et. al., A New Approach to
Hierarchical Clustering and Structuring of Data
with Self-Organizing Maps, Intelligent Data
Analysis Journal (IDA), Volume 8, Number 2, 2003
33ReferencesPart 3
- Rauber, A., et. al., Empirical Evaluation of
Clustering Algorithms, Journal of Information
and Organizational Sciences (JIOS), Volume 24,
Number 2, 2000 - Finding the Solution to Data Mining Exploring
the Features and Components of Enterprise Miner,
Release 4.1 from SAS SAS White Paper, 2001 - See5 - Data Mining Tools, Release 1.9, Rulequest
Research 1997-2003 - Simoff, S. J., VDM_at_ECML/PKDD2001 The
International Workshop on Visual Data Mining at
ECML/PKDD 2001, SIGKDD Explorations, Volume 3,
Issue 2, 2001 - Thearling, K., Understanding Data Mining Its
All in the Interaction, DS Star The On-Line
Executive Journal for Data-Intensive Decision
Support, Volume 1, Number 10, December 9, 1997 - Thearling, K., et. al., Visualizing Data Mining
Models, as published in Information
Visualization in Data Mining and Knowledge
Discovery, edited by Fayyad, Usama, et. al.,
Morgan Kaufman, 2001 - Ward, M. O., XmdvTool Integrating Multiple
Methods for Visualizing Multivariate Data,
Proceedings of IEEE Visualization '94
(Washington, DC, 1994). - Wishart, D., Efficient Hierarchical Cluster
Analysis for Data Mining and Knowledge
Discovery, Computing Science and Statistics,
Volume 30, 1998. - Wishart, D., ClustanGraphics3 Interactive
Graphics for Cluster Analysis, Published in
Classification in the Information Age, Gaul W.
and Locarrek-Junge, H (Eds.), Springer 1999 - XmdvTool Home Page (http//davis.wpi.edu/xmdv/vis
ualizations.html)