Visualization for Classification and Clustering Techniques - PowerPoint PPT Presentation

About This Presentation
Title:

Visualization for Classification and Clustering Techniques

Description:

Vertical representation - allows for easy user interaction ... Louie, J. Q. and Kraay, T., 'Origami: A New Visualization Tool', KDD-99, San Diego, CA ... – PowerPoint PPT presentation

Number of Views:626
Avg rating:3.0/5.0
Slides: 34
Provided by: GTE2
Category:

less

Transcript and Presenter's Notes

Title: Visualization for Classification and Clustering Techniques


1
Visualization for Classification and Clustering
Techniques
  • Marc René
  • CSE 8331
  • Data Mining - Project 1

2
Overview
  • Importance of Data Visualization in the KDD
    Process
  • Understanding and Trust
  • Visualization techniques
  • Classification
  • Clustering
  • Future Directions

3
KDD Process
  • Selection
  • Obtain data from all of sources
  • Preprocessing
  • After selecting the data, clean it to make sure
    it is consistent
  • Transformation
  • After preprocessing the data, analyze the
    format/amount of data
  • Data Mining
  • Once the data is in a useable format/content,
    apply various algorithms based upon the results
    trying to be achieved
  • Interpretation/Evaluation
  • Finally, present the results of the data mining
    step to the user, so that the results can be used
    to solve the business need at hand

4
Importance of Data Visualization
  • The final step in the KDD process
  • Highly dependent on the Data Visualization
    technique
  • Bad/inappropriate technique may result in
    misunderstanding
  • Misunderstanding may cause an incorrect (or no)
    decision

It is important to consider that the KDD process
is useless if the results are not understandable
5
Current Issues w/Data Visualization
  • The literature suggests a significant reliance on
    expert users
  • General lack of data visualization support in
    many data mining tools Goebel99
  • These are significant problems if KDD/DM/Data
    Visualization will expand at the rates suggested
  • Data visualization tool market 2.2 billion by
    2007 Nuttall03

6
Suggested Direction
  • Need to determine techniques that balance
    simplicity with completeness
  • If this can be done for non-expert users
  • Simplicity Completeness ? Understanding
  • Understanding ? Trust
  • Trust ? more use of KDD/DM
  • Result will be
  • Better business value
  • Higher ROI

7
Common Visualization Techniques
  • Visualization techniques dependent upon
  • The type of data mining technique chosen
  • The underlying structure and attributes of the
    data
  • Classification Clustering
  • Decision Trees - Scatter Plots
  • Scatter Plots - Dendrograms
  • Axis-Parallel Decision Trees - Smoothed Data
    Histograms
  • Circle Segments - Self-Organizing Maps
  • Decision Tables - Proximity Matrixes

8
Classification
9
Decision Tree
  • Information limited to
  • Attributes
  • Splitting values
  • Terminal node class assignments

10
Decision Tree with Histograms
  • Data mining rarely classify 100 of the data
    correctly
  • Include the success of properly classifying the
    data - histogram added for each terminal node
  • Percentage of data that was classified
    correctly/incorrectly
  • Assists users in determining if the
    classification is good enough

11
Decision Tree - Different Format
  • Vertical representation - allows for easy user
    interaction
  • Combines the split points and classification
    accuracy - compactly
  • Key difference - colors are matched with a
    specific classification

12
Scatter Plot with Regression Line
  • Excellent way to view 2-dimensional data
  • Familiar to anyone who has taken high-school
    algebra
  • Regression lines provide descriptive techniques
    for classification

13
Axis-Parallel Decision Tree
  • Combination Scatter Plot and Decision Tree
  • Areas divided in parallel regions on the axis
  • Well suited for classification problems with two
    attribute values
  • High visibility into the impact of outliers

14
Circle Segments
  • Multi-dimension data
  • Maps dataset with n dimensions onto a circle
    divided by n segments
  • Each segment is a different attribute
  • Each pixel inside a segment is a single value of
    the attribute
  • Values of each attribute are then sorted
    (independently) and assigned a different colors
    based upon its class

15
Decision Table
  • Interactive technique
  • Maps attribute data to a 2D hierarchical matrix
  • Levels can be drilled down - another set of
    attributes
  • Height of a cell conveys the number of data
    entities
  • Cells color coded
  • Neutral color ? no data in that intersection
    point
  • Color coded by class (percentage)

16
Decision Table
17
Clustering
18
Scatter Plot
  • Extensions include, displaying points in
  • Various sizes and colors to indicate additional
    attributes
  • Shading of points to introduce a third dimension
  • Using different brightness levels of the same
    color to represent continuous values for the same
    attribute
  • Using various points or classification
    identifiers (i.e., numbers, symbols)
  • Using various glyphs to display additional
    attributes

19
Scatter Plot
  • Map decision trees on top of scatter plots to
    describe clusters

20
Scatter Plot with Regression Lines
21
Scatter Plot w/Min Spanning Tree
22
Dendrogram
  • Intuitive representation - hierarchical
    decomposition of data into sets of nested
    clusters.
  • From an agglomerative perspective
  • Each leaf - a single data entity
  • Each internal node - the union of all data
    entities in its sub-tree
  • The root - the entire dataset
  • The height of any internal node - the similarity
    between its children.

23
Dendrogram with Exemplars
  • The most typical member of each cluster
    Wishart99
  • Underlined labels of the leafs
  • Done in combination with shading to identify the
    clustering level

24
Smoothed Data Histogram
  • Represents data on a display map
  • Similar data items are located close to each
    other
  • More defined the clusters lighter colors

25
Smoothed Data Histogram - Detail
26
Self-Organizing Map Grid
  • Source of Smoothed Data Histogram
  • Numbers indicate most common cluster

27
Proximity Matrix
  • Graphically display the relationship between data
    elements
  • Usually symmetric, but can be sorted by the
    strength of relationships

28
Proximity Matrix and Dendrogram
29
Summary
  • Data visualization techniques are extremely
    important for understanding the KDD process
  • A balance of simplicity and completeness is
    important
  • The techniques discussed allow average users to
    understand the results of the KDD process
  • Understanding ? KDD results to be
    interpreted/trusted by non-expert users ?
    extending the business value
  • If data visualization techniques do not establish
    a high level of trust in the KDD process, the
    process will fail

30
Future Direction
  • Significant effort will be spent on improving
    data visualization techniques in the next few
    years
  • KDD process and data mining are becoming more
    widespread
  • Business will expect tools to become more
    user-friendly and support the varied level of
    skills
  • Trends are moving to a more interactive mode
  • Static reporting techniques (i.e., standard
    decision trees, standard circle segments) are
    being replaced
  • Interactive techniques (i.e., smoothed data
    histograms, interactive circle segments and
    decision tables)
  • Very interactive data models ? virtual reality
    are also being considered/proposed

31
ReferencesPart 1
  • Ahlberg, C., Spotfire An Information
    Exploration Environment, ACM SIGMOD Record,
    Volume 25, Number 4, December 1996
  • Ankerst, M., et. al., Visual Classification An
    Interactive Approach to Decision Tree
    Construction, KDD-99, San Diego, CA
  • Ankerst, M., et. al., Towards an Effective
    Cooperation of the User and the Computer for
    Classification, KDD00, Boston, MA, USA
  • Apte C. and Weiss S.M., Data Mining with
    Decision Trees and Decision Rules, Future
    Generation Computer Systems, November 1997
  • Arkin, E., et. al., Decision Trees for
    Geometric Models, ACM, 9th Annual Computational
    Geometry, 5/93/CA, USA
  • de Hann, G., et. al., Towards Intuitive
    Exploration Tools for Data Visualization in VR,
    VRST02, November 11-13, 2003, Hong Kong
  • Dunham, M. H., Data Mining Introductory and
    Advanced Topics, Prentice Hall, 2003.
  • Fekete, J. and Plaisant, C., Excentric Labeling
    Dynamic Neighborhood Labeling for Data
    Visualization, Proceedings of the Conference on
    Human factors in Computer Systems (CHI'99), ACM ,
    New York
  • Fredrikson, A., et. al., Temporal, Geographical
    and Categorical Aggregations Viewed through
    Coordinated Displays A Case Study with Highway
    Incident Data, NPIVM99, Kansas City, MO, 1999
  • Goebel, M. and Gruenwald, L., A Survey of Data
    Mining and Knowledge Discovery Software Tools,
    SIGKDD Explorations, June 1999.
  • Han, J. and Cersone, N., RuleViz A Model for
    Visualizing Knowledge Discovery Process, Sixth
    International Conference on Knowledge Discovery
    and Data Mining, 2000
  • Ho, T., et. al., Visualization Support for a
    User-Centered KDD Process, SIGKDD02, 2002.

32
ReferencesPart 2
  • Hsieh, H. and Shipman, F. M. III, VITE A Visual
    Interface Supporting the Direct Manipulation of
    Structured Data Using Two-Way Mappings, IUI
    2000, New Orleans LA
  • Solving Business Problems with IBM DB2
    Intelligent Miner, Presented by DB2 Developer
    Domain, http//www7b.software.ibm.com/dmdd
  • Jain, A. K., et. al., Data Clustering A
    Review, ACM Computing Surveys, Volume 3, Number
    3, September 1999
  • Keim, D. A., Visual Techniques for Exploring
    Databases, KDD97, Newport Beach, CA, 1997
  • Kohavi, R., and Sommerfield, D, Targeting
    Business Users with Decision Table Classifiers,
    KDD99, New York City, 1998
  • Kohavi, R., et. al., Emerging Trends in
    Business Analytics, Communications of the ACM,
    Volume 45, Number 8, August 2002
  • Liu, B., et. al., Clustering Through Decision
    Tree Construction, CIKM 2000, ACM, McLean VA,
    2000
  • Louie, J. Q. and Kraay, T., Origami A New
    Visualization Tool, KDD-99, San Diego, CA
  • Moret, B. M. E., Decision Trees and Diagrams,
    Computing Surveys, Volume 14, Number 4, December
    1982
  • Nuttall, C., "It's a Vision Thing", Financial
    Times-IT Review , November 12, 2003
  • Pampalk, E. et. al., Using Smoothed Data
    Histograms for Cluster Visualization in
    Self-Organizing Maps, Proceeding of the
    International Conference on Artificial Neural
    Networks (ICANN02), Springer Lecture Notes in
    Computer Science, Madrid Spain, 2002
  • Pampalk, E. et. al., Content-based Organization
    and Visualization of Music Archives, Proceeding
    of the 10th ACM International Conference on
    Multimedia (MM02), Juan-les-Pins, France, 2002
  • Pampalk, E., et. al., A New Approach to
    Hierarchical Clustering and Structuring of Data
    with Self-Organizing Maps, Intelligent Data
    Analysis Journal (IDA), Volume 8, Number 2, 2003

33
ReferencesPart 3
  • Rauber, A., et. al., Empirical Evaluation of
    Clustering Algorithms, Journal of Information
    and Organizational Sciences (JIOS), Volume 24,
    Number 2, 2000
  • Finding the Solution to Data Mining Exploring
    the Features and Components of Enterprise Miner,
    Release 4.1 from SAS SAS White Paper, 2001
  • See5 - Data Mining Tools, Release 1.9, Rulequest
    Research 1997-2003
  • Simoff, S. J., VDM_at_ECML/PKDD2001 The
    International Workshop on Visual Data Mining at
    ECML/PKDD 2001, SIGKDD Explorations, Volume 3,
    Issue 2, 2001
  • Thearling, K., Understanding Data Mining Its
    All in the Interaction, DS Star The On-Line
    Executive Journal for Data-Intensive Decision
    Support, Volume 1, Number 10, December 9, 1997
  • Thearling, K., et. al., Visualizing Data Mining
    Models, as published in Information
    Visualization in Data Mining and Knowledge
    Discovery, edited by Fayyad, Usama, et. al.,
    Morgan Kaufman, 2001
  • Ward, M. O., XmdvTool Integrating Multiple
    Methods for Visualizing Multivariate Data,
    Proceedings of IEEE Visualization '94
    (Washington, DC, 1994).
  • Wishart, D., Efficient Hierarchical Cluster
    Analysis for Data Mining and Knowledge
    Discovery, Computing Science and Statistics,
    Volume 30, 1998.
  • Wishart, D., ClustanGraphics3 Interactive
    Graphics for Cluster Analysis, Published in
    Classification in the Information Age, Gaul W.
    and Locarrek-Junge, H (Eds.), Springer 1999
  • XmdvTool Home Page (http//davis.wpi.edu/xmdv/vis
    ualizations.html)
Write a Comment
User Comments (0)
About PowerShow.com