Main Concepts of Data Mining Introduction to Data Preprocessing - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

Main Concepts of Data Mining Introduction to Data Preprocessing

Description:

Institute for Medical Informatics and Biometry. University of Rostock, Germany ... have served various purposes in biomedical CBR systems, among which to organize ... – PowerPoint PPT presentation

Number of Views:471
Avg rating:3.0/5.0
Slides: 79
Provided by: isabellebi
Category:

less

Transcript and Presenter's Notes

Title: Main Concepts of Data Mining Introduction to Data Preprocessing


1
Main Concepts of Data MiningIntroduction to Data
Preprocessing
2
Learning Objectives
  • Study some examples of data mining systems
  • Understand why to preprocess the data.
  • Understand how to understand the data
    (descriptive data summarization)

3
Acknowledgements
  • Some of these slides are adapted from Jiawei Han
    and Micheline Kamber

4
Learning Objectives
  • Study some examples of data mining systems
  • Understand why to preprocess the data.
  • Understand how to understand the data
    (descriptive data summarization)

5
Data Mining Confluence of Multiple Disciplines
Database Technology
Statistics
Data Mining
Machine Learning
Visualization
Information Science
Other Disciplines
6
Data Mining Classification Schemes
  • General functionality
  • Descriptive data mining
  • Predictive data mining
  • Different views, different classifications
  • Kinds of databases to be mined
  • Kinds of knowledge to be discovered
  • Kinds of techniques utilized
  • Kinds of applications adapted

7
Major Issues in Data Mining (1)
  • Mining methodology and user interaction
  • Mining different kinds of knowledge in databases
  • Interactive mining of knowledge at multiple
    levels of abstraction
  • Incorporation of background knowledge
  • Data mining query languages and ad-hoc data
    mining
  • Expression and visualization of data mining
    results
  • Handling noise and incomplete data
  • Pattern evaluation the interestingness problem
  • Performance and scalability
  • Efficiency and scalability of data mining
    algorithms
  • Parallel, distributed and incremental mining
    methods

8
Major Issues in Data Mining (2)
  • Issues relating to the diversity of data types
  • Handling relational and complex types of data
  • Mining information from heterogeneous databases
    and global information systems (WWW)
  • Issues related to applications and social impacts
  • Application of discovered knowledge
  • Domain-specific data mining tools
  • Intelligent query answering
  • Process control and decision making
  • Integration of the discovered knowledge with
    existing knowledge A knowledge fusion problem
  • Protection of data security, integrity, and
    privacy

9
Main Concepts in Data Mining
  • Data mining discovering interesting patterns
    from large amounts of data
  • A natural evolution of database technology, in
    great demand, with wide applications
  • A KDD process includes data cleaning, data
    integration, data selection, transformation, data
    mining, pattern evaluation, and knowledge
    presentation
  • Mining can be performed in a variety of
    information repositories
  • Data mining functionalities characterization,
    discrimination, association, classification,
    clustering, outlier and trend analysis, etc.
  • Classification of data mining systems
  • Major issues in data mining

10
Case-Based Reasoning
  • Case-based reasoning (CBR)
  • Problem-solving method from artificial
    intelligence (AI) that proposes to reuse
    previously solved and memorized problem
    situations, called cases
  • Instance-based method from machine learning
  • Can be used for classification/prediction tasks

11
Case-Based Reasoning
PROBLEM
USER INTERFACE
Target case
SOLUTION
Interpretation
New Case
Retrieve
Retain
CASE BASE Previous Cases
Tested Case
RetrievedCase
Revise
Reuse
Solution
Solved Case
12
Fifth Workshop on Case-Based Reasoning in the
Health Sciences
  • Isabelle Bichindaritz
  • University of Washington, Tacoma, Washington, USA
  • ibichind_at_u.washington.edu
  • Stefania Montani
  • University of Piemonte Orientale, Italy
    stefania.montani_at_unipmn.it

13
Workshop Stats
  • Papers accepted 10 papers
  • Attendees 19 participants
  • Good news !!!

14
Workshop Goals
  • Provide a forum for identifying important
    contributions and opportunities for research on
    the application of CBR to the Health Sciences
  • Promote the systematic study of how to apply CBR
    to the Health Sciences
  • Showcase applications of CBR in the Health
    Sciences

15
A CBR Solution for Missing Medical Data
Olga Vorobieva and Rainer Schmidt Institute for
Medical Informatics and Biometry University of
Rostock, Germany Alexander Rumiantzev Pavlov
State Medical University, St.Petersburg, Russia
16
Summary
  • Application domaindialysis medicineeffects of
    fitness on dialysis
  • System contextISOR, a CBR system that explains
    the exceptional cases those for which fitness
    does not improve renal function
  • Task / problem addressedrestoration of missing
    data
  • Research hypothesiscase-based reasoning can be
    applied to restore missing data in a dataset/case
    base
  • Main contributionsynergy between CBR and
    statistics (statistical modeling).

17
(No Transcript)
18
A Case-Based Reasoning Approach to Dose Planning
in Radiotherapy Xueyan Song1, Sanja Petrovic1,
and Santhanam Sundar 2 1Automated Scheduling,
Optimisation and Planning Group School of
Computer Science University of Nottingham,
UK 2Dept. of Oncology, City Hospital Campus,
Nottingham University Hospitals NHS Trust,
Nottingham, UK
19
Summary
  • Application domaindose planning in radiotherapy
    for prostate cancer
  • System contexttrade-off between the benefit in
    terms of cancer control and the risk in terms of
    harmful side effects to neighboring tissues
  • Task / problem addressedplanning problem
    designing a radiotherapy dose planning
  • Research hypothesiscase-based reasoning can be
    applied to propose dose plans
  • Main contributionfuzzy representation of
    attribute values and similarity measurefusion of
    similar cases by Dempster-Shafer theory.

20
(No Transcript)
21
On-Line Domain Knowledge Management
forCase-Based Medical Recommendation
Amélie Cordier1,Béatrice Fuchs1,Jean Lieber2, and
Alain Mille1 1LIRIS CNRS, UMR 5202, Université
Lyon 1, INSA Lyon, Université Lyon 2, ECL 43, bd
du 11 Novembre 1918, Villeurbanne Cedex,
France, Amelie.Cordier, Beatrice.Fuchs,
Alain.Mille_at_liris.cnrs.fr 2LORIA (UMR 7503
CNRSINRIANancy Universities), BP 239, 54506
Vandoeuvre-lès-Nancy, France Jean.Lieber_at_loria.fr
22
Summary
  • Application domainbreast cancer treatment
  • System contextKasimir is a knowledge management
    and decision-support system in oncology focusing
    on case-based protocol treatment recommendations
  • Task / problem addressedplanning problem
    recommending a treatment plan based on a protocol
  • Research hypothesesconservative adaptation is
    recommended for adapting a protocol to a new case
    through case-based reasoningnew domain knowledge
    can be acquired by analysis of failures
  • Main contributionimprovement of
    adaptationmethod for learning from failures of
    the case-based reasoning.

23
(No Transcript)
24
Concepts for Novelty Detection and Handling
based on Case-Based Reasoning
Petra Perner Institute of Computer Vision and
applied Computer Sciences, IBaI
25
Summary
  • Application domainHep-2 cell image
    interpretation
  • System contextcase-based image interpretation
  • Task / problem addressedclassification problem
    improve recognition of over 30 different nuclear
    and cytoplasmic patterns when patterns change
    over time or new patterns emerge
  • Research hypothesiscase-based reasoning can be
    applied to the problem of novelty detection and
    also of concept drift
  • Main contributionnovel application for CBR
    detecting novelty, detecting concept drift.

26
(No Transcript)
27
Similarity of Medical Cases in Health CareUsing
Cosine Similarity and Ontology
Shahina Begum, Mobyen Uddin Ahmed, Peter Funk,
Ning Xiong, Bo von Schéele Mälardalen University,
Department of Computer Science and ElectronicsPO
Box 883 SE-721 23, Västerås, Sweden firstname.las
tname_at_mdh.se
28
Summary
  • Application domainany medical domain
  • System contextelectronic medical records
  • Task / problem addressedretrieval task finding
    similar cases represented with structured and
    semi-structured data
  • Research hypothesisa hybrid similarity measure
    based on combining the cosine similarity measure,
    an ontology, and the nearest neighbor method
    permit to successfully retrieve similar cases
  • Main contributionsynergy between case-based
    reasoning and information retrieval.

29
(No Transcript)
30
Towards Case-Based Reasoning for
DiabetesManagement
Cindy Marling1, Jay Shubrook2 and Frank
Schwartz2 1 School of Electrical Engineering and
Computer Science Russ College of Engineering and
Technology Ohio University, Athens, Ohio 45701,
USA marling_at_ohio.edu 2 Appalachian Rural Health
Institute, Diabetes and Endocrine Center College
of Osteopathic Medicine Ohio University, Athens,
Ohio 45701, USA shubrook_at_ohio.edu,
schwartf_at_ohio.edu
31
Summary
  • Application domaintype I diabetes management
  • System contextreal-time monitoring of glucose
    level through insulin pump
  • Task / problem addressedtreatment planning
    adjusting insulin dosage
  • Research hypothesiscase-based reasoning can
    adjust insulin dosage in real timecases required
    for the future CBR system can be acquired through
    an online Web-based interface
  • Main contributionplanning the development of a
    case-based reasoning system for automatic type I
    diabetes monitoring.

32
Hypothetico-Deductive Case-Based Reasoning
David McSherry School of Computing and
Information Engineering, University of Ulster,
Northern Ireland
33
Summary
  • Application domaincontact lenses classification
  • System contextconversational CBR
  • Task / problem addressedclassification problem
    recommending type of contact lenses
  • Research hypothesisa hypothetico-deductive CBR
    approach to test selection can minimize the
    number of tests required to confirm a hypothesis
    proposed by the system or user
  • Main contributionsynergy between case-based
    reasoning and hypothetico-deductive
    reasoningexplanations in CBR.

34
(No Transcript)
35
Other Papers Summaries
  • Case-based Reasoning for managing non-compliance
    with clinical guidelines, Stefania Montani,
    University of Piemonte Orientale, Alessandria,
    Italy A CBR system able to
  • Retrieve similar past episodes (cases) of
    non-compliance to guidelines, to be suggested to
    the physician
  • Learn more general indications from ground
    non-compliance cases, adoptable for a formal GL
    revision by an experts committee
  • CBR for Temporal Abstractions Configuration in
    Haemodyalisis, Leonardi Giorgio, Bottrighi
    Alessio, Portinale Luigi, Montani Stefania,
    University of Piemonte Orientale, Alessandria,
    ItalyA CBR system able to choose the appropriate
    parameters for the configuration of temporal
    abstractions in medical domain of haemodyalisis

36
Other Papers Summaries
  • Prototypical Cases for Knowledge Maintenance in
    Biomedical CBR, Isabelle Bichindaritz, University
    of Washington, Tacoma, WA, USAPrototypical cases
    have served various purposes in biomedical CBR
    systems, among which to organize and structure
    the memory, to guide the retrieval as well as the
    reuse of cases, and to serve as bootstrapping a
    CBR system memory when real cases are not
    available in sufficient quantity and/or quality.
    Knowledge maintenance is yet another role that
    these prototypical cases can play in biomedical
    CBR systems

37
Discussion
  • Trends and issues
  • Integration of CBR with electronic patient
    records and/or in clinical practice (Begum et
    al., Marling et al.)
  • Importance of prototypical cases (Bichindaritz)
  • Incompleteness / non-reliability of cases or CBR
    system knowledge (Vorobieva et al., Cordier et
    al., Bichindaritz)
  • Novel domains of applications for CBR (Perner,
    Leonardi et al., Montani)
  • Need for synergy with other AI methods (Song et
    al., McSherry)

38
Discussion
  • Pearls of wisdom
  • Remember Occams razor introducing complexity
    in CBR should be carefully justified
  • Knowledge in medical cases / domain knowledge is
    often questionable finding methods for dealing
    with this reality is essential for the
    development of CBR in biomedical domains
  • CBR can be promoted as the methodology of choice
    for evidence gathering in evidence-based medicine

39
Future Plans
  • A second special issue on CBR in the Health
    Sciences, based on papers from this Fifth
    Workshop on CBR in the Health Sciences is going
    to be published in Computational Intelligence.
  • The Web-site (version 1.beta) and mailing list
    for our research group are now livehttp//www.cb
    r-health.orghttp//www.cbr-biomed.org

40
(No Transcript)
41
(No Transcript)
42
Learning Objectives
  • Study some examples of data mining systems
  • Understand why to preprocess the data.
  • Understand how to understand the data
    (descriptive data summarization)

43
Why Data Preprocessing?
  • Data mining aims at discovering relationships and
    other forms of knowledge from data in the real
    world.
  • Data map entities in the application domain to
    symbolic representation through a measurement
    function.
  • Data in the real world is dirty
  • incomplete missing data, lacking attribute
    values, lacking certain attributes of interest,
    or containing only aggregate data
  • noisy containing errors, such as measurement
    errors, or outliers
  • inconsistent containing discrepancies in codes
    or names
  • distorted sampling distortion
  • No quality data, no quality mining results!
    (GIGO)
  • Quality decisions must be based on quality data
  • Data warehouse needs consistent integration of
    quality data

44
Multi-Dimensional Measure of Data Quality
  • Data quality is multidimensional
  • Accuracy
  • Preciseness (reliability)
  • Completeness
  • Consistency
  • Timeliness
  • Believability (validity)
  • Value added
  • Interpretability
  • Accessibility
  • Broad categories
  • intrinsic, contextual, representational, and
    accessibility.

45
Major Tasks in Data Preprocessing
  • Data cleaning
  • Fill in missing values, smooth noisy data,
    identify or remove outliers, and resolve
    inconsistencies and errors
  • Data integration
  • Integration of multiple databases, data cubes, or
    files
  • Data transformation
  • Normalization and aggregation
  • Data reduction
  • Obtains reduced representation in volume but
    produces the same or similar analytical results
  • Data discretization
  • Part of data reduction but with particular
    importance, especially for numerical data

46
Forms of data preprocessing
47
Learning Objectives
  • Study some examples of data mining systems
  • Understand why to preprocess the data.
  • Understand how to understand the data
    (descriptive data summarization)

48
Mining Data Descriptive Characteristics
  • Motivation
  • To better understand the data central tendency,
    variation and spread
  • Data dispersion characteristics
  • median, max, min, quantiles, outliers, variance,
    etc.
  • Numerical dimensions correspond to sorted
    intervals
  • Data dispersion analyzed with multiple
    granularities of precision
  • Boxplot or quantile analysis on sorted intervals
  • Dispersion analysis on computed measures
  • Folding measures into numerical dimensions
  • Boxplot or quantile analysis on the transformed
    cube

49
Measuring the Central Tendency
  • Mean (algebraic measure) (sample vs. population)
  • Weighted arithmetic mean
  • Trimmed mean chopping extreme values
  • Median A holistic measure
  • Middle value if odd number of values, or average
    of the middle two values otherwise
  • Estimated by interpolation (for grouped data)
  • Mode
  • Value that occurs most frequently in the data
  • Unimodal, bimodal, trimodal
  • Empirical formula

50
Symmetric vs. Skewed Data
symmetric
  • Median, mean and mode of symmetric, positively
    and negatively skewed data

positively skewed
negatively skewed
51
Measuring the Dispersion of Data
  • Quartiles, outliers and boxplots
  • Quartiles Q1 (25th percentile), Q3 (75th
    percentile)
  • Inter-quartile range IQR Q3 Q1
  • Five number summary min, Q1, M, Q3, max
  • Boxplot ends of the box are the quartiles,
    median is marked, whiskers, and plot outlier
    individually
  • Outlier usually, a value higher/lower than 1.5 x
    IQR
  • Variance and standard deviation (sample s,
    population s)
  • Variance (algebraic, scalable computation)
  • Standard deviation s (or s) is the square root of
    variance s2 (or s2)

52
Boxplot Analysis
  • Five-number summary of a distribution
  • Minimum, Q1, M, Q3, Maximum
  • Boxplot
  • Data is represented with a box
  • The ends of the box are at the first and third
    quartiles, i.e., the height of the box is IQR
  • The median is marked by a line within the box
  • Whiskers two lines outside the box extend to
    Minimum and Maximum

53
Visualization of Data Dispersion 3-D Boxplots
54
Properties of Normal Distribution Curve
  • The normal (distribution) curve
  • From µs to µs contains about 68 of the
    measurements (µ mean, s standard deviation)
  • From µ2s to µ2s contains about 95 of it
  • From µ3s to µ3s contains about 99.7 of it

55
Graphic Displays of Basic Statistical Descriptions
  • Boxplot graphic display of five-number summary
  • Histogram x-axis are values, y-axis repres.
    frequencies
  • Quantile plot each value xi is paired with fi
    indicating that approximately 100 fi of data
    are ? xi
  • Quantile-quantile (q-q) plot graphs the
    quantiles of one univariant distribution against
    the corresponding quantiles of another
  • Scatter plot each pair of values is a pair of
    coordinates and plotted as points in the plane
  • Loess (local regression) curve add a smooth
    curve to a scatter plot to provide better
    perception of the pattern of dependence

56
Histogram Analysis
  • Graph displays of basic statistical class
    descriptions
  • Frequency histograms
  • A univariate graphical method
  • Consists of a set of rectangles that reflect the
    counts or frequencies of the classes present in
    the given data

57
Histograms Often Tells More than Boxplots
  • The two histograms shown in the left may have the
    same boxplot representation
  • The same values for min, Q1, median, Q3, max
  • But they have rather different data distributions

58
Quantile Plot
  • Displays all of the data (allowing the user to
    assess both the overall behavior and unusual
    occurrences)
  • Plots quantile information
  • For a data xi data sorted in increasing order, fi
    indicates that approximately 100 fi of the data
    are below or equal to the value xi

59
Quantile-Quantile (Q-Q) Plot
  • Graphs the quantiles of one univariate
    distribution against the corresponding quantiles
    of another
  • Allows the user to view whether there is a shift
    in going from one distribution to another

60
Scatter plot
  • Provides a first look at bivariate data to see
    clusters of points, outliers, etc
  • Each pair of values is treated as a pair of
    coordinates and plotted as points in the plane

61
Loess Curve
  • Adds a smooth curve to a scatter plot in order to
    provide better perception of the pattern of
    dependence
  • Loess curve is fitted by setting two parameters
    a smoothing parameter, and the degree of the
    polynomials that are fitted by the regression

62
Positively and Negatively Correlated Data
  • The left half fragment is positively correlated
  • The right half is negative correlated

63
Not Correlated Data
64
Data Visualization and Its Methods
  • Why data visualization?
  • Gain insight into an information space by mapping
    data onto graphical primitives
  • Provide qualitative overview of large data sets
  • Search for patterns, trends, structure,
    irregularities, relationships among data
  • Help find interesting regions and suitable
    parameters for further quantitative analysis
  • Provide a visual proof of computer
    representations derived
  • Typical visualization methods
  • Geometric techniques
  • Icon-based techniques
  • Hierarchical techniques

65
Direct Data Visualization
Ribbons with Twists Based on Vorticity
66
Geometric Techniques
  • Visualization of geometric transformations and
    projections of the data
  • Methods
  • Landscapes
  • Projection pursuit technique
  • Finding meaningful projections of
    multidimensional data
  • Scatterplot matrices
  • Prosection views
  • Hyperslice
  • Parallel coordinates

67
Scatterplot Matrices
Used by ermission of M. Ward, Worcester
Polytechnic Institute
  • Matrix of scatterplots (x-y-diagrams) of the
    k-dim. data total of (k2/2-k) scatterplots

68
Landscapes
news articlesvisualized asa landscape
Used by permission of B. Wright, Visible
Decisions Inc.
  • Visualization of the data as perspective
    landscape
  • The data needs to be transformed into a (possibly
    artificial) 2D spatial representation which
    preserves the characteristics of the data

69
Parallel Coordinates
  • n equidistant axes which are parallel to one of
    the screen axes and correspond to the attributes
  • The axes are scaled to the minimum, maximum
    range of the corresponding attribute
  • Every data item corresponds to a polygonal line
    which intersects each of the axes at the point
    which corresponds to the value for the attribute

70
Parallel Coordinates of a Data Set
71
Icon-based Techniques
  • Visualization of the data values as features of
    icons
  • Methods
  • Chernoff Faces
  • Stick Figures
  • Shape Coding
  • Color Icons
  • TileBars The use of small icons representing the
    relevance feature vectors in document retrieval

72
Chernoff Faces
  • A way to display variables on a two-dimensional
    surface, e.g., let x be eyebrow slant, y be eye
    size, z be nose length, etc.
  • The figure shows faces produced using 10
    characteristics--head eccentricity, eye size, eye
    spacing, eye eccentricity, pupil size, eyebrow
    slant, nose size, mouth shape, mouth size, and
    mouth opening) Each assigned one of 10 possible
    values, generated using Mathematica (S. Dickson)
  • REFERENCE Gonick, L. and Smith, W. The Cartoon
    Guide to Statistics. New York Harper Perennial,
    p. 212, 1993
  • Weisstein, Eric W. "Chernoff Face." From
    MathWorld--A Wolfram Web Resource.
    mathworld.wolfram.com/ChernoffFace.html

73
Stick Figures

census data showing age, income, gender,
education, etc.
used by permission of G. Grinstein, University of
Massachusettes at Lowell
74
Hierarchical Techniques
  • Visualization of the data using a hierarchical
    partitioning into subspaces.
  • Methods
  • Dimensional Stacking
  • Worlds-within-Worlds
  • Treemap
  • Cone Trees
  • InfoCube

75
Dimensional Stacking
  • Partitioning of the n-dimensional attribute space
    in 2-D subspaces which are stacked into each
    other
  • Partitioning of the attribute value ranges into
    classes the important attributes should be used
    on the outer levels
  • Adequate for data with ordinal attributes of low
    cardinality
  • But, difficult to display more than nine
    dimensions
  • Important to map dimensions appropriately

76
Dimensional Stacking
Used by permission of M. Ward, Worcester
Polytechnic Institute
Visualization of oil mining data with longitude
and latitude mapped to the outer x-, y-axes and
ore grade and depth mapped to the inner x-, y-axes
77
Tree-Map
  • Screen-filling method which uses a hierarchical
    partitioning of the screen into regions depending
    on the attribute values
  • The x- and y-dimension of the screen are
    partitioned alternately according to the
    attribute values (classes)

MSR Netscan Image
78
Tree-Map of a File System (Schneiderman)
Write a Comment
User Comments (0)
About PowerShow.com