Title: Main Concepts of Data Mining Introduction to Data Preprocessing
1Main Concepts of Data MiningIntroduction to Data
Preprocessing
2Learning Objectives
- Study some examples of data mining systems
- Understand why to preprocess the data.
- Understand how to understand the data
(descriptive data summarization)
3Acknowledgements
- Some of these slides are adapted from Jiawei Han
and Micheline Kamber
4Learning Objectives
- Study some examples of data mining systems
- Understand why to preprocess the data.
- Understand how to understand the data
(descriptive data summarization)
5Data Mining Confluence of Multiple Disciplines
Database Technology
Statistics
Data Mining
Machine Learning
Visualization
Information Science
Other Disciplines
6Data Mining Classification Schemes
- General functionality
- Descriptive data mining
- Predictive data mining
- Different views, different classifications
- Kinds of databases to be mined
- Kinds of knowledge to be discovered
- Kinds of techniques utilized
- Kinds of applications adapted
7Major Issues in Data Mining (1)
- Mining methodology and user interaction
- Mining different kinds of knowledge in databases
- Interactive mining of knowledge at multiple
levels of abstraction - Incorporation of background knowledge
- Data mining query languages and ad-hoc data
mining - Expression and visualization of data mining
results - Handling noise and incomplete data
- Pattern evaluation the interestingness problem
- Performance and scalability
- Efficiency and scalability of data mining
algorithms - Parallel, distributed and incremental mining
methods
8Major Issues in Data Mining (2)
- Issues relating to the diversity of data types
- Handling relational and complex types of data
- Mining information from heterogeneous databases
and global information systems (WWW) - Issues related to applications and social impacts
- Application of discovered knowledge
- Domain-specific data mining tools
- Intelligent query answering
- Process control and decision making
- Integration of the discovered knowledge with
existing knowledge A knowledge fusion problem - Protection of data security, integrity, and
privacy
9Main Concepts in Data Mining
- Data mining discovering interesting patterns
from large amounts of data - A natural evolution of database technology, in
great demand, with wide applications - A KDD process includes data cleaning, data
integration, data selection, transformation, data
mining, pattern evaluation, and knowledge
presentation - Mining can be performed in a variety of
information repositories - Data mining functionalities characterization,
discrimination, association, classification,
clustering, outlier and trend analysis, etc. - Classification of data mining systems
- Major issues in data mining
10Case-Based Reasoning
- Case-based reasoning (CBR)
- Problem-solving method from artificial
intelligence (AI) that proposes to reuse
previously solved and memorized problem
situations, called cases - Instance-based method from machine learning
- Can be used for classification/prediction tasks
11Case-Based Reasoning
PROBLEM
USER INTERFACE
Target case
SOLUTION
Interpretation
New Case
Retrieve
Retain
CASE BASE Previous Cases
Tested Case
RetrievedCase
Revise
Reuse
Solution
Solved Case
12Fifth Workshop on Case-Based Reasoning in the
Health Sciences
- Isabelle Bichindaritz
- University of Washington, Tacoma, Washington, USA
- ibichind_at_u.washington.edu
- Stefania Montani
- University of Piemonte Orientale, Italy
stefania.montani_at_unipmn.it
13Workshop Stats
- Papers accepted 10 papers
- Attendees 19 participants
- Good news !!!
14Workshop Goals
- Provide a forum for identifying important
contributions and opportunities for research on
the application of CBR to the Health Sciences - Promote the systematic study of how to apply CBR
to the Health Sciences - Showcase applications of CBR in the Health
Sciences
15A CBR Solution for Missing Medical Data
Olga Vorobieva and Rainer Schmidt Institute for
Medical Informatics and Biometry University of
Rostock, Germany Alexander Rumiantzev Pavlov
State Medical University, St.Petersburg, Russia
16Summary
- Application domaindialysis medicineeffects of
fitness on dialysis - System contextISOR, a CBR system that explains
the exceptional cases those for which fitness
does not improve renal function - Task / problem addressedrestoration of missing
data - Research hypothesiscase-based reasoning can be
applied to restore missing data in a dataset/case
base - Main contributionsynergy between CBR and
statistics (statistical modeling).
17(No Transcript)
18A Case-Based Reasoning Approach to Dose Planning
in Radiotherapy Xueyan Song1, Sanja Petrovic1,
and Santhanam Sundar 2 1Automated Scheduling,
Optimisation and Planning Group School of
Computer Science University of Nottingham,
UK 2Dept. of Oncology, City Hospital Campus,
Nottingham University Hospitals NHS Trust,
Nottingham, UK
19Summary
- Application domaindose planning in radiotherapy
for prostate cancer - System contexttrade-off between the benefit in
terms of cancer control and the risk in terms of
harmful side effects to neighboring tissues - Task / problem addressedplanning problem
designing a radiotherapy dose planning - Research hypothesiscase-based reasoning can be
applied to propose dose plans - Main contributionfuzzy representation of
attribute values and similarity measurefusion of
similar cases by Dempster-Shafer theory.
20(No Transcript)
21On-Line Domain Knowledge Management
forCase-Based Medical Recommendation
Amélie Cordier1,Béatrice Fuchs1,Jean Lieber2, and
Alain Mille1 1LIRIS CNRS, UMR 5202, Université
Lyon 1, INSA Lyon, Université Lyon 2, ECL 43, bd
du 11 Novembre 1918, Villeurbanne Cedex,
France, Amelie.Cordier, Beatrice.Fuchs,
Alain.Mille_at_liris.cnrs.fr 2LORIA (UMR 7503
CNRSINRIANancy Universities), BP 239, 54506
Vandoeuvre-lès-Nancy, France Jean.Lieber_at_loria.fr
22Summary
- Application domainbreast cancer treatment
- System contextKasimir is a knowledge management
and decision-support system in oncology focusing
on case-based protocol treatment recommendations - Task / problem addressedplanning problem
recommending a treatment plan based on a protocol - Research hypothesesconservative adaptation is
recommended for adapting a protocol to a new case
through case-based reasoningnew domain knowledge
can be acquired by analysis of failures - Main contributionimprovement of
adaptationmethod for learning from failures of
the case-based reasoning.
23(No Transcript)
24Concepts for Novelty Detection and Handling
based on Case-Based Reasoning
Petra Perner Institute of Computer Vision and
applied Computer Sciences, IBaI
25Summary
- Application domainHep-2 cell image
interpretation - System contextcase-based image interpretation
- Task / problem addressedclassification problem
improve recognition of over 30 different nuclear
and cytoplasmic patterns when patterns change
over time or new patterns emerge - Research hypothesiscase-based reasoning can be
applied to the problem of novelty detection and
also of concept drift - Main contributionnovel application for CBR
detecting novelty, detecting concept drift.
26(No Transcript)
27Similarity of Medical Cases in Health CareUsing
Cosine Similarity and Ontology
Shahina Begum, Mobyen Uddin Ahmed, Peter Funk,
Ning Xiong, Bo von Schéele Mälardalen University,
Department of Computer Science and ElectronicsPO
Box 883 SE-721 23, Västerås, Sweden firstname.las
tname_at_mdh.se
28Summary
- Application domainany medical domain
- System contextelectronic medical records
- Task / problem addressedretrieval task finding
similar cases represented with structured and
semi-structured data - Research hypothesisa hybrid similarity measure
based on combining the cosine similarity measure,
an ontology, and the nearest neighbor method
permit to successfully retrieve similar cases - Main contributionsynergy between case-based
reasoning and information retrieval.
29(No Transcript)
30Towards Case-Based Reasoning for
DiabetesManagement
Cindy Marling1, Jay Shubrook2 and Frank
Schwartz2 1 School of Electrical Engineering and
Computer Science Russ College of Engineering and
Technology Ohio University, Athens, Ohio 45701,
USA marling_at_ohio.edu 2 Appalachian Rural Health
Institute, Diabetes and Endocrine Center College
of Osteopathic Medicine Ohio University, Athens,
Ohio 45701, USA shubrook_at_ohio.edu,
schwartf_at_ohio.edu
31Summary
- Application domaintype I diabetes management
- System contextreal-time monitoring of glucose
level through insulin pump - Task / problem addressedtreatment planning
adjusting insulin dosage - Research hypothesiscase-based reasoning can
adjust insulin dosage in real timecases required
for the future CBR system can be acquired through
an online Web-based interface - Main contributionplanning the development of a
case-based reasoning system for automatic type I
diabetes monitoring.
32Hypothetico-Deductive Case-Based Reasoning
David McSherry School of Computing and
Information Engineering, University of Ulster,
Northern Ireland
33Summary
- Application domaincontact lenses classification
- System contextconversational CBR
- Task / problem addressedclassification problem
recommending type of contact lenses - Research hypothesisa hypothetico-deductive CBR
approach to test selection can minimize the
number of tests required to confirm a hypothesis
proposed by the system or user - Main contributionsynergy between case-based
reasoning and hypothetico-deductive
reasoningexplanations in CBR.
34(No Transcript)
35Other Papers Summaries
- Case-based Reasoning for managing non-compliance
with clinical guidelines, Stefania Montani,
University of Piemonte Orientale, Alessandria,
Italy A CBR system able to - Retrieve similar past episodes (cases) of
non-compliance to guidelines, to be suggested to
the physician - Learn more general indications from ground
non-compliance cases, adoptable for a formal GL
revision by an experts committee - CBR for Temporal Abstractions Configuration in
Haemodyalisis, Leonardi Giorgio, Bottrighi
Alessio, Portinale Luigi, Montani Stefania,
University of Piemonte Orientale, Alessandria,
ItalyA CBR system able to choose the appropriate
parameters for the configuration of temporal
abstractions in medical domain of haemodyalisis
36Other Papers Summaries
- Prototypical Cases for Knowledge Maintenance in
Biomedical CBR, Isabelle Bichindaritz, University
of Washington, Tacoma, WA, USAPrototypical cases
have served various purposes in biomedical CBR
systems, among which to organize and structure
the memory, to guide the retrieval as well as the
reuse of cases, and to serve as bootstrapping a
CBR system memory when real cases are not
available in sufficient quantity and/or quality.
Knowledge maintenance is yet another role that
these prototypical cases can play in biomedical
CBR systems
37Discussion
- Trends and issues
- Integration of CBR with electronic patient
records and/or in clinical practice (Begum et
al., Marling et al.) - Importance of prototypical cases (Bichindaritz)
- Incompleteness / non-reliability of cases or CBR
system knowledge (Vorobieva et al., Cordier et
al., Bichindaritz) - Novel domains of applications for CBR (Perner,
Leonardi et al., Montani) - Need for synergy with other AI methods (Song et
al., McSherry)
38Discussion
- Pearls of wisdom
- Remember Occams razor introducing complexity
in CBR should be carefully justified - Knowledge in medical cases / domain knowledge is
often questionable finding methods for dealing
with this reality is essential for the
development of CBR in biomedical domains - CBR can be promoted as the methodology of choice
for evidence gathering in evidence-based medicine
39Future Plans
- A second special issue on CBR in the Health
Sciences, based on papers from this Fifth
Workshop on CBR in the Health Sciences is going
to be published in Computational Intelligence. - The Web-site (version 1.beta) and mailing list
for our research group are now livehttp//www.cb
r-health.orghttp//www.cbr-biomed.org
40(No Transcript)
41(No Transcript)
42Learning Objectives
- Study some examples of data mining systems
- Understand why to preprocess the data.
- Understand how to understand the data
(descriptive data summarization)
43Why Data Preprocessing?
- Data mining aims at discovering relationships and
other forms of knowledge from data in the real
world.
- Data map entities in the application domain to
symbolic representation through a measurement
function.
- Data in the real world is dirty
- incomplete missing data, lacking attribute
values, lacking certain attributes of interest,
or containing only aggregate data - noisy containing errors, such as measurement
errors, or outliers - inconsistent containing discrepancies in codes
or names - distorted sampling distortion
- No quality data, no quality mining results!
(GIGO) - Quality decisions must be based on quality data
- Data warehouse needs consistent integration of
quality data
44Multi-Dimensional Measure of Data Quality
- Data quality is multidimensional
- Accuracy
- Preciseness (reliability)
- Completeness
- Consistency
- Timeliness
- Believability (validity)
- Value added
- Interpretability
- Accessibility
- Broad categories
- intrinsic, contextual, representational, and
accessibility.
45Major Tasks in Data Preprocessing
- Data cleaning
- Fill in missing values, smooth noisy data,
identify or remove outliers, and resolve
inconsistencies and errors
- Data integration
- Integration of multiple databases, data cubes, or
files
- Data transformation
- Normalization and aggregation
- Data reduction
- Obtains reduced representation in volume but
produces the same or similar analytical results
- Data discretization
- Part of data reduction but with particular
importance, especially for numerical data
46Forms of data preprocessing
47Learning Objectives
- Study some examples of data mining systems
- Understand why to preprocess the data.
- Understand how to understand the data
(descriptive data summarization)
48Mining Data Descriptive Characteristics
- Motivation
- To better understand the data central tendency,
variation and spread - Data dispersion characteristics
- median, max, min, quantiles, outliers, variance,
etc. - Numerical dimensions correspond to sorted
intervals - Data dispersion analyzed with multiple
granularities of precision - Boxplot or quantile analysis on sorted intervals
- Dispersion analysis on computed measures
- Folding measures into numerical dimensions
- Boxplot or quantile analysis on the transformed
cube
49Measuring the Central Tendency
- Mean (algebraic measure) (sample vs. population)
- Weighted arithmetic mean
- Trimmed mean chopping extreme values
- Median A holistic measure
- Middle value if odd number of values, or average
of the middle two values otherwise - Estimated by interpolation (for grouped data)
- Mode
- Value that occurs most frequently in the data
- Unimodal, bimodal, trimodal
- Empirical formula
50 Symmetric vs. Skewed Data
symmetric
- Median, mean and mode of symmetric, positively
and negatively skewed data
positively skewed
negatively skewed
51Measuring the Dispersion of Data
- Quartiles, outliers and boxplots
- Quartiles Q1 (25th percentile), Q3 (75th
percentile) - Inter-quartile range IQR Q3 Q1
- Five number summary min, Q1, M, Q3, max
- Boxplot ends of the box are the quartiles,
median is marked, whiskers, and plot outlier
individually - Outlier usually, a value higher/lower than 1.5 x
IQR - Variance and standard deviation (sample s,
population s) - Variance (algebraic, scalable computation)
- Standard deviation s (or s) is the square root of
variance s2 (or s2)
52 Boxplot Analysis
- Five-number summary of a distribution
- Minimum, Q1, M, Q3, Maximum
- Boxplot
- Data is represented with a box
- The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR - The median is marked by a line within the box
- Whiskers two lines outside the box extend to
Minimum and Maximum
53Visualization of Data Dispersion 3-D Boxplots
54Properties of Normal Distribution Curve
- The normal (distribution) curve
- From µs to µs contains about 68 of the
measurements (µ mean, s standard deviation) - From µ2s to µ2s contains about 95 of it
- From µ3s to µ3s contains about 99.7 of it
55Graphic Displays of Basic Statistical Descriptions
- Boxplot graphic display of five-number summary
- Histogram x-axis are values, y-axis repres.
frequencies - Quantile plot each value xi is paired with fi
indicating that approximately 100 fi of data
are ? xi - Quantile-quantile (q-q) plot graphs the
quantiles of one univariant distribution against
the corresponding quantiles of another - Scatter plot each pair of values is a pair of
coordinates and plotted as points in the plane - Loess (local regression) curve add a smooth
curve to a scatter plot to provide better
perception of the pattern of dependence
56Histogram Analysis
- Graph displays of basic statistical class
descriptions - Frequency histograms
- A univariate graphical method
- Consists of a set of rectangles that reflect the
counts or frequencies of the classes present in
the given data
57Histograms Often Tells More than Boxplots
- The two histograms shown in the left may have the
same boxplot representation - The same values for min, Q1, median, Q3, max
- But they have rather different data distributions
58Quantile Plot
- Displays all of the data (allowing the user to
assess both the overall behavior and unusual
occurrences) - Plots quantile information
- For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi of the data
are below or equal to the value xi
59Quantile-Quantile (Q-Q) Plot
- Graphs the quantiles of one univariate
distribution against the corresponding quantiles
of another - Allows the user to view whether there is a shift
in going from one distribution to another
60Scatter plot
- Provides a first look at bivariate data to see
clusters of points, outliers, etc - Each pair of values is treated as a pair of
coordinates and plotted as points in the plane
61Loess Curve
- Adds a smooth curve to a scatter plot in order to
provide better perception of the pattern of
dependence - Loess curve is fitted by setting two parameters
a smoothing parameter, and the degree of the
polynomials that are fitted by the regression
62Positively and Negatively Correlated Data
- The left half fragment is positively correlated
- The right half is negative correlated
63 Not Correlated Data
64Data Visualization and Its Methods
- Why data visualization?
- Gain insight into an information space by mapping
data onto graphical primitives - Provide qualitative overview of large data sets
- Search for patterns, trends, structure,
irregularities, relationships among data - Help find interesting regions and suitable
parameters for further quantitative analysis - Provide a visual proof of computer
representations derived - Typical visualization methods
- Geometric techniques
- Icon-based techniques
- Hierarchical techniques
65Direct Data Visualization
Ribbons with Twists Based on Vorticity
66Geometric Techniques
- Visualization of geometric transformations and
projections of the data - Methods
- Landscapes
- Projection pursuit technique
- Finding meaningful projections of
multidimensional data - Scatterplot matrices
- Prosection views
- Hyperslice
- Parallel coordinates
67Scatterplot Matrices
Used by ermission of M. Ward, Worcester
Polytechnic Institute
- Matrix of scatterplots (x-y-diagrams) of the
k-dim. data total of (k2/2-k) scatterplots
68Landscapes
news articlesvisualized asa landscape
Used by permission of B. Wright, Visible
Decisions Inc.
- Visualization of the data as perspective
landscape - The data needs to be transformed into a (possibly
artificial) 2D spatial representation which
preserves the characteristics of the data
69Parallel Coordinates
- n equidistant axes which are parallel to one of
the screen axes and correspond to the attributes - The axes are scaled to the minimum, maximum
range of the corresponding attribute - Every data item corresponds to a polygonal line
which intersects each of the axes at the point
which corresponds to the value for the attribute
70Parallel Coordinates of a Data Set
71Icon-based Techniques
- Visualization of the data values as features of
icons - Methods
- Chernoff Faces
- Stick Figures
- Shape Coding
- Color Icons
- TileBars The use of small icons representing the
relevance feature vectors in document retrieval
72Chernoff Faces
- A way to display variables on a two-dimensional
surface, e.g., let x be eyebrow slant, y be eye
size, z be nose length, etc. - The figure shows faces produced using 10
characteristics--head eccentricity, eye size, eye
spacing, eye eccentricity, pupil size, eyebrow
slant, nose size, mouth shape, mouth size, and
mouth opening) Each assigned one of 10 possible
values, generated using Mathematica (S. Dickson)
- REFERENCE Gonick, L. and Smith, W. The Cartoon
Guide to Statistics. New York Harper Perennial,
p. 212, 1993 - Weisstein, Eric W. "Chernoff Face." From
MathWorld--A Wolfram Web Resource.
mathworld.wolfram.com/ChernoffFace.html
73Stick Figures
census data showing age, income, gender,
education, etc.
used by permission of G. Grinstein, University of
Massachusettes at Lowell
74Hierarchical Techniques
- Visualization of the data using a hierarchical
partitioning into subspaces. - Methods
- Dimensional Stacking
- Worlds-within-Worlds
- Treemap
- Cone Trees
- InfoCube
75Dimensional Stacking
- Partitioning of the n-dimensional attribute space
in 2-D subspaces which are stacked into each
other - Partitioning of the attribute value ranges into
classes the important attributes should be used
on the outer levels - Adequate for data with ordinal attributes of low
cardinality - But, difficult to display more than nine
dimensions - Important to map dimensions appropriately
76Dimensional Stacking
Used by permission of M. Ward, Worcester
Polytechnic Institute
Visualization of oil mining data with longitude
and latitude mapped to the outer x-, y-axes and
ore grade and depth mapped to the inner x-, y-axes
77Tree-Map
- Screen-filling method which uses a hierarchical
partitioning of the screen into regions depending
on the attribute values - The x- and y-dimension of the screen are
partitioned alternately according to the
attribute values (classes)
MSR Netscan Image
78Tree-Map of a File System (Schneiderman)