Main Concepts of Data Mining Introduction to Data Preprocessing

About This Presentation

Title:

Main Concepts of Data Mining Introduction to Data Preprocessing

Description:

Institute for Medical Informatics and Biometry. University of Rostock, Germany ... have served various purposes in biomedical CBR systems, among which to organize ... – PowerPoint PPT presentation

Number of Views:471

Avg rating:3.0/5.0

Slides: 79

Provided by: isabellebi

Category:

more less

Transcript and Presenter's Notes

Title: Main Concepts of Data Mining Introduction to Data Preprocessing

1
Main Concepts of Data MiningIntroduction to Data
Preprocessing
2
Learning Objectives

Study some examples of data mining systems
Understand why to preprocess the data.
Understand how to understand the data
(descriptive data summarization)

3
Acknowledgements

Some of these slides are adapted from Jiawei Han
and Micheline Kamber

4
Learning Objectives

Study some examples of data mining systems
Understand why to preprocess the data.
Understand how to understand the data
(descriptive data summarization)

5
Data Mining Confluence of Multiple Disciplines
Database Technology
Statistics
Data Mining
Machine Learning
Visualization
Information Science
Other Disciplines
6
Data Mining Classification Schemes

General functionality
Descriptive data mining
Predictive data mining
Different views, different classifications
Kinds of databases to be mined
Kinds of knowledge to be discovered
Kinds of techniques utilized
Kinds of applications adapted

7
Major Issues in Data Mining (1)

Mining methodology and user interaction
Mining different kinds of knowledge in databases
Interactive mining of knowledge at multiple
levels of abstraction
Incorporation of background knowledge
Data mining query languages and ad-hoc data
mining
Expression and visualization of data mining
results
Handling noise and incomplete data
Pattern evaluation the interestingness problem
Performance and scalability
Efficiency and scalability of data mining
algorithms
Parallel, distributed and incremental mining
methods

8
Major Issues in Data Mining (2)

Issues relating to the diversity of data types
Handling relational and complex types of data
Mining information from heterogeneous databases
and global information systems (WWW)
Issues related to applications and social impacts
Application of discovered knowledge
Domain-specific data mining tools
Intelligent query answering
Process control and decision making
Integration of the discovered knowledge with
existing knowledge A knowledge fusion problem
Protection of data security, integrity, and
privacy

9
Main Concepts in Data Mining

Data mining discovering interesting patterns
from large amounts of data
A natural evolution of database technology, in
great demand, with wide applications
A KDD process includes data cleaning, data
integration, data selection, transformation, data
mining, pattern evaluation, and knowledge
presentation
Mining can be performed in a variety of
information repositories
Data mining functionalities characterization,
discrimination, association, classification,
clustering, outlier and trend analysis, etc.
Classification of data mining systems
Major issues in data mining

10
Case-Based Reasoning

Case-based reasoning (CBR)
Problem-solving method from artificial
intelligence (AI) that proposes to reuse
previously solved and memorized problem
situations, called cases
Instance-based method from machine learning
Can be used for classification/prediction tasks

11
Case-Based Reasoning
PROBLEM
USER INTERFACE
Target case
SOLUTION
Interpretation
New Case
Retrieve
Retain
CASE BASE Previous Cases
Tested Case
RetrievedCase
Revise
Reuse
Solution
Solved Case
12
Fifth Workshop on Case-Based Reasoning in the
Health Sciences

Isabelle Bichindaritz
University of Washington, Tacoma, Washington, USA
ibichind_at_u.washington.edu
Stefania Montani
University of Piemonte Orientale, Italy
stefania.montani_at_unipmn.it

13
Workshop Stats

Papers accepted 10 papers
Attendees 19 participants
Good news !!!

14
Workshop Goals

Provide a forum for identifying important
contributions and opportunities for research on
the application of CBR to the Health Sciences
Promote the systematic study of how to apply CBR
to the Health Sciences
Showcase applications of CBR in the Health
Sciences

15
A CBR Solution for Missing Medical Data
Olga Vorobieva and Rainer Schmidt Institute for
Medical Informatics and Biometry University of
Rostock, Germany Alexander Rumiantzev Pavlov
State Medical University, St.Petersburg, Russia
16
Summary

Application domaindialysis medicineeffects of
fitness on dialysis
System contextISOR, a CBR system that explains
the exceptional cases those for which fitness
does not improve renal function
Task / problem addressedrestoration of missing
data
Research hypothesiscase-based reasoning can be
applied to restore missing data in a dataset/case
base
Main contributionsynergy between CBR and
statistics (statistical modeling).

17
(No Transcript)
18
A Case-Based Reasoning Approach to Dose Planning
in Radiotherapy Xueyan Song1, Sanja Petrovic1,
and Santhanam Sundar 2 1Automated Scheduling,
Optimisation and Planning Group School of
Computer Science University of Nottingham,
UK 2Dept. of Oncology, City Hospital Campus,
Nottingham University Hospitals NHS Trust,
Nottingham, UK
19
Summary

Application domaindose planning in radiotherapy
for prostate cancer
System contexttrade-off between the benefit in
terms of cancer control and the risk in terms of
harmful side effects to neighboring tissues
Task / problem addressedplanning problem
designing a radiotherapy dose planning
Research hypothesiscase-based reasoning can be
applied to propose dose plans
Main contributionfuzzy representation of
attribute values and similarity measurefusion of
similar cases by Dempster-Shafer theory.

20
(No Transcript)
21
On-Line Domain Knowledge Management
forCase-Based Medical Recommendation
Amélie Cordier1,Béatrice Fuchs1,Jean Lieber2, and
Alain Mille1 1LIRIS CNRS, UMR 5202, Université
Lyon 1, INSA Lyon, Université Lyon 2, ECL 43, bd
du 11 Novembre 1918, Villeurbanne Cedex,
France, Amelie.Cordier, Beatrice.Fuchs,
Alain.Mille_at_liris.cnrs.fr 2LORIA (UMR 7503
CNRSINRIANancy Universities), BP 239, 54506
Vandoeuvre-lès-Nancy, France Jean.Lieber_at_loria.fr
22
Summary

Application domainbreast cancer treatment
System contextKasimir is a knowledge management
and decision-support system in oncology focusing
on case-based protocol treatment recommendations
Task / problem addressedplanning problem
recommending a treatment plan based on a protocol
Research hypothesesconservative adaptation is
recommended for adapting a protocol to a new case
through case-based reasoningnew domain knowledge
can be acquired by analysis of failures
Main contributionimprovement of
adaptationmethod for learning from failures of
the case-based reasoning.

23
(No Transcript)
24
Concepts for Novelty Detection and Handling
based on Case-Based Reasoning
Petra Perner Institute of Computer Vision and
applied Computer Sciences, IBaI
25
Summary

Application domainHep-2 cell image
interpretation
System contextcase-based image interpretation
Task / problem addressedclassification problem
improve recognition of over 30 different nuclear
and cytoplasmic patterns when patterns change
over time or new patterns emerge
Research hypothesiscase-based reasoning can be
applied to the problem of novelty detection and
also of concept drift
Main contributionnovel application for CBR
detecting novelty, detecting concept drift.

26
(No Transcript)
27
Similarity of Medical Cases in Health CareUsing
Cosine Similarity and Ontology
Shahina Begum, Mobyen Uddin Ahmed, Peter Funk,
Ning Xiong, Bo von Schéele Mälardalen University,
Department of Computer Science and ElectronicsPO
Box 883 SE-721 23, Västerås, Sweden firstname.las
tname_at_mdh.se
28
Summary

Application domainany medical domain
System contextelectronic medical records
Task / problem addressedretrieval task finding
similar cases represented with structured and
semi-structured data
Research hypothesisa hybrid similarity measure
based on combining the cosine similarity measure,
an ontology, and the nearest neighbor method
permit to successfully retrieve similar cases
Main contributionsynergy between case-based
reasoning and information retrieval.

29
(No Transcript)
30
Towards Case-Based Reasoning for
DiabetesManagement
Cindy Marling1, Jay Shubrook2 and Frank
Schwartz2 1 School of Electrical Engineering and
Computer Science Russ College of Engineering and
Technology Ohio University, Athens, Ohio 45701,
USA marling_at_ohio.edu 2 Appalachian Rural Health
Institute, Diabetes and Endocrine Center College
of Osteopathic Medicine Ohio University, Athens,
Ohio 45701, USA shubrook_at_ohio.edu,
schwartf_at_ohio.edu
31
Summary

Application domaintype I diabetes management
System contextreal-time monitoring of glucose
level through insulin pump
Task / problem addressedtreatment planning
adjusting insulin dosage
Research hypothesiscase-based reasoning can
adjust insulin dosage in real timecases required
for the future CBR system can be acquired through
an online Web-based interface
Main contributionplanning the development of a
case-based reasoning system for automatic type I
diabetes monitoring.

32
Hypothetico-Deductive Case-Based Reasoning
David McSherry School of Computing and
Information Engineering, University of Ulster,
Northern Ireland
33
Summary

Application domaincontact lenses classification
System contextconversational CBR
Task / problem addressedclassification problem
recommending type of contact lenses
Research hypothesisa hypothetico-deductive CBR
approach to test selection can minimize the
number of tests required to confirm a hypothesis
proposed by the system or user
Main contributionsynergy between case-based
reasoning and hypothetico-deductive
reasoningexplanations in CBR.

34
(No Transcript)
35
Other Papers Summaries

Case-based Reasoning for managing non-compliance
with clinical guidelines, Stefania Montani,
University of Piemonte Orientale, Alessandria,
Italy A CBR system able to
Retrieve similar past episodes (cases) of
non-compliance to guidelines, to be suggested to
the physician
Learn more general indications from ground
non-compliance cases, adoptable for a formal GL
revision by an experts committee
CBR for Temporal Abstractions Configuration in
Haemodyalisis, Leonardi Giorgio, Bottrighi
Alessio, Portinale Luigi, Montani Stefania,
University of Piemonte Orientale, Alessandria,
ItalyA CBR system able to choose the appropriate
parameters for the configuration of temporal
abstractions in medical domain of haemodyalisis

36
Other Papers Summaries

Prototypical Cases for Knowledge Maintenance in
Biomedical CBR, Isabelle Bichindaritz, University
of Washington, Tacoma, WA, USAPrototypical cases
have served various purposes in biomedical CBR
systems, among which to organize and structure
the memory, to guide the retrieval as well as the
reuse of cases, and to serve as bootstrapping a
CBR system memory when real cases are not
available in sufficient quantity and/or quality.
Knowledge maintenance is yet another role that
these prototypical cases can play in biomedical
CBR systems

37
Discussion

Trends and issues
Integration of CBR with electronic patient
records and/or in clinical practice (Begum et
al., Marling et al.)
Importance of prototypical cases (Bichindaritz)
Incompleteness / non-reliability of cases or CBR
system knowledge (Vorobieva et al., Cordier et
al., Bichindaritz)
Novel domains of applications for CBR (Perner,
Leonardi et al., Montani)
Need for synergy with other AI methods (Song et
al., McSherry)

38
Discussion

Pearls of wisdom
Remember Occams razor introducing complexity
in CBR should be carefully justified
Knowledge in medical cases / domain knowledge is
often questionable finding methods for dealing
with this reality is essential for the
development of CBR in biomedical domains
CBR can be promoted as the methodology of choice
for evidence gathering in evidence-based medicine

39
Future Plans

A second special issue on CBR in the Health
Sciences, based on papers from this Fifth
Workshop on CBR in the Health Sciences is going
to be published in Computational Intelligence.
The Web-site (version 1.beta) and mailing list
for our research group are now livehttp//www.cb
r-health.orghttp//www.cbr-biomed.org

40
(No Transcript)
41
(No Transcript)
42
Learning Objectives

Study some examples of data mining systems
Understand why to preprocess the data.
Understand how to understand the data
(descriptive data summarization)

43
Why Data Preprocessing?

Data mining aims at discovering relationships and
other forms of knowledge from data in the real
world.

Data map entities in the application domain to
symbolic representation through a measurement
function.

Data in the real world is dirty
incomplete missing data, lacking attribute
values, lacking certain attributes of interest,
or containing only aggregate data
noisy containing errors, such as measurement
errors, or outliers
inconsistent containing discrepancies in codes
or names
distorted sampling distortion

No quality data, no quality mining results!
(GIGO)
Quality decisions must be based on quality data
Data warehouse needs consistent integration of
quality data

44
Multi-Dimensional Measure of Data Quality

Data quality is multidimensional
Accuracy
Preciseness (reliability)
Completeness
Consistency
Timeliness
Believability (validity)
Value added
Interpretability
Accessibility
Broad categories
intrinsic, contextual, representational, and
accessibility.

45
Major Tasks in Data Preprocessing

Data cleaning
Fill in missing values, smooth noisy data,
identify or remove outliers, and resolve
inconsistencies and errors

Data integration
Integration of multiple databases, data cubes, or
files

Data transformation
Normalization and aggregation

Data reduction
Obtains reduced representation in volume but
produces the same or similar analytical results

Data discretization
Part of data reduction but with particular
importance, especially for numerical data

46
Forms of data preprocessing
47
Learning Objectives

Study some examples of data mining systems
Understand why to preprocess the data.
Understand how to understand the data
(descriptive data summarization)

48
Mining Data Descriptive Characteristics

Motivation
To better understand the data central tendency,
variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance,
etc.
Numerical dimensions correspond to sorted
intervals
Data dispersion analyzed with multiple
granularities of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed
cube

49
Measuring the Central Tendency

Mean (algebraic measure) (sample vs. population)
Weighted arithmetic mean
Trimmed mean chopping extreme values
Median A holistic measure
Middle value if odd number of values, or average
of the middle two values otherwise
Estimated by interpolation (for grouped data)
Mode
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula

50
Symmetric vs. Skewed Data
symmetric

Median, mean and mode of symmetric, positively
and negatively skewed data

positively skewed
negatively skewed
51
Measuring the Dispersion of Data

Quartiles, outliers and boxplots
Quartiles Q1 (25th percentile), Q3 (75th
percentile)
Inter-quartile range IQR Q3 Q1
Five number summary min, Q1, M, Q3, max
Boxplot ends of the box are the quartiles,
median is marked, whiskers, and plot outlier
individually
Outlier usually, a value higher/lower than 1.5 x
IQR
Variance and standard deviation (sample s,
population s)
Variance (algebraic, scalable computation)
Standard deviation s (or s) is the square root of
variance s2 (or s2)

52
Boxplot Analysis

Five-number summary of a distribution
Minimum, Q1, M, Q3, Maximum
Boxplot
Data is represented with a box
The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
The median is marked by a line within the box
Whiskers two lines outside the box extend to
Minimum and Maximum

53
Visualization of Data Dispersion 3-D Boxplots
54
Properties of Normal Distribution Curve

The normal (distribution) curve
From µs to µs contains about 68 of the
measurements (µ mean, s standard deviation)
From µ2s to µ2s contains about 95 of it
From µ3s to µ3s contains about 99.7 of it

55
Graphic Displays of Basic Statistical Descriptions

Boxplot graphic display of five-number summary
Histogram x-axis are values, y-axis repres.
frequencies
Quantile plot each value xi is paired with fi
indicating that approximately 100 fi of data
are ? xi
Quantile-quantile (q-q) plot graphs the
quantiles of one univariant distribution against
the corresponding quantiles of another
Scatter plot each pair of values is a pair of
coordinates and plotted as points in the plane
Loess (local regression) curve add a smooth
curve to a scatter plot to provide better
perception of the pattern of dependence

56
Histogram Analysis

Graph displays of basic statistical class
descriptions
Frequency histograms
A univariate graphical method
Consists of a set of rectangles that reflect the
counts or frequencies of the classes present in
the given data

57
Histograms Often Tells More than Boxplots

The two histograms shown in the left may have the
same boxplot representation
The same values for min, Q1, median, Q3, max
But they have rather different data distributions

58
Quantile Plot

Displays all of the data (allowing the user to
assess both the overall behavior and unusual
occurrences)
Plots quantile information
For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi of the data
are below or equal to the value xi

59
Quantile-Quantile (Q-Q) Plot

Graphs the quantiles of one univariate
distribution against the corresponding quantiles
of another
Allows the user to view whether there is a shift
in going from one distribution to another

60
Scatter plot

Provides a first look at bivariate data to see
clusters of points, outliers, etc
Each pair of values is treated as a pair of
coordinates and plotted as points in the plane

61
Loess Curve

Adds a smooth curve to a scatter plot in order to
provide better perception of the pattern of
dependence
Loess curve is fitted by setting two parameters
a smoothing parameter, and the degree of the
polynomials that are fitted by the regression

62
Positively and Negatively Correlated Data

The left half fragment is positively correlated
The right half is negative correlated

63
Not Correlated Data
64
Data Visualization and Its Methods

Why data visualization?
Gain insight into an information space by mapping
data onto graphical primitives
Provide qualitative overview of large data sets
Search for patterns, trends, structure,
irregularities, relationships among data
Help find interesting regions and suitable
parameters for further quantitative analysis
Provide a visual proof of computer
representations derived
Typical visualization methods
Geometric techniques
Icon-based techniques
Hierarchical techniques

65
Direct Data Visualization
Ribbons with Twists Based on Vorticity
66
Geometric Techniques

Visualization of geometric transformations and
projections of the data
Methods
Landscapes
Projection pursuit technique
Finding meaningful projections of
multidimensional data
Scatterplot matrices
Prosection views
Hyperslice
Parallel coordinates

67
Scatterplot Matrices
Used by ermission of M. Ward, Worcester
Polytechnic Institute

Matrix of scatterplots (x-y-diagrams) of the
k-dim. data total of (k2/2-k) scatterplots

68
Landscapes
news articlesvisualized asa landscape
Used by permission of B. Wright, Visible
Decisions Inc.

Visualization of the data as perspective
landscape
The data needs to be transformed into a (possibly
artificial) 2D spatial representation which
preserves the characteristics of the data

69
Parallel Coordinates

n equidistant axes which are parallel to one of
the screen axes and correspond to the attributes
The axes are scaled to the minimum, maximum
range of the corresponding attribute
Every data item corresponds to a polygonal line
which intersects each of the axes at the point
which corresponds to the value for the attribute

70
Parallel Coordinates of a Data Set
71
Icon-based Techniques

Visualization of the data values as features of
icons
Methods
Chernoff Faces
Stick Figures
Shape Coding
Color Icons
TileBars The use of small icons representing the
relevance feature vectors in document retrieval

72
Chernoff Faces

A way to display variables on a two-dimensional
surface, e.g., let x be eyebrow slant, y be eye
size, z be nose length, etc.
The figure shows faces produced using 10
characteristics--head eccentricity, eye size, eye
spacing, eye eccentricity, pupil size, eyebrow
slant, nose size, mouth shape, mouth size, and
mouth opening) Each assigned one of 10 possible
values, generated using Mathematica (S. Dickson)

REFERENCE Gonick, L. and Smith, W. The Cartoon
Guide to Statistics. New York Harper Perennial,
p. 212, 1993
Weisstein, Eric W. "Chernoff Face." From
MathWorld--A Wolfram Web Resource.
mathworld.wolfram.com/ChernoffFace.html

73
Stick Figures

census data showing age, income, gender,
education, etc.
used by permission of G. Grinstein, University of
Massachusettes at Lowell
74
Hierarchical Techniques

Visualization of the data using a hierarchical
partitioning into subspaces.
Methods
Dimensional Stacking
Worlds-within-Worlds
Treemap
Cone Trees
InfoCube

75
Dimensional Stacking

Partitioning of the n-dimensional attribute space
in 2-D subspaces which are stacked into each
other
Partitioning of the attribute value ranges into
classes the important attributes should be used
on the outer levels
Adequate for data with ordinal attributes of low
cardinality
But, difficult to display more than nine
dimensions
Important to map dimensions appropriately

76
Dimensional Stacking
Used by permission of M. Ward, Worcester
Polytechnic Institute
Visualization of oil mining data with longitude
and latitude mapped to the outer x-, y-axes and
ore grade and depth mapped to the inner x-, y-axes
77
Tree-Map

Screen-filling method which uses a hierarchical
partitioning of the screen into regions depending
on the attribute values
The x- and y-dimension of the screen are
partitioned alternately according to the
attribute values (classes)

MSR Netscan Image
78
Tree-Map of a File System (Schneiderman)

Write a Comment

User Comments (0)

About PowerShow.com

Main Concepts of Data Mining Introduction to Data Preprocessing - PowerPoint PPT Presentation

Main Concepts of Data Mining Introduction to Data Preprocessing

Institute for Medical Informatics and Biometry. University of Rostock, Germany ... have served various purposes in biomedical CBR systems, among which to organize ... – PowerPoint PPT presentation