Title: Course on Data Mining (581550-4)
1Course on Data Mining (581550-4)
Intro/Ass. Rules
Clustering
Episodes
KDD Process
Text Mining
Appl./Summary
2Course on Data Mining (581550-4)
Today 22.11.2001
- Today's subject
- KDD Process
- Next week's program
- Lecture Data mining applications, future,
summary - Exercise KDD Process
- Seminar KDD Process
3KDD process
- Overview
- Preprocessing
- Post-processing
- Summary
4What is KDD? A process!
- Aim the selection and processing of data for
- the identification of novel, accurate, and useful
patterns, and - the modeling of real-world phenomena
- Data mining is a major component of the KDD
process
5Typical KDD process
6Phases of the KDD process (1)
Learning the domain
Creating a target data set
Pre- processing
Data cleaning, integration and transformation
Data reduction and projection
Choosing the DM task
7Phases of the KDD process (2)
Choosing the DM algorithm(s)
Data mining search
Pattern evaluation and interpretation
Post- processing
Knowledge presentation
Use of discovered knowledge
8Preprocessing - overview
- Why data preprocessing?
- Data cleaning
- Data integration and transformation
- Data reduction
9Why data preprocessing?
- Aim to select the data relevant with respect to
the task in hand to be mined - Data in the real world is dirty
- incomplete lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data - noisy containing errors or outliers
- inconsistent containing discrepancies in codes
or names - No quality data, no quality mining results!
10Measures of data quality
- accuracy
- completeness
- consistency
- timeliness
- believability
- value added
- interpretability
- accessibility
11Preprocessing tasks (1)
- Data cleaning
- fill in missing values, smooth noisy data,
identify or remove outliers, and resolve
inconsistencies - Data integration
- integration of multiple databases, files, etc.
- Data transformation
- normalization and aggregation
12Preprocessing tasks (2)
- Data reduction (including discretization)
- obtains reduced representation in volume, but
produces the same or similar analytical results - data discretization is part of data reduction,
but with particular importance, especially for
numerical data
13Preprocessing tasks (3)
14Data cleaning tasks
- Fill in missing values
- Identify outliers and smooth out noisy data
- Correct inconsistent data
15Missing Data
- Data is not always available
- Missing data may be due to
- equipment malfunction
- inconsistent with other recorded data, and thus
deleted - data not entered due to misunderstanding
- certain data may not be considered important at
the time of entry - not register history or changes of the data
- Missing data may need to be inferred
16How to Handle Missing Data? (1)
- Ignore the tuple
- usually done when the class label is missing
- not effective, when the percentage of missing
values per attribute varies considerably - Fill in the missing value manually
- tedious infeasible?
- Use a global constant to fill in the missing
value - e.g., unknown, a new class?!
17How to Handle Missing Data? (2)
- Use the attribute mean to fill in the missing
value - Use the attribute mean for all samples belonging
to the same class to fill in the missing value - smarter solution than using the general
attribute mean - Use the most probable value to fill in the
missing value - inference-based tools such as decision tree
induction or a Bayesian formalism - regression
18Noisy Data
- Noise random error or variance in a measured
variable - Incorrect attribute values may due to
- faulty data collection instruments
- data entry problems
- data transmission problems
- technology limitation
- inconsistency in naming convention
19How to Handle Noisy Data?
- Binning
- smooth a sorted data value by looking at the
values around it - Clustering
- detect and remove outliers
- Combined computer and human inspection
- detect suspicious values and check by human
- Regression
- smooth by fitting the data into regression
functions
20Binning methods (1)
- Equal-depth (frequency) partitioning
- sort data and partition into bins, N intervals,
each containing approximately same number of
samples - smooth by bin means, bin median, bin boundaries,
etc. - good data scaling
- managing categorical attributes can be tricky
21Binning methods (2)
- Equal-width (distance) partitioning
- divide the range into N intervals of equal size
uniform grid - if A and B are the lowest and highest values of
the attribute, the width of intervals will be W
(B-A)/N. - the most straightforward
- outliers may dominate presentation
- skewed data is not handled well
22Equal-depth binning - Example
- Sorted data for price (in dollars)
- 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
- Partition into (equal-depth) bins
- Bin 1 4, 8, 9, 15
- Bin 2 21, 21, 24, 25
- Bin 3 26, 28, 29, 34
- Smoothing by bin means
- Bin 1 9, 9, 9, 9
- Bin 2 23, 23, 23, 23
- Bin 3 29, 29, 29, 29
- by bin boundaries
- Bin 1 4, 4, 4, 15
- Bin 2 21, 21, 25, 25
- Bin 3 26, 26, 26, 34
23Data Integration (1)
- Data integration
- combines data from multiple sources into a
coherent store - Schema integration
- integrate metadata from different sources
- entity identification problem identify real
world entities from multiple data sources, e.g.,
A.cust-id ? B.cust-
24Data Integration (2)
- Detecting and resolving data value conflicts
- for the same real world entity, attribute values
from different sources are different - possible reasons different representations,
different scales, e.g., metric vs. British units
25Handling Redundant Data
- Redundant data occur often, when multiple
databases are integrated - the same attribute may have different names in
different databases - one attribute may be a derived attribute in
another table, e.g., annual revenue - Redundant data may be detected by correlation
analysis - Careful integration of data from multiple sources
may - help to reduce/avoid redundancies and
inconsistencies - improve mining speed and quality
26Data Transformation
- Smoothing remove noise from data
- Aggregation summarization, data cube
construction - Generalization concept hierarchy climbing
- Normalization scaled to fall within a small,
specified range, e.g., - min-max normalization
- normalization by decimal scaling
- Attribute/feature construction
- new attributes constructed from the given ones
27Data Reduction
- Data reduction
- obtains a reduced representation of the data set
that is much smaller in volume - produces the same (or almost the same) analytical
results as the original data - Data reduction strategies
- dimensionality reduction
- numerosity reduction
- discretization and concept hierarchy generation
28Dimensionality Reduction
- Feature selection (i.e., attribute subset
selection) - select a minimum set of features such that the
probability distribution of different classes
given the values for those features is as close
as possible to the original distribution given
the values of all features - reduce the number of patterns in the patterns,
easier to understand - Heuristic methods (due to exponential of
choices) - step-wise forward selection
- step-wise backward elimination
- combining forward selection and backward
elimination
29Dimensionality Reduction - Example
Initial attribute set A1, A2, A3, A4, A5, A6
A4 ?
A6?
A1?
Class 2
Class 2
Class 1
Class 1
Reduced attribute set A1, A4, A6
30Numerosity Reduction
- Parametric methods
- assume the data fits some model, estimate model
parameters, store only the parameters, and
discard the data (except possible outliers) - e.g., regression analysis, log-linear models
- Non-parametric methods
- do not assume models
- e.g., histograms, clustering, sampling
31Discretization
- Reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals - Interval labels can then be used to replace
actual data values - Some classification algorithms only accept
categorical attributes
32Concept Hierarchies
- Reduce the data by collecting and replacing low
level concepts by higher level concepts - For example, replace numeric values for the
attribute age by more general values young,
middle-aged, or senior
33Discretization and concept hierarchy generation
for numeric data
- Binning
- Histogram analysis
- Clustering analysis
- Entropy-based discretization
- Segmentation by natural partitioning
34Concept hierarchy generation for categorical data
- Specification of a partial ordering of attributes
explicitly at the schema level by users or
experts - Specification of a portion of a hierarchy by
explicit data grouping - Specification of a set of attributes, but not of
their partial ordering - Specification of only a partial set of attributes
35Specification of a set of attributes
- Concept hierarchy can be automatically generated
based on the number of distinct values per
attribute in the given attribute set. The
attribute with the most distinct values is placed
at the lowest level of the hierarchy.
15 distinct values
65 distinct values
3567 distinct values
674 339 distinct values
36Post-processing - overview
- Why data post-processing?
- Interestingness
- Visualization
- Utilization
Post-processing
37Why data post-processing? (1)
- Aim to show the results, or more precisely the
most interesting findings, of the data mining
phase to a user/users in an understandable way - A possible post-processing methodology
- find all potentially interesting patterns
according to some rather loose criteria - provide flexible methods for iteratively and
interactively creating different views of the
discovered patterns - Other more restrictive or focused methodologies
possible as well
38Why data post-processing? (2)
- A post-processing methodology is useful, if
- the desired focus is not known in advance (the
search process cannot be optimized to look only
for the interesting patterns) - there is an algorithm that can produce all
patterns from a class of potentially interesting
patterns (the result is complete) - the time requirement for discovering all
potentially interesting patterns is not
considerably longer than, if the discovery was
focused to a small subset of potentially
interesting patterns
39Are all the discovered pattern interesting?
- A data mining system/query may generate thousands
of patterns, but are they all interesting? - Usually NOT!
- How could we then choose the interesting
patterns? - gt Interestingness
40Interestingness criteria (1)
- Some possible criteria for interestingness
- evidence statistical significance of finding?
- redundancy similarity between findings?
- usefulness meeting the user's needs/goals?
- novelty already prior knowledge?
- simplicity syntactical complexity?
- generality how many examples covered?
41Interestingness criteria(2)
- One division of interestingness criteria
- objective measures that are based on statistics
and structures of patterns, e.g., - J-measure statistical significance
- certainty factor support or frequency
- strength confidence
- subjective measures that are based on users
beliefs in the data, e.g., - unexpectedness is the found pattern
surprising?" - actionability can I do something with it?"
42Criticism Support Confidence
- Example (Aggarwal Yu, PODS98)
- among 5000 students
- 3000 play basketball, 3750 eat cereal
- 2000 both play basket ball and eat cereal
- the rule play basketball ? eat cereal 40,
66.7 is misleading, because the overall
percentage of students eating cereal is 75,
which is higher than 66.7 - the rule play basketball ? not eat cereal 20,
33.3 is far more accurate, although with lower
support and confidence
43Interest
- Yet another objective measure for interestingness
is interest that is defined as - Properties of this measure
- takes both P(A) and P(B) in consideration
- P(AB)P(B)P(A), if A and B are independent
events - A and B negatively correlated, if the value is
less than 1 otherwise A and B positively
correlated.
44J-measure
- Also J-measure
- is an objective measure for interestingness
- Properties of J-measure
- again, takes both P(A) and P(B) in consideration
- value is always between 0 and 1
- can be computed using pre-calculated values
45Support/Frequency/J-measure
46Confidence
47Example Selection of Interesting Association
Rules
- For reducing the number of association rules that
have to be considered, we could, for example, use
one of the following selection criteria - frequency and confidence
- J-measure or interest
- maximum rule size (whole rule, left-hand side,
right-hand side) - rule attributes (e.g., templates)
48Example Problems with selection of rules
- A rule can correspond to prior knowledge or
expectations - how to encode the background knowledge into the
system? - A rule can refer to uninteresting attributes or
attribute combinations - could this be avoided by enhancing the
preprocessing phase? - Rules can be redundant
- redundancy elimination by rule covers etc.
49Interpretation and evaluation of the results of
data mining
- Evaluation
- statistical validation and significance testing
- qualitative review by experts in the field
- pilot surveys to evaluate model accuracy
- Interpretation
- tree and rule models can be read directly
- clustering results can be graphed and tabled
- code can be automatically generated by some
systems
50Visualization of Discovered Patterns (1)
- In some cases, visualization of the results of
data mining (rules, clusters, networks) can be
very helpful - Visualization is actually already important in
the preprocessing phase in selecting the
appropriate data or in looking at the data - Visualization requires training and practice
51Visualization of Discovered Patterns (2)
- Different backgrounds/usages may require
different forms of representation - e.g., rules, tables, cross-tabulations, or
pie/bar chart - Concept hierarchy is also important
- discovered knowledge might be more understandable
when represented at high level of abstraction - interactive drill up/down, pivoting, slicing and
dicing provide different perspective to data - Different kinds of knowledge require different
kinds of representation - association, classification, clustering, etc.
52Visualization
53(No Transcript)
54Utilization of the results
Increasing potential to support business decisions
End User
Making Decisions
Business Analyst
Data Presentation
Visualization Techniques
Data Mining
Data Analyst
Information Discovery
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
DBA
Data Sources
Paper, Files, Information Providers, Database
Systems, OLTP
55Summary
- Data mining semi-automatic discovery of
interesting patterns from large data sets - Knowledge discovery is a process
- preprocessing
- data mining
- post-processing
- using and utilizing the knowledge
56Summary
- Preprocessing is important in order to get useful
results! - If a loosely defined mining methodology is used,
post-processing is needed in order to find the
interesting results! - Visualization is useful in pre- and
post-processing! - One has to be able to utilize the found knowledge!
57References KDD Process
- P. Adriaans and D. Zantinge. Data Mining.
Addison-Wesley Harlow, England, 1996. - R.J. Brachman, T. Anand. The process of knowledge
discovery in databases. Advances in Knowledge
Discovery and Data Mining. AAAI/MIT Press, 1996. - D. P. Ballou and G. K. Tayi. Enhancing data
quality in data warehouse environments.
Communications of ACM, 4273-78, 1999. - M. S. Chen, J. Han, and P. S. Yu. Data mining An
overview from a database perspective. IEEE Trans.
Knowledge and Data Engineering, 8866-883, 1996. - U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
R. Uthurusamy. Advances in Knowledge Discovery
and Data Mining. AAAI/MIT Press, 1996. - T. Imielinski and H. Mannila. A database
perspective on knowledge discovery.
Communications of ACM, 3958-64, 1996. - Jagadish et al., Special Issue on Data Reduction
Techniques. Bulletin of the Technical Committee
on Data Engineering, 20(4), December 1997. - D. Keim, Visual techniques for exploring
databases. Tutorial notes in KDD97, Newport
Beach, CA, USA, 1997. - D. Keim, Visual data mining. Tutorial notes in
VLDB97, Athens, Greece, 1997. - D. Keim, and H.P. Krieger, Visual techniques for
mining large databases a comparison. IEEE
Transactions on Knowledge and Data Engineering,
8(6), 1996.
58References KDD Process
- W. Kloesgen, Explora A multipattern and
multistrategy discovery assistant. In U.M.
Fayyad, et al. (eds.), Advances in Knowledge
Discovery and Data Mining, 249-271. AAAI/MIT
Press, 1996. - M. Klemettinen, A knowledge discovery methodology
for telecommunication network alarm databases.
Ph.D. thesis, University of Helsinki, Report
A-1999-1, 1999. - M. Klemettinen, H. Mannila, P. Ronkainen, H.
Toivonen, and A.I. Verkamo. Finding interesting
rules from large sets of discovered association
rules. CIKM94, Gaithersburg, Maryland, Nov.
1994. - G. Piatetsky-Shapiro, U. Fayyad, and P. Smith.
From data mining to knowledge discovery An
overview. In U.M. Fayyad, et al. (eds.), Advances
in Knowledge Discovery and Data Mining, 1-35.
AAAI/MIT Press, 1996. - G. Piatetsky-Shapiro and W. J. Frawley. Knowledge
Discovery in Databases. AAAI/MIT Press, 1991. - D. Pyle. Data Preparation for Data Mining. Morgan
Kaufmann, 1999. - T. Redman. Data Quality Management and
Technology. Bantam Books, New York, 1992. - A. Silberschatz and A. Tuzhilin. What makes
patterns interesting in knowledge discovery
systems. IEEE Trans. on Knowledge and Data
Engineering, 8970-974, Dec. 1996. - D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton,
R. Motwani, and S. Nestorov. Query flocks A
generalization of association-rule mining.
SIGMOD'98, Seattle, Washington, June 1998.
59References KDD Process
- Y. Wand and R. Wang. Anchoring data quality
dimensions ontological foundations.
Communications of ACM, 3986-95, 1996. - R. Wang, V. Storey, and C. Firth. A framework for
analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7623-640, 1995.
60Reminder Course Organization
Course Evaluation
- Passing the course min 30 points
- home exam min 13 points (max 30 points)
- exercises/experiments min 8 points (max 20
points) - at least 3 returned and reported experiments
- group presentation min 4 points (max 10 points)
- Remember also the other requirements
- attending the lectures (5/7)
- attending the seminars (4/5)
- attending the exercises (4/5)
61Seminar Presentations/Groups 9-10
Visualization and data mining
D. Keim, H.P., Kriegel, T. Seidl Supporting
Data Mining of Large Databases by Visual Feedback
Queries", ICDE94.
62Seminar Presentations/Groups 9-10
Interestingness
G. Piatetsky-Shapiro, C.J. Matheus The
Interestingness of Deviations, KDD94.
63KDD process
Thanks to Jiawei Han from Simon Fraser
University and Mika Klemettinen from Nokia
Research Center for their slides which greatly
helped in preparing this lecture! Also thanks
to Fosca Giannotti and Dino Pedreschi from
Pisa for their slides.