What is data mining? presentation

About This Presentation

Transcript and Presenter's Notes

Title: What is data mining?

1
What is data mining?

Wlodzislaw Duch
Dept. of Informatics, Nicholas Copernicus
University, Torun, Poland
http//www.phys.uni.torun.pl/duch

ISEP Porto, 8-12 July 2002
2
What is it about?

Data used to be precious! Now it is overwhelming
...
In many areas of science, business and commerce
people are drowning in data.
Ex astronomy super-telescope data mining in
existing databases.
Database technology allows to store and retrieve
large amounts of data of any kind.
There is knowledge hidden in data.
Data analysis requires intelligence.

3
Ancient history

1960 first databases, collections of data.
1970 RDBMS, relational data model most popular
today, large centralized systems.
1980 application-oriented data models,
specialized for scientific, geographic,
engineering data, time series, text,
object-oriented models, distributed databases.
1990 multimedia and Web databases, data
warehousing (subject-oriented DB for decision
support), and on-line analytical processing
(OLAP), deduction and verification of
hypothetical patterns.
Data mining first conference in 1989, book 1996,
discover something useful!

4
Data Mining History

1989 IJCAI Workshop on Knowledge Discovery in
Databases (Piatetsky-Shapiro and W. Frawley 1991)
1991-1994 Workshops on KDD
1996 Advances in Knowledge Discovery and Data
Mining (Fayyad et al.)
1995-1998 International Conferences on Knowledge
Discovery in Databases and Data Mining
(KDD95-98)
1997 Journal of Data Mining and Knowledge
Discovery
1998 ACM SIGKDD, SIGKDD1999-2001 conferences,
and SIGKDD Explorations
Many conferences on data mining PAKDD, PKDD,
SIAM-Data Mining, (IEEE) ICDM, etc.

5
References, papers

KDD WWW Resources
http//www.kdd.org
http//www.kdnuggets.com
http//www.the-data-mine.com
http//www.acm.org/sigkdd/

ResearchIndex http//citeseer.nj.nec.com/cs AI
ML aspects http//www.phys.uni.torun.pl/kmk NN
Statistics http//www.phys.uni.torun.pl/kmk Compa
rison of results on many datasets http//www.phy
s.uni.torun.pl/kmk
6
Data Mining and statistics

Statisticians deal with data whats new in DM?
Many DM methods have roots in statistics.
Statistics used to deal with small, controlled
experiments, while DM deals with large, messy
collections of data.
Statistics is based on analytical probabilistic
models, DM is based on algorithms that find
patterns in data.
Many DM algorithms came from other sources and
slowly get some statistical justification.
Key factor for DM is the computer
cost/performance.
Sometimes DM is more art than science

7
Types of Data

Statistical data clean, numerical, controlled
experiments, vector space model.
Relational data marketing, finances.
Textual data Web, NLP, search.
Complex structures chemistry, economics.
Sequence data bioinformatics.
Multimedia data images, video.
Signals dynamic data, biosignals.
AI data logical problems, games, behavior

8
What is DM?

Discovering interesting patterns, finding useful
summaries of large databases.
DM is more than database technology, On-Line
Analitic Processing (OLAP) tools.
DM is more than statistical analysis, although it
includes classification, association, clustering,
outlier and trend analysis, decision rules,
prototype cases, multidimensional visualization
etc. Understanding of data has not been an
explicit goal of statistics, focusing on
predictive data models.

9
DM applications

Many applications, but spectacular new knowledge
is rarely discovered. Some examples
Diapers and beer correlation please them close
and put potato chips in between.
Mining astronomical catalogs (Skycat, Sloan Sky
survey) new subtype of stars has been
discovered!
Bioinformatics more precise characterization of
some diseases, many discoveries to be made?
Credit card fraud detection (HNC company).
Discounts of air/hotel for frequent travelers.

10
Important issues in data mining.

Use of statistical and CI methods for KDD.
What makes an interesting pattern?
Handling uncertainty in the data.
Handling noise, outliers and missing or unknown
data.
Finding linguistic variables, discretization of
continuous data, presentation and evaluation of
knowledge.
Knowledge representation for structural data,
heterogeneous information, textual databases
NLP.
Performance, scalability, distributed data,
incremental or on-line processing.
Best form of explanation depends on the
application.

11
DM dangers

If there are too many conclusions to draw some
inferences will be true by chance due to too
small data samples (Bonferronis
theorem).Example 1 David Rhine (Duke Univ) ESP
tests. 1 person in 1000 guessed correctly color
(red or black) of 10 cards is this evidence for
ESP?Retesting of these people gave average
results. Rhines conclusion telling people that
they have ESP interferes with their ability
Example 2 using m letters to form a random
sequence of the length N all possible
subsequences of logmN are found gt Bible code!

12
Data Mining process

Knowledge discovery in databases (KDD)
a search process for understandable and useful
patterns in data.

Data Mining
most effort
13
Stages of DM process

Data gathering, data warehousing, Web crawling.
Preparation of the data cleaning, removing
outliers and impossible values, removing wrong
records, finding missing data.
Exploratory data analysis visualization of
different aspects of data.
Finding relevant features for questions that are
asked, preparing data structures for predictive
methods, converting symbolic values to numerical
representation.
Pattern extraction, discovery, rules, prototypes.
Evaluation of knowledge gained, finding useful
patterns, consultation with experts.

14
Multidimensional Data Cuboids

Data warehouses use multidimensional data model.
Projections (views) of data on different
dimensions (attributes) form data cuboids.
In DB warehousing literature base cuboid
original data, N-Dim. apex cuboid 0-D cuboid,
highest-level summary data cube lattice of
cuboids.
Ex Sales data cube, viewed in multiple
dimensions
Dimension tables, ex. item (item_name, brand,
type), or time(day, week, month, quarter, year)
Fact tables, measures (such as cost), and keys to
each of the related dimension tables

15
Data Cube A Lattice of Cuboids
time,item
time,item,location
16
Forms of useful knowledge
AI/Machine Learning camp Neural nets are black
boxes. Unacceptable! Symbolic rules forever.

But ... knowledge accessible to humans is in
symbols,
similarity to prototypes,
images, visual representations.
What type of explanation is satisfactory?
Interesting question for cognitive scientists.
Different answers in different fields.

17
Forms of knowledge

Humans remember examples of each category and
refer to such examples as similarity-based or
nearest-neighbors methods do.
Humans create prototypes out of many examples
as Gaussian classifiers, RBF networks, neurofuzzy
systems do.
Logical rules are the highest form of
summarization of knowledge.

Types of explanation
exemplar-based prototypes and similarity
logic-based symbols and rules
visualization-based exploratory data analysis,
maps, diagrams, relations ...

18
Computational Intelligence
Soft computing
Computational IntelligenceData gt
KnowledgeArtificial Intelligence
19
CI methods for data mining

Provide non-parametric (universal), predictive
models of data.
Classify new data to pre-defined categories,
supporting diagnosis prognosis.
Discover new categories, clusters, patterns.
Discover interesting associations, correlations.
Allow to understand the data, creating fuzzy or
crisp logical rules, or prototypes.
Help to visualize multi-dimensional relationships
among data samples.

20
Association rules

Classification rules X gt C(X)
Association rules looking for correlation
between components of X, i.e. probability
p(XiX1,Xi-1,Xi1,Xn).
Market basket problem many items selected from
an available pool to a basket what are the
correlations?
Only frequent items are interestingitemsets
with high support, i.e. appearing together in
many baskets. Search for rules above support
threshold gt 1.

21
Association rules - related

Related problems to market basket correlation
between documents high for plagiarism phrases
in documents high for semantically related
documents.
Causal relations matter, although may be
difficult to determine lower the price of
diapers, keep high beer price, or try the reverse
what will happen?
More general approach Bayesian belief networks,
causal networks, graphical models.

22
Clustering

Given points in multidimensional space divided
them into groups that are similar.
Ex if epidemic breaks, look for location of
cases on the map (cholera in London). Documents
in the space of words cluster according to their
topics.
How to measure similarity?
Hierarchical approaches start from single cases,
join them forming clusters ex
dendrogram.Centroid approaches assume a few
centers and adapt their position ex k-means,
LVQ, SOM.

23
Neural networks

Inspired by neurobiology simple elements
cooperate changing internal parameters.
Large field, dozens of different models, over 500
papers on NN in medicine each year.
Supervised networks heteroassociative mapping
XgtY, symptoms gt diseases,universal
approximators.
Unsupervised networks clusterization,
competitive learning, autoassociation.
Reinforcement learning modeling behavior,
playing games, sequential data.

24
Supervised learning

Compare the desired with the achieved outputs
you cant always get what you want.
Examples MLP/RBF NN, kNN, SVM, LDA, DT

25
Unsupervised learning

Find interesting structures in data.
SOM, many variants.

26
Reinforcement learning

Reward comes after the sequence of actions.
Games, survival behavior, planning sequences of
actions.

27
Unsupervised NN example
Clustering and visualization of the quality of
life index (UN data) by SOM map.
Poor classification, inaccurate visualization.
28
Real and artificial neurons
Nodes artificial neurons
Dendrites
Signals
Synapses
Synapses
(weights)
Axon
29
Neural network for MI diagnosis
Myocardial Infarction
p(MIX)
0.7
Outputweights
Inputweights
Sex
Age
Smoking
Elevation
Pain
ECG ST
Duration
30
MI network function

Training setting the values of weights and
thresholds, efficient algorithms exist.

Effect non-linear regression function
Such networks are universal approximators they
may learn any mapping X gt Y
31
Knowledge from networks

Simplify networks force most weights to 0,
quantize remaining parameters, be constructive!

Regularization mathematical technique
improving predictive abilities of the network.
Result MLP2LN neural networks that are
equivalent to logical rules.

32
MLP2LN

Converts MLP neural networks into a network
performing logical operations (LN).

Input layer
Output one node per class.
Aggregation better features
Rule units threshold logic
Linguistic units windows, filters
33
Learning dynamics
Decision regions shown every 200 training epochs
in x3, x4 coordinates borders are optimally
placed with wide margins.
34
Neurofuzzy systems
Fuzzy m(x)0,1 (no/yes) replaced by a degree
m(x)?0,1. Triangular, trapezoidal, Gaussian ...
MF.
M.f-s in many dimensions

Feature Space Mapping (FSM) neurofuzzy system.
Neural adaptation, estimation of probability
density distribution (PDF) using single hidden
layer network (RBF-like) with nodes realizing
separable functions

35
GhostMiner Philosophy

GhostMiner, data mining tools from our lab.
http//www.fqspl.com.pl/ghostminer/
Separate the process of model building and
knowledge discovery from model use gt
GhostMiner Developer GhostMiner Analyzer.

There is no free lunch provide different type
of tools for knowledge discovery. Decision tree,
neural, neurofuzzy, similarity-based, committees.
Provide tools for visualization of data.
Support the process of knowledge discovery/model
building and evaluating, organizing it into
projects.

36
Heterogeneous systems
Homogenous systems one type of building
blocks, same type of decision borders. Ex
neural networks, SVMs, decision trees, kNNs
. Committees combine many models together, but
lead to complex models that are difficult to
understand.

Discovering simplest class structures, its
inductive bias, requires heterogeneous adaptive
systems (HAS).
Ockham razor simpler systems are better.
HAS examples
NN with many types of neuron transfer functions.
k-NN with different distance functions.
DT with different types of test criteria.

37
Wine data example
Chemical analysis of wine from grapes grown in
the same region in Italy, but derived from three
different cultivars.Task recognize the source
of wine sample.13 quantities measured,
continuous features

alcohol content
ash content
magnesium content
flavanoids content
proanthocyanins phenols content
OD280/D315 of diluted wines

malic acid content
alkalinity of ash
total phenols content
nonanthocyanins phenols content
color intensity
hue
proline.

38
Exploration and visualization

General info about the data

39
Exploration data

Inspect the data

40
Exploration data statistics

Distribution of feature values

Proline has very large values, the data should be
standardized before further processing.
41
Exploration data standardized

Standardized data unit standard deviation, about
2/3 of all data should fall within
mean-std,meanstd

Other options normalize to fit in -1,1, or
normalize rejecting some extreme values.
42
Exploration 1D histograms

Distribution of feature values in classes

Some features are more useful than the others.
43
Exploration 1D/3D histograms

Distribution of feature values in classes, 3D

44
Exploration 2D projections

Projections (cuboids) on selected 2D

Projections on selected 2D
45
Visualize data
Relations in more than 3D are hard to
imagine. SOM mappings popular for
visualization, but rather inaccurate, no measure
of distortions. Measure of topographical
distortions map all Xi points from Rn to xi
points in Rm, m lt n, and ask How well are Rij
D(Xi, Xj) distances reproduced by distances rij
d(xi,xj) ? Use m 2 for visualization, use
higher m for dimensionality reduction.
46
Visualize data MDS
Multidimensional scaling invented in psychometry
by Torgerson (1952), re-invented by Sammon (1969)
and myself (1994) Minimize measure of
topographical distortions moving the x
coordinates.
47
Visualize data Wine
3 clusters are clearly distinguished, 2D is fine.
The green outlier can be identified easily.
48
Decision trees
Simplest things first use decision tree to find
logical rules.
Test single attribute, find good point to split
the data, separating vectors from different
classes. DT advantages fast, simple, easy to
understand, easy to program, many good
algorithms.
4 attributes used, 10 errors, 168 correct,
94.4 correct.
49
Decision borders
Univariate trees test the value of a single
attribute x lt a.
Multivariate trees test on combinations of
attributes, hyperplanes.
Result feature space is divided into cuboids.
Wine data univariate decision tree borders for
proline and flavanoids
50
Logical rules
Crisp logic rules for continuous x use
linguistic variables (predicate functions).
sk(x) s True XkL x L X'k, for example
small(x) Truexx lt 1 medium(x)
Truexx Î 1,2 large(x) Truexx gt
2 Linguistic variables are used in crisp
(prepositional, Boolean) logic rules IF
small-height(X) AND has-hat(X) AND has-beard(X)
THEN (X is a Brownie) ELSE IF ... ELSE ...
51
Crisp logic decisions
Crisp logic is based on rectangular membership
functions
True/False values jump from 0 to 1. Step
functions are used for partitioning of the
feature space.
Very simple hyper-rectangular decision borders.
Sever limitation on the expressive power of
crisp logical rules!
52
Logical rules - advantages
Logical rules, if simple enough, are preferable.

Rules may expose limitations of black box
solutions.
Only relevant features are used in rules.
Rules may sometimes be more accurate than NN and
other CI methods.
Overfitting is easy to control, rules usually
have small number of parameters.
Rules forever !? A logical rule about logical
rules is

53
Logical rules - limitations

Logical rules are preferred but ...

Only one class is predicted p(CiX,M) 0 or 1
black-and-white picture may be inappropriate in
many applications.
Discontinuous cost function allow only
non-gradient optimization.
Sets of rules are unstable small change in the
dataset leads to a large change in structure of
complex sets of rules.
Reliable crisp rules may reject some cases as
unclassified.
Interpretation of crisp rules may be misleading.
Fuzzy rules are not so comprehensible.

54
Rules - choices
Simplicity vs. accuracy. Confidence vs.
rejection rate.
p is a hit p- false alarm p- is a miss.
Accuracy (overall) A(M) p p--
Error rate L(M) p- p-
Rejection rate R(M)prp-r 1-L(M)-A(M)
Sensitivity S(M) p p /p
Specificity S-(M) p-- p-- /p-
55
Rules error functions

The overall accuracy is equal to a combination of
sensitivity and specificity weighted by the a
priori probabilities

A(M) pS(M)p-S-(M)
Optimization of rules for the C class large g
means no errors but high rejection rate.
E(Mg) gL(M)-A(M) g (p-p-) - (pp--)
minM E(Mg) ? minM (1g)L(M)R(M)
Optimization with different costs of errors
minM E(Ma) minM p- a p- minM
p(1-S(M)) - pr(M) a p-(1-S-(M)) -
p-r(M) ROC (Receiver Operating Curve) p
(p-), hit(false alarm).
56
Wine example SSV rules

Decision trees provide rules of different
complexity.

Simplest tree 5 nodes, corresponding to 3
rules 25 errors, mostly Class2/3 wines mixed.
57
Wine SSV 5 rules

Lower pruning leads to more complex tree.

7 nodes, corresponding to 5 rules 10 errors,
mostly Class2/3 wines mixed.
58
Wine SSV optimal rules
What is the optimal complexity of rules? Use
crossvalidation to estimate generalization.
Various solutions may be found, depending on the
search 5 rules with 12 premises, making 6
errors, 6 rules with 16 premises and 3 errors,
8 rules, 25 premises, and 1 error.
if OD280/D315 gt 2.505 ? proline gt 726.5 ? color gt
3.435 then class 1 if OD280/D315 gt 2.505 ?
proline gt 726.5 ? color lt 3.435 then class 2 if
OD280/D315 lt 2.505 ? hue gt 0.875 ? malic-acid lt
2.82 then class 2 if OD280/D315 gt 2.505 ? proline
lt 726.5 then class 2 if OD280/D315 lt 2.505 ? hue
lt 0.875 then class 3 if OD280/D315 lt 2.505 ? hue
gt 0.875 ? malic-acid gt 2.82 then class 3
59
Wine FSM rules
SSV hierarchical rules FSM density estimation
with feature selection.
Complexity of rules depends on desired
accuracy. Use rectangular functions for crisp
rules. Optimal accuracy may be evaluated using
crossvalidation.
FSM discovers simpler rules, for example if
proline gt 929.5 then class 1 (48 cases, 45
correct, 2 recovered by other rules). if color lt
3.79285 then class 2 (63 cases, 60 correct)
60
Examples of interesting knowledge discovered!

The most famous example of knowledge discovered
by data mining
correlation between beer, milk and diapers.

Other examples 2 subtypes of galactic spectra
forced astrophysicist to reconsider star
evolutionary processes. Several examples of
knowledge found by us in medical and other
datasets follow.
61
Mushrooms

The Mushroom Guide no simple rule for mushrooms
no rule like leaflets three, let it be for
Poisonous Oak and Ivy.

8124 cases, 51.8 are edible, the rest
non-edible. 22 symbolic attributes, up to 12
values each, equivalent to 118 logical features,
or 21183.1035 possible input vectors. Odor
almond, anise, creosote, fishy, foul, musty,
none, pungent, spicy Spore print color black,
brown, buff, chocolate, green, orange, purple,
white, yellow.
Safe rule for edible mushrooms
odor(almond.or.anise.or.none) U
spore-print-color R green 48 errors,
99.41 correct This is why animals have such a
good sense of smell! What does it tell us
about odor receptors?
62
Mushrooms rules

To eat or not to eat, this is the question! Not
any more ...

A mushroom is poisonous if R1) odor R (almond
Ú anise Ú none) 120 errors, 98.52 R2)
spore-print-color green 48 errors, 99.41
R3) odor none U stalk-surface-below-ring
scaly U stalk-color-above-ring R brown
8 errors, 99.90 R4) habitat leaves U
cap-color white no errors! R1 R2 are
quite stable, found even with 10 of data R3
and R4 may be replaced by other rules, ex R'3)
gill-sizenarrow U stalk-surface-above-ring(silky
Ú scaly) R'4) gill-sizenarrow U
populationclustered Only 5 of 22 attributes
used! Simplest possible rules? 100 in CV tests
- structure of this data is completely clear.
63
Recurrence of breast cancer

Data from Institute of Oncology, University
Medical Center, Ljubljana, Yugoslavia.

286 cases, 201 no recurrence (70.3), 85
recurrence cases (29.7) no-recurrence-events,
40-49, premeno, 25-29, 0-2, ?, 2, left,
right_low, yes 9 nominal features age (9 bins),
menopause, tumor-size (12 bins), nodes involved
(13 bins), node-caps, degree-malignant (1,2,3),
breast, breast quad, radiation.
64
Rules for breast cancer

Data from Institute of Oncology, University
Medical Center, Ljubljana, Yugoslavia.

Many systems used, 65-78 accuracy reported.
Single rule IF (nodes-involved ? 0,2 Ù
degree-malignant 3 THEN recurrence, ELSE
no-recurrence 76.2 accuracy, only trivial
knowledge in the data Highly malignant breast
cancer involving many nodes is likely to strike
back.
65
Recurrence - comparison.
Method 10xCV accuracy MLP2LN 1
rule 76.2 SSV DT stable rules 75.7 ? 1.0
k-NN, k10, Canberra 74.1 ?1.2 MLPbackprop.
73.5 ? 9.4 (Zarndt)CART DT 71.4 ? 5.0
(Zarndt) FSM, Gaussian nodes 71.7 ? 6.8 Naive
Bayes 69.3 ? 10.0 (Zarndt) Other decision
trees lt 70.0
66
Breast cancer diagnosis.

Data from University of Wisconsin Hospital,
Madison, collected by dr. W.H. Wolberg.

699 cases, 9 features quantized from 1 to 10
clump thickness, uniformity of cell size,
uniformity of cell shape, marginal adhesion,
single epithelial cell size, bare nuclei, bland
chromatin, normal nucleoli, mitoses Tasks
distinguish benign from malignant cases.
67
Breast cancer rules.

Data from University of Wisconsin Hospital,
Madison, collected by dr. W.H. Wolberg.

Simplest rule from MLP2LN, large regularization
If uniformity of cell size lt 3 Then
benign Else malignant Sensitivity0.97,
Specificity0.85 More complex NN solutions, from
10CV estimate Sensitivity 0.98,
Specificity0.94
68
Breast cancer comparison.
Method 10xCV accuracy k-NN, k3,
Manh 97.0 ? 2.1 (GM)FSM, neurofuzzy 96.9 ?
1.4 (GM) Fisher LDA 96.8 MLPbackprop.
96.7 (Ster, Dobnikar)LVQ 96.6 (Ster,
Dobnikar) IncNet (neural) 96.4 ? 2.1 (GM)Naive
Bayes 96.4 SSV DT, 3 crisp rules 96.0 ?
2.9 (GM) LDA (linear discriminant) 96.0
Various decision trees 93.5-95.6
69
Melanoma skin cancer

Collected in the Outpatient Center of Dermatology
in Rzeszów, Poland.
Four types of Melanoma benign, blue, suspicious,
or malignant.

250 cases, with almost equal class distribution.
Each record in the database has 13 attributes
asymmetry, border, color (6), diversity (5).
TDS (Total Dermatoscopy Score) - single index
Goal hardware scanner for preliminary diagnosis.

70
Melanoma rules
R1 IF TDS 4.85 AND C-BLUE IS absent THEN
MELANOMA IS Benign-nevus R2 IF TDS 4.85 AND
C-BLUE IS present THEN MELANOMA IS
Blue-nevus R3 IF TDS gt 5.45 THEN MELANOMA IS
Malignant R4 IF TDS gt 4.85 AND TDS lt 5.45
THEN MELANOMA IS Suspicious 5 errors (98.0)
on the training set 0 errors (100 ) on the test
set. Feature aggregation is important! Without
TDS 15 rules are needed.
71
Melanoma results
Method Rules Training Test MLP2LN,
crisp rules 4 98.0 all 100 SSV Tree,
crisp rules 4 97.50.3 100FSM,
rectangular f. 7 95.51.0 100 knn
prototype selection 13 97.50.0 100
FSM, Gaussian f. 15 93.71.0 953.6 knn
k1, Manh, 2 features -- 97.40.3 100 LERS,
rough rules 21 -- 96.2
72
Summary

Data mining is a large field only a few issues
have been mentioned here.
DM involves many steps, here only those related
to pattern recognition were stressed, but in
practice scalability and efficiency issues may be
most important.

Neural networks are used still mostly for
building predictive data models, but they may
also provide simplified description in form of
rules. Rules are not the only for of data
understanding. Rules may be a beginning for a
practical application. Some interesting
knowledge has been discovered.
73
Challenges

Fully automatic universal data analysis systems
press the button and wait for the truth

Discovery of theories rather than data models
Integration with image/signal analysis
Integration with reasoning in complex domains
Combining expert systems with neural networks

We are slowly getting there. More more
computational intelligence tools (including our
own) are available.
74
Disclaimer

A few slides/figures were taken from various
presentations found in the Internet
unfortunately I cannot identify original authors
at the moment, since these slides went through
different iterations.
I have to apologize for that.

Write a Comment

User Comments (0)

What is data mining? PowerPoint PPT Presentation