Data%20Mining:%20An%20Overview - PowerPoint PPT Presentation

About This Presentation

Title:

Data%20Mining:%20An%20Overview

Description:

Title: Visualizing and Exploring Data Author: madigan Created Date: 1/25/2001 6:28:21 PM Document presentation format: On-screen Show Company: Soliloquy, Inc. – PowerPoint PPT presentation

Number of Views:314

Avg rating:3.0/5.0

Slides: 80

Provided by: madi67

Learn more at: http://www.stat.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data%20Mining:%20An%20Overview

1
Data Mining An Overview
David Madigan dmadigan_at_rci.rutgers.edu http//stat
.rutgers.edu/madigan
2
Overview

Brief Introduction to Data Mining
Data Mining Algorithms
Specific Examples
Algorithms Disease Clusters
Algorithms Model-Based Clustering
Algorithms Frequent Items and Association Rules
Future Directions, etc.

3
Of Laws, Monsters, and Giants

Moores law processing capacity doubles every
18 months CPU, cache, memory
Its more aggressive cousin
Disk storage capacity doubles every 9 months

4
What is Data Mining?

Finding interesting structure in data
Structure refers to statistical patterns,
predictive models, hidden relationships
Examples of tasks addressed by Data Mining
Predictive Modeling (classification, regression)
Segmentation (Data Clustering )
Summarization
Visualization

5
(No Transcript)
6
(No Transcript)
7
Ronny Kohavi, ICML 1998
8
Ronny Kohavi, ICML 1998
9
Ronny Kohavi, ICML 1998
10
Stories Online Retailing
11
Chapter 4 Data Analysis and Uncertainty

Elementary statistical concepts random
variables, distributions, densities,
independence, point and interval estimation, bias
variance, MLE
Model (global, represent prominent structures)
vs. Pattern (local, idiosyncratic deviations)
Frequentist vs. Bayesian
Sampling methods

12
Bayesian Estimation
e.g. beta-binomial model
Predictive distribution
13
Issues to do with p-values

Using thresholds of 0.05 or 0.01 regardless of
sample size
Multiple testing (e.g. Friedman (1983) selecting
highly significant regressors from noise)
Subtle interpretation Jeffreys (1980) I have
always considered the arguments for the use of P
absurd. They amount to saying that a hypothesis
that may or may not be true is rejected because a
greater departure from the trial value was
improbable that is, that is has not predicted
something that has not happened.

14
p-value as measure of evidence
Schervish (1996) if hypothesis H implies
hypothesis H', then there should be at least as
much support for H' as for H. - not satisfied by
p-values
Grimmet and Ridenhour (1996) one might expect
an outlying data point to lend support to the
alternative hypothesis in, for instance, a
one-way analysis of variance. - the value of the
outlying data point that minimizes the
significance level can lie within the range of
the data
15
Chapter 5 Data Mining Algorithms
A data mining algorithm is a well-defined
procedure that takes data as input and produces
output in the form of models or patterns
well-defined can be encoded in
software algorithm must terminate after some
finite number of steps
16
Data Mining Algorithms
A data mining algorithm is a well-defined
procedure that takes data as input and produces
output in the form of models or patterns
Hand, Mannila, and Smyth
well-defined can be encoded in
software algorithm must terminate after some
finite number of steps
17
Algorithm Components
1. The task the algorithm is used to address
(e.g. classification, clustering, etc.) 2. The
structure of the model or pattern we are fitting
to the data (e.g. a linear regression model) 3.
The score function used to judge the quality of
the fitted models or patterns (e.g. accuracy,
BIC, etc.) 4. The search or optimization method
used to search over parameters and/or structures
(e.g. steepest descent, MCMC, etc.) 5. The data
management technique used for storing, indexing,
and retrieving data (critical when data too large
to reside in memory)
18
(No Transcript)
19
Backpropagation data mining algorithm
x1
h1
x2
y
x3
h2
x4
4
2

vector of p input values multiplied by p ? d1
weight matrix
resulting d1 values individually transformed by
non-linear function
resulting d1 values multiplied by d1 ? d2 weight
matrix

1
20
Backpropagation (cont.)
Parameters
Score
Search steepest descent search for structure?
21
Models and Patterns
Models
Probability Distributions
Structured Data
Prediction

Linear regression
Piecewise linear

22
Models
Probability Distributions
Structured Data
Prediction

Linear regression
Piecewise linear
Nonparamatric regression

23
(No Transcript)
24
Models
Probability Distributions
Structured Data
Prediction

Linear regression
Piecewise linear
Nonparametric regression
Classification

logistic regression naïve bayes/TAN/bayesian
networks NN support vector machines Trees etc.
25
Models
Probability Distributions
Structured Data
Prediction

Linear regression
Piecewise linear
Nonparametric regression
Classification

Parametric models
Mixtures of parametric models
Graphical Markov models (categorical, continuous,
mixed)

26
Models
Probability Distributions
Structured Data
Prediction

Time series
Markov models
Mixture Transition Distribution models
Hidden Markov models
Spatial models

Linear regression
Piecewise linear
Nonparametric regression
Classification

Parametric models
Mixtures of parametric models
Graphical Markov models (categorical, continuous,
mixed)

27
Markov Models
First-order
e.g.
g linear ? standard first-order auto-regressive
model
yT
y1
y2
y3
28
First-Order HMM/Kalman Filter
yT
y1
y2
y3
xT
x1
x2
x3
Note to compute p(y1,,yT) need to sum/integrate
over all possible state sequences...
29
Bias-Variance Tradeoff
High Bias - Low Variance
Low Bias - High Variance overfitting - modeling
the random component
Score function should embody the compromise
30
The Curse of Dimensionality
X MVNp (0 , I)

Gaussian kernel density estimation
Bandwidth chosen to minimize MSE at the mean
Suppose want

Dimension data points 1 4
2 19 3 67
6 2,790 10
842,000
31
Patterns
Local
Global

Outlier detection
Changepoint detection

Bump hunting
Scan statistics
Association rules

Clustering via partitioning
Hierarchical Clustering
Mixture Models

32
Scan Statistics via Permutation Tests
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
The curve represents a road Each x marks an
accident Red x denotes an injury accident Black
x means no injury Is there a stretch of road
where there is an unusually large fraction of
injury accidents?
33
Scan with Fixed Window

If we know the length of the stretch of road
that we seek, e.g., we could
slide this window long the road and find the most
unusual window location

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
34
How Unusual is a Window?

Let pW and pW denote the true probability of
being red inside and outside the window
respectively. Let (xW ,nW) and (xW ,nW) denote
the corresponding counts
Use the GLRT for comparing H0 pW pW versus
H1 pW ? pW

lambda measures how unusual a window is

-2 log l here has an asymptotic chi-square
distribution with 1df
35
Permutation Test

Since we look at the smallest l over all window
locations, need to find the distribution of
smallest-l under the null hypothesis that there
are no clusters
Look at the distribution of smallest-l over say
999 random relabellings of the colors of the xs

smallest-l
xx x xxx x xx x xx x 0.376 xx x xxx
x xx x xx x 0.233 xx x xxx x xx x
xx x 0.412 xx x xxx x xx x xx x
0.222

Look at the position of observed smallest-l in
this distribution to get the scan statistic
p-value (e.g., if observed smallest-l is 5th
smallest, p-value is 0.005)

36
Variable Length Window

No need to use fixed-length window. Examine all
possible windows up to say half the length of the
entire road

O fatal accident O non-fatal accident
37
Spatial Scan Statistics

Spatial scan statistic uses, e.g., circles
instead of line segments

38
(No Transcript)
39
Spatial-Temporal Scan Statistics

Spatial-temporal scan statistic use cylinders
where the height of the cylinder represents a
time window

40
Other Issues

Poisson model also common (instead of the
bernoulli model)
Covariate adjustment
Andrew Moores group at CMU efficient algorithms
for scan statistics

41
Software SaTScan others
http//www.satscan.org http//www.phrl.org
http//www.terraseer.com
42
Association Rules Support and Confidence
Customer buys both

Find all the rules Y ? Z with minimum confidence
and support
support, s, probability that a transaction
contains Y Z
confidence, c, conditional probability that a
transaction having Y Z also contains Z

Customer buys diaper
Customer buys beer

Let minimum support 50, and minimum confidence
50, we have
A ? C (50, 66.6)
C ? A (50, 100)

43
Mining Association RulesAn Example
Min. support 50 Min. confidence 50

For rule A ? C
support support(A C) 50
confidence support(A C)/support(A) 66.6
The Apriori principle
Any subset of a frequent itemset must be frequent

44
Mining Frequent Itemsets the Key Step

Find the frequent itemsets the sets of items
that have minimum support
A subset of a frequent itemset must also be a
frequent itemset
i.e., if AB is a frequent itemset, both A and
B should be a frequent itemset
Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset)
Use the frequent itemsets to generate association
rules.

45
The Apriori Algorithm

Join Step Ck is generated by joining Lk-1with
itself
Prune Step Any (k-1)-itemset that is not
frequent cannot be a subset of a frequent
k-itemset
Pseudo-code
Ck Candidate itemset of size k
Lk frequent itemset of size k
L1 frequent items
for (k 1 Lk !? k) do begin
Ck1 candidates generated from Lk
for each transaction t in database do
increment the count of all candidates in
Ck1 that are
contained in t
Lk1 candidates in Ck1 with min_support
end
return ?k Lk

46
The Apriori Algorithm Example
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Scan D
47
Association Rule Mining A Road Map

Boolean vs. quantitative associations (Based on
the types of values handled)
buys(x, SQLServer) buys(x, DMBook)
buys(x, DBMiner) 0.2, 60
age(x, 30..39) income(x, 42..48K)
buys(x, PC) 1, 75
Single dimension vs. multiple dimensional
associations (see ex. Above)
Single level vs. multiple-level analysis
What brands of beers are associated with what
brands of diapers?
Various extensions (thousands!)

48
(No Transcript)
49
Model-based Clustering
Padhraic Smyth, UCI
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
Mixtures of Sequences, Curves,
Generative Model - select a component ck for
individual i - generate data according to p(Di
ck) - p(Di ck) can be very general - e.g.,
sets of sequences, spatial patterns, etc Note
given p(Di ck), we can define an EM algorithm
58
Example Mixtures of SFSMs

Simple model for traversal on a Web site
(equivalent to first-order Markov with end-state)
Generative model for large sets of Web users
- different behaviors ltgt mixture of SFSMs
EM algorithm is quite simple weighted counts

59
WebCanvas Cadez, Heckerman, et al, KDD 2000
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
Discussion

What is data mining? Hard to pin down who
cares?
Textbook statistical ideas with a new focus on
algorithms
Lots of new ideas too

64
Privacy and Data Mining
Ronny Kohavi, ICML 1998
65
Analyzing Hospital Discharge Data
David Madigan Rutgers University
66
Comparing Outcomes Across Providers

Florence Nightingale wrote in 1863

In attempting to arrive at the truth, I have
applied everywhere for information, but in
scarcely an instance have I been able to obtain
hospital records fit for any purposes of
comparisonI am fain to sum up with an urgent
appeal for adopting some uniform system of
publishing the statistical records of hospitals.
67
Data

Data of various kinds are now available e.g.
data concerning all medicare/medicaid hospital
admissions in standard format UB-92 covers gt95
of all admissions nationally
Considerable interest in using these data to
compare providers (hospitals, physician groups,
physicians, etc.)
In Pennsylvannia, large corporations such as
Westinghouse and Hershey Foods are a motivating
force and use the data to select providers.

68
SYSID DCSTATUS PPXDOW CANCER1
YEAR LOS SPX1DOW CANCER2
QUARTER DCHOUR SPX2DOW MDCHC4
PAF DCDOW SPX3DOW MQSEV
HREGION ECODE SPX4DOW MQNRSP
MAID PDX SPX5DOW PROFCHG
PTSEX SDX1 REFID TOTALCHG
ETHNIC SDX2 ATTID NONCVCHG
RACE SDX3 OPERID ROOMCHG
PSEUDOID SDX4 PAYTYPE1 ANCLRCHG
AGE SDX5 PAYTYPE2 DRUGCHG
AGECAT SDX6 PAYTYPE3 EQUIPCHG
PRIVZIP SDX7 ESTPAYER SPECLCHG
MKTSHARE SDX8 NAIC MISCCHG
COUNTY PPX OCCUR1 APRMDC
STATE SPX1 OCCUR2 APRDRG
ADTYPE SPX2 BILLTYPE APRSOI
ADSOURCE SPX3 DRGHOSP APRROM
ADHOUR SPX4 PCMU MQGCLUST
ADMDX SPX5 DRGHC4 MQGCELL
ADDOW
Pennsylvannia Healthcare Cost Containment
Council. 2000-1, n800,000
69
Risk Adjustment

Discharge data like these allow for comparisons
of, e.g., mortality rates for CABG procedure
across hospitals.
Some hospitals accept riskier patients than
others a fair comparison must account for such
differences.
PHC4 (and many other organizations) use indirect
standardization
http//www.phc4.org

70
(No Transcript)
71
Hospital Responses
72
(No Transcript)
73
p-value computation

n463 suppose actual number of deaths40
e29.56
p-value

p-value lt 0.05
74
Concerns

Ad-hoc groupings of strata
Adequate risk adjustment for outcomes other than
mortality? Sensitivity analysis? Hopeless?
Statistical testing versus estimation
Simpsons paradox

75
Risk Cat. N Rate Actual Number Expected Number
Low 800 1 8 8 (1)
High 200 8 16 10 (5)
A
SMR 24/18 1.33 p-value 0.07
Low 200 1 2 2 (1)
High 800 8 64 40 (5)
B
SMR 66/42 1.57 p-value 0.0002
76
Hierarchical Model

Patients -gt physicians -gt hospitals
Build a model using data at each level and
estimate quantities of interest

77
Bayesian Hierarchical Model
MCMC via WinBUGS
78
Goldstein and Spiegelhalter, 1996
79
Discussion

Markov chain Monte Carlo compute power enable
hierarchical modeling
Software is a significant barrier to the
widespread application of better methodology
Are these data useful for the study of disease?

Write a Comment

User Comments (0)