Data%20Mining:%20An%20Overview - PowerPoint PPT Presentation

About This Presentation
Title:

Data%20Mining:%20An%20Overview

Description:

Title: Visualizing and Exploring Data Author: madigan Created Date: 1/25/2001 6:28:21 PM Document presentation format: On-screen Show Company: Soliloquy, Inc. – PowerPoint PPT presentation

Number of Views:314
Avg rating:3.0/5.0
Slides: 80
Provided by: madi67
Category:

less

Transcript and Presenter's Notes

Title: Data%20Mining:%20An%20Overview


1
Data Mining An Overview
David Madigan dmadigan_at_rci.rutgers.edu http//stat
.rutgers.edu/madigan
2
Overview
  • Brief Introduction to Data Mining
  • Data Mining Algorithms
  • Specific Examples
  • Algorithms Disease Clusters
  • Algorithms Model-Based Clustering
  • Algorithms Frequent Items and Association Rules
  • Future Directions, etc.

3
Of Laws, Monsters, and Giants
  • Moores law processing capacity doubles every
    18 months CPU, cache, memory
  • Its more aggressive cousin
  • Disk storage capacity doubles every 9 months

4
What is Data Mining?
  • Finding interesting structure in data
  • Structure refers to statistical patterns,
    predictive models, hidden relationships
  • Examples of tasks addressed by Data Mining
  • Predictive Modeling (classification, regression)
  • Segmentation (Data Clustering )
  • Summarization
  • Visualization

5
(No Transcript)
6
(No Transcript)
7
Ronny Kohavi, ICML 1998
8
Ronny Kohavi, ICML 1998
9
Ronny Kohavi, ICML 1998
10
Stories Online Retailing
11
Chapter 4 Data Analysis and Uncertainty
  • Elementary statistical concepts random
    variables, distributions, densities,
    independence, point and interval estimation, bias
    variance, MLE
  • Model (global, represent prominent structures)
    vs. Pattern (local, idiosyncratic deviations)
  • Frequentist vs. Bayesian
  • Sampling methods

12
Bayesian Estimation
e.g. beta-binomial model
Predictive distribution
13
Issues to do with p-values
  • Using thresholds of 0.05 or 0.01 regardless of
    sample size
  • Multiple testing (e.g. Friedman (1983) selecting
    highly significant regressors from noise)
  • Subtle interpretation Jeffreys (1980) I have
    always considered the arguments for the use of P
    absurd. They amount to saying that a hypothesis
    that may or may not be true is rejected because a
    greater departure from the trial value was
    improbable that is, that is has not predicted
    something that has not happened.

14
p-value as measure of evidence
Schervish (1996) if hypothesis H implies
hypothesis H', then there should be at least as
much support for H' as for H. - not satisfied by
p-values
Grimmet and Ridenhour (1996) one might expect
an outlying data point to lend support to the
alternative hypothesis in, for instance, a
one-way analysis of variance. - the value of the
outlying data point that minimizes the
significance level can lie within the range of
the data
15
Chapter 5 Data Mining Algorithms
A data mining algorithm is a well-defined
procedure that takes data as input and produces
output in the form of models or patterns
well-defined can be encoded in
software algorithm must terminate after some
finite number of steps
16
Data Mining Algorithms
A data mining algorithm is a well-defined
procedure that takes data as input and produces
output in the form of models or patterns
Hand, Mannila, and Smyth
well-defined can be encoded in
software algorithm must terminate after some
finite number of steps
17
Algorithm Components
1. The task the algorithm is used to address
(e.g. classification, clustering, etc.) 2. The
structure of the model or pattern we are fitting
to the data (e.g. a linear regression model) 3.
The score function used to judge the quality of
the fitted models or patterns (e.g. accuracy,
BIC, etc.) 4. The search or optimization method
used to search over parameters and/or structures
(e.g. steepest descent, MCMC, etc.) 5. The data
management technique used for storing, indexing,
and retrieving data (critical when data too large
to reside in memory)
18
(No Transcript)
19
Backpropagation data mining algorithm
x1
h1
x2
y
x3
h2
x4
4
2
  • vector of p input values multiplied by p ? d1
    weight matrix
  • resulting d1 values individually transformed by
    non-linear function
  • resulting d1 values multiplied by d1 ? d2 weight
    matrix

1
20
Backpropagation (cont.)
Parameters
Score
Search steepest descent search for structure?
21
Models and Patterns
Models
Probability Distributions
Structured Data
Prediction
  • Linear regression
  • Piecewise linear

22
Models
Probability Distributions
Structured Data
Prediction
  • Linear regression
  • Piecewise linear
  • Nonparamatric regression

23
(No Transcript)
24
Models
Probability Distributions
Structured Data
Prediction
  • Linear regression
  • Piecewise linear
  • Nonparametric regression
  • Classification

logistic regression naïve bayes/TAN/bayesian
networks NN support vector machines Trees etc.
25
Models
Probability Distributions
Structured Data
Prediction
  • Linear regression
  • Piecewise linear
  • Nonparametric regression
  • Classification
  • Parametric models
  • Mixtures of parametric models
  • Graphical Markov models (categorical, continuous,
    mixed)

26
Models
Probability Distributions
Structured Data
Prediction
  • Time series
  • Markov models
  • Mixture Transition Distribution models
  • Hidden Markov models
  • Spatial models
  • Linear regression
  • Piecewise linear
  • Nonparametric regression
  • Classification
  • Parametric models
  • Mixtures of parametric models
  • Graphical Markov models (categorical, continuous,
    mixed)

27
Markov Models
First-order
e.g.
g linear ? standard first-order auto-regressive
model
yT
y1
y2
y3
28
First-Order HMM/Kalman Filter
yT
y1
y2
y3
xT
x1
x2
x3
Note to compute p(y1,,yT) need to sum/integrate
over all possible state sequences...
29
Bias-Variance Tradeoff
High Bias - Low Variance
Low Bias - High Variance overfitting - modeling
the random component
Score function should embody the compromise
30
The Curse of Dimensionality
X MVNp (0 , I)
  • Gaussian kernel density estimation
  • Bandwidth chosen to minimize MSE at the mean
  • Suppose want

Dimension data points 1 4
2 19 3 67
6 2,790 10
842,000
31
Patterns
Local
Global
  • Outlier detection
  • Changepoint detection
  • Bump hunting
  • Scan statistics
  • Association rules
  • Clustering via partitioning
  • Hierarchical Clustering
  • Mixture Models

32
Scan Statistics via Permutation Tests
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
The curve represents a road Each x marks an
accident Red x denotes an injury accident Black
x means no injury Is there a stretch of road
where there is an unusually large fraction of
injury accidents?
33
Scan with Fixed Window
  • If we know the length of the stretch of road
    that we seek, e.g., we could
    slide this window long the road and find the most
    unusual window location

x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
34
How Unusual is a Window?
  • Let pW and pW denote the true probability of
    being red inside and outside the window
    respectively. Let (xW ,nW) and (xW ,nW) denote
    the corresponding counts
  • Use the GLRT for comparing H0 pW pW versus
    H1 pW ? pW
  • lambda measures how unusual a window is

-2 log l here has an asymptotic chi-square
distribution with 1df
35
Permutation Test
  • Since we look at the smallest l over all window
    locations, need to find the distribution of
    smallest-l under the null hypothesis that there
    are no clusters
  • Look at the distribution of smallest-l over say
    999 random relabellings of the colors of the xs

smallest-l
xx x xxx x xx x xx x 0.376 xx x xxx
x xx x xx x 0.233 xx x xxx x xx x
xx x 0.412 xx x xxx x xx x xx x
0.222
  • Look at the position of observed smallest-l in
    this distribution to get the scan statistic
    p-value (e.g., if observed smallest-l is 5th
    smallest, p-value is 0.005)

36
Variable Length Window
  • No need to use fixed-length window. Examine all
    possible windows up to say half the length of the
    entire road

O fatal accident O non-fatal accident
37
Spatial Scan Statistics
  • Spatial scan statistic uses, e.g., circles
    instead of line segments

38
(No Transcript)
39
Spatial-Temporal Scan Statistics
  • Spatial-temporal scan statistic use cylinders
    where the height of the cylinder represents a
    time window

40
Other Issues
  • Poisson model also common (instead of the
    bernoulli model)
  • Covariate adjustment
  • Andrew Moores group at CMU efficient algorithms
    for scan statistics

41
Software SaTScan others
http//www.satscan.org http//www.phrl.org
http//www.terraseer.com
42
Association Rules Support and Confidence
Customer buys both
  • Find all the rules Y ? Z with minimum confidence
    and support
  • support, s, probability that a transaction
    contains Y Z
  • confidence, c, conditional probability that a
    transaction having Y Z also contains Z

Customer buys diaper
Customer buys beer
  • Let minimum support 50, and minimum confidence
    50, we have
  • A ? C (50, 66.6)
  • C ? A (50, 100)

43
Mining Association RulesAn Example
Min. support 50 Min. confidence 50
  • For rule A ? C
  • support support(A C) 50
  • confidence support(A C)/support(A) 66.6
  • The Apriori principle
  • Any subset of a frequent itemset must be frequent

44
Mining Frequent Itemsets the Key Step
  • Find the frequent itemsets the sets of items
    that have minimum support
  • A subset of a frequent itemset must also be a
    frequent itemset
  • i.e., if AB is a frequent itemset, both A and
    B should be a frequent itemset
  • Iteratively find frequent itemsets with
    cardinality from 1 to k (k-itemset)
  • Use the frequent itemsets to generate association
    rules.

45
The Apriori Algorithm
  • Join Step Ck is generated by joining Lk-1with
    itself
  • Prune Step Any (k-1)-itemset that is not
    frequent cannot be a subset of a frequent
    k-itemset
  • Pseudo-code
  • Ck Candidate itemset of size k
  • Lk frequent itemset of size k
  • L1 frequent items
  • for (k 1 Lk !? k) do begin
  • Ck1 candidates generated from Lk
  • for each transaction t in database do
  • increment the count of all candidates in
    Ck1 that are
    contained in t
  • Lk1 candidates in Ck1 with min_support
  • end
  • return ?k Lk

46
The Apriori Algorithm Example
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Scan D
47
Association Rule Mining A Road Map
  • Boolean vs. quantitative associations (Based on
    the types of values handled)
  • buys(x, SQLServer) buys(x, DMBook)
    buys(x, DBMiner) 0.2, 60
  • age(x, 30..39) income(x, 42..48K)
    buys(x, PC) 1, 75
  • Single dimension vs. multiple dimensional
    associations (see ex. Above)
  • Single level vs. multiple-level analysis
  • What brands of beers are associated with what
    brands of diapers?
  • Various extensions (thousands!)

48
(No Transcript)
49
Model-based Clustering
Padhraic Smyth, UCI
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
Mixtures of Sequences, Curves,
Generative Model - select a component ck for
individual i - generate data according to p(Di
ck) - p(Di ck) can be very general - e.g.,
sets of sequences, spatial patterns, etc Note
given p(Di ck), we can define an EM algorithm
58
Example Mixtures of SFSMs
  • Simple model for traversal on a Web site
  • (equivalent to first-order Markov with end-state)
  • Generative model for large sets of Web users
  • - different behaviors ltgt mixture of SFSMs
  • EM algorithm is quite simple weighted counts

59
WebCanvas Cadez, Heckerman, et al, KDD 2000
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
Discussion
  • What is data mining? Hard to pin down who
    cares?
  • Textbook statistical ideas with a new focus on
    algorithms
  • Lots of new ideas too

64
Privacy and Data Mining
Ronny Kohavi, ICML 1998
65
Analyzing Hospital Discharge Data
David Madigan Rutgers University
66
Comparing Outcomes Across Providers
  • Florence Nightingale wrote in 1863

In attempting to arrive at the truth, I have
applied everywhere for information, but in
scarcely an instance have I been able to obtain
hospital records fit for any purposes of
comparisonI am fain to sum up with an urgent
appeal for adopting some uniform system of
publishing the statistical records of hospitals.
67
Data
  • Data of various kinds are now available e.g.
    data concerning all medicare/medicaid hospital
    admissions in standard format UB-92 covers gt95
    of all admissions nationally
  • Considerable interest in using these data to
    compare providers (hospitals, physician groups,
    physicians, etc.)
  • In Pennsylvannia, large corporations such as
    Westinghouse and Hershey Foods are a motivating
    force and use the data to select providers.

68
SYSID DCSTATUS PPXDOW CANCER1
YEAR LOS SPX1DOW CANCER2
QUARTER DCHOUR SPX2DOW MDCHC4
PAF DCDOW SPX3DOW MQSEV
HREGION ECODE SPX4DOW MQNRSP
MAID PDX SPX5DOW PROFCHG
PTSEX SDX1 REFID TOTALCHG
ETHNIC SDX2 ATTID NONCVCHG
RACE SDX3 OPERID ROOMCHG
PSEUDOID SDX4 PAYTYPE1 ANCLRCHG
AGE SDX5 PAYTYPE2 DRUGCHG
AGECAT SDX6 PAYTYPE3 EQUIPCHG
PRIVZIP SDX7 ESTPAYER SPECLCHG
MKTSHARE SDX8 NAIC MISCCHG
COUNTY PPX OCCUR1 APRMDC
STATE SPX1 OCCUR2 APRDRG
ADTYPE SPX2 BILLTYPE APRSOI
ADSOURCE SPX3 DRGHOSP APRROM
ADHOUR SPX4 PCMU MQGCLUST
ADMDX SPX5 DRGHC4 MQGCELL
ADDOW  
Pennsylvannia Healthcare Cost Containment
Council. 2000-1, n800,000
69
Risk Adjustment
  • Discharge data like these allow for comparisons
    of, e.g., mortality rates for CABG procedure
    across hospitals.
  • Some hospitals accept riskier patients than
    others a fair comparison must account for such
    differences.
  • PHC4 (and many other organizations) use indirect
    standardization
  • http//www.phc4.org

70
(No Transcript)
71
Hospital Responses
72
(No Transcript)
73
p-value computation
  • n463 suppose actual number of deaths40
  • e29.56
  • p-value

p-value lt 0.05
74
Concerns
  • Ad-hoc groupings of strata
  • Adequate risk adjustment for outcomes other than
    mortality? Sensitivity analysis? Hopeless?
  • Statistical testing versus estimation
  • Simpsons paradox

75
Risk Cat. N Rate Actual Number Expected Number
Low 800 1 8 8 (1)
High 200 8 16 10 (5)
A
SMR 24/18 1.33 p-value 0.07
Low 200 1 2 2 (1)
High 800 8 64 40 (5)
B
SMR 66/42 1.57 p-value 0.0002
76
Hierarchical Model
  • Patients -gt physicians -gt hospitals
  • Build a model using data at each level and
    estimate quantities of interest

77
Bayesian Hierarchical Model
MCMC via WinBUGS
78
Goldstein and Spiegelhalter, 1996
79
Discussion
  • Markov chain Monte Carlo compute power enable
    hierarchical modeling
  • Software is a significant barrier to the
    widespread application of better methodology
  • Are these data useful for the study of disease?
Write a Comment
User Comments (0)
About PowerShow.com