Title: Data Mining
1Data Mining
2Outline
- What is data mining?
- Data Mining Tasks
- Association
- Classification
- Clustering
- Data mining Algorithms
- Are all the patterns interesting?
3What is Data Mining
- Huge amount of databases and web pages make
information extraction next to impossible
(remember the favored statement I will bury them
in data!) - Inability of many other disciplines (statistic,
AI, information retrieval) to have scalable
algorithms to extract information and/or rules
from the databases - Necessity to find relationships among data
4What is Data Mining
- Discovery of useful, possibly unexpected data
patterns - Subsidiary issues
- Data cleansing
- Visualization
- Warehousing
5Examples
- A big objection to was that it was looking for so
many vague connections that it was sure to find
things that were bogus and thus violate
innocents privacy. - The Rhine Paradox a great example of how not to
conduct scientific research.
6Rhine Paradox --- (1)
- David Rhine was a parapsychologist in the 1950s
who hypothesized that some people had
Extra-Sensory Perception. - He devised an experiment where subjects were
asked to guess 10 hidden cards --- red or blue. - He discovered that almost 1 in 1000 had ESP ---
they were able to get all 10 right!
7Rhine Paradox --- (2)
- He told these people they had ESP and called them
in for another test of the same type. - Alas, he discovered that almost all of them had
lost their ESP. - What did he conclude?
- Answer on next slide.
8Rhine Paradox --- (3)
- He concluded that you shouldnt tell people they
have ESP it causes them to lose it.
9A Concrete Example
- This example illustrates a problem with
intelligence-gathering. - Suppose we believe that certain groups of
evil-doers are meeting occasionally in hotels to
plot doing evil. - We want to find people who at least twice have
stayed at the same hotel on the same day.
10The Details
- 109 people being tracked.
- 1000 days.
- Each person stays in a hotel 1 of the time (10
days out of 1000). - Hotels hold 100 people (so 105 hotels).
- If everyone behaves randomly (I.e., no
evil-doers) will the data mining detect anything
suspicious?
11Calculations --- (1)
- Probability that persons p and q will be at the
same hotel on day d - 1/100 1/100 10-5 10-9.
- Probability that p and q will be at the same
hotel on two given days - 10-9 10-9 10-18.
- Pairs of days
- 5105.
12Calculations --- (2)
- Probability that p and q will be at the same
hotel on some two days - 5105 10-18 510-13.
- Pairs of people
- 51017.
- Expected number of suspicious pairs of people
- 51017 510-13 250,000.
13Conclusion
- Suppose there are (say) 10 pairs of evil-doers
who definitely stayed at the same hotel twice. - Analysts have to sift through 250,010 candidates
to find the 10 real cases. - Not gonna happen.
- But how can we improve the scheme?
14Appetizer
- Consider a file consisting of 24471 records. File
contains at least two condition attributes A and
D
A/D 0 1 total
0 9272 232 9504
1 14695 272 14967
Total 23967 504 24471
15Appetizer (cont)
- Probability that person has A P(A)0.6,
- Probability that person has D P(D)0.02
- Conditional probability that person has D
provided it has A P(DA) P(AD)/P(A)(272/24471)
/.6 .02 - P(AD) P(AD)/P(D) .54
- What can we say about dependencies between A and
D?
A/D 0 1 total
0 9272 232 9504
1 14695 272 14967
Total 23967 504 24471
16Appetizer(3)
- So far we did not ask anything that statistics
would not have ask. So Data Mining another word
for statistic? - We hope that the response will be resounding NO
- The major difference is that statistical methods
work with random data samples, whereas the data
in databases is not necessarily random - The second difference is the size of the data set
- The third data is that statistical samples do not
contain dirty data
17Architecture of a Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data warehouse server
Filtering
Data cleaning data integration
Data Warehouse
Databases
18Data Mining Tasks
- Association (correlation and causality)
- Multi-dimensional vs. single-dimensional
association - age(X, 20..29) income(X, 20..29K) -gt
buys(X, PC) support 2, confidence 60 - contains(T, computer) -gt contains(x,
software) 1, 75 - What is support? the percentage of the tuples
in the database that have age between 20 and 29
and income between 20K and 29K and buying PC - What is confidence? the probability that if
person is between 20 and 29 and income between
20K and 29K then it buys PC - Clustering (getting data that are close together
into the same cluster. - What does close together means?
-
19Distances between data
- Distance between data is a measure of
dissimilarity between data. - d(i,j)gt0 d(i,j) d(j,i) d(i,j)lt d(i,k)
d(k,j) - Euclidean distance ltx1,x2, xkgt and lty1,y2,ykgt
- Standardize variables by finding standard
deviation and dividing each xi by standard
deviation of X - Covariance(X,Y)1/k(Sum(xi-mean(x))(y(I)-mean(y))
- Boolean variables and their distances
20Data Mining Tasks
- Outlier analysis
- Outlier a data object that does not comply with
the general behavior of the data - It can be considered as noise or exception but is
quite useful in fraud detection, rare events
analysis - Trend and evolution analysis
- Trend and deviation regression analysis
- Sequential pattern mining, periodicity analysis
- Similarity-based analysis
- Other pattern-directed or statistical analyses
21Are All the Discovered Patterns Interesting?
- A data mining system/query may generate thousands
of patterns, not all of them are interesting. - Suggested approach Human-centered, query-based,
focused mining - Interestingness measures A pattern is
interesting if it is easily understood by humans,
valid on new or test data with some degree of
certainty, potentially useful, novel, or
validates some hypothesis that a user seeks to
confirm - Objective vs. subjective interestingness
measures - Objective based on statistics and structures of
patterns, e.g., support, confidence, etc. - Subjective based on users belief in the data,
e.g., unexpectedness, novelty, actionability, etc.
22Are All the Discovered Patterns Interesting? -
Example
1
coffee
0 1
tea
5 5 20
25
0
70 75
Conditional probability that if one buys coffee,
one also buys tea is 2/9 Conditional probability
that if one buys tea she also buys coffee is
20/25.8 However, the probability that she buys
coffee is .9 So, is it significant inference that
if customer buys tea she also buys coffee? Is
buying tea and coffee independent activities?
23How to measure Interestingness
- RI X , Y - XY/N
- Support and Confidence X Y/N support and
X Y/X -confidence of X-gtY - Chi2 (XY - E(XY)) 2 /E(XY)
- J(X-gtY) P(Y)(P(XY)log (P(XY)/P(X)) (1-
P(XY))log ((1- P(XY)/(1-P(X)) - Sufficiency (X-gtY) P(XY)/P(X!Y) Necessity
(X-gtY) P(!XY)/P(!X!Y). Interestingness of
Y-gtX is - NC 1-N(X-gtY)P(Y), if N() is less than 1
or 0 otherwise -
24Can We Find All and Only Interesting Patterns?
- Find all the interesting patterns Completeness
- Can a data mining system find all the interesting
patterns? - Association vs. classification vs. clustering
- Search for only interesting patterns
Optimization - Can a data mining system find only the
interesting patterns? - Approaches
- First general all the patterns and then filter
out the uninteresting ones. - Generate only the interesting patternsmining
query optimization
25Clustering
- Partition data set into clusters, and one can
store cluster representation only - Can be very effective if data is clustered but
not if data is smeared - Can have hierarchical clustering and be stored in
multi-dimensional index tree structures - There are many choices of clustering definitions
and clustering algorithms.
26Example Clusters
Outliers
x xx x x x x x x x x x x x
x
x x x x x x x x x x x x
x x x
x x x x x x x x x x
x
27Sampling
- Allow a mining algorithm to run in complexity
that is potentially sub-linear to the size of the
data - Choose a representative subset of the data
- Simple random sampling may have very poor
performance in the presence of skew - Develop adaptive sampling methods
- Stratified sampling
- Approximate the percentage of each class (or
subpopulation of interest) in the overall
database - Used in conjunction with skewed data
- Sampling may not reduce database I/Os (page at a
time).
28Sampling
SRSWOR (simple random sample without
replacement)
SRSWR
29Sampling
Cluster/Stratified Sample
Raw Data
30Discretization
- Three types of attributes
- Nominal values from an unordered set
- Ordinal values from an ordered set
- Continuous real numbers
- Discretization
- divide the range of a continuous attribute into
intervals - Some classification algorithms only accept
categorical attributes. - Reduce data size by discretization
- Prepare for further analysis
31Discretization
- Discretization
- reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals. Interval labels can
then be used to replace actual data values.
32Discretization
Sort Attribute
Select cut Point
Evaluate Measure
NO
NO
Satisfied
Yes
DONE
Split/Merge
Stop
33Discretization
- Dynamic vs Static
- Local vs Global
- Top-Down vs Bottom-Up
- Direct vs Incremental
34Discretization Quality Evaluation
- Total number of Intervals
- The Number of Inconsistencies
- Predictive Accuracy
- Complexity
35Discretization - Binning
- Equal width all range is between min and max
values is split in equal width intervals - Equal-frequency - Each bin contains
approximately the same number of data
36Entropy-Based Discretization
- Given a set of samples S, if S is partitioned
into two intervals S1 and S2 using boundary T,
the entropy after partitioning is - The boundary that minimizes the entropy function
over all possible boundaries is selected as a
binary discretization. - The process is recursively applied to partitions
obtained until some stopping criterion is met,
e.g., - Experiments show that it may reduce data size and
improve classification accuracy
37Data Mining Primitives, Languages, and System
Architectures
- Data mining primitives What defines a data
mining task? - A data mining query language
- Design graphical user interfaces based on a data
mining query language - Architecture of data mining systems
38Why Data Mining Primitives and Languages?
- Data mining should be an interactive process
- User directs what to be mined
- Users must be provided with a set of primitives
to be used to communicate with the data mining
system - Incorporating these primitives in a data mining
query language - More flexible user interaction
- Foundation for design of graphical user interface
- Standardization of data mining industry and
practice
39What Defines a Data Mining Task ?
- Task-relevant data
- Type of knowledge to be mined
- Background knowledge
- Pattern interestingness measurements
- Visualization of discovered patterns
40Task-Relevant Data (Minable View)
- Database or data warehouse name
- Database tables or data warehouse cubes
- Condition for data selection
- Relevant attributes or dimensions
- Data grouping criteria
41Types of knowledge to be mined
- Characterization
- Discrimination
- Association
- Classification/prediction
- Clustering
- Outlier analysis
- Other data mining tasks
42A Data Mining Query Language (DMQL)
- Motivation
- A DMQL can provide the ability to support ad-hoc
and interactive data mining - By providing a standardized language like SQL
- Hope to achieve a similar effect like that SQL
has on relational database - Foundation for system development and evolution
- Facilitate information exchange, technology
transfer, commercialization and wide acceptance - Design
- DMQL is designed with the primitives described
earlier
43Syntax for DMQL
- Syntax for specification of
- task-relevant data
- the kind of knowledge to be mined
- concept hierarchy specification
- interestingness measure
- pattern presentation and visualization
- Putting it all together a DMQL query
44Syntax for task-relevant data specification
- use database database_name, or use data warehouse
data_warehouse_name - from relation(s)/cube(s)Â where condition
- in relevance to att_or_dim_list
- order by order_list
- group by grouping_list
- having condition
45Specification of task-relevant data
46Syntax for specifying the kind of knowledge to be
mined
- Characterization
- Mine_Knowledge_Specification mine
characteristics as pattern_name analyze
measure(s) - Discrimination
- Mine_Knowledge_Specification mine
comparison as pattern_name for
target_class where target_condition versus
contrast_class_i where contrast_condition_iÂ
analyze measure(s) - Association
- Mine_Knowledge_Specification mine
associations as pattern_name
47Syntax for specifying the kind of knowledge to be
mined (cont.)
- Classification
- Mine_Knowledge_Specification mine
classification as pattern_name analyze
classifying_attribute_or_dimension - Prediction
- Mine_Knowledge_Specification mine
prediction as pattern_name analyze
prediction_attribute_or_dimension set
attribute_or_dimension_i value_i
48Syntax for concept hierarchy specification
- To specify what concept hierarchies to use
- use hierarchy lthierarchygt for ltattribute_or_dimens
iongt - We use different syntax to define different type
of hierarchies - schema hierarchies
- define hierarchy time_hierarchy on date as
date,month quarter,year - set-grouping hierarchies
- define hierarchy age_hierarchy for age on
customer as - level1 young, middle_aged, senior lt level0
all - level2 20, ..., 39 lt level1 young
- level2 40, ..., 59 lt level1 middle_aged
- level2 60, ..., 89 lt level1 senior
49Syntax for concept hierarchy specification (Cont.)
- operation-derived hierarchies
- define hierarchy age_hierarchy for age on
customer as - age_category(1), ..., age_category(5)
cluster(default, age, 5) lt all(age) - rule-based hierarchies
- define hierarchy profit_margin_hierarchy on item
as - level_1 low_profit_margin lt level_0 all
- if (price - cost)lt 50
- level_1 medium-profit_margin lt level_0 all
- if ((price - cost) gt 50) and ((price -
cost) lt 250)) - level_1 high_profit_margin lt level_0 all
- if (price - cost) gt 250
50Syntax for interestingness measure specification
- Interestingness measures and thresholds can be
specified by the user with the statement - with ltinterest_measure_namegt  threshold
threshold_value - Example
- with support threshold 0.05
- with confidence threshold 0.7Â
51Syntax for pattern presentation and visualization
specification
- We have syntax which allows users to specify the
display of discovered patterns in one or more
forms - display as ltresult_formgt
- To facilitate interactive viewing at different
concept level, the following syntax is defined - Multilevel_Manipulation  roll up on
attribute_or_dimension drill down on
attribute_or_dimension add
attribute_or_dimension drop
attribute_or_dimension
52Putting it all together the full specification
of a DMQL query
- use database AllElectronics_db
- use hierarchy location_hierarchy for B.address
- mine characteristics as customerPurchasing
- analyze count
- in relevance to C.age, I.type, I.place_made
- from customer C, item I, purchases P,
items_sold S, works_at W, branch - where I.item_ID S.item_ID and S.trans_ID
P.trans_ID - and P.cust_ID C.cust_ID and P.method_paid
AmEx'' - and P.empl_ID W.empl_ID and W.branch_ID
B.branch_ID and B.address Canada" and
I.price gt 100 - with noise threshold 0.05
- display as table
53DMQL and SQL
- DMQL Describe general characteristics of
graduate students in the Big-University database - use Big_University_DB
- mine characteristics as Science_Students
- in relevance to name, gender, major, birth_place,
birth_date, residence, phone, gpa - from student
- where status in graduate
- Corresponding SQL statement
- Select name, gender, major, birth_place,
birth_date, residence, phone, gpa - from student
- where status in Msc, MBA, PhD
54Decision Trees
- Example
- Conducted survey to see what customers were
interested in new model car - Want to select customers for advertising campaign
training set
55One Possibility
agelt30
Y
N
citysf
carvan
Y
Y
N
N
likely
unlikely
likely
unlikely
56Another Possibility
cartaurus
Y
N
citysf
agelt45
Y
Y
N
N
likely
unlikely
likely
unlikely
57Issues
- Decision tree cannot be too deep
- would not have statistically significant amounts
of data for lower decisions - Need to select tree that most reliably predicts
outcomes
58Top-Down Induction of Decision Tree
Attributes Outlook, Temperature, Humidity,
Wind
PlayTennis yes, no
59Entropy and Information Gain
- S contains si tuples of class Ci for i 1, ,
m - Information measures info required to classify
any arbitrary tuple - Entropy of attribute A with values a1,a2,,av
- Information gained by branching on attribute A
60Example Analytical Characterization
- Task
- Mine general characteristics describing graduate
students using analytical characterization - Given
- attributes name, gender, major, birth_place,
birth_date, phone, and gpa - Gen(ai) concept hierarchies on ai
- Ui attribute analytical thresholds for ai
- Ti attribute generalization thresholds for ai
- R attribute relevance threshold
61Example Analytical Characterization (contd)
- 1. Data collection
- target class graduate student
- contrasting class undergraduate student
- 2. Analytical generalization using Ui
- attribute removal
- remove name and phone
- attribute generalization
- generalize major, birth_place, birth_date and
gpa - accumulate counts
- candidate relation gender, major, birth_country,
age_range and gpa
62Example Analytical characterization (3)
- 3. Relevance analysis
- Calculate expected info required to classify an
arbitrary tuple - Calculate entropy of each attribute e.g. major
63Example Analytical Characterization (4)
- Calculate expected info required to classify a
given sample if S is partitioned according to the
attribute - Calculate information gain for each attribute
- Information gain for all attributes
64Example Analytical characterization (5)
- 4. Initial working relation (W0) derivation
- R 0.1
- remove irrelevant/weakly relevant attributes from
candidate relation gt drop gender, birth_country - remove contrasting class candidate relation
- 5. Perform attribute-oriented induction on W0
using Ti
Initial target class working relation W0
Graduate students
65What Is Association Mining?
- Association rule mining
- Finding frequent patterns, associations,
correlations, or causal structures among sets of
items or objects in transaction databases,
relational databases, and other information
repositories. - Applications
- Basket data analysis, cross-marketing, catalog
design, loss-leader analysis, clustering,
classification, etc. - Examples.
- Rule form Body Head support, confidence.
- buys(x, diapers) buys(x, beers) 0.5,
60 - major(x, CS) takes(x, DB) grade(x, A)
1, 75
66Association Rule Mining
transaction id
customer id
products bought
sales records
market-basket data
- Trend Products p5, p8 often bough together
- Trend Customer 12 likes product p9
67Association Rule
- Rule p1, p3, p8
- Support number of baskets where these products
appear - High-support set support ? threshold s
- Problem find all high support sets
68Association Rule Basic Concepts
- Given (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit) - Find all rules that correlate the presence of
one set of items with that of another set of
items - E.g., 98 of people who purchase tires and auto
accessories also get automotive services done - Applications
- ? Maintenance Agreement (What the store
should do to boost Maintenance Agreement sales) - Home Electronics ? (What other products
should the store stocks up?) - Attached mailing in direct marketing
- Detecting ping-ponging of patients, faulty
collisions
69Rule Measures Support and Confidence
Customer buys both
- Find all the rules X Y ? Z with minimum
confidence and support - support, s, probability that a transaction
contains X ? Y ? Z - confidence, c, conditional probability that a
transaction having X ? Y also contains Z
Customer buys diaper
Customer buys beer
- Let minimum support 50, and minimum confidence
50, we have - A ? C (50, 66.6)
- C ? A (50, 100)
70Mining Association RulesAn Example
Min. support 50 Min. confidence 50
- For rule A ? C
- support support(A ?C) 50
- confidence support(A ?C)/support(A) 66.6
- The Apriori principle
- Any subset of a frequent itemset must be frequent
71Mining Frequent Itemsets the Key Step
- Find the frequent itemsets the sets of items
that have minimum support - A subset of a frequent itemset must also be a
frequent itemset - i.e., if AB is a frequent itemset, both A and
B should be a frequent itemset - Iteratively find frequent itemsets with
cardinality from 1 to k (k-itemset) - Use the frequent itemsets to generate association
rules.
72The Apriori Algorithm
- Join Step Ck is generated by joining Lk-1with
itself - Prune Step Any (k-1)-itemset that is not
frequent cannot be a subset of a frequent
k-itemset - Pseudo-code
- Ck Candidate itemset of size k
- Lk frequent itemset of size k
- L1 frequent items
- for (k 1 Lk !? k) do begin
- Ck1 candidates generated from Lk
- for each transaction t in database do
- increment the count of all candidates in
Ck1 that are
contained in t - Lk1 candidates in Ck1 with min_support
- end
- return ?k Lk
73The Apriori Algorithm Example
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Scan D
74How to Generate Candidates?
- Suppose the items in Lk-1 are listed in an order
- Step 1 self-joining Lk-1
- insert into Ck
- select p.item1, p.item2, , p.itemk-1, q.itemk-1
- from Lk-1 p, Lk-1 q
- where p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1 - Step 2 pruning
- forall itemsets c in Ck do
- forall (k-1)-subsets s of c do
- if (s is not in Lk-1) then delete c from Ck
75How to Count Supports of Candidates?
- Why counting supports of candidates a problem?
- The total number of candidates can be very huge
- One transaction may contain many candidates
- Method
- Candidate itemsets are stored in a hash-tree
- Leaf node of hash-tree contains a list of
itemsets and counts - Interior node contains a hash table
- Subset function finds all the candidates
contained in a transaction
76Example of Generating Candidates
- L3abc, abd, acd, ace, bcd
- Self-joining L3L3
- abcd from abc and abd
- acde from acd and ace
- Pruning
- acde is removed because ade is not in L3
- C4abcd
77Criticism to Support and Confidence
- Example 1 (Aggarwal Yu, PODS98)
- Among 5000 students
- 3000 play basketball
- 3750 eat cereal
- 2000 both play basket ball and eat cereal
- play basketball ? eat cereal 40, 66.7 is
misleading because the overall percentage of
students eating cereal is 75 which is higher
than 66.7. - play basketball ? not eat cereal 20, 33.3 is
far more accurate, although with lower support
and confidence
78Criticism to Support and Confidence (Cont.)
- Example 2
- X and Y positively correlated,
- X and Z, negatively related
- support and confidence of
- XgtZ dominates
- We need a measure of dependent or correlated
events - P(BA)/P(B) is also called the lift of rule A gt B
79Other Interestingness Measures Interest
- Interest (correlation, lift)
- taking both P(A) and P(B) in consideration
- P(AB)P(B)P(A), if A and B are independent
events - A and B negatively correlated, if the value is
less than 1 otherwise A and B positively
correlated
80Classification vs. Prediction
- Classification
- predicts categorical class labels
- classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying
new data - Prediction
- models continuous-valued functions, i.e.,
predicts unknown or missing values - Typical Applications
- credit approval
- target marketing
- medical diagnosis
- treatment effectiveness analysis
81Classification Process Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
82Classification Process Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
83Supervised vs. Unsupervised Learning
- Supervised learning (classification)
- Supervision The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations - New data is classified based on the training set
- Unsupervised learning (clustering)
- The class labels of training data are unknown
- Given a set of measurements, observations, etc.
with the aim of establishing the existence of
classes or clusters in the data
84Training Dataset
This follows an example from Quinlans ID3
85Output A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
86Algorithm for Decision Tree Induction
- Basic algorithm (a greedy algorithm)
- Tree is constructed in a top-down recursive
divide-and-conquer manner - At start, all the training examples are at the
root - Attributes are categorical (if continuous-valued,
they are discretized in advance) - Examples are partitioned recursively based on
selected attributes - Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain) - Conditions for stopping partitioning
- All samples for a given node belong to the same
class - There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf - There are no samples left
87Information Gain (ID3/C4.5)
- Select the attribute with the highest information
gain - Assume there are two classes, P and N
- Let the set of examples S contain p elements of
class P and n elements of class N - The amount of information, needed to decide if an
arbitrary example in S belongs to P or N is
defined as
88Information Gain in Decision Tree Induction
- Assume that using attribute A a set S will be
partitioned into sets S1, S2 , , Sv - If Si contains pi examples of P and ni examples
of N, the entropy, or the expected information
needed to classify objects in all subtrees Si is - The encoding information that would be gained by
branching on A
89Attribute Selection by Information Gain
Computation
- Class P buys_computer yes
- Class N buys_computer no
- I(p, n) I(9, 5) 0.940
- Compute the entropy for age
90Gini Index (IBM IntelligentMiner)
- If a data set T contains examples from n classes,
gini index, gini(T) is defined as - where pj is the relative frequency of class j
in T. - If a data set T is split into two subsets T1 and
T2 with sizes N1 and N2 respectively, the gini
index of the split data contains examples from n
classes, the gini index gini(T) is defined as - The attribute provides the smallest ginisplit(T)
is chosen to split the node (need to enumerate
all possible splitting points for each attribute).
91Extracting Classification Rules from Trees
- Represent the knowledge in the form of IF-THEN
rules - One rule is created for each path from the root
to a leaf - Each attribute-value pair along a path forms a
conjunction - The leaf node holds the class prediction
- Rules are easier for humans to understand
- Example
- IF age lt30 AND student no THEN
buys_computer no - IF age lt30 AND student yes THEN
buys_computer yes - IF age 3140 THEN buys_computer yes
- IF age gt40 AND credit_rating excellent
THEN buys_computer yes - IF age gt40 AND credit_rating fair THEN
buys_computer no
92Avoid Overfitting in Classification
- The generated tree may overfit the training data
- Too many branches, some may reflect anomalies due
to noise or outliers - Result is in poor accuracy for unseen samples
- Two approaches to avoid overfitting
- Prepruning Halt tree construction earlydo not
split a node if this would result in the goodness
measure falling below a threshold - Difficult to choose an appropriate threshold
- Postpruning Remove branches from a fully grown
treeget a sequence of progressively pruned trees - Use a set of data different from the training
data to decide which is the best pruned tree
93Approaches to Determine the Final Tree Size
- Separate training (2/3) and testing (1/3) sets
- Use cross validation, e.g., 10-fold cross
validation - Use all the data for training
- but apply a statistical test (e.g., chi-square)
to estimate whether expanding or pruning a node
may improve the entire distribution - Use minimum description length (MDL) principle
- halting growth of the tree when the encoding is
minimized
94Scalable Decision Tree Induction Methods in Data
Mining Studies
- SLIQ (EDBT96 Mehta et al.)
- builds an index for each attribute and only class
list and the current attribute list reside in
memory - SPRINT (VLDB96 J. Shafer et al.)
- constructs an attribute list data structure
- PUBLIC (VLDB98 Rastogi Shim)
- integrates tree splitting and tree pruning stop
growing the tree earlier - RainForest (VLDB98 Gehrke, Ramakrishnan
Ganti) - separates the scalability aspects from the
criteria that determine the quality of the tree - builds an AVC-list (attribute, value, class label)
95Bayesian Theorem
- Given training data D, posteriori probability of
a hypothesis h, P(hD) follows the Bayes theorem - MAP (maximum posteriori) hypothesis
- Practical difficulty require initial knowledge
of many probabilities, significant computational
cost
96NaĂŻve Bayes Classifier (I)
- A simplified assumption attributes are
conditionally independent - Greatly reduces the computation cost, only count
the class distribution.
97Naive Bayesian Classifier (II)
- Given a training set, we can compute the
probabilities
98Bayesian classification
- The classification problem may be formalized
using a-posteriori probabilities - P(CX) prob. that the sample tuple
Xltx1,,xkgt is of class C. - E.g. P(classN outlooksunny,windytrue,)
- Idea assign to sample X the class label C such
that P(CX) is maximal
99Estimating a-posteriori probabilities
- Bayes theorem
- P(CX) P(XC)P(C) / P(X)
- P(X) is constant for all classes
- P(C) relative freq of class C samples
- C such that P(CX) is maximum C such that
P(XC)P(C) is maximum - Problem computing P(XC) is unfeasible!
100NaĂŻve Bayesian Classification
- NaĂŻve assumption attribute independence
- P(x1,,xkC) P(x1C)P(xkC)
- If i-th attribute is categoricalP(xiC) is
estimated as the relative freq of samples having
value xi as i-th attribute in class C - If i-th attribute is continuousP(xiC) is
estimated thru a Gaussian density function - Computationally easy in both cases
101Play-tennis example estimating P(xiC)
outlook
P(sunnyp) 2/9 P(sunnyn) 3/5
P(overcastp) 4/9 P(overcastn) 0
P(rainp) 3/9 P(rainn) 2/5
temperature
P(hotp) 2/9 P(hotn) 2/5
P(mildp) 4/9 P(mildn) 2/5
P(coolp) 3/9 P(cooln) 1/5
humidity
P(highp) 3/9 P(highn) 4/5
P(normalp) 6/9 P(normaln) 2/5
windy
P(truep) 3/9 P(truen) 3/5
P(falsep) 6/9 P(falsen) 2/5
P(p) 9/14
P(n) 5/14
102Play-tennis example classifying X
- An unseen sample X ltrain, hot, high, falsegt
- P(Xp)P(p) P(rainp)P(hotp)P(highp)P(fals
ep)P(p) 3/92/93/96/99/14 0.010582 - P(Xn)P(n) P(rainn)P(hotn)P(highn)P(fals
en)P(n) 2/52/54/52/55/14 0.018286 - Sample X is classified in class n (dont play)
103Association-Based Classification
- Several methods for association-based
classification - ARCS Quantitative association mining and
clustering of association rules (Lent et al97) - It beats C4.5 in (mainly) scalability and also
accuracy - Associative classification (Liu et al98)
- It mines high support and high confidence rules
in the form of cond_set gt y, where y is a
class label - CAEP (Classification by aggregating emerging
patterns) (Dong et al99) - Emerging patterns (EPs) the itemsets whose
support increases significantly from one class to
another - Mine Eps based on minimum support and growth rate
104What Is Prediction?
- Prediction is similar to classification
- First, construct a model
- Second, use model to predict unknown value
- Major method for prediction is regression
- Linear and multiple regression
- Non-linear regression
- Prediction is different from classification
- Classification refers to predict categorical
class label - Prediction models continuous-valued functions
105Regression Analysis and Log-Linear Models in
Prediction
- Linear regression Y ? ? X
- Two parameters , ? and ? specify the line and
are to be estimated by using the data at hand. - using the least squares criterion to the known
values of Y1, Y2, , X1, X2, . - Multiple regression Y b0 b1 X1 b2 X2.
- Many nonlinear functions can be transformed into
the above. - Log-linear models
- The multi-way table of joint probabilities is
approximated by a product of lower-order tables. - Probability p(a, b, c, d) ?ab ?ac?ad ?bcd
106What is Cluster Analysis?
- Cluster a collection of data objects
- Similar to one another within the same cluster
- Dissimilar objects are in different clusters
- Cluster analysis
- Grouping a set of data objects into clusters
- Clustering is unsupervised classification no
predefined classes - Typical applications
- As a stand-alone tool to get insight into data
distribution - As a preprocessing step for other algorithms
107General Applications of Clustering
- Pattern Recognition
- Spatial Data Analysis
- create thematic maps in GIS by clustering feature
spaces - detect spatial clusters and explain them in
spatial data mining - Image Processing
- Economic Science (especially market research)
- WWW
- Document classification
- Cluster Weblog data to discover groups of similar
access patterns
108Examples of Clustering Applications
- Marketing Help marketers discover distinct
groups in their customer bases, and then use this
knowledge to develop targeted marketing programs - Land use Identification of areas of similar land
use in an earth observation database - Insurance Identifying groups of motor insurance
policy holders with a high average claim cost - City-planning Identifying groups of houses
according to their house type, value, and
geographical location - Earth-quake studies Observed earth quake
epicenters should be clustered along continent
faults
109What Is Good Clustering?
- A good clustering method will produce high
quality clusters with - high intra-class similarity
- low inter-class similarity
- The quality of a clustering result depends on
both the similarity measure used by the method
and its implementation. - The quality of a clustering method is also
measured by its ability to discover some or all
of the hidden patterns.
110Types of Data in Cluster Analysis
- Data matrix
- Dissimilarity matrix
111Measure the Quality of Clustering
- Dissimilarity/Similarity metric Similarity is
expressed in terms of a distance function, which
is typically metric d(i, j) - There is a separate quality function that
measures the goodness of a cluster. - The definitions of distance functions are usually
very different for interval-scaled, boolean,
categorical, ordinal and ratio variables. - Weights should be associated with different
variables based on applications and data
semantics. - It is hard to define similar enough or good
enough - the answer is typically highly subjective.
112Similarity and Dissimilarity Between Objects
- Distances are normally used to measure the
similarity or dissimilarity between two data
objects - Some popular ones include Minkowski distance
- where i (xi1, xi2, , xip) and j (xj1, xj2,
, xjp) are two p-dimensional data objects, and q
is a positive integer - If q 1, d is Manhattan distance
113Similarity and Dissimilarity Between Objects
- If q 2, d is Euclidean distance
- Properties
- d(i,j) ? 0
- d(i,i) 0
- d(i,j) d(j,i)
- d(i,j) ? d(i,k) d(k,j)
- Also one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures.
114Binary Variables
- A contingency table for binary data
- Simple matching coefficient (invariant, if the
binary variable is symmetric) - Jaccard coefficient (noninvariant if the binary
variable is asymmetric)
Object j
Object i
115Dissimilarity between Binary Variables
- Example
- gender is a symmetric attribute
- the remaining attributes are asymmetric binary
- let the values Y and P be set to 1, and the value
N be set to 0
116Major Clustering Methods
- Partitioning algorithms Construct various
partitions and then evaluate them by some
criterion - Hierarchy algorithms Create a hierarchical
decomposition of the set of data (or objects)
using some criterion - Density-based based on connectivity and density
functions - Grid-based based on a multiple-level granularity
structure - Model-based A model is hypothesized for each of
the clusters and the idea is to find the best fit
of that model to each other
117Partitioning Algorithms Basic Concept
- Partitioning method Construct a partition of a
database D of n objects into a set of k clusters - Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion - Global optimal exhaustively enumerate all
partitions - Heuristic methods k-means and k-medoids
algorithms - k-means (MacQueen67) Each cluster is
represented by the center of the cluster - k-medoids or PAM (Partition around medoids)
(Kaufman Rousseeuw87) Each cluster is
represented by one of the objects in the cluster
118The K-Means Clustering Method
- Given k, the k-means algorithm is implemented in
4 steps - Partition objects into k nonempty subsets
- Compute seed points as the centroids of the
clusters of the current partition. The centroid
is the center (mean point) of the cluster. - Assign each object to the cluster with the
nearest seed point. - Go back to Step 2, stop when no more new
assignment.
119The K-Means Clustering Method
120Comments on the K-Means Method
- Strength
- Relatively efficient O(tkn), where n is
objects, k is clusters, and t is iterations.
Normally, k, t ltlt n. - Often terminates at a local optimum. The global
optimum may be found using techniques such as
deterministic annealing and genetic algorithms - Weakness
- Applicable only when mean is defined, then what
about categorical data? - Need to specify k, the number of clusters, in
advance - Unable to handle noisy data and outliers
- Not suitable to discover clusters with non-convex
shapes