????? t?t?? d?af??e?a? - PowerPoint PPT Presentation

About This Presentation
Title:

????? t?t?? d?af??e?a?

Description:

18 Sedan false. AVC set of the predictor attribute age at the root node ... Sedan 0 4. Sports 1 1. Truck 2 1. ??se?? ?ed? ???? 2001-2002 ??a??e??a ??t????. 41 ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 54
Provided by: csU83
Category:
Tags: sedan

less

Transcript and Presenter's Notes

Title: ????? t?t?? d?af??e?a?


1
Data Mining
2
Introduction
Finding interesting trends or pattern in large
datasets
Statistics exploratory data analysis Artificial
intelligence knowledge discovery and machine
learning
Scalability with respect to data size
An algorithm is scalable if the running time
grows (linearly) in proportion to the dataset
size, given the available system resources
3
Introduction
Finding interesting trends or pattern in large
datasets
SQL queries (based on the relational
algebra) OLAP queries (higher-level query
constructs multidimensional data model) Data
mining techniques
4
Knowledge Discovery (KDD)
The Knowledge Discovery Process
Data Selection
Identify the target dataset and relevant
attributes
Remove noise and outliers, transform field values
to common units, generate new fields, bring the
data into the relational schema
Data Cleaning
Data Mining
Present the patterns in an understandable form to
the end user (e.g., through visualization)
Evaluation
5
Overview
Counting Co-occurrences Frequent
Itemsets Iceberg Queries Mining for
Rules Association Rules Sequential
Rules Classification and Regression Tree-Structure
s Rules Clustering Similarity Search over
Sequences
6
Counting Co-Occurrences
A market basket is a collection of items
purchased by a customer in a single customer
transaction
Identify items that are purchased together
7
Example
Transid custid date item qty 111 201 5/1/99 pen 2
111 201 5/1/99 ink 1 111 201 5/1/99 milk 3 111
201 5/1/99 juice 6 112 105 6/3/99 pen 1 112 105
6/3/99 ink 1 112 105 6/3/99 milk 1 113 106 5/10
/99 pen 1 113 106 5/10/99 milk 1 114 201 6/1/99
pen 2 114 201 6/1/99 ink 2 114 201 6/1/99 juice
4
one transaction
note that there is redundancy
Observations of the form in 75 of transactions
both pen and ink are purchased together
8
Frequent Itemsets
Itemset a set of items Support of an itemset
the fraction of transactions in the database that
contain all items in the itemset
Example Itemset pen, ink Support 75 Itemset
milk, juice Support 25
9
Frequent Itemsets
Frequent Itemsets itemsets whose support is
higher than a user specified minimum support
called minsup
Example If minsup 70, Frequent Itemsets
pen, ink, milk, pen, ink, pen, milk
10
Frequent Itemsets
An algorithm for identify (all) frequent itemsets?
The a priory property Every subset of a frequent
itemset must also be a frequent itemset
11
Frequent Itemsets
An algorithm for identifying frequent itemsets
For each item, check if it is a frequent
itemset k 1 repeat for each new frequent
itemset Ik with k items Generate all itemsets
Ik1 with k1 items, Ik ? Ik1 Scan all
transactions once and check if the generated
k1 itemsets are frequent k k 1 Until
no new frequent itemsets are identified
12
Frequent Itemsets
A refinement of the algorithm for identifying
frequent itemsets
For each item, check if it is a frequent
itemset k 1 repeat for each new frequent
itemset Ik with k items Generate all itemsets
Ik1 with k1 items, Ik ? Ik1 whose subsets
are ferquent itemsets Scan all transactions
once and check if the generated k1 itemsets
are frequent k k 1 Until no new frequent
itemsets are identified
13
Iceberg Queries
Assume we want to find pairs of customers and
items such that the customer has purchased the
item at least 5 times
select P.custid, P. item, sum(P.qty) from
Purchases P group by P.custid, P.item having sum
(P.qty) gt 5
Execution plan for the query?
The number of groups is very large but the answer
to the query (the tip of the iceberg) is usually
very small
14
Iceberg Queries
Iceberg query
select R.A1, R.A2, , R.Ak, agr(R.B) from
Relation R group by R.A1, R.A2, , R.Ak having
agr(R.B) gt constant
A priory property similar to the a priori
property for the frequent itemsets?
15
Iceberg Queries
select P.custid, P. item, sum(P.qty) from
Purchases P group by P.custid, P.item having sum
(P.qty) gt 5
select P.custid from Purchases P group by
P.custid having sum (P.qty) gt 5
select P.item from Purchases P group by
P.item having sum (P.qty) gt 5
Q1
Q2
Generate (custid, item) pairs only for custid
from Q1 and item from Q2
16
Overview
Counting Co-occurences Frequent
Itemsets Iceberg Queries Mining for
Rules Association Rules Sequential
Rules Classification and Regression Tree-Structure
s Rules Clustering Similarity Search over
Sequences
17
Association Rules
Example pen ? ink If a pen is purchased in
a transaction, it is likely that ink will also be
purchased in the transaction
In general LHS ? RHS
18
Association Rules
LHS ? RHS
Support support(LHS ? RHS) The percentage of
transactions that contain all of these items
Confidence support(LHS ? RHS) / support(LHS) Is
an indication of the strength of the rule P(RHS
LHS)
An algorithm for finding all rules with minsum
and minconf?
19
Association Rules
An algorithm for finding all rules with minsup
and minconf
Step 1 Find all frequent itemsets with
minsup Step 2 Generate all rules from step 1
Step 2
For each frequent itemset I with support
support(I) Divide I into LHSI and
RHSI confidence support(I) / support(LHSI)
20
Association Rules and ISA Hierarchies
An ISA hierarchy or category hierarchy upon the
set of items a transaction implicitly contains
for each of its items all of the items ancestors
  • Detect relationships between items at different
    levels of the hierarchy
  • In general, the support of an itemset can only
    increase if an item is replaced by its ancestor

Beverage
Stationery
Milk
Juice
Ink
Pen
21
Generalized Association Rules
More general not just customer transactions
Transid custid date item qty 111 201 5/1/99 pen 2
111 201 5/1/99 ink 1 111 201 5/1/99 milk 3 111
201 5/1/99 juice 6 112 105 6/3/99 pen 1 112 105
6/3/99 ink 1 112 105 6/3/99 milk 1 113 106 5/10
/99 pen 1 113 106 5/10/99 milk 1 114 201 6/1/99
pen 2 114 201 6/1/99 ink 2 114 201 6/1/99 juice
4
e.g., Group tuples by custid Rule pen ? milk
if a pen is purchased by a customer, it likely
that milk will also be purchased by the customer
22
Generalized Association Rules
Group tuples by date Calendric market basket
analysis A calendar is any group of dates e.g.,
every first of the month Given a calendar,
compute association rules over the set of tuples
whose date field falls within the calendar
Transid custid date item qty 111 201 5/1/99 pen 2
111 201 5/1/99 ink 1 111 201 5/1/99 milk 3 111
201 5/1/99 juice 6 112 105 6/3/99 pen 1 112 105
6/3/99 ink 1 112 105 6/3/99 milk 1 113 106 5/10
/99 pen 1 113 106 5/10/99 milk 1 114 201 6/1/99
pen 2 114 201 6/1/99 ink 2 114 201 6/1/99 juice
4
Calendar every first of the month Rule pen ?
juice has support 100 Over the entire
50 Rule pen ? milk has support 50 Over
the entire 75
23
Sequential Patterns
Sequence of Itemsets The sequence of itemsets
purchased by the customer Example custid 201
pen, ink, milk, juice, pen, ink,
juice (ordered by date) A subsequence of a
sequence of itemsets is obtained by deleting one
or more itemsets and is also a sequence of
itemsets
24
Sequential Patterns
A sequence a1, a2, .., an is contained in
sequence S if S has a subsequence bq, .., bm
such that ai ? bi for 1 ? i ? m
Example pen, ink, milk, pen, juice is
contained in pen, ink, shirt, juice, ink,
milk, juice, pen, milk The order of items
within each itemset does not matter but the order
of itemsets does matter pen, ink, milk, pen,
juice is not conatined in pen, ink, shirt,
juice, pen, milk, juice, milk, ink
25
Sequential Patterns
The support for a sequence S of itemsets is the
percentage of customer sequences of which S is a
subsequence
Identify all sequences that have a minimum support
26
Overview
Counting Co-occurences Frequent
Itemsets Iceberg Queries Mining for
Rules Association Rules Sequential
Rules Classification and Regression Tree-Structure
s Rules Clustering Similarity Search over
Sequences
27
Classification and Regression Rules
InsuranceInfo(age integer, cartype string,
highrisk boolean)
There is one attribute (highrisk) whose value we
would like to predict dependent attribute The
other attributes are called the predictors
General form of the types of rules we want to
discover P1(X1) ? P2(X2) ? ? Pk(Xk) ? Y c
28
Classification and Regression Rules
P1(X1) ? P2(X2) ? ? Pk(Xk) ? Y c
  • Pi(Xi) are predicates
  • Two types
  • numerical Pi(Xi) li ? Xi ? hi
  • categorical Pi(Xi) Xi ? v1, , vj
  • numerical dependent attribute regression rule
  • categorical dependent attribute classification
    rule

(16 ? age ? 25) ? (cartype ? Sports, Truck) ?
highrisk true
29
Classification and Regression Rules
P1(X1) ? P2(X2) ? ? Pk(Xk) ? Y c
Support The support for a condition C is the
percentage of tuples that satisfy C. The support
for a rule C1 ? C2 is the support of the
condition C1 ? C2 Confidence Consider the tuples
that satisfy condition C1. The confidence for a
rule C1 ? C2 is the percentage of such tuples
that also satisfy condition C2
30
Classification and Regression Rules
Differ from association rules by considering
continuous and categorical attributes, rather
than one field that is set-valued
31
Overview
Counting Co-occurences Frequent
Itemsets Iceberg Queries Mining for
Rules Association Rules Sequential
Rules Classification and Regression Tree-Structure
s Rules Clustering Similarity Search over
Sequences
32
Tree-Structured Rules
Classification or decision trees Regression
trees Typically the tree itself is the output of
data mining Easy to understand Efficient
algorithms to construct them
33
Decision Trees
A graphical representation of a collection of
classification rules. Given a data record, the
tree directs the record from the root to a leaf.
  • Internal nodes labeled with a predictor
    attribute (called a splitting attribute)
  • Outgoing edges labeled with predicates that
    involve the splitting attribute of the node
    (splitting criterion)
  • Leaf nodes labeled with a value of a dependent
    attribute

34
Decision Trees
Example
Age
Construct classification rules from the paths
from the root to the leaf LHS conjuction of
predicates RHS the value of the leaf
gt 25
lt 25
No
Car Type
Sports, Truck
Other
Yes
No
(16 ? age ? 25) ? (cartype ? Sports, Truck) ?
highrisk true
35
Decision Trees
Constructed into two phases Phase 1 growth
phase construct a vary large tree (e.g., leaf
nodes for individual records in the
database Phase 2 pruning phase
Build the tree greedily top down At the root
node, examine the database and compute the
locally best splitting criterion Partition the
database into two parts Recurse on each child
36
Decision Trees
Input node n partition D split selection method
S Output decision tree for D rooted at node
n Top down Decision Tree Induction
Schema BuildTree(node n, partition D, method
S) Apply S to D to find the splitting
criterion If (a good splitting criterion is
found) create two children nodes n1 and n2 of
n partition D into D1 and D2 BuildTree(n1, D1,
S) Build Tree(n2, D2, S)
37
Decision Trees
Split selection method An algorithm that takes as
input (part of) a relation and outputs the
locally best spliting criterion Example examine
the attributes cartype and age, select one of
them as a splitting attribute and then select the
splitting predicates
38
Decision Trees
How can we construct decision trees when the
input database is larger than main memory?
Provide the split selection method with
aggregated information about the database instead
of loading the complete database into main
memory We need aggregated information for each
predictor attribute
AVC set of the predictor attribute X at node n is
the projection of ns database partition onto X
and the dependent attribute where counts of the
individual values in the domain of the dependent
attribute are aggregated
39
Decision Trees
age cartype highrisk 23 Sedan false 30 Sports fals
e 36 Sedan false 25 Truck true 30 Sedan false 23 T
ruck true 30 Truck false 25 Sports true 18 Sedan f
alse
AVC set of the predictor attribute age at the
root node select R.age, R.highrisk,
count() from InsuranceInfo R group by R.age,
R.highrisk
AVC set of the predictor attribute cartype at the
left child of the root node select R.cartype,
R.highrisk, count() from InsuranceInfo R where
R.age lt25 group by R.age, R.highrisk
40
Decision Trees
age cartype highrisk 23 Sedan false 30 Sports fals
e 36 Sedan false 25 Truck true 30 Sedan false 23 T
ruck true 30 Truck false 25 Sports true 18 Sedan f
alse
AVC set of the predictor attribute age at the
root node select R.age, R.highrisk,
count() from InsuranceInfo R group by R.age,
R.highrisk True False Sedan 0 4 Sports
1 1 Truck 2 1
41
Decision Trees
AVC group of a node n the set of the AVC sets of
all predictors attributes at node n
Size of the AVC set?
42
Decision Trees
Input node n partition D split selection method
S Output decision tree for D rooted at node
n Top down Decision Tree Induction
Schema BuildTree(node n, partition D, method
S) Make a scan over D and construct the AVC group
of node n in memory Apply S to AVC group to find
the splitting criterion If (a good splitting
criterion is found) create two children nodes n1
and n2 of n partition D into D1 and
D2 BuildTree(n1, D1, S) Build Tree(n2, D2, S)
43
Overview
Counting Co-occurences Frequent
Itemsets Iceberg Queries Mining for
Rules Association Rules Sequential
Rules Classification and Regression Tree-Structure
s Rules Clustering Similarity Search over
Sequences
44
Clustering
Partition a set of records into groups (clusters)
such that all records within a group are similar
to each other and records that belong to two
different groups are disimilar.
Similarity between records measured
computationally by a distance function.
45
Clustering
CustomerInfo(age integer, salaryreal)
Salary
20 40 60
Age
  • Visually identify three clusters - shape of
    clusters spherical spheres

46
Clustering
The output of a clustering algorithm consists of
a summarized representation of each
cluster. Type of output depends on type and
shape of clusters. For example if spherical
clusters center C (mean) and radius R given a
collection of records r1, r2, .., rn C ? ri R
? (ri - C) n
47
Clustering
  • Two types of clustering algorithms
  • Partitional clustering partitions the data into
    k groups such that some criterion that evaluates
    the clustering quality is optimized
  • Hierarchical clustering generates a sequence of
    partitions of the records. Starting with a
    partition in which each cluster consists of a
    single record, merges two partitions in each step

48
Clustering
The BIRCH algorithm
  • Assumptions
  • Large number of records, just one scan of them
  • A limited amount of main memory
  • Two parameters
  • k main memory threshold maximum number of
    clusters that can be maintained in memory
  • e initial threshold of the radius of each
    cluster. A cluster is compact if its radius is
    smaller than e.
  • Always maintain in main memory k or fewer compact
    cluster summaries (Ci, Ri)
  • (If this is no possible adjust e)

49
Clustering
The BIRCH algorithm
Read a record r from the database Compute the
distance of r and each of the existing cluster
centers Let i be the cluster (index) such that
the distance between r and Ci is the
smallest Compute Ri assuming r is inserted in
the ith cluster If Ri ? e, insert r in the
ith cluster recompute Ri and Ci else start a
new cluster containing only r
50
Overview
Counting Co-occurences Frequent
Itemsets Iceberg Queries Mining for
Rules Association Rules Sequential
Rules Classification and Regression Tree-Structure
s Rules Clustering Similarity Search over
Sequences
51
Similarity Search over Sequences
A user specifies a query sequence and wants to
retrieve all data sequences that are similar to
the query sequence
Not exact matches
A data sequence X is a sequence of numbers X
ltx1, x2, .., xkgt Also called a time series k
length of the sequence A subsequence Z ltz1, z2,
, zjgt is deleting from another sequence by
deleting numbers from the front and back of the
sequence
52
Similarity Search over Sequences
Given two sequences X and Y we can define the
distance of the two sequences as the Euclidean
norm
Given a user-specified query sequence and a
threshold parameter e, retrieve all data
sequences within e-distance to the query sequence
  • Complete sequence matching (the query sequence
    and the sequence in the database have the same
    length)
  • Subsequence matching (the query is shorter)

53
Similarity Search over Sequences
Given a user-specified query sequence and a
threshold parameter e, retrieve all data
sequences within e-distance to the query sequence
Brute-force method?
Each data sequence and the query sequence of
length k may be represented as a point in a
k-dimensional space Construct a multidimensional
index Non-exact matches? Query the index with a
hyper-rectangle with side length 2-e and the
query sentence as the center
Write a Comment
User Comments (0)
About PowerShow.com