Chapter 22: Advanced Querying and Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

Chapter 22: Advanced Querying and Information Retrieval

Description:

... between exposure to chemical X and cancer, or new medicine and cardiac problems ... Webster's Dictionary. Similarity Measures ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 65
Provided by: ssu80
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: Chapter 22: Advanced Querying and Information Retrieval


1
Data Mining Chapter 24 of book Dr Eamonn
Keogh Computer Science Engineering
DepartmentUniversity of California -
RiversideRiverside,CA 92521eamonn_at_cs.ucr.edu
Dont bother reading 24.3.7 or 24.3.8
2
What is Data Mining?
Data Mining has been defined as The nontrivial
extraction of implicit, previously unknown, and
potentially useful information from data.
data visualization
statistics
Data Mining
artificial intelligence
databases
Informally, data mining is the extraction of
interesting knowledge (rules, regularities,
patterns, constraints) from data in large
databases.
3
Data Mining
  • Broadly speaking, data mining is the process of
    semi-automatically analyzing large databases to
    find useful patterns
  • Like knowledge discovery in artificial
    intelligence data mining discovers statistical
    rules and patterns
  • Differs from machine learning in that it deals
    with large volumes of data stored primarily on
    disk.
  • Some types of knowledge discovered from a
    database can be represented by a set of rules.
  • e.g., Young women with annual incomes greater
    than 50,000 are most likely to buy sports cars
  • Other types of knowledge represented by
    equations, or by prediction functions, or by
    clusters
  • Some manual intervention is usually required
  • Pre-processing of data, choice of which type of
    pattern to find, postprocessing to find novel
    patterns

4
Applications of Data Mining
  • Prediction based on past history
  • Predict if a credit card applicant poses a good
    credit risk, based on some attributes (income,
    job type, age, ..) and past history
  • Predict if a customer is likely to switch brand
    loyalty
  • Predict if a customer is likely to respond to
    junk mail
  • Predict if a pattern of phone calling card usage
    is likely to be fraudulent
  • Some examples of prediction mechanisms
  • Classification
  • Given a training set consisting of items
    belonging to different classes, and a new item
    whose class is unknown, predict which class it
    belongs to
  • Regression formulae
  • given a set of parameter-value to
    function-result mappings for an unknown function,
    predict the function-result for a new
    parameter-value

5
Applications of Data Mining (Cont.)
  • Descriptive Patterns
  • Associations
  • Find books that are often bought by the same
    customers. If a new customer buys one such book,
    suggest that he buys the others too.
  • Other similar applications camera accessories,
    clothes, etc.
  • Associations may also be used as a first step in
    detecting causation
  • E.g. association between exposure to chemical X
    and cancer, or new medicine and cardiac problems
  • Clusters
  • E.g. typhoid cases were clustered in an area
    surrounding a contaminated well
  • Detection of clusters remains important in
    detecting epidemics

6
Classification Rules
  • Classification rules help assign new objects to a
    set of classes. E.g., given a new automobile
    insurance applicant, should he or she be
    classified as low risk, medium risk or high risk?
  • Classification rules for above example could use
    a variety of knowledge, such as educational level
    of applicant, salary of applicant, age of
    applicant, etc.
  • ? person P, P.degree masters and P.income gt
    75,000

  • ? P.credit excellent
  • ? person P, P.degree bachelors and
    (P.income ? 25,000 and P.income ?
    75,000)
    ? P.credit good
  • Rules are not necessarily exact there may be
    some misclassifications
  • Classification rules can be compactly shown as a
    decision tree.

7
Decision Tree
? person P, P.degree masters and P.income gt
75,000 ? P.credit excellent
8
Construction of Decision Trees
  • Training set a data sample in which the grouping
    for each tuple is already known.
  • Consider credit risk example Suppose degree is
    chosen to partition the data at the root.
  • Since degree has a small number of possible
    values, one child is created for each value.
  • At each child node of the root, further
    classification is done if required. Here,
    partitions are defined by income.
  • Since income is a continuous attribute, some
    number of intervals are chosen, and one child
    created for each interval.
  • Different classification algorithms use different
    ways of choosing which attribute to partition on
    at each node, and what the intervals, if any,
    are.
  • In general
  • Different branches of the tree could grow to
    different levels.
  • Different nodes at the same level may use
    different partitioning attributes.

9
Construction of Decision Trees (Cont.)
  • Greedy top down generation of decision trees.
  • Each internal node of the tree partitions the
    data into groups based on a partitioning
    attribute, and a partitioning condition for the
    node
  • More on choosing partitioning attribute/condition
    shortly
  • Algorithm is greedy the choice is made once and
    not revisited as more of the tree is constructed
  • The data at a node is not partitioned further if
    either
  • all (or most) of the items at the node belong to
    the same class, or
  • all attributes have been considered, and no
    further partitioning is possible.
  • Such a node is a leaf node.
  • Otherwise the data at the node is partitioned
    further by picking an attribute for partitioning
    data at the node.

10
Best Splits
  • Idea evaluate different attributes and
    partitioning conditions and pick the one that
    best improves the purity of the training set
    examples
  • The initial training set has a mixture of
    instances from different classes and is thus
    relatively impure
  • E.g. if degree exactly predicts credit risk,
    partitioning on degree would result in each child
    having instances of only one class
  • I.e., the child nodes would be pure
  • The purity of a set S of training instances can
    be measured quantitatively in several ways.
  • Notation number of classes k, number of
    instances S, fraction of instances in class
    i pi.
  • The Gini measure of purity is defined as
  • Gini (S) 1 - ?
  • When all instances are in a single class, the
    Gini value is 0, while it reaches its maximum (of
    1 1 /k) if each class the same number of
    instances.

11
Best Splits (Cont.)
  • Another measure of purity is the entropy measure,
    which is defined as
  • entropy (S) ?
  • When a set S is split into multiple sets Si, I1,
    2, , r, we can measure the purity of the
    resultant set of sets as
  • purity(S1, S2, .., Sr) ?
  • The information gain due to particular split of S
    into Si, i 1, 2, ., r
  • Information-gain (S, S1, S2, ., Sr)

    purity(S) purity (S1, S2, Sr)

12
Best Splits (Cont.)
  • Measure of cost of a split
    Information-content(S, S1, S2, .., Sr))

    ?
  • Information-gain ratio Information-gain (S,
    S1, S2, , Sr)

  • Information-content (S, S1, S2, .., Sr)
  • The best split for an attribute is the one that
    gives the maximum information gain ratio
  • Continuous valued attributes
  • Can be ordered in a fashion meaningful to
    classification
  • e.g. integer and real values
  • Categorical attributes
  • Cannot be meaningfully ordered (e.g. country,
    school/university, item-color, .)

13
Finding Best Splits
  • Categorical attributes
  • Multi-way split, one child for each value
  • may have too many children in some cases
  • Binary split try all possible breakup of values
    into two sets, and pick the best
  • Continuous valued attribute
  • Binary split
  • Sort values in the instances, try each as a split
    point
  • E.g. if values are 1, 10, 15, 25, split at ?1,
    ? 10, ? 15
  • Pick the value that gives best split
  • Multi-way split more complicated, see
    bibliographic notes
  • A series of binary splits on the same attribute
    has roughly equivalent effect

14
Decision-Tree Construction Algorithm I
  • Procedure Grow.Tree(S) Partition(S)Procedure
    Partition (S) if (purity(S) gt ?p or S lt ?s)
    then return for each attribute A
    evaluate splits on attribute A Use best split
    found (across all attributes) to partition
    S into S1, S2, ., Sr, for i 1, 2, .., r
    Partition(Si)

15
Decision-Tree Construction Algorithm II
  • Variety of algorithms have been developed to
  • Reduce CPU cost and/or
  • Reduce IO cost when handling datasets larger than
    memory
  • Improve accuracy of classification
  • Decision tree may be overfitted, i.e., overly
    tuned to given training set
  • Pruning of decision tree may be done on branches
    that have too few training instances
  • When a subtree is pruned, an internal node
    becomes a leaf
  • and its class is set to the majority class of the
    instances that map to the node
  • Pruning can be done by using a part of the
    training set to build tree, and a second part to
    test the tree
  • prune subtrees that increase misclassification
    on second part

16
A visual intuition of the classification problem
Given a database (called a training database) of
labeled examples, predict future unlabeled
examples
10
What is the class of Homer?
Shoe Size
Blood Sugar
17
Decision-Tree A visual intuition
10
Shoe Size
Blood Sugar
18
White Cell Count
Blood Sugar
19
Other Types of Classifiers
  • Further types of classifiers
  • Neural net classifiers
  • Bayesian classifiers
  • Neural net classifiers use the training data to
    train artificial neural nets
  • Widely studied in AI, wont cover here
  • Bayesian classifiers use Bayes theorem, which
    says
  • p(cj d) p(d cj ) p(cj)
  • p(d)where p(cj
    d) probability of instance d being in class cj,
  • p(d cj) probability of
    generating instance d given class cj,
  • p(cj) probability of
    occurrence of class cj, and
  • p(d) probability of
    instance d occurring
  • For more details see Keogh, E. Pazzani, M.
    (1999). Learning augmented Bayesian classifiers
    A comparison of distribution-based and
    classification-based approaches. In Uncertainty
    99, 7th. Int'l Workshop on AI and Statistics, Ft.
    Lauderdale, FL, pp. 225--230.

20
Naïve Bayesian Classifiers
  • Bayesian classifiers require
  • computation of p(d cj)
  • precomputation of p(cj)
  • p(d) can be ignored since it is the same for all
    classes
  • To simplify the task, naïve Bayesian classifiers
    assume attributes have independent distributions,
    and thereby estimate
  • p(dcj) p(d1cj) p(d2cj) . (p(dncj)
  • Each of the p(dicj) can be estimated from a
    histogram on di values for each class cj
  • the histogram is computed from the training
    instances
  • Histograms on multiple attributes are more
    expensive to compute and store

21
Naïve Bayesian Classifiers Visual Intuition I
5 foot 8
6 foot 6
4 foot 8
5 foot 8
22
Naïve Bayesian Classifiers Visual Intuition II
p(cj d) probability of instance d being in
class cj,
P(male 5 foot 8 ) 10 / (10 2)
0.833 P(female 5 foot 8 ) 2 / (10 2)
0.166
10
2
5 foot 8
23
Clustering
  • Clustering Intuitively, finding clusters of
    points in the given data such that similar points
    lie in the same cluster
  • Can be formalized using distance metrics in
    several ways
  • E.g. Group points into k sets (for a given k)
    such that the average distance of points from the
    centroid of their assigned group is minimized
  • Centroid point defined by taking average of
    coordinates in each dimension.
  • Another metric minimize average distance between
    every pair of points in a cluster
  • Has been studied extensively in statistics, but
    on small data sets
  • Data mining systems aim at clustering techniques
    that can handle very large data sets
  • E.g. the Birch clustering algorithm (more shortly)

24
What is Clustering?
Also called unsupervised learning, sometimes
called classification by statisticians and
sorting by psychologists and segmentation by
people in marketing
  • Organizing data into classes such that there is
  • high intra-class similarity
  • low inter-class similarity
  • Finding the class labels and the number of
    classes directly from the data (in contrast to
    classification).
  • More informally, finding natural groupings among
    objects. (I.e east coast cities, west coast
    cities)

25
What is a natural grouping among these objects?
26
What is a natural grouping among these objects?
Clustering is subjective
School Employees
Simpson's Family
Males
Females
27
What is Similarity?
The quality or state of being similar likeness
resemblance as, a similarity of features.
Webster's Dictionary
Similarity is hard to define, but We know it
when we see it The real meaning of similarity
is a philosophical question. We will take a more
pragmatic approach.
28
Similarity Measures
For the moment assume that we can measure the
similarity between any two objects. (we will
cover this in detail later).
One intuitive example is to measure the distance
between two cities and call it the similarity.
For example we have D(LA,San Diego) 110, and
D(LA,New York) 3,000.
This would allow use to make (subjectively
correct) statements like LA is more similar to
San Francisco that it is to New York.
  • MACSTEEL USA LOCATIONS

29
Defining Distance Measures
Definition Let O1 and O2 be two objects from the
universe of possible objects. The distance
(dissimilarity) between O1 and O2 is a real
number denoted by D(O1,O2)
Peter
Piotr
0.23
3
342.7
30
Peter
Piotr
When we peek inside one of these black boxes, we
see some function on two variables. These
functions might very simple or very complex. In
either case it is natural to ask, what properties
should these functions have?
d('', '') 0 d(s, '') d('', s) s -- i.e.
length of s d(s1ch1, s2ch2) min( d(s1, s2)
if ch1ch2 then 0 else 1 fi, d(s1ch1, s2) 1,
d(s1, s2ch2) 1 )
3
  • What properties should a distance measure have?
  • D(A,B) D(B,A) Symmetry
  • D(A,A) 0 Constancy of Self-Similarity
  • D(A,B) 0 IIf A B Positivity (Separation)
  • D(A,B) ? D(A,C) D(B,C) Triangular Inequality

31
Intuitions behind desirable distance measure
properties
D(A,B) D(B,A) Symmetry Otherwise you could
claim Alex looks like Bob, but Bob looks nothing
like Alex. D(A,A) 0 Constancy of
Self-Similarity Otherwise you could claim Alex
looks more like Bob, than Bob does. D(A,B) 0
IIf AB Positivity (Separation) Otherwise there
are objects in your world that are different, but
you cannot tell apart. D(A,B) ? D(A,C)
D(B,C) Triangular Inequality Otherwise you could
claim Alex is very like Bob, and Alex is very
like Carl, but Bob is very unlike Carl.
32
Two Types of Clustering
  • Partitional algorithms Construct various
    partitions and then evaluate them by some
    criterion (we will see an example called BIRCH)
  • Hierarchical algorithms Create a hierarchical
    decomposition of the set of objects using some
    criterion

Partitional
Hierarchical
33
A Useful Tool for Summarizing Similarity
Measurements
In order to better appreciate and evaluate the
examples given in the early part of this talk, we
will now introduce the dendrogram.
The similarity between two objects in a
dendrogram is represented as the height of the
lowest internal node they share.
34
Note that hierarchies are commonly used to
organize information, for example in a web
portal. Yahoos hierarchy is manually created,
we will focus on automatic creation of
hierarchies in data mining.
Business Economy
B2B Finance Shopping Jobs
Aerospace Agriculture Banking Bonds Animals
Apparel Career Workspace
35
(Bovine0.69395,(Gibbon0.36079,(Orangutan0.33636
,(Gorilla0.17147,(Chimp0.19268,Human0.11927)0.
08386)0.06124)0.15057)0.54939)
36
Desirable Properties of a Clustering Algorithm
  • Scalability (in terms of both time and space)
  • Ability to deal with different data types
  • Minimal requirements for domain knowledge to
    determine input parameters
  • Able to deal with noise and outliers
  • Insensitive to order of input records
  • Incorporation of user-specified constraints
  • Interpretability and usability

37
Hierarchical Clustering
Since we cannot test all possible trees we will
have to heuristic search of all possible trees.
We could do this.. Bottom-Up (agglomerative)
Starting with each item in its own cluster, find
the best pair to merge into a new cluster. Repeat
until all clusters are fused together. Top-Down
(divisive) Starting with all the data in a
single cluster, consider every possible way to
divide the cluster into two. Choose the best
division and recursively operate on both sides.
  • The number of dendrograms with n leafs (2n
    -3)!/(2(n -2)) (n -2)!
  • Number Number of Possible
  • of Leafs Dendrograms
  • 2 1
  • 3 3
  • 4 15
  • 5 105
  • ...
  • 34,459,425

38
We begin with a distance matrix which contains
the distances between every pair of objects in
our database.
D( , ) 8 D( , ) 1
39
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

40
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

41
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

42
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

43
We know how to measure the distance between two
objects, but defining the distance between an
object and a cluster, or defining the distance
between two clusters is non obvious.
  • Single linkage (nearest neighbor) In this
    method the distance between two clusters is
    determined by the distance of the two closest
    objects (nearest neighbors) in the different
    clusters.
  • Complete linkage (furthest neighbor) In this
    method, the distances between clusters are
    determined by the greatest distance between any
    two objects in the different clusters (i.e., by
    the "furthest neighbors").
  • Group average In this method, the distance
    between two clusters is calculated as the average
    distance between all pairs of objects in the two
    different clusters.

44
  • Summary of Hierarchal Clustering Methods
  • No need to specify the number of clusters in
    advance.
  • Hierarchal nature maps nicely onto human
    intuition for some domains
  • They do not scale well time complexity of at
    least O(n2), where n is the number of total
    objects.
  • Like any heuristic search algorithms, local
    optima are a problem.
  • Interpretation of results is subjective.

45
Partitional Clustering Algorithms
  • Clustering algorithms have been designed to
    handle very large datasets
  • E.g. the Birch algorithm
  • Main idea use an in-memory R-tree to store
    points that are being clustered
  • Insert points one at a time into the R-tree,
    merging a new point with an existing cluster if
    is less than some ? distance away
  • If there are more leaf nodes than fit in memory,
    merge existing clusters that are close to each
    other
  • At the end of first pass we get a large number of
    clusters at the leaves of the R-tree
  • Merge clusters to reduce the number of clusters

46
Partitional Clustering Algorithms
  • The Birch algorithm

We need to specify the number of clusters in
advance, I have chosen 2
R10
R11
R10 R11 R12
R1 R2 R3
R4 R5 R6
R7 R8 R9
R12
Data nodes containing points
47
Partitional Clustering Algorithms
  • The Birch algorithm

R10
R11
R10 R11 R12
R1,R2 R3
R4 R5 R6
R7 R8 R9
R12
Data nodes containing points
48
Partitional Clustering Algorithms
  • The Birch algorithm

R10
R11
R12
49
Up to this point we have simply assumed that we
can measure similarity, butHow do we measure
similarity?
Peter
Piotr
0.23
3
342.7
50
A generic technique for measuring similarity
To measure the similarity between two objects,
transform one of the objects into the other, and
measure how much effort it took. The measure of
effort becomes the distance measure.
The distance between Patty and Selma. Change
dress color, 1 point Change earring shape, 1
point Change hair part, 1 point D(Patty,Selma
) 3
The distance between Marge and Selma. Change
dress color, 1 point Add earrings, 1
point Decrease height, 1 point Take up
smoking, 1 point Lose weight, 1
point D(Marge,Selma) 5
This is called the edit distance or the
transformation distance
51
Edit Distance Example
How similar are the names Peter and
Piotr? Assume the following cost function
Substitution 1 Unit Insertion 1
Unit Deletion 1 Unit D(Peter,Piotr) is 3
It is possible to transform any string Q into
string C, using only Substitution, Insertion and
Deletion. Assume that each of these operators has
a cost associated with it. The similarity
between two strings can be defined as the cost of
the cheapest transformation from Q to C. Note
that for now we have ignored the issue of how we
can find this cheapest transformation
Peter Piter Pioter Piotr
Substitution (i for e)
Insertion (o)
Deletion (e)
52
Association Rules(market basket analysis)
  • Retail shops are often interested in associations
    between different items that people buy.
  • Someone who buys bread is quite likely also to
    buy milk
  • A person who bought the book Database System
    Concepts is quite likely also to buy the book
    Operating System Concepts.
  • Associations information can be used in several
    ways.
  • E.g. when a customer buys a particular book, an
    online shop may suggest associated books.
  • Association rules
  • bread ? milk DB-Concepts,
    OS-Concepts ? Networks
  • Left hand side antecedent, right hand side
    consequent
  • An association rule must have an associated
    population the population consists of a set of
    instances
  • E.g. each transaction (sale) at a shop is an
    instance, and the set of all transactions is the
    population

53
Association Rule Definitions
  • Set of items II1,I2,,Im
  • Transactions Dt1,t2, , tn, tj? I
  • Itemset Ii1,Ii2, , Iik ? I
  • Support of an itemset Percentage of transactions
    which contain that itemset.
  • Large (Frequent) itemset Itemset whose number of
    occurrences is above a threshold.

54
Association Rules Example
I Beer, Bread, Jelly, Milk,
PeanutButter Support of Bread,PeanutButter is
60
55
Association Rule Definitions
  • Association Rule (AR) implication X ? Y where
    X,Y ? I and X ? Y the null set
  • Support of AR (s) X ? Y Percentage of
    transactions that contain X ?Y
  • Confidence of AR (a) X ? Y Ratio of number of
    transactions that contain X ? Y to the number
    that contain X

56
Association Rules Ex (contd)
57
Association Rules Ex (contd)
  • Of 5 transactions, 3 involve both Bread and
    PeanutButter, 3/5 60
  • Of the 4 transactions that involve Bread, 3 of
    them also involve PeanutButter 3/4 75

58
Association Rule Problem
  • Given a set of items II1,I2,,Im and a
    database of transactions Dt1,t2, , tn where
    tiIi1,Ii2, , Iik and Iij ? I, the Association
    Rule Problem is to identify all association rules
    X ? Y with a minimum support and confidence
    (supplied by user).
  • NOTE Support of X ? Y is same as support of X ?
    Y.

59
Association Rule Algorithm (Basic Idea)
  • Find Large Itemsets.
  • Generate rules from frequent itemsets.

This is the simple naïve algorithm, better
algorithms exist.
60
Association Rule Algorithm
  • We are generally only interested in association
    rules with reasonably high support (e.g. support
    of 2 or greater)
  • Naïve algorithm
  • Consider all possible sets of relevant items.
  • For each set find its support (i.e. count how
    many transactions purchase all items in the
    set).
  • Large itemsets sets with sufficiently high
    support
  • Use large itemsets to generate association rules.
  • From itemset A generate the rule A - b ?b for
    each b ? A.
  • Support of rule support (A).
  • Confidence of rule support (A ) / support (A -
    b)

61
  • From itemset A generate the rule A - b ?b for
    each b ? A.
  • Support of rule support (A).
  • Confidence of rule support (A ) / support (A -
    b)

Lets say itemset A Bread, Butter, Milk Then
A - b ?b for each b ? A includes 3
possibilities Bread, Butter ? Milk Bread,
Milk ? Butter Butter, Milk ? Bread
62
Apriori
  • Large Itemset Property
  • Any subset of a large itemset is large.
  • Contrapositive
  • If an itemset is not large,
  • none of its supersets are large.

63
Large Itemset Property
64
Large Itemset Property
If B is not frequent, then none of the supersets
of B can be frequent. If ACD is frequent,
then all subsets of ACD (AC, AD, CD) must
be frequent.
Write a Comment
User Comments (0)
About PowerShow.com