Title: Data Mining overview
1Data Mining(overview)
2Presentation overview
- Introduction
- Association Rules
- Classification
- Clustering
- Similar Time Sequences
- Similar Images
- Outliers
- WWW
- Summary
3Background
- Corporations have huge databases containing a
wealth of information - Business databases potentially constitute a
goldmine of valuable business information - Very little functionality in database systems to
support data mining applications - Data mining The efficient discovery of
previously unknown patterns in large databases
4Applications
- Fraud Detection
- Loan and Credit Approval
- Market Basket Analysis
- Customer Segmentation
- Financial Applications
- E-Commerce
- Decision Support
- Web Search
5Data Mining Techniques
- Association Rules
- Sequential Patterns
- Classification
- Clustering
- Similar Time Sequences
- Similar Images
- Outlier Discovery
- Text/Web Mining
6Examples of Discovered Patterns
- Association rules
- 98 of people who purchase diapers also buy beer
- Classification
- People with age less than 25 and salary gt 40k
drive sports cars - Similar time sequences
- Stocks of companies A and B perform similarly
- Outlier Detection
- Residential customers for telecom company with
businesses at home
7Association Rules
- Given
- A database of customer transactions
- Each transaction is a set of items
- Find all rules X gt Y that correlate the presence
of one set of items X with another set of items Y - Example 98 of people who purchase diapers and
baby food also buy beer. - Any number of items in the consequent/antecedent
of a rule - Possible to specify constraints on rules (e.g.,
find only rules involving expensive imported
products)
8Association Rules
- Sample Applications
- Market basket analysis
- Attached mailing in direct marketing
- Fraud detection for medical insurance
- Department store floor/shelf planning
9Confidence and Support
- A rule must have some minimum user-specified
confidence - 1 2 gt 3 has 90 confidence if when a customer
bought 1 and 2, in 90 of cases, the customer
also bought 3. - A rule must have some minimum user-specified
support (how frequently the rule occurs) - 1 2 gt 3 should hold in some minimum percentage
of transactions to have business value
10Example
- For minimum support 50, minimum confidence
50, we have the following rules - 1 gt 3 with 50 support and 66 confidence
- (13 happened in 50 of cases, but whenever 1
happened only in 2/3 of cases 3 happened too) - 3 gt 1 with 50 support and 100 confidence
- (31 happened in 50 of cases, but whenever 3
happened 1 happened too)
11Quantitative Association Rules
Definition?
- Quantitative attributes (e.g. age, income)
- Categorical attributes (e.g. make of car)
- Age 30..39 and Married Yes gt
NumCars2 -
min support 40 min confidence 50
12Temporal Association Rules
- Can describe the rich temporal character in data
- Example
- diaper -gt beer (support 5, confidence
87) - Support of this rule may jump to 25 between 6 to
9 PM weekdays - Problem How to find rules that follow
interesting user-defined temporal patterns - Challenge is to design efficient algorithms that
do much better than finding every rule in every
time unit
13Correlation Rules
- Association rules do not capture correlations
- Example
- Suppose 90 customers buy coffee, 25 buy tea
and 20 buy both tea and coffee - tea gt coffee has high support 0.2 and
confidence 0.8 - tea, coffee are not correlated
- expected support of customers buying both is 0.9
0.25 0.225
14Sequential Patterns
- Given
- A sequence of customer transactions
- Each transaction is a set of items
- Find all maximal sequential patterns supported by
more than a user-specified percentage of
customers - Example 10 of customers who bought a PC did a
memory upgrade in a subsequent transaction - 10 is the support of the pattern
15Classification
- Given
- Database of tuples, each assigned a class label
- Develop a model/profile for each class
- Example profile (good credit) (25 lt age lt 40
and income gt 40k) or (married YES) - Sample applications
- Credit card approval (good, bad)
- Bank locations (good, fair, poor)
- Treatment effectiveness (good, fair, poor)
16Decision Trees
50 Churners 50 Non-Churners
New technology phone
Old technology phone
30 Churners 50 Non-Churners
20 Churners 0 Non-Churners
Customer lt 2.3 years
Customer gt 2.3 years
25 Churners 10 Non-Churners
5 Churners 40 Non-Churners
Age lt 55
Age gt 55
20 Churners 0 Non-Churners
5 Churners 10 Non-Churners
A decision tree is a predictive model that makes
a prediction on the basis of a series of decisions
17Decision Trees
DT are creating a segmentation of the original
data set. This segmentation is done for the
prediction of some information. The records fall
in each segment have similarity with respect to
the information being predicted. The DT and the
algorithms may be complex, but the results are
presented in an easy-to-understand way, quite
useful to the business user.
18Decision Trees
- DT in business
- Automation Very favorable technique for
automating the data mining and predictive
modeling. They embed automated solutions to
things that other techniques leave as a burden to
the user (4/4) - Clarity The models are viewed as a tree of
simple decisions based on familiar predictors or
as a set of rules. The user can confirm the DT or
modify by hand on the basis of his own expertise
(4/4) - ROI Because DT work well with relational
databases, they provide well-integrated solutions
with highly accurate models (3/4)
19Decision Trees
- Pros
- Fast execution time
- Generated rules are easy to interpret by humans
- Scale well for large data sets
- Can handle high dimensional data
- Cons
- Cannot capture correlations among attributes
- Consider only axis-parallel cuts
20Clustering
- Given
- Data points and number of desired clusters K
- Group the data points into K clusters
- Data points within clusters are more similar than
across clusters - Sample applications
- Customer segmentation
- Market basket customer analysis
- Attached mailing in direct marketing
- Clustering companies with similar growth
21Where to use clustering and nearest-neighbor
prediction
- Clustering for clarity
- A high-level view
- Segmentation
- Clustering for outlier analysis
- To see records that stick out of the rest
- e.g. Wine distributors produce a certain level of
profit. One store produces significantly lower
profit. Turns out that the distributor was
delivering to but not collecting payment from one
of its customers. - Nearest neighbor for prediction
- Objects near to each other have similar
prediction values. - Examples to find more documents as this one
among journal articles, the value to be predicted
in the next value of stock price based on time
series.
22Outlier Discovery
- Sometimes clustering is performed to see when one
record sticks out of the rest - E.g. One store stands out as producing
significantly lower profit. Closer examination
shows that the distributor was not collecting
payment from one of his customers - E.g. A sale of mans suits is being held in all
branches of a department store. All stores, but
one, have seen at least 100 jump in revenue. It
turns out that store had advertised via radio
rather than TV as other stores - Sample applications
- Credit card fraud detection
- Telecom fraud detection
- Customer segmentation
- Medical analysis
23Outlier Discovery
- Given
- Data points and number of outliers ( n) to find
- Find top n outlier points
- outliers are considerably dissimilar from the
remainder of the data
24Statistical Approaches
- Model underlying distribution that generates
dataset (e.g. normal distribution) - Use discordancy tests depending on
- data distribution
- distribution parameter (e.g. mean, variance)
- number of expected outliers
- Drawbacks
- most tests are for single attribute
- In many cases, data distribution may not be known
25Differences between the nearest-neighbor
technique and clustering
Nearest neighbors
Clustering
- Used for prediction and consolidation
- Space is defined by the problem to be solved
- Generally only uses distance metrics to determine
nearness
- Used for consolidating data into a high level
view and general grouping of records into like
behaviors - Space is defined as default n-dimensional space,
or by the user, or predefined space driven by
past experience - Can use other metrics beside distance to
determine nearness of two records- e.g.linking
points together
26How clustering and nearest-neighbor work
- Looking at n-dimensional space
- The distance between the cluster and a given data
point is often measured from the center of mass
of the cluster - The center can be calculated
- By simply average income and age of each record
- By square error criterion
- Other
- Many clustering problems have hundreds of
dimensions. Our intuition works only in 2 or
3-dimensional space
Cluster 1
Outliers
Cluster 2
Customers of a golf equipment business Cluster 1
retirees with modest income Cluster 2
middle-aged weekend golfers Cluster 3 wealthy
youth with exclusive club membership
Cluster 3
27Traditional Algorithms
- Partitional algorithms
- Enumerate K partitions optimizing some criterion
- Example square-error criterion
- mi is the mean of cluster Ci
28How is nearness defined
ID Name Prediction Age Balance() Income Eyes Ge
nder
- Carla Yes 21 2300 High Blue F
- Sue ?? 21 2300 High Blue F
Exactly the same as the record to be predicted is
considered close. However, it is unlikely to
find exact matches
- The Manhattan Distance metric adds up the
differences between each predictor between the
historical record and the record to be predicted - The Euclidean Distance metrics calculates
distance the Pythagorean way (the square of the
hypotenuse is equal to the sum of squares of the
other two sides) - Others
29- The Manhattan Distance metric (an example)
Calculating the difference between ages (6 years)
and balances (3100) is simple. Eyes color
predictor? e.g. match0, mismatch1 Income
assign numbers high3, medium2, low1 3108
6 3100 0 1 1 The result must be
normalized (e.g. 0-100) 225 6 19 0
100 100
30- Calculating dimension weights
- Different dimensions may have different weights
- e.g. In text classification not all words
(dimensions) are created equal entrepreneur is
significant, the is not. - Two methods
- The inverse frequency of the word is used, the
1/10,000, entrepreneur 1/10 - The importance of the word to the topic to be
predicted. entrepreneur and venture capital
will be given higher weight then tornado, the
topic is to start a small business - Dimension weights have also been calculated via
adaptive algorithms where random weights are
tried initially and then slowly modified to
improve the accuracy of the system (neural
networks, genetic algorithms)
31Hierarchy of Clusters
- The hierarchy of clusters is viewed as a tree in
which the smallest clusters merge to create the
next highest level of clusters. - Agglomerative technique starts with as many
clusters as there are records. The clusters that
are nearest each other are merged to form the
next largest cluster. This merging is continued
until a hierarchy of clusters is built. - Divisive technique takes the opposite approach.
It starts with all the records in one cluster,
then try to split that cluster into smaller
pieces, etc. - The hierarchy allows the end user to chose the
level to work with
Large single cluster
Smallest clusters
32Similar Time Sequences
- Given
- A set of time-series sequences
- Find
- All sequences similar to the query sequence
- All pairs of similar sequences
- whole matching vs. subsequence matching
- Sample Applications
- Financial market
- Market basket data analysis
- Scientific databases
- Medical Diagnosis
33Whole Sequence Matching
- Basic Idea
- Extract k features from every sequence
- Every sequence is then represented as a point in
k-dimensional space - Use a multi-dimensional index to store and search
these points - Spatial indices do not work well for high
dimensional data -
34Similar Time Sequences
- Sequences are normalized with amplitude scaling
and offset translation - Two subsequences are considered similar if one
lies within an envelope of width around the
other, ignoring outliers - Two sequences are said to be similar if they have
enough non-overlapping time-ordered pairs of
similar subsequences
35Similar Sequences Found
VanEck International Fund
Fidelity Selective Precious Metal and Mineral Fund
Two similar mutual funds in the different fund
group
36Similar Images
- Given
- A set of images
- Find
- All images similar to a given image
- All pairs of similar images
- Sample applications
- Medical diagnosis
- Weather predication
- Web search engine for images
- E-commerce
37Similar Images
- QBICNib93, FSN95, JFS95, WBIISWWWFS98
- Generates a single signature per image
- Fails when the images contain similar objects,
but at different locations or varying sizes - Smi97
- Divide an image into individual objects
- Manual extraction can be very tedious and time
consuming - Inaccurate in identifying objects and not robust
38WALRUS
- Automatically extract regions from an image based
on the complexity of images - A single signature is used per each region
- Two images are considered to be similar if they
have enough similar region pairs
39WALRUS
Similarity Model
40WALRUS (Overview)
Image Querying Phase
Image Indexing Phase
Compute wavelet signatures for sliding windows
Compute wavelet signatures for sliding windows
Cluster windows to generate regions
Cluster windows to generate regions
Insert regions into spatial index (R tree)
Find matching regions using spatial index
Compute similarity between query image and target
images
41WALRUS
Query image
42Web Mining Challenges
- Todays search engines are plagued by problems
- the abundance problem (99 of info of no interest
to 99 of people) - limited coverage of the Web (internet sources
hidden behind search interfaces) - limited query interface based on keyword-oriented
search - limited customization to individual users
43Web is ..
- The web is a huge collection of documents
- Semistructured (HTML, XML)
- Hyper-link information
- Access and usage information
- Dynamic
- (i.e. New pages are constantly being generated)
44Web Mining
- Web Content Mining
- Extract concept hierarchies/relations from the
web - Automatic categorization
- Web Log Mining
- Trend analysis (i.e web dynamics info)
- Web access association/sequential pattern
analysis - Web Structure Mining
- Google A page is important if important pages
point to it
45Improving Search/Customization
- Learn about users interests based on access
patterns - Provide users with pages, sites and
advertisements of interest
46Summary
- Data mining
- Good science - leading position in research
community - Recent progress for large databases association
rules, classification, clustering, similar time
sequences, similar image retrieval, outlier
discovery, etc. - Many papers were published in major conferences
- Still promising and rich field with many
challenging research issues - Maturing in industry