Title: Data Mining
1Data Mining
CIS 4262 Information Systems Design and Analysis
- Dr. Brahim Medjahed
- brahim_at_umich.edu
2Why Mining Data? (1)
- Commercial Point of View
- Lots of data is being collected
- Web data, e-commerce
- Purchases at department/grocery stores
- Bank/Credit Card Transactions
- High competitive pressure
- Provide better, customized services for an edge
(e.g. in Customer Relationship Management)
3Why Mining Data? (2)
- Scientific Point of View
- Data collected and stored at enormous speeds
(TB/hour) - Remote sensors on a satellite
- Telescopes scanning the skies
- Microarrays generating gene expression data
- scientific simulations generating terabytes of
data - Data mining may help scientists in
- Classifying and segmenting data
- Hypothesis formation
4Mining Large Data Set - Motivation
- There is often information hidden in the data
that is not readily evident - Human analysts may take weeks to discover useful
information - Much of the data is never analyzed at all
The Data Gap
Total new disk (TB) since 1995
Number of analysts
5Applications of Data Mining
- Marketing
- Identify likely responders to sales promotions
- Consumer behavior based on buying patterns
- Determination of marketing strategies including
advertising, store location, and targeted mailing - Fraud detection
- Which types of transactions are likely to be
fraudulent, given the demographics and
transactional history of a particular customer? - Customer relationship management
- Which of my customers are likely to be the most
loyal, and which are most likely to leave for a
competitor? - Banking loan/credit card approval
- predict good customers based on old customers
- Healthcare
- Analysis of effectiveness of certain treatments
- Relating patient wellness data with doctor
qualifications - Analyzing side effects of drugs
- .
6What is Data Mining?
- Process of semi-automatically analyzing large
databases to find patterns that are - valid hold on new data with some certainty
- novel non-obvious to the system
- useful should be possible to act on the item
- understandable humans should be able to
interpret the pattern - Another definition
- Exploration and analysis, by automatic or
semi-automatic means, of large quantities of data
in order to discover meaningful patterns
7Data Mining and Data Warehousing
Data Warehousing provides the Enterprise with
a memory
Data Mining provides the Enterprise with
intelligence
8Data Mining as Part of the Knowledge Discovery
Process (1)
- Example
- Transaction database maintained by a specialty
consumer goods retailer - Client data includes
- Customer name, zip code, phone number, date of
purchase, item code, price, quantity, total
amount - Problem Formulation
- Plan additional store locations based on
demographics - Run store promotions,
- Combine items in advertisements
- Plan seasonal marketing strategies
- etc.
9Data Mining as Part of the Knowledge Discovery
Process (2)
- Data Selection
- Data about specific items or categories of items,
or from stores in a specific region or area of
the country, may be selected - Data cleansing
- Correct invalid zip codes or eliminate records
with incorrect phone prefixes - Enrichment
- Enhances the data with additional sources of
information - The store may purchase data about age, income,
and credit rating and append them to each record
10Data Mining as Part of the Knowledge Discovery
Process (3)
- Data Transformation and Encoding
- Done to reduce the data
- Zip codes may be aggregated into geographic
regions, incomes may be divided into ten ranges,
- Data Mining
- Mine different rules and patterns
- Whenever a customer buys video equipment, he/she
also buys another electronic gadget - Reporting and display of the discovered
information - Results may be reported in a variety of formats,
such as listings, graphic outputs, summary
tables, or vizualizations
11Goals of Data Mining
- Prediction
- Identification
- Optimization
- Classification
12Goals of Data Mining (1)
- Predication
- How certain attributes within the data will
behave in the future - Predict what consumers will buy under certain
discounts - Predict how much sales volume a store would
generate in a given period - Predict whether deleting a product line would
yield more profits - Predict an earthquake based on certain seismic
wave patterns
13Goals of Data Mining (2)
- Identification
- Use data pattern to identify the existence of an
item, an event, or an activity - Intruders trying to break a system may be
identified by the programs executed, files
accessed, and CPU time per session. - Existence of a gene may be identified by certain
sequences of nucleotide symbols in the DNA
sequence - Optimization
- Optimize the use of limited resources such as
time, space, money, or materials and maximize
output variables such as sales or profits under
certain constraints
14Goals of Data Mining (3)
- Classification
- Partition data so that different classes or
categories can be identified based on
combinations of parameters - Customers in a supermarket may be categorized
into - Discount-seeking shoppers
- Shoppers in a rush
- Loyal regular shoppers
- Infrequent shoppers
15Data Mining Tasks
- Association Rule Discovery
- Classification
- Clustering
- Detection of Sequential Patterns
- Detection of Patterns within Time Series
- Etc.
16Association Rules
- Given a set of records each of which contain some
number of items from a given collection - Produce dependency rules which will predict
occurrence of an item based on occurrences of
other items.
Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
17Association Rules Applications (1)
- Marketing and Sales Promotion
- Let the rule discovered be
- Bagels, --gt Potato Chips
- Potato Chips as consequent gt Can be used to
determine what should be done to boost its sales. - Bagels in the antecedent gt Can be used to see
which products would be affected if the store
discontinues selling bagels. - Bagels in antecedent and Potato chips in
consequent gt Can be used to see what products
should be sold with Bagels to promote sale of
Potato chips!
18Association Rules Applications (2)
- Supermarket Shelf Management
- Goal To identify items that are bought together
by sufficiently many customers. - Approach Process the point-of-sale data
collected with barcode scanners to find
dependencies among items. - Inventory Management
- Goal A consumer appliance repair company wants
to anticipate the nature of repairs on its
consumer products and keep the service vehicles
equipped with right parts to reduce on number of
visits to consumer households. - Approach Process the data on tools and parts
required in previous repairs at different
consumer locations and discover the co-occurrence
patterns.
19Prevalent Rules ? Interesting Rules
1995
Milk and cereal selltogether!
- Analysts already know about prevalent rules
- Interesting rules are those that deviate from
prior expectation - Minings payoff is in finding surprising phenomena
Milk and cereal selltogether!
20Classification
- Given old data about customers and payments,
predict new applicants loan eligibility.
Previous customers
Classifier
Decision rules
Age Salary Profession Location Customer type
Salary gt 5 L
Good/ bad
Prof. Exec
New applicants data
21Classification Definition
- Predictive task Use some variables to predict
unknown or future values of other variables - Given a collection of records (training set )
- Each record contains a set of attributes, one of
the attributes is the class. - Find a model for class attribute as a function
of the values of other attributes. - Goal previously unseen records should be
assigned a class as accurately as possible. - A test set is used to determine the accuracy of
the model. Usually, the given data set is divided
into training and test sets, with training set
used to build the model and test set used to
validate it.
22Classification Example
categorical
categorical
continuous
class
Learn Classifier
Training Set
23Classification - Application (1)
- Direct Marketing
- Goal Reduce cost of mailing by targeting a set
of consumers likely to buy a new cell-phone
product. - Approach
- Use the data for a similar product introduced
before. - We know which customers decided to buy and which
decided otherwise. This buy, dont buy decision
forms the class attribute. - Collect various demographic, lifestyle, and
company-interaction related information about all
such customers. - Type of business, where they stay, how much they
earn, etc. - Use this information as input attributes to learn
a classifier model.
24Classification - Application (2)
- Fraud Detection
- Goal Predict fraudulent cases in credit card
transactions. - Approach
- Use credit card transactions and the information
on its account-holder as attributes. - When does a customer buy, what does he buy, how
often he pays on time, etc - Label past transactions as fraud or fair
transactions. This forms the class attribute. - Learn a model for the class of the transactions.
- Use this model to detect fraud by observing
credit card transactions on an account.
25Classification - Application (3)
- Customer Loyalty
- Goal To predict whether a customer is likely to
be lost to a competitor. - Approach
- Use detailed record of transactions with each of
the past and present customers, to find
attributes. - How often the customer calls, where he calls,
what time-of-the day he calls most, his financial
status, marital status, etc. - Label the customers as loyal or disloyal.
- Find a model for loyalty.
26Clustering
- Descriptive task Find human-interpretable
patterns that describe the data - Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that - Data points in one cluster are more similar to
one another. - Data points in separate clusters are less similar
to one another. - Key requirement Need a good measure of
similarity between instances. - Similarity Measures
- Euclidean Distance if attributes are continuous.
- Other Problem-specific Measures
27Clustering - Application
- Document Clustering
- Goal To find groups of documents that are
similar to each other based on the important
terms appearing in them. - Approach To identify frequently occurring terms
in each document. Form a similarity measure based
on the frequencies of different terms. Use it to
cluster. - Gain Information Retrieval can utilize the
clusters to relate a new document or search term
to clustered documents.
28Detection of Sequential Patterns
- A sequence of actions or events is sought
- Given is a set of objects, with each object
associated with its own timeline of events, find
rules that predict strong sequential dependencies
among different events.
29Sequential Patterns - Examples
- In healthcare
- If a patient underwent cardiac bypass surgery for
blocked arteries and an aneurysm and later
developed high blood urea within a year of
surgery, he or she is likely to suffer from
kidney failure within the next 18 months - In point-of-sale transaction sequences,
- Computer Bookstore
- (Intro_To_Visual_C) (C_Primer) --gt
(Perl_for_dummies,Tcl_Tk) - Athletic Apparel Store
- (Shoes) (Racket, Racketball) --gt
(Sports_Jacket)
30Detection of Patterns within Time Series
- Similarities can be detected within positions of
the time series - Examples
- Stocks of a utility company ABC Power and a
financial company XYZ Securities show the same
pattern during 1998 in terms of closing stock
price - Two products show the same selling pattern in
summer but a different one in winter
31More about Associations Rules
- Retail Shops
- Someone who buys bread is quite likely to buy
milk - A person who bought the book Database System
Concepts is quite likely to buy the book
Operating Systems Concepts - Data consists of 2 parts
- Transactions customers purchases
- Items things that we bought
- For each transaction, there is a list of items
32Example
33Association Rule Form
- X ? Y
- Where Xx1,,xn and Yy1,,ym are sets of
items - If a customer buys X, he/she is likely to buy Y
- We need to automatically discover such
association rules - Two concepts to measure the strength of an
association - Support
- Confidence
34Support
- Support of X ? Y
- Also called prevalence
- Percentage of transactions that hold all the
items of the union X ? Y - Probability that the 2 item sets occurred
together - Estimated by
- Transactions that contain every item in X and Y
- All transactions
35Example
- Support of Milk ? Juice ?
- Support of Bread ? Juice ?
36Large and Small Itemsets
- Support threshold user-specified
- Large Itemsets
- Sets of items that have a support that exceeds a
certain threshold - Small Itemsets
- Sets of items that have a support that is below
a certain threshold
37Confidence
- Confidence of the rule X?Y
- Conditional probability of a transaction
containing item set Y given that it contains
items set X - Estimated by
- Transactions that contain every item in X and Y
- Transactions that contain the items in X
- Confidence threshold user-specified
38Example
- Confidence of Milk ? Juice ?
- Confidence of Bread ? Juice ?
39Goal of Mining Association Rules
- Generate all possible rules that exceed some
minimum user-specified support and confidence
thresholds
40Generating Large Itemsets Exploratory Method
- Consider all possibilities
- Example three items a, b, and c
- a, b, c, a,b, b,c, a,c, a,b,c
- Works for a very small number of items
- Very computation intensive if the number of items
becomes large (thousands) - If the number of items is m, then the number of
distinct items sets is 2m
41Generating Large Itemsets A Priori Method
- Idea
- A subset of a large itemset must also be large
- if juice, milk, bread is frequent, so is
juice, bread - Every transaction having juice, milk, bread
also contains juice, bread - Conversely, an extension of a small itemset is
also small - If there is any itemset which is infrequent, its
superset should not be generated/tested! - Overview
- Only sets with single items are considered in the
first pass. - In the second pass, sets with two items are
considered, and so on.
42Generating Large Itemsets A Priori Method
(contd)
- Test the support for itemsets of length 1, called
1-itemsets by scanning the database - Discard those that do not exceed the threshold
- Extend the large 1-itemsets into 2-itemsets by
appending one item each time, to generate all
candidate itemsets of length two - Test the support for all candidate itemsets by
scanning the database and eliminate those
2-itemsets that do not meet the minimum support - Repeat the above steps.
- At step k, the previously found (k-1) itemsets
are extended into k-itemsets and tested for
minimum support - The process is repeated until no large itemsets
can be found
43The A Priori Method - Example
Database TDB
L1
C1
1st scan
Frequency 50, Confidence 100 A ? C B ? E BC
? E CE ? B BE ? C
C2
C2
L2
2nd scan
C3
L3
3rd scan