Data Mining - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Data Mining

Description:

... marketing strategies including advertising, store location, and targeted mailing ... Plan additional store locations based on demographics. Run store promotions, ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 44

Provided by: brahimm

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining

1
Data Mining
CIS 4262 Information Systems Design and Analysis

Dr. Brahim Medjahed
brahim_at_umich.edu

2
Why Mining Data? (1)

Commercial Point of View
Lots of data is being collected
Web data, e-commerce
Purchases at department/grocery stores
Bank/Credit Card Transactions
High competitive pressure
Provide better, customized services for an edge
(e.g. in Customer Relationship Management)

3
Why Mining Data? (2)

Scientific Point of View
Data collected and stored at enormous speeds
(TB/hour)
Remote sensors on a satellite
Telescopes scanning the skies
Microarrays generating gene expression data
scientific simulations generating terabytes of
data
Data mining may help scientists in
Classifying and segmenting data
Hypothesis formation

4
Mining Large Data Set - Motivation

There is often information hidden in the data
that is not readily evident
Human analysts may take weeks to discover useful
information
Much of the data is never analyzed at all

The Data Gap
Total new disk (TB) since 1995
Number of analysts
5
Applications of Data Mining

Marketing
Identify likely responders to sales promotions
Consumer behavior based on buying patterns
Determination of marketing strategies including
advertising, store location, and targeted mailing
Fraud detection
Which types of transactions are likely to be
fraudulent, given the demographics and
transactional history of a particular customer?
Customer relationship management
Which of my customers are likely to be the most
loyal, and which are most likely to leave for a
competitor?
Banking loan/credit card approval
predict good customers based on old customers
Healthcare
Analysis of effectiveness of certain treatments
Relating patient wellness data with doctor
qualifications
Analyzing side effects of drugs
.

6
What is Data Mining?

Process of semi-automatically analyzing large
databases to find patterns that are
valid hold on new data with some certainty
novel non-obvious to the system
useful should be possible to act on the item
understandable humans should be able to
interpret the pattern
Another definition
Exploration and analysis, by automatic or
semi-automatic means, of large quantities of data
in order to discover meaningful patterns

7
Data Mining and Data Warehousing
Data Warehousing provides the Enterprise with
a memory
Data Mining provides the Enterprise with
intelligence
8
Data Mining as Part of the Knowledge Discovery
Process (1)

Example
Transaction database maintained by a specialty
consumer goods retailer
Client data includes
Customer name, zip code, phone number, date of
purchase, item code, price, quantity, total
amount
Problem Formulation
Plan additional store locations based on
demographics
Run store promotions,
Combine items in advertisements
Plan seasonal marketing strategies
etc.

9
Data Mining as Part of the Knowledge Discovery
Process (2)

Data Selection
Data about specific items or categories of items,
or from stores in a specific region or area of
the country, may be selected
Data cleansing
Correct invalid zip codes or eliminate records
with incorrect phone prefixes
Enrichment
Enhances the data with additional sources of
information
The store may purchase data about age, income,
and credit rating and append them to each record

10
Data Mining as Part of the Knowledge Discovery
Process (3)

Data Transformation and Encoding
Done to reduce the data
Zip codes may be aggregated into geographic
regions, incomes may be divided into ten ranges,
Data Mining
Mine different rules and patterns
Whenever a customer buys video equipment, he/she
also buys another electronic gadget
Reporting and display of the discovered
information
Results may be reported in a variety of formats,
such as listings, graphic outputs, summary
tables, or vizualizations

11
Goals of Data Mining

Prediction
Identification
Optimization
Classification

12
Goals of Data Mining (1)

Predication
How certain attributes within the data will
behave in the future
Predict what consumers will buy under certain
discounts
Predict how much sales volume a store would
generate in a given period
Predict whether deleting a product line would
yield more profits
Predict an earthquake based on certain seismic
wave patterns

13
Goals of Data Mining (2)

Identification
Use data pattern to identify the existence of an
item, an event, or an activity
Intruders trying to break a system may be
identified by the programs executed, files
accessed, and CPU time per session.
Existence of a gene may be identified by certain
sequences of nucleotide symbols in the DNA
sequence
Optimization
Optimize the use of limited resources such as
time, space, money, or materials and maximize
output variables such as sales or profits under
certain constraints

14
Goals of Data Mining (3)

Classification
Partition data so that different classes or
categories can be identified based on
combinations of parameters
Customers in a supermarket may be categorized
into
Discount-seeking shoppers
Shoppers in a rush
Loyal regular shoppers
Infrequent shoppers

15
Data Mining Tasks

Association Rule Discovery
Classification
Clustering
Detection of Sequential Patterns
Detection of Patterns within Time Series
Etc.

16
Association Rules

Given a set of records each of which contain some
number of items from a given collection
Produce dependency rules which will predict
occurrence of an item based on occurrences of
other items.

Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
17
Association Rules Applications (1)

Marketing and Sales Promotion
Let the rule discovered be
Bagels, --gt Potato Chips
Potato Chips as consequent gt Can be used to
determine what should be done to boost its sales.
Bagels in the antecedent gt Can be used to see
which products would be affected if the store
discontinues selling bagels.
Bagels in antecedent and Potato chips in
consequent gt Can be used to see what products
should be sold with Bagels to promote sale of
Potato chips!

18
Association Rules Applications (2)

Supermarket Shelf Management
Goal To identify items that are bought together
by sufficiently many customers.
Approach Process the point-of-sale data
collected with barcode scanners to find
dependencies among items.
Inventory Management
Goal A consumer appliance repair company wants
to anticipate the nature of repairs on its
consumer products and keep the service vehicles
equipped with right parts to reduce on number of
visits to consumer households.
Approach Process the data on tools and parts
required in previous repairs at different
consumer locations and discover the co-occurrence
patterns.

19
Prevalent Rules ? Interesting Rules
1995
Milk and cereal selltogether!

Analysts already know about prevalent rules
Interesting rules are those that deviate from
prior expectation
Minings payoff is in finding surprising phenomena

Milk and cereal selltogether!
20
Classification

Given old data about customers and payments,
predict new applicants loan eligibility.

Previous customers
Classifier
Decision rules
Age Salary Profession Location Customer type
Salary gt 5 L
Good/ bad
Prof. Exec
New applicants data
21
Classification Definition

Predictive task Use some variables to predict
unknown or future values of other variables
Given a collection of records (training set )
Each record contains a set of attributes, one of
the attributes is the class.
Find a model for class attribute as a function
of the values of other attributes.
Goal previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of
the model. Usually, the given data set is divided
into training and test sets, with training set
used to build the model and test set used to
validate it.

22
Classification Example
categorical
categorical
continuous
class
Learn Classifier
Training Set
23
Classification - Application (1)

Direct Marketing
Goal Reduce cost of mailing by targeting a set
of consumers likely to buy a new cell-phone
product.
Approach
Use the data for a similar product introduced
before.
We know which customers decided to buy and which
decided otherwise. This buy, dont buy decision
forms the class attribute.
Collect various demographic, lifestyle, and
company-interaction related information about all
such customers.
Type of business, where they stay, how much they
earn, etc.
Use this information as input attributes to learn
a classifier model.

24
Classification - Application (2)

Fraud Detection
Goal Predict fraudulent cases in credit card
transactions.
Approach
Use credit card transactions and the information
on its account-holder as attributes.
When does a customer buy, what does he buy, how
often he pays on time, etc
Label past transactions as fraud or fair
transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing
credit card transactions on an account.

25
Classification - Application (3)

Customer Loyalty
Goal To predict whether a customer is likely to
be lost to a competitor.
Approach
Use detailed record of transactions with each of
the past and present customers, to find
attributes.
How often the customer calls, where he calls,
what time-of-the day he calls most, his financial
status, marital status, etc.
Label the customers as loyal or disloyal.
Find a model for loyalty.

26
Clustering

Descriptive task Find human-interpretable
patterns that describe the data
Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
Data points in one cluster are more similar to
one another.
Data points in separate clusters are less similar
to one another.
Key requirement Need a good measure of
similarity between instances.
Similarity Measures
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures

27
Clustering - Application

Document Clustering
Goal To find groups of documents that are
similar to each other based on the important
terms appearing in them.
Approach To identify frequently occurring terms
in each document. Form a similarity measure based
on the frequencies of different terms. Use it to
cluster.
Gain Information Retrieval can utilize the
clusters to relate a new document or search term
to clustered documents.

28
Detection of Sequential Patterns

A sequence of actions or events is sought
Given is a set of objects, with each object
associated with its own timeline of events, find
rules that predict strong sequential dependencies
among different events.

29
Sequential Patterns - Examples

In healthcare
If a patient underwent cardiac bypass surgery for
blocked arteries and an aneurysm and later
developed high blood urea within a year of
surgery, he or she is likely to suffer from
kidney failure within the next 18 months
In point-of-sale transaction sequences,
Computer Bookstore
(Intro_To_Visual_C) (C_Primer) --gt
(Perl_for_dummies,Tcl_Tk)
Athletic Apparel Store
(Shoes) (Racket, Racketball) --gt
(Sports_Jacket)

30
Detection of Patterns within Time Series

Similarities can be detected within positions of
the time series
Examples
Stocks of a utility company ABC Power and a
financial company XYZ Securities show the same
pattern during 1998 in terms of closing stock
price
Two products show the same selling pattern in
summer but a different one in winter

31
More about Associations Rules

Retail Shops
Someone who buys bread is quite likely to buy
milk
A person who bought the book Database System
Concepts is quite likely to buy the book
Operating Systems Concepts
Data consists of 2 parts
Transactions customers purchases
Items things that we bought
For each transaction, there is a list of items

32
Example
33
Association Rule Form

X ? Y
Where Xx1,,xn and Yy1,,ym are sets of
items
If a customer buys X, he/she is likely to buy Y
We need to automatically discover such
association rules
Two concepts to measure the strength of an
association
Support
Confidence

34
Support

Support of X ? Y
Also called prevalence
Percentage of transactions that hold all the
items of the union X ? Y
Probability that the 2 item sets occurred
together
Estimated by
Transactions that contain every item in X and Y
All transactions

35
Example
- Support of Milk ? Juice ?
- Support of Bread ? Juice ?
36
Large and Small Itemsets

Support threshold user-specified
Large Itemsets
Sets of items that have a support that exceeds a
certain threshold
Small Itemsets
Sets of items that have a support that is below
a certain threshold

37
Confidence

Confidence of the rule X?Y
Conditional probability of a transaction
containing item set Y given that it contains
items set X
Estimated by
Transactions that contain every item in X and Y
Transactions that contain the items in X
Confidence threshold user-specified

38
Example
- Confidence of Milk ? Juice ?
- Confidence of Bread ? Juice ?
39
Goal of Mining Association Rules

Generate all possible rules that exceed some
minimum user-specified support and confidence
thresholds

40
Generating Large Itemsets Exploratory Method

Consider all possibilities
Example three items a, b, and c
a, b, c, a,b, b,c, a,c, a,b,c
Works for a very small number of items
Very computation intensive if the number of items
becomes large (thousands)
If the number of items is m, then the number of
distinct items sets is 2m

41
Generating Large Itemsets A Priori Method

Idea
A subset of a large itemset must also be large
if juice, milk, bread is frequent, so is
juice, bread
Every transaction having juice, milk, bread
also contains juice, bread
Conversely, an extension of a small itemset is
also small
If there is any itemset which is infrequent, its
superset should not be generated/tested!
Overview
Only sets with single items are considered in the
first pass.
In the second pass, sets with two items are
considered, and so on.

42
Generating Large Itemsets A Priori Method
(contd)

Test the support for itemsets of length 1, called
1-itemsets by scanning the database
Discard those that do not exceed the threshold
Extend the large 1-itemsets into 2-itemsets by
appending one item each time, to generate all
candidate itemsets of length two
Test the support for all candidate itemsets by
scanning the database and eliminate those
2-itemsets that do not meet the minimum support
Repeat the above steps.
At step k, the previously found (k-1) itemsets
are extended into k-itemsets and tested for
minimum support
The process is repeated until no large itemsets
can be found

43
The A Priori Method - Example
Database TDB
L1
C1
1st scan
Frequency 50, Confidence 100 A ? C B ? E BC
? E CE ? B BE ? C
C2
C2
L2
2nd scan
C3
L3
3rd scan

Write a Comment

User Comments (0)