Data Mining - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Data Mining

Description:

... marketing strategies including advertising, store location, and targeted mailing ... Plan additional store locations based on demographics. Run store promotions, ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 44
Provided by: brahimm
Category:
Tags: data | mining

less

Transcript and Presenter's Notes

Title: Data Mining


1
Data Mining
CIS 4262 Information Systems Design and Analysis
  • Dr. Brahim Medjahed
  • brahim_at_umich.edu

2
Why Mining Data? (1)
  • Commercial Point of View
  • Lots of data is being collected
  • Web data, e-commerce
  • Purchases at department/grocery stores
  • Bank/Credit Card Transactions
  • High competitive pressure
  • Provide better, customized services for an edge
    (e.g. in Customer Relationship Management)

3
Why Mining Data? (2)
  • Scientific Point of View
  • Data collected and stored at enormous speeds
    (TB/hour)
  • Remote sensors on a satellite
  • Telescopes scanning the skies
  • Microarrays generating gene expression data
  • scientific simulations generating terabytes of
    data
  • Data mining may help scientists in
  • Classifying and segmenting data
  • Hypothesis formation

4
Mining Large Data Set - Motivation
  • There is often information hidden in the data
    that is not readily evident
  • Human analysts may take weeks to discover useful
    information
  • Much of the data is never analyzed at all

The Data Gap
Total new disk (TB) since 1995
Number of analysts
5
Applications of Data Mining
  • Marketing
  • Identify likely responders to sales promotions
  • Consumer behavior based on buying patterns
  • Determination of marketing strategies including
    advertising, store location, and targeted mailing
  • Fraud detection
  • Which types of transactions are likely to be
    fraudulent, given the demographics and
    transactional history of a particular customer?
  • Customer relationship management
  • Which of my customers are likely to be the most
    loyal, and which are most likely to leave for a
    competitor?
  • Banking loan/credit card approval
  • predict good customers based on old customers
  • Healthcare
  • Analysis of effectiveness of certain treatments
  • Relating patient wellness data with doctor
    qualifications
  • Analyzing side effects of drugs
  • .

6
What is Data Mining?
  • Process of semi-automatically analyzing large
    databases to find patterns that are
  • valid hold on new data with some certainty
  • novel non-obvious to the system
  • useful should be possible to act on the item
  • understandable humans should be able to
    interpret the pattern
  • Another definition
  • Exploration and analysis, by automatic or
    semi-automatic means, of large quantities of data
    in order to discover meaningful patterns

7
Data Mining and Data Warehousing
Data Warehousing provides the Enterprise with
a memory
Data Mining provides the Enterprise with
intelligence
8
Data Mining as Part of the Knowledge Discovery
Process (1)
  • Example
  • Transaction database maintained by a specialty
    consumer goods retailer
  • Client data includes
  • Customer name, zip code, phone number, date of
    purchase, item code, price, quantity, total
    amount
  • Problem Formulation
  • Plan additional store locations based on
    demographics
  • Run store promotions,
  • Combine items in advertisements
  • Plan seasonal marketing strategies
  • etc.

9
Data Mining as Part of the Knowledge Discovery
Process (2)
  • Data Selection
  • Data about specific items or categories of items,
    or from stores in a specific region or area of
    the country, may be selected
  • Data cleansing
  • Correct invalid zip codes or eliminate records
    with incorrect phone prefixes
  • Enrichment
  • Enhances the data with additional sources of
    information
  • The store may purchase data about age, income,
    and credit rating and append them to each record

10
Data Mining as Part of the Knowledge Discovery
Process (3)
  • Data Transformation and Encoding
  • Done to reduce the data
  • Zip codes may be aggregated into geographic
    regions, incomes may be divided into ten ranges,
  • Data Mining
  • Mine different rules and patterns
  • Whenever a customer buys video equipment, he/she
    also buys another electronic gadget
  • Reporting and display of the discovered
    information
  • Results may be reported in a variety of formats,
    such as listings, graphic outputs, summary
    tables, or vizualizations

11
Goals of Data Mining
  • Prediction
  • Identification
  • Optimization
  • Classification

12
Goals of Data Mining (1)
  • Predication
  • How certain attributes within the data will
    behave in the future
  • Predict what consumers will buy under certain
    discounts
  • Predict how much sales volume a store would
    generate in a given period
  • Predict whether deleting a product line would
    yield more profits
  • Predict an earthquake based on certain seismic
    wave patterns

13
Goals of Data Mining (2)
  • Identification
  • Use data pattern to identify the existence of an
    item, an event, or an activity
  • Intruders trying to break a system may be
    identified by the programs executed, files
    accessed, and CPU time per session.
  • Existence of a gene may be identified by certain
    sequences of nucleotide symbols in the DNA
    sequence
  • Optimization
  • Optimize the use of limited resources such as
    time, space, money, or materials and maximize
    output variables such as sales or profits under
    certain constraints

14
Goals of Data Mining (3)
  • Classification
  • Partition data so that different classes or
    categories can be identified based on
    combinations of parameters
  • Customers in a supermarket may be categorized
    into
  • Discount-seeking shoppers
  • Shoppers in a rush
  • Loyal regular shoppers
  • Infrequent shoppers

15
Data Mining Tasks
  • Association Rule Discovery
  • Classification
  • Clustering
  • Detection of Sequential Patterns
  • Detection of Patterns within Time Series
  • Etc.

16
Association Rules
  • Given a set of records each of which contain some
    number of items from a given collection
  • Produce dependency rules which will predict
    occurrence of an item based on occurrences of
    other items.

Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
17
Association Rules Applications (1)
  • Marketing and Sales Promotion
  • Let the rule discovered be
  • Bagels, --gt Potato Chips
  • Potato Chips as consequent gt Can be used to
    determine what should be done to boost its sales.
  • Bagels in the antecedent gt Can be used to see
    which products would be affected if the store
    discontinues selling bagels.
  • Bagels in antecedent and Potato chips in
    consequent gt Can be used to see what products
    should be sold with Bagels to promote sale of
    Potato chips!

18
Association Rules Applications (2)
  • Supermarket Shelf Management
  • Goal To identify items that are bought together
    by sufficiently many customers.
  • Approach Process the point-of-sale data
    collected with barcode scanners to find
    dependencies among items.
  • Inventory Management
  • Goal A consumer appliance repair company wants
    to anticipate the nature of repairs on its
    consumer products and keep the service vehicles
    equipped with right parts to reduce on number of
    visits to consumer households.
  • Approach Process the data on tools and parts
    required in previous repairs at different
    consumer locations and discover the co-occurrence
    patterns.

19
Prevalent Rules ? Interesting Rules
1995
Milk and cereal selltogether!
  • Analysts already know about prevalent rules
  • Interesting rules are those that deviate from
    prior expectation
  • Minings payoff is in finding surprising phenomena

Milk and cereal selltogether!
20
Classification
  • Given old data about customers and payments,
    predict new applicants loan eligibility.

Previous customers
Classifier
Decision rules
Age Salary Profession Location Customer type
Salary gt 5 L
Good/ bad
Prof. Exec
New applicants data
21
Classification Definition
  • Predictive task Use some variables to predict
    unknown or future values of other variables
  • Given a collection of records (training set )
  • Each record contains a set of attributes, one of
    the attributes is the class.
  • Find a model for class attribute as a function
    of the values of other attributes.
  • Goal previously unseen records should be
    assigned a class as accurately as possible.
  • A test set is used to determine the accuracy of
    the model. Usually, the given data set is divided
    into training and test sets, with training set
    used to build the model and test set used to
    validate it.

22
Classification Example
categorical
categorical
continuous
class
Learn Classifier
Training Set
23
Classification - Application (1)
  • Direct Marketing
  • Goal Reduce cost of mailing by targeting a set
    of consumers likely to buy a new cell-phone
    product.
  • Approach
  • Use the data for a similar product introduced
    before.
  • We know which customers decided to buy and which
    decided otherwise. This buy, dont buy decision
    forms the class attribute.
  • Collect various demographic, lifestyle, and
    company-interaction related information about all
    such customers.
  • Type of business, where they stay, how much they
    earn, etc.
  • Use this information as input attributes to learn
    a classifier model.

24
Classification - Application (2)
  • Fraud Detection
  • Goal Predict fraudulent cases in credit card
    transactions.
  • Approach
  • Use credit card transactions and the information
    on its account-holder as attributes.
  • When does a customer buy, what does he buy, how
    often he pays on time, etc
  • Label past transactions as fraud or fair
    transactions. This forms the class attribute.
  • Learn a model for the class of the transactions.
  • Use this model to detect fraud by observing
    credit card transactions on an account.

25
Classification - Application (3)
  • Customer Loyalty
  • Goal To predict whether a customer is likely to
    be lost to a competitor.
  • Approach
  • Use detailed record of transactions with each of
    the past and present customers, to find
    attributes.
  • How often the customer calls, where he calls,
    what time-of-the day he calls most, his financial
    status, marital status, etc.
  • Label the customers as loyal or disloyal.
  • Find a model for loyalty.

26
Clustering
  • Descriptive task Find human-interpretable
    patterns that describe the data
  • Given a set of data points, each having a set of
    attributes, and a similarity measure among them,
    find clusters such that
  • Data points in one cluster are more similar to
    one another.
  • Data points in separate clusters are less similar
    to one another.
  • Key requirement Need a good measure of
    similarity between instances.
  • Similarity Measures
  • Euclidean Distance if attributes are continuous.
  • Other Problem-specific Measures

27
Clustering - Application
  • Document Clustering
  • Goal To find groups of documents that are
    similar to each other based on the important
    terms appearing in them.
  • Approach To identify frequently occurring terms
    in each document. Form a similarity measure based
    on the frequencies of different terms. Use it to
    cluster.
  • Gain Information Retrieval can utilize the
    clusters to relate a new document or search term
    to clustered documents.

28
Detection of Sequential Patterns
  • A sequence of actions or events is sought
  • Given is a set of objects, with each object
    associated with its own timeline of events, find
    rules that predict strong sequential dependencies
    among different events.

29
Sequential Patterns - Examples
  • In healthcare
  • If a patient underwent cardiac bypass surgery for
    blocked arteries and an aneurysm and later
    developed high blood urea within a year of
    surgery, he or she is likely to suffer from
    kidney failure within the next 18 months
  • In point-of-sale transaction sequences,
  • Computer Bookstore
  • (Intro_To_Visual_C) (C_Primer) --gt
    (Perl_for_dummies,Tcl_Tk)
  • Athletic Apparel Store
  • (Shoes) (Racket, Racketball) --gt
    (Sports_Jacket)

30
Detection of Patterns within Time Series
  • Similarities can be detected within positions of
    the time series
  • Examples
  • Stocks of a utility company ABC Power and a
    financial company XYZ Securities show the same
    pattern during 1998 in terms of closing stock
    price
  • Two products show the same selling pattern in
    summer but a different one in winter

31
More about Associations Rules
  • Retail Shops
  • Someone who buys bread is quite likely to buy
    milk
  • A person who bought the book Database System
    Concepts is quite likely to buy the book
    Operating Systems Concepts
  • Data consists of 2 parts
  • Transactions customers purchases
  • Items things that we bought
  • For each transaction, there is a list of items

32
Example
33
Association Rule Form
  • X ? Y
  • Where Xx1,,xn and Yy1,,ym are sets of
    items
  • If a customer buys X, he/she is likely to buy Y
  • We need to automatically discover such
    association rules
  • Two concepts to measure the strength of an
    association
  • Support
  • Confidence

34
Support
  • Support of X ? Y
  • Also called prevalence
  • Percentage of transactions that hold all the
    items of the union X ? Y
  • Probability that the 2 item sets occurred
    together
  • Estimated by
  • Transactions that contain every item in X and Y
  • All transactions

35
Example
- Support of Milk ? Juice ?
- Support of Bread ? Juice ?
36
Large and Small Itemsets
  • Support threshold user-specified
  • Large Itemsets
  • Sets of items that have a support that exceeds a
    certain threshold
  • Small Itemsets
  • Sets of items that have a support that is below
    a certain threshold

37
Confidence
  • Confidence of the rule X?Y
  • Conditional probability of a transaction
    containing item set Y given that it contains
    items set X
  • Estimated by
  • Transactions that contain every item in X and Y
  • Transactions that contain the items in X
  • Confidence threshold user-specified

38
Example
- Confidence of Milk ? Juice ?
- Confidence of Bread ? Juice ?
39
Goal of Mining Association Rules
  • Generate all possible rules that exceed some
    minimum user-specified support and confidence
    thresholds

40
Generating Large Itemsets Exploratory Method
  • Consider all possibilities
  • Example three items a, b, and c
  • a, b, c, a,b, b,c, a,c, a,b,c
  • Works for a very small number of items
  • Very computation intensive if the number of items
    becomes large (thousands)
  • If the number of items is m, then the number of
    distinct items sets is 2m

41
Generating Large Itemsets A Priori Method
  • Idea
  • A subset of a large itemset must also be large
  • if juice, milk, bread is frequent, so is
    juice, bread
  • Every transaction having juice, milk, bread
    also contains juice, bread
  • Conversely, an extension of a small itemset is
    also small
  • If there is any itemset which is infrequent, its
    superset should not be generated/tested!
  • Overview
  • Only sets with single items are considered in the
    first pass.
  • In the second pass, sets with two items are
    considered, and so on.

42
Generating Large Itemsets A Priori Method
(contd)
  • Test the support for itemsets of length 1, called
    1-itemsets by scanning the database
  • Discard those that do not exceed the threshold
  • Extend the large 1-itemsets into 2-itemsets by
    appending one item each time, to generate all
    candidate itemsets of length two
  • Test the support for all candidate itemsets by
    scanning the database and eliminate those
    2-itemsets that do not meet the minimum support
  • Repeat the above steps.
  • At step k, the previously found (k-1) itemsets
    are extended into k-itemsets and tested for
    minimum support
  • The process is repeated until no large itemsets
    can be found

43
The A Priori Method - Example
Database TDB
L1
C1
1st scan
Frequency 50, Confidence 100 A ? C B ? E BC
? E CE ? B BE ? C
C2
C2
L2
2nd scan
C3
L3
3rd scan
Write a Comment
User Comments (0)
About PowerShow.com