Knowledge Discovery - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Knowledge Discovery

Description:

Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases – PowerPoint PPT presentation

Number of Views:176
Avg rating:3.0/5.0
Slides: 31
Provided by: Dr557
Category:

less

Transcript and Presenter's Notes

Title: Knowledge Discovery


1
Knowledge Discovery Data Mining
  • process of extracting previously unknown, valid,
    and actionable (understandable) information from
    large databases
  • Data mining is a step in the KDD process of
    applying data analysis and discovery algorithms
  • Machine learning, pattern recognition,
    statistics, databases, data visualization.
  • Traditional techniques may be inadequate
  • large data

2
Why Mine Data?
  • Huge amounts of data being collected and
    warehoused
  • Walmart records 20 millions transactions per day
  • WebLogs Millions of hits per day on major sites
  • health care transactions multi-gigabyte
    databases
  • Mobil Oil geological data of over 100 terabytes
  • Affordable computing
  • Competitive pressure
  • gain an edge by providing improved, customized
    services
  • information as a product in its own right

3
  • Knowledge discovery in databases (KDD) is the
    non-trivial process of identifying valid,
    potentially useful and ultimately understandable
    patterns in data

Data Mining
Clean, Collect, Summarize
Data Preparation
Training Data
Data Warehouse
Model Patterns
Verification, Evaluation
Operational Databases
4
Data mining
  • Pattern
  • 12121?
  • 12 pattern is found often enough So, with some
    confidence we can say ? is 2
  • If 1 then 2 follows
  • Pattern ? Model
  • Confidence
  • 121212?
  • 12121231212123121212?
  • 121212? 3
  • Models are created using historical data by
    detecting patterns. It is a calculated guess
    about likelihood of repetition of pattern.

5
  • Note Models and patterns A pattern can be
    thought of as an instantiation of a model. Eg.
    f(x) 3 x2 x is a pattern whereas f(x) ax2
    bx is considered a model.
  • Data mining involves fitting models to and
    determining patterns from observed data.

6
Data Mining
  • Prediction Methods
  • using some variables to predict unknown or future
    values of other variables
  • It uses database fields (predictors) for
    prediction model, using the field values we can
    make predictions
  • Descriptive Methods
  • finding human-interpretable patterns describing
    the data

7
Data Mining Techniques
  • Classification
  • Clustering
  • Association Rule Discovery
  • Sequential Pattern Discovery
  • Regression
  • Deviation Detection

8
Classification
  • Data defined in terms of attributes, one of which
    is the class
  • Find a model for class attribute as a function of
    the values of other(predictor) attributes, such
    that previously unseen records can be assigned a
    class as accurately as possible.
  • Training Data used to build the model
  • Test data used to validate the model (determine
    accuracy of the model)
  • Given data is usually divided into training and
    test sets.

9
Classification
  • Given old data about customers and payments,
    predict new applicants loan eligibility.

Previous customers
Classifier
Decision rules
Age Salary Profession Location Customer type
Salary gt 5 L
Good/ bad
Prof. Exec
New applicants data
10
Classification methods
  • Goal Predict class Ci f(x1, x2, .. Xn)
  • Regression (linear or any other polynomial)
  • Decision tree classifier divide decision space
    into piecewise constant regions.
  • Neural networks partition by non-linear
    boundaries
  • Probabilistic/generative models
  • Lazy learning methods nearest neighbor

11
Decision trees
  • Tree where internal nodes are simple decision
    rules on one or more attributes and leaf nodes
    are predicted class labels.

Salary lt 50 K
Prof teacher
Age lt 30
12
ClassificationExample
13
Decision Tree
Training Dataset
14
Output A Decision Tree for buys_computer
15
Algorithm for Decision Tree Induction
  • Basic algorithm (a greedy algorithm)
  • Tree is constructed in a top-down recursive
    divide-and-conquer manner
  • At start, all the training examples are at the
    root
  • Attributes are categorical (if continuous-valued,
    they are discretized in advance)
  • Examples are partitioned recursively based on
    selected attributes
  • Test attributes are selected on the basis of a
    heuristic or statistical measure (e.g.,
    information gain)
  • Conditions for stopping partitioning
  • All samples for a given node belong to the same
    class
  • There are no remaining attributes for further
    partitioning majority voting is employed for
    classifying the leaf
  • There are no samples left

16
Attribute Selection Measure Information Gain
(ID3/C4.5)
  • Select the attribute with the highest information
    gain
  • S contains si tuples of class Ci for i 1, ,
    m
  • information measures info required to classify
    any arbitrary tuple
  • entropy of attribute A with values a1,a2,,av
  • information gained by branching on attribute A

17
Attribute Selection by Information Gain
Computation
  • Class P buys_computer yes
  • Class N buys_computer no
  • I(p, n) I(9, 5) 0.940
  • Compute the entropy for age
  • means age lt30 has 5 out of 14
    samples, with 2 yeses and 3 nos. Hence
  • Similarly,

18
Classification Direct Marketing
  • Goal Reduce cost of soliciting (mailing) by
    targeting a set of consumers likely to buy a new
    product.
  • Data
  • for similar product introduced earlier
  • we know which customers decided to buy and which
    did not buy, not buy class attribute
  • collect various demographic, lifestyle, and
    company related information about all such
    customers - as possible predictor variables.
  • Learn classifier model

19
Classification Fraud detection
  • Goal Predict fraudulent cases in credit card
    transactions.
  • Data
  • Use credit card transactions and information on
    its account-holder as input variables
  • label past transactions as fraud or fair.
  • Learn a model for the class of transactions
  • Use the model to detect fraud by observing credit
    card transactions on a given account.

20
Clustering
  • Given a set of data points, each having a set of
    attributes, and a similarity measure among them,
    find clusters such that
  • data points in one cluster are more similar to
    one another
  • data points in separate clusters are less similar
    to one another.
  • Similarity measures
  • Euclidean distance, if attributes are continuous
  • Problem specific measures

21
Clustering Market Segmentation
  • Goal subdivide a market into distinct subsets of
    customers where any subset may conceivably be
    selected as a market target to be reached with a
    distinct marketing mix.
  • Approach
  • collect different attributes on customers based
    on geographical, and lifestyle related
    information
  • identify clusters of similar customers
  • measure the clustering quality by observing
    buying patterns of customers in same cluster vs.
    those from different clusters.

22
Association Rule Discovery
  • Given a set of records, each of which contain
    some number of items from a given collection
  • produce dependency rules which will predict
    occurrence of an item based on occurences of
    other items

23
Association Rule Basic Concepts
  • Given (1) database of transactions, (2) each
    transaction is a list of items (purchased by a
    customer in a visit)
  • Find all rules that correlate the presence of
    one set of items with that of another set of
    items
  • E.g., 98 of people who purchase tires and auto
    accessories also get automotive services done
  • Applications
  • ? Maintenance Agreement (What the store
    should do to boost Maintenance Agreement sales)
  • Home Electronics ? (What other products
    should the store stocks up?)
  • Attached mailing in direct marketing

24
Association Rule Basic Concepts
  • number of tuples containing both A and B
  • Support (A? B) ---------------------------------
    --------------
  • total number of tuples
  • number of tuples containing both A and B
  • Confidence (A? B) ------------------------------
    -----------
  • total number of tuples containg A

25
Rule Measures Support and Confidence
Customer buys both
  • Find all the rules X Y ? Z with minimum
    confidence and support
  • support, s, probability that a transaction
    contains X , Y , Z
  • confidence, c, conditional probability that a
    transaction having X , Y also contains Z

Customer buys d
Customer buys b
  • Let minimum support 50, and minimum confidence
    50, we have
  • A ? C (50, 66.6)
  • C ? A (50, 100)

26
Mining Association RulesAn Example
Min. support 50 Min. confidence 50
  • For rule A ? C
  • support support(A , C) 50
  • confidence support(A , C)/support(A) 66.6

27
Association RulesApplication
  • Marketing and Sales Promotion
  • Consider discovered rule
  • Bagels, --gt Potato Chips
  • Potato Chips as consequent can be used to
    determine what may be done to boost sales
  • Bagels as an antecedent can be used to see which
    products may be affected if bagels are
    discontinued
  • Can be used to see which products should be sold
    with Bagels to promote sale of Potato Chips

28
Association Rules Application
  • Supermarket shelf management
  • Goal to identify items which are bought together
    (by sufficiently many customers)
  • Approach process point-of-sale data (collected
    with barcode scanners) to find dependencies among
    items.
  • Example
  • If a customer buys Diapers and Milk, then he is
    very likely to buy Beer
  • so stack six-packs next to diapers?

29
Sequential Pattern Discovery
  • Given set of objects, each associated with its
    own timeline of events, find rules that predict
    strong sequential dependencies among different
    events, of the form (A B) (C) (D E) --gt (F)
  • xg max allowed time between consecutive
  • event-sets
  • ng min required time between consecutive
  • event sets
  • ws window-size, max time difference between
  • earliest and latest events in an event-set
    (events
  • within an event-set may occur in any order)
  • ms max allowed time between earliest and
  • latest events of the sequence.

30
Sequential Pattern Discovery Examples
  • Sequences in which customers purchase
    goods/services
  • Understanding long term customer behavior --
    timely promotions.
  • In point-of--sale transaction sequences
  • Computer bookstore
  • (Intro to Visual C) (C Primer) --gt (Perl
    for Dummies, Tcl/Tk)
  • Athletic Apparel Store
  • (Shoes) (Racket, Racquetball) --gt (Sports
    Jacket)
Write a Comment
User Comments (0)
About PowerShow.com