Knowledge Discovery - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Knowledge Discovery

Description:

Knowledge Discovery Process. Goal. understanding the application domain, and goals of KDD effort ... Use of domain knowledge. utilizing knowledge on complex ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 31

Provided by: mgmt

Learn more at: http://www.uic.edu

Category:

more less

Transcript and Presenter's Notes

Title: Knowledge Discovery

1
Knowledge Discovery Data Mining

process of extracting previously unknown, valid,
and actionable (understandable) information from
large databases
Data mining is a step in the KDD process of
applying data analysis and discovery algorithms
Machine learning, pattern recognition,
statistics, databases, data visualization.
Traditional techniques may be inadequate
large data

2
Why Mine Data?

Huge amounts of data being collected and
warehoused
Walmart records 20 millions per day
health care transactions multi-gigabyte
databases
Mobil Oil geological data of over 100 terabytes
Affordable computing
Competitive pressure
gain an edge by providing improved, customized
services
information as a product in its own right

Knowledge discovery in databases (KDD) is the
non-trivial process of identifying valid,
potentially useful and ultimately understandable
patterns in data

Data Mining
Clean, Collect, Summarize
Data Preparation
Training Data
Data Warehouse
Model Patterns
Verification, Evaluation
Operational Databases
4
Data mining algorithm components

Model representation
descriptions of discovered patterns
overly limited representation -- unable to
capture data patterns
too powerful -- potential for overfit
(decision trees, rules, linear/non-linear
regression classification,
nearest neighbor and case-based reasoning
methods, graphical
dependency models)
Model evaluation criteria
how well a pattern (model) meets goals (fit
function)
eg., accuracy, novelty, etc.
Search method
parameter search optimization of of parameters
for a given model representation
model search considers a family of models
Different methods suit different problems.
Proper problem formulation crucial.

Note Models and patterns A pattern can be
thought of as an instantiation of a model. Eg.
f(x) - 3 x2 x is a pattern whereas f(x) ax2
bx is considered a model.
Data mining involves fitting models to and
determining patterns from observed data.

6
Knowledge Discovery Process

Goal
understanding the application domain, and goals
of KDD effort
Data selection, acquisition, integration
Data cleaning
noise, missing data, outliers,etc.
Exploratory data analysis
dimensionality reduction, transformations
selection of appropriate model for analysis,
hypotheses to test
Data mining
selecting appropriate method that match set goals
(classification, regression, clustering, etc)
selecting algorithm
Testing and verification
Interpretation
Consolidation and use

7
100
90
80
70
60
50
40
30
20
10
0
Business Objective Determination
Data Preparation
Data Mining
Analysis of Results and Knowledge Assimilation
Effort for each data-mining process step
8
Issues and challenges

large data
number of variables (features), number of cases
(examples)
multi gigabyte, terabyte databases
efficient algorithms, parallel processing
high dimensionality
large number of features exponential increase in
search space
potential for spurious patterns
dimensionality reduction
Overfitting
models noise in training data, rather than just
the general patterns
Changing data, missing and noisy data
Use of domain knowledge
utilizing knowledge on complex data
relationships, known facts
Understandability of patterns

9
Data Mining

Prediction Methods
using some variables to predict unknown or future
values of other variables
Descriptive Methods
finding human-interpretable patterns describing
the data

10
Data Mining Tasks

Classification
Clustering
Association Rule Discovery
Sequential Pattern Discovery
Regression
Deviation Detection

11
Classification

Data defined in terms of attributes, one of which
is the class
Find a model for class attribute as a function of
the values of other(predictor) attributes, such
that previously unseen records can be assigned a
class as accurately as possible.
Training Data used to build the model
Test data used to validate the model (determine
accuracy of the model)
Given data is usually divided into training and
test sets.

12
ClassificationExample
13
Classification Direct Marketing

Goal Reduce cost of soliciting (mailing) by
targeting a set of consumers likely to buy a new
product.
Data
for similar product introduced earlier
we know which customers decided to buy and which
did not buy, not buy class attribute
collect various demographic, lifestyle, and
company related information about all such
customers - as possible predictor variables.
Learn classifier model

14
Classification Fraud detection

Goal Predict fraudulent cases in credit card
transactions.
Data
Use credit card transactions and information on
its account-holder as input variables
label past transactions as fraud or fair.
Learn a model for the class of transactions
Use the model to detect fraud by observing credit
card transactions on a given account.

15
Clustering

Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
data points in one cluster are more similar to
one another
data points in separate clusters are less
simislar to one another.
Similarity measures
Euclidean distance if attributes are continuous
Problem specific measures

16
Clustering Market Segmentation

Goal subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
Approach
collect different attributes on customers based
on geographical, and lifestyle related
information
identify clusters of similar customers
measure the clustering quality by observing
buying patterns of customers in same cluster vs.
those from different clusters.

17
Association Rule Discovery

Given a set of records, each of which contain
some number of items from a given collection
produce dependency rules which will predict
occurrence of an item based on occurences of
other items

18
Association RulesApplication

Marketing and Sales Promotion
Consider discovered rule
Bagels, --gt Potato Chips
Potato Chips as consequent can be used to
determine what may be done to boost sales
Bagels as an antecedent can be used to see which
products may be affected if bagels are
discontinued
Can be used to see which products should be sold
with Bagels to promote sale of Potato Chips

19
Association Rules Application

Supermarket shelf management
Goal to identify items which are bought together
(by sufficiently many customers)
Approach process point-of-sale data (collected
with barcode scanners) to find dependencies among
items.
Example
If a customer buys Diapers and Milk, then he is
very likely to but Beer
so stack six-packs next to diapers?

20
Sequential Pattern Discovery

Given set of objects, each associated with its
own timeline of events, find rules that predict
strong sequential dependencies among different
events, of the form (A B) (C) (D E) --gt (F)

xg max allowed time between consecutive
event-sets
ng min required time between consecutive
event sets
ws window-size, max time difference between
earliest and latest events in an event-set
(events
within an event-set may occur in any order)
ms max allowed time between earliest and
latest events of the sequence.

21
Sequential Pattern Discovery Examples

sequences in which customers purchase
goods/services
understanding long term customer behavior --
timely promotions.
In point-of--sale transaction sequences
Computer bookstore
(Intro to Visual C) (C Primer) --gt (Perl
for Dummies,
TCL/TK)
Athletic Apparel Store
(Shoes) (Racket, Racketball) --gt (Sports Jacket)

22
Regression

Predict a value of a given continuous valued
variable (dependent variable) based on values of
other variables (independent variables)
Statistics, Neural networks, Genetic algorithms
Examples
predicting sales volumes of new product based on
advertising expenditure
Time series prediction of stock market indices.

23
Visualization

complement to other DM techniques like
Segmentation,etc.

24
Sample Data Mining Plan Example

Bank concerned about attrition for its Demand
Deposit Accounts
identify customers likely to leave, with
sufficient warning of impending attrition to
allow for some intervention (signature for
impending attrition?)
Hypothesis testing
transaction data may be insufficient
explore ideas about why customers might leave,
and how to identify
e.g. Regular bi-weekly direct deposit ceases new
job and no longer using direct deposits
got married and spouse used another bank
reduction in balance and number if transactions,
last-name change request

Data requirements
Careful attention to data generated by internal
decisions
bank started charging for debit card transactions
that were free
bank turned down loan or credit increase request
Is the data available?
Preparing data for analysis
Exploratory analysis of data
queries, OLAP, hypothesis testing
association rules
Knowledge Discovery plan
classes of customers rather than an overall
signature of attrition?
Deviation from normal behavior indicating
attrition potential

Preparing data for analysis
data organized over time-windows
demographic profiles
Clustering
unsupervised
models for different clusters

27
Exampleimproving direct mail responses

Direct mailing for home equity line of credit
(HELOC)
prospects are existing demand deposit account
(DDA) customers
use info. on lifetime value of existing customers
to derive model to predict customers likely to be
the most profitable long-term prospects

28
Example

Data
DDA history of loan balances over 3,6,9,12,18
months, returned checks
demographic data (age, income, length of
residence, etc.), both internal and external
property data sourced externally (home purchase
price, loan-to-value ratio, etc.)
credit worthiness data
response to previous mailings
120 variables selected
less than half the DDAs had history records
missing fields (45 K cases remaining for use --
prospects database)
exclude variables like sex, race, age (legal
restrictions)
Neural network (radial basis function) model for
value prediction

29
Example

Training data
randomly sample from prospects database weighted
to include more responders than present in actual
data
Validation
rank on likelihood of response
consider top and bottom 10 -- use visualization,
decision tree to understand rationale for
obtained classification
Testing
sample from prospects database unweighted with
normal proportion of responders and
non-responders
gains (lift) chart