... observations to form a model of the importan presentation

About This Presentation

Transcript and Presenter's Notes

Title: ... observations to form a model of the importan

1
Chapter 32 Data Mining
CS 522 Fall 2001

Instructor Paul Chen

2
Descriptive The dealer sold 200 cars last month.
Operational
(OLTP)
Explanatory For every increase in 1 in the
interest, auto sales decrease by 5 .
Traditional DW
OLAP
Predictive predictions about future buyer
behavior.
Data Mining
3
Data Mining and OLAP

They are two separate breeds of analysis with
entirely different objectives, not to mention
tools, skill sets, and implementation methods.

4
Data Mining

With canned reports, ad hoc querying, and
OLAP, the end user defines a hypothesis and
determines which data to examine. With data
mining, the tool identifies the hypothesis, and
it
actually tells the user where in the data to
start
the exploration process.

5
Data Mining

Rather than using SQL to filter out values and
methodically
reduce the data into a concise answer set, data
mining uses
algorithms that exhaustively review the
relationships among
data elements to determine if any patterns exist.
The whole
purpose of data mining is to yield new business
information
that a business person can act on.

6
The Data Mining Process

Define the problem.
Select the data.
Prepare the data.
Mine the data.
Deploy the model.
Take business action.

7
Define the problem

A successful data mining initiative always starts
with
a well-defined project. To insure that the
project produces incremental value, include an
assessment of the status quo
solution and a review of technology,
organization, and business processes.

8
Select the data

This step involves defining your data source .
(not every
data source and record is required.) The data
is usually extracted from the source system to a
separate server.

9
Prepare the data

This step represents up to 80 percent of the
total project effort. For data mining, the data
must reside in one flat table (each record has
many columns). In addition, to being the most
time consuming, the step is also the most
critical. The resulting models are only as good
as the data used to create them.

10
Mine the data

Typically the easiest and shortest phase, this
step involves applying statistical and AI tools
to create mathematical models. Data mining
typically occurs on a server separate from the
data warehousing and other corporate systems.

11
Deploy the Model

Model deployment is the process of implementing
the mathematical models into operational systems
to improve business results.

12
Take Business Action

Use the deployed model to achieve improved
results to the business problem identified at the
beginning of the process.

13
Data Mining Tools

Data mining tools are typically classified by the
type of
algorithm they use to identify hidden patterns.
There are
many different algorithms in use, but the four
most
popular are association, sequence, clustering (or
segmentation), and predictive modeling.

14
Data Mining Tools

ASSOCIATION
Association, also frequently referred to as
"affinity analysis," reviews numerous sets of
items and looks for common groupings. An example
of association is market basket analysis, which
involves reviewing the products that consumers
purchase in a single trip to the grocery store.

15
ASSOCIATION

Finds items that imply the presence of other
items in the same event.
Affinities between items are represented by
association rules.
e.g. When a customer rents property for more
than 2 years and is more than 25 years old, in
40 of cases, the customer will buy a property.
This association happens in 35 of all customers
who rent properties.

16
Data Mining Tools

SEQUENCE
Sequential analysis helps data miners
identify a set of order-specific items or events.
Association identifies the existence of patterns
or groups of items sequential
analysis identifies the order of those
patterns or groups of items.

17
SEQUENCE

Finds patterns between events such that the
presence of one set of items is followed by
another set of items in a database of events over
a period of time.
e.g. Used to understand long term customer
buying behavior.

18
Link Analysis - Similar Time Sequence Discovery

Finds links between two sets of data that are
time-dependent, and is based on the degree of
similarity between the patterns that both time
series demonstrate.
e.g. Within three months of buying property,
new home owners will purchase goods such as
cookers, freezers, and washing machines.

19
Data Mining Tools

CLUSTERING
Cluster analysis lets the data miner assemble
data into unforeseen groups containing similar
characteristics. Also known as "segmentation,"
this type of data
mining is probably the most widely used.

20
CLUSTERING

Aim is to partition a database into an unknown
number of segments, or clusters, of similar
records.
Uses unsupervised learning to discover
homogeneous sub-populations in a database to
improve the accuracy of the profiles.

21
Data Mining Tools

PREDICTIVE MODELING
As the name implies, predictive modeling
involves developing a model from historical data
for predicting a future event. The power of
predictive modeling engines is that they can use
a broad range of data attributes to identify
future behavior. Both cluster analysis and
predictive modeling tools identify distinct
groups of items with common attributes the
difference is that predictive modeling focuses on
the likelihood of a particular outcome for a
particular group.

22
PREDICTIVE MODELING

Similar to the human learning experience
uses observations to form a model of the
important characteristics of some phenomenon.
Uses generalizations of real world and ability
to fit new data into a general framework.
Can analyze a database to determine essential
characteristics (model) about the data set.

23
PREDICTIVE MODELING

There are two techniques associated with
predictive modeling classification and value
prediction, which are distinguished by the nature
of the variable being predicted.

24
PREDICTIVE MODELING-classification

Used to establish a specific predetermined class
for each record in a database from a finite set
of possible, class values.
Two specializations of classification tree
induction and neural induction.

25
car taurus
y
n
cityseattle
agelt45
n
y
y
n
likely
unlikely
unlikely
likely
26
Example of Classification using Neural Induction
62
27
PREDICTIVE MODELING- Value Prediction

Used to estimate a continuous numeric value that
is associated with a database record.
Uses the traditional statistical techniques of
linear regression and nonlinear regression.
Relatively easy-to-use and understand.

28
PREDICTIVE MODELING- Value Prediction

Linear regression attempts to fit a straight line
through a plot of the data, such that the line is
the best representation of the average of all
observations at that point in the plot.
Problem is that the technique only works well
with linear data and is sensitive to the presence
of outliers (that is, data values, which do not
conform to the expected norm).

29
PREDICTIVE MODELING- Value Prediction

Although nonlinear regression avoids the main
problems of linear regression, it is still not
flexible enough to handle all possible shapes of
the data plot.
Statistical measurements are fine for building
linear models that describe predictable data
points, however, most data is not linear in
nature.

30
PREDICTIVE MODELING- Value Prediction

Data mining requires statistical methods that can
accommodate non-linearity, outliers, and
non-numeric data.
Applications of value prediction include credit
card fraud detection or target mailing list
identification.

31
ARE YOU READY FOR DATA MINING?

Just because you have a data warehouse doesnt
mean
youre necessarily ready for data mining. Much of
the
work our company does in the data mining arena
has
more to do with data mining readiness assessment
than
with actually performing data mining.

32
Metrics you can use to gauge your data mining
readiness

Do you have a staff of experienced knowledge
workers?
Do you have the data?
Do you have marketing processes in place that can
use this data?
Do you have a business champion who can embrace
the process and results?
Do you have the technology infrastructure to
support advanced analysis?

33
OLAP vs. Mining Tools

Are ad hoc, shrink wrapped tools that provide
an interface to data
Are used when you have specific questions
Looks and feels like a spreadsheet that allow
rotation, slicing and graphics
Can be deployed to large number of users

Methods for analyzing multiple data types
-- Regression trees
-- Neural networks
-- Genetic algorithms
Usually textual in nature
Usually deployed to a small number of analysis

Write a Comment

User Comments (0)

About PowerShow.com

... observations to form a model of the importan PowerPoint PPT Presentation