... observations to form a model of the importan PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: ... observations to form a model of the importan


1
Chapter 32 Data Mining
CS 522 Fall 2001
  • Instructor Paul Chen

2
Descriptive The dealer sold 200 cars last month.
Operational
(OLTP)
Explanatory For every increase in 1 in the
interest, auto sales decrease by 5 .
Traditional DW
OLAP
Predictive predictions about future buyer
behavior.
Data Mining
3
Data Mining and OLAP
  • They are two separate breeds of analysis with
  • entirely different objectives, not to mention
  • tools, skill sets, and implementation methods.

4
Data Mining
  • With canned reports, ad hoc querying, and
  • OLAP, the end user defines a hypothesis and
  • determines which data to examine. With data
  • mining, the tool identifies the hypothesis, and
    it
  • actually tells the user where in the data to
    start
  • the exploration process.

5
Data Mining
  • Rather than using SQL to filter out values and
    methodically
  • reduce the data into a concise answer set, data
    mining uses
  • algorithms that exhaustively review the
    relationships among
  • data elements to determine if any patterns exist.
    The whole
  • purpose of data mining is to yield new business
    information
  • that a business person can act on.

6
The Data Mining Process
  • Define the problem.
  • Select the data.
  • Prepare the data.
  • Mine the data.
  • Deploy the model.
  • Take business action.

7
Define the problem
  • A successful data mining initiative always starts
    with
  • a well-defined project. To insure that the
    project produces incremental value, include an
    assessment of the status quo
  • solution and a review of technology,
    organization, and business processes.

8
Select the data
  • This step involves defining your data source .
    (not every
  • data source and record is required.) The data
    is usually extracted from the source system to a
    separate server.

9
Prepare the data
  • This step represents up to 80 percent of the
    total project effort. For data mining, the data
    must reside in one flat table (each record has
    many columns). In addition, to being the most
    time consuming, the step is also the most
    critical. The resulting models are only as good
    as the data used to create them.

10
Mine the data
  • Typically the easiest and shortest phase, this
    step involves applying statistical and AI tools
    to create mathematical models. Data mining
    typically occurs on a server separate from the
    data warehousing and other corporate systems.

11
Deploy the Model
  • Model deployment is the process of implementing
    the mathematical models into operational systems
    to improve business results.

12
Take Business Action
  • Use the deployed model to achieve improved
    results to the business problem identified at the
    beginning of the process.

13
Data Mining Tools
  • Data mining tools are typically classified by the
    type of
  • algorithm they use to identify hidden patterns.
    There are
  • many different algorithms in use, but the four
    most
  • popular are association, sequence, clustering (or
  • segmentation), and predictive modeling.

14
Data Mining Tools
  • ASSOCIATION
  • Association, also frequently referred to as
    "affinity analysis," reviews numerous sets of
    items and looks for common groupings. An example
    of association is market basket analysis, which
    involves reviewing the products that consumers
    purchase in a single trip to the grocery store.

15
ASSOCIATION
  • Finds items that imply the presence of other
    items in the same event.
  • Affinities between items are represented by
    association rules.
  • e.g. When a customer rents property for more
    than 2 years and is more than 25 years old, in
    40 of cases, the customer will buy a property.
    This association happens in 35 of all customers
    who rent properties.

16
Data Mining Tools
  • SEQUENCE
  • Sequential analysis helps data miners
    identify a set of order-specific items or events.
    Association identifies the existence of patterns
    or groups of items sequential
  • analysis identifies the order of those
    patterns or groups of items.

17
SEQUENCE
  • Finds patterns between events such that the
    presence of one set of items is followed by
    another set of items in a database of events over
    a period of time.
  • e.g. Used to understand long term customer
    buying behavior.

18
Link Analysis - Similar Time Sequence Discovery
  • Finds links between two sets of data that are
    time-dependent, and is based on the degree of
    similarity between the patterns that both time
    series demonstrate.
  • e.g. Within three months of buying property,
    new home owners will purchase goods such as
    cookers, freezers, and washing machines.

19
Data Mining Tools
  • CLUSTERING
  • Cluster analysis lets the data miner assemble
    data into unforeseen groups containing similar
    characteristics. Also known as "segmentation,"
    this type of data
  • mining is probably the most widely used.

20
CLUSTERING
  • Aim is to partition a database into an unknown
    number of segments, or clusters, of similar
    records.
  • Uses unsupervised learning to discover
    homogeneous sub-populations in a database to
    improve the accuracy of the profiles.

21
Data Mining Tools
  • PREDICTIVE MODELING
  • As the name implies, predictive modeling
    involves developing a model from historical data
    for predicting a future event. The power of
    predictive modeling engines is that they can use
    a broad range of data attributes to identify
    future behavior. Both cluster analysis and
    predictive modeling tools identify distinct
    groups of items with common attributes the
    difference is that predictive modeling focuses on
    the likelihood of a particular outcome for a
    particular group.

22
PREDICTIVE MODELING
  • Similar to the human learning experience
  • uses observations to form a model of the
    important characteristics of some phenomenon.
  • Uses generalizations of real world and ability
    to fit new data into a general framework.
  • Can analyze a database to determine essential
    characteristics (model) about the data set.

23
PREDICTIVE MODELING
  • There are two techniques associated with
    predictive modeling classification and value
    prediction, which are distinguished by the nature
    of the variable being predicted.

24
PREDICTIVE MODELING-classification
  • Used to establish a specific predetermined class
    for each record in a database from a finite set
    of possible, class values.
  • Two specializations of classification tree
    induction and neural induction.

25
car taurus
y
n
cityseattle
agelt45
n
y
y
n
likely
unlikely
unlikely
likely
26
Example of Classification using Neural Induction
62
27
PREDICTIVE MODELING- Value Prediction
  • Used to estimate a continuous numeric value that
    is associated with a database record.
  • Uses the traditional statistical techniques of
    linear regression and nonlinear regression.
  • Relatively easy-to-use and understand.

28
PREDICTIVE MODELING- Value Prediction
  • Linear regression attempts to fit a straight line
    through a plot of the data, such that the line is
    the best representation of the average of all
    observations at that point in the plot.
  • Problem is that the technique only works well
    with linear data and is sensitive to the presence
    of outliers (that is, data values, which do not
    conform to the expected norm).

29
PREDICTIVE MODELING- Value Prediction
  • Although nonlinear regression avoids the main
    problems of linear regression, it is still not
    flexible enough to handle all possible shapes of
    the data plot.
  • Statistical measurements are fine for building
    linear models that describe predictable data
    points, however, most data is not linear in
    nature.

30
PREDICTIVE MODELING- Value Prediction
  • Data mining requires statistical methods that can
    accommodate non-linearity, outliers, and
    non-numeric data.
  • Applications of value prediction include credit
    card fraud detection or target mailing list
    identification.

31
ARE YOU READY FOR DATA MINING?
  • Just because you have a data warehouse doesnt
    mean
  • youre necessarily ready for data mining. Much of
    the
  • work our company does in the data mining arena
    has
  • more to do with data mining readiness assessment
    than
  • with actually performing data mining.

32
Metrics you can use to gauge your data mining
readiness
  • Do you have a staff of experienced knowledge
    workers?
  • Do you have the data?
  • Do you have marketing processes in place that can
    use this data?
  • Do you have a business champion who can embrace
    the process and results?
  • Do you have the technology infrastructure to
    support advanced analysis?

33
OLAP vs. Mining Tools
  • Are ad hoc, shrink wrapped tools that provide
  • an interface to data
  • Are used when you have specific questions
  • Looks and feels like a spreadsheet that allow
    rotation, slicing and graphics
  • Can be deployed to large number of users
  • Methods for analyzing multiple data types
  • -- Regression trees
  • -- Neural networks
  • -- Genetic algorithms
  • Usually textual in nature
  • Usually deployed to a small number of analysis
Write a Comment
User Comments (0)
About PowerShow.com