Lecture 3: Data Mining and Data Visualization

1 / 50
About This Presentation
Title:

Lecture 3: Data Mining and Data Visualization

Description:

The sales data can be augmented with the addition of virtual items. ... Spatial data: these elements correspond to a uniquely-defined location on earth. ... – PowerPoint PPT presentation

Number of Views:207
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Lecture 3: Data Mining and Data Visualization


1
Lecture 3 Data Mining andData Visualization
2
3-1 A Picture is Worth a Thousand Words
  • Definition Data mining is the set of activities
    used to find new, hidden, or unexpected patterns
    in data.
  • These techniques are often called Knowledge Data
    Discovery (KDD), and include statistical
    analysis, neural or fuzzy logic, intelligent
    agents or data visualization.
  • The KDD techniques not only discover useful
    patterns in the data, but also can be used to
    develop predictive models.

3
Verification Versus Discovery
  • In the past, decision support activities were
    primarily based on the concept of verification.
  • This required a great deal of prior knowledge on
    the decision-makers part in order to verify a
    suspected relationship.
  • With the advance of technology, the concept of
    verification began to turn into discovery.

4
Data Minings Growth in Popularity
  • One reason is that we keep getting more and more
    data all the time and need tools to understand
    it.
  • We also are aware that the human brain has
    trouble processing multidimensional data.
  • A third reason is that machine learning
    techniques are becoming more affordable and more
    refined at the same time.

5
Making Accurate Predictions with Data Mining
  • Although the literature contains statements such
    as data mining will allow us to predict who will
    buy a particular product, that is against human
    nature.
  • In situations where data mining is used to
    predict response to a marketing campaign, only
    about 5 of the people selected as likely
    respondents actually do respond.

6
Making Accurate Predictions with Data Mining
(cont.)
  • Although the accuracy of predicting individual
    behavior is not so good, it is better than it
    seems, since direct marketing efforts often have
    hit rates of only about 1 without data mining.

7
3-2 Online Analytical Processing (OLAP)
Codd developed a set of 12 rules for the
development of multidimensional databases
  • Multidimensional view
  • Transparent to user
  • Accessible
  • Consistent reporting
  • Client-server architecture
  • Generic dimensionality
  • Dynamic sparse matrix handling
  • Multiuser support
  • Cross-dimensional ops
  • Intuitive manipulation
  • Flexible reporting
  • Unlimited dimension and aggregation

8
OLAP as Implemented
  • To date, it does not appear that any
    implementation that exists that satisfies all 12
    rules?
  • Some people argue it might not even be possible
    to attain all of them.
  • More recently, the term OLAP has come to
    represent the broad category of software
    technology that enables multidimensional analysis
    of enterprise data.

9
Multidimensional OLAP (MOLAP)
  • Data can be viewed across several dimensions.
    Here sales are arrayed by region and product.
  • A fourth dimension could be added by using
    several graphs -- perhaps at different time
    points.
  • Most analyses have many more dimensions than
    this. MOLAP handles data as an n-dimensional
    hypercube.

10
Relational OLAP (ROLAP)
  • A large relational database server replaces the
    multidimensional one.
  • The database contains both detailed and
    summarized data, allowing drill down techniques
    to be applied.
  • SQL interfaces allow vendors to build tools, both
    portable and scalable.
  • This does require databases with many relational
    tables which may lead to substantial processor
    overhead on complex joins.

11
A Typical Relational Schema
12
3-3 Techniques Used to Mine the Data
  • Paralleling the popularity of data mining itself,
    the development of new techniques is exploding as
    well.
  • Many innovations are vendor-specific, which
    sometimes does little to advance the state of the
    art.
  • Regardless, data-mining techniques tend to fall
    into four major categories
  • 1. classification 2. association
  • 3. sequencing 4. clustering

13
Classification methods
  • The goal is to discover rules that define whether
    an item belongs to a particular subset or class
    of data.
  • For example, if we are trying to determine which
    households will respond to a direct mail
    campaign, we will want rules that separate the
    probables from the not probables.
  • These IF-THEN rules often are portrayed in a
    tree-like structure.

14
Association Methods
  • These techniques search all transactions from a
    system for patterns of occurrence.
  • A common method is market basket analysis, in
    which the set of products purchased by thousands
    of consumers are examined.
  • Results are then portrayed as percentages for
    example, 30 of the people that buy steaks also
    buy charcoal.

15
Sequencing Methods
  • These methods are applied to time series data in
    an attempt to find hidden trends.
  • If found, these can be useful predictors of
    future events.
  • For example, customer groups that tend to
    purchase products tied-in with hit movies would
    be targeted with promotional campaigns timed to
    release dates.

16
Clustering Techniques
  • Clustering techniques attempt to create
    partitions in the data according to some distance
    metric.
  • The clusters formed are data grouped together
    simply by their similarity to their neighbors.
  • By examining the characteristics of each cluster,
    it may be possible to establish rules for
    classification.

17
Data Mining Technologies
  • Statistics the most mature data mining
    technologies, but are often not applicable
    because they need clean data. In addition, many
    statistical procedures assume linear
    relationships, which limits their use.
  • Neural networks, genetic algorithms, fuzzy logic
    these technologies are able to work with
    complicated and imprecise data. Their broad
    applicability has made them popular in the field.

18
Data Mining Technologies (cont.)
  • Decision trees these technologies are
    conceptually simple and have gained in popularity
    as better tree growing software was introduced.
    Because of the way they are used, they are
    perhaps better called classification trees.

19
The Knowledge Discovery Search Process
  • Table 3-2 contains a more detailed outline of the
    process, but the major steps are
  • Define the business problem and obtain the data
    to study it.
  • Use data mining software to model the problem.
  • Mine the data to search for patterns of interest.

20
The Knowledge Discovery Search Process (cont.)
  • Review the mining results and refine them by
    respecifying the model.
  • Once validated, make the model available to other
    users of the DW.

21
Creating a Data-Mining Model
  • Although syntax differs from vendor to vendor,
    building a model on top of a database is much
    like creating a table
  • CREATE MODEL mail_list
  • Income character input, Age integer input,
    Respond character input
  • To populate it with data, use an SQL INSERT
  • INSERT INTO mail_list
  • SELECT income, age, respond
  • FROM client_list
  • WHERE region Southeast

22
Creating a Data-Mining Model (cont.)
  • The process automatically created additional
    views of the model (mail_list_UNDERSTAND and
    mail_list_PREDICT). These can be examined
  • SELECT FROM mail_list_UNDERSTAND
  • WHERE input_column_name income and
  • input_column_value high and
  • output_column_name respond and
  • output_column_value yes
  • Once these are created, they are treated as
    tables in the database so they can be viewed and
    joined by other users.

23
New Applications for Data Mining
  • As the technology matures, new applications
    emerge, especially in two new categories, text
    mining and web mining.
  • Some text mining examples are
  • Distilling the meaning of a text
  • Accurate summarization of a text
  • Explication of the text theme structure
  • Clustering of texts

24
Web mining
  • Web mining is a special case of text mining where
    the mining occurs over a website.
  • It enhances the website with intelligent
    behavior, such as suggesting related links or
    recommending new products.
  • It allows you to unobtrusively learn the
    interests of the visitors and modify their user
    profiles in real time.
  • They also allow you to match resources to the
    interests of the visitor.

25
3-4 Market Basket Analysis The King of
Algorithms
  • This is the most widely used and, in many ways,
    most successful data mining algorithm.
  • It essentially determines what products people
    purchase together.
  • Stores can use this information to place these
    products in the same area.
  • Direct marketers can use this information to
    determine which new products to offer to their
    current customers.
  • Inventory policies can be improved if reorder
    points reflect the demand for the complementary
    products.

26
Association Rules for Market Basket Analysis
  • Rules are written in the form left-hand side
    implies right-hand side and an example is
  • Yellow Peppers IMPLIES Red Peppers, Bananas,
    Bakery
  • To make effective use of a rule, three numeric
    measures about that rule must be considered (1)
    support, (2) confidence and (3) lift

27
Measures of Predictive Ability
  • Support refers to the percentage of baskets where
    the rule was true (both left and right side
    products were present).
  • Confidence measures what percentage of baskets
    that contained the left-hand product also
    contained the right.
  • Lift measures how much more frequently the
    left-hand item is found with the right than
    without the right.

28
An Example
  • The confidence suggests people buying any kind of
    pepper also buy bananas.
  • Green peppers sell in about the same quantities
    as red or yellow, but are not as predctive.

29
Market Basket Analysis Methodology
  • We first need a list of transactions and what was
    purchased. This is pretty easily obtained these
    days from scanning cash registers.
  • Next, we choose a list of products to analyze,
    and tabulate how many times each was purchased
    with the others.
  • The diagonals of the table shows how often a
    product is purchased in any combination, and the
    off-diagonals show which combinations were bought.

30
A Convenience Store Example (5 transactions)
  • Consider the following simple example about five
    transactions at a convenience store
  • Transaction 1 Frozen pizza, cola, milk
  • Transaction 2 Milk, potato chips
  • Transaction 3 Cola, frozen pizza
  • Transaction 4 Milk, pretzels
  • Transaction 5 Cola, pretzels
  • These need to be cross tabulated and displayed in
    a table.

31
A Convenience Store Example (5 transactions)
  • Pizza and Cola sell together more often than any
    other combo a cross-marketing opportunity?
  • Milk sells well with everything people probably
    come here specifically to buy it.

32
Using the Results
  • The tabulations can immediately be translated
    into association rules and the numerical measures
    computed.
  • Comparing this weeks table to last weeks table
    can immediately show the effect of this weeks
    promotional activities.
  • Some rules are going to be trivial (hot dogs and
    buns sell together) or inexplicable (toilet rings
    sell only when a new hardware store is opened).

33
Limitations to Market Basket Analysis
  • A large number of real transactions are needed to
    do an effective basket analysis, but the datas
    accuracy is compromised if all the products do
    not occur with similar frequency.
  • The analysis can sometimes capture results that
    were due to the success of previous marketing
    campaigns (and not natural tendencies of
    customers).

34
Performing Analysis with Virtual Items
  • The sales data can be augmented with the addition
    of virtual items. For example, we could record
    that the customer was new to us, or had children.
  • The transaction record might look like
  • Item 1 Sweater Item 2 Jacket Item
    3 New
  • This might allow us to see what patterns new
    customers have versus old customers.

35
Taxonomies
  • The presence of items not purchased very
    frequently is an obstacle to a good market basket
    analysis.
  • One way to deal with this is to eliminate
    products that occur with a frequency less than
    some threshold.
  • A better idea would be to try to form groups of
    products that fall below the threshold. Four
    flavors of popsicle occur 9 of the time all
    together, but no more than 3 individually.

36
Multidimensional Market Basket Analysis
  • Rules can involve more than two items, for
    example Plant and Clay Pot IMPLIES Soil.
  • These rules are built iteratively. First, pairs
    are found, then relevant sets of three or four.
  • These are then pruned by removing those that
    occur infrequently.
  • In an environment like a grocery store, where
    customers commonly buy over 100 items, rules
    could involve as many as 10 items.

37
3-5 Current Limitations and Challenges to
Data Mining
  • Despite the potential power and value, data
    mining is still a new field. Some things that
    that thus far have limited advancement are
  • Identification of missing information not all
    knowledge gets stored in a database
  • Data noise and missing values future systems
    need better ways to handle this
  • Large databases and high dimensionality future
    applications need ways to partition data into
    more manageable chunks

38
3-6 Data Visualization Seeing the Data
39
Visual Presentation
  • For any kind of high dimensional data set,
    displaying predictive relationships is a
    challenge.
  • The picture on the previous slide uses 3-D
    graphics to portray the weather balloon data
    numbers . We learn very little from just
    examining the numbers .
  • Shading is used to represent relative degrees of
    thunderstorm activity, with the darkest regions
    the heaviest activity.

40
A Bit of History
  • An early effort used sequences of two-dimensional
    graphs to add depth.
  • Current virtual reality programs allow the user
    to step through a data set. Try going to a
    realtors website and taking a tour of a house up
    for sale.

41
Human Visual Perception and Data Visualization
  • Data visualization is so powerful because the
    human visual cortex converts objects into
    information so quickly.
  • The next three slides show (1) usage of global
    private networks, (2) flow through natural gas
    pipelines, and (3) a risk analysis report that
    permits the user to draw an interactive yield
    curve.
  • All three use height or shading to add additional
    dimensions to the figure.

42
Global Private Network Activity
High Activity
Low Activity
43
Natural Gas Pipeline Analysis
Note Height shows total flow through compressor
stations.
44
An Enlivened Risk Analysis Report
45
Geographical Information Systems
  • A GIS is a special purpose database that contains
    a spatial coordinate system. A comprehensive GIS
    requires
  • Data input from maps, aerial photos, etc.
  • Data storage, retrieval and query
  • Data transformation and modeling
  • Data reporting (maps, reports and plans)

46
The Special Capabilities of a GIS
  • In general, a GIS contains two types of data
  • Spatial data these elements correspond to a
    uniquely-defined location on earth. They could
    be in point, line or polygon form.
  • Attribute data These are the data that will be
    portrayed at the geographic references
    established by spatial data.
  • Example Data from an opinion poll is displayed
    for multiple regions in the United States.
    Clicking on an area allows the user to drill down
    to the results for smaller areas.

47
Telephone Polling Results
Note On the live map, clicking on an area
allows the user to drill down and see results
for smaller areas.
48
3-7 Siftware Technologies
  • Although data visualization product vendors seem
    to enter or leave the market with great
    frequency, several firms are beginning to develop
    significant brand loyalty.
  • Red Brick Helped category managers at H.E.B. in
    San Antonio to determine which products to put in
    which stores. Another application was the
    consolidation of three old data warehouses at
    Hewlett-Packard.

49
Siftware -- Continued
  • Oracle A large suite of connectivity products
    allows transparent access to mainframe databases.
    Some major customers include John Alden
    Insurance, ShopKo Stores and Pacific Bell.
  • Informix Associated Grocers uses Informix data
    warehousing products at the heart of its
    three-tier client-server system.

50
Siftware -- Continued
  • Sybase Sybase Warehouse WORKS is an integrated
    system designed around the four key functions in
    data warehousing.
  • Silicon Graphics Data mining software is mated
    to 3-D visualization tools to allow users to fly
    through data.
  • IBM provides a number of decision support tools
    in its Information Warehouse Solutions.
Write a Comment
User Comments (0)