Lecture 3: Data Mining and Data Visualization

1 / 50

About This Presentation

Title:

Lecture 3: Data Mining and Data Visualization

Description:

The sales data can be augmented with the addition of virtual items. ... Spatial data: these elements correspond to a uniquely-defined location on earth. ... – PowerPoint PPT presentation

Number of Views:207

Avg rating:3.0/5.0

Slides: 51

Provided by: PatrickT87

more less

Transcript and Presenter's Notes

Title: Lecture 3: Data Mining and Data Visualization

1
Lecture 3 Data Mining andData Visualization
2
3-1 A Picture is Worth a Thousand Words

Definition Data mining is the set of activities
used to find new, hidden, or unexpected patterns
in data.
These techniques are often called Knowledge Data
Discovery (KDD), and include statistical
analysis, neural or fuzzy logic, intelligent
agents or data visualization.
The KDD techniques not only discover useful
patterns in the data, but also can be used to
develop predictive models.

3
Verification Versus Discovery

In the past, decision support activities were
primarily based on the concept of verification.
This required a great deal of prior knowledge on
the decision-makers part in order to verify a
suspected relationship.
With the advance of technology, the concept of
verification began to turn into discovery.

4
Data Minings Growth in Popularity

One reason is that we keep getting more and more
data all the time and need tools to understand
it.
We also are aware that the human brain has
trouble processing multidimensional data.
A third reason is that machine learning
techniques are becoming more affordable and more
refined at the same time.

5
Making Accurate Predictions with Data Mining

Although the literature contains statements such
as data mining will allow us to predict who will
buy a particular product, that is against human
nature.
In situations where data mining is used to
predict response to a marketing campaign, only
about 5 of the people selected as likely
respondents actually do respond.

6
Making Accurate Predictions with Data Mining
(cont.)

Although the accuracy of predicting individual
behavior is not so good, it is better than it
seems, since direct marketing efforts often have
hit rates of only about 1 without data mining.

7
3-2 Online Analytical Processing (OLAP)
Codd developed a set of 12 rules for the
development of multidimensional databases

Multidimensional view
Transparent to user
Accessible
Consistent reporting
Client-server architecture
Generic dimensionality

Dynamic sparse matrix handling
Multiuser support
Cross-dimensional ops
Intuitive manipulation
Flexible reporting
Unlimited dimension and aggregation

8
OLAP as Implemented

To date, it does not appear that any
implementation that exists that satisfies all 12
rules?
Some people argue it might not even be possible
to attain all of them.
More recently, the term OLAP has come to
represent the broad category of software
technology that enables multidimensional analysis
of enterprise data.

9
Multidimensional OLAP (MOLAP)

Data can be viewed across several dimensions.
Here sales are arrayed by region and product.
A fourth dimension could be added by using
several graphs -- perhaps at different time
points.
Most analyses have many more dimensions than
this. MOLAP handles data as an n-dimensional
hypercube.

10
Relational OLAP (ROLAP)

A large relational database server replaces the
multidimensional one.
The database contains both detailed and
summarized data, allowing drill down techniques
to be applied.
SQL interfaces allow vendors to build tools, both
portable and scalable.
This does require databases with many relational
tables which may lead to substantial processor
overhead on complex joins.

11
A Typical Relational Schema
12
3-3 Techniques Used to Mine the Data

Paralleling the popularity of data mining itself,
the development of new techniques is exploding as
well.
Many innovations are vendor-specific, which
sometimes does little to advance the state of the
art.
Regardless, data-mining techniques tend to fall
into four major categories
1. classification 2. association
3. sequencing 4. clustering

13
Classification methods

The goal is to discover rules that define whether
an item belongs to a particular subset or class
of data.
For example, if we are trying to determine which
households will respond to a direct mail
campaign, we will want rules that separate the
probables from the not probables.
These IF-THEN rules often are portrayed in a
tree-like structure.

14
Association Methods

These techniques search all transactions from a
system for patterns of occurrence.
A common method is market basket analysis, in
which the set of products purchased by thousands
of consumers are examined.
Results are then portrayed as percentages for
example, 30 of the people that buy steaks also
buy charcoal.

15
Sequencing Methods

These methods are applied to time series data in
an attempt to find hidden trends.
If found, these can be useful predictors of
future events.
For example, customer groups that tend to
purchase products tied-in with hit movies would
be targeted with promotional campaigns timed to
release dates.

16
Clustering Techniques

Clustering techniques attempt to create
partitions in the data according to some distance
metric.
The clusters formed are data grouped together
simply by their similarity to their neighbors.
By examining the characteristics of each cluster,
it may be possible to establish rules for
classification.

17
Data Mining Technologies

Statistics the most mature data mining
technologies, but are often not applicable
because they need clean data. In addition, many
statistical procedures assume linear
relationships, which limits their use.
Neural networks, genetic algorithms, fuzzy logic
these technologies are able to work with
complicated and imprecise data. Their broad
applicability has made them popular in the field.

18
Data Mining Technologies (cont.)

Decision trees these technologies are
conceptually simple and have gained in popularity
as better tree growing software was introduced.
Because of the way they are used, they are
perhaps better called classification trees.

19
The Knowledge Discovery Search Process

Table 3-2 contains a more detailed outline of the
process, but the major steps are
Define the business problem and obtain the data
to study it.
Use data mining software to model the problem.
Mine the data to search for patterns of interest.

20
The Knowledge Discovery Search Process (cont.)

Review the mining results and refine them by
respecifying the model.
Once validated, make the model available to other
users of the DW.

21
Creating a Data-Mining Model

Although syntax differs from vendor to vendor,
building a model on top of a database is much
like creating a table
CREATE MODEL mail_list
Income character input, Age integer input,
Respond character input
To populate it with data, use an SQL INSERT
INSERT INTO mail_list
SELECT income, age, respond
FROM client_list
WHERE region Southeast

22
Creating a Data-Mining Model (cont.)

The process automatically created additional
views of the model (mail_list_UNDERSTAND and
mail_list_PREDICT). These can be examined
SELECT FROM mail_list_UNDERSTAND
WHERE input_column_name income and
input_column_value high and
output_column_name respond and
output_column_value yes
Once these are created, they are treated as
tables in the database so they can be viewed and
joined by other users.

23
New Applications for Data Mining

As the technology matures, new applications
emerge, especially in two new categories, text
mining and web mining.
Some text mining examples are
Distilling the meaning of a text
Accurate summarization of a text
Explication of the text theme structure
Clustering of texts

24
Web mining

Web mining is a special case of text mining where
the mining occurs over a website.
It enhances the website with intelligent
behavior, such as suggesting related links or
recommending new products.
It allows you to unobtrusively learn the
interests of the visitors and modify their user
profiles in real time.
They also allow you to match resources to the
interests of the visitor.

25
3-4 Market Basket Analysis The King of
Algorithms

This is the most widely used and, in many ways,
most successful data mining algorithm.
It essentially determines what products people
purchase together.
Stores can use this information to place these
products in the same area.
Direct marketers can use this information to
determine which new products to offer to their
current customers.
Inventory policies can be improved if reorder
points reflect the demand for the complementary
products.

26
Association Rules for Market Basket Analysis

Rules are written in the form left-hand side
implies right-hand side and an example is
Yellow Peppers IMPLIES Red Peppers, Bananas,
Bakery
To make effective use of a rule, three numeric
measures about that rule must be considered (1)
support, (2) confidence and (3) lift

27
Measures of Predictive Ability

Support refers to the percentage of baskets where
the rule was true (both left and right side
products were present).
Confidence measures what percentage of baskets
that contained the left-hand product also
contained the right.
Lift measures how much more frequently the
left-hand item is found with the right than
without the right.

28
An Example

The confidence suggests people buying any kind of
pepper also buy bananas.
Green peppers sell in about the same quantities
as red or yellow, but are not as predctive.

29
Market Basket Analysis Methodology

We first need a list of transactions and what was
purchased. This is pretty easily obtained these
days from scanning cash registers.
Next, we choose a list of products to analyze,
and tabulate how many times each was purchased
with the others.
The diagonals of the table shows how often a
product is purchased in any combination, and the
off-diagonals show which combinations were bought.

30
A Convenience Store Example (5 transactions)

Consider the following simple example about five
transactions at a convenience store
Transaction 1 Frozen pizza, cola, milk
Transaction 2 Milk, potato chips
Transaction 3 Cola, frozen pizza
Transaction 4 Milk, pretzels
Transaction 5 Cola, pretzels
These need to be cross tabulated and displayed in
a table.

31
A Convenience Store Example (5 transactions)

Pizza and Cola sell together more often than any
other combo a cross-marketing opportunity?
Milk sells well with everything people probably
come here specifically to buy it.

32
Using the Results

The tabulations can immediately be translated
into association rules and the numerical measures
computed.
Comparing this weeks table to last weeks table
can immediately show the effect of this weeks
promotional activities.
Some rules are going to be trivial (hot dogs and
buns sell together) or inexplicable (toilet rings
sell only when a new hardware store is opened).

33
Limitations to Market Basket Analysis

A large number of real transactions are needed to
do an effective basket analysis, but the datas
accuracy is compromised if all the products do
not occur with similar frequency.
The analysis can sometimes capture results that
were due to the success of previous marketing
campaigns (and not natural tendencies of
customers).

34
Performing Analysis with Virtual Items

The sales data can be augmented with the addition
of virtual items. For example, we could record
that the customer was new to us, or had children.
The transaction record might look like
Item 1 Sweater Item 2 Jacket Item
3 New
This might allow us to see what patterns new
customers have versus old customers.

35
Taxonomies

The presence of items not purchased very
frequently is an obstacle to a good market basket
analysis.
One way to deal with this is to eliminate
products that occur with a frequency less than
some threshold.
A better idea would be to try to form groups of
products that fall below the threshold. Four
flavors of popsicle occur 9 of the time all
together, but no more than 3 individually.

36
Multidimensional Market Basket Analysis

Rules can involve more than two items, for
example Plant and Clay Pot IMPLIES Soil.
These rules are built iteratively. First, pairs
are found, then relevant sets of three or four.
These are then pruned by removing those that
occur infrequently.
In an environment like a grocery store, where
customers commonly buy over 100 items, rules
could involve as many as 10 items.

37
3-5 Current Limitations and Challenges to
Data Mining

Despite the potential power and value, data
mining is still a new field. Some things that
that thus far have limited advancement are
Identification of missing information not all
knowledge gets stored in a database
Data noise and missing values future systems
need better ways to handle this
Large databases and high dimensionality future
applications need ways to partition data into
more manageable chunks

38
3-6 Data Visualization Seeing the Data
39
Visual Presentation

For any kind of high dimensional data set,
displaying predictive relationships is a
challenge.
The picture on the previous slide uses 3-D
graphics to portray the weather balloon data
numbers . We learn very little from just
examining the numbers .
Shading is used to represent relative degrees of
thunderstorm activity, with the darkest regions
the heaviest activity.

40
A Bit of History

An early effort used sequences of two-dimensional
graphs to add depth.
Current virtual reality programs allow the user
to step through a data set. Try going to a
realtors website and taking a tour of a house up
for sale.

41
Human Visual Perception and Data Visualization

Data visualization is so powerful because the
human visual cortex converts objects into
information so quickly.
The next three slides show (1) usage of global
private networks, (2) flow through natural gas
pipelines, and (3) a risk analysis report that
permits the user to draw an interactive yield
curve.
All three use height or shading to add additional
dimensions to the figure.

42
Global Private Network Activity
High Activity
Low Activity
43
Natural Gas Pipeline Analysis
Note Height shows total flow through compressor
stations.
44
An Enlivened Risk Analysis Report
45
Geographical Information Systems

A GIS is a special purpose database that contains
a spatial coordinate system. A comprehensive GIS
requires
Data input from maps, aerial photos, etc.
Data storage, retrieval and query
Data transformation and modeling
Data reporting (maps, reports and plans)

46
The Special Capabilities of a GIS

In general, a GIS contains two types of data
Spatial data these elements correspond to a
uniquely-defined location on earth. They could
be in point, line or polygon form.
Attribute data These are the data that will be
portrayed at the geographic references
established by spatial data.
Example Data from an opinion poll is displayed
for multiple regions in the United States.
Clicking on an area allows the user to drill down
to the results for smaller areas.

47
Telephone Polling Results
Note On the live map, clicking on an area
allows the user to drill down and see results
for smaller areas.
48
3-7 Siftware Technologies

Although data visualization product vendors seem
to enter or leave the market with great
frequency, several firms are beginning to develop
significant brand loyalty.
Red Brick Helped category managers at H.E.B. in
San Antonio to determine which products to put in
which stores. Another application was the
consolidation of three old data warehouses at
Hewlett-Packard.

49
Siftware -- Continued

Oracle A large suite of connectivity products
allows transparent access to mainframe databases.
Some major customers include John Alden
Insurance, ShopKo Stores and Pacific Bell.
Informix Associated Grocers uses Informix data
warehousing products at the heart of its
three-tier client-server system.

50
Siftware -- Continued

Sybase Sybase Warehouse WORKS is an integrated
system designed around the four key functions in
data warehousing.
Silicon Graphics Data mining software is mated
to 3-D visualization tools to allow users to fly
through data.
IBM provides a number of decision support tools
in its Information Warehouse Solutions.

Write a Comment

User Comments (0)