Title: Data Warehouses and Data Mining
1Data Warehouses and Data Mining
2Resources in VS
- Database and data warehousing
- Chapter 4 pages 167-177
- Data mining and OLAP
- Chapter 8 pages 321-325
3Physical versus Logical Views
- In managing information, physical deals with the
structure of information as it resides on various
storage media. - Logical deals with how knowledge workers view
their information needs, and includes such terms
as - CHARACTER - our smallest unit of information.
- FIELD - group of related characters.
- RECORD - group of related fields.
- FILE - group of related records.
- DATABASE - group of logically associated files.
- DATA WAREHOUSE - information from many databases.
4The Data Dictionary
- DATA DICTIONARY - contains the logical structure
of information in a database. - Definitions of all fields, records, and tables
- Relationships between tables
- Who is responsible for maintaining data in the
database - Descriptions of who is authorized to access
different parts of the database - Data dictionary contains meta data (data about
the data)
5Sample Data Dictionary Report
6Data Warehouse
- Definition- a database that stores current and
historical data designed to support business
analysis activities and decision-making tasks of
managers typically a relational database model
is used. The data warehouse uses special
software (tools) to assist managers extract
information. - Benefits
- improved access
- improved information
- isolation from operational systems
- tools permit advanced data analysis
- Users and data marts
7Building a Data Warehouse (ETL)
- Extraction phase create files on the computer
that will store the data warehouse and move
transaction data to this machine data may come
from many sources or parts of the organization - Transformation phase cleanse and standardize
the data. Why is this necessary? - Load phase transfer the data from the
transformation phase into the data warehouse - The ETL process becomes automated to make regular
transfers of transaction data into the data
warehouse
8Comparison of Data in a Data Warehouse and
Operational Data
- Operational Data
- Data is on many systems
- Current operational data
- Inconsistent data definitions
- Functionally organized data
- Data are constantly changing
- Support OLTP
- Warehouse Data
- Integrated in one enterprise-wide system
- Recent and historical data
- Consistent data definitions
- Data are organized around business entities
- Data are stabilized
- Support OLAP
9What Is Data Mining?
- Data mining (knowledge discovery in databases)
- Extraction of interesting (non-trivial,
previously unknown and potentially useful)
information or patterns from data in large
databases - Similar terms
- Exploratory data analysis
- Data driven discovery
- Deductive learning
- Knowledge extraction
10Data Mining
- A computer-based data analysis process
- Utilizes historical organizational data,
typically in a data warehouse - Uses a variety of data analysis, modeling, and
visualization techniques - One use is to discover previously unknown
patterns or potential relationships in the data
undirected - Also used to make predictions, verify
assumptions, or otherwise provide useful
information directed - Allows businesses to make proactive,
knowledge-driven decisions
11Data Mining Methods
- Prediction
- Use some variables to predict unknown or future
values of other variables - Description
- Find human-interpretable patterns that describe
the data
12Data Mining Methods and Models
13Data Mining Methods and Models
14Classification Definition
- Given a collection of data
- Each record contains a set of attributes, one of
the attributes is the class variable - Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it - Find a model for the class attribute as a
function of the values of other attributes based
on the training set - Previously unseen data (test set) are used to
determine the accuracy of the model
15Classification Example Direct Marketing
- Goal Reduce cost of mailing by targeting a set
of consumers likely to buy a new cell-phone
product - Approach
- Use the data for a similar product introduced
before - We know which customers decided to buy and which
decided otherwise this buy, dont buy decision
is the class variable - Collect various demographic, lifestyle, and
company-interaction related information about all
such customers - Use this information as input attributes to
create a classifier model
16Classification Example Fraud Detection
- Goal Predict fraudulent cases in credit card
transactions - Approach
- Use past credit card transactions and the
information on its account-holder as attributes - When does a customer buy, what does he buy, how
often he pays on time, etc - Determine whether past transactions were fraud or
fair transactions this is the class variable - Create a model for the class of the transactions
17Classification Example Customer Churn
- Goal To predict whether a customer is likely to
be lost to a competitor - Approach
- Use detailed record of transactions with each of
the past and present customers, to find
attributes - How often the customer calls, where he calls,
what time-of-the day he calls most, his financial
status, marital status, etc - Label the customers as still with the company or
left the company (churned) - Create a model for churn
18Clustering Definition
- Clustering concerns segmenting a diverse
population into several homogeneous subgroups or
clusters. Clustering differs from classification
in that there are no predefined classifications. - Data points in one cluster are more similar to
one another - Data points in separate clusters are less similar
to one another - Similarity Measures
- Euclidean distance if attributes are continuous
- Other problem-specific measures
19Clustering Example Market Segmentation
- Goal Subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix - Approach
- Collect different attributes of customers based
on their geographical and lifestyle related
information - Find clusters of similar customers
- Measure the clustering quality by observing
buying patterns of customers in same cluster vs.
those from different clusters
20Clustering Document Retrieval
- Goal To find groups of documents that are
similar to each other based on the important
terms appearing in them - Approach To identify frequently occurring terms
in each document form a similarity measure
based on the frequencies of different terms
use it to cluster - Gain Retrieval can utilize the clusters to
relate a new document or search term to
clustered documents
21Association Rules or Affinity Grouping Definition
- Given a set of records, each of which contains
some number of items from a specific collection,
produce dependency rules which will predict
occurrence of an item based on occurrences of
other items
Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
22Association Rules Example Market Basket
- Goal To identify items that are bought together
by sufficiently many customers - Approach Process the point-of-sale data
collected with barcode scanners to find
dependencies among items - A classic case If a customer buys diapers on
Friday evening, then he is very likely to
buy beer
23The Sad Truth About Diapers and Beer
- So, dont be surprised if you find six-packs
stacked next to diapers!
24Negative Implications of DM
- Privacy
-
- Profiling
- Unauthorized Use
- Big Brother
25Collecting Data for the Warehouse
- Customer loyalty cards have multiple uses, but
one use is to collect data for the data warehouse - Examples
- Grocery stores
- Web sites
- Harrahs
- Store related credit cards
- Assurance of a steady flow of data
26Multidimensionality or OLAP
- Multidimensional data analysis (or OLAP) enables
users to view data using various dimensions,
measures and time frames (i. e., OLAP) - dimensions products, business units, country,
industry (e.g., categories) - measures money, unit sales, head count,
variances - time daily, weekly, monthly, quarterly, yearly)
- This type of analysis also provides the ability
to view data in different ways (tables, charts,
3-D, geographically) - OLAP tools provide for this
- Pivot tables in Excel or Access
27Multi-dimensional Data Model
Three dimensional revenue model
Property
City
Time
28Example Multi-dimensional Data Model
29Example Multi-dimensional Data Table
30Example OLAP Analyses
- A common operation is to aggregate a measure over
one or more dimensions - Find the total revenue
- Find the total revenue for each city
- Find the top property-type for the 3rd quarter
based on total revenue across all cities
31Typical OLAP Operations
- Roll-up Aggregating data across different
dimension levels - Example given revenue by city, we can roll-up to
get total revenue by state - Drill-down The inverse of roll-up
disaggregating data - Example given total revenue by state, we can
drill-down to get revenue by city
32Roll-up and Drill-down
Rollup
Drill Down
33Typical OLAP Operations
- Slicing Dicing Selecting data within dimension
categories - Example given revenue by city for the entire
year, we can extract the revenue for a given
city for a given quarter - Rotating Reorienting the presentation of the
data cube - Example given a cube that presents revenue by
city property type for each quarter, we can
change the presentation to present the revenue
by property type quarter for each city
34Example OLAP Usage at an Automobile Dealership
35Example The Multi-dimensional Data Model Used
Sales Volume
Van
MOD E L
Coupe
Smith
Sedan
Clyde
Miller
Blue
Red
White
SALESPERSON
COLOR
36Example OLAP Slicing Dicing Selecting
Categories
Choose a range out of each dimension
Sales Volumes
Van
MOD E L
Coupe
Smith
Sedan
Clyde
Miller
Blue
Red
White
SALESPERSON
COLOR
37Example OLAP Rotation Changing the
Presentation of the Cube
Different Users will require different views of
the multidimensional cube
Sales Volume
Van
MOD E L
Coupe
Sedan
Blue
Red
White
COLOR
38Example OLAP Drill-Down and Roll-Up
Data can be disaggregated and aggregated along a
dimension according to the natural hierarchy
Sales Volume by Organization Dimension- three
level hierarchy -
Roll-Up
Georgia
State
Region
Atlanta
Athens
Salesperson
Miller
Smith
Clyde
Lucas
Gleason
Drill-Down
39Characteristics of OLAP Tools
- Primarily used to exploit data warehouses
- Provides extremely fast response
- View combinations of two dimensions
- Enable drilling down (start with broad info and
get more specific) - Produces results as counts or percentages
- Conversion of tables to charts/graphs
- Usually requires a tailored-made relational
database - OLAP applications are widely used by mid-level
and upper level managers - A form of business intelligence software
40Examples of OLAP Tools
- Go to www.fedscope.opm.gov
- Under data cubes on entry page click on
employment - Demonstrate drill down and adding charts
- Data for this example comes from the Central
Personnel Data File (CPDF) of the federal
government - The OLAP tool used to build this site is from a
company named Cognos (PowerPlay) - OLAP tools based on Excel
- http//www.cubularity.com
41Spreadsheets Versus DBMS
- Linkage between elements
- spreadsheet - between cells in same table
- DBMS - between elements in different tables
- Orientation
- spreadsheet is toward calculations
- DBMS is tilted toward organization and linkage of
data elements in different tables - Capabilities
- DBMS has extensive querying and reporting power
- spreadsheet is limited
- Memory requirements
- entire spreadsheet table must be in memory
- not true for the database table