Data Warehouses and Data Mining - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Data Warehouses and Data Mining

Description:

In managing information, physical deals with the structure ... Gleason. Atlanta. Athens. Georgia. Sales Volume by Organization Dimension - three level hierarchy ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 42
Provided by: davidbr6
Category:

less

Transcript and Presenter's Notes

Title: Data Warehouses and Data Mining


1
Data Warehouses and Data Mining
2
Resources in VS
  • Database and data warehousing
  • Chapter 4 pages 167-177
  • Data mining and OLAP
  • Chapter 8 pages 321-325

3
Physical versus Logical Views
  • In managing information, physical deals with the
    structure of information as it resides on various
    storage media.
  • Logical deals with how knowledge workers view
    their information needs, and includes such terms
    as
  • CHARACTER - our smallest unit of information.
  • FIELD - group of related characters.
  • RECORD - group of related fields.
  • FILE - group of related records.
  • DATABASE - group of logically associated files.
  • DATA WAREHOUSE - information from many databases.

4
The Data Dictionary
  • DATA DICTIONARY - contains the logical structure
    of information in a database.
  • Definitions of all fields, records, and tables
  • Relationships between tables
  • Who is responsible for maintaining data in the
    database
  • Descriptions of who is authorized to access
    different parts of the database
  • Data dictionary contains meta data (data about
    the data)

5
Sample Data Dictionary Report
6
Data Warehouse
  • Definition- a database that stores current and
    historical data designed to support business
    analysis activities and decision-making tasks of
    managers typically a relational database model
    is used. The data warehouse uses special
    software (tools) to assist managers extract
    information.
  • Benefits
  • improved access
  • improved information
  • isolation from operational systems
  • tools permit advanced data analysis
  • Users and data marts

7
Building a Data Warehouse (ETL)
  • Extraction phase create files on the computer
    that will store the data warehouse and move
    transaction data to this machine data may come
    from many sources or parts of the organization
  • Transformation phase cleanse and standardize
    the data. Why is this necessary?
  • Load phase transfer the data from the
    transformation phase into the data warehouse
  • The ETL process becomes automated to make regular
    transfers of transaction data into the data
    warehouse

8
Comparison of Data in a Data Warehouse and
Operational Data
  • Operational Data
  • Data is on many systems
  • Current operational data
  • Inconsistent data definitions
  • Functionally organized data
  • Data are constantly changing
  • Support OLTP
  • Warehouse Data
  • Integrated in one enterprise-wide system
  • Recent and historical data
  • Consistent data definitions
  • Data are organized around business entities
  • Data are stabilized
  • Support OLAP

9
What Is Data Mining?
  • Data mining (knowledge discovery in databases)
  • Extraction of interesting (non-trivial,
    previously unknown and potentially useful)
    information or patterns from data in large
    databases
  • Similar terms
  • Exploratory data analysis
  • Data driven discovery
  • Deductive learning
  • Knowledge extraction

10
Data Mining
  • A computer-based data analysis process
  • Utilizes historical organizational data,
    typically in a data warehouse
  • Uses a variety of data analysis, modeling, and
    visualization techniques
  • One use is to discover previously unknown
    patterns or potential relationships in the data
    undirected
  • Also used to make predictions, verify
    assumptions, or otherwise provide useful
    information directed
  • Allows businesses to make proactive,
    knowledge-driven decisions

11
Data Mining Methods
  • Prediction
  • Use some variables to predict unknown or future
    values of other variables
  • Description
  • Find human-interpretable patterns that describe
    the data

12
Data Mining Methods and Models
13
Data Mining Methods and Models
14
Classification Definition
  • Given a collection of data
  • Each record contains a set of attributes, one of
    the attributes is the class variable
  • Usually, the given data set is divided into
    training and test sets, with training set used to
    build the model and test set used to validate it
  • Find a model for the class attribute as a
    function of the values of other attributes based
    on the training set
  • Previously unseen data (test set) are used to
    determine the accuracy of the model

15
Classification Example Direct Marketing
  • Goal Reduce cost of mailing by targeting a set
    of consumers likely to buy a new cell-phone
    product
  • Approach
  • Use the data for a similar product introduced
    before
  • We know which customers decided to buy and which
    decided otherwise this buy, dont buy decision
    is the class variable
  • Collect various demographic, lifestyle, and
    company-interaction related information about all
    such customers
  • Use this information as input attributes to
    create a classifier model

16
Classification Example Fraud Detection
  • Goal Predict fraudulent cases in credit card
    transactions
  • Approach
  • Use past credit card transactions and the
    information on its account-holder as attributes
  • When does a customer buy, what does he buy, how
    often he pays on time, etc
  • Determine whether past transactions were fraud or
    fair transactions this is the class variable
  • Create a model for the class of the transactions

17
Classification Example Customer Churn
  • Goal To predict whether a customer is likely to
    be lost to a competitor
  • Approach
  • Use detailed record of transactions with each of
    the past and present customers, to find
    attributes
  • How often the customer calls, where he calls,
    what time-of-the day he calls most, his financial
    status, marital status, etc
  • Label the customers as still with the company or
    left the company (churned)
  • Create a model for churn

18
Clustering Definition
  • Clustering concerns segmenting a diverse
    population into several homogeneous subgroups or
    clusters. Clustering differs from classification
    in that there are no predefined classifications.
  • Data points in one cluster are more similar to
    one another
  • Data points in separate clusters are less similar
    to one another
  • Similarity Measures
  • Euclidean distance if attributes are continuous
  • Other problem-specific measures

19
Clustering Example Market Segmentation
  • Goal Subdivide a market into distinct subsets of
    customers where any subset may conceivably be
    selected as a market target to be reached with a
    distinct marketing mix
  • Approach
  • Collect different attributes of customers based
    on their geographical and lifestyle related
    information
  • Find clusters of similar customers
  • Measure the clustering quality by observing
    buying patterns of customers in same cluster vs.
    those from different clusters

20
Clustering Document Retrieval
  • Goal To find groups of documents that are
    similar to each other based on the important
    terms appearing in them
  • Approach To identify frequently occurring terms
    in each document form a similarity measure
    based on the frequencies of different terms
    use it to cluster
  • Gain Retrieval can utilize the clusters to
    relate a new document or search term to
    clustered documents

21
Association Rules or Affinity Grouping Definition
  • Given a set of records, each of which contains
    some number of items from a specific collection,
    produce dependency rules which will predict
    occurrence of an item based on occurrences of
    other items

Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
22
Association Rules Example Market Basket
  • Goal To identify items that are bought together
    by sufficiently many customers
  • Approach Process the point-of-sale data
    collected with barcode scanners to find
    dependencies among items
  • A classic case If a customer buys diapers on
    Friday evening, then he is very likely to
    buy beer

23
The Sad Truth About Diapers and Beer
  • So, dont be surprised if you find six-packs
    stacked next to diapers!

24
Negative Implications of DM
  • Privacy
  • Profiling
  • Unauthorized Use
  • Big Brother

25
Collecting Data for the Warehouse
  • Customer loyalty cards have multiple uses, but
    one use is to collect data for the data warehouse
  • Examples
  • Grocery stores
  • Web sites
  • Harrahs
  • Store related credit cards
  • Assurance of a steady flow of data

26
Multidimensionality or OLAP
  • Multidimensional data analysis (or OLAP) enables
    users to view data using various dimensions,
    measures and time frames (i. e., OLAP)
  • dimensions products, business units, country,
    industry (e.g., categories)
  • measures money, unit sales, head count,
    variances
  • time daily, weekly, monthly, quarterly, yearly)
  • This type of analysis also provides the ability
    to view data in different ways (tables, charts,
    3-D, geographically)
  • OLAP tools provide for this
  • Pivot tables in Excel or Access

27
Multi-dimensional Data Model
Three dimensional revenue model
Property
City
Time
28
Example Multi-dimensional Data Model
29
Example Multi-dimensional Data Table
30
Example OLAP Analyses
  • A common operation is to aggregate a measure over
    one or more dimensions
  • Find the total revenue
  • Find the total revenue for each city
  • Find the top property-type for the 3rd quarter
    based on total revenue across all cities

31
Typical OLAP Operations
  • Roll-up Aggregating data across different
    dimension levels
  • Example given revenue by city, we can roll-up to
    get total revenue by state
  • Drill-down The inverse of roll-up
    disaggregating data
  • Example given total revenue by state, we can
    drill-down to get revenue by city

32
Roll-up and Drill-down
Rollup
Drill Down
33
Typical OLAP Operations
  • Slicing Dicing Selecting data within dimension
    categories
  • Example given revenue by city for the entire
    year, we can extract the revenue for a given
    city for a given quarter
  • Rotating Reorienting the presentation of the
    data cube
  • Example given a cube that presents revenue by
    city property type for each quarter, we can
    change the presentation to present the revenue
    by property type quarter for each city

34
Example OLAP Usage at an Automobile Dealership
35
Example The Multi-dimensional Data Model Used
Sales Volume
Van
MOD E L
Coupe
Smith
Sedan
Clyde
Miller
Blue
Red
White
SALESPERSON
COLOR
36
Example OLAP Slicing Dicing Selecting
Categories
  • Color Blue and White

Choose a range out of each dimension
  • Model Coupe only
  • Salesperson Clyde only

Sales Volumes
Van
MOD E L
Coupe
Smith
Sedan
Clyde
Miller
Blue
Red
White
SALESPERSON
COLOR
37
Example OLAP Rotation Changing the
Presentation of the Cube
Different Users will require different views of
the multidimensional cube
Sales Volume
Van
MOD E L
Coupe
Sedan
Blue
Red
White
COLOR
38
Example OLAP Drill-Down and Roll-Up
Data can be disaggregated and aggregated along a
dimension according to the natural hierarchy
Sales Volume by Organization Dimension- three
level hierarchy -
Roll-Up
Georgia
State
Region
Atlanta
Athens
Salesperson
Miller
Smith
Clyde
Lucas
Gleason
Drill-Down
39
Characteristics of OLAP Tools
  • Primarily used to exploit data warehouses
  • Provides extremely fast response
  • View combinations of two dimensions
  • Enable drilling down (start with broad info and
    get more specific)
  • Produces results as counts or percentages
  • Conversion of tables to charts/graphs
  • Usually requires a tailored-made relational
    database
  • OLAP applications are widely used by mid-level
    and upper level managers
  • A form of business intelligence software

40
Examples of OLAP Tools
  • Go to www.fedscope.opm.gov
  • Under data cubes on entry page click on
    employment
  • Demonstrate drill down and adding charts
  • Data for this example comes from the Central
    Personnel Data File (CPDF) of the federal
    government
  • The OLAP tool used to build this site is from a
    company named Cognos (PowerPlay)
  • OLAP tools based on Excel
  • http//www.cubularity.com

41
Spreadsheets Versus DBMS
  • Linkage between elements
  • spreadsheet - between cells in same table
  • DBMS - between elements in different tables
  • Orientation
  • spreadsheet is toward calculations
  • DBMS is tilted toward organization and linkage of
    data elements in different tables
  • Capabilities
  • DBMS has extensive querying and reporting power
  • spreadsheet is limited
  • Memory requirements
  • entire spreadsheet table must be in memory
  • not true for the database table
Write a Comment
User Comments (0)
About PowerShow.com