Penn State Student Chapter of the Association for Computing Machinery - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

Penn State Student Chapter of the Association for Computing Machinery

Description:

Title: Data Warehousing: An information Delivery and Decision Support Infrastructure Last modified by: Todd Bacastow Created Date: 9/28/2000 8:38:05 AM – PowerPoint PPT presentation

Number of Views:139

Avg rating:3.0/5.0

Slides: 51

Provided by: personalP9

Learn more at: http://www.personal.psu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Penn State Student Chapter of the Association for Computing Machinery

1
Penn State Student Chapter of theAssociation
forComputing Machinery

We welcome all interested students to our 4th
general meeting of the Spring 2005 semester!
When Monday, April 11th, 2005 from 7-8 pm
Where Cybertorium (213 IST)
Agenda
Brief overview of our ACM chapter
New officer introductions
Special topic presentation No Pain, No
Game Presented by IST Professor Brian K. Smith
Co-op/Intern presentation Working at IBM
Presented by Rick Osowski
Free refreshments will be provided

2
Data Warehousing, Data Mining, and Advanced
Applications
3
Data Rich, but Information Poor

Data is stored, not explored by its volume and
complexity it represents a burden, not a
support
Data overload results in uninformed decisions,
contradictory information, higher overhead,
wrong decisions, increased costs
Data is not designed and is not structured for
successful management decision making

4
Improving Decision Making
5
Data Warehouse Concepts
6
Whats a Data Warehouse?

A data warehouse is a single, integrated source
of decision support information formed by
collecting data from multiple sources, internal
to the organization as well as external, and
transforming and summarising this information to
enable improved decision making.
A data warehouse is designed for easy access by
users to large amounts of information, and data
access is typically supported by specialized
analytical tools and applications.

7
Data Warehouse Characteristics

Key Characteristics of a Data Warehouse
Subject-oriented
Integrated
Time-variant
Non-volatile

8
Subject Oriented

Example for an insurance company

9
Integrated

Data is stored once in a single integrated
location(e.g. insurance company)

Auto Policy Processing System
Data Warehouse Database
Customer data stored in several databases
Fire Policy Processing System
Subject Customer
FACTS, LIFE Commercial, Accounting Applications
10
Time - Variant

Data is stored as a series of snapshots or views
which record how it is collected across time.

Data Warehouse Data

Time
Data
Key

Data is tagged with some element of time -
creation date, as of date, etc.
Data is available on-line for long periods of
time for trend analysis and forecasting. For
example, five or more years

11
Non-Volatile

Existing data in the warehouse is not overwritten
or updated.

External Sources

Load

Read-Only

12
Transaction System vs. Data Warehouse
13
Transaction-Based Reporting System
On-line, real time update into disparate systems
Day-to-day operations
System Experts
Users
Data Manipulation
Unix
VMS
MVS
Other
14
Warehouse-Based Reporting System
Unix
Executive Reporting and On-Line Analysis
Interfaces
Summarization
Data Staging, Transformation and Cleansing
VMS
Data Warehouse
MVS
Environment
Other
OLAP
BENEFIT Integrated, consistent data available
for analysis
BENEFIT Improve Network Reporting processes and
analytical capabilities
15
Transaction - Warehouse Process
Transaction Based Process
On-line, real time update.
Day-to-day operations
Detailed Information to operational systems.
Warehouse Based Process
Batch Load
Summarize Refine
Decision support for management use.
Transform
16
Transaction System vs. Data Warehouse

Data Warehouse

Transaction System

Supports management analysis and decision-making
processes
Contains summarized, refined, and cleansed
information
Non-volatile -- provides a data snapshot
adjustments are not permitted, or are limited
Business analysis requirements drive the data
structure and system design
Integrated, consistent information on a single
technology platform
Users have direct, fast access via On-line
Analytical Processing tools
Minimal impact on operational processes

Supports day-to-day operational processes
Contains raw, detailed data that has not been
refined or cleansed
Volatile -- data changes from day-to-day, with
frequent updates
Technical issues drive the data structure and
system design
Disparate data structures, physical locations,
query types, etc.
Users rely on technical analysts for reporting
needs
Operational processes impacted by queries run off
of system

17
Data Warehouse Architecture
18
Data Warehouse Architecture
19
Data Warehouse ArchitectureConversion and
Cleansing Activities

Map source data to target
Data scrubbing
Derive new data
Data Extraction
Transform / convert data
Create / modify metadata

Conversion Cleansing
20
Data Warehouse ArchitectureData Warehouse
Components
Detailed Data
Summary Data

Ranges from detailed to summarized data
Contains metadata
Many views of the data
Subject-Oriented
Time-variant

Metadata
21
Requirements Gathering Process Business Measure
Definition

Standard definition and related business rules
and formulas
Source data element(s), including quality
constraints
Data granularity levels (e.g., county detail for
state)
Data retention (e.g., one month, one quarter, one
year, multiple years)
Priority of the information (For example, is the
information necessary to derive other business
measures?)
Data load frequency (e.g., monthly, quarterly,
etc.)

22
Star Join Schema
23
Multi-Dimensional Analysis
24
Application Solution Classes

Executive information system (EIS)
Present information at the highest level of
summarization using corporate business measures.
They are designed for extreme ease-of-use and, in
many cases, only a mouse is required. Graphics
are usually generously incorporated to provide
at-a-glance indications of performance
Decision Support Systems (DSS)
They ideally present information in graphical and
tabular form, providing the user with the ability
to drill down on selected information. Note the
increased detail and data manipulation options
presented

25
Data Mining
1
26
Data Mining

The process of extracting valid, previously
unknown, comprehensible, and actionable
information from large databases and using it to
make crucial business decisions, (Simoudis,1996).
Involves the analysis of data and the use of
software techniques for finding hidden and
unexpected patterns and relationships in sets of
data.

27
Data Mining

Reveals information that is hidden and
unexpected, as little value in finding patterns
and relationships that are already intuitive.
Patterns and relationships are identified by
examining the underlying rules and features in
the data.
Data mining can provide huge paybacks for
companies who have made a significant investment
in data warehousing.
Relatively new technology, however already used
in a number of industries.

28
Examples of Applications of Data Mining

Retail / Marketing
Identifying buying patterns of customers
Finding associations among customer demographic
characteristics
Predicting response to mailing campaigns
Market basket analysis
Banking
Detecting patterns of fraudulent credit card use
Identifying loyal customers
Predicting customers likely to change their
credit card affiliation
Determining credit card spending by customer
groups

29
Examples of Applications of Data Mining

Insurance
Claims analysis
Predicting which customers will buy new policies
Medicine
Characterizing patient behavior to predict
surgery visits
Identifying successful medical therapies for
different illnesses

30
Data Mining Operations and Associated Techniques
31
Database Segmentation

Aim is to partition a database into an unknown
number of segments, or clusters, of similar
records.
Uses unsupervised learning to discover
homogeneous sub-populations in a database to
improve the accuracy of the profiles.
Less precise than other operations thus less
sensitive to redundant and irrelevant features.
Sensitivity can be reduced by ignoring a subset
of the attributes that describe each instance or
by assigning a weighting factor to each variable.
Applications of database segmentation include
customer profiling, direct marketing, and cross
selling.

32
Scatterplot
33
Visualization
34
Data Mining and Data Warehousing

Major challenge to exploit data mining is
identifying suitable data to mine.
Data mining requires single, separate, clean,
integrated, and self-consistent source of data.
A data warehouse is well equipped for providing
data for mining.
Data quality and consistency is a pre-requisite
for mining to ensure the accuracy of the
predictive models. Data warehouses are populated
with clean, consistent data.

35
Data Mining and Data Warehousing

It is advantageous to mine data from multiple
sources to discover as many interrelationships as
possible. Data warehouses contain data from a
number of sources.
Selecting the relevant subsets of records and
fields for data mining requires the query
capabilities of the data warehouse.
The results of a data mining study are useful if
there is some way to further investigate the
uncovered patterns. Data warehouses provide the
capability to go back to the data source.

36
Advanced Database Topics
37
A Little History

Prior to the 1980s ? hierarchical and network
databases.
Hardware ? dumb terminals using private networks
Database ? centralized and stored on the disk
packs
End user terminals ? simply input/output devices
?Processing at the mainframe
Data ? text data
Networks had to handle text data
No access from outside to the organization's
private network.

38
New Needs

Microcomputer enabled workstation processing
power.
Satellite and network technology provided for
very high speed, high traffic, and low cost long
distance communications networks.
Internet in the late 1990s and the corresponding
phenomenal growth in electronic commerce
(E-commerce) necessitated public access to data
in people's homes.
The volume of data needed to be transmitted
increased greatly.

39
New Needs

Business environment changed during the last two
decades
Information stored at different locations, on
different hardware and operating systems, with
different commercial DBMS products, and with
different underlying data models had to be
combined
The centralized database was no longer feasible
to handle these new demands

40
Distributed Database Scenario

There are many advantages to using a distributed
database rather than a centralized database. They
are
Improved performance, because high traffic data
are stored locally.
More efficient data management, because the DBA
workload is shared.
Better network integrity, because the whole
system does not stop if one computer goes down.
Expansion of the database is facilitated when the
organization grows, since new data does not have
to be centralized. It can remain and be
administered in the original location.
Data for the whole organization can still be
accessed from any location.

41
Distributed Database

Data administration is improved (??)
In a distributed database system even a simple
task like creating a backup copy of the database
can take a considerable amount of time.
If the database is divided among several
locations the time and workload for this task can
be shared.

42
Replication of Data

System failure in one location should not stop
processing in other locations
Replicate all or parts of the database in more
than one location.
Database replication improves performance and
provides a fail-safe option, but it involves
considerable complexity
Replication of frequently used data improves
response time and reduces network traffic
If the data changes at one location it must be
changed at all locations

43
Distributed Systems in an Ideal World

C. J. Date established rules for the ideal
distributed DBMS system
Rules are a goal that distributed systems strive
toward, but have not yet reached
According to Date's rules
Each site is responsible for its own portion of
the distributed database, including security,
backup, and recovery.
Each site has equal capabilities and does not
rely on any other site.
The system should work regardless of the computer
hardware, operating system, or network installed
at any site.

44
Date's Rules of Distributed Databases

Local site independence
Central site independence
Failure independence
Location transparency
Fragmentation transparency
Replication transparency
Distributed query processing
Distributed transaction processing
Hardware independence
Operating system independence
Network independence
Database independence

45
Complexities of Distributed Databases

There also are many complications involved in the
management of distributed database systems.
The distributed database must be carefully
designed to insure the following
Store data as close as possible to where it is
used most often.
Make the location of the data transparent to the
end user.
Make the system easy to expand.
Optimize queries to improve response time in the
distributed environment.

46
Database Design

The designer must analyze the organization's
needs and business processes to determine the
best way to distribute the database.
There are several possibilities for storing the
data in more than one location
Centralized master database
Replication of the entire or part of the database
in several locations
Horizontal partitions
Vertical partitions
Mixture of the above

47
Fragmentation

Horizontal fragmentation of the database
means that rows of a table(s) may be stored in
different locations
Similar to the separation of the customer table
in the retailing example above.
Vertical fragmentation means that columns of a
table ( i.e., attributes or groups of attributes
of an entity) are stored in different locations.

48
Query Formulation

Distributed databases require a considerable
amount of network overhead
Poorly formulated query it may cause unnecessary
data retrieval from the database
Query optimization is ideally performed by the
distributed database management system

49
OODB

In traditional relational databases E-R Modeling
and normalization focuses on identifying
entities, their attributes, and the relationships
between entities
This works well for most organizational data,
especially business data
The advent of the microcomputer and processing
power on the desktop
Computer aided design, CAD, became the norm for
engineering work, so it became necessary to store
drawings
Powerful multimedia PCs with sound cards and
color monitors enabled the manipulation of sound
and video files
Many other applications were developed that
required more than just text and numeric
processing

50
Why??

These new applications were facilitated by the
development of Object-Oriented Programming
Still evolving development of object-oriented
data modeling, object-oriented databases, and
object-oriented database management systems
OODBMS and O/R DBMS are two types of database
management systems that are currently available
O/R DBMS uses the basic theory of relational
database management systems with object-oriented
features added
OODBMS is more object-oriented and was developed
separately from the relational products
OODMBS suffers from a lack of standardization
that is available with relational database systems