Penn State Student Chapter of the Association for Computing Machinery - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Penn State Student Chapter of the Association for Computing Machinery

Description:

Title: Data Warehousing: An information Delivery and Decision Support Infrastructure Last modified by: Todd Bacastow Created Date: 9/28/2000 8:38:05 AM – PowerPoint PPT presentation

Number of Views:139
Avg rating:3.0/5.0
Slides: 51
Provided by: personalP9
Category:

less

Transcript and Presenter's Notes

Title: Penn State Student Chapter of the Association for Computing Machinery


1
Penn State Student Chapter of theAssociation
forComputing Machinery
  • We welcome all interested students to our 4th
    general meeting of the Spring 2005 semester!
  • When Monday, April 11th, 2005 from 7-8 pm
    Where Cybertorium (213 IST)
  • Agenda
  • Brief overview of our ACM chapter
  • New officer introductions
  • Special topic presentation No Pain, No
    Game Presented by IST Professor Brian K. Smith
  • Co-op/Intern presentation Working at IBM
  • Presented by Rick Osowski
  • Free refreshments will be provided

2
Data Warehousing, Data Mining, and Advanced
Applications
3
Data Rich, but Information Poor
  • Data is stored, not explored by its volume and
    complexity it represents a burden, not a
    support
  • Data overload results in uninformed decisions,
    contradictory information, higher overhead,
    wrong decisions, increased costs
  • Data is not designed and is not structured for
    successful management decision making

4
Improving Decision Making
5
Data Warehouse Concepts
6
Whats a Data Warehouse?
  • A data warehouse is a single, integrated source
    of decision support information formed by
    collecting data from multiple sources, internal
    to the organization as well as external, and
    transforming and summarising this information to
    enable improved decision making.
  • A data warehouse is designed for easy access by
    users to large amounts of information, and data
    access is typically supported by specialized
    analytical tools and applications.

7
Data Warehouse Characteristics
  • Key Characteristics of a Data Warehouse
  • Subject-oriented
  • Integrated
  • Time-variant
  • Non-volatile

8
Subject Oriented
  • Example for an insurance company

9
Integrated
  • Data is stored once in a single integrated
    location(e.g. insurance company)

Auto Policy Processing System
Data Warehouse Database
Customer data stored in several databases
Fire Policy Processing System
Subject Customer
FACTS, LIFE Commercial, Accounting Applications
10
Time - Variant
  • Data is stored as a series of snapshots or views
    which record how it is collected across time.

Data Warehouse Data

Time
Data
Key
  • Data is tagged with some element of time -
    creation date, as of date, etc.
  • Data is available on-line for long periods of
    time for trend analysis and forecasting. For
    example, five or more years

11
Non-Volatile
  • Existing data in the warehouse is not overwritten
    or updated.

External Sources
  • Load
  • Read-Only

12
Transaction System vs. Data Warehouse
13
Transaction-Based Reporting System
On-line, real time update into disparate systems
Day-to-day operations
System Experts
Users
Data Manipulation
Unix
VMS
MVS
Other
14
Warehouse-Based Reporting System
Unix
Executive Reporting and On-Line Analysis
Interfaces
Summarization
Data Staging, Transformation and Cleansing
VMS
Data Warehouse
MVS
Environment
Other
OLAP
BENEFIT Integrated, consistent data available
for analysis
BENEFIT Improve Network Reporting processes and
analytical capabilities
15
Transaction - Warehouse Process
Transaction Based Process
On-line, real time update.
Day-to-day operations
Detailed Information to operational systems.
Warehouse Based Process
Batch Load
Summarize Refine
Decision support for management use.
Transform
16
Transaction System vs. Data Warehouse
  • Data Warehouse
  • Transaction System
  • Supports management analysis and decision-making
    processes
  • Contains summarized, refined, and cleansed
    information
  • Non-volatile -- provides a data snapshot
    adjustments are not permitted, or are limited
  • Business analysis requirements drive the data
    structure and system design
  • Integrated, consistent information on a single
    technology platform
  • Users have direct, fast access via On-line
    Analytical Processing tools
  • Minimal impact on operational processes
  • Supports day-to-day operational processes
  • Contains raw, detailed data that has not been
    refined or cleansed
  • Volatile -- data changes from day-to-day, with
    frequent updates
  • Technical issues drive the data structure and
    system design
  • Disparate data structures, physical locations,
    query types, etc.
  • Users rely on technical analysts for reporting
    needs
  • Operational processes impacted by queries run off
    of system

17
Data Warehouse Architecture
18
Data Warehouse Architecture
19
Data Warehouse ArchitectureConversion and
Cleansing Activities
  • Map source data to target
  • Data scrubbing
  • Derive new data
  • Data Extraction
  • Transform / convert data
  • Create / modify metadata

Conversion Cleansing
20
Data Warehouse ArchitectureData Warehouse
Components
Detailed Data
Summary Data
  • Ranges from detailed to summarized data
  • Contains metadata
  • Many views of the data
  • Subject-Oriented
  • Time-variant

Metadata
21
Requirements Gathering Process Business Measure
Definition
  • Standard definition and related business rules
    and formulas
  • Source data element(s), including quality
    constraints
  • Data granularity levels (e.g., county detail for
    state)
  • Data retention (e.g., one month, one quarter, one
    year, multiple years)
  • Priority of the information (For example, is the
    information necessary to derive other business
    measures?)
  • Data load frequency (e.g., monthly, quarterly,
    etc.)

22
Star Join Schema
23
Multi-Dimensional Analysis
24
Application Solution Classes
  • Executive information system (EIS)
  • Present information at the highest level of
    summarization using corporate business measures.
    They are designed for extreme ease-of-use and, in
    many cases, only a mouse is required. Graphics
    are usually generously incorporated to provide
    at-a-glance indications of performance
  • Decision Support Systems (DSS)
  • They ideally present information in graphical and
    tabular form, providing the user with the ability
    to drill down on selected information. Note the
    increased detail and data manipulation options
    presented

25
Data Mining
1
26
Data Mining
  • The process of extracting valid, previously
    unknown, comprehensible, and actionable
    information from large databases and using it to
    make crucial business decisions, (Simoudis,1996).
  • Involves the analysis of data and the use of
    software techniques for finding hidden and
    unexpected patterns and relationships in sets of
    data.

27
Data Mining
  • Reveals information that is hidden and
    unexpected, as little value in finding patterns
    and relationships that are already intuitive.
  • Patterns and relationships are identified by
    examining the underlying rules and features in
    the data.
  • Data mining can provide huge paybacks for
    companies who have made a significant investment
    in data warehousing.
  • Relatively new technology, however already used
    in a number of industries.

28
Examples of Applications of Data Mining
  • Retail / Marketing
  • Identifying buying patterns of customers
  • Finding associations among customer demographic
    characteristics
  • Predicting response to mailing campaigns
  • Market basket analysis
  • Banking
  • Detecting patterns of fraudulent credit card use
  • Identifying loyal customers
  • Predicting customers likely to change their
    credit card affiliation
  • Determining credit card spending by customer
    groups

29
Examples of Applications of Data Mining
  • Insurance
  • Claims analysis
  • Predicting which customers will buy new policies
  • Medicine
  • Characterizing patient behavior to predict
    surgery visits
  • Identifying successful medical therapies for
    different illnesses

30
Data Mining Operations and Associated Techniques
31
Database Segmentation
  • Aim is to partition a database into an unknown
    number of segments, or clusters, of similar
    records.
  • Uses unsupervised learning to discover
    homogeneous sub-populations in a database to
    improve the accuracy of the profiles.
  • Less precise than other operations thus less
    sensitive to redundant and irrelevant features.
  • Sensitivity can be reduced by ignoring a subset
    of the attributes that describe each instance or
    by assigning a weighting factor to each variable.
  • Applications of database segmentation include
    customer profiling, direct marketing, and cross
    selling.

32
Scatterplot
33
Visualization
34
Data Mining and Data Warehousing
  • Major challenge to exploit data mining is
    identifying suitable data to mine.
  • Data mining requires single, separate, clean,
    integrated, and self-consistent source of data.
  • A data warehouse is well equipped for providing
    data for mining.
  • Data quality and consistency is a pre-requisite
    for mining to ensure the accuracy of the
    predictive models. Data warehouses are populated
    with clean, consistent data.

35
Data Mining and Data Warehousing
  • It is advantageous to mine data from multiple
    sources to discover as many interrelationships as
    possible. Data warehouses contain data from a
    number of sources.
  • Selecting the relevant subsets of records and
    fields for data mining requires the query
    capabilities of the data warehouse.
  • The results of a data mining study are useful if
    there is some way to further investigate the
    uncovered patterns. Data warehouses provide the
    capability to go back to the data source.

36
Advanced Database Topics
37
A Little History
  • Prior to the 1980s ? hierarchical and network
    databases.
  • Hardware ? dumb terminals using private networks
  • Database ? centralized and stored on the disk
    packs
  • End user terminals ? simply input/output devices
    ?Processing at the mainframe
  • Data ? text data
  • Networks had to handle text data
  • No access from outside to the organization's
    private network.

38
New Needs
  • Microcomputer enabled workstation processing
    power.
  • Satellite and network technology provided for
    very high speed, high traffic, and low cost long
    distance communications networks.
  • Internet in the late 1990s and the corresponding
    phenomenal growth in electronic commerce
    (E-commerce) necessitated public access to data
    in people's homes.
  • The volume of data needed to be transmitted
    increased greatly.

39
New Needs
  • Business environment changed during the last two
    decades
  • Information stored at different locations, on
    different hardware and operating systems, with
    different commercial DBMS products, and with
    different underlying data models had to be
    combined
  • The centralized database was no longer feasible
    to handle these new demands

40
Distributed Database Scenario
  • There are many advantages to using a distributed
    database rather than a centralized database. They
    are
  • Improved performance, because high traffic data
    are stored locally.
  • More efficient data management, because the DBA
    workload is shared.
  • Better network integrity, because the whole
    system does not stop if one computer goes down.
  • Expansion of the database is facilitated when the
    organization grows, since new data does not have
    to be centralized. It can remain and be
    administered in the original location.
  • Data for the whole organization can still be
    accessed from any location.

41
Distributed Database
  • Data administration is improved (??)
  • In a distributed database system even a simple
    task like creating a backup copy of the database
    can take a considerable amount of time.
  • If the database is divided among several
    locations the time and workload for this task can
    be shared.

42
Replication of Data
  • System failure in one location should not stop
    processing in other locations
  • Replicate all or parts of the database in more
    than one location.
  • Database replication improves performance and
    provides a fail-safe option, but it involves
    considerable complexity
  • Replication of frequently used data improves
    response time and reduces network traffic
  • If the data changes at one location it must be
    changed at all locations

43
Distributed Systems in an Ideal World
  • C. J. Date established rules for the ideal
    distributed DBMS system
  • Rules are a goal that distributed systems strive
    toward, but have not yet reached
  • According to Date's rules
  • Each site is responsible for its own portion of
    the distributed database, including security,
    backup, and recovery.
  • Each site has equal capabilities and does not
    rely on any other site.
  • The system should work regardless of the computer
    hardware, operating system, or network installed
    at any site.

44
Date's Rules of Distributed Databases
  1. Local site independence
  2. Central site independence
  3. Failure independence
  4. Location transparency
  5. Fragmentation transparency
  6. Replication transparency
  7. Distributed query processing
  8. Distributed transaction processing
  9. Hardware independence
  10. Operating system independence
  11. Network independence
  12. Database independence

45
Complexities of Distributed Databases
  • There also are many complications involved in the
    management of distributed database systems.
  • The distributed database must be carefully
    designed to insure the following
  • Store data as close as possible to where it is
    used most often.
  • Make the location of the data transparent to the
    end user.
  • Make the system easy to expand.
  • Optimize queries to improve response time in the
    distributed environment.

46
Database Design
  • The designer must analyze the organization's
    needs and business processes to determine the
    best way to distribute the database.
  • There are several possibilities for storing the
    data in more than one location
  • Centralized master database
  • Replication of the entire or part of the database
    in several locations
  • Horizontal partitions
  • Vertical partitions
  • Mixture of the above

47
Fragmentation
  • Horizontal fragmentation of the database
  • means that rows of a table(s) may be stored in
    different locations
  • Similar to the separation of the customer table
    in the retailing example above.
  • Vertical fragmentation means that columns of a
    table ( i.e., attributes or groups of attributes
    of an entity) are stored in different locations.

48
Query Formulation
  • Distributed databases require a considerable
    amount of network overhead
  • Poorly formulated query it may cause unnecessary
    data retrieval from the database
  • Query optimization is ideally performed by the
    distributed database management system

49
OODB
  • In traditional relational databases E-R Modeling
    and normalization focuses on identifying
    entities, their attributes, and the relationships
    between entities
  • This works well for most organizational data,
    especially business data
  • The advent of the microcomputer and processing
    power on the desktop
  • Computer aided design, CAD, became the norm for
    engineering work, so it became necessary to store
    drawings
  • Powerful multimedia PCs with sound cards and
    color monitors enabled the manipulation of sound
    and video files
  • Many other applications were developed that
    required more than just text and numeric
    processing

50
Why??
  • These new applications were facilitated by the
    development of Object-Oriented Programming
  • Still evolving development of object-oriented
    data modeling, object-oriented databases, and
    object-oriented database management systems
  • OODBMS and O/R DBMS are two types of database
    management systems that are currently available
  • O/R DBMS uses the basic theory of relational
    database management systems with object-oriented
    features added
  • OODBMS is more object-oriented and was developed
    separately from the relational products
  • OODMBS suffers from a lack of standardization
    that is available with relational database systems
Write a Comment
User Comments (0)
About PowerShow.com