Title: Penn State Student Chapter of the Association for Computing Machinery
1Penn State Student Chapter of theAssociation
forComputing Machinery
- We welcome all interested students to our 4th
general meeting of the Spring 2005 semester! - When Monday, April 11th, 2005 from 7-8 pm
Where Cybertorium (213 IST) - Agenda
- Brief overview of our ACM chapter
- New officer introductions
- Special topic presentation No Pain, No
Game Presented by IST Professor Brian K. Smith - Co-op/Intern presentation Working at IBM
- Presented by Rick Osowski
- Free refreshments will be provided
2Data Warehousing, Data Mining, and Advanced
Applications
3Data Rich, but Information Poor
- Data is stored, not explored by its volume and
complexity it represents a burden, not a
support - Data overload results in uninformed decisions,
contradictory information, higher overhead,
wrong decisions, increased costs - Data is not designed and is not structured for
successful management decision making
4Improving Decision Making
5Data Warehouse Concepts
6Whats a Data Warehouse?
- A data warehouse is a single, integrated source
of decision support information formed by
collecting data from multiple sources, internal
to the organization as well as external, and
transforming and summarising this information to
enable improved decision making. - A data warehouse is designed for easy access by
users to large amounts of information, and data
access is typically supported by specialized
analytical tools and applications.
7Data Warehouse Characteristics
- Key Characteristics of a Data Warehouse
- Subject-oriented
- Integrated
- Time-variant
- Non-volatile
8Subject Oriented
- Example for an insurance company
9Integrated
- Data is stored once in a single integrated
location(e.g. insurance company)
Auto Policy Processing System
Data Warehouse Database
Customer data stored in several databases
Fire Policy Processing System
Subject Customer
FACTS, LIFE Commercial, Accounting Applications
10Time - Variant
- Data is stored as a series of snapshots or views
which record how it is collected across time.
Data Warehouse Data
Time
Data
Key
- Data is tagged with some element of time -
creation date, as of date, etc. - Data is available on-line for long periods of
time for trend analysis and forecasting. For
example, five or more years
11Non-Volatile
- Existing data in the warehouse is not overwritten
or updated.
External Sources
12Transaction System vs. Data Warehouse
13Transaction-Based Reporting System
On-line, real time update into disparate systems
Day-to-day operations
System Experts
Users
Data Manipulation
Unix
VMS
MVS
Other
14Warehouse-Based Reporting System
Unix
Executive Reporting and On-Line Analysis
Interfaces
Summarization
Data Staging, Transformation and Cleansing
VMS
Data Warehouse
MVS
Environment
Other
OLAP
BENEFIT Integrated, consistent data available
for analysis
BENEFIT Improve Network Reporting processes and
analytical capabilities
15Transaction - Warehouse Process
Transaction Based Process
On-line, real time update.
Day-to-day operations
Detailed Information to operational systems.
Warehouse Based Process
Batch Load
Summarize Refine
Decision support for management use.
Transform
16Transaction System vs. Data Warehouse
- Supports management analysis and decision-making
processes - Contains summarized, refined, and cleansed
information - Non-volatile -- provides a data snapshot
adjustments are not permitted, or are limited - Business analysis requirements drive the data
structure and system design - Integrated, consistent information on a single
technology platform - Users have direct, fast access via On-line
Analytical Processing tools - Minimal impact on operational processes
- Supports day-to-day operational processes
- Contains raw, detailed data that has not been
refined or cleansed - Volatile -- data changes from day-to-day, with
frequent updates - Technical issues drive the data structure and
system design - Disparate data structures, physical locations,
query types, etc. - Users rely on technical analysts for reporting
needs - Operational processes impacted by queries run off
of system
17Data Warehouse Architecture
18Data Warehouse Architecture
19Data Warehouse ArchitectureConversion and
Cleansing Activities
- Map source data to target
- Data scrubbing
- Derive new data
- Data Extraction
- Transform / convert data
- Create / modify metadata
Conversion Cleansing
20Data Warehouse ArchitectureData Warehouse
Components
Detailed Data
Summary Data
- Ranges from detailed to summarized data
- Contains metadata
- Many views of the data
- Subject-Oriented
- Time-variant
Metadata
21Requirements Gathering Process Business Measure
Definition
- Standard definition and related business rules
and formulas - Source data element(s), including quality
constraints - Data granularity levels (e.g., county detail for
state) - Data retention (e.g., one month, one quarter, one
year, multiple years) - Priority of the information (For example, is the
information necessary to derive other business
measures?) - Data load frequency (e.g., monthly, quarterly,
etc.)
22Star Join Schema
23Multi-Dimensional Analysis
24Application Solution Classes
- Executive information system (EIS)
- Present information at the highest level of
summarization using corporate business measures.
They are designed for extreme ease-of-use and, in
many cases, only a mouse is required. Graphics
are usually generously incorporated to provide
at-a-glance indications of performance - Decision Support Systems (DSS)
- They ideally present information in graphical and
tabular form, providing the user with the ability
to drill down on selected information. Note the
increased detail and data manipulation options
presented
25Data Mining
1
26Data Mining
- The process of extracting valid, previously
unknown, comprehensible, and actionable
information from large databases and using it to
make crucial business decisions, (Simoudis,1996). - Involves the analysis of data and the use of
software techniques for finding hidden and
unexpected patterns and relationships in sets of
data.
27Data Mining
- Reveals information that is hidden and
unexpected, as little value in finding patterns
and relationships that are already intuitive. - Patterns and relationships are identified by
examining the underlying rules and features in
the data. - Data mining can provide huge paybacks for
companies who have made a significant investment
in data warehousing. - Relatively new technology, however already used
in a number of industries.
28Examples of Applications of Data Mining
- Retail / Marketing
- Identifying buying patterns of customers
- Finding associations among customer demographic
characteristics - Predicting response to mailing campaigns
- Market basket analysis
- Banking
- Detecting patterns of fraudulent credit card use
- Identifying loyal customers
- Predicting customers likely to change their
credit card affiliation - Determining credit card spending by customer
groups
29Examples of Applications of Data Mining
- Insurance
- Claims analysis
- Predicting which customers will buy new policies
- Medicine
- Characterizing patient behavior to predict
surgery visits - Identifying successful medical therapies for
different illnesses
30Data Mining Operations and Associated Techniques
31Database Segmentation
- Aim is to partition a database into an unknown
number of segments, or clusters, of similar
records. - Uses unsupervised learning to discover
homogeneous sub-populations in a database to
improve the accuracy of the profiles. - Less precise than other operations thus less
sensitive to redundant and irrelevant features. - Sensitivity can be reduced by ignoring a subset
of the attributes that describe each instance or
by assigning a weighting factor to each variable.
- Applications of database segmentation include
customer profiling, direct marketing, and cross
selling.
32Scatterplot
33Visualization
34Data Mining and Data Warehousing
- Major challenge to exploit data mining is
identifying suitable data to mine. - Data mining requires single, separate, clean,
integrated, and self-consistent source of data. - A data warehouse is well equipped for providing
data for mining. - Data quality and consistency is a pre-requisite
for mining to ensure the accuracy of the
predictive models. Data warehouses are populated
with clean, consistent data.
35Data Mining and Data Warehousing
- It is advantageous to mine data from multiple
sources to discover as many interrelationships as
possible. Data warehouses contain data from a
number of sources. - Selecting the relevant subsets of records and
fields for data mining requires the query
capabilities of the data warehouse. - The results of a data mining study are useful if
there is some way to further investigate the
uncovered patterns. Data warehouses provide the
capability to go back to the data source.
36Advanced Database Topics
37A Little History
- Prior to the 1980s ? hierarchical and network
databases. - Hardware ? dumb terminals using private networks
- Database ? centralized and stored on the disk
packs - End user terminals ? simply input/output devices
?Processing at the mainframe - Data ? text data
- Networks had to handle text data
- No access from outside to the organization's
private network.
38New Needs
- Microcomputer enabled workstation processing
power. - Satellite and network technology provided for
very high speed, high traffic, and low cost long
distance communications networks. - Internet in the late 1990s and the corresponding
phenomenal growth in electronic commerce
(E-commerce) necessitated public access to data
in people's homes. - The volume of data needed to be transmitted
increased greatly.
39New Needs
- Business environment changed during the last two
decades - Information stored at different locations, on
different hardware and operating systems, with
different commercial DBMS products, and with
different underlying data models had to be
combined - The centralized database was no longer feasible
to handle these new demands
40Distributed Database Scenario
- There are many advantages to using a distributed
database rather than a centralized database. They
are - Improved performance, because high traffic data
are stored locally. - More efficient data management, because the DBA
workload is shared. - Better network integrity, because the whole
system does not stop if one computer goes down. - Expansion of the database is facilitated when the
organization grows, since new data does not have
to be centralized. It can remain and be
administered in the original location. - Data for the whole organization can still be
accessed from any location.
41Distributed Database
- Data administration is improved (??)
- In a distributed database system even a simple
task like creating a backup copy of the database
can take a considerable amount of time. - If the database is divided among several
locations the time and workload for this task can
be shared.
42Replication of Data
- System failure in one location should not stop
processing in other locations - Replicate all or parts of the database in more
than one location. - Database replication improves performance and
provides a fail-safe option, but it involves
considerable complexity - Replication of frequently used data improves
response time and reduces network traffic - If the data changes at one location it must be
changed at all locations
43Distributed Systems in an Ideal World
- C. J. Date established rules for the ideal
distributed DBMS system - Rules are a goal that distributed systems strive
toward, but have not yet reached - According to Date's rules
- Each site is responsible for its own portion of
the distributed database, including security,
backup, and recovery. - Each site has equal capabilities and does not
rely on any other site. - The system should work regardless of the computer
hardware, operating system, or network installed
at any site.
44Date's Rules of Distributed Databases
- Local site independence
- Central site independence
- Failure independence
- Location transparency
- Fragmentation transparency
- Replication transparency
- Distributed query processing
- Distributed transaction processing
- Hardware independence
- Operating system independence
- Network independence
- Database independence
45Complexities of Distributed Databases
- There also are many complications involved in the
management of distributed database systems. - The distributed database must be carefully
designed to insure the following - Store data as close as possible to where it is
used most often. - Make the location of the data transparent to the
end user. - Make the system easy to expand.
- Optimize queries to improve response time in the
distributed environment.
46Database Design
- The designer must analyze the organization's
needs and business processes to determine the
best way to distribute the database. - There are several possibilities for storing the
data in more than one location - Centralized master database
- Replication of the entire or part of the database
in several locations - Horizontal partitions
- Vertical partitions
- Mixture of the above
47Fragmentation
- Horizontal fragmentation of the database
- means that rows of a table(s) may be stored in
different locations - Similar to the separation of the customer table
in the retailing example above. - Vertical fragmentation means that columns of a
table ( i.e., attributes or groups of attributes
of an entity) are stored in different locations.
48Query Formulation
- Distributed databases require a considerable
amount of network overhead - Poorly formulated query it may cause unnecessary
data retrieval from the database - Query optimization is ideally performed by the
distributed database management system
49OODB
- In traditional relational databases E-R Modeling
and normalization focuses on identifying
entities, their attributes, and the relationships
between entities - This works well for most organizational data,
especially business data - The advent of the microcomputer and processing
power on the desktop - Computer aided design, CAD, became the norm for
engineering work, so it became necessary to store
drawings - Powerful multimedia PCs with sound cards and
color monitors enabled the manipulation of sound
and video files - Many other applications were developed that
required more than just text and numeric
processing
50Why??
- These new applications were facilitated by the
development of Object-Oriented Programming - Still evolving development of object-oriented
data modeling, object-oriented databases, and
object-oriented database management systems - OODBMS and O/R DBMS are two types of database
management systems that are currently available - O/R DBMS uses the basic theory of relational
database management systems with object-oriented
features added - OODBMS is more object-oriented and was developed
separately from the relational products - OODMBS suffers from a lack of standardization
that is available with relational database systems