Title: Data Warehousing
1Data Warehousing
- Denis Manley
- Enterprise Systems
- FT228/3
2Data Warehousing
- A data warehouse is a repository for all an
organisations data. It is designed to present
data in very user-friendly formats to end-users. - Why is Data Warehousing necessary?
- think of a typical company that uses computers
for order processing, invoicing, etc. - Data is gathered for every transaction that
occurs. Different departments probably use
different systems, each designed for a specific
task.
3Data Warehousing (contd)
- There will be lots of different database Tables,
in different formats, on different machines.
There will be tie-ups between data items, but
those tie-ups may not be obvious outside the
applications that manipulate the tables. - Also, the amount of data is constantly
increasing. Some estimates state that the size of
a typical data processing system doubles every
two years. - Both standard, and ad-hoc, reports must be
catered for, and they must run in a reasonable
time-frame.
4Data Warehousing (contd)
- Some solutions to the issues listed can be
implemented in the applications that manage
specific data. - E.G. if you want to ask How much does this
customer owe? then the original package is
probably the one to use. - But if you want to ask Was this ad campaign more
successful than that one?, you require data from
more disparate sources, and one application may
not provide all of it.
5Data Warehousing (contd)
- The new alternative is a Data Warehouse
- a D.W. is a way of organising data from a wide
range of formats, possibly from different
locations, so its all available in a single
location. - Once this stage is complete, the collection of
data is usually frequently replicated around
multiple locations. - This means users have a local copy of the data
they need to inspect. This improves query
run-times, and reduces communications overheads.
6Data Warehousing (contd)
- The data is not only collected and joined,
however. What the data means is described in a
way that the end-users can easily understand,
e.g. using the name Customer Account Number
instead of ordergtcacno for a field. - The data must be quickly accessible, so that
users can find data quickly and easily.
7Data Warehousing process
Operational data in legacy systems e.g. OLTP apps.
Data fusion Assembles diverse data Data
Cleansing fixing data
Meta Data shows transformations of data,origins
of data, etc.
Data Migration load data and meta-data into
warehouse periodically
Data Warehouse
Decision-support analyst queries
8Data Warehousing Architecture
- A data warewhouse is an architectural construct
of an information system that provides users with
current and historical decision support
information that is hard to access or present in
traditional operational data stores. - It comprises a number of components illustrated
in figure 1
9Data Warehousing Architecture
Figure 1 A datawarehouse architecture
10Data Warehousing Architecture
- Data Warehouse database
- This is a cornerstone of the data warehouse (item
2 in figure 1) - Almost always implemented on a relational
database management system - However very large databases, ad hoc processing
and the need for flexible user views e.g.
aggregates and drill downs are calling for
different technological approaches - Parallel relational database designs requiring
the use of symmetric multiprocessors, Massively
parallel processors and clusters
11Data Warehousing Architecture
- Speeding up traditional RDBMS
- The use of Multidimensional databases (MDDBs)
which are tightly coupled to on-line analytical
processing
12Data Warehousing Architecture
- Sourcing, acquisation, cleanup and transformation
of data - Implementing data warehouses involves extracting
data from operational systems including legacy
systems and putting it into a suitable format. - The various tools are illustrated as item 1 in
figure 1 - These tools perform all the conversions,
summarisations, key changes, structural changes,
and condensations needed to transform disparate
data into information can be used by decision
support tools -
13Data Warehousing Architecture
- It produces programs and control statements
required to move data into the data warehouse
form multiple operational systems - Maintains the metadata
- Remove unwanted data
- Converts to common data names and definitions
- Calculates summaries
- Establish defaults for missing data
- Keep track of source data definition changes
14Data Warehousing Architecture
- The tools have to deal with
- Database hetergeneity DBMS can be very different
in data models, data access language etc. - Data hetergeneity the difference in the way data
is defined and used e.g. synonyms and different
attributes for the same entity etc.
15Data Warehousing Architecture
- Metadata (item 3 figure1) data about data that
describes the datawarehouse - Technical metadata contains information for
warehouse designers and administrators - Information about data sources
- Transformation descriptions
- Rules used to perform data clean up
- Access authorisation, information delivery
history, data acquisation history, data access
etc - Business metadata information that gives users
an understanding of the information stored in the
data warehouse. - Queries, reports images
- Data warehouse operational information e.g. data
history and ownership
16Meta Data Versioning
- In the operational environment, there tends to be
a single instance of data and meta data at any
one moment in time. - In the data warehouse environment, there are
multiple definitions of data and meta data over
an historically long period of time. - Therefore, versioning of data warehouse meta data
is crucial to the success of the data warehouse
vis-a-vis the end users ability to access and
understand the data in the data warehouse.
17Guidelines for Metadata Management
- Develop an Information Directory that integrates
technical and business metadata. - Keep the metadata current and accurate!
- Maintain the time variant history of the
metadata! - Provide meaningful descriptions and definitions
(use business definitions, not technical
definitions, where possible). - The end users must be educated about what
metadata is, how to access it, how to use it, etc.
18Meta Data Answers Questions for Users of the
Data Warehouse
- How do I find the data I need?
- What is the original source of the data?
- How was this summarization created?
- What queries are available to access the data?
- How have business definitions and terms changed
over time? - How do product lines vary across organizations?
- What business assumptions have been made?
19The Role of Meta Datain the Data Warehouse
Architecture
Meta Data enables data to become information,
because with it you
- Know what data you haveand
- You can trust it!
20Data marts (item 4 figure 1)
- A data mart is a data store that is subsidary to
a data warehouse of intergrated data. - The data mart is directed at a partition of data
(subject area) that is created for the use of a
dedicated group of users and is sometimes termed
a subject warehouse - The data mart might be a set of denormalised,
summarised or aggregated data that can be placed
on the data warehouse database or more often
placed on a separate physical store. - Data marts can be dependent data marts when the
data is sourced from the data warehouse. - Independent data marts represent fragmented
solutions to a range of business problems in the
enterprise, however, such a concept should not be
deployed as it doesnt have the data
intergration concept thats associated with data
warehouses. -
21Independent Data Marts
Data Mart
Systems of Record
Extract, Transform, Clean, Integrate, Summarize,
etc....
22Independent Data Marts
Extract, Transform, Clean, Integrate, Summarize,
etc....
Three Times!
23Independent Data Marts
Significant and expensive duplication of effort
and data.
24Independent Data Marts
Maintenance of proliferating unarchitected marts
expensive and cumbersome.
25Unarchitected Data Marts
?
There may be metadata for some marts, but what
about consistency history?
26Contrast Architected Data Warehouse
Data Access
Data Warehouse
Deptl
Deptl
Deptl
Systems of Record
Dependent (Architected) Departmental Marts
with the Appropriate Subset of Metadata
27Independent Data Marts vs. The Real Thing
- Architected to meet organizational as well as
departmental requirements - Data and results consistent
- Redundancy is managed
- Detailed history available for drill-down
- Metadata is consistent!
- Easy to do, but...
- Are the extracts, transformations, integrations
loads consistent? - Is the redundancy managed?
- What is the impact on the sources?
28Independent Data marts
- However, such marts are not necessarly all bad.
- Often a valid solution to a pressing business
problem - Extremely urgent user requirements
- The absence of a budget for a full data warehouse
- The decentralisation of business units
29Data Warehousing Architecture
- Access Tools (item 5 figure 1)
- The principal purpose of the data warehouse is to
provide information for strategic decision
making. - The main tools used to achieve this objective
are - Data query and reporting tools
- Executive information system tools
- On-line analytical processing tools
- Data mining tools
30Data Warehousing Architecture
- Query and reporting tools
- Reporting tools
- Production reporting tools e.g. generate
operational reports - Report writers inexpensive desktop tools
- Managed Query tools
- Shield users from the complexities of SQL and
database structures by inserting a metalayer
between the users and the database
31A Few Definitions
- OLAP
- On-Line Analytical Processing
- A set of functionality that attempts to
facilitate multidimensional analysis - Multidimensional Analysis
- The ability to manipulate information by a
variety of relevant categories or dimensions to
facilitate analysis and an understanding of that
data - Has also been called as drill-down,
drill-across and slice and dice
32A Few Definitions
- Hypercube
- A means of visually representing multidimensional
data - Star Schema
- A means of aggregating data based on a set of
known dimensions, attempting to store data
multidimensionally in a two dimensional RDBMS - Snowflake Schema
- An extension of the star schema by means of
applying additional dimensions to the dimensions
of a star schema in a relational environment
33A Few Definitions
- Multidimensional Database
- Also known as MDDB or MDDBS
- A class of proprietary, non-relational database
management tools that store and manage data in a
multidimensional manner, as opposed to the two
dimensions associated with traditional relational
database management systems - OLAP Tools
- A set of software products that attempt to
facilitate multidimensional analysis - Can incorporate data acquisition, data access,
data manipulation, or any combination thereof
34A Few Definitions
- ROLAP
- Relational OLAP
- Using an RDBMS to implement an OLAP environment
- Typically involves a star schema to provide the
multidimensional capabilities - OLAP tool manipulates RDBMS star schema data
35A Few Definitions
- MOLAP (1)
- Multidimensional OLAP
- Using an MDDBS to store and access data
- MDDBS directly manages data multidimensionally
- Usually requires proprietary (non-SQL) access
tools - MOLAP (2)
- OLAP tool facilitates multidimensional
capabilities without the need for a star schema - Often utilizes a 3-Tier environment, where middle
tier server preprocesses data from an RDBMS - Some OLAP tools access an RDBMS directly and
build cubes as a fat client
36How Can OLAP Be Accomplished?
- Use the Data Warehouse as the architected
foundation for the organizations informational
processing requirements - Use the appropriate design techniques to ensure
that the data required is at the appropriate
degree of granularity at the atomic level, and
the appropriate degree of summarization at the
departmental level - Use the appropriate tools to either access
relational data in a multidimensional manner, or
manage multidimensional data
37Data mining
- A CSF for any business is its ability to
effectively use information. - This strategic use can occur from discovering
hidden, undected and frequently valuable
information about customers suppliers retailers
etc. - This information can be used to formulate
effective business, marketing and sales
strategies. - A relatively new technology that can be used to
achieve this strategic advantage is data mining
38Data visualisation
- Data visualisation is a method of presenting the
output of the previously mentioned methods in
such a way that the problem or solution is
clearly visible to domain experts and even casual
observers. - It goes way beyond simple bar and pie charts
- It is a collection of complex techniques that
focus on determining the best way to display
complex patterns on a 2-D computer monitor. - Such techniques involve experimenting with
various colours, shapes 3-d images, sound and
vittual reality to help users really see and feel
the problem and solution.
39DW administration and management(Item 6 figure 1)
- Managing data warehouses involves
- Security and priority management
- Monitoring updates from multiple sources
- Data quality checks
- Managing and updataing metadata
- Replicating, subsetting and distributing data
- Backup and recovery
- Data warehouse storage management e.g. capacity
planning, hierarchical storage management,
purging of aged data.
40Information delivery system(Item 7 figure 1)
- The IDS is used to enable the process of
subscribing for data warehouse information and
having it delivered to one or more destinations
of choice according to some user-specified
scheduling algorithm - IDS may be based on time of day or on completion
of an external event. - IDS can be achieved by Client/Server architecture
and now by Interner/intranet and World Wide Web.
41Top 10 Donts of Data Warehousing
4210. Pre-selecting Your Technical Environment
- This is a very common trap in which many
organizations find themselves. It is traditional
to select the hardware, software and other
physical, technical components of a system as one
of the earliest activities. - However, a data warehouse is an unstable
environment from a sizing perspective. How do
you know the hardware/RDBMS/end user tool is
appropriate for your data warehouse before
conducting even the first round of analysis? - If at all possible, wait to select your technical
environment until after you have analyzed the
business requirements for information, data, and
potential systems of record.
439. Allowing Access Tool to Determine Data
Architecture
- This is an extension of 10, but is important
enough to list by itself. - If you select an end user tool before developing
your data architecture, it is very likely that
that architecture will suffer at the hand of
design requirements delivered by the tool. - If you have to sacrifice design requirements in
order to meet functional requirements of a tool,
it is probably time to put that tool aside and
select another one.
448. Unarchitected Data Marts
- OK. Data marts are good they are an essential
part of the data warehouse architecture. But to
build only a data mart and to ignore the rest of
the data warehouse (specifically the atomic level
data and centralized meta data) will lead you
down a path that will be more expensive and
deliver less quality of data than the
alternative. - The alternative is to architect and build the
data warehouse incrementally, iteratively.
Include data marts as departmental instances of
the architecture, and populate them from the
atomic level data. This will ensure accuracy
across the architecture, and reduce costs by
eliminating unnecessary population of stand-alone
data marts.
457. Boiling the Ocean
- It is more efficient to implement the data
warehouse in small, achievable and palatable
chunks than to try to implement it all at once.
When I say boil the ocean, I mean trying to do
too many things for too many people all at the
same time. - There is an old adage You can have everything
where would you put it all? The same holds true
for a data warehouse. If you try to design,
develop and implement a data warehouse that is
all-encompassing as your first iteration, how
will the users be able to use all that you
delivered? And in the mean time, while youve
been trying to meet all of their needs, you have
failed to meet any needs. And users wont forget
that for a long time.
466. If you build it they will come
- If you design, develop and implement an
operational system, such as an order processing
system, that new system is typically going to
replace an existing system. In other words, the
old system goes away and the users have no choice
but to use the new one. Not so with the d/w. - If you build it implies an analysis that
includes only bottom-up activities. It is
crucial to the success of a data warehouse that a
top-down analysis of user requirements for
information be conducted. - After that, users must be tutored, mentored and
otherwise have their hands held as part of the
implementation of the data warehouse. Existence
does not guarantee utilization and, therefore,
value.
475. Putting ROI before RFI (Requirements for
Information)
- It is very difficult to quantify the intangible
benefits that a data warehouse can provide to an
organization. How can you put a price on
increased customer loyalty. Somewhere, sometime,
someone has probably made this calculation. In
most cases, however, the determination of how
beneficial the data warehouse will be is based on
criteria that was developed for operational
systems. Just as you cannot use operational data
to meet your strategic informational
requirements, it is difficult to calculate the
return on investment (ROI) of a data warehouse. - In terms of benefits to the organization, it is
more appropriate to concentrate on how well the
data warehouse addresses the target users
requirements for information.
484. No Committed User Involvement
- Write this down
- The success of any data warehouse is directly
proportional to the amount of end user
participation! - A data warehouse cannot be successful without
active participate on the part of the target
users. Period. If you do not have user
participation, you will find yourself in a
situation where you will build it and hope that
they will come. - If there is no serious user participation in a
data warehouse project, you have to seriously
question whether or not the organization truly
needs a data warehouse.
493. No Dedicated DBA
- In many situations the lack of a dedicated
database administrator (DBA) has prevented a data
warehouse project to be complete 1) on time, or
2) successfully. - Borrowing a DBA from the operational pool
will only result in questions about the nature of
the data warehouse data models and database
design. Its too flat, not normalized properly,
too much redundancy, and other criticisms are
well suited for an operational systems database
design, but not a data warehouse. - Considering that data is the first word in
data warehouse, be sure you have a dedicated
database administration resource committed to
this important project.
502. No Meta Data
- Meta data is like documentation and training
Everyone knows it is necessary, but it usually
gets dropped somewhere along the route to
implementation. - For the data warehouse, meta data is more
important than just your typical documentation.
Remember, in order to turn data into information
you have to have the data, know that you have it,
be able to access it, and trust it. Meta data is
the means by which the users will be able to
understand and trust the data. A time-variant
record of where data came from, what happened to
it along the way, where it is in the data
warehouse, and what vehicles exist to access it
will spell the difference between success and
frustration.
511. Analysis Paralysis
- The inability to proceed past a sticking
question. Wanting to boil the ocean and
model/design everything before proceeding with
development. Having to resolve political issues
surrounding a standard or common definition.
All of these things (and more!) will result in
analysis paralysis. - The 80/20 rule is very applicable to the
development of a data warehouse. Execute 20
effort to get 80 of the total outcome, then move
on to the next set of challenges and
opportunities. Many data warehouse failures
started when the development team stopped. - Get your hands around an idea, understand what
the users requirements for information are, and
build something that produces something that can
be evaluated. Dont just stand theredo
something!
52Parallel Data Management
- A topic thats closely linked to Data Warehousing
is that of Parallel Data Management. - The argument goes
- if your main problem is that your queries run too
slowly, use more than one machine at a time to
make them run faster (Parallel Processing). - Oracle uses this strategy in its warehousing
products. - There are two types of parallel processing -
Symmetric Multiprocessing (SMP), and Massively
Parallel Processing (MPP)
53Parallel Data Management (contd)
- SMP - means the O.S. runs and schedules tasks on
more than one processor without distinction. - in other words, all processors are treated
equally in an effort to get the list of jobs
done. - MPP - more varied in its design, but essentially
consists of multiple processors, each running
their own program. - the problem with MPP is to harness all these
processors to solve a single problem.
54Parallel Data Management (contd)
- Regardless of the architecture used, there are
still alternatives regarding the use of the
parallel processing capabilities. - In normal Transaction processing, each
transaction runs on a separate processor, because
transactions are small units of work that run in
a reasonable time-span. - However, the type of analysis carried out in data
warehouse applications isnt like that. Typically
you want to run a query that looks at all the
data in a set of tables. The problem is splitting
that into chunks that can be assigned to the
multiple processors.
55Parallel Data Management (contd)
- There are two possible solutions to this problem
Static and Dynamic Partitioning. - In Static Partitioning you break up the data into
a number of sections. Each section is placed on a
different processor with its own data storage and
memory. The query is then run on each of the
processors, and the results combined at the end
to give the entire picture. - This is like joining a queue in a supermarket.
You stay with it until you reach the check-out.
56Parallel Data Management (contd)
- The main problem with Static Partitioning is that
you cant tell how much processing the various
sections need. If most of the relevant data is
processed by one processor you could end up
waiting almost as long as if you didnt use
parallel processing at all. - In Dynamic Partitioning the data is stored in one
place, and the data server takes care of
splitting the query into multiple tasks, which
are allocated to processors as they become
available. - This is like the single queue in a bank. As a
counter position becomes free the person at the
head of the queue takes that position
57Parallel Data Management (contd)
- With Dynamic Partitioning the performance
improvement can be dramatic, but the partitioning
is out of the users hands.
582 Tiered Architecture
- End User Functionality
- Data Display
- Personal Data Storage
- Personal Application Logic
- Shared Global Data Storage
- Shared Application Logic
This slide defines the fundamentals of a
two-tiered architecture
592 Tiered Architecture in the DW Environment
- Data Access Processing
- Individual Level Data
- Data Manipulation
- Atomic Data Acquisition
- Organizational Level Data
- Secondary Data Acquisition
- Departmental Level Data
This slide depicts where the various components
of the data warehouse architecture would fall in
a two-tiered technical architecture.
602 Tiered Architecture in the DW Environment
This is a graphical representation of the
two-tiered data warehouse architecture. The
atomic level, metadata and departmental levels
would reside on the Enterprise Server, while any
individual level data would be stored on the
client. In this scenario the client is a thin
one, with little processing beyond query and
simple data manipulation.
612 Tiered Architecture in the DW Environment
In this variation, a fat client at the second
tier facilitates a considerable amount of data
manipulation, freeing up capacity on the
Enterprise Server for other, more global
activities.
622 Tiered Architecture in the DW Environment
In this variation the second tier is a thin
client, and volumes of data are pre-processed on
the Enterprise Server. This enables access to
sophisticated analyses by more than one thin
client.
633 Tiered Architecture
- Shared Global Data Storage
- Shared Application Logic
- Shared Local Data
- Shared Application Logic
- End User Functionality
- Personal Data Storage
- Personal Application Logic
In a three-tiered technical environment a
middle tier has been added for shared data and
application logic.
643 Tiered Architecture in the DW Environment
- Atomic Data Acquisition
- Organizational Level Data
- Secondary Data Acquisition
- Departmental Level Data
- Data Access Processing
- Individual Data
- Data Manipulation
In the data warehouse, this middle tier can take
on the responsibility for departmental level
(data mart) processing. There is little or no
impact on the third tier, while the first tier is
relieved of some mid-level functionality.
653 Tiered Architecture in the DW Environment
In this graphical representation of the
three-tiered technical architecture, the
departmental data has been moved to the middle
tier. In reality, departmental instances can
still exist on the first tier. In cases where the
middle tier is geographically distant from the
first tier, it is advisable to provide a subset
of local metadata for the users of the middle
tiers data.
663 Tiered Architecture in the DW Environment
This scenario also supports a fat client at the
third tier, where the individual level data is
maintained and manipulated.
673 Tiered OLAP Arch. in the DW Environment
One of the biggest benefits of a three-tiered
technical architecture for a data warehouse is
the ability to facilitate tremendous amounts of
pre-processing of data on the middle tier.
- Secondary Data Acquisition
- Departmental Level Data
- Shared Data Access Processing
- Data Access
- Individual Data
- Data Manipulation
- Atomic Data Acquisition
- Organizational Level Data
683 Tiered OLAP Arch. in the DW Environment
In this case, the middle tier pre-processes data
for use by many thin clients, rather than
requiring the individual levels on the third tier
to repetitively processes the same data with
complex algorithms.
69The Atomic Schema
70The Star Schema
Dimension Table 1
Fact Table
Dimension Key 1
Dimension Key 3
Dimension Key 1 Dimension Key 2 Dimension Key
3 Dimension Key 4
Description 1 Aggregatn Lvl 1.1 Aggregatn Lvl
1.2 Aggregatn Lvl 1.n
Description 3 Aggregatn Lvl 3.1 Aggregatn Lvl
3.2 Aggregatn Lvl 3.n
Fact 1 Fact 2 Fact 3 Fact 4 . . . Fact n
Dimension Key 4
Dimension Key 2
Description 4 Aggregatn Lvl 4.1 Aggregatn Lvl
4.2 Aggregatn Lvl 4.n
Description 2 Aggregatn Lvl 2.1 Aggregatn Lvl
2.2 Aggregatn Lvl 2.n
71Dimension Table
- Describes the data that has been organized in
the Fact Table - Key should either be the most detailed
aggregation level necessary (e.g. country vs.
county), if possible, or... - Surrogate keys may be necessary, but will
decrease the natural value of the key - Manageable number of aggregation levels
72Fact Table
- Quantifies the data that has been described by
the Dimension Tables - Key made up of unique combination of values of
dimension keys - ALWAYS contains date or date dimension
- Fact values should be additive
- Aggregations of quantities or amounts from atomic
level - No percentages or ratios
- May be non-additive, time-variant data
73For Example
74Star Schema Query
Select E.Month, B.Customer_Type,
C.Product_Type, D.Store_Location,
sum(A.Total_Quantity) From Purchases_1 A,
Customer_Type B, Product C, Selling_Responsibilit
y D, Date_Information E Where B.Customer_Type
A.Customer_Type and C.Product_Code
A.Product_Code and D.Sales_Rep_ID
A.Sales_Rep_ID and E.Week_Ending_Date
A.Week_Ending_Date and E.Year 1996
and C.Product_Category V Group
by E.Month, B.Customer_Type, C.Product_Type, D.St
ore_Location
75Caution! Overly Complex Dimension
- Number of aggregation levels within the dimension
becomes unmanageable - Logically or functionally incorrect combination
of aggregation levels within a dimension
76Answer 1 Split Dimension
Split overly complex dimension into several
logical, more manageable dimensions, based on
business function of aggregation levels
77Answer 2 The Snowflake
- Identify hierarchies of aggregation levels and
dimensionalize the primary dimension - Secondary dimensions descriptive of the primary
dimension
78Answer Distinct Time Period Fact Tables
- Create separate fact tables to account for
different time periods - Date still part of each fact table key
- Same dimension tables used by both fact tables
- Improves overall performance (loading and
accessing) for each time period - Will not increase amount of managed redundancy