Title: Data Mining and Data Warehousing
1Data Mining and Data Warehousing
2Evolution Of Data Warehouse
- Since the 1970s, organizations have gained
competitive advantage through automation of
business processes to offer more efficient and
cost-effective services to customers - This resulted in accumulation of growing amounts
of data in operational databases - Organizations now focus on ways to use
operational data to support decision-making, as a
means of gaining competitive advantage - However, operational systems were never designed
to support such business activities
3The Data Warehouse
- A data warehouse is a relational database that is
designed for query and analysis rather than for
transaction processing - It usually contains historical data derived from
transaction data, but it can include data from
other sources - It separates analysis workload from transaction
workload and enables an organization to
consolidate data from several sources to business
users
4The Data Warehouse
- A data warehouse differs from a conventional
database in a number of ways - Volume of data A data warehouse is likely to
hold far more data from operational database. - Volume could be 400 gigabytes.
- Diverse data sources The data stores in a data
warehouse come from different data
sources/application systems.
5The Data Warehouse
- Dimensional access A warehouse is designed to
fulfill various ways in which users may wish to
retrieve data.
6Data Warehouse Definitions
- A copy of transaction data, specifically
structured for query and analysis Ralph Kimball - A data warehouse is a simple, complete and
consistent store of data obtained from a variety
of sources and made available to end users in a
way they can understand and use it in a business
context IBM
7Data Warehouse Definitions
- A data warehouse is a subject-oriented,
integrated, time-variant, and non-volatile
collection of data in support of managements
decision-making process Bill Inmon - Bill Inmon is regarded as father of
- Data Warehousing
8Data Warehouse - Subject-Oriented
- Organized around major subjects, such as
customer, product, sales - Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing - Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process
9Data Warehouse - Subject-Oriented (cont)
10Data Warehouse - Integrated
- Constructed by integrating multiple,
heterogeneous data sources - Relational databases, flat files, on-line
transaction records - Data cleaning and data integration techniques are
applied - Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources - E.g., Hotel price currency, tax, breakfast
covered, etc. - When data is moved to the warehouse, it is
converted
11Data Warehouse Integrated (cont)
12Data Warehouse - Time Variant
- The time horizon for data warehouses is much
longer than that of operational systems - Operational database current value data
- Data warehouse data provide information from a
historical perspective (e.g., past 5-10 years) - Every key structure in the data warehouse
- Contains an element of time, explicitly or
implicitly - But the key of operational data may or may not
contain time element - Need to decide how frequently data warehouse is
updated
13Data Warehouse - Non-Volatile
- A physically separate store of data transformed
from the operational environment - Operational update of data does not occur in the
data warehouse environment - Does not require transaction processing,
recovery, and concurrency control mechanisms - Requires only two operations in data accessing
- Initial loading of data and access of data
14Data Warehouse - Non-Volatile (cont)
15Data Warehouse Environment Capabilities
- A data warehouse environment typically includes
- Extraction
- Transformation
- Loading (ETL) solution
- An online analytical processing (OLAP) engine
- Other applications that manage the process of
gathering data and delivering it to business users
16What Is OLAP?
- Online Analytical Processing (OLAP) is an
industry-accepted reporting technology that
provides high-performance analysis and easy
reporting on large volumes of data - The goal of OLAP, also known as multidimensional
data analysis, is to provide fast and flexible
data summarization, analysis, and reporting
capabilities with the ability to view trends over
time
17What Is OLAP?
- OLAP is term coined by Tedd Codd to describe the
architecture that supports complex analytical
processing such as consolidation, drilling down,
and pivoting - Consolidation comprises the aggregation of data
such as module data being aggregated into courses
data, and courses data being aggregated into
department data - Drilling Down Disaggregating data such as
breaking down school-data into particular courses
18What Is OLAP?
- Pivoting sometimes referred to as slice and
dicing comprises the ability to analyse the same
data into different viewpoints, frequently along
time axis.
19What Is OLAP?
- OLAP applications, also called decision support
systems (DSS), have the following features - Enable users to look at different relationships
in data by looking beyond traditional
two-dimensional row and column data analysis - Offer high-performance access to large amounts of
presummarized data - Give users the power to retrieve answers to
multi-dimensional business questions quickly and
easily - Provide slice-and-dice views of multiple
relationships in large quantities of
presummarized data
20Data Warehouse Vs. Operational DBMS
- OLTP (on-line transaction processing)
- Major task of traditional relational DBMS
- Day-to-day operations purchasing, inventory,
banking, manufacturing, payroll, registration,
accounting, etc. - OLAP (on-line analytical processing)
- Major task of data warehouse system
- Data analysis and decision making
21OLTP Vs. Data Warehouse
OLTP Data Warehouse
Users Clerk, IT professional Knowledge worker
Function Day to day operations Decision support
DB Design Application-oriented Subject-oriented
Data Current, up-to-date detailed, flat relational Isolated Historical, summarized, multidimensional, integrated, consolidated
Records Accessed Tens Millions
Users Thousands Hundreds
DB Size 100MB-GB 100GB-TB
22Benefits of data warehousing
- A data warehouse is seen to deliver three major
benefits to organisations - It provides a single manageable structure for
decision-support data. - A data warehouse enables organisational users to
run complex queries on data that traverses a
number of business areas. - A data warehouse enables a number of business
intelligence applications such as on-line
analytical processing and data mining.
23Benefits of data warehousing
- The overall objective for a data warehouse is to
increase the productivity and effectiveness of
decision-making in organisations. This, in turn,
is expected to deliver competitive advantage to
deliver competitive advantage to organisation.
24Data Warehouse Architecture
25Building a Data Warehouse
- The main stages of getting data into the data
warehouse are - Data Extraction
- Data Cleaning
- Data Transformation
- Data Loading
- Once the data is loaded it needs to be put into
a suitable format - ER model
- Star Schema
26Data Extraction
- Process of copying the data from the
transactional databases in preparation for
loading it into the data warehouse - The data is likely to come from several
transactional databases - Some of the data entering into this process may
come from outside of the company (data enrichment)
27Data Extraction (cont)
- Internal
- Manufacturing, Accounting, HR, etc.
- Legacy
- External
- Competitor Data
- Economic Data
- Demographic Data
- Credit Data
28Data Cleaning
- Transactional data can have all kinds of errors
in it - Data warehouses are very sensitive to data errors
- Data errors must be cleaned or cleansed or
scrubbed as the data is loaded into the data
warehouse - Get data into a consistent state
29Data Transformation
- Data extracted from transactional databases must
go through several kinds of data transformation
on its way to a data warehouse - Data from different transactional databases being
merged to form the data warehouse tables - Data will often be aggregated as it is being
extracted from the transactional databases and
prepared for the data warehouse - Units of measure used for attributes in different
transactional databases must be reconciled as
they are being merged into common data warehouse
tables
30Data Transformation
- Coding schemes used for attributes in different
transactional databases must be reconciled as
they are being merged into common data warehouse
tables - Sometimes values from different attributes in
transactional databases are combined into a
single attribute in the data warehouse (e.g.,
employee name)
31Data Loading
- After all of the extracting, cleaning, and
transforming, the data is ready to be loaded into
the data warehouse - Data will be loaded into a loading or working
area in the database - Some of the previous steps may have been done in
the database - Data may have to go through a number of stages
dividing up the data and merging with other data - When the above has been done the Star Schemas are
populated with the new, time specific data
32Data Loading (cont)
- A schedule for regularly updating the data
warehouse must be put in place - Frequency of updates is important
- Time taken to get to this point is important
33Data Warehouse Queries
- Types of queries that a data warehouse is
expected to answer ranges from the relatively
simple to the highly complex and is dependent on
the type of end-user access tools used - End-user access tools include
- Reporting, query, and application development
tools - Executive information systems (EIS)
- OLAP tools
- Data mining tools
34Steps in Building DW
- Users specify information needs
- Analysts and users create a logical and physical
design - Sources of data is scrubbed, extracted and
transformed - Data is transferred and loaded into the warehouse
periodically - Users are given the access to warehouse
- The warehouse is maintained in terms of changing
requirements
35Typical Data Warehouse Queries
- Examples include
- What was total IFM revenue in 3rd quarter of
2006? - What was total revenue for property sales for
each type of property in Tanzania in 2006? - What are the three most popular areas in each
city for the renting of property in 2003 and how
does this compare with the figures for the
previous two years? - What would be effect on property sales in the
different regions of Europe if legal costs went
up by 3.5 and Government taxes went down by 1.5
for properties over 250,000? - What is monthly revenue for property sales at
each branch office, compared with rolling
12-monthly prior figures?
36Benefits Of Data Warehousing
- Gives the data you want, in a suitable format
- Removes inconsistency of reporting
- Gives one consistent picture of the data. i.e. It
provides single manageable structure for decision
support data. - Potential high returns on investment
37Benefits Of Data Warehousing
- Enable users to run complex queries on data that
traverses a number of business areas. - Competitive advantage
- Increased productivity of corporate
decision-makers