Title: The Software Infrastructure for Electronic Commerce
1The Software Infrastructurefor Electronic
Commerce
- Databases and Data Mining
- Lecture 2 Data Warehousing
- Johannes Gehrke
- johannes_at_cs.cornell.edu
- http//www.cs.cornell.edu/johannes
2Overview
- Conceptual design
- Querying relational data
- Dimensional data modeling OLTP versus decision
support - We will design and analyze a data mart with
click-stream data as an illustrative example.
3The Database Design Process
- Requirement analysis
- Conceptual design using the entity-relationship
(ER) model - Schema refinement
- Normalization
- Physical tuning
4Overview of Database Design
- Conceptual design (ER Model is used at this
stage.) - What are the entities and relationships in the
enterprise? - What information about these entities and
relationships should we store in the database? - What are the integrity constraints or business
rules that hold? - A database schema in the ER Model can be
represented pictorially (ER diagrams). - Can map an ER diagram into a relational schema.
5ER Model Basics
- EntityReal-world object distinguishable
fromother objects. Anentity is described
(inDB) using a set of attributes. - Entity Set A collection of similar entities.
E.g., all employees. - All entities in an entity set have the same set
of attributes. - Each entity set has a key.
- Each attribute has a domain.
6ER Model Basics (Contd.)
- Relationship Association among two or more
entities. E.g., Johannes works in the computer
science department. - Relationship set Collection of similar
relationships.
since
name
dname
budget
ssn
lot
did
Works_In
Departments
Employees
7ER-Model Basics (Contd.)
- An n-ary relationship set R relates n entity sets
E1 ... En each relationship in R involves
entitiese1, ..., en - Same entity set could participate in different
relationship sets, or in different roles in
same set.
name
ssn
address
Employees
subor-dinate
super-visor
Reports_To
8Key Constraints
- Consider Works_In An employee can work in many
departments a department can have many
employees. - In contrast, each deptartment has at most one
manager, according to the key constraint on
Manages.
did
budget
Departments
9Key Constraints (Contd.)
- Several types of key-constraints
1-to-1
1-to Many
Many-to-1
Many-to-Many
10Key constraints Examples
- Example Scenario 1 An inventory database
contains information about parts and
manufacturers. Each part is constructed by
exactly one manufacturer. - Example Scenario 2 A customer database contains
information about customers and sales persons.
Each customer has exactly one primary sales
person. - What do the ER diagrams look like?
11Participation Constraints
- Does every department have a manager? If so, this
is a participation constraint The participation
of Departments in Manages is said to be total
(vs. partial). - (Compare with foreign key constraints.)
since
since
name
dname
name
dname
ssn
lot
budget
did
budget
did
Departments
Employees
Manages
Works_In
since
12Participation Constraints Examples
- Example Scenario 1 (Contd.) Each part is
constructed by exactly one manufacturer. - Example Scenario 2 Each customer has exactly one
primary sales person.
13ER Modeling Case Study
- Drugwarehouse.com has offered you a free
life-time supply of prescription drugs (no
questions asked) if you design its database
schema. Given the rising cost of health care, you
agree. Here is the information that you gathered - Patients are identified by their SSN, and we also
store their names and age. - Doctors are identified by their SSN, and we also
store their names and specialty. - Each patient has one primary care physician, and
we want to know since when the patient has been
with her primary care physician. - Each doctor has at least one patient.
14ER Modeling Summary
- After the requirement analysis, the conceptual
design develops a high-level description of the
data - Main components
- Entities
- Relationships
- Attributes
- Integrity constraints Key constraints and
participation constraints - We covered only a subset
15Querying Relational Databases
- The relational query language SQL
16Structured Query Language
- SELECT target-list
- FROM relation-list
- WHERE qualifications
- relation-list A list of relation names
- target-list A list of attributes of relations
in relation-list - qualification Comparisons (Attr op const or
Attr1 op Attr2, where op is one of
lt,gt,,lt,gt,ltgt) combined using AND, OR and NOT.
17Recall Customer Relation
- Relation schemaCustomers(cid integer, name
string, byear integer, state string) - Relation instance
18Example Query
- Example SchemaCustomers( cid integer, name
string, byear integer, state string) - QuerySELECT Customers.cid, Customers.name, Cu
stomers.byear, Customers.stateFROM
CustomersWHERE cid 1950
19Example Query
- SELECTCustomers.cid, Customers.name,Customers.by
ear, Customers.state - FROM Customers
- WHERE cid 1960
20Range Variables
- SELECTCustomers.cid, Customers.name,Customers.by
ear, Customers.state - FROM Customers C
- WHERE cid 1960
21Example Query
- A range variable is a substitute for a relation
name. - QuerySELECT C.cid, C.nameFROM Customers
CWHERE C.byear 1960
22Common Shortcuts
- is a shortcut for all fields
- QuerySELECT FROM Customers CWHERE
C.nameSmith
23Example Query
- SELECT C.state
- FROM Customers C
24DISTINCT Keyword
- DISTINCTEliminates duplicates in the output
- QuerySELECT DISTINCT C.stateFROM Customers C
25Example Query
- QuerySELECT C.cid, C.name, C.byear,
C.stateFROM Customers CWHERE C.nameSmith - Answer
26Selection Predicates
- QuerySELECT C.cid, C.name, C.byear,
C.stateFROM Customers CWHERE C.nameSmith AND
C.stateNY - Answer
27Combining Relations Joins
- SELECT P.pid, P.pname
- FROM Products P, Transactions T
- WHERE P.pid T.tid AND T.tdate lt 2/1/2000
28Example Query
- Query Find the names and ids of customers who
have made purchases before February 1, 2000. - SQL
- SELECT C.name, C.id
- FROM Customers C, Transactions T
- WHERE C.cid T.cid ANDT.tdate lt 2/1/2000
29Example Query
- Query Find the names and ids of the customers
who have purchased MS Office Pro. - SQLSELECT C.name, C.idFROM Customers C,
Transactions T, Products P WHERE C.cid T.cid
AND T.pid P.pid AND P.pname MS Office Pro
30Aggregate Operators
- SQL allows computation of summary statistics for
a collection of records. - Operators
- MAX (maximum value)
- MIN (minimum value)
- SUM
- AVG (average)
- COUNT (distinct number)
31Example Queries
- Query Tell me the minimum and maximum price of
all products. - SQLSELECT MAX(P.price), MIN(P.price)FROM
Products P - Query How many different products do we have?
- SQLSELECT COUNT()FROM Products
32GROUP BY and HAVING
- Instead of applying aggregate operators to all
(qualifying) tuples, apply aggregate operators to
each of several groups of tuples. - Example For each year, show the number of
customers who are born that year. - Conceptually, many queries, one query per year.
- Suppose we know that years are between 1900 and
2000, we can write 1001 queries that look like
thisSELECT COUNT()FROM Customers CWHERE
C.byear i
33Queries With GROUP BY and HAVING
- Extended SQL Query StructureSELECT
DISTINCT target-listFROM
relation-listWHERE tuple-qualificationGROU
P BY grouping-listHAVING
group-qualification
34Example GROUP BY
- Example For each year, show the number of
customers who are born that year. - SQLSELECT C.byear, COUNT()FROM Customers
CGROUP BY C.byear
35Example
- Query For each customer, list the price of the
most expensive product she purchased. - SQLSELECT C.cid, C.name, MAX(P.price)FROM
Customers C, Transactions T, Products PWHERE
C.cid T.cid and T.pid P.pidGROUP BY C.cid,
C.name
36Example
- Query For each product that has been sold at
least twice, output how often it has been sold so
far. - SQLSELECT P.pid, P.pname, COUNT()FROM
Products P, Transactions TWHERE P.pid
T.pidGROUP BY P.pid, P.pnameHAVING COUNT() gt 1
37Notes on GROUP BY and HAVING
- SELECT DISTINCT attribute-list,
aggregate-listFROM relation-listWHERE
record-qualificationGROUP BY
grouping-listHAVING group-qualification - The query generates one output record per group.
A group is a set of tuples that have the same
value for all attributes in grouping-list.
38Notes on GROUP BY and HAVING
- SELECT DISTINCT attribute-list,
aggregate-listFROM relation-listWHERE
record-qualificationGROUP BY
grouping-listHAVING group-qualification - The attribute list must be a subset of the
grouping-list. Why? Each answer record
corresponds to a group, and there must be a
single value per group. - The aggregate-list generates one value per group.
- What about the group-qualification?
39Summary SQL
- Powerful query language for relational database
systems - End-users usually do not write SQL, but graphical
user front-ends generate SQL queries - SQL completely isolates users from the physical
structure of the DBMS ? You can tune your DBMS
for performance and your applications do not
change (This is physical data independence!)
40From OLTP To The Data Warehouse
- Traditionally, database systems stored data
relevant to current business processes - Old data was archived or purged
- Your database stores the current snapshot of your
business - Your current customers with current addresses
- Your current inventory
- Your current orders
- My current account balance
41The Data Warehouse
- The data warehouse is a historical collection of
all your data for analysis purposes - Examples
- Current customers versus all customers
- Current orders versus history of all orders
- Current inventory versus history of all shipments
- Thus the data warehouse stores information that
might be useless for the operational part of your
business
42Terminology
- OLTP (Online Transaction Processing)
- DSS (Decision Support System)
- DW (Data Warehouse)
- OLAP (Online Analytical Processing)
43OLTP Architecture
Clients
OLTPDBMSs
CashRegister
Product Purchase
Inventory Update
44DW Architecture
Clients
Data Warehouse Server
Information Sources
OLAP Servers
MOLAP
OLTPDBMSs
Analysis
Query/Reporting
ExtractCleanTransformAggregateLoadUpdate
Other Data Sources
Data Mining
Data Marts
ROLAP
45OLTP Versus Data Warehousing
46The Data Warehouse Market
- Market forecast for warehousing tools in 2002 8
billion (IDC 7/1999) - Revenue forecast for data warehouse front-end
tools1998 1.6 billion1999 2.1
billion2000 2.8 billion(Gartner Group 2/1999)
47Dimensional Data Modeling
- Recall The relational model.
- The dimensional data model
- Relational model with two different types of
attributes and tables. - Attribute level Facts (numerical, additive,
dependent) versus dimensions (descriptive,
independent). - Table level Fact tables (large tables with facts
and foreign keys to dimensions) versus dimension
tables (small tables with dimensions).
48Dimensional Modeling (Contd.)
- Fact (attribute) Measures performance of a
business. - Example facts
- Sales, budget, profit, inventory
- Example fact table
- Transactions (timekey, storekey, pkey, promkey,
ckey, units, price)
- Dimension (attribute) Specifies a fact.
- Example dimension
- Product, customer data, sales person, store
- Example dimension table
- Customer (ckey, firstname, lastname, address,
dateOfBirth, occupation, )
49OLTP Versus The Data Warehouse
- OLTP
- Regular relational schema
- Update queries change data in the databaseOne
instance of a customer with a unique customerID - Queries return information about the current
state of affairs
- The data warehouse
- Dimensional model
- Update queries create new records in the
databaseSeveral instances of the same customer
(with different data), in case the customer moved - Queries return aggregate information about
historical facts
50Example Dimensional Data Modeling
Customers Dimension Table
Time Dim. Table
Transactions Fact Table
Products Dim. Table
51Another View Star Schema
Transactions(timekey, storekey, pkey,promkey, ck
ey, units, price)
Time
Store
Customers
Products
Promotions
52Grain
- The grain defines the level of resolution of a
single record in the fact table. - Example fact tables
- Transactions (timekey, storekey, pkey, promkey,
ckey, units, price) grain is individual item - Transactions(timekey, storekey, ckey, units,
price) grain is one market basket - CustomerSessions(timekey, ckey, pagekey,
sessionkey, sessionSeconds, pagesVisited)what
is the grain?
53Tips
- Fact tables are usually very large they can grow
to several hundred GB and TB - Dimension tables are usually smaller (although
can grow large, e.g., Customers table), but they
have many fields - Queries over fact tables usually involve many
records - Indexes usually increase the size of each table a
factor of three or four
54Typical Queries
- SQLSELECT D1.d1, , Dk.dk, agg1(F.f1,)FROM
Dimension D1, , Dimension Dk, Fact FWHERE
D1.key F.key1 AND AND Dk.keyk F.keyk
AND otherPredicatesGROUP BY D1.d1, ,
Dk.dkHAVING groupPredicates - This query is called a Star Join.
55Example Query
- Break down sales by year and category for the
last two years show only categories with more
than 1M in sales. - SQLSELECT T.year, P.category, SUM(X.units
X.price)FROM Time T, Products P, Transactions
XWHERE T.year 1999 OR T.year 2000GROUP BY
T.year, P.categoryHAVING SUM(X.units X.price)
gt 1000000
56Design of the Clickstream Schema
- Dimensions
- Date. One record for each calendar day.
- Time. One record per second (is that accurate
enough?). - Customer. One record per customer. Several groups
of customers, depending on knowledge about the
customer - Anonymous web site visitor with cookie ID
- Know name, address, and customer ID is assigned
- Know demographics
- Can you think of other dimensions?
57Schema Design (Contd.)
- Often used clickstream dimensions
- Page. Grain One record per page or one record
per page type?Page(pagekey, pageFunction,
pageType, contentsType) - Event. Captures what the users does at a page
(enter data, click on a link, etc.).Event(eventke
y, eventtype) - Session. Attributes that characterize short-term
behavior.Session(sessionkey, sessiontype,
sequencetype, context, status) - Referral. Captures how the user arrived at our
site.Referral(referralkey, referraltype, url,
site, domain)
58Schema Design (Contd.)
- Example clickstream fact table schema for
analyzing sessionsSessionfacts(datekey,
timekey, ckey, totalSessionTime, numPages,
numItems, orderAmount) - Foreign keys datekey, timekey, ckey
- Facts totalSessionTime, number of pages visited,
number of items ordered, total order amount - Example clickstream fact table schema for
analyzing page usePagefacts(datekey, timekey,
ckey, pagekey, pagetime, numItems) - Foreign keys datekey, timekey, ckey, pagekey
- Facts pagetime, number of items orderd,
59Building a Data Warehouse Tips
- A data warehouse is a collection of data marts. A
data mart contains one dimensional star schema
that captures one business aspect. - Notes
- It is crucial to centralize the logical
definition and format of dimensions and facts
(political challenge assign a dimension
authority to each dimension). Everything else is
a distributed effort throughout your company
(technical challenge). - Each data mart will have its own fact table, but
we will duplicate dimension tables over several
data marts.
60Summary Dimensional Design
- A dimensional data model is similar to a
relational data model, but we plan for historical
data - Facts versus dimensions
- Queries are in star join format
61Questions?
- (In the third lecture Data analysis)