The Software Infrastructure for Electronic Commerce - PowerPoint PPT Presentation

About This Presentation
Title:

The Software Infrastructure for Electronic Commerce

Description:

T.tdate '2/1/2000' Example Query ... Conceptually, many queries, one query per year. ... WHERE T.year = 1999 OR T.year = 2000. GROUP BY T.year, P.category ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 62
Provided by: johanne46
Category:

less

Transcript and Presenter's Notes

Title: The Software Infrastructure for Electronic Commerce


1
The Software Infrastructurefor Electronic
Commerce
  • Databases and Data Mining
  • Lecture 2 Data Warehousing
  • Johannes Gehrke
  • johannes_at_cs.cornell.edu
  • http//www.cs.cornell.edu/johannes

2
Overview
  • Conceptual design
  • Querying relational data
  • Dimensional data modeling OLTP versus decision
    support
  • We will design and analyze a data mart with
    click-stream data as an illustrative example.

3
The Database Design Process
  • Requirement analysis
  • Conceptual design using the entity-relationship
    (ER) model
  • Schema refinement
  • Normalization
  • Physical tuning

4
Overview of Database Design
  • Conceptual design (ER Model is used at this
    stage.)
  • What are the entities and relationships in the
    enterprise?
  • What information about these entities and
    relationships should we store in the database?
  • What are the integrity constraints or business
    rules that hold?
  • A database schema in the ER Model can be
    represented pictorially (ER diagrams).
  • Can map an ER diagram into a relational schema.

5
ER Model Basics
  • EntityReal-world object distinguishable
    fromother objects. Anentity is described
    (inDB) using a set of attributes.
  • Entity Set A collection of similar entities.
    E.g., all employees.
  • All entities in an entity set have the same set
    of attributes.
  • Each entity set has a key.
  • Each attribute has a domain.

6
ER Model Basics (Contd.)
  • Relationship Association among two or more
    entities. E.g., Johannes works in the computer
    science department.
  • Relationship set Collection of similar
    relationships.

since
name
dname
budget
ssn
lot
did
Works_In
Departments
Employees
7
ER-Model Basics (Contd.)
  • An n-ary relationship set R relates n entity sets
    E1 ... En each relationship in R involves
    entitiese1, ..., en
  • Same entity set could participate in different
    relationship sets, or in different roles in
    same set.

name
ssn
address
Employees
subor-dinate
super-visor
Reports_To
8
Key Constraints
  • Consider Works_In An employee can work in many
    departments a department can have many
    employees.
  • In contrast, each deptartment has at most one
    manager, according to the key constraint on
    Manages.

did
budget
Departments
9
Key Constraints (Contd.)
  • Several types of key-constraints

1-to-1
1-to Many
Many-to-1
Many-to-Many
10
Key constraints Examples
  • Example Scenario 1 An inventory database
    contains information about parts and
    manufacturers. Each part is constructed by
    exactly one manufacturer.
  • Example Scenario 2 A customer database contains
    information about customers and sales persons.
    Each customer has exactly one primary sales
    person.
  • What do the ER diagrams look like?

11
Participation Constraints
  • Does every department have a manager? If so, this
    is a participation constraint The participation
    of Departments in Manages is said to be total
    (vs. partial).
  • (Compare with foreign key constraints.)

since
since
name
dname
name
dname
ssn
lot
budget
did
budget
did
Departments
Employees
Manages
Works_In
since
12
Participation Constraints Examples
  • Example Scenario 1 (Contd.) Each part is
    constructed by exactly one manufacturer.
  • Example Scenario 2 Each customer has exactly one
    primary sales person.

13
ER Modeling Case Study
  • Drugwarehouse.com has offered you a free
    life-time supply of prescription drugs (no
    questions asked) if you design its database
    schema. Given the rising cost of health care, you
    agree. Here is the information that you gathered
  • Patients are identified by their SSN, and we also
    store their names and age.
  • Doctors are identified by their SSN, and we also
    store their names and specialty.
  • Each patient has one primary care physician, and
    we want to know since when the patient has been
    with her primary care physician.
  • Each doctor has at least one patient.

14
ER Modeling Summary
  • After the requirement analysis, the conceptual
    design develops a high-level description of the
    data
  • Main components
  • Entities
  • Relationships
  • Attributes
  • Integrity constraints Key constraints and
    participation constraints
  • We covered only a subset

15
Querying Relational Databases
  • The relational query language SQL

16
Structured Query Language
  • SELECT target-list
  • FROM relation-list
  • WHERE qualifications
  • relation-list A list of relation names
  • target-list A list of attributes of relations
    in relation-list
  • qualification Comparisons (Attr op const or
    Attr1 op Attr2, where op is one of
    lt,gt,,lt,gt,ltgt) combined using AND, OR and NOT.

17
Recall Customer Relation
  • Relation schemaCustomers(cid integer, name
    string, byear integer, state string)
  • Relation instance

18
Example Query
  • Example SchemaCustomers( cid integer, name
    string, byear integer, state string)
  • QuerySELECT Customers.cid, Customers.name, Cu
    stomers.byear, Customers.stateFROM
    CustomersWHERE cid 1950

19
Example Query
  • SELECTCustomers.cid, Customers.name,Customers.by
    ear, Customers.state
  • FROM Customers
  • WHERE cid 1960

20
Range Variables
  • SELECTCustomers.cid, Customers.name,Customers.by
    ear, Customers.state
  • FROM Customers C
  • WHERE cid 1960

21
Example Query
  • A range variable is a substitute for a relation
    name.
  • QuerySELECT C.cid, C.nameFROM Customers
    CWHERE C.byear 1960

22
Common Shortcuts
  • is a shortcut for all fields
  • QuerySELECT FROM Customers CWHERE
    C.nameSmith

23
Example Query
  • SELECT C.state
  • FROM Customers C

24
DISTINCT Keyword
  • DISTINCTEliminates duplicates in the output
  • QuerySELECT DISTINCT C.stateFROM Customers C

25
Example Query
  • QuerySELECT C.cid, C.name, C.byear,
    C.stateFROM Customers CWHERE C.nameSmith
  • Answer

26
Selection Predicates
  • QuerySELECT C.cid, C.name, C.byear,
    C.stateFROM Customers CWHERE C.nameSmith AND
    C.stateNY
  • Answer

27
Combining Relations Joins
  • SELECT P.pid, P.pname
  • FROM Products P, Transactions T
  • WHERE P.pid T.tid AND T.tdate lt 2/1/2000

28
Example Query
  • Query Find the names and ids of customers who
    have made purchases before February 1, 2000.
  • SQL
  • SELECT C.name, C.id
  • FROM Customers C, Transactions T
  • WHERE C.cid T.cid ANDT.tdate lt 2/1/2000

29
Example Query
  • Query Find the names and ids of the customers
    who have purchased MS Office Pro.
  • SQLSELECT C.name, C.idFROM Customers C,
    Transactions T, Products P WHERE C.cid T.cid
    AND T.pid P.pid AND P.pname MS Office Pro

30
Aggregate Operators
  • SQL allows computation of summary statistics for
    a collection of records.
  • Operators
  • MAX (maximum value)
  • MIN (minimum value)
  • SUM
  • AVG (average)
  • COUNT (distinct number)

31
Example Queries
  • Query Tell me the minimum and maximum price of
    all products.
  • SQLSELECT MAX(P.price), MIN(P.price)FROM
    Products P
  • Query How many different products do we have?
  • SQLSELECT COUNT()FROM Products

32
GROUP BY and HAVING
  • Instead of applying aggregate operators to all
    (qualifying) tuples, apply aggregate operators to
    each of several groups of tuples.
  • Example For each year, show the number of
    customers who are born that year.
  • Conceptually, many queries, one query per year.
  • Suppose we know that years are between 1900 and
    2000, we can write 1001 queries that look like
    thisSELECT COUNT()FROM Customers CWHERE
    C.byear i

33
Queries With GROUP BY and HAVING
  • Extended SQL Query StructureSELECT
    DISTINCT target-listFROM
    relation-listWHERE tuple-qualificationGROU
    P BY grouping-listHAVING
    group-qualification

34
Example GROUP BY
  • Example For each year, show the number of
    customers who are born that year.
  • SQLSELECT C.byear, COUNT()FROM Customers
    CGROUP BY C.byear

35
Example
  • Query For each customer, list the price of the
    most expensive product she purchased.
  • SQLSELECT C.cid, C.name, MAX(P.price)FROM
    Customers C, Transactions T, Products PWHERE
    C.cid T.cid and T.pid P.pidGROUP BY C.cid,
    C.name

36
Example
  • Query For each product that has been sold at
    least twice, output how often it has been sold so
    far.
  • SQLSELECT P.pid, P.pname, COUNT()FROM
    Products P, Transactions TWHERE P.pid
    T.pidGROUP BY P.pid, P.pnameHAVING COUNT() gt 1

37
Notes on GROUP BY and HAVING
  • SELECT DISTINCT attribute-list,
    aggregate-listFROM relation-listWHERE
    record-qualificationGROUP BY
    grouping-listHAVING group-qualification
  • The query generates one output record per group.
    A group is a set of tuples that have the same
    value for all attributes in grouping-list.

38
Notes on GROUP BY and HAVING
  • SELECT DISTINCT attribute-list,
    aggregate-listFROM relation-listWHERE
    record-qualificationGROUP BY
    grouping-listHAVING group-qualification
  • The attribute list must be a subset of the
    grouping-list. Why? Each answer record
    corresponds to a group, and there must be a
    single value per group.
  • The aggregate-list generates one value per group.
  • What about the group-qualification?

39
Summary SQL
  • Powerful query language for relational database
    systems
  • End-users usually do not write SQL, but graphical
    user front-ends generate SQL queries
  • SQL completely isolates users from the physical
    structure of the DBMS ? You can tune your DBMS
    for performance and your applications do not
    change (This is physical data independence!)

40
From OLTP To The Data Warehouse
  • Traditionally, database systems stored data
    relevant to current business processes
  • Old data was archived or purged
  • Your database stores the current snapshot of your
    business
  • Your current customers with current addresses
  • Your current inventory
  • Your current orders
  • My current account balance

41
The Data Warehouse
  • The data warehouse is a historical collection of
    all your data for analysis purposes
  • Examples
  • Current customers versus all customers
  • Current orders versus history of all orders
  • Current inventory versus history of all shipments
  • Thus the data warehouse stores information that
    might be useless for the operational part of your
    business

42
Terminology
  • OLTP (Online Transaction Processing)
  • DSS (Decision Support System)
  • DW (Data Warehouse)
  • OLAP (Online Analytical Processing)

43
OLTP Architecture
Clients
OLTPDBMSs
CashRegister
Product Purchase
Inventory Update
44
DW Architecture
Clients
Data Warehouse Server
Information Sources
OLAP Servers
MOLAP
OLTPDBMSs
Analysis
Query/Reporting
ExtractCleanTransformAggregateLoadUpdate
Other Data Sources
Data Mining
Data Marts
ROLAP
45
OLTP Versus Data Warehousing
46
The Data Warehouse Market
  • Market forecast for warehousing tools in 2002 8
    billion (IDC 7/1999)
  • Revenue forecast for data warehouse front-end
    tools1998 1.6 billion1999 2.1
    billion2000 2.8 billion(Gartner Group 2/1999)

47
Dimensional Data Modeling
  • Recall The relational model.
  • The dimensional data model
  • Relational model with two different types of
    attributes and tables.
  • Attribute level Facts (numerical, additive,
    dependent) versus dimensions (descriptive,
    independent).
  • Table level Fact tables (large tables with facts
    and foreign keys to dimensions) versus dimension
    tables (small tables with dimensions).

48
Dimensional Modeling (Contd.)
  • Fact (attribute) Measures performance of a
    business.
  • Example facts
  • Sales, budget, profit, inventory
  • Example fact table
  • Transactions (timekey, storekey, pkey, promkey,
    ckey, units, price)
  • Dimension (attribute) Specifies a fact.
  • Example dimension
  • Product, customer data, sales person, store
  • Example dimension table
  • Customer (ckey, firstname, lastname, address,
    dateOfBirth, occupation, )

49
OLTP Versus The Data Warehouse
  • OLTP
  • Regular relational schema
  • Update queries change data in the databaseOne
    instance of a customer with a unique customerID
  • Queries return information about the current
    state of affairs
  • The data warehouse
  • Dimensional model
  • Update queries create new records in the
    databaseSeveral instances of the same customer
    (with different data), in case the customer moved
  • Queries return aggregate information about
    historical facts

50
Example Dimensional Data Modeling
Customers Dimension Table
Time Dim. Table
Transactions Fact Table
Products Dim. Table
51
Another View Star Schema
Transactions(timekey, storekey, pkey,promkey, ck
ey, units, price)
Time
Store
Customers
Products
Promotions
52
Grain
  • The grain defines the level of resolution of a
    single record in the fact table.
  • Example fact tables
  • Transactions (timekey, storekey, pkey, promkey,
    ckey, units, price) grain is individual item
  • Transactions(timekey, storekey, ckey, units,
    price) grain is one market basket
  • CustomerSessions(timekey, ckey, pagekey,
    sessionkey, sessionSeconds, pagesVisited)what
    is the grain?

53
Tips
  • Fact tables are usually very large they can grow
    to several hundred GB and TB
  • Dimension tables are usually smaller (although
    can grow large, e.g., Customers table), but they
    have many fields
  • Queries over fact tables usually involve many
    records
  • Indexes usually increase the size of each table a
    factor of three or four

54
Typical Queries
  • SQLSELECT D1.d1, , Dk.dk, agg1(F.f1,)FROM
    Dimension D1, , Dimension Dk, Fact FWHERE
    D1.key F.key1 AND AND Dk.keyk F.keyk
    AND otherPredicatesGROUP BY D1.d1, ,
    Dk.dkHAVING groupPredicates
  • This query is called a Star Join.

55
Example Query
  • Break down sales by year and category for the
    last two years show only categories with more
    than 1M in sales.
  • SQLSELECT T.year, P.category, SUM(X.units
    X.price)FROM Time T, Products P, Transactions
    XWHERE T.year 1999 OR T.year 2000GROUP BY
    T.year, P.categoryHAVING SUM(X.units X.price)
    gt 1000000

56
Design of the Clickstream Schema
  • Dimensions
  • Date. One record for each calendar day.
  • Time. One record per second (is that accurate
    enough?).
  • Customer. One record per customer. Several groups
    of customers, depending on knowledge about the
    customer
  • Anonymous web site visitor with cookie ID
  • Know name, address, and customer ID is assigned
  • Know demographics
  • Can you think of other dimensions?

57
Schema Design (Contd.)
  • Often used clickstream dimensions
  • Page. Grain One record per page or one record
    per page type?Page(pagekey, pageFunction,
    pageType, contentsType)
  • Event. Captures what the users does at a page
    (enter data, click on a link, etc.).Event(eventke
    y, eventtype)
  • Session. Attributes that characterize short-term
    behavior.Session(sessionkey, sessiontype,
    sequencetype, context, status)
  • Referral. Captures how the user arrived at our
    site.Referral(referralkey, referraltype, url,
    site, domain)

58
Schema Design (Contd.)
  • Example clickstream fact table schema for
    analyzing sessionsSessionfacts(datekey,
    timekey, ckey, totalSessionTime, numPages,
    numItems, orderAmount)
  • Foreign keys datekey, timekey, ckey
  • Facts totalSessionTime, number of pages visited,
    number of items ordered, total order amount
  • Example clickstream fact table schema for
    analyzing page usePagefacts(datekey, timekey,
    ckey, pagekey, pagetime, numItems)
  • Foreign keys datekey, timekey, ckey, pagekey
  • Facts pagetime, number of items orderd,

59
Building a Data Warehouse Tips
  • A data warehouse is a collection of data marts. A
    data mart contains one dimensional star schema
    that captures one business aspect.
  • Notes
  • It is crucial to centralize the logical
    definition and format of dimensions and facts
    (political challenge assign a dimension
    authority to each dimension). Everything else is
    a distributed effort throughout your company
    (technical challenge).
  • Each data mart will have its own fact table, but
    we will duplicate dimension tables over several
    data marts.

60
Summary Dimensional Design
  • A dimensional data model is similar to a
    relational data model, but we plan for historical
    data
  • Facts versus dimensions
  • Queries are in star join format

61
Questions?
  • (In the third lecture Data analysis)
Write a Comment
User Comments (0)
About PowerShow.com