Title: Chapters 24, 26, 27 Database System Architectures
1Chapters 24, 26, 27Database System Architectures
Prof. Steven A. Demurjian, Sr. Computer Science
Engineering Department The University of
Connecticut 191 Auditorium Road, Box
U-155 Storrs, CT 06269-3155
steve_at_engr.uconn.edu http//www.engr.uconn.edu/st
eve (860) 486 - 4818
- A portion of these slides are being used with the
permission of Dr. Ling Lui, Associate Professor,
College of Computing, Georgia Tech. - Remaining slides represent new material.
2Classical and Distributed Architectures
- Classic/Centralized DBMS Dominated the Commercial
Market from 1970s Forward - Problems of this Approach
- Difficult to Scale w.r.t. Performance Gains
- If DB Overloaded, replace with a Faster Computer
- this can Only Go So Far - Disk Bottlenecks
- Distributed DBMS have Evolved to Address a Number
of Issues - Improved Performance
- Putting Data Near Location where it is Needed
- Replication of Data for Fault Tolerance
- Vertical and Horizontal Partitioning of DB Tuples
3Common Features of Centralized DBMS
- Data Independence
- High-Level Representation via Conceptual and
External Schemas - Physical Representation (Internal Schema) Hidden
- Program Independence
- Multiple Applications can Share Data
- Views/External Schema Support this Capability
- Reduction of Program/Data Redundancy
- Single, Unique, Conceptual Schema
- Shared Database
- Almost No Data Redundancy
- Controlled Data Access Reduces Inconsistencies
- Programs Execute with Consistent Results
4Common Features of Centralized DBMS
- Promote Sharing Automatically Provided via CC
- No Longer Programmatic Issue
- Most DBMS Offer Locking for Key Shared Data
- Oracle Allows Locks on Data Item (Attributes)
- For Example, Controlling Access to Shared
Identifier - Coherent and Central DB Administration
- Semantic DB Integrity via the Automatic
Enforcement of Data Consistency via Integrity
Constraints/Rules - Data Resiliency
- Physical Integrity of Data in the Presence of
Faults and Errors - Supported by DB Recovery
- Data Security Control Access for Authorized
Users Against Sensitive Data
5Shared Nothing Architecture
- In this Architecture, Each DBMS Operates
Autonomously - there is No Sharing
- Three Separate DBMSs on Three Different Computers
- Applications/Clients Must Know About the External
Schemas of all Three DBMSs for - Database Retrieval
- Client Processing
- Complicates Client
- Different DBMS Platforms(Oracle, Sybase,
Informix, ..) - Different Access Modes(Query, Embedded, ODBC)
- Difficult for SWE to Code
6NW Architecture with Centralized DB
- High-Speed NWs/WANs Spawned Centralized DB
Accessible Worldwide - Clients at Any Site can Access Repository
- Data May be Far Away - Increased Access Time
- In Practice, Each Remote Site Needs only Portion
of the Data in DB1 and/or DB2 - Inefficient, no Replication w.r.t. Failure
7Fully Distributed Architecture
- The Five Sites (Chicago, SF, LA, NY, Atlanta)
each have a Portion of the Database - its
Distributed - Replication is Possible for Fault Tolerance
- Queries at one Site May Need to Access Data at
Another Site (e.g., for a Join) - Increased Transaction Processing Complexity
8Goals of DDBMS
- Support User Distribution Across Multiple Sites
- Remote Access by Users Regardless of Location
- Distribution and Replication of Database Content
- Provide Location Transparency
- Users Manipulate their Own Data
- Non-Local Sites Appear Local to Any User
- Provide Transaction Control Akin to Centralized
Case - Transaction Control Hides Distribution
- CC and Serializability - Must be Extended
- Minimize Communications Cost
- Optimize Use of Network - a Critical Issue
- Distribute DB Design Supported by Partitioning
(Fragmentation) and Replication
9Goals of DDBMS
- Improve Response Time for DB Access
- Use a More Sophisticated Load Control for
Transaction Processing - However, Synchronization Across Sites May
Introduce Additional Overhead - System Availability
- Site Independence in the Presence of Site Failure
- Subset of Database is Always Available
- Replication can Keep All Data Available, Even
When Multiple Sites Fail - Modularity
- Incremental Growth with the Addition of Sites
- Dedicate Sites to Specific Tasks
10Advantages of DDBMS
- There are Four Major Advantages
- Transparency
- Distribution/NW Transparency
- User Doesnt Know about NW Configuration
(Location Transparency) - User can Find Object at any Site (Naming
Transparency) - Replication Transparency (see next PPT)
- User Doesnt Know Location of Data
- Replicas are Transparently Accessible
- Fragmentation Transparency
- Horizontal Fragmentation (Distribute by Row)
- Vertical Fragmentation (Distribute by Column)
11Data Distribution and Replication
12Advantages of DDBMS
- Increased Reliability and Availability
- Reliability - System Always Running
- Availability - Data Always Present
- Achieved via Replication and Distribution
- Improved Performance
- Sites Able to Utilize Data that is Local for
Majority of Queries - Easier Expansion
- Improve Performance of Site by
- Upgrading Processor of Computer
- Adding Additional Disks
- Splitting a Site into Two or More Sites
- Expansion over Time as Business Grows
13Challenges of DDBMS
- Tracking Data - Meta Data More Complex
- Must Track Distribution (where is the Data)
- V H Fragmentation (How is Data Split)
- Replication (Multiple Copies for Consistency)
- Distributed Query Processing
- Optimization, Accessibility, etc., More Complex
- Block Analysis of Data Size Must also Now
Consider the NW Transmitting Time - Distributed Transaction Processing
- TP Potentially Spans Multiple Sites
- Submit Query to Multiple Sites
- Collect and Collate Results
14Challenges of DDBMS
- Replicated Data Management
- TP Must Choose the Replica to Access
- Updates Must Modify All Replica Copies
- Distributed Database Recovery
- Recovery of Individual Sites
- Recovery Across DDBMS
- Security
- Local and Remote Authorization
- During TP, be Able to Verify Remote Privileges
- Distributed Directory Management
- Meta-Data on Database - Local and Remote
- Must maintain Replicas of this - Every Site
Tracks the Meta-Data for All Sites
15A Complete Schema with Keys ...
Keys Allow us to Establish Links Between
Relations
what is this Similar to in ER?
16 and Corresponding DB Tables
which Represent Tuples/Instances of Each Relation
A S C null W B null null
1 4 5 5
17 with Remaining DB Tables
18What is Fragmentation?
- Fragmentation Divides a DB Across Multiple Sites
- Two Types of Fragmentation
- Horizontal Fragmentation
- Given a Relation R with n Total Tuples, Spread
Entire Tuples Across Multiple Sites - Each Site has a Subset of the n Tuples
- Essentially Fragmentation is a Selection
- Vertical Fragmentation
- Given a Relation R with m Attributes and n Total
Tuples, Spread the Columns Across Multiple Sites - Essentially Fragmentation is a Projection
- In Both Cases, Sites can Overlap for Replication
19Horizontal Fragmentation
- Site 2 Tracks All Information Related to Dept. 5
20Horizontal Fragmentation
- Site 3 Tracks All Information Related to Dept. 4
- Note that an Employee Could be Listed in Both
Cases, if s/he Works on a Project for Both
Departments
21Refined Horizontal Fragmentation
- Further Fragment from Site 2 based on Dept. that
Employee Works in - Notice that G1 G2 G3 is the Same as WORKS_ON5
- there is no Overlap
22Refined Horizontal Fragmentation
- Further Fragment from Site 3 based on Dept. that
Employee Works in - Notice that G4 G5 G6 is the Same as WORKS_ON4
- Note Some Fragments can be Empty
23Homogeneous DDBMS
- Homogeneous
- Identical Software (w.r.t. Database)
- One DB Product (e.g., Oracle) is Distributed and
Available at All Sites - Uniformity w.r.t. Administration, Maintenance,
Client Access, Users, Security, etc. - Interaction by Programmatic Clients is Consistent
(e.g., JDBC or ODBC or )
24Non-Federated Heterogeneous DDBMS
- Non-Federated Heterogeneous
- Different Software (w.r.t. Database)
- Multiple DB Products (e.g., Oracle at One Site,
Access another, Sybase, Informix, etc.) - Replicated Administration (e.g., Users Needs
Accounts on Multiple Systems) - Varied Programmatic Access - SWEs Must Know All
Platforms/Client Software Complicated
25Federated DDBMS
- Federated
- Multiple DBMS Platforms Overlaid with a Global
Schema View - Single External Schema Combines Schemas from all
Sites - Multiple Data Models
- Relational in one Component DBS
- Object in another Component DBS
- Hierarchical in a 3rd DBS
26Federated DBMS Issues
- Differences in Data Models
- Reconcile Relation vs. Object-Oriented Models
- Each Different Model has Different Capabilities
- These Differences Must be Addressed in Order to
Present a Federated Schema - Differences in Constraints
- Referential Integrity Constraints in Different
DBSs - Different Constraints on Similar Data
- Federated Schema Must Deal with these Conflicts
- Differences in Query Languages
- SQL-89, SQL-92, SQL2, SQL3
- Specific Types in Different DBMS (Blobs in Oracle)
27Databases on WWW
- Web has changed the way we do Business Research
- Facts
- Industry Saw an Opportunity, knew it had to Move
Quickly to Capitalize - Lots of Action, Lots of Money, Lots of Releases
- Line Between RD is Very Narrow
- Many Researchers Moved to Industry (Trying to
Return Back to Academia) - Emergence of Java
- Java changed the way that Software was Designed,
Developed, and Utilized - Particularly w.r.t. Web-Based Applications,
Database Interoperability, Web Architectures,
etc. - Emergence of Enterprise Computing
28Internet and the Web
- A Major Opportunity for Business
- A Global Marketplace
- Business Across State and Country Boundaries
- A Way of Extending Services
- Online Payment vs. VISA, Mastercard
- A Medium for Creation of New Services
- Publishers, Travel Agents, Teller, Virtual Yellow
Pages, Online Auctions - A Boon for Academia
- Research Interactions and Collaborations
- Free Software for Classroom/Research Usage
- Opportunities for Exploration of Technologies in
Student Projects
29Market Forecast New Internet Servers
30WWW Three Market Segments
Server
- Business to Business
- Information sharing
- Ordering info./status
- Targeted electronic commerce
Corporate Network
Server
Internet
- Intranet
- Decision support
- Mfg.. System monitoring
- corporate repositories
- Workgroups
Corporate Network
Server
Server
- Internet
- Sales
- Marketing
- Information
- Services
31Information Delivery Problems on the Net
- Everyone can Publish Information on the Web
Independently at Any Time - Consequently, there is an Information Explosion
- Identifying Information Content More Difficult
- There are too Many Search Engines but too Few
Capable of Returning High Quality Data - Most Search Engines are Useful for Ad-hoc
Searches but Awkward for Tracking Changes
32Example Web Applications
- Scenario 1 World Wide Wait
- A Major Event is Underway and the Latest,
Up-to-the Minute Results are Being Posted on the
Web - You Want to Monitor the Results for this
Important Event, so you Fire up your Trusty Web
Browser, Pointing at the Result Posting Site, and
Wait, and Wait, and Wait - What is the Problem?
- The Scalability Problems are the Result of a
Mismatch Between the Data Access Characteristics
of the Application and the Technology Used to
Implement the Application
33Example Web Applications
- Scenario 2
- Many Applications Today have the Need for
Tracking Changes in Local and Remote Data Sources
and Notifying Changes If Some Condition Over the
Data Source(s) is Met - If You Want to Monitor the Changes on Web, You
Need to Fire Your Trusty Web Browser from Time to
Time, and Cache the Most Recent Result, and do
the Difference Manually Each Time You Poll the
Data Source(s) - What is the Problem?
- Pure Pull is Not the Answer to All Problems
34What is the Problem?
- Applications are Asymmetric but the Web is Not
- Computation Centric vs. Information Flow Centric
- Type of Asymmetry
- Network Asymmetry
- Satellite, CATV, Mobile Clients, Etc.
- Client to Server Ratio
- Too Many Clients can Swamp Servers
- Data Volume
- Mouse and Key Click vs. Content Delivery
- Update and Information Creation
- Clients Need to be Informed or Must Poll
35Useful Solutions
- Combination/Interleave of Pull and Push Protocols
- User-initiated, Comprehensive Search-based
Information Delivery (Pull) - Server-initiated Information Dissemination (Push)
- Provide Support for a Variety of Data Delivery
Protocols, Frequencies, and Delivery Modes - Information Delivery Frequencies
- Periodic, Conditional, Ad-Hoc
- Information Delivery Modes
- Information Delivery Protocols (IDP)
- Request/Respond, Polling, Publish/Subscribe,
Broadcast - Information Delivery Styles (IDS)
- Pull, Push, Hybrid
36Information Delivery Frequencies
- Periodic
- Data is Delivered from a Server to Clients
Periodically - Period can be Defined by System-default or by
Clients Using their Profiles - Period can be Influenced by Client and Bandwidth
- PDA/Cell Phone vs. PC w/Modem
- PC w/DSL vs. PC w/Cable Modem
- Conditional (Aperiodic)
- Data is Delivered from a Server when Conditions
Installed by Clients in their Profiles are
Satisfied - Ad-hoc (or Irregular)
37Information Delivery Modes
- Uni-cast
- Data is Sent from a Data Source (a Single Server)
to Another Machine - 1-to-n
- Data is Sent by a Single Data Source and Received
by Multiple Machines - Multicast vs. Broadcast
- Multicast Data is Sent to a Specific Set of
Clients - Broadcast Sending Data Over a Medium which an
Unidentified or Unbounded Set of Clients can
Listen
38IDP Request/Respond
- Semantics of Request/Respond
- Clients Send their Request to Servers to Ask the
Information of their Interest - Servers Respond to the Client Request by
Delivering the Information Requested - Client can Wait (Synchronous) or Not
- Applications
- Most Database Systems and Web Search Engines are
Using the Request/Respond Protocol for
Client-Server Communication
39IDP Programmed Polling vs. User Polling
- Semantics
- Programmed Polling a System Periodically Sends
Requests to Other Sites to Obtain Status
Information or Detect Changed Values - User Polling a User or Application Periodically
or Aperiodically Polls the Data Sites and Obtains
the Changes - Applications
- Programmed Polling Save the Users from having to
Click, but does Nothing to Solve the Scalability
Problems Caused by the Request/Respond Mechanism
40IDP Publish/Subscribe
- Semantics Servers Publish/Clients Subscribe
- Servers Publish Information Online
- Clients Subscribe to the Information of Interest
(Subscription-based Information Delivery) - Data Flow is Initiated by the Data Sources
(Servers) and is Aperiodic - Danger Subscriptions can Lead to Other Unwanted
Subscriptions - Applications
- Unicast Database Triggers and Active Databases
- 1-to-n Online News Groups
41IDP Broadcast
- Semantics Server Broadcast/Client(s) Listen
- Clients who Require Access to a Data Item Need to
Wait Until the Item Appears - Selective Broadcast vs. Random Broadcast
- Applications
- Unicast Doesnt Address Network Bandwidth
Problems - 1-to-n Most Likely to Prevail
42Information Delivery Styles
- Pull-Based System
- Transfer of Data from Server to Client is
Initiated by a Client Pull - Clients Determine when to Get Information
- Potential for Information to be Old Unless Client
Periodically Pulls - Push-Based System
- Transfer of Data from Server to Client is
Initiated by a Server Push - Clients may get Overloaded if Push is Too
Frequent - Hybrid
- Pull and Push Combined
- Pull First and then Push Continually
43Summary Pull vs. Push
Request/
Publish/
Broadcast
Periodic
Conditional
Ad-hoc
Respond
Subscribe
Pure Pull
Y
Y
Pure Push
Y
Y
Y
Y
Hybrid
Y
Y
Y
Y
Y
Y
44Design Options for Nodes
- Three Types of Nodes
- Data Sources
- Provide Base Data which is to be Disseminated
- Clients
- Who are the Net Consumers of the Information
- Information Brokers
- which Acquire Information from Other Data
Sources, Add Value to that Information and then
Distribute this Information to Other Consumers - By Creating a Hierarchy of Brokers, Information
Delivery can be Tailored to the Need of Many Users
45The Next Big Challenge
- Interoperability
- Heterogeneous Distributed Databases
- Heterogeneous Distributed Systems
- Autonomous Applications
- Scalability
- Rapid and Continuous Growth
- Amount of Data
- Variety of Data Types
- Different Privacy Levels or Ownerships of Data
46Interoperability A Classic View
Simple Federation
Multiple Nested Federation
FDB Global Schema
FDB Global Schema 4
Federated Integration
Federated Integration
Local Schema
Local Schema
Local Schema
FDB 1
Local Schema
FDB3
Federation
Federation
47CORBA-Based Development
IDL file
Object Implementation
Client Application
IDL Compiler
IDL Compiler
Stub
Skeleton
ORB/IIOP
ORB/IIOP
48ORB IntegrationJava Client Legacy Application
Java Client
Legacy Application
Java Wrapper
Object Request Broker (ORB)
CORBA is the Medium of Info. Exchange Requires
Java/CORBA Capabilities
49Java Client with Wrapper to Legacy Application
Java Client
Interactions Between Java Client and Legacy Appl.
via C and RPC C is the Medium of Info.
Exchange Java Client with C/C Wrapper
Java Application Code
WRAPPER
Mapping Classes
JAVA LAYER
NATIVE LAYER
Native Functions (C) RPC Client Stubs (C)
Legacy Application
Network
50COTS and Legacy Appls. to Java Clients
COTS Application
Legacy Application
Java Application Code
Java Application Code
Native Functions that Map to COTS Appl
Native Functions that Map to Legacy Appl
NATIVE LAYER
NATIVE LAYER
JAVA LAYER
JAVA LAYER
Mapping Classes
Mapping Classes
JAVA NETWORK WRAPPER
JAVA NETWORK WRAPPER
Network
Java Client
Java Client
Java is Medium of Info. Exchange - C/C Appls
with Java Wrappers
51Database Interoperability in the Internet
- Technology
- Web/HTTP, JDBC/ODBC, CORBA (ORBs IIOP), XML
- Architecture
52 Java Client to Legacy App via RDBS
Transformed Legacy Data
Java Client
Relational Database System(RDS)
Updated Data
Extract and Generate Data
Transform and Store Data
Legacy Application
53JDBC
- JDBC API Provides DB Access Protocols for Open,
Query, Close, etc. - Different Drivers for Different DB Platforms
JDBC API
Java Application
Driver Manager
Driver
Driver
Oracle
Sybase
Access
54Connecting a DB to the Web
- Web Server are Stateless
- DB Interactions Tend to be Stateful
- Invoking a CGI Script on Each DB Interaction is
Very Expensive, Mainly Due to the Cost of DB Open
DBMS
CGI Script Invocation or JDBC Invocation
Web Server
Internet
Browser
55Connecting More Efficiently
- To Avoid Cost of Opening Database, One can Use
Helper Processes that Always Keep Database Open
and Outlive Web Connection - Newly Invoked CGI Scripts Connect to a
Preexisting Helper Process - System is Still Stateless
DBMS
Helper Processes
CGI Script or JDBC Invocation
Web Server
Internet
Browser
56DB-Internet Architecture
WWW Client (Netscape)
WWW Client (HotJava)
WWW client (Info. Explore)
Internet
HTTP Server
DBWeb Gateway
DBWeb Gateway
DBWeb Dispatcher
DBWeb Gateway
DBWeb Gateway
57EJB Architecture
58Technology Push
- Computer/Communication Technology (Almost Free)
- Plenty of Affordable CPU, Memory, Disk, Network
Bandwidth - Next Generation Internet Gigabit Now
- Wireless Ubiquitous, High Bandwidth
- Information Growth
- Massively Parallel Generation of Information on
the Internet and from New Generation of Sensors - Disk Capacity on the Order of Peta-bytes
- Small, Handy Devices to Access Information
59More Information Moves to Cyberspace
- What Grows Faster than Internet?
- Why?
- Low Rent 10x Cheaper100 Letters (1MB) Disk
10 File Cabinet 5001 Picture Disk
10 Printed 40 - Easy Access and Search
- Robot can Find All Docs Matching a Predicate
- Access from Anywhere
- Human Costs 15/hr
- How Much Information is there?
Gray, Turing Lecture 1998, ACM FCRC Atlanta
60Sizes of Search Engines
KEY AVAltaVista, NLNorthern Light,
INKInktomi, EXExcite, LYLycos, ISInfoseek,
WCWebCrawler
as of February, 1999
61Research Challenges
- Inherent Complexity
- Coping with Latency (Sometimes Unpredictable)
- Failure Detection and Recovery (Partial Failure)
- Concurrency, Load Balancing, Availability, Scale
- Service Partitioning
- Ordering of Distributed Events
- Accidental Complexity
- Heterogeneity Beyond the Local Case Platform,
Protocol, Plus All Local Heterogeneity in Spades. - Autonomy Change and Evolve Autonomously
- Tool Deficiencies Language Support
(Sockets,rpc), Debugging, Etc.
Ubiquitous/Pervasive Many computers and
information appliances everywhere, networked
together
62Infosphere
Problem too many sources,too much information
63Current State-of-Art
Web Server
Mainframe Database Server
Thin Client
64Infosphere Scenario
Infotaps Fat Clients
Sensors
Many sources
Database Server
65Heterogeneity and Autonomy
- Heterogeneity
- How Much can we Really Integrate?
- Syntactic Integration
- Different Formats and Models
- Web/SQL Query Languages
- Semantic Interoperability
- Basic Research on Ontology, Etc
- Autonomy
- No Central DBA on the Net
- Independent Evolution of Schema and Content
- Interoperation is Voluntary
- Interface Technology (Support for Isvs)
- DCOM Microsoft Standard
- CORBA, Etc...
66Security and Data Quality
- Security
- System Security in the Broad Sense
- Attacks Penetrations, Denial of Service
- System (and Information) Survivability
- Security Fault Tolerance
- Replication for Performance, Availability, and
Survivability - Data Quality
- Web Data Quality Problems
- Local Updates with Global Effects
- Unchecked Redundancy (Mutual Copying)
- Registration of Unchecked Information
- Spam on the Rise
67Legacy Data Challenge
- Legacy Applications and Data
- Definition Important and Difficult to Replace
- Typically, Mainframe Mission Critical Code
- Most are OLTP and Database Applications
- Evolution of Legacy Databases
- Client-server Architectures
- Wrappers
- Expensive and Gradual in Any Case
68Potential Value Added/Jumping on Bandwagon
- Sophisticated Query Capability
- Combining SQL with Keyword Queries
- Consistent Updates
- Atomic Transactions and Beyond
- But Everything has to be in a Database!
- Only If we Stick with Classic DB Assumptions
- Relaxing DB Assumptions
- Interoperable Query Processing
- Extended Transaction Updates
- Commodities DB Software
- A Little Help is Still Good If it is Cheap
- Internet Facilitates Software Distribution
- Databases as Middleware
69Data Warehousing and Data Mining
- Data Warehousing
- Provide Access to Data for Complex Analysis,
Knowledge Discovery, and Decision Making - Underlying Infrastructure in Support of Mining
- Provides Means to Interact with Multiple DBs
- OLAP (on-Line Analytical Processing) vs. OLTP
- Data Mining
- Discovery of Information in a Vast Data Sets
- Search for Patterns and Common Features based
- Discover Information not Previously Known
- Medical Records Accessible Nationwide
- Research/Discover Cures for Rare Diseases
- Relies on Knowledge Discovery in DBs (KDD)
70Data Warehousing and OLAP
- A Data Warehouse
- Database is Maintained Separately from an
Operational Database - A Subject-Oriented, Integrated, Time-Variant,
and Non-Volatile Collection of Data in Support
for Managements Decision Making Process
W.H.Inmon - OLAP (on-Line Analytical Processing)
- Analysis of Complex Data in the Warehouse
- Attempt to Attain Value through Analysis
- Relies on Trained and Adept Skilled Knowledge
Workers who Discover Information - Data Mart
- Organized Data for a Subset of an Organization
71Building a Data Warehouse
- Option 1
- Leverage Existing Repositories
- Collate and Collect
- May Not Capture All Relevant Data
- Option 2
- Start from Scratch
- Utilize Underlying Corporate Data
Corporate data warehouse
Option 1 Consolidate Data Marts
Option 2 Build from scratch
Data Mart
Data Mart
Data Mart
Data Mart
...
Corporate data
72Data Warehouse Characteristics
- Utilizes a Multi-Dimensional Data Model
- Warehouse Comprised of
- Store of Integrated Data from Multiple Sources
- Processed into Multi-Dimensional Model
- Warehouse Supports of
- Times Series and Trend Analysis
- Super-Excel Integrated with DB Technologies
- Data is Less Volatile than Regular DB
- Doesnt Dramatically Change Over Time
- Updates at Regular Intervals
- Specific Refresh Policy Regarding Some Data
73Three Tier Architecture
monitor
OLAP Server
integrator
External data sources
Summarization report
Extraxt Transform Load Refresh
Operational databases
Data Warehouse
serve
Query report
Data mining
metadata
Data marts
74Data Warehouse Design
- Most of Data Warehouses use a Start Schema to
Represent Multi-Dimensional Data Model - Each Dimension is Represented by a Dimension
Table that Provides its Multidimensional
Coordinates and Stores Measures for those
Coordinates - A Fact Table Connects All Dimension Tables with a
Multiple Join - Each Tuple in Fact Table Represents the Content
of One Dimension - Each Tuple in the Fact Table Consists of a
Pointer to Each of the Dimensional Tables - Links Between the Fact Table and the Dimensional
Tables for a Shape Like a Star
75What is a Multi-Dimensional Data Cube?
- Representation of Information in Two or More
Dimensions - Typical Two-Dimensional - Spreadsheet
- In Practice, to Track Trends or Conduct Analysis,
Three or More Dimensions are Useful
76Multi-Dimensional Schemas
- Supporting Multi-Dimensional Schemas Requires Two
Types of Tables - Dimension Table Tuples of Attributes for Each
Dimension - Fact Table Measured/Observed Variables with
Pointers into Dimension Table - Star Schema
- Characterizes Data Cubes by having a Single Fact
Table for Each Dimension - Snowflake Schema
- Dimension Tables from Star Schema are Organized
into Hierarchy via Normalization - Both Represent Storage Structures for Cubes
77Example of Star Schema
Date
Sale Fact Table
Date Month Year
Date Product Store Customer Unit_Sales Dollar_Sale
s
Store
StoreID City State Country Region
78A Second Example of Star Schema
79and Corresponding Snowflake Schema
80Data Warehouse Issues
- Data Acquisition
- Extraction from Heterogeneous Sources
- Reformatted into Warehouse Context - Names,
Meanings, Data Domains Must be Consistent - Data Cleaning for Validity and Qualityis the
Data as Expected w.r.t. Content? Value? - Transition of Data into Data Model of Warehouse
- Loading of Data into the Warehouse
- Other Issues Include
- How Current is the Data? Frequency of Update?
- Availability of Warehouse? Dependencies of Data?
- Distribution, Replication, and Partitioning
Needs? - Loading Time (Clean, Format, Copy, Transmit,
Index Creation, etc.)?
81OLAP Strategies
- OLAP Strategies
- Roll-Up Summarization of Data
- Drill-Down from the General to Specific
(Details) - Pivot Cross Tabulate the Data Cubes
- Slide and Dice Projection Operations Across
Dimensions - Sorting Ordering Result Sets
- Selection Access by Value or Value Range
- Implementation Issues
- Persistent with Infrequent Updates (Loading)
- Optimization for Performance on Queries is More
Complex - Across Multi-Dimensional Cubes - Recovery Less Critical - Mostly Read Only
- Temporal Aspects of Data (Versions) Important
82Knowledge Discovery
- Data Warehousing Requires Knowledge Discovery to
Organize/Extract Information Meaningfully - Knowledge Discovery
- Technology to Extract Interesting Knowledge
(Rules, Patterns, Regularities, Constraints) from
a Vast Data Set - Process of Non-trivial Extraction of Implicit,
Previously Unknown, and Potentially Useful
Information from Large Collection of Data - Data Mining
- A Critical Step in the Knowledge Discovery
Process - Extracts Implicit Information from Large Data Set
83Steps in a KDD Process
- Learning the Application Domain (goals)
- Gathering and Integrating Data
- Data Cleaning
- Data Integration
- Data Transformation/Consolidation
- Data Mining
- Choosing the Mining Method(s) and Algorithm(s)
- Mining Search for Patterns or Rules of Interest
- Analysis and Evaluation of the Mining Results
- Use of Discovered Knowledge in Decision Making
- Important Caveats
- This is Not an Automated Process!
- Requires Significant Human Interaction!
84On-Line Analytical Processing
- Data Cube
- A Multidimensonal Array
- Each Attribute is a Dimension
- In Example Below, the Data Must be Interpreted so
that it Can be Aggregated by Region/Product/Date
Product
Pants
Diapers
Beer
Nuts
West East Central Mountain South
Region
Jan Feb March April
Date
85Examples of Data Mining
- The Slicing Action
- A Vertical or Horizontal Slice Across Entire Cube
Slice on city Atlanta
Multi-Dimensional Data Cube
86Examples of Data Mining
- The Dicing Action
- A Slide First Identifies on Dimension
- A Selection of Any Cube within the Slice which
Essentially Constrains All Three Dimensions
March 2000
Atlanta
Electronics
Dice on Electronics and Atlanta
87Examples of Data Mining
- Drill Down - Takes a Facet (e.g., Q1)
- and Decomposes into Finer Detail
Drill down on Q1
Atlanta Columbus Gainesville Savannah
Arizona California Georgia Iowa
Q1 Q2 Q3 Q4
Roll Up on Location (State, USA)
Products Sales
Roll Up Combines Multiple Dimensions From
Individual Cities to State
88Mining Other Types of Data
- Analysis and Access Dramatically More Complicated!
Spatial databases Multimedia databases World
Wide Web
Time series data
Geographical and Satellite Data
89Advantages/Objectives of Data Mining
- Descriptive Mining
- Discover and Describe General Properties
- 60 People who buy Beer on Friday also have
Bought Nuts or Chips in the Past Three Months - Predictive Mining
- Infer Interesting Properties based on Available
Data - People who Buy Beer on Friday usually also Buy
Nuts or Chips - Result of Mining
- Order from Chaos
- Mining Large Data Sets in Multiple Dimensions
Allows Businesses, Individuals, etc. to Learn
about Trends, Behavior, etc. - Impact on Marketing Strateg
90Data Mining Methods (1)
- Association
- Discover the Frequency of Items Occurring
Together in a Transaction or an Event - Example
- 80 Customers who Buy Milk also Buy BreadHence -
Bread and Milk Adjacent in Supermarket - 50 of Customers Forget to Buy Milk/Soda/DrinksHe
nce - Available at Register - Prediction
- Predicts Some Unknown or Missing Information
based on Available Data - Example
- Forecast Sale Value of Electronic Products for
Next Quarter via Available Data from Past Three
Quarters
91Association Rules
- Motivated by Market Analysis
- Rules of the Form
- Item1Item2 Itemk?Itemk1 Itemn
- Example
- Beer Soft Drink ? Pop Corn
- Problem Discovering All Interesting Association
Rules in a Large Database is Difficult! - Issues
- Interestingness
- Completeness
- Efficiency
- Basic Measurement for Association Rules
- Support of the Rule
- Confidence of the Rule
92Data Mining Methods (2)
- Classification
- Determine the Class or Category of an Object
based on its Properties - Example
- Classify Companies based on the Final Sale
Results in the Past Quarter - Clustering
- Organize a Set of Multi-dimensional Data Objects
in Groups to Minimize Inter-group Similarity is
and Maximize Intra-group Similarity - Example
- Group Crime Locations to Find Distribution
Patterns
93Classification
- Two Stages
- Learning Stage Construction of a Classification
Function or Model - Classification Stage Predication of Classes of
Objects Using the Function or Model - Tools for Classification
- Decision Tree
- Bayesian Network
- Neural Network
- Regression
- Problem
- Given a Set of Objects whose Classes are Known
(Training Set), Derive a Classification Model
which can Correctly Classify Future Objects
94An Example
- Attributes
- Class Attribute - Play/Dont Play the Game
- Training Set
- Values that Set the Condition for the
Classification - What are the Pattern Below?
95Data Mining Methods (3)
- Summarization
- Characterization (Summarization) of General
Features of Objects in the Target Class - Example
- Characterize Peoples Buying Patterns on the
Weekend - Potential Impact on Sale Items When Sales
Start - Department Stores with Bonus Coupons
- Discrimination
- Comparison of General Features of Objects Between
a Target Class and a Contrasting Class - Example
- Comparing Students in Engineering and in Art
- Attempt to Arrive at Commonalities/Differences
96Summarization Technique
- Attribute-Oriented Induction
- Generalization using Concert hierarchy (Taxonomy)
barcode category brand content
size
14998 milk diaryland Skim
2L
food
12998 mechanical MotorCraft valve 23a 12in
Milk bread
...
Skim milk 2 milk
White whole bread wheat
Category Content Count
Lucern Dairyland
Wonder Safeway
milk skim 280 milk 2
98 ...
97Why is Data Mining Popular?
- Technology Push
- Technology for Collecting Large Quantity of Data
- Bar Code, Scanners, Satellites, Cameras
- Technology for Storing Large Collection of Data
- Databases, Data Warehouses
- Variety of Data Repositories, such as Virtual
Worlds, Digital Media, World Wide Web - Corporations want to Improve Direct Marketing and
Promotions - Driving Technology Advances - Targeted Marketing by Age, Region, Income, etc.
- Exploiting User Preferences/Customized Shopping
98Requirements Challenges in Data Mining
- Security and Social
- What Information is Available to Mine?
- Preferences via Store Cards/Web Purchases
- What is Your Comfort Level with Trends?
- User Interfaces and Visualization
- What Tools Must be Provided for End Users of Data
Mining Systems? - How are Results for Multi-Dimensional Data
Displayed? - Performance Guarantees
- Range from Real-Time for Some Queries to
Long-Term for Other Queries - Data Sources of Complex Data Types or
Unstructured Data - Ability to Format, Clean, and
Load Data Sets
99Data Mining Visualization
- Leverage Improving 3D Graphics and Increased PC
Processing Power for Displaying Results - Significant Research in Visualization w.r.t.
Displaying Multi-Dimensional Data
100Successful Data Mining Applications
- Business Data Analysis and Decision Support
- Marketing, Customer Profiling, Market Analysis
and Management, Risk Analysis and Management - Fraud Detection
- Detecting Telephone Fraud, Automotive and Health
Insurance Fraud, Credit-card Fraud, Suspicious
Money Transactions (Money Laundering) - Text Mining
- Message Filtering (Email, Newsgroups, Etc.)
- Newspaper Articles Analysis
- Sports
- IBM Advanced Scout Analyzed NBA Game Statistics
(Shots Blocked, Assists and Fouls) to Gain
Competitive Advantage
101Select Data Mining Products
102Concluding Remarks
- Four-Fold Objective
- Distributed Database Processing
- Web-Based Architectures for DB Interoperability
- Data Warehouses
- Data Mining of Vast Information Repositories
- All Three are Tightly Related
- DDBMS can Improve Performance of Mining
Repositories as Backend Database Processors - Web-Based Architectures Provide Access Means for
DDBMS or Mining - Warehouses are Infrastructure to Facilitate
Mining - Clinical Data Repositories, Geographic
Information Systems, Deductive DBMS, Multi-Media
DBMS, Mobile DBMS, Embedded/Real-Time DBMS, etc.