Title: Distributed Database Design
1Distributed Database Design Semantic Data
Control
- Univ.-Prof. Dr. Peter Brezany
- Institut für Scientific Computing
- Universität Wien
- Tel. 4277 39425
- Sprechstunde Di, 13.00-14.00
- LV-Portal www.par.univie.ac.at/brezany/teach/gck
fk/06ss/300658.html
2Distributed Database Design
3A General Framework Considered in the Distributed
Database Design Problem
- The organization of distributed systems can be
investigated along 3 dimensions - Level of Sharing
- (a) no sharing each application and its
data execute at one site, and there - is no communication with any other
program or access to any data file - at other sites (not very common today)
(b) data sharing all the programs are
replicated at all the sites, but data files
are not (c) data-plus-program sharing
a program at a given site can request a
service from another program at a se-
cond site, which, in turn, may have to
access a data file located at a 3. site.
4A General Framework Considered in the Distributed
Database Design Problem (2)
2. Access patterns (along this dimension, the
relationship between distributed database
design and query processing is established)
(a) static (they do not change over time) the
design is easier for static
environments unfortunately, it is difficult to
find many real-life applications that
would be classified as static. (b) dynamic
3. Level of knowledge about the access pattern
(AP) behavior (a) no knowledge (a theoretical
possibility)
it is very difficult to design a distr.
database that can effectively cope
with this situation (b) complete knowledge
AP can be re- sonably (almost precicely)
predicted (c) partial information there
are deviations from the
predictions
5Alternative Design Strategies
- Top-Down approach
- Bottom-Up approach
- In most database designs, the two approaches
may need to be applied to complement one another. - It is a joint function of the database,
enterprise, and application system administrators
(or of the administrator performing all three
roles)
6Top-Down Design Process
7Bottom-Up Design Process
- Top-down design is a suitable approach when a
database system is being designed from scratch. - Commonly, a number of databases already exist,
and the design task involves integrating them
into one database. ? the bottom-up approach is
suited for this type of environment. - The starting point is individual local conceptual
schemas. - The process consists of integrating local schemas
into the global conceptual schemas. - This type of environment exists primarily in the
context of heterogeneous databases.
8Distribution Design Issues - Fragmentation
- Why fragment at all?
- How should we fragment?
- How much should we fragment?
- Is there any way to test the correctness of
decomposition? - How should we allocate?
- What is necessary information for fragmentation
and allocation?
9Reasons for Fragmentation
- The important issue is the appropriate unit of
distribution a relation is not a suitable unit
why? - Application views are usually subsets of
relations, therefore the locality of accesses of
applications is defined not on entire relations
but on their subsets ? it is only natural to
consider subsets of relations as distribution
units. - If the applications that have views defined on a
given relation reside at different sites, two
alternatives can be followed, with the entire
relation being the unit of distribution - The relation is not replicated ? this results
into in an unnecessarily high volume of remote
data accesses. - The relation is replicated at all or some of the
sites where the applications reside ? problems
in executing updates and may not be desirable if
storage is limited. - The decomposition of a relation into fragments,
each being treated as a unit, permits a large
number of transactions to execute concurrently
and intraquery concurrency is enabled. - Disadvantages of fragmentation
- Applications whose views are defined on more than
one fragment may suffer performance degradation. - Integrity checking (semantic data control)
Attributes participating in a dependency may be
decomposed into different fragments which might
be allocated to different sites.
10Fragmentation AlternativesHorizontal and
Vertical Fragmentation
Example Database
11Example of Horizontal Fragmentation
12Example of Vertical Fragmentation
Remark The fragmentation may, of course, be
nested. If the nestings are of
different types, one gets hybrid fragmentation.
13Hybrid (Mixed) Fragmentation
Relation A
Fragment H1
Fragment H2V2
Fragment H2V1H1
Fragment H2V1H2
Fragment H2V1H3
14Correctness Rules of Fragmentation
- Completness. If a relation R is decomposed into
fragments R1, R2, ..., Rn, each data item that
can be found in R can also be found in one or
more of Ris. - Reconstruction. If a relation R is decomposed
into a set fragments FR R1, R2, ..., Rn, it
should be possible to define a relational
operator ? such that - R ?Ri, ?Ri ? FR
- The operator ? will be different for for the
different forms of fragmentation. The
reconstructability of the relation from its
fragments ensures that constraints defined on the
data in the form of dependencies are preserved. - Disjointness. If a relation R is horizontally
decomposed into fragments R1, R2, ..., Rn and
data item di is in Rj, it is not in any other
fragment Rk (k ? j).If relation R is vertically
decomposed, its primary key attributes are
typically repeated in all its fragments ?
disjointness is defined only on the nonprimary
key attributes of a relation.
15Allokation
- Sei F F1, ..., Fn eine Menge von Fragmenten,
S S1, ..., Sm ein Netzwerk gegeben durch die
Menge seiner Sites, und Q Q1, ..., Qp die
Menge der relevanten Anwendungen. - Allokationsproblem
- Was ist die optimale Zuordnung von F zu S bzgl.
Q ? - Optimaltätskriterium
- Minimalität der Kosten gegeben durch die
Speicherkosten der Fi an den Sites Sj, der
Anfragekosten für Fi an Site Sj, der
Änderungskosten der Fi an allen Sites an den sie
gespeichert sind, und die Kosten der
Datenkommunikation. - Performanz im Sinne von Antwortzeiten oder
Systemdurchsatz.
16Sites
R1
R1,1
S1
R2
R2,1
R
R1,2
R3
R4
Globale Relation
S2
R2,2
Fragmente
Fragmente und ihre Allokation
Allokation an den Sites
R3,3
S3
R4,3
17Zusätzliche Beispiele
Beispiel 1
horizontale Fragmentierung
18Beispiel 2
abgeleitete horizontale Fragmentierung
19Beispiel 3
vertikale Fragmentierung
20Semantic Data Control
21Main Objectives
- View management
- Security control
- Semantic integrity control.
- The above items can be defined by rules that
the system automatically enforces. The violation
of some such rule by a user program (a set of
database operations) generally implies the
rejection of the effects of that program. - The rules are defined by the DB administration.
- The cost of enforcing semantic data control,
which is high in terms of resource utilization in
a centralized DBMS, can be prohibitive in a
distributed environment. - The rules must be stored in a catalog (directory)
? efficient management in a distributed
environment is important
22View Management
- In a rel. DB system a view is a virtual relation,
defined as the result of a query on base
relations (or real relations), but not
materialized like a base relation, which is
stored in the DB. - An external schema can be defined as a set of
views and/or base relations. - Besides their use in external schemas, views are
useful for ensuring data security in a simple way
? if users may only access the database through
views, they cannot see or manipulate the hidden
data, which are therefore secure. - In a distributed DBMS, a view can be derived from
distributed relations, and the access to a view
requires the execution of the distributed query
corresponding to the view definition. - An important issue in a distributed DBMS is to
make view materialization efficient.
23Views in Centralized DBMSs
- Example 1
- The view of system analysts (SYSAN) derived from
relation EMP (ENO,ENAME,TITLE), can be defined by
the following SQL query - CREATE VIEW SYSAN(ENO, ENAME)
- AS SELECT ENO, ENAME
- FROM EMP
- WHERE TITLE Syst. Anal.
- The single effect is the storage of the view
defintion in the catalog. No other information
needs not be recorded. Therefore, the result to
the query defining the view (i.e., a relation
having the attributes ENO and ENAME for the
system analysts as shown in the figure below) is
not produced. However, the view SYSAN can be
manipulated as a base relation.
ENO ENAME E2 M.Smith E5 B. Casey E8 J.Jones
24Views in Centralized DBMSs (cont.)
- Example 2
- The query Find the names of all the system
analysts with their - project number and responsibility(ies)
- involving the view SYSAN and relation
ASG(ENO,PNO,RESP,DUR) can be expressed as - SELECT ENAME, PNO, RESP
- FROM SYSAN, ASG
- WHERE SYSAN.ENO ASG.ENO
- Mapping a query expresed on views into a query
expressed on base relations can be done by query
modification at compile time. - The preceding can be modified to
- SELECT ENAME, PNO, RESP
- FROM EMP, ASG
- WHERE EMP.ENO ASG.ENO
- AND TITLE Syst. Anal.
ENAME PNO RESP M.Smith P1 Analyst M.Smith
P2 Analyst B.Casey P3 Manager J.Jones
P4 Manager
25Views in Distributed DBMSs
- The definition of a view is similar as in
centralized systems distinction, a view may be
derived from fragmented relations stored at
different sites. - View definitions can be centralized at one site,
partially duplicate, or fully duplicated. - Processing a query expressed on views
distributed query processor. - Views derived from distributed relations may be
costly to evaluate. - Since in a given organization it is likely that
many users access the same views, some proposals
have been made to optimize view derivation. - An alternative solution is to avoid view
derivation by maintaining actual versions of the
views snapshots. A snapshot represents a
particular state of the database and is therefore
static, meaning that it does not reflect updates
to base relations.
26Data Security
- Data security includes 2 aspects
- Data protection. It is required to prevent
unauthorized users from understanding the
physical content of data. - Authorization control. It must guarantee that
only authorized users perform operations they are
allowed to perform on DB. - Checking whether a given triple (user, operation,
object) can be allowed to proceed. - The introduction of a user (user name,
password). The user name uniquely identifies the
users of that name in the system, while the
password, known only to the users of that name,
authenticates the users. - Distributed authorization control
- Additional complexity objects and subjects are
distributed - remote user authentication
- management of distributed authorization rules
- handling of views and of user groups (DB
administration is simplified)
27Semantic Integrity Control
- How to guarantee database consistency?
- A DB state is said to be consistent if the DB
satisfies a set of constraints, called semantic
integrity constraints. - Mantaining a consistent DB requires various
mechanisms concurrency control, reliability,
protection, and semantic integrity control (SIC). - SIC ensures DB consistency by rejecting update
programs which lead to inconsistent DB states, or
by activating specific actions on the DB state,
which compensate for the effects of the update
programs. - SI constraints are rules that represent the
knowledge about the properties of an application. - Structural constraints (e.g. unique key
constraints in the rel. model) - Behavioral constraints regulate the application
behavior.
28Centralized Semantic Integrity Control
- The language for expression and manipulating
integrity assertions. Examples - Employee number in relation EMP cannot be null
ENO NOT NULL in EMP - The pair (ENO,PNO) is the unique key in relation
ASG (ENO,PNO) UNIQUE IN ASG - Enforcement mechanism that performs specific
actions to enforce DB integrity and updates.
Examples - The query for increasing the budget of the
CAD/CAM project by 10, which would be specified
as - UPDATE PROJ SET BUDGET BUDGET 1.1
- WHERE PNAME CAD/CAM
- will be transformed into the following query
in order to enforce the domain constraint - CHECK ON PROJ (BUDGET gt 500000 AND
BUDGET lt 1000000) - UPDATE PROJ SET BUDGET BUDGET 1.1
- WHERE PNAME CAD/CAM
AND NEW.BUDGET gt 500000 - AND NEW.BUDGET lt
1000000)
29Distributed Semantic Integrity Control
- Individual assertions. They refer only to tuples
to be updated independently of the rest of the
database. Example CHECK ON PROJ (BUDGET gt
500000 AND BUDGET lt 1000000). The assertion
definition is sent to all other sites that
contain fragments of the relation involved in the
assertion. The assertion must be compatible with
the relation data at each site. Compatibility can
be checked at two levels predicate and data.
First, predicate compatibility is verified by
comparing the assertion predicate with the
fragment predicate. An asertion C is not
compatible with a fragment predicate p if C is
true implies that p is false, and is
compatible with p otherwise. If noncompatibility
is found at one of the sites, the assertion
definition is globally rejected because tuples of
that fragment do not satisfy the integration
constraints. Second, if predicate compatibility
has been found, the assertion is tested against
the instance of the fragment. If it is not
satisfied by that instance, the assertion is also
globally rejected. If compatibility is found, the
assertion is stored at each site.
30Distributed Individual Assertions Example
Consider relation EMP, horizontally fragmented
across three sites using the predicates p1
0 lt ENO lt E3 p2 E3 lt ENO lt E6 p3
ENO gt E6 and the domain assertion C ENO lt
E4. Assertion is compatible with p1 (if C is
true, p1 is true) and p2 (if C is true, p2 is
not necessarily false), but is not with p3 (if C
is true, then p3 is false). Therefore, assertion
C should be globally rejected because the tuples
at site 3 cannot satisfy C, and thus relation
EMP does not satisfy C.
31Set-Oriented Assertions
- They include single-relation multivariable
constraints such as functional dependency
(Example 1) and multirelation multi-variable
constraints such as foreign key constraints
(Example 2) - Example 1 The employee number functionally
determines the employee name. - ENO IN EMP DETERMINES ENAME
- Example 2 The project number PNO in relation ASG
is a foreign key matching the primary key PNO of
relation PROJ. In other words, a project referred
to in relation ASG must exist in relation PROJ. - PNO IN ASG REFERENCES PNO IN PROJ
32Assertions Involving Aggregates
- Example The total duration for all employees in
the CAD/CAM project is less than 100. - CHECK ON gASG, jPROJ (SUM(g.DUR WHERE
g.PNOj.PNO) lt 100 - IF j.PNAMECAD/CAM)
- These assertions are among the most costly to
test because they require the calculations of the
aggregate functions.