Title: Fundamentals of Relational Database Design and Database Planning
1Fundamentals of Relational Database Design
andDatabase Planning
- J.Trumbo
- Fermilab
- CSS-DSG
2Outline
- Definitions
- Selecting a dbms
- Selecting an application layer
- Relational Design
- Planning
- A very few words about Replication
- Space
3DefinitionsWhat is a database?
- A database is the implementation of freeware or
commercial software that provides a means to
organize and retrieve data. The database is the
set of physical files in which all the objects
and database metadata are stored. These files can
usually be seen at the operating system level.
This talk will focus on the organize aspect of
data storage and retrieval. - Commercial vendors include MicroSoft and Oracle.
- Freeware products include mysql and postgres.
- For this discussion, all points/issues apply to
both commercial and freeware products.
4DefinitionsInstance
- A database instance, or an instance is made up
of the background processes needed by the
database software. - These processes usually include a process
monitor, session monitor, lock monitor, etc.
They will vary from database vendor to database
vendor.
5DefinitionsWhat is a schema?
- A SCHEMA IS NOT A DATABASE, AND A DATABASE IS NOT
A SCHEMA. - A database instance controls 0 or more databases.
- A database contains 0 or more database
application schemas. - A database application schema is the set of
database objects that apply to a specific
application. These objects are relational in
nature, and are related to each other, within a
database to serve a specific functionality. For
example payroll, purchasing, calibration,
trigger, etc. A database application schema not a
database. Usually several schemas coexist in a
database. - A database application is the code base to
manipulate and retrieve the data stored in the
database application schema.
6Definitions Cont.Primary Definitions
- Table, a set of columns that contain data. In the
old days, a table was called a file. - Row, a set of columns from a table reflecting a
record. - Index, an object that allows for fast retrieval
of table rows. Every primary key and foreign key
should have an index for retrieval speed. - Primary key, often designated pk, is 1 or more
columns in a table that makes a record unique.
7Definitions Cont.Primary Definitions
- Foreign key, often designated fk, is a common
column common between 2 tables that define the
relationship between those 2 tables. - Foreign keys are either mandatory or optional.
Mandatory forces a child to have a parent by
creating a not null column at the child. Optional
allows a child to exist without a parent,
allowing a nullable column at the child table
(not a common circumstance).
8Definitions Cont.Primary Definitions
- Entity Relationship Diagram or ER is a pictorial
representation of the application schema.
9Er Example
10Definitions Cont.Primary Definitions
- Constraints are rules residing in the databases
data dictionary governing relationships and
dictating the ways records are manipulated, what
is a legal move vs. what is an illegal move.
These are of the utmost importance for a secure
and consistent set of data.
11Definitions Cont.Primary Definitions
- Data Manipulation Language or DML, sql statements
that insert, update or delete database in a
database. - Data Definition Language or DDL, sql used to
create and modify database objects used in an
application schema.
12Definitions Cont.Primary Definitions
- A transaction is a logical unit of work that
contains one or more SQL statements. A
transaction is an atomic unit. The effects of all
the SQL statements in a transaction can be either
all committed (applied to the database) or all
rolled back (undone from the database), insuring
data consistency.
13Definitions Cont.Primary Definitions
- A view is a selective presentation of the
structure of, and data in, one or more tables (or
other views). A view is a virtual table, having
predefined columns and joins to one or more
tables, reflecting a specific facet of
information.
14Definitions Cont.Primary Definitions
- Database triggers are PL/SQL, Java, or C
procedures that run implicitly whenever a table
or view is modified or when some user actions or
database system actions occur. Database triggers
can be used in a variety of ways for managing
your database. For example, they can automate
data generation, audit data modifications,
enforce complex integrity constraints, and
customize complex security authorizations.
Trigger methodology differs between databases.
15Definitions Cont.Primary Definitions
- Replication is the process of copying and
maintaining database objects, such as tables, in
multiple databases that make up a distributed
database system. - Backups are copies of the database data in a
format specific to the database. Backups are
used to recover one or more files that have been
physically damaged as the result of a disk
failure. Media recovery requires the restoration
of the damaged files from the most recent
operating system backup of a database. It is of
the utmost importance to perform regularly
scheduled backups.
16Definitions Cont.
- Mission Critical Applications
- An application is defined as mission critical,
imho, if - 1. there are legal implications or financial
loss to the institution if the data is lost or
unavailable. - 2. there are safety issues if the data is lost
or unavailable. - 3. no data loss can be tolerated.
- 4. uptime must be maximized (98).
17Definitions Cont.
- large or very large or a lot
- Seems odd, but large is a hard definition to
determine. Vldb is an acronym for very large
databases. Its definition varies depending on
the database software one selects. Very large
normally indicates data that is reaching the
limits of capacity for the database software, or
data that needs extraordinary measures need to be
taken for operations such as backup, recovery,
storage, etc.
18Definitions Cont.
- Commercial databases do not a have a practical
limit to the size of the load. Issues will be
backup strategies for large databases. - Freeware does limit the size of the databases,
and the number of users. Documentation on these
issues vary widely from the freeware sites to the
user sites. Mysql supposedly can support 8T and
100 users. However, you will find arguments on
the users lists that these numbers cannot be met.
19Selecting a DBMS
- Many options, many decisions, planning, costs,
criticality. - For lots of good information, please refer to the
urls on the last slides. Many examples of people
choosing product.
20Selecting a DBMSHow do I Choose?
- Which database product is appropriate for my
application? You must make a requirements
assessment. - Does you database need 24x7 availability?
- Is your database mission critical, and no data
loss can be tolerated? - Is your database large? (backup recovery methods)
- What data types do I need? (binary, large
objects?) - Do I need replication? What level of replication
is required? Read only? Read/Write? Read/Write
is very expensive, so can I justify it?
21Selecting a DBMSHow do I Choose? Cont.
- If your answer to any of the above is yes, I
would strongly suggest purchasing and using a
commercial database with support. Support
includes - 24x7 assistance with technical issues
- Patches for bugs and security
- The ability to report bugs, and get them resolved
in a timely manner. - Priority for production issues
- Upgrades/new releases
- Assistance with and use of proven backup/recovery
methods
22Selecting a DBMSThe Freeware Choice
- Freeware is an alternative for applications.
However, be fore warned, support for these
databases is done via email to a ad hoc support
group. The level of support via these groups may
vary over the life of your database. Be
prepared. Also expect less functionality than
any commercial product. See http//www-css.fnal.go
v/dsg/external/freeware/
23Selecting a DBMSThe Freeware Choice
- Freeware is free.
- Freeware is open source.
- Freeware functionality is improving.
- Freeware is good for smaller non-mission critical
applications.
24Selecting an Application Layer
- Again, planning takes center stage. In the end
you want stability and dependability. - How many users need access?
- What will the security requirements be?
- Are there software licensing issues that need
consideration? - Is platform portability a requirement?
- Two tier or three tier architecture?
25Selecting an Application Layer
- Direct access to the database layer? (probably
should be avoided) - Are you replicating? How? Where? With what?
- There are no utilities that will port data from 1
database to another (i.e., postgres to mysql). if
database portability is a requirement, an
independent code must be written to satisfy this
requirement.
26Selecting an Application Layer Cont.
- Application maintenance issues
- People availability, working with users as a
team, talent, and turnover? (historically a huge
issue) - A known or common language?
- Freeware? Bug fixes, patchesare they important
and timely? - Documentation? Set standards, procedures, code
reviews making sure the documentation exists and
is clear. - Is the application flexible enough to easily
accommodate business rule changes that mandate
modifications? - The availability of an ER diagram at this stage
is invaluable. We consider it a must have. - There are no utilities to port data from 1 type
of db to another. This lack of portability means
a method to move data between databases - must be written independently.
-
27Selecting an Application Layer
- Misc. application definitions
- This presentation is not an application
presentation, but I will mention a few terms you
may hear. - Sql the query language for relational databases.
A must learn. - ODBC, open database connectivity. The software
that allows a database to talk to an application. - JDBC, java database connectivity.
28Relational Design
- The design of the application schema will
determine the usability and query ability of the
application. Done incorrectly, the application
and users will suffer until someone else is
forced to rewrite it.
29Relational DesignThe Setup
- The database group has a standard 3 tier
infrastructure for developing and deploying
production databases and applications. This
infrastructure provides 3 database instances,
development, integration and production. This
infrastructure is applicable to any application
schema, mission critical or not. It is designed
to insure development, testing, feedback,
signoff, and an protected production environment. - Each of these instances contain 1 or more
applications.
30Relational DesignThe Setup
- The 3 instances are used as follows
- Development instance. Developers playground.
Small in size compared to production. Much of the
data is invented and input by the developers.
Usually there is not enough disk space to ever
refresh with production data.
31Relational Design Cont.The Setup
- 2. The integration instance is used for moving
what is thought to be complete functionality to
a pre production implementation. Power users and
developers work in concert in integration to make
sure the specs were followed. The users should
use integration as their sign off area. Cuts from
dev to int are frequent and common to maintain
the newest releases in int for user testing.
32Relational Design Cont.The Setup
- 3. The production instance, real data. Needs to
be kept pure. NO testing allowed. Very few
logons. The optimal setup of a production
database server machine has 3 operating system
logons, root, the database logon (ie oracle), and
a monitoring tool. In a critical 24x7 supported
database, developers, development tools, web
servers, log files, all should be kept off the
production database server.
33Relational Design Cont.The Setup
- Lets talk about mission critical 24x7 a bit.
- To optimize a mission critical 24/7 database, the
database server machine should be dedicated to
running the database, nothing else. - All software products need maintenance and
downtime. Resist putting software products on the
db server machine so that their maintenance does
not inhibit the running of the database.
Further, if the product breaks, it could inhibit
access to the database for a long period.
Example, a logging application, monitoring users
on the db goes wild, fills all available space
and halts the database. If this logging app.
were not on the dbserver machine, the db would be
unaffected by the malfunction.
34Relational Design Cont.The Setup
- 3. All database applications and database
software require modifications. Most times these
modification require down time because the schema
or data modifications need to lock entire tables
exclusively. If you are sharing your database
instance with other many other applications, and
1 of those applications needs the database for an
upgrade, all apps may have to take the down time.
Avoid this by insuring your 24/7 database
application is segregated from all other software
that is not absolutely needed. In that way you
insure any down times are specific to your cause.
35Our 1st relational example
A cpu can house 1 or more databases
schema applications in d0ofprd1 (sam, runs,
calib)
Databases on d0ora2 (d0ofprd1, d0ofint1)
CPU (d0ora2)
schema applications in d0ofint1 (sam, runs,
calib)
An database can accommodate 1 or more instances
An instance may contain 1 or more application sche
mas
36What is a schema?
- One implements a schema by running scripts. These
scripts can be run against multiple servers and
should be archived.
37Relational DesignGetting Started
- Using your design tool, you will begin by
relating objects that will eventually become
tables. All the other schema objects will fall
out of this design. - You will spend LOADS of time in your design tool,
honing, redoing, reacting to modifications, etc. - The end users and the designers need to be
working almost at the same desk for this process.
If the end user is the designer, the end user
should involve additional users to insure an
unbiased and general design. - It is highly suggested that the design be kept up
to date for future documentation and maintainers.
- Tables are related, most frequently in a 0 to
many relationship. Example, 1 run will result in
0 or more events. Analyzing and defining these
relationships results in an application schema.
38What will a good schema design buy you?
- I am afraid the 80 planning 20 implementation
rule applies. Gather requirements. - Discovery of data that needs to be gathered.
- Fast query results
- Limited application code maintenance
- Data flexibility
- Less painful turnover of application to new
maintainers. - Fewer long term maintenance issues.
39Relational DesignLets get started
- Write a requirements document.
- You will not be able to anticipate all
requirements, but a document will be a start. A
well designed schema naturally allows for
additional functionality. - Who are the users? What is their mission?
- Identify objects that need to be stored/tracked.
- Think about how objects relate to each other.
- Do not be afraid to argue/debate the
relationships with others.
40Relational DesignSo how do you get there?
- Design tools are available, however, they do not
think for you. They will give you a clue that you
are doing something stupid, but it wont stop
you. It is highly recommended you use a design
tool. - A picture says 1000 words. Create ER, entity
relationship, diagrams. - Get a commitment from the developer(s) to see the
application through to implementation. We have
seen several applications redone multiple times.
A string of developers tried, left the project,
and left a mess. A new developer started from
scratch because there was no documentation or
design.
41Relational DesignHow do I get there?
- Adhere to the recommendations of your database
vendor for setup and architecture. - Dont be afraid to ask for help or to see other
examples. - Dont be afraid to pilfer others design work, if
it is good, if it closely fits your requirements,
then use it. - Ask questions, schedule reviews with experts and
users. - Work with your hardware system administrators to
insure you have the hardware you need for the
proposed job to be done.
42Relational DesignCommon Mistakes
- Mistakes we see ALL the time
- Do not design your schema around your favorite
query. A relational design will enable all
queries to be speedy, not only your favorite. - Dont design the schema around your narrow view
of the application. Get other users involved from
the start, ask for input and review.
43Relational DesignCommon Mistakes
- Create a relational structure, not a hierarchical
structure. The ER diagram should not necessarily
resemble a tree or a circle. It is the logical
building of relationships between data.
Relationships flow between subsets of data. The
resulting ER diagrams look is not a standard
by which one can judge the quality of the design.
44Relational DesignCommon Mistakes
- Do not create 1 huge table to hold 99 of the
data. We have seen a table with 1100
columnsunusable, unqueryable, required an entire
application rewrite, took over a year, made 80
tables from the 1 table. - Do not create separate schemas for the same
application or functions within an application. - Use indices and constraints, this is a MUST!
45Relational DesignExamples of Common Mistakes
- Using timestamp as the primary key assumes that
within a second, no other record will be
inserted. Actually this was not the case, and an
insert operation failed. Use database generated
sequences as primary keys and NON-UNIQUE index on
timestamp. - A table with more than 900 columns. Such design
will cause chaining since each record is not
going to fit in one block. One record spanning
many blocks, thus chaining, hence bad
performance.
46Relational DesignExamples of Common Mistakes
- Do not let the application control a generated
sequence. Have seen locking issues, and duplicate
values issues when the application increments the
sequence. Have the database increment/lock/constr
ain the sequence/primary key. That is why the
databases have sequence mechanisms, use them. - Use indices! An Atlas table with 200,000 rows,
halted during a query. Reason? No indices. Added
a primary key index, instantaneous query
response. Indices are not wasted space!
47Relational DesignExamples of Common Mistakes
- USE DATABASE CONSTRAINTS!!!!!!
- Have examples where constraints were not used,
but implemented via the api. Bugs in the api
allowed data to be deleted that should not have
been deleted, and constraints would have
prevented the error. Have also seen apis error
with cannot delete errors. They were trying to
force an invalid delete, luckily the database
constraints saved the data.
48Entity Relationship Diagrams1 to many
49Entity Relationship Diagramsmany to many
50Entity Relationship Diagrams1 to 1
51Relational DesignThe Good
Calibration type might have 3 rows, drift,
pedestal, gain This is a parent table.
Each calibration record will be Defined by drift,
pedestal or gain. In addition to start and end
times. This is a child table.
52Relational DesignThe Bad
You have now created 3 different children, all
reporting the same information, when 1 child
would suffice. Code will have to be written,
tested, and maintained for 4 tables now instead
of 2.
53Relational DesignThe Ugly
Now you have created 3 different applications,
using 6 tables. All of which could be managed
with 2 tables. Extra code, extra testing, extra
maintenance.
54Relational DesignThe Goodlets recap
AHHH, back to normal, or normalization as we
refer to it.
55Relational DesignWhat to expect from a design
tool
- An entity relationship diagram
- The ability to create the ddl (data definition
language) needed - The ability to project disk space usage
- Ddl in a format to allow you to enter the code
into a code library (cvs), and that will allow
you to run against your database
56Relational Design Why bother? Experience from
RunII
- TO SAVE TIME AND PRECIOUS PEOPLE RESOURCES!
- Personnel consistency does not exist.
Application developers come and go regularly. The
documentation that a design product provides will
the next developer an immediate understanding of
the application in picture format. - Application sharing is enhanced when others can
look at your design and determine whether the
application is reusable in their environment. Sam
is a good example of an application that 3
experiments are now using.
57Relational DesignWhy bother? Cont.
- When an application is under construction, the ER
diagram goes to every application meeting, and
quite possibly the wallet of the application
leader. It is the pictorial answer to many
issues. - Planning for disk space has been an issue, the
designer tool should assist with this task.
58PlanningOverall
- What do I need to plan for?
- People, hardware, software, obsolescence,
maintenance, emergencies. - How far out do I need to plan?
- Initially 2-4 years.
- How often do I need to review the plans?
- Annually.
- What if my plan fails or looks undoable?
- Nip it in the bud, be proactive, come up with
options.
59PlanningOverall
- Disk space requirements. My experience is all the
wags, (wild guesses) fall short of what is
needed. It is hard to predict the number of rows
in a table. It would be easier if we knew the
amount and results of the science ahead of time!
Remember, 10x what you think the data will take. - Hardware requirements. Experience tells us that
the database machine should serve 1 master (if it
is a large database or mission critical), the
database, nothing else. Ideally there will be
root, a database monitor user and a database
user, oracle for example. No apache, no log file
areas, no applications, etc.
60PlanningOverall
- Growth and obsolesce. Plan for 3-4 years before
needing to replace hardware. Hardware and
software become obsolete. New/upgraded software
gives addition functionality that you will
want/need. - Maintenance. Do you change the oil in your car?
Plan on 1 morning per month downtime for caring
for the hardware and software. Security patches
could mandate additional stoppages. I cannot
stress how important this is. Fire walling will
not protect you from bugs and obsolescence. If
the downtime is not needed, it will not be taken.
Planning maintenance time is as important as
planning to buy disks.
61PlanningUser Requirements
- Will user requirements influence your hardware
software decisions? - Do you need replication?
- What architecture is your api going to be?
- How many users will be loading the database and
hardware?
62PlanningMaintenance
- Database/Operating system software need upgrades.
One always hopes one can get on a stable version
of something and not upgrade. That is a fallacy.
Major version upgrades provide needed and new
functionality. Bug patches and security patches
are a never ending fact of life.
63PlanningBackup and Recovery
- Backup and recovery procedures of vldb (very
large databases) are difficult at best. Vldb is
normally defined as mulitple Gig or tera byte
databases. This is probably the most sensitive
area when choosing a freeware database. - Hardware plays a part here as well. Insure when
planning for hardware there is plan for backup
and recovery. Disk and tape may be needed.
64Planning Good Practices with a Hammer
- Make a standards document and enforce its use.
When dbas and developers are always on the same
page, life is easier for both. Expectations are
clear and defined. Anger and disappointment are
lessened. - System as well as database standards need to be
followed and enforced.
65PlanningFailover
- Yikes, we are down!
- Everyone always wants 24x7 scheduled uptime.
Until they see the cost. - Make anyone who insists on real 100 uptime to
justify it (and pay for it?). 98-99 uptime can
be realized at a much lower cost. - Uptime requirements will influence, possibly
dictate, database choices, hardware choices, fte
requirements.
66PlanningFailover
- The cheapest method of addressing a failure is
proactive planning. - Make sure your database and database software are
backed up. Unless you are using a commercial
database with roll forward recovery, assume you
will lose all dml since your last backup if you
need to recover. This should dictate your backup
schedule. - Do not forget tape backups as a catastrophic
recovery method. - Practice recovery on your integration and
development databases. Practice different
scenarios, delete a datafile, delete the entire
database.
67Replication
- Replication is the process of copying and
maintaining database objects in multiple
databases that make up a distributed database
system. Replication can improve the performance
and protect the availability of applications
because alternate data access options exist.
68Replication Cont.
- Oracle Supports 3 types of replication READ ONLY
Snapshots (Materialized views), Advanced
Replication and streams based replication. - Streams allows ddl modifications made to the
master automatically. - Streams can be configured in uni-directional (
Single Source and one or more than targets) or
master to master where updates can happen to any
participant database. - Advanced replication also supports master to
master . But streams based replication is
recommended. - READ ONLY Snapshots replication from a Sun box to
a Sun Linux box(s) is being done in CDF. When
a replica is under maintenance there is failover
to another replica. The replicas are up and
running in read only mode if the master is down
for maintenance.
69Replication cont.
- Oracle master to master replication allows for
updates on both the master and replica sides. - Master to master is a complex and a high
maintenance replication. It seems to be the 1st
option the unwitting opt for. Both Cern and Fermi
dbas have requested firm justification before
considering this type of replication request. - Every link in the multi master would be required
to be a fully staffed, as downtime will be
critical.
70Replication cont.
- Disk Space for Archives. If receiving site is
down for extended period of time, then source db
should be tuned enough to hold the archives logs,
otherwise, one has to reinstantiate the
replication. Reasonable downtime for target
depends upon archive area being generated on
source. Space, space and more space. - Conflict Resolution In Master to Master, conflict
resolution may be challenge. Rules should be well
defined to resolve the data conflicts. - Design of Data Model if Primary Keys are
populated by sequences , there is very much
chance of overlapping the sequences and will
cause integrity constraints. Data Model should be
designed very carefully. - DB Support In Master to Master Replication, all
master sites should be in 247 support mode.
Otherwise , sync up of data will be challenge or
one may lead to reinstantiation of replication.
Reinstantiation is not unplug and play type of
situation.
71Freeware Replication
- MySQL has replication in the last stable version
(3.23.32, v4.1 is out). It is master-slave
replication using binary log of operations on the
server side. It is possible to build star or
chain type structures. - There is a PostgreSQL replication tool. We have
not tested it yet.
72Lost in Space
- Space is the 1 area consistently under estimated
in every application I have seen. Imho,
consistently, data volume initial estimates were
undersized by a factor of 2 or 3. For example,
RunII events were estimated at 1 billion rows.
This estimate was surpassed Feb. 2004. We will
probably end up with 4-5 billion event rows. That
is a lot of disk space. - Disk hardware becomes unsupported, and obsolete
in what seems to be a blink of an eye.
73Lost In Space cont.
Unexpected?
N Gb
8 x N Gb
- All databases use disk to store data.
Data
Index
Redo
Rollback
Data mirror
Index mirror
Backup
Replication
Good rule of thumb You need 10x the disk
to hold a given amount of data in an RDB.
- Operate in 2 year cycles
- First 2 years storage available on day 1.
- Evaluate growth at end of year 1, begin prep of
next 2 yr.
74Lost in Space, cont.
- You will use as much disk space as you purchase,
and then some. - Database indices will take MINIMALLY at least as
much space as the tables, probably considerably
more. - Give WIDE lead time to purchase disk storage.
New disks are not installed and configured over
night. They require planning, downtime and .
75Additional References
- WARNING some of these may be database specific.
- Intro to database design http//www.cc.gatech.edu/
classes/AY2000/cs4400_spring/cs4400a/ - Intro to Oracle tutorial http//w2.syronex.com/jmr
/edu/db/ - Evolutionary Database Design http//www.martinfowl
er.com/articles/evodb.html mentions 1 dba for
atlas - Sql course http//sqlcourse.com/
-
76Additional References
- Highly recommended reading, db comparatives
http//www-css.fnal.gov/dsg/external/freeware/ - db infrastructure standard, support levels, etc.
for fermi computing http//www-css.fnal.gov/dsg/ex
ternal/oracle_admin/
77Additional References
- Oracle Designer tutorial http//www-css.fnal.gov/
dsg/internal/ora_adm/index.htmdesigner (choose
Oracle Designer tutorial or Oracle Designer Short
Cuts and Lessons Learned) - Btev specific additional information
http//www-css.fnal.gov/dsg/external/BTeV/index.ht
ml