Title: Databases
1Databases
- Practical Informatics Course
- APIII 2008
2What is a database?
- Technically, a database is any collection of
data, usually in some sort of structure, but this
definition includes essentially everything, and
therefore is not very useful - More practically, a database
- is a non-volatile repository for data with some
inherent meaning - is designed, built, and populated for a specific
purpose - stores multiple instances of some type of
information - stores similar details about each instance of
the information - Databases can be as simple as a list of CDs or as
complex as a medical record
3A Sample, Simple Database
- Address Book
- Last Name
- First Name
- Street Address
- City
- State
- Zip Code
- Phone Number
Last Name
First Name
Street
City
State
Zip
Phone
4Database Terminology
Records
The things about which information is stored.
Each record is indepen-dent of each other record.
Last Name
First Name
Sinard
John
Street
310 Cedar Street
City
New Haven
State
Zip
CT
06510
Phone
(203) 785-3891
Fields
Values/Data
The informational elements which are stored for
each record.
The actual information which is stored for each
field.
5Database Conceptualization
- It is convenient to think of simple databases as
tables (more complex databases as groups of
tables), and set aside the details of the data
storage and manipulation - In practice, these details are generally
relegated to a Database Management System (more
later), so it turns out that this abstraction
process is not only theoretical but also the
usual practice
Columns (fields, attributes)
Rows (instances, records, tuples)
Element (data, value)
6Single Table Databases
- Advantages
- Simple to conceptualize
- Many options for implementation, including simply
a spreadsheet - Disadvantages
- Only simple databases can be accommodated
- The real world is more complex than a single
entity with attributes - Repeating Fields
- Attributes which can occur multiple times for a
single instance of an entity - Redundant Fields
- Groups of attributes which have the same values
over multiple instances of an entity - Difficult to expand or modify as needs change
7Shortcomings of Single-Table Databases
Repeating Fields
- Some attributes (fields) may occur multiple times
for a single instance of an entity - A person can have more than one phone number
- How many repeats does one leave room for?
- Must allocate space in each row for the maximum
number of repeats needed for any one row - Are all repetitions equivalent or does the
position of the repeat carry some meaning? - If one repeat is deleted, do the others shift
down or stay where they are? - Problem increases when there are multiple fields
which repeat as a group
8Shortcomings of Single-Table Databases
Redundant Fields
- Values of groups of fields can repeat over
multiple instances (rows) of an entity - e.g, Two people can live at the same address
- Lots of database space can be used storing the
same data over and over again - Modification or incorrect entry of the data in
any one row can lead to inconsistencies in the
data across rows - Real changes to the data require the changes to
be made in multiple rows - e.g, A family moves to a new address
9Complex Database Architecture
- A number of different approaches have been
developed to address the one-to-many problem - Each approach has its own advantages and
disadvantages - The best choice depends on the particular
application - Complex Database Architectures
- Hierarchical
- Network
- Relational
- Object Oriented
10Relational Databases
- All data is stored in tables
- Each table represents an entity and corresponds
to some real-life concept - eg. person, address, phone number
- Each entity has attributes (columns) which
further define that entity - eg. Person first name, last name, date of birth,
gender - Address street, city, state, zip code
- Within a table, each instance (row) is defined
uniquely by the values of its attributes the
order of the rows can be changed/swapped/sorted
without consequence - Data in different tables is related to each other
by common data, column elements which are shared
by two or more tables
11Relational Database Terminology
Entity (Table, Relation)
Attributes (Columns, Fields)
Specimens
Accession
Date
Patient
Clinical History
Instances (Rows, Tuples)
1-1-97
Smith, John
Upper GI Bleed
Ovarian Tumor
1-1-97
Jones, Mary
1-1-97
Jones, Mary
Ovarian Tumor
1-1-97
Black, Frank
Gunshot Wound
Cardinality ( of Rows)
1-1-97
Simpson, Homer
Radiation Exposure
1-1-97
T
aylor, Tim
Hip Fracture
1-1-97
Skywalker
, Luke
s/p
T
raumatic
Amputation
1-1-97
Janeway
, Katherine
Viral Infection
Degree ( of Columns)
Element
12Relational Database Terminology
Entities
Patients
Specimens
Parts
- Each Patient may have 0, 1 or many specimens -
Each Specimen must belong to one and only one
patient
- Each Specimen must have at least 1 part, but
may have many parts - Each Part must belong to
one and only one Specimen
Relationships
13Relational Databases
Primary Key
- The Primary Key for a table is an attribute
(column, field) or combination of attributes for
which every instance (row) of the table has a
unique value - When a combination of attributes is used, it is
referred to as a composite key - Most tables in a relational database will have a
primary key defined - In general, the value of the primary key should
not change once assigned, since it will often be
pointed to by a foreign key in another table - Therefore, data which might change (name, phone
number) makes a bad key - Selection of an appropriate primary key requires
an understanding of the data which is stored
cannot be done by simply looking at the data in
the database - Depending upon the entity, it is often necessary
to create an artificial field such as a
sequential identifier to serve as the primary
key - eg for a person table, one cannot guarantee that
the last name, first name, or even the
combination of both will always be unique for
each person
14Relational Databases
Foreign Key
- A Foreign Key in a table is an attribute
(column, field) which defines a relationship
between two entities (tables) - The related rows are identified by the value in
the foreign key column of one table matching the
value in another column (usually the primary key)
of another table - This allows one-to-one, one-to-many, many-to-one,
and even many-to-many relationships - In a one-to-many relationship, the table with the
many rows contains the foreign key and points
to the primary key in the table with the one
rows - The foreign key can be identified as such to
the database management system within the
definition of the table, or simply supplied
on-the-fly during a query
15Relational Databases
Keys
Primary Key
Foreign Key
One-to-One
Many-to-One
Putting a foreign key in the person table allows
multiple people to be related to a single
address.
Primary Key
Foreign Key
Foreign Key
The Phone_Numbers table has two foreign keys,
allowing phone numbers to be associated with
either a person or an address, or both.
Last name would not work as a primary key because
two people can have the same last name.
Primary Key
Since the phone number is unique, it can serve as
the primary key for this table.
16Entity-Relationship Diagrams
- A tool for designing relational databases
- Does not contain actual data but rather names of
the tables (entities) and columns (attributes) as
well as a symbolic representation of the
relationships between these - Many different formats exist
17Entity-Relationship Diagram
Data View
Entity-Relationship View
18Entity - Attribute - Value Data Model
Attributes
Entities
Values
19Entity - Attribute - Value Data Model
- EAV Model Advantages
- When many of the possible attributes remain
unvalued, takes less space in database - Each instance of the entity can have different
attributes stored - Can accommodate attributes which one did not know
about at the time the database was created - EAV Model Disadvantages
- If most of the instances of an entity have values
for a particular attribute, it is less efficient
to use the EAV model since one must repeatedly
store the entity instance identifiers and the
attribute name - Data retrieval is less efficient
- The attributes for a given entity instance will
be scattered in the table - For aggregate data analysis, first have to find
the desired attributes
20Direct Program Access to Databases
Application Program
Application Program
Database
21Disadvantage of Direct Program Access to Databases
- Any modification to the type of data stored
(e.g., number of fields) requires a complete
rebuild of the entire database - Each application which accesses the data needs to
know the details of the physical storage of the
data - Applications need to be substantially modified if
the structure of the data stored is altered - Applications need to manually build and
maintain any desired indices
22Database Management System
Application Program
Application Program
Program Routine
Program Routine
Program Routine
Program Routine
Database Management System (DBMS)
Database
23Database Management Systems
DBMS
- A DBMS is a type of middle-ware which sits
between the applications and the database - All access to the database must go through the
DBMS - Only the DBMS needs to know the details of the
physical storage of the data - Application routines can treat the database in an
abstract fashion, funneling all requests for
data through the DBMS using a database language - If a common language is used by different DBMSs,
a completely different database system can be
substituted without requiring any modifications
to the application routines
24Client-Server Architecture
Client Computer
Client Computer
Client Computer
Application Program
Application Program
Application Program
Program Routine
Program Routine
Program Routine
Program Routine
Program Routine
Program Routine
DBMS Front End
DBMS Front End
DBMS Front End
Network
DBMS Back End
Database
Database Server
25Multi-user Access to Large Databases
Most large databases need to be accessible to
multiple users at the same time
- The DBMS is split into two components
- The Front-end or Client resides on each
users machine - Program routines submit requests to the DBMS
front-end - The DBMS front-end communicates over the network
with the DBMS back-end
Program
Program
Routine
Routine
Routine
Routine
Routine
Routine
DBMS Front-end
Server
DBMS Back-end
- The Back-end sits on a separate database
server, communicates with the database, and
processes the requests of the front-end(s) - Multi-user access requires a more sophisticated
DBMS
26Database Abstraction
Concerns of the DBMS
Concerns of the Database Designer and the
Application
- What type of data is stored in the database?
- How are the fields in the database related to
each other? - What data needs to be retrieved?
- What data should be indexed, based upon how the
data is routinely requested?
- How and where is the data stored on disk?
- How is free space in the database maintained?
- What changes need to be made to an index to
accommodate a change in the data? - What is the best way to find the data requested?
Physical Data Model
Conceptual Data Model
27Functions of a Full Featured DBMS
- Handle all the details of the physical storage of
the data - Data Types
- Maintain indices and keys
- Support data addition, updating and deleting
- Support changes in conceptual database design /
definition - Client/Server distribution
- Authorization / User Privileges / Security
- Data Integrity
- Referential Integrity
- Data Concurrency
- Data Recovery
- Database Backup
- Transaction Logs
- Data Locking and Deadlocks
- Support for a Query Language
28Preserving Data Integrity in a Database
Referential Integrity
- Making sure that the rows in a child table
related to a row in a parent table are not
orphaned by deletion of the row in the parent
table
?
29Preserving Data Integrity in a Database
Referential Integrity
- Declarative Referential Integrity
- Define Foreign Keys as such, indicating what
table/primary key they reference - The DBMS will make sure that any row referenced
by a foreign key in another table is not deleted
attempts to do so generates an error - Advantage Developer does not need to do anything
other than define the relationship (and check for
errors) - Disadvantage Deleting a parent requires
specifically deleting all the child rows first - Procedural Referential Integrity
- Create Procedures which are executed whenever
the contents of a table are changed (called
Triggers can be defined for inserts, updates,
and/or deletes) - Advantage Greater flexibility for developer to
handle references - Disadvantage Developer has to make sure triggers
properly address referential integrity issues - Example A delete trigger on the Address
table automatically deletes any rows in the
Person table which points to the row deleted
from the Address table
30Database Language
- Applications need a language with which to
communicate with databases - The database language must support
- Data definition
- Data retrieval
- Data manipulation
- Data integrity maintenance
- Access control
- Data sharing
31SQL
(A Language for communicating with relational
databases)
- SQL is not a complete programming language
- Not as structured as C or Pascal
- No IF, GOTO, DO, or FOR statements no variable
declaration - Not procedural statements specify what, but
not how - Statements resemble English sentences
- Includes the use of optional noise words
- Improves readability but does not alter
functionality - Not a stand-alone product
- SQL is a standard which is incorporated, to
varying degrees, into commercially available
products - eg DB2, Access, Oracle, Sybase, Informix, dBASE,
SQL Server
32Sample SQL Statement
Address add_id Street City State Zip a1 310 Cedar
St. New Haven CT 06510 a2 Disney
World Orlando FL 32300 a3 4 Privet Drive Little
Whinging Surrey a4 Overlook Hotel Sidewinder CO 8
0200
SELECT Person.Last, Person.First, Address.City,
Address.State FROM Person, Address WHERE
Person.add_id Address.add_id AND
Address.State "FL"
33Features of SQL
- Vendor independence (widespread vendor support)
- Hardware/System independence
- Relational basis
- Dynamic data definition
- Client/Server architecture
- Direct interactive access without having to write
programs to get at the data - Program level access
34Data Mining
- Data Mining is the process of trying to find
new relationships in data not previously known or
perhaps even suspected - Requires lots of data from multiple sources
- Numerous approaches, often based predominantly on
hit or miss queries - Uses neural networks, case-based reasoning,
statistical analyses - Two approaches for aggregating data from multiple
systems into a single database - Data Repository
- Data Federation
35Users
Users
Laboratory Medicine
Diagnostic Radiology
Diagnostic Radiology
Anatomic Pathology
Anatomic Pathology
Laboratory Medicine
36Data Repository vs Data Federation
- Data Repository
- Data physically collected in one location
- Structure can be optimized for data mining
- Queries do not affect production/transactional
environments - Data Federation
- Data is always up-to-date
- Fewer data ownership / political issues
- Any given data source can withdraw from the
federation at any time - Data inconsistencies can occur
- More complex to set up and maintain