Title: Introduction to Databases
1Introduction to Databases
- Data Organisation
- Definition
- Data modelling
- SQL
- DBMS functions
2Basics of data Organisation
- DATA HIERARCHY (four categories)
- Fields represent a single data item
- Records made up of a related set of fields
describing one instance of an entity - File / Table a set of related records - as many
as instances (occurrence) in the set - Database a collection of related files
3Example of data structure
Fields
Name First name Telephone Zidane Zinedine 45 25
65 65 Feller Joe 25 58 96 63 Clinton Bill 12
25 28 89 Henry Thierry 25 78 85 85
Records
Other files gtcomplete data Structure DB
File / Table
4Database Definition.
- "A collection of interrelated data stored
together with controlled redundancy, to serve one
or more applications in an optimal fashion the
data is stored so that it is independent of the
application programs which use it a common and
controlled approach is used in adding new data
and in modifying existing data within the
database."
5Definition - closer look
- A collection of interrelated data stored together
- with controlled redundancy
- to serve one or more applications in an optimal
fashion - the data is stored so that it is independent of
the application programs which use it - a common and controlled approach is used in
adding new data and in modifying existing data
within the database.
6Advantages of Databases
- data are independent from applications - stored
centrally - data repository accessible to any new program
- data are not duplicated in different locations
- programmers do not have to write extensive
descriptions of the files - Physical and logical protection is centralised
7Disadvantages of DBs
- Centralisation can be a weakness
- Large DBs require expensive hardware and software
- specialised / scarce personnel is required to
develop and maintain large DBs - Standardisation of data on a central repository
has implications for the format in which it is
stored
8Characteristics of DBs
- High concurrency (high performance under load)
- Multi-user (read does not interfere with write)
- Data consistency changes to data dont affect
running queries no phantom data changes - High degree of recoverability (pull the plug
test)
9ACID test
- Atomicity
- Consistency
- Isolation
- Durability
All or nothing
Preserve consistency of database
Transactions are independent
Once committed data is preserved
10DataBase Management System (DBMS)
- program that makes it possible to
- create
- Use (insert / update / delete data)
- maintain a database
- It provides an interface / translation mechanism
between the logical organisation of the data
stored in the DB and the physical organisation of
the data
11Using a database
- Two main functions of the DBMS
- Query language searching answers in data (SQL)
- Data manipulation language - for programmers who
want to modify tha data model in which the data
is stored - Host Language - the language used by
programmers to develop the rest of the
application - eg Oracle developer 2000
12Relational DBs
- Data items stored in tables
- Specific fields in tables related to other field
in other tables (joint) - infinite number of possible viewpoints on the
data (queries) - Highly flexible DB but overly slow for complex
searches - Oracle, SyBase, Ingres, Access, Paradox for
Windows...
13Describing relationships
- Attempt at modelling the business elements
(entities) and their relationships (links) - Can be based on users descriptions of the
business processes - Specifies dependencies between the data items
- Coded in an Entity-Relationship Diagram (ERD)
14Types of Relationships
- one-to-one one instance of one data item
corresponds to one instance of another - one-to-many one instance to many instances
- many-to-many many instance correspond to many
instances - Also some relationships may be
- compulsory
- optional
15Example
- Student registering system
- What are the entities?
- What type of relationship do they have?
- Draw the diagram
16Entity Relationship Diagram
17Example 2 Sales Order Processing
- Entities
- Relationships
- Use a business object based approach?
18Next step - creating the data structure
- Few rules - a lot of experience
- Can get quite complex (paramount for the speed of
the DB) - Tables must be normalised - ie redundancy is
limited to the strict minimum by an algorithm - In practice, normalisation is not always the best
19Data Structure Diagrams
- Describe the underlying structure of the DB the
complete logical structure - Data items are stored in tables linked by
pointers - attribute pointers data fields in one table that
will link it to another (common information) - logical pointers specific links that exist
between tables - Tables have a key
- Is it an attribute or an entity?
20ORDER order number Item description Item
Price Quantity ordered Customer number Item number
Customer Customer number Customer name Customer
address Customer balance Customer special rate
1
2
3
4
Item Item number Item description Item
cost Quantity on hand
compulsory attributes 0 optional attributes
21Normalisation
- Process of simplifying the relationships amongst
data items as much as possible (see example
provided - handout) - Through an iterative process, structure of data
is refined to 1NF, 2NF, 3NF etc. - Reasons for normalisation
- to simplify retrieval (speed of response)
- to simplify maintenance (updates, deletion,
insertions) - to reduce the need to restructure the data for
each new application
22First Normal Form
- design record structure so that each record looks
the same (same length, no repeating groups) - repetition within a record means one relation was
missed create new relation - elements of repeating groups are stored as a
separate entity, in a separate table - normalised records have a fixed length and
expanded primary key
23Second Normal Form
- Record must be in first normal form first
- each item in the record must be fully dependent
on the key for identification - Functional dependency means a data items value
is uniquely associated with anothers - only on-to-one relationship between elements in
the same file - otherwise split into more tables
24Third normal form
- to remove transitive dependencies
- when one item is dependent on an item which is
dependent from the key in the file - relationship is split to avoid data being lost
inadvertently - this will give greater flexibility for the design
of the application eliminate deletion problems - in practice, 3 NF not used all the time - speed
of retrieval can be affected
25Beyond data modeling
- Model must be normalised
- Optimised model
- no surprise model
- resilience
- Outcome is a set of tables logical design
- Then, design can be warped until it meets the
realistic constraints of the system - Eg what business problem are we trying to solve?
see handout riccardi p. 113, 127
26Realistic constraints
- Users cannot cope with too many tables
- Too much development required in hiding complex
data structure - Too much administration
- Optimisation is impossible with too many tables
- Actually RDBs can be quite slow!
27Key practical questions
- What are the most important tasks that the DB
MUST accomplish efficiently? - How must the DB be rigged physically to address
these? - What coding practices will keep the coding clean
and simple? - What additional demands arise from the need for
resilience and security?
28Analysis - Three Levels of Schema
External Schema 2
External Schema
External Schema 1
Tables
Logical Schema
Disk Array
Internal Schema
294 way trade-off
Security
Ease of use
Performance
Clarity of code
30Key decisions
- Oracle offers many different ways to do things
- Indexes
- Backups
- Good analysis is not only about knowing these gt
understanding whether they are appropriate - Failure to think it through gt unworkable model
- Particularly, predicting performance must be done
properly - Ok on the technical side, tricky on the business
side
31Design optimisation
- Sources of problems
- Network traffic
- Excess CPU usage
- But physical I/O is greatest threat (different
from logical I/O) - Disks still the slowest in the loop
- Solution minimise or re-schedule access
- Also try to minimise the impact of Q4 (e.g.
mirroring, internal consistency checks)
32Using scenarios for analysis
- Define standard situation for DB use
- Analyse their specific requirements
- Understand the implications for DB design
- Compare and contrast new problems with old ones
33Categories of critical operations
- Manual transaction processing complex DE by
small number of operators - Automatic transaction processing large number of
concurrent users performing simple DE - High batch throughput automatic batch input into
DB of very large number of complex transactions - Data warehousing large volumes of new data
thrown on top every day at fixed intervals
intensive querying
34Manual transaction processing
- Insurance telemarketing broker
- Data entry
- Retrieving reference info
- Calculations
- On-line human-computer interaction!!
- Instant validation (field by field)
- Drop-down lists (DE accelerators)
- Quick response time
- Critical issue user-friendly front end, but
minimise traffic between interface and back end!
35Automatic transaction processing
- Large number of user performing simple tasks
- Real-time credit card system (e.g. authorisation)
or check out (EPOS) - Human interaction at its most simple eg typing
a code or swiping a card - Minimum validation, no complex feed back
- Large numbers mean potential problems are
- Connection opening / closing rate
- Contention between concurrent users
- SQL engine pbs data consistency costs
- Design with multiple servers
36Automatic transaction processing
- Another eg on-line shopping
- What specific problems would arise from shopping
cart type applications? - How do you handle lost customers?
37High batch throughput
- Eg mobile phone network operator
- Real time huge volume of simultaneous complex
transactions - Number checks
- Account info
- Price info
- Pattern checks
- Large processing capacity required need to
tackle all transactions together in batches - DB query may not be only solution (or quickest)
- Move customer account to cache
- Copy updated figures for accounts to a log and
updated accounts in slack periods (2.5GB an
hour!) - Indexing or partitioning for quicker access
38Data warehouse
- Huge store of data
- Large volume added every day
- 99 new data, 1 corrections to existing data
- Substantial analysis required prior to
development - What to include
- How to aggregate and organise it
- Where data comes from
- Real Oracle territory because schedule is lax
ie not a real time application - Key issues
- Getting partitioning right
- Deciding how many summary levels
- Deciding what to hold and what to recalulate
39Partitioning
- Oldest trick in the book to speed up retrieval
(eg?) - Smaller bunch of data
- Well labeled so it can be easily found
- Smaller index
- Data manipulation maintenance, copy and
protection far easier - Break down big problem (eg table) into small ones
40Internet Databases
- In between types 1 and 2
- Many concurrent sessions
- Reduced interaction front end back end
- Internet Extra response time (2 secs!)
- In practice, many sites are quite slow
- Key issues
- thin client
- Reduced dialogue
- Management of sessions (eg coockies) to avoid
multiple restarts
41Conclusion Key issues
- At one end very large numbers of small
transactions - Threat of network or process contention
- At other end small number of processes with
complex data crunching and time constraints - Design of DB and application must reflect these
constraints