Title: Lecture No. 7
1Lecture No. 7
Database - Structured and Unstructured
2Objectives of this Lecture
- This lecture will be focussed mostly on database
as a means of storing and retrieving data. - The term database is always associated with the
term database management system (DBMS) which
indicates that not only is data stored but it is
controlled and managed by software. - You will most likely be using Microsoft Access
for your assignment. This DBMS is aimed at the
lower end of data volumes and processing. - You will also use Oracle in your laboratory work.
This is a medium to large client-server DBMS
3(No Transcript)
4Some Aspects of Data
- A major benefit of Computing is Storage and
Retrieval of Data - We need to have answers to these questions -
- WHAT are we going to produce as output ?
-
- WHAT data is to be stored ?
- WHAT is the level of detail (Name, Address,
Height ?) - HOW long is to be retained - and is it
volatile? - WHAT form is it to be stored - on line,
off line ? - WHO is going to access it ?
-
5Some Aspects of Data
- HOW often will it be accessed ?
- WHAT security of access is to be applied ?
- WHAT are the starting volumes - what is the
growth or decay factor ? - WHAT response time is expected ?
- HOW accurate is the data content ?
- HOW current is it ? (e.g. on line sales
retailing)
6Information / Data
- A General Definition
- DATA - raw (unprocessed or partly processed)
facts which represent the state of
entities (things) which have
occurred - INFORMATION - data which has been processed into
a form USEFUL TO THE USER - What is Information to one user may be Data to
another user.
7Audit Trail
- General Definition
- The presence of data processing media and
procedures - which allow any and / or all
transaction(s) to be traced - through ALL STAGES of processing
- This infers that the following devices /
techniques are in place - 1. A logging device which traps all
transactions - 2. Some way of tagging each transaction so that
it can be identified - 3. Some way of retrieving the required
transaction(s) - 4. Some way of archiving - what is the required
period ? - 5. Control procedures and processes to ensure
integrity -
8Data Base
- A Database is a shared collection of
Inter-Related data designed to meet the needs of
multiple types of users and applications. - This implies that multiple user VIEWS can be
defined - Data stored is independent of the programs which
use it - Data is structured to provide a basis for future
applications - DATABASE Stored Collection of Related Data
- May be physically
distributed
9Database Management System
- A DBMS is SOFTWARE which provides access to the
database in an integrated and controlled manner - A DBMS must contain
- 1. Data Definition and Structure capabilities
- 2. Data Manipulation capabilities
-
10Data Definition and Manipulation
- Data Definition Language (DDL)
- used to describe data at the database level
- Schema level - complete database description
- Sub-Schema level - user views (restricted)
- Data Manipulation Language (DML)
- Provides for Create Insert
- Update Retrieve (extract)
- Delete Drop
- Modify Calculation
- Report
- capabilities
-
11Three Level Architecture
- 1. External schema - User Views
- 2. Conceptual schema - Total database
- description
- 3. Internal schema - Physical database
12The Many Faces of Database
- Databases can be
- 1. Transaction Intensive - ATMs Checkouts
- 2. Decision Support - Browsing for
trends - 3. Mixed-Load - Combination of both
- 4. Small databases - Few thousand
records - 5. Very Large Database - Many millions or
trillions - (VLDB) of
records (Banks) - 6. Non Traditional - Weather bureau,
flight plans - Computer Aided Design data
13The Many Faces of Database
- They can be
- Data Warehouses
- Data Marts
- How is a database size measured ?
- There are a number of measurements
- Raw data size
- Total database size
- Total usable disk space size (which includes
media protection such as mirroring)
14The Many Faces of Database
15DBMS Requirements
-
- Querying Capabilities
- Data Displays (Presentation)
-
- Data entry
- Data Validation
-
- Data Deletion
- Committing Procedures (of changes)
- AND Data Integrity, Security, Consistency and
Concurrency - Capabilities
-
16Important Database Features
- Data Integrity
- Data Independence
- Referential Integrity - Relational Database Model
- Concurrency Control - Multiple Users
- Consistency
- - multi users
- - distributed database
- - replicated database
- - partitioned database
- - mobile database
- Recovery from failure (Transaction and Media)
- Security
17File Processing (non database)
Purchasing Program
Billing Program
Accounts Buyers Inventory
Vendors Receivable
Customers
Accounts Payable Sales Order
Payroll
Processing
Vendors Invoice Customers
Inventory Employee
18File Processing With Database Technology
- Orders Department Accounting
Department
Program Program Program
Program Program A
B C A
B
Order Filling
Invoicing system
system
Inventory Back Inventory
Customer Master Orders
Pricing Master
Customers
19InterRelated File Outline
Sales Stats
Accounts Receivable
Customer
Salesperson
Buyers
Inventory
Vendor
Accounts Payable
General Ledger
Purchase Order
20Conceptual Data Model
customer
order
invoice
product
work order
raw material
vendor
21User Views
customer customer
product vendor
customer
order order invoice
raw material
22Database Models - Hierarchical
owner/parent child /parent
owner
member
child
child/parent
23Data Base Models - Network
set of data
owner
member
owner
member
set of data
Note Only linked sets can be accessed
24Data Base Models - Relational
table table
table table
table A B
C
D E
Any table(s) can be joined to any other table(s),
provided there is a means of effecting the
join Primary key / Foreign key concept. Data
redundancy No fixed linkages
252 Relations
EMPNUM NAME Date of Birth
DEPTNUM 3 JONES
16-05-1956 605
7 SMITH 23-09-1965 432 11
ADAMS 11-08-1972 201 15
NGUYEN 23-10-1964 314 18 PHAN 16-11-197
6 201 Relation (Table) Name EMP Relation
Schema EMP(empnum,name,date of birth,deptnum)
DEPTNUM DEPTNAME 201
Production 314 Finance 432
Information Systems 605 Administration
Relation (Table) Name DEPT Relation Schema
DEPT(deptnum, deptname)
26Definition of a Relational Database
- A relational database is a collection of
relations or two-dimensional tables.
Database
27Data Models
Model ofsystemin clientsmind
Entity model ofclients model
Table modelof entity model
Tables on disk
28Communicating with a RDBMS Using SQL
29Relational Database Management System
Server
User tables
Data dictionary
30Relational DataBase
- Data is represented in ROW and COLUMN form
(matrix) -
(attribute) - Collections of related data ---gt TABLES
(relations) - 1 or more tables ----gt DATA BASE
- ATTRIBUTES are generally static
- ROWS are DYNAMIC and Time-Varying
- The number of Attributes DEGREE of a table
- The number of Rows CARDINALITY of a
table
31Some RDB Considerations
- Data is held in tables
- No order of data in tables - row or attribute
- Concept of Foreign Key - Primary Key relationship
- Data Typing - including nulls
- Query Access - insert, update, delete, retrieval
- Indexing on candidate (and Primary) keys
- Integrity Constraints
- Attribute value ranges
- Referential Integrity
- Entity Integrity
- User Defined Integrity
- Set retention constraints
32Some RDB Considerations
- Domain constraints
- User defined Rules e.g. no booking of rooms for
lectures after 4.00pm Fridays (also known as
Business Rules) - Recovery procedures. Used to restore a database
after a failure - No explicit linkages between tables
- Linking or embedding database operations in a
procedural language (Cobol, C ..) - Databases may be distributed across similar or
different DBMSs - Security features
33Database Components
- 1. Back End Engine
- Used for Disk Input/Output processes
- (Read/Write/Find)
- 2. Front End Processor
- Data manipulation
- String/Arithmetic/Statistical operations
- 3. DBMS Interface
- Data Definition Language (DDL) Data
Manipulation Language (DML) - 4.Programmer Interface
- Applications Environment (4GLs, Embedded
capability)
34Data Description Language
- Used to describe data at the Database level
- Terms Used
- Structure and Attributes
- Schema Complete description of the database
using DDL - SubSchema Describes data in the database as it
is known to individual programs(processes) or
users - A segment or part of logical data record(s)
required - is commonly known as a VIEW
35Data Manipulation Language
- Language (commands and syntax) used to cause
transfers of data from the Database and the
Operating Environment and vice versa - Variety of Models - Access, DB2, dBASEV, MySQL,
SQLServer -
VisualDataBase, DataBoss, Ingres, - Oracle,
Informix .... - Windows versions provide Icons and Menu options
which are translated by the DBMS software to
Database manipulation commands - Typical commands get, put, replace, seek,
update,delete, insert, drop,
find, modify
36Application Systems
users(ATMs) users users users
C programs
DBMS
Cobol
Access
37DBMS Components
users
Database Management System
Program Language Interface
Application Programs
Utility Programs
Retrieval Update
Database
38A Practical Development Scenario
In 198n , a Company decided to develop and
introduce a Payroll system using database
technology. It looked this this
Payroll System
Payroll Data
39A Practical Development Scenario
- The Company grew in numbers and expertise and
decided - to introduce a Personnel System. The new design
was this
40A Practical Development Scenario
In the next few years, these components were added
Job History
Labor Distribution
Labor Analysis Data
Employee Tracing Data
41Advantages of DataBase
- Reduced Data Redundancy
- Data Integrity
- Data Independence
- Data Security
- Data Consistency
- Easier use of Data via DBMS Tools (Query
languages, 4GL's) -
42Disadvantages of Database
- Complexity
- Expense
- Vulnerability
- Size of - disk storage, processor memory
- Training Costs
- Compatibility
- Technology Lock In
43(No Transcript)
44Data Types
- Used to augment an attribute description and to
provide a means of Integrity - Normal data types are
- Character (or text)
- Numeric - Integer, Decimal, Money, Float
- (in Access Long and Short Integer,
- Decimal with options of a number of decimal
places) - Date - Standard date format - Access has
various forms - Logical - Yes/No True/False Exists/Does
Not Exist
45Attribute Size
- In many cases this is set by the Data Definition
facilities - e.g.. Date, Short Integer, Long Integer,
Logical,Number - Others are set by the Designer
- Number of Characters, Size of a Decimal
Number - Access allows for a Default value
- Duplicates/ No Duplicates
allowed - Primary Key nomination
- Indexing
- Required Status of an Attribute
-
46Integrity Examples
- Primary Key - Must have a value - not null
- - Must not be duplicated in the same table
- Attribute values must exist (or not)
- student record
- student identity number - must exist, not
duplicated, must be a valid number - student name - must exist, may be
duplicated - student course code - must exist, must be a
valid course - subject code - if enrolled, must
exist, code must be
a valid code - subject result - dependent on
time. May be null. If it
exists must be a valid grade and
mark
47Integrity Examples
- Questions
- 1. Do the same constraints exist in Excel, Word
? - 2. Should they ?
- 3. What are alternatives for embedding the same
or similar controls - 4. Are spreadsheets less reliable than
databases ?
48Queries
- A Query acts of the base table or tables of a
database and returns a subset of this data. - A Query normally returns a restricted set of
attributes (and their current values - this is
the User View of the database - A Query normally has some criteria
- e.g. salaries gt 50,000
- outstandings gt 30 days
- date (of some event , gt or lt some
designated date) - Criteria can be linked event A OR event B,
- event A
AND event B
49Reports
- A more formal output of data from base tables and
in many cases produces high volume. - The design is formatted
- Page numbering, Headings, Footers,
- Page breaks, Page or Item totals and sub-totals
- Reports can be criteria based and include
calculations (derived data) - Report content can be imported from another
source, or exported to another target. - Reports can contain Exception Full Graphic
Hypertext information and may be hard copy,
electronic copy,
50A Puzzle
How can these dots be joined by 4 straight lines
without lifting a pencil (or pen) from the
surface ?
51A Puzzle
1
2
3
No, thats 5 lines
4
5
52A Puzzle
1
2
4
5
Thats worse - its 6 lines
6
3
53A Puzzle
1
How about this ?
54A Sunburnt Country
- This could be a commentary of the Summer weather
- I love a sunburnt country
- A land of sweeping plains,
- Of ragged mountain ranges,
- Of droughts and flooding rains
- I love her far horizons,
- I love her jewel-sea,
- Her beauty and her terror -
- The wide brown land for me !
55Well, that was about 'Structured' Data. What
about 'Non Structured' Data ?
56Non Structured Data
- Up to date, we have studied some of the methods
of file organisation associated with efficient
data retrieval as embodied in the relational data
base model. - However, although industry has invested large
amounts of money in relational database
applications, the greater percentage of data
access requirements are made of non-structured
data sources (a classic one being the World Wide
Web). - This lecture will uncover some of the techniques
which locate and release data of this
non-structured nature.
57Information Retrieval
- Non - Structured i.e. Occurrence of
'Attribute Values'
neither regular nor regulated - Processing Requirements
- - Mainly to reveal/release textual information
- - Information searches on 'key terms'
- - May be further processed by computer systems
- (e.g. Spreadsheets, Mathematical Models,
Simulations)
58Information Retrieval
User Expectations - All 'relevant' references
will be found and
released (Compare to
structured DBMS closure theory) Note
Approximately 70 - 80 of 'Management
Information' is from Non-structured Databases
59Information Retrieval
- Consider the following
- When I do count the clock that tells the time,
- And see the brave day sunk in hideous night,
- When I behold the violet past prime,
- And sable curls all silvered o'er with white
- When lofty trees I see barren of leaves,
- Which erst from heat did canopy the herd
- And summer's green all girded up in sheaves
60Information Retrieval
Borne on the bier with white and bristly
beard Then of thy beauty do I question make That
thou amongst the wastes of time must go, Since
sweets and beauties do themselves forsake, And
die as fast as they see others grow, And nothing
against time's scythe can make defence Save breed
to brave him when he takes thee hence
Shakespeare, Sonnet No.12
61Information Retrieval
- Or , consider this
- Now is the winter of our discontent
- Made glorious summer by this sun of York
- And all the clouds that loured upon our house
- In the deep bosom of the ocean buried
- Now are our brows bound with victorious wreaths
- Our bruised arms hung up for monuments
- Our stern alarums changed to merry greetings
- Our dreadful marches to delightful measures
- Shakespeare Soliloquy ? ? ?
62Information Retrieval
- Finally, another extract
- The peasants who survived the plague found
themselves in many cases afflicted by fresh
burdens, for with fewer people to work the land,
overlords demanded a standstill in wages and a
return to feudal duties in full. - But, with the shortage of labour, workers
naturally expected to be valued more highly and
to be given better pay and more freedom. - Europe 14/15th Century
- A history of the world - Rjunstead ( perhaps
Enterprise Bargaining ?)
63Information Retrieval
- Definition An Information Retrieval System
stores items of Information which need to be - Processed
- Searched
- Retrieved
- Analysed, Condensed, Explained and Sent to User
Populations - Some Concerns
- Data Base Management
- Decision Support (as with other Information
Systems) -
64Information Retrieval
- Requirements
- Must have efficient storage organisations
- Rapid search procedures
- Effective dissemination and user interaction
65Information Retrieval
- Information Retrieval Systems are used to handle
data in various forms such as - Bibliograpic
- Textual
- Audio
- Video
-
66Information Retrieval
- Allied Topics
- Information Theory
- Probability Theory
- Computational Semantics
- Programming Theory
- Algebra
67Information Retrieval
- Measurement of Usefulness
- 1. Currency
- 2. Completeness
- Some queries
-
- 1. Excess Information
- 2. Obsolete (?) Information
68Information Retrieval
- Resource Management Is it Important ?
- Growth rates Up to 1800, the rate of
publication doubling every 50 years - 1800 to 1996 gt 100,000 scientific
publications - 1996 gt 400,000 scientific
publications - 2000 gt 680,000 scientific
publications - Impact of World Wide Web
- plus the translation
factor
69Information Retrieval
- Some Information Organisation Problems
- Volume expansion is not evenly distributed for
all topics - Location of related items
- Relationship methods (trace, link, chain)
70Information Retrieval
- Key Functions of Information Retrieval Systems
Input Process
Output
Indexing Language
Information Items
Requests
(mapping)
Organised for Search Formulation
Indexing Processes
Request Representation Information
Representation
71Information Retrieval
- File Structures
- 1. Linear Lists
- 2. Ordered Sequential Files
- 3. Indexed Files
- 1. Linear Lists
- UNordered collection of items
- Items examined one at a time
- Average Search (n 1)/2
-
72Information Retrieval
- 2. Ordered Sequential files
- Concept of keys Limited to items of special
importance such as Name of Author - File ordered on these keys
- Searching
- (n1)/2 same as linear list
- log2(n1)
-
73Information Retrieval
- 3 (a) Indexed Files
- Creation of a smaller, referenced table to the
main file - Provides a pointer to the location of items in
the main file - (memory and/or disk storage considerations)
- Frequently linked with a subsequent sequential
search - Can be extended to cover multiple keys - the
major disadvantage is the necessity to rebuild
main file and index
74Information Retrieval
- 3(b). Inverted File
- Data organisation creates an Index for all Unique
Key Values - Files are arranged in order by TOPIC
- Each Topic includes a corresponding list of item
numbers - This produces a quick access method - smaller
table - A Secondary index (an index of the index) is also
possible
75Information Retrieval
- Item No 1 2 3 4 5
- Author Ash Brown
Jones Reynolds Smith - Aspects of A Survey
A History The State of Users of - Title Computerised of users of of
Computer the Art of New - Information Information
Systems Retrieval Retrieval - Retrieval Retrieval
Systems
Systems - Topic Computer Information
Computer Information Retrieval - Information Retrieval
Systems Retrieval System - Retrieval Users
Systems
Users - Systems
-
76Information Retrieval
- Inverted File Organisation Sample
- Related Information Items
Computer Information Retrieval Systems Users
Additions and Deletions require Index
modifications
1 3 1 2 4 1 2
4 5 1 3 4
5 2 5
Topic
Inverted Index
77Information Retrieval
- Inverted Files in a Commercially Available I.R.S.
- Boolean logic associative terms queries
- Operators AND OR NOT
- Functions used Set Intersection
- Set Union
- Set Difference
-
78Information Retrieval
79Information Retrieval
Item No 1 2 3 4 5 Author
Ash Brown Jones
Reynolds Smith
Aspects of A Survey A History
The State of Users of Title
Computerised of users of of Computer the
Art of New Information
Information Systems Retrieval
Retrieval Retrieval
Retrieval Systems
Systems Topic Computer
Information Computer Information
Retrieval Information
Retrieval Systems Retrieval
System Retrieval Users
Systems
Users Systems
80Information AND Retrieval
- Based on the data given-
- 1. Use Inverted Index to Locate the reference
Numbers for the term Information
Set 1 - 2. Ditto for the term Retrieval Set 2
- 3. Intersect of Set 1 and Set 2 Set 3
- 4. Use set 3 to retrieve documents identified in
this set.
81Information OR Retrieval
- Create Set 1 as before on the occurrences of the
term Information - Create Set 2 as before on the occurrences of the
term Retrieval - Combine Set 1 and Set 2 Set 3
- This is known as Set Union
82Information NOT Retrieval
- Perform the formation of Set 1 and Set 2 based on
the - occurrence of the term Information and also
the term Retrieval - Remove any reference from Set 1 which is included
in Set 2 - to produce a final set, Set 3
- This is known as Set Difference
- And this is more Information. Its
- informing me its time to stop.
83Some Niceties
- Sample search X and Y or Z
- Start scan from Left, move to Right
- Start scan from Right, move to Left
- Partial scan for x number of identical
characters, again from Left to Right or Right to
Left - Perform all ANDs then all ORs
- Results would be (X and Y) OR Z (Left to
Right) - (X AND (Y OR Z)
(Right to Left) - Note This is not a Political movement of members
84Additional Features
- 1. Adjacency Operations
- Search items contained in adjacent word
positions - Can also be used in Inverted Index to point to
Word location within each document.
85Additional Features
-
- Document Reference No.
- Paragraph No.
Document Hierarchy - Sentence No.
- Word No
- Retrieval (345 1 2 5)
- Could also be a Word Count from the beginning of
text - (distance indicator)
- Retrieval (345 13) interpreted as Document 345,
13 words from Beginning of Text (B.O.T)
86Some Commercial Inverted File Systems
- 1. Dialog Uses Select and Combine
- 2. Stairs Utilities to create and maintain
database OnLine retrieval system AQUARIUS -
includes - a dictionary
- 3. Bibliographic Retrieval Services System (BRS)
- Based on Stairs
- 4. Medlars system (National Library of Medicine)
- Uses Index File Posting
File Data File - 5. Orbit system
- 6. Lexis system
87Dynamic Item Searching
- A Binary or simple tree search is not viable for
dynamic file conditions - the search depth and
thus time is high
Root ---gt
Node
Number of Levels (Height)
If the number of levels could be reduced (e.g.
increase the number of values per node (known as
order) then access times would improve. B-Tree,
B Tree and B Tree address this
requirement Access formula is logdn where d
order
88B Tree Example Document Clustering
Catalog Hardware Morpheu
Review Synonym
Number Publication
Apparatus Biomedicine
Statistics Stem Structure
Encyclopedia Grammar
File
Search for Query File
89String Matching - Boyer and Moore
- Based on 1. Analysis of query pattern
- 2. Auxiliary table
- 3. Character mapping
- 1 2 3 4 5 6 7 8 9 10 11
- L E C T U R E R O O M
- L occurs in position 1 E occurs in
positions 2 and 7 - C occurs in position 3 T occurs in position
4 - U occurs in position 5 R occurs in positions
6 and 8 - O occurs in positions 9 and 10 M occurs in
position 11 - Matching starts ltRightgt character and
progressively shifts left on match
90Information Management
- Data retrieval - an application package
- TITAN Museums, Births Deaths and Marriages
Registries - Art Galleries, Market Research,
News Services, - Hansard, Reference Libraries
- Platforms Unix, AIX, SunO/S, PCs to Mainframes
- Method - Signature Files - A bit string for each
record in the - data file. Uses Indexed terms
91Information Management
- 2 stages necessary
- 1. Create descriptor for each Indexed term
- 2. Superimpose each term descriptor (exclusive
OR-ing) - Professional ---gt
100100 .......... (32 bits) - term Computing ---gt 010100
.......... - descriptors Magazine ---gt 000101
.......... - Record descriptor 110101 (inclusive OR)
- Record descriptors ---gt record descriptor file
Pointers
92Some Calculations
- Inverted Files
- Number of searches no. of steps to search
index - (n1)/2 for subfile
- Assume 106 records.
- 1.Sequential Search specific character
- (n1)/2 500,001 steps average
- 2. Indexed file
- Assume that the number of records beginning with
the specified character is 50,000 - Index entries 26
- No. of search steps (261)/2 14
- plus (50,000 1)/2 25001 Total
25,015