Lecture No. 7 - PowerPoint PPT Presentation

1 / 83
About This Presentation
Title:

Lecture No. 7

Description:

Lecture No. 7 Database - Structured and Unstructured – PowerPoint PPT presentation

Number of Views:173
Avg rating:3.0/5.0
Slides: 84
Provided by: Rod113
Category:

less

Transcript and Presenter's Notes

Title: Lecture No. 7


1
Lecture No. 7
Database - Structured and Unstructured
2
Objectives of this Lecture
  • This lecture will be focussed mostly on database
    as a means of storing and retrieving data.
  • The term database is always associated with the
    term database management system (DBMS) which
    indicates that not only is data stored but it is
    controlled and managed by software.
  • You will most likely be using Microsoft Access
    for your assignment. This DBMS is aimed at the
    lower end of data volumes and processing.
  • You will also use Oracle in your laboratory work.
    This is a medium to large client-server DBMS

3
(No Transcript)
4
Some Aspects of Data
  • A major benefit of Computing is Storage and
    Retrieval of Data
  • We need to have answers to these questions -
  • WHAT are we going to produce as output ?
  • WHAT data is to be stored ?
  • WHAT is the level of detail (Name, Address,
    Height ?)
  • HOW long is to be retained - and is it
    volatile?
  • WHAT form is it to be stored - on line,
    off line ?
  • WHO is going to access it ?

5
Some Aspects of Data
  • HOW often will it be accessed ?
  • WHAT security of access is to be applied ?
  • WHAT are the starting volumes - what is the
    growth or decay factor ?
  • WHAT response time is expected ?
  • HOW accurate is the data content ?
  • HOW current is it ? (e.g. on line sales
    retailing)

6
Information / Data
  • A General Definition
  • DATA - raw (unprocessed or partly processed)
    facts which represent the state of
    entities (things) which have
    occurred
  • INFORMATION - data which has been processed into
    a form USEFUL TO THE USER
  • What is Information to one user may be Data to
    another user.

7
Audit Trail
  • General Definition
  • The presence of data processing media and
    procedures
  • which allow any and / or all
    transaction(s) to be traced
  • through ALL STAGES of processing
  • This infers that the following devices /
    techniques are in place
  • 1. A logging device which traps all
    transactions
  • 2. Some way of tagging each transaction so that
    it can be identified
  • 3. Some way of retrieving the required
    transaction(s)
  • 4. Some way of archiving - what is the required
    period ?
  • 5. Control procedures and processes to ensure
    integrity

8
Data Base
  • A Database is a shared collection of
    Inter-Related data designed to meet the needs of
    multiple types of users and applications.
  • This implies that multiple user VIEWS can be
    defined
  • Data stored is independent of the programs which
    use it
  • Data is structured to provide a basis for future
    applications
  • DATABASE Stored Collection of Related Data
  • May be physically
    distributed

9
Database Management System
  • A DBMS is SOFTWARE which provides access to the
    database in an integrated and controlled manner
  • A DBMS must contain
  • 1. Data Definition and Structure capabilities
  • 2. Data Manipulation capabilities

10
Data Definition and Manipulation
  • Data Definition Language (DDL)
  • used to describe data at the database level
  • Schema level - complete database description
  • Sub-Schema level - user views (restricted)
  • Data Manipulation Language (DML)
  • Provides for Create Insert
  • Update Retrieve (extract)
  • Delete Drop
  • Modify Calculation
  • Report
  • capabilities

11
Three Level Architecture
  • 1. External schema - User Views
  • 2. Conceptual schema - Total database
  • description
  • 3. Internal schema - Physical database

12
The Many Faces of Database
  • Databases can be
  • 1. Transaction Intensive - ATMs Checkouts
  • 2. Decision Support - Browsing for
    trends
  • 3. Mixed-Load - Combination of both
  • 4. Small databases - Few thousand
    records
  • 5. Very Large Database - Many millions or
    trillions
  • (VLDB) of
    records (Banks)
  • 6. Non Traditional - Weather bureau,
    flight plans
  • Computer Aided Design data

13
The Many Faces of Database
  • They can be
  • Data Warehouses
  • Data Marts
  • How is a database size measured ?
  • There are a number of measurements
  • Raw data size
  • Total database size
  • Total usable disk space size (which includes
    media protection such as mirroring)

14
The Many Faces of Database
15
DBMS Requirements
  • Querying Capabilities
  • Data Displays (Presentation)
  • Data entry
  • Data Validation
  • Data Deletion
  • Committing Procedures (of changes)
  • AND Data Integrity, Security, Consistency and
    Concurrency
  • Capabilities

16
Important Database Features
  • Data Integrity
  • Data Independence
  • Referential Integrity - Relational Database Model
  • Concurrency Control - Multiple Users
  • Consistency
  • - multi users
  • - distributed database
  • - replicated database
  • - partitioned database
  • - mobile database
  • Recovery from failure (Transaction and Media)
  • Security

17
File Processing (non database)
Purchasing Program
Billing Program
Accounts Buyers Inventory
Vendors Receivable
Customers
Accounts Payable Sales Order
Payroll
Processing
Vendors Invoice Customers
Inventory Employee
18
File Processing With Database Technology
  • Orders Department Accounting
    Department

Program Program Program
Program Program A
B C A
B
Order Filling
Invoicing system
system
Inventory Back Inventory
Customer Master Orders
Pricing Master

Customers

19
InterRelated File Outline
Sales Stats
Accounts Receivable
Customer
Salesperson
Buyers
Inventory
Vendor
Accounts Payable
General Ledger
Purchase Order
20
Conceptual Data Model
customer
order
invoice
product
work order
raw material
vendor
21
User Views
customer customer
product vendor
customer
order order invoice
raw material
22
Database Models - Hierarchical

owner/parent child /parent
owner
member
child
child/parent
23
Data Base Models - Network
set of data
owner
member
owner
member
set of data
Note Only linked sets can be accessed
24
Data Base Models - Relational
table table
table table
table A B
C
D E
Any table(s) can be joined to any other table(s),
provided there is a means of effecting the
join Primary key / Foreign key concept. Data
redundancy No fixed linkages
25
2 Relations
EMPNUM NAME Date of Birth
DEPTNUM 3 JONES
16-05-1956 605
7 SMITH 23-09-1965 432 11
ADAMS 11-08-1972 201 15
NGUYEN 23-10-1964 314 18 PHAN 16-11-197
6 201 Relation (Table) Name EMP Relation
Schema EMP(empnum,name,date of birth,deptnum)
DEPTNUM DEPTNAME 201
Production 314 Finance 432
Information Systems 605 Administration
Relation (Table) Name DEPT Relation Schema
DEPT(deptnum, deptname)
26
Definition of a Relational Database
  • A relational database is a collection of
    relations or two-dimensional tables.

Database
27
Data Models
Model ofsystemin clientsmind
Entity model ofclients model
Table modelof entity model
Tables on disk
28
Communicating with a RDBMS Using SQL
29
Relational Database Management System
Server
User tables
Data dictionary
30
Relational DataBase
  • Data is represented in ROW and COLUMN form
    (matrix)

  • (attribute)
  • Collections of related data ---gt TABLES
    (relations)
  • 1 or more tables ----gt DATA BASE
  • ATTRIBUTES are generally static
  • ROWS are DYNAMIC and Time-Varying
  • The number of Attributes DEGREE of a table
  • The number of Rows CARDINALITY of a
    table

31
Some RDB Considerations
  • Data is held in tables
  • No order of data in tables - row or attribute
  • Concept of Foreign Key - Primary Key relationship
  • Data Typing - including nulls
  • Query Access - insert, update, delete, retrieval
  • Indexing on candidate (and Primary) keys
  • Integrity Constraints
  • Attribute value ranges
  • Referential Integrity
  • Entity Integrity
  • User Defined Integrity
  • Set retention constraints

32
Some RDB Considerations
  • Domain constraints
  • User defined Rules e.g. no booking of rooms for
    lectures after 4.00pm Fridays (also known as
    Business Rules)
  • Recovery procedures. Used to restore a database
    after a failure
  • No explicit linkages between tables
  • Linking or embedding database operations in a
    procedural language (Cobol, C ..)
  • Databases may be distributed across similar or
    different DBMSs
  • Security features

33
Database Components
  • 1. Back End Engine
  • Used for Disk Input/Output processes
  • (Read/Write/Find)
  • 2. Front End Processor
  • Data manipulation
  • String/Arithmetic/Statistical operations
  • 3. DBMS Interface
  • Data Definition Language (DDL) Data
    Manipulation Language (DML)
  • 4.Programmer Interface
  • Applications Environment (4GLs, Embedded
    capability)

34
Data Description Language
  • Used to describe data at the Database level
  • Terms Used
  • Structure and Attributes
  • Schema Complete description of the database
    using DDL
  • SubSchema Describes data in the database as it
    is known to individual programs(processes) or
    users
  • A segment or part of logical data record(s)
    required
  • is commonly known as a VIEW

35
Data Manipulation Language
  • Language (commands and syntax) used to cause
    transfers of data from the Database and the
    Operating Environment and vice versa
  • Variety of Models - Access, DB2, dBASEV, MySQL,
    SQLServer

  • VisualDataBase, DataBoss, Ingres,
  • Oracle,
    Informix ....
  • Windows versions provide Icons and Menu options
    which are translated by the DBMS software to
    Database manipulation commands
  • Typical commands get, put, replace, seek,
    update,delete, insert, drop,
    find, modify

36
Application Systems
users(ATMs) users users users
C programs
DBMS
Cobol
Access
37
DBMS Components

users
Database Management System
Program Language Interface
Application Programs
Utility Programs
Retrieval Update
Database
38
A Practical Development Scenario
In 198n , a Company decided to develop and
introduce a Payroll system using database
technology. It looked this this
Payroll System
Payroll Data
39
A Practical Development Scenario
  • The Company grew in numbers and expertise and
    decided
  • to introduce a Personnel System. The new design
    was this

40
A Practical Development Scenario
In the next few years, these components were added
Job History
Labor Distribution
Labor Analysis Data
Employee Tracing Data
41
Advantages of DataBase
  • Reduced Data Redundancy
  • Data Integrity
  • Data Independence
  • Data Security
  • Data Consistency
  • Easier use of Data via DBMS Tools (Query
    languages, 4GL's)

42
Disadvantages of Database
  • Complexity
  • Expense
  • Vulnerability
  • Size of - disk storage, processor memory
  • Training Costs
  • Compatibility
  • Technology Lock In

43
(No Transcript)
44
Data Types
  • Used to augment an attribute description and to
    provide a means of Integrity
  • Normal data types are
  • Character (or text)
  • Numeric - Integer, Decimal, Money, Float
  • (in Access Long and Short Integer,
  • Decimal with options of a number of decimal
    places)
  • Date - Standard date format - Access has
    various forms
  • Logical - Yes/No True/False Exists/Does
    Not Exist

45
Attribute Size
  • In many cases this is set by the Data Definition
    facilities
  • e.g.. Date, Short Integer, Long Integer,
    Logical,Number
  • Others are set by the Designer
  • Number of Characters, Size of a Decimal
    Number
  • Access allows for a Default value
  • Duplicates/ No Duplicates
    allowed
  • Primary Key nomination
  • Indexing
  • Required Status of an Attribute

46
Integrity Examples
  • Primary Key - Must have a value - not null
  • - Must not be duplicated in the same table
  • Attribute values must exist (or not)
  • student record
  • student identity number - must exist, not
    duplicated, must be a valid number
  • student name - must exist, may be
    duplicated
  • student course code - must exist, must be a
    valid course
  • subject code - if enrolled, must
    exist, code must be
    a valid code
  • subject result - dependent on
    time. May be null. If it
    exists must be a valid grade and
    mark

47
Integrity Examples
  • Questions
  • 1. Do the same constraints exist in Excel, Word
    ?
  • 2. Should they ?
  • 3. What are alternatives for embedding the same
    or similar controls
  • 4. Are spreadsheets less reliable than
    databases ?

48
Queries
  • A Query acts of the base table or tables of a
    database and returns a subset of this data.
  • A Query normally returns a restricted set of
    attributes (and their current values - this is
    the User View of the database
  • A Query normally has some criteria
  • e.g. salaries gt 50,000
  • outstandings gt 30 days
  • date (of some event , gt or lt some
    designated date)
  • Criteria can be linked event A OR event B,
  • event A
    AND event B

49
Reports
  • A more formal output of data from base tables and
    in many cases produces high volume.
  • The design is formatted
  • Page numbering, Headings, Footers,
  • Page breaks, Page or Item totals and sub-totals
  • Reports can be criteria based and include
    calculations (derived data)
  • Report content can be imported from another
    source, or exported to another target.
  • Reports can contain Exception Full Graphic
    Hypertext information and may be hard copy,
    electronic copy,

50
A Puzzle
How can these dots be joined by 4 straight lines
without lifting a pencil (or pen) from the
surface ?
51
A Puzzle
1
2
3
No, thats 5 lines
4
5

52
A Puzzle
1
2
4
5
Thats worse - its 6 lines
6
3

53
A Puzzle
1
How about this ?

54
A Sunburnt Country
  • This could be a commentary of the Summer weather
  • I love a sunburnt country
  • A land of sweeping plains,
  • Of ragged mountain ranges,
  • Of droughts and flooding rains
  • I love her far horizons,
  • I love her jewel-sea,
  • Her beauty and her terror -
  • The wide brown land for me !

55
Well, that was about 'Structured' Data. What
about 'Non Structured' Data ?
56
Non Structured Data
  • Up to date, we have studied some of the methods
    of file organisation associated with efficient
    data retrieval as embodied in the relational data
    base model.
  • However, although industry has invested large
    amounts of money in relational database
    applications, the greater percentage of data
    access requirements are made of non-structured
    data sources (a classic one being the World Wide
    Web).
  • This lecture will uncover some of the techniques
    which locate and release data of this
    non-structured nature.

57
Information Retrieval
  • Non - Structured i.e. Occurrence of
    'Attribute Values'
    neither regular nor regulated
  • Processing Requirements
  • - Mainly to reveal/release textual information
  • - Information searches on 'key terms'
  • - May be further processed by computer systems
  • (e.g. Spreadsheets, Mathematical Models,

    Simulations)

58
Information Retrieval
User Expectations - All 'relevant' references
will be found and
released (Compare to
structured DBMS closure theory) Note
Approximately 70 - 80 of 'Management
Information' is from Non-structured Databases
59
Information Retrieval
  • Consider the following
  • When I do count the clock that tells the time,
  • And see the brave day sunk in hideous night,
  • When I behold the violet past prime,
  • And sable curls all silvered o'er with white
  • When lofty trees I see barren of leaves,
  • Which erst from heat did canopy the herd
  • And summer's green all girded up in sheaves

60
Information Retrieval
Borne on the bier with white and bristly
beard Then of thy beauty do I question make That
thou amongst the wastes of time must go, Since
sweets and beauties do themselves forsake, And
die as fast as they see others grow, And nothing
against time's scythe can make defence Save breed
to brave him when he takes thee hence
Shakespeare, Sonnet No.12
61
Information Retrieval
  • Or , consider this
  • Now is the winter of our discontent
  • Made glorious summer by this sun of York
  • And all the clouds that loured upon our house
  • In the deep bosom of the ocean buried
  • Now are our brows bound with victorious wreaths
  • Our bruised arms hung up for monuments
  • Our stern alarums changed to merry greetings
  • Our dreadful marches to delightful measures
  • Shakespeare Soliloquy ? ? ?

62
Information Retrieval
  • Finally, another extract
  • The peasants who survived the plague found
    themselves in many cases afflicted by fresh
    burdens, for with fewer people to work the land,
    overlords demanded a standstill in wages and a
    return to feudal duties in full.
  • But, with the shortage of labour, workers
    naturally expected to be valued more highly and
    to be given better pay and more freedom.
  • Europe 14/15th Century
  • A history of the world - Rjunstead ( perhaps
    Enterprise Bargaining ?)

63
Information Retrieval
  • Definition An Information Retrieval System
    stores items of Information which need to be
  • Processed
  • Searched
  • Retrieved
  • Analysed, Condensed, Explained and Sent to User
    Populations
  • Some Concerns
  • Data Base Management
  • Decision Support (as with other Information
    Systems)

64
Information Retrieval
  • Requirements
  • Must have efficient storage organisations
  • Rapid search procedures
  • Effective dissemination and user interaction

65
Information Retrieval
  • Information Retrieval Systems are used to handle
    data in various forms such as
  • Bibliograpic
  • Textual
  • Audio
  • Video

66
Information Retrieval
  • Allied Topics
  • Information Theory
  • Probability Theory
  • Computational Semantics
  • Programming Theory
  • Algebra

67
Information Retrieval
  • Measurement of Usefulness
  • 1. Currency
  • 2. Completeness
  • Some queries
  • 1. Excess Information
  • 2. Obsolete (?) Information

68
Information Retrieval
  • Resource Management Is it Important ?
  • Growth rates Up to 1800, the rate of
    publication doubling every 50 years
  • 1800 to 1996 gt 100,000 scientific
    publications
  • 1996 gt 400,000 scientific
    publications
  • 2000 gt 680,000 scientific
    publications
  • Impact of World Wide Web
  • plus the translation
    factor

69
Information Retrieval
  • Some Information Organisation Problems
  • Volume expansion is not evenly distributed for
    all topics
  • Location of related items
  • Relationship methods (trace, link, chain)

70
Information Retrieval
  • Key Functions of Information Retrieval Systems

Input Process
Output
Indexing Language
Information Items
Requests
(mapping)
Organised for Search Formulation
Indexing Processes
Request Representation Information
Representation
71
Information Retrieval
  • File Structures
  • 1. Linear Lists
  • 2. Ordered Sequential Files
  • 3. Indexed Files
  • 1. Linear Lists
  • UNordered collection of items
  • Items examined one at a time
  • Average Search (n 1)/2

72
Information Retrieval
  • 2. Ordered Sequential files
  • Concept of keys Limited to items of special
    importance such as Name of Author
  • File ordered on these keys
  • Searching
  • (n1)/2 same as linear list
  • log2(n1)

73
Information Retrieval
  • 3 (a) Indexed Files
  • Creation of a smaller, referenced table to the
    main file
  • Provides a pointer to the location of items in
    the main file
  • (memory and/or disk storage considerations)
  • Frequently linked with a subsequent sequential
    search
  • Can be extended to cover multiple keys - the
    major disadvantage is the necessity to rebuild
    main file and index

74
Information Retrieval
  • 3(b). Inverted File
  • Data organisation creates an Index for all Unique
    Key Values
  • Files are arranged in order by TOPIC
  • Each Topic includes a corresponding list of item
    numbers
  • This produces a quick access method - smaller
    table
  • A Secondary index (an index of the index) is also
    possible

75
Information Retrieval
  • Item No 1 2 3 4 5
  • Author Ash Brown
    Jones Reynolds Smith
  • Aspects of A Survey
    A History The State of Users of
  • Title Computerised of users of of
    Computer the Art of New
  • Information Information
    Systems Retrieval Retrieval
  • Retrieval Retrieval
    Systems
    Systems
  • Topic Computer Information
    Computer Information Retrieval
  • Information Retrieval
    Systems Retrieval System
  • Retrieval Users
    Systems
    Users
  • Systems

76
Information Retrieval
  • Inverted File Organisation Sample
  • Related Information Items

Computer Information Retrieval Systems Users
Additions and Deletions require Index
modifications
1 3 1 2 4 1 2
4 5 1 3 4
5 2 5
Topic
Inverted Index
77
Information Retrieval
  • Inverted Files in a Commercially Available I.R.S.
  • Boolean logic associative terms queries
  • Operators AND OR NOT
  • Functions used Set Intersection
  • Set Union
  • Set Difference

78
Information Retrieval
79
Information Retrieval
Item No 1 2 3 4 5 Author
Ash Brown Jones
Reynolds Smith
Aspects of A Survey A History
The State of Users of Title
Computerised of users of of Computer the
Art of New Information
Information Systems Retrieval
Retrieval Retrieval
Retrieval Systems
Systems Topic Computer
Information Computer Information
Retrieval Information
Retrieval Systems Retrieval
System Retrieval Users
Systems
Users Systems
80
Information AND Retrieval
  • Based on the data given-
  • 1. Use Inverted Index to Locate the reference
    Numbers for the term Information
    Set 1
  • 2. Ditto for the term Retrieval Set 2
  • 3. Intersect of Set 1 and Set 2 Set 3
  • 4. Use set 3 to retrieve documents identified in
    this set.

81
Information OR Retrieval
  • Create Set 1 as before on the occurrences of the
    term Information
  • Create Set 2 as before on the occurrences of the
    term Retrieval
  • Combine Set 1 and Set 2 Set 3
  • This is known as Set Union

82
Information NOT Retrieval
  • Perform the formation of Set 1 and Set 2 based on
    the
  • occurrence of the term Information and also
    the term Retrieval
  • Remove any reference from Set 1 which is included
    in Set 2
  • to produce a final set, Set 3
  • This is known as Set Difference
  • And this is more Information. Its
  • informing me its time to stop.

83
Some Niceties
  • Sample search X and Y or Z
  • Start scan from Left, move to Right
  • Start scan from Right, move to Left
  • Partial scan for x number of identical
    characters, again from Left to Right or Right to
    Left
  • Perform all ANDs then all ORs
  • Results would be (X and Y) OR Z (Left to
    Right)
  • (X AND (Y OR Z)
    (Right to Left)
  • Note This is not a Political movement of members

84
Additional Features
  • 1. Adjacency Operations
  • Search items contained in adjacent word
    positions
  • Can also be used in Inverted Index to point to
    Word location within each document.

85
Additional Features
  • Document Reference No.
  • Paragraph No.
    Document Hierarchy
  • Sentence No.
  • Word No
  • Retrieval (345 1 2 5)
  • Could also be a Word Count from the beginning of
    text
  • (distance indicator)
  • Retrieval (345 13) interpreted as Document 345,
    13 words from Beginning of Text (B.O.T)

86
Some Commercial Inverted File Systems
  • 1. Dialog Uses Select and Combine
  • 2. Stairs Utilities to create and maintain
    database OnLine retrieval system AQUARIUS -
    includes
  • a dictionary
  • 3. Bibliographic Retrieval Services System (BRS)
  • Based on Stairs
  • 4. Medlars system (National Library of Medicine)
  • Uses Index File Posting
    File Data File
  • 5. Orbit system
  • 6. Lexis system

87
Dynamic Item Searching
  • A Binary or simple tree search is not viable for
    dynamic file conditions - the search depth and
    thus time is high

Root ---gt
Node
Number of Levels (Height)
If the number of levels could be reduced (e.g.
increase the number of values per node (known as
order) then access times would improve. B-Tree,
B Tree and B Tree address this
requirement Access formula is logdn where d
order
88
B Tree Example Document Clustering
  • Network

Catalog Hardware Morpheu
Review Synonym
Number Publication
Apparatus Biomedicine
Statistics Stem Structure
Encyclopedia Grammar
File
Search for Query File
89
String Matching - Boyer and Moore
  • Based on 1. Analysis of query pattern
  • 2. Auxiliary table
  • 3. Character mapping
  • 1 2 3 4 5 6 7 8 9 10 11
  • L E C T U R E R O O M
  • L occurs in position 1 E occurs in
    positions 2 and 7
  • C occurs in position 3 T occurs in position
    4
  • U occurs in position 5 R occurs in positions
    6 and 8
  • O occurs in positions 9 and 10 M occurs in
    position 11
  • Matching starts ltRightgt character and
    progressively shifts left on match

90
Information Management
  • Data retrieval - an application package
  • TITAN Museums, Births Deaths and Marriages
    Registries
  • Art Galleries, Market Research,
    News Services,
  • Hansard, Reference Libraries
  • Platforms Unix, AIX, SunO/S, PCs to Mainframes
  • Method - Signature Files - A bit string for each
    record in the
  • data file. Uses Indexed terms

91
Information Management
  • 2 stages necessary
  • 1. Create descriptor for each Indexed term
  • 2. Superimpose each term descriptor (exclusive
    OR-ing)
  • Professional ---gt
    100100 .......... (32 bits)
  • term Computing ---gt 010100
    ..........
  • descriptors Magazine ---gt 000101
    ..........
  • Record descriptor 110101 (inclusive OR)
  • Record descriptors ---gt record descriptor file
    Pointers

92
Some Calculations
  • Inverted Files
  • Number of searches no. of steps to search
    index
  • (n1)/2 for subfile
  • Assume 106 records.
  • 1.Sequential Search specific character
  • (n1)/2 500,001 steps average
  • 2. Indexed file
  • Assume that the number of records beginning with
    the specified character is 50,000
  • Index entries 26
  • No. of search steps (261)/2 14
  • plus (50,000 1)/2 25001 Total
    25,015
Write a Comment
User Comments (0)
About PowerShow.com