File Processing : Index and Hash - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

File Processing : Index and Hash

Description:

Exhaustive search : Too Expensive. Index for a file or database ... Number of delimiters. Delimiter. Block number. How to determine m ? One Node : One Disk Page ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 21
Provided by: lik
Category:

less

Transcript and Presenter's Notes

Title: File Processing : Index and Hash


1
File Processing Index and Hash
  • 2005, Spring
  • Pusan National University
  • Ki-Joune Li

2
What is index ?
  • Index in a book
  • Index Keyword ? Pages
  • Without Index
  • Exhaustive search Too Expensive
  • Index for a file or database
  • A function or mechanism
  • FIndex SPredicate ? B (block numbers on hard
    disk)
  • e.g. find student records where student.GPA gt 4.0

3
Data Retrieval Time
  • Data retrieval on disk Two phases
  • 1st phase Search with a condition (Predicate)
  • 2nd phase Data access

Data Access Time- File Structure- Disk
Placement- Clustering, etc..
Search Condition
Block
4
Blocking Factor Bf
  • Blocking Factor
  • Number of Records in a Block
  • Blocking Number and Number of Disk Accesses
  • ND Nrecord / Bf

5
How to Accelerate Phase 1 ?
  • Of course, we could accelerate the phase 1
  • by index or by hash
  • Index vs. Hash
  • Index a type of data structures
  • Needs additional data structures
  • Hash a type of mechanism
  • May not need any additional data structure (not
    exactly true)

6
A Simple Idea on Index
  • Mapping Table from keywords to block numbers
  • Inverted File
  • Why inverted file is better than nothing ?
  • If the table is too large (to fit in main memory)
  • It has to be stored on disk
  • Disk Access for Index Access

Keyword
Block
Juliet
Romeo
B26
Hamlet
B22


Carmen
B212
7
Searching Algorithms and Index
  • A good way to accelerate searching
  • Tree O( logn )
  • Reorganize Inverted File to Tree
  • Binary Search Tree Branching Factor 2
  • Tree in memory space vs. in disk space
  • Memory space Number of Comparisons
  • Disk space Number of Block Accesses

8
Paged Tree m-way search tree
  • How to determine m ?
  • One Node One Disk Page
  • e.g. When 1 disk page is 4 K bytes
  • 44m8(m-1) 4096 ? m 341
  • Very fat tree

9
Problem of m-Way search tree
  • m-way search tree
  • Search Performance determined by the height
  • Not balanced
  • Average O(log n)
  • Worst case n / Bf ? O(n)
  • Height determined by insertion order
  • e.g insertion by ascending order
  • How to make it balanced ?
  • Balanced m-Way search tree B-tree

10
B-tree
  • B-tree Balanced m-way search Tree
  • Root Node no child node or more than one child
    nodes
  • Internal Node ?m/2? m child nodes (block
    number)
  • External Node data block number instead of
    child node
  • Balanced
  • Upward split
  • instead of downward split Binary Tree

11
Downward Split
Suppose m3
Insert 10, 20
10
20
Insert 30
Insert 40
12
Downward Split
40
Insert 50
60
Insert 70
70
13
Meaning of Downward Split
  • Always Balanced
  • Not so much influenced by the order of insertions
  • Internal Nodes ?m/2? m child nodes (block
    number)

14
Search by B-tree
? 45
45
40
45
20
60
45
50
10
30
70
Not Found
15
Performance of B-tree
  • Number of Comparison within a node Trivial
  • Number of Nodes to visit Depth

16
Problem of B-tree
  • Types of Search
  • Exact Match Search
  • Range Search
  • E.g. find students where 25ltstudent.GPAlt50
  • B-tree
  • Good for Exact match search
  • Bad for range search

17
B-tree
  • A Variant of B-tree
  • Duplicate all elements at leaf nodes (external
    nodes)
  • Linked List of Leaf Nodes
  • Performance
  • Exact Match Search and Insertion
  • A small fraction of performance sacrifice
  • Range Search much more powerful than B-tree

18
B-tree Example
19
Range Search with B-tree
  • Find students where GPAgt3.5

35
20
Performance of B-tree
  • Performance
  • Determined by the Depth
  • Exact Match Search and Insertion (without split)
  • d node (page) accesses
  • Range Search
  • node accesses ( nq number of
    records to retrieve)
Write a Comment
User Comments (0)
About PowerShow.com