Title: Cosequential Processing
1Cosequential Processing
2Cosequential Processing
- Coordinated processing of two or more sequential
lists - Goals
- To merge lists into a single sorted list (union)
- Make a single sorted list from many
- To match records with the same keys
(intersection) - Apply transactions to a master file
- Find entries which exist in multiple lists
3Cosequential Processing
- Keys
- Matching/merging may be by a single key or
several. - Number of keys only affects compare operator, not
sort strategy
4Master Transaction File Processing
- Common processing strategy on sequential files.
- Common since historically sequential processing
was the rule (tapes, cards) - Companies stored data in sequential files
- Lists of transactions posted against these
record periodically.
5Master Transaction File Processing
- Consider a grocery store
- Record of inventory for each type of item stored
in a large sequential file (master file) - As items sold, a the item number and quantity
sold posted (written) as records to a transaction
file - As trucks deliver new items, item numbers and
quantities are entered into the transaction file. - As new types of items are added to inventory, or
old items are discontinued, entries about this
are placed in the transaction file.
6Master Transaction File Processing
Master File
Transaction File
Item Item Name Type Quan 20231 Shoe Shine (br)
6 4 20231 Shoe Shine (bl) 6 1 20177 Cottage
Cheese 5 392 20179 Chicken Soup 6 32 20231
T-bone 2 43 ....
Item Trans Quan Item Name 20231 U
-2 20231 U 50 20379 U -5 20443 U
-4 20445 A 40 Corn Chips 20532 A 300
Butter 20534 D 20558 U 200 ....
U - Update A - Add D - Delete
7Master Transaction File Processing
- Periodically update master from transaction
New Master File
Transaction File
Update Operation
Old Master File
Update Messages
8Master Transaction File Processing
- Transactions are applied against master.
- New master is created
- Invalid Transactions result in Message
- Important changes in Messages - audit trail
- Transaction and master must be in sorted order.
9Master Transaction File Processing
- Processing Scheme
- Read record Mast from old Master and Trans from
Transaction - While more records in both files
- if Add and Trans.ID lt Mast.ID, write Mast to new
master - else If Trans.ID Mast.ID then
- If UPDATE then update record and write to new
master - If Delete then continue (no write)
- else trasaction error
- else write Mast to new master
- Read next from transaction, next from old master
- If more records in old master, write to new
master - If more records in transaction, give errors
10Merging
- Merge two (or more) sorted lists into a single
sorted list - May remove duplicates (union) or keep
Bill Gray Hillery Jenny Linda Mary Randy
Bill Cathy Fran Gray Hillery Jenny Kenny Linda Mar
y Pete Randy Sally Zeke
merge
Cathy Fran Kenny Pete Sally Zeke
11Merging
- Merge(List1,Max1,List2, Max2,Result)
- int next1 0 next2 0 out 0
- while Max1 gt next1 and Max2 gt next2
- if (List1next1 gt List2next2)
- Resultout List2next2
- else
- Resultout List1next1
- if (List1 lt Max1)for ( next1 lt Max1
Resultout List1next1) - if (List2 lt Max2)for ( next1 lt Max2
Resultout List2next1)
12Sorting
- Small files
- sort completely in memory
- Called internal sorting.
13Sorting
- Larger files
- may be too large to fit in memory simultaneously
- require "external sorting"
- Sorting using secondary devices
14External Sorting
- Criteria for evaluating external sorting
algorithms - Different from internal sorts
- Internal sort comparison criteria
- Number of comparisons required
- Number of swaps made
- Memory needs
- External sort comparison criteria
- Dominated by I/O time
- Minimize transfers between secondary storage and
main memory
15External Sorting
- Two major external sorting methods
- in situ - sort the file in place
- use additional storage space
16External Sorting
- Characteristics of in situ sorting
- uses less file space, thus larger files may be
sorted. - if crash occurs during sort, file may be left in
corrupt state - in site sorts may be done on direct-access files
using standard internal type sorts. - direct-access required (may not be available)
- performance of such algorithm's tends to be data
sensitive
17External Sorting
- Consider a file with 1000 records, 120 bytes each
- We have 25,000 bytes available for a buffer.
- Solution?
- read in 200 records at a time, sort internally
- This results in 5 sorted files
- merge the resulting sorted files into 1sorted file
18Sort/Merge
- A common non-in situ method is an algorithm
called "sort-merge" - "safe" sorting technique
- performance is guaranteed
- requires only serial file access
19Sort/Merge
Sort
Sort
Merge
Partition
Sort
Sort
20Sort/Merge
- Sort/Merge techniques have two stages
- sort stage - sorted partitions are generated
- Size depends on available memory
- merge stage - sorted partitions are merged
(repetitively if necessary) - Why might more then one merge phase be needed?
21Basic Sort/Merge
- initial partition size is 1
- Merge begins immediately (no sort)
- Smallest main memory use
- requires only 2 buffers in memory.
- File starts with N "sorted" files of size 1
- Similar to internal merge/sort
22Improving Sort/Merge
- Increase buffer size
- Partitions sorted (in memory) with little I/O
- Larger partitions mean fewer (I/O intensive)
merges needed - Take advantage of already sorted runs of data
- Consider the "unsortedness" of the data
23Sort/Merge
- Producing sorted partitions
- internal sorting
- natural selection - (use already sorted runs)
- replacement selection
24Internal sorting
- read M records (M determined by available memory)
- sort them using internal sorting techniques
- write back out, creating a partition of size M
25Sort/Merge
- Replacement selection (snowshovel)
- files usually not totally out of order
- take advantage of partial ordering in file
- partition size varies with already existing
ordering
26Replacement selection (snowshovel)
- Start with primary buffer of size N (snowshovel)
- 1. Read in N records into buffer
- 2. Output record with smallest key
- 3. Replace with next record in file
- 4. if this new record is smaller then the last
record written, "freeze" (must wait for next
partition) - 5. if unfrozen records remain, go to 2
- 6. If all records frozen, unfreeze them all,
start new partition, go to 2
27Replacement selection (snowshovel)
- if file is sorted or almost sorted, one pass may
suffice for complete sort! - average partition length is 2N
- Consider file with, N 4
- 29 42 3 7 9 101 99 87 89 100 16 8 12 2 15 EOF
28Natural Selection
- Frozen records in the replacement scheme take up
space and search time. - Natural, rather than freezing, writes these
unused records to a fixed length secondary file
(called reservoir) - partition creation terminates when reservoir
full. - Next, buffer is refilled first with records from
buffer, than records from file (if more needed) - expected partition length is 2.718N if reservoir
and buffer same size - (about 30)
29Natural Selection
- Redo example with reservoir size 4
- 29 42 3 7 9 101 99 87 89 100 16 8 12 2 15 EOF
30Distribution and Merging
- Merging
- required to bring the sorted partitions together
into a sorted whole - may require a series of merge phases, where
shorter partitions are merged into larger
partitions - More then one partitions per file
- Not all partitions can be openned at once
31MergingSingle phase
32MergingMultiple phase
33MergingMultiple Partitions / File
P5-8
P1-8
P9-12
P1-12
34Merging
- Major issues - minimizing overall I/O
- Different length partitions
- Spend time simply reading and writing from one
file - Left over partitions
- Spend time simply copying partitions
35Distribution and Merging
- Distribution
- In order to merge, partitions must be
distributed to files in a manner facilitaing
the merge process. - If 1 partition per file, distribution is trivial
- If gt1 partition per file, distribution should
minimize I/O - Several partitions may be placed in each file
36Balanced N-way merge
- use as many files (or tapes) as the system can
open at once - Distribute the partitions evenly amoung F/2 files
- repetitively merge back and forth between one set
of F/2 files and the other - Distribute the generated partitions evenly amoung
the F/2 output files
37Balanced 2-way merge
File 1
File 2
File 3
File 4
P5-8
File 1
File 2
P1-8
P9-12
File 3
File 4
P1-12
File 1
38Balanced 2-way merge
- Example 4 files, 700 records, 100 primary
records can be sorted in memory
1-100 201-300 401-500 601-700
1-200 401-600
1-400
1-700
1-700
101-200 301-400 501-600
201-400 601-700
401-700
39Balanced N-way merge
- advantage
- simple
- disadvantage
- wastes time if partition size different
- spend time reading and write records without
actually merging
40Polyphase merging
- Strategically distribute the partitions onto F
files based on the Fibonacci Sequence - Algorithm
- During each phase merge the F smallest files
until the end of one file is reached. - After each phase at least one partition will now
be empty - this file becomes available new place
to merge into - Continue to merge until only one file exists
41Polyphase merging
- Consider Initially generate three files
- 24 partitions, 20 partitions , and 13 partitions
42Polyphase merging
- advantages
- No overhead from merging partitions of different
sizes - disadvantages
- complex management of files
- must know partition sizes
- still not completely optional - partition sizes
not always maximal.