Title: Raw%20Data%20engine
1Raw Data engine
- Perl a Generic Cure
- For a Perennial Problem
2The Task
- Get from a customer data files in his format
specs - Create from these files data understood by ALEPH
system
3General Description
Input
Output
Manipulation
4Input - Overview
Input
Record
Output
Manipulation
Record
Record
Record
Record
Record
5Input Structure
Input
field,field, field, field, field, field
Output
Manipulation
Record
Record
Record
Record
Record
6Input Variety
Input
field,field, field, field, field, field
fieldfieldfieldfieldfieldfield
f1 f2 f3 f4 f5
7Input - Multiline
Input
Rec name value name value name
value ..
Rec name value name value name
value ..
8Make Input Uniform
- All input into a hash
- input_hash
- input_hashf1 f1_val
- input_hashf2 f2_val
9Repeatable Fields
32137001535410 I0010.. 32137001535410
I0020.. 005
I0050.. 1992-07-29
I0060.. TB55
I0250.. 0004688683
I0300.. PERP
I0330.. PERP
I0360.. WCC3
I0390.. 0.00
I0250.. 0004688684
I0300.. PERP
I0330.. PERP
I0360.. WCC3
I0390.. 5.00
10Its a Perl Hash
- Some of input hash entries become pointers to
lists - x \input_hash
- 0 HASH(0x1a7e8f0)
- 'I0010' gt 32137001535410
- 'I0020' gt 005
- 'I0050' gt '1992-07-29'
- 'I0060' gt 'TB55'
- 'I0250' gt ARRAY(0x1c3e990)
- 0 0004688683
- 1 0004688684
- 'I0300' gt ARRAY(0x1c3e96c)
- 0 'PERP'
- 1 'PERP'
- 'I0330' gt ARRAY(0x1d40328)
- 0 'PERP'
- 1 'PERP'
- 'I0360' gt ARRAY(0x1d4031c)
- 0 'WCC3'
- 1 'WCC3'
11Some Code
- no value yet simple scalar
- unless (defined input_hashcurrent_name)
input_hashcurrent_name value already
value for key - else already list just push
- if (ref input_hashcurrent_name)
push _at_input_hashcurrent_name, value
only scalar make list with existing
new - else my _at_lst (input_hashcurrent_na
me, value) input_hashcurrent_name
\_at_lst
12Output
- Lets look at one typical output
- Each record is one simple string
- Each field has a fixed width
- (Typical for file to be imported into DB)
13Output Needed Structures
- To build a correct output record we need
- The values value per field
- Their correct order
- Their correct format (I.e. witdh per field)
14Output Info Source
- How do we know what the output record looks like?
- From DB describe of relevant table
- From Data Definition file (COBOL copy, C h file
etc) - From some special purpose definition file
15Output From COBOL Copy
- 01 Z71.
- 02 Z71-REC-KEY.
- 03 Z71-SEQUENCE-TYPE PICTURE
X(1). - 03 Z71-DOC-NUMBER PICTURE
9(9). - 03 Z71-COPY-ORDER-SEQUENCE PICTURE
9(5). - 03 Z71-REC-SEQUENCE PICTURE
9(15). - 02 Z71-USER-NAME PICTURE
X(10). - 02 Z71-OPEN-DATE PICTURE
9(8). - 02 Z71-OPEN-HOUR PICTURE
9(4). - 02 Z71-OPEN-SECOND PICTURE
9(4). - 02 Z71-ACTION-DATE PICTURE
9(8). - 02 Z71-TYPE
PICTURE X(2). - 02 Z71-ALPHA
PICTURE X(1). - 02 Z71-TEXT
PICTURE X(200). - 02 Z71-DATA
PICTURE X(2000). - 02 Z71-TRIGGERED PICTURE
X(1).
16Field Order
- _at_z71_cols_copy (
- 'z71_sequence_type',
- 'z71_doc_number',
- 'z71_copy_order_sequence',
- 'z71_rec_sequence',
- 'z71_user_name',
- 'z71_open_date',
- 'z71_open_hour',
- 'z71_open_second',
- 'z71_action_date',
- 'z71_type',
- 'z71_alpha',
- 'z71_text',
- 'z71_data',
- 'z71_triggered'
- )
17Output format
- z71_format_copy '-1s09s05s015s
- -10s08s04s04s08s-2s-1s-200s-2000s-1s'
- A regular Perl format for (s)print based on the
COPY info - Each character string (X(n)) is left justified
- Each number is right justified ans zero padded.
18Writing Correct Output
- Now suppose we have built zhash with values
corresponding to the Z71 copy. - The following code writes out a record
- foreach my f (_at_z71_cols_copy) push
_at_out_vals, zhashf - sprintf receives the format and a list
- out_str sprintf(z71_format_copy ,
- _at_out_vals)print out_str . "\n"
19Making It Generic
- current_cols ztable . _cols_copy
- cols current_cols
- current_format ztable . _format_copy
- format current_format
- foreach my f (_at_cols) push _at_out_vals,
zhashf - sprintf receives the format and a list
- out_str sprintf(format ,
- _at_out_vals)print out_str . "\n"
20Output Sum up
- To Create
- Read all relevant Copy files
- For each znn table create znn_cols_copy,
znn_format_copy - Use DataDumper to store in zvars_copy.txt
-
-
21Output Sum up
- To use
- do zvars_copy.txt
- establish current ztable
- use glob to get correct cols and format
- write output
22Input Simplifying Variety
- Define to which type belongs current input
- Fields separated by delimiter specify
delimiter, name fields - Fields with fixed widths specify names and
widths - Record multiline specify new record, field
format - Others (cvs, Excel etc.)
23File Type Definitions
- auth_kul_ext type fixed
- auth_kul_ext names d10-1lnk2-19d220-40va
l41-70 - ser_copy_data type fixed
- ser_copy_data names CNUM8LOCATION6
- ser_copy_data names COPY_NUM5SUPPLIER12
- item_data type delimited
- item_data names numpricedate
- cash_data type
- tag_multilinerec14sep2tag5sep3
val - acq_order_data type
tag_multilinedelimgt - acq_order_data new_rec STORE
24Input Implementation
- one module-object per input type
- Each has the new constructor
- Each has a process_infile function for actual
input reading - A module ChooseInputFormat.pm selects the right
object and return a pointer to it. - new just does some preparations based on the
type - process_infile reads the actual file
- for each input record builds input_hash and
calls build_output() received as argument
25Reading Delimited
- names n1n2n3
- _at_lst split //, names
- map name to its position
- ind 0
- foreach (_at_lst)
- lst_names_ ind
- ind
-
- _at_vals split /delim/, _
- foreach my name (keys lst_names)
take value of name from correct position - my val valslst_namesname input_
hashname val
26Reading Fixed
- my _at_lst split /\s\s/, names my
field_num 0 line_pat "" foreach my lst
(_at_lst) my (name, width) split //, lst - position per name
- lst_namename field_num
- unpack for reading
line_pat . 'A' . width . ' ' - my input_hash ()
- my _at_lst unpack (line_pat, _) foreach my
name (keys lst_name) input_hashname
lstlst_namename -
27From In To Out
Manipulation
In File
Out File
input_hash
zhash
???
28General Look
- Z30-DOC-NUMBER sequence,1,9
- Z30-ITEM-SEQUENCE sequencedocno,10,6
- Z30-SUB-LIBRARY PERM_LOC14sub_lib_ma
p - Z30-MATERIAL MATERIALmaterial_map
- Z30-ITEM-STATUS "09"
- Z30-OPEN-DATE INVENTORY_DATEformat_date
- Z30-CALL-NO CALL_NO
- Z30-PRICE PRICEdecimal
- Z30-NOTE-OPAC NOTE1 NOTE2
- Z30-NOTE-CIRCULATION NOTESsubstring_on_word2001
- Z30-NOTE-INTERNAL NOTESsubstring_on_word2
002
29Building zhash Syntax 1
- Z30-ITEM-STATUS "09"
- Build Z30-ITEM-STATUS from literal 09
- Z30-CALL-NO CALL_NO
- Build Z30-CALL-NO from input field CALL_NO as is
- Z30-SUB-LIBRARY PERM_LOC14sub_lib_map
- Build Z30-SUB-LIBRARY from input field PERM_LOC,
take first 4 characters, then map result using
sub_lib_map - Z30-PRICE PRICEdecimal
- Build Z30-PRICE from input PRICE, extract just
the decimal portion
30Building zhash Syntax 2
- Z30-NOTE-OPAC NOTE1 NOTE2
- To build Z30-NOTE-OPAC concatenate input NOTE1,
then , then input NOTE2 - Z30-DOC-NUMBER sequence,1,9
- Build a sequence with delta 1, width 9, leading
zeroes - Z30-ITEM-SEQUENCE sequencedocno,10,6
- Build a sequence with delta 10, width 6, leading
zeroes, restart the sequence for each input docno
with differen values. - 3 input records with docno500
000010,000020,000030 - 2 input records with docno732 000010,000020
31Implementation - 1
- Implementing the functions is done in 2 levels
- Group type
- Using modifiers
- Types
- Just a literal
- An input field no modifiers
- An input field with modifiers(s)
- Concatenation (Each element may include
modifiers) - Various sequences
32Implementation - 2
- For each output field we build an info hash, that
includes - the pointer to the function to call
- the target
- the source
- other info needed per specific function
- All hashes are pushed into a list
33The Functional Backbone
- f ZbuildOutFuncs'func
- handle_funcsfunc_indfunc \f
- handle_funcsfunc_indtarget target
handle_funcsfunc_indsource source - handle_funcsfunc_ind
34Actually Building zhash
- foreach my handle (_at_handle_funcs)
- my res handle-gt'func'(handle, \zhash,
input_hash) -
- Each function has something like
- target handle-gttarget What to build
- source handle-gtsource How to build
- my res
- .
- evaluates Build
- .
- zhashtarget res Prepare for
output - return res
35Getting Values
- sub get_gen_field_val
- my (my_input_hash, column) _at__
- my (org_val, field, modifiers)
- if (defined input_hashcolumn)
- org_val input_hashcolumn
- return org_val
-
- if (column /_\w/)
- org_val zhashcolumn
- return org_val
-
- (org_val, field, modifiers)
analyze_complex(column) - org_val get_modified_val(org_val,
modifiers) - return org_val
-
36Modifying a Value
- sub get_modified_val
- my (org_val, modifiers) _at__
- my non_square '(\\)'
- my _at_mods modifiers /non_square/go
-
- foreach my mod (_at_mods)
- next if mod /\s/
- my (type, action) mod
/(\w)(.)/ - if (type)
- org_val mod_functionstype-gt(org_val,
action) -
- else
- exit_on_err()
-
-
- return org_val
-
37Simple Modifier - uppercase
- sub mod_uppercase my (val, action)
_at__ return uc val - What if val is list ??
38Handling Lists
- sub handle_list my (val, action, func)
_at__ my _at_val_lst foreach my v (_at_val)
my res func-gt(v, action) push
_at_val_lst, res return \_at_val_lst
39Modifier with List
- sub mod_uppercase my (val, action) _at__
return handle_list(val, action,
\mod_uppercase) if ref val - return uc val
40regex
- Definition
- regex_check
- my_reg /\d4\D\d4/
- Code
- sub build_regex_check
- my (name, regex) _ /(\w)\s(.)/
- my code_src "sub my \valshiftmy
\res\valregexreturn \res" - my code eval code_src
-
41Running Index
- We need several running indexes.
- Maybe the easiest way to do it
- sub build_running_index
- my cnt 0
- return sub
- return cnt
-
-
- Each instance
- starts from 0
- increments by 1
42Using it
- cnt1 build_running_index()cnt2
build_running_index()res1 cnt1()res1
cnt1()res2 cnt2()res1
cnt1()print "c1 res1 c2 res2\n"sub
build_running_index my cnt 0
return sub return cnt
output - c1 3 c2 1
43More General
- Different start
- Different delta
- sub build_own_index_gen my (cnt, delta)
_at__ return sub return cnt
delta
44Usage
- cnt1 build_own_index_gen(0, 3)cnt2
build_own_index_gen(5,10)res1
cnt1()res2 cnt2()res1
cnt1()res1 cnt1()print "c1 res1 c2
res2\n"sub build_own_index_gen my (cnt,
delta) _at__ return sub return
cnt delta - output
- c1 9 c2 15
45Timer - 1
- We need several timers
- We also need a total timer
- sub get_time_diff
- my tls time()
- return sub
- my start shift
- my tle time()
- my tlt tle - tls
- tls tle unless start
- return tlt
-
-
46Timer
- Timer for differences
- sub get_time_diff
- my tls time()
- return sub
- my tle time()
- my tlt tle - tls
- tls tle
- return tlt
-
-
47Timer - Usage
- t1 get_time_diff()
- t2 get_time_diff()
- sleep(1)
- r t1-gt()
- r1 t2-gt()
- print r r r1 r1\n"
- sleep(1)
- r t1-gt()
- r1 t2-gt()
- print r r1 r1 r1\n"
- Results
- r 1 r1 1r 1 r1 1
48Timer 2
- Timer with total
- sub get_time_diff
- my tls time()
- return sub
- my start shift
- my tle time()
- my tlt tle - tls
- tls tle unless start
- return tlt
-
-
49Timer 2 - Usage
- t1 get_time_diff()
- t2 get_time_diff()
- sleep(1)
- r t1-gt()
- r1 t2-gt(1)
- print r r r1 r1\n"
- sleep(1)
- r t1-gt()
- r1 t2-gt(1)
- print r r1 r1 r1\n"
- Results
- r 1 r1 1r 1 r1 2
50Timer
- We need several timers
- We also need a total timer
- sub get_time_diff
- my tls time()
- return sub
- my start shift
- my tle time()
- my tlt tle - tls
- tls tle unless start
- return tlt
-
-
51Perl Things
- Hash-Direct mapping of string from source to code
any entry can hold scalar/list/hash/anon. sub -
does it dynamically - ref
- references
- hash-names/list-order
- DataDumper
- file globs
- glob -gt generic code
- tied hash glob
- closure
- sprintf format string values list
- evalconditions, regexp
- regexp qr, quotemeta, \Q
- external modules module defined name
- modules DB_File, DataDumper, dates, Excel, DBI