Raw%20Data%20engine - PowerPoint PPT Presentation

About This Presentation
Title:

Raw%20Data%20engine

Description:

Raw Data engine. Perl a Generic Cure. For a ... Get from a customer data files in his format specs. Create from these files data understood by ALEPH system ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 52
Provided by: mot56
Category:
Tags: 20data | 20engine | aleph | raw

less

Transcript and Presenter's Notes

Title: Raw%20Data%20engine


1
Raw Data engine
  • Perl a Generic Cure
  • For a Perennial Problem

2
The Task
  • Get from a customer data files in his format
    specs
  • Create from these files data understood by ALEPH
    system

3
General Description
Input
Output
Manipulation
4
Input - Overview
Input
Record
Output
Manipulation
Record
Record
Record
Record
Record
5
Input Structure
Input
field,field, field, field, field, field
Output
Manipulation
Record
Record
Record
Record
Record
6
Input Variety
Input
field,field, field, field, field, field
fieldfieldfieldfieldfieldfield
f1 f2 f3 f4 f5
7
Input - Multiline
Input
Rec name value name value name
value ..
Rec name value name value name
value ..
8
Make Input Uniform
  • All input into a hash
  • input_hash
  • input_hashf1 f1_val
  • input_hashf2 f2_val

9
Repeatable Fields
32137001535410 I0010.. 32137001535410
I0020.. 005
I0050.. 1992-07-29
I0060.. TB55
I0250.. 0004688683
I0300.. PERP
I0330.. PERP
I0360.. WCC3
I0390.. 0.00
I0250.. 0004688684
I0300.. PERP
I0330.. PERP
I0360.. WCC3
I0390.. 5.00
10
Its a Perl Hash
  • Some of input hash entries become pointers to
    lists
  • x \input_hash
  • 0 HASH(0x1a7e8f0)
  • 'I0010' gt 32137001535410
  • 'I0020' gt 005
  • 'I0050' gt '1992-07-29'
  • 'I0060' gt 'TB55'
  • 'I0250' gt ARRAY(0x1c3e990)
  • 0 0004688683
  • 1 0004688684
  • 'I0300' gt ARRAY(0x1c3e96c)
  • 0 'PERP'
  • 1 'PERP'
  • 'I0330' gt ARRAY(0x1d40328)
  • 0 'PERP'
  • 1 'PERP'
  • 'I0360' gt ARRAY(0x1d4031c)
  • 0 'WCC3'
  • 1 'WCC3'

11
Some Code
  • no value yet simple scalar
  • unless (defined input_hashcurrent_name)
    input_hashcurrent_name value already
    value for key
  • else already list just push
  • if (ref input_hashcurrent_name)
    push _at_input_hashcurrent_name, value
    only scalar make list with existing
    new
  • else my _at_lst (input_hashcurrent_na
    me, value) input_hashcurrent_name
    \_at_lst

12
Output
  • Lets look at one typical output
  • Each record is one simple string
  • Each field has a fixed width
  • (Typical for file to be imported into DB)

13
Output Needed Structures
  • To build a correct output record we need
  • The values value per field
  • Their correct order
  • Their correct format (I.e. witdh per field)

14
Output Info Source
  • How do we know what the output record looks like?
  • From DB describe of relevant table
  • From Data Definition file (COBOL copy, C h file
    etc)
  • From some special purpose definition file

15
Output From COBOL Copy
  • 01 Z71.
  • 02 Z71-REC-KEY.
  • 03 Z71-SEQUENCE-TYPE PICTURE
    X(1).
  • 03 Z71-DOC-NUMBER PICTURE
    9(9).
  • 03 Z71-COPY-ORDER-SEQUENCE PICTURE
    9(5).
  • 03 Z71-REC-SEQUENCE PICTURE
    9(15).
  • 02 Z71-USER-NAME PICTURE
    X(10).
  • 02 Z71-OPEN-DATE PICTURE
    9(8).
  • 02 Z71-OPEN-HOUR PICTURE
    9(4).
  • 02 Z71-OPEN-SECOND PICTURE
    9(4).
  • 02 Z71-ACTION-DATE PICTURE
    9(8).
  • 02 Z71-TYPE
    PICTURE X(2).
  • 02 Z71-ALPHA
    PICTURE X(1).
  • 02 Z71-TEXT
    PICTURE X(200).
  • 02 Z71-DATA
    PICTURE X(2000).
  • 02 Z71-TRIGGERED PICTURE
    X(1).

16
Field Order
  • _at_z71_cols_copy (
  • 'z71_sequence_type',
  • 'z71_doc_number',
  • 'z71_copy_order_sequence',
  • 'z71_rec_sequence',
  • 'z71_user_name',
  • 'z71_open_date',
  • 'z71_open_hour',
  • 'z71_open_second',
  • 'z71_action_date',
  • 'z71_type',
  • 'z71_alpha',
  • 'z71_text',
  • 'z71_data',
  • 'z71_triggered'
  • )

17
Output format
  • z71_format_copy '-1s09s05s015s
  • -10s08s04s04s08s-2s-1s-200s-2000s-1s'
  • A regular Perl format for (s)print based on the
    COPY info
  • Each character string (X(n)) is left justified
  • Each number is right justified ans zero padded.

18
Writing Correct Output
  • Now suppose we have built zhash with values
    corresponding to the Z71 copy.
  • The following code writes out a record
  • foreach my f (_at_z71_cols_copy) push
    _at_out_vals, zhashf
  • sprintf receives the format and a list
  • out_str sprintf(z71_format_copy ,
  • _at_out_vals)print out_str . "\n"

19
Making It Generic
  • current_cols ztable . _cols_copy
  • cols current_cols
  • current_format ztable . _format_copy
  • format current_format
  • foreach my f (_at_cols) push _at_out_vals,
    zhashf
  • sprintf receives the format and a list
  • out_str sprintf(format ,
  • _at_out_vals)print out_str . "\n"

20
Output Sum up
  • To Create
  • Read all relevant Copy files
  • For each znn table create znn_cols_copy,
    znn_format_copy
  • Use DataDumper to store in zvars_copy.txt

21
Output Sum up
  • To use
  • do zvars_copy.txt
  • establish current ztable
  • use glob to get correct cols and format
  • write output

22
Input Simplifying Variety
  • Define to which type belongs current input
  • Fields separated by delimiter specify
    delimiter, name fields
  • Fields with fixed widths specify names and
    widths
  • Record multiline specify new record, field
    format
  • Others (cvs, Excel etc.)

23
File Type Definitions
  • auth_kul_ext type fixed
  • auth_kul_ext names d10-1lnk2-19d220-40va
    l41-70
  • ser_copy_data type fixed
  • ser_copy_data names CNUM8LOCATION6
  • ser_copy_data names COPY_NUM5SUPPLIER12
  • item_data type delimited
  • item_data names numpricedate
  • cash_data type
  • tag_multilinerec14sep2tag5sep3
    val
  • acq_order_data type
    tag_multilinedelimgt
  • acq_order_data new_rec STORE

24
Input Implementation
  • one module-object per input type
  • Each has the new constructor
  • Each has a process_infile function for actual
    input reading
  • A module ChooseInputFormat.pm selects the right
    object and return a pointer to it.
  • new just does some preparations based on the
    type
  • process_infile reads the actual file
  • for each input record builds input_hash and
    calls build_output() received as argument

25
Reading Delimited
  • names n1n2n3
  • _at_lst split //, names
  • map name to its position
  • ind 0
  • foreach (_at_lst)
  • lst_names_ ind
  • ind
  • _at_vals split /delim/, _
  • foreach my name (keys lst_names)
    take value of name from correct position
  • my val valslst_namesname input_
    hashname val

26
Reading Fixed
  • my _at_lst split /\s\s/, names my
    field_num 0 line_pat "" foreach my lst
    (_at_lst) my (name, width) split //, lst
  • position per name
  • lst_namename field_num
  • unpack for reading
    line_pat . 'A' . width . ' '
  • my input_hash ()
  • my _at_lst unpack (line_pat, _) foreach my
    name (keys lst_name) input_hashname
    lstlst_namename

27
From In To Out
Manipulation
In File
Out File
input_hash
zhash
???
28
General Look
  • Z30-DOC-NUMBER sequence,1,9
  • Z30-ITEM-SEQUENCE sequencedocno,10,6
  • Z30-SUB-LIBRARY PERM_LOC14sub_lib_ma
    p
  • Z30-MATERIAL MATERIALmaterial_map
  • Z30-ITEM-STATUS "09"
  • Z30-OPEN-DATE INVENTORY_DATEformat_date
  • Z30-CALL-NO CALL_NO
  • Z30-PRICE PRICEdecimal
  • Z30-NOTE-OPAC NOTE1 NOTE2
  • Z30-NOTE-CIRCULATION NOTESsubstring_on_word2001
  • Z30-NOTE-INTERNAL NOTESsubstring_on_word2
    002

29
Building zhash Syntax 1
  • Z30-ITEM-STATUS "09"
  • Build Z30-ITEM-STATUS from literal 09
  • Z30-CALL-NO CALL_NO
  • Build Z30-CALL-NO from input field CALL_NO as is
  • Z30-SUB-LIBRARY PERM_LOC14sub_lib_map
  • Build Z30-SUB-LIBRARY from input field PERM_LOC,
    take first 4 characters, then map result using
    sub_lib_map
  • Z30-PRICE PRICEdecimal
  • Build Z30-PRICE from input PRICE, extract just
    the decimal portion

30
Building zhash Syntax 2
  • Z30-NOTE-OPAC NOTE1 NOTE2
  • To build Z30-NOTE-OPAC concatenate input NOTE1,
    then , then input NOTE2
  • Z30-DOC-NUMBER sequence,1,9
  • Build a sequence with delta 1, width 9, leading
    zeroes
  • Z30-ITEM-SEQUENCE sequencedocno,10,6
  • Build a sequence with delta 10, width 6, leading
    zeroes, restart the sequence for each input docno
    with differen values.
  • 3 input records with docno500
    000010,000020,000030
  • 2 input records with docno732 000010,000020

31
Implementation - 1
  • Implementing the functions is done in 2 levels
  • Group type
  • Using modifiers
  • Types
  • Just a literal
  • An input field no modifiers
  • An input field with modifiers(s)
  • Concatenation (Each element may include
    modifiers)
  • Various sequences

32
Implementation - 2
  • For each output field we build an info hash, that
    includes
  • the pointer to the function to call
  • the target
  • the source
  • other info needed per specific function
  • All hashes are pushed into a list

33
The Functional Backbone
  • f ZbuildOutFuncs'func
  • handle_funcsfunc_indfunc \f
  • handle_funcsfunc_indtarget target
    handle_funcsfunc_indsource source
  • handle_funcsfunc_ind

34
Actually Building zhash
  • foreach my handle (_at_handle_funcs)
  • my res handle-gt'func'(handle, \zhash,
    input_hash)
  • Each function has something like
  • target handle-gttarget What to build
  • source handle-gtsource How to build
  • my res
  • .
  • evaluates Build
  • .
  • zhashtarget res Prepare for
    output
  • return res

35
Getting Values
  • sub get_gen_field_val
  • my (my_input_hash, column) _at__
  • my (org_val, field, modifiers)
  • if (defined input_hashcolumn)
  • org_val input_hashcolumn
  • return org_val
  • if (column /_\w/)
  • org_val zhashcolumn
  • return org_val
  • (org_val, field, modifiers)
    analyze_complex(column)
  • org_val get_modified_val(org_val,
    modifiers)
  • return org_val

36
Modifying a Value
  • sub get_modified_val
  • my (org_val, modifiers) _at__
  • my non_square '(\\)'
  • my _at_mods modifiers /non_square/go
  • foreach my mod (_at_mods)
  • next if mod /\s/
  • my (type, action) mod
    /(\w)(.)/
  • if (type)
  • org_val mod_functionstype-gt(org_val,
    action)
  • else
  • exit_on_err()
  • return org_val

37
Simple Modifier - uppercase
  • sub mod_uppercase my (val, action)
    _at__ return uc val
  • What if val is list ??

38
Handling Lists
  • sub handle_list my (val, action, func)
    _at__ my _at_val_lst foreach my v (_at_val)
    my res func-gt(v, action) push
    _at_val_lst, res return \_at_val_lst

39
Modifier with List
  • sub mod_uppercase my (val, action) _at__
    return handle_list(val, action,
    \mod_uppercase) if ref val
  • return uc val

40
regex
  • Definition
  • regex_check
  • my_reg /\d4\D\d4/
  • Code
  • sub build_regex_check
  • my (name, regex) _ /(\w)\s(.)/
  • my code_src "sub my \valshiftmy
    \res\valregexreturn \res"
  • my code eval code_src

41
Running Index
  • We need several running indexes.
  • Maybe the easiest way to do it
  • sub build_running_index
  • my cnt 0
  • return sub
  • return cnt
  • Each instance
  • starts from 0
  • increments by 1

42
Using it
  • cnt1 build_running_index()cnt2
    build_running_index()res1 cnt1()res1
    cnt1()res2 cnt2()res1
    cnt1()print "c1 res1 c2 res2\n"sub
    build_running_index my cnt 0
    return sub return cnt
    output
  • c1 3 c2 1

43
More General
  • Different start
  • Different delta
  • sub build_own_index_gen my (cnt, delta)
    _at__ return sub return cnt
    delta

44
Usage
  • cnt1 build_own_index_gen(0, 3)cnt2
    build_own_index_gen(5,10)res1
    cnt1()res2 cnt2()res1
    cnt1()res1 cnt1()print "c1 res1 c2
    res2\n"sub build_own_index_gen my (cnt,
    delta) _at__ return sub return
    cnt delta
  • output
  • c1 9 c2 15

45
Timer - 1
  • We need several timers
  • We also need a total timer
  • sub get_time_diff
  • my tls time()
  • return sub
  • my start shift
  • my tle time()
  • my tlt tle - tls
  • tls tle unless start
  • return tlt

46
Timer
  • Timer for differences
  • sub get_time_diff
  • my tls time()
  • return sub
  • my tle time()
  • my tlt tle - tls
  • tls tle
  • return tlt

47
Timer - Usage
  • t1 get_time_diff()
  • t2 get_time_diff()
  • sleep(1)
  • r t1-gt()
  • r1 t2-gt()
  • print r r r1 r1\n"
  • sleep(1)
  • r t1-gt()
  • r1 t2-gt()
  • print r r1 r1 r1\n"
  • Results
  • r 1 r1 1r 1 r1 1

48
Timer 2
  • Timer with total
  • sub get_time_diff
  • my tls time()
  • return sub
  • my start shift
  • my tle time()
  • my tlt tle - tls
  • tls tle unless start
  • return tlt

49
Timer 2 - Usage
  • t1 get_time_diff()
  • t2 get_time_diff()
  • sleep(1)
  • r t1-gt()
  • r1 t2-gt(1)
  • print r r r1 r1\n"
  • sleep(1)
  • r t1-gt()
  • r1 t2-gt(1)
  • print r r1 r1 r1\n"
  • Results
  • r 1 r1 1r 1 r1 2

50
Timer
  • We need several timers
  • We also need a total timer
  • sub get_time_diff
  • my tls time()
  • return sub
  • my start shift
  • my tle time()
  • my tlt tle - tls
  • tls tle unless start
  • return tlt

51
Perl Things
  • Hash-Direct mapping of string from source to code
    any entry can hold scalar/list/hash/anon. sub -
    does it dynamically
  • ref
  • references
  • hash-names/list-order
  • DataDumper
  • file globs
  • glob -gt generic code
  • tied hash glob
  • closure
  • sprintf format string values list
  • evalconditions, regexp
  • regexp qr, quotemeta, \Q
  • external modules module defined name
  • modules DB_File, DataDumper, dates, Excel, DBI
Write a Comment
User Comments (0)
About PowerShow.com