Title: Larger than Life
1Larger than Life ?
Presented By Adrian Hodgson
2How big ?
- Jobcentre client activity data - 11 million
people - 80 Gbytes of data per month on DLT
- Processing 20 tables 250m records per month
- Largest tables have 80 million records, 5 Gbytes
- Key extract retrievals take 6 - 20 hours to run
- Full SPSS extracts for clients - 5 Gbytes
- Twenty five SIR databases - over 100 Gbytes
- from 1 Mbyte to 15 Gbytes
3Evaluation Database - Overview
- Background project requirements
- Setting the environment
- Visual PQL
- The Data Dictionary
- Program generators
- Fuzzy matching
- Use of the data
- New developments
- Questions
4Background
- Government program started in January 1998
- New Deal for Young People was set up to encourage
and assist unemployed groups into full time
sustainable employment. It gives unemployed
people aged 18-24 the opportunity to develop
their potential, gain skills and experience and
find work. - Employment Service needed an evaluation database
- March 98 - ORC issues first extract with 18000
clients - June 98 - program expanded to over 25s
- June 99 - expanded to cover all Jobcentre clients
- October 00 - Source database migrates from six
regional Ingres databases to single Oracle
database
5Contract extension -press notice
- The new contract will run to May 2004, with an
option to run a further two years to May 2006. - ORC Internationals database tools are designed
to help the ES evaluate the service it provides
to all clients registered on the Governments
JobCentre computer system.They allow the ES to
regularly monitor clients with sustainable jobs,
the effectiveness of equal opportunities
measures, and the relationship between job
vacancies and the labour market skills base. - It is our intention to create a project web
site, which will allow multiple level access to
different categories of users. This will include
project documentation and progress reports,
access to tabulations and small data extracts,
and customer feedback areas, as well as links to
other related sites. - ORC International is part of Opinion Research
Corporation, which was founded in 1938 with
offices in the United States, Europe, Asia, Latin
America and Africa, the Company provides
integrated marketing services to both businesses
and governments in more than 100 countries.
http//www.orc.co.uk
6Project Requirements
- LMS (Labour Market System) is a multi-user
transaction system used in Jobcentres - Needed new database with evaluative functionality
- Linked to additional data-sets including clerical
- Flexibility to change structure periodically
- Combine cross regional records for the same
client - Extracts provided to ES for statistical purposes
7Setting the environment
- PROGRAM
- PQL CONNECT DATABASE 'CLMIDS PREFIX
'' - END PROGRAM
- SET DATABASE CLMIDS
- SET PROCFILE ' reset to main
procedure file - ( main procedure file held in separate .SR4
file ) - Midlands database prefix
- retval globals ( CLI_MIDS ,
'\\urmston\d\50413\sirdb\client\') - and so on for the procedure file other
databases
8Visual PQL (1)
execute dbms CALL ddict.cprog ( tablenam
, region , extract ,
newdata , editdata , dbnum
)
9Visual PQL (2)
10Linking SIR to other software - Winzip
open inf /dsnvarfnamein /iostatierr1 /write
/lrecl300 ifthen ( ierr1 ne 0) . write 'File
not found write // 'Unzipping file using
winzip pql escape "C\Program
Files\WinZip\WINZIP32.EXE -e zipfn fnamein
" write 'Back to sir !!!!!! ' c Build in a
loop to check that input file now unzipped and
ready c wait and repeat if not else . write
'File opened OK ' fnamein / endif
11The Data Dictionary
12Automatic schema generation
- Source field names, labels data types read from
HTML into data dictionary database - Program to create SIR variable names
- strip out underscore characters _
- trims field length to eight
- if duplicated replaces eighth character with a
number - Program sets column positions based on field
types - Data types converted from Ingres / Oracle to SIR
- Procedure for Date Integers and true date fields
13Load programs - the report files (1)
- Client_action record
- Read
78224996 - \ fixed_width_data \ 2001_05 \north
19156652 - east 6949927
- west 11515552
- south 11774525
- northwest 11352025
- midlands 12147512
- oracle 5328791
- dodgy 12
14Load program generation (2)
- Develop standard file naming directory location
- Read pre -processing report files to pick up
numbers of records to be loaded - Detect any missing report files using iostat
values - Write loading program code to pql files
- Read these pql files back into the procedure file
15Load generator -finishing touches (3)
open inf6 / dsn 'Q\ps\50413\ProgGen\ldmidlands1
.txt' / read / lrecl 250 open ouf6 / dsn
'Q\ps\50413\ProgGen\ldmidlands2.txt' / write
/ lrecl 250 write ( ouf6 ) 'PROCEDURE
INITLOAD.T add the procedure header
line / 'call initload.dropall'
add call to drop all
databases // 'call initload.connmids' /
add call to connect
midlands loop . read (inf6, iostat ierr6)
textline(a250) copy the rest of the
program . if (ierr6 ne 0) exit loop
. write (ouf6)
textline end loop write (ouf6) 'call
initload.dropmids' add
module call to drop database / 'END
PROCEDURE
add the END PROCEDURE close (inf6)
close the input and output files close
(ouf6) pread 'Q\ps\50413\ProgGen\ldmidlands2.tx
t Pread the SIR pql
16Load program - Midlands database (4)
call initload.dropall drop all connected
databases call initload.connmids connect the
midlands database call initload.update
(\\Urmston\d\50413\, sir_input_files\lms\2001
_04\Midlands\edited\,
d\50413\sirdb\client\Midlands\l
og\2001_04\, Midlands_client2001_04ORCIDsort,
edt, 0.75, 1, 1, 181450, 0) call initload.update
(\\Urmston\d\50413\, sir_input_files\lms\2001_
04\Midlands\edited\, d\50413\sirdb\client\Mid
lands\log\2001_04\, Midlands_client_action2001
_04ORCIDsort, edt, 0.75, 2, 1, 236835, 0) .
more call to load data call
initload.dropmids
disconnect the midlands database
17Fuzzy matching
- Linking data by best combinations of Nino,
name,dob - Stripping non significant text
- blanks, apostrophes, hyphens
- Methods grown organically on case by case basis
- Variety of scoring methods eg surname matches
- HODGSON 4 points direct matches
- HODGESON 3points 0.9 for misaligned - 6.7/8
84 - Flexible generic modules applicable to all match
types - Reporting of false positive and negative matches
- Manual review of near /doubtful matches
18Generic fuzzy matching - issues
- Key fields different sizes and names across
applications - Some key fields absent or with high missing
values - Quality of key fields varies widely
- Matching varies from a handful to millions of
records - Bringing Access and SIR together - Visual PQL ?
- How to assess false positives when no other
common fields in data sets being matched - - set of core procedures with options to bypass
?
19General Issues
- Continual growth in database extract size
- Data Irregularities
- Embedded carriage returns in text fields
- Date formats (American /English)
- The team
- Keeping the routine /repeated processing
interesting - Mushrooming similar code - keeping it generic
/black box - SIR
- several large tabfiles required - largest
currently 11Gb - Some retrievals crash with sir.exe error 1 time
in 5 -why? - P4 /Windows 2000 mixed performance
20New Developments (1) - Project web site
- Management tool for display of project statistics
- Focus for collecting project documents ( Word )
- FAQs
- Glossary of abbreviations acronyms ?
- which variable should I use for this ?
- How is leaving date derived ?
- Where can I get the latest data dictionary for
NDYP ? - Whats the ORC variable name for expct_start_date
? - Small sampling /extraction tools
- Links to other related government sites
21New Developments (2) - Processing
- Full benefits data set arrives on Monday
- Increasingly complex extracts in SPSS and SAS
- Moving more of the data processing to Linux
- Generic fuzzy matching tools
- Adding other data , deprivation index , surveys
- Sir 2002 - reading an 800 byte record into a
single string - Secondary indexing /lookups - replace Case 0
- Better linking of the processing zipping /spss
/excel /HTML - Questions
22Larger than the Evaluation Database ?
Presented By Adrian Hodgson