SAS Challenge Matching Records - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

SAS Challenge Matching Records

Description:

phone= compress(Phone, '0123456789', 'k'); Fname1=upcase(scan(Fname,1, ... run; SAS functions for cleaning character variables ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 37
Provided by: BCB1
Category:

less

Transcript and Presenter's Notes

Title: SAS Challenge Matching Records


1
SAS Challenge Matching Records
  • Libing Shi
  • Libing.shi_at_bcbsma.com
  • Blue Cross Blue Shield of MA
  • BASUG Quarter 1 Meeting
  • March 19, 2009

2
Overview
Final_outdata
Members.txt
Registrations.txt
Fname v Lname v Organization v Addr1 v Addr2
v City v State ? Zip v Phone v Email v Member
(Y/N) Match_Ind (L-sure match M-manual check N-
not matched)
DuesYear Organization Email Fname Lname Zip Phone
City AddressLine1 AddressLine2
Organization Email Fname Lname Phone

3
Input datasets
  • Method I Use SAS pipe feature to get the file
    names
  • filename indata pipe 'dir/b C\DOCUME1\lshi0001\D
    esktop\BASUG\.txt'
  • data file_list
  • length name_of_file 20
  • infile indata truncover
  • input name_of_file 20.
  • call symput ('num_files',_n_)
  • run

Path
file_list
missover
4
  • Method I Automatically read in data using
    PROC IMPORT
  • macro fileread
  • do j1 to num_files
  • data _null_ set file_list
  • if _n_j
  • call symput ('filein',name_of_file)
  • call symput ('fileout', scan(name_of_file,1,
    '.') ) run
  • PROC IMPORT OUT work.fileout
  • DATAFILE "C\Documents and settings\
    lshi0001\Desktop\BASUG\filein"
  • DBMSTAB REPLACE
  • DELIMITER''
  • GETNAMESYES
  • DATAROW2
  • RUN
  • end
  • mend fileread

Bedford VA Hospital"200 Springs Road,
152 Bedford VA Hospital"200 Springs Road, 1
5
Log window using Proc import wizard
6
Input datasets
  • Method II Old fashion way infile, input,
    informat and format
  • data WORK.registrations
  • let _EFIERR_ 0
  • infile 'C\DOCUME1\lshi0001\Desktop\BASUG\Registr
    ations.txt' delimiter '' MISSOVER DSD
    lrecl32767 firstobs2
  • informat Organization 58.
  • informat Email 50.
  • format Organization 58.
  • format Email 50.

input
Organization Email
if _ERROR_ then call
symputx('_EFIERR_',1) run
7
Compare two read data methods
  • Method I
  • Code is reusable.
  • Dont need to know the file names, variable
    names and variable lengths before read them in.
  • But cant read in data correctly if the data is
    messy.
  • Method II
  • extremely flexible, can deal with really messy
    data
  • But the code is not reusable, it is data
    specific.

8
Explore the datasets
  • Dataset Members (397 records)
  • Variable Type Len Format
    Informat
  • 9 AddressLine1 Char 50 50.
    50.
  • 10 AddressLine2 Char 37 37.
    37.
  • 8 City Char 20 20.
    20.
  • 1 DuesYear Char 4 4.
    4.
  • 3 Email Char 50 50.
    50.
  • 4 Fname Char 15 15.
    15.
  • 5 Lname Char 15 15.
    15.
  • 2 Organization Char 58 58.
    58.
  • 7 Phone Char 12 12.
    12.
  • 6 Zip Char 10
    10. 10.
  • Dataset Registrations (147 records)
  • Variable Type Len Format
    Informat
  • 2 Email Char 50 50.
    50.
  • 3 Fname Char 15 15.
    15.
  • 4 Lname Char 15 15.
    15.
  • 1 Organization Char 58 58.
    58.
  • 5 Phone Char 12 12.
    12.

Email Unique Count346 Total 43 records with
email missing
Phone Unique Count342 Total 30 records with
phone missing
Email Unique Count147
Phone Unique Count146, 1 dup
9
The challenges
  • No unique identifier (single variable or
    variables combination) can be used to match the
    records
  • Standardizing variable Fname, Lname and
    Organization is an endless task
  • There is no State information in both
    registrations and members files

10
Process Steps for Matching
  • Same email address, same Lname or same Fname
  • Same phone , same Lname and same Fname
  • Same phone , same Lname or same Fname
  • Same organization, same Lname and Fname
  • 5. Same Lname and Fname

L
L
M
L
M
11
Prepare Data for Matching
  • Clean the email address
  • 4/6/3835 Asiotscao_at_orgoewn.icn"38845382
  • Standardize the phone number
  • 832.941.8698 832-875-1600 832-9700640
  • Separate names in Fname column
  • Marilyn(Mary) Margaret Rose Robert-James
    Jack Paul Harri
  • Clean the Lname
  • O''Hearn De Kassu Biggers-Smith
  • Standardize the organization name
  • Angela and Women's Hospital Angela and
    Womens Hospital

12
Clean messy address
  • Address information in city column
  • City AddressLine1
    AddressLine2
  • 200 Harrison Ave VA Medical Center"200 Springs Rd
    Bedford
  • AddressLine1 and AddressLine2 Switched
  • City AddressLine1
    AddressLine2
  • Wakefield Bldg 919, 01-24
    99 Binney Street
  • Wakefield 326 AAC
    321 Forest Street
  • AddressLine1 has too much information
  • City AddressLine1
    AddressLine2
  • Norwood Bedford VA Hospital"200 Springs Road,
    152 Bedford One Kendall Square, Building
    200

13
Clean the messy data
/clean Registrations and Menbers
files/ let q1 str(") let
q2 str(') macro clean(indata) data
indata._2(droprid) set indata. if
scan(Email,2,'_at_')' ' then Email' '
email1lowcase(scan(Email,1, "q1." )) phone
compress(Phone, "0123456789", "k")
Fname1upcase(scan(Fname,1,'(- '))
Fname2upcase(scan(Fname,2,'()- '))
Fname3upcase(scan(Fname,3,'- '))
FnameMnameupcase(compress(Fname,'()- '))
Lname1translate(Lname,' ', "q1.q2.")
Lname1upcase(compress(Lname1,' - '))
organization compress(organization, ".,q1."
) organizationtranslate(organization,' ',
'/') organizationtranwrd(organization,'',
'and')

14
Clean the messy data contd
  • organizationtranwrd(organization,'Associates',
    'Assoc')
  • organizationtranwrd(organization,'Associateia
    tes', 'Assoc')
  • organizationtranwrd(organization, 'Univ',
    'University')
  • organizationtranwrd(organization,
    'Universityersity', 'University')
  • organizationtranwrd(organization, 'HealthCare',
    'Health Care')
  • ridprxparse('s/\s/ /')
  • call prxchange(rid, -1, organization)
  • organization1tranwrd(organization, 'Inc', '
    ')
  • organization1tranwrd(organization1,'Corp',
    ' ')
  • organization1tranwrd(organization1,'Institutes',
    ' ')
  • organization1tranwrd(organization1,'Institute',
    ' ')
  • run
  • mend clean
  • clean(registrations)
  • clean(members)

15
Clean address

/clean address in members file/ data
members_2(droppos)set members_2 length Zip1
5. new_AddressLine1 50. AddressLine1tranwrd(
AddressLine1,'One ','1 ') AddressLine2tranwrd(A
ddressLine2,'One ','1 ') if
substr(City,1,1) in ( '0', '1',
'2','3','4','5','6','7','8','9') and
AddressLine1' ' then do
AddressLine1City City ' ' end
else if substr(City,1,1) in ( '0', '1',
'2','3','4','5','6','7','8','9') and
AddressLine1 ne ' ' and AddressLine2' ' then
do AddressLine2City City '
' end else if substr(City,1,1) in
( '0', '1', '2','3','4','5','6','7','8','9')
and AddressLine1 ne ' ' and AddressLine2 ne '
' then City ' '
16
Clean address contd

if substr(AddressLine2,1,1) in ('0', '1',
'2','3','4', '5', '6', '7', '8', '9') and
scan(AddressLine2,2, ' ') ne 'floor ' and
(substr(AddressLine1,1,1) not in ('0','1',
'2','3','4', '5', '6', '7', '8', '9') or
scan(AddressLine1, 2, ' ')'AAC ') then
do new_AddressLine1AddressLine2 new_AddressLi
ne2AddressLine1 end else do
new_AddressLine1AddressLine1 new_AddressLi
ne2AddressLine2 end if scan(new_AddressLine
1,2,'"') ne ' ' then new_AddressLine1scan(new_Add
ressLine1,2,'"')
17
Clean address contd

/Split AddressLine1/ posindexc(new_addressLine1
, ',' ) pos_endlength(new_AddressLine1) if
pos ne 0 then do new_AddressLine2substr(new_Ad
dressLine1, pos1, pos_end - pos)
new_AddressLine1substr(new_AddressLine1,1,pos-1)
end Zip1substr(Zip,1,5) run
18
SAS functions for cleaning character variables
  • SCAN(string ,n, delimiter(s)) -- Returns a
    portion of the string as defined by delimiter. If
    you omit delimiter, SAS uses the following
    characters blank . lt ( ! ) - / ,
  • /clean the wrong email address /
  • if scan(Email,2,'_at_')' ' then Email' '
  • email1lowcase(scan(Email,1, "q1.q2." ))

4/6/3835 will become empty
Asiotscao_at_orgoewn.icn"38845382 changed to
asiotscao_at_orgoewn.icn
email1lowcase(scan(Email,1,
)) email1lowcase(scan(Email,1, " ))
19
SAS functions for cleaning character variables
  • COMPRESS(string, chars, modifiers)
  • chars a list of characters need to be removed or
    kept
  • Common used modifies
  • K (k) keeps the characters in the list
  • I (i) ignore the case of the characters in the
    list
  • phone compress(phone, "0123456789", "k")
  • Lname1upcase(compress(Lname1,' - '))
  • organization compress(organization,
    ".,q1." )

20
SAS functions for cleaning character variables
  • TRANSLATE(string,to-1,from-1lt,...to-n,from-ngt)
    Converts every occurrence of a user-supplied
    character to another character
  • organizationtranslate(organization,' ', '/')
  • Lname1translate(Lname,' ', "q1.", ' ', "q2.")
  • Lname1upcase(compress(Lname1,' - '))
  • TRANWRD(string, target, replacement) scans for
    words (or patterns of characters) and replaces
    those words with a second word (or pattern of
    characters).
  • organizationtranwrd(organization,'', 'and')
  • organizationtranwrd(organization,'Associates'
    , 'Assoc')
  • organizationtranwrd(organization,'Associateia
    tes', 'Assoc')

Organizationtranwrd(organization, /, )
Lname1translate(Lname,' ', "q1. q2.")
21
SAS functions for cleaning character variables
  • Prxparse and Call Prxchange to get rid of extra
    spaces between words
  • regular-expression-idPRXPARSE (perl-regular-expre
    ssion)
  • CALL PRXCHANGE (regular-expression-id, times,
    old-string , new-string )
  • Times is a numeric value that specifies the
    number of times to search for a match and replace
    a matching pattern.
  • TipIf the value of times is -1, then all
    matching patterns are replaced.
  • ridprxparse('s/\s/ /')
  • call prxchange(rid, -1, organization)

22
SAS functions for cleaning character variables
  • indexc(string, chars)
  • chars a list of characters need to be searched
  • Searches string from left to right, returns the
    first position of any character present in the
    searching list.
  • pos indexc(new_addressLine1, ',' )
  • pos_end length(new_AddressLine1)
  • if pos ne 0 then do
  • new_AddressLine2substr(new_AddressLine1, pos1,
    pos_end - pos)
  • new_AddressLine1substr(new_AddressLine1,1,pos-
    1)
  • end

23
Matching step1by email, either first or last
name
  • proc sql create table email_match as
  • select distinct a.Fname, a.Lname,
    a.Organization,
  • b.new_AddressLine1 as Addr1,
    b.new_AddressLine2 as Addr2,
  • b.City, b.Zip1 as Zip, a.phone,
  • a.email1 as Email, 'Y' as Member,
  • case when a.Lname1b.Lname1 or
  • (a.Fname1 b.Fname1 or
    a.Fname1b.Fname2 or a.Fname1b.Fname3 or
  • a.Fname2b.Fname1 or (a.Fname2b.Fname2 and
    a.Fname2 ne ' ') or (a.Fname2b.Fname3 and
    a.Fname2 ne ' ') or
  • a.FnameMname b.FnameMname)
  • then 'L'
  • else 'M'
  • end as match_ind
  • from registrations_2 as a
  • join members_2 as b
  • on lowcase(a.email1)lowcase(b.email1)

24
Find the rest unmatched membersunmatch1
  • /get the rest unmatched members/
  • proc sql
  • create table unmatch1 as
  • select
  • from registrations_2
  • where email1 not in (select email from
    email_match)
  • order by phone, Lname, Fname
  • quit

25
Matching step 2 by phone, last name and first
name
  • proc sql
  • create table phone_match as
  • select a.Fname, a.Lname, a.organization,
  • b.new_AddressLine1 as Addr1,
  • b.new_AddressLine2 as Addr2,b.City, b.Zip1
    as Zip, a.phone,
  • a.email1 as Email, 'Y' as Member,
  • 'L' as match_ind
  • from unmatch1 as a
  • join members_2 as b
  • on a.Lname1b.Lname1 and
  • (a.Fname1 b.Fname1 or a.Fname1b.Fname2
    or a.Fname1b.Fname3 or a.Fname2b.Fname1
    or (a.Fname2b.Fname2 and a.Fname2 ne ' ')
  • or (a.Fname2b.Fname3 and a.Fname2 ne '
    ') or
  • a.FnameMname b.FnameMname)
  • and a.phone b.phone
  • order by a.phone, a.Lname, a.Fname
  • quit

26
Find the rest unmatched members unmach2
  • /get the rest unmatched members/
  • data unmatch2(dropaddr1 addr2 city zip member
    match_ind)
  • merge unmatch1(ina)
  • phone_match(inb)
  • by phone Lname Fname
  • if a b
  • run

27
Matching step 3 by phone, last name or first
name
  • proc sql
  • create table phone_match as
  • select a.Fname, a.Lname, a.organization,
  • b.new_AddressLine1 as Addr1,
  • b.new_AddressLine2 as Addr2,b.City, b.Zip1
    as Zip, a.phone,
  • a.email1 as Email, 'Y' as Member,
  • M' as match_ind
  • from unmatch2 as a
  • join members_2 as b
  • on a.Lname1b.Lname1 or
  • (a.Fname1 b.Fname1 or a.Fname1b.Fname2
    or a.Fname1b.Fname3 or a.Fname2b.Fname1
    or (a.Fname2b.Fname2 and a.Fname2 ne ' ')
  • or (a.Fname2b.Fname3 and a.Fname2 ne '
    ') or
  • a.FnameMname b.FnameMname)
  • and a.phone b.phone
  • order by a.phone, a.Lname, a.Fname
  • quit

28
Find the rest unmatched members unmatch3
  • /get the rest unmatched members/
  • data unmatch3(dropaddr1 addr2 city zip member
    match_ind)
  • merge unmatch2(ina)
  • phone_match2(inb)
  • by phone Lname Fname
  • if a b
  • proc sort by organization Lname Fname
  • run

29
Matching step 4by organization, first and name
  • proc sql
  • create table name_org_match as
  • select distinct a.Fname, a.Lname,
  • a.organization, b.new_AddressLine1 as
    Addr1,
  • b.new_AddressLine2 as Addr2, b.City,
    b.Zip1 as Zip, a.phone,
  • a.email1 as Email, 'Y' as Member,
  • 'L' as match_ind
  • from unmatch3 as a
  • join members_2 as b
  • on (a.Lname1b.Lname1 and
  • (a.Fname1 b.Fname1 or a.Fname1b.Fname2
    or a.Fname1b.Fname3
  • or a.Fname2b.Fname1 or (a.Fname2b.Fname2 and
    a.Fname2 ne ' ') or
  • (a.Fname2b.Fname3 and a.Fname2 ne ' ')
    or
  • a.FnameMname b.FnameMname))
  • and upcase(a.organization1)upcase(b.organi
    zation1)
  • order by a.organization, a.Lname, a.Fname
  • quit

30
Find the rest unmatched members unmatch4
  • /get the rest unmatched members/
  • data unmatch4(dropaddr1 addr2 city zip member
    match_ind)
  • merge unmatch3(ina)
  • name_org_match(inb)
  • by organization Lname Fname
  • if a b
  • proc sort by Lname Fname
  • run

31
Matching step 5 by first and last name
  • proc sql
  • create table name_match as
  • select distinct a.Fname, a.Lname,
  • a.organization,
  • b.new_AddressLine1 as Addr1,
  • b.new_AddressLine2 as Addr2,b.City, b.Zip1
    as Zip, a.phone,
  • a.email1 as Email, 'Y' as Member,
  • 'M' as match_ind
  • from unmatch4 as a
  • join members_2 as b
  • on a.Lname1b.Lname1 and
  • (a.Fname1 b.Fname1 or
    a.Fname1b.Fname2 or
  • a.Fname1b.Fname3 or a.Fname2b.Fname1 or
    (a.Fname2b.Fname2 and a.Fname2 ne ' ') or
    (a.Fname2b.Fname3 and a.Fname2 ne ' ') or
  • a.FnameMname b.FnameMname)
  • order by a.Lname, a.Fname
  • quit

32
Final unmatched records
  • data unmatch (dropaddr1 addr2 city zip Fname1
    Fname2 Fname3 FnameMname Lname0
  • Lname1
    organization1
  • rename(email1email))
  • merge unmatch4(ina dropemail)
  • name_match(inb dropemail)
  • by Lname Fname
  • if a b
  • Member'N'
  • Match_ind'N'
  • run

33
Get state information
  • Use sashelp.zipcode in SAS system
  • Download most current zip code file from SAS web
    and use proc cimport to read it into the system
    http//support.sas.com/rnd/datavisualization/mapso
    nline/html/misc.html

34
Clean set the final data together
  • data final set email_match
  • phone_match phone_match2
  • name_org_match name_match unmatch
  • organizationtranwrd(organization, 'University ',
    'Univ ')
  • organizationtranwrd(organization, 'Health',
    'Hlth')
  • organizationtranwrd(organization, 'Services ',
    'Srvs ')
  • organizationtranwrd(organization, 'Center',
    'Cntr ')
  • organizationtranwrd(organization,
    'Investigations', 'Ivs ')
  • organizationtranwrd(organization, 'Inc', '')
  • organizationtranwrd(organization, ' and ', '')
  • organizationtranwrd(organization, 'Senior ', 'Sr
    ')
  • organizationtranwrd(organization, 'School ',
    'Schl ')
  • organizationtranwrd(organization, 'Medical ',
    'Med ')
  • organizationtranwrd(organization, 'Outcomes ',
    'Outcms ')
  • organizationtranwrd(organization, 'Assoc ', 'Ass
    ')
  • organizationtranwrd(organization, 'Public ',
    'Pblc ')
  • proc sort by zip run

35
Final output file
  • data out.SASChallenge_LS( drop _ )
  • retain Fname Lname Organization Addr1 Addr2 City
    State Zip Phone Email Member Match_Ind
  • merge final (ina rename(Addr1_Addr1
    Addr2_Addr2
    Email_Emai Organization_Organizat
    ion

  • Phone_Phone))
  • state (inb rename(StatecodeState
    ))
  • by zip
  • if a
  • length Organization Addr1 Addr2 30. Phone 10.
    Email 50.
  • Organizationsubstr(_Organization,1,30)
  • Addr1substr(_Addr1,1,30)
  • Addr2substr(_Addr2,1,30)
  • Phone_Phone
  • Emailsubstr(_email,1,50) run

36
References
  • A macro for reading multiple text files (SUGI29)
  • by Bebbie Miller, Denver, CO
  • Address cleaning using the TRANWRD function
    (NESUG 2008)
  • by Rena Jones, Mike Zdeb
  • SAS online documents
  • Thank You !
  • libing.shi_at_bcbsma.com
Write a Comment
User Comments (0)
About PowerShow.com