Data Transformation - PowerPoint PPT Presentation

About This Presentation
Title:

Data Transformation

Description:

Data Transformation Data cleaning Importing Data Reading data from external formats Libname/Infile/Input for text form data Proc Import for Excel/Access data ODBC for ... – PowerPoint PPT presentation

Number of Views:184
Avg rating:3.0/5.0
Slides: 32
Provided by: ouEduclas
Learn more at: https://www.ou.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Transformation


1
Data Transformation
  • Data cleaning

2
Importing Data
  • Reading data from external formats
  • Libname/Infile/Input for text form data
  • Proc Import for Excel/Access data
  • ODBC for external database data

3
Importing an Excel Spreadsheet
  • PROC IMPORT OUT WORK.Fall2007
  • DATAFILE "L\DataWarehousing07f\Cours
    eDatabase\Fall2007.xls"
  • DBMSEXCEL REPLACE
  • SHEET"'Fall 07'"
  • GETNAMESYES
  • MIXEDNO
  • SCANTEXTYES
  • USEDATEYES
  • SCANTIMEYES
  • RUN

4
Import an Access Table
  • PROC IMPORT OUT WORK.OrderLine
  • DATATABLE "OrderLin"
  • DBMSACCESS REPLACE
  • DATABASE"I\DataWarehousing07f\WholesaleProd
    ucts.mdb"
  • SCANMEMOYES
  • USEDATENO
  • SCANTIMEYES
  • RUN

5
? Good Practice
  • Check the metadata for a dataset
  • PROC CONTENTS DATA OrderLine

  • RUN
  • Print a few records
  • PROC PRINT DATA OrderLine (OBS 10)

  • RUN

6
Saving SAS Datasets
  • LIBNAME course "L\DataWarehousing07f\CourseDataba
    se"
  • Data course.Spring2008
  • set spring2008
  • run
  • Note the name associated with the libname
    command (course) must be 8 characters or less.

7
LIBNAME / INFILE / INPUT for character data
  • LIBNAME identifies the location or folder where
    the data file is stored
  • INFILE specifies the libname to use for reading
    external data.
  • INPUT reads text format data
  • SET reads SAS data

8
INFILE with INPUT for character data files
  • DATA Fitness
  • INFILE "L\DataWarehousing07f\TransformationSAS\SA
    S1.txt"
  • INPUT NAME WEIGHT WAIST PULSE CHINS SITUPS
    JUMPS
  • run

9
Creating Derived Attributes
  • Generating new attributes for a table. SAS
    creates attributes when they are referred to in a
    data step. The metadata depends on the context
    of the code.
  • LENGTH statements
  • FORMAT statements
  • FORMATS and INFORMATS
  • PUT
  • INPUT

10
PUT and INPUT Functions
  • TextOutput PUT(variable, format)
  • Note the result of a put function is always
    character
  • Note there is also a PUT statement that writes
    the contents of a variable to the SAS log
  • Output INPUT(CharacterInput, informat)
  • Note the variable for an input function is
    always character

11
Formats
  • Formats always contain a period
  • Formats for character variables always start with
    a
  • The most used format categories are Character,
    Date and Time, and Numeric
  • Note use the SAS search tab to look for
    Formats. For a list of SAS formats look under
    Formats Formats by Category

12
? Good Practice
  • The following code is handy for testing
    functions and formats in SAS. The _Null_ dataset
    name tells SAS not to create the datset in the
    WORK library
  • Data _Null_
  • InputVal 123
  • OutputVal PUT(InputVal, Roman30.)
  • PUT InputVal OutputVal
  • run

13
Generating Dates
  • Generating a Date dimension
  • Usually done offline in something like Excel
  • SAS has extensive date and datetime functions and
    formats
  • SAS formats apply to only one of datetime, date
    or time variable types. Convert from one type to
    another with SAS functions.

14
Creating a text variable for Date
  • Data Orders2

  • Length Date 10.

  • Set Orders

  • Date PUT( Datepart(OrderDate), MDDYY8.)
  • The Length statement assures that the variable
    will have enough space. It must come before the
    SET.
  • OrderDate has DateTime format. The DATEPART
    function produces a date format output. MMDDYYx.
    is a date format type.

15
SAS Functions
  • We are especially interested in Character and
    Date and Time functions
  • Note use the SAS search tab to look for
    Functions. For a list of SAS functions look
    under Functions and CALL routines
    Functions and CALL Routines by Category

16
Useful Data Cleaning Functions
  • Text Manipulation
  • COMPRESS, STRIP, TRIM, LEFT, RIGHT, UPCASE,
    LOWCASE
  • Text Extraction
  • INDEX, SCAN, SUBSTR, TRANSLATE, TRANWRD

17
Parsing
  • The process of splitting a text field into
    multiple fields
  • Uses SAS functions to extract parts of a
    character string.
  • Fixed position in a string SUBSTR
  • Known delimiter SCAN
  • Note it is a good idea to strip blanks before
    you try to parse a string.

18
Example of Parsing
  • Data Customer2
  • LENGTH street cust_addr 20.
  • FORMAT street cust_addr 20.
  • SET Customer
  • Cust_Addr TRIM(Cust_Addr)
  • Number Scan(Cust_Addr,1,' ')
  • Street Scan(Cust_Addr,2,' ')
  • run
  • Note The LENGTH and FORMAT statements clear
    trailing blanks for further display.

19
Parsing Results
  • Obs cust_addr Number street
  • 1 481 OAK 481 OAK
  • 2 215 PETE 215 PETE
  • 3 48 COLLEGE 48 COLLEGE
  • 4 914 CHERRY 914 CHERRY
  • 5 519 WATSON 519 WATSON
  • 6 16 ELM 16 ELM
  • 7 108 PINE 108 PINE

20
? Good Practice
  • Always print the before and after images here.
    Parsing free form text can be quite a problem.
    For example, apartment addresses 110b Elm and
    110 b Elm will parse differently. In this case
    you may have to search the second word for things
    that look like apartments and correct the data.

21
SUBSTR( string, positionlt, lengthgt)
  • Use this when you have a known position for
    characters.
  • String character expression
  • Position start position (starts with 1)
  • Length number of characters to take (missing
    takes all to the end)
  • VAR ABCDEFG
  • NEWVAR SUBSTR(VAR,2,2)
  • NEWVAR2 SUBSTR(VAR,4)
  • NEWVAR BC
  • NEWVAR2 DEFG

22
SUBSTR(variable, positionlt,lengthgt)
new-characters
  • Replaces character value contents. Use this
    when you know where the replacement starts.
  • a'KIDNAP'
  • substr(a,1,3)'CAT'
  • a CATNAP
  • substr(a,4)'TY'
  • a KIDTY

23
INDEX(source, excerpt)
  • Searches a character expression for a string of
    characters. Returns the location (number) where
    the string begins.
  • a'ABC.DEF (XY)'
  • b'XY'
  • xindex(a,b)
  • x 10
  • x index(a,DEF)
  • x 5

24
Alternative INDEX functions
  • INDEXC searches for a single character
  • INDEXW searches for a word
  • Syntax
  • INDEXW(source, excerptlt,delimitergt)

25
Length
  • Returns the length of a character variable
  • The LENGTH and LENGTHN functions return the same
    value for non-blank character strings. LENGTH
    returns a value of 1 for blank character strings,
    whereas LENGTHN returns a value of 0.
  • The LENGTH function returns the length of a
    character string, excluding trailing blanks,
    whereas the LENGTHC function returns the length
    of a character string, including trailing blanks.
    LENGTH always returns a value that is less than
    or equal to the value returned by LENGTHC.

26
Standardizing
  • Adjusting terms to standard format.
  • Based off of frequency prints.
  • Use functions or IF statements
  • TRANWRD is easy but can produce unexpected
    results
  • IF statements are safer, but less general

27
Standardization Code
  • Supplier Tranwrd(supplier, " Incorporated", "")
  • If Supplier "Trinkets Things" then supplier
    "Trinkets n' Things"
  • More complex logic is often needed. See the
    course examples.

28
? Good Practice
  • It is a good idea to produce a change log for
    standardized changes
  • Data Products2 Changed
  • Set Products
  • SupplierOld Supplier
  • Output Products2
  • If Trim(supplier) Trim(SupplierOld) then
    output Changed
  • Proc Print Data Changed
  • Var SupplierOld Supplier

29
Locating Anomalies
  • Frequency counts are a good way to identify
    anomalies.
  • It is also helpful to identify standard changes
    that you do not have to review.
  • Probably the safest way to execute standard
    changes is with a Change Table that lists From
    and To values. (Advanced SAS exercise go for
    it!!)

30
De Duplicating
  • Reconcile different representations of the same
    entity
  • Done after standardizing. Usually requires
    multi-field testing.
  • May use probabilistic logic, depending on the
    application.
  • Should produce a change log.

31
Correcting
  • Identifying and correcting values that are wrong
  • Very difficult to do. Usually based off of
    exception reports or range checks.
Write a Comment
User Comments (0)
About PowerShow.com