Data Quality - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Data Quality

Description:

Example:street addresses. Fields may not be big enough. Text spills over to next field ... address number. Predirectional and/or Postdirectional. Street name ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 45
Provided by: davidl116
Learn more at: https://cs.nyu.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Quality


1
Data Quality
  • Class 6

2
This Week
  • Review for Exam
  • Project Questions
  • Data Standardization

3
Data Standardization
  • What is a standard?
  • Benefits of Standardization
  • Defining Data standards
  • Testing for standard form
  • Transforming into standard form

4
What is a Standard?
  • a standard is something set up and established by
    authority, custom, or general consent as a model
    or example
  • a model to which all objects of the same class
    must conform.

5
What is a Standard? 2
  • conforms to a predefined expected format which
    may be defined by
  • an organization with some official authority
    (e.g., government)
  • some recognized authoritative board (such as a
    standards committee)
  • negotiated agreement (such as electronic data
    interchange (EDI) agreements)
  • de facto convention (e.g., telephone number
    formats)

6
Benefits of Standardization
  • conformity for comparison (as well as aggregation
    and analysis purposes)
  • an audit trail for data error accountability
  • a streamlined means for the transfer and sharing
    of information

7
Defining Standards
  • Find representative body
  • Identify a simple set of rules that completely
    specify the valid structure and meaning of a
    correct data value
  • Present the standard to the committee (or even
    the community as a whole) for comments
  • Document and publish standard

8
Testing for Standard Form
  • If there is a standard, there should be a way to
    test to see if data is in standard form
  • Example US Telephone numbers
  • Defined by Industry Numbering Committee (INC)
  • NPA Numbering Plan Area code
  • NXX Central Office Code
  • Test for format conformance (e.g., 1-XXX-YYY-ZZZZ
    for telephone numbers)
  • Test for validity (e.g., is XXX a valid NPA, is
    YYY a valid NXX for the NPA XXX)

9
Transforming into Standard Form
  • Given a good standard, it should be
    straightforward to transform data into that form
  • Must be able to recognize data components to be
    able to place them in proper locations

10
Error Paradigms
  • How are errors introduced into data?
  • Attribute Granularity
  • Finger Flubs
  • Format Conformance
  • Semi-structured form
  • Transcription Errors
  • Transformation Flubs
  • Misfielded Data
  • Floating Data
  • Overloaded Attributes

11
Attribute Granularity
  • Data granularity is not at the proper level
  • Example name vs. last name, first name
  • Creates confusion when more than one entity can
    be represented in the same attribute

12
Finger Flubs
  • This happens whem the incorrect letter is typed
    on the keybpard
  • Also, sometimes mnore than one letter is hit by
    mistake
  • Also, a leter might be missing

13
Format Conformance
  • When the format is too restrictive, the user may
    not be able to properly enter the data
  • Example First name, middle initial, last name
  • Some people go by their middle name

14
Semi-structured form
  • There may be multiple valid formats that appear
    in free-form
  • Example corporate structure laid out at web
    sites
  • Example
  • (first name) (middle initial) (last name) or
  • (last name), (first name)

15
Transcription Errors
  • Data is collected through fuzzy media and is
    not properly transcribed
  • Mispronounced data
  • Incorrect spellings

16
Transformation Flubs
  • Automated processing may introduce errors
  • Weve already seen this example
  • a database of names was found to have an
    inordinately large number of high-frequency word
    fragments, such as INCORP, ATIONAL, COMPA.
  • Text spanned multiple fields, which were not
    concatenated properly on extraction

17
Misfielded Data
  • Data that is placed in the wrong field
  • Examplestreet addresses
  • Fields may not be big enough
  • Text spills over to next field

18
Floating Data
  • Information that belongs in one field is
    contained in different fields in different
    records in the database
  • See examples in housing authority database

19
Overloaded Attributes
  • More than one entity shows up in data
  • Weve already seen this example
  • John and Mary Smith, TTES, Smith Foundation

20
Record Parsing
  • Tokenizing data elements within an attribute
  • Assign meaning to tokens
  • Domain membership
  • Patterns
  • Context

21
Record Parsing 2
  • In order to do this, we need
  • The names and types of the data components
    expected to be found in the field
  • The set of valid values for each data component
    type
  • The acceptable forms that the data may take
  • A means for tagging records that have
    unidentified data components
  • We can do this with domains, mappings, and rules!

22
Data Correction
  • If we can automatically recognize data as not
    conforming to a standard, can we automate its
    correction?
  • If we have translation rules or mappings from
    incorrect values to correct values
  • This is how many data cleansing applications work
  • example Internatinal?International

23
Data Correction 2
  • Correction by consolidation
  • Makes use of record linkage
  • Find a pivot attribute across which to link
  • The pivot should be unique (such as social
    security number)
  • Link records together and consolidate correct
    name based on other factors, such as data source,
    timestamp, etc.

24
Data Standardization
  • Use standard form as a pivot for linkage and
    consolidation
  • Example
  • Elizabeth R. Johnson, 123 Main St
  • Beth R. Johnson, 123 Main St
  • Its a good hunch that these records represent
    the same person
  • We can standardize components based on nicknames,
    abbreviations, etc.

25
Data Standardization 2
  • Examples
  • Robert, Rob, Bob, Robby, Bobby
  • Elizabeth, Elisabeth, Liz, Lizzie, Beth
  • International, Intl, Intl, Intrntnl
  • Make use of a standard form, even if it is not
    necessarily correct
  • In other words, change all Roberts, Robs, Bobs,
    Robbys, and Bobbys to Robert
  • Use standard form for linkage

26
Data Standardization 3
  • Again, this concept sounds familiar
  • Many to one mapping
  • Maintain the standardization mapping as metadata
  • Apply mapping to get standard form

27
Abbreviation Expansion
  • Rule/mapping oriented
  • Translates common abbreviations to a standard
    form
  • Types
  • Shortenings (INC for INCORPORATED)
  • Compression (INTL for INTERNATIONAL)
  • Acronyms (IBM for you know what)

28
Transformation Rules
  • Standardization is a process of transforming
    nonconforming forms to conforming forms
  • Use mappings/transformation rules
  • Create a rule engine instance and integrate the
    rules
  • Engine becomes a filter

29
ExampleAddress Standardization
  • United States Postal Service (USPS) has done a
    very good job of presenting their addressing
    standard
  • Their goal increase readability of mail to
    increase deliverability
  • Benefits are given to postal customers when data
    is in correct form

30
USPS Address Standard
  • Multiple address lines
  • Recipient line
  • Delivery Address line
  • Last line
  • Standard Address Block

31
Recipient Line
  • Person or entity to whom mail is to be delivered
  • First line of standard address block

32
Delivery Address Line
  • Contains location information
  • Includes street address
  • Broken down into
  • Primary address number
  • Predirectional and/or Postdirectional
  • Street name
  • Suffix (RD, ST, etc.)
  • Secondary address designator

33
Last Line
  • City
  • State
  • ZIP4 code

34
Standard Abbreviations
  • USPS expects addresses to be represented in a
    reduced form, using standard abbreviations
  • This can be represented using a mapping
  • See example (pub. 28)

35
ZIP4
  • Encoding of geographical data
  • Actually, the ZIP code is an overloaded data
    value
  • It contains state information as well as delivery
    location focus

36
Address Standardization
  • First Is the address already in standard form?
  • This can be checked by making sure that the
    address conforms to the address block layout
  • Some special cases need addressing (East West
    Hwy)
  • Are real city names used, or vanity names?
  • Is correct ZIP4 used?

37
Address Standardzation 2
  • More
  • Identify all addressing elements
  • Make sure placement is correct if not, correct
    it
  • Is the street specified a valid street name?
    (USPS provides database)
  • Is the address number valid within the street
    address ranges?

38
Address Standardization 3
  • Next Correct if necessary
  • Identify all address elements
  • Look up proper city name
  • Look up correct ZIP4
  • If the right one cannot be used, use the ZIP4
    centroid
  • Move elements to proper location in address block
  • Transform elements into standard abbreviated form
  • Generate bar code (if needed)

39
Business Data Elements
  • USPS standard is a nice source for business rules
  • Elements are broken down into element classes

40
Business Elements
  • Secondary unit indicator
  • Secondary number
  • Company name
  • PO box number
  • City
  • State
  • ZIP/ZIP4
  • Carrier Route code
  • Operational Endorsement
  • Key line code
  • POSTNET barcode
  • POSTNET address barcode
  • Name Prefix (Mr., Mrs.)
  • First name
  • Middle name or initial
  • Surname
  • Suffix title (e.g., JR, PHD)
  • Professional title (PROJECT MANAGER)
  • Division/Department
  • Mailstop code
  • Street number
  • Predirectional
  • Street name
  • Street suffix

41
CASS
  • Acronym for Coding Accuracy Support System
  • Provides a platform to measure the quality of
    address matching and standardization software
  • Addresses are CASS certified if they pass USPS
    provided tests (I.e., they are standardized)
  • Only mail that is CASS certified can qualify for
    postage savings

42
NCOA
  • 20 of population changes addresses each year
  • NCOA National Change of Address

43
Other Standards
  • Telephone industry
  • Financial industry (SWIFT, FIX)
  • HTML
  • SIC codes
  • GIS standards

44
Next Week
  • Midterm
  • Following week
  • Data cleansing
  • Record linkage
  • Similarity and distance
Write a Comment
User Comments (0)
About PowerShow.com