Access to Confidential Data for Statistical Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Access to Confidential Data for Statistical Analysis

Description:

... 6450 835.5613953 69.3124265 RAVPAY95 real av. an. pay 95 dollars 5424 26933.93 2826.80 PERCAFDC percent of households receiving AFDC 5424 ... – PowerPoint PPT presentation

Number of Views:205
Avg rating:3.0/5.0
Slides: 106
Provided by: NoN590
Learn more at: https://www.cdc.gov
Category:

less

Transcript and Presenter's Notes

Title: Access to Confidential Data for Statistical Analysis


1
Access to Confidential Data for Statistical
Analysis
Kenneth Harris, Director of Research Data Center
2
National Center for Health Statistics (NCHS)
  • Despite the wide dissemination of its data
    through publications, CD-ROMs, etc., the
    inability to release files with, for instance,
    lower levels of geography, severely limits the
    utility of some data for research, policy, and
    programmatic purposes and sets a boundary on one
    of the Centers goals to increase its capacity to
    provide state and local area estimates.

3
NCHS (cont.)
  • In pursuit of this goal and in response to the
    research communitys interest in restricted data,
    NCHS established the Research Data Center (RDC),
    a mechanism whereby researchers can access
    detailed data files in a secure environment,
    without jeopardizing the confidentiality of the
    respondents.

4
Research Data Center
  • The NCHS Research Data Center, established in
    1998, is a facility at the NCHS headquarters in
    Hyattsville, Maryland, where researchers are
    granted access to restricted data files needed to
    complete approved projects. Restricted data
    files may contain information, such as lower
    levels of geography, but do not contain direct
    identifiers (e.g., name or social security
    number).

5
Data Restrictions
  • Section 308 (d) of the Public Health Service Act
    and the NCHS Staff Confidentiality Manual do not
    permit the release of data that are either
    identified or identifiable to persons outside of
    NCHS.

6
Data Restrictions (cont.)
  • Identifiable data include not only direct
    identifiers such as name, social security number,
    etc., but also data that can serve to allow
    inferential identification of either individual
    or institutional respondents by a number of means.

7
Data Restrictions (cont.)
  • Research indicates that identifiability is
    greatly enhanced if geographic identifiers for
    state, county, census tract, block-group or block
    are released on public use files.

8
Key Issues for Research Data Availability
  • CONFIDENTIALITY
  • The dissemination of data in a manner that would
    allow public identification of the respondent or
    would in any way be harmful to him/her is
    prohibited and the data are immune from legal
    process.

9
Key Issues for Research Data Availability (cont.)
  • DISCLOSURE
  • Disclosure relates to inappropriate attribution
    of information to a data subject, whether an
    individual or an organization. Disclosure occurs
    when a data subject is identified from a released
    file (identity disclosure), sensitive information
    about a data subject is revealed through the
    released file (attribute disclosure), or the
    released data make it possible to determine the
    value of some characteristic of an individual
    more accurately than otherwise would have been
    possible (inferential disclosure).

10
Appendix I Rules for the Release of Micro Data
Files
  • The data file must not contain any detailed
  • information about the subject that could
    facilitate identification and that is not
    essential for research purposes (e.g., exact date
    of the subjects birth).
  • Geographic places that have fewer than 100,000
    people are not to be identified on the data file.
  • Characteristics of an area are not to appear on
    the data file if they would uniquely identify an
    area of less than 100,000 people.

11
Appendix I Rules for the Release of Micro Data
Files (cont.)
  • Information on the drawing of the sample which
    might assist in identifying a data subject must
    not be released outside the Center. Thus, the
    identities of primary sampling units are not to
    be made available outside the Center.
  • Before any new or revised micro data files are
    published, they, together with their full
    documentation, must be approved for publication
    by the NCHS Director or Deputy Director.
  • A micro data file containing confidential data on
    unidentified individuals or facilities may not be
    released to any person or organization outside
    NCHS until that person, or a responsible
    representative of that organization, has first
    signed the statement on the Order Form, whereby
    he gives assurance that the data provided will be
    used only for statistical reporting or research
    purposes.

12
Why NCHS Does Not Release Files With Lower Levels
of Geography
  • Research suggests that in the case of personal
    surveys nine commonly collected variables result
    in the table below.

Population of Geopolitical Area Percent of Sample Identifiable
25,000 24
50,000 20
100,000 14
200,000 8
300,000 5
400,000 4
500,000 3
13
Why NCHS Does Not Release Files With Lower Levels
of Geography (cont.)
  • Notes A geopolitical area may be a county,
    city, town, or other place with well- defined
    boundaries.
  • In this case, identification refers to
    certainty identification.

14
How Does RDC Operate?
  • On-Site Access
  • Remote Access
  • Staff Assisted Analytical Session

15
User Procedures
  • To gain access to NCHS restricted data through
  • either method, user must
  • Submit a research proposal.
  • An advisory and proposal review committee
    receives, reviews, and approves researcher
    proposals
  • Proposals are evaluated primarily on the
    confidentiality disclosure risk.
  • Scientific merit is not an evaluation criteria.
  • Sign an affidavit of confidentiality and promise
    not to use any method to attempt to identify
    respondents.

16
User Procedures (cont.)
  • Not take any materials or equipment into RDC
    unless approved by RDC staff.
  • Submit data files to be merged onto NCHS data
    ahead of time all merging is done by RDC staff.
  • Subject all output and/or materials removed from
    the RDC to a disclosure limitation review.
  • May not remove any NCHS restricted data files nor
    linked data files.

17
Researcher Affidavit of Confidentiality
  • I certify that no confidential data or
    information viewed or otherwise obtained while I
    am a researcher in the National Center for
    Health Statistics (NCHS), Research Data Center
    (RDC) will be removed from NCHS. Further, I
    understand that NCHS will perform a disclosure
    review and must provide approval to me before I
    remove any data from the RDC, whether it be in
    electronic or paper form. I acknowledge NCHS
    Confidentiality Statute, 308(d) of the Public
    Health Service Act stated below and fully
    understand my legal obligations to NCHS to
    protect all confidential data. Further I
    understand any violation I may perform is
    punishable under 18 United States Code (USC),
    1001 which carries a fine of up to 10,000 or up
    to 5 years in prison.

18
Researcher Affidavit of Confidentiality (cont.)
  • NCHS 308(d) Confidentiality Statute - No
    information, if an establishment or person
    supplying the information or described in it is
    identified, obtained in the course of activities
    undertaken or supported under section 304, 305,
    306, 307, or 309 may be used for any purpose
    other than the purpose for which it was supplied
    unless such establishment or person has consented
    to its use for such other purpose and in the case
    of information obtained in the course of health
    statistical or epidemiological activities under
    section 304 or 306, such information may not be
    published or released in other form if the
    particular establishment or person supplying the
    information or described in it is identifiable
    unless such establishment or person has consented
    to its publication or release in other form.

19
Researcher Affidavit of Confidentiality (cont.)
  • 18 United States Code, 1001 - Deliberately making
    a false statement in any matter within the
    jurisdiction of any Department or Agency of the
    Federal Government violates 18 USC 1001 and is
    punishable by a fine of up to 10,000 or up to 5
    years in prison.
  • ____________________ _______________
    Researchers Signature Date
  • ____________________ _______________
  • NCHS Witness Date

20
Can Researcher Merge his/her Data with NCHS ?
  • Must Interact with RDC staff to ensure
  • that their data can be merged with the
  • NCHS data.
  • User-supplied data will be merged with
  • NCHS data by RDC staff only.
  • The NCHS RDC policy states that merged
  • and user-supplied data will not be made
  • available for analysis to anyone without
  • the written consent of the user.

21
The Cost per Project
  • On Site
  • 200 per day (2 day minimum)
  • Remote Access
  • NSFG-CDF 500/ year
  • NHIS-polio 500/ year
  • NHIS Linked Mort. File 250/Month
  • NHANES Linked Mort. File 250/Month

22
The Cost per Project (cont.)
  • Files lt 130k records 500 per month
  • Files gt 130k records 1000 per month
  • Staff Assisted Variable
  • File Construction and Setup
  • For Mortality Files 250 per day
  • For all Other Files 500 per day

23
Do Doctors perform defensive Cesareans?
  • Overview This topic re-examined the issues of
    defensive medicine and state reforms designed
    to limit malpractice risk on the use of cesarean
    section delivery.
  • NCHS Data Used National Hospital Discharge
    Survey (NHDS)
  • Years of Data Used 1980 through 1992,
    inclusive.
  • Users Data Merged with NCHS? Yes
  • Method of Access to NCHS Data Remote and
  • On-site Access
  • Statistical Software Used SAS

24
Economic Model to Explain the Incidence of Sexual
Activity, Contraceptive Use, STD, and Pregnancy
Among Teenage Girls.
  • Overview National Survey of Family Growth Data
    provide extensive socio-demographic information
    and reports of the sexual histories of these
    women. Researcher focused on the effects of a
    number of policies measured at the state-level.
    These included
  • Parental notification of consent laws.
  • Medicaid funding of abortions.
  • Welfare generosity.
  • NCHS Data Used National Survey of Family Growth
    (NSFG)
  • Users Data Merged with NCHS? Yes
  • Method of Access to NCHS Data Remote Access
  • Statistical Software Used SAS

25
Nursing Home Admission and Payment Source?
  • Overview This project tested if patients with
    Medicare were being discriminated against because
    their reimbursement rate was significantly below
    the private pay rate for nursing homes.
  • NCHS Data Used National Nursing Home Survey
    (NNHS)
  • Years of Data Used 1985, 1995, and 1997
  • Users Data Merged with NCHS? No
  • Method of Access to NCHS Data Remote Access
  • Statistical Software Used SAS

26
Hardware and Software
  • All RDC hardware and software are standard.
  • Hardware
  • Pentium IV computers with Windows 2000
  • Software
  • SAS (only language on ANDRE)
  • Sudaan
  • Fortran
  • HLM
  • Stata
  • Limdep
  • text editors/viewers
  • Onsite workstations do NOT have email or internet
    access
  • Only access to printer is through RDC staff

27
Record Linkage for Epidemiologic Research
Accessing Linked data at the NCHS Research Data
Center
Christine S. Cox NCHS Data Users Conference July
12, 2006
28
What is Record Linkage?
Administrative records
NCHS Surveys
Linked Data File
29
NCHS Linked Data Major Activities
  • Mortality
  • National Death Index
  • Health Care Utilization and Costs
  • Medicare Data
  • Retirement and Disability
  • Social Security Data

30
NCHS Linked Data Mortality
  • Eligibility status
  • Assigned vital status
  • Date of death
  • Age at death
  • Underlying and multiple causes of death
  • Adjusted sample weights

31
Research Potential of Linked Mortality Data
The Income-Associated Burden of Disease in the
United States P Muennig, P Franks, H Jia, E
Lubetkin and MR Gold
Excess Deaths Associated with Underweight,
Overweight, and ObesityKM Flegal, BI Graubard,
DF Williamson MH GailJAMA. 20052931861-1867.
Living and Dying in the USA Behavioral, Health,
and Social Differentials of Adult Mortality RG
Rogers, CB Nam, RA Hummer
A Semiparametric Analysis of the Body Mass
Indexs Relationship to Mortality JT Gronniger
32
NCHS Linked Data Medicare
  • Medicare entitlement and health care utilization
    and payment data for 1991-2000
  • Denominator file
  • MEDPAR Inpatient hospitalization
  • MEDPAR Skilled nursing facility
  • Hospital outpatient
  • Home Health Care
  • Hospice
  • Carrier (physician/supplier Part B file)
  • Durable Medical Equipment

33
Research Potential ofLinked Medicare Data
  • Examine risk factors for health conditions
  • Examine reliability of survey data
  • Examine survey report of disability with program
    participation eligibility criteria
  • Compare survey reported health conditions to
    claims records
  • Examine disparities in Medicare service
    utilization

34
NCHS Linked Data Retirement/Disability
  • Social Security data from Retirement, Survivors,
    and Disability Insurance (RSDI) and Supplemental
    Security Insurance (SSI) programs
  • Master Beneficiary Record (MBR)
  • 1962-2003
  • Payment History Update System (PHUS)
  • 1984-2003
  • Supplemental Security Record (SSR)
  • 1974-2003

35
Research Potential of Linked Social Security Data
  • Examine reliability of survey information for SSA
    program participation and benefits
  • Compare the health characteristics of those who
    take early (age 62) Social Security benefits to
    those who postpone benefits
  • Policy analysis using validated survey data
  • Predicting the number of people who will become
    disabled based upon survey reported health
    conditions
  • Determining whether current disability
    entitlement funding levels will be adequate as
    the population ages

36
Summary NCHS Data Linkage
37
www.cdc.gov/nchs/rd/nchs_datalinkage/data_linkage
_activities.htm
38
Why cant you just give me the data?
  • NCHS does not own the linked administrative
    data
  • NCHS data confidentiality rules prohibit the
    release of potentially identifiable data
    special considerations concerning the protection
    of linked data
  • The RDC is the only option for access for now.

39
Overview Data Access Procedures
  • Proposal Requirements
  • Access Methods
  • Helpful Tips
  • Where to get help?

40
Proposal Requirements
  • Proposal is evaluated by review committee
  • Review criteria
  • Scientific and technical feasibility
  • Availability of RDC resources
  • Disclosure risk for restricted information
  • The extent to which project is in accordance with
    the mission of NCHS
  • Special note NCHS does not try to determine if
    proposals are duplicative

41
Proposal Requirements
  • Cover letter
  • Project title
  • Abstract (maximum 300 words summarizing project)
  • Full contact information
  • Institutional affiliation
  • Mail address, phone, email
  • Dates of proposed time at RDC (or indication of
    using remote access)
  • Source of funding for proposed research

42
Proposal Requirements
  • Study background
  • Key study questions or hypotheses
  • Public health benefits
  • Methods
  • Analytic approach and statistical methods
  • Statistical software requirements
  • Description of intended output for nondisclosure
    review, e.g.
  • Table shells
  • Model equations
  • Test statistics that researcher plans to remove
    from RDC

43
Proposal Requirements
  • Explanation of why restricted data are needed,
    e.g. describe why publicly available data are
    insufficient
  • Summary of data requirements to be included in
    analytic file
  • Identification of sample
  • Identification of variables
  • Description of additional data to be supplied by
    researcher to be merged with NCHS or other data
    source (must clearly identify source of other
    data)

44
Proposal Requirements Appendices
  • Current Curriculum Vitae or resume for each
    investigator
  • Data dictionary complete listing of specific
    data requested and its source(s) and indicate if
    public use or restricted access variables
  • specific files and years
  • sample
  • variables (dependent, independent,
    matching/linking)

45
Proposal Requirements Appendices
  • For remote-access applicants
  • Description of the computer and email system to
    be used to receive output
  • Security provisions for the computer and email
    systems
  • For students
  • Letter from department chair or academic advisor
    stating that student is working under the
    direction of the department

46
Overview RDC Data Access Procedures
  • Proposal Requirements
  • Access Methods
  • Helpful Tips
  • Where to get help?

47
Access Methods
  • Once approved, three methods to access restricted
    data
  • on-site - use local computing resources in the
    NCHS RDC, Hyattsville, MD
  • remote submit programs electronically to be
    executed in the RDC with output returned by email
  • staff assisted RDC staff provide on-site
    programming for off-site approved researchers
  • For all methods of access, restricted data files
    remain in RDC and output is inspected for
    disclosure violations

48
On-Site Access
  • RDC staff constructs necessary data files,
    including merged user data
  • Most statistical packages available with
    sufficient lead time
  • Output subject to disclosure review
  • Open only during normal working hours

49
Remote Access Method
  • RDC staff constructs necessary data files,
    including merged user data
  • SAS programs only (certain procedures and
    functions not allowed) additional software
    options expected
  • Both submitted programs and output undergo a
    programmed disclosure limitation review

50
RDC Staff-assisted Programming Method
  • Subcontract with the RDC staff to perform
    programming tasks
  • Useful for those planning to use statistical
    software not available for the remote system and
    who are not able to travel to the RDC facility
  • Cost is estimated for each research project

51
Overview RDC Data Access Procedures
  • Proposal Requirements
  • Access Methods
  • Helpful Tips
  • Where to get help?

52
RDC Helpful Tips
  • Be clear about research and data requirements
    (helps to determine feasibility of project)
  • Clearly identify the sample to be included in the
    analytic file
  • Provide data dictionaries for both
  • Public use data
  • Restricted data
  • Provide examples of expected output

53
Overview RDC Data Access Procedures
  • Proposal Requirements
  • Access Methods
  • Helpful Tips
  • Where to get help?

54
Visit the RDC at www.cdc.gov/nchs/rd/rdc.htm
or email rdca_at_cdc.gov
55
LINKED DATA, CONTEXTUAL DATA, and
GEO-CODINGON-SITE and STAFF-ASSISTED DATA ACCESS
Christopher Rogers Research Data
Center cor2_at_cdc.gov
56
Why Link Data Sets?
  • Improve modeling and make use of existing data.
  • Compensate for increased difficulties taking
    surveys.
  • Open your mind.
  • Common Example
  • Economic variables versus Ethnic variables

57
Historical Trends
  • More linking of scientific data sets between
    government agencies. Confidential Information
    Protection and Statistical Efficiency Act of 2002
    (CIPSEA.)
  • Confused political and social situation in US.

58
Quality NCHS Resources
  • Linked Birth and Infant Death Data with Fetal
    Death Data.
  • Geo-coded NHIS 1986-2003 (2004-2005).
  • Geo-coded NHANES III.
  • Cycles 4, 5, and 6 NSFG Contextual Data.
  • Linked Data Sets described earlier.

59
Linked Birth and Infant Death
  • Designed to study factors in infant death.
  • Links birth and death certificates for deaths
    under one year of age. Includes fetal deaths for
    1995-1997
  • Years 1983-1991 and 1995-1997
  • Numerator File (for deceased children) Parental
    information and behavior, prenatal care, infant
    health variables, demographics, cause of death.
  • Denominator File (for control group) Parental
    information and behavior, prenatal heath, infant
    health, demographics.
  • Fetal Death Data 1995-1997
  • Restricted Data County/City of mothers
    residence or County of childs birth or death
    when under 250,000. 100,000 starting 1989.

60
Data Example
  • From the Division of Vital Statistics. Proposals
    or questions can go either to the RDC or the DVS.
  • Fetal Death Data portion. Given 1989-1999.
  • Linked to county level contextual data.
  • Goal to model fetal death with emphasis on ground
    water quality. Estimates death rates for each
    county.

61
Geo-Coded NHIS
  • National Health Interview Survey. RDC has access
    to files from 1963 to present. Previously
    geo-coded households for 1986-1994. Recently
    geo-coded by RDC from 1995-2003. 2004-2005
    coding in progress.
  • State (2 digits), County (3 digits), Tract (6
    digits), Block Group (1 digit), and Block (3-4
    digits) levels. Households coded to 1990 and
    2000 Censuses.

62
Geo-Coded NHANES III
  • NHANES III is also linked to NDI Mortality data.
  • NHANES III has been geo-coded twice. The RDC has
    done it at the same level of detail as NHIS.
  • Continuous NHANES has not been geo-coded yet.
  • Example Large project with neighborhood,
    economic, ethnic, and individual medical and
    behavioral variables. Multi-level models.

63
NSFG Contextual Data
  • Contextual variables available with Cycles 4, 5,
    and 6. Supplied for each individual in sample.
  • Cycle 6 1054 contextual variables at the state,
    county, tract, and block group levels. For
    respondent addresses in 2000 and 2002.
  • Contextual data include both economic and
    demographic characteristics of locations. Easily
    merged by case ID to individual characteristics,
    behaviors, and histories.

64
Simple NSFG Example
  • A simple example relating economics on state
    level, ethnicity, and behavior, but not using
    contextual variables.
  • Treatment States given waiver to offer more
    family planning services (FPS).
  • Questions
  • FPS effects on behavior
  • FPS effect on pregnancy rates
  • Differential impacts across demographic subgroups?

65
Change of Topic Accessing Data
  • On-site access to data at the RDC in Hyattsville.
  • Staff-assisted remote access to data via e-mail.
  • Researchers often use both types of access.
  • Potential Designated Agent status. (CIPSEA)
  • The RDC has put many resources into automated
    remote access.

66
On-Site Access
  • Rules in 24 page file GuidelinesRDC11-8-05.pdf
    available on-line.
  • The RDC and NCHS surveys have knowledgeable
    professional staffs that review proposals
    carefully. Clients can only remove what has been
    approved. Checked by staff.
  • Exploratory Data Analysis. If needed, ask.
    Recent example Checking general shapes of
    variables for model validity. OKed by survey.
  • Modeling needs. Recent example Nested
    randomized geo-codes.
  • Estimation problems. Example Single PSU in a
    Stratum.

67
Staff-Assisted Remote Access
  • Analysis done through a particular staff member.
    Usually efficient, but could be very busy.
  • Staff member determines costs based on time.
  • Staff usually not asked to do much programming.
  • Staff creates data, runs e-mailed programs,
    checks, and returns output to researcher.
  • Staff can do exploratory analysis, if needed.
  • Staff can help check modeling problems.
  • Commonly done after on-site visit.

68
Our Mission
  • The RDC has a professional staff dedicated to
    helping researchers uncover knowledge and advance
    understanding.

69
Remote Access System
Vijay Gambhir
70
Remote Access System
  • Envisioned as an integral Part of RDC
  • Pre onsite usage
  • Post onsite usage
  • Super store/ Convenience store

71
Basics of Remote Access System
  • Object oriented, event driven system based upon
    the principles of distributed computing
  • About two years of development efforts
  • Set of applications called in service by resident
    component
  • Advanced pattern recognition techniques

72
Analytic Data Research by Email (ANDRE)
  • NCHS has been providing remote data access to
    researchers through ANDRE since April 1998.
  • In the past five years, ANDRE has served 45
    different data analysts and executed over 9,500
    SAS programs for their research programs.

73
Main Features of ANDRE
  • Completely automated system
  • Operates round the clock
  • without any human intervention
  • Registered subscribers only
  • Proposals already reviewed and approved
  • Have an agreement with NCHS/RDC
  • Unlimited Access during the subscription period

74
Data Requests
  • Registered user can submit data requests by email
    from anywhere and at any time.
  • Results of the data request released to a
    specified email address that has been certified
    as secure by the subscriber and approved by
    NCHS/RDC.

75
Authentication
  • Multi-levels of system security
  • Submission syntax
  • User id
  • Password
  • Email/code word
  • Package
  • Path info

76
Data Request Analysis
  • Compliance with the disclosure limitation
    constraints of NCHS
  • Integrity of the system
  • Resource constraints (CPU time Storage
    requirements)
  • Protection of ANDREs work environment

77
Prevention of Direct Disclosure
  • Cleaning up of the Log File
  • Categorization of SAS commands/words
  • Forbidden Commands
  • Modifications to the Commands
  • Output suppression

78
Sample Original Log
  • 1 options nocenter
  • 2 Data one
  • 3 Infile 'd\nchs\respnd95.dat' lrecl13064
  • 4 Input
  • 5 TODAYSPG 6847-6847
  • 6 CONSTAT1 11934-11935
  • 7 CONSTAT2 11936-11937
  • 8 CONSTAT3 11938-11939
  • 9 CONSTAT4 11940-11941
  • 10 SEX1MTHD 11945-11946
  • 11 POST_WT 12350-12359
  • 12 if constat1 'ab' then vjvar1 else vjvar
    2
  • 13 WGT1000POST_WT/1000
  • 14 title 'NSFG cycle 1995'
  • NOTE Character values have been converted to
    numeric values at the places given by
    (Line)(Column).
  • 1215

  • NOTE The infile 'd\nchs\respnd95.dat' is
  • File Named\nchs\respnd95.dat,

79
Sample Original Log (cont.)
  • 12901 11232521101 0526721310303392181193101
    1103 01030000000321120000392702210611511200403
    1344 1316
  • 13001 622501001006034
  • TODAYSPG1 CONSTAT15 CONSTAT288 CONSTAT388
    CONSTAT488 SEX1MTHD1 POST_WT2545.7569 vjvar2
    WGT10002.5457569 _ERROR_1
  • _N_20
  • NOTE 10847 records were read from the infile
    'd\nchs\respnd95.dat'.
  • The minimum record length was 13064.
  • The maximum record length was 13064.
  • NOTE The data set WORK.ONE has 10847
    observations and 9 variables.
  • NOTE DATA statement used
  • real time 39.88 seconds
  • cpu time 12.10 seconds
  • 15 proc freq
  • 16 tables CONSTAT1 vjvar
  • 17 run

80
Sample Cleaned Log
  • 1 options nocenter
  • 2 Data one
  • 3 Infile 'd\nchs\respnd95.dat' lrecl13064
  • 4 Input
  • 5 TODAYSPG 6847-6847
  • 6 CONSTAT1 11934-11935
  • 7 CONSTAT2 11936-11937
  • 8 CONSTAT3 11938-11939
  • 9 CONSTAT4 11940-11941
  • 10 SEX1MTHD 11945-11946
  • 11 POST_WT 12350-12359
  • 12 if constat1 'ab' then vjvar1 else vjvar
    2
  • 13 WGT1000POST_WT/1000
  • 14 title 'NSFG cycle 1995'
  • NOTE Character values have been converted to
    numeric values at the places given by
    (Line)(Column).
  • 1215
  • NOTE The infile 'd\nchs\respnd95.dat' is
  • File Named\nchs\respnd95.dat,

81
Sample Cleaned Log (cont.)
  • NOTE 10847 records were read from the infile
    'd\nchs\respnd95.dat'.
  • The minimum record length was 13064.

  • The maximum record length was 13064.

  • NOTE The data set WORK.ONE has 10847
    observations and 9 variables.
  • NOTE DATA statement used

  • real time 39.88 seconds

  • cpu time 12.10 seconds





  • 15 proc freq

  • 16 tables CONSTAT1 vjvar

  • 17 run



  • NOTE There were 10847 observations read from the
    data set WORK.ONE.
  • NOTE PROCEDURE FREQ used

  • real time 0.49 seconds

  • cpu time 0.04 seconds


82
Forbidden Commands
  • Commands That Pose Unacceptable Disclosure
    Risks
  • OR
  • Disallowed to Protect Integrity/Internal
    Environment of ANDRE
  • Add firstobs report iml
  • Print first. Pctn nofreq
  • Obs last. Pctsum nocum
  • Firstobs nocol tabulate editor
  • Browse summary list put

83
Commands Modification
  • Modify users program to enforce restrictions on
    options allowed with certain SAS procedures to
    prevent objectionable info appearing in the
    output
  • PROC MEANS n mean std

84
Output Suppression
  • Wiping out of extreme values from the output of
    Proc Univariate.
  • Suppressing complete output line (Procs Means,
    corr, Univariate, etc) where sample size less
    than the minimum acceptable value.

85
Proc Means Suppression
  • The MEANS Procedure
  • Variable Label
    N Mean Std Dev
  • --------------------------------------------------
    ------------------------------------------
  • EXPEND_R Current expend/pupil in public
    schl/1000 5424 5.0830820 1.3958710

  • Values Suppressed
  • RPUB87 exp. for contr. serv. and supplies
    1997 5424 23472052.60 18806802.86
  • RPUB92 exp. for contr. serv. and supplies
    1997 5424 34800922.98 30481634.59
  • PRGPRO Coordinated Pregnancy Prevention
    Program 1708 0.0679157 0.2516749
  • HIVED HIV/AIDS Education
    1708 3.5146370 0.8044378

  • Values Suppressed
  • PRGPRO87 Coordinated Pregnancy Prevention
    Program 5424 0.0540192 0.2260764
  • HIVED87 HIV/AIDS Education
    5424 3.4968658 0.8008324
  • WT_PER15 Wt females aged 15-19/total 15-19
    5424 0.7279681 0.1265796
  • BK_PER15 Bk females aged 15-19/total 15-19
    5424 0.1409869 0.0932332
  • HS_PER15 Hs females aged 15-19/total 15-19
    5424 0.0962413 0.1055191
  • TEENMMC2 Teenmom by cohort (1,2,3r)
    1201 1.7119067 0.7715351
  • C18_2_1S R in C2 (vs 1) at 18-19 endpt (1,2)
    1770 1.5248588 0.4995228
  • TM2_1S18 R tnmm in Coh 2 (vs 1)-age 18 _at_ ext
    358 1.4804469 0.5003168

86
Proc Univariate OutputUnsuppressed
  • The SAS System
    9

  • 1409 Sunday, October 24, 1999
  • Univariate
    Procedure
  • VariableAVHRATET
  • Moments
    Quantiles(Def5)
  • N 2283 Sum Wgts 2283
    100 Max -0.25314 99 -1.62008
  • Mean -4.66219 Sum -10643.8
    75 Q3 -3.56179 95 -2.37588
  • Std Dev 1.892017 Variance 3.57973
    50 Med -4.50491 90 -2.79152
  • Skewness -2.11919 Kurtosis 6.892929
    25 Q1 -5.30374 10 -6.07639
  • USS 57792.36 CSS 8168.944
    0 Min -13.5463 5 -7.19645
  • CV -40.5821 Std Mean 0.039598
    1 -12.7402
  • TMean0 -117.738 PrgtT 0.0001
    Range 13.29321
  • Num 0 2283 Num gt 0 0
    Q3-Q1 1.741949
  • M(Sign) -1141.5 PrgtM 0.0001
    Mode -13.5463
  • Sgn Rank -1303593 PrgtS 0.0001
  • Extremes
  • Lowest Obs
    Highest Obs
  • -13.5463( 1547)
    -0.90519( 649)
  • -13.5397( 1836)
    -0.81756( 1094)

87
Proc Univariate OutputSuppressed
  • The SAS System
    9

  • 1409 Sunday, October 24, 1999
  • Univariate
    Procedure
  • VariableAVHRATET
  • Moments
    Quantiles(Def5)
  • N 2283 Sum Wgts 2283
    100 Max -0.25314 99 -1.62008
  • Mean -4.66219 Sum -10643.8
    75 Q3 -3.56179 95 -2.37588
  • Std Dev 1.892017 Variance 3.57973
    50 Med -4.50491 90 -2.79152
  • Skewness -2.11919 Kurtosis 6.892929
    25 Q1 -5.30374 10 -6.07639
  • USS 57792.36 CSS 8168.944
    0 Min -13.5463 5 -7.19645
  • CV -40.5821 Std Mean 0.039598
    1 -12.7402
  • TMean0 -117.738 PrgtT 0.0001
    Range 13.29321
  • Num 0 2283 Num gt 0 0
    Q3-Q1 1.741949
  • M(Sign) -1141.5 PrgtM 0.0001
    Mode -13.5463
  • Sgn Rank -1303593 PrgtS 0.0001

88
Proc Univariate OutputSuppressed (sample size
1)
  • Univariate
    Procedure
  • VariableFREQ (sum) freq
  • Moments
    Quantiles(Def5)
  • Serious Disclosure limitation Violations
  • Values too low to release
  • Output of Proc Univariate withheld

89
Proc Freq Suppression (One-Way Tables)
  • Suppress at least two consecutive rows to prevent
    derivation of suppressed values from cumulative
    totals.
  • Disallow single row output.

90
One-Way Freq TableSuppressed
  • Cumulative Cumulative
  • LOGRNTOPAT Frequency Percent
    Frequency Percent
  • --------------------------------------------------
    ---------------
  • 0.2277839309 ????? ?????
    ????? ?????
  • 0.2277839309 ????? ?????
    ????? ?????
  • 0.2305236586 5 0.08
    6429 97.99
  • 0.231111721 5 0.08
    6434 98.06
  • 0.232058915 ????? ?????
    ????? ?????
  • 0.232058915 ????? ?????
    ????? ?????
  • 0.2436220827 ????? ?????
    ????? ?????
  • 0.2436220827 ????? ?????
    ????? ?????
  • 0.2498117984 6 0.09
    6456 98.40
  • 0.2504106777 6 0.09
    6462 98.49
  • 0.2513144283 18 0.27
    6480 98.77
  • 0.2595111955 6 0.09
    6486 98.86
  • 0.2670627852 ????? ?????
    ????? ?????
  • 0.2670627852 ????? ?????
    ????? ?????
  • 0.2736958305 5 0.08
    6500 99.07
  • 0.2814124594 5 0.08
    6505 99.15

91
One-Way Freq Tablesuppressed (cont.)
  • Cumulative Cumulative
  • LOGRNTOPAT Frequency Percent
    Frequency Percent
  • --------------------------------------------------
    ---------------
  • 0.3403258059 ????? ?????
    ????? ?????
  • 0.3403258059 ????? ?????
    ????? ?????
  • 0.3715635564 6 0.09
    6537 99.63
  • 0.3856624808 ????? ?????
    ????? ?????
  • 0.3856624808 ????? ?????
    ????? ?????
  • 0.6931471806 6 0.09
    6550 99.83
  • 1.2527629685 ????? ?????
    ????? ?????
  • 1.2527629685 ????? ?????
    ????? ?????
  • 1.2527629685 ????? ?????
    ????? ?????

92
Proc Freq Suppression (Two-way Tables)
  • Rows and columns totals preserved
  • Cells with values less than the acceptable
    minimum are suppressed
  • Additional suppressions to ensure that no row and
    no column has single suppression.
  • Logical stitching of horizontal and vertical
    splits.

93
Proc Freq Two-way Tables Suppression
  • TABLE OF FAMREL BY FAMSIZER
  • FAMREL FAMSIZER
  • Frequency
  • Percent
  • Row Pct
  • Col Pct 2 3
    4 5 Total
  • -------------------------------
    ----------
  • 3 94 388
    792 533 2206
  • 3.97 16.40
    33.47 22.53 93.24
  • 4.26 17.59
    35.90 24.16
  • 98.95 96.28
    96.12 94.34
  • -------------------------------
    ----------
  • 4 ?????? 9
    22 27 104
  • ?????? 0.38
    0.93 1.14 4.40
  • ?????? 8.65
    21.15 25.96
  • ?????? 2.23
    2.67 4.78
  • -------------------------------
    ----------
  • 6 ?????? 6
    10 5 56
  • ?????? 0.25
    0.42 0.21 2.37

94
Proc Freq Two-way Tables Suppression (Cont.)
  • checking frequencies
    4

  • 1201 Thursday, May 6, 1999
  • TABLE OF FAMREL BY
    FAMSIZER
  • FAMREL FAMSIZER
  • Frequency
  • Percent
  • Row Pct
  • Col Pct 6 7
    8 9 Total
  • -------------------------------
    ----------
  • 3 209 98
    19 73 2206
  • 8.83 4.14
    0.80 3.09 93.24
  • 9.47 4.44
    0.86 3.31
  • 90.48 83.05
    59.38 74.49
  • -------------------------------
    ----------
  • 4 13 10
    ?????? 12 104
  • 0.55 0.42
    ?????? 0.51 4.40
  • 12.50 9.62
    ?????? 11.54
  • 5.63 8.47
    ?????? 12.24
  • -------------------------------
    ----------

95
Fully Automated and Expert system?
  • Fully automated?
  • Reboot to deal with memory leakage.
  • Confidentiality Expert? How reliable?
  • As good as underlying algorithms. Needs constant
    monitoring

96
(No Transcript)
97
(No Transcript)
98
(No Transcript)
99
(No Transcript)
100
(No Transcript)
101
What is new?
  • Improved and expanded hardware platform
  • Two machines dedicated to heavy remote access
    usage
  • Three additional machines dedicated to general
    remote access usage

102
What is New?
  • Sudaan now available to remote access users
  • Proc Crosstab
  • Proc Rlogist
  • Proc Regress
  • Proc Multilog
  • Proc Survival

103
What is new
  • Proc Descript
  • Other new Sudaan procedures will be made
    available shortly
  • Plans to make Stata available through remote
    access

104
What is new
  • Web Component of ANDRE under construction.
  • On-line scanning of users code
  • Valuable research tools and information readily
    available to the users.

105
Contact Information
  • For general Questions/Comments
  • Email rdca_at_cdc.gov Phone (301) 458-4732
  • For On-site Info
  • Email Neb9_at_cdc.gov Phone (301) 458-4097
  • For Remote Access Info
  • Email vgambhir_at_cdc.gov Phone (301) 458-4226
Write a Comment
User Comments (0)
About PowerShow.com