Title: Security Methods for Statistical Databases by Karen Goodwin
1Security Methods for Statistical Databasesby
Karen Goodwin
2Introduction
- Statistical Databases containing medical
information are often used for research - Some of the data is protected by laws to help
protect the privacy of the patient - Proper security precautions must be implemented
to comply with laws and respect the sensitivity
of the data
3Accuracy vs. Confidentiality
- Accuracy
- Researchers want to extract accurate and
meaningful data
- Confidentiality
- Patients, laws and database administrators want
to maintain the privacy of patients and the
confidentiality of their information
4Laws
- Health Insurance Portability and Accountability
Act HIPAA (Privacy Rule) - Covered organizations must comply by April 14,
2003 - Designed to improve efficiency of healthcare
system by using electronic exchange of data and
maintaining security - Covered entities (health plans, healthcare
clearinghouses, healthcare providers) may not use
or disclose protected information except as
permitted or required - Privacy Rule establishes a minimum necessary
standard for the purpose of making covered
entities evaluate their current regulations and
security precautions
5HIPAA Compliance
- Companies offer 3rd Party Certification of
covered entities - Such companies will check your company and
associating companies for compliance with HIPAA - Can help with rapid implementation and compliance
to HIPAA regulations
6Types of Statistical Databases
- Static a static database is made once and never
changes - Example U.S. Census
- Dynamic changes continuously to reflect
real-time data - Example most online research databases
7Security Methods
- Access Restriction
- Query Set Restriction
- Microaggregation
- Data Perturbation
- Output Perturbation
- Auditing
- Random Sampling
8Access Restriction
- Databases normally have different access levels
for different types of users - User ID and passwords are the most common methods
for restricting access - In a medical database
- Doctors/Healthcare Representative full access
to information - Researchers only access to partial information
(e.g. aggregate information)
9Query Set Restriction
- A query-set size control can limit the number of
records that must be in the result set - Allows the query results to be displayed only if
the size of the query set satisfies the condition - Setting a minimum query-set size can help protect
against the disclosure of individual data
10Query Set Restriction
- Let K represents the minimum number or records to
be present for the query set - Let R represents the size of the query set
- The query set can only be displayed if
- K ? R
11Query Set Restriction
12Microaggregation
- Raw (individual) data is grouped into small
aggregates before publication - The average value of the group replaces each
value of the individual - Data with the most similarities are grouped
together to maintain data accuracy - Helps to prevent disclosure of individual data
13Microaggregation
- National Agricultural Statistics Service (NASS)
publishes data about farms - To protect against data disclosure, data is only
released at the county level - Farms in each county are averaged together to
maintain as much purity, yet still protect
against disclosure
14Microaggregation
15Microaggregation
16Data Perturbation
- Perturbed data is raw data with noise added
- Pro With perturbed databases, if unauthorized
data is accessed, the true value is not disclosed
- Con Data perturbation runs the risk of
presenting biased data
17Data Perturbation
18Output Perturbation
- Instead of the raw data being transformed as in
Data Perturbation, only the output or query
results are perturbed - The bias problem is less severe than with data
perturbation
19Output Perturbation
Query
Results
Results
Query
20Auditing
- Auditing is the process of keeping track of all
queries made by each user - Usually done with up-to-date logs
- Each time a user issues a query, the log is
checked to see if the user is querying the
database maliciously
21Random Sampling
- Only a sample of the records meeting the
requirements of the query are shown - Must maintain consistency by giving exact same
results to the same query - Weakness - Logical equivalent queries can result
in a different query set
22Comparison Methods
The following criteria are used to determine the
most effective methods of statistical database
security
- Security possibility of exact disclosure,
partial disclosure, robustness - Richness of Information amount of
non-confidential information eliminated, bias,
precision, consistency - Costs initial implementation cost, processing
overhead per query, user education
23A Comparison of Methods
Method Security Richness of Information Costs
Query-set Restriction Low Low1 Low
Microaggregation Moderate Moderate Moderate
Data Perturbation High High-Moderate Low
Output Perturbation Moderate Moderate-low Low
Auditing Moderate-Low Moderate High
Sampling Moderate Moderate-Low Moderate
1 Quality is low because a lot of information can
be eliminated if the query does not meet the
requirements
24Sources
- This presentation is posted on http//www.cs.jmu.e
du/users/aboutams - Adam, Nabil R. Wortmann, John C.
Security-Control Methods for Statistical
Databases A Comparative Study ACM Computing
Surveys, Vol. 21, No. 4, December 1989
(http//delivery.acm.org/10.1145/80000/76895/p515-
adam.pdf?key176895key21947043301collportaldl
ACMCFID4702747CFTOKEN83773110) - Official HIPAA (http//cms.hhs.gov/hipaa/)
incur - Bernstein, Stephen W. Impact of HIPAA on
BioTech/Pharma Research Rules of the Road
(http//www.privacyassociation.org/docs/3-02bernst
ein.pdf) - Service Bureau 3rd Party Testing
(http//hipaatesting.com/service_bureau.html)