Privacy Statistics and Data Linkage - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Privacy Statistics and Data Linkage

Description:

Privacy Statistics and. Data Linkage. Mark Elliot ... Crawler. Web. Crawler. Web. Crawler. Web. Crawler. Web. Crawler. What next? Decide on roles. ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 36
Provided by: mscs7
Category:

less

Transcript and Presenter's Notes

Title: Privacy Statistics and Data Linkage


1
Privacy Statistics and Data Linkage
  • Mark Elliot
  • Confidentiality and Privacy Group
  • University of Manchester

2
Overview
  • The disclosure risk problem
  • Some e-science possibilities
  • Monitored data access
  • Grid based Data environment Analysis
  • The meaning of privacy

3
Data Data Everywhere
  • Massive and exponential increase in data Mackey
    and Purdam(2002) Purdam and Elliot(2002).
  • These studies have led to the setting up of the
    data monitoring service.
  • Singer(1999) noted three behavioural tendencies
  • Collect more information on each population unit
  • Replace aggregate data with person specific
    databases
  • Given the opportunity collect personal
    information
  • Purdam and Elliot add
  • Link data whenever you can

4
Disclosure Risk I Microdata
5
The Disclosure Risk ProblemType I
Identification
Identification file
Name
Address
Sex
Age
..
Income
..
..
Sex
Age
..
Target file
Target variables
ID variables
Key variables


6
Disclosure Risk II Aggregate Tables of Counts
7
The Disclosure Risk ProblemType II Attribution
8
The Disclosure Risk ProblemType II Attribution
9
The Disclosure Risk ProblemType II Attribution
10
Multiple datasets
  • Disclosure Risk assessment for single datasets is
    a reasonably understood problem.
  • But what happens with multiple datasets?

11
Data Mining and the Grid
  • Traditional Data Mining examines and identifies
    patterns on single (if massive) datasets.
  • But Data Mining is really a method/approach/techno
    logy that has been waiting for the grid to happen.

12
  • Smith and Elliot (2005,06,07)
  • Increases in data availability lead inexorably to
    an increase in disclosure risk
  • My ability to make linkages (disclosive or
    otherwise) between datasets X and Y is
    facilitated by the copresence of dataset Z.
  • Its all about information!

13
CLEF Clinical e-Science Framework
  • A solution involving monitored access

14
CLEF Consortium
  • Approximately 40 Staff from
  • University of Manchester
  • University of Sheffield
  • University College London
  • University of Brighton
  • Royal Marsden Hospital, London

15
Purpose
  • To provide a system for allowing research access
    to patient data, whilst maintaining privacy.
  • Patient records
  • Database
  • Texts such as referral letters and other clinical
    texts
  • Text mining system convert to microdata

16
CLEF one possible architecture
Firewall
Raw Data
PRE-ACCESS DQI Monitor
PRE-ACCESS SDRA/SDC
Treated Data
PRE-OUTPUT SDRA/SDC
PRE-Output DQI Monitor
Data Intrusion sentry
Workbench
17
Data Sentry an AI system
  • Monitors patterns of analytical requests
  • 3 levels users, institution, world.
  • Looking for intrusive patterns.
  • Numbers of requests
  • Stores Analytical requests for future use.

18
CLEF Proposed Architecture
Firewall
Raw Data
PRE-ACCESS DQI Monitor
PRE-ACCESS SDRA/SDC
Treated Data
PRE-OUTPUT SDRA/SDC
PRE-Output DQI Monitor
Data Intrusion sentry
Workbench
19
Data Quality
  • User analyses are run on both treated and
    untreated data.
  • Outputs are compared and assessed for difference.
  • Major research area Knowledge Engineering
  • Analyses are stored and collectively run over pre
    and post SDC files for assessment of impact.

20
The Grid the context for massive combining.
  • Integrated infrastructure for high-performance
    distributed computation Cannataro and Talia
    (2002)
  • Grid middleware handles the technical issues
    communication, security, access/authentication
    etc Cole et al (2002)
  • Data grid
  • Knowledge grid

21
Grid based Data Environment Analysis
22
Whats it about?
  • Disclosure risk analysis is forever constrained
    by the fact that we tend to only look at the
    release object.
  • This is a bit like evaluating the risk of a house
    being vulnerable to flooding without looking at
    where it is located!
  • Data Environment Analysis aims to remedy that
    situation and complete change the face of
    disclosure control in so doing..

23
What would it involve?
  • Web Crawling
  • Data Monitoring
  • Synthetic Data Generation
  • Grid based disclosure risk analysis

24
Web crawling
  • Untrained Screen scraping of all web sites that
    collect personal data.
  • Generic info gathering of web published personal
    info (personal web pages, My space etc)

25
Data Monitoring
  • The development of sophisticated metadatabases
    representing available info fields
  • Combined Database of web available data.
  • Involves intelligent interpretation of web data,
    record linkage and other AI crossover techniques.

26
Architecture
Web Crawler
Web Crawler
Web Crawler
Web Crawler
Web Crawler
Data monitor
Synthesiser
SDRA system
Repository Data Metadata
27
What next?
  • Decide on roles.
  • Identify funder.
  • Develop grant application.

28
Synthetic Data Generation
  • Uses techniques like multiple imputation to
    generate artificial data from the metadata
    generated by the data monitors and from data
    stored and accessed through data repositories.

29
Closing thoughts
30
A Blurring of Concepts
  • The boundaries between data and processes become
    less distinct.
  • Cyberidenties
  • I am my data?
  • The distinction between informational and
    physical privacy becomes less distinct.

31
Data Growth
  • There is no reason to suppose that data growth
    will not continue at the same break neck pace
  • The data environment will become increasingly
    richer
  • In this context the meaning of privacy will
    undoubtedly change.
  • But how?

32
The meaning of Privacy
  • Do people care about privacy in an orthodox,
    absolute sense?
  • What does a blog mean?
  • Private-public Public Privacy
  • Control and ownership are more important than the
    absolute right to secrecy.

33
From Data Subjects to Data Citizens
  • A data actualised individual in control and self
    aware of their own data.
  • What would data citizens be concerned about?
  • Ownership
  • The use/abuse of their data
  • Harm
  • Permission/Consent
  • This suggests that the law should focus on data
    abuse rather than privacy per se.

34
Summary
  • Statistical Disclosure prevents a problem for the
    use of data
  • Multiple linkable datasets exacerbate that
    problem.
  • E-science provides some tools for new modes of
    data access

35
But..
  • Assuming that the global culture continues to
    feed and be fed by the information explosion
  • Our view of ourselves/our data will/must change.
  • The meaning of privacy must change with it.
  • The key question is what sort of society we are
    constructing the meaning of privacy will reflect
    this.
Write a Comment
User Comments (0)
About PowerShow.com