Title: National Data Network
1National Data Network
- Qld Discussion
- 29 March 2004
2The problem
- Researchers and Policy analysts want easy and
unrestricted access to data. - Agencies increasingly want to share/integrate
data with each other - Administrative data sources are under-utilised
for research/policy development - Custodians cannot just give data away. They have
legal obligations, customer expectations, and
budget restrictions which must be met. So, they
need to be able to expose data in a way that
allows them to be confident that their
obligations will be honoured.
3The problem (cont.)
- Data Sources are not well documented, this makes
it difficult for users to know whether data is
fit for purpose - Agencies lack infrastructure for managing data.
They need user-friendly tools to manage meta
data, design data collections and support data
access.
4The National Data Network
- Shared facilities and protocols which are well
understood by a large body of custodians and
analysts/researchers. - Streamlined and consistent data access
approaches - Trusted facilities for protecting secrecy and
confidentiality - Will provide a range of services to both people
and applications
5The National Data Network
- Data Owners will remain in control of their data.
- Will exist as
- a collection of network nodes. Each (major)
custodian will operate their own node on which
their own data is stored - A hub which maintains a catalog of sources and
services (and network adminstration services) - Will service multiple sectors and jurisdictions
health, transport, spatial, statistics
6Data Network
data
data
Node
Node
Service
Catalog
www.nationaldatanetwork.org
Node
log
Services
Node
data
Service
Node
Service
7Data Network Service Framework
Owner/ Custodian Services
Design
Data Capture
Process
Publish
Search
Acquire
Analyse
Report
Link/ Integrate
Researcher/ Analyst Services
Network Administration Services
Registration Audit Planning
8Data Network Services
- Auto coding
- Assisted coding
- Character recognition
- Collection Control
- Respondent Management
- Confidentialise
- Document
- Expose
- Define Access Policy
- Archive
- Acquire test data
- Qualify for access
- Extract
- Subscribe
- Document data source issues
- Document Findings
- Expose Findings
Design
Data Capture
Process
Publish
Search
Acquire
Analyse
Report
Link/ Integrate
- Supervised Analysis
- Analyse
- Graph
- Tabulate
- Data Item Definitions
- Classifications
- Standard Questions
- Sample selection
- Form Design
- Edit
- Seasonal Adjustment
- Estimation
- Imputation
- Aggregate
9Getting Started..
- Start with development of services at the
interface between custodian and researcher
Publish
Search
Link/ Integrate
- Develop base standards for documenting data
sources and access rules
10Getting Started..
- Support 4 access classes
- Unrestricted (user can freely take the data and
do anything with it) - Approval Required (Researcher must apply for
access and agree/meet conditions specified by the
data owner/custodian. Example conditions
payment, sign an undertaking..) - Remote Analysis Only (Researcher cannot acquire
the data but, on giving required undertakings,
can submit programs to analyse the data. Output
subject to vetting by data owner) - Specification Only (Researcher cannot acquire
the data but can submit specifications for
interrogations/ tabulations. Scope of specs may
be restricted
11Getting Started..
- Support 3 linkage models
- User linkage (user can do their own linking
using their own or Data Network facilities,
typically applies where privacy is not an issue
eg. aggregate data, consent has been given) - Blind (Identifying information from each
dataset provided to an independent linking unit,
link key passed back to each custodian,
custodians supply datasets with link keys to
researcher) - Trusted (applies when an agency is trusted to
do the linking) - NDN should support confidentialisation of data
so, where required, a linked dataset can be
confidentialised before it is given to researcher)
12How the Data Network will work
- Governing Body endorses protocols, standards,
acquires funds commissions work - Administering Body administers the network
promotion, audit, registration, performance
monitoring - Custodians document and expose data sources by
registering them. - Registration includes documenting access rules
- Service Providers can register and provide
services (which comply with standards) - Researchers/ Analysts agree to comply with
access rules and provide feedback
13Data Network resource registration
Census BCPs
ABS
CURFs
RADL
ANZSIC Coder
Register sources and services
www.nationaldatanetwork.org
Centrelink
Data Definitions
14Data Network - search
Census BCPs
ABS
CURFs
RADL
ANZSIC Coder
Catalog
www.nationaldatanetwork.org
Centrelink
Researcher
Data Definitions Data Sets
15Data Network undertaking process
Census BCPs
ABS
CURFs
RADL
ANZSIC Coder
www.nationaldatanetwork.org
Catalog
Centrelink
Researcher
Data Definitions Data Sets
16Data Network Access via download
Census BCPs
ABS
CURFs
RADL
ANZSIC Coder
Researcher
www.nationaldatanetwork.org
Catalog
Centrelink
Data Definitions Data Sets
17Data Network Access via RADL
Census BCPs
ABS
CURFs
RADL
ANZSIC Coder
RADL session
Researcher
www.nationaldatanetwork.org
Catalog
Centrelink
Data Definitions Data Sets
18Getting Started..
- We have started by working with ARACY, CSIRO with
meetings/ roundtables of major custodians,
researchers, privacy commissioners and others.
Now starting to engage more with States. - Need to progress on three fronts
- - governance, protocols, priorities, resources
- - data source development
- - infrastructure development
19Getting Started..
- Governance, protocols, priorities
- - Establish an Interim Governing Board (ABS,
ARACY, AIHW, DOHA, User rep, State gov rep..)
and a broader member network which can be
consulted and kept informed - - agree on principles, priorities
20Getting Started..
- Data Sources
- pick a couple of exemplar projects, identify
range of relevant data sources and work with
custodians to expose the data sources using the
Data Network Infrastructure - Range of regional data sources (eg. IRDB,
Healthwhiz?) - ABS Curfs (via RADL and download)
21Getting Started..
- Infrastructure
- Form infrastructure development consortium (this
has started with ABS, CSIRO, Geosciences Aust) - Develop demonstration version of NDN system
22Data Source Documentation Standards
- Need a rich metadata schema and an agreed
minimum documentation standard - A plethora of partial solutions and standards
- A single schema possible in theory but too
cumbersome in practice? - Start with schemas for some common data object
types? (eg. Time Series, Classification, Unit
Record Dataset..)
23NDN Services internal resource as well as
external
- NDN software can be installed inside and outside
of firewall. So, NDN services can be used
privately as part of the agencies
infrastructure for managing their own data
holdings. - This may improve the value proposition for some
custodians as well as make it easier to meet
standards and maintain quality.
24Open Source Software
- Ideally, NDN software will be developed using
Open Source code. - Benefits portability, reduce barriers to
adoption (no license fees required), transparency
(anyone can see source code), we can avoid
starting from scratch by building on top of
existing Open Source products (eg. Zope, Napster)
25Aim to have demonstrable system in 2004
- National Data Network Website established
- Min 3 nodes established with reasonable range of
data - Search service which can locate data on the nodes
- Access services working (3 classes)
- Linking Services working
26Data Network demonstration version
- Auto coding
- Assisted coding
- Character recognition
- Collection Control
- Respondent Management
- Confidentialise
- Document
- Expose
- Define Access Policy
- Archive
- Acquire test data
- Qualify for access
- Extract
- Subscribe
- Document data source issues
- Document Findings
- Expose Findings
Design
Data Capture
Process
Publish
Search
Acquire
Analyse
Report
Link/ Integrate
- Supervised Analysis
- Analyse
- Graph
- Tabulate
- Data Item Definitions
- Classifications
- Standard Questions
- Sample selection
- Form Design
- Edit
- Seasonal Adjustment
- Estimation
- Imputation
- Aggregate
RADL
27Service description Assisted Coding
- Consistent approach for coding according to
standard classifications. For example, the
standard Industry Classification is ANZSIC. The
network will provide - A human interface a web page where you can
key in an industry description (fishing trawler
operation) and get back a code 0922 - An application interface which allows an
application to invoke a coding service which
returns an ANZSIC code (or a set of codes)
28Service description Remote Analysis
- The user does not get a copy of the data but can
submit programs against it - Programs written in SAS or SPSS
- Programs subjected to automated and human vetting
- Program outputs also checked to ensure that they
do not disclose information which should be
protected
29Principles
- Who can use the network?
- What data will the network serve?
- What conditions can custodians impose?
- What conditions should the network impose?
- How visible/open should the network be?
- Who can be a node?
- Architecture/design principles