Title: Provision of access to data for secondary analysis
1Provision of access to data for secondary analysis
- Louise Corti, Jo Wathan and Keith Cole
- Economic and Social Data ServiceE-society
Programme - March 07
2Overview of chapter
- Why access secondary quantitative data?
- brief overview of the potential of secondary data
- Finding, accessing and obtaining secondary data
- describes the ESDS distributed national on-line
data service designed - Case studies the UK Economic and Social Data
Service - practical exemplars of how data can be re-used
3Why access secondary quantitative data?
- Quantitative methods have an important
longstanding place in social research. Can
identify - typical characteristics and background
description - the amount of variation within a population of
interest - differences between groups
- how possible explanatory factors can account for
differences - predictions and forecasts
- Kinds of data
- Micro data resemble the sort of data obtained
from a survey - Longitudinal data follow the same individuals (or
other study unit) over time - Macro or aggregate data contain records for much
larger units e.g countries or regions
4Secondary analysis
- reduces respondent burden
- enables data linkage and the creation of new
datasets - informs policy disputes about the interpretation
of analyses - provides transparency within research
- enables methodologists to learn from each other
- allows students to engage with real data, to
obtain results which relate to the real world and
to tackle real problems of data management
(substantive social science and research methods
teaching)
5Data expensive
- Collecting good quality, reliable, representative
data is expensive and technically demanding - In 2001/2 the British General Household Survey
(GHS) sample included all individuals in 8,989
households and cost 1.43 million - In 2001, the American Community Survey collected
data from nearly 400,000 interviews in the year
at an estimated to cost 131 million
6Data historical - enabling trend analysis
- In the UK the General Household Surveys (GHS) and
Labour Force Surveys (LFS) date back to 1971 and
1973 - In the United States, the General Social Survey
series dates back to 1972 and Current Population
Survey data dating back to 1964 (ICPSR) - Longitudinal studies
- US Panel Study of Income Dynamics, started in
1968 - German Socioeconomic Panel in 1984
- British Household Panel Study in 1991
7Finding, accessing and obtaining secondary data
- The development of secondary analysis has
depended on the development and growth of social
science data archives - Inter-University Consortium in Political and
Social Research - (ICSPR)
- the UK Data Archive (UKDA)
- Zentralarchiv für Empirische Sozialforschung
(ZA) - Norwegian Social Science Data Services (NSD)
- Now networked
- Council of European Social Service Data Archives
(CESSDA) - International Federation of Data Organisations
(IFDO)
8Changing provision
- early data archives predated e-social science,
and the internet as we know it.by decades - the gradual development of online data archives
and dissemination services has varied across the
world - the more mature archives have reached the point
at which most users will interact with the data
service wholly through the internet - Internet delivery has broadened the potential
role of data services
9Functions of the modern archives role
- acquire - nurture, cajol, plead, evaluate
- prepare, document and enhance data check and
add context - store data safely for ever back up, store and
migrate - distribute data - download, explore online
- provide support for their use - promote, write,
teach - improve resource discovery and data access - RD
10Acquisition and checking
- data archives typically select and evaluate
potential data collections against criteria
designed to ensure that they are appropriate for
re-use - assessed for their
- research value, quality, degree of fit to meet
existing collection - data are checked and validated by the receiving
archive by - examining the data values or text validation
and consistency checking - ensuring that, where requested, the data are
anonymous (where required) - checking for Intellectual property and
commericial ownership rights in the data
11Documentation and metadata
- Documentation which enables users to understand
the origins of the data and to correctly
interpret outputs - user guides created - how the data were collected
- questionnairess, questionnaires, code books,
interviewer instructions, technical reports,
original and subsequent publications and outputs - catalogue record, and full variable and value
labels (standard used - DDI) - a few archives work closely with data creators in
the early stages to ensure that good data
management practices are adhered to
12Online dissemination
- first steps towards online data archiving and
dissemination came with the development of
archive websites - increasingly sophisticated data catalogues
- nowdays, searchable online data catalogues
enables users to search and browse collections - and view documentation freely online
- online registration account management, data
download - access data via a web browser
13New generation data services
- online data exploration with tools
- Survey Documentation and Analysis (SDA), Nesstar,
Beyond 2020, interactive (GIS) mapping tools - increasingly necessary to link to data sites,
offsite support and related datasets as the
complexity of the data infrastructure increases - data services may be distributed services
- data need not be co-located
- social science increasingly looking to the
potential of grid technologies
14Economic and Social Data Service (ESDS)
- new generation distributed data service that
provides a seamless integrated service - offers enhanced support for the secondary use of
key economic and social data across the research,
learning and teaching communities - value-added service goes far beyond the original
role of traditional data archives as data storage
and dissemination houses - brings together centres of expertise in data
creation, dissemination, preservation and use
15UK Data archiving history
- Data Archive established in 1968 (as Data Bank)
- funded by (then) SSRC to provide a service to UK
HE sector - initial focus on academic surveys then government
survey data - new distributed service established 1 January
2003 as the ESDS - core arching service plus four value added
specialist services
16Types of data
- ESDS acquires mixed data types and formats
- social surveys
- aggregate data
- administrative data
- textual data
- images
- audio visual data
- UKDA hosts specialist Qualidata unit, Census
unit, and History Data Service - since 2005 designated as Place of Deposit by
The National Archives (TNA) - New data types
- Online surveys, interviews and focus groups
- social transaction data
- Linked admin data
- blogs and so on
17Who produces the social science data held by ESDS?
- government agencies
- increasing tendency for government agencies to
contract out survey work to private sector
(NatCen) - academic sector
- private sector
- local Government
- Research Council funded
- ESRC, MRC, NERC, AHRB, Wellcome, Leverhulme
- increasing number of large digitisation projects
- JISC, NOF
- access to international data via links with other
data archives worldwide - IGOs
18Core Service
- run by UKDA
- acquiring, processing, preserving and
disseminating data - data creation and deposit support
- central registration service operating across the
ESDS - central 'first stop' help desk service
- front line user support
- cataloguing and describing data
- maintaining and developing web presence
- publicity and training
19Specialist data services
- ESDS Government
- ESDS International
- ESDS Longitudinal
- ESDS Qualidata
- Greater emphasis on
- value-added data and documentation
- enhanced resource discovery
- improved delivery services
- support and training for the secondary use of
data for research, learning and teaching - outreach and promotion
20Facts and figures UKDA
- 4,000 datasets in the collection
- 350 new datasets and editions added each year
- 30,000 registered users
- 15,000 datasets distributed worldwide p.a.
- 100,000 online sessions p.a.
- 15,000,000 web hits p.a.
21Data In
- Data acquisition
- offers and proactive scoping of data
- formal data evaluation via committee
- Data ingest
- checking, verifying
- converting, formatting, processing
- documenting and contextualising
- Data preservation
- long-term data management
- Preservation Policy
22Online exploration
- Online data browsing, including
- simple data analysis, visualisation, downloading
and subsetting via Nesstar - ESDS Government Vital Statistics online
- International macro data via Beyond 20/20 and
visualisation interface - ESDS Qualidata Online interview transcripts
- Census data services
231 Using Government microdata to explore health
- UK is fortunate in its wealth of available major
cross-sectional surveys - government surveys rich resources
- large micro data files with a large number of
detailed variables - series of repeated cross sections which enable
comparisons over time - nationally representative United Kingdom or
constituent countries - sample survey data, which may involve a degree of
complexity - structure ((hierarchical) and
sampling strategy - data holdings and documentation are extensive
241 Government data
- General Household Survey/Continuous Household
Survey (NI) - Labour Force Survey/NI LFS
- Health Survey for England/Wales/Scotland
- Family Expenditure Survey/NI FES
- British/Scottish Crime Survey
- Family Resources Survey
- National Food Survey/Expenditure and Food Survey
- ONS Omnibus Survey
- Survey of English Housing
- British Social Attitudes/Scottish Social
Attitudes/Young Peoples Social Attitudes/NI Life
Times - National Travel Survey
- Time Use Survey
- Vital Statistics for England and Wales
251 Investigating smoking
- ESDS high web presence
- Google search ESDS pages
- ESDS catalogue advanced searching on key words
study and variable level information - browse by subject
- major studies lists
- Government series pages
- theme guides
- publications database
- software and analysis
- guides
261 Accessing Data
- register with ESDS, using the online
authentication system ATHENS (currently moving
towards a new system Shibboleth which provides a
greater degree of differentiation in user types) - ESDS Users must specify the purpose for which
they will use each data set - registered users can choose to download the whole
file (typically SPSS, Stata and tab delimited)
or undertake further analyses, including
graphing, within Nesstar - more stringent conditions apply to more sensitive
data such as detailed microdata with detailed
geography (Special Licence)
271 Online exploration
- Nesstar system - allows unregistered users to
view metadata and univariate distributions online - based on the DDI standard to describe data
- permits users to specify subsets and download in
a wide range of formats - ability to quickly browse data useful where
particular subsets of cases in the data are of
interest - GHS to undertake an analysis of people who would
like to give up smoking - need to know whether
there were a sufficiently large number of people
in the dataset who smoke but would like to give up
28(No Transcript)
291 What can a user do with the data?
- multivariate analysis that look within households
and analyses that look at change over time - look at relationships between multiple individual
characteristics - depth of many questionnaires, allows users to
explore the validity of existing means of
operationalising concepts, or to use new ones
302 Analysing longitudinal health data
- true cohort analysis requires information about
the same individuals over time - explore the chronological ordering of behaviours
or characteristics - ESDS Longitudinal specializes in supporting five
major UK-based longitudinal data sets - British Household Panel Survey (BHPS)
- 1970 British Cohort Study (BCS70)
- National Child Development Study (NCDS)
- Millennium Cohort Study (MCS)
- English Longitudinal Study of Ageing (ELSA)
- BHPS is a household hierarchical dataset -
interviews all members of the households of panel
members. Can explore household factors
313 Providing a common user interface to
international macro data to support comparative
research
- researchers now require access to the key
international evidence bases in order to
contribute and comment on trans-national policy
responses to global issues - ESDS International was established to address
these needs through the provision of free
web-based access to a portfolio of authoritative,
high quality international databanks - high quality, regularly updated time series
databanks - contain huge range of macro-economic
and social indicators aggregated to national or
regional level worldwide
32- datasets supported produced by a number of key
International Governmental Organisations (IGOs)
such as the International Monetary Fund, the
United Nations, the World Bank, the Organisation
for Economic Cooperation and Development and the
International Energy Agency - access via a common user interface to all the
international aggregate datasets which makes it
easy for users to obtain access to data - beyond 20/20 Web Data Server (WDS) to display,
subset, visualize, chart and download data
- Iraqi exports to the rest of the world 1980-2005
(Source International Monetary Fund (IMF),
Direction of Trade Statistics (DOTS) July 2006)
33- CommonGIS used to build a web-based data
exploration interface to geographically
referenced international data - CommonGIS provides standard GIS functionality and
can be used as a tool for visualisation and
exploratory analysis based on geographically
referenced statistical data
- CommonGIS visualization shows the relationship
between birth and death rates in European
countries in 2005 to CIA Word Factbook - the cross classification map shows those
countries, such as Moldova, which have high birth
and death rates
344 Grid-enabling quantitative datasets to support
more complex forms of analysis
- Data Grids facilitate unimpeded and integrated
use of distributed, heterogeneous, autonomous
data resources - grid enabling a dataset creates new opportunities
for its use - enables users to integrate it with other datasets
- makes it possible to analyse the dataset using
techniques that require the kind of computational
power that it is only feasible using the Grid
(e.g. more complex models, more data points). - standardisation of procedures and mechanisms used
to access and update the dataset, increase its
shareability - automated analyses (i.e. analyses can be re-run
automatically when databases are updated)
354 ConvertGrid Key Objectives
- a practical demonstration of how the Grid can be
used to facilitate data integration and overcome
a major barrier to research use of multiple
datasets - demonstrates how to build a social science Data
Grid by grid enabling a number of key
geo-referenced socio-economic data sources - uses Grid technologies to extend the
functionality of an existing web based data
service (i.e. Convert) to exploit the existence
of a Data Grid - demonstrates how Grid technologies can automate
complex workflows and enhance the capacity to
address substantive social science research
questions - builds a user interface to a Grid based service
which is suitable for student/teaching use
364 ConvertGrid The Research Context
- many research questions require the combination
of a data from multiple geo-referenced datasets - E.g. Linking post coded data to census geography
- conversion of data relating to different
geographies to a common target geography is - complex time consuming task
- requires a range of data handling/processing
skills - the data conversion process will require users to
perform the following generic tasks - extract and download data in different formats
from a number of databases using different
interfaces - convert each dataset to the desired target
geography using geographical conversion tables - combine the converted sets into a single dataset
for analysis - these generic tasks can be automated!
374 ConvertGrid A Worked Example
- what factors explain spatial variations in
participation rates in higher education - study target geography 1991 Census Ward
- data required
- 1991 Census
- total persons aged 16-17 18-19 (1991 Census
Ward) - Neighbourhood Statistics
- number of applicants aged under 20 entering
university (1998 Electoral Ward) - Experian
- average house price sales Quarter 2 2000 to
Quarter 1 2001 (1999 Postcode Sectors)
384 ConvertGrid Data Visualisation Interface
High average house price sales but low
participation rates
Low average house price sales but high
participation rates
Ten minutes from start to finish
- relationship between average house price sales
(Experian) and percentage of 16-19 year olds
entering university (Neighbourhood Statistics
Census aggregate statistics)
395 Mixed Methods Data
- there is an increasing interest in and
recognition of the value of re-using qualitative
data - in the past few years there has been a
significant move to utilise mixed methods
strategies in research - ESDS has seen the deposit of multiple methods
datasets combining quantitative and qualitative
datasets - processed and supported by dedicated unit - ESDS
Qualidata
405 ESDS Qualidata
- range of qualitative datasets, hosted by the UK
Data Archive - data from National Research Council (ESRC)
individual and programme research grant awards
(Data Policy) - data from classic social science studies
- other funders/sources
- focus on DIGITAL Collections, but also facilitate
paper-based archiving
415 Types of qualitative data
- diverse data types in-depth interviews
semi-structured interviews focus groups oral
histories mixed methods data open-ended survey
questions case notes/records of meetings
diaries/ research diaries - multimedia audio, video, photos and text (most
common is interview transcriptions) - formats digital, paper, analogue audio-visual
- data structures - differ across different
document types
425 Classic study datasets
- Townsend Poverty, old age and Katherine
Buildings - Thompson oral history and Edwardians
- Goldthorpe et al - The Affluent Worker
- Jackson and Marsden Education and the Working
class - National Social Policy and Social Change Archive
435 Online access to data
445 schoolchildrens attitudes towards risk-taking
and health
- typical example of a mixed methods study might be
undertaking a sample survey and conducting
ethnographic fieldwork (eg observation and
in-depth interviews) based on the survey sample
or on other cases - Incidents and the Health-related Behaviour of
Schoolchildren, 1997, M. Denscombe - Studying critical incidents in the life of young
people which act as crucial flashpoints in the
generation of attitudes towards health-related
behaviour
455 schoolchildrens attitudes towards risk-taking
and health
- the project used a mixture of quantitative and
qualitative methodology - survey of 1648 children
- eleven transcripts of focus group interviews
- eight transcripts of interviews - two students
together - Denscombe in-depth interviews also cover a lot of
detail about the role and pressure of exams at
the age of 15/16, and future life ambitions
46Secondary use?
- qualitative aspect can offer a more detailed
explanation of a quantitative analysis and
possibly enable a more complex model to be built - sequencing of data collection methods or the
selection of cases needs to be carefully
considered in re-use - in larger data collections, the data types may
have been collected by different teams with
differing methodological agendas - researchers
tend to prioritise one method because of
familiarity with the data type and analytic
methods - possibility that each method could show
conflicting findings - re-users should be aware
how they report findings and be reflexive about
how the secondary data were selected, confronted
and analysed
47Collaboration - UK
- Government agencies work closely
- Research Councils on formal data sharing policies
- Research Centres and Programmes collecting data
- Other funding agencies e.g JISC on technical
issues - authentication, digitisation, TL resources
- TNA on records management and preservation
practise - E-science on grid enabled data issues, ontologies
- Research Methods centres on data quality and
secondary analysis
48Conclusion
- secondary analysis permits a range of valuable
analyses to be undertaken quickly, effectively,
transparently and with minimal respondent burden - digital formats have enable users to easily
consult full documentation, explore and analyse
data online - and to make linkages between appropriate
resources in a context of an increasingly complex
data infrastructure - data access services themselves may be virtual
centres, distributed across multiple sites - anticipate that grid developments will provide
increased scope for harmonising access to
different data types
49- Contact
- www.esds.ac.uk
- help_at_esds.ac.uk
- corti_at_essex.ac.uk
- 01206 872145