Title: Towards a Data Network for Integrated Social Science Research
1Towards a Data Network for Integrated Social
Science Research
- Micah Altman
- Harvard University
- Archival Director, Henry A. Murray Research
Archive - Associate Director, Harvard-MIT Data Center
- Senior Research Scientist, Institute for
Quantitative Social Sciences - E micah_altman_at_harvard.eduW http//maltman.hmdc
.harvard.edu/
2This Talk
- Why Social Science Data?
- Integrating Data Access -- Dataverse Network
- Replicated Institutional Preservation the
Data-PASS Alliance for the Social Sciences - Social Science and Cyberinfrastructure
3Related Work
- Papers
- M. Altman and G. King. A Proposed Standard for
the Scholarly Citation of Quantitative Data,
D-Lib, 13, 3/4 (March/April). 2007. - M. Altman, et. al, Data Preservation Alliance
for the Social Sciences A Model for
Collaboration Proceedings of DigCcurr07, Chapel
Hill. April 2007. - G. King, An Introduction to the Dataverse
Network as an Infrastructure for Data Sharing,
Sociological Methods and Research, 32, 2
(November, 2007) 173199. - M. Altman , "A Fingerprint Method for
Verification of Scientific Data" in, Advances in
Systems, Computing Sciences and Software
Engineering, (Proceedings of the International
Conference on Systems, Computing Sciences and
Software Engineering 2007) , Springer Verlag.
Forthcoming 2008. - Collaborators Co-conspirators
- Margaret Adams, Ken Bollen, Cavan Capps, Jonathan
Crabtree, Darrell Donakowski, Myron Gutmann, Gary
King, Lois Timms-Ferrarra, Marc Maynard, Amy
Pienta - Research Support
- Thanks to the Library of Congress (PANDP03-1),
the National Institutes of Aging (P01
AG17625-01), the National Science Foundation
(SES-0318275, IIS-9874747), the Harvard
University Library, the Institute for
Quantitative SocialScience, the Harvard-MIT Data
Center, and the Murray Research Archive.
4What is Digital Social-Science Data?
- DIGITAL
- Optical DVD, CD
- Magnetic Tapes, Floppies
- Paper cards, tapes
- SOCIAL SCIENCE
- Social class, crime, social movements, culture,
folklore, family - Economic wealth, prosperity, labor, business,
equity - Psychology cognition, attitudes, stereotypes
- Politicsjustice, democracy, public policy,
public administration, international conflic - DATA
- Raw measurements
- Numeric tables
- Administrative records ( email)
- Video and audio interviews, transcripts ( blogs)
5Data Access is the Key To Science
- Science is not (only) about being scientific
- Scientific progress requires community
Competition and collaboration in the pursuit of
common goals - Without access to the same materials no
community exists data is the nucleus of
collaboration. - The value of an article that cant be replicated
? - Scholarly articles are summaries, not the actual
research results - But Data access is spotty by field, finding the
data is still hard - Hard for journal editors to verify.If you find
it, how do you know its the same? - Replication projects showmost published
articles in social science cannot be
replicated data is necessary for replication
and versication
6Data Access is the Key To Democracy
- Statistics state-istics
- The state tax authority counting people,
estimating wealth - Reformers use data to assess the performance of
the state - Science informs public policy continually
- In modern democracy the public needs a direct
source of information
7Why is Infrastructure for Data Needed?
- Accessibility
- Most large data sets in public archives
- Most data in published articles not accessible,
results not replicable without the original
author - Most data sets from federal grants not publicly
available - Problems even with professional archives
- Data in different archives have different
identifiers - Archives change identifiers, links
- Changes to data are made identifiers are reused
or removed old data are lost - Data sets are not like books
- Static data files (even if on the web)
unreadable after a few years - When storage methods change some data sets are
lost others have altered content! - Why not Single Centralized infrastructure ?
- Single point of failure
- Impossible when data are heterogeneous in format,
origin, size, effort needed to collect or
analyze, IRB access rules, etc. - Data producers want credit, control, and
visibility - Requirements
- Recognition, for data producers, distributors,
related publishers - Rule-based Public Distribution
8The Dataverse Network
- An Open-Source, Federated, Web 2.0 Data Network
- Gateway to over 20000 social science studies
(worlds largest catalog) - Web Virtual Hosting 2.0 Service-- Over 100
virtual archives - Federated access to other networks
- Unified access to major U.S. research data
archives, government data - Open service endowed hosting
- Open source GPL-Affero-3
- Discovery Services
- Simple fielded search
- Virtual collection browsing
- Management
- Ingest
- Curation review
- Virtual Hosting and administration
- Metadata delivery
- Descriptive and structural
- Provenance (chain-of custody metadata)
- Human and OAI interfaces
- Preservation
- Standards based
- Reformatting
- Universal Numeric Fingerprints
- Enhanced Delivery
- Replication
- Layered analysis services
9DVN Screenshots
http//dvn.iq.harvard.edu/
10Some Dataverse Uses
- Future Researchers discovery linking forward
citation verification analysis - Journals, for replication
- Authors, for their own data
- Teachers, in depth analysis
- Sections of scholarly organizations, to organize
existing data - Granting agencies
- Research centers
- Archives
- Major Research Projects
- Academic departments, universities, centers,
libraries
11Fixing Data Citations
- Citations are a traditional formal mechanism to
link together intellectual works - Citations glue together Regulations,
Publications, and Evidence - But, lack of rules for citing numeric data
- No consistency in practice
- No fixed rules for copyeditors
- Sometimes in the list of references sometimes a
casual mention in the text - Sometimes the archive is noted
- Sometimes a version number exists
- Sometimes the version number is listed (if it
exists) - Archive numbers are sometimes given, if they
exist - Sometimes the author is noted
- Date of creation is sometimes given
- URLs often given, rarely persist
- Dates of access protect the researcher, do not
help find the data - The data may not be available publicly
- The data may no longer exist
12A New Citation Standard for Numeric Data
13(No Transcript)
14DataPASS
15(No Transcript)
16Future Replication as Institutional Insurance
Data-PASS Syndicated Storage Project
- External Causes of Preservation Failure
- Third party attacks
- Institutional funding
- Change in legal regimes
- Quis custodiet ipsos custodes?
- Unintentional curatorial modification
- Loss of institutional knowledge skills
- Intentional removal
- Change in institutional mission
- Schema drivencapture inter-archival
preservation commitments - Asymmetric resource commitments proportional
to holdings - Versioned versioned data and citations
- Integration LOCKSS DVN technology, archival
workflows
17Future Social Science Data Deluge
- Collective holdings of all U.S. numeric social
science data in all major data archives,
government repositories estimated 10s of TB - Ambient data increasingly becoming subject of
social science research. - Data deluge annually (2002 annual)
- Web (surface) 167 TB
- Radio 3,500 TB
- Television 69,000 TB
- Web (deep) 92000 TB
- Email (originals) 441,000 TB
- Telephone 18,000,000 TB
Or, what are you thinking?
18Whats Next New Social Science Examples
- From Social Science Research Computing
Environment Project - Assess need for high performance computing among
social scientists at Harvard - Prototype interfaces to make grid computing
usable by social scientists - Examples
- Harvesting and analysis of blogs for virtual
political opinion surveys - Continuous collection of CSPAN, real-time subject
coding, continuous dissemination - Cell phone data stream collection movement,
proximity to others, social network analysis - Participative goals-based redistricting
- Agent-based models of emerging institutions
- FMRI analyses of reaction to political and social
scenarios - Modal Features
- Analyses emerge through exploration and
interactions - Data collection from non-experimental, non
instrumental, sources - Increasing scale of data
- Compute limited
- Data confidentiality
- High-level analysis tools
- Remote collaboration is part of projects
Meta-features of social science messy data
an abundance of plausible models
19Social Science Research Infrastructure Challenges
- Social science challenges
- Few definitive answers
- Complex conceptual primitives
- Complex theories of behavior
- Reliance on observational data
- Specification uncertainty
- Changing evidence base (blogs, video,
continuously recorded behavioral data) - Some trends
- Compute-intensive inferential statistics
- Specification searches
- Sensitivity analyses
- Curse of dimensionality
- Data explosion
- Changing evidence base
- Agent-based models
- Important Gaps
- No tool covers entire scholarly research
lifecycle - Most not yet immature
- Poor integration across most tools
20For More Information
Dataverse Network Project http//TheData.Org Da
ta-PASS Alliance http//www.icpsr.umich.edu/DAT
APASS/ Contact me http//maltman.hmdc.harvard.e
du/