Towards a Data Network for Integrated Social Science Research - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Towards a Data Network for Integrated Social Science Research

Description:

Towards a Data Network for Integrated Social Science Research. Micah Altman. Harvard ... Why Social Science Data? Integrating Data Access -- Dataverse Network ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 21
Provided by: donated
Category:

less

Transcript and Presenter's Notes

Title: Towards a Data Network for Integrated Social Science Research


1
Towards a Data Network for Integrated Social
Science Research
  • Micah Altman
  • Harvard University
  • Archival Director, Henry A. Murray Research
    Archive
  • Associate Director, Harvard-MIT Data Center
  • Senior Research Scientist, Institute for
    Quantitative Social Sciences
  • E micah_altman_at_harvard.eduW http//maltman.hmdc
    .harvard.edu/

2
This Talk
  • Why Social Science Data?
  • Integrating Data Access -- Dataverse Network
  • Replicated Institutional Preservation the
    Data-PASS Alliance for the Social Sciences
  • Social Science and Cyberinfrastructure

3
Related Work
  • Papers
  • M. Altman and G. King. A Proposed Standard for
    the Scholarly Citation of Quantitative Data,
    D-Lib, 13, 3/4 (March/April). 2007.
  • M. Altman, et. al, Data Preservation Alliance
    for the Social Sciences A Model for
    Collaboration Proceedings of DigCcurr07, Chapel
    Hill. April 2007.
  • G. King, An Introduction to the Dataverse
    Network as an Infrastructure for Data Sharing,
    Sociological Methods and Research, 32, 2
    (November, 2007) 173199.
  • M. Altman , "A Fingerprint Method for
    Verification of Scientific Data" in, Advances in
    Systems, Computing Sciences and Software
    Engineering, (Proceedings of the International
    Conference on Systems, Computing Sciences and
    Software Engineering 2007) , Springer Verlag.
    Forthcoming 2008.
  • Collaborators Co-conspirators
  • Margaret Adams, Ken Bollen, Cavan Capps, Jonathan
    Crabtree, Darrell Donakowski, Myron Gutmann, Gary
    King, Lois Timms-Ferrarra, Marc Maynard, Amy
    Pienta
  • Research Support
  • Thanks to the Library of Congress (PANDP03-1),
    the National Institutes of Aging (P01
    AG17625-01), the National Science Foundation
    (SES-0318275, IIS-9874747), the Harvard
    University Library, the Institute for
    Quantitative SocialScience, the Harvard-MIT Data
    Center, and the Murray Research Archive.

4
What is Digital Social-Science Data?
  • DIGITAL
  • Optical DVD, CD
  • Magnetic Tapes, Floppies
  • Paper cards, tapes
  • SOCIAL SCIENCE
  • Social class, crime, social movements, culture,
    folklore, family
  • Economic wealth, prosperity, labor, business,
    equity
  • Psychology cognition, attitudes, stereotypes
  • Politicsjustice, democracy, public policy,
    public administration, international conflic
  • DATA
  • Raw measurements
  • Numeric tables
  • Administrative records ( email)
  • Video and audio interviews, transcripts ( blogs)

5
Data Access is the Key To Science
  • Science is not (only) about being scientific
  • Scientific progress requires community
    Competition and collaboration in the pursuit of
    common goals
  • Without access to the same materials no
    community exists data is the nucleus of
    collaboration.
  • The value of an article that cant be replicated
    ?
  • Scholarly articles are summaries, not the actual
    research results
  • But Data access is spotty by field, finding the
    data is still hard
  • Hard for journal editors to verify.If you find
    it, how do you know its the same?
  • Replication projects showmost published
    articles in social science cannot be
    replicated data is necessary for replication
    and versication

6
Data Access is the Key To Democracy
  • Statistics state-istics
  • The state tax authority counting people,
    estimating wealth
  • Reformers use data to assess the performance of
    the state
  • Science informs public policy continually
  • In modern democracy the public needs a direct
    source of information

7
Why is Infrastructure for Data Needed?
  • Accessibility
  • Most large data sets in public archives
  • Most data in published articles not accessible,
    results not replicable without the original
    author
  • Most data sets from federal grants not publicly
    available
  • Problems even with professional archives
  • Data in different archives have different
    identifiers
  • Archives change identifiers, links
  • Changes to data are made identifiers are reused
    or removed old data are lost
  • Data sets are not like books
  • Static data files (even if on the web)
    unreadable after a few years
  • When storage methods change some data sets are
    lost others have altered content!
  • Why not Single Centralized infrastructure ?
  • Single point of failure
  • Impossible when data are heterogeneous in format,
    origin, size, effort needed to collect or
    analyze, IRB access rules, etc.
  • Data producers want credit, control, and
    visibility
  • Requirements
  • Recognition, for data producers, distributors,
    related publishers
  • Rule-based Public Distribution

8
The Dataverse Network
  • An Open-Source, Federated, Web 2.0 Data Network
  • Gateway to over 20000 social science studies
    (worlds largest catalog)
  • Web Virtual Hosting 2.0 Service-- Over 100
    virtual archives
  • Federated access to other networks
  • Unified access to major U.S. research data
    archives, government data
  • Open service endowed hosting
  • Open source GPL-Affero-3
  • Discovery Services
  • Simple fielded search
  • Virtual collection browsing
  • Management
  • Ingest
  • Curation review
  • Virtual Hosting and administration
  • Metadata delivery
  • Descriptive and structural
  • Provenance (chain-of custody metadata)
  • Human and OAI interfaces
  • Preservation
  • Standards based
  • Reformatting
  • Universal Numeric Fingerprints
  • Enhanced Delivery
  • Replication
  • Layered analysis services

9
DVN Screenshots
http//dvn.iq.harvard.edu/
10
Some Dataverse Uses
  • Future Researchers discovery linking forward
    citation verification analysis
  • Journals, for replication
  • Authors, for their own data
  • Teachers, in depth analysis
  • Sections of scholarly organizations, to organize
    existing data
  • Granting agencies
  • Research centers
  • Archives
  • Major Research Projects
  • Academic departments, universities, centers,
    libraries

11
Fixing Data Citations
  • Citations are a traditional formal mechanism to
    link together intellectual works
  • Citations glue together Regulations,
    Publications, and Evidence
  • But, lack of rules for citing numeric data
  • No consistency in practice
  • No fixed rules for copyeditors
  • Sometimes in the list of references sometimes a
    casual mention in the text
  • Sometimes the archive is noted
  • Sometimes a version number exists
  • Sometimes the version number is listed (if it
    exists)
  • Archive numbers are sometimes given, if they
    exist
  • Sometimes the author is noted
  • Date of creation is sometimes given
  • URLs often given, rarely persist
  • Dates of access protect the researcher, do not
    help find the data
  • The data may not be available publicly
  • The data may no longer exist

12
A New Citation Standard for Numeric Data
13
(No Transcript)
14
DataPASS
15
(No Transcript)
16
Future Replication as Institutional Insurance
Data-PASS Syndicated Storage Project
  • External Causes of Preservation Failure
  • Third party attacks
  • Institutional funding
  • Change in legal regimes
  • Quis custodiet ipsos custodes?
  • Unintentional curatorial modification
  • Loss of institutional knowledge skills
  • Intentional removal
  • Change in institutional mission
  • Schema drivencapture inter-archival
    preservation commitments
  • Asymmetric resource commitments proportional
    to holdings
  • Versioned versioned data and citations
  • Integration LOCKSS DVN technology, archival
    workflows

17
Future Social Science Data Deluge
  • Collective holdings of all U.S. numeric social
    science data in all major data archives,
    government repositories estimated 10s of TB
  • Ambient data increasingly becoming subject of
    social science research.
  • Data deluge annually (2002 annual)
  • Web (surface) 167 TB
  • Radio 3,500 TB
  • Television 69,000 TB
  • Web (deep) 92000 TB
  • Email (originals) 441,000 TB
  • Telephone 18,000,000 TB

Or, what are you thinking?
18
Whats Next New Social Science Examples
  • From Social Science Research Computing
    Environment Project
  • Assess need for high performance computing among
    social scientists at Harvard
  • Prototype interfaces to make grid computing
    usable by social scientists
  • Examples
  • Harvesting and analysis of blogs for virtual
    political opinion surveys
  • Continuous collection of CSPAN, real-time subject
    coding, continuous dissemination
  • Cell phone data stream collection movement,
    proximity to others, social network analysis
  • Participative goals-based redistricting
  • Agent-based models of emerging institutions
  • FMRI analyses of reaction to political and social
    scenarios
  • Modal Features
  • Analyses emerge through exploration and
    interactions
  • Data collection from non-experimental, non
    instrumental, sources
  • Increasing scale of data
  • Compute limited
  • Data confidentiality
  • High-level analysis tools
  • Remote collaboration is part of projects

Meta-features of social science messy data
an abundance of plausible models
19
Social Science Research Infrastructure Challenges
  • Social science challenges
  • Few definitive answers
  • Complex conceptual primitives
  • Complex theories of behavior
  • Reliance on observational data
  • Specification uncertainty
  • Changing evidence base (blogs, video,
    continuously recorded behavioral data)
  • Some trends
  • Compute-intensive inferential statistics
  • Specification searches
  • Sensitivity analyses
  • Curse of dimensionality
  • Data explosion
  • Changing evidence base
  • Agent-based models
  • Important Gaps
  • No tool covers entire scholarly research
    lifecycle
  • Most not yet immature
  • Poor integration across most tools

20
For More Information
Dataverse Network Project http//TheData.Org Da
ta-PASS Alliance http//www.icpsr.umich.edu/DAT
APASS/ Contact me http//maltman.hmdc.harvard.e
du/
Write a Comment
User Comments (0)
About PowerShow.com