Transparently Gathering Provenance with Provenance Aware Condor - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Transparently Gathering Provenance with Provenance Aware Condor

Description:

Can we use Condor as the execution system for this type of provenance system? ... FileTrace gathers information about files used by Condor jobs. ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 42
Provided by: christin168
Category:

less

Transcript and Presenter's Notes

Title: Transparently Gathering Provenance with Provenance Aware Condor


1
Transparently GatheringProvenancewith
Provenance Aware Condor
  • Christine Reilly and Jeffrey Naughton
  • Department of Computer Sciences
  • University of Wisconsin - Madison
  • TaPP 09 February 23, 2009

2
Motivations for Our Work
  • Scientific Computing
  • Grid Computing
  • Condor job scheduling system

3
Motivation Scientific Computing
  • What input data was used to produce this output
    data?
  • Example
  • Blast sequence DB is updated periodically.
  • Did my last computation use the latest version of
    the Blast DB?

4
Motivation Scientific Computing
  • Need to find the root of anomalous results.
  • Example
  • Multiple members of a research group can update
    the code for the simulation.
  • Did two researchers use the same version of the
    executable?

5
Motivation Grid Computing
  • What grid resources did my computation use?
  • Example
  • A machine in the grid has a hardware problem (bad
    DIMM, corrupt disk).
  • Did any of my computations use the bad machine?

6
Motivation Condor
  • Condor
  • Distributed computing system.
  • Runs jobs from wide range of applications and
    fields of study.
  • Quill Captures operational information exposed
    by Condor.
  • Could Quill be used for provenance?
  • Users would get provenance for free.

7
Provenance System Goals
  • Generic Can be used by many different
    applications.
  • Transparent Users dont need to alter their
    applications.
  • Can we use Condor as the execution system for
    this type of provenance system?
  • Can we do this with minimal impact on Condor
    developers?

8
Our Model of Provenance
  • File level granularity.
  • The provenance of a file is
  • All files involved in its creation
  • Job execution environment.

Execution Environment
exe
output
input
9
Outline of Talk
  • Motivation and Introduction
  • PAC - Provenance Aware Condor
  • Is PAC Practical? (Storage requirements and
    overhead)
  • DBLife Benefits and Limitations of PAC
  • Conclusions

10
Provenance Aware Condor
  • Three parts Condor, Quill, FileTrace
  • Condor is the job execution system.
  • Quill gathers operational information from
    Condor.
  • FileTrace gathers information about files used by
    Condor jobs.
  • Provenance is obtained by querying the Quill
    database.

11
What is Condor?
  • Provides a simple interface to a cluster of
    machines.
  • User gives computing jobs to Condor.
  • Condor finds available machine that meets job
    requirements.
  • Condor ensures that the job completes execution,
    even in the presence of failures.
  • Used for many different applications
  • Physics, Biology, Chemistry, Computer Science,
    Engineering applications.
  • Academic research, national labs, industry.

12
What is Condor? (part 2)
  • Research project at UW, but also resembles
    commercial software.
  • 35 faculty, full-time staff, students.
  • Used at 2000 locations world-wide.
  • Running on more than 250,000 machines.
  • Annual user conference attracts hundreds of
    attendees.

13
Condor Overview
Central Manager
Submit Machine
Execute Machine
Users Machine
14
What is Quill?
  • Gathers operational data from Condor.
  • Improves performance of users inquiries about
    their jobs.
  • Information about Condor operations is stored in
    RDBMS.
  • Created by group of database researchers
    (including us).
  • Ships with Condor.

15
Quills Design Requirements
  • Expose Condor operational data for querying
    through a DBMS interface.
  • Cannot affect how Condor operates
  • No changes to the existing Condor code.
  • No change to how Condor is used.
  • Failures in Quill cannot cause Condor to fail.

16
Quill Overview
Central Manager
Machine Resources
Job Requirements
Match Info
Match Info
Executable, Input Files
Submit Machine
Execute Machine
Output Files
Output Files
Job Specs
Exe, In Files
Users Machine
17
File Access in Condor
  • Three methods of accessing files
  • Remote system calls
  • Condor File Transfer
  • Shared file system
  • Quills information about files
  • Detailed information about File Transfers
  • Limited information about some other files
  • No guarantee that it detects all files
  • Problem Because Condor does not track all file
    access, Quill may miss file information.
  • FileTrace gathers information about all files.

18
FileTrace
  • Modification of UNIX strace.
  • Transparent to Condor.
  • For each file open or close system call FileTrace
    gathers
  • Condor Job Id.
  • File information pathname, last modified time,
    checksum, size.
  • File access information activity type, open
    flags, file pointer.

19
Provenance in PAC
  • Provenance of a file is
  • All files involved in its creation Executable,
    Input files, Libraries and system files.
  • Job execution environment when, on what machine.
  • PAC has no control over files in the users file
    system.

20
Provenance Queries in PAC
  • What files were used by this job?
  • SELECT pathname, activity_type, open_flags,
    last_modified, size, checksum
  • FROM filetrace
  • WHERE globaljobidmy_job_id

21
Provenance Queries in PAC
  • More questions PAC can answer
  • Were any of my files created on the bad machine?
  • Was this output created using the current
    versions of the input and executable files?
  • Did any input files change between two runs of
    the application?

22
Outline of Talk
  • Motivation and Introduction
  • PAC - Provenance Aware Condor
  • Is PAC Practical? (Storage requirements and
    overhead)
  • DBLife Benefits and Limitations of PAC
  • Conclusions

23
Is PAC Practical?
  • Users will not tolerate PAC if
  • Data storage requirements are too onerous.
  • Computational overhead is significant.

24
Estimate Table Growth
  • Use the run time and number of files used by
    seven scientific applications (BLAST, IBIS, CMS,
    Nautilus, Messkit Hartree-Fock, AMANDA,
    SETI_at_home).
  • Simulate running each application constantly for
    a year on a 1000 machine cluster.
  • How big is the FileTrace table?

25
1.15 TB
900 GB
540 GB
190 GB
41 GB
30 GB
9 GB
26
Test Cluster for Overhead
  • Machine Specs
  • Pentium 2.40 GHz Core 2 Duo
  • 4GB RAM
  • Two 250GB SATA-I hard disks.
  • Condor Cluster
  • Central Manager
  • Submit Machine
  • Database Machine
  • 10 execute machines

27
Overhead of PAC
  • Use synthetic program.
  • Writes random numbers to files.
  • Can vary the number of files generated and run
    time.
  • 4 sets of jobs with run time and number of files
    based on the scientific programs from the size
    experiment.
  • Run time per job 5 or 20 minutes.
  • Files per job 10 or 300.

28
Overhead of PAC
  • Two Clusters PAC and Condor without Quill or
    FileTrace.
  • For each set of jobs, submit as many jobs as will
    take 12 hours to complete.
  • No significant difference in average time per job
    or total run time.

29
Outline of Talk
  • Motivation and Introduction
  • PAC - Provenance Aware Condor
  • Is PAC Practical? (Storage requirements and
    overhead)
  • DBLife Benefits and Limitations of PAC
  • Conclusions

30
DBLife
  • Community Information Management (CIM)
  • Creates and maintains an ER graph of the database
    research community.
  • Gathers data by crawling web.
  • Interesting example for PAC
  • Large, complex workflow.
  • Accesses many files.

31
Overview ofDBLife
32
DBLife Front Page
33
DBLife Person Homepage
34
Questions PAC Can Answer
  • In general - questions developers and system
    administrators would ask.
  • Examples
  • Is this the version of the program that was used
    to create the current data set?
  • Was this machine used to create any part of the
    DBLife data set?
  • Does the DBLife ER model reflect the recent
    changes on my web page?

35
Questions PAC Cannot Answer
  • In general - questions that require knowledge of
    the semantics of DBLife.
  • Examples
  • Why does the DBLife portal show that person X is
    affiliated with institution Y?
  • Why doesnt my page on the DBLife portal include
    that I presented a paper at this workshop?

36
Problem Large FileTrace Table
  • Ran DBLife with 10 crawl start points
  • FileTrace table 100 MB
  • Full scale run uses 1000 start points
  • FileTrace table est. 10 GB
  • Run DBLife daily for 1 year
  • FileTrace table est. 3.7 TB
  • Possibilities for reducing table size
  • Could be smarter about how data is stored.
  • Table compression.

37
Outline of Talk
  • Motivation and Introduction
  • PAC - Provenance Aware Condor
  • Is PAC Practical? (Storage requirements and
    overhead)
  • DBLife Benefits and Limitations of PAC
  • Conclusions

38
Limitations of PAC
  • File level is coarse granularity.
  • Files are not under PACs management.
  • Workflows are not explicitly recorded.
  • Doesnt know about the semantics of an
    application.

39
Contributions of PAC
  • Many different applications can use PAC for
    provenance.
  • Has little impact on Condor.
  • Could be combined with other systems to yield a
    stronger provenance system.
  • Workflow management system.
  • File management system.

40
Future Work
  • Integrate PAC with an Xlog version of DBLife.
  • Xlog is similar to Datalog
  • Record information when Xlog rule is triggered.
  • Provides fine grained provenance.
  • Can answer questions that depend on the semantics
    of the application.

41
Acknowledgements
  • Supported in part by National Science Foundation
    Award SCI-0515491.
  • Thank you CondorDB, DBLife, and Condor groups!
Write a Comment
User Comments (0)
About PowerShow.com