lbsh: Breadcrumbs as You Work - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

lbsh: Breadcrumbs as You Work

Description:

Given some raw data (measurements, observations, etc.) we often 'try a number of things' ... of crufty scripts from an author and trying to get them to work? ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 30
Provided by: blah154
Category:
Tags: breadcrumbs | lbsh | work

less

Transcript and Presenter's Notes

Title: lbsh: Breadcrumbs as You Work


1
lbsh Breadcrumbs as You Work
  • Eric Osterweil

2
Problem
  • Measurement studies, simulations, and many other
    investigations require a lot of data work
  • Data processing (or experimentation) can be
    ad-hoc
  • Given some raw data (measurements, observations,
    etc.) we often try a number of things
  • Simulations are often tweaked and re-run numerous
    times
  • In this mode, experiments can recursively lead
    subsequent experiments
  • How can a researcher always remember the exact
    provenance of their results?

3
What is Data Provenance?
  • The concept of data provenance is a lot like
    chain of custody
  • More specifically, we borrow a definition from
    1
  • We defined data provenance as information that
    helps determine the derivation history of a data
    product, starting from its original sources.

4
Why Does Provenance Matter?
  • Do we need to be able to remember exactly how we
    got results?
  • Setting
  • Student does a lot of processing, gets
    compelling results
  • Advisor wants to re-run with new data
  • Student panics (silently of course)
  • Reviewers hope that results are reproducible

5
Sharing Work With Others?
  • How many people have had to re-implement someone
    elses algos for a paper?
  • How about getting a tarball of crufty scripts
    from an author and trying to get them to work?
  • What if you could get a tarball that was totally
    self-descriptive
  • What if the tarball could totally describe the
    work that lead to that users results?
  • What if it could allow you to re-run the whole
    thing?

6
Example
  • sort data2.out gt sorted-data2.out
  • awk print 1 \t 5 sorted-data2.out gt dook
  • sort data3.out gt sorted-data3.out
  • join -1 1 -2 1 dook sorted-data3.out gt blah
  • vi script.sh
  • script.sh dook
  • awk tot11ENDprint tot blah gt day1.out
  • vi blah.pl
  • blah.pl data1.out data2.out gt day2.out
  • sort data1.out gt sorted-data1.out
  • join -1 1 -2 1 dook sorted-data1.out gt blah
  • vi blindluck.awk
  • blindluck.awk blah gt day3.out

7
Results?
  • What if it turns out that day3.out has the
    results I wanted?
  • Can anyone recall what the commands were for
    that?
  • Were any of the files overwritten?

8
Outline
  • Inspiration
  • Goal
  • lbsh (Pound-Shell)
  • Usage
  • Contribution
  • Future

9
Inspiration
  • Computer Scientists cannot be the first group to
    have this problem
  • In fact, were not
  • Science is predicated on reproducibility, so how
    do (for example) biologists deal with this?
  • They have lab-books, and they take notes

10
Can We Do the Same?
  • A biologist may make a few notes and then spend
    several days conducting experiments
  • Conversely, we process data as fast as we can
    type, and block on I/O occasionally
  • Note taking is a small task in proportion to a
    biologists experiment
  • Note taking is a large task in proportion to our
    fast-fingers
  • Even then, a lab-book can look like a dictionary
    (too full of noise to use)

11
What Else Do People Do?
  • Scientific Workflow
  • Design experiments in workflow environments
  • Lets each experiment be re-run and transparent
  • Lower level of noise
  • Of course, users must do all work in a foreign,
    and often times restrictive, environment

12
Observation
  • We cant always (ever?) know what experiments
    will be fruitful before we run them
  • So, we may not want to setup a large experiment
    and design a workflow every time we try something
  • CorollaryWe may not realize our results are
    good until some time after we first examine them

13
What Holds Us Back?
  • A lack of motivation?
  • Shouldnt a solution be
  • Easy
  • Support automation that makes it worth doing. Why
    bother if it isnt directly useful?

14
Goal
  • What we really want is to know how day3.out was
    generated because
  • We need to be sure we did it right
  • We need to be able to show our collaborators that
    we arent smoking crack
  • We often want to re-run our analysis with new
    data
  • More? Lets stop here for now

15
How COULD We Do This?
  • Keep a manual lab-book file of all commands run
  • This is feasible, but very prone to both bloat
    and stale/missing/mistaken info
  • Its a very manual process and a pain. You cant
    copy-and-paste w/o stripping the prompts, etc.
  • Look at the history file
  • Multiple shells will cause holes in the history
  • What about commands issued in R, gnuplot, etc?
  • An ideal solution
  • Automatic, just specify start and stop points.
  • Wasted experiments are not a factor

16
Meaningless Eye Candy
17
lbsh (Pound-Shell)
  • Lets provide lab-book support on the command
    line!
  • While typing we should be able to just start an
    experiment do some work, and then stop it
  • In addition, we should keep track of what files
    were accessed and modified during this
  • Goal provide provenance for files based on
    lab-book entries

18
Level-Setting
  • lbsh is in alpha
  • The code works well, but there are certainly bugs
  • The features that are there are a starting point
  • Feedback is welcome
  • Tell me about bugs, tell me what you like, tell
    me what you dislike, etc
  • The page is hosted here, but there are links to
    sourceforge for bug tracking and feature reqs
  • http//lbsh.cs.ucla.edu/

19
How Does it Work?
  • Lbsh is a monitor that spawns a worker shell and
    passes commands to it
  • When a user starts an experiment lbsh starts
    recording
  • The experiments are entered as separate lab-book
    entries

20
Specifically
  • lbsh uses a user config file (HOME/.lbshrc)
  • Records commands (even in R, etc.)
  • Stats files in a user-specified directory
    (atime/mtime)
  • Can repeat experiments
  • Is able to avoid repeating editor sessions (vi,
    emacs, etc.)
  • Can report the experimental provenance of
    individual files
  • i.e. How did I get day3.out?

21
Usage
  • To use lbsh, just launch it
  • To start/stop an experiment
  • ctrl-b
  • To tell if lbsh is running, or if an experiment
    is running
  • lbshrunning.sh -v
  • exprunning.sh -v
  • To find a files provenance
  • file-provenance.pl
  • To re-run an old experiment
  • exeggutor.pl ltexperiment IDgt

22
Revisiting Example
  • sort data2.out gt sorted-data2.out
  • awk print 1 \t 5 sorted-data2.out gt dook
  • sort data3.out gt sorted-data3.out
  • join -1 1 -2 1 dook sorted-data3.out gt blah
  • vi script.sh
  • script.sh dook
  • awk tot11ENDprint tot blah gt day1.out
  • vi blah.pl
  • blah.pl data1.out data2.out gt day2.out
  • sort data1.out gt sorted-data1.out
  • join -1 1 -2 1 dook sorted-data1.out gt blah
  • vi blindluck.awk
  • blindluck.awk blah gt day3.out

23
Real Experiments
  • This example is too simple to be interesting
  • Though simple is good
  • Lets see the result of some real usage from a
    paper submission

24
(No Transcript)
25
Contribution
  • What we want is to make reproducibility a
    foregone conclusion, not a pipedream
  • Can we do it?
  • lbsh is a simple tool that is NOT fool-proof
  • Evidence Ive already found ways to trick it
  • lbsh is just a useful tool that makes it easier
    for each of us to be more diligent
  • What lbsh really contributes is
  • An automation framework for us to be more
    efficient, and more secure in our work
    (reproducing data, etc.)
  • An enabling technology for us to do better

26
Future
  • In addition to tending our own farm, can we build
    on someone elses work now?
  • Ex IMC requires datasets to be made public to be
    considered for best-paper
  • From public data, can I automatically see how
    someone got their results and try to do follow-on
    work?
  • Feature requests
  • Svn support version control some files
  • File cleanup
  • Fix NFS support

27
http//lbsh.cs.ucla.edu/
28
References
  • 1 Y. L. Simmhan, B. Plale, and D. Gannon. A
    survey of data provenance in e-science. SIGMOD
    Rec., 34(3)3136, 2005.

29
Thanks!
  • Questions?
  • Ideas?
Write a Comment
User Comments (0)
About PowerShow.com