Title: Implementation and Evaluation of a Protocol for Recording Process Documentation in the Presence of F
1Implementation and Evaluation of a Protocol for
Recording Process Documentation in the Presence
of Failures
- Zheng Chen and Luc Moreau
- zc05r_at_ecs.soton.ac.uk
- L.Moreau_at_ecs.soton.ac.uk
- University of Southampton
2Outline
- Motivation
- Protocol Overview
- Implementation
- Experimental Setup
- Experimental Results Analysis
- Conclusions Future Work
3- The provenance of a data product refers to the
process that led to that data product - Process documentation is a computer-based
representation of a past process for determining
provenance - Process documentation consists of a set of
p-assertions - Process documentation is stored in provenance
stores - Provenance obtained by querying provenance stores
4PReP (Groth 04-08)
- A protocol to record process documentation
- Multiple provenance stores are interlinked to
enable retrievability of distributed process
documentation
5Failures
- Provenance store crash, communication failures
- We do not consider application failures, e.g.
actor crash - Poor quality process documentation
- Incomplete
- Disconnected
6Requirements
-
- Guaranteed Recording
- After a process completes, the entire
documentation of the process must eventually be
recorded in provenance stores - Link Accuracy
- All the links recorded during a process must
eventually be accurate to enable retrievability
of distributed documentation - Efficient Recording
- The protocol should be efficient and
introduce minimum overhead
7F-PReP
- A protocol for recording process documentation in
the presence of failures - Derives from PReP to inherit its generic nature
- Introduces an Update Coordinator to facilitate
updating links (We assume the coordinator does
not crash) - Actors side
- Uses timeout and retransmission to record
p-assertions - Chooses alternative provenance stores in case of
failures - Requests the coordinator to update links
- Provenance store
- Replies an acknowledgement only after it has
successfully recorded p-assertions in its
persistent storage.
8F-PReP
9Implementation
- Provenance Store
- Implemented as a Java Servlet
- backend store (Berkeley DB)
- Disk cache
- Flushing OS buffers to disk before providing
an ack to actor - Update Plug-In
- Client Side Library
- Remedial actions that cope with failures
- Multithreading for the creation and recording of
p-assertions - A local file store (Berkeley DB) for temporarily
maintaining p-assertions - Update Coordinator
- Implemented as a Java Servlet
- Berkeley DB is also employed to maintain request
information
10Performance Study
- Throughput of provenance store and coordinator
- Scalability of update coordinator
- Failure-free recording performance
- Overhead of taking remedial actions
- Performance impact on application
11Experimental Setup
- Iridis cluster (Over 1000 processor-cores)
- Gigabit Ethernet
- Tomcat 5.0 container
- Berkeley DB Java Edition database
- Java 1.5
- A generator is used on an actor's side to inject
random failure events - Failure to submit a batch of p-assertions to a
provenance store - Failure to receive an acknowledgement from a
provenance store before a timeout - Generates a failure event based on a failure
rate, i.e., the number of failure events
occurring after a total number of recordings
121. Provenance Store (PS) Throughput
- Setup up to 512 clients sending 10k
p-assertions to 1 PS in 10 min - Hypothesis Disk cache may sacrifice a
provenance store's throughput. - Result 20 decrease in throughput
132. Coordinator Throughput
- Setup up to 512 clients sending 100 requests to
1 coordinator in 10 min - Hypothesis The coordinators throughput is
high. - Result 30,000100 repair requests accepted in
10 min
143. Throughput Experiment with Failures (1 client)
- Setup 1 client sending 10k p-assertions to 1 PS
- 1 alt. PS and 1 coordinator used in
the case of failures - Hypothesis (a) Resending to a same PS is
preferred over alt. PS - for transient failures
- (b) Update coordinator is
not a bottleneck.
154. Throughput Experiment with Failures (128
clients)
- Setup 128 clients sending 10k p-assertions to 1
PS - 1 alt. PS and 1 coordinator used
in the case of failures - Hypothesis (a) Resending to a alt. PS is
preferred to same PS - (b) The coordinator is not a bottleneck.
165. Failure-free Recording Performance
- Setup 1 client recording 10,000 10k
p-assertions to 1 PS - 100 p-assertions shipped in a single batch
- Hypothesis Disk cache causes overhead.
- Results (a) 900 10k p-assertions may be lost if
PSs OS crashes. (PReP) - (b) 13.8 overhead, compared to PReP
176. Overhead of Taking Remedial Actions
- Setup 1 client recording 100 p-assertions to 1
PS - 1 alt. PS and 1 coordinator used in the case of
failures - Hypothesis Remedial actions have acceptable
overhead. - Result record time
187. Performance Impact on Application
- Amino Acid Compressibility Experiment (ACE)
- High performance and fine grained, thus
representative - One run of ACE 20 parallel jobs 54, 000
interactions/job - Extremely detailed process documentation
- 1.08 GB p-assertions/job in 25 minutes
19Recording Performance in ACE
- Setup 5 PS and 1 coordinator
- Multithreading for creation and recording
p-assertions - Hypothesis F-PReP has acceptable recording
overhead. - Results (a) similar overhead (12) as PReP on
application performance when no
failure occurs - (b) Timeout and queue management affect
performance.
20Impact of Queue Management on Performance
- Hypothesis Flow control on queue affects
performance. - Conclusions (a) The result supports our
hypothesis. - (b) We can monitor queue and take
actions, - e.g., employing the local file store.
218. Quality of Recorded Process Documentation
- Setup Using F-PReP and PReP to record
p-assertions - Querying PS to verify recorded
documentation - Results (a) PReP incomplete F-PReP complete
- (b) PReP irretrievable F-PReP
retrievable
22Conclusions Future Work
- Coordinator does not affect an actors recording
performance. - In an application, F-PReP has similar recording
overhead as PReP on application performance when
there is no failure. - Although it introduces overhead in the presence
of failures, we believe the overhead is still
acceptable, given that it can record high quality
(i.e., complete and retrievable) process
documentation. - We are currently investigating how to create
process documentation when an application has its
own fault tolerance schemes to tolerate
application level failures. - In future work, we plan to make use of the
process documentation recorded in the presence of
failures to diagnose failures.
23Questions?
Thank you!