recovering (memprof-limits.recovering)

Summary

For the program to recover from an interrupt:

The task must simply be written like any program on its own, programmed to clean-up and “fail fast” in the event of a failure, using the provided resource-management features.
The monitor must make sure not to share state with the task, unless write access to this state is protected under a resource abstraction that enforces its consistency in face of interruption.

Introduction

Memprof_limits interrupts tasks by raising an exception at some allocation point chosen randomly by the statmemprof runtime, if some limit is reached. It is in general not possible to reason about where this exception will arise nor to handle it locally: there are far too many possible locations in the code. Such exceptions are called asynchronous exceptions or interrupts [Reppy, 1990].

The advantage of asynchrony is that the user code always remains responsive to memprof-limits interruptions, without requiring user code nor libraries it depends on to be reorganised to actively poll memprof-limits. The downside is that the asynchronous interrupt is akin to other exceptions denoting unexpected conditions causing the failure of the whole task at hand, such as Stdlib.Out_of_memory, Stdlib.Stack_overflow, Stdlib.Assert_failure… The recovery from these unexpected exceptions is subject to some discipline, which we describe now.

In short:

Do not rely on the returned exception,
Re-raise exceptions promptly,
Ensure isolation of state.

1. Do not rely on the returned exception

Catching a specific asynchronous interrupt is not reliable and in general hides bugs. The exception could have been caught in the user code with the intention of re-raising it, but another unexpected exception arose in the meanwhile, causing the original exception to be discarded. Yet it may very well be that the first exception arising in the wrong place was the cause of the second one, in which case the intention was to catch the first exception. In addition, exceptions are sometimes wrapped, such as inside the Stdlib.Fun.Finally_raised and Dynlink.Error exceptions in the OCaml stdlib, in order to augment them with contextual information.

The indiscernability of asynchronous interrupts was noticed in Standard ML, which has attempted to solve it by representing all causes of interruption (out of memory, stack overflow, Ctrl+C…) with a single exception Interrupt. While Standard ML's notion of dedicated interrupt has never been part of OCaml, we can emulate it as a programmer discipline, by never matching on specific interrupts, and using the catch-all _ instead. This amounts to treating all asynchronous interrupts, bugs, and uncaught exceptions as being equivalent at that boundary.

In the case of memprof-limits

Memprof-limits handles this for you. In order to reliably tell whether a computation has been interrupted, it records instead with a flag whether it has previously raised an exception due to this particular limit instance.

The generic form to handle the return value is the following.

match Memprof_limits.limit_global_memory task with
| Ok result -> …
| Error exn1 -> (* limit reached *) …
| exception exn2 -> (* some other error *) …

The Error case is encountered whenever the computation of task has been interrupted by the corresponding limit being reached, whether the task has returned normally for some reason (exception caught and discarded), has let through the exception raised by memprof-limits, or has raised any other exception instead. In the latter case, this exception is exn1 above, the latest exception raised in the task after interruption. exn1 can be discarded assuming that the monitored computation was properly isolated (see further below).

If the limit has not been reached, no exception arising from the task is caught by memprof-limits (exn2 above).

2. Re-raise exceptions promptly

Software relying on interrupts, such as Isabelle or Rocq, have found necessary to request from users that they re-raise interrupts promptly. For instance, the Isabelle/ML manual states:

“Isabelle/ML user code needs to terminate promptly on interruption, without guessing at its meaning to the system infrastructure. Temporary handling of interrupts for cleanup of global resources etc. needs to be followed immediately by re-raising of the original exception.”

For this purpose, both Isabelle/ML and the OCaml libraries of Rocq offer predicates to test whether an exception is an interrupt, so that the user can act accordingly.

In the case of memprof-limits

OCaml does not have a built-in notion of interrupt and therefore does not provide such a predicate. However, another property comes save memprof-limits: if it happened that an interrupt was not re-raised promptly, then the task would be interrupted again soon.

It is still possible for adversarial code to systematically catch and discard all interrupts. In practice, memprof-limits expects some cooperation from the task to not do anything like this. But as far as bugs are concerned, further interruptions are always imminent.

Nevertheless, ideally one should not to use the catch-all _ with exceptions without re-raising the exception promptly, and one should prefer combinators such as Stdlib.Fun.protect or Memprof_limits.Masking.with_resource, which ensure that the interrupt is not discarded. The use of the predicate Memprof_limits.is_interrupted inside when-clauses of catch-alls lets you avoid catching interrupts.

3. Ensure isolation of state

We have mentioned that memprof-limits catches all exceptions if a limit is reached, which amounts to treating bugs and uncaught exceptions as interrupts too. In fact the memprof-limits functions that execute tasks act as an isolation boundary, similar to a toplevel prompt or a thread.

One aspect of isolation concerns the consistency of state in the program. An exception raised asynchronously can corrupt the program state if it occurs in the middle of a block meant to maintain some invariant. Then, it is the responsibility of the person catching the interrupt to make sure that no corrupt state escapes the boundary. This is done in two fashion:

Internal state is one that is not shared outside of the task. Since the interrupt has the intent of halting the computation of the task, internal state is permitted to be corrupted (i.e. violating abstractions) in a transient manner during resource clean-up, since nobody will be there to witness the corrupted state after the task terminates.

External state, shared outside of the task, must be modified in such a way that the interrupt does not result in corrupted state to be made available to the rest of the program. This is done by protecting the external state through resource abstractions.

The typical example of external state is indeed resource allocation which must be paired with deallocation. The advice for exception-safety [David Abrahams, 2000] must be applied, see Memprof_limits.Masking.with_resource.

A particular case concerns modifications to some state that in case of failure must be invalidated in order to prevent further access by other threads, or rolled back to a previous valid value. This can be achieved with an invariant-enforcing guard placed around critical sections using Memprof_limits.Masking.with_resource.

Invalidation: State invalidation can be seen in the Rust language, where the release action associated with a mutex is responsible not only for unlocking the mutex, but also for “poisoning” it in case of an exception, meaning that the data it protects cannot be accessed normally any further [The Rust Team, 2017]. This strategy amounts to communicating the interruption to other users of the same state.

Rollback: Rather than invalidating state, the strategy used in the Rocq prover is to contain the consequence of interrupts (e.g. Ctrl-C), so they are not witnessed by later users of the shared state, and the software can continue normally as before. This is done by setting the value back to a previous known-valid value.

To sum up, the task itself can be written without reasoning about interrupts: 1) corrupted internal state disappears along with the task, 2) external state shared with the task must already be provided by the monitor through interrupt-safe abstractions.

In the case of memprof-limits

Memprof-limits offers Memprof_limits.Masking.with_resource, which can be used to clean-up resources reliably and to implement interrupt-safe abstractions. Checking whether clean-up was triggered by an interrupt (to implement rollback or poisoning) can be done using Memprof_limits.is_interrupted.

An example of an unwind-safe shared data structure is given, using a Mutex wrapper which communicates the interruption to other threads using invalidation.

Caveat: Memprof-limits only support well-bracketed resources at the moment. This is known to be too limited in some situations.

Caution: The case of global values can be seen as escaping the control of the monitor. Ideally, all global values should be interrupt-safe by default. In the OCaml stdlib, some global state is interrupt-safe, but it is currently undocumented which one is and which one is not.

Bibliography

Notions of exception safety go way back and seem to be in perpetual rediscovery. Futher reading: Armstrong (2003), Siedersleben (2006), Duffy (2016), Snoyman (2018). Important milestones are C++ RAII [Abrahams, 2000] and Rust's extension of Armstrong's fail-fast technique to shared mutable state (e.g. [Alex Crichton & the Rust team, 2015]).

Goodenough (1975), “Structured exception handling“, POPL 1975.
Parnas and Würges (1976), “Response to undesired events in software systems”, ICSE 1976.
John Reppy (1990), “Asynchronous signals in Standard ML”, Technical Report TR 90-1144, Department of Computer Science, Cornell University, 1990.
https://people.cs.uchicago.edu/~jhr/papers/1990/cornell-tr-1144.pdf
David Abrahams (2000), “Exception-safety in generic components: Lessons learned from specifying exception-safety for the C++ standard library”, in LNCS 1766 (Springer Verlag, 2000), 69-79,
https://doi.org/10.1007/3-540-39953-4_6
Simon Marlow, Simon Peyton Jones, Andrew Moran, and John Reppy (2001). “Asynchronous exceptions in Haskell”. SIGPLAN Notices 36, nᵒ5 (2001): 274-85.
https://doi.org/10.1145/381694.378858
Joe Armstrong (2003), “Making reliable distributed systems in the presence of software errors”, Ph.D. thesis, Royal Institute of Technology, Sweden, 2003.
http://urn.kb.se/resolve?urn=urn:nbn:se:ri:diva-22455
Johannes Siedersleben (2006), “Errors and Exceptions – Rights and Obligations”, in C. Dony et al. (Eds.): Advanced Topics in Exception Handling Techniques, LNCS 4119 (Springer Verlag, 2006), 275-287.
https://doi.org/10.1007/11818502_15
The Isabelle developers (2010), “Interrupts” in Isabelle/ML manual, documented circa 2010-2015.
http://isabelle.in.tum.de/dist/library/Doc/Implementation/ML.html
Alex Crichton & the Rust team (2015), RFC 1236: “Stabilize catch_panic”, 2015.
https://github.com/rust-lang/rfcs/blob/master/text/1236-stabilize-catch-panic.md
Joe Duffy (2016), “The Error Model” in a series of blog posts on the Midori project, 2016.
http://joeduffyblog.com/2016/02/07/the-error-model/
The Rust team (2017), “Poisoning” in The Rustonomicon, 2017 (the feature itself was implemented circa 2012).
https://doc.rust-lang.org/nomicon/poisoning.html
Tony Garnock-Jones (2017). “Conversational Concurrency”, Ph.D. thesis, Northeastern University, 2017.
http://hdl.handle.net/2047/D20261862
Michael Snoyman (2018), “Asynchronous Exception Handling in Haskell”, blog post, 2018.
https://www.fpcomplete.com/blog/2018/04/async-exception-handling-haskell/