Cancel with Confidence: Efficiently Manage Workflows in Covalent#
Covalent offers the unique ability to cancel a workflow or any task in a workflow at any point before or during execution.
Why Cancel?#
Most ETL and workflow orchestrators focus solely on production workflows, with emphasis on automation, performance, and efficient management of business processes. Covalent was designed to support research-based as well as production workflows. Research workflows require the flexibility to accommodate modifications, including cancellation of running tasks.
Covalent’s cancellation feature is indispensable in research workflow management, enabling you to stop any task or workflow that is taking too long to complete from consuming further resources. Covalent empowers you to confidently execute computation tasks on cutting-edge hardware platforms with high cost or limited availability, such as quantum computers, HPC clusters, GPU arrays, and cloud services, preventing budget and schedule overruns.
Besides stopping unneeded, over-long, and unnecessary jobs, the ability to cancel workflows or tasks encourages researchers and developers to modify their workflows to suit their changing needs, such as adjusting the computation pipeline or input data.
How To Cancel a Task or Workflow#
To cancel a dispatched workflow from a Python notebook or interactive environment, you issue a cancel
directive with the dispatch_id
of the workflow.
The following code illustrates how to use the Covalent cancel
API.
# In the notebook where you've dispatched the workflow ...
import covalent as ct
dispatch_id = ct.dispatch(workflow)(args, kwargs)
# ... Do the following to cancel the workflow:
ct.cancel(dispatch_id)
Similarly, to cancel a single task or a set of tasks within a single dispatch, use the same cancel
command, but as follows:
# In the notebook where you've dispatched the workflow ...
import covalent as ct
dispatch_id = ct.dispatch(workflow)(args, kwargs)
# ... to cancel tasks 1, 3, and 5, for example:
ct.cancel(dispatch_id, task_ids=[1, 3, 5])
The cancel
command interrupts or prevents tasks 1, 3, and 5. All resources (local and remote) being consumed by these tasks are released. Subsequent electrons dependent on the outcomes of tasks 1, 3, and 5 (that is, downstream in the transport graph) are also canceled and will not execute.
Note
If a node in a lattice is a sublattice, then cancelling that node recursively cancels all the tasks within the sublattice. Cancelling individual nodes within a sublattice is not supported in Covalent because the transport graphs associated with sublattices are dynamically built at runtime.
For complete examples of cancelling a task or workflow, see the How-to guides.
How Task Cancellation Works#
Recall that Covalent tasks (@electrons
) are executed by various executors. If the task is already running, cancelling it involves stopping the task’s process, thread, or program on the resource fronted by the executor.
An executor assigns a unique job handle to a task when the executor begins processing it. The job handle identifies the compute resource and the ID assigned by the compute resource. Some examples are:
Covalent Executor |
Compute Resource ID |
---|---|
|
|
|
|
AWS |
Job Amazon Resource Name (ARN) |
|
Linux process ID |
Using the compute resource and resource-specific ID, the job handle uniquely identifies each task. Of course, tasks that have not yet been started do not yet have a job handle.
Note
In case of task packing when multiple electrons that use the same executor are packed and executed as a single task, the job handle-to-task relationship is not one-to-one but one-to-many.
Once a unique job handle is assigned to a task, Covalent stores it in its local database for later retrieval and processing. Covalent uses this job handle to cancel the individual task in a workflow. The executor plugin implements the necessary APIs to map the job handle to a process on a compute resource and use that information to stop the task process.
How It Works in Even More Detail#
A task decorated with a remote executor such as AWSBatch or Slurm carries out several steps before executing the electron code, including:
pickling the electron’s
function
,args
, andkwargs
uploading the resulting pickle files to the remote destination
provisioning compute resources
fetching remote files
Once a workflow is dispatched, if you then request a particular node be canceled, Covalent might be performing any of these preliminary actions when the cancellation request comes in. In general, Covalent stops at the earliest possible point in the process, doing no more processing of any kind on the task once it receives the cancellation request.
If … |
Then Covalent … |
---|---|
The node has already started executing (the executor has already started running the electron’s code) |
Kills the job (using SIGKILL/SIGTERM, for example) |
Covalent has not yet begun executing the task |
Aborts and does not attempt to process it |
A node’s function, |
Does not pickle them; aborts immediately |
The required compute resources for the node have not yet been provisioned |
Does not provision them; aborts immediately |
Covalent has not instantiated the executor |
Does not instantiate an instance; aborts immediately |
Warning
Cancelling a task in Covalent is treated as a hard abort and instructs Covalent to abandon processing the task any further.
To abort a task as early as possible, Covalent needs to poll for cancellation requests. When a cancellation is requested, Covalent asynchronously updates its database record for the canceled task, setting its Boolean cancel_requested
flag to True
. (This flag defaults to False
for each node when it is created.)
Covalent’s task execution process checks this flag before each of the steps named above.
Implications for Custom Executors#
Executor plugin developers can check the cancel_requested
flag before each step, such as uploading pickled object to remote object stores, provisioning compute resources, executing the task, and so on. If an executor finds that the cancel_requested
flag has been set to True
for a task, it should raise a TaskCancelledError
exception, which Covalent then handles to abort processing the task any further.
To support task cancellation, an executor plugin must override the cancel
method provided in the base executor class. This method has the following function signature:
def cancel(self, task_metadata: Dict, job_handle: str):
...
The metadata associated with a task, dispatch_id
and node_id
, and the task’s job handle are provided as inputs to the cancel()
method. Use these inputs to implement the logic and backend-specific API calls to cancel the running task.
For examples of how to cancel workflows and individual tasks, refer to the How-to guides.