covalent-hpc-plugin 0.0.8

Creator: codyrutscher

Last updated:

Add to Cart

Description:

covalenthpcplugin 0.0.8

Covalent HPC Plugin
Covalent is a Pythonic workflow tool used to execute tasks on advanced computing hardware. This executor plugin uses PSI/J to allow Covalent to seamlessly interface with a variety of common high-performance computing job schedulers and pilot systems (e.g. Slurm, PBS, LSF, Flux, Cobalt, RADICAL-Pilot). For workflows to be deployable, users must have SSH access to the login node, access to the job scheduler, and write access to the remote filesystem.
Installation
Server Environment
To use this plugin with Covalent, simply install it using pip in whatever Python environment you use to run the Covalent server (your local machine by default):
pip install covalent-hpc-plugin

Run the following in Python to have Covalent automatically register the plugin:
import covalent

HPC Environment
Additionally, on the remote machine(s) where you plan to execute Covalent workflows with this plugin, ensure that the remote Python environment has Covalent and PSI/J installed:
pip install covalent psij-python

Note that the Python major and minor version numbers on both the local and remote machines must match to ensure reliable (un)pickling of the various objects.
Usage
Default Configuration Parameters
By default, when you install the covalent-hpc-plugin and run import covalent for the first time, your Covalent configuration file (found at ~/.config/covalent/covalent.conf by default) will automatically be updated to include the following sections. These are not all of the available parameters but are simply the default values.
[executors.hpc]
address = ""
username = ""
ssh_key_file = "~/.ssh/id_rsa"
instance = "slurm"
launcher = "single"
inherit_environment = true
pre_launch_cmds = []
post_launch_cmds = []
shebang = "#!/bin/bash"
remote_python_exe = "python"
remote_workdir = "~/covalent-workdir"
create_unique_workdir = false
cache_dir = "~/.cache/covalent"
poll_freq = 60

[executors.hpc.environment]

[executors.hpc.resource_spec_kwargs]
node_count = 1
processes_per_node = 1
gpu_cores_per_process = 0

[executors.hpc.job_attributes_kwargs]
duration = 10

You can modify various parameters in the Covalent config file as-needed to better suit your needs, such as the address of the remote machine, the username to use when logging in, the ssh_key_file to use for authentication, the type of job scheduler (instance), and much more. Note that PSI/J is a common interface to many common job schedulers, so you only need to toggle the instance to switch between job schedulers.
A full description of the various input parameters are described in the docstrings of the HPCExecutor class, reproduced below:
https://github.com/Quantum-Accelerators/covalent-hpc-plugin/blob/25785d0c546851c4b11e5c227f2e7aebb12aba8c/covalent_hpc_plugin/hpc.py#L115-L159
Defining Resource Specifications and Job Attributes
Two of the most important sets of parameters are resource_spec_kwargs and job_attributes_kwargs, which specify the resources required for the job (e.g. number of nodes, number of processes per node, etc.) and the job attributes (e.g. duration, queue name, etc.), respectively.

resource_spec_kwargs is a dictionary of keyword arguments passed to PSI/J's ResourceSpecV1 class
job_attributes_kwargs is a dictionary of keyword arguments passed to PSI/J's JobAttributes class.

The allowed types are listed here.
Using the Plugin in a Workflow: Approach 1
With the configuration file appropriately set up, one can run a workflow on the HPC machine as follows:
import covalent as ct

@ct.electron(executor="HPCExecutor")
def add(a, b):
return a + b

@ct.lattice
def workflow(a, b):
return add(a, b)


dispatch_id = ct.dispatch(workflow)(1, 2)
result = ct.get_result(dispatch_id)

Using the Plugin in a Workflow: Approach 2
If you wish to modify the various parameters within your Python script rather than solely relying on the the Covalent configuration file, it is possible to do that as well by instantiating a custom instance of the HPCExecutor class. An example with some commonly used parameters is shown below. By default, any parameters not specified in the HPCExecutor will be inherited from the configuration file.
import covalent as ct

executor = ct.executor.HPCExecutor(
address="coolmachine.university.edu",
username="UserName",
ssh_key_file="~/.ssh/id_rsa",
instance="slurm",
remote_conda_env="myenv",
environment={"HELLO": "WORLD"},
resource_spec_kwargs={
"node_count": 2,
"processes_per_node": 24
},
job_attributes_kwargs={
"duration": 30, # minutes
"queue_name": "debug",
"project_name": "AccountName",
},
launcher="single",
remote_workdir="~/covalent-workdir",
)

@ct.electron(executor=executor)
def add(a, b):
return a + b

@ct.lattice
def workflow(a, b):
return add(a, b)


dispatch_id = ct.dispatch(workflow)(1, 2)
result = ct.get_result(dispatch_id)

Working Example: Perlmutter
The following is a minimal working example to submit a Covalent job on NERSC's Perlmutter machine. It assumes that you have used the sshproxy utility to generate a certificate file in order to circumvent the need for multi-factor authentication for each login.
import covalent as ct

executor = ct.executor.HPCExecutor(
address="perlmutter-p1.nersc.gov",
username="UserName",
ssh_key_file="~/.ssh/nersc",
cert_file="~/.ssh/nersc-cert.pub",
remote_conda_env="myenv",
job_attributes_kwargs={
"project_name": "ProjectName",
"custom_attributes": {"slurm.constraint": "cpu", "slurm.qos": "debug"},
},
)

@ct.electron(executor=executor)
def add(a, b):
return a + b

@ct.lattice
def workflow(a, b):
return add(a, b)


dispatch_id = ct.dispatch(workflow)(1, 2)
result = ct.get_result(dispatch_id)

Troubleshooting
The most common cause of issues is related to the job scheduler details (i.e. the resource_spec_kwargs and the job_attributes_kwargs). If your job fails on the remote machine, set cleanup=False and then check the files left behind in the working directory as well as the ~/.psij directory for a history and various log files associated with your attempted job submissions.
Release Notes
Release notes are available in the Changelog.
Credit
This plugin was developed by Andrew S. Rosen, building off of prior work by the Agnostiq team on the covalent-slurm-plugin.
If you use this plugin, be sure to cite Covalent as follows:

W. J. Cunningham, S. K. Radha, F. Hasan, J. Kanem, S. W. Neagle, and S. Sanand.
Covalent. Zenodo, 2022. https://doi.org/10.5281/zenodo.5903364

License
Covalent is licensed under the Apache 2.0 License. Covalent may be distributed under other licenses upon request. See the LICENSE file or contact the support team for more details.

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.