Francesco Oteri

Introduction

As a biologist transitioning to computational biology, you're likely using High-Performance Computing (HPC) clusters equipped nowadays with powerful CPUs and GPUs. Correctly use a clusters is crucial for the success of your career. In this series of articles, we will go through the main concepts you need to know to correctly deploy your pipelines on a cluster using Slurm as resource management.

In this first article, I will provide an overview of the HPC cluster architecture and explain how Slurm helps you interact with it.

What is an HPC Cluster?

Even though you don’t necessairly need to know the details of the architecture of a HPC cluster to be a succesfull computational biologist, knowing the big picture will be quite useful to get the most of this systems and troubleshoot some of the issues you may encounter.

An HPC cluster is a system composed of interconnected computers, known as nodes. An HPC cluster is made out by three main kind of nodes: frontend, computing nodes and storage nodes.

Frontend, computing nodes and storage nodes are connected together by a network. You can assimilate this to the wifi box at your home. Just faster. The connection of the nodes is an other important concept you need to grasp to improve the performance of your pipelines.

Schema showing the interplay between frontend, computing nodes and storage nodes.

These three different types of nodes interact to perform complex calculations and process large datasets at high speeds. Slurm is part of the backbone of a modern HPC cluster and is crucial for managing shared resources. It lies between the operating system and the user ( i. e. you!) abstracting the management and dispatching of computations, ensuring that resources are allocated efficiently and fairly. It allows you to monitor and remove running jobs as well.

Details

In a computational biology setting, tasks often involve processing large datasets, running complex simulations, or performing extensive data analysis. These tasks can be resource-intensive, requiring significant computational power and time. Here's where a queue management system like Slurm becomes invaluable:

  1. Resource Allocation: Slurm ensures that computational resources are distributed fairly among users. This prevents any single user from monopolizing the cluster and ensures equitable access.
  2. Job Prioritization: Not all tasks are created equal. Some jobs may be more urgent than others. Slurm allows for prioritizing jobs based on criteria, defined by the system administrator, such as user or project defined importance, submission time, and resource requirements.
  3. Optimized Utilization: By managing job queues and scheduling tasks efficiently, Slurm maximizes the utilization of cluster resources, reducing idle times and increasing overall productivity.
  4. Scalability: As the number of users and computational tasks grows, manual resource management becomes impractical. Slurm's automated scheduling and resource management capabilities scale effortlessly with the increasing demand.

How Does Slurm Work?

At its core, Slurm works by accepting job submissions from users, placing them in a queue, and then scheduling these jobs based on available resources and predefined policies. Here's a high-level overview of the process:

  1. Job Submission: Users submit jobs to Slurm using a set of commands. Each job includes information about the required resources (e.g., number of CPUs, memory) and any specific constraints (e.g., time limits, node specifications).
  2. Job Queuing: Once submitted, jobs are placed in a queue. Slurm evaluates the queue and determines the order in which jobs should be executed based on priority, resource availability, and other scheduling policies.
  3. Resource Allocation: When the required resources become available, Slurm allocates them to the job. This can involve assigning specific nodes, CPUs, or GPUs to the task.
  4. Job Execution: The job is executed on the allocated nodes. Slurm monitors the job's progress and manages its execution, ensuring that it runs efficiently.
  5. Job Completion: Once the job is completed, Slurm releases the resources, making them available for other tasks in the queue.

This steps are performed transparently and you can use the resources using simple commands. In the next few article, I will demistify the most used commands and introduce some advanced usage to make your life easier and more productive.

The diagram shows the interaction flow between the user and slurm.

Practical Considerations

When you transitioning from traditional benchwork to computational biology, understanding and leveraging tools like Slurm allows to have a more your workflows more effective. Here are a few practical considerations to keep in mind:

  1. Understanding Resource Needs: Before submitting jobs, it's important to have a clear understanding of your resource requirements. This includes the amount of CPU, memory, and storage your task will need.
  2. Job Dependencies: Some tasks may depend on the completion of others. Slurm allows you to specify job dependencies, ensuring that jobs are executed in the correct order.
  3. Efficient Coding: Writing efficient code and optimizing your scripts can significantly reduce resource usage and execution time, making it easier to manage large-scale computations.
  4. Collaboration and Fair Use: Be mindful of other users on the cluster. Collaboration and communication can help ensure that resources are used efficiently and fairly.

In the next articles, I will get into the details of the most important commands. Using practical examples I will show you how to efficiently use slurm.

Introduction

Alphafold2

AlphaFold2 ( AF2 ) is a deep learning-based system developed by DeepMind to predict the 3D structure of proteins from their amino acid sequences. The algorithm uses Multiple Sequence Alignment (MSA) to determine the interactions between distant amino acids and infer the protein’s 3D structure ( https://www.nature.com/articles/s41586-021-03819-2 ) . The MSA information is paired with a transformer neural network, a type of model known for its success in understanding the context in natural language processing tasks. AF2 can also use information extracted by structural templates to refine the structure. Even though it has been primarily developed to predict monomers, it has been further adapted to predict multimers (https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1 and https://www.nature.com/articles/s41467-022-29394-2).

Why serverless Alphafold2?

Even tough AF2 code is freely available, as the majority of deep learning models you need a powerful GPU to get the best out of it. If you don’t have access to a GPU cluster but still you want to perform a prediction, you can use google collab to run AF2 taking advantage of ColabFold ( https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb )

Even if Google Colab is an incredibly valuable tool for many projects, it does have some limitations:

  1. Resource Limitations: After a certain period, your session will time out, and you'll lose all the data stored in the instance's RAM. Furthermore, you may experience slower processing times if the servers are heavily in use.
  2. Storage Limitations: Colab does not offer permanent storage. You can connect it with Google Drive, but the maximum storage space depends on your Google Drive account, which might not be enough for larger datasets.
  3. Privacy Concerns: As your code is run on Google's servers, it may not be suitable for sensitive, confidential, or proprietary data.
  4. Limited Offline Capabilities: Colab notebooks are designed to be used online and have limited offline capabilities.

You have also time restrictions because a single session can run for a maximum of 12 hours. This is not usually an issue for AF2 because each prediction takes maximum couple of minutes and a bit longer if you have to build the MSA from scratch.

In order to surpass these limitations, I delved into serverless computing. In serverless computing, the cloud provider manages the infrastructure and automatically allocates computing resources as needed to execute and scale applications. This model is often used for event-driven computing, such as microservices, APIs, and mobile and web applications. You pay only when you use the resource.

The purpose of this tutorial is to enable you to submit a prediction to runpod workers with a command as simple as:

python3 launch.py --msa your_structure.fasta  --output your_structure.pdb --endpointId your_endpoint

While the meaning of the options --msa and --output is quite clear, you will learn the meaning of
--endpointId in a few minute.

Refactoring AF2

Before delving into the main topic, let’s briefly discuss the need to refactor AF2. The official implementation, provided by DeepMind, follows a specific workflow: it takes a sequence as input, generates a Multiple Sequence Alignment (MSA), and then utilizes a deep learning model to make predictions. However, this structure is not adapt to serverless computing.

Firstly, acquiring the sequences to build the MSA involves scanning multiple databases, which must be available locally. These databases can reaching up to 4 terabytes of data. This sheer size makes hosting them on cloud platforms not practical.

Second, this approach lacks flexibility because it doesn’t provide an option to supply your own custom MSA. This limitation can be a hindrance when you have specific MSA requirements or preferences.

We need then to decouple the MSA generation from the structure prediction. Additionally, I wanted to enhance the tool’s usability by enabling it to read multiple file formats, including the default a3m format, Stockholm format, and FASTA files ( I hate converting files to run a tool).

To achieve this, I began by examining the repository maintained by Sergey Ovchinnikov, available at https://github.com/sokrypton/alphafold/. If you’re considering working on AlphaFold 2 (AF2) or similar projects, I highly recommend starting with this optimized repository rather than the one provided by DeepMind. For additional insights into the optimization strategy employed, you can refer to this tweet by Sergey Ovchinnikov: https://twitter.com/sokrypton/status/1535941266579640320. This tweet provides valuable information on the approach used for the optimization process.

From there, I further refactored the code, removing redundant checks on the MSA and restructuring the Python code to make it modular making the code easy to import as a module. You can find the resulting code on my GitHub repository at https://github.com/oteri/alphafold_MSA.

RunPod platform description

There are several providers for GPU on the cloud, but very few for serverless GPU ( check this article if you want a list: https://fullstackdeeplearning.com/cloud-gpus/). I’ve chosen runpod ( https://www.runpod.io ) for this tutorial because it has a simple API and a rich documentation with several examples to use as inspiration.

RunPod allows you to have both a GPU on a VM or serverless fashion. To use the serverless option, you must create a template and an endpoint.

A template, in RunPod terminology, is a Docker container. To define a template, you must supply the Docker image, the amount of disk needed for each container, and optionally, the Docker command to run it.

An endpoint represents the actual set of workers. You can set up the minimum and maximum number of workers, the autoscaling policy, and optionally, a shared volume ( that get automatically mounted at the “/runpod-volume”). This last option is particularly valuable for machine learning tasks because it can be used to store model weights that can be loaded at runtime without the need to download them repeatedly, and in general, to share data.

Whenever an endpoint is instantiated, a new template is spun up. When an endpoint is invoked, either a new job is queued on a running worker or a new one is instantiated. I suggest you to read the latest documentation at https://docs.runpod.io/docs/overview

Now, let’s see how to practically set up RunPod to serve our purposes.

Setup the system

Once logged in, you need to setup the system ( Settings tab )

General settings for RunPod serverless platform

The most important parameter is the “API Keys” tab:

Api keys generation in RunPod serverless platform

Click on “+ API Keys” button to create any API_KEY. Copy it because you will need it later on.

Setting up a Network Storage

Since we don’t want to download the model weights each timle a worker is instantiated, we need a networek shared volume to store the weights.

To setup a network storage, go to Storage , then click on “+Network Volume”

Add a network volume for RunPod serverless platform

You will be presented with a dialog to setup name, size and region of deployment of the volume:

Create a network volume for RunPod serverless platform

Setting up a Template

To setup a template, start clicking on “Serverless”, then “Templates” and finally on “New Template”

Create a template volume for RunPod serverless platform

You will be presented with a dialog that enables you to setup your template:

Add an environment variable to RunPod serverless platform

The most important parameter is the container name. It is actually the name of the image used for deploying a worker. For this tutorial, we will be using the image francescooteri/af2_msa_serverless:latest ( you can find the code on my github account: https://github.com/oteri/alphafold_serverless ) . The image installs alphafold_MSA, downloads the models from google servers and stores them on network attacched disk (whose path is specified by the environment variable PARAM_DIR) . The default command is used to run a prediction, you need to set it here only if you want to override the one in the container. We also setup the variable WORKDIR that allows to specify where you want the system to put the data used at prediction time.

Setting up an Endpoint

This is the procedure to setup a serverless worker on RunPod

  1. Setting up the workers ( you need at least 10€ in credit to create one). Go on “My Endpoints” and click on “New Endpoint”.
Setting up an Endpoint for RunPod serverless platform

2. Configuring the endpoint.

Configuring the endpoint for RunPod serverless platform

Several parameters are available:

The parameters “Min provisioned worker”, “Max provisioned worker” and ”Idle timeout” are meant to setup the reactivity of the system to new requests. Their value are a compromise between cost savings and reactivity of the app. If you want to save as much as possible, you want to put “Min provisioned worker” to 0 and ”Idle timeout” as low as possible so that the system goes down when it is not used. If you want to increase the reactivity, you will have “Min provisioned worker” greater than 0 ( so that a worker is always ready to serve a request). Alternatively you want to increase ”Idle timeout” so that a worker doesn’t shut down immediatley after the last usage. For debugging purpose though, you want to select Min and Max equal to 1. This cause the worker to never sleep so you can ssh to it and debug the issues.

  1. Once the endpoint is created, copy paste its endpoint id ( in this case x61hmm6op9q92r ), we will need it to submit our job
Get the endpoint d from the serverless platform RunPod

Let’s dive in the code

So far you are using what has blindly been provided to you. Now, let’s go in the details.

When an http request hits the endpoint, runpod spuns a container and the default command is run. The basic requirement for such a command to be compatible with runpod infrastructure is to invoke the function runpod.serverless.start ( the package runpod must be installed in the image used for the template):

import runpod
def handler(event):
do something
return the result as json

runpod.serverless.start({
"handler": handler
})

/app/handler.py runs the script that performs the actual prediction. Here I show an edulcorated version of the script:

WORKDIR = os.environ.get("WORKDIR", "/data/")
def handler(event):
    # Prepare the name for the MSA
    job_dir = tempfile.mkdtemp(dir=WORKDIR)
    msa_file_path = os.path.join(job_dir, "msa.fasta")
    with open(msa_file_path, "w") as file:
        file.write(event["input"]["msa"]) # Save event.msa to WORKDIR/msa.fasta
    run_prediction(precomputed_msa=msa_file_path, output_dir=job_dir, data_dir="/data/")
    # Read the output file.
    # runpod doesn't allow to manage more than 2MB so only the best result is returned.
    # If you want to store more, use a bucket and the object id instead.
    output_file_path = os.path.join(job_dir, "msa/ranked_0.pdb")
    with open(output_file_path, "r") as file:
        output_content = file.read()
        return json.dumps({"structure": output_content})
runpod.serverless.start({"handler": handler})

In our scenario, we also need to verify whether the model weights already exist on the shared volume. This check is performed each time a worker is created. The worker examines if the parameters are available in a pre-defined folder ( /runpod-volumes/params ) and downloads them only if they are not already present. This is the resulting bash script run as default command ( start.sh ). This script handles the verification process and subsequently executes the actual handler once the check is completed:

#!/bin/bash
set -x
date

# Check if PARAM_DIR is unset or empty
if [ -z "$PARAM_DIR" ]; then
PARAM_DIR="/data/"
fi
# Check if the directory exists, and create it if it doesn't
if [ ! -d "$PARAM_DIR/params" ]; then
mkdir -p "$PARAM_DIR/params"
echo "Directory $PARAM_DIR created."
cd $PARAM_DIR/params
wget -qO- https://storage.googleapis.com/alphafold/alphafold_params_2022-12-06.tar| tar xf - --no-same-owner
fi

cd /app/
export PYTHONUNBUFFERED=1

export LD_LIBRARY_PATH=/app/env/lib/:${LD_LIBRARY_PATH}
micromamba run -p /app/env/ python3 -u /app/handler.py

Submission of a Prediction

At the beginning, I promised you to make prediction submission as easy as:

python3 launch.py --msa your_msa.fasta --output your_structure.pdb --endpointId your_endpoint

When it comes to submitting requests to the endpoint, it’s a straightforward process. When an endpoint is created, it automatically generates an API endpoint that you can use for communication. The endpoint address is: https://api.runpod.ai/v2/endpointId/task. In our example, the endpointId is x61hmm6op9q92r and the 'task' parameter specifies the function you want the endpoint to perform. There are various available actions; you can refer to the documentation for more details.
In this tutorial, our main focus revolves around 'run' and 'status.' The 'run' function allows you to transmit data to the endpoint, initiating the computation process. The system autonomously manages workers start. The newly created job then joins the queue of one of the active workers. After you've submitted the job, you'll be furnished with a 'job_id,' which allows you to monitor the job's progress and retrieve the outcomes.

Regarding authentication, the system utilizes the Bearer system, which requires the use of both the ‘API_TOKEN’ and an authentication token for security.

Keeping this in mind, submitting a job is as easy as this:

# Read the content of the MSA file
msa_content = ""
with open(args.msa, "r") as file:
    msa_content = file.read()
    payload = {
    "input": {
        "msa": msa_content,
        }
    }

headers = {
"accept": "application/json",
"content-type": "application/json",
"authorization": f"Bearer {API_TOKEN}",
}

logger.info("Submitting job")
response = requests.post(url, json=payload, headers=headers)

As usual, response.status_code == 200 means that the job has been correctly submitted. We can now get the job_id

response_dict = json.loads(response.text)
job_id = response_dict["id"]

and use it to poll the server until the prediction is completed:

url = f"https://api.runpod.ai/v2/{args.endpointId}/status/{job_id}"
headers = {
"accept": "application/json",
"authorization": f"Bearer {API_TOKEN}",
}
response = requests.get(url, headers=headers)
response_dict = json.loads(response.text)
status = response_dict["status"]

status can be either of IN_QUEUE, IN_PROGRESS, FAILED, COMPLETED.

When the job is correctly completed we get status==COMPLETED and the result is in response_dict["output"].

if status == "COMPLETED":
    output_file = f"{job_id}.pdb" if args.output is None else args.output
    response_output_dict = json.loads(response_dict["output"])
    with open(output_file, "w") as file:
    file.write(response_output_dict["structure"])
    break

You can find complete code in launch.py in the accompanying repository ( https://github.com/oteri/alphafold_serverless )

And now, submitting a job becomes as easy as:

python3 launch.py --msa your_msa.fasta --output your_structure.pdb --endpointId your_endpoint

Meanwhile, you can observe what is going on on the worker logs:

Observing the worker logs from the serverless platform RunPod

Submitting the job, triggers the building of a new worker if the number of worker is not enough. Here, the 1st time the job is submitted a new worker is instantiated and when it run a green box is showed.

Once the job starts you have the job id

Submitting a job prediction to the serverless platform RunPod

And the resources are showed on the dashboard:

Observing the resources used by an endpoint on the serverless platform RunPod

As well as the system logs

Observing the system log of an endpoint on the serverless platform RunPod

If you are debugging, RunPod interface gives you the command that you can simply copy/paste to ssh into it. As long as the worker is active, the connection is open.

Getting the ssh command to connect to a running job on the serverless platform RunPod

And voilà, once the job is completed, you will find your predicted structure as a pdb file.

Getting a MSA

The last step, is getting the MSA to use. If you are just testing, you can use a precomputed MSA taken from PFAM. In my tests I have used HATPase_c (PF02518) as stored on PFAM ( http://pfam-legacy.xfam.org/family/PF02518 ). Approximately 100 pdb structures are available for this family. The full MSA is built by 300774 sequences, but I will be using the seed alignment that contains only 658 sequences and can be directly downloaded as fasta file from PFAM website. Remember to download the version with gaps represented as ‘-’

Getting a MSA from PFAM

Limitations