In the enterprise, multi-architecture is the reality
Don't get stuck with an overly prescriptive vendor
In almost every large company MemVerge works with, there’s always some form of multi-architecture infrastructure:
Managed services (like AWS Batch or Google Cloud Batch) for standard containerized workloads
Traditional HPC schedulers (like Slurm) for specialized computing
Kubernetes for cloud-native and high-throughput applications
Much of this is due to people reasons.
For example, many pharmaceutical enterprises hire from academia, who are trained on SLURM prefer running SLURM on AWS Parallel Cluster over say AWS Batch.
Another example, engineers working on cloud-native services or AI model deployment favor containerization and will therefore orchestrate their workloads on Kubernetes.
Enterprise infra teams seek the best available technologies to solve their use cases — in so long as the technology meets the company’s compliance standards and integrates into the organization’s stack. What this often looks like is custom-built technology with many interfaces across many teams.
Let’s look at how bioinformatics infra has evolved as an example:
Traditionally, bioinformaticians used WDL and CWL as the dominant workflow languages. In recent years, there has been a third called Nextflow which has been heavily adopted by the younger generation of scientists and bioinformaticians due to strong open source adoption and a healthy community of professionals developing pipelines in the open (nf-core).
As a result of this recent trend, many large pharmas don’t just support one workflow language to orchestrate their bioinformatics pipelines on HPC. They also now support multiple workflow languages. These workflow languages can live in the cloud or on local HPC, further diversifying the tech stack needed for computational biology.
This can particularly challenging because:
Different technologies have different design intentions (and therefore limitations)
But teams have different technology preferences (it’s hard to teach an old dog new tricks)
For example, SLURM was never designed to do training, inference, and fine tuning, but forcing researchers and developers to reskill and learn how to use containers would threaten productivity.
Similarly, many organizations want to migrate jobs to the cloud for cost and elasticity reasons, but engineering leaders who are used to operating on-prem may view a cloud migration as expensive and difficult.
These types of qualities can make it difficult to work with vendors who only support one type of technology or are slow to create feature parity across various different parts of the infra stack (e.g. AWS Batch, SLURM, and Kubernetes).
But often this is the reality of enterprise infrastructure.
MemVerge has a customer who uses our spot surfing technology for cost savings on CPU workloads (leverage spot instances safely across their genomics workloads without losing work to interruptions).
Unlike the nextflow vendor mentioned above, we support multiple workflow languages (WDL and CWL) and multiple orchestrators (miniwdl and Cromwell in addition to Nextflow). We went to lengths to support these different languages, because we think the product mirrors the reality of the enterprise biotech and pharma customers.
With the AI boom in recent years, the same biotech customers are now expanding into GPU spot surfing which is why we’ve invested to allowing customers to checkpoint on both CPUs and GPUs.
Realizing customers want feature parity across different clouds, we built support for checkpointing ARM on both AWS Graviton and GCP. And on the GPU front, we are one of the few vendors to checkpoint both Nvidia and AMD.
In today’s environment where enterprises hate vendor lock-in and software stacks are complex, you need a partner who can act as a true platform and build tooling that’s compatible with your entire stack, not just one opinionated part of it.