Back to careers

Training

Senior Training Platform Engineer

Build the platform layer for serious training workloads: Kubernetes, Slurm, Ray, multi-node jobs, observability, scheduling, reliability, and customer workflows.

Apply via build@jarvislabs.ai

Jarvis Labs already lets customers launch VMs and containers. The next step is to make serious training workloads easy: Kubernetes as a service, Slurm clusters, Ray clusters, multi-node training, reliable job execution, observability, and customer workflows that do not break under real load.

We are looking for senior engineers who can build and operate the platform layer for distributed AI training. This role sits at the intersection of distributed systems, GPU clusters, orchestration, storage, networking, reliability, and customer-facing AI workloads.

What You Will Own

  • Build Jarvis Labs training cluster products across Kubernetes, Slurm, Ray, and related orchestration systems.
  • Make multi-node GPU training reliable, debuggable, and easy enough for serious customers to adopt.
  • Design cluster lifecycle workflows: provisioning, scaling, node health, scheduling, isolation, job submission, teardown, and recovery.
  • Debug failures across NCCL, InfiniBand/RDMA, storage, networking, drivers, containers, schedulers, checkpointing, and distributed training frameworks.
  • Improve observability for jobs and clusters: GPU utilization, network issues, slow nodes, failed jobs, storage bottlenecks, scheduler behavior, and wasted compute.
  • Build reference architectures and examples for training LLMs or large models across multiple servers.
  • Work with Forward Deployed AI Engineers to understand customer training workloads and convert repeated problems into platform improvements.
  • Own production reliability as part of product building, not as a separate afterthought.

What We Are Looking For

  • Strong distributed systems, platform, backend, HPC, Kubernetes, or product engineering experience.
  • Experience building or operating production systems where reliability, automation, and debugging mattered.
  • Strong Linux systems fundamentals, including containers, networking, storage, process isolation, and automation.
  • Ability to reason about long-running jobs, retries, failure recovery, state, checkpointing, and reproducibility.
  • High agency and comfort working in a flat team.
  • Ability to use AI tools responsibly while owning the final technical judgment.

Strong Pluses

  • Hands-on experience with Kubernetes, Slurm, Ray, Argo, Kubeflow, Volcano, Kueue, or similar schedulers/orchestrators.
  • Experience with NCCL, PyTorch DDP/FSDP, DeepSpeed, Megatron-LM, checkpointing, data loading, or multi-node training.
  • Experience with InfiniBand, RDMA, GPUDirect, RoCE, NVLink/NVSwitch, topology-aware scheduling, or GPU cluster validation.
  • Experience building tools for researchers or ML engineers.
  • Prior work at an AI cloud, GPU cloud, HPC platform, ML platform, or frontier AI lab.

This Role Is Not For You If

  • You want to build features but avoid operations.
  • You are not interested in customer workflows or developer experience.
  • You need a manager to turn ambiguity into a task list.
  • You treat reliability, observability, and documentation as secondary work.
  • You are uncomfortable debugging across many layers of the stack.