Training

Training Forward Deployed AI Engineer

Help customers run distributed training workloads successfully. Understand their models, data, clusters, failure modes, and turn repeated problems into better product direction.

Apply via build@jarvislabs.ai

Jarvis Labs needs a customer-facing engineer who can make distributed training workloads successful. This is not generic support. It is a hands-on technical customer success role for someone who can understand AI training workloads, guide customers through cluster setup, debug failures, and convert repeated customer problems into better product direction.

You will work with customers who want to run Kubernetes, Slurm, Ray, multi-node training, fine-tuning, data processing, and long-running GPU workloads on Jarvis Labs.

What You Will Own

Own technical success for training-cluster customers from discovery to production.
Understand the customer's training workload: framework, model size, data pipeline, cluster size, GPU type, networking needs, storage needs, checkpointing strategy, and reliability expectations.
Recommend cluster architecture across Kubernetes, Slurm, Ray, containers, storage, networking, and observability.
Help customers run pilots, validate cluster performance, debug jobs, and move toward production usage.
Debug issues across NCCL, networking, storage, scheduler behavior, containers, drivers, distributed training frameworks, checkpointing, and job failures.
Create reusable playbooks, reference architectures, examples, and troubleshooting guides.
Bring high-signal product feedback to platform engineers: what customers are trying to train, what breaks, what is confusing, and what Jarvis Labs should improve.
Support pre-sales technical evaluation and post-sales workload success, without owning revenue quota.

What We Are Looking For

Strong technical background in ML engineering, platform engineering, solutions architecture, or production training workloads.
Practical understanding of distributed training or large-scale compute workloads.
Comfortable with Kubernetes, Linux, containers, networking, storage, logs, metrics, and debugging.
Ability to communicate clearly with ML engineers, platform teams, and customer stakeholders.
High agency: you solve customer problems, synthesize patterns, and improve the system.
Strong written communication for customer notes, internal feedback, runbooks, and examples.

Strong Pluses

Experience with Slurm, Ray, Kubernetes, Argo, Kubeflow, Volcano, Kueue, or similar systems.
Experience with PyTorch DDP/FSDP, DeepSpeed, Megatron-LM, NCCL, checkpointing, data loading, or multi-node training.
Experience with InfiniBand/RDMA, GPU topology, storage bottlenecks, or HPC operations.
Experience as an ML engineer who has actually trained or fine-tuned models on cloud GPU infrastructure.
Ability to write scripts/tools to automate setup, validation, benchmarking, and debugging.

This Role Is Not For You If

You want a ticket-routing support role.
You are uncomfortable debugging across product and ML layers.
You cannot turn customer pain into product feedback.
You avoid writing clear technical docs or runbooks.
You need every customer engagement to be pre-scoped before you start.