Back to careers

Inference

Senior Inference Platform Engineer

Build and evolve Jarvis Labs inference-as-a-service for open and custom AI models. Own performance, reliability, runtime integrations, benchmarks, APIs, and production behavior.

Apply via build@jarvislabs.ai

Jarvis Labs is building a modern inference-as-a-service platform for open and custom AI models. Customers should be able to choose a model, choose a serving runtime such as vLLM, SGLang, Ollama, or similar systems, tune the right configuration, and deploy a highly optimized production endpoint without fighting infrastructure.

We are looking for senior engineers who can build this product like owners. This is not a narrow backend role. You will work across distributed systems, model serving, GPU performance, reliability, APIs, benchmarks, customer workload patterns, and production operations.

What You Will Own

  • Build and evolve Jarvis Labs inference-as-a-service from early internal product to a state-of-the-art customer-facing platform.
  • Integrate and optimize inference runtimes such as vLLM, SGLang, Ollama, TensorRT-LLM, Triton, and future serving stacks when they matter.
  • Improve throughput, latency, time-to-first-token, cost per token, GPU utilization, autoscaling behavior, and production reliability.
  • Design configuration surfaces that let customers choose models, precision, quantization, batching, context length, replicas, GPU type, and runtime-specific parameters without unnecessary complexity.
  • Build benchmarking and regression systems for common models and workloads: Gemma, Qwen, Llama, multimodal models, diffusion/video models, and customer-specific deployments.
  • Debug hard serving issues across GPU memory, KV cache behavior, speculative decoding, batching, networking, storage, container startup, runtime bugs, and customer traffic patterns.
  • Work with Forward Deployed AI Engineers when customers need deep inference optimization or production debugging.
  • Write the docs, examples, runbooks, and technical notes needed for customers and internal teams to trust what you build.

What We Are Looking For

  • Strong distributed systems, backend, platform, or infrastructure engineering experience.
  • Proven ownership of production systems where performance and reliability mattered.
  • Strong debugging instincts under ambiguity.
  • Practical understanding of Linux, containers, networking, observability, and production operations.
  • Ability to learn AI inference deeply if you have not already worked in AI products.
  • Comfort using AI coding tools like Claude Code or Codex while remaining responsible for correctness and quality.
  • High agency in a flat team: you do not wait to be assigned tickets before making progress.

Strong Pluses

  • Experience with vLLM, SGLang, TensorRT-LLM, Triton, Ollama, TGI, llama.cpp, or similar inference stacks.
  • Understanding of KV cache, speculative decoding, prefix caching, continuous batching, quantization, parallelism, and GPU memory tradeoffs.
  • Experience with Kubernetes, GPU scheduling, model gateways, inference APIs, or multi-tenant serving.
  • Prior work at an AI cloud, ML platform, model-serving company, or high-scale backend platform.
  • Experience publishing benchmarks, technical docs, or reference deployments.

This Role Is Not For You If

  • You only want to implement pre-scoped tickets.
  • You are uncomfortable owning production reliability.
  • You think docs, benchmarks, and customer usability are someone else's job.
  • You use AI tools to generate code but do not deeply review or test the output.
  • You want to stay in a narrow backend lane and avoid product or customer context.