Inference

Senior Inference Platform Engineer

Build and evolve Jarvis Labs inference-as-a-service for open and custom AI models. Own performance, reliability, runtime integrations, benchmarks, APIs, and production behavior.

Apply via build@jarvislabs.ai

Jarvis Labs is building a modern inference-as-a-service platform for open and custom AI models. Customers should be able to choose a model, choose a serving runtime such as vLLM, SGLang, Ollama, or similar systems, tune the right configuration, and deploy a highly optimized production endpoint without fighting infrastructure.

We are looking for senior engineers who can build this product like owners. This is not a narrow backend role. You will work across distributed systems, model serving, GPU performance, reliability, APIs, benchmarks, customer workload patterns, and production operations.

What You Will Own

Build and evolve Jarvis Labs inference-as-a-service from early internal product to a state-of-the-art customer-facing platform.
Integrate and optimize inference runtimes such as vLLM, SGLang, Ollama, TensorRT-LLM, Triton, and future serving stacks when they matter.
Improve throughput, latency, time-to-first-token, cost per token, GPU utilization, autoscaling behavior, and production reliability.
Design configuration surfaces that let customers choose models, precision, quantization, batching, context length, replicas, GPU type, and runtime-specific parameters without unnecessary complexity.
Build benchmarking and regression systems for common models and workloads: Gemma, Qwen, Llama, multimodal models, diffusion/video models, and customer-specific deployments.
Debug hard serving issues across GPU memory, KV cache behavior, speculative decoding, batching, networking, storage, container startup, runtime bugs, and customer traffic patterns.
Work with Forward Deployed AI Engineers when customers need deep inference optimization or production debugging.
Write the docs, examples, runbooks, and technical notes needed for customers and internal teams to trust what you build.

What We Are Looking For

Strong distributed systems, backend, platform, or infrastructure engineering experience.
Proven ownership of production systems where performance and reliability mattered.
Strong debugging instincts under ambiguity.
Practical understanding of Linux, containers, networking, observability, and production operations.
Ability to learn AI inference deeply if you have not already worked in AI products.
Comfort using AI coding tools like Claude Code or Codex while remaining responsible for correctness and quality.
High agency in a flat team: you do not wait to be assigned tickets before making progress.

Strong Pluses

Experience with vLLM, SGLang, TensorRT-LLM, Triton, Ollama, TGI, llama.cpp, or similar inference stacks.
Understanding of KV cache, speculative decoding, prefix caching, continuous batching, quantization, parallelism, and GPU memory tradeoffs.
Experience with Kubernetes, GPU scheduling, model gateways, inference APIs, or multi-tenant serving.
Prior work at an AI cloud, ML platform, model-serving company, or high-scale backend platform.
Experience publishing benchmarks, technical docs, or reference deployments.

This Role Is Not For You If

You only want to implement pre-scoped tickets.
You are uncomfortable owning production reliability.
You think docs, benchmarks, and customer usability are someone else's job.
You use AI tools to generate code but do not deeply review or test the output.
You want to stay in a narrow backend lane and avoid product or customer context.