Inference
Senior Inference Platform Engineer
Build and evolve Jarvis Labs inference-as-a-service for open and custom AI models. Own performance, reliability, runtime integrations, benchmarks, APIs, and production behavior.
Apply via build@jarvislabs.aiJarvis Labs is building a modern inference-as-a-service platform for open and custom AI models. Customers should be able to choose a model, choose a serving runtime such as vLLM, SGLang, Ollama, or similar systems, tune the right configuration, and deploy a highly optimized production endpoint without fighting infrastructure.
We are looking for senior engineers who can build this product like owners. This is not a narrow backend role. You will work across distributed systems, model serving, GPU performance, reliability, APIs, benchmarks, customer workload patterns, and production operations.
What You Will Own
- Build and evolve Jarvis Labs inference-as-a-service from early internal product to a state-of-the-art customer-facing platform.
- Integrate and optimize inference runtimes such as vLLM, SGLang, Ollama, TensorRT-LLM, Triton, and future serving stacks when they matter.
- Improve throughput, latency, time-to-first-token, cost per token, GPU utilization, autoscaling behavior, and production reliability.
- Design configuration surfaces that let customers choose models, precision, quantization, batching, context length, replicas, GPU type, and runtime-specific parameters without unnecessary complexity.
- Build benchmarking and regression systems for common models and workloads: Gemma, Qwen, Llama, multimodal models, diffusion/video models, and customer-specific deployments.
- Debug hard serving issues across GPU memory, KV cache behavior, speculative decoding, batching, networking, storage, container startup, runtime bugs, and customer traffic patterns.
- Work with Forward Deployed AI Engineers when customers need deep inference optimization or production debugging.
- Write the docs, examples, runbooks, and technical notes needed for customers and internal teams to trust what you build.
What We Are Looking For
- Strong distributed systems, backend, platform, or infrastructure engineering experience.
- Proven ownership of production systems where performance and reliability mattered.
- Strong debugging instincts under ambiguity.
- Practical understanding of Linux, containers, networking, observability, and production operations.
- Ability to learn AI inference deeply if you have not already worked in AI products.
- Comfort using AI coding tools like Claude Code or Codex while remaining responsible for correctness and quality.
- High agency in a flat team: you do not wait to be assigned tickets before making progress.
Strong Pluses
- Experience with vLLM, SGLang, TensorRT-LLM, Triton, Ollama, TGI, llama.cpp, or similar inference stacks.
- Understanding of KV cache, speculative decoding, prefix caching, continuous batching, quantization, parallelism, and GPU memory tradeoffs.
- Experience with Kubernetes, GPU scheduling, model gateways, inference APIs, or multi-tenant serving.
- Prior work at an AI cloud, ML platform, model-serving company, or high-scale backend platform.
- Experience publishing benchmarks, technical docs, or reference deployments.
This Role Is Not For You If
- You only want to implement pre-scoped tickets.
- You are uncomfortable owning production reliability.
- You think docs, benchmarks, and customer usability are someone else's job.
- You use AI tools to generate code but do not deeply review or test the output.
- You want to stay in a narrow backend lane and avoid product or customer context.