Inference
Inference Forward Deployed AI Engineer
Work with customers deploying real inference workloads. Help them choose runtimes, GPUs, precision, quantization, scaling patterns, and production deployment paths.
Apply via build@jarvislabs.aiJarvis Labs needs a customer-facing engineer who can make real inference workloads successful. This is not a support role. It is a hands-on technical customer success and product feedback role for someone who understands AI workloads, can debug production deployments, and can translate customer reality into better product direction.
You will work with customers deploying models such as Gemma, Qwen, Llama, multimodal models, and other open or custom models. You should be able to reason about runtime choice, precision, quantization, GPU type, latency, throughput, cost, reliability, and production deployment patterns.
What You Will Own
- Own technical success for inference customers from discovery to production.
- Understand the customer's workload: model, traffic shape, latency needs, throughput, quality constraints, budget, scaling pattern, and operational requirements.
- Recommend serving runtimes, model precision, quantization strategy, GPU configuration, batching strategy, context length, and deployment architecture.
- Help customers run evaluations, pilots, benchmarks, migrations, and production launches.
- Debug deployment and performance issues across model serving, containers, Kubernetes, networking, GPU memory, runtime configuration, and observability.
- Build reusable playbooks, examples, reference architectures, and troubleshooting guides from customer work.
- Bring high-signal product feedback to platform engineers: what customers are trying to do, where they get stuck, and what Jarvis Labs should build or fix next.
- Support pre-sales technical evaluation and post-sales workload success, without owning revenue quota.
What We Are Looking For
- Strong technical background in ML engineering, AI cloud, cloud infrastructure, solutions engineering, or production AI workloads.
- Practical understanding of inference tradeoffs: quality, latency, throughput, cost, GPU memory, precision, quantization, scaling, and reliability.
- Comfortable with Kubernetes, containers, Linux, APIs, logs, metrics, and debugging real systems.
- Excellent customer communication: you can explain hard technical tradeoffs clearly without handwaving.
- High agency: you do not merely route issues to engineers; you solve what you can and synthesize what the product team needs to know.
- Strong written communication for customer notes, internal product feedback, runbooks, and examples.
Strong Pluses
- Experience with vLLM, SGLang, Ollama, TensorRT-LLM, Triton, TGI, llama.cpp, or other model-serving systems.
- Experience helping customers deploy ML models or AI applications in production.
- Experience with GPU clouds, ML platforms, devtools, model APIs, or AI cloud startups.
- Ability to write code or scripts to automate deployments, benchmarks, troubleshooting, or customer workflows.
This Role Is Not For You If
- You want to be a traditional support engineer.
- You avoid ambiguous customer problems.
- You are uncomfortable going deep technically.
- You escalate everything instead of forming hypotheses and debugging.
- You do not want to write clear internal notes and product feedback.