Your role
Dialpad’s AI Engineering organization is responsible for building and maintaining customer-facing AI features at scale across all of our cloud-native products and services. Every day, millions of users worldwide leverage our technology to communicate effectively and efficiently.
Dialpad's Agentic Runtime team owns the infrastructure and execution engine that runs AI agents at scale across Dialpad's core product modalities — including voice, messaging, video, and digital engagement. From multi-step task orchestration and tool execution to real-time context management and agent memory, our team builds the foundational platform that powers Dialpad's next-generation intelligent, autonomous experiences. Our teams are highly collaborative and comprise cross-disciplinary professionals, including Product Managers, QA Specialists, and Engineers specializing in Distributed Systems, ML Infrastructure, and Platform Engineering.
This position reports to the Engineering Manager, who is based in Kitchener, CA, and has the opportunity to be based in our Buenos Aires, Argentina office.
What you’ll do
- Contribute to the design, development, and maintenance of agentic runtime systems, including agent orchestration, tool execution pipelines, and multi-step reasoning loops.
- Build and optimize core runtime components, including task planners, action dispatchers, memory managers, and context window management systems.
- Work on agent coordination techniques, including dynamic tool selection, parallel agent execution, state management, and result aggregation across multi-agent workflows.
- Maintain and enhance highly scalable agentic platforms with a focus on low-latency execution, cost efficiency, and deterministic behavior.
- Ensure high availability, reliability, and fault tolerance in agent runtime services, including graceful degradation when LLM or tool calls fail.
- Collaborate with cross-functional teams — including ML researchers, product, and platform engineers — to translate agentic product requirements into robust runtime infrastructure.
- Develop and optimize real-time distributed systems, microservices, and event-driven architectures powering agentic task execution.
- Design and implement sandboxed execution environments for safe agent use of tools, code execution, and external API calls.
- Implement and maintain monitoring, alerting, and performance metrics covering agent run success rates, token consumption, latency, and cost attribution.
- Evaluate and integrate emerging agentic frameworks, LLM APIs, and tooling ecosystems to continuously improve platform capabilities.
- Write clean, modular, and well-tested code while following best engineering practices in a rapidly evolving problem space.
- Participate in code reviews to ensure the quality, maintainability, and scalability of runtime components.
- Provide mentorship and technical guidance to junior engineers navigating the unique challenges of agentic systems.
Skills you’ll bring
- 3–6 years of experience in distributed systems, platform engineering, or ML infrastructure, with exposure to LLM-based or agentic systems strongly preferred.
- Strong understanding of agent architectures, including ReAct, plan-and-execute, and multi-agent coordination patterns.
- Deep knowledge of context management, prompt lifecycle, tool-call protocols (e.g., function calling, MCP), and agent memory strategies (short-term, episodic, and long-term).
- Experience integrating and managing external tool ecosystems, including web search, code interpreters, databases, and third-party APIs.
- Familiarity with retrieval-augmented generation (RAG) and how retrieval fits into broader agentic pipelines.
- Understanding of LLM output reliability challenges — hallucination, non-determinism, and retry/fallback strategies at runtime.
- Proficiency in Go and Python 3 (experience with Rust or TypeScript is a plus).
- Strong understanding of distributed systems, microservices, and event-driven architectures suited to long-running agent tasks.
- Passion for real-time performance optimization, including streaming responses, async execution, and parallel tool invocation.
- Experience with API design using OpenAPI, Swagger, or equivalent, with an eye toward agentic interaction patterns.
- Knowledge of gRPC or equivalent RPC protocols for inter-service communication within agent runtimes.
- Experience with Docker and Kubernetes, including managing long-running or stateful agent workloads in containerized environments.
- Familiarity with cloud platforms (GCP preferred, AWS/Azure optional), including managed services relevant to agentic workloads such as queuing, secrets management, and compute autoscaling.
- Hands-on experience with Infrastructure as Code tools like Terraform or Ansible.
- Knowledge of CI/CD frameworks and continuous delivery practices, with comfort shipping infrastructure in a fast-moving research-adjacent environment.