Incident Engineer

Netomi • Full-time • Gurugram • 1m ago

Why This Job is Featured on The SaaS Jobs

This Incident Engineer role stands out in SaaS because it sits at the point where a customer-facing platform meets real-time operational reliability. Netomi’s product context combines enterprise CX workflows with AI components and multiple integrations, which typically increases the surface area for incidents and makes disciplined response practices a core capability rather than a support function.

For a SaaS career, the work builds durable operating instincts: turning ambiguous production signals into structured triage, aligning teams around severity and service objectives, and converting post-incident learning into runbooks, alerting improvements, and better observability. Experience spanning APIs, cloud infrastructure, and integration-heavy environments translates well across modern SaaS, especially as more products incorporate LLM-related dependencies and require clear incident communication across technical and non-technical stakeholders.

This is best suited to professionals who enjoy being a coordinating force during outages and who prefer measurable operational outcomes over project-only work. It will fit someone comfortable switching between deep technical investigation and crisp stakeholder updates, and who values process design such as SLAs, SLOs, and incident frameworks as much as rapid resolution.

The section above is editorial commentary from The SaaS Jobs, provided to help SaaS professionals understand the role in a broader industry context.

Job Description

About the Company:

Netomi is the leading agentic AI platform for enterprise customer experience. We work with the largest global brands like Delta Airlines, MetLife, MGM, United, and others to enable agentic automation at scale across the entire customer journey. Our no-code platform delivers the fastest time to market, lowest total cost of ownership, and simple, scalable management of AI agents for any CX use case. Backed by WndrCo, Y Combinator, and Index Ventures, we help enterprises drive efficiency, lower costs, and deliver higher quality customer experiences.

Want to be part of the AI revolution and transform how the world’s largest global brands do business? Join us!

About the role

We are looking for a proactive Incident Manager to own end-to-end incident response across our AI and platform stack. You will ensure rapid detection, triage, communication, and resolution of incidents impacting customers and internal systems.

Responsibilities

Own the incident lifecycle: detection, triage, escalation, resolution, and postmortems
Act as the central command during major incidents (war rooms, stakeholder updates)
Define and enforce SLAs/SLOs, incident severity frameworks, and runbooks
Collaborate with Engineering, ML, and Integrations teams to resolve issues quickly
Monitor system health across integrations (agent desks, LLMs, ASR/TTS pipelines)
Drive root cause analysis (RCA) and preventive actions
Improve observability, alerting, and incident tooling
Maintain clear internal and customer-facing communication during incidents

Requirements

3–6 years in Incident Management / SRE / Production Support roles
Strong understanding of distributed systems, APIs, and cloud environments (AWS)
Experience with observability tools (e.g., DataDog)
Familiarity with AI/ML systems, especially LLM integrations and voice stacks (ASR/TTS), is a plus
Experience with monitoring/tracing tools like Langfuse or similar
Excellent communication and stakeholder management skills
Ability to stay calm under pressure and drive structured resolution

Nice to Have

Exposure to OpenAI or similar LLM platforms
Experience supporting customer-facing SaaS products
Automation mindset (runbooks, alert tuning, incident tooling)

Netomi is an equal opportunity employer committed to diversity in the workplace. We evaluate qualified applicants without regard to race, color, religion, sex, sexual orientation, disability, veteran status, and other protected characteristics.