Site Reliability Engineer II

PagerDuty • Toronto • 1m ago

Why This Job is Featured on The SaaS Jobs

Modern SaaS platforms are increasingly defined by their operational guarantees, and this Site Reliability Engineer II role sits directly on the layer that makes those guarantees real. Working on core infrastructure for a real time digital operations product means the day to day focus is less about isolated services and more about the shared primitives that every customer facing capability depends on, including networking, compute, Kubernetes, and traffic management.

For a long term SaaS career, this kind of remit builds durable platform instincts: designing for scale, hardening systems already in production, and using observability signals to steer reliability work. Exposure to on call, incident response, and iterative infrastructure rollouts also develops the operational judgment that many SaaS companies look for when moving engineers into senior SRE, platform, or reliability leadership tracks.

This position tends to fit engineers who enjoy balancing planned engineering with operational responsibility, and who prefer work that improves systems for many internal teams rather than a single application surface. It also suits someone looking to deepen cloud native fundamentals in a product environment where uptime, latency, and security are treated as first class engineering outcomes.

The section above is editorial commentary from The SaaS Jobs, provided to help SaaS professionals understand the role in a broader industry context.

Job Description

As an intermediate Site Reliability Engineer on the Core Infrastructure team in our Toronto office, you'll help build and operate the foundational infrastructure that powers PagerDuty's real-time digital operations platform. Our systems support millions of events and alerts daily, enabling customers to detect, respond to, and resolve incidents quickly and reliably.

You'll work at the intersection of platform evolution and operational excellence, building and evolving foundational network, compute, and ingress infrastructure while scaling and hardening existing systems. Your work will directly impact the reliability, scalability, and security of the services our customers rely on to keep their businesses running as PagerDuty continues to grow across products, regions, and customer use cases.

Key Responsibilities

Support and improve foundational infrastructure, including networking, compute platforms, Kubernetes clusters, and ingress/traffic management systems.
Contribute to the reliability and scalability of PagerDuty's core platform by hardening existing systems and supporting the rollout of new infrastructure capabilities.
Participate in agile rituals (standups, planning, retros) and communicate progress/risks early
You stay current on technical trends to suggest innovative tools and approaches to interesting problems
Monitor system health using metrics, logs, and alerts, and participate in 24/7 on-call rotations to help detect, respond to, and resolve incidents.

Basic Qualifications

3+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering roles
Hands-on experience operating Linux-based systems in production environments
Working knowledge of networking fundamentals, such as load balancing, DNS, TLS, and ingress traffic flow
Experience with container orchestration (e.g., EKS, Kubernetes)
Experience working on cloud-native infrastructure (e.g., AWS, GCP, Azure), including networking and compute concepts
Proficiency in at least one programming language (e.g., Python, Ruby, Go, etc.)
Experience with Infrastructure as Code (e.g., Terraform, CloudFormation)

Preferred Qualifications

Experience with AWS cloud networking concepts such as VPCs, subnets, routing, security groups, and load balancers
Experience operating or contributing to production Kubernetes platforms (e.g., EKS), including cluster upgrades, networking, or ingress configuration
Experience with monitoring, observability, and logging platforms (e.g., DataDog, New Relic, SumoLogic, Splunk, Prometheus, Grafana)
Familiarity with service meshes, ingress controllers, or API gateways (e.g., Envoy, Istio, NGINX)

The base salary range for this position is 115,000 - 165,000 CAD. This role may also be eligible for bonus, commission, equity, and/or benefits.

Our base salary ranges are determined by role, level, and location. The range, which is subject to change based on primary work location, reflects the minimum and maximum base salary we expect to pay newly hired employees for the position. Within the range, we determine pay for an individual based on a number of factors including market location, job-related knowledge, skills/competencies and experience.

Your recruiter can share more about the specific offerings for this role, as well as the salary range for your primary work location during the hiring process.

PagerDuty is a flexible, hybrid workplace. We embrace and encourage in-person working as an integral part of our culture. Both our employees and external research tells us that co-located collaboration strengthens connections, drives innovation, and accelerates learning.

This role is expected to come into our Toronto office 2 days per week, so you can thrive in your new role and fully embrace being a Dutonian!

Related Jobs

Senior Research Engineer, Model Evaluation

Cohere • Full-time • Remote (Toronto, Ontario, Canada) • 8h ago

Engineering Senior level

8h ago

Apply

Front-end Engineer

Pigment • Toronto • 1d ago

Engineering

1d ago

Apply

Senior Backend Software Engineer

Pigment • Toronto • 1d ago

Engineering Senior level

1d ago

Apply