Why This Job is Featured on The SaaS Jobs
This Senior Site Reliability Engineer role sits at the core of a product-led SaaS business where uptime is part of the offering, not an internal metric. With Algolia operating search infrastructure at very high query volumes for a large customer base, reliability work here is inherently customer-facing and tied to how the platform is experienced through APIs. The remit spans availability and cost at scale, which is a recurring theme for mature SaaS infrastructure teams.
From a SaaS career perspective, the role builds durable platform skills: defining and tracking SLOs and error budgets, designing self-healing systems, and reducing operational toil through automation. That combination maps well to modern SaaS operating models where engineering teams are expected to own production outcomes and where reliability is managed as an engineering discipline. Exposure to Kubernetes, IaC, CI/CD, and multi-environment hosting also translates across many SaaS companies.
This position is best suited to an engineer who enjoys operational ownership and pragmatic engineering trade-offs, including incident response and continuous improvement of existing systems. It fits someone comfortable collaborating across product and engineering groups, and who wants to deepen expertise in scalable, API-driven services while still writing software to improve the platform.
The section above is editorial commentary from The SaaS Jobs, provided to help SaaS professionals understand the role in a broader industry context.
Job Description
Algolia is set to enable every company to create world-class Search and Discovery experiences with an API-first approach. Performance and Scalability is at the heart of our mission: we power 1.5 trillion searches a year, for 10K+ customers all over the world.
If you're a problem solver, able to think outside the box and eager to nurture others and learn from them, then this is your challenge!
The Team
The Fleet team is a Site Reliability Engineering team focusing on one goal: the Search products should always be available. To make this possible, the Fleet team creates pragmatic solutions to optimize the Search products availability and costs at scale, taking into account the needs of customers, the product teams, and the many engineering teams involved in delivering a unique Search Experience to our customers.
The Opportunity
The team is looking for an individual who has a first experience of building and operating scalable architectures. You will contribute to the delivery of solutions that support other engineering teams and will have a direct impact on the success of Algolia's Search products.
In this role, you'll help design and implement systems focused on reliability, scalability, and cost efficiency, while also having opportunities to grow your skills and collaborate with team members.
Your role will include
- Operating the Search products, building self-healing and automated incident response mechanisms
- Building components that improve reliability and performance
- Monitoring and computing the SLO and the error budget of the product you operate
- Reducing the toil and the technical debt by automating tasks and increasing the quality of existing components
- Managing Incidents and Customer Requests
You might be a good fit if you have
- 5 years experience in a scalable environment
- Knowledge of at least one programming language (Python, Golang, Ruby) and you are familiar with software craftsmanship
- Experience working with APIs
- A focus on designing reliable, operable, and highly available applications
- Familiarity with at least Public Cloud Providers like GCP, AWS, or Microsoft Azure, and their Kubernetes service
- A good understanding of Linux system administration, networking, and troubleshooting
- Strong communication and organizational skills
Team’s current stack:
- Programming languages: Golang, Python, Ruby
- CI/CD:Github Actions, CircleCI
- IaC & configuration management:Terraform, Chef
- Platform: Linux, Kubernetes
- Hosting: Bare Metal Servers & Cloud on AWS & Azure
- Monitoring: Datadog & custom monitoring stack for our Search infrastructure
#LI-Remote