Why This Job is Featured on The SaaS Jobs
### Why this Role is Featured on The SaaS Jobs
SRE roles inside research-adjacent product teams are still relatively uncommon in SaaS, and this one stands out for being embedded directly within an AI Research group that ships customer-facing systems. The remit spans cloud-first, service-oriented infrastructure on GCP, with production inference and data-platform components that sit on critical paths—work that mirrors the reliability challenges increasingly typical of modern SaaS offerings adopting AI features.
From a SaaS career perspective, the role builds durable platform skills: operating Kubernetes services, shaping CI/CD and GitOps practices, and designing observability and incident-response routines that scale with real usage. The emphasis on enabling iteration while maintaining safeguards is a useful lens for anyone looking to deepen judgment around operational maturity, not just tooling. Experience partnering with researchers also translates well to SaaS environments where product bets are exploratory but reliability expectations remain high.
The section above is editorial commentary from The SaaS Jobs, provided to help SaaS professionals understand the role in a broader industry context.
Job Description
About the AI Research Team
The AI Research team at Algolia combines fundamental research with product engineering to deliver customer-facing AI-powered features.
The team is highly cross-functional, made up of PhD researchers, full-stack engineers, and infrastructure specialists working together to explore new ideas, validate impact, and bring successful research outcomes into production. While the work is research-driven, the output is real, customer-facing systems.
The Opportunity
We are looking for an embedded Senior Site Reliability Engineer to join the AI Research team as a full member of the group. In this role, you will support both the research and product-engineering aspects of the team by ensuring the stability, scalability, and operability of the infrastructure that enables this work.
This is a classic SRE role focused on cloud-first, service-oriented architectures running on Google Cloud Platform. While the team builds AI-powered systems, AI or ML experience is not required for this role. Our priority is strong SRE fundamentals, experience operating production services, and comfort working in an environment with ambiguity and high ownership.
You will play an important role in day-to-day execution as well as in longer-term (12-month) planning, helping shape how the team builds and operates its platforms over time.
What You’ll Work On
Platform Reliability & Enablement
- Support and evolve the reliability of platforms used by the AI Research team. Examples of our infrastructure work to date include:
- A production inference service (embedding model serving API)
- AI data feature store
- Internal tools used for novel research and experimentation
- Infrastructure that combines the above to enable offline testing of customer deployments to agentically discover configuration improvements.
- Ensure production services meet expectations for availability, latency, and operational readiness, particularly for systems that sit on customer-critical paths
- Design infrastructure and operational patterns that prioritize iteration speed while maintaining appropriate safeguards for production systems
Embedded Collaboration
- Work closely with researchers and engineers in a cross-functional setting, acting as an advisor on infrastructure, reliability, and operational concerns
- Participate directly in team planning and execution, from early exploration through production rollout
- Help researchers self-serve infrastructure safely and effectively, without becoming a bottleneck
Cloud Infrastructure & Operations
- Build and maintain Kubernetes-based services on GCP using infrastructure-as-code and GitOps (Terraform, ArgoCD)
- Own and improve CI/CD pipelines for services written primarily in Go, with some Python-based services
- Design and operate observability systems using tools such as Datadog
- Participate in an on-call rotation (relatively light), responding to incidents and helping improve systems over time
What We’re Looking For
Required Experience
- Strong experience operating cloud-first infrastructure
- Hands-on experience running production services on Kubernetes
- Proficiency with infrastructure-as-code (Terraform) and CI/CD systems
- Experience supporting production services written in Go (Python experience is a plus)
- Solid grounding in service reliability, incident response, and operational best practices
- Comfort working in environments with ambiguity, where problems are not always well-defined upfront
Nice to Have
- Experience supporting mission-critical internal platforms
- Exposure to research or experimentation-heavy environments
- Familiarity working alongside researchers or highly specialized domain experts
Explicitly Not Required
- AI, ML, or deep learning experience
- Model training, tuning, or ML framework expertise (e.g. PyTorch, JAX)
Ways This Role May Not Be a Fit
This role may not be a good match if:
- You are only interested in maintaining existing infrastructure without contributing to what is being built
- You want to work exclusively on customer-facing product features
- You are looking to avoid on-call or production systems entirely
- You are seeking narrowly defined work with low ambiguity and limited ownership
- You want to build or train AI models yourself rather than enable the systems around them
Why Join the AI Research Team
- High Impact: Your work directly enables new AI-powered capabilities that reach customers
- High Agency: You’ll help shape what gets built, how it’s built, and whether it’s worth building
- Strong Peers: Collaborate with experienced SREs, engineers, and PhD researchers
- Growth: Build expertise in research-adjacent infrastructure and platform reliability
- Flexibility: Australia-based role with remote-friendly culture; occasional off-hours collaboration may be required