Senior Site Reliability Engineer, AI Research

Algolia • Remote (Remote - Australia) • 1m ago

Why This Job is Featured on The SaaS Jobs

### Why this Role is Featured on The SaaS Jobs

SRE roles inside research-adjacent product teams are still relatively uncommon in SaaS, and this one stands out for being embedded directly within an AI Research group that ships customer-facing systems. The remit spans cloud-first, service-oriented infrastructure on GCP, with production inference and data-platform components that sit on critical paths—work that mirrors the reliability challenges increasingly typical of modern SaaS offerings adopting AI features.

From a SaaS career perspective, the role builds durable platform skills: operating Kubernetes services, shaping CI/CD and GitOps practices, and designing observability and incident-response routines that scale with real usage. The emphasis on enabling iteration while maintaining safeguards is a useful lens for anyone looking to deepen judgment around operational maturity, not just tooling. Experience partnering with researchers also translates well to SaaS environments where product bets are exploratory but reliability expectations remain high.

The section above is editorial commentary from The SaaS Jobs, provided to help SaaS professionals understand the role in a broader industry context.

Job Description

About the AI Research Team

The AI Research team at Algolia combines fundamental research with product engineering to deliver customer-facing AI-powered features.

The team is highly cross-functional, made up of PhD researchers, full-stack engineers, and infrastructure specialists working together to explore new ideas, validate impact, and bring successful research outcomes into production. While the work is research-driven, the output is real, customer-facing systems.

The Opportunity

We are looking for an embedded Senior Site Reliability Engineer to join the AI Research team as a full member of the group. In this role, you will support both the research and product-engineering aspects of the team by ensuring the stability, scalability, and operability of the infrastructure that enables this work.

This is a classic SRE role focused on cloud-first, service-oriented architectures running on Google Cloud Platform. While the team builds AI-powered systems, AI or ML experience is not required for this role. Our priority is strong SRE fundamentals, experience operating production services, and comfort working in an environment with ambiguity and high ownership.

You will play an important role in day-to-day execution as well as in longer-term (12-month) planning, helping shape how the team builds and operates its platforms over time.

What You’ll Work On

Platform Reliability & Enablement

Support and evolve the reliability of platforms used by the AI Research team. Examples of our infrastructure work to date include:
- A production inference service (embedding model serving API)
- AI data feature store
- Internal tools used for novel research and experimentation
- Infrastructure that combines the above to enable offline testing of customer deployments to agentically discover configuration improvements.
Ensure production services meet expectations for availability, latency, and operational readiness, particularly for systems that sit on customer-critical paths
Design infrastructure and operational patterns that prioritize iteration speed while maintaining appropriate safeguards for production systems

Embedded Collaboration

Work closely with researchers and engineers in a cross-functional setting, acting as an advisor on infrastructure, reliability, and operational concerns
Participate directly in team planning and execution, from early exploration through production rollout
Help researchers self-serve infrastructure safely and effectively, without becoming a bottleneck

Cloud Infrastructure & Operations

Build and maintain Kubernetes-based services on GCP using infrastructure-as-code and GitOps (Terraform, ArgoCD)
Own and improve CI/CD pipelines for services written primarily in Go, with some Python-based services
Design and operate observability systems using tools such as Datadog
Participate in an on-call rotation (relatively light), responding to incidents and helping improve systems over time

What We’re Looking For

Required Experience

Strong experience operating cloud-first infrastructure
Hands-on experience running production services on Kubernetes
Proficiency with infrastructure-as-code (Terraform) and CI/CD systems
Experience supporting production services written in Go (Python experience is a plus)
Solid grounding in service reliability, incident response, and operational best practices
Comfort working in environments with ambiguity, where problems are not always well-defined upfront

Nice to Have

Experience supporting mission-critical internal platforms
Exposure to research or experimentation-heavy environments
Familiarity working alongside researchers or highly specialized domain experts

Explicitly Not Required

AI, ML, or deep learning experience
Model training, tuning, or ML framework expertise (e.g. PyTorch, JAX)

Ways This Role May Not Be a Fit

This role may not be a good match if:

You are only interested in maintaining existing infrastructure without contributing to what is being built
You want to work exclusively on customer-facing product features
You are looking to avoid on-call or production systems entirely
You are seeking narrowly defined work with low ambiguity and limited ownership
You want to build or train AI models yourself rather than enable the systems around them

Why Join the AI Research Team

High Impact: Your work directly enables new AI-powered capabilities that reach customers
High Agency: You’ll help shape what gets built, how it’s built, and whether it’s worth building
Strong Peers: Collaborate with experienced SREs, engineers, and PhD researchers
Growth: Build expertise in research-adjacent infrastructure and platform reliability
Flexibility: Australia-based role with remote-friendly culture; occasional off-hours collaboration may be required

Related Jobs

Sr. Sales Engineer, AI Security (Remote, AUS)

Crowdstrike • Full-time • Remote (AUS NW Remote, Australia) • 1m ago

Sales

1m ago

Apply

Future Opportunities - Sales - APJ

PagerDuty • Remote (Remote (Australia); Sydney) • 1m ago

Sales

1m ago

Apply

Join our Talent Community

Karbon • Remote (Canberra, ACT, Australia; Cebu, Central Visayas, Philippines; Chicago, IL, United States; Los Angeles, CA, United States; Manila, Metro Manila, Philippines; Melbourne, VIC, Australia; Remote, United States; Sydney, NSW, Australia) • 1m ago

People & Human Resources

1m ago

Apply