Site Reliability Engineer

Clay • Full-time • Remote (New York, NY, United States) • $130k - $300k / year • 1m ago

Why This Job is Featured on The SaaS Jobs

This Site Reliability Engineer role stands out in SaaS because it sits at the point where product demand meets platform resilience. With a cloud-native stack and explicit ownership of availability, performance, and cost efficiency, the work reflects the operational realities of a revenue-backed SaaS business where reliability is part of the customer experience, not just an internal concern. The remit spans infrastructure design, automation, and incident response, which are core levers for sustaining a multi-tenant service as usage grows.

For a long-term SaaS career, the role builds durable skills in operating production systems: infrastructure as code, CI/CD guardrails, observability, and pragmatic trade-offs between developer velocity and risk. Experience balancing these constraints is highly transferable across SaaS companies, particularly those running on AWS and modern container and serverless patterns. Regular on-call participation also develops the operational judgment that differentiates senior platform engineers in SaaS environments.

This position is best suited to an engineer who prefers end-to-end ownership and treats automation as a product, not a side task. It will fit someone comfortable switching between coding and systems work, collaborating across teams, and making measured decisions under incident pressure. The emphasis on learning unfamiliar technologies also signals a good match for engineers who like evolving stacks rather than fixed runbooks.

The section above is editorial commentary from The SaaS Jobs, provided to help SaaS professionals understand the role in a broader industry context.

Job Description

About Clay

Our mission is to help organizations turn any growth idea into reality.

We see growth as a creative practice, not a formula. Finding and reaching your best-fit customers takes unique ideas and constant iteration. As AI makes execution faster and tactics easier to copy, creativity is the only lasting advantage. We're already helping thousands of customers — including Anthropic, Notion, Google, and Ramp — go to market with unique data, signals, and AI research.

In 2025, we raised a $100M Series C backed by world-class investors including Sequoia, CapitalG, and First Round — and crossed $100M in revenue.

In 2026, we announced our second employee tender offer in 9 months at a new $5B valuation. We also launched a community equity round, for our customers, agency partners, and club members.

Some things to know about us:

Our community includes 11,000+ customers, 150+ integration partners, 125+ agencies, 50+ Clay clubs, and 30k members on Slack.
Our cultureis unique inside and outside of work. Our team members are also DJs, activists, writers, clowns, marathoners, skydivers, psychedelic therapists, social workers, and more.
All employees can work for free with world-class coaches who specialize in creativity, management, and more.
Our operating principles — including negative maintenance and non-attached action — guide our work. Read more about them here.
Read about us in the NYT, Forbes, First Round Review, and more.

Hear from our employees directly on our Glassdoor page!

SRE @ Clay

In this role, you’ll join our growing infrastructure team in building and fine-tuning our infrastructure to keep our services running smoothly. We’re looking for someone who’s excited about automation and continuous improvement. While your main focus will be on infrastructure, coding skills are a must. As a growing startup, we all jump in where needed, so you’ll need to be comfortable taking on a variety of roles.

What You’ll Do

Architect, design, implement, and manage robust, scalable, and secure infrastructure solutions.
Develop, maintain, and enforce best practices for CI/CD, infrastructure as code, and automation.
Oversee the management and optimization of cloud infrastructure, ensuring high availability, performance, and cost-efficiency.
Implement monitoring, logging, and alerting solutions to maintain system health and quickly resolve issues.
Lead incident response efforts, troubleshooting and resolving complex issues in a timely manner.
Participate in an oncall rotation.
Work with teams across the company to ensure we achieve the right balance of developer velocity, reliability and performance, and cost efficiency.

What You’ll Bring

5+ years of experience
Experience with containerization and orchestration tools
Strong understanding of CI/CD concepts and tools
Knowledge of infrastructure automation tools
Experience with oncall and incident response
Proficiency in one or more programming languages
Familiarity with our stack or ability to learn unfamiliar technologies quickly:
- Aurora Postgres RDS, Elasticache Redis, Docker + ECS, Lambda, OpenSearch
- Terraform and Atlantis
- CircleCI, Netlify, Playwright
- Cloudwatch, Datadog, Mezmo
- Typescript, Python