Staff Software Engineer: Reliability, Performance

Feldera • Full-time • 3w ago

Staff Software Engineer: Reliability, Performance

Location: Remote
Commitment: Full-time

About Feldera

Feldera is redefining how engineers compute over changing data. Powered by an award-winning breakthrough in database theory, our platform incrementally maintains even the most complex SQL views as data changes, even when pipelines have hundreds of joins, aggregates, unions, and even recursion. The result? Real-time insights over both live and historical data, with 10x lower cost and 100x faster time-to-value.

This isn’t faster batch compute. It’s a fundamentally new model of computation.

We’re now growing the team that brings this breakthrough to production for our customers.

In this role, you’ll focus on our secure, scalable, production-grade self-hosted platform; one that customers deploy on their own infrastructure, from laptops to multi-node clusters.

The Role

We are looking for a Performance and Reliability Engineer for our internal automated testing and performance benchmarking platform. You’ll help ensure Feldera runs flawlessly under real-world conditions: pushing the engine and control plane to its limits, validating performance and correctness under stress via benchmarks, soak tests, chaos tests, fault tests, and differential tests, and make sure upgrades and recovery are smooth for our customers running mission-critical workloads. See this blog post on how our team approaches software correctness.

This role is ideal for someone who thrives at the intersection of systems engineering, testing, automation and performance engineering.

What You’ll Do

Real-world testing at scale: Design and run long-lived workloads that mimic production environments, including sustained load, skewed data distributions, and upgrade workflows.
Performance monitoring & tuning: Build metrics and dashboards to continuously measure throughput, latency, and resource efficiency, and use these insights to guide system improvements.
Chaos and fault injection: Run experiments involving node failures, crashes, network partitions, resource contention and rolling upgrades – validating correctness and resilience under stress.
CI/CD engineering: Own and evolve our CI/CD pipelines to make them faster, more reliable, and more reflective of production conditions. Ensure that every change is validated under meaningful workloads before it ships.
Collaboration: Work closely with our systems engineers to pinpoint bottlenecks, identify regressions, and improve reliability mechanisms.

Minimum Requirements

Strong background in systems engineering, performance testing, or site reliability engineering.
Fluency in Python and Linux fundamentals. Rust experience is strongly valued.
Experience with distributed systems and database concepts (consistency, fault tolerance, transactions).
Experience with CI/CD pipeline engineering (GitHub Actions, Docker, Kubernetes).
Hands-on experience running large-scale and long-running workloads, preferably in a cloud-native environment.
Curiosity, rigor, and the ability to design experiments that simulate messy real-world conditions.

Candidates must have authorization to work in their country of residence (United States, European Union, or India). We are unable to sponsor employment visas at this time.