Staff CloudOps Engineer

Wrike • Remote (Estonia - Remote) • 20h ago

About the Role:

As a Cloud Ops Engineer at Wrike, you have advanced skills in supporting cloud and data center infrastructure with security in mind. You know how to work with monitoring and logging systems, containers, networking, automation, and debugging a reasonably complex infrastructure. You feel comfortable defining your own work based on the team OKRs. You can also help others do so when necessary. You are used to proposing meaningful improvements to the existing infrastructure in alignment with architects and tech leads, and you can drive the execution

In this role, you would join a core development team of 250+ engineers developing Wrike and become a part of the whole operations department which is exposed to various technologies and systems. Does this sound like you? If your answer is yes, we'd love to speak with you!

Team Dynamics:

We have two dozen folks in the SysOps Department, consisting of three teams distributed in Prague, Cyprus, and Tallinn. As a core member of our team you will be:

Managing the Wrike product infrastructure
Implementing reliable solutions to ensure a product uptime SLA of 99.9%
Working with GCP, AWS and other cloud providers in the IaC paradigm
Introducing and supporting new infrastructure services
Actively participating in incident response and management, including on-call duties
Developing and maintaining professional connections within and outside of the team

Technical Environment:

We run 150+ Java based SaaS applications in Kubernetes for a massive audience of over 20,000 organizations in 3 Data Centers both on-premises and in cloud.

Key technologies and tools include:

Linux (core platform for all services)
Kubernetes and ArgoCD (Service-oriented architecture)
Nginx, HAproxy and Istio for load balancing
GCP, AWS and Cloudflare are our cloud providers
Zabbix and Prometheus (VictoriaMetrics) for monitoring and alerting
Graylog, Elasticsearch/Logstash, Fluentd for centralized logging
Puppet, Ansible and Terraform for defining everything as a code
Python and Bash for automation and tooling
Jenkins and Gitlab-CI for CI/CD
PostgreSQL as DB platform
Kafka and RabbitMQ for messaging

Your Impact:

Lead the evolution of our enterprise-grade logging and monitoring platforms (Graylog, Elasticsearch, Zabbix, Prometheus), ensuring they scale with business growth.
Design and extend observability pipelines (data ingestion, storage, correlation, and alerting).
Partner with developers, SysOps, and security teams to proactively improve visibility, reliability, and incident response.
Ensure availability, performance, and security of mission-critical Linux-based infrastructure.
Drive automation-first approaches for infrastructure and operational tasks using Python, Bash, and configuration management/IaC tools.
Influence architectural decisions with a strong site reliability engineering mindset, balancing performance, cost, and resilience.

Your Qualifications:

Deep expertise in monitoring and logging ecosystems, ideally with Zabbix, Prometheus (VictoriaMetrics), Graylog, and Elasticsearch.
Expert-level Linux administration skills with proven experience running large-scale, highly available infrastructure for SaaS/web applications.
Hands-on production experience with Kubernetes.
Strong background in infrastructure as code (Terraform, Ansible, or Puppet).
Experience operating across multi-cloud and hybrid environments (AWS, GCP, on-prem).
Experience working with cross-functional teams, driving initiatives, and acting as a technical authority.
Effective communication skills (Upper-Intermediate English or higher).