ABOUT ALGOLIA
Algolia is a fast growing company that helps users deliver intuitive search and discovery experiences on their websites and mobile apps. We provide APIs used by thousands of customers in more than 100 countries. Today, Algolia powers 1.5 Trillion searches a year – that’s 4 times more than Bing, Yahoo, DuckDuckGo, Baidu and Yandex combined!
THE MISSION
We are building the next generation of AI powered search products. We make AI explainable and we help customers make data driven decisions through. Work with the product function to guide product development through use of analytics and experimentation. You will be an integral part of building the future of AI search. If you’re passionate about turning product data into actionable insights and driving product success, we’d love to hear from you.
THE OPPORTUNITY
We are seeking a skilled Senior AI / MLOps Engineer to enable our Data Scientists to move faster and our customers to receive smarter search & discovery experiences by turning prototypes into robust, scalable, and observable AI services. You will own the end-to-end engineering life-cycle—packaging, deploying, operating, and continuously improving machine-learning models that power search ranking, recommendations, and related information-retrieval features on our e-commerce platform.
What you'll be doing:
- Productionization & Packaging: Convert notebooks and research codebase into production-ready Python and Go micro-services, libraries, or kubeflow pipelines, and design reproducible build pipelines (Docker, Conda, Poetry) and manage artefacts in centralized registries.
- Scalable Deployment: Orchestrate real-time and batch inference workloads on Kubernetes, AWS/GCP managed services, or similar platforms, ensuring low latency and high throughput, and Implement blue-green / canary rollouts, automatic rollback, and model versioning strategies (SageMaker, Vertex AI, KServe, MLflow, BentoML, etc.).
- MLOps & CI/CD: Build and maintain CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, Argo) covering unit, integration, data-quality, and performance tests, and Automate feature store updates, model retraining triggers, and scheduled batch jobs using Airflow, Dagster, or similar orchestration tools.
- Observability & Reliability: Define and monitor SLIs/SLOs for model latency, throughput, accuracy, drift, and cost, and Integrate logging, tracing, and metrics (Datadog etc.) and establish alerting & on-call practices.
- Data & Feature Engineering: Collaborate with data engineers to create scalable pipelines that ingest clickstream logs, catalog metadata, images, and user signals, and Implement real-time and offline feature extraction, validation, and lineage tracking.
- Performance & Cost Optimization: Profile models and services; leverage hardware acceleration (GPU, TPU), libraries (ONNX, OpenVINO), and caching strategies (Redis, Faiss) to meet aggressive latency targets, and Right-size clusters and workloads to balance performance with cloud spend.
- Governance & Compliance: Embed security, privacy, and responsible-AI checks in pipelines; manage secrets, IAM roles, and data-access controls via Terraform or CloudFormation, and Ensure auditability and reproducibility through comprehensive documentation and artifact tracking.
- Collaboration & Mentorship: Partner closely with Data Scientists, Product Owners, and Site Reliability Engineers to align technical solutions with business goals, and Coach junior engineers on MLOps best practices and contribute to internal knowledge-sharing sessions.
Role Requirements:
- Spend 1-2 days per week in a local coworking space to collaborate with your teammates in person.
- 5+ years of experience in software engineering with 2+ years focused on deploying ML/AI systems at scale.
- Strong coding skills in Python (preferred) and at least one statically typed language (Go preferred).
- Hands-on expertise with containerization (Docker), orchestration (Kubernetes/EKS/GKE/AKS), and cloud platforms (AWS, GCP, or Azure).
- Proven record of building CI/CD pipelines and automated testing frameworks for data or ML workloads.
- Deep understanding of REST/gRPC APIs, message queues (Kafka, Kinesis, Pub/Sub), and stream/batch data processing frameworks (Spark, Flink, Beam).
- Experience implementing monitoring, alerting, and logging for mission-critical services.
- Familiarity with common ML lifecycle tools (MLflow, Kubeflow, SageMaker, Vertex AI, Feature Store, etc.).
- Working knowledge of ML concepts such as feature engineering, model evaluation, A/B testing, and drift detection.
We’re looking for someone who can live our values:
- GRIT - Problem-solving and perseverance capability in an ever-changing and growing environment
- TRUST - Willingness to trust our co-workers and to take ownership
- CANDOR - Ability to receive and give constructive feedback.
- CARE - Genuine care about other team members, our clients, and the decisions we make in the company.
- HUMILITY- Aptitude for learning from others, putting ego aside.
#LI-Hybrid