Senior Site Reliability Engineer (SRE), Platform Team
ReadyOn
Senior Site Reliability Engineer (SRE), Platform Team
About the role
As a Senior Site Reliability Engineer, you’ll lead the reliability and operations of our Kubernetes-based platform, ensuring scalable, resilient application deployments via GitOps. You’ll drive operational excellence by spearheading incident investigations, root cause analysis, and post-mortems to minimize downtime. Your expertise in observability, Kubernetes, ArgoCD, and CI/CD will empower developer productivity and system stability. We’re seeking a proactive go-getter who thrives in fast-paced startup environments, anticipates issues, and champions continuous improvements in reliability and performance.
Responsibilities
Infrastructure & Observability:
- Manage multi-cluster Kubernetes environments for high availability
- Design and enhance observability systems (monitoring, logging, tracing with Prometheus, OpenTelemetry, Grafana, ELK)
- Configure ingress, service mesh, and networking for performance and security
- Implement Infrastructure as Code (IaC) using Terraform for AWS infrastructure
Operations & Incident Leadership:
- Lead incident response, investigations, and post-mortems to drive root cause resolution
- Design proactive monitoring and alerting systems to minimize outages
- Establish SLAs/SLOs and conduct chaos engineering to improve reliability
- Build and automate remediation runbooks
- Lead and coordinate regular Failover and Disaster Recovery drills to validate recovery procedures and minimize RTO/RPO
- Mentor teams on operational best practices and troubleshooting
Platform Engineering:
- Develop and maintain internal tools and automation using ArgoCD for Kubernetes continuous delivery, enhancing developer productivity and platform reliability.
- Design and implement platform features (e.g., service templates, libraries, security controls) supporting scalable, automated GitOps deployments.
- Drive cloud-native technology adoption and optimization, integrating with continuous delivery pipelines.
- Collaborate with application teams, providing platform support, and promoting reliability and scalability best practices.
Cloud Operations:
- Manage AWS services (Kubernetes, RDS, VPC, IAM) in multi-account, multi-region setups
- Optimize infrastructure for cost, performance, and scalability
- Enforce security policies and compliance standards
Your background
Core:
- 5+ years in SRE, DevOps, or Platform Engineering (7–10+ years preferred)
- Expert in Kubernetes, Helm, and GitOps (ArgoCD/Flux)
- Advanced observability skills (OpenTelemetry)
- Proficient in Terraform and AWS services (RDS, VPC, IAM)
- Strong scripting/automation (Python, Bash)
- Proven leadership in incident investigations and system troubleshooting
- Expertise in CI/CD pipelines (e.g., GitHub Actions)
- Solid Linux operations and networking fundamentals
- Experience operating production-grade platforms at scale
Preferred:
- PostgreSQL administration and migration automation
- Service mesh expertise (e.g., Istio)
- Mentoring junior engineers in operational practices
- Security engineering in cloud environments
Certifications (Preferred):
- CKA/CKAD
- AWS Solutions Architect or DevOps Engineer
- Terraform Associate
To apply
If you’re excited about this opportunity and believe you’d be a great fit, we’d love to hear from you! Send a short note to careers@readyon.ai on why you’re interested in joining ReadyOn.
We encourage all candidates to apply even if your experience doesn't exactly match up to our job description. We are committed to building a diverse and inclusive workspace where everyone (regardless of age, religion, ethnicity, gender, sexual orientation, and more) feels like they belong.