Senior Site Reliability Engineer (SRE), Platform Team

ReadyOn

ReadyOn

Software Engineering
San Francisco, CA, USA
Posted 6+ months ago

Senior Site Reliability Engineer (SRE), Platform Team

About the role

As a Senior Site Reliability Engineer, you’ll lead the reliability and operations of our Kubernetes-based platform, ensuring scalable, resilient application deployments via GitOps. You’ll drive operational excellence by spearheading incident investigations, root cause analysis, and post-mortems to minimize downtime. Your expertise in observability, Kubernetes, ArgoCD, and CI/CD will empower developer productivity and system stability. We’re seeking a proactive go-getter who thrives in fast-paced startup environments, anticipates issues, and champions continuous improvements in reliability and performance.

Responsibilities

Infrastructure & Observability:

  • Manage multi-cluster Kubernetes environments for high availability
  • Design and enhance observability systems (monitoring, logging, tracing with Prometheus, OpenTelemetry, Grafana, ELK)
  • Configure ingress, service mesh, and networking for performance and security
  • Implement Infrastructure as Code (IaC) using Terraform for AWS infrastructure

Operations & Incident Leadership:

  • Lead incident response, investigations, and post-mortems to drive root cause resolution
  • Design proactive monitoring and alerting systems to minimize outages
  • Establish SLAs/SLOs and conduct chaos engineering to improve reliability
  • Build and automate remediation runbooks
  • Lead and coordinate regular Failover and Disaster Recovery drills to validate recovery procedures and minimize RTO/RPO
  • Mentor teams on operational best practices and troubleshooting

Platform Engineering:

  • Develop and maintain internal tools and automation using ArgoCD for Kubernetes continuous delivery, enhancing developer productivity and platform reliability.
  • Design and implement platform features (e.g., service templates, libraries, security controls) supporting scalable, automated GitOps deployments.
  • Drive cloud-native technology adoption and optimization, integrating with continuous delivery pipelines.
  • Collaborate with application teams, providing platform support, and promoting reliability and scalability best practices.

Cloud Operations:

  • Manage AWS services (Kubernetes, RDS, VPC, IAM) in multi-account, multi-region setups
  • Optimize infrastructure for cost, performance, and scalability
  • Enforce security policies and compliance standards

Your background

Core:

  • 5+ years in SRE, DevOps, or Platform Engineering (7–10+ years preferred)
  • Expert in Kubernetes, Helm, and GitOps (ArgoCD/Flux)
  • Advanced observability skills (OpenTelemetry)
  • Proficient in Terraform and AWS services (RDS, VPC, IAM)
  • Strong scripting/automation (Python, Bash)
  • Proven leadership in incident investigations and system troubleshooting
  • Expertise in CI/CD pipelines (e.g., GitHub Actions)
  • Solid Linux operations and networking fundamentals
  • Experience operating production-grade platforms at scale

Preferred:

  • PostgreSQL administration and migration automation
  • Service mesh expertise (e.g., Istio)
  • Mentoring junior engineers in operational practices
  • Security engineering in cloud environments

Certifications (Preferred):

  • CKA/CKAD
  • AWS Solutions Architect or DevOps Engineer
  • Terraform Associate

To apply

If you’re excited about this opportunity and believe you’d be a great fit, we’d love to hear from you! Send a short note to careers@readyon.ai on why you’re interested in joining ReadyOn.

We encourage all candidates to apply even if your experience doesn't exactly match up to our job description. We are committed to building a diverse and inclusive workspace where everyone (regardless of age, religion, ethnicity, gender, sexual orientation, and more) feels like they belong.