Senior Site Reliability Engineer (SRE), Platform Team

ReadyOn

ReadyOn

Software Engineering
San Francisco, CA, USA
Posted on Sep 23, 2025

Senior Site Reliability Engineer (SRE), Platform Team

About the role

As a Senior Site Reliability Engineer, you’ll lead the reliability and operations of our Kubernetes-based platform, ensuring scalable, resilient application deployments via GitOps. You’ll drive operational excellence by spearheading incident investigations, root cause analysis, and post-mortems to minimize downtime. Your expertise in observability, Kubernetes, ArgoCD, and CI/CD will empower developer productivity and system stability. We’re seeking a proactive go-getter who thrives in fast-paced startup environments, anticipates issues, and champions continuous improvements in reliability and performance.

Responsibilities

Infrastructure & Observability:

  • Manage multi-cluster Kubernetes environments for high availability
  • Design and enhance observability systems (monitoring, logging, tracing with Prometheus, OpenTelemetry, Grafana, ELK)
  • Configure ingress, service mesh, and networking for performance and security
  • Implement Infrastructure as Code (IaC) using Terraform for AWS infrastructure

Operations & Incident Leadership:

  • Lead incident response, investigations, and post-mortems to drive root cause resolution
  • Design proactive monitoring and alerting systems to minimize outages
  • Establish SLAs/SLOs and conduct chaos engineering to improve reliability
  • Build and automate remediation runbooks
  • Lead and coordinate regular Failover and Disaster Recovery drills to validate recovery procedures and minimize RTO/RPO
  • Mentor teams on operational best practices and troubleshooting

Platform Engineering:

  • Develop and maintain internal tools and automation using ArgoCD for Kubernetes continuous delivery, enhancing developer productivity and platform reliability.
  • Design and implement platform features (e.g., service templates, libraries, security controls) supporting scalable, automated GitOps deployments.
  • Drive cloud-native technology adoption and optimization, integrating with continuous delivery pipelines.
  • Collaborate with application teams, providing platform support, and promoting reliability and scalability best practices.

Cloud Operations:

  • Manage AWS services (Kubernetes, RDS, VPC, IAM) in multi-account, multi-region setups
  • Optimize infrastructure for cost, performance, and scalability
  • Enforce security policies and compliance standards

Your background

Core:

  • 5+ years in SRE, DevOps, or Platform Engineering (7–10+ years preferred)
  • Expert in Kubernetes, Helm, and GitOps (ArgoCD/Flux)
  • Advanced observability skills (OpenTelemetry)
  • Proficient in Terraform and AWS services (RDS, VPC, IAM)
  • Strong scripting/automation (Python, Bash)
  • Proven leadership in incident investigations and system troubleshooting
  • Expertise in CI/CD pipelines (e.g., GitHub Actions)
  • Solid Linux operations and networking fundamentals
  • Experience operating production-grade platforms at scale

Preferred:

  • PostgreSQL administration and migration automation
  • Service mesh expertise (e.g., Istio)
  • Mentoring junior engineers in operational practices
  • Security engineering in cloud environments

Certifications (Preferred):

  • CKA/CKAD
  • AWS Solutions Architect or DevOps Engineer
  • Terraform Associate

To apply

If you’re excited about this opportunity and believe you’d be a great fit, we’d love to hear from you! Send a short note to careers@readyon.ai on why you’re interested in joining ReadyOn.

We encourage all candidates to apply even if your experience doesn't exactly match up to our job description. We are committed to building a diverse and inclusive workspace where everyone (regardless of age, religion, ethnicity, gender, sexual orientation, and more) feels like they belong.