Senior Site Reliability Engineer (SRE), Platform Team

ReadyOn

Software Engineering

San Francisco, CA, USA

Posted 6+ months ago

Apply now

Senior Site Reliability Engineer (SRE), Platform Team

About the role

As a Senior Site Reliability Engineer, you’ll lead the reliability and operations of our Kubernetes-based platform, ensuring scalable, resilient application deployments via GitOps. You’ll drive operational excellence by spearheading incident investigations, root cause analysis, and post-mortems to minimize downtime. Your expertise in observability, Kubernetes, ArgoCD, and CI/CD will empower developer productivity and system stability. We’re seeking a proactive go-getter who thrives in fast-paced startup environments, anticipates issues, and champions continuous improvements in reliability and performance.

Responsibilities

Infrastructure & Observability:

Manage multi-cluster Kubernetes environments for high availability
Design and enhance observability systems (monitoring, logging, tracing with Prometheus, OpenTelemetry, Grafana, ELK)
Configure ingress, service mesh, and networking for performance and security
Implement Infrastructure as Code (IaC) using Terraform for AWS infrastructure

Operations & Incident Leadership:

Lead incident response, investigations, and post-mortems to drive root cause resolution
Design proactive monitoring and alerting systems to minimize outages
Establish SLAs/SLOs and conduct chaos engineering to improve reliability
Build and automate remediation runbooks
Lead and coordinate regular Failover and Disaster Recovery drills to validate recovery procedures and minimize RTO/RPO
Mentor teams on operational best practices and troubleshooting

Platform Engineering:

Develop and maintain internal tools and automation using ArgoCD for Kubernetes continuous delivery, enhancing developer productivity and platform reliability.
Design and implement platform features (e.g., service templates, libraries, security controls) supporting scalable, automated GitOps deployments.
Drive cloud-native technology adoption and optimization, integrating with continuous delivery pipelines.
Collaborate with application teams, providing platform support, and promoting reliability and scalability best practices.

Cloud Operations:

Manage AWS services (Kubernetes, RDS, VPC, IAM) in multi-account, multi-region setups
Optimize infrastructure for cost, performance, and scalability
Enforce security policies and compliance standards

Your background

Core:

5+ years in SRE, DevOps, or Platform Engineering (7–10+ years preferred)
Expert in Kubernetes, Helm, and GitOps (ArgoCD/Flux)
Advanced observability skills (OpenTelemetry)
Proficient in Terraform and AWS services (RDS, VPC, IAM)
Strong scripting/automation (Python, Bash)
Proven leadership in incident investigations and system troubleshooting
Expertise in CI/CD pipelines (e.g., GitHub Actions)
Solid Linux operations and networking fundamentals
Experience operating production-grade platforms at scale

Preferred:

PostgreSQL administration and migration automation
Service mesh expertise (e.g., Istio)
Mentoring junior engineers in operational practices
Security engineering in cloud environments

Certifications (Preferred):

CKA/CKAD
AWS Solutions Architect or DevOps Engineer
Terraform Associate

To apply

If you’re excited about this opportunity and believe you’d be a great fit, we’d love to hear from you! Send a short note to careers@readyon.ai on why you’re interested in joining ReadyOn.

We encourage all candidates to apply even if your experience doesn't exactly match up to our job description. We are committed to building a diverse and inclusive workspace where everyone (regardless of age, religion, ethnicity, gender, sexual orientation, and more) feels like they belong.

Apply now

See more open positions at ReadyOn