Cloud - Staff Site Reliability Engineer
Bedrock Robotics
Location
New York, NY, San Francisco, CA, Remote
Employment Type
Full time
Location Type
Hybrid
Department
Engineering
Join the team bringing advanced autonomy to the built world
At Bedrock, we’re moving AI out of the lab and into the real world. Our team is composed of industry veterans who helped launch Waymo, scaled Segment to a $3.2B acquisition, and grew Uber Freight to $5B in revenue. Today, we’re deploying autonomous systems on heavy construction machinery across the country, accelerating project schedules of billion-dollar infrastructure projects and improving safety on job sites. Backed by $350M in funding, we’re working quickly to close the gap between America's surging demand for housing, data centers, manufacturing hubs, and the construction industry's growing labor shortage.
This is where algorithms meet steel-toed boots. You’ll collaborate with construction veterans and world-class engineers to solve physical-world problems that simulations can’t touch. If you're ready to apply cutting-edge technology to solve meaningful problems alongside a talented team—we'd love to have you join us.
The Role:
We are seeking an experienced Staff Site Reliability Engineer to own and evolve our cloud infrastructure, with a focus on scalable design, operational excellence, and system reliability.
The ideal candidate brings a strong production-engineering mindset and a deep commitment to observability, resilience, and well-instrumented distributed systems while holding a high bar for production readiness and believes no service should ship without meaningful telemetry and safeguards in place.
This role is critical to scaling the infrastructure that underpins our core data pipelines and directly enables our Machine Learning and Robotics engineering teams. If you enjoy tackling complex production challenges and building robust, highly scalable systems, this role offers significant scope and impact.
What you'll do:
System Design & Operations: Design, build, and operate highly scalable, reliable systems used by all Bedrock engineering teams.
Cloud Infrastructure Ownership: Take full ownership of Bedrock’s cloud infrastructure (AWS, GCP, Azure), ensuring best-in-class security, performance, and cost efficiency.
Observability Stack: Design, implement, and maintain Bedrock’s end-to-end observability stack (including monitoring, logging, and tracing).
Production Excellence: Pave the road for production engineering by developing and implementing best practices for system reliability, security, on-call rotation, and effective incident response.
Performance & Cost Optimization: Continuously identify and implement improvements to enhance system performance and optimize cloud resource consumption.
Required Qualifications:
6+ years building and maintaining reliable, fault-tolerant distributed systems.
Strong proficiency in major cloud platforms (such as AWS, GCP, or Azure) and Infrastructure as Code (IaC) tools like Terraform.
Proven experience with container technologies and orchestration platforms, particularly Kubernetes.
Hands-on experience with observability tools (e.g., Datadog, Prometheus, Splunk) and techniques.
Strong understanding of distributed systems, networking concepts, database technologies, and compute infrastructure.
Strong understanding and experience implementing security best practices in cloud environments.
Ability to work in a fast-paced, high-growth environment, deal effectively with ambiguity, and take decisive ownership of challenging problems.
Our roles are often flexible. If you don't fit all the criteria, or are in another location (especially one where we have an office like SF or NY) please apply anyway! We'd love to consider you.