Find a career with Emergence Capital Partners companies

Explore career opportunities across the Emergence Capital portfolio.
companies
Jobs

ML Infra Engineer (Data Systems)

Physical Intelligence

Physical Intelligence

Software Engineering, Data Science
San Francisco, CA, USA
Posted on Jan 24, 2026

Location

San Francisco

Employment Type

Full time

Location Type

On-site

Department

Engineering

As an ML Infra Engineer (Data Systems), you’ll build and operate the data infrastructure that powers large-scale robot learning. Your systems will sit directly between raw data sources and training/evaluation, enabling us to move faster while maintaining performance, correctness, and reliability at scale.

This is a systems role at the intersection of distributed systems, storage, and machine learning infrastructure.

The Team

The Infrastructure organization builds the foundations that make large-scale learning possible at PI. This includes training systems, data platforms, evaluation pipelines, and the tooling that allows researchers and roboticists to work with massive datasets safely and efficiently.2

In This Role You Will

- Data Ingestion & Processing: Design and build high-throughput pipelines that validate, transform, and featurize raw multimodal data.

- Batch & Streaming Systems: Operate large-scale batch and streaming workflows over massive datasets.

- Storage Systems: Design object storage layouts, metadata systems, and efficient access patterns; choose file formats with performance and scalability in mind.

- Data Lifecycle Management: Build systems for backfills, dataset rebuilds, garbage collection, and large-scale transformations.

- Training-Time Performance: Optimize dataloaders, sharding, prefetching, caching, and throughput to reduce time from data arrival → model training.

- Metadata & Indexing: Build scalable metadata stores for datasets, annotations, and training artifacts.

- Data Movement: Move hundreds of terabytes to petabytes efficiently across clusters and environments.

- Operational Correctness: Implement observability, validation, and guardrails to prevent silent data regressions.

- Cross-Functional Collaboration: Work closely with cross-functional teams of researchers, engineers and roboticists to translate evolving data needs into robust systems.

What We Hope You’ll Bring

- Strong software engineering fundamentals.

- Experience building distributed systems or large-scale data pipelines.

- Comfort reasoning about performance, memory, I/O, and storage efficiency.

- Familiarity with batch and/or streaming processing systems.

- Experience with object storage systems and data format tradeoffs.

- Ownership mindset: design, build, operate, and iterate on systems end-to-end.

- Enjoy working closely with researchers and unblocking fast-moving projects.

Bonus Points If You Have

- Experience with large ML training pipelines or dataloading systems.

- Knowledge of columnar or custom data formats.

- Experience with systems like ClickHouse, Ray, Flink, Spark, or similar.

- Hands-on experience operating petabyte-scale datasets.

- Debugging and fixing performance bottlenecks in data-heavy systems.