Find a career with Emergence Capital Partners companies

Explore career opportunities across the Emergence Capital portfolio.

companies

Jobs

ML Infra Engineer (Data Systems)

Physical Intelligence

Software Engineering, Data Science

San Francisco, CA, USA

Posted 6+ months ago

Apply now

Who We Are

Physical Intelligence is bringing general-purpose AI into the physical world. We are a group of engineers, scientists, roboticists, and company builders developing foundation models and learning algorithms to power the robots of today and the physically-actuated devices of the future.

As an ML Infra Engineer (Data Systems), you’ll build and operate the data infrastructure that powers large-scale robot learning. Your systems will sit directly between raw data sources and training/evaluation, enabling us to move faster while maintaining performance, correctness, and reliability at scale.

This is a systems role at the intersection of distributed systems, storage, and machine learning infrastructure.

The Team

The Infrastructure organization builds the foundations that make large-scale learning possible at PI. This includes training systems, data platforms, evaluation pipelines, and the tooling that allows researchers and roboticists to work with massive datasets safely and efficiently.

In This Role You Will

Data Ingestion & Processing: Design and build high-throughput pipelines that validate, transform, and featurize raw multimodal data.
Batch & Streaming Systems: Operate large-scale batch and streaming workflows over massive datasets.
Storage Systems: Design object storage layouts, metadata systems, and efficient access patterns; choose file formats with performance and scalability in mind.
Data Lifecycle Management: Build systems for backfills, dataset rebuilds, garbage collection, and large-scale transformations.
Training-Time Performance: Optimize dataloaders, sharding, prefetching, caching, and throughput to reduce time from data arrival → model training.
Metadata & Indexing: Build scalable metadata stores for datasets, annotations, and training artifacts.
Data Movement: Move petabytes efficiently across clusters and environments.
Operational Correctness: Implement observability, validation, and guardrails to prevent silent data regressions.
Cross-Functional Collaboration: Work closely with cross-functional teams of researchers, engineers and roboticists to translate evolving data needs into robust systems.

What We Hope You’ll Bring

Strong software engineering fundamentals.
Experience building distributed systems or large-scale data pipelines.
Comfort reasoning about performance, memory, I/O, and storage efficiency.
Familiarity with batch and/or streaming processing systems.
Experience with object storage systems and data format tradeoffs.
Ownership mindset: design, build, operate, and iterate on systems end-to-end.
Enjoy working closely with researchers and unblocking fast-moving projects.

Bonus Points If You Have

Experience with large ML training pipelines or dataloading systems.
Knowledge of columnar or custom data formats.
Experience with systems like ClickHouse, Ray, Flink, Spark, or similar.
Hands-on experience operating petabyte-scale datasets.
Debugging and fixing performance bottlenecks in data-heavy systems.

Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.

Apply now

See more open positions at Physical Intelligence