Member of Technical Staff, AI Supercomputing

Radical Numerics

Radical Numerics

Software Engineering, IT, Data Science

San Francisco, CA, USA · Tokyo, Japan

Posted on Jun 9, 2026

Location

San Francisco; Tokyo

Employment Type

Full time

Location Type

On-site

Department

R&D

About Radical Numerics

Radical Numerics is an AI lab bringing the rigor of distributed systems, model architecture, and numerics research to the challenges of biology. We are building the infrastructure needed to unlock scaling on vast biological sequence, structure, and image datasets so that biological world models become a reality. Our team introduced hybrid architectures that unlocked million-token context windows, enabling work toward AI-designed whole genomes and real gene-editing tools.

We believe biological world models will require not only strong research ideas, but the supercomputing environment that powers them: GPU clusters that are efficient, reliable, and cost-effective enough to support rapid scientific iteration at scale. This role is focused on building and operating that foundation.

About the Role

As a Member of Technical Staff, AI Supercomputing at Radical Numerics, you will design, build, and operate the GPU supercomputing environment that powers our large-scale training and inference. You will deliver high-performance, reliable, and cost-efficient compute so our researchers can move fast at scale, turning frontier infrastructure into the foundation for the next generation of biological world models.

This role is ideal for someone who combines deep operational instincts with an interest in modern machine learning. You should care about how every layer of the cluster affects research velocity: provisioning and capacity, scheduling and multi-tenancy, storage and lineage, communication overhead, observability, and the reliability of long-running jobs across thousands of accelerators.

What You'll Do

  • Operate and automate large GPU clusters. Own provisioning, imaging, and capacity planning across large distributed compute systems, with a focus on uptime, utilization, and cost efficiency.

  • Build a unified compute interface. Write software that abstracts cluster management and presents a single, ergonomic interface for training and inference, so researchers spend their time on science rather than infrastructure.

  • Extend scheduling and orchestration. Adapt systems like Kubernetes or Slurm for topology-aware placement, preemption, quotas, and fair-share multi-tenancy across competing workloads.

  • Maximize throughput and hardware efficiency. Profile and tune performance across the stack, including communication patterns, memory efficiency, custom kernels, compilation paths, and systems instrumentation, to ensure training compute is used effectively.

  • Improve reliability and recovery. Establish standards and mechanisms for robustness and error recovery, including monitoring, fault tolerance, checkpointing, and incident analysis for fast-moving research infrastructure.

  • Build reliable storage and artifact paths. Design durable paths for datasets, checkpoints, and logs, with clear retention and lineage that support reproducible, large-scale experimentation.

  • Collaborate across research and engineering. Partner closely with model researchers and training scientists to unblock large-scale runs, advise on parallelism and performance trade-offs, and design systems that support new scientific directions rather than constrain them.

What We're Looking For

  • Proven track record operating large-scale GPU clusters and container orchestration systems such as Kubernetes or Slurm.

  • Proficiency in building performant, maintainable software in at least one backend language (we use Python and Rust), with a focus on performance and reliability.

  • Strong systems background spanning Linux, networking, and infrastructure-as-code.

  • Strong understanding of modern deep learning frameworks and their systems internals (e.g., PyTorch, Triton, CUDA, C++).

  • Ability to debug complex, multi-layered systems involving distributed training, memory and performance regressions, and reliability issues in large codebases.

  • Comfort operating across the stack and owning projects end to end, with a bias toward initiative and execution.

  • Excellent written and verbal communication skills bridging technical and scientific domains.

Nice to Have

  • Familiarity with CUDA/NCCL and performance profiling for distributed training and inference.

  • Experience supporting large-scale distributed training for frontier or foundation models.

  • Contributions to open-source ML systems or infrastructure such as PyTorch, Torchtitan, or Megatron-LM.

  • Familiarity with ML runtimes, compilers, numerics, communication libraries, and custom kernel development.

  • Experience improving researcher productivity through infrastructure design, developer tooling, or workflow improvements.

  • Background in applied math, systems, computational biology, or related quantitative sciences.

Why Radical Numerics

  • Help build the computational foundation for multimodal biological world models aimed at rapid detection, response, and countermeasures across global health.

  • Work on systems problems at the frontier of distributed training, architecture, and numerics, in service of real biological applications.

  • Join a collaborative culture that values rigor, creativity, and cross-disciplinary partnership across AI labs, biotechs, hospital systems, and research institutes.

  • Competitive compensation, comprehensive benefits, and support for continual learning.

Radical Numerics is committed to equal employment opportunity and does not discriminate in any employment opportunities or practices based on an individual's race, color, creed, gender (including gender identity and gender expression), religion (all aspects of religious beliefs, observance or practice, including religious dress or grooming practices), marital status, registered domestic partner status, age, national origin or ancestry (including language use restrictions and possession of a driver’s license issued under California Vehicle Code section 12801.9), natural hair, physical or mental disability, political affiliation, medical condition (including cancer or a record or history of cancer, and genetic characteristics), sex (including pregnancy, childbirth, breastfeeding or related medical condition), genetic information, sexual orientation, military and veteran status or any other consideration made unlawful by federal, state, or local laws. It also prohibits unlawful discrimination based on the perception that anyone has any of those characteristics, or is associated with a person who has or is perceived as having any of those characteristics.