Infrastructure & DevOps Engineer: Distributed LLM Training
Arcee.ai
This job is no longer accepting applications
See open jobs at Arcee.ai.See open jobs similar to "Infrastructure & DevOps Engineer: Distributed LLM Training" Emergence.About the role:
As an Infrastructure & DevOps Engineer: Distributed LLM Training at Arcee, you will collaborate closely with our Product and Research Engineering teams to design and implement processes and systems that ensure the stability and availability of our services. Your expertise in AWS and modern DevOps technologies will be key to optimizing our infrastructure for continual pretraining and model merging.
What you’ll do:
- Develop and maintain cloud infrastructure on AWS using modern technologies such as Terraform and Kubernetes.
- Instrument, monitor, and optimize the performance and reliability of our AI models and HPC cluster
- Implement and maintain automation tools and processes to prevent and mitigate service disruptions.
- Create and manage high-performance computing clusters using Slurm on AWS, with a focus on efficient continual pretraining.
- Develop and implement strategies for efficient model merging to enhance AI performance.
- Ensure security best practices are applied to application build pipelines and cloud/SaaS infrastructure.
- Stay updated on the latest trends and best practices in DevOps, Security, and AI to continuously improve our products.
What we’re seeking:
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
- Proven experience in DevOps and cloud infrastructure management, with a focus on AWS.
- Strong proficiency in infrastructure-as-code tools such as Terraform.
- Experience with continuous deployment tools such as ArgoCD.
- Proficiency in container orchestration using Kubernetes.
- Experience with high-performance computing clusters and job scheduling systems like Slurm.
- Understanding of continual pretraining and model merging techniques.
- Strong understanding of monitoring and logging tools and best practices.
- Knowledge of security best practices in cloud and application build pipelines.
- Experience with other cloud platforms such as Google Cloud or Azure.
- Familiarity with AI/ML infrastructure and distributed training environments.
- Knowledge of MLOps practices and tools for managing ML workflows.
- Prior experience in a startup environment or a fast-paced, dynamic work setting.
- Excellent problem-solving skills and the ability to work effectively in a collaborative startup environment.
- Strong communication skills, with the ability to explain complex technical concepts to both technical and non-technical stakeholders.
About Arcee.AI
Arcee.AI was founded on a mission of empowering companies to productionalize LLMs, broadening access to world-class LLM technology, and furthering the frontiers of Generative AI research. We are firmly committed to the values of the Open Source community in particular through our work with our arcee-ai/mergekit repository.
Equal Opportunity
We are an Equal Opportunity Employer, offering equal opportunity to all regardless of race, religion, gender identity, sexual orientation, age, citizenship, marital status, disability, and more. We would like to remind candidates that the listed qualifications for each role are not hard requirements, and we encourage them to apply if they feel they would be a good fit.
Compensation
We offer competitive salaries, equity, and benefits. We base our salaries on location, role, and level as well as consideration of the candidate’s experience and overall qualifications.
This job is no longer accepting applications
See open jobs at Arcee.ai.See open jobs similar to "Infrastructure & DevOps Engineer: Distributed LLM Training" Emergence.