hero

We invest in people who change the way the world works.

Interested in working with them?
65
companies
777
Jobs

Data Engineer: LLM Training Data

Arcee.ai

Arcee.ai

Data Science
Remote
Posted on Jul 24, 2024

About the role:

As a Data Engineer: LLM Training Data at Arcee, you will be responsible for designing and implementing data pipelines and processes that ensure the integrity and quality of data used for training Large Language Models, as well as synthetic dataset generation for LLM training. You will collaborate closely with our Researchers, Machine Learning Engineers, and other stakeholders to gather requirements and ensure the availability of high-quality datasets.

What you’ll do:

  • Source and acquire diverse and high-quality datasets from various sources.
  • Develop and maintain robust data pipelines to ingest, process, and transform raw data into formats suitable for LLM training.
  • Clean and preprocess data, including handling missing values, normalization, deduplication, and other data quality tasks.
  • Implement data validation and monitoring processes to ensure the accuracy and consistency of datasets.
  • Implement a data pipeline to augment datasets with public data as well as synthetically-generated data.
  • Collaborate with Researchers and Machine Learning Engineers to understand data requirements and deliver datasets that meet their needs.
  • Optimize data storage and retrieval processes for efficiency and scalability.
  • Stay up-to-date with industry best practices and emerging technologies in data engineering and AI.

What we’re seeking:

  • Bachelor’s or Master’s degree in Computer Science, Data Engineering, or a related field.
  • Proven experience in data engineering, with a focus on data sourcing, preparation, and cleaning.
  • Strong programming skills in languages such as Python, SQL, and familiarity with data engineering frameworks (e.g., Apache Spark, Apache Kafka).
  • Experience with cloud platforms (AWS being the most prominent) and data storage solutions (e.g., S3, BigQuery, Redshift).
  • Knowledge of ETL (Extract, Transform, Load) processes and tools.
  • Understanding of data quality best practices and techniques for ensuring data integrity.
  • Experience working with large-scale datasets and distributed data processing systems.
  • Familiarity with Machine Learning concepts and the specific data requirements for training Large Language Models.
  • Knowledge of MLOps practices and tools for managing data pipelines and workflows.
  • Strong problem-solving skills and the ability to work effectively in a collaborative startup environment.
  • Excellent communication skills, with the ability to explain complex data concepts to both technical and non-technical stakeholders.
  • Prior experience in a startup environment or a fast-paced, dynamic work setting.

About Arcee.AI

Arcee.AI was founded on a mission of empowering companies to productionalize LLMs, broadening access to world-class LLM technology, and furthering the frontiers of Generative AI research. We are firmly committed to the values of the Open Source community in particular through our work with our arcee-ai/mergekit repository.

Equal Opportunity

We are an Equal Opportunity Employer, offering equal opportunity to all regardless of race, religion, gender identity, sexual orientation, age, citizenship, marital status, disability, and more. We would like to remind candidates that the listed qualifications for each role are not hard requirements, and we encourage them to apply if they feel they would be a good fit.

Compensation

We offer competitive salaries, equity, and benefits. We base our salaries on location, role, and level as well as consideration of the candidate’s experience and overall qualifications.