Lead Software Engineer, AI/ML Model Inference Job at Annapurna Labs (U.S.) Inc., Cupertino, CA

U1NnOXI2cmNqMFhFKzM0ZmhMazh4TStpWlE9PQ==
  • Annapurna Labs (U.S.) Inc.
  • Cupertino, CA

Job Description

Salary: $151,300 - 261,500 per year Requirements:

  • Bachelor’s degree in computer science or a related field
  • 5+ years of non-internship professional experience in software development
  • 5+ years of experience in designing or architecting new and existing systems with a focus on design patterns, reliability, and scalability
  • Solid understanding of machine learning fundamentals, particularly in large language models (LLMs), including architecture, training, and inference lifecycles, with hands-on experience in model optimization
  • Proficiency in software development using C++ and Python, with experience in at least one of these languages required
  • Strong grasp of system performance, memory management, and principles of parallel computing
  • Expertise in debugging, profiling, and applying best practices in software engineering in large-scale systems
Responsibilities:
  • In this pivotal role, I will lead efforts to develop distributed inference support for PyTorch within the Neuron SDK. I will optimize these models to ensure optimal performance and maximize their efficiency on AWS Trainium and Inferentia silicon and servers. My responsibilities include:
  • Designing, developing, and fine-tuning machine learning models and frameworks for deployment on custom ML hardware accelerators
  • Participating in all phases of the ML system development lifecycle, including architecture design, implementation, performance profiling, hardware-specific optimizations, testing, and production deployment
  • Creating infrastructure for systematic analysis and onboarding of various models with diverse architectures
  • Designing and implementing high-performance kernels and features for ML operations, leveraging the Neuron architecture and programming models
  • Analyzing and optimizing system-level performance across multiple generations of Neuron hardware
  • Conducting detailed performance analysis using profiling tools to identify and address bottlenecks
  • Implementing optimizations such as fusion, sharding, tiling, and scheduling
  • Conducting comprehensive testing, including unit and end-to-end testing with continuous deployment through pipelines
  • Collaborating directly with customers to enable and optimize their ML models on AWS accelerators
  • Innovating optimization techniques in collaboration with cross-functional teams
Technologies:
  • AI
  • AWS
  • Hardware
  • Support
  • Machine Learning
  • PyTorch
  • Python
  • Cloud
  • Architect
  • Backbone
  • CUDA
  • GitHub
  • LLM
  • Web

More:

As part of the Inference Enablement and Acceleration team, I will contribute to pioneering efforts that enhance inference capabilities for Generative AI applications. My collaboration with a cross-functional team of applied scientists, system engineers, and product managers will allow me to debug performance issues, optimize memory usage, and influence the future of Neuron's inference stack throughout Amazon and the open-source community. I’ll be expected to build impactful solutions for our extensive customer base and actively participate in discussions on design, code reviews, and communication with both internal and external stakeholders. I thrive in a startup-like environment where the focus is on innovation and prioritizing important initiatives. Our team promotes a culture of builders, emphasizing collaboration, technical ownership, and continuous learning, while ensuring that new members are supported. We cherish knowledge-sharing and mentorship, aiming to foster a conducive environment for career growth and technical excellence. Join us to tackle some of the most fascinating and influential challenges in AI/ML infrastructure today.

last updated 48 week of 2025

Job Tags

Full time,

Similar Jobs