Job Description

Salary: $151,300 - 261,500 per year Requirements:

Bachelor’s degree in computer science or a related field
5+ years of non-internship professional experience in software development
5+ years of experience in designing or architecting new and existing systems with a focus on design patterns, reliability, and scalability
Solid understanding of machine learning fundamentals, particularly in large language models (LLMs), including architecture, training, and inference lifecycles, with hands-on experience in model optimization
Proficiency in software development using C++ and Python, with experience in at least one of these languages required
Strong grasp of system performance, memory management, and principles of parallel computing
Expertise in debugging, profiling, and applying best practices in software engineering in large-scale systems

Responsibilities:

In this pivotal role, I will lead efforts to develop distributed inference support for PyTorch within the Neuron SDK. I will optimize these models to ensure optimal performance and maximize their efficiency on AWS Trainium and Inferentia silicon and servers. My responsibilities include:
Designing, developing, and fine-tuning machine learning models and frameworks for deployment on custom ML hardware accelerators
Participating in all phases of the ML system development lifecycle, including architecture design, implementation, performance profiling, hardware-specific optimizations, testing, and production deployment
Creating infrastructure for systematic analysis and onboarding of various models with diverse architectures
Designing and implementing high-performance kernels and features for ML operations, leveraging the Neuron architecture and programming models
Analyzing and optimizing system-level performance across multiple generations of Neuron hardware
Conducting detailed performance analysis using profiling tools to identify and address bottlenecks
Implementing optimizations such as fusion, sharding, tiling, and scheduling
Conducting comprehensive testing, including unit and end-to-end testing with continuous deployment through pipelines
Collaborating directly with customers to enable and optimize their ML models on AWS accelerators
Innovating optimization techniques in collaboration with cross-functional teams

Technologies:

AI
AWS
Hardware
Support
Machine Learning
PyTorch
Python
Cloud
Architect
Backbone
CUDA
GitHub
LLM
Web

More:

As part of the Inference Enablement and Acceleration team, I will contribute to pioneering efforts that enhance inference capabilities for Generative AI applications. My collaboration with a cross-functional team of applied scientists, system engineers, and product managers will allow me to debug performance issues, optimize memory usage, and influence the future of Neuron's inference stack throughout Amazon and the open-source community. I’ll be expected to build impactful solutions for our extensive customer base and actively participate in discussions on design, code reviews, and communication with both internal and external stakeholders. I thrive in a startup-like environment where the focus is on innovation and prioritizing important initiatives. Our team promotes a culture of builders, emphasizing collaboration, technical ownership, and continuous learning, while ensuring that new members are supported. We cherish knowledge-sharing and mentorship, aiming to foster a conducive environment for career growth and technical excellence. Join us to tackle some of the most fascinating and influential challenges in AI/ML infrastructure today.

last updated 48 week of 2025

Job Tags

Full time,

Lead Software Engineer, AI/ML Model Inference Job at Annapurna Labs (U.S.) Inc., Cupertino, CA

U1NnOXI2cmNqMFhFKzM0ZmhMazh4TStpWlE9PQ==

Job Description

Job Tags

Similar Jobs

Cupertino, CA

Full Time

$151.3k - $261.5k

2026-04-25

2026-05-25