AWS Trainium

Get high performance for deep learning and generative AI training while lowering costs

Why Trainium?

AWS Trainium chips are a family of AI chips purpose built by AWS for AI training and inference to deliver high performance while reducing costs.

The first-generation AWS Trainium chip powers Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances, which have up to 50% lower training costs than comparable Amazon EC2 instances. Many customers, including Databricks, Ricoh, NinjaTech AI, and Arcee AI, are realizing performance and cost benefits of Trn1 instances.

AWS Trainium2 chip delivers up to 4x the performance of first-generation Trainium. Trainium2-based Amazon EC2 Trn2 instances are purpose-built for generative AI and are the most powerful EC2 instances for training and deploying models with hundreds of billions to trillion+ parameters. Trn2 instances offer 30-40% better price performance than the current generation of GPU-based EC2 P5e and P5en instances. Trn2 instances feature 16 Trainium2 chips interconnected with NeuronLink, our proprietary chip-to-chip interconnect. You can use Trn2 instances to train and deploy the most demanding models including large language models (LLMs), multi-modal models, and diffusion transformers, to build a broad set of next-generation generative AI applications. Trn2 UltraServers, a completely new EC2 offering (available in preview), are ideal for the largest models that require more memory and memory bandwidth than standalone EC2 instances can provide. The UltraServer design uses NeuronLink to connect 64 Trainium2 chips across four Trn2 instances into one node, unlocking new capabilities. For inference, UltraServers help deliver industry-leading response time to create the best real-time experiences. For training, UltraServers boost model training speed and efficiency with faster collective communication for model parallelism as compared to standalone instances.

You can get started training and deploying models on Trn2 and Trn1 instances with native support for popular machine learning (ML) frameworks such as PyTorch and JAX.

Benefits

Trn2 UltraServers and instances deliver breakthrough performance in Amazon EC2 for generative AI training and inference. Each Trn2 UltraServer has 64 Trainium2 chips interconnected with NeuronLink, our proprietary chip-to-chip interconnect, and delivers up to 83.2 petaflops of FP8 compute, 6 TB of HBM3 with 185 terabytes per second (TBps) of memory bandwidth, and 12.8 terabits per second (Tbps) of Elastic Fabric Adapter (EFA) networking. Each Trn2 instance has 16 Trainium2 chips connected with NeuronLink and delivers up to 20.8 petaflops of FP8 compute, 1.5 TB of HBM3 with 46 TBps of memory bandwidth, and 3.2 Tbps of EFA networking. Trn1 instance features up to 16 Trainium chips and delivers up to 3 petaflops of FP8 compute, 512 GB of HBM with 9.8 TBps of memory bandwidth, and up to 1.6 Tbps of EFA networking.

AWS Neuron SDK helps you extract the full performance from Trn2 and Trn1 instances so you can focus on building and deploying models and accelerating your time to market. AWS Neuron integrates natively with JAX, PyTorch, and essential libraries like Hugging Face, PyTorch Lightning, and NeMo. AWS Neuron supports over 100,000 models on the Hugging Face model hub including popular models such as Meta’s Llama family of models and Stable Diffusion XL. It optimizes models out of the box for distributed training and inference, while providing deep insights for profiling and debugging. AWS Neuron integrates with services such as Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), AWS ParallelCluster, and AWS Batch, as well as third- party services like Ray (Anyscale), Domino Data Lab, and Datadog.

To deliver high performance while meeting accuracy goals, Trainium chips are optimized for FP32, TF32, BF16, FP16, and the new configurable FP8 (cFP8) data type. To support the fast pace of innovation in generative AI, Trainium2 has hardware optimizations for 4x sparsity (16:4), micro-scaling, stochastic rounding, and dedicated collective engines.

Neuron Kernel Interface (NKI) enables direct access to instruction set architecture (ISA) using a Python- based environment with a Triton-like interface, allowing you to innovate new model architectures and highly-optimized compute kernels that outperform existing techniques.

Trn2 instances are designed to be three times more energy efficient than Trn1 instances. Trn1 instances are up to 25% more energy efficient than comparable accelerated computing EC2 instances. These instances help you meet your sustainability goals when training ultra-large models.

Videos

Behind the scenes look at generative AI infrastructure at Amazon
Accelerate DL and innovate faster with AWS Trainium
Introducing Amazon EC2 Trn1 instances powered by AWS Trainium