AMD Instinct MI355X GPUs Exceed 1M Tokens/Sec in MLPerf 6.0, Advancing Distributed Inference

MLPerf 6.0 results reveal how AMD is advancing large-scale inference with faster model deployment, flexible architectures, and ecosystem-wide consistency

AMD’s submission to MLCommons for MLPerf Inference 6.0 represents a significant inflection point in the evolution of AI inference infrastructure. Rather than simply delivering incremental performance improvements, the company demonstrated a comprehensive advancement across multiple dimensions: raw throughput, scalability, model diversity, ecosystem validation, and software maturity. These results collectively signal a transition from isolated benchmark wins to production-ready, enterprise-scale AI deployment capabilities.

A Broader Definition of Inference Performance

Historically, inference benchmarking focused on single metrics—typically latency or throughput on a fixed workload. However, enterprise buyers now evaluate platforms holistically. They demand consistent single-node performance, efficient scaling across clusters, rapid enablement of new models, reproducibility across vendors, and a software ecosystem capable of sustaining innovation.

MLPerf Inference 6.0 provided a platform for AMD to address all these criteria simultaneously. The company’s results were not limited to a single configuration or workload. Instead, they spanned multiple AI models, deployment scenarios, and hardware environments, demonstrating a level of maturity that aligns with real-world production requirements.

AMD Instinct MI355X: Architecture Built for Modern AI

At the center of these achievements is the AMD Instinct MI355X GPU, built on the CDNA 4 architecture and manufactured using a 3nm process. This GPU represents a substantial leap in design, integrating approximately 185 billion transistors and supporting emerging low-precision data types such as FP4 and FP6—critical for optimizing large language model (LLM) inference.

The MI355X also features up to 288GB of HBM3E memory, enabling it to handle extremely large models—up to 520 billion parameters—on a single GPU. With compute performance reaching up to 10 petaflops in FP4/FP6 operations, the platform is engineered not only for speed but also for capacity and deployment flexibility.

Importantly, AMD has aligned the hardware with practical deployment requirements. The MI355X supports industry-standard UBB8 configurations and is available in both air-cooled and liquid-cooled systems, making it adaptable to diverse data center environments.

Breaking the 1 Million Tokens per Second Barrier

One of the most notable achievements in this MLPerf round is AMD’s ability to surpass 1 million tokens per second in inference throughput at multinode scale. This milestone was achieved using models such as Llama 2 70B and GPT-OSS-120B across Server and Offline scenarios.

This threshold is more than a symbolic achievement—it represents a shift toward production-grade AI infrastructure. In real-world deployments, inference workloads are rarely confined to a single node. Instead, they operate across clusters where aggregate throughput and latency determine user experience and system viability.

By exceeding 1 million tokens per second, AMD demonstrated that its platform can handle high-demand, large-scale inference workloads. This capability is particularly relevant for applications such as conversational AI, content generation, and enterprise copilots, where responsiveness and scalability are critical.

A Generational Leap in Performance

The MI355X also delivered a significant generational improvement over its predecessor, the AMD Instinct MI325X GPU. On the Llama 2 70B Server benchmark, the MI355X achieved over 100,000 tokens per second—approximately 3.1 times higher than the MI325X.

This rapid performance gain within a relatively short timeframe underscores the effectiveness of AMD’s full-stack approach. The improvements are not solely attributable to hardware advancements but also to software optimizations within the ROCm platform, enhanced memory bandwidth, and support for lower-precision computation.

Such generational leaps are critical in the AI space, where model sizes and computational demands are growing exponentially. They provide organizations with a clear upgrade path and measurable ROI when adopting new hardware.

Competitive Single-Node Performance

In addition to scaling performance, AMD demonstrated strong competitiveness at the single-node level. On the widely recognized Llama 2 70B benchmark, the MI355X platform delivered results comparable to leading GPUs from NVIDIA, including the B200 and B300.

The platform achieved near parity or better performance across multiple scenarios:

  • Offline (batch processing): competitive throughput
  • Server (sustained inference): approximately 93–97% of competing systems
  • Interactive (low-latency responses): up to 119% performance in certain cases

This breadth of competitiveness is particularly important. It demonstrates that AMD’s solution is not optimized for a single scenario but performs consistently across diverse workloads, from high-throughput batch jobs to real-time interactive applications.

Rapid Enablement of New Models

Another standout aspect of AMD’s submission is its ability to support new models quickly. GPT-OSS-120B, a first-time MLPerf workload, was successfully deployed and optimized within the benchmark timeframe.

AMD

First-time model bring-up is a complex process involving integration, optimization, and validation. Despite these challenges, the MI355X platform delivered performance exceeding that of competing systems in certain scenarios, including surpassing NVIDIA B200 in both Offline and Server modes.

This capability highlights the flexibility of AMD’s platform and its readiness for emerging AI workloads. As the AI landscape evolves, organizations need infrastructure that can adapt quickly to new models without extensive reengineering.

Expanding into Multimodal AI

MLPerf Inference 6.0 also marked AMD’s entry into multimodal workloads, specifically text-to-video generation with the Wan-2.2-t2v model. This represents a significant expansion beyond traditional LLMs into generative media applications.

Even as a first-time effort, the MI355X platform delivered competitive performance relative to established solutions. Subsequent optimizations further improved results, demonstrating the platform’s potential for rapid performance tuning.

The importance of this development lies in the broader trend toward multimodal AI. Future applications will increasingly combine text, images, video, and audio, requiring infrastructure that can handle diverse data types efficiently.

Efficient Multinode Scaling

Scalability remains one of the most critical factors in AI deployment. AMD’s results show that the MI355X platform scales efficiently across multiple nodes, maintaining performance close to ideal linear scaling.

For example, on Llama 2 70B:

  • Scaling from 1 to 11 nodes achieved over 1 million tokens per second
  • Efficiency remained around 93% for both Offline and Server scenarios
  • Interactive workloads achieved up to 98% scaling efficiency

Similar results were observed with GPT-OSS-120B, where the system maintained over 90% efficiency across 12 nodes.

These results demonstrate that AMD’s platform can expand seamlessly as workload demands increase. High scaling efficiency ensures better utilization of hardware resources, reducing cost per token and improving overall system economics.

Ecosystem Validation and Reproducibility

A key differentiator in AMD’s MLPerf submission is the strength of its ecosystem. Nine partners—including major OEMs and cloud providers—submitted results using AMD Instinct GPUs, spanning multiple generations.

Importantly, these partner submissions closely matched AMD’s own results, often within a 4% margin. This level of reproducibility indicates that performance is not confined to controlled lab environments but can be replicated across diverse systems and configurations.

For enterprise customers, this translates into reduced deployment risk and greater confidence in achieving expected performance outcomes.

Heterogeneous Computing Across Geographies

AMD also demonstrated a forward-looking capability: heterogeneous inference across multiple GPU generations and geographic locations. A joint submission combined MI300X, MI325X, and MI355X GPUs across systems located in different regions.

This configuration achieved strong performance on Llama 2 70B, proving that distributed inference can be orchestrated effectively even in non-uniform environments.

This approach offers several advantages:

  • Extending the lifespan of existing hardware investments
  • Enabling flexible infrastructure scaling
  • Supporting hybrid and geographically distributed deployments

Such capabilities are increasingly relevant as organizations adopt global AI strategies and seek to optimize resource utilization.

The Role of ROCm Software

Underlying all these achievements is ROCm, AMD’s open software stack for GPU computing. ROCm plays a central role in enabling performance, scalability, and flexibility across the Instinct platform.

Key contributions of ROCm include:

  • Optimized execution for low-precision data types (FP4/FP6)
  • Efficient communication for multinode scaling
  • Dynamic workload distribution across heterogeneous systems
  • Rapid enablement of new models and workloads

ROCm is not just a supporting layer—it is a critical enabler of AMD’s full-stack strategy. By integrating tightly with hardware, it ensures that performance gains are realized consistently across different use cases.

A Roadmap for the Future

AMD’s MLPerf Inference 6.0 results are part of a broader roadmap characterized by an annual cadence of innovation. From the MI300X to the MI325X and now the MI355X, each generation has delivered substantial improvements in performance and capability.

Looking ahead, the company plans to introduce the MI400 series based on the next-generation CDNA 5 architecture. This will further extend AMD’s capabilities into rack-scale AI systems, including solutions like the Helios platform.

This consistent roadmap provides customers with confidence that AMD is committed to long-term innovation and scalability in AI infrastructure.

AMD’s MLPerf Inference 6.0 submission marks a pivotal moment in the evolution of AI inference platforms. By combining high-performance hardware, a robust software ecosystem, and strong partner validation, the company has demonstrated that it is ready to support production-scale generative AI workloads.

From surpassing 1 million tokens per second to enabling new models and achieving efficient multinode scaling, the results highlight a platform that is not only competitive but also forward-looking. As AI continues to evolve, AMD is positioning itself as a key player in defining the next generation of inference infrastructure—one that is scalable, flexible, and ready for real-world deployment.

Source link: https://www.amd.com

Share your love