
MLPerf Inference v5.0 Benchmark Results: Driving AI Performance and Innovation
The machine learning (ML) and artificial intelligence (AI) industries continue to evolve at a breakneck pace, with generative AI emerging as a focal point for innovation. Today, MLCommons® released the latest results for its industry-standard MLPerf® Inference v5.0 benchmark suite, showcasing remarkable advancements in AI system performance across datacenter and edge computing environments. These benchmarks provide a neutral, reproducible, and representative framework for evaluating how quickly systems can execute AI and ML models across diverse workloads.
This round of results highlights significant progress in generative AI, driven by breakthroughs in hardware and software optimizations. The release also introduces four new benchmarks—Llama 3.1 405B, Llama 2 70B Interactive, RGAT, and Automotive PointPainting—reflecting the growing complexity and diversity of AI applications. These updates ensure that MLPerf remains a trusted resource for customers procuring and tuning AI systems, while fostering innovation and energy efficiency across the industry.
Generative AI Takes Center Stage
Generative AI continues to dominate the AI landscape, as evidenced by the surge in submissions to the Llama 2 70B benchmark test, which has now surpassed ResNet50 as the most-submitted test in MLPerf Inference. Submissions to Llama 2 70B have increased 2.5x over the past year, with performance results showing dramatic improvements. The median submitted score has doubled, and the best-performing systems are now 3.3 times faster compared to last year’s MLPerf Inference v4.0 results.
“This is clear evidence that much of the ecosystem is focused on deploying generative AI,” said David Kanter, head of MLPerf at MLCommons. “The combination of cutting-edge hardware and software innovations, including support for the FP4 data format, is driving unprecedented performance gains.”
To keep pace with the industry’s shift toward larger and more complex models, MLPerf Inference v5.0 introduces a new benchmark utilizing Llama 3.1 405B, a model with 405 billion parameters—far exceeding the scale of previous benchmarks. This test evaluates three key tasks: general question-answering, math, and code generation, pushing the boundaries of what’s possible in generative AI inference.
“This is our most ambitious inference benchmark to date,” said Miro Hodak, co-chair of the MLPerf Inference working group. “It reflects the trend toward larger models, which offer greater accuracy and versatility but also require more sophisticated infrastructure. Trusted benchmark results are critical for organizations deploying such models at scale.”
New Benchmarks Reflect Industry Trends
In addition to Llama 3.1 405B, MLPerf Inference v5.0 introduces several other groundbreaking tests designed to address emerging AI use cases:
- Llama 2 70B Interactive:
Building on the existing Llama 2 70B benchmark, this new test adds low-latency requirements to simulate real-world interactive scenarios, such as chatbots and reasoning systems. Systems under test must meet strict metrics for time to first token (TTFT) and time per output token (TPOT). “Responsiveness is a key measure of performance for query systems and chatbots,” explained Mitchelle Rasquinha, co-chair of the MLPerf Inference working group. “This interactive version provides deeper insights into how well models perform in practical, user-facing applications.” - RGAT (Graph Neural Network Benchmark):
A new datacenter benchmark implementing a graph neural network (GNN) model based on the Illinois Graph Benchmark Heterogeneous (IGBH) dataset. GNNs are widely used in recommendation systems, fraud detection, and knowledge graphs, making this benchmark highly relevant for modern AI applications. - Automotive PointPainting:
Designed for edge devices, this benchmark evaluates 3D object detection in camera feeds—a critical capability for autonomous vehicles. While not a full-fledged automotive benchmark, it serves as a proxy for assessing AI performance in real-world driving scenarios.
Hardware and Software Innovations Drive Performance
The MLPerf Inference v5.0 results include submissions from six newly available or soon-to-be-released processors:
- AMD Instinct MI325X
- Intel Xeon 6980P “Granite Rapids”
- Google TPU Trillium (TPU v6e)
- NVIDIA B200
- NVIDIA Jetson AGX Thor 128
- NVIDIA GB200
These processors demonstrate the rapid pace of hardware innovation, particularly in accelerators optimized for generative AI workloads. Paired with advanced software techniques, they deliver record-breaking performance and energy efficiency.
“We rarely introduce four new tests in a single update,” said Hodak. “But given the rapid advancements in machine learning and the diversity of applications, we felt it was necessary to keep the benchmark suite relevant and comprehensive.”
Broad Industry Participation and Energy Efficiency Focus
The MLPerf Inference v5.0 results include 17,457 performance results from 23 submitting organizations, including first-time participants like CoreWeave, FlexAI, GATEOverflow, Lambda, and MangoBoost. Notably, Fujitsu and GATEOverflow contributed extensive power benchmark submissions for datacenter and edge systems, underscoring the growing importance of energy efficiency in AI deployments.
“The machine learning ecosystem is delivering ever-greater capabilities,” said Kanter. “We’re seeing larger models, faster responsiveness, and broader deployment of AI compute than ever before. MLCommons is proud to support these advancements by providing up-to-date, reliable performance data.”
Why MLPerf Matters
As AI systems grow more complex, the need for accurate, standardized benchmarks becomes even more critical. MLPerf Inference v5.0 ensures that stakeholders—from researchers to enterprise buyers—have access to trustworthy data to guide their decisions. By introducing new benchmarks and supporting a wide range of hardware and software configurations, MLCommons continues to drive innovation and accountability in the AI ecosystem.
For more information about the MLPerf Inference v5.0 results, visit the MLCommons website or explore detailed analyses on their blog.
About MLCommons
MLCommons is the world’s leader in AI benchmarking. An open engineering consortium supported by over 125 members and affiliates, MLCommons has a proven record of bringing together academia, industry, and civil society to measure and improve AI. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. Since then, MLCommons has continued using collective engineering to build the benchmarks and metrics required for better AI – ultimately helping to evaluate and improve AI technologies’ accuracy, safety, speed, and efficiency.



