ESC8000-E12P Server with Intel® Gaudi® 3 PCIe AI Accelerator Delivers Exceptional Performance and TCO for Gen AI

White paper

Summary

 

The ASUS ESC8000-E12P server, configured with Intel® Gaudi® 3 PCIe AI Accelerator, is engineered to deliver exceptional performance and scalability for next-generation Generative AI (GenAI) and large language model (LLM) workloads.

 

Built on PCIe® 5.0 architecture, this 4U platform supports up to eight 600W dual-slot GPUs, five high-speed NICs, and advanced thermal design to ensure stable, high-efficiency operation under intensive AI training and inference.

 

Benchmark results demonstrate outstanding throughput on the Llama 3.1 model and up to 89.19GB/s bandwidth in HCCL Quad-to-Quad testing over RoCE Ethernet—matching proprietary InfiniBand systems in performance while significantly reducing total cost of ownership (TCO).

 

Combining powerful computing density, open Ethernet scalability, and cost-efficient design, ASUS ESC8000-E12P with Intel® Gaudi® 3 PCIe AI Accelerator stands as a proven and future-ready AI infrastructure solution for enterprise AI transformation.

 

                   

 

A. Introduction
 

The rapid development of Generative AI (GenAI) and large language models (LLMs) is fundamentally reshaping industries, from finance and manufacturing to healthcare and retail. These models demand unprecedented computational power and memory bandwidth, placing ever-increasing pressure on enterprise data centers.

 

As organizations strive to accelerate AI innovation, they require infrastructure that not only delivers outstanding performance but also offers the flexibility to scale quickly in response to evolving workloads and business needs. In this landscape, the ability to support high-density, multi-GPU architectures, ensure efficient data movement, and maintain operational reliability has become critical for success in the AI era.

 

The ASUS ESC8000-E12P server is specifically engineered to meet the most demanding requirements of next-generation AI infrastructure. Built on PCIe 5.0 high-bandwidth architecture and designed as a high-density 4U platform, ESC8000-E12P can accommodate up to eight dual-slot, high-end GPU cards, each supporting up to 600 watts. Its advanced thermal design ensures efficient cooling for all components, even during the most intensive AI training or inference workloads.

 

The system also supports up to five high-speed NICs, providing tremendous network bandwidth for large-scale AI deployments. With an optimized spatial layout and a high-performance PCIe switch, ESC8000-E12P guarantees stable and efficient data paths between Intel® Xeon® 6 CPUs, Intel® Gaudi® 3 PCIe AI Accelerator, and 400G NICs—fully unlocking the advantages of multi-card collaborative computing.

 

Intel® Gaudi® 3 PCIe AI Accelerator (HL-338) is purpose-built for next-generation AI applications, featuring 128GB HBM2E memory, PCIe 5.0 x16 with up to 128GB/s bidirectional bandwidth, and an impressive 3.7TB/s HBM peak bandwidth per card.

 

The platform is designed for flexible multi-card expansion and efficient interconnection. With dedicated top boards (HLTB-304A/HLTB-304B), it enables high-speed data transfer between multiple cards, empowering enterprises to build high-performance, open-architecture AI clusters.

 

This white paper demonstrates the real-world performance of the ESC8000-E12P in Llama 3 model inference and HCCL network collaboration, establishing it as the optimal choicefor Intel® Gaudi® 3 PCIe AI Accelerator deployments, combining performance, openness, and total cost of ownership (TCO) advantages.

 



B. Llama 3 Performance

 

  • Llama 3.1 Inference Throughput Across Different Scenarios (tokens/sec)


     

  • Performance Description
    Across various input/output lengths and batch sizes, the ASUS ESC8000-E12P platform demonstrates extremely high inference throughput on the Llama model. With short input/output and large batch size, throughput reaches up to 14,909.2 tokens/sec, and even in long-sequence scenarios, performance remains stable.


This proves that ESC8000-E12P’s thermal and power design can support Intel® Gaudi® 3 PCIe AI Accelerator at full performance, providing customers with a stable and high-speed AI model serving platform.


C. HCCL Quad-to-Quad Performance


One of the greatest advantages of Intel® Gaudi® 3 PCIe AI Accelerator is its support for high-bandwidth RoCE Ethernet scale-out architecture with high-performance NICs and network environment. ESC8000-E12P can house up to eight Intel® Gaudi® 3 PCIe AI Accelerator and, with a 2x4 (dual quad) configuration, combined with 400G NICs and an ROCE switch, enables the creation of an open and scalable large-scale AI cluster.

 

Based on Intel HCCL (Habana Collective Communications Library) ALL_REDUCE testing, bandwidth between Quad-to-Quad can reach 89.19GB/s, showcasing the platform’s high performance and scalability on standard Ethernet.
The test environment is further optimized with MTU set to 9000 and PFC (lossless) enabled, ensuring stable and low-latency RDMA transfers.

This architecture brings two key benefits:

 

  • High-Efficiency Collaboration: On an open Ethernet architecture, the ESC8000-E12P delivers latency and bandwidth on par with high-end proprietary solutions, meeting the demands of large-scale AI training and inference.
     

  • TCO Reduction: Users no longer need to purchase costly, closed InfiniBand solutions. With industry-standard Ethernet equipment, they can build highly performant and scalable AI clusters, dramatically lowering deployment and maintenance costs.


This proves that ESC8000-E12P’s thermal and power design can support Intel® Gaudi® 3 PCIe AI Accelerator at full performance, providing customers with a stable and high-speed AI model serving platform.


D. Conclusion & Recommended L10 Configuration

ASUS ESC8000-E12P is more than just a server—it is a proven GenAI solution. Combined with the Intel® Gaudi® 3 PCIe AI Accelerator HL-338, advanced thermal design, PCIe 5.0 high-bandwidth slots, and enterprise-class power, it delivers industry-leading performance, stability, and scalability for AI/LLM inference and training.

 

ESC8000-E12P has demonstrated top-tier performance in Llama 3 and HCCL scale-out testing, helping enterprises embrace open Ethernet standards, break away from proprietary ecosystems, and significantly reduce AI infrastructure deployment and maintenance costs.
 

Recommended L10 Configuration:

  • ASUS ESC8000-E12P (4U, 8 x PCIe 5.0 x16 slots)

  • 8 x Intel® Gaudi® 3 PCIe AI Accelerator HL-338

  • 2 x Intel® Xeon® 6 CPU 6776P

  • 2TB DDR5 6400 ECC memory

  • 4 x 1P NDR400 MCX7 (NVIDIA) NIC

  • 1 x SN5560 RoCE switch

  • OS revision: Ubuntu 22.04.5 LTS 

  • Kernel: 6.8.0-83-generic

  • Gaudi3 SW revision: 1.23.0-fw-62.2.0.0

 

 

  • Service and warranty coverage may depend on country and territory. Service may not be available in all markets. We recommend that you check with your local retailers to confirm the options available.
  • Must be purchased and activated within 90 days of your ASUS product purchase date via ASUS Premium Care.