Qualcomm Centriq 2400 Processor: Designed for scalability and throughput performance on cloud datacenter workloads
OCT 5, 2017
Qualcomm products mentioned within this post are offered by Qualcomm Technologies, Inc. and/or its subsidiaries.
When Qualcomm Datacenter Technologies unveiled the details of the Qualcomm Falkor CPU core in August, we discussed the market shift to a cloud-based computing model and how datacenter infrastructure is being optimized to address the demand for scalable performance under the unique characteristics of cloud software and services. Falkor, our fully custom core built specifically for the cloud datacenter market, was designed for optimal throughput performance and efficiency on today’s multi-threaded cloud workloads. Falkor serves as the scalable building block for the Qualcomm Centriq 2400 Processor, the world’s first 10nm server processor, which will begin shipping commercially later this year.
SoC architectures for cloud-based workloads must provide a balance of aggregate throughput performance and performance-per-watt efficiency. In addition, they must be designed for compute density and predictable performance in order to perform well in highly-loaded and multi-tenant environments. When developing the highly scalable 48-core Qualcomm Centriq 2400 SoC, we employed the same “built for the cloud” design philosophy from Falkor to all the other foundational elements of the SoC. Preliminary estimates based on internal testing show integer throughput performance comparable to Intel Xeon Platinum Series at significantly lower power.
At the 2017 Linley Processor Conference this week, we will share additional details about the SoC foundational elements and how they address the needs of cloud datacenter workloads:
Highly integrated server SoC: The Qualcomm Centriq 2400 SoC was designed using a scalable architecture to maximize efficiency and performance for throughput-oriented workloads. This single chip platform-level solution obviates the real estate, power, and cost of a separate chipset for I/O. The SoC is ARM SBSA Level 3 Compliant to help simplify development and deployment by our ecosystem partners and customers.
Qualcomm Falkor core as a building block: Our processor design team has a rich history of delivering high-performance, yet power-efficient, custom ARM CPUs for mobile platforms, and has brought this world-class design expertise to architect a CPU core specifically designed to support the features and performance demands of cloud service providers. Falkor is AArch64 only and fully ARMv8 compliant. The Falkor core duplex includes two custom Falkor CPUs, a shared 512 KB L2 cache with ECC (SEC/DED), and a shared system bus interface.
Scalable on-chip interconnect: The Qualcomm Centriq 2400 SoC includes a high-bandwidth and low-latency bi-directional segmented ring bus that utilizes a Qualcomm proprietary protocol. The multi-ring architecture and interconnect protocol are built for SoC scalability and outstanding throughput performance with capabilities such as full coherency (cache and I/O), shortest path routing, and multicast on read.
Distributed L3 Cache: The SoC includes a distributed 60MB non-inclusive/non-exclusive L3 Cache (12 x 5MB) with ECC (SEC/DED) that is 20-way set associative. The memory address is hashed across all 12 L3 cache blocks to evenly distribute accesses and smooth out access latencies. The memory subsystem includes innovative shared resource management techniques such as L3 Quality of Service (QoS) to improve cache utilization, reduce application latency, and manage cache resource bandwidth. Resources can be managed by virtual machine, container, or thread groups.
Scalable Multi-channel DDR: The memory subsystem includes six 64-bit DDR4 memory controllers with ECC (SEC/DEC). The SoC supports RDIMM or LRDIMM with one or two DIMMs per channel and memory speeds up to 2667 MT/second. The controllers have full out-of-order execution with memory addresses hashed across all DDRs. The design includes a proprietary algorithm for memory bandwidth enhancement via in-line and transparent memory compression. Memory compression is performed on a cache line granularity and delivers up to 50% compression and up to 2x memory bandwidth on highly compressible data.
Distributed IOMMUs: Distributed IO Memory Management Units (IOMMUs) provide address translation and access control with shared/distributed virtual memory support. Each major IO function (PCIe, DMA, SATA, etc.) includes dedicated instances to eliminate resource contention and enable concurrent page table lookup/translation for maximum I/O throughput and concurrency.
The Qualcomm Datacenter Technologies product roadmap is tailored to the emerging demands of highly-scalable, performant, and power-efficient servers that will fuel the next wave of cloud datacenters. We look forward to beginning commercial shipments of the Qualcomm Centriq 2400 — the world’s first 10nm server processor — by the end of 2017.