|WHERE CHINA’S LONG ROAD TO DATACENTER COMPUTE INDEPENDENCE LEADS|
September 20, 2021 Timothy Prickett Morgan
The Sunway TaihuLight machine has a peak performance of 125.4 petaflops acrpss 1-,649,600 cores. It sports 1.31 petabytes of main memory. To put the peak performance figure in some context, recall that the current (by far top) supercomputer until this announcement had been Tianhe-2 with 33.86 pea petaflop capability. One key difference, other than the clear peak potential, is that TianhuLight came out of the gate with demonstrated high performance on real-world applications, some of which are able to utilize over 8 million of the machine’s 10 million-plus cores.
While we are big fans of laissez faire capitalism like that of the United States and sometimes Europe – right up to the point where monopolies naturally form and therefore competition essentially stops, and thus monopolists need to be regulated in some fashion to promote the common good as well as their own profits – we also see the benefits that accrue from a command economy like that which China has built over the past four decades.
A recently rumored announcement of a GPU designed by Chinese chip maker Jingjia Micro and presumably etched by Semiconductor Manufacturing International Corp (SMIC), the indigenous foundry in China that is playing catch up to Taiwan Semiconductor Manufacturing Co, Intel, GlobalFoundries, and Samsung Semiconductor, got us to thinking about this and what it might mean when – and if – China ever reaches datacenter compute independence.
Taking Steps Five Years At A Time
While China has been successful in many areas, particularly in becoming the manufacturing center of the world, it has not been particularly successful in achieving independence in datacenter compute. Some of that has to do with the immaturity of its chip foundry business, some of it has to do with its experience in making big, wonking, complex CPU and GPU designs that can take on the big loads in the datacenter. China has a bit of a chicken and egg problem here, and as usual, the smartphone and tablet markets is giving the Middle Kingdom’s chip designers and foundries the experience they need to take it up another notch to take on the datacenter.
The motivations are certainly there for China to achieve chip independence. The current supply chain issues in semiconductors as well as the messy geopolitical situation between China and the United States, which draws in Taiwan, South Korea, Japan, and Europe as well. Like every other country on Earth, China has an imbalance between semiconductor production and semiconductor consumption, and that is partly a function of the immense amount of electronics and computer manufacturing that has been moved to China over the past two decades.
According to Dauxe Consulting, which provides research into the Chinese market, back in 2003 China consumed about 18.5 percent of semiconductors (that’s revenue, not shipments), which was a little bit less than the Americas (19.4 percent), Europe (19.4 percent), or Japan (23.4 percent). SMIC was only founded in 2000 and had negligible semiconductor shipment revenue at the time. Fast forward to 2019, which is the last year for which data is publicly available, and China’s chip manufacturing accounts for about 30 percent of chip revenues in the aggregate, but the chips that Chinese companies buy to build stuff account for over 60 percent of semiconductor consumption (which is revenues going to SMIC as well as all of the other foundries, big and small, around the world). This is a huge imbalance, and it is not surprising that the Chinese government wants to achieve chip independence.
While there may be strong political and economic reasons why Chinese chip independence might mean China’s reach outside of its own markets diminishes in proportion to how much it can take care of its own business. China can compel its own state, regional, and national governments as well as state-controlled businesses to Buy China, but it can’t do that outside of its political borders. It can make companies and governments in Africa and South America attractive orders they probably won’t refuse. It will be a harder sell indeed in the United States and Europe and their cultural and economic satellites.
More about that in a moment.
Let’s start our Chinese datacenter compute overview with that GPU chip from Jingjia Micro that we heard about last week as a starting point because it illustrates the problem China has. We backed through all of the stories and found that a site called MyDrivers is the originator of the story, as far as we can see, and has this table nicked from Jingjia Micro to show how the JM9 series of GPUs stacks up against the Nvidia GeForce GTX 1050 and GTX 1080 GPUs that debuted in late 2015 and that started shipping in 2016 in volume:
There are two of these JM9 series GPUs from Jingjia, and they are equal or better to the Nvidia equivalents. The top end JM9271 is the interesting one as far as we are concerned because it has a PCI-Express 4.0 interface and thanks to HBM2 stacked memory weighing in at 16 GB, it has twice the capacity of the GTX 1080 and at 512 GB/sec of bandwidth has 60 percent more memory bandwidth at 512 GB/sec while burning 11.1 percent more power and delivering 9.8 percent lower performance at 8 teraflops at FP32 single precision.
This Jingjia card is puny compared to the top-of-the-line “Ampere” GA100 GPU engine from Nvidia, which runs at 1.41 GHz, has 40 GB or 80 GB of HBM2E stacked memory, and 19.49 teraflops at single precision. The cheaper Ampere GA102 processor used in the GeForce RTX 3090 gamer GPU (as well as the slower RTX 3080) runs at 1.71 GHz, has 24 GB of GDDR6X memory, and delivers an incredible 35.69 teraflops at FP32 precision– and has ray tracing accelerators that can also be used to boost machine learning inference. The Ampere A100 and RTX 3090 devices burn 400 watts and 350 watts, respectively, because the laws of physics must be obeyed. If you want to run faster these days, you also have to run hotter because Moore’s Law transistor shrinks are harder to come by.
Architecturally speaking, the JM9 series is about five years behind Nvidia, with the exception of the HBM memory and the PCI-Express 4.0 interface. The chip is implemented in SMIC’s 28 nanometer processes, which is not even close to the 14 nanometer process that SMIC has working or its follow-on, which is akin to TSMC’s 10 nanometer node and Samsung’s 8 nanometer node (the latter process being used to make the Ampere RTX GPUs). Jingjia is hanging back, getting its architecture out there and tested before it jumps to a process shrink. TSMC has had 28 nanometer in the field for a decade now.
This is not even close to China’s best effort. Tianshu Zhixin is working on a 7 nanometer GPU accelerator called “Big Island” that looks to be etched by TSMC and including its CoWoS packaging (the same one used by Nvidia for its GPU accelerator cards). The Big Island GPU is aimed squarely at HPC and AI acceleration in the datacenter, not gaming, and it will absolutely be competitive if the reports (on very thin data and a lot of big talk it looks like) pan out. Another company called Biren Technology is working on its own GPU accelerator for the datacenter, and thin reports out of China say the Biren chip, etched using TSMC 7 nanometer processes, will compete with Nvidia’s next-gen “Hopper” GPUs. We shall see when Biren ships its GPU next year.
We are skeptical of such claims, and reasonably so. If you looked at the plan for the “Godson” family of MIPS-derived and X86-eumlating processors that were created by the Institute of Computing Technology at the Chinese Academy of Sciences. (You know CAS, they are the largest shareholder in Chinese IT gear maker Lenovo.) We reported with great interest on the Godson processors (also known by the synonymous name Loongson) and the roadmap to span them from handhelds to supercomputers way back in February 2011. These processors made their way into the Dawning 6000 supercomputers made by Sugon, but as far as we know they did not really get any of the traction that Sugon had hoped in the datacenter.
It remains to be seen if the Loongson 3A5000 clone of the AMD Epyc processor, which is derived from the four-core Ryzen chiplet used in the “Naples” Epyc processor from 2017 and which is said to have its own “in-house” GS464V microarchitecture (oh, give me a break. . . .), will do better in the broader Chinese datacenter market. With the licensing limited to the original Zen 1 cores and the four-core chiplets, the AMD-China joint venture, called Tianjin Haiguang Advanced Technology Investment Co, has the Chinese Academy of Sciences as a big (but not majority) shareholder, and it is expected that a variant of this processor will be at the heart of at least one of China’s exascale HPC systems.
By the way, the old VIA Technologies (the third company with an X86 license) has partnered with the Shanghai Municipal Government to create the Zhaoxin Semiconductor partnership, which makes client devices based on the X86 architecture. Zhaoxin could be tapped to make a big, bad X86 processor at some point. Why not?
Thanks to being blacklisted by the US government, Huawei Technologies, one of the dominant IT equipment suppliers on Earth, has every motivation to help create an indigenous and healthy market for CPUs, GPUs, and other kinds of ASICs in China, and has a good footing with the design efforts of its arm’s length (pun intended) fabless semiconductor division, HiSilicon. The HiSilicon Kunpeng CPUs and Kirin GPUs hew pretty close to the Arm Holdings roadmaps, which is fine, and there is no reason to believe that if properly motivated – meaning enough money is thrown at it and China takes an attitude that it is going to be very aggressive with Hauwei sales outside of the United States and Europe – it could do more custom CPUs and even GPUs. It could acquire Jingia, Tianshu Zhixin, or Biren, for that matter.
For a while there, it looks like Suzhou PowerCore, a revamped PowerPC re-implementer that joined IBM’s OpenPower Consortium and that delivered a variant of the Power8 processor for the Chinese market, might try to extend into the Power9 and Power10 eras with its own Power chip designs. But that does not seem to have happened, or if it did, it is being done secretly.
The future Sunway exascale supercomputer at the National Supercomputing Center in Wuxi, which is one of the three exascale systems being funded by the Chinese government. It has a custom processor, a kicker to the SW26010 processor used in the original Sunway TaihuLight supercomputer, which also dates from 2016. The SW26010 had 260 cores, 256 of them skinny cores for doing math and four of the fat cores for managing data that feeds the cores, and we think that the Sunway exascale machine won’t have a big architectural change, but have some tweaks, add more compute element blocks to the die, and ride down the die shrink to reach exascale. The SW26010 and its kicker, which we have jokingly called the SW52020 because it has double of everything, mixes architectural elements of CPUs and math accelerators, much as Fujitsu’s A64FX Arm chips do. The A64FX is used in the “Fugaku” pre-exascale supercomputer at the RIKEN lab in Japan. Hewlett Packard Enterprise is reselling the A64FX in Apollo supercomputer clusters, but as far as we know, no one is reselling SW26010 in any commercial machines.
Arm server chip maker Phytium made a lot of noise back in 2016 with its four-core “Earth” and 64-core “Mars” Arm server chips, but almost immediately went mostly dark thanks to the trade war between the US and China that really got going in 2018.
The most successful indigenous accelerator to be developed and manufactured in China is the Matrix2000 DSP accelerator used at the National Super Computer Center in Guangzhou. That Matrix2000 chip, which uses DPs to do single-precision and double-precision math acceleration in an offload model from CPU hosts, just like GPUs and FPGAs, was created because Intel’s “Knights” many-core X86 accelerators were blocked for sale to China back in 2013 for supercomputers. The Matrix2000 DSP engines, along with the proprietary TH-Express 2+ interconnect, were deployed in the Tianhe-2A supercomputer with 4.8 teraflops of oomph each at FP32 single precision. That was back in 2015, mind you, when the GTX 1080 was being unveiled by Nvidia, for comparison.
As far as we know, these Matrix2000 DSP engines were not commercialized beyond this system and the upcoming Tianhe-3 exascale system, which will use a 64-core Phytium 2000+ CPU and a Matrix2000+ DSP accelerator. One-off or two-off compute engines are interesting, of course, but they don’t change the world except inasmuch as they show what can be done with a particular technology. But the real point is to bring such compute engines to the masses, thereby lowering their unit costs as volumes increase.
And China surely has masses. But a lot of Chinese organizations, both in government and in industry, have free will when it comes to architectures. But that could change. China could whittle down the choices for datacenter compute to a few architectures, all of them homegrown and all of them isolated from the rest of the world. It has enough money – and enough market of its own – to do that.