We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor. We ask that you disable ad blocking while on Silicon
Investor in the best interests of our community. If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Interesting. With the huge challenges that further scaling faces, AMD is in a good place with their chiplet design methodology, as you say. You may have already read this recent article already at SemiWiki, discussing Imec's presentations at the SPIE Advanced Lithography Conference, but if not, it is an interesting look at standard cell scaling down to 5nm, using innovative changes such as Buried Power Rail and Backside Power Distribution to free up space for interconnect. However, this reminds me about the revolutionary things Intel did on their 10nm, their " hyper-scaling" technologies, which proved to be more challenging than anticipated (Dummy Gate, Contact-Over-Active-Gate, and switching to cobalt for lower-layer interconnect).
Training deep neural networks is one of the more computationally intensive applications running in datacenters today. Arguably, training these models is even more compute-demanding than your average physics simulation using HPC. Nevertheless, deep learning has rather different hardware requirements than that of conventional high performance computing.
For the most part, that has to do with numerical formats. While most HPC models rely on double precision floating point (FP64), with the occasional excursion into single precision (FP32), deep learning models are typically built with FP32, supplemented by half precision FP16. In general, the more you can use lower precision values for calculations, the better off you are since cutting the number of bytes in half doubles data throughput for an application. And that goes for HPC, deep learning, or anything else.
All of these formats are based on the IEEE 754 standard, which was set up more than 30 years ago when floating point was primarily used for scientific computation. As a result, the field that contains the significant bits in these IEEE formats (the mantissa or significand) take up most of the space: 52-bits in FP64, 23 bits in FP32 and 10 bits in FP16. The idea is to maintain high precision, which reflects its original intended use.
The exponent field for these IEEE formats is relatively smaller, which means the dynamic range is limited. The rationale is that if you need more range, you just keep using larger formats – FP32, FP64, FP128, and so on – until the exponent field is large enough to support the numbers your application needs.
But for deep learning, high precision is not necessarily desirable. “Deep learning, in fact, performs better with lower precision,” says Pradeep Dubey, who directs the Parallel Computing Lab at Intel. While he acknowledges that sounds confusing, his explanation is the when you’re training deep learning models, “you need an ability to generalize.”
What he’s referring to is when building a model, it’s better construct something that is generalized enough to detect a range of possibilities. For example, in pattern recognition where you’re looking for a particular object like a cat, it’s better not to be too precise about the pattern that represents a cat. Too much precision would limit the kind of images that would be recognized or even prevent the model from converging while training.
On the other hand, you do need enough of a numeric range so the model will be able encompass a decent number of possibilities – what Dubey calls “learning the curve.” Thus, for deep learning, the range is more important than the precision, which is the inverse of the rationale used on for IEEE’s floating point formats.
According to Dubey, IEEE’s FP16 format reduces the dynamic range too much in an effort to keep more bits for precision, but again, that’s not the tradeoff you want for deep learning computations. What often happens is that with FP16, the model doesn’t converge, so you end up needing to tune the hyperparameters – things like the learning rate, batch size, and weight decay.
Bfloat16 has a 7-bit mantissa, along with an 8-bit exponent, which means it has the same range as FP32, but with less precision. According to Intel though, that’s more than enough to cover the range of deep learning domains. To prove the point, Dubey and his team from the Parallel Computing Lab, along with some Facebook researchers, set out to test bfloat16 on some typical deep learning models, encapsulating convolutional neural networks (CNNs), Recurrent Neural Networks (RNNs), and Generative Adversarial Networks (GANs).
In particular, Intel used bfloat16 to train AlexNet, ResNet-50, DC-GAN, SR-GAN, Baidu’s DeepSpeech2, Google’s neural machine translation (GNMT). They also benchmarked a couple of industrial deep learning workloads: a Deep and Cross Network, and a DNN recommendation system. The bloat16 data was used to hold the tensor values (activation and weights), with results accumulated in FP32.
At this point, Intel doesn’t have bfloat16 implemented in any of its processors, so they used current AVX512 vector hardware present in its existing processor to emulate the format and the requisite operations. According to the researchers, this resulted in “only a very slight performance tax.”
Dubey says the emulated bfloat16 worked beautifully across the workloads. The models converged in the same number of iterations as when using FP32 for all the computations, with any hyperparameter tuning required. In fact, the bfloat16 runs tracked the FP32 runs almost exactly, as documented in the research paper penned by Dubey and his colleagues.
Essentially, they were able to get the benefit of the 16-bit throughput for free, the slight caveat being that some of the work, like the fused-multiply add (FMA), needs an FP32 accumulator. But, according to Dubey, depending on how much you’re able to keep the computations in the bfloat16 realm, you should be able to improve training speed by at least 1.7x. Which is a big deal when training a model takes days or even weeks.
The researchers conclude that bfloat16 is able to represent tensor values across many application domains, including vision, speech, language, generative networks, and recommendation systems, and doesn’t suffer from the drawbacks of FP16 implementations. They go on to say that they “expect industry-wide adoption of bfloat16 across emerging domains.” Of course, given that Intel will be supporting the format in both its general purpose Xeon line and its purpose-built NNP processor, that adoption is more assured than ever.
Is BFloat16 something that belongs in an accelerator or in a general-purpose core?
It is obviously useful for ML, but how useful is it across domains? One interesting part of the article is that Intel, rather than implementing BFloat16 in hardware to evaluate it, simulated it with AVX-512, with "only a very slight performance tax". However, according to the article, "Intel will be supporting the format in both its general purpose Xeon line and its purpose-built NNP processor".
I am sceptical to the idea that a balanced general-purpose core should be encumbered by BFloat16 — and AVX-512, for that matter — unless the inclusion can be shown to have greater benefits than executing such code on GPGPU or dedicated accelerator.
By the way, BFloat16 was supported in AMD's ROCm 2.6 (rocBLAS/Tensible), released earlier this month:
"Radeon ROCm 2.6 brings various information reporting improvements, the first official release of rocThrust and hipCUB, MIGraphX 0.3 for reading models frozen from Tensorflow, MIOpen 2.0 with Bfloat16 support and other features, BFloat 16 for rocBLAS/Tensible, AMD Infinity Fabric Link support, RCCL2 support, rocFFT improvements, ROCm SMI fixes, and other enhancements."
Government Approved AMD’s China Exports July 21, 2019 9:56 am ET wsj.com
AMD has actively sought open and productive dialogue with the U.S. government to protect national-security interests. Advanced Micro Devices Inc. is a proud American company with a long history of working with the U.S. government on projects to advance U.S. innovation. AMD has always complied with all laws, regulations and policies governing the sale of products and licensing of technology overseas, so we were extremely disappointed to read “ Chip Maker Shared ‘Keys to Kingdom’” (Page One, June 28).
The implications that AMD put financial interests above national-security interests or structured joint ventures to evade government regulatory review aren’t true. When AMD formed these joint ventures in 2016, compliance with U.S. export regulations was the highest priority—and that hasn’t changed. Before forming the joint ventures, AMD proactively briefed the Departments of Commerce, Defense, State and multiple other federal agencies, and received no objections. Prior to transferring any technology, AMD obtained written notification from the Commerce Department that the technology was classified for export to China without a license. The technology transferred was specifically modified to be of lower performance than chips commercially available in China from AMD and others. AMD also implemented significant controls to protect our intellectual property.
The environment around technology and national security has evolved over the last three years. As the U.S. government has communicated new concerns to the technology industry, AMD has actively sought open and productive dialogue with the U.S. government to protect national-security interests. AMD remains committed to working closely with the U.S. government and its peers to make America as competitive as possible in the global markets while placing national-security interests as the foremost priority.
As more reviews, experience and evidence emerges for the Ryzen 3000 boosting behaviour, it seems to corroborate my intuition that AMD did not hit the frequency targets for Zen 2. I think Lisa Su hinted about it when she announced the launch schedule earlier this year ("we need to get frequencies where we want them", or something to that effect). That schedule was a bit later than most expected, me included.
Maybe the AMD engineers put too much priority on power-efficiency and targeted a sweet spot a little too low on the power-vs-frequency curve, hoping and expecting they would reach the frequencies necessary on the high-end to challenge Intel single-thread performance dominance. However, it now seems the SKUs have trouble even reaching advertised boost frequencies — to any meaningful degree, at least.
AMD Senior Technical Marketing Manager Robert Haddock — which seems to me, in all his presentations and demeanour, to be a genuine PC-enthusiast trying to do clear and fair marketing and engage with the community — is now under criticism for his promotion and claims about Precision Boost Overdrive (which currently seems to do very little, if anything at all), and in particular the reference to possible gains of another 200MHz over and beyond the 4.6 GHz frequency advertised for Ryzen 3900X (see his PBO explanation video).
Hopefully, Zen 2 is doing better than expected on the power-efficiency part of the curve to compensate. In the end, Ryzen 3000 is still impressively competitive, and Zen 2's service in EPYC 2 is after all the top priority, where power-efficiency at the sweet spot on the frequency curve is more important.
Still, I hope AMD has more they can do to refine the Zen 2 implementation and 7nm process to gain some frequency. If so, perhaps we will have a refresh of Ryzen 3000 before Zen 3 arrives. It will be interesting to see how Ryzen 3950 fares when it arrives in September, and whether the silicon is any better.