|Is BFloat16 something that belongs in an accelerator or in a general-purpose core?|
It is obviously useful for ML, but how useful is it across domains? One interesting part of the article is that Intel, rather than implementing BFloat16 in hardware to evaluate it, simulated it with AVX-512, with "only a very slight performance tax". However, according to the article, "Intel will be supporting the format in both its general purpose Xeon line and its purpose-built NNP processor".
I am sceptical to the idea that a balanced general-purpose core should be encumbered by BFloat16 — and AVX-512, for that matter — unless the inclusion can be shown to have greater benefits than executing such code on GPGPU or dedicated accelerator.
FP16 vs BFloat16: nickhigham.wordpress.com
By the way, BFloat16 was supported in AMD's ROCm 2.6 (rocBLAS/Tensible), released earlier this month:
"Radeon ROCm 2.6 brings various information reporting improvements, the first official release of rocThrust and hipCUB, MIGraphX 0.3 for reading models frozen from Tensorflow, MIOpen 2.0 with Bfloat16 support and other features, BFloat 16 for rocBLAS/Tensible, AMD Infinity Fabric Link support, RCCL2 support, rocFFT improvements, ROCm SMI fixes, and other enhancements."
However there is no hardware support for BFloat16 in Radeon yet:
"Added mixed precision bfloat16/IEEE f32 to gemm_ex. The input and output matrices are bfloat16. All arithmetic is in IEEE f32."
It would be interesting so see how simulating BFloat16 with RDNA would compare in suitability and performance to what Intel did to evaluate the format with AVX-512.