The native FP16 implementation of the Fused MultiHeadAttention (FMHA) operator with onnxruntime exhibits numerical divergence compared to the equivalent PyTorch FP16 implementation, even when running ...