Given integral type A and non integral type B and depending on rounding
mode, optimization, compiler, and phase of the moon A(A)*B != A(A*B) so
split the two cases.
While at it, also make the template automagically work for complex types
instead of requiring manual casts, the general idea here is to allow
inlining and vectorization by treating all args as plain arrays, which is fine.
This works as expected with -tune=native, x64 implies sse2, and we do not
target any neon-less arm versions either.
Clang only array length hints can improve this even more.
Change-Id: I93f077f967daf2ed382d12cc20a54846b3688634