By using a derived key r^2 we can improve performance, as we can do loop
unrolling and slightly better utilize SIMD instructions.
Overall ChaCha20-Poly1305 performance increases by ~12%.
Converting integers to/from our 5-word representation in SSE does not seem
to pay off, so we work on individual words.
We always build the driver on x86/x64, but enable it only if SSSE3 support
is detected during runtime.
Poly1305 uses parallel 32-bit multiplication operands yielding a 64-bit result,
for which two can be done in parallel in SSE. This is minimally faster than
multiplication with 64-bit operands, and also works on 32-bit builds not having
a __int128 result type.
On a 32-bit architecture, this is more than twice as fast as the portable
driver, and on 64-bit it is ~30% faster.