Now we use intrinsics when possible, and fallback to optimized
implementation in portable C++. The difference is about 4x when
we can use intrinsics and about 2x when we cannot.
This should speed up our Lemire's algorithm implementation nicely.
* Utility for extended mult n x n bits -> 2n bits
* Utility to adapt output from URBG to target (unsigned) integral
type
* Utility to reorder signed values into unsigned type while keeping
the order.