RDTSC is not ordered on Intel and AMD CPUs without preceding memory
barrier, and worse, the results can be non-monotonic if compared on
different processors.
The intel SDM says lfence;rdtsc causes all previous instrusctions to
complete before the tsc read, and AMD APM says to use mfence;rdtsc to
do same thing. This is what GNU/Linux kernel does on its rdtsc_ordered
function.
* https://github.com/torvalds/linux/blob/03b9730b769fc4d87e40f6104f4c5b2e43889f19/arch/x86/include/asm/msr.h#L130-L154