D assumes N-bit machines support 2N-atomic loads, which is the case for all modern processors. The compiler should generate code to load and store all built-in shared slices atomically.
Apparently on i64 the only way to do so is by using CMPXCHG16B (http://stackoverflow.com/questions/4099002/x86-128-bit-atomic-ops). The instruction is supported by the front-end too.
Also core.atomic should support atomicLoad() for values that are 128-bit on 64-bit models.
Comment #1 by blah38621 — 2014-08-29T18:18:28Z
If someone is already going to be digging around in core.atomic, perhaps they
could add intrinsic versions as well, so that the operations can be inlined and
don't need to do the excessive amount of extra moving that they currently do.