In general, the scheme I proposed had three core operations: watch, touch, and validate. The problem with watch is that it suggests some sort of global monitoring which detects touches to various locations. Such a thing is easy enough to implement; one can use a hash-table, a bitvector, a Bloom filter, or any number of techniques to implement it. This is no mystery.
The problem is that in a highly distributed system, or even a multicore system with more than 32 cores, any centralized memory system isn't feasible. With 32 processors, you get contention problems. In a distributed system, a single shared memory is out of the question.
For higher contention, it becomes necessary to get rid of the centralized coordination, or at least reduce it. We can employ version tags to do this (bounded-length infinite-reuse tags for n processors can be implemented with 2n bits), recording a location's tag with a watch, incrementing the tag for our own copy with a touch, and checking the tag against the global version with validate. This version of things seems to imply a slightly different set of intrinsics. First, watch and read must be combined, though read without watch is still permissible. Second, rollback performs writes, bumping version tags. However, since it backtracks values, we need to somehow state that two version tags are equivalent. There's a fairly simple obstruction-free, undo-logged transaction implementation using this method.
The apparent way of the world would suggest we'll run into another wall at the next power of 8, which is 256. At this point, we're best off using Herlihy's methodology. Since my intrinsic set was designed to allow the implementation of either undo or redo logging (and generally to maximize flexibility), we would need to slip redo logging in under the constructs I've provided for doing undo logging. This could be done by buffering all locations which are watched or touched, having log instructions allocate a slot in the redo log, to which subsequent writes store, having rollback simply clean up, and validate do an implicit commit. It's less efficient, but it works. This gets you a lock-free redo-logged implementation.
One would also expect that at 1024, the protocol would need to go to wait-freedom.
In summary, I theorize the following use of strategies for CPU numbers:
- 1-8: Lock-based centralized conflict detection
- 8-32: Lock-free centralized conflict detection
- 32-256: Obstruction-free version-tagged decentralized conflict detection
- 256-1024: Lock-free Herlihy
- 1024+: Wait-free Herlihy