lockfree stack with ABA fix #java AtomicStampedRef

http://15418.courses.cs.cmu.edu/spring2013/article/46 explains ABA in the specific context of a … lockfree stack. To counter ABA problem, It uses CAS2 i.e. CAS on a group of two objects — the original object + a stamp.

Java CAS2? I believe AtomicStampedReference is designed specifically for it.

http://tutorials.jenkov.com/java-util-concurrent/atomicstampedreference.html#atomicstampedreference-and-the-a-b-a-problem explains the AtomicStampedReference solving the ABA problem but in the last section it doesn’t clearly explain the benefit of get().

Also, the retry need a loop.

The stamp is a plain old integer. The all-important “Increment Control” is not by some sophisticated magic inside the library, but by your own hand-written logic.

====Actually, many ABA illustrations are simplistic. Consider this typically simplistic illustration of a CAS stack:

  1. Thread 0 begins a POP and sees “A” as the top, followed by “B”. Thread saves “A” and “B”, before committing anything.
  2. Thread 1 begins and completes a POP , returning “A”.
  3. Thread 1 begins and completes a Push of “D”.
  4. Thread 1 begins and completes a Push of the same “A”. So now the actual stack top has A above D above B, i.e. with D inserted just below the original top-dog “A”
  5. Thread 0 sees that “A” is “still” on top and returns “A”, setting the new top to “B”. Therefore, Node D is lost.

— With a vector-of-pointer implementation, Thread 0 needs to save integer position within the stack. At Step 5, it should then detect that the A sitting at stack top is now at a higher position than observed earlier, detecting a symptom of ABA.

The rest is simple. Thread 0 should then query the new item (D) below A. Lastly, the CAS would compare current stack top position with the saved position, before committing.

However, in reality I think a vector-of-ptr is non-trivial to implement if we have two shared mutable things to update via CAS: 1) the stackTopPosition integer variable and 2) the (null) ptr in the next slot of the vector.

However, if Thread 1 replaces B with D, keeping the stack size unchanged, then the ABA would be undetectable by position check alone.

— with a linked list implementation, I think we only have node addresses, so at Step 5, Thread 0 can’t tell that D has been inserted between A and B.

You may consider using stack size as a second check, but it would be similar complexity as CAS2 but less reliable.

single-writer lockfree data structure@@

Most of the lockfree data structures I have seen have 2 writer threads or more.

Q: what if in your design, there can be at most one writer thread but some reader threads. Any need for lockfree?
A: I don’t think so. I think both scenarios below are very common.

  • Scenario: If notification required, then mutex + condVar
  • Scenario: In many scenarios, notification is not needed — Readers simply drop by and read the latest record. In java, a volatile field suffices.


latency favors STM; lockfree constructs target moderate latency

Strictly single-threaded mode means no shared mutable. Therefore, Really low-latency apps should probably avoid lockfree programming which assumes the presence of shared mutable. If no shared mutable, then the lockfree constructs would exert an unwanted performance penalty.

Now I think the real low-latency systems always prefer Single-Threaded-Mode. But is it feasible?

  • xtap and (loosely) Rebus are both STM — very high performance, proven designs.
  • Nasdaq’s new java-based architecture is STM, including their matching engine.
  • The matching engines in many exchanges/ECNs are STM. Remember FXAll interview?



This is a pure QQ topic, but you can score a few points by mentioning it.

[[javaPerf2014]] P268 says biased locking is enabled by default but can be disabled to improve performance in app servers using a thread pool and contended locks. It explains the reason i.e. biased locking comes with a price — a runtime overhead.

https://stackoverflow.com/questions/9439602/biased-locking-in-java is very concise — un-contended locking may incur zero cost, as claimed in the accepted answer

  • .. incurs no synchronization cost. I suppose this is typically geared towards overly conservative code that performs locks on objects without ever exposing them to another thread. The actual synchronization overhead will only kick in once another thread tries to obtain a lock on the object.
  • Biased locking is the default in java6 ====> -XX:+UseBiasedLocking improving the performance of uncontended synchronization. An object is “biased” toward the thread which first acquires its monitor; subsequent monitor-related operations are relatively faster. Some applications with significant amounts of uncontended synchronization may attain significant speedups

Is this technique considered lockfree? No, but some may speculate that it might be even faster than lockfree. So if you suspect most of the time there’s only one thread accessing a critical section, you could choose to rely on the (java6 default) biased locking rather than lockfree solution. Most of the time this mutex-based design would challenge (if not beat) a lockfree performance.

However, I believe single-threaded mode is still faster, where a thread isn’t aware of other threads, as if it is the only thread access those objects i.e. no shared-mutable. There would be no lock grab, no memory fencing at all. [[javaPerf2014]] P375 agrees.

ABA problem ] lockfree designs #DeepakCM

  • breaks basic CAS assumptions
  • most common solution is [1]
  • affects c++ CAS only, probably not java CAS

[1] https://lumian2015.github.io/lockFreeProgramming/aba-problem.html and http://moodycamel.com/blog/2014/solving-the-aba-problem-for-lock-free-free-lists explain the “tagged state reference”  aka “tagged pointer“. I feel I don’t need to understand it. Even if I spent some time getting a grasp it may not be enough to impress an interviewer.

http://www.modernescpp.com/index.php/aba-a-is-not-the-same-as-a has more details, including

  • compare_exchange_strong
  • lockfree stack hitting UndefinedBehavior due to ABA
  • RCU

This topic was asked briefly in Deepak’s MS c++ interview in 2019. I think it’s too arcane and low-level for most interviewers.


##lockfree queue implementations c++J

I think a linked queue with 2 pointers (head / tail) could be relatively easy to implement by myself.

c++ lockfree data types: mostly bool/int/pointer

  • In generally, only atomic bool/int/pointer can be lock-free. Bigger types need special implementation techniques and I have read that lock-free is usually not possible.
  • Atomic flags are lock-free (this is the only type guaranteed to be lock-free on all library implementations)

[18]fastest threadsafe queue,minimal synchronization #CSY

I got this question in a 2017 Wells white-board coding interview, and discussed with my friend Shanyou. We hoped to avoid locks and also avoid other synchronization devices such as atomic variables..

Q1: only a single producer thread and a single consumer thread and no other threads.

I put together a java implementation that can enqueue without synchronization, most of the time … See https://wp.me/p74oew-7mE

Q1b: Is it possible to avoid synchronization completely, i.e. single-threaded mode?
A: No. Consumer thread would have absolutely NO idea whatsoever how close it is to the producer end. No. We asneed a memory barrier at the very least.

Q2: what if there are multiple producer/consumer threads?

I believe we can use 2 separate locks for the two ends, rather than a global lock. This is more efficient but invites the tricky question “how to detect when the two ends meet“. I am not sure. I just hope the locks enforce a memory barrier.

Alternatively, we could use CAS on both ends, but see lockfree queue #popular IV


unbounded queue for 1-Producer-1-Consumer #Wells

———- Forwarded message ———
From: Bin TAN (Victor) <tiger40490@gmail.com>
Subject: demo of my design that first interviewer didn’t see

Hi Angelo,

During my interview with Jun, I spent the last 20 minutes locked in a rigorous debate — given an unbounded queue designed for single-threaded mode, can we provide a thread-safe facade so a producer thread and a consumer thread can operate on it safely, but without using locks in every operation.
Note In JDK and STL there are many collections designed for single-threaded mode, because such designs are simpler and significantly faster.
I argued valiantly that there exist designs where most of the time we can avoid both locking and CompareAndSwap. I failed to convince Jun. Jun believed firmly that unsynchronized access by two threads is always unsafe so we must use lock or CompareAndSwap every single time.
I just spent 20 minutes at home posting an implementation in https://github.com/tiger40490/repo1/tree/jProj/java/com/wells
I would also like to point out the java ConcurrentHashMap lets two threads (possibly a writer thread and a reader thread) access the shared data structure concurrently, usually without locking or CompareAndSwap , when the two threads happen to access two distinct segments. Note there can be a large number of segments, up to a few hundred at least, so the chance of two threads hitting distinct segments is very high (99.999% chance for a 333-segments map). Therefore, contrary to what Jun said, for reader/writer threads to access the same data structure,  we don’t need locking in every read and write access.
concurrencyLevel – the estimated number of concurrently updating threads. The implementation may use this value as a sizing hint.
I wrote this (like most of my concurrent designs) in Java since Java memory model is better documented and better understood.

CAS cpu-instruction: 3inputs #curVal discovered]hw

A CVA interviewer asked me to explain the cmpxch cpu-instruction. I now believe it COMPARES two values (expected^current) and IIF matched, updates the memory location to a “newValue”.

Out of these 3 “inputs”, only the expected and newValue are inputs to the function. The 3rd item “current” is NOT an input parameter to the function, but discovered in the hardware.

See P1018 [[the c++StdLib]] by Josuttis

This knowledge is slightly below java API

lockfree usage in high speed trading system@@

Hi Deepak,

Further to our conversation, do high frequency shops use lock-free? I would guess that the most demanding high-speeding data processing system (such as the matching engine in an exchange) are single-threaded, rather than lock-free multithreaded.

I hope to hear a typical high-speed data processing system that has lots of shared mutable data. I have not found one.

· Order matching

· Airline booking

· Vote counting in real time?

If 99% of the data is not shared mutable, then single-threaded design is probably better.

· I feel one reason is complexity vs simplicity. Lock-free designs can be very tricky, according to experts.

· Another reason is performance. Lock-free is, in my view, definitely slower than single-threaded. The memory fence required on those atomic objects impeded compiler optimizations.

· Most important reason, in my view — it’s always possible to achieve the same functionality without using locks or lock-free designs. Single-threaded designs are possible, simpler and faster.

If we have 64 cores, just run 64 processes, each single-threaded. However, in reality these systems often do have some shared mutable data between two threads. There are various solutions I’m not qualified to compare and illustrate. These solutions could use lock-free or locks.

serialize access to shared mutable: mutex^CAS

[[optimized c++]] P290 points out that, in addition to mutex, CAS construct also serializes access to shared mutable objects.

I feel it’s nothing but a restatement of the definition of “shared mutable”.  More relevant question is

Q: what constructs support unimpeded concurrent access to shared mutable?
A: read-write lock lets multiple readers proceed, in the absence of writers.
A: RCU lets all readers proceed, but writers are restricted.


See also blog on c++ atomic<int> — primary usage is load/store, not CAS

lock free isn’t equal to CAS — lock-free construct also include Read-Copy-Update, widely used in linux and other kernels.

atomic isn’t equal to lock-free — For example, c++ atomic classes (actually templates) can use locks if the underlying processor lacks CAS instruction.


read-copy-update lockfree +! retry #RCU

RCU is an advanced concurrency construct, not implemented in a regular application, but valuable for lock-free interviews.

http://concurrencyfreaks.blogspot.sg/2016/09/a-simple-userspace-rcu-in-java.html is a “simple Userspace” java implementation. I didn’t read it in detail and assume it’s rather different from the linux kernel RCU

http://www.modernescpp.com/index.php/aba-a-is-not-the-same-as-a (by a published author) has a brief mention of Userspace RCU solution to ABA problem. Also touches on Garbage Collection.

https://lwn.net/Articles/262464/ — by RCU inventor. I can see RCU is non-trivial !

The Wikipedia article is accessible until the paragraph introducing read-side-critical-section and grace-period. This is the first paragraph on the implementation. I found it hard. Therefore a few background pointers:

  • · There must be multiple threads, typically many readers and few writers.
  • · There must a shared mutable data structure, probably on heap.
  • · Not all data structures are suitable for RCU. So which subset are? I would say pointer-graph including hash tables.

In the simplified model, we are talking about

  • · A reader thread R1 executing a block of code that has started reading the shared mutable data structure. This code block is the critical section
  • · A writer thread W1 executing a block of code attempting to update the same data.
  • · After the so-called grace period, a GC thread would reclaim the obsolete data that R1 has finished reading.
  • · A reader R2 enters the critical section after the update is done. R2 would see the new data

GC need to know when it’s safe to reclaim. It needs the wait-for-readers operation and the grace period.

Q: What if R1 gets a reference to the “obsolete node”, performs some slow task, reads the node, then exits the critical section? This node is reclaimable only after R1 exits critical section. The grace period would be long?

Q: Among 9999 nodes, how does system keep track which each node is reclaimable?
%%A: I feel kernel access to CPU registers might be needed.

Q: How does it compare with copy-on-write?
A: COW probably copies the entire data structure. Also, the modified copy is not visible to subsequent readers (like R2)

Q: How does it compare with read/write lock?
A: readlock can block

Q: How does it relate to lock-free concepts?
A: reader threads are lock-free and even better — not required to retry.
A GC threads need to wait.
A: Writer thread (like W1) ? not sure. Might be lock-free in some situations.

minimize locking – queue for producer/consumer

I was asked this question in a big bank IV.

Q: if our queue needs to be as fast as possible, we want to avoid a global lock. How?

%%A1: multi-queue, based on cusip or account prefix. I implemented this partially in a JPM take-home coding test

%%A2: if were are very sure the queue is never depleted, then use 2 locks at both ends, so consumer threads only need to contend for the consumer lock.

%%A3: lock free queues are probably quite common in c#, java and c++

For A2, On one hand, we want to keep things simple by having fewer locks. On the other hand, we want to increase parallelism by breaking up one big sync group (of threads) into independent groups, each having a “smaller” lock.
Coarse-grained parallelism is key. If the 2 smaller sync groups never cross paths, then the additional locks won’t add complexity. We may need a way to ensure that the 2 ends of the queue never cross, hopefully without needing both locks. The risk — the consumer is too fast and “eats” an item that’s not yet added. We can make sure this always fails reliable in production and stress test, rather than UndefinedBehavior.

I feel in reality, there is usually some bound on the capacity of producer, consumer or queue. Sometimes producer will be too fast (overflow), or too slow (cross), so a queue without any bound check is unreliable in practice.

atomic operations offered in c++11 ^ pthreads libraries

P341 [[c++ concurrency in action]] has a nice table showing about 7 to 10 most important concurrency features across pthreads vs c++11 [2]. It’s obvious to me the 2 fundamental[1] and universally important features are locks and condVars. These are thoroughly and adequately supported everywhere — pthreads, c++11, java, c#. Beyond these 2 features, other features aren’t universally supported across the board. Let’s look at some of them.

— Atomic operations —
pthreads? no support
c++11? yes atomic types
java? yes atomic variables
boost? no support
c#? yes interlocked

Notice in the C#, c++11, java thread libraries there are specific constructs (classes and methods) for atomic INT and atomic BOOLEAN (but not atomic float), because in practice most atomic algorithms use primitive int and boolean types.

Atomic operations don’t always rely on specific CPU instructions. They are often “offered” on (no more than a few) specific atomic data types. Can we apply atomic operations on a random data type, like some method in a regular class? I doubt it. I feel in each thread library, there are no more than a few specific atomic Operations, tied to a handful of atomic data types.

— thread pool —
java? yes
c#? yes
c++11? no
pthreads? no
boost? To my surprise, No, according to the maintainer of boost::thread

I think it’s technically very feasible to implement thread pool using locks and condVars, so this feature is left out of the c/c++ base libraries.

[1] “Fundamental” is a imprecise term that I would not spend too much time debating. In c#, locks and condVars aren’t really rock-bottom fundamental. They are based on more fundamental constructs namely primitive kernel objects. In other languages, I’m not sure. Locks and condVars are often implemented in thread libraries not syscalls. It’s a bit obscure, arcane and even irrelevant to many developers.
[2] (and java and boost too, but this blog is about c++)

%%Lockfree Unbounded Stack(generic) using AtomicReference

import static java.lang.System.*;
import java.util.concurrent.atomic.AtomicReference;
public class LockfreeUnboundedStack {
                final private AtomicReference<Node> arTop = new AtomicReference<Node>();
                public LockfreeUnboundedStack() {
                                this.arTop.set((Node) Node.theTerminator);
                public static void main(String[] oipwuir) {
                                LockfreeUnboundedStack stack = new LockfreeUnboundedStack();
                                Node a = stack.pop();
                                a = stack.pop();
                                a = stack.pop();
                                a = stack.pop();
                                a = stack.pop();
                                a = stack.pop();
                * @return a terminator node if empty stack
                public Node pop() {
                                boolean done;
                                Node oldTopNode;
                                do {
                                                oldTopNode = this.arTop.get();
                                                if (oldTopNode.isTerminator()) {
                                                                out.println(“returning terminator”);
                                                                return oldTopNode;
                                                Node newTopNode = oldTopNode.previous;
                                                done = this.arTop.compareAndSet(oldTopNode, newTopNode);
                                } while (!done);
                                out.println(“returning ” + oldTopNode);
                                return oldTopNode;
                public void push(V item) {
                                assert item != null;
                                Node newNode = new Node(item);
                                boolean done;
                                do {
                                                Node replacedTopNode = this.arTop.get();
                                                newNode.previous = replacedTopNode;
                                                done = arTop.compareAndSet(replacedTopNode, newNode);
                                } while (!done);
class Node {
                final V payload;
                Node previous = null;
                public Node(V pay) {
                                assert pay != null;
                                this.payload = pay;
                * callable only by the anonymous nested class
                private Node() {
                                this.payload = null;
                public boolean isTerminator() {
                                return false;
                public String toString() {
                                String p = “an invalid payload”;
                                if (this.payload != null) p = this.payload.toString();
                                return “Node holding ” + this.payload.toString();
                static final Node theTerminator = new Node() {
                                final public boolean isTerminator() {
                                                return true;

CAS and concurrent access to a simple C++ array of pointers

(Some 2010 MS interviewer asked me about this concurrency hazard.) This array could be part of a hash table.

Thread 1 is about to remove an element from a hash table. It needs to first substitute a NULL address into the array to prevent Thread 2 from seeing the to-be-deleted entry. Then it will delete on the obsolete ptr … but actually it has to save that address before storing the NULL.

The critical operation is the atomic load-and-store — “load current element and substitute a NULL”. XCHG is the way to go. But what if another thread is doing the same operation? I feel a CAS loop is better — Thread 1 first loads current value into a register. Then it attempts to compareAndSet. If current value is unchanged, it puts in the NULL. Otherwise it should retry.

Q: What if in the same clock cycle, 2 threads (on separate processors) succeed in the load-and-store. Both loads the obsolete address and both try to delete on it… double-free
%%A: I think the hardware spec should cover this scenario. Should not be left to chances.

This situation is virtually impossible to test. But if they don’t happen in the same clock cycle, then the slow thread will see a NULL in the array – lock-free thread safety.

AtomicReference needs immutables

Another post briefly explains AtomicReference + immutable = lockfree. Can we make do without the immutable guarantee? A subtle but crucial question that deserves a focused discussion.

Look at the earlier post showing sample code. Now introduce one of the common loopholes (reference leak for eg) into the Immutable wrapper. If i succeed with compareAndSet, but another thread actually modifies my Account object state in the mean time, then i have unknowingly put another thread’s data into the “global-lookup-table” .

Since my method is “committing” this object into the lookup table, then this method need control on the object state, and should prevent “brood parasite” like this.

AtomicBoolean is irreplaceable

People question the justification for seemingly trivial classes like AtomicSomething. Let’s take the simplest and show AtomicBoolean is irreplaceable.

compareAndSet(true, false) means “if value in main memory before this clock cycle is positive, then toggle it, otherwise do nothing. No locking please.

Read it over and over, and you would realize it’s impossible without Atomics. Volatiles can’t do it, because this involves more than a single load or store. AtomicBoolean does the load and the store in one clock cycle, using special hardware features — “cmpxchg” in intel instruction.

java’s getAndSet() is written using compareAndSet()

Q: Now, since a volatile done/stop flag is frequently used to stop a thread, shall we replace it with an AtomicBoolean?
A: no harm, but I don’t see the need.

AtomicReference + immutable = lockfree #code

import static java.lang.System.out;
import java.util.concurrent.atomic.AtomicReference;

public class AtomicAccount {
    public static void main(String[] a) {
        final AtomicAccount inst = new AtomicAccount(100);
        Runnable r = new Runnable() {
            public void run() {
        new Thread(r).start();
        new Thread(r).start();

    public final AtomicReference AR = new AtomicReference();

    public AtomicAccount(float bal) {
        Account tmp = new Account();
        tmp.bal = bal;
        this.AR.set(new ImmutableAccount(tmp));

     * add 1% to balance. load and store must be atomic.
    void addInterest() {
        while (true) {
            ImmutableAccount b4 = this.AR.get();
            Account tmp = new Account();
            tmp.bal = b4.getBal() * 1.01f;
            ImmutableAccount update = new ImmutableAccount(tmp);
            if (this.AR.compareAndSet(b4, update)) {

    public void printBal() {
        out.println(“current balance = ” + this.AR.get().getBal());

    static private class ImmutableAccount {
        final private Account mutable = new Account();

         * if bal is a non-volatile double, we might read the 2 parts
         * un-atomically.
         * @param a
        public ImmutableAccount(Account a) {
            this.mutable.bal = a.bal;
            // copy other fields

        public float getBal() {
            return this.mutable.bal;

class Account {
    float bal;

AtomicReference + immutable = lockfree

Q: what does atomic mean?
A: “within this method between first and last access to some shared object, no other thread can modify it without me knowing.”

For a mutable bean like Account, we’d love atomic operations like addInterest(). Note there’s no AtomicFloat class.

Note AtomicReference is a wrapper/bridge of the raw pointer. Somewhat similar to how a smart pointer wraps a raw pointer. Our AtomicAccount.java could be a bigger wrapper around the AtomicRefernce wrapper.

Note the cornerstone method AtomicReference.compareAndSet() works solely with addresses, in the comparison and the reseating. Much simpler and faster than equals(), esp. in a lockfree atomic context.

Q: Can we use AtomicReference to add thread safety without locking?

Suppose someone supplies a mutable Account class. We will construct an AtomicAccount class meeting these requirements all without locking —

– Req: List getStocks() to atomically return all the stock symbols held in the account. Without locking, it’s possible to see an invalid list undergoing simultaneous changes in 2 threads.
– Req: addInterest() and addFund(double). Without locking or atomic classes, you can lose updates.
– Req: getBalance()

% assumption: we have strict access control over the Account instance. No thread should use a naked Account reference to access its state.

A key design principle — “If RAW-POINTER wrapped in the AtomicReference is not _reseated_, then the pointee object state is intact.” In other words, account is mutable only by creating a new object. —

Achieved by wrapping Account instance in an ImmutableAccount instance. Change to Account results in a new ImmutableAccount wrapping a new Account. New ImmutableAccount then replaces pointer in the AtomicReference.

Now you realize an AtomicReference is a thread-safe lookup table with just one key-value pair. Key is the unnamed default key; value is an address. Salient features:
* compareAndSet in a loop, lockfree.
* immutables
* optimistic.
* layers of wrapper[1] to give the lockfree advantage

[1] In our example, 3 wrappers on Account ) ImmutableAccount ) AtomicReference ) AtomicAccount

compareAndSet by 2 competing threads — probably safe

static boolean unsynchronizedSetter(Date expected){
Date newDate = new Date();
AtomicReference myAtomicReference = Lookup.getAtomicRef();
boolean myStatus = myAtomicReference.compareAndSet(expected, newDate); //CAS
return myStatus;

Q: If 2 threads execute the CAS simultaneously, which object will get stored in the atomic reference?

In a multi-processor machine, 2 threads could be performing the CAS in the same clock cycle. Suppose they both use the same myAtomicReference object to do the CAS, both use the correct value of “expected”, but they try to put in 2 distinct objects ie the 2 newDate. One of them must fail, but will myStatus be false in that thread?

I feel it’s ok if both threads get myStatus==true, in the same clock cycle or not. Right after the CAS clock cycle, the atomic reference holds one of the 2 values. But how many clock cycles later will the subsequent myStatus-dependent statement execute? No guarantee. Therefore when it does execute, the atomic reference could hold any value. myStatus==true doesn’t mean “my value was stored”.

We must also consider statement reorder and per-thread caching, as described by Doug Lea.

I guess one hardware implementation of CompareAndSwap might make the 2 threads queue up to do their updates. I guess even if the 2 processors are executing the CAS instruction in the same clock cycle, one of them is probably delayed.

In general, Simultaneous writes aren’t allowed by the memory module — just consider voltage signal in that write cycle. We won’t go into details but input voltage applied to a particular bit in RAM must be either 0 or 1, not both. This is the original meaning of electronic race-condition. Some links —



lockfree and compare-and-swap in my project

A simple race condition – If 2 threads update the same object in the same clock cycle, how do we avoid lost update?

Traditional solution is synchronization, i.e. locking. There’s huge interest in lockfree, but I feel either it’s impossible or requires hardware support like xchg or cmpxchg(?).

while (!done) {
final PositionMark oldMark = getRegion().get(key);
newMark = producer.producePrice(key);
final PositionMark replaced = getRegion().put(key, newMark); // atomic?
// The following ensures consistency given concurrent updates to the same key
done = (replaced == oldMark);

Idea is to first get old value of object, then update it and get the replaced value in one sweep. (Perhaps using xchg). If old != replaced, then keep retrying. This is probably swap-and-compare, not to be confused with the more complex compare-and-swap hardware feature [1].

Challenge is the atomicity of the xchg. Without hardware support, can u guarantee atomicity without locking? I doubt it.

In this code, getRegion returns a gemfire region, backed by a concurrent hash map, whose put() is actually synchronized with lock striping. If you were to use a hash map, I would assume put() is completely thread unsafe. 2 threads can both get the same replacedMark, and put in their new marks and overwrite each other — lost update.

[1] compare-And-Swap does the swap conditionally — safer.

lockfree atomic variables — JDK 1.5

(see article at http://www.ibm.com/developerworks/java/library/j-jtp11234/)

Until JDK 5.0, it was not possible to write wait-free, lock-free algorithms in the Java language without using native code. With the addition of the atomic variables classes in the java.util.concurrent.atomic package, that has changed. The atomic variable classes all expose a compare-and-set primitive (similar to compare-and-swap), which is implemented using the fastest native construct available on the platform.

The atomic variable classes can be thought of as a generalization of volatile variables, extending the concept of volatile variables to support atomic conditional compare-and-set update

Nearly all the classes in the java.util.concurrent package use atomic variables instead of synchronization, either directly or indirectly.

http://java.sun.com/j2se/1.5.0/docs/guide/concurrency/overview.html has unusual comments
Atomic Variables – Classes for atomically manipulating single[1] variables (primitive types or references), providing high-performance atomic arithmetic and compare-and-set[2] methods. The atomic variable implementations offer higher[3] performance than would be available by using synchronization (on most platforms), making them useful for implementing high-performance concurrent algorithms as well as conveniently implementing counters and sequence number generators[4].
[1] atomic variables are quite low-level and only cover monolithic variables. No Object here. Object is different from reference (monolithic).
[2] CAS is the biggest feature
[3] synch performance hit is actually a real concern for java designers. I think synch severely limits multi-processor parallelism
[4] these simple integers are a key use case for atomic? You can see Atomic variables aren’t for everyday use.

base version ^ current version and CAS

http://en.wikipedia.org/wiki/Compare-and-swap#Implementation_in_C describes a compare-And-Swap algo — compare my base version with master’s current version and update master only if matched. If unmatched, then my base version is stale => can’t safely update.

i think iterator uses similar constructs. When an iterator is created from a “master” collection object, it copies master’s current version as its immutable base version. Whenever something changes concurrently in master, current version increments. Next iteration will notice it…. and triggers fail fast??

In other words….I guess a Collection object has an auto-incrementing private counter serving as a version number. Each new iterator gets an immutable snapshot of the version number. At each iteration the iterator checks if the “current version #” has incremented which means a change in the collection??

I guess CVS/SVN use similar constructs?