C++ 并发实战《C++ Concurrency in Action》的学习笔记 5, 记录第五章的部分 Designing lock-free concurrent data structures.
内容是设计 lock-free 的数据结构用于并发, 实现了 stack 与 queue 的示例, 同时包含了与之相关的内存管理的技术(例如垃圾收集).
- Algorithms and data structures that use mutexes, condition variables, and futures to synchronize the data are called blocking data structures and algorithms.
- Data structures and algorithms that don’t use blocking library functions are said to be nonblocking. Not all these data structures are lock-free, though.
std::atomic_flag 实现的自旋锁即为 nonblocking, 但是不是 lock-free.
Obstruction-Free: If all other threads are paused, then any given thread will complete its operation in a bounded number of steps. ==> this is more useful as a characterization of a failed lock-free implementation.
Lock-Free: If multiple threads are operating on a data structure, then after a bounded number of steps one of them will complete its operation.
Wait-Free: Every thread operating on a data structure will complete its operation in a bounded number of steps, even if other threads are also operating on the data structure.
多线程安全访问: more than one thread must be able to access the data structure concurrently. They don’t have to be able to do the same
operations; a lock-free queue might allow one thread to push and one to pop but break if two threads try to push new items at the same time.
其他线程的中断不能妨碍此线程的完成. One of the threads accessing the data structure is suspended by the scheduler midway through its operation, the other threads must still be able to complete their operations without waiting for the suspended thread.
例如 This code can still be lock-free if the compare/exchange would eventually succeed if the other threads were suspended. If it didn’t, you’d have a spin lock, which is nonblocking but not lock-free.
starvation 问题: Data structures that avoid this problem are wait-free as well as lock-free.
定义: A wait-free data structure is a lock-free data structure with the additional property that every thread accessing the data structure can complete its operation within a bounded number of steps, regardless of the behavior of other threads.
不是 wait-free 的例子: a
while loop on a
实现 wait-free 极其困难: Writing wait-free data structures correctly is extremely hard. In order to ensure that every thread can complete its operations within a bounded number of steps, you have to ensure that each operation can be performed in a single pass and that the steps performed by one thread don’t cause an operation on another thread to fail.
- enable maximum concurrency
If a thread dies while holding a lock, that data structure is broken forever. But if a thread dies partway through an operation on a lock-free data structure, nothing is lost except that thread’s data; other threads can proceed normally.
- invariants are upheld: not easy.
- deadlocks are impossible with lock-free data structures, although there is the possibility of live locks instead.
A live lock occurs when two threads each try to change the data structure, but for each thread, the changes made by the other require the operation to be restarted, so both threads loop and try again. 书中举了一个例子, 两个人无法同时通过一个窄门, 当二者碰巧都要去通过时, 都会发现过不去, 因此退出来然后又一次尝试通过, 又一次失败, 循环往复. live locks are typically short-lived because they depend on the exact scheduling of threads. 因此只会导致性能的一时下降不会导致长期问题(They therefore sap performance rather than cause long-term problems)
- it may well decrease overall performance 某些情况下还是会导致整体性能下降
- the atomic operations used for lock-free code can be much slower than non-atomic operations.
- hardware must synchronize data between threads that access the same atomic variables(例如 cache ping-pong 问题).
- 看你怎么测, 看重什么方面: relevant performance aspects (whether that’s worst-case wait time, average wait time, overall execution time, or something else)
注意不是说用了 atomic operation 以及 memory ordering options 就可以保证里面没锁. 这与平台/实现强相关.
std::atomic_flagis guaranteed not to use locks in the implementation.
- Create a new node.
- Set its
nextpointer to the current
- Set the
headnode to point to it.
2,3 步会有可能 data race.
注意第 3 步, 我们不能回撤已经添加的 new node. Note that once head has been updated to point to your new node, another thread could read that node. It’s therefore vital that your new node is thoroughly prepared before head is set to point to it; you can’t modify the node afterward.
push() without locks: atomic compare/exchange operation
- Read the current value of
- Return the data from the retrieved node.
- Delete the retrieved node.
- dereference a dangling pointer: 线程1,2 同时进入步骤 1, 线程 2 率先删除 node, 导致步骤 2 的线程 1 访问空悬的指针.
- if two threads read the same value of
head, they’ll return the same node. 这部符合 stack 的要求.
- 没处理空的 stack. if
headis a null pointer, it will cause undefined behavior as it tries to read the
- exception-safety issue: if an exception is thrown when copying the return value, the value is lost. 解决方案: return a (smart) pointer to the data value.
注意 If you do the heap allocation as part of
pop(), you’re still no better off, because the heap allocation might throw an exception. Instead, you can allocate the memory when you
push()the data onto the stack—you have to allocate memory for the node anyway.
为了防止内存泄漏 write a special-purpose garbage collector for nodes
Reclaiming nodes when no threads are in
Because you’re going to potentially delay the deletion of the node itself, you can use
swap() to remove the data from the node rather than copying the pointer, so that the data will be deleted automatically when you no longer need it rather than it being kept alive because there’s still a reference in a not-yet-deleted node. 我们希望实现 2 种效果:
- 通过智能指针返回被删除的 node 的内容–> data, 由于是智能指针, 因此可以不用考虑生命周期的管理问题.
- 推迟被删除 node 的删除部分, 直到我们确定是安全的(没有其他线程在
swap()函数做到了内容部分提取与删除 node 部分的分离(这一部分的安全性放到了
else if(nodes_to_delete) 进一步的 check 的意义.
第二张图是线程 A 执行到
if(threads_in_pop==1) 语句, 被挂起.
第三张图是线程 B 执行到
node* old_head=head.load(); 语句, 也就是
old_head 指向 node Y, 被挂起.
第四张图是线程 C 执行完
pop() 把 node Y 放入了
to_be_deleted list 里.
第五张图是线程 A 接着执行
to_be_deleted.exchange(nullptr); 在 A 的世界里
threads_in_pop == 1, 因此会执行 delete pending list, node Y 会被删除.
第六张图是线程 B 接着执行
old_head->next, 发现 node Y 已经被删除了, B 还尝试访问它 导致 UB.
换句话说要保证能安全地 delete pending list 需要满足 2 个条件, 一个是
nullptr(只有能安全 delete pending list 的线程才能设置为
- 为了不影响性能, 考虑 lazy 策略, 先把危险的操作堆积起来, 一有机会就全部处理掉
- 拆解过后的危险的部分, 进一步分解成为可以理解的步骤(线程间可能冲突), 找到关键变量然后在关键步骤 re-check.
注意此方法的缺陷, pending list 有可能会无限增加导致资源泄漏: In high-load situations, there may never be this quiescent state, because other threads have entered
pop() before all the threads initially in
pop() have left. Under this scenario, the
to_be_deleted list would grow without bounds, and you’d be leaking memory again.
解决思路, 想办法标记那些不被 access 的 node 删除: If there aren’t going to be any quiescent periods, you need to find an alternative mechanism for reclaiming the nodes. The key is to identify when no more threads are accessing a particular node so that it can be reclaimed. By far the easiest such mechanism to reason about is the use of hazard pointers.
hazard pointers: The basic idea is that if a thread is going to access an object that another thread might want to delete, it first sets a hazard pointer to reference the object, informing the other thread that deleting the object would indeed be hazardous. Once the object is no longer needed, the hazard pointer is cleared. 通过标记/解标记为 hazard 的形式来说明能否在定期的循环检查里安全地删除被指向的对象. 还是垃圾收集的思路.
When a thread wants to delete an object, it must first check the hazard pointers belonging to the other threads in the system. If none of the hazard pointers reference the object, it can safely be deleted. Otherwise, it must be left until later. Periodically, the list of objects that have been left until later is checked to see if any of them can now be deleted.
discovered by Maged Michael: “Safe Memory Reclamation for Dynamic Lock-Free Objects Using Atomic Reads and Writes,” .
一个前提, 即便一个指针指向的对象被删除了, 我们依旧可以查看这个指针的值(内存地址), 这在一般情况下是 UB, 需要自定义 allocator: Using hazard pointers like this relies on the fact that it’s safe to use the value of a pointer after the object it references has been deleted. This is technically undefined behavior if you are using the default implementation of new and delete, so either you need to ensure that your implementation permits it, or you need to use a custom allocator that permits this usage.
An implementation of
pop() using hazard pointers
- You’re using
compare_exchange_strong()here because you’re doing work inside the while loop: a spurious failure on
compare_exchange_weak()would result in resetting the hazard pointer unnecessarily.
- Any nodes for which there are still outstanding hazard pointers will be left for the next thread that calls
A simple implementation of
初始化的过程耗时, 后面就不会太多资源需求了: Once the
hp_owner instance has been created for a given thread, further accesses are much faster because the pointer is cached, so the table doesn’t have to be scanned again.
bool outstanding_hazard_pointers_for(void* p)
A simple implementation of the reclaim functions
pop()an expensive operation
pop()call => checking
max_hazard_pointersatomic variables. Atomic operations are inherently slow—often 100 times slower than an equivalent non-atomic operation on desktop CPUs
- you scan the hazard pointer list for the node you’re about to remove + also scan it for each node in the waiting list. => checking all of them against
max_hazard_pointersstored hazard pointers.
- trade memory for performance.
- checking 2*
pop()and reclaiming at least
- downside: have to count the nodes on the reclamation list, which means using an atomic count, and you still have multiple threads competing to access the reclamation list itself.
- checking 2*
- each thread keeps its own reclamation list in a thread-local variable.
- If a thread exits before all its nodes have been reclaimed, they can be stored in the global list as before and added to the local list of the next thread doing a reclamation process.
可惜 hazard pointers are covered by a patent application submitted by IBM.
Reference counting tackles the problem by storing a count of the number of threads accessing each node.
直观反应的智能有它的问题: although some operations on
std::shared_ptr<> are atomic, they aren’t guaranteed to be lock-free. 为啥不加?
std::shared_ptr<> is intended for use in many contexts, and making the atomic operations lock-free would likely impose an overhead on all uses of the class.
A lock-free stack using a lock-free
如果智能指针是 lock-free 的话, 实现起来就会很简单:
Concurrency TS 提供了
std::experimental::atomic_shared_ptr<T> in the
<experimental/atomic> header. 注意 it may or may not be lock-free on any given implementation.
Stack implementation using
好处: without having to remember to include the
使用智能指针里的计数不一定是 lock-free, 因此考虑手动实现引用计数.
two reference counts for each node: an internal count and an external count.
- The sum of these values is the total number of references to the node.
- The external count is kept alongside the pointer to the node and is increased every time the pointer is read.
- When the reader is finished with the node, it decreases the internal count.
- A simple operation that reads the pointer will leave the external count increased by one and the internal count decreased by one when it’s finished.
- When the external count/pointer pairing is no longer required (the node is no longer accessible from a location accessible to multiple threads), the internal count is increased by the value of the external count minus one and the external counter is discarded.
- Once the internal count is equal to zero, there are no outstanding references to the node and it can be safely deleted.
node* ptr; 放在一起形成
struct counted_node_ptr 放入
std::atomic<counted_node_ptr> head; 要注意其 struct 大小问题.
std::atomic<counted_node_ptr> 不会 lock-free: On those platforms that support a double-word-compare-and-swap operation(可以参考[wiki]https://en.wikipedia.org/wiki/Double_compare-and-swap), this structure will be small enough for
std::atomic<counted_node_ptr> to be lock-free.
一些 trick: 利用指针数值里的无效 bits 存储 counter. => require platform-specific knowledge
if you’re willing to limit the size of the counter, and you know that your platform has spare bits in a pointer (for example, because the address space is only 48 bits but a pointer is 64 bits), you can store the count inside the spare bits of the pointer to fit it all back in a single machine word.
Popping a node from a lock-free stack using split reference counts
需要考虑一些场景, 先从最简单的开始: one thread pushes a data item onto the stack and another thread then pops that data item off the stack some time later.
分析数据结构, three important pieces of data are involved:
counted_node_ptrused for transferring the data:
dataitem pointed to by that node.
- constructs the
- constructs the
- loads the value of
- does a compare/exchange loop on
headto increase the reference count.
- reads the
nodestructure to obtain the
a happens-before relationship between the store (by the pushing thread) and the load (by the popping thread).
- only atomic operation in the
compare_exchange_weak(), and you need a release operation to get a happens-before relationship between threads, the
- If the
compare_exchange_weak()call fails, nothing has changed and you keep looping, so you need only
std::memory_order_relaxedin that case.
void push(T const& data)
std::memory_order_acquireor stronger before the access to
- The pointer you dereference to access the
nextfield is the old value read by the
increase_head_count(), so you need the ordering on that if it succeeds.
void increase_head_count(counted_node_ptr& old_counter)
- store to
push()thread is sequenced before the store to
- the call to
increase_head_count()is sequenced before the load of
- the acquire operation in
increase_head_count()ensures that there’s a synchronizes-with relationship between the store in the
push()thread and that compare/exchange.
==> there’s a happens-before relationship with
data between two threads.
The only other place where
ptr->data is changed is the call to
swap(). no other thread can be operating on the same node. ==> safe
head is only ever modified by compare/exchange operations. Because these are read-modify-write operations, they form part of the release sequence headed by the compare/exchange in
push(). Therefore, the
push() synchronizes with a call to
increase_head_count(), which reads the value stored, even if many other threads modify
head in the meantime.
any thread that did not successfully retrieve the data knows that another thread did modify the node data; the successful thread used
swap() to extract the referenced data item. Therefore you need to ensure that
swap() happens before the
delete in order to avoid a data race.
自然而然地推出: make the
fetch_add() in the successful-return branch use
std::memory_order_release and the
fetch_add() in the loop-again branch use
else if(ptr->internal_count.fetch_add( 里的
fetch_add() is a read-modify-write operation, it forms part of the release
sequence, so you can do that with an additional
A lock-free stack with reference counting and relaxed atomic operations
lots of the complexity in lock-free code comes from managing memory.
a perfectly serviceable single-producer, single-consumer(SPSC) queue:
- The store to
tailsynchronizes with the load from
- the store to
datahappens before the load:
- store to
tailsynchronizes with the load from
- the store to the preceding node’s data pointer is sequenced before the store to
- the load from
tailis sequenced before the load from the data pointer.
- store to
思路1: add a dummy node between the real nodes.
push(): the current
tail node that needs updating is the
next pointer, which could therefore be made atomic.
pop(): require a minor change to
pop() in order to discard nodes with a null data pointer and loop again.
The downside here is that every
pop() call will typically have to remove two nodes, and there are twice as many memory allocations.
思路2: make the
data pointer atomic and set that with a call to compare/exchange.
push(): If a call to compare/exchange succeeds, this is your
tail node, and you can safely set the
next pointer to your new node and then update
tail. If the compare/exchange fails because another thread has stored the data, you loop around, reread
tail, and start again.
Listing 7.15 A (broken) first attempt at revising
void push(T new_value)
tail 导致 UB. 如何解决呢?
pop() 一样的问题, 可以考虑再加一对 external 与 internal count. 具体思路如下:
Having two external counts for the same node requires a modification to the reference-counting scheme to avoid deleting the node too early. You can address this by also counting the number of external counters inside the node structure and decreasing this number when each external counter is destroyed (as well as adding the corresponding external count to the internal count). If the internal count is zero and there are no external counters, you know the node can safely be deleted.
Listing 7.16 Implementing
push() for a lock-free queue with a reference-counted tail
Listing 7.17 Popping a node from a lock-free queue with a reference-counted tail
接下来是 reference-counting functions
Listing 7.18 Releasing a node reference in a lock-free queue:
Listing 7.19 Obtaining a new reference to a node in a lock-free queue
Listing 7.20 Freeing an external counter to a node in a lock-free queue
上面的做法其实不是 lock-free 的:
Once one thread has started a
push() operation by successfully completing the
old_tail.ptr->data. no other thread can perform a
This is a busy wait, which consumes CPU cycles without achieving anything. Consequently, this is effectively a lock.
比一般的 mutex 锁更坏的一点是:
the operating system can give priority to the thread that holds the lock on a mutex if there are blocked threads, it can’t do so in this case, so the blocked threads will waste CPU cycles until the first thread is done.
解决思路: find a way for a waiting thread to make progress even if the thread doing the
push() is stalled. One way to do this is to help the stalled thread by doing its work for it.
具体的内容: the next pointer on the
tail node needs to be set to a new dummy node, and then the
tail pointer itself must be updated.
pop() modified to allow helping on the
Listing 7.22 A sample
push() with helping for a lock-free queue
一个潜在的效率风险点: there are a lot of
delete calls for such a small piece of code, because new nodes are allocated on
push() and destroyed in
pop(). The efficiency of the memory allocator therefore has a considerable impact on the performance of this code.
- try it and measure the performance of the code before and after.
- having a separate memory allocator on each thread and using free lists to recycle nodes rather than returning them to the allocator.
对于检查: Unless you have an algorithm checker that can systematically test all possible combinations of thread visibilities that are consistent with the specified ordering guarantees (and these things do exist), running the code isn’t enough.
难点: managing memory.
- It’s essential to avoid deleting objects when other threads might still have references to them,
- but you still want to delete the object as soon as possible in order to avoid excessive memory consumption.
three techniques for ensuring that memory can safely be reclaimed:
- Waiting until no threads are accessing the data structure and deleting all objects that are pending deletion.
- Using hazard pointers to identify that a thread is accessing a particular object.
- Reference counting the objects so that they aren’t deleted until there are no outstanding references.
- this is the ideal scenario for using a garbage collector.
- recycle nodes and only free them completely when the data structure is destroyed.
The downside here is that another problem becomes more prevalent. This is the so-called ABA problem.
解决方法: an ABA counter alongside the variable
x. The compare/exchange operation is then done on the combined structure of
x plus the counter as a single unit.
常见的场景: The ABA problem is particularly prevalent in algorithms that use free lists or otherwise recycle nodes rather than returning them to the allocator.
解决办法: the waiting thread performs the incomplete steps if it’s scheduled to run before the original thread completes the operation, you can remove the busy-wait and the operation is no longer blocking.
本章设计 data structures 的意义: Wherever data is shared between threads, you need to think about the data structures used and how the data is synchronized between threads. By designing data structures for concurrency, you can encapsulate that responsibility in the data structure itself, so the rest of the code can focus on the task it’s trying to perform with the data rather than the data synchronization.