C++并发实战-第七章-总学习笔记第5
[TOC]
C++ 并发实战《C++ Concurrency in Action》的学习笔记 5, 记录第五章的部分 Designing lock-free concurrent data structures.
内容是设计 lock-free 的数据结构用于并发, 实现了 stack 与 queue 的示例, 同时包含了与之相关的内存管理的技术(例如垃圾收集).
Chapter 7 Designing lock-free concurrent data structures
7.1 Definitions and consequences
定义:
- Algorithms and data structures that use mutexes, condition variables, and futures to synchronize the data are called blocking data structures and algorithms.
- Data structures and algorithms that don’t use blocking library functions are said to be nonblocking. Not all these data structures are lock-free, though.
7.1.1 Types of nonblocking data structures
如下使用 std::atomic_flag
实现的自旋锁即为 nonblocking, 但是不是 lock-free.
1 |
|
更加细分的概念:
Obstruction-Free: If all other threads are paused, then any given thread will complete its operation in a bounded number of steps. ==> this is more useful as a characterization of a failed lock-free implementation.
Lock-Free: If multiple threads are operating on a data structure, then after a bounded number of steps one of them will complete its operation.
Wait-Free: Every thread operating on a data structure will complete its operation in a bounded number of steps, even if other threads are also operating on the data structure.
7.1.2 Lock-free data structures
特点:
多线程安全访问: more than one thread must be able to access the data structure concurrently. They don’t have to be able to do the same
operations; a lock-free queue might allow one thread to push and one to pop but break if two threads try to push new items at the same time.其他线程的中断不能妨碍此线程的完成. One of the threads accessing the data structure is suspended by the scheduler midway through its operation, the other threads must still be able to complete their operations without waiting for the suspended thread.
例如 This code can still be lock-free if the compare/exchange would eventually succeed if the other threads were suspended. If it didn’t, you’d have a spin lock, which is nonblocking but not lock-free.
starvation 问题: Data structures that avoid this problem are wait-free as well as lock-free.
7.1.3 Wait-free data structures
定义: A wait-free data structure is a lock-free data structure with the additional property that every thread accessing the data structure can complete its operation within a bounded number of steps, regardless of the behavior of other threads.
不是 wait-free 的例子: a while
loop on a compare_exchange_weak
or compare_exchange_strong operation
.
实现 wait-free 极其困难: Writing wait-free data structures correctly is extremely hard. In order to ensure that every thread can complete its operations within a bounded number of steps, you have to ensure that each operation can be performed in a single pass and that the steps performed by one thread don’t cause an operation on another thread to fail.
7.1.4 The pros and cons of lock-free data structures
pros:
- enable maximum concurrency
- robustness
If a thread dies while holding a lock, that data structure is broken forever. But if a thread dies partway through an operation on a lock-free data structure, nothing is lost except that thread’s data; other threads can proceed normally.
cons:
- invariants are upheld: not easy.
- deadlocks are impossible with lock-free data structures, although there is the possibility of live locks instead.
A live lock occurs when two threads each try to change the data structure, but for each thread, the changes made by the other require the operation to be restarted, so both threads loop and try again. 书中举了一个例子, 两个人无法同时通过一个窄门, 当二者碰巧都要去通过时, 都会发现过不去, 因此退出来然后又一次尝试通过, 又一次失败, 循环往复. live locks are typically short-lived because they depend on the exact scheduling of threads. 因此只会导致性能的一时下降不会导致长期问题(They therefore sap performance rather than cause long-term problems) - it may well decrease overall performance 某些情况下还是会导致整体性能下降
- the atomic operations used for lock-free code can be much slower than non-atomic operations.
- hardware must synchronize data between threads that access the same atomic variables(例如 cache ping-pong 问题).
- 看你怎么测, 看重什么方面: relevant performance aspects (whether that’s worst-case wait time, average wait time, overall execution time, or something else)
7.2 Examples of lock-free data structures
注意不是说用了 atomic operation 以及 memory ordering options 就可以保证里面没锁. 这与平台/实现强相关.
Only
std::atomic_flag
is guaranteed not to use locks in the implementation.
7.2.1 Writing a thread-safe stack without locks
adding a node(单向链表)
步骤:
- Create a new node.
- Set its
next
pointer to the currenthead
node. - Set the
head
node to point to it.
2,3 步会有可能 data race.
注意第 3 步, 我们不能回撤已经添加的 new node. Note that once head has been updated to point to your new node, another thread could read that node. It’s therefore vital that your new node is thoroughly prepared before head is set to point to it; you can’t modify the node afterward.
Implementing push()
without locks: atomic compare/exchange operation
1 |
|
removing a node
步骤:
- Read the current value of
head
. - Read
head->next
. - Set
head
tohead->next
. - Return the data from the retrieved node.
- Delete the retrieved node.
可能的问题:
- dereference a dangling pointer: 线程1,2 同时进入步骤 1, 线程 2 率先删除 node, 导致步骤 2 的线程 1 访问空悬的指针.
- if two threads read the same value of
head
, they’ll return the same node. 这部符合 stack 的要求.
1 | template<typename T> |
此处的代码有些问题:
- 没处理空的 stack. if
head
is a null pointer, it will cause undefined behavior as it tries to read thenext
pointer. - exception-safety issue: if an exception is thrown when copying the return value, the value is lost. 解决方案: return a (smart) pointer to the data value.
注意 If you do the heap allocation as part ofpop()
, you’re still no better off, because the heap allocation might throw an exception. Instead, you can allocate the memory when youpush()
the data onto the stack—you have to allocate memory for the node anyway.
1 |
|
7.2.2 Stopping those pesky leaks: managing memory in lock-free data structures
为了防止内存泄漏 write a special-purpose garbage collector for nodes
Reclaiming nodes when no threads are in pop()
1 |
|
Because you’re going to potentially delay the deletion of the node itself, you can use swap()
to remove the data from the node rather than copying the pointer, so that the data will be deleted automatically when you no longer need it rather than it being kept alive because there’s still a reference in a not-yet-deleted node. 我们希望实现 2 种效果:
- 通过智能指针返回被删除的 node 的内容–> data, 由于是智能指针, 因此可以不用考虑生命周期的管理问题.
- 推迟被删除 node 的删除部分, 直到我们确定是安全的(没有其他线程在
pop()
它).
因此使用一个swap()
函数做到了内容部分提取与删除 node 部分的分离(这一部分的安全性放到了try_reclaim
的实现里). 非常值得借鉴的思路!
1 |
|
我用我的理解来描述一遍上图的过程, 这个过程主要解释 try_reclaim()
里的 if(!--threads_in_pop)
以及 else if(nodes_to_delete)
进一步的 check 的意义.
第一张图是初始状态.
第二张图是线程 A 执行到 if(threads_in_pop==1)
语句, 被挂起.
第三张图是线程 B 执行到 pop()
里的 node* old_head=head.load();
语句, 也就是 old_head
指向 node Y, 被挂起.
第四张图是线程 C 执行完 pop()
把 node Y 放入了 to_be_deleted
list 里.
第五张图是线程 A 接着执行 to_be_deleted.exchange(nullptr);
在 A 的世界里 threads_in_pop == 1
, 因此会执行 delete pending list, node Y 会被删除.
第六张图是线程 B 接着执行 old_head->next
, 发现 node Y 已经被删除了, B 还尝试访问它 导致 UB.
换句话说要保证能安全地 delete pending list 需要满足 2 个条件, 一个是 threads_in_pop==1
只有一个线程在 pop()
, 另一个是 to_be_deleted
没有被其他线程改为不是 nullptr
(只有能安全 delete pending list 的线程才能设置为 nullptr
). 这才有了后面的双层判断.
程序的其他部分需要注意的已经写在注释里了.
整体的设计思路是
- 找到矛盾的关键点
- 把安全一些与危险的部分剥离开来
- 为了不影响性能, 考虑 lazy 策略, 先把危险的操作堆积起来, 一有机会就全部处理掉
- 拖累性能而又不危险的操作(例如内存操作)设计其能够在线程等待的时候运行
- 拆解过后的危险的部分, 进一步分解成为可以理解的步骤(线程间可能冲突), 找到关键变量然后在关键步骤 re-check.
注意此方法的缺陷, pending list 有可能会无限增加导致资源泄漏: In high-load situations, there may never be this quiescent state, because other threads have entered pop()
before all the threads initially in pop()
have left. Under this scenario, the to_be_deleted
list would grow without bounds, and you’d be leaking memory again.
解决思路, 想办法标记那些不被 access 的 node 删除: If there aren’t going to be any quiescent periods, you need to find an alternative mechanism for reclaiming the nodes. The key is to identify when no more threads are accessing a particular node so that it can be reclaimed. By far the easiest such mechanism to reason about is the use of hazard pointers.
7.2.3 Detecting nodes that can’t be reclaimed using hazard pointers
hazard pointers: The basic idea is that if a thread is going to access an object that another thread might want to delete, it first sets a hazard pointer to reference the object, informing the other thread that deleting the object would indeed be hazardous. Once the object is no longer needed, the hazard pointer is cleared. 通过标记/解标记为 hazard 的形式来说明能否在定期的循环检查里安全地删除被指向的对象. 还是垃圾收集的思路.
When a thread wants to delete an object, it must first check the hazard pointers belonging to the other threads in the system. If none of the hazard pointers reference the object, it can safely be deleted. Otherwise, it must be left until later. Periodically, the list of objects that have been left until later is checked to see if any of them can now be deleted.
discovered by Maged Michael: “Safe Memory Reclamation for Dynamic Lock-Free Objects Using Atomic Reads and Writes,” .
一个前提, 即便一个指针指向的对象被删除了, 我们依旧可以查看这个指针的值(内存地址), 这在一般情况下是 UB, 需要自定义 allocator: Using hazard pointers like this relies on the fact that it’s safe to use the value of a pointer after the object it references has been deleted. This is technically undefined behavior if you are using the default implementation of new and delete, so either you need to ensure that your implementation permits it, or you need to use a custom allocator that permits this usage.
An implementation of pop()
using hazard pointers
1 |
|
上面代码中需要关注的点:
- You’re using
compare_exchange_strong()
here because you’re doing work inside the while loop: a spurious failure oncompare_exchange_weak()
would result in resetting the hazard pointer unnecessarily. - Any nodes for which there are still outstanding hazard pointers will be left for the next thread that calls
pop()
.
A simple implementation of get_hazard_pointer_for_current_thread()
1 |
|
初始化的过程耗时, 后面就不会太多资源需求了: Once the hp_owner
instance has been created for a given thread, further accesses are much faster because the pointer is cached, so the table doesn’t have to be scanned again.
implementation of outstanding_hazard_pointers_for()
:
1 | bool outstanding_hazard_pointers_for(void* p) |
A simple implementation of the reclaim functions
1 |
|
上面的实现不够高效:
- makes
pop()
an expensive operation- every
pop()
call => checkingmax_hazard_pointers
atomic variables. Atomic operations are inherently slow—often 100 times slower than an equivalent non-atomic operation on desktop CPUs
- every
- you scan the hazard pointer list for the node you’re about to remove + also scan it for each node in the waiting list. => checking all of them against
max_hazard_pointers
stored hazard pointers.
BETTER RECLAMATION STRATEGIES USING HAZARD POINTERS
解决上面问题的思路:
- trade memory for performance.
- checking 2*
max_hazard_pointers
nodes everymax_hazard_pointers
calls topop()
and reclaiming at leastmax_hazard_pointers
nodes. - downside: have to count the nodes on the reclamation list, which means using an atomic count, and you still have multiple threads competing to access the reclamation list itself.
- checking 2*
- each thread keeps its own reclamation list in a thread-local variable.
- have
max_hazard_pointers
*max_hazard_pointers
nodes allocated. - If a thread exits before all its nodes have been reclaimed, they can be stored in the global list as before and added to the local list of the next thread doing a reclamation process.
- have
可惜 hazard pointers are covered by a patent application submitted by IBM.
也许会被放入 C++ 标准里? http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0566r5.pdf
7.2.4 Detecting nodes in use with reference counting
本节介绍使用引用计数来实现垃圾回收.
Reference counting tackles the problem by storing a count of the number of threads accessing each node.
直观反应的智能有它的问题: although some operations on std::shared_ptr<>
are atomic, they aren’t guaranteed to be lock-free. 为啥不加? std::shared_ptr<>
is intended for use in many contexts, and making the atomic operations lock-free would likely impose an overhead on all uses of the class.
A lock-free stack using a lock-free std::shared_ptr<>
implementation
如果智能指针是 lock-free 的话, 实现起来就会很简单:
1 | template<typename T> |
Concurrency TS 提供了 std::experimental::atomic_shared_ptr<T>
in the <experimental/atomic>
header. 注意 it may or may not be lock-free on any given implementation.
Stack implementation using std::experimental::atomic_shared_ptr<>
1 | template<typename T> |
好处: without having to remember to include the atomic_load
and atomic_store
calls.
使用智能指针里的计数不一定是 lock-free, 因此考虑手动实现引用计数.
two reference counts for each node: an internal count and an external count.
- The sum of these values is the total number of references to the node.
- The external count is kept alongside the pointer to the node and is increased every time the pointer is read.
- When the reader is finished with the node, it decreases the internal count.
- A simple operation that reads the pointer will leave the external count increased by one and the internal count decreased by one when it’s finished.
- When the external count/pointer pairing is no longer required (the node is no longer accessible from a location accessible to multiple threads), the internal count is increased by the value of the external count minus one and the external counter is discarded.
- Once the internal count is equal to zero, there are no outstanding references to the node and it can be safely deleted.
1 |
|
把 external_count
与 node* ptr;
放在一起形成 struct counted_node_ptr
放入 std::atomic<counted_node_ptr> head;
要注意其 struct 大小问题.
如果太大 std::atomic<counted_node_ptr>
不会 lock-free: On those platforms that support a double-word-compare-and-swap operation(可以参考[wiki]https://en.wikipedia.org/wiki/Double_compare-and-swap), this structure will be small enough for std::atomic<counted_node_ptr>
to be lock-free.
一些 trick: 利用指针数值里的无效 bits 存储 counter. => require platform-specific knowledge
if you’re willing to limit the size of the counter, and you know that your platform has spare bits in a pointer (for example, because the address space is only 48 bits but a pointer is 64 bits), you can store the count inside the spare bits of the pointer to fit it all back in a single machine word.
Popping a node from a lock-free stack using split reference counts
1 |
|
7.2.5 Applying the memory model to the lock-free stack
需要考虑一些场景, 先从最简单的开始: one thread pushes a data item onto the stack and another thread then pops that data item off the stack some time later.
分析数据结构, three important pieces of data are involved:
counted_node_ptr
used for transferring the data:head
.- the
node
structure thathead
refers to. - the
data
item pointed to by that node.
动作拆解, push()
:
- constructs the
data
item. - constructs the
node
. - sets
head
.
动作拆解, pop()
:
- loads the value of
head
. - does a compare/exchange loop on
head
to increase the reference count. - reads the
node
structure to obtain thenext
value.
a happens-before relationship between the store (by the pushing thread) and the load (by the popping thread).
对 next
指针分别考虑:
pushing:
- only atomic operation in the
push()
is thecompare_exchange_weak()
, and you need a release operation to get a happens-before relationship between threads, thecompare_exchange_weak()
must bestd::memory_order_release
or stronger. - If the
compare_exchange_weak()
call fails, nothing has changed and you keep looping, so you need onlystd::memory_order_relaxed
in that case.
1 | void push(T const& data) |
poping:
std::memory_order_acquire
or stronger before the access tonext
.- The pointer you dereference to access the
next
field is the old value read by thecompare_exchange_strong()
inincrease_head_count()
, so you need the ordering on that if it succeeds.
1 | void increase_head_count(counted_node_ptr& old_counter) |
对 data
理清顺序:
- store to
data
in thepush()
thread is sequenced before the store tohead
. - the call to
increase_head_count()
is sequenced before the load ofptr->data
. - the acquire operation in
increase_head_count()
ensures that there’s a synchronizes-with relationship between the store in thepush()
thread and that compare/exchange.
==> there’s a happens-before relationship with data
between two threads.
The only other place where ptr->data
is changed is the call to swap()
. no other thread can be operating on the same node. ==> safe
many other threads modify head
in the meantime
head
is only ever modified by compare/exchange operations. Because these are read-modify-write operations, they form part of the release sequence headed by the compare/exchange in push()
. Therefore, the compare_exchange_weak()
in push()
synchronizes with a call to compare_exchange_strong()
in increase_head_count()
, which reads the value stored, even if many other threads modify head
in the meantime.
modifying the reference count(fetch_add()
operations)
any thread that did not successfully retrieve the data knows that another thread did modify the node data; the successful thread used swap()
to extract the referenced data item. Therefore you need to ensure that swap()
happens before the delete
in order to avoid a data race.
自然而然地推出: make the fetch_add()
in the successful-return branch use std::memory_order_release
and the fetch_add()
in the loop-again branch use std::memory_order_acquire
.
这样做还可以压榨, else if(ptr->internal_count.fetch_add(
里的 fetch_add()
is a read-modify-write operation, it forms part of the release
sequence, so you can do that with an additional load()
: fetch_add()
with std::memory_order_relaxed
+ load()
with std::memory_order_acquire
.
汇总如下:
A lock-free stack with reference counting and relaxed atomic operations
1 |
|
lots of the complexity in lock-free code comes from managing memory.
7.2.6 Writing a thread-safe queue without locks
a perfectly serviceable single-producer, single-consumer(SPSC) queue:
1 |
|
几个需要注意同步的点:
- The store to
tail
synchronizes with the load fromtail
=> 通过原子变量实现. - the store to
data
happens before the load:- store to
tail
synchronizes with the load fromtail
. - the store to the preceding node’s data pointer is sequenced before the store to
tail
. - the load from
tail
is sequenced before the load from the data pointer.
- store to
HANDLING MULTIPLE THREADS IN PUSH()
如何解决多个线程下的并发 pop()
与 push()
.
思路1: add a dummy node between the real nodes.
对于 push()
: the current tail
node that needs updating is the next
pointer, which could therefore be made atomic.
对于 pop()
: require a minor change to pop()
in order to discard nodes with a null data pointer and loop again.
The downside here is that every pop()
call will typically have to remove two nodes, and there are twice as many memory allocations.
思路2: make the data
pointer atomic and set that with a call to compare/exchange.
对于 push()
: If a call to compare/exchange succeeds, this is your tail
node, and you can safely set the next
pointer to your new node and then update tail
. If the compare/exchange fails because another thread has stored the data, you loop around, reread tail
, and start again.
Listing 7.15 A (broken) first attempt at revising push()
1 | void push(T new_value) |
tail.store(new_next.ptr);
的同时有别的线程修改了 tail
指针比如 pop()
, 那么 new
会去访问已经被删掉的 tail
导致 UB. 如何解决呢?
因为是与 pop()
一样的问题, 可以考虑再加一对 external 与 internal count. 具体思路如下:
Having two external counts for the same node requires a modification to the reference-counting scheme to avoid deleting the node too early. You can address this by also counting the number of external counters inside the node structure and decreasing this number when each external counter is destroyed (as well as adding the corresponding external count to the internal count). If the internal count is zero and there are no external counters, you know the node can safely be deleted.
Listing 7.16 Implementing push()
for a lock-free queue with a reference-counted tail
1 |
|
Listing 7.17 Popping a node from a lock-free queue with a reference-counted tail
1 |
|
接下来是 reference-counting functions
Listing 7.18 Releasing a node reference in a lock-free queue: release_ref()
1 |
|
Listing 7.19 Obtaining a new reference to a node in a lock-free queue increase_external_count()
1 | template<typename T> |
Listing 7.20 Freeing an external counter to a node in a lock-free queue
1 | template<typename T> |
上面的做法其实不是 lock-free 的:
Once one thread has started a push()
operation by successfully completing the compare_exchange_strong()
on old_tail.ptr->data
. no other thread can perform a push()
operation.
This is a busy wait, which consumes CPU cycles without achieving anything. Consequently, this is effectively a lock.
比一般的 mutex 锁更坏的一点是:
the operating system can give priority to the thread that holds the lock on a mutex if there are blocked threads, it can’t do so in this case, so the blocked threads will waste CPU cycles until the first thread is done.
MAKING THE QUEUE LOCK-FREE BY HELPING OUT ANOTHER THREAD
解决思路: find a way for a waiting thread to make progress even if the thread doing the push()
is stalled. One way to do this is to help the stalled thread by doing its work for it.
具体的内容: the next pointer on the tail
node needs to be set to a new dummy node, and then the tail
pointer itself must be updated.
Listing 7.21 pop()
modified to allow helping on the push()
side
1 |
|
Listing 7.22 A sample push()
with helping for a lock-free queue
1 | template<typename T> |
一个潜在的效率风险点: there are a lot of new
and delete
calls for such a small piece of code, because new nodes are allocated on push()
and destroyed in pop()
. The efficiency of the memory allocator therefore has a considerable impact on the performance of this code.
解决方法:
- try it and measure the performance of the code before and after.
- having a separate memory allocator on each thread and using free lists to recycle nodes rather than returning them to the allocator.
7.3 Guidelines for writing lock-free data structures
7.3.1 Guideline: use std::memory_order_seq_cst
for prototyping
对于检查: Unless you have an algorithm checker that can systematically test all possible combinations of thread visibilities that are consistent with the specified ordering guarantees (and these things do exist), running the code isn’t enough.
7.3.2 Guideline: use a lock-free memory reclamation scheme
难点: managing memory.
- It’s essential to avoid deleting objects when other threads might still have references to them,
- but you still want to delete the object as soon as possible in order to avoid excessive memory consumption.
three techniques for ensuring that memory can safely be reclaimed:
- Waiting until no threads are accessing the data structure and deleting all objects that are pending deletion.
- Using hazard pointers to identify that a thread is accessing a particular object.
- Reference counting the objects so that they aren’t deleted until there are no outstanding references.
更大维度上的解决办法:
- this is the ideal scenario for using a garbage collector.
- recycle nodes and only free them completely when the data structure is destroyed.
The downside here is that another problem becomes more prevalent. This is the so-called ABA problem.
7.3.3 Guideline: watch out for the ABA problem
解决方法: an ABA counter alongside the variable x
. The compare/exchange operation is then done on the combined structure of x
plus the counter as a single unit.
常见的场景: The ABA problem is particularly prevalent in algorithms that use free lists or otherwise recycle nodes rather than returning them to the allocator.
7.3.4 Guideline: identify busy-wait loops and help the other thread
解决办法: the waiting thread performs the incomplete steps if it’s scheduled to run before the original thread completes the operation, you can remove the busy-wait and the operation is no longer blocking.
Summary
本章设计 data structures 的意义: Wherever data is shared between threads, you need to think about the data structures used and how the data is synchronized between threads. By designing data structures for concurrency, you can encapsulate that responsibility in the data structure itself, so the rest of the code can focus on the task it’s trying to perform with the data rather than the data synchronization.
C++并发实战-第七章-总学习笔记第5
https://www.chuxin911.com/Notes_On_C++_Concurrency_in_Action_5_20221003/