menu

perfbook-c4

4 Tools of the Trade

4.1 Scripting Languages

直接使用脚本后台运行。

4.2 POSIX Multiprocessing

4.2.1 POSIX Process Creation and Destruction

POSIX进程的接口:

通过fork()原语创建
使用kill()原语销毁,也可以用exit()原语自我销毁。
执行fork()的进程被称为新创建进程的“父进程”。
父进程可以通过wait()原语等待子进程执行完毕,每次调用wait()只能等待一个子进程。
It is critically important to note that the parent and child do not share memory.

4.2.2 POSIX Thread Creation and Destruction

#include <pthread.h>
int pthread_create(pthread_t *thread, const pthread_attr_t *attr,
                  void *(*start_routine) (void *), void *arg);
/* 对fork-join中的wait()的模仿 */
int pthread_join(pthread_t thread, void **retval);
void pthread_exit(void *status);

线程之间共享内存。

4.2.3 POSIX Locking

常见接口如下所示:

#include <pthread.h>
/* 静态初始化 */
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; 
/* 动态初始化 */
int pthread_mutex_init(pthread_mutex_t *restrict mutex,
const pthread_mutexattr_t *restrict attr);
/* 销毁 */
int pthread_mutex_destroy(pthread_mutex_t *mutex);
/* 获取一个指定的锁 */
int pthread_mutex_lock(pthread_mutex_t *mutex);
int pthread_mutex_trylock(pthread_mutex_t *mutex);
/* 释放一个指定的锁 */
int pthread_mutex_unlock(pthread_mutex_t *mutex); 

4.2.4 POSIX Reader-Writer Locking

/* 静态初始化 */
pthread_rwlock_t  rwlock = PTHREAD_RWLOCK_INITIALIZER;
/* 动态初始化 */
int pthread_rwlock_init(pthread_rwlock_t *restrict rwlock,
const pthread_rwlockattr_t *restrict attr); 
/* 销毁 */
int pthread_rwlock_destroy(pthread_rwlock_t *rwlock);

4.2.5 Atomic Operations (GCC Classic)

常用的基本原子操作接口

__sync_fetch_and_sub()
__sync_fetch_and_or()
__sync_fetch_and_and()
__sync_fetch_and_xor()
__sync_fetch_and_nand()
/* all of which return the old value. If you instead need the new value
you can instead use the follow */
__sync_add_and_fetch()
__sync_sub_and_fetch()
__sync_or_and_fetch()
__sync_and_and_fetch()
__sync_xor_and_fetch()
__sync_nand_and_fetch()

常用的编译器级别屏障原语

/* Listing 4.9: Compiler Barrier Primitive (for GCC) */
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
#define READ_ONCE(x) \
    ({ typeof(x) ___x = ACCESS_ONCE(x); ___x; })
#define WRITE_ONCE(x, val) ({ ACCESS_ONCE(x) = (val); })
#define barrier() __asm__ __volatile__("": : :"memory")

4.2.6 Atomic Operations (C11)

C11标准的原子操作

常见接口如下所示:

atomic_flag_test_and_set
atomic_flag_clear
atomic_init
atomic_is_lock_free
atomic_store
atomic_load
atomic_exchange
atomic_compare_exchange_strong
atomic_compare_exchange_weak
atomic_fetch_add
atomic_fetch_sub
atomic_fetch_or
atomic_fetch_xor
atomic_fetch_and
atomic_thread_fence
atomic_signal_fence

可以用该接口实现应用级别的内存屏障, 如下所示:

#define MFENCE std::atomic_thread_fence(std::memory_order::memory_order_acq_rel)

4.2.7 Atomic Operations (Modern GCC)

__atomic_load()
__atomic_load_n()
__atomic_store()
__atomic_store_n()
__atomic_thread_fence()

在这里会用到内存序相关的参数,如下所示:

__ATOMIC_RELAXED
__ATOMIC_CONSUME
__ATOMIC_ACQUIRE
__ATOMIC_RELEASE
__ATOMIC_ACQ_REL
__ATOMIC_SEQ_CST

4.2.8 Per-Thread Variables

pthread_key_create()
pthread_key_delete()
pthread_setspecific()
pthread_setspecific()

可以在类型后跟上__thread说明符,C11标准引入了代替这个说明符的_Thread_local关键字。

4.3 Alternatives to POSIX Operations(POSIX接口的替代方法)

4.3.2 Thread Creation, Destruction, and Control

内核提供的相关接口:

/* The Linux kernel uses struct task_struct pointers to track kthreads */
/* create */
kthread_create()
/* To externally suggest that theystop (which has no POSIX equivalent) */
kthread_should_stop() 
/* wait for them to stop */
kthread_stop()
/* For a timed wait. */ 
schedule_timeout_interruptible()

/* 以下是用户态的接口 */
int smp_thread_id(void)
thread_id_t create_thread(void *(*func)(void *), void *arg)
for_each_thread(t)
for_each_running_thread(t)
void *wait_thread(thread_id_t tid)
void wait_all_threads(void)

4.3.3 Locking

void spin_lock_init(spinlock_t *sp);
void spin_lock(spinlock_t *sp);
int spin_trylock(spinlock_t *sp);
void spin_unlock(spinlock_t *sp);

4.3.4 Accessing Shared Variables

4.3.4.1 Shared-Variable Shenanigans

这一节主要是讲编译器能对我们的程序做些什么来优化,这种优化常常会导致我们预期之外的逻辑。

  • Load tearing
    Load tearing occurs when the compiler uses multiple load instructions for a single asscess.
  • Store tearing
    Store tearing occurs when the compiler uses multiple store instructions for a single access.
  • Load fusing
    Load fusing occurs when the compiler uses the result of a prior load from a given variable instead of repeating the load. 程序实际Load的顺序可能与代码顺序不符,带来判断的错误。
  • Store fusing
    Store fusing can occur when the compiler notices a pair of successive stores to a given variable with no intervening loads from that variable. 从4.174.18的例子可以看出,由于Store fusing机制,第3行的store操作被优化掉了,导致函数shut_it_down以及work_until_shut_down都无法退出循环,进入了死锁。

  • Code reordering
  • Invented loads
    编译器去掉了中间变量导致了更多的load操作,在某些情况可以会增加cache miss的概率,带来性能损耗。

  • Invented stores
    复杂的inline函数可能会导致寄存器溢出,部分临时变量比如other_task_ready的值被覆盖,导致程序执行逻辑错误,这里例子实际应该解读为UnInvented store
    例子4.19意在说明编译器的分支优化逻辑:This transforms the if-then-else into an if-then, saving one branch.这个应该是Invented store更为常用的场景,在glibc的汇编实现里也很多这种逻辑。

4.3.4.2 A Volatile Solution

Although it is now much maligned, before the advent of C11 and C++11 [Bec11], the volatile keyword was an indispensible tool in the parallel programmer’s toolbox. This raises the question of exactly what volatile means, a question that is not answered with excessive precision even by more recent versions of this standard [Smi18].6 This version guarantees that “Accesses through volatile glvalues are evaluated strictly according to the rules of the abstract machine”, that volatile accesses are side effects, that they are one of the four forward-progress indicators, and that their exact semantics are implementation-defined. Perhaps the most clear guidance is provided by this nonnormative note:

volatile关键字是并行程序员工具箱中必不可少的工具。其语义如下:

volatile暗示编译器避免对使用对象进行积极优化,因为对象的值可能通过实现无法检测到的方式进行更改。此外,对于某些实现,volatile可能指示需要特殊的硬件指令才能访问该对象。

通过volatile语义定义的宏可以解决4.3.4.1中发现的各种问题。

#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
#define READ_ONCE(x) \
    ({ typeof(x) ___x = ACCESS_ONCE(x); ___x; })
#define WRITE_ONCE(x, val) ({ ACCESS_ONCE(x) = (val); })

volatile语义解决方案的总结

To summarize, the volatile keyword can prevent load tearing and store tearing in cases where the loads and stores are machine-sized and properly aligned. It can also prevent load fusing, store fusing, invented loads, and invented stores. However, although it does prevent the compiler from reordering volatile accesses with each other, it does nothing to prevent the CPU from reordering these accesses. Furthermore, it does nothing to prevent either compiler or CPU from reordering non-volatile accesses with each other or with volatile accesses. Preventing these types of reordering requires the techniques described in the next section.

也就是说volatile无法解决所有的问题,还是需要内存屏障,如下小节所示。

4.3.4.3 Assembling the Rest of a Solution

  • 编译级的屏障
    while (!need_to_stop) {
      barrier();
      do_something_quickly();
      barrier();
    }
    
  • 运行时屏障 ```c void shut_it_down(void) { WRITE_ONCE(status, SHUTTING_DOWN); smp_mb(); start_shutdown(); while (!READ_ONCE(other_task_ready)) continue; smp_mb(); finish_shutdown(); smp_mb(); WRITE_ONCE(status, SHUT_DOWN); do_something_else(); }

void work_until_shut_down(void) { while (READ_ONCE(status) != SHUTTING_DOWN) { smp_mb(); do_more_work(); } smp_mb(); WRITE_ONCE(other_task_ready, 1); }


#### 4.3.4.4 Avoiding Data Races
使用锁解决并发访问变量的问题,可基本遵循如下原则:
> 1. If a shared variable is only modified while holding a given lock by a given owning CPU or thread, then all stores must use WRITE_ONCE() and non-owning CPUs or threads that are not holding the lock must use READ_ONCE() for loads. The owning CPU or thread may use plain loads, as may any CPU or thread holding the lock.  
如果仅在拥有给定的拥有CPU或线程持有给定锁的同时修改共享变量,则所有存储区都必须使用WRITE_ONCE(),而没有拥有锁的非拥有CPU或线程必须使用READ_ONCE() 。  
拥有CPU或线程的线程可以使用普通负载,而持有锁的任何CPU或线程也可以使用。
> 2. If a shared variable is only modified while holding a given lock, then all stores must use WRITE_ONCE(). CPUs or threads not holding the lock must use READ_ ONCE() for loads. CPUs or threads holding the lock may use plain loads.  
如果仅在持有给定锁的情况下修改共享变量,则所有存储区都必须使用WRITE_ONCE()。  
未持有锁的CPU或线程必须使用READ_ ONCE()进行加载。CPU或线程持有锁可以使用普通的负载。
> 3. If a shared variable is only modified by a given owning CPU or thread, then all stores must use WRITE_ONCE() and non-owning CPUs or threads must use READ_ONCE() for loads. The owning CPU or thread may use plain loads.  
如果共享变量仅由给定的拥有CPU或线程修改,则所有存储必须使用WRITE_ONCE(),非拥有CPU或线程必须使用READ_ONCE()进行加载。  
拥有的CPU或线程可能使用普通负载。


### 4.3.5 Atomic Operations
[内核文档atomic_ops.txt](https://github.com/tinganho/linux-kernel/blob/master/Documentation/atomic_ops.txt)

### 4.3.6 Per-CPU Variables
内核提供了一些接口用于创造、初始化、访问线程独立的变量。
```c
DEFINE_PER_THREAD()
DECLARE_PER_THREAD()
per_thread()
__get_thread_var()
init_per_thread()

4.4 The Right Tool for the Job: How to Choose?

从简到难,开销从大到小:

  • 脚本的forkexec接口
  • C的fork()wait()接口
  • POSIX threading
  • Chapter 9 Deferred Processing
  • inter-process communication and message-passing

针对工作平台、语言的差异,可以选择以下不同的多线程编程接口:

  • C11标准中的接口
  • GCC接口
  • 使用volatile语义
  • 使用内存屏障

设计层面的告诫:

The smarter you are, the deeper a hole you will dig for yourself before you realize that you are in trouble