read-copy update p. e. mckenney, j. appavoo, a. kleen, o. krieger, r. russell, d. saram, m. soni...

Read-Copy Update

P. E. McKenney, J. Appavoo, A. Kleen, O. Krieger, R. Russell, D. Saram, M. Soni

Ottawa Linux Symposium 2001

Presented by Bogdan Simion

Motivation

Locking can be expensive Overhead of locking code Cache bouncing

Linux uses locking to protect against infrequent destructive modifications e.g., racy accesses to unloaded modules

Want to avoid locking expense for reads of data that are infrequently modified

Key Idea

Example: module unloading

Give ongoing operations a grace period to finish

Grace Periods

Starts when new operations see new state e.g., remove pointer to module from a list First phase of the update No new references made once period starts

Extends until after all operations that started before the grace period finish Operations with outstanding references finish safely

When period ends, system may cleanup Second phase of the update, e.g., free module data

Grace Period Duration

Safe to end the grace period when all CPUs have finished prior operations

A non-preemptive operating system finishes all operations when it context switches

Thus, grace period ends after all CPUs have context switched at least once

Zero reference count deduced without using any shared data!

RCU So Far

RCU performs updates in two phases:– Update enough so new operations see new state

but old operations can proceed using old state– Complete the update after the grace period

RCU works well when– Updates can be done in two phases– Operations still work with stale state– Destructive updates are infrequent

Let's look at an example of how it's used

• Simple circular doubly linked-list

• Compare a reference-counting locking algorithm taken from Linux with its read-copy-update equivalent

Example: Reference Counted Search

Reference Counted Search

• For each algorithm:

• search()

• delete()

• search(): returns a pointer to an element in the list given its addr, and ensures that element is not being freed up

• delete(): arranges for the specified element to eventually be freed up

• delete() may not be able to free the element immediately due to concurrent searches

Reference Counted Search

Reference-Counted Usage

• Read-only and update (including delete) operation

RCU Search / Delete

Search / Delete Discussion

Searching scales perfectly No locks – scales well No cache line bouncing Clear advantage over reference counting

Search can return stale data There is a race between search and delete Reference counting + locks does not have this problem

Delete is similar – global lock Good speedups only if many more searches than del

kfree_rcu is neither trivial nor inexpensive

Read-Copy Deletion Scenario

• To delete element B, the updater task acquires list lock to exclude other list manipulation, unlinks element B from the list and releases list lock

Read-Copy Deletion Scenario

• The updater task passes a pointer to B to the kfree_rcu() primitive, which adds the memory to a list waiting to be freed.

• Safe to return B to the freelist at the end of the grace period (when all pre-existing ops complete)

List After Element B Returned to Freelist

List After Grace Period

Implementing kfree_rcu

• Basic idea:


• Execute updater on each CPU:


Delay deletion until the end of the grace period: wait_for_rcu() { ...

current->cpus_allowed = (1 << num_cpus) - 1; while(true) { current->cpus_allowed &= ~(1 << cpu_index()); if (current->cpus_allowed == 0) break; schedule(); } /* Grace period now over. Now it's safe to delete. */ ...

}


Delay deletion until the end of the grace period: wait_for_rcu() { ...

current->cpus_allowed = (1 << num_cpus) - 1; while(true) { current->cpus_allowed &= ~(1 << cpu_index()); if (current->cpus_allowed == 0) break; schedule(); } /* Grace period now over. Now it's safe to delete. */ ...

}

Doesn't work with preemptible kernels. Why? Can't be called from an interrupt handler or while a spin

lock is held. Why? Can be relatively slow. Why?

Deferring wait_for_rcu

struct rcu_head { tq_struct task; }; void* kmalloc_rcu(size_t size, int flags) { rcu_head* ret = kmalloc(size + sizeof(*ret), flags); return ret + 1; } void sync_and_destroy(void* head) { wait_for_rcu(); kfree(head); } kfree_rcu(void* obj) { rcu_head* head = ((rcu_head*) obj) – 1; head->task.routine = &sync_and_destroy; head->task.data = head; schedule_task(&head->task); }

Deferring wait_for_rcu

struct rcu_head { tq_struct task; }; void* kmalloc_rcu(size_t size, int flags) { rcu_head* ret = kmalloc(size + sizeof(*ret), flags); return ret + 1; } void sync_and_destroy(void* head) { wait_for_rcu(); kfree(head); } kfree_rcu(void* obj) { rcu_head* head = ((rcu_head*) obj) – 1; head->task.routine = &sync_and_destroy; head->task.data = head; schedule_task(&head->task); }

Why is kmalloc_rcu necessary?

RCU Application: File Descriptors

Kernel maintains mapping of file descriptors to instances of struct file with an array

Expansion of the array is a destructive update: Copies the old elements into a new array Updates pointers and deletes the old array

RCU employed: Phase 1: Create new arrays and update pointers Phase 2: Delete the old arrays

RCU Performance: File Descriptors

Chat benchmark, 2.4.2 SMP Kernel

Why does R/W lock incur so much overhead?

RCU Performance Improvements

A number of improvements to the basic mechanism Batch grace period measurements

wait_for_rcu is expensive A single measurement satisfies multiple deferred free

requests Maintain per-CPU request lists Faster grace period algorithm

See the paper for details

Comparing RCU to other Locking Algorithms

Data locking Does not avoid reader locks Also prone to deadlocks Although list elements can be manipulated in parallel,

searches cannot be done in parallel Can be used to prevent stale reads in RCU

brlock Effectively lock-free reads Not clear how its performance differs from RCU

i.e., Can't brlock be used for the file descriptor arrays?

Conclusions

RCU is an effective approach for avoiding locking for read-mostly data structures

An elegant method for implicit reference counting Main advantage: readers need not acquire locks,

perform any atomic ops, write to shared memory or use barriers.

The destructive update is delayed until the grace period finishes – until all CPUs context switch (if non-preemptible)

Since 2001, it has been used in hundreds of places in the Linux kernel

read-copy update p. e. mckenney, j. appavoo, a. kleen, o. krieger, r. russell, d. saram, m. soni...

Documents