diff --git a/doc/CODING-GUIDELINES b/doc/CODING-GUIDELINES index 88cc7dced..985df6846 100644 --- a/doc/CODING-GUIDELINES +++ b/doc/CODING-GUIDELINES @@ -639,6 +639,232 @@ The old style, with one event named "ThisEventOn" and another named Check manager.txt for more information on manager and existing headers. Please update this file if you add new headers. +* Locking in Asterisk +----------------------------- + +A) Locking Fundamentals + +Asterisk is a heavily multithreaded application. It makes extensive +use of locking to ensure safe access to shared resources between +different threads. + +When more that one lock is involved in a given code path, there is the +potential for deadlocks. A deadlock occurs when a thread is stuck +waiting for a resource that it will never acquire. Here is a classic +example of a deadlock: + + Thread 1 Thread 2 + ------------ ------------ + Holds Lock A Holds Lock B + Waiting for Lock B Waiting for Lock A + +In this case, there is a deadlock between threads 1 and 2. +This deadlock would have been avoided if both threads had +agreed that one must acquire Lock A before Lock B. + +In general, the fundamental rule for dealing with multiple locks is + + an order _must_ be established to acquire locks, and then all threads + must respect that order when acquiring locks. + + +A.1) Establishing a locking order + +Because any ordering for acquiring locks is ok, one could establish +the rule arbitrarily, e.g. ordering by address, or by some other criterion. +The main issue, though, is defining an order that + i) is easy to check at runtime; + ii) reflects the order in which the code executes. +As an example, if a data structure B is only accessible through a +data structure A, and both require locking, then the natural order +is locking first A and then B. +As another example, if we have some unrelated data structures to +be locked in pairs, then a possible order can be based on the address +of the data structures themselves. + +B) Minding the boundary between channel drivers and the Asterisk core + +The #1 cause of deadlocks in Asterisk is by not properly following the +locking rules that exist at the boundary between Channel Drivers and +the Asterisk core. The Asterisk core allocates an ast_channel, and +Channel Drivers allocate "technology specific private data" (PVT) that is +associated with an ast_channel. Typically, both the ast_channel and +PVT have their own lock. There are _many_ +code paths that require both objects to be locked. + +The locking order in this situation is the following: + + 1) ast_channel + 2) PVT + +Channel Drivers implement the ast_channel_tech interface to provide a +channel implementation for Asterisk. Most of the channel_tech +interface callbacks are called with the associated ast_channel +locked. When accessing technology specific data, the PVT can be locked +directly because the locking order is respected. + +C) Preventing lock ordering reversals. + +There are some code paths which make it extremely difficult to +respect the locking order. +Consider for example the following situation: + + 1) A message comes in over the "network" + 2) The Channel Driver (CD) monitor thread receives the message + 3) The CD associates the message with a PVT and locks the PVT + 4) While processing the message, the CD must do something that requires + locking the ast_channel associated to the PVT + +This is the point that must be handled carefully. +The following psuedo-code + + unlock(pvt); + lock(ast_channel); + lock(pvt); + +is _not_ correct for two reasons: + +i) first and foremost, unlocking the PVT means that other threads + can acquire the lock and believe it is safe to modify the + associated data. When reacquiring the lock, the original thread + might find unexpected changes in the protected data structures. + This essentially means that the original thread must behave as if + the lock on the pvt was not held, in which case it could have + released it itself altogether; + +ii) Asterisk uses the so called "recursive" locks, which allow a thread + to issue a lock() call multiple times on the same lock. Recursive + locks count the number of calls, and they require an equivalent + number of unlock() to be actually released. + + For this reason, just calling unlock() once does not guarantee that the + lock is actually released -- it all depends on how many times lock() + was called before. + +An alternative, but still incorrect, construct is widely used in +the asterisk code to try and improve the situation: + + while (trylock(ast_channel) == FAILURE) { + unlock(pvt); + usleep(1); /* yield to other threads */ + lock(pvt); + } + +Here the trylock() is non blocking, so we do not deadlock if the ast_channel +is already locked by someone else: in this case, we try to unlock the PVT +(which happens only if the PVT lock counter is 1), yield the CPU to +give other threads a chance to run, and then acquire the lock again. + +This code is not correct for two reasons: + i) same as in the previous example, it releases the lock when the thread + probably did not expect it; + ii) if the PVT lock counter is greater than 1 we will not + really release the lock on the PVT. We might be lucky and have the + other contender actually release the lock itself, and so we will "win" + the race, but if both contenders have their lock counts > 1 then + they will loop forever (basically replacing deadlock with livelock). + +Another variant of this code is the following: + + if (trylock(ast_channel) == FAILURE) { + unlock(pvt); + lock(ast_channel); + lock(pvt); + } + +which has the same issues as the while(trylock...) code, but just +deadlocks instead of looping forever in case of lock counts > 1. + +The deadlock/livelock could be in principle spared if one had an +unlock_all() function that calls unlock as many times as needed to +actually release the lock, and reports the count. Then we could do: + + if (trylock(ast_channel) == FAILURE) { + n = unlock_all(pvt); + lock(ast_channel) + while (n-- > 0) lock(pvt); + } + +The issue with unexpected unlocks remains, though. + +C) Locking multiple channels. + +The next situation to consider is what to do when you need a lock on +multiple ast_channels (or multiple unrelated data structures). + +If we are sure that we do not hold any of these locks, then the +following construct is sufficient: + + lock(MIN(chan1, chan2)); + lock(MAX(chan1, chan2)); + +That type of code would follow an established locking order of always +locking the channel that has a lower address first. Also keep in mind +that to use this construct for channel locking, one would have to go +through the entire codebase to ensure that when two channels are locked, +this locking order is used. + However, if we enter the above section of code with some lock held +(which would be incorrect using non-recursive locks, but is completely +legal using recursive mutexes) then the locking order is not guaranteed +anymore because it depends on which locks we already hold. So we have +to go through the same tricks used for the channel+PVT case. + +D) Recommendations + +As you can see from the above discussion, getting locking right is all +but easy. So please follow these recommendations when using locks: + +*) Use locks only when really necessary + Please try to use locks only when strictly necessary, and only for + the minimum amount of time required to run critical sections of code. + A common use of locks in the current code is to protect a data structure + from being released while you use it. + With the use of reference-counted objects (astobj2) this should not be + necessary anymore. + +*) Do not sleep while holding a lock + If possible, do not run any blocking code while holding a lock, + because you will also block other threads trying to access the same + lock. In many cases, you can hold a reference to the object to avoid + that it is deleted while you sleep, perhaps set a flag in the object + itself to report other threads that you have some pending work to + complete, then release and acquire the lock around the blocking path, + checking the status of the object after you acquire the lock to make + sure that you can still perform the operation you wanted to. + +*) Try not to exploit the 'recursive' feature of locks. + Recursive locks are very convenient when coding, as you don't have to + worry, when entering a section of code, whether or not you already + hold the lock -- you can just protect the section with a lock/unlock + pair and let the lock counter track things for you. + But as you have seen, exploiting the features of recursive locks + make it a lot harder to implement proper deadlock avoidance strategies. + So please try to analyse your code and determine statically whether you + already hold a lock when entering a section of code. + If you need to call some function foo() with and without a lock held, + you could define two function as below: + foo_locked(...) { + ... do something, assume lock held + } + + foo(...) { + lock(xyz) + ret = foo_locked(...) + unlock(xyz) + return ret; + } + and call them according to the needs. + +*) Document locking rules. + Please document the locking order rules are documented for every + lock introduced into Asterisk. This is done almost nowhere in the + existing code. However, it will be expected to be there for newly + introduced code. Over time, this information should be added for + all of the existing lock usage. + +----------------------------------------------------------------------- + + ------------------------------------ == PART TWO: BUILD ARCHITECTURE == ------------------------------------