dect
/
asterisk
Archived
13
0
Fork 0

Add a locking section to the coding guidelines document.

This section covers some locking fundamentals, as well as some information on
locking as it is used in Asterisk.  It describes some of the ways that are used
and could be used to achieve deadlock avoidance.  It also demonstrates the
unfortunate conclusion that with the use of recursive locks, none of the
constructs in use today are failsafe from deadlocks.  Finally, it makes some
recommendations for new code being written.  As proper locking strategies is a
complex subject, this section still has room for expansion and improvement.

This is a result of collaboration between Luigi Rizzo and myself on the 
asterisk-dev mailing list.


git-svn-id: http://svn.digium.com/svn/asterisk/trunk@127363 f38db490-d61c-443f-a65b-d21fe96a405b
This commit is contained in:
russell 2008-07-02 12:08:33 +00:00
parent 6d96663d29
commit ffd283a966
1 changed files with 226 additions and 0 deletions

View File

@ -639,6 +639,232 @@ The old style, with one event named "ThisEventOn" and another named
Check manager.txt for more information on manager and existing
headers. Please update this file if you add new headers.
* Locking in Asterisk
-----------------------------
A) Locking Fundamentals
Asterisk is a heavily multithreaded application. It makes extensive
use of locking to ensure safe access to shared resources between
different threads.
When more that one lock is involved in a given code path, there is the
potential for deadlocks. A deadlock occurs when a thread is stuck
waiting for a resource that it will never acquire. Here is a classic
example of a deadlock:
Thread 1 Thread 2
------------ ------------
Holds Lock A Holds Lock B
Waiting for Lock B Waiting for Lock A
In this case, there is a deadlock between threads 1 and 2.
This deadlock would have been avoided if both threads had
agreed that one must acquire Lock A before Lock B.
In general, the fundamental rule for dealing with multiple locks is
an order _must_ be established to acquire locks, and then all threads
must respect that order when acquiring locks.
A.1) Establishing a locking order
Because any ordering for acquiring locks is ok, one could establish
the rule arbitrarily, e.g. ordering by address, or by some other criterion.
The main issue, though, is defining an order that
i) is easy to check at runtime;
ii) reflects the order in which the code executes.
As an example, if a data structure B is only accessible through a
data structure A, and both require locking, then the natural order
is locking first A and then B.
As another example, if we have some unrelated data structures to
be locked in pairs, then a possible order can be based on the address
of the data structures themselves.
B) Minding the boundary between channel drivers and the Asterisk core
The #1 cause of deadlocks in Asterisk is by not properly following the
locking rules that exist at the boundary between Channel Drivers and
the Asterisk core. The Asterisk core allocates an ast_channel, and
Channel Drivers allocate "technology specific private data" (PVT) that is
associated with an ast_channel. Typically, both the ast_channel and
PVT have their own lock. There are _many_
code paths that require both objects to be locked.
The locking order in this situation is the following:
1) ast_channel
2) PVT
Channel Drivers implement the ast_channel_tech interface to provide a
channel implementation for Asterisk. Most of the channel_tech
interface callbacks are called with the associated ast_channel
locked. When accessing technology specific data, the PVT can be locked
directly because the locking order is respected.
C) Preventing lock ordering reversals.
There are some code paths which make it extremely difficult to
respect the locking order.
Consider for example the following situation:
1) A message comes in over the "network"
2) The Channel Driver (CD) monitor thread receives the message
3) The CD associates the message with a PVT and locks the PVT
4) While processing the message, the CD must do something that requires
locking the ast_channel associated to the PVT
This is the point that must be handled carefully.
The following psuedo-code
unlock(pvt);
lock(ast_channel);
lock(pvt);
is _not_ correct for two reasons:
i) first and foremost, unlocking the PVT means that other threads
can acquire the lock and believe it is safe to modify the
associated data. When reacquiring the lock, the original thread
might find unexpected changes in the protected data structures.
This essentially means that the original thread must behave as if
the lock on the pvt was not held, in which case it could have
released it itself altogether;
ii) Asterisk uses the so called "recursive" locks, which allow a thread
to issue a lock() call multiple times on the same lock. Recursive
locks count the number of calls, and they require an equivalent
number of unlock() to be actually released.
For this reason, just calling unlock() once does not guarantee that the
lock is actually released -- it all depends on how many times lock()
was called before.
An alternative, but still incorrect, construct is widely used in
the asterisk code to try and improve the situation:
while (trylock(ast_channel) == FAILURE) {
unlock(pvt);
usleep(1); /* yield to other threads */
lock(pvt);
}
Here the trylock() is non blocking, so we do not deadlock if the ast_channel
is already locked by someone else: in this case, we try to unlock the PVT
(which happens only if the PVT lock counter is 1), yield the CPU to
give other threads a chance to run, and then acquire the lock again.
This code is not correct for two reasons:
i) same as in the previous example, it releases the lock when the thread
probably did not expect it;
ii) if the PVT lock counter is greater than 1 we will not
really release the lock on the PVT. We might be lucky and have the
other contender actually release the lock itself, and so we will "win"
the race, but if both contenders have their lock counts > 1 then
they will loop forever (basically replacing deadlock with livelock).
Another variant of this code is the following:
if (trylock(ast_channel) == FAILURE) {
unlock(pvt);
lock(ast_channel);
lock(pvt);
}
which has the same issues as the while(trylock...) code, but just
deadlocks instead of looping forever in case of lock counts > 1.
The deadlock/livelock could be in principle spared if one had an
unlock_all() function that calls unlock as many times as needed to
actually release the lock, and reports the count. Then we could do:
if (trylock(ast_channel) == FAILURE) {
n = unlock_all(pvt);
lock(ast_channel)
while (n-- > 0) lock(pvt);
}
The issue with unexpected unlocks remains, though.
C) Locking multiple channels.
The next situation to consider is what to do when you need a lock on
multiple ast_channels (or multiple unrelated data structures).
If we are sure that we do not hold any of these locks, then the
following construct is sufficient:
lock(MIN(chan1, chan2));
lock(MAX(chan1, chan2));
That type of code would follow an established locking order of always
locking the channel that has a lower address first. Also keep in mind
that to use this construct for channel locking, one would have to go
through the entire codebase to ensure that when two channels are locked,
this locking order is used.
However, if we enter the above section of code with some lock held
(which would be incorrect using non-recursive locks, but is completely
legal using recursive mutexes) then the locking order is not guaranteed
anymore because it depends on which locks we already hold. So we have
to go through the same tricks used for the channel+PVT case.
D) Recommendations
As you can see from the above discussion, getting locking right is all
but easy. So please follow these recommendations when using locks:
*) Use locks only when really necessary
Please try to use locks only when strictly necessary, and only for
the minimum amount of time required to run critical sections of code.
A common use of locks in the current code is to protect a data structure
from being released while you use it.
With the use of reference-counted objects (astobj2) this should not be
necessary anymore.
*) Do not sleep while holding a lock
If possible, do not run any blocking code while holding a lock,
because you will also block other threads trying to access the same
lock. In many cases, you can hold a reference to the object to avoid
that it is deleted while you sleep, perhaps set a flag in the object
itself to report other threads that you have some pending work to
complete, then release and acquire the lock around the blocking path,
checking the status of the object after you acquire the lock to make
sure that you can still perform the operation you wanted to.
*) Try not to exploit the 'recursive' feature of locks.
Recursive locks are very convenient when coding, as you don't have to
worry, when entering a section of code, whether or not you already
hold the lock -- you can just protect the section with a lock/unlock
pair and let the lock counter track things for you.
But as you have seen, exploiting the features of recursive locks
make it a lot harder to implement proper deadlock avoidance strategies.
So please try to analyse your code and determine statically whether you
already hold a lock when entering a section of code.
If you need to call some function foo() with and without a lock held,
you could define two function as below:
foo_locked(...) {
... do something, assume lock held
}
foo(...) {
lock(xyz)
ret = foo_locked(...)
unlock(xyz)
return ret;
}
and call them according to the needs.
*) Document locking rules.
Please document the locking order rules are documented for every
lock introduced into Asterisk. This is done almost nowhere in the
existing code. However, it will be expected to be there for newly
introduced code. Over time, this information should be added for
all of the existing lock usage.
-----------------------------------------------------------------------
------------------------------------
== PART TWO: BUILD ARCHITECTURE ==
------------------------------------