dect
/
linux-2.6
Archived
13
0
Fork 0
Commit Graph

13879 Commits

Author SHA1 Message Date
Tejun Heo 56e6a08154 Merge branch 'timers/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-3.7 2012-08-21 12:31:12 -07:00
Tejun Heo c5f66e99b7 timer: Implement TIMER_IRQSAFE
Timer internals are protected with irq-safe locks but timer execution
isn't, so a timer being dequeued for execution and its execution
aren't atomic against IRQs.  This makes it impossible to wait for its
completion from IRQ handlers and difficult to shoot down a timer from
IRQ handlers.

This issue caused some issues for delayed_work interface.  Because
there's no way to reliably shoot down delayed_work->timer from IRQ
handlers, __cancel_delayed_work() can't share the logic to steal the
target delayed_work with cancel_delayed_work_sync(), and can only
steal delayed_works which are on queued on timer.  Similarly, the
pending mod_delayed_work() can't be used from IRQ handlers.

This patch adds a new timer flag TIMER_IRQSAFE, which makes the timer
to be executed without enabling IRQ after dequeueing such that its
dequeueing and execution are atomic against IRQ handlers.

This makes it safe to wait for the timer's completion from IRQ
handlers, for example, using del_timer_sync().  It can never be
executing on the local CPU and if executing on other CPUs it won't be
interrupted until done.

This will enable simplifying delayed_work cancel/mod interface.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: torvalds@linux-foundation.org
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1344449428-24962-5-git-send-email-tj@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-21 16:28:31 +02:00
Tejun Heo fc683995a6 timer: Clean up timer initializers
Over time, timer initializers became messy with unnecessarily
duplicated code which are inconsistently spread across timer.h and
timer.c.

This patch cleans up timer initializers.

* timer.c::__init_timer() is renamed to do_init_timer().

* __TIMER_INITIALIZER() added.  It takes @flags and all initializers
  are wrappers around it.

* init_timer[_on_stack]_key() now take @flags.

* __init_timer[_on_stack]() added.  They take @flags and all init
  macros are wrappers around them.

* __setup_timer[_on_stack]() added.  It uses __init_timer() and takes
  @flags.  All setup macros are wrappers around the two.

Note that this patch doesn't add missing init/setup combinations -
e.g. init_timer_deferrable_on_stack().  Adding missing ones is
trivial.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: torvalds@linux-foundation.org
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1344449428-24962-4-git-send-email-tj@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-21 16:28:30 +02:00
Tejun Heo e52b1db37b timer: Generalize timer->base flags handling
To prepare for addition of another flag, generalize timer->base flags
handling.

* Rename from TBASE_*_FLAG to TIMER_* and make them LU constants.

* Define and use TIMER_FLAG_MASK for flags masking so that multiple
  flags can be handled correctly.

* Don't dereference timer->base directly even if
  !tbase_get_deferrable().  All two such places are already passed in
  @base, so use it instead.

* Make sure tvec_base's alignment is large enough for timer->base
  flags using BUILD_BUG_ON().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: torvalds@linux-foundation.org
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1344449428-24962-2-git-send-email-tj@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-08-21 16:28:30 +02:00
Tejun Heo 3b07e9ca26 workqueue: deprecate system_nrt[_freezable]_wq
system_nrt[_freezable]_wq are now spurious.  Mark them deprecated and
convert all users to system[_freezable]_wq.

If you're cc'd and wondering what's going on: Now all workqueues are
non-reentrant, so there's no reason to use system_nrt[_freezable]_wq.
Please use system[_freezable]_wq instead.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-By: Lai Jiangshan <laijs@cn.fujitsu.com>

Cc: Jens Axboe <axboe@kernel.dk>
Cc: David Airlie <airlied@linux.ie>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
2012-08-20 14:51:24 -07:00
Tejun Heo ae930e0f4e workqueue: gut system_nrt[_freezable]_wq()
Now that all workqueues are non-reentrant, system[_freezable]_wq() are
equivalent to system_nrt[_freezable]_wq().  Replace the latter with
wrappers around system[_freezable]_wq().  The wrapping goes through
inline functions so that __deprecated can be added easily.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-20 14:51:23 -07:00
Tejun Heo 606a5020b9 workqueue: gut flush[_delayed]_work_sync()
Now that all workqueues are non-reentrant, flush[_delayed]_work_sync()
are equivalent to flush[_delayed]_work().  Drop the separate
implementation and make them thin wrappers around
flush[_delayed]_work().

* start_flush_work() no longer takes @wait_executing as the only left
  user - flush_work() - always sets it to %true.

* __cancel_work_timer() uses flush_work() instead of wait_on_work().

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-20 14:51:23 -07:00
Tejun Heo dbf2576e37 workqueue: make all workqueues non-reentrant
By default, each per-cpu part of a bound workqueue operates separately
and a work item may be executing concurrently on different CPUs.  The
behavior avoids some cross-cpu traffic but leads to subtle weirdities
and not-so-subtle contortions in the API.

* There's no sane usefulness in allowing a single work item to be
  executed concurrently on multiple CPUs.  People just get the
  behavior unintentionally and get surprised after learning about it.
  Most either explicitly synchronize or use non-reentrant/ordered
  workqueue but this is error-prone.

* flush_work() can't wait for multiple instances of the same work item
  on different CPUs.  If a work item is executing on cpu0 and then
  queued on cpu1, flush_work() can only wait for the one on cpu1.

  Unfortunately, work items can easily cross CPU boundaries
  unintentionally when the queueing thread gets migrated.  This means
  that if multiple queuers compete, flush_work() can't even guarantee
  that the instance queued right before it is finished before
  returning.

* flush_work_sync() was added to work around some of the deficiencies
  of flush_work().  In addition to the usual flushing, it ensures that
  all currently executing instances are finished before returning.
  This operation is expensive as it has to walk all CPUs and at the
  same time fails to address competing queuer case.

  Incorrectly using flush_work() when flush_work_sync() is necessary
  is an easy error to make and can lead to bugs which are difficult to
  reproduce.

* Similar problems exist for flush_delayed_work[_sync]().

Other than the cross-cpu access concern, there's no benefit in
allowing parallel execution and it's plain silly to have this level of
contortion for workqueue which is widely used from core code to
extremely obscure drivers.

This patch makes all workqueues non-reentrant.  If a work item is
executing on a different CPU when queueing is requested, it is always
queued to that CPU.  This guarantees that any given work item can be
executing on one CPU at maximum and if a work item is queued and
executing, both are on the same CPU.

The only behavior change which may affect workqueue users negatively
is that non-reentrancy overrides the affinity specified by
queue_work_on().  On a reentrant workqueue, the affinity specified by
queue_work_on() is always followed.  Now, if the work item is
executing on one of the CPUs, the work item will be queued there
regardless of the requested affinity.  I've reviewed all workqueue
users which request explicit affinity, and, fortunately, none seems to
be crazy enough to exploit parallel execution of the same work item.

This adds an additional busy_hash lookup if the work item was
previously queued on a different CPU.  This shouldn't be noticeable
under any sane workload.  Work item queueing isn't a very
high-frequency operation and they don't jump across CPUs all the time.
In a micro benchmark to exaggerate this difference - measuring the
time it takes for two work items to repeatedly jump between two CPUs a
number (10M) of times with busy_hash table densely populated, the
difference was around 3%.

While the overhead is measureable, it is only visible in pathological
cases and the difference isn't huge.  This change brings much needed
sanity to workqueue and makes its behavior consistent with timer.  I
think this is the right tradeoff to make.

This enables significant simplification of workqueue API.
Simplification patches will follow.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-20 14:51:23 -07:00
Valentin Ilie 044c782ce3 workqueue: fix checkpatch issues
Fixed some checkpatch warnings.

tj: adapted to wq/for-3.7 and massaged pr_xxx() format strings a bit.

Signed-off-by: Valentin Ilie <valentin.ilie@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <1345326762-21747-1-git-send-email-valentin.ilie@gmail.com>
2012-08-20 13:37:07 -07:00
Joonsoo Kim 7635d2fd7f workqueue: use system_highpri_wq for unbind_work
To speed cpu down processing up, use system_highpri_wq.
As scheduling priority of workers on it is higher than system_wq and
it is not contended by other normal works on this cpu, work on it
is processed faster than system_wq.

tj: CPU up/downs care quite a bit about latency these days.  This
    shouldn't hurt anything and makes sense.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-16 14:21:16 -07:00
Joonsoo Kim e2b6a6d570 workqueue: use system_highpri_wq for highpri workers in rebind_workers()
In rebind_workers(), we do inserting a work to rebind to cpu for busy workers.
Currently, in this case, we use only system_wq. This makes a possible
error situation as there is mismatch between cwq->pool and worker->pool.

To prevent this, we should use system_highpri_wq for highpri worker
to match theses. This implements it.

tj: Rephrased comment a bit.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-16 14:21:15 -07:00
Joonsoo Kim 1aabe902ca workqueue: introduce system_highpri_wq
Commit 3270476a6c ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker pool
for HIGHPRI. When we handle busyworkers for gcwq, it can be normal worker
or highpri worker. But, we don't consider this difference in rebind_workers(),
we use just system_wq for highpri worker. It makes mismatch between
cwq->pool and worker->pool.

It doesn't make error in current implementation, but possible in the future.
Now, we introduce system_highpri_wq to use proper cwq for highpri workers
in rebind_workers(). Following patch fix this issue properly.

tj: Even apart from rebinding, having system_highpri_wq generally
    makes sense.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-16 14:21:15 -07:00
Joonsoo Kim e42986de48 workqueue: change value of lcpu in __queue_delayed_work_on()
We assign cpu id into work struct's data field in __queue_delayed_work_on().
In current implementation, when work is come in first time,
current running cpu id is assigned.
If we do __queue_delayed_work_on() with CPU A on CPU B,
__queue_work() invoked in delayed_work_timer_fn() go into
the following sub-optimal path in case of WQ_NON_REENTRANT.

	gcwq = get_gcwq(cpu);
	if (wq->flags & WQ_NON_REENTRANT &&
		(last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) {

Change lcpu to @cpu and rechange lcpu to local cpu if lcpu is WORK_CPU_UNBOUND.
It is sufficient to prevent to go into sub-optimal path.

tj: Slightly rephrased the comment.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-16 14:21:15 -07:00
Joonsoo Kim b75cac9368 workqueue: correct req_cpu in trace_workqueue_queue_work()
When we do tracing workqueue_queue_work(), it records requested cpu.
But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND,
requested cpu is changed as local cpu.
In case of @wq->flag & WQ_UNBOUND, above change is not occured,
therefore it is reasonable to correct it.

Use temporary local variable for storing requested cpu.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-16 14:21:15 -07:00
Joonsoo Kim 330dad5b9c workqueue: use enum value to set array size of pools in gcwq
Commit 3270476a6c ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker_pool
for HIGHPRI. Although there is NR_WORKER_POOLS enum value which represent
size of pools, definition of worker_pool in gcwq doesn't use it.
Using it makes code robust and prevent future mistakes.
So change code to use this enum value.

Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-16 14:21:15 -07:00
Tejun Heo 23657bb192 workqueue: add missing wmb() in clear_work_data()
Any operation which clears PENDING should be preceded by a wmb to
guarantee that the next PENDING owner sees all the changes made before
PENDING release.

There are only two places where PENDING is cleared -
set_work_cpu_and_clear_pending() and clear_work_data().  The caller of
the former already does smp_wmb() but the latter doesn't have any.

Move the wmb above set_work_cpu_and_clear_pending() into it and add
one to clear_work_data().

There hasn't been any report related to this issue, and, given how
clear_work_data() is used, it is extremely unlikely to have caused any
actual problems on any architecture.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
2012-08-13 17:08:19 -07:00
Tejun Heo 1265057fa0 workqueue: fix CPU binding of flush_delayed_work[_sync]()
delayed_work encodes the workqueue to use and the last CPU in
delayed_work->work.data while it's on timer.  The target CPU is
implicitly recorded as the CPU the timer is queued on and
delayed_work_timer_fn() queues delayed_work->work to the CPU it is
running on.

Unfortunately, this leaves flush_delayed_work[_sync]() no way to find
out which CPU the delayed_work was queued for when they try to
re-queue after killing the timer.  Currently, it chooses the local CPU
flush is running on.  This can unexpectedly move a delayed_work queued
on a specific CPU to another CPU and lead to subtle errors.

There isn't much point in trying to save several bytes in struct
delayed_work, which is already close to a hundred bytes on 64bit with
all debug options turned off.  This patch adds delayed_work->cpu to
remember the CPU it's queued for.

Note that if the timer is migrated during CPU down, the work item
could be queued to the downed global_cwq after this change.  As a
detached global_cwq behaves like an unbound one, this doesn't change
much for the delayed_work.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
2012-08-13 16:27:55 -07:00
Tejun Heo 8376fe22c7 workqueue: implement mod_delayed_work[_on]()
Workqueue was lacking a mechanism to modify the timeout of an already
pending delayed_work.  delayed_work users have been working around
this using several methods - using an explicit timer + work item,
messing directly with delayed_work->timer, and canceling before
re-queueing, all of which are error-prone and/or ugly.

This patch implements mod_delayed_work[_on]() which behaves similarly
to mod_timer() - if the delayed_work is idle, it's queued with the
given delay; otherwise, its timeout is modified to the new value.
Zero @delay guarantees immediate execution.

v2: Updated to reflect try_to_grab_pending() changes.  Now safe to be
    called from bh context.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
2012-08-03 10:30:47 -07:00
Tejun Heo bbb68dfaba workqueue: mark a work item being canceled as such
There can be two reasons try_to_grab_pending() can fail with -EAGAIN.
One is when someone else is queueing or deqeueing the work item.  With
the previous patches, it is guaranteed that PENDING and queued state
will soon agree making it safe to busy-retry in this case.

The other is if multiple __cancel_work_timer() invocations are racing
one another.  __cancel_work_timer() grabs PENDING and then waits for
running instances of the target work item on all CPUs while holding
PENDING and !queued.  try_to_grab_pending() invoked from another task
will keep returning -EAGAIN while the current owner is waiting.

Not distinguishing the two cases is okay because __cancel_work_timer()
is the only user of try_to_grab_pending() and it invokes
wait_on_work() whenever grabbing fails.  For the first case, busy
looping should be fine but wait_on_work() doesn't cause any critical
problem.  For the latter case, the new contender usually waits for the
same condition as the current owner, so no unnecessarily extended
busy-looping happens.  Combined, these make __cancel_work_timer()
technically correct even without irq protection while grabbing PENDING
or distinguishing the two different cases.

While the current code is technically correct, not distinguishing the
two cases makes it difficult to use try_to_grab_pending() for other
purposes than canceling because it's impossible to tell whether it's
safe to busy-retry grabbing.

This patch adds a mechanism to mark a work item being canceled.
try_to_grab_pending() now disables irq on success and returns -EAGAIN
to indicate that grabbing failed but PENDING and queued states are
gonna agree soon and it's safe to busy-loop.  It returns -ENOENT if
the work item is being canceled and it may stay PENDING && !queued for
arbitrary amount of time.

__cancel_work_timer() is modified to mark the work canceling with
WORK_OFFQ_CANCELING after grabbing PENDING, thus making
try_to_grab_pending() fail with -ENOENT instead of -EAGAIN.  Also, it
invokes wait_on_work() iff grabbing failed with -ENOENT.  This isn't
necessary for correctness but makes it consistent with other future
users of try_to_grab_pending().

v2: try_to_grab_pending() was testing preempt_count() to ensure that
    the caller has disabled preemption.  This triggers spuriously if
    !CONFIG_PREEMPT_COUNT.  Use preemptible() instead.  Reported by
    Fengguang Wu.

v3: Updated so that try_to_grab_pending() disables irq on success
    rather than requiring preemption disabled by the caller.  This
    makes busy-looping easier and will allow try_to_grap_pending() to
    be used from bh/irq contexts.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
2012-08-03 10:30:46 -07:00
Tejun Heo 36e227d242 workqueue: reorganize try_to_grab_pending() and __cancel_timer_work()
* Use bool @is_dwork instead of @timer and let try_to_grab_pending()
  use to_delayed_work() to determine the delayed_work address.

* Move timer handling from __cancel_work_timer() to
  try_to_grab_pending().

* Make try_to_grab_pending() use -EAGAIN instead of -1 for
  busy-looping and drop the ret local variable.

* Add proper function comment to try_to_grab_pending().

This makes the code a bit easier to understand and will ease further
changes.  This patch doesn't make any functional change.

v2: Use @is_dwork instead of @timer.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-03 10:30:46 -07:00
Tejun Heo 7beb2edf44 workqueue: factor out __queue_delayed_work() from queue_delayed_work_on()
This is to prepare for mod_delayed_work[_on]() and doesn't cause any
functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-03 10:30:46 -07:00
Tejun Heo b549007727 workqueue: introduce WORK_OFFQ_FLAG_*
Low WORK_STRUCT_FLAG_BITS bits of work_struct->data contain
WORK_STRUCT_FLAG_* and flush color.  If the work item is queued, the
rest point to the cpu_workqueue with WORK_STRUCT_CWQ set; otherwise,
WORK_STRUCT_CWQ is clear and the bits contain the last CPU number -
either a real CPU number or one of WORK_CPU_*.

Scheduled addition of mod_delayed_work[_on]() requires an additional
flag, which is used only while a work item is off queue.  There are
more than enough bits to represent off-queue CPU number on both 32 and
64bits.  This patch introduces WORK_OFFQ_FLAG_* which occupy the lower
part of the @work->data high bits while off queue.  This patch doesn't
define any actual OFFQ flag yet.

Off-queue CPU number is now shifted by WORK_OFFQ_CPU_SHIFT, which adds
the number of bits used by OFFQ flags to WORK_STRUCT_FLAG_SHIFT, to
make room for OFFQ flags.

To avoid shift width warning with large WORK_OFFQ_FLAG_BITS, ulong
cast is added to WORK_STRUCT_NO_CPU and, just in case, BUILD_BUG_ON()
to check that there are enough bits to accomodate off-queue CPU number
is added.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-03 10:30:46 -07:00
Tejun Heo bf4ede014e workqueue: move try_to_grab_pending() upwards
try_to_grab_pending() will be used by to-be-implemented
mod_delayed_work[_on]().  Move try_to_grab_pending() and related
functions above queueing functions.

This patch only moves functions around.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-03 10:30:46 -07:00
Tejun Heo 715f130080 workqueue: fix zero @delay handling of queue_delayed_work_on()
If @delay is zero and the dealyed_work is idle, queue_delayed_work()
queues it for immediate execution; however, queue_delayed_work_on()
lacks this logic and always goes through timer regardless of @delay.

This patch moves 0 @delay handling logic from queue_delayed_work() to
queue_delayed_work_on() so that both functions behave the same.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-03 10:30:46 -07:00
Tejun Heo 57469821fd workqueue: unify local CPU queueing handling
Queueing functions have been using different methods to determine the
local CPU.

* queue_work() superflously uses get/put_cpu() to acquire and hold the
  local CPU across queue_work_on().

* delayed_work_timer_fn() uses smp_processor_id().

* queue_delayed_work() calls queue_delayed_work_on() with -1 @cpu
  which is interpreted as the local CPU.

* flush_delayed_work[_sync]() were using raw_smp_processor_id().

* __queue_work() interprets %WORK_CPU_UNBOUND as local CPU if the
  target workqueue is bound one but nobody uses this.

This patch converts all functions to uniformly use %WORK_CPU_UNBOUND
to indicate local CPU and use the local binding feature of
__queue_work().  unlikely() is dropped from %WORK_CPU_UNBOUND handling
in __queue_work().

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-03 10:30:45 -07:00
Tejun Heo d8e794dfd5 workqueue: set delayed_work->timer function on initialization
delayed_work->timer.function is currently initialized during
queue_delayed_work_on().  Export delayed_work_timer_fn() and set
delayed_work timer function during delayed_work initialization
together with other fields.

This ensures the timer function is always valid on an initialized
delayed_work.  This is to help mod_delayed_work() implementation.

To detect delayed_work users which diddle with the internal timer,
trigger WARN if timer function doesn't match on queue.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-03 10:30:45 -07:00
Tejun Heo 8930caba3d workqueue: disable irq while manipulating PENDING
Queueing operations use WORK_STRUCT_PENDING_BIT to synchronize access
to the target work item.  They first try to claim the bit and proceed
with queueing only after that succeeds and there's a window between
PENDING being set and the actual queueing where the task can be
interrupted or preempted.

There's also a similar window in process_one_work() when clearing
PENDING.  A work item is dequeued, gcwq->lock is released and then
PENDING is cleared and the worker might get interrupted or preempted
between releasing gcwq->lock and clearing PENDING.

cancel[_delayed]_work_sync() tries to claim or steal PENDING.  The
function assumes that a work item with PENDING is either queued or in
the process of being [de]queued.  In the latter case, it busy-loops
until either the work item loses PENDING or is queued.  If canceling
coincides with the above described interrupts or preemptions, the
canceling task will busy-loop while the queueing or executing task is
preempted.

This patch keeps irq disabled across claiming PENDING and actual
queueing and moves PENDING clearing in process_one_work() inside
gcwq->lock so that busy looping from PENDING && !queued doesn't wait
for interrupted/preempted tasks.  Note that, in process_one_work(),
setting last CPU and clearing PENDING got merged into single
operation.

This removes possible long busy-loops and will allow using
try_to_grab_pending() from bh and irq contexts.

v2: __queue_work() was testing preempt_count() to ensure that the
    caller has disabled preemption.  This triggers spuriously if
    !CONFIG_PREEMPT_COUNT.  Use preemptible() instead.  Reported by
    Fengguang Wu.

v3: Disable irq instead of preemption.  IRQ will be disabled while
    grabbing gcwq->lock later anyway and this allows using
    try_to_grab_pending() from bh and irq contexts.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
2012-08-03 10:30:45 -07:00
Tejun Heo 959d1af8cf workqueue: add missing smp_wmb() in process_one_work()
WORK_STRUCT_PENDING is used to claim ownership of a work item and
process_one_work() releases it before starting execution.  When
someone else grabs PENDING, all pre-release updates to the work item
should be visible and all updates made by the new owner should happen
afterwards.

Grabbing PENDING uses test_and_set_bit() and thus has a full barrier;
however, clearing doesn't have a matching wmb.  Given the preceding
spin_unlock and use of clear_bit, I don't believe this can be a
problem on an actual machine and there hasn't been any related report
but it still is theretically possible for clear_pending to permeate
upwards and happen before work->entry update.

Add an explicit smp_wmb() before work_clear_pending().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
2012-08-03 10:30:45 -07:00
Tejun Heo d4283e9378 workqueue: make queueing functions return bool
All queueing functions return 1 on success, 0 if the work item was
already pending.  Update them to return bool instead.  This signifies
better that they don't return 0 / -errno.

This is cleanup and doesn't cause any functional difference.

While at it, fix comment opening for schedule_work_on().

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-03 10:30:44 -07:00
Tejun Heo 0a13c00e9d workqueue: reorder queueing functions so that _on() variants are on top
Currently, queue/schedule[_delayed]_work_on() are located below the
counterpart without the _on postifx even though the latter is usually
implemented using the former.  Swap them.

This is cleanup and doesn't cause any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-08-03 10:30:44 -07:00
Linus Torvalds a0e881b7c1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second vfs pile from Al Viro:
 "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
  deadlock reproduced by xfstests 068), symlink and hardlink restriction
  patches, plus assorted cleanups and fixes.

  Note that another fsfreeze deadlock (emergency thaw one) is *not*
  dealt with - the series by Fernando conflicts a lot with Jan's, breaks
  userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
  for massive vfsmount leak; this is going to be handled next cycle.
  There probably will be another pull request, but that stuff won't be
  in it."

Fix up trivial conflicts due to unrelated changes next to each other in
drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
  delousing target_core_file a bit
  Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
  fs: Remove old freezing mechanism
  ext2: Implement freezing
  btrfs: Convert to new freezing mechanism
  nilfs2: Convert to new freezing mechanism
  ntfs: Convert to new freezing mechanism
  fuse: Convert to new freezing mechanism
  gfs2: Convert to new freezing mechanism
  ocfs2: Convert to new freezing mechanism
  xfs: Convert to new freezing code
  ext4: Convert to new freezing mechanism
  fs: Protect write paths by sb_start_write - sb_end_write
  fs: Skip atime update on frozen filesystem
  fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
  fs: Improve filesystem freezing handling
  switch the protection of percpu_counter list to spinlock
  nfsd: Push mnt_want_write() outside of i_mutex
  btrfs: Push mnt_want_write() outside of i_mutex
  fat: Push mnt_want_write() outside of i_mutex
  ...
2012-08-01 10:26:23 -07:00
Linus Torvalds 2d53492620 irqdomain changes for Linux v3.6
Round of refactoring and enhancements to irq_domain infrastructure. This
 series starts the process of simplifying irqdomain. The ultimate goal is
 to merge LEGACY, LINEAR and TREE mappings into a single system, but had
 to back off from that after some last minute bugs. Instead it mainly
 reorganizes the code and ensures that the reverse map gets populated
 when the irq is mapped instead of the first time it is looked up.
 
 Merging of the irq_domain types is deferred to v3.7
 
 In other news, this series adds helpers for creating static mappings on
 a linear or tree mapping.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJQGKKkAAoJEEFnBt12D9kB59gQAJnTjrihej1tr0OEkffIthGK
 RyVI/DMo0jMgLs4K/rIo3Y+PdTSsNYd8x4R7ln8O7rNRQn8W6jE6NQgoMh51EvNc
 FAltmTsBldq6hUNuz2FEnbmojBP4QklTzL8bAiXtX5EufWQsgMsP4guOuHXLCjEV
 CkWYVk/slXEWJ8yYJc6GKVRvL+CNeiXVCTcOsYA0CI3ofN7O0rd+YAL314CRllIc
 e5uARbWM+s9FJ/eXwCZP4+3jCmdI/CHJb284WldMc/mBD8Rbiqpb4kH6AZI+TH2O
 CyiNEPWs6FG5eJPTID7HrOarXGzwYq/pvv8iG7Mh8NiKSae1C1HdkHelCjbLQ+pU
 POya0fWF1Gvzlmw0gHik86dqaKjwb29btjj7SFg8KnQExWn2ifhsY70mM9wCTo3s
 cwcQlssDIsARE83nttTFCoV/iAWh9AvTxafrXu/+9OKTjpsYlC8kgzdVjq5aAxON
 JaAUK1OduTWRsd1TabKlh6naRXr9nRcLKikwKri2oYVKkj97wahBuib4ffzAcNqz
 VklRBxTH6M+dz/t5NpcVyLXJpqzTN++QNdTAmeQG6LOnHJL4tpFTsx5sMa7ghmzX
 LNpmp/AkVfP0MT7Drf0FUUx6iFA7sjANYzcepUVDrPGKHx0E3LyqbG5JKcC5LgM6
 +UIoKAktF3vY7pdZJL9z
 =ZUF/
 -----END PGP SIGNATURE-----

Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6

Pull irqdomain changes from Grant Likely:
 "Round of refactoring and enhancements to irq_domain infrastructure.
  This series starts the process of simplifying irqdomain.  The ultimate
  goal is to merge LEGACY, LINEAR and TREE mappings into a single
  system, but had to back off from that after some last minute bugs.
  Instead it mainly reorganizes the code and ensures that the reverse
  map gets populated when the irq is mapped instead of the first time it
  is looked up.

  Merging of the irq_domain types is deferred to v3.7

  In other news, this series adds helpers for creating static mappings
  on a linear or tree mapping."

* tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6:
  irqdomain: Improve diagnostics when a domain mapping fails
  irqdomain: eliminate slow-path revmap lookups
  irqdomain: Fix irq_create_direct_mapping() to test irq_domain type.
  irqdomain: Eliminate dedicated radix lookup functions
  irqdomain: Support for static IRQ mapping and association.
  irqdomain: Always update revmap when setting up a virq
  irqdomain: Split disassociating code into separate function
  irq_domain: correct a minor wrong comment for linear revmap
  irq_domain: Standardise legacy/linear domain selection
  irqdomain: Make ops->map hook optional
  irqdomain: Remove unnecessary test for IRQ_DOMAIN_MAP_LEGACY
  irqdomain: Simple NUMA awareness.
  devicetree: add helper inline for retrieving a node's full name
2012-07-31 20:44:03 -07:00
Linus Torvalds ac694dbdbc Merge branch 'akpm' (Andrew's patch-bomb)
Merge Andrew's second set of patches:
 - MM
 - a few random fixes
 - a couple of RTC leftovers

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits)
  rtc/rtc-88pm80x: remove unneed devm_kfree
  rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
  mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
  tmpfs: distribute interleave better across nodes
  mm: remove redundant initialization
  mm: warn if pg_data_t isn't initialized with zero
  mips: zero out pg_data_t when it's allocated
  memcg: gix memory accounting scalability in shrink_page_list
  mm/sparse: remove index_init_lock
  mm/sparse: more checks on mem_section number
  mm/sparse: optimize sparse_index_alloc
  memcg: add mem_cgroup_from_css() helper
  memcg: further prevent OOM with too many dirty pages
  memcg: prevent OOM with too many dirty pages
  mm: mmu_notifier: fix freed page still mapped in secondary MMU
  mm: memcg: only check anon swapin page charges for swap cache
  mm: memcg: only check swap cache pages for repeated charging
  mm: memcg: split swapin charge function into private and public part
  mm: memcg: remove needless !mm fixup to init_mm when charging
  mm: memcg: remove unneeded shmem charge type
  ...
2012-07-31 19:25:39 -07:00
Linus Torvalds 3e9a97082f This patch series contains a major revamp of how we collect entropy
from interrupts for /dev/random and /dev/urandom.  The goal is to
 addresses weaknesses discussed in the paper "Mining your Ps and Qs:
 Detection of Widespread Weak Keys in Network Devices", by Nadia
 Heninger, Zakir Durumeric, Eric Wustrow, J. Alex Halderman, which will
 be published in the Proceedings of the 21st Usenix Security Symposium,
 August 2012.  (See https://factorable.net for more information and an
 extended version of the paper.)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQF/0DAAoJENNvdpvBGATwIowQAOep9QKtLrBvb2lwIRVmeiy8
 lRf7V/tYZnz4FePbR0W92JQfKYkCV8yyOO0bmeRzWL3v4m+lRwDTSyA1DDyQMoH+
 LOMzvDKSLJMSXTXdSOIr1WYACphViCR/9CrbMBCKSkYfZLJ1MdaEDxT3rcpTGD0T
 6iknUweiSkHHhkerU5yQL7FKzD5kYUe0hsF47w7QVlHRHJsW2fsZqkFoh+RpnhNw
 03u+djxNGBo9qV81vZ9D1b0vA9uRlEjoWOOEG2XE4M2iq6TUySueA72dQnCwunfi
 3kG/u1Swv2dgq6aRrP3H7zdwhYSourGxziu3jNhEKwKEohrxYY7xjNX3RVeTqP67
 AzlKsOTWpRLIDrzjSLlb8VxRQiZewu8Unex3e1G+eo20sbcIObHGrxNp7K00zZvd
 QZiMHhOwItwFTe4lBO+XbqH2JKbL9/uJmwh5EipMpQTraKO9E6N3CJiUHjzBLo2K
 iGDZxRMKf4gVJRwDxbbP6D70JPVu8ZJ09XVIpsXQ3Z1xNqaMF0QdCmP3ty56q1o0
 NvkSXxPKrijZs8Sk0rVDqnJ3ll8PuDnXMv5eDtL42VT818I5WxESn9djjwEanGv0
 TYxbFub/NRxmPEE5B2Js5FBpqsLf5f282OSMeS/5WLBbnHJR1OoPoAhGVpHvxntC
 bi5FC1OolqhvzVIdsqgt
 =u7KM
 -----END PGP SIGNATURE-----

Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random

Pull random subsystem patches from Ted Ts'o:
 "This patch series contains a major revamp of how we collect entropy
  from interrupts for /dev/random and /dev/urandom.

  The goal is to addresses weaknesses discussed in the paper "Mining
  your Ps and Qs: Detection of Widespread Weak Keys in Network Devices",
  by Nadia Heninger, Zakir Durumeric, Eric Wustrow, J.  Alex Halderman,
  which will be published in the Proceedings of the 21st Usenix Security
  Symposium, August 2012.  (See https://factorable.net for more
  information and an extended version of the paper.)"

Fix up trivial conflicts due to nearby changes in
drivers/{mfd/ab3100-core.c, usb/gadget/omap_udc.c}

* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: (33 commits)
  random: mix in architectural randomness in extract_buf()
  dmi: Feed DMI table to /dev/random driver
  random: Add comment to random_initialize()
  random: final removal of IRQF_SAMPLE_RANDOM
  um: remove IRQF_SAMPLE_RANDOM which is now a no-op
  sparc/ldc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  [ARM] pxa: remove IRQF_SAMPLE_RANDOM which is now a no-op
  board-palmz71: remove IRQF_SAMPLE_RANDOM which is now a no-op
  isp1301_omap: remove IRQF_SAMPLE_RANDOM which is now a no-op
  pxa25x_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  omap_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  goku_udc: remove IRQF_SAMPLE_RANDOM which was commented out
  uartlite: remove IRQF_SAMPLE_RANDOM which is now a no-op
  drivers: hv: remove IRQF_SAMPLE_RANDOM which is now a no-op
  xen-blkfront: remove IRQF_SAMPLE_RANDOM which is now a no-op
  n2_crypto: remove IRQF_SAMPLE_RANDOM which is now a no-op
  pda_power: remove IRQF_SAMPLE_RANDOM which is now a no-op
  i2c-pmcmsp: remove IRQF_SAMPLE_RANDOM which is now a no-op
  input/serio/hp_sdc.c: remove IRQF_SAMPLE_RANDOM which is now a no-op
  mfd: remove IRQF_SAMPLE_RANDOM which is now a no-op
  ...
2012-07-31 19:07:42 -07:00
Mel Gorman 907aed48f6 mm: allow PF_MEMALLOC from softirq context
This is needed to allow network softirq packet processing to make use of
PF_MEMALLOC.

Currently softirq context cannot use PF_MEMALLOC due to it not being
associated with a task, and therefore not having task flags to fiddle with
- thus the gfp to alloc flag mapping ignores the task flags when in
interrupts (hard or soft) context.

Allowing softirqs to make use of PF_MEMALLOC therefore requires some
trickery.  This patch borrows the task flags from whatever process happens
to be preempted by the softirq.  It then modifies the gfp to alloc flags
mapping to not exclude task flags in softirq context, and modify the
softirq code to save, clear and restore the PF_MEMALLOC flag.

The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
leak into the softirq.  The restore ensures a softirq's PF_MEMALLOC flag
cannot leak back into the preempted process.  This should be safe due to
the following reasons

Softirqs can run on multiple CPUs sure but the same task should not be
	executing the same softirq code. Neither should the softirq
	handler be preempted by any other softirq handler so the flags
	should not leak to an unrelated softirq.

Softirqs re-enable hardware interrupts in __do_softirq() so can be
	preempted by hardware interrupts so PF_MEMALLOC is inherited
	by the hard IRQ. However, this is similar to a process in
	reclaim being preempted by a hardirq. While PF_MEMALLOC is
	set, gfp_to_alloc_flags() distinguishes between hard and
	soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS
	flag.

If the softirq is deferred to ksoftirq then its flags may be used
        instead of a normal tasks but as the softirq cannot be preempted,
        the PF_MEMALLOC flag does not leak to other code by accident.

[davem@davemloft.net: Document why PF_MEMALLOC is safe]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:45 -07:00
Jiang Liu 9adb62a5df mm/hotplug: correctly setup fallback zonelists when creating new pgdat
When hotadd_new_pgdat() is called to create new pgdat for a new node, a
fallback zonelist should be created for the new node.  There's code to try
to achieve that in hotadd_new_pgdat() as below:

	/*
	 * The node we allocated has no zone fallback lists. For avoiding
	 * to access not-initialized zonelist, build here.
	 */
	mutex_lock(&zonelists_mutex);
	build_all_zonelists(pgdat, NULL);
	mutex_unlock(&zonelists_mutex);

But it doesn't work as expected.  When hotadd_new_pgdat() is called, the
new node is still in offline state because node_set_online(nid) hasn't
been called yet.  And build_all_zonelists() only builds zonelists for
online nodes as:

        for_each_online_node(nid) {
                pg_data_t *pgdat = NODE_DATA(nid);

                build_zonelists(pgdat);
                build_zonelist_cache(pgdat);
        }

Though we hope to create zonelist for the new pgdat, but it doesn't.  So
add a new parameter "pgdat" the build_all_zonelists() to build pgdat for
the new pgdat too.

Signed-off-by: Jiang Liu <liuj97@gmail.com>
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Keping Chen <chenkeping@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:44 -07:00
Andrew Morton c255a45805 memcg: rename config variables
Sanity:

CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

[mhocko@suse.cz: fix missed bits]
Cc: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:43 -07:00
Wanpeng Li 3965c9ae47 mm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threads
Since per-BDI flusher threads were introduced in 2.6, the pdflush
mechanism is not used any more.  But the old interface exported through
/proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.

For back-compatibility, printk warning information and return 2 to notify
the users that the interface is removed.

Signed-off-by: Wanpeng Li <liwp@linux.vnet.ibm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:40 -07:00
Huang Shijie 44de9d0cad mm: account the total_vm in the vm_stat_account()
vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now.
But we can also account for total_vm in the vm_stat_account() which makes
the code tidy.

Even for mprotect_fixup(), we can get the right result in the end.

Signed-off-by: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:39 -07:00
Linus Torvalds bca1a5c0ea Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "The biggest changes are Intel Nehalem-EX PMU uncore support, uprobes
  updates/cleanups/fixes from Oleg and diverse tooling updates (mostly
  fixes) now that Arnaldo is back from vacation."

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  uprobes: __replace_page() needs munlock_vma_page()
  uprobes: Rename vma_address() and make it return "unsigned long"
  uprobes: Fix register_for_each_vma()->vma_address() check
  uprobes: Introduce vaddr_to_offset(vma, vaddr)
  uprobes: Teach build_probe_list() to consider the range
  uprobes: Remove insert_vm_struct()->uprobe_mmap()
  uprobes: Remove copy_vma()->uprobe_mmap()
  uprobes: Fix overflow in vma_address()/find_active_uprobe()
  uprobes: Suppress uprobe_munmap() from mmput()
  uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe()
  uprobes: Clean up and document write_opcode()->lock_page(old_page)
  uprobes: Kill write_opcode()->lock_page(new_page)
  uprobes: __replace_page() should not use page_address_in_vma()
  uprobes: Don't recheck vma/f_mapping in write_opcode()
  perf/x86: Fix missing struct before structure name
  perf/x86: Fix format definition of SNB-EP uncore QPI box
  perf/x86: Make bitfield unsigned
  perf/x86: Fix LLC-* and node-* events on Intel SandyBridge
  perf/x86: Add Intel Nehalem-EX uncore support
  perf/x86: Fix typo in format definition of uncore PCU filter
  ...
2012-07-31 15:34:13 -07:00
Octavian Purdila 65fed8f6f2 resource: make sure requested range is included in the root range
When the requested range is outside of the root range the logic in
__reserve_region_with_split will cause an infinite recursion which will
overflow the stack as seen in the warning bellow.

This particular stack overflow was caused by requesting the
(100000000-107ffffff) range while the root range was (0-ffffffff).  In
this case __request_resource would return the whole root range as
conflict range (i.e.  0-ffffffff).  Then, the logic in
__reserve_region_with_split would continue the recursion requesting the
new range as (conflict->end+1, end) which incidentally in this case
equals the originally requested range.

This patch aborts looking for an usable range when the request does not
intersect with the root range.  When the request partially overlaps with
the root range, it ajust the request to fall in the root range and then
continues with the new request.

When the request is modified or aborted errors and a stack trace are
logged to allow catching the errors in the upper layers.

[    5.968374] WARNING: at kernel/sched.c:4129 sub_preempt_count+0x63/0x89()
[    5.975150] Modules linked in:
[    5.978184] Pid: 1, comm: swapper Not tainted 3.0.22-mid27-00004-gb72c817 #46
[    5.985324] Call Trace:
[    5.987759]  [<c1039dfc>] ? console_unlock+0x17b/0x18d
[    5.992891]  [<c1039620>] warn_slowpath_common+0x48/0x5d
[    5.998194]  [<c1031758>] ? sub_preempt_count+0x63/0x89
[    6.003412]  [<c1039644>] warn_slowpath_null+0xf/0x13
[    6.008453]  [<c1031758>] sub_preempt_count+0x63/0x89
[    6.013499]  [<c14d60c4>] _raw_spin_unlock+0x27/0x3f
[    6.018453]  [<c10c6349>] add_partial+0x36/0x3b
[    6.022973]  [<c10c7c0a>] deactivate_slab+0x96/0xb4
[    6.027842]  [<c14cf9d9>] __slab_alloc.isra.54.constprop.63+0x204/0x241
[    6.034456]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[    6.039842]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[    6.045232]  [<c10c7dc9>] kmem_cache_alloc_trace+0x51/0xb0
[    6.050710]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[    6.056100]  [<c103f78f>] kzalloc.constprop.5+0x29/0x38
[    6.061320]  [<c17b45e9>] __reserve_region_with_split+0x1c/0xd1
[    6.067230]  [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
...
[    7.179057]  [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
[    7.184970]  [<c17b4779>] reserve_region_with_split+0x30/0x42
[    7.190709]  [<c17a8ebf>] e820_reserve_resources_late+0xd1/0xe9
[    7.196623]  [<c17c9526>] pcibios_resource_survey+0x23/0x2a
[    7.202184]  [<c17cad8a>] pcibios_init+0x23/0x35
[    7.206789]  [<c17ca574>] pci_subsys_init+0x3f/0x44
[    7.211659]  [<c1002088>] do_one_initcall+0x72/0x122
[    7.216615]  [<c17ca535>] ? pci_legacy_init+0x3d/0x3d
[    7.221659]  [<c17a27ff>] kernel_init+0xa6/0x118
[    7.226265]  [<c17a2759>] ? start_kernel+0x334/0x334
[    7.231223]  [<c14d7482>] kernel_thread_helper+0x6/0x10

Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:21 -07:00
Alan Cox 25353b3377 taskstats: check nla_reserve() return
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=44621

Reported-by: <rucsoftsec@gmail.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:21 -07:00
Steven Rostedt fd4b616b0f sysctl: suppress kmemleak messages
register_sysctl_table() is a strange function, as it makes internal
allocations (a header) to register a sysctl_table.  This header is a
handle to the table that is created, and can be used to unregister the
table.  But if the table is permanent and never unregistered, the header
acts the same as a static variable.

Unfortunately, this allocation of memory that is never expected to be
freed fools kmemleak in thinking that we have leaked memory.  For those
sysctl tables that are never unregistered, and have no pointer referencing
them, kmemleak will think that these are memory leaks:

unreferenced object 0xffff880079fb9d40 (size 192):
  comm "swapper/0", pid 0, jiffies 4294667316 (age 12614.152s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8146b590>] kmemleak_alloc+0x73/0x98
    [<ffffffff8110a935>] kmemleak_alloc_recursive.constprop.42+0x16/0x18
    [<ffffffff8110b852>] __kmalloc+0x107/0x153
    [<ffffffff8116fa72>] kzalloc.constprop.8+0xe/0x10
    [<ffffffff811703c9>] __register_sysctl_paths+0xe1/0x160
    [<ffffffff81170463>] register_sysctl_paths+0x1b/0x1d
    [<ffffffff8117047d>] register_sysctl_table+0x18/0x1a
    [<ffffffff81afb0a1>] sysctl_init+0x10/0x14
    [<ffffffff81b05a6f>] proc_sys_init+0x2f/0x31
    [<ffffffff81b0584c>] proc_root_init+0xa5/0xa7
    [<ffffffff81ae5b7e>] start_kernel+0x3d0/0x40a
    [<ffffffff81ae52a7>] x86_64_start_reservations+0xae/0xb2
    [<ffffffff81ae53ad>] x86_64_start_kernel+0x102/0x111
    [<ffffffffffffffff>] 0xffffffffffffffff

The sysctl_base_table used by sysctl itself is one such instance that
registers the table to never be unregistered.

Use kmemleak_not_leak() to suppress the kmemleak false positive.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:21 -07:00
Vivek Goyal 63dca8d5b5 kdump: append newline to the last lien of vmcoreinfo note
The last line of vmcoreinfo note does not end with \n.  Parsing all the
lines in note becomes easier if all lines end with \n instead of trying to
special case the last line.

I know at least one tool, vmcore-dmesg in kexec-tools tree which made the
assumption that all lines end with \n.  I think it is a good idea to fix
it.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Akinobu Mita f19b9f74b7 fork: fix error handling in dup_task()
The function dup_task() may fail at the following function calls in the
following order.

0) alloc_task_struct_node()
1) alloc_thread_info_node()
2) arch_dup_task_struct()

Error by 0) is not a matter, it can just return.  But error by 1) requires
releasing task_struct allocated by 0) before it returns.  Likewise, error
by 2) requires releasing task_struct and thread_info allocated by 0) and
1).

The existing error handling calls free_task_struct() and
free_thread_info() which do not only release task_struct and thread_info,
but also call architecture specific arch_release_task_struct() and
arch_release_thread_info().

The problem is that task_struct and thread_info are not fully initialized
yet at this point, but arch_release_task_struct() and
arch_release_thread_info() are called with them.

For example, x86 defines its own arch_release_task_struct() that releases
a task_xstate.  If alloc_thread_info_node() fails in dup_task(),
arch_release_task_struct() is called with task_struct which is just
allocated and filled with garbage in this error handling.

This actually happened with tools/testing/fault-injection/failcmd.sh

	# env FAILCMD_TYPE=fail_page_alloc \
		./tools/testing/fault-injection/failcmd.sh --times=100 \
		--min-order=0 --ignore-gfp-wait=0 \
		-- make -C tools/testing/selftests/ run_tests

In order to fix this issue, make free_{task_struct,thread_info}() not to
call arch_release_{task_struct,thread_info}() and call
arch_release_{task_struct,thread_info}() implicitly where needed.

Default arch_release_task_struct() and arch_release_thread_info() are
defined as empty by default.  So this change only affects the
architectures which implement their own arch_release_task_struct() or
arch_release_thread_info() as listed below.

arch_release_task_struct(): x86, sh
arch_release_thread_info(): mn10300, tile

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Salman Qazi <sqazi@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Andrew Morton 87bec58a52 revert "sched: Fix fork() error path to not crash"
To make way for "fork: fix error handling in dup_task()", which fixes the
errors more completely.

Cc: Salman Qazi <sqazi@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Huang Shijie b2412b7fa7 fork: use vma_pages() to simplify the code
The current code can be replaced by vma_pages().  So use it to simplify
the code.

[akpm@linux-foundation.org: initialise `len' at its definition site]
Signed-off-by: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Tetsuo Handa 0f20784d4b kmod: avoid deadlock from recursive kmod call
The system deadlocks (at least since 2.6.10) when
call_usermodehelper(UMH_WAIT_EXEC) request triggers
call_usermodehelper(UMH_WAIT_PROC) request.

This is because "khelper thread is waiting for the worker thread at
wait_for_completion() in do_fork() since the worker thread was created
with CLONE_VFORK flag" and "the worker thread cannot call complete()
because do_execve() is blocked at UMH_WAIT_PROC request" and "the khelper
thread cannot start processing UMH_WAIT_PROC request because the khelper
thread is waiting for the worker thread at wait_for_completion() in
do_fork()".

The easiest example to observe this deadlock is to use a corrupted
/sbin/hotplug binary (like shown below).

  # : > /tmp/dummy
  # chmod 755 /tmp/dummy
  # echo /tmp/dummy > /proc/sys/kernel/hotplug
  # modprobe whatever

call_usermodehelper("/tmp/dummy", UMH_WAIT_EXEC) is called from
kobject_uevent_env() in lib/kobject_uevent.c upon loading/unloading a
module.  do_execve("/tmp/dummy") triggers a call to
request_module("binfmt-0000") from search_binary_handler() which in turn
calls call_usermodehelper(UMH_WAIT_PROC).

In order to avoid deadlock, as a for-now and easy-to-backport solution, do
not try to call wait_for_completion() in call_usermodehelper_exec() if the
worker thread was created by khelper thread with CLONE_VFORK flag.  Future
and fundamental solution might be replacing singleton khelper thread with
some workqueue so that recursive calls up to max_active dependency loop
can be handled without deadlock.

[akpm@linux-foundation.org: add comment to kmod_thread_locker]
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Andrew Morton 79c743dd1e kernel/kmod.c: document call_usermodehelper_fns() a bit
This function's interface is, uh, subtle.  Attempt to apologise for it.

Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:20 -07:00
Joe Perches 088a52aac8 printk: only look for prefix levels in kernel messages
vprintk_emit() prefix parsing should only be done for internal kernel
messages.  This allows existing behavior to be kept in all cases.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Kay Sievers <kay@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30 17:25:14 -07:00