dect
/
linux-2.6
Archived
13
0
Fork 0
Commit Graph

13201 Commits

Author SHA1 Message Date
Alexander Gordeev 540b60e24f genirq: Fix incorrect check for forced IRQ thread handler
We do not want a bitwise AND between boolean operands

Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/20120309135912.GA2114@dhcp-26-207.brq.redhat.com
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-03-09 17:19:08 +01:00
Linus Torvalds 4293f20c19 Revert "CPU hotplug, cpusets, suspend: Don't touch cpusets during suspend/resume"
This reverts commit 8f2f748b06.

It causes some odd regression that we have not figured out, and it's too
late in the -rc series to try to figure it out now.

As reported by Konstantin Khlebnikov, it causes consistent hangs on his
laptop (Thinkpad x220: 2x cores + HT).  They can be avoided by adding
calls to "rebuild_sched_domains();" in cpuset_cpu_[in]active() for the
CPU_{ONLINE/DOWN_FAILED/DOWN_PREPARE}_FROZEN cases, but it's not at all
clear why, and it makes no sense.

Konstantin's config doesn't even have CONFIG_CPUSETS enabled, just to
make things even more interesting.  So it's not the cpusets, it's just
the scheduling domains.

So until this is understood, revert.

Bisected-reported-and-tested-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-07 08:21:19 -08:00
Thomas Gleixner 52abb700e1 genirq: Clear action->thread_mask if IRQ_ONESHOT is not set
Xommit ac5637611(genirq: Unmask oneshot irqs when thread was not woken)
fails to unmask when a !IRQ_ONESHOT threaded handler is handled by
handle_level_irq.

This happens because thread_mask is or'ed unconditionally in
irq_wake_thread(), but for !IRQ_ONESHOT interrupts never cleared.  So
the check for !desc->thread_active fails and keeps the interrupt
disabled.

Keep the thread_mask zero for !IRQ_ONESHOT interrupts.

Document the thread_mask magic while at it.

Reported-and-tested-by: Sven Joachim <svenjoac@gmx.de>
Reported-and-tested-by: Stefan Lippers-Hollmann <s.l-h@gmx.de>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-06 16:46:39 -08:00
Heiko Carstens b2a0017861 softirq: Reduce invoke_softirq() code duplication
The two invoke_softirq() variants are identical except for a single
line. So move the #ifdef __ARCH_IRQ_EXIT_IRQS_DISABLED inside one of
the functions and get rid of the other one.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-03-06 13:33:27 +01:00
Russell King a09b659cd6 genirq: Fix long-term regression in genirq irq_set_irq_type() handling
In 2008, commit 0c5d1eb77a ("genirq: record trigger type") modified the
way set_irq_type() handles the 'no trigger' condition.  However, this has
an adverse effect on PCMCIA support on Intel StrongARM and probably PXA
platforms.

PCMCIA has several status signals on the socket which can trigger
interrupts; some of these status signals depend on the card's mode
(whether it is configured in memory or IO mode).  For example, cards have
a 'Ready/IRQ' signal: in memory mode, this provides an indication to
PCMCIA that the card has finished its power up initialization.  In IO
mode, it provides the device interrupt signal.  Other status signals
switch between on-board battery status and loud speaker output.

In classical PCMCIA implementations, where you have a specific socket
controller, the controller provides a method to mask interrupts from the
socket, and importantly ignore any state transitions on the pins which
correspond with interrupts once masked.  This masking prevents unwanted
events caused by the removal and application of socket power being
forwarded.

However, on platforms where there is no socket controller, the PCMCIA
status and interrupt signals are routed to standard edge-triggered GPIOs. 
These GPIOs can be configured to interrupt on rising edge, falling edge,
or never.  This is where the problems start.

Edge triggered interrupts are required to record events while disabled via
the usual methods of {free,request,disable,enable}_irq() to prevent
problems with dropped interrupts (eg, the 8390 driver uses disable_irq()
to defer the delivery of interrupts).  As a result, these interfaces can
not be used to implement the desired behaviour.

The side effect of this is that if the 'Ready/IRQ' GPIO is disabled via
disable_irq() on suspend, and enabled via enable_irq() after resume, we
will record the state transitions caused by powering events as valid
interrupts, and foward them to the card driver, which may attempt to
access a card which is not powered up.

This leads delays resume while drivers spin in their interrupt handlers,
and complaints from drivers before they realize what's happened.

Moreover, in the case of the 'Ready/IRQ' signal, this is requested and
freed by the card driver itself; the PCMCIA core has no idea whether the
interrupt is requested, and, therefore, whether a call to disable_irq()
would be valid.  (We tried this around 2.4.17 / 2.5.1 kernel era, and
ended up throwing it out because of this problem.)

Therefore, it was decided back in around 2002 to disable the edge
triggering instead, resulting in all state transitions on the GPIO being
ignored.  That's what we actually need the hardware to do.

The commit above changes this behaviour; it explicitly prevents the 'no
trigger' state being selected.

The reason that request_irq() does not accept the 'no trigger' state is
for compatibility with existing drivers which do not provide their desired
triggering configuration.  The set_irq_type() function is 'new' and not
used by non-trigger aware drivers.

Therefore, revert this change, and restore previously working platforms
back to their former state.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: linux@arm.linux.org.uk
Cc: Ingo Molnar <mingo@elte.hu>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-03-06 13:33:27 +01:00
Oleg Nesterov 6027ce497d hung_task: fix the broken rcu_lock_break() logic
check_hung_uninterruptible_tasks()->rcu_lock_break() introduced by
"softlockup: check all tasks in hung_task" commit ce9dbe24 looks
absolutely wrong.

	- rcu_lock_break() does put_task_struct(). If the task has exited
	  it is not safe to even read its ->state, nothing protects this
	  task_struct.

	- The TASK_DEAD checks are wrong too. Contrary to the comment, we
	  can't use it to check if the task was unhashed. It can be unhashed
	  without TASK_DEAD, or it can be valid with TASK_DEAD.

	  For example, an autoreaping task can do release_task(current)
	  long before it sets TASK_DEAD in do_exit().

	  Or, a zombie task can have ->state == TASK_DEAD but release_task()
	  was not called, and in this case we must not break the loop.

Change this code to check pid_alive() instead, and do this before we drop
the reference to the task_struct.

Note: while_each_thread() under rcu_read_lock() is not really safe, it can
livelock.  This will be fixed later, but fortunately in this case the
"max_count" logic saves us anyway.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Mandeep Singh Baines <msb@google.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-05 15:49:42 -08:00
Oleg Nesterov 6e27f63edb vfork: kill PF_STARTING
Previously it was (ab)used by utrace.  Then it was wrongly used by the
scheduler code.

Currently it is not used, kill it before it finds the new erroneous user.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-05 15:49:42 -08:00
Oleg Nesterov 57b59c4a14 coredump_wait: don't call complete_vfork_done()
Now that CLONE_VFORK is killable, coredump_wait() no longer needs
complete_vfork_done().  zap_threads() should find and kill all tasks with
the same ->mm, this includes our parent if ->vfork_done is set.

mm_release() becomes the only caller, unexport complete_vfork_done().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-05 15:49:42 -08:00
Oleg Nesterov d68b46fe16 vfork: make it killable
Make vfork() killable.

Change do_fork(CLONE_VFORK) to do wait_for_completion_killable().  If it
fails we do not return to the user-mode and never touch the memory shared
with our child.

However, in this case we should clear child->vfork_done before return, we
use task_lock() in do_fork()->wait_for_vfork_done() and
complete_vfork_done() to serialize with each other.

Note: now that we use task_lock() we don't really need completion, we
could turn task->vfork_done into "task_struct *wake_up_me" but this needs
some complications.

NOTE: this and the next patches do not affect in-kernel users of
CLONE_VFORK, kernel threads run with all signals ignored including
SIGKILL/SIGSTOP.

However this is obviously the user-visible change.  Not only a fatal
signal can kill the vforking parent, a sub-thread can do execve or
exit_group() and kill the thread sleeping in vfork().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-05 15:49:42 -08:00
Oleg Nesterov c415c3b47e vfork: introduce complete_vfork_done()
No functional changes.

Move the clear-and-complete-vfork_done code into the new trivial helper,
complete_vfork_done().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-05 15:49:42 -08:00
Prashanth Nageshappa f986a499ef kprobes: return proper error code from register_kprobe()
register_kprobe() aborts if the address of the new request falls in a
prohibited area (such as ftrace pouch, __kprobes annotated functions,
non-kernel text addresses, jump label text).  We however don't return the
right error on this abort, resulting in a silent failure - incorrect
adding/reporting of kprobes ('perf probe do_fork+18' or 'perf probe
mcount' for instance).

In V2 we are incorporating Masami Hiramatsu's  feedback.

This patch fixes it by returning -EINVAL upon failure.

While we are here, rename the label used for exit to be more appropriate.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Prashanth K Nageshappa <prashanth@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-05 15:49:42 -08:00
Matthew Garrett c22ab33290 kmsg_dump: don't run on non-error paths by default
Since commit 04c6862c05 ("kmsg_dump: add kmsg_dump() calls to the
reboot, halt, poweroff and emergency_restart paths"), kmsg_dump() gets
run on normal paths including poweroff and reboot.

This is less than ideal given pstore implementations that can only
represent single backtraces, since a reboot may overwrite a stored oops
before it's been picked up by userspace.  In addition, some pstore
backends may have low performance and provide a significant delay in
reboot as a result.

This patch adds a printk.always_kmsg_dump kernel parameter (which can also
be changed from userspace).  Without it, the code will only be run on
failure paths rather than on normal paths.  The option can be enabled in
environments where there's a desire to attempt to audit whether or not a
reboot was cleanly requested or not.

Signed-off-by: Matthew Garrett <mjg@redhat.com>
Acked-by: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Marco Stornelli <marco.stornelli@gmail.com>
Cc: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Don Zickus <dzickus@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-05 15:49:42 -08:00
Stephane Eranian d010b3326c perf: Add callback to flush branch_stack on context switch
With branch stack sampling, it is possible to filter by priv levels.

In system-wide mode, that means it is possible to capture only user
level branches. The builtin SW LBR filter needs to disassemble code
based on LBR captured addresses. For that, it needs to know the task
the addresses are associated with. Because of context switches, the
content of the branch stack buffer may contain addresses from
different tasks.

We need a callback on context switch to either flush the branch stack
or save it. This patch adds a new callback in struct pmu which is called
during context switches. The callback is called only when necessary.
That is when a system-wide context has, at least, one event which
uses PERF_SAMPLE_BRANCH_STACK. The callback is never called for
per-thread context.

In this version, the Intel x86 code simply flushes (resets) the LBR
on context switches (fills it with zeroes). Those zeroed branches are
then filtered out by the SW filter.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-11-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 14:55:42 +01:00
Stephane Eranian 2481c5fa6d perf: Disable PERF_SAMPLE_BRANCH_* when not supported
PERF_SAMPLE_BRANCH_* is disabled for:

 - SW events (sw counters, tracepoints)
 - HW breakpoints
 - ALL but Intel x86 architecture
 - AMD64 processors

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-10-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 14:55:42 +01:00
Stephane Eranian bce38cd53e perf: Add generic taken branch sampling support
This patch adds the ability to sample taken branches to the
perf_event interface.

The ability to capture taken branches is very useful for all
sorts of analysis. For instance, basic block profiling, call
counts, statistical call graph.

This new capability requires hardware assist and as such may
not be available on all HW platforms. On Intel x86 it is
implemented on top of the Last Branch Record (LBR) facility.

To enable taken branches sampling, the PERF_SAMPLE_BRANCH_STACK
bit must be set in attr->sample_type.

Sampled taken branches may be filtered by type and/or priv
levels.

The patch adds a new field, called branch_sample_type, to the
perf_event_attr structure. It contains a bitmask of filters
to apply to the sampled taken branches.

Filters may be implemented in HW. If the HW filter does not exist
or is not good enough, some arch may also implement a SW filter.

The following generic filters are currently defined:
- PERF_SAMPLE_USER
  only branches whose targets are at the user level

- PERF_SAMPLE_KERNEL
  only branches whose targets are at the kernel level

- PERF_SAMPLE_HV
  only branches whose targets are at the hypervisor level

- PERF_SAMPLE_ANY
  any type of branches (subject to priv levels filters)

- PERF_SAMPLE_ANY_CALL
  any call branches (may incl. syscall on some arch)

- PERF_SAMPLE_ANY_RET
  any return branches (may incl. syscall returns on some arch)

- PERF_SAMPLE_IND_CALL
  indirect call branches

Obviously filter may be combined. The priv level bits are optional.
If not provided, the priv level of the associated event are used. It
is possible to collect branches at a priv level different from the
associated event. Use of kernel, hv priv levels is subject to permissions
and availability (hv).

The number of taken branch records present in each sample may vary based
on HW, the type of sampled branches, the executed code. Therefore
each sample contains the number of taken branches it contains.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1328826068-11713-2-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 14:55:39 +01:00
Ingo Molnar 737f24bda7 Merge branch 'perf/urgent' into perf/core
Conflicts:
	tools/perf/builtin-record.c
	tools/perf/builtin-top.c
	tools/perf/perf.h
	tools/perf/util/top.h

Merge reason: resolve these cherry-picking conflicts.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05 09:20:08 +01:00
Rafael J. Wysocki ed819e3b96 Merge branch 'pm-qos'
* pm-qos:
  PM / QoS: unconditionally build the feature
  PM / QoS: Simplify PM QoS expansion/merge
2012-03-04 23:11:24 +01:00
Rafael J. Wysocki 643161ace2 Merge branch 'pm-sleep'
* pm-sleep:
  PM / Freezer: Remove references to TIF_FREEZE in comments
  PM / Sleep: Add more wakeup source initialization routines
  PM / Hibernate: Enable usermodehelpers in hibernate() error path
  PM / Sleep: Make __pm_stay_awake() delete wakeup source timers
  PM / Sleep: Fix race conditions related to wakeup source timer function
  PM / Sleep: Fix possible infinite loop during wakeup source destruction
  PM / Hibernate: print physical addresses consistently with other parts of kernel
  PM: Add comment describing relationships between PM callbacks to pm.h
  PM / Sleep: Drop suspend_stats_update()
  PM / Sleep: Make enter_state() in kernel/power/suspend.c static
  PM / Sleep: Unify kerneldoc comments in kernel/power/suspend.c
  PM / Sleep: Remove unnecessary label from suspend_freeze_processes()
  PM / Sleep: Do not check wakeup too often in try_to_freeze_tasks()
  PM / Sleep: Initialize wakeup source locks in wakeup_source_add()
  PM / Hibernate: Refactor and simplify freezer_test_done
  PM / Hibernate: Thaw kernel threads in hibernation_snapshot() in error/test path
  PM / Freezer / Docs: Document the beauty of freeze/thaw semantics
  PM / Suspend: Avoid code duplication in suspend statistics update
  PM / Sleep: Introduce generic callbacks for new device PM phases
  PM / Sleep: Introduce "late suspend" and "early resume" of devices
2012-03-04 23:11:14 +01:00
Marcos Paulo de Souza 37f08be11b PM / Freezer: Remove references to TIF_FREEZE in comments
This patch removes all the references in the code about the TIF_FREEZE
flag removed by commit a3201227f8

    freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE

There still are some references to TIF_FREEZE in
Documentation/power/freezing-of-tasks.txt, but it looks like that
documentation needs more thorough work to reflect how the new
freezer works, and hence merely removing the references to TIF_FREEZE
won't really help. So I have not touched that part in this patch.

Suggested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Marcos Paulo de Souza <marcos.mage@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-03-04 23:08:54 +01:00
Srivatsa S. Bhat 05b4877f6a PM / Hibernate: Enable usermodehelpers in hibernate() error path
If create_basic_memory_bitmaps() fails, usermodehelpers are not re-enabled
before returning. Fix this. And while at it, reword the goto labels so that
they look more meaningful.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-03-04 23:08:37 +01:00
Linus Torvalds 2273d5ccb8 Merge branches 'core-urgent-for-linus', 'perf-urgent-for-linus' and 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pulling latest branches from Ingo:

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  memblock: Fix size aligning of memblock_alloc_base_nid()

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf probe: Ensure offset provided is not greater than function length without DWARF info too
  perf tools: Ensure comm string is properly terminated
  perf probe: Ensure offset provided is not greater than function length
  perf evlist: Return first evsel for non-sample event on old kernel
  perf/hwbp: Fix a possible memory leak

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  CPU hotplug, cpusets, suspend: Don't touch cpusets during suspend/resume
2012-03-02 11:38:43 -08:00
Hiroshi Shimamoto 2e5b5b3a1b sched: Clean up parameter passing of proc_sched_autogroup_set_nice()
Pass nice as a value to proc_sched_autogroup_set_nice().

No side effect is expected, and the variable err will be overwritten with
the return value.

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4F45FBB7.5090607@ct.jp.nec.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-02 12:23:49 +01:00
Alan Stern 62d3c5439c Block: use a freezable workqueue for disk-event polling
This patch (as1519) fixes a bug in the block layer's disk-events
polling.  The polling is done by a work routine queued on the
system_nrt_wq workqueue.  Since that workqueue isn't freezable, the
polling continues even in the middle of a system sleep transition.

Obviously, polling a suspended drive for media changes and such isn't
a good thing to do; in the case of USB mass-storage devices it can
lead to real problems requiring device resets and even re-enumeration.

The patch fixes things by creating a new system-wide, non-reentrant,
freezable workqueue and using it for disk-events polling.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
CC: <stable@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-02 10:51:00 +01:00
Steven Rostedt b892e5c897 tracing: Keep NMI watchdog from triggering when dumping trace
As ftrace_dump() (called by ftrace_dump_on_oops) disables interrupts
as it dumps its output to the console, it can keep interrupts disabled
for long periods of time. This is likely to trigger the NMI watchdog,
and it can disrupt the output of critical data.

Add a touch_nmi_watchdog() to each event that is written to the screen
to keep the NMI watchdog from affecting the output.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-03-01 22:06:48 -05:00
Peter Zijlstra 367456c756 sched: Ditch per cgroup task lists for load-balancing
Per cgroup load-balance has numerous problems, chief amongst them that
there is no real sane order in them. So stop pretending it makes sense
and enqueue all tasks on a single list.

This also allows us to more easily fix the fwd progress issue
uncovered by the lock-break stuff. Rotate the list on failure to
migreate and limit the total iterations to nr_running (which with
releasing the lock isn't strictly accurate but close enough).

Also add a filter that skips very light tasks on the first attempt
around the list, this attempts to avoid shooting whole cgroups around
without affecting over balance.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: pjt@google.com
Link: http://lkml.kernel.org/n/tip-tx8yqydc7eimgq7i4rkc3a4g@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01 13:08:37 +01:00
Peter Zijlstra ddcdf6e7d9 sched: Rename load-balancing fields
s/env->this_/env->dst_/g
 s/env->busiest_/env->src_/g
 s/pull_task/move_task/g

Makes everything clearer.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: pjt@google.com
Link: http://lkml.kernel.org/n/tip-0yvgms8t8x962drpvl0fu0kk@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01 10:51:23 +01:00
Peter Zijlstra 8e45cb545d sched: Move load-balancing arguments into helper struct
Passing large sets of similar arguments all around the load-balancer
gets tiresom when you want to modify something. Stick them all in a
helper structure and pass the structure around.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: pjt@google.com
Link: http://lkml.kernel.org/n/tip-5slqz0vhsdzewrfk9eza1aon@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01 10:51:22 +01:00
Thomas Gleixner 3c7d51843b sched/rt: Do not submit new work when PI-blocked
When we are PI-blocked then we want to get things done ASAP.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-vw8et3445km5b8mpihf4trae@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01 10:46:54 +01:00
Thomas Gleixner 1c4dd99bed sched/rt: Prevent idle task boosting
Idle task boosting is a nono in general. There is one
exception, when PREEMPT_RT and NOHZ is active:

The idle task calls get_next_timer_interrupt() and holds
the timer wheel base->lock on the CPU and another CPU wants
to access the timer (probably to cancel it). We can safely
ignore the boosting request, as the idle CPU runs this code
with interrupts disabled and will complete the lock
protected section without being interrupted. So there is no
real need to boost.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-755rvsosz7sdzot12a3gbha6@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01 10:28:05 +01:00
Thomas Gleixner 63b2001169 sched/wait: Add __wake_up_all_locked() API
For code which protects the waitqueue itself with another lock it
makes no sense to acquire the waitqueue lock for wakeup all. Provide
__wake_up_all_locked().

This is an optimization on the vanilla kernel (to be used by the
PCI code) and an important semantic distinction on -rt.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-ux6m4b8jonb9inx8xafh77ds@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01 10:28:04 +01:00
Thomas Gleixner ba74c1448f sched/rt: Document scheduler related skip-resched-check sites
Create a distinction between scheduler related preempt_enable_no_resched()
calls and the nearly one hundred other places in the kernel that do not
want to reschedule, for one reason or another.

This distinction matters for -rt, where the scheduler and the non-scheduler
preempt models (and checks) are different. For upstream it's purely
documentational.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/n/tip-gs88fvx2mdv5psnzxnv575ke@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01 10:28:04 +01:00
Thomas Gleixner bd2f55361f sched/rt: Use schedule_preempt_disabled()
Coccinelle based conversion.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-24swm5zut3h9c4a6s46x8rws@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01 10:28:03 +01:00
Thomas Gleixner c5491ea779 sched/rt: Add schedule_preempt_disabled()
Add helper to get rid of the ever repeating:

    preempt_enable_no_resched();
    schedule();
    preempt_disable();

patterns.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-wxx7btox7coby6ifv5vzhzgp@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01 10:28:03 +01:00
Peter Zijlstra 7abc63b1bd sched/rt: Do not throttle when PI boosting
When a runqueue has rt_runtime_us = 0 then the only way it can
accumulate rt_time is via PI boosting. That causes the runqueue
to be throttled and replenishing does not change anything due to
rt_runtime_us = 0. So avoid that situation by clearing rt_time and
skip the throttling alltogether.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
[ Changelog ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/n/tip-7x70cypsotjb4jvcor3edctk@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01 10:28:02 +01:00
Peter Zijlstra 42c62a589f sched/rt: Keep period timer ticking when rt throttling is active
When a runqueue is throttled we cannot disable the period timer
because that timer is the only way to undo the throttling.

We got stale throttling entries when a rq was throttled and then the
global sysctl was disabled, which stopped the timer.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
[ Added changelog ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/n/tip-nuj34q52p6ro7szapuz84i0v@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01 10:28:01 +01:00
Ingo Molnar 7e4d960993 Merge branch 'linus' into sched/core
Merge reason: we'll queue up dependent patches.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01 10:26:43 +01:00
Paul Gortmaker 13ae246db4 includecheck: delete any duplicate instances of module.h
Different tree maintainers picked up independently generated
trivial compile fixes based on linux-next testing, resulting
in some cases where a file would have got more than one addition
of module.h once everything was all merged together.

Delete any duplicates so includecheck isn't complaining about
anything related to module.h/export.h changes.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-02-28 19:31:56 -05:00
Jason Baron 8eedce9965 static keys: Inline the static_key_enabled() function
In the jump label enabled case, calling static_key_enabled()
results in a function call. The function returns the results of
a compare, so it really doesn't need the overhead of a full
function call. Let's make it 'static inline' for both the jump
label enabled and disabled cases.

Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: a.p.zijlstra@chello.nl
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@polymtl.ca
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/201202281849.q1SIn1p2023270@int-mx10.intmail.prod.int.phx2.redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-28 20:01:08 +01:00
Ingo Molnar a706d4fc9e Merge branch 'perf/jump-labels' into perf/core
Merge reason: After much naming discussion, there seems to be consensus
              now - queue it up for v3.4.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-28 19:59:47 +01:00
Ingo Molnar bdd4431c8d Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
The major features of this series are:

 - making RCU more aggressive about entering dyntick-idle mode in order to
   improve energy efficiency

 - converting a few more call_rcu()s to kfree_rcu()s

 - applying a number of rcutree fixes and cleanups to rcutiny

 - removing CONFIG_SMP #ifdefs from treercu

 - allowing RCU CPU stall times to be set via sysfs

 - adding CPU-stall capability to rcutorture

 - adding more RCU-abuse diagnostics

 - updating documentation

 - fixing yet more issues located by the still-ongoing top-to-bottom
   inspection of RCU, this time with a special focus on the
   CPU-hotplug code path.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-28 10:16:10 +01:00
Namhyung Kim 30ce2f7eef perf/hwbp: Fix a possible memory leak
If kzalloc() for TYPE_DATA failed on a given cpu, previous chunk
of TYPE_INST will be leaked. Fix it.

Thanks to Peter Zijlstra for suggesting this better solution. It
should work as long as the initial value of the region is all
0's and that's the case of static (per-cpu) memory allocation.

Signed-off-by: Namhyung Kim <namhyung.kim@lge.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Link: http://lkml.kernel.org/r/1330391978-28070-1-git-send-email-namhyung.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-28 09:52:54 +01:00
Linus Torvalds 70ca00db10 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/events: Revert trace_sched_stat_sleeptime()
2012-02-27 07:55:39 -08:00
Linus Torvalds faf3502a3f Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Handle pending irqs in irq_startup()
  genirq: Unmask oneshot irqs when thread was not woken
2012-02-27 07:54:57 -08:00
Gerlando Falauto 8c9cf542b8 tracing: Do not select FRAME_POINTER on PPC
On PowerPC, FUNCTION_TRACER selects FRAME_POINTER, even
though the architecture does not support it.

This causes the following warning:
warning: (LOCKDEP && FAULT_INJECTION_STACKTRACE_FILTER && LATENCYTOP && FUNCTION_TRACER && KMEMCHECK) selects FRAME_POINTER which has unmet direct dependencies (DEBUG_KERNEL && (CRIS || M68K || FRV || UML || AVR32 || SUPERH || BLACKFIN || MN10300) || ARCH_WANT_FRAME_POINTERS)

So remove the warning by adding the extra condition
"if !PPC" to FUNCTION_TRACER for FRAME_POINTER selection

Link: http://lkml.kernel.org/r/1330330101-8618-1-git-send-email-gerlando.falauto@keymile.com

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Gerlando Falauto <gerlando.falauto@keymile.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-27 08:45:11 -05:00
Srivatsa S. Bhat 8f2f748b06 CPU hotplug, cpusets, suspend: Don't touch cpusets during suspend/resume
Currently, during CPU hotplug, the cpuset callbacks modify the cpusets
to reflect the state of the system, and this handling is asymmetric.
That is, upon CPU offline, that CPU is removed from all cpusets. However
when it comes back online, it is put back only to the root cpuset.

This gives rise to a significant problem during suspend/resume. During
suspend, we offline all non-boot cpus and during resume we online them back.
Which means, after a resume, all cpusets (except the root cpuset) will be
restricted to just one single CPU (the boot cpu). But the whole point of
suspend/resume is to restore the system to a state which is as close as
possible to how it was before suspend.

So to fix this, don't touch cpusets during suspend/resume. That is, modify
the cpuset-related CPU hotplug callback to just ignore CPU hotplug when it
is initiated as part of the suspend/resume sequence.

Reported-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/4F460D7B.1020703@linux.vnet.ibm.com
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-27 11:38:13 +01:00
Oleg Nesterov d80e731eca epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree()
This patch is intentionally incomplete to simplify the review.
It ignores ep_unregister_pollwait() which plays with the same wqh.
See the next change.

epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
f_op->poll() needs. In particular it assumes that the wait queue
can't go away until eventpoll_release(). This is not true in case
of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
which is not connected to the file.

This patch adds the special event, POLLFREE, currently only for
epoll. It expects that init_poll_funcptr()'ed hook should do the
necessary cleanup. Perhaps it should be defined as EPOLLFREE in
eventpoll.

__cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
->signalfd_wqh is not empty, we add the new signalfd_cleanup()
helper.

ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
This make this poll entry inconsistent, but we don't care. If you
share epoll fd which contains our sigfd with another process you
should blame yourself. signalfd is "really special". I simply do
not know how we can define the "right" semantics if it used with
epoll.

The main problem is, epoll calls signalfd_poll() once to establish
the connection with the wait queue, after that signalfd_poll(NULL)
returns the different/inconsistent results depending on who does
EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
has nothing to do with the file, it works with the current thread.

In short: this patch is the hack which tries to fix the symptoms.
It also assumes that nobody can take tasklist_lock under epoll
locks, this seems to be true.

Note:

	- we do not have wake_up_all_poll() but wake_up_poll()
	  is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.

	- signalfd_cleanup() uses POLLHUP along with POLLFREE,
	  we need a couple of simple changes in eventpoll.c to
	  make sure it can't be "lost".

Reported-by: Maxime Bizon <mbizon@freebox.fr>
Cc: <stable@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-24 11:42:50 -08:00
Grant Likely abd2363f6a irq_domain/mips: Allow irq_domain on MIPS
This patch makes IRQ_DOMAIN usable on MIPS.  It uses an ugly workaround
to preserve current behaviour so that MIPS has time to add irq_domain
registration to the irq controller drivers.  The workaround will be
removed in Linux v3.6

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mips@linux-mips.org
2012-02-24 09:47:23 -07:00
Ingo Molnar c5905afb0e static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc|dec]()
So here's a boot tested patch on top of Jason's series that does
all the cleanups I talked about and turns jump labels into a
more intuitive to use facility. It should also address the
various misconceptions and confusions that surround jump labels.

Typical usage scenarios:

        #include <linux/static_key.h>

        struct static_key key = STATIC_KEY_INIT_TRUE;

        if (static_key_false(&key))
                do unlikely code
        else
                do likely code

Or:

        if (static_key_true(&key))
                do likely code
        else
                do unlikely code

The static key is modified via:

        static_key_slow_inc(&key);
        ...
        static_key_slow_dec(&key);

The 'slow' prefix makes it abundantly clear that this is an
expensive operation.

I've updated all in-kernel code to use this everywhere. Note
that I (intentionally) have not pushed through the rename
blindly through to the lowest levels: the actual jump-label
patching arch facility should be named like that, so we want to
decouple jump labels from the static-key facility a bit.

On non-jump-label enabled architectures static keys default to
likely()/unlikely() branches.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jason Baron <jbaron@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: a.p.zijlstra@chello.nl
Cc: mathieu.desnoyers@efficios.com
Cc: davem@davemloft.net
Cc: ddaney.cavm@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.hu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-24 10:05:59 +01:00
Steven Rostedt 499e547057 tracing/ring-buffer: Only have tracing_on disable tracing buffers
As the ring-buffer code is being used by other facilities in the
kernel, having tracing_on file disable *all* buffers is not a desired
affect. It should only disable the ftrace buffers that are being used.

Move the code into the trace.c file and use the buffer disabling
for tracing_on() and tracing_off(). This way only the ftrace buffers
will be affected by them and other kernel utilities will not be
confused to why their output suddenly stopped.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-22 15:50:28 -05:00
Hiroshi Shimamoto de5bdff7a7 sched: Make initial SCHED_RR timeslace DEF_TIMESLICE
Current the initial SCHED_RR timeslice of init_task is HZ, which means
1s, and is not same as the default SCHED_RR timeslice DEF_TIMESLICE.

Change that initial timeslice to the DEF_TIMESLICE.

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
[ s/DEF_TIMESLICE/RR_TIMESLICE/g ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4F3C9995.3010800@ct.jp.nec.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-22 12:28:29 +01:00
Nikunj A. Dadhania 62f6536a63 sched: Remove rcu_read_lock/unlock() from select_idle_sibling()
select_idle_sibling() is called from select_task_rq_fair(), which
already has the RCU read lock held.

Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120217030409.11748.12491.stgit@abhimanyu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-22 12:28:28 +01:00
Peter Zijlstra 8c79a045fd sched/events: Revert trace_sched_stat_sleeptime()
Commit 1ac9bc69 ("sched/tracing: Add a new tracepoint for sleeptime")
added a new sched:sched_stat_sleeptime tracepoint.

It's broken: the first sample we get on a task might be bad because
of a stale sleep_start value that wasn't reset at the last task switch
because the tracepoint was not active.

It also breaks the existing schedstat samples due to the side
effects of:

-               se->statistics.sleep_start = 0;
...
-               se->statistics.block_start = 0;

Nor do I see means to fix it without adding overhead to the scheduler
fast path, which I'm not willing to for the sake of redundant
instrumentation.

Most importantly, sleep time information can already be constructed
by tracing context switches and wakeups, and taking the timestamp
difference between the schedule-out, the wakeup and the schedule-in.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-pc4c9qhl8q6vg3bs4j6k0rbd@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-22 12:06:55 +01:00
Jason Baron a746e3cc98 jump label: Fix compiler warning
While cross-compiling on sparc64, I found:

kernel/jump_label.c: In function 'jump_label_update':
kernel/jump_label.c:447:40: warning: cast from pointer to
integer of different size [-Wpointer-to-int-cast]

Fix by casting to 'unsigned long'.

Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: davem@davemloft.net
Cc: ddaney.cavm@gmail.com
Cc: a.p.zijlstra@chello.nl
Link: http://lkml.kernel.org/r/08026cbc6df80619cae833ef1ebbbc43efab69ab.1329851692.git.jbaron@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-22 07:59:40 +01:00
Jason Baron fadf0464b8 jump label: Add a WARN() if jump label key count goes negative
The count on a jump label key should never go negative. Add a
WARN() to check for this condition.

Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: davem@davemloft.net
Cc: ddaney.cavm@gmail.com
Cc: a.p.zijlstra@chello.nl
Link: http://lkml.kernel.org/r/3c68556121be4d1920417a3fe367da1ec38246b4.1329851692.git.jbaron@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-22 07:59:39 +01:00
Hugh Dickins 1cc85961e2 rcu: Stop spurious warnings from synchronize_sched_expedited
synchronize_sched_expedited() is spamming CONFIG_DEBUG_PREEMPT=y
users with an unintended warning from the cpu_is_offline() check: use
raw_smp_processor_id() instead of smp_processor_id() there.

Because the warning is under a get_online_cpus(), it is not possible
for any CPUs to go offline, though it is quite possible that the
task might migrate between the raw_smp_processor_id() and the check
of cpu_is_offline().  This is not a problem because the task cannot
migrate from an offline CPU to an online one or vice versa.  The point
of the check is to verify that synchronize_sched_expedited() is not
called from an offline CPU, for example, from a CPU_DYING notifier, or,
more important, from an outgoing CPU making its way from its CPU_DYING
notifiers to the idle loop.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 15:33:34 -08:00
Frederic Weisbecker 3ce3230a0c cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list
Walking through the tasklist in cgroup_enable_task_cg_list() inside
an RCU read side critical section is not enough because:

- RCU is not (yet) safe against while_each_thread()

- If we use only RCU, a forking task that has passed cgroup_post_fork()
  without seeing use_task_css_set_links == 1 is not guaranteed to have
  its child immediately visible in the tasklist if we walk through it
  remotely with RCU. In this case it will be missing in its css_set's
  task list.

Thus we need to traverse the list (unfortunately) under the
tasklist_lock. It makes us safe against while_each_thread() and also
make sure we see all forked task that have been added to the tasklist.

As a secondary effect, reading and writing use_task_css_set_links are
now well ordered against tasklist traversing and modification. The new
layout is:

CPU 0                                      CPU 1

use_task_css_set_links = 1                write_lock(tasklist_lock)
read_lock(tasklist_lock)                  add task to tasklist
do_each_thread() {                        write_unlock(tasklist_lock)
	add thread to css set links       if (use_task_css_set_links)
} while_each_thread()                         add thread to css set links
read_unlock(tasklist_lock)

If CPU 0 traverse the list after the task has been added to the tasklist
then it is correctly added to the css set links. OTOH if CPU 0 traverse
the tasklist before the new task had the opportunity to be added to the
tasklist because it was too early in the fork process, then CPU 1
catches up and add the task to the css set links after it added the task
to the tasklist. The right value of use_task_css_set_links is guaranteed
to be visible from CPU 1 due to the LOCK/UNLOCK implicit barrier properties:
the read_unlock on CPU 0 makes the write on use_task_css_set_links happening
and the write_lock on CPU 1 make the read of use_task_css_set_links that comes
afterward to return the correct value.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:46:47 -08:00
Frederic Weisbecker 9a4b430451 cgroup: Remove wrong comment on cgroup_enable_task_cg_list()
Remove the stale comment about RCU protection. Many callers
(all of them?) of cgroup_enable_task_cg_list() don't seem
to be in an RCU read side critical section. Besides, RCU is
not helpful to protect against while_each_thread().

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:46:43 -08:00
Paul E. McKenney 696a02cc16 rcu: Hold off RCU_FAST_NO_HZ after timer posted
This commit handles workloads that transition quickly between idle and
non-idle, and where the CPU's callbacks cannot be invoked, but where
RCU does not have anything immediate for the CPU to do.  Without this
patch, the RCU_FAST_NO_HZ code can be invoked repeatedly on each entry
to idle.  The commit sets the per-CPU rcu_dyntick_holdoff variable to
hold off further attempts for a tick.

Reported-by: "Abou Gazala, Neven M" <neven.m.abou.gazala@intel.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:42:30 -08:00
Paul E. McKenney c3ce910b14 rcu: Eliminate softirq-mediated RCU_FAST_NO_HZ idle-entry loop
If a softirq is pending, the current CPU has RCU callbacks pending,
and RCU does not immediately need anything from this CPU, then the
current code resets the RCU_FAST_NO_HZ state machine.  This means that
upon exit from the subsequent softirq handler, RCU_FAST_NO_HZ will
try really hard to force RCU into dyntick-idle mode.  And if the same
conditions hold after a few tries (determined by RCU_IDLE_OPT_FLUSHES),
the same situation can repeat, possibly endlessly.  This scenario is
not particularly good for battery lifetime.

This commit therefore suppresses the early exit from the RCU_FAST_NO_HZ
state machine in the case where there is a softirq pending.  This change
forces the state machine to retain its memory, and to enter holdoff if
this condition persists.

Reported-by: "Abou Gazala, Neven M" <neven.m.abou.gazala@intel.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:42:11 -08:00
Paul E. McKenney 8a2ecf474d rcu: Add RCU_NONIDLE() for idle-loop RCU read-side critical sections
RCU, RCU-bh, and RCU-sched read-side critical sections are forbidden
in the inner idle loop, that is, between the rcu_idle_enter() and the
rcu_idle_exit() -- RCU will happily ignore any such read-side critical
sections.  However, things like powertop need tracepoints in the inner
idle loop.

This commit therefore provides an RCU_NONIDLE() macro that can be used to
wrap code in the idle loop that requires RCU read-side critical sections.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
2012-02-21 09:06:13 -08:00
Paul E. McKenney 29e37d8141 rcu: Allow nesting of rcu_idle_enter() and rcu_idle_exit()
Use of RCU in the idle loop is incorrect, quite a few instances of
just that have made their way into mainline, primarily event tracing.
The problem with RCU read-side critical sections on CPUs that RCU believes
to be idle is that RCU is completely ignoring the CPU, along with any
attempts and RCU read-side critical sections.

The approaches of eliminating the offending uses and of pushing the
definition of idle down beyond the offending uses have both proved
impractical.  The new approach is to encapsulate offending uses of RCU
with rcu_idle_exit() and rcu_idle_enter(), but this requires nesting
for code that is invoked both during idle and and during normal execution.
Therefore, this commit modifies rcu_idle_enter() and rcu_idle_exit() to
permit nesting.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
2012-02-21 09:06:12 -08:00
Paul E. McKenney ce5df97be5 rcu: Remove redundant check for rcu_head misalignment
There is now an unconditional check for rcu_head misalignment in
__call_rcu(), so remove the old conditional one in debug_rcu_head_queue().

Reported-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:06:11 -08:00
Julia Lawall 3c1b1ce00d PTR_ERR should be called before its argument is cleared.
The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression e,e1;
constant c;
@@

*e = c
... when != e = e1
    when != &e
    when != true IS_ERR(e)
*PTR_ERR(e)
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reported-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:06:10 -08:00
Paul E. McKenney 7129d383d9 rcu: Trace only after NULL-pointer check
Fix a bonehead error introduced when adding event tracing to rcutorture.
Move the traces to follow the NULL-pointer checks.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:06:09 -08:00
Paul E. McKenney 236fefafe5 rcu: Call out dangers of expedited RCU primitives
The expedited RCU primitives can be quite useful, but they have some
high costs as well.  This commit updates and creates docbook comments
calling out the costs, and updates the RCU documentation as well.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:06:08 -08:00
Paul E. McKenney 2036d94a7b rcu: Rework detection of use of RCU by offline CPUs
Because newly offlined CPUs continue executing after completing the
CPU_DYING notifiers, they legitimately enter the scheduler and use
RCU while appearing to be offline.  This calls for a more sophisticated
approach as follows:

1.	RCU marks the CPU online during the CPU_UP_PREPARE phase.

2.	RCU marks the CPU offline during the CPU_DEAD phase.

3.	Diagnostics regarding use of read-side RCU by offline CPUs use
	RCU's accounting rather than the cpu_online_map.  (Note that
	__call_rcu() still uses cpu_online_map to detect illegal
	invocations within CPU_DYING notifiers.)

4.	Offline CPUs are prevented from hanging the system by
	force_quiescent_state(), which pays attention to cpu_online_map.
	Some additional work (in a later commit) will be needed to
	guarantee that force_quiescent_state() waits a full jiffy before
	assuming that a CPU is offline, for example, when called from
	idle entry.  (This commit also makes the one-jiffy wait
	explicit, since the old-style implicit wait can now be defeated
	by RCU_FAST_NO_HZ and by rcutorture.)

This approach avoids the false positives encountered when attempting to
use more exact classification of CPU online/offline state.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:06:07 -08:00
Paul E. McKenney c5fdcec927 lockdep: Add CPU-idle/offline warning to lockdep-RCU splat
It is illegal to use RCU from a CPU that has reported idleness or
offlinedness to RCU.  However, it can be quite difficult to determine
from a stack trace whether or not a given CPU is idle or offline.
Therefore, this commit adds idle/offline diagnostics to the lockdep-RCU
error message.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:06:06 -08:00
Paul E. McKenney c0cfbbb0d4 rcu: No interrupt disabling for rcu_prepare_for_idle()
The rcu_prepare_for_idle() function is always called with interrupts
disabled, so there is no reason to disable interrupts again within
rcu_prepare_for_idle().  Therefore, this commit removes all of the
interrupt disabling, also removing a latent disabling-unbalance bug.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:06:05 -08:00
Paul E. McKenney 3d3b7db0a2 rcu: Move synchronize_sched_expedited() to rcutree.c
Now that TREE_RCU and TREE_PREEMPT_RCU no longer do anything different
for the single-CPU case, there is no need for multiple definitions of
synchronize_sched_expedited().  It is no longer in any sense a plug-in,
so move it from kernel/rcutree_plugin.h to kernel/rcutree.c.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:06:04 -08:00
Paul E. McKenney c0d6d01bff rcu: Check for illegal use of RCU from offlined CPUs
Although it is legal to use RCU during early boot, it is anything
but legal to use RCU at runtime from an offlined CPU.  After all, RCU
explicitly ignores offlined CPUs.  This commit therefore adds checks
for runtime use of RCU from offlined CPUs.

These checks are not perfect, in particular, they can be subverted
through use of things like rcu_dereference_raw().  Note that it is not
possible to put checks in rcu_read_lock() and friends due to the fact
that these primitives are used in code that might be used under either
RCU or lock-based protection, which means that checking rcu_read_lock()
gets you fat piles of false positives.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:06:03 -08:00
Paul E. McKenney c13f3757d0 rcu: Add CPU-stall capability to rcutorture
Add module parameters to rcutorture that induce a CPU stall.
The stall_cpu parameter specifies how long to stall in seconds,
defaulting to zero, which indicates no stalling is to be undertaken.
The stall_cpu_holdoff parameter specifies how many seconds after
insmod (or boot, if rcutorture is built into the kernel) that this
stall is to start.  The default value for stall_cpu_holdoff is ten
seconds.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:52 -08:00
Paul E. McKenney 9b9ec9b90e rcutorture: Permit holding off CPU-hotplug operations during boot
When rcutorture is started automatically at boot time, it might well
also start CPU-hotplug operations at that time, which might not be
desirable.  This commit therefore adds an rcutorture parameter that
allows CPU-hotplug operations to be held off for the specified number
of seconds after the start of boot.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:50 -08:00
Paul E. McKenney a858af2875 rcu: Print scheduling-clock information on RCU CPU stall-warning messages
There have been situations where RCU CPU stall warnings were caused by
issues in scheduling-clock timer initialization.  To make it easier to
track these down, this commit causes the RCU CPU stall-warning messages
to print out the number of scheduling-clock interrupts taken in the
current grace period for each stalled CPU.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:49 -08:00
Paul E. McKenney 13cfcca0e4 rcu: Set RCU CPU stall times via sysfs
The default CONFIG_RCU_CPU_STALL_TIMEOUT value of 60 seconds has served
Linux users well for production use for quite some time.  However, for
debugging, there will be more than three minutes between subsequent
stall-warning messages.  This can be an annoyingly long wait if you
are trying to work out where the offending infinite loop is hiding.

Therefore, this commit provides a rcu_cpu_stall_timeout sysfs
parameter that may be adjusted at boot time and at runtime to speed
up debugging.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:48 -08:00
Paul E. McKenney 27565d64a4 rcu: Remove #ifdef CONFIG_SMP from TREE_RCU
Now that both TINY_RCU and TINY_PREEMPT_RCU have been in place for awhile,
it is time to remove UP support from TREE_RCU, which is what this commit
does.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:47 -08:00
Paul E. McKenney c44e2cddac rcu: Check for idle-loop entry while in RCU read-side critical section
The inner idle loop is an extended quiescent state for all flavors
of RCU, but there have been recent bug involving use of RCU read-side
primitives from within the idle loop.  Therefore, this commit enlists
lockdep-RCU to detect attempts to enter the inner idle loop while in
an RCU read-side critical section, emitting a lockdep-RCU splat if so.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:45 -08:00
Paul E. McKenney 30fbcc90b0 rcu: Clean up straggling rcu_preempt_needs_cpu() name
The recent updates to RCU_CPU_FAST_NO_HZ have an rcu_needs_cpu() that
does more than just check for callbacks, so get the name for
rcu_preempt_needs_cpu() consistent with that change, now calling it
rcu_preempt_cpu_has_callbacks().

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:44 -08:00
Paul E. McKenney 1aa03f1188 rcu: Simplify unboosting checks
This is a port of commit #82e78d80 from TREE_PREEMPT_RCU to
TINY_PREEMPT_RCU.

This commit uses the fact that current->rcu_boost_mutex is set
any time that the RCU_READ_UNLOCK_BOOSTED flag is set in the
current->rcu_read_unlock_special bitmask.  This allows tests of
the bit to be changed to tests of the pointer, which in turn allows
the RCU_READ_UNLOCK_BOOSTED flag to be eliminated.

Please note that the check of current->rcu_read_unlock_special need not
change because any time that RCU_READ_UNLOCK_BOOSTED was set, so was
RCU_READ_UNLOCK_BLOCKED.  Therefore, __rcu_read_unlock() can continue
testing current->rcu_read_unlock_special for non-zero, as before.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:43 -08:00
Paul E. McKenney 8762705ad4 rcu: Inform RCU of irq_exit() activity
This is a port to TINY_RCU of Peter Zijlstra's commit #ec433f0c5

The rcu_read_unlock_special() function relies on in_irq() to exclude
scheduler activity from interrupt level.  This fails because exit_irq()
can invoke the scheduler after clearing the preempt_count() bits that
in_irq() uses to determine that it is at interrupt level.  This situation
can result in failures as follows:

     $task			IRQ		SoftIRQ

     rcu_read_lock()

     /* do stuff */

     <preempt> |= UNLOCK_BLOCKED

     rcu_read_unlock()
       --t->rcu_read_lock_nesting

    			irq_enter();
    			/* do stuff, don't use RCU */
    			irq_exit();
    			  sub_preempt_count(IRQ_EXIT_OFFSET);
    			  invoke_softirq()

    					ttwu();
    					  spin_lock_irq(&pi->lock)
    					  rcu_read_lock();
    					  /* do stuff */
    					  rcu_read_unlock();
    					    rcu_read_unlock_special()
    					      rcu_report_exp_rnp()
    					        ttwu()
    					          spin_lock_irq(&pi->lock) /* deadlock */

       rcu_read_unlock_special(t);

This can be triggered 'easily' because invoke_softirq() immediately does
a ttwu() of ksoftirqd/# instead of doing the in-place softirq stuff first,
but even without that the above happens.

Cure this by also excluding softirqs from the rcu_read_unlock_special()
handler and ensuring the force_irqthreads ksoftirqd/# wakeup is done
from full softirq context.

It is also necessary to delay the ->rcu_read_lock_nesting decrement until
after rcu_read_unlock_special().  This delay is handled by the commit
"Protect __rcu_read_unlock() against scheduler-using irq handlers".

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:42 -08:00
Paul E. McKenney 768dfffdff rcu: Prevent RCU callbacks from executing before scheduler initialized
This is a port of commit #b0d3041 from TREE_RCU to TREE_PREEMPT_RCU.

Under some rare but real combinations of configuration parameters, RCU
callbacks are posted during early boot that use kernel facilities that are
not yet initialized.  Therefore, when these callbacks are invoked, hard
hangs and crashes ensue.  This commit therefore prevents RCU callbacks
from being invoked until after the scheduler is fully up and running,
as in after multiple tasks have been spawned.

It might well turn out that a better approach is to identify the specific
RCU callbacks that are causing this problem, but that discussion will
wait until such time as someone really needs an RCU callback to be invoked
(as opposed to merely registered) during early boot.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:41 -08:00
Paul E. McKenney afef20540f rcu: Streamline code produced by __rcu_read_unlock()
This is a port of commit #be0e1e21 to TINY_PREEMPT_RCU.  This uses
noinline to prevent rcu_read_unlock_special() from being inlined into
__rcu_read_unlock().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:40 -08:00
Paul E. McKenney 26861faf89 rcu: Protect __rcu_read_unlock() against scheduler-using irq handlers
This commit ports commit #10f39bb1b2 (rcu: protect __rcu_read_unlock()
against scheduler-using irq handlers) from TREE_PREEMPT_RCU to
TINY_PREEMPT_RCU.  The following is a corresponding port of that
commit message.

The addition of RCU read-side critical sections within runqueue and
priority-inheritance critical sections introduced some deadlocks,
for example, involving interrupts from __rcu_read_unlock() where the
interrupt handlers call wake_up().  This situation can cause the
instance of __rcu_read_unlock() invoked from interrupt to do some
of the processing that would otherwise have been carried out by the
task-level instance of __rcu_read_unlock().  When the interrupt-level
instance of __rcu_read_unlock() is called with a scheduler lock held from
interrupt-entry/exit situations where in_irq() returns false, deadlock can
result.  Of course, in a UP kernel, there are not really any deadlocks,
but the upper-level critical section can still be be fatally confused
by the lower-level critical section changing things out from under it.

This commit resolves these deadlocks by using negative values of the
per-task ->rcu_read_lock_nesting counter to indicate that an instance of
__rcu_read_unlock() is in flight, which in turn prevents instances from
interrupt handlers from doing any special processing.  Note that nested
rcu_read_lock()/rcu_read_unlock() pairs are still permitted, but they will
never see ->rcu_read_lock_nesting go to zero, and will therefore never
invoke rcu_read_unlock_special(), thus preventing them from seeing the
RCU_READ_UNLOCK_BLOCKED bit should it be set in ->rcu_read_unlock_special.
This patch also adds a check for ->rcu_read_unlock_special being negative
in rcu_check_callbacks(), thus preventing the RCU_READ_UNLOCK_NEED_QS
bit from being set should a scheduling-clock interrupt occur while
__rcu_read_unlock() is exiting from an outermost RCU read-side critical
section.

Of course, __rcu_read_unlock() can be preempted during the time that
->rcu_read_lock_nesting is negative.  This could result in the setting
of the RCU_READ_UNLOCK_BLOCKED bit after __rcu_read_unlock() checks it,
and would also result it this task being queued on the corresponding
rcu_node structure's blkd_tasks list.  Therefore, some later RCU read-side
critical section would enter rcu_read_unlock_special() to clean up --
which could result in deadlock (OK, OK, fatal confusion) if that RCU
read-side critical section happened to be in the scheduler where the
runqueue or priority-inheritance locks were held.

To prevent the possibility of fatal confusion that might result from
preemption during the time that ->rcu_read_lock_nesting is negative,
this commit also makes rcu_preempt_note_context_switch() check for
negative ->rcu_read_lock_nesting, thus refraining from queuing the task
(and from setting RCU_READ_UNLOCK_BLOCKED) if we are already exiting
from the outermost RCU read-side critical section (in other words,
we really are no longer actually in that RCU read-side critical
section).  In addition, rcu_preempt_note_context_switch() invokes
rcu_read_unlock_special() to carry out the cleanup in this case, which
clears out the ->rcu_read_unlock_special bits and dequeues the task
(if necessary), in turn avoiding needless delay of the current RCU grace
period and needless RCU priority boosting.

It is still illegal to call rcu_read_unlock() while holding a scheduler
lock if the prior RCU read-side critical section has ever had both
preemption and irqs enabled.  However, the common use case is legal,
namely where then entire RCU read-side critical section executes with
irqs disabled, for example, when the scheduler lock is held across the
entire lifetime of the RCU read-side critical section.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:39 -08:00
Paul E. McKenney f38bd1020f rcu: Remove single-rcu_node optimization in rcu_start_gp()
The grace-period initialization sequence in rcu_start_gp() has a special
case for systems where the rcu_node tree is a single rcu_node structure.
This made sense some years ago when systems were smaller and up to 64
CPUs could share a single rcu_node structure, but now that large systems
are common and a given leaf rcu_node structure can support only 16 CPUs
(due to lock contention on the rcu_node's ->lock field), this optimization
is almost never taken.  And even the small mobile platforms that might
make use of it might rather have the kernel text reduction.

Therefore, this commit removes the check for single-rcu_node trees.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-02-21 09:03:38 -08:00
Paul E. McKenney a50c3af910 rcu: Don't make callbacks go through second full grace period
RCU's current CPU-offline code path dumps all of the outgoing CPU's
callbacks onto the RCU_NEXT_TAIL portion of the surviving CPU's
callback list.  This means that all the ready-to-invoke callbacks from
the outgoing CPU must wait for another full RCU grace period.  This was
just fine when CPU-hotplug events were rare, but there is increasing
evidence that users are planning to make increasing use of CPU hotplug.

Therefore, this commit changes the callback-dumping procedure so that
callbacks that are ready to invoke are moved to the RCU_DONE_TAIL
portion of the surviving CPU's callback list.  This avoids running
these callbacks through a second unnecessary grace period.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:37 -08:00
Paul E. McKenney 8146c4e2e2 rcu: Check for callback invocation from offline CPUs
Because quiescent states are now reported from offline CPUs in
CPU_DYING state, there is some possibility that such a CPU might
note the end of a grace period and attempt to start invoking
callbacks.  This would be a very bad thing, and is supposed to
be prevented by the fact that the CPU_DYING CPU gets rid of all
its callbacks before reporting the quiescent state.  However,
there is other CPU-offline code in the kernel, and it is quite
possible that someone will invoke RCU core processing from that
code.  Therefore, this commit adds a warning for this case.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:37 -08:00
Paul E. McKenney 778d250a29 rcu: Limit lazy-callback duration
Currently, a given CPU is permitted to remain in dyntick-idle mode
indefinitely if it has only lazy RCU callbacks queued.  This is vulnerable
to corner cases in NUMA systems, so limit the time to six seconds by
default.  (Currently controlled by a cpp macro.)

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:36 -08:00
Paul E. McKenney 091541bbdb rcu: Make rcutorture flag online/offline failures
Make rcutorture check for CPU-hotplug failures and complain if there
were any.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:35 -08:00
Paul E. McKenney e560140008 rcu: Simplify offline processing
Move ->qsmaskinit and blkd_tasks[] manipulation to the CPU_DYING
notifier.  This simplifies the code by eliminating a potential
deadlock and by reducing the responsibilities of force_quiescent_state().
Also rename functions to make their connection to the CPU-hotplug
stages explicit.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:34 -08:00
Paul E. McKenney 486e259340 rcu: Avoid waking up CPUs having only kfree_rcu() callbacks
When CONFIG_RCU_FAST_NO_HZ is enabled, RCU will allow a given CPU to
enter dyntick-idle mode even if it still has RCU callbacks queued.
RCU avoids system hangs in this case by scheduling a timer for several
jiffies in the future.  However, if all of the callbacks on that CPU
are from kfree_rcu(), there is no reason to wake the CPU up, as it is
not a problem to defer freeing of memory.

This commit therefore tracks the number of callbacks on a given CPU
that are from kfree_rcu(), and avoids scheduling the timer if all of
a given CPU's callbacks are from kfree_rcu().

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:25 -08:00
Paul E. McKenney 0bb7b59d6e rcu: Add diagnostic for misaligned rcu_head structures
The push for energy efficiency will require that RCU tag rcu_head
structures to indicate whether or not their invocation is time critical.
This tagging is best carried out in the bottom bits of the ->next
pointers in the rcu_head structures.  This tagging requires that the
rcu_head structures be properly aligned, so this commit adds the required
diagnostics.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:24 -08:00
Paul E. McKenney fe15d706cf rcu: Add lockdep-RCU checks for simple self-deadlock
It is illegal to have a grace period within a same-flavor RCU read-side
critical section, so this commit adds lockdep-RCU checks to splat when
such abuse is encountered.  This commit does not detect more elaborate
RCU deadlock situations.  These situations might be a job for lockdep
enhancements.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-02-21 09:03:23 -08:00
Jiri Olsa 5500fa5119 ftrace, perf: Add filter support for function trace event
Adding support to filter function trace event via perf
interface. It is now possible to use filter interface
in the perf tool like:

  perf record -e ftrace:function --filter="(ip == mm_*)" ls

The filter syntax is restricted to the the 'ip' field only,
and following operators are accepted '==' '!=' '||', ending
up with the filter strings like:

  ip == f1[, ]f2 ... || ip != f3[, ]f4 ...

with comma ',' or space ' ' as a function separator. If the
space ' ' is used as a separator, the right side of the
assignment needs to be enclosed in double quotes '"', e.g.:

  perf record -e ftrace:function --filter '(ip == do_execve,sys_*,ext*)' ls
  perf record -e ftrace:function --filter '(ip == "do_execve,sys_*,ext*")' ls
  perf record -e ftrace:function --filter '(ip == "do_execve sys_* ext*")' ls

The '==' operator adds trace filter with same effect as would
be added via set_ftrace_filter file.

The '!=' operator adds trace filter with same effect as would
be added via set_ftrace_notrace file.

The right side of the '!=', '==' operators is list of functions
or regexp. to be added to filter separated by space.

The '||' operator is used for connecting multiple filter definitions
together. It is possible to have more than one '==' and '!='
operators within one filter string.

Link: http://lkml.kernel.org/r/1329317514-8131-8-git-send-email-jolsa@redhat.com

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-21 11:08:30 -05:00
Jiri Olsa 02aa3162ed ftrace: Allow to specify filter field type for ftrace events
Adding FILTER_TRACE_FN event field type for function tracepoint
event, so it can be properly recognized within filtering code.

Currently all fields of ftrace subsystem events share the common
field type FILTER_OTHER. Since the function trace fields need
special care within the filtering code we need to recognize it
properly, hence adding the FILTER_TRACE_FN event type.

Adding filter parameter to the FTRACE_ENTRY macro, to specify the
filter field type for the event.

Link: http://lkml.kernel.org/r/1329317514-8131-7-git-send-email-jolsa@redhat.com

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-21 11:08:29 -05:00
Jiri Olsa ced39002f5 ftrace, perf: Add support to use function tracepoint in perf
Adding perf registration support for the ftrace function event,
so it is now possible to register it via perf interface.

The perf_event struct statically contains ftrace_ops as a handle
for function tracer. The function tracer is registered/unregistered
in open/close actions.

To be efficient, we enable/disable ftrace_ops each time the traced
process is scheduled in/out (via TRACE_REG_PERF_(ADD|DELL) handlers).
This way tracing is enabled only when the process is running.
Intentionally using this way instead of the event's hw state
PERF_HES_STOPPED, which would not disable the ftrace_ops.

It is now possible to use function trace within perf commands
like:

  perf record -e ftrace:function ls
  perf stat -e ftrace:function ls

Allowed only for root.

Link: http://lkml.kernel.org/r/1329317514-8131-6-git-send-email-jolsa@redhat.com

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-21 11:08:27 -05:00
Jiri Olsa e59a0bff3e ftrace: Add FTRACE_ENTRY_REG macro to allow event registration
Adding FTRACE_ENTRY_REG macro so particular ftrace entries
could specify registration function and thus become accesible
via perf.

This will be used in upcomming patch for function trace.

Link: http://lkml.kernel.org/r/1329317514-8131-5-git-send-email-jolsa@redhat.com

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-21 11:08:26 -05:00
Jiri Olsa 489c75c3b3 ftrace, perf: Add add/del tracepoint perf registration actions
Adding TRACE_REG_PERF_ADD and TRACE_REG_PERF_DEL to handle
perf event schedule in/out actions.

The add action is invoked for when the perf event is scheduled in,
while the del action is invoked when the event is scheduled out.

Link: http://lkml.kernel.org/r/1329317514-8131-4-git-send-email-jolsa@redhat.com

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-21 11:08:25 -05:00
Jiri Olsa ceec0b6fc7 ftrace, perf: Add open/close tracepoint perf registration actions
Adding TRACE_REG_PERF_OPEN and TRACE_REG_PERF_CLOSE to differentiate
register/unregister from open/close actions.

The register/unregister actions are invoked for the first/last
tracepoint user when opening/closing the event.

The open/close actions are invoked for each tracepoint user when
opening/closing the event.

Link: http://lkml.kernel.org/r/1329317514-8131-3-git-send-email-jolsa@redhat.com

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-21 11:08:24 -05:00
Jiri Olsa e248491ac2 ftrace: Add enable/disable ftrace_ops control interface
Adding a way to temporarily enable/disable ftrace_ops. The change
follows the same way as 'global' ftrace_ops are done.

Introducing 2 global ftrace_ops - control_ops and ftrace_control_list
which take over all ftrace_ops registered with FTRACE_OPS_FL_CONTROL
flag. In addition new per cpu flag called 'disabled' is also added to
ftrace_ops to provide the control information for each cpu.

When ftrace_ops with FTRACE_OPS_FL_CONTROL is registered, it is
set as disabled for all cpus.

The ftrace_control_list contains all the registered 'control' ftrace_ops.
The control_ops provides function which iterates ftrace_control_list
and does the check for 'disabled' flag on current cpu.

Adding 3 inline functions:
  ftrace_function_local_disable/ftrace_function_local_enable
  - enable/disable the ftrace_ops on current cpu
  ftrace_function_local_disabled
  - get disabled ftrace_ops::disabled value for current cpu

Link: http://lkml.kernel.org/r/1329317514-8131-2-git-send-email-jolsa@redhat.com

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-21 11:08:23 -05:00
Steven Rostedt 5b34926114 tracing: Don't use p->len field to determine output in __print_*() functions
If more than one __print_*() function is used in a tracepoint
(__print_flags(), __print_symbols(), etc), then the temp seq buffer will
not be zero on entry. Using the temp seq buffer's length to know if
data has been printed or not in the current function is incorrect and
may produce incorrect results.

Currently, no in-tree tracepoint causes this bug, but new ones may
be created.

Cc: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-21 11:08:13 -05:00
Andrey Vagin e404b321db tracing: Don't print an extra separator of flags
If __print_flags() is used after another __print_*() function, the
temp seq_file buffer will not be empty on entry, and the delimiter will
be printed even though there's just one field. We get something like:

	|S

instead of just:

	S

This is because the length of the temp seq buffer is used to determine
if the delimiter is printed or not. But this algorithm fails when
the seq buffer is not empty on entry, and the delimiter will be printed
because it thinks that a previous field was already printed.

Link: http://lkml.kernel.org/r/1329650167-480655-1-git-send-email-avagin@openvz.org

Signed-off-by: Andrew Vagin <avagin@openvz.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-20 20:33:31 -05:00
Linus Torvalds 8ebbfb4957 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Assorted fixes, sat in -next for a week or so...

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ocfs2: deal with wraparounds of i_nlink in ocfs2_rename()
  vfs: fix compat_sys_stat() handling of overflows in st_nlink
  quota: Fix deadlock with suspend and quotas
  vfs: Provide function to get superblock and wait for it to thaw
  vfs: fix panic in __d_lookup() with high dentry hashtable counts
  autofs4 - fix lockdep splat in autofs
  vfs: fix d_inode_lookup() dentry ref leak
2012-02-20 16:13:58 -08:00
H. Peter Anvin 6684ba202b compat: Add helper functions to read/write struct timeval, timespec
Add helper functions to read and write struct timeval and struct
timespec from userspace.  We already had helper functions for reading
and writing struct compat_timespec; add a set of functions to do the
same with struct timeval, and add a second suite of functions which
can be sensitive to COMPAT_USE_64BIT_TIME and access either 32- or
64-bit time structures.

This also exports these helper functions to modules.

Rename the existing inlines for converting between struct
compat_timeval and native struct timespec so we can have a saner
naming convention for the exported functions.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-20 12:48:47 -08:00
David Howells 1fd36adcd9 Replace the fd_sets in struct fdtable with an array of unsigned longs
Replace the fd_sets in struct fdtable with an array of unsigned longs and then
use the standard non-atomic bit operations rather than the FD_* macros.

This:

 (1) Removes the abuses of struct fd_set:

     (a) Since we don't want to allocate a full fd_set the vast majority of the
     	 time, we actually, in effect, just allocate a just-big-enough array of
     	 unsigned longs and cast it to an fd_set type - so why bother with the
     	 fd_set at all?

     (b) Some places outside of the core fdtable handling code (such as
     	 SELinux) want to look inside the array of unsigned longs hidden inside
     	 the fd_set struct for more efficient iteration over the entire set.

 (2) Eliminates the use of FD_*() macros in the kernel completely.

 (3) Permits the __FD_*() macros to be deleted entirely where not exposed to
     userspace.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: http://lkml.kernel.org/r/20120216174954.23314.48147.stgit@warthog.procyon.org.uk
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
2012-02-19 10:30:57 -08:00
Bjorn Helgaas 69f1d475cc PM / Hibernate: print physical addresses consistently with other parts of kernel
Print physical address info in a style consistent with the %pR style used
elsewhere in the kernel.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-02-17 23:36:41 +01:00
Rafael J. Wysocki bc25cf5089 PM / Sleep: Drop suspend_stats_update()
Since suspend_stats_update() is only called from pm_suspend(),
move its code directly into that function and remove the static
inline definition from include/linux/suspend.h.  Clean_up
pm_suspend() in the process.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
2012-02-17 23:36:23 +01:00
Rafael J. Wysocki 93e1ee43a7 PM / Sleep: Make enter_state() in kernel/power/suspend.c static
The enter_state() function in kernel/power/suspend.c should be
static and state_store() in kernel/power/suspend.c should call
pm_suspend() instead of it, so make that happen (which also reduces
code duplication related to suspend statistics).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
2012-02-17 23:36:10 +01:00
Rafael J. Wysocki 55ae451918 PM / Sleep: Unify kerneldoc comments in kernel/power/suspend.c
The kerneldoc comments in kernel/power/suspend.c are not formatted
in the same way and the quality of some of them is questionable.
Unify the formatting and improve the contents.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
2012-02-17 23:36:03 +01:00
Ingo Molnar 09bda4432a Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core 2012-02-17 12:55:07 +01:00
Grant Likely a18dc81bf5 irq_domain: constify irq_domain_ops
Make irq_domain_ops pointer a constant to make it safer for multiple
instances to share the same ops pointer and change the irq_domain code
so that it does not modify the ops.

v4: Fix mismatched type reference in powerpc code

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Milton Miller <miltonm@bga.com>
Tested-by: Olof Johansson <olof@lixom.net>
2012-02-16 06:11:24 -07:00
Grant Likely 16b2e6e2f3 irq_domain: Create common xlate functions that device drivers can use
Rather than having each interrupt controller driver creating its own barely
unique .xlate function for irq_domain, create a library of translators which
any driver can use directly.

v5: - Remove irq_domain_xlate_pci().  It was incorrect.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Tested-by: Olof Johansson <olof@lixom.net>
2012-02-16 06:11:23 -07:00
Grant Likely 6b783f7c5d irq_domain: Remove irq_domain_add_simple()
irq_domain_add_simple() was a stop-gap measure until complete irq_domain
support was complete.  This patch removes the irq_domain_add_simple()
interface.

This patch also drops the explicit irq_domain initialization performed
by the mach-versatile code because the versatile interrupt controller
already has irq_domain support built into it.  This was a bug that was
hanging around quietly for a while, but with the full irq_domain which
actually verifies that irq_domain ranges are available it would cause
the registration to fail and the system wouldn't boot.

v4: Fixed number of irqs in mx5 gpio code
v2: Updated to pass in host_data pointer on irq_domain allocation.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Cc: Russell King <linux@arm.linux.org.uk>
Tested-by: Olof Johansson <olof@lixom.net>
2012-02-16 06:11:23 -07:00
Grant Likely 75294957be irq_domain: Remove 'new' irq_domain in favour of the ppc one
This patch removes the simplistic implementation of irq_domains and enables
the powerpc infrastructure for all irq_domain users.  The powerpc
infrastructure includes support for complex mappings between Linux and
hardware irq numbers, and can manage allocation of irq_descs.

This patch also converts the few users of irq_domain_add()/irq_domain_del()
to call irq_domain_add_legacy() instead.

v3: Fix bug that set up too many irqs in translation range.
v2: Fix removal of irq_alloc_descs() call in gic driver

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Tested-by: Olof Johansson <olof@lixom.net>
2012-02-16 06:11:23 -07:00
Grant Likely 1bc04f2cf8 irq_domain: Add support for base irq and hwirq in legacy mappings
Add support for a legacy mapping where irq = (hwirq - first_hwirq + first_irq)
so that a controller driver can allocate a fixed range of irq_descs and use
a simple calculation to translate back and forth between linux and hw irq
numbers.  This is needed to use an irq_domain with many of the ARM interrupt
controller drivers that manage their own irq_desc allocations.  Ultimately
the goal is to migrate those drivers to use the linear revmap, but doing it
this way allows each driver to be converted separately which makes the
migration path easier.

This patch generalizes the IRQ_DOMAIN_MAP_LEGACY method to use
(first_irq-first_hwirq) as the offset between hwirq and linux irq number,
and adds checks to make sure that the hwirq number does not exceed range
assigned to the controller.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Tested-by: Olof Johansson <olof@lixom.net>
2012-02-16 06:11:23 -07:00
Grant Likely a8db8cf0d8 irq_domain: Replace irq_alloc_host() with revmap-specific initializers
Each revmap type has different arguments for setting up the revmap.
This patch splits up the generator functions so that each revmap type
can do its own setup and the user doesn't need to keep track of how
each revmap type handles the arguments.

This patch also adds a host_data argument to the generators.  There are
cases where the host_data pointer will be needed before the function returns.
ie. the legacy map calls the .map callback for each irq before returning.

v2: - Add void *host_data argument to irq_domain_add_*() functions
    - fixed failure to compile
    - Moved IRQ_DOMAIN_MAP_* defines into irqdomain.c

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Tested-by: Olof Johansson <olof@lixom.net>
2012-02-16 06:11:22 -07:00
Grant Likely 68700650e7 irq_domain: Remove references to old irq_host names
No functional changes.  Replaces non-exported references to 'host' with domain.
Does not change any symbol names referenced by other .c files.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Tested-by: Olof Johansson <olof@lixom.net>
2012-02-16 06:11:22 -07:00
Grant Likely 03848373ea irq_domain: remove NO_IRQ from irq domain code
zero always means no irq when using irq domains.  Get rid of the NO_IRQ
references.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Tested-by: Olof Johansson <olof@lixom.net>
2012-02-16 06:11:21 -07:00
Grant Likely cc79ca691c irq_domain: Move irq_domain code from powerpc to kernel/irq
This patch only moves the code.  It doesn't make any changes, and the
code is still only compiled for powerpc.  Follow-on patches will generalize
the code for other architectures.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Tested-by: Olof Johansson <olof@lixom.net>
2012-02-16 01:37:49 -07:00
Thomas Gleixner 97ac984d2f Merge branch 'fortglx/3.4/time' of git://git.linaro.org/people/jstultz/linux into timers/core 2012-02-15 20:29:18 +01:00
Frederic Weisbecker 0a8a2e78b7 timer: Fix bad idle check on irq entry
idle_cpu() is called on irq entry to guess if we need to call
tick_check_idle(). This way we can catch up with jiffies if the tick
was stopped, stop accounting idle time during the interrupt and
maintain the sched clock if it is unstable.

But if we are going to exit the idle loop to schedule a new task (ie:
if we have a task in the runqueue or a remotely enqueued ttwu to
perform), the idle_cpu() check will return 0 such that we miss the
call to tick_check_idle() for all interrupts happening before we
schedule the new task.

As a result these interrupts and the softirqs coming along may deal
with stale jiffies values, bad sched clock values, and won't substract
their time from the idle time accounting.

Fix this with using is_idle_task() instead that strictly checks that
we are running the idle task, without caring about the fact we are
going to schedule a task soon.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ingo Molnar <mingo@elte.hu>
Link: http://lkml.kernel.org/r/1327427984-23282-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-02-15 15:23:09 +01:00
Frederic Weisbecker 15f827be93 nohz: Remove ts->Einidle checks before restarting the tick
ts->inidle is set by tick_nohz_idle_enter() and unset by
tick_nohz_idle_exit(). However these two calls are assumed
to be always paired. This means that by the time we call
tick_nohz_idle_exit(), ts->inidle is supposed to be always
set to 1.

Remove the checks for ts->inidle in tick_nohz_idle_exit().
This simplifies a bit the code and improves its debuggability
(ie: ensure the call is paired with a tick_nohz_idle_enter()
call).

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ingo Molnar <mingo@elte.hu>
Link: http://lkml.kernel.org/r/1327427984-23282-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-02-15 15:23:09 +01:00
Michal Hocko 430ee88195 nohz: Remove update_ts_time_stat from tick_nohz_start_idle
There is no reason to call update_ts_time_stat from tick_nohz_start_idle
anymore (after e0e37c20 sched: Eliminate the ts->idle_lastupdate field)
when we updated idle_lastupdate unconditionally.

We haven't set idle_active yet and do not provide last_update_time so
the whole call end up being just 2 wasted branches.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Link: http://lkml.kernel.org/r/1322755222-6951-1-git-send-email-mhocko@suse.cz
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-02-15 15:23:09 +01:00
Suresh Siddha 77b0d60c5a clockevents: Leave the broadcast device in shutdown mode when not needed
Platforms with Always Running APIC Timer doesn't use the broadcast timer
but the kernel is leaving the broadcast timer (HPET in this case)
in oneshot mode.

On these platforms, before the switch to oneshot mode, broadcast device is
actually in shutdown mode. Code checks for empty tick_broadcast_mask and
avoids going into the periodic mode.

During switch to oneshot mode, add the same tick_broadcast_mask checks in the
tick_broadcast_switch_to_oneshot() and avoid the broadcast device going into
the oneshot mode.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: john stultz <johnstul@us.ibm.com>
Cc: venki@google.com
Link: http://lkml.kernel.org/r/1320452301.15071.16.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-02-15 15:23:09 +01:00
Thomas Gleixner 81b40539e7 futex: Simplify return logic
No need to assign ret in each case and break. Simply return the result
of the handler function directly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@linux.intel.com>
2012-02-15 12:19:13 +01:00
Thomas Gleixner 59263b513c futex: Cover all PI opcodes with cmpxchg enabled check
Some of the newer futex PI opcodes do not check the cmpxchg enabled
variable and call unconditionally into the handling functions. Cover
all PI opcodes in a separate check.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Darren Hart <dvhart@linux.intel.com>
2012-02-15 12:19:13 +01:00
Thomas Gleixner b4bc724e82 genirq: Handle pending irqs in irq_startup()
An interrupt might be pending when irq_startup() is called, but the
startup code does not invoke the resend logic. In some cases this
prevents the device from issuing another interrupt which renders the
device non functional.

Call the resend function in irq_startup() to keep things going.

Reported-and-tested-by: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-02-15 11:56:59 +01:00
Thomas Gleixner ac56376111 genirq: Unmask oneshot irqs when thread was not woken
When the primary handler of an interrupt which is marked IRQ_ONESHOT
returns IRQ_HANDLED or IRQ_NONE, then the interrupt thread is not
woken and the unmask logic of the interrupt line is never
invoked. This keeps the interrupt masked forever.

This was not noticed as most IRQ_ONESHOT users wake the thread
unconditionally (usually because they cannot access the underlying
device from hard interrupt context). Though this behaviour was nowhere
documented and not necessarily intentional. Some drivers can avoid the
thread wakeup in certain cases and run into the situation where the
interrupt line s kept masked.

Handle it gracefully.

Reported-and-tested-by: Lothar Wassmann <lw@karo-electronics.de>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-02-15 11:56:59 +01:00
Grant Likely 7bb69bade0 irq_domain: Make irq_domain structure match powerpc's irq_host
Part of the series to unify the irq remapping mechanisms in the
kernel.  A follow up patch will copy the powerpc implementation into
kernel/irq/irqdomain.c, which will be a lot easier if the structures
are identical.

Where they differ, I've chose to use the powerpc names since there is
a lot more code using those names.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Milton Miller <miltonm@bga.com>
Tested-by: Olof Johansson <olof@lixom.net>
2012-02-14 14:06:48 -07:00
Grant Likely e1964c50a8 irq_domain: Be less verbose
irq_domain printk's too much.  Drop some output.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Rob Herring <rob.herring@calxeda.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Olof Johansson <olof@lixom.net>
2012-02-14 14:06:48 -07:00
Dimitri Sivanich 074b85175a vfs: fix panic in __d_lookup() with high dentry hashtable counts
When the number of dentry cache hash table entries gets too high
(2147483648 entries), as happens by default on a 16TB system, use of a
signed integer in the dcache_init() initialization loop prevents the
dentry_hashtable from getting initialized, causing a panic in
__d_lookup().  Fix this in dcache_init() and similar areas.

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-02-13 20:45:38 -05:00
Linus Torvalds e3f89f4ae4 Autogenerated GPG tag for Rusty D1ADB8F1: 15EE 8D6C AB0E 7F0C F999 BFCB D920 0E6C D1AD B8F1
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJPOauPAAoJENkgDmzRrbjx1YsP/0+Il3FvyCvN6Qx5NHef61nZ
 9mW6UCwiY/mnIldsBS5S5iKVvXchOa7PdOebfhnpmizilZPInahXrs2qO5WV2vaa
 p+CA3UyZk9kLlqZFTfrZKsafpqZeAWJ5MHAYlu1t/CaBvRNGVtHLDthrU0DrbHdy
 AUqeIPeo4fGrhZm/zrhSPqie0Tc3Xfw1xgxXYNZGNaaiEcaeSKUdwCTQepWIdyv1
 ljKjJ5VlRmS01NdCLtyWiVfkzBmmSkYI5nYzxPxnawt43uiszm9gPnZ8ruUdP/1B
 ZTDYvhXnDYdLO7sEfOh19leQg85BMY37i5U2PTa/3r2aU+0DTSGDwnX0HDc+TCah
 jsCmyIkLomFqiy6UduPZRE2xoj3LWJcB7/uESCPfn1P51w2OMhRKwQsA3EZj8Lze
 flga09EvrO5qOT1Pbx59FjQ2rXG8Ek7uWZ73elqlAbHk9K4J+1DLIF/wynI0rmJy
 VEWKzMGSaCAeNKayEoe6aPwLU6x7XeA4/UvBmrWtsuPPdgJ1S0UxaEHRP5ZoTssj
 7DqxtcY9Q4BdSxPFVOU5hnfDa4TotS9i19s+gqmrinmULzH8cgzMh5xkGgBGn+HA
 EVkwqi+4iHe1pj6KAYT4CT5Yhr2k/tgOxRzcIW+3019z3ro1/x8kmxCMmekzdVq3
 9KHESNlwkzush1+T3vvm
 =QjWG
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://github.com/rustyrussell/linux

* tag 'for-linus' of git://github.com/rustyrussell/linux:
  module: fix broken isapnp handling in file2alias
  module: make module param bint handle nul value
2012-02-13 16:59:53 -08:00
Dave Young 10f296cbfe module: make module param bint handle nul value
Allow bint param accept nul values, just do same as bool param.

Signed-off-by: Dave Young <dyoung@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-02-14 11:02:15 +10:30
Al Viro 4040153087 security: trim security.h
Trim security.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: James Morris <jmorris@namei.org>
2012-02-14 10:45:42 +11:00
Al Viro 191c542442 mm: collapse security_vm_enough_memory() variants into a single function
Collapse security_vm_enough_memory() variants into a single function.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: James Morris <jmorris@namei.org>
2012-02-14 10:45:39 +11:00
Thomas Meyer 47b0edcb59 tracing/trivial: Use kcalloc instead of kzalloc to allocate array
The advantage of kcalloc is, that will prevent integer overflows which could
result from the multiplication of number of elements and size and it is also
a bit nicer to read.

The semantic patch that makes this change is available
in https://lkml.org/lkml/2011/11/25/107

Link: http://lkml.kernel.org/r/1322600880.1534.347.camel@localhost.localdomain

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-13 13:48:11 -05:00
Johannes Berg 9510035849 printk/tracing: Add console output tracing
Add a printk.console trace point to record any printk
messages into the trace, regardless of the current
console loglevel. This can help correlate (existing)
printk debugging with other tracing.

Link: http://lkml.kernel.org/r/1322161388.5366.54.camel@jlt3.sipsolutions.net

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-13 13:46:05 -05:00
Rafael J. Wysocki 6f585f750d PM / Sleep: Remove unnecessary label from suspend_freeze_processes()
The Finish label in suspend_freeze_processes() is in fact unnecessary
and makes the function look more complicated than it really is, so
remove that label (along with a few empty lines).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
2012-02-13 16:26:13 +01:00
Rafael J. Wysocki 6c83b4818d PM / Sleep: Do not check wakeup too often in try_to_freeze_tasks()
Use the observation that it is more efficient to check the wakeup
variable once before the loop reporting tasks that were not
frozen in try_to_freeze_tasks() than to do that in every step of that
loop.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-02-13 16:26:05 +01:00
Jean Pihet a9b542ee60 PM / QoS: unconditionally build the feature
The PM QoS feature originally didn't depend on CONFIG_PM, which was
mistakenly changed by commit e8db0be124

    PM QoS: Move and rename the implementation files

Later, commit d020283dc6

    PM / QoS: CPU C-state breakage with PM Qos change

partially fixed that by introducing a static inline definition of
pm_qos_request(), but that still didn't allow user space to use
the PM QoS interface if CONFIG_PM was unset (which had been possible
before).  For this reason, remove the dependency of PM QoS on
CONFIG_PM to make it work (as intended) with CONFIG_PM unset.

[rjw: Replaced the original changelog with a new one.]

Signed-off-by: Jean Pihet <j-pihet@ti.com>
Reported-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-02-13 16:23:42 +01:00
Rafael J. Wysocki 654b0c9627 Merge commit 'pm-fixes-for-3.3-rc3' into pm-qos
New material in the pm-qos branch depends on recent power management
fixes.
2012-02-13 16:20:07 +01:00
Geunsik Lim 1e42e83fde ftrace: sched_switch plugin is deprecated
Actually, sched_switch function tracer is merged into wakeup/wakeup_rt
Update 'mini-HOWTO' for ftrace(Kernel function tracer).
If we want to trace "sched:sched_switch" to trace sched_switch func,
We may utilize event option.(e.g: trace-cmd list -e | grep sched)
This patch is based on Linux-3.3.rc2-SMP-PREEMPT

Link: http://lkml.kernel.org/r/1328695537-15081-1-git-send-email-geunsik.lim@gmail.com

Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Geunsik Lim <geunsik.lim@samsung.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-13 09:14:47 -05:00
Linus Torvalds 3ec1e88b33 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Says Jens:

 "Time to push off some of the pending items.  I really wanted to wait
  until we had the regression nailed, but alas it's not quite there yet.
  But I'm very confident that it's "just" a missing expire on exit, so
  fix from Tejun should be fairly trivial.  I'm headed out for a week on
  the slopes.

  - Killing the barrier part of mtip32xx.  It doesn't really support
    barriers, and it doesn't need them (writes are fully ordered).

  - A few fixes from Dan Carpenter, preventing overflows of integer
    multiplication.

  - A fixup for loop, fixing a previous commit that didn't quite solve
    the partial read problem from Dave Young.

  - A bio integer overflow fix from Kent Overstreet.

  - Improvement/fix of the door "keep locked" part of the cdrom shared
    code from Paolo Benzini.

  - A few cfq fixes from Shaohua Li.

  - A fix for bsg sysfs warning when removing a file it did not create
    from Stanislaw Gruszka.

  - Two fixes for floppy from Vivek, preventing a crash.

  - A few block core fixes from Tejun.  One killing the over-optimized
    ioc exit path, cleaning that up nicely.  Two others fixing an oops
    on elevator switch, due to calling into the scheduler merge check
    code without holding the queue lock."

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: fix lockdep warning on io_context release put_io_context()
  relay: prevent integer overflow in relay_open()
  loop: zero fill bio instead of return -EIO for partial read
  bio: don't overflow in bio_get_nr_vecs()
  floppy: Fix a crash during rmmod
  floppy: Cleanup disk->queue before caling put_disk() if add_disk() was never called
  cdrom: move shared static to cdrom_device_info
  bsg: fix sysfs link remove warning
  block: don't call elevator callbacks for plug merges
  block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions
  mtip32xx: removed the irrelevant argument of mtip_hw_submit_io() and the unused member of struct driver_data
  block: strip out locking optimization in put_io_context()
  cdrom: use copy_to_user() without the underscores
  block: fix ioc locking warning
  block: fix NULL icq_cache reference
  block,cfq: change code order
2012-02-11 10:07:11 -08:00
Fernando Luis Vázquez Cao 86f5e6a7b1 watchdog: Fix code/comments mismatches
Reflect the change in the soft and hard lockup thresholds and
their relation to the frequency of the hrtimer and NMI events in
the code comments. While at it, remove references to files that
do not exist anymore.

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/r/1328827342-6253-3-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-11 15:11:33 +01:00
Linus Torvalds ce2814f227 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Fix double start/stop in x86_pmu_start()
  perf evsel: Fix an issue where perf report fails to show the proper percentage
  perf tools: Fix prefix matching for kernel maps
  perf tools: Fix perf stack to non executable on x86_64
  perf: Remove deprecated WARN_ON_ONCE()
2012-02-10 09:05:07 -08:00
Dan Carpenter f6302f1bcd relay: prevent integer overflow in relay_open()
"subbuf_size" and "n_subbufs" come from the user and they need to be
capped to prevent an integer overflow.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-10 09:04:49 +01:00
Srivatsa S. Bhat a556d5b583 PM / Hibernate: Refactor and simplify freezer_test_done
The code related to 'freezer_test_done' is needlessly convoluted.
Refactor the code and simplify the implementation.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-02-09 23:56:01 +01:00
Srivatsa S. Bhat 51d6ff7acd PM / Hibernate: Thaw kernel threads in hibernation_snapshot() in error/test path
In the hibernation call path, the kernel threads are frozen inside
hibernation_snapshot(). If we happen to encounter an error further down
the road or if we are exiting early due to a successful freezer test,
then thaw kernel threads before returning to the caller.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-02-09 23:55:54 +01:00
Marcos Paulo de Souza 8916e3702e PM / Suspend: Avoid code duplication in suspend statistics update
The code
       if (error) {
               suspend_stats.fail++;
               dpm_save_failed_errno(error);
       } else
               suspend_stats.success++;

Appears in the kernel/power/main.c and kernel/power/suspend.c.

This patch just creates a new function to avoid duplicated code.

Suggested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Marcos Paulo de Souza <marcos.mage@gmail.com>
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-02-09 23:55:43 +01:00
Rafael J. Wysocki 3ed3c7b559 Merge commit 'pm-fixes-for-3.3-rc3' into pm-sleep
New material in the pm-sleep branch depends on recent
power management fixes.
2012-02-09 23:54:09 +01:00
Kees Cook 1a2a4d06e1 security: create task_free security callback
The current LSM interface to cred_free is not sufficient for allowing
an LSM to track the life and death of a task. This patch adds the
task_free hook so that an LSM can clean up resources on task death.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: James Morris <jmorris@namei.org>
2012-02-10 09:14:51 +11:00
Stephane Eranian f39d47ff81 perf: Fix double start/stop in x86_pmu_start()
The following patch fixes a bug introduced by the following
commit:

        e050e3f0a7 ("perf: Fix broken interrupt rate throttling")

The patch caused the following warning to pop up depending on
the sampling frequency adjustments:

  ------------[ cut here ]------------
  WARNING: at arch/x86/kernel/cpu/perf_event.c:995 x86_pmu_start+0x79/0xd4()

It was caused by the following call sequence:

perf_adjust_freq_unthr_context.part() {
     stop()
     if (delta > 0) {
          perf_adjust_period() {
              if (period > 8*...) {
                  stop()
                  ...
                  start()
              }
          }
      }
      start()
}

Which caused a double start and a double stop, thus triggering
the assert in x86_pmu_start().

The patch fixes the problem by avoiding the double calls. We
pass a new argument to perf_adjust_period() to indicate whether
or not the event is already stopped. We can't just remove the
start/stop from that function because it's called from
__perf_event_overflow where the event needs to be reloaded via a
stop/start back-toback call.

The patch reintroduces the assertion in x86_pmu_start() which
was removed by commit:

	84f2b9b ("perf: Remove deprecated WARN_ON_ONCE()")

In this second version, we've added calls to disable/enable PMU
during unthrottling or frequency adjustment based on bug report
of spurious NMI interrupts from Eric Dumazet.

Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: markus@trippelsdorf.de
Cc: paulus@samba.org
Link: http://lkml.kernel.org/r/20120207133956.GA4932@quad
[ Minor edits to the changelog and to the code ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-07 16:58:56 +01:00
Tejun Heo 11a3122f6c block: strip out locking optimization in put_io_context()
put_io_context() performed a complex trylock dancing to avoid
deferring ioc release to workqueue.  It was also broken on UP because
trylock was always assumed to succeed which resulted in unbalanced
preemption count.

While there are ways to fix the UP breakage, even the most
pathological microbench (forced ioc allocation and tight fork/exit
loop) fails to show any appreciable performance benefit of the
optimization.  Strip it out.  If there turns out to be workloads which
are affected by this change, simpler optimization from the discussion
thread can be applied later.

Signed-off-by: Tejun Heo <tj@kernel.org>
LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-07 07:51:30 +01:00
Arnaldo Carvalho de Melo 5ddf146f70 Merge branch 'perf/urgent' into perf/core
So that we can get the perf bench exec stack fixes and then apply the
remaining fix for the files added after what is in perf/urgent.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2012-02-06 19:11:02 -02:00
Linus Torvalds 23783f817b Power management fixes for 3.3-rc3
Three power management regression fixes, one for a recent regression introcuded
 by the freezer changes during the 3.3 merge window and two for regressions
 in cpuidle (resulting from PM QoS changes) and in the hibernate user space
 interface, both introduced during the 3.2 development cycle.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIcBAABAgAGBQJPLbiPAAoJEKhOf7ml8uNscvwQAJYYhSBL+ouK8ERS0OLkeEoB
 k4O1Ap0hb5Kv54Sr85WKEm5zGRDJXUxlWeMklo9K/fvs04CU1gsBb8jhDdbZ2ovE
 rnyybPjfieExQbLxX6nYIP4qKMLtnZvHhHpafuDSUz0RWq/7sCTiFI2htNj97gGu
 DzXYpeePFgvzG6AaznywWkvNdXoQfmsTC0adDrXWcuKXnNrH6h8o/OIB+pO70Szw
 gmU8SjVGGQjrlnuQ+Ku4WqbSyXs1bXlUkyTHJilg6CNJySrA/LUHhKPrRnP1i3Hu
 LxX/rsrTqohhD1tz1qQOpnMiu86FSez+UVA65b2cF3EqZbNROY2+O1/V+OlczKYy
 V9Q3rk+J4uRJtnL8DEgcniMGrRsjyle5USN5KDX50BkrC56h3mZirnEu1yaiMIJn
 K8NWI/4JdK7JbA6f2hXuPuesmudSP4uo8vuUzKthEUi88QReYXYSMcz/Fy/G9z8n
 JW7PimC5OmeTwYIqBcjZf+8j/1u6cHaEkvjPAJhIUgCR/ZVi6VFySnUByDD6JKTJ
 bQcUSqZZ8TvEc4A6JjG18/QfmWIZMErfuG0WAKb8sqtXoPkHKR/XXjbaXof9Oppn
 nRS5iJUaZGY4YivSHZZOFAk24ThqKx5ZK3qXq/dBbj9JwtJdc+++b9f0RwXUHjd9
 ECoM3bFtO8ewINmZ7wRQ
 =EKGs
 -----END PGP SIGNATURE-----

Merge tag 'pm-fixes-for-3.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Power management fixes for 3.3-rc3

Three power management regression fixes, one for a recent regression introcuded
by the freezer changes during the 3.3 merge window and two for regressions
in cpuidle (resulting from PM QoS changes) and in the hibernate user space
interface, both introduced during the 3.2 development cycle.

They include:

* Two hibernate (s2disk) regression fixes from Srivatsa S. Bhat (for
 regressions introduced during the 3.3 merge window and during the 3.2
 development cycle).

* A cpuidle fix from Venki Pallipadi for a regression resulting from PM QoS
 changes during the 3.2 development cycle causing cpuidle to work incorrectly
 for CONFIG_PM unset.

* tag 'pm-fixes-for-3.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM / QoS: CPU C-state breakage with PM Qos change
  PM / Freezer: Thaw only kernel threads if freezing of kernel threads fails
  PM / Hibernate: Thaw kernel threads in SNAPSHOT_CREATE_IMAGE ioctl path
2012-02-04 15:21:39 -08:00
Srivatsa S. Bhat 379e0be812 PM / Freezer: Thaw only kernel threads if freezing of kernel threads fails
If freezing of kernel threads fails, we are expected to automatically
thaw tasks in the error recovery path. However, at times, we encounter
situations in which we would like the automatic error recovery path
to thaw only the kernel threads, because we want to be able to do
some more cleanup before we thaw userspace. Something like:

error = freeze_kernel_threads();
if (error) {
	/* Do some cleanup */

	/* Only then thaw userspace tasks*/
	thaw_processes();
}

An example of such a situation is where we freeze/thaw filesystems
during suspend/hibernation. There, if freezing of kernel threads
fails, we would like to thaw the frozen filesystems before thawing
the userspace tasks.

So, modify freeze_kernel_threads() to thaw only kernel threads in
case of freezing failure. And change suspend_freeze_processes()
accordingly. (At the same time, let us also get rid of the rather
cryptic usage of the conditional operator (:?) in that function.)

[rjw: In fact, this patch fixes a regression introduced during the
 3.3 merge window, because without it thaw_processes() may be called
 before swsusp_free() in some situations and that may lead to massive
 memory allocation failures.]

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Nigel Cunningham <nigel@tuxonice.net>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-02-04 22:23:05 +01:00
Jiang Liu 55ca6140e9 kprobes: fix a memory leak in function pre_handler_kretprobe()
In function pre_handler_kretprobe(), the allocated kretprobe_instance
object will get leaked if the entry_handler callback returns non-zero.
This may cause all the preallocated kretprobe_instance objects exhausted.

This issue can be reproduced by changing
samples/kprobes/kretprobe_example.c to probe "mutex_unlock".  And the fix
is straightforward: just put the allocated kretprobe_instance object back
onto the free_instances list.

[akpm@linux-foundation.org: use raw_spin_lock/unlock]
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-03 16:16:41 -08:00
Cong Wang 2410574866 kernel/resource.c: move EXPORT_SYMBOL right after definition
EXPORT_SYMBOL(adjust_resource) should be right after adjust_resource().

Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-02-03 23:37:07 +01:00
Steven Rostedt f069686e4b tracing/softirq: Move __raise_softirq_irqoff() out of header
The __raise_softirq_irqoff() contains a tracepoint. As tracepoints in headers
can cause issues, and not to mention, bloats the kernel when they are
in a static inline, it is best to move the function that contains the
tracepoint out of the header and into softirq.c.

Link: http://lkml.kernel.org/r/20120118120711.GB14863@elte.hu

Suggested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-03 09:48:19 -05:00
Jiri Olsa ac483c446b ftrace: Change filter/notrace set functions to return exit code
Currently the ftrace_set_filter and ftrace_set_notrace functions
do not return any return code. So there's no way for ftrace_ops
user to tell wether the filter was correctly applied.

The set_ftrace_filter interface returns error in case the filter
did not match:

  # echo krava > set_ftrace_filter
  bash: echo: write error: Invalid argument

Changing both ftrace_set_filter and ftrace_set_notrace functions
to return zero if the filter was applied correctly or -E* values
in case of error.

Link: http://lkml.kernel.org/r/1325495060-6402-2-git-send-email-jolsa@redhat.com

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-03 09:48:18 -05:00
John Stultz a80b83b7b8 Input: add infrastructure for selecting clockid for event time stamps
As noted by Arve and others, since wall time can jump backwards, it is
difficult to use for input because one cannot determine if one event
occurred before another or for how long a key was pressed.

However, the timestamp field is part of the kernel ABI, and cannot be
changed without possibly breaking existing users.

This patch adds a new IOCTL that allows a clockid to be set in the
evdev_client struct that will specify which time base to use for event
timestamps (ie: CLOCK_MONOTONIC instead of CLOCK_REALTIME).

For now we only support CLOCK_MONOTONIC and CLOCK_REALTIME, but
in the future we could support other clockids if appropriate.

The default remains CLOCK_REALTIME, so we don't change the ABI.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Daniel Kurtz <djkurtz@google.com>
Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
2012-02-03 00:24:58 -08:00
Christopher Yeoh 8cdb878dcb Fix race in process_vm_rw_core
This fixes the race in process_vm_core found by Oleg (see

  http://article.gmane.org/gmane.linux.kernel/1235667/

for details).

This has been updated since I last sent it as the creation of the new
mm_access() function did almost exactly the same thing as parts of the
previous version of this patch did.

In order to use mm_access() even when /proc isn't enabled, we move it to
kernel/fork.c where other related process mm access functions already
are.

Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-02 12:55:17 -08:00
Linus Torvalds 2f2fde9272 Merge branches 'core-urgent-for-linus', 'perf-urgent-for-linus', 'sched-urgent-for-linus' and 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  bugs, x86: Fix printk levels for panic, softlockups and stack dumps

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf top: Fix number of samples displayed
  perf tools: Fix strlen() bug in perf_event__synthesize_event_type()
  perf tools: Fix broken build by defining _GNU_SOURCE in Makefile
  x86/dumpstack: Remove unneeded check in dump_trace()
  perf: Fix broken interrupt rate throttling

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/rt: Fix task stack corruption under __ARCH_WANT_INTERRUPTS_ON_CTXSW
  sched: Fix ancient race in do_exit()
  sched/nohz: Fix nohz cpu idle load balancing state with cpu hotplug
  sched/s390: Fix compile error in sched/core.c
  sched: Fix rq->nr_uninterruptible update race

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/reboot: Remove VersaLogic Menlow reboot quirk
  x86/reboot: Skip DMI checks if reboot set by user
  x86: Properly parenthesize cmpxchg() macro arguments
2012-02-02 11:11:13 -08:00
Li Zefan 761b3ef50e cgroup: remove cgroup_subsys argument from callbacks
The argument is not used at all, and it's not necessary, because
a specific callback handler of course knows which subsys it
belongs to.

Now only ->pupulate() takes this argument, because the handlers of
this callback always call cgroup_add_file()/cgroup_add_files().

So we reduce a few lines of code, though the shrinking of object size
is minimal.

 16 files changed, 113 insertions(+), 162 deletions(-)

   text    data     bss     dec     hex filename
5486240  656987 7039960 13183187         c928d3 vmlinux.o.orig
5486170  656987 7039960 13183117         c9288d vmlinux.o

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-02-02 09:20:22 -08:00
Srivatsa S. Bhat fe9161db2e PM / Hibernate: Thaw kernel threads in SNAPSHOT_CREATE_IMAGE ioctl path
In the SNAPSHOT_CREATE_IMAGE ioctl, if the call to hibernation_snapshot()
fails, the frozen tasks are not thawed.

And in the case of success, if we happen to exit due to a successful freezer
test, all tasks (including those of userspace) are thawed, whereas actually
we should have thawed only the kernel threads at that point. Fix both these
issues.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: stable@vger.kernel.org
2012-02-01 22:16:36 +01:00
Hiroshi Shimamoto ed387b781e sched: Move SMP-only variable into the SMP section
This also fixes the following compilation warning on !SMP:

  CC kernel/sched/fair.o
  kernel/sched/fair.c:218:36: warning: 'max_load_balance_interval' defined but not used [-Wunused-variable]

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4F2754A0.9090306@ct.jp.nec.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-31 13:40:59 +01:00
Ingo Molnar bb1693f89a Merge branch 'perf/urgent' into perf/core
We cherry-picked 3 commits into perf/urgent, merge them back to allow
conflict-free work on those files.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-31 13:02:37 +01:00
Mandeep Singh Baines 61d1d219c4 cgroup: remove extra calls to find_existing_css_set
In cgroup_attach_proc, we indirectly call find_existing_css_set 3
times. It is an expensive call so we want to call it a minimum
of times. This patch only calls it once and stores the result so
that it can be used later on when we call cgroup_task_migrate.

This required modifying cgroup_task_migrate to take the new css_set
(which we obtained from find_css_set) as a parameter. The nice side
effect of this is that cgroup_task_migrate is now identical for
cgroup_attach_task and cgroup_attach_proc. It also now returns a
void since it can never fail.

Changes in V5:
* https://lkml.org/lkml/2012/1/20/344 (Tejun Heo)
  * Remove css_set_refs
Changes in V4:
* https://lkml.org/lkml/2011/12/22/421 (Li Zefan)
  * Avoid GFP_KERNEL (sleep) in rcu_read_lock by getting css_set in
    a separate loop not under an rcu_read_lock
Changes in V3:
* https://lkml.org/lkml/2011/12/22/13 (Li Zefan)
  * Fixed earlier bug by creating a seperate patch to remove tasklist_lock
Changes in V2:
* https://lkml.org/lkml/2011/12/20/372 (Tejun Heo)
  * Move find_css_set call into loop which creates the flex array
* Author
  * Kill css_set_refs and use group_size instead
  * Fix an off-by-one error in counting css_set refs
  * Add a retval check in out_list_teardown

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: containers@lists.linux-foundation.org
Cc: cgroups@vger.kernel.org
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
2012-01-30 13:01:39 -08:00
Alex Frid d031e1de2c PM / QoS: Simplify PM QoS expansion/merge
- Replace class ID #define with enumeration
 - Loop through PM QoS objects during initialization (rather than
   initializing them one-by-one)

Signed-off-by: Alex Frid <afrid@nvidia.com>
Reviewed-by: Antti Miettinen <amiettinen@nvidia.com>
Reviewed-by: Diwakar Tundlam <dtundlam@nvidia.com>
Reviewed-by: Scott Williams <scwilliams@nvidia.com>
Reviewed-by: Yu-Huan Hsu <yhsu@nvidia.com>
Acked-by: markgross <markgross@thegnar.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-01-29 20:39:25 +01:00
Rafael J. Wysocki cf579dfb82 PM / Sleep: Introduce "late suspend" and "early resume" of devices
The current device suspend/resume phases during system-wide power
transitions appear to be insufficient for some platforms that want
to use the same callback routines for saving device states and
related operations during runtime suspend/resume as well as during
system suspend/resume.  In principle, they could point their
.suspend_noirq() and .resume_noirq() to the same callback routines
as their .runtime_suspend() and .runtime_resume(), respectively,
but at least some of them require device interrupts to be enabled
while the code in those routines is running.

It also makes sense to have device suspend-resume callbacks that will
be executed with runtime PM disabled and with device interrupts
enabled in case someone needs to run some special code in that
context during system-wide power transitions.

Apart from this, .suspend_noirq() and .resume_noirq() were introduced
as a workaround for drivers using shared interrupts and failing to
prevent their interrupt handlers from accessing suspended hardware.
It appears to be better not to use them for other porposes, or we may
have to deal with some serious confusion (which seems to be happening
already).

For the above reasons, introduce new device suspend/resume phases,
"late suspend" and "early resume" (and analogously for hibernation)
whose callback will be executed with runtime PM disabled and with
device interrupts enabled and whose callback pointers generally may
point to runtime suspend/resume routines.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reviewed-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Reviewed-by: Kevin Hilman <khilman@ti.com>
2012-01-29 20:38:29 +01:00
Rafael J. Wysocki 181e9bdef3 PM / Hibernate: Fix s2disk regression related to freezing workqueues
Commit 2aede851dd

  PM / Hibernate: Freeze kernel threads after preallocating memory

introduced a mechanism by which kernel threads were frozen after
the preallocation of hibernate image memory to avoid problems with
frozen kernel threads not responding to memory freeing requests.
However, it overlooked the s2disk code path in which the
SNAPSHOT_CREATE_IMAGE ioctl was run directly after SNAPSHOT_FREE,
which caused freeze_workqueues_begin() to BUG(), because it saw
that worqueues had been already frozen.

Although in principle this issue might be addressed by removing
the relevant BUG_ON() from freeze_workqueues_begin(), that would
reintroduce the very problem that commit 2aede851dd
attempted to avoid into that particular code path.  For this reason,
to fix the issue at hand, introduce thaw_kernel_threads() and make
the SNAPSHOT_FREE ioctl execute it.

Special thanks to Srivatsa S. Bhat for detailed analysis of the
problem.

Reported-and-tested-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: stable@kernel.org
2012-01-29 20:35:52 +01:00
Rakib Mullick 30fd049afc sched: Remove sched_switch
Currently we don't utilize the sched_switch field anymore.

But, simply removing sched_switch field from the middle of the
sched_stat output will break tools.

So, to stay compatible we hardcode it to zero and remove the
field from the scheduler data structures.

Update the schedstat documentation accordingly.

Signed-off-by: Rakib Mullick <rakib.mullick@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1327422836.27181.5.camel@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-27 13:28:53 +01:00
Vincent Guittot 4ec4412e1e sched: Ensure cpu_power periodic update
With a lot of small tasks, the softirq sched is nearly never called
when no_hz is enabled. In this case load_balance() is mainly called
with the newly_idle mode which doesn't update the cpu_power.

Add a next_update field which ensure a maximum update period when
there is short activity.

Having stale cpu_power information can skew the load-balancing
decisions, this is cured by the guaranteed update.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1323717668-2143-1-git-send-email-vincent.guittot@linaro.org
2012-01-27 13:28:49 +01:00
Peter Zijlstra 39be350127 sched, block: Unify cache detection
The block layer has some code trying to determine if two CPUs share a
cache, the scheduler has a similar function. Expose the function used
by the scheduler and make the block layer use it, thereby removing the
block layers usage of CONFIG_SCHED* and topology bits.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Jens Axboe <axboe@kernel.dk>
Link: http://lkml.kernel.org/r/1327579450.2446.95.camel@twins
2012-01-27 13:28:48 +01:00
Chanho Min cb297a3e43 sched/rt: Fix task stack corruption under __ARCH_WANT_INTERRUPTS_ON_CTXSW
This issue happens under the following conditions:

 1. preemption is off
 2. __ARCH_WANT_INTERRUPTS_ON_CTXSW is defined
 3. RT scheduling class
 4. SMP system

Sequence is as follows:

 1.suppose current task is A. start schedule()
 2.task A is enqueued pushable task at the entry of schedule()
   __schedule
    prev = rq->curr;
    ...
    put_prev_task
     put_prev_task_rt
      enqueue_pushable_task
 4.pick the task B as next task.
   next = pick_next_task(rq);
 3.rq->curr set to task B and context_switch is started.
   rq->curr = next;
 4.At the entry of context_swtich, release this cpu's rq->lock.
   context_switch
    prepare_task_switch
     prepare_lock_switch
      raw_spin_unlock_irq(&rq->lock);
 5.Shortly after rq->lock is released, interrupt is occurred and start IRQ context
 6.try_to_wake_up() which called by ISR acquires rq->lock
    try_to_wake_up
     ttwu_remote
      rq = __task_rq_lock(p)
      ttwu_do_wakeup(rq, p, wake_flags);
        task_woken_rt
 7.push_rt_task picks the task A which is enqueued before.
   task_woken_rt
    push_rt_tasks(rq)
     next_task = pick_next_pushable_task(rq)
 8.At find_lock_lowest_rq(), If double_lock_balance() returns 0,
   lowest_rq can be the remote rq.
  (But,If preemption is on, double_lock_balance always return 1 and it
   does't happen.)
   push_rt_task
    find_lock_lowest_rq
     if (double_lock_balance(rq, lowest_rq))..
 9.find_lock_lowest_rq return the available rq. task A is migrated to
   the remote cpu/rq.
   push_rt_task
    ...
    deactivate_task(rq, next_task, 0);
    set_task_cpu(next_task, lowest_rq->cpu);
    activate_task(lowest_rq, next_task, 0);
 10. But, task A is on irq context at this cpu.
     So, task A is scheduled by two cpus at the same time until restore from IRQ.
     Task A's stack is corrupted.

To fix it, don't migrate an RT task if it's still running.

Signed-off-by: Chanho Min <chanho.min@lge.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/CAOAMb1BHA=5fm7KTewYyke6u-8DP0iUuJMpgQw54vNeXFsGpoQ@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-27 12:49:41 +01:00
Ingo Molnar 44a6839711 Merge branch 'perf/fast' into perf/core
Merge reason: Lets ready it for v3.4

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-27 12:08:09 +01:00
Stephane Eranian e050e3f0a7 perf: Fix broken interrupt rate throttling
This patch fixes the sampling interrupt throttling mechanism.

It was broken in v3.2. Events were not being unthrottled. The
unthrottling mechanism required that events be checked at each
timer tick.

This patch solves this problem and also separates:

  - unthrottling
  - multiplexing
  - frequency-mode period adjustments

Not all of them need to be executed at each timer tick.

This third version of the patch is based on my original patch +
PeterZ proposal (https://lkml.org/lkml/2012/1/7/87).

At each timer tick, for each context:

  - if the current CPU has throttled events, we unthrottle events

  - if context has frequency-based events, we adjust sampling periods

  - if we have reached the jiffies interval, we multiplex (rotate)

We decoupled rotation (multiplexing) from frequency-mode sampling
period adjustments.  They should not necessarily happen at the same
rate. Multiplexing is subject to jiffies_interval (currently at 1
but could be higher once the tunable is exposed via sysfs).

We have grouped frequency-mode adjustment and unthrottling into the
same routine to minimize code duplication. When throttled while in
frequency mode, we scan the events only once.

We have fixed the threshold enforcement code in __perf_event_overflow().
There was a bug whereby it would allow more than the authorized rate
because an increment of hwc->interrupts was not executed at the right
place.

The patch was tested with low sampling limit (2000) and fixed periods,
frequency mode, overcommitted PMU.

On a 2.1GHz AMD CPU:

 $ cat /proc/sys/kernel/perf_event_max_sample_rate
 2000

We set a rate of 3000 samples/sec (2.1GHz/3000 = 700000):

 $ perf record -e cycles,cycles -c 700000  noploop 10
 $ perf report -D | tail -21

 Aggregated stats:
           TOTAL events:      80086
            MMAP events:         88
            COMM events:          2
            EXIT events:          4
        THROTTLE events:      19996
      UNTHROTTLE events:      19996
          SAMPLE events:      40000

 cycles stats:
           TOTAL events:      40006
            MMAP events:          5
            COMM events:          1
            EXIT events:          4
        THROTTLE events:       9998
      UNTHROTTLE events:       9998
          SAMPLE events:      20000

 cycles stats:
           TOTAL events:      39996
        THROTTLE events:       9998
      UNTHROTTLE events:       9998
          SAMPLE events:      20000

For 10s, the cap is 2x2000x10 = 40000 samples.
We get exactly that: 20000 samples/event.

Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: <stable@kernel.org> # v3.2+
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120126160319.GA5655@quad
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-27 12:06:39 +01:00
Ingo Molnar 801493c2e2 Merge branch 'linus' into perf/core
Merge reason: pick up fixes that are upstream, we need them
              because we'll apply dependent patches.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-27 12:01:43 +01:00
Yasunori Goto b5740f4b2c sched: Fix ancient race in do_exit()
try_to_wake_up() has a problem which may change status from TASK_DEAD to
TASK_RUNNING in race condition with SMI or guest environment of virtual
machine. As a result, exited task is scheduled() again and panic occurs.

Here is the sequence how it occurs:

 ----------------------------------+-----------------------------
                                   |
            CPU A                  |             CPU B
 ----------------------------------+-----------------------------

TASK A calls exit()....

do_exit()

  exit_mm()
    down_read(mm->mmap_sem);

    rwsem_down_failed_common()

      set TASK_UNINTERRUPTIBLE
      set waiter.task <= task A
      list_add to sem->wait_list
           :
      raw_spin_unlock_irq()
      (I/O interruption occured)

                                      __rwsem_do_wake(mmap_sem)

                                        list_del(&waiter->list);
                                        waiter->task = NULL
                                        wake_up_process(task A)
                                          try_to_wake_up()
                                             (task is still
                                               TASK_UNINTERRUPTIBLE)
                                              p->on_rq is still 1.)

                                              ttwu_do_wakeup()
                                                 (*A)
                                                   :
     (I/O interruption handler finished)

      if (!waiter.task)
          schedule() is not called
          due to waiter.task is NULL.

      tsk->state = TASK_RUNNING

          :
                                              check_preempt_curr();
                                                  :
  task->state = TASK_DEAD
                                              (*B)
                                        <---    set TASK_RUNNING (*C)

     schedule()
     (exit task is running again)
     BUG_ON() is called!
 --------------------------------------------------------

The execution time between (*A) and (*B) is usually very short,
because the interruption is disabled, and setting TASK_RUNNING at (*C)
must be executed before setting TASK_DEAD.

HOWEVER, if SMI is interrupted between (*A) and (*B),
(*C) is able to execute AFTER setting TASK_DEAD!
Then, exited task is scheduled again, and BUG_ON() is called....

If the system works on guest system of virtual machine, the time
between (*A) and (*B) may be also long due to scheduling of hypervisor,
and same phenomenon can occur.

By this patch, do_exit() waits for releasing task->pi_lock which is used
in try_to_wake_up(). It guarantees the task becomes TASK_DEAD after
waking up.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120117174031.3118.E1E9C6FF@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-27 11:55:36 +01:00
Thomas Gleixner cc06268c6a time: Move common updates to a function
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-01-26 19:44:31 -08:00
Thomas Gleixner 058892e632 time: Reorder so the hot data is together
Keep all the interesting data in a single cache line.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-01-26 19:44:29 -08:00
John Stultz 92c1d3ed4d time: Remove most of xtime_lock usage in timekeeping.c
Now that ntp.c's locking is reworked, we can remove most
of the xtime_lock usage in timekeeping.c

The remaining xtime_lock presence is really for jiffies access
and the global load calculation.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-01-26 19:44:27 -08:00
John Stultz bd3312681f ntp: Add ntp_lock to replace xtime_locking
Use a ntp_lock spin lock to replace xtime_lock locking in ntp.c

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-01-26 19:44:25 -08:00
John Stultz ea7cf49a76 ntp: Access tick_length variable via ntp_tick_length()
Currently the NTP managed tick_length value is accessed globally,
in preparations for locking cleanups, make sure it is accessed via
a function and mark it as static.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-01-26 19:44:23 -08:00
John Stultz 8357929e6a ntp: Cleanup timex.h
Move ntp_sycned to ntp.c and mark time_status as static.
Also yank function declaration for non-existant function.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-01-26 19:44:20 -08:00
John Stultz 70471f2f06 time: Add timekeeper lock
Now that all the timekeeping variables are stored in
the timekeeper structure, add a new lock to protect the
structure.

For now, this lock nests under the xtime_lock for writes.

For readers, we don't need to take xtime_lock anymore.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-01-26 19:44:18 -08:00
John Stultz 8fcce546be time: Cleanup global variables and move them to the top
Move global xtime_lock and timekeeping_suspended values up
to the top of timekeeping.c

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-01-26 19:44:16 -08:00
John Stultz 01f71b47e0 time: Move raw_time into timekeeper structure
In preparation for locking cleanups, move raw_time into
timekeeper structure.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-01-26 19:44:14 -08:00
John Stultz 8ff2cb92dd time: Move xtime into timekeeeper structure
In preparation for locking cleanups, move xtime into
timekeeper structure.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-01-26 19:44:12 -08:00
John Stultz d9f7217aac time: Move wall_to_monotonic into the timekeeper structure
In preparation for locking cleanups, move wall_to_monotonic
into the timekeeper structure.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-01-26 19:44:09 -08:00
John Stultz 00c5fb774e time: Move total_sleep_time into the timekeeper structure
Move total_sleep_time into the timekeeper structure in preparation
for locking cleanups

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-01-26 19:44:07 -08:00
Linus Torvalds b57cea5e33 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Call perf_cgroup_event_time() directly
  perf: Don't call release_callchain_buffers() if allocation fails
2012-01-26 12:45:57 -08:00
Linus Torvalds 2437dcbf55 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rcu: Add missing __cpuinit annotation in rcutorture code
  sched: Add "const" to is_idle_task() parameter
  rcu: Make rcutorture bool parameters really bool (core code)
  memblock: Fix alloc failure due to dumb underflow protection in memblock_find_in_range_node()
2012-01-26 12:45:41 -08:00
Prarit Bhargava b0f4c4b32c bugs, x86: Fix printk levels for panic, softlockups and stack dumps
rsyslog will display KERN_EMERG messages on a connected
terminal.  However, these messages are useless/undecipherable
for a general user.

For example, after a softlockup we get:

 Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
 kernel:Stack:

 Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
 kernel:Call Trace:

 Message from syslogd@intel-s3e37-04 at Jan 25 14:18:06 ...
 kernel:Code: ff ff a8 08 75 25 31 d2 48 8d 86 38 e0 ff ff 48 89
 d1 0f 01 c8 0f ae f0 48 8b 86 38 e0 ff ff a8 08 75 08 b1 01 4c 89 e0 0f 01 c9 <e8> ea 69 dd ff 4c 29 e8 48 89 c7 e8 0f bc da ff 49 89 c4 49 89

This happens because the printk levels for these messages are
incorrect. Only an informational message should be displayed on
a terminal.

I modified the printk levels for various messages in the kernel
and tested the output by using the drivers/misc/lkdtm.c kernel
modules (ie, softlockups, panics, hard lockups, etc.) and
confirmed that the console output was still the same and that
the output to the terminals was correct.

For example, in the case of a softlockup we now see the much
more informative:

 Message from syslogd@intel-s3e37-04 at Jan 25 10:18:06 ...
 BUG: soft lockup - CPU4 stuck for 60s!

instead of the above confusing messages.

AFAICT, the messages no longer have to be KERN_EMERG.  In the
most important case of a panic we set console_verbose().  As for
the other less severe cases the correct data is output to the
console and /var/log/messages.

Successfully tested by me using the drivers/misc/lkdtm.c module.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: dzickus@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1327586134-11926-1-git-send-email-prarit@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 21:28:45 +01:00
Suresh Siddha 71325960d1 sched/nohz: Fix nohz cpu idle load balancing state with cpu hotplug
With the recent nohz scheduler changes, rq's nohz flag
'NOHZ_TICK_STOPPED' and its associated state doesn't get cleared
immediately after the cpu exits idle. This gets cleared as part
of the next tick seen on that cpu.

For the cpu offline support, we need to clear this state
manually. Fix it by registering a cpu notifier, which clears the
nohz idle load balance state for this rq explicitly during the
CPU_DYING notification.

There won't be any nohz updates for that cpu, after the
CPU_DYING notification. But lets be extra paranoid and skip
updating the nohz state in the select_nohz_load_balancer() if
the cpu is not in active state anymore.

Reported-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Reviewed-and-tested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1327026538.16150.40.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 19:38:13 +01:00
Christian Borntraeger db7e527da4 sched/s390: Fix compile error in sched/core.c
Commit 029632fbb7 ("sched: Make
separate sched*.c translation units") removed the include of
asm/mutex.h from sched.c.

This breaks the combination of:

 CONFIG_MUTEX_SPIN_ON_OWNER=yes
 CONFIG_HAVE_ARCH_MUTEX_CPU_RELAX=yes

like s390 without mutex debugging:

  CC      kernel/sched/core.o
  kernel/sched/core.c: In function ‘mutex_spin_on_owner’:
  kernel/sched/core.c:3287: error: implicit declaration of function ‘arch_mutex_cpu_relax’

Lets re-add the include to kernel/sched/core.c

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1326268696-30904-1-git-send-email-borntraeger@de.ibm.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 19:38:11 +01:00
Peter Zijlstra 4ca9b72b71 sched: Fix rq->nr_uninterruptible update race
KOSAKI Motohiro noticed the following race:

 > CPU0                    CPU1
 > --------------------------------------------------------
 > deactivate_task()
 >                         task->state = TASK_UNINTERRUPTIBLE;
 > activate_task()
 >    rq->nr_uninterruptible--;
 >
 >                         schedule()
 >                           deactivate_task()
 >                             rq->nr_uninterruptible++;
 >

Kosaki-San's scenario is possible when CPU0 runs
__sched_setscheduler() against CPU1's current @task.

__sched_setscheduler() does a dequeue/enqueue in order to move
the task to its new queue (position) to reflect the newly provided
scheduling parameters. However it should be completely invariant to
nr_uninterruptible accounting, sched_setscheduler() doesn't affect
readyness to run, merely policy on when to run.

So convert the inappropriate activate/deactivate_task usage to
enqueue/dequeue_task, which avoids the nr_uninterruptible accounting.

Also convert the two other sites: __migrate_task() and
normalize_task() that still use activate/deactivate_task. These sites
aren't really a problem since __migrate_task() will only be called on
non-running task (and therefore are immume to the described problem)
and normalize_task() isn't ever used on regular systems.

Also remove the comments from activate/deactivate_task since they're
misleading at best.

Reported-by: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1327486224.2614.45.camel@laptop
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-26 19:38:09 +01:00
Ingo Molnar 96070c83b2 Merge branch 'sigtrace' of git://github.com/utrace/linux into perf/core 2012-01-26 11:09:19 +01:00
Rob Herring 2ed86b16ea irq: make SPARSE_IRQ an optionally hidden option
On ARM, we don't want SPARSE_IRQ to be a user visible option. Make
SPARSE_IRQ visible based on MAY_HAVE_SPARSE_IRQ instead of depending
on HAVE_SPARSE_IRQ.

With this, SPARSE_IRQ is not visible on C6X and ARM.

Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-c6x-dev@linux-c6x.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-sh@vger.kernel.org
2012-01-25 20:37:42 -06:00
Eric W. Biederman 1f87f0b52b sysctl: Move the implementation into fs/proc/proc_sysctl.c
Move the core sysctl code from kernel/sysctl.c and kernel/sysctl_check.c
into fs/proc/proc_sysctl.c.

Currently sysctl maintenance is hampered by the sysctl implementation
being split across 3 files with artificial layering between them.
Consolidate the entire sysctl implementation into 1 file so that
it is easier to see what is going on and hopefully allowing for
simpler maintenance.

For functions that are now only used in fs/proc/proc_sysctl.c remove
their declarations from sysctl.h and make them static in fs/proc/proc_sysctl.c

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:54 -08:00
Eric W. Biederman de4e83bd6b sysctl: Register the base sysctl table like any other sysctl table.
Simplify the code by treating the base sysctl table like any other
sysctl table and register it with register_sysctl_table.

To ensure this table is registered early enough to avoid problems
call sysctl_init from proc_sys_init.

Rename sysctl_net.c:sysctl_init() to net_sysctl_init() to avoid
name conflicts now that kernel/sysctl.c:sysctl_init() is no longer
static.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:54 -08:00
Eric W. Biederman 0ce8974d50 sysctl: Consolidate !CONFIG_SYSCTL handling
- In sysctl.h move functions only available if CONFIG_SYSCL
  is defined inside of #ifdef CONFIG_SYSCTL

- Move the stub function definitions for !CONFIG_SYSCTL
  into sysctl.h and make them static inlines.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-01-24 16:37:54 -08:00
Linus Torvalds 701b259f44 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Davem says:

1) Fix JIT code generation on x86-64 for divide by zero, from Eric Dumazet.

2) tg3 header length computation correction from Eric Dumazet.

3) More build and reference counting fixes for socket memory cgroup
   code from Glauber Costa.

4) module.h snuck back into a core header after all the hard work we
   did to remove that, from Paul Gortmaker and Jesper Dangaard Brouer.

5) Fix PHY naming regression and add some new PCI IDs in stmmac, from
   Alessandro Rubini.

6) Netlink message generation fix in new team driver, should only advertise
   the entries that changed during events, from Jiri Pirko.

7) SRIOV VF registration and unregistration fixes, and also add a
   missing PCI ID, from Roopa Prabhu.

8) Fix infinite loop in tx queue flush code of brcmsmac, from Stanislaw Gruszka.

9) ftgmac100/ftmac100 build fix, missing interrupt.h include.

10) Memory leak fix in net/hyperv do_set_mutlicast() handling, from Wei Yongjun.

11) Off by one fix in netem packet scheduler, from Vijay Subramanian.

12) TCP loss detection fix from Yuchung Cheng.

13) TCP reset packet MD5 calculation uses wrong address, fix from Shawn Lu.

14) skge carrier assertion and DMA mapping fixes from Stephen Hemminger.

15) Congestion recovery undo performed at the wrong spot in BIC and CUBIC
    congestion control modules, fix from Neal Cardwell.

16) Ethtool ETHTOOL_GSSET_INFO is unnecessarily restrictive, from Michał Mirosław.

17) Fix triggerable race in ipv6 sysctl handling, from Francesco Ruggeri.

18) Statistics bug fixes in mlx4 from Eugenia Emantayev.

19) rds locking bug fix during info dumps, from your's truly.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits)
  rds: Make rds_sock_lock BH rather than IRQ safe.
  netprio_cgroup.h: dont include module.h from other includes
  net: flow_dissector.c missing include linux/export.h
  team: send only changed options/ports via netlink
  net/hyperv: fix possible memory leak in do_set_multicast()
  drivers/net: dsa/mv88e6xxx.c files need linux/module.h
  stmmac: added PCI identifiers
  llc: Fix race condition in llc_ui_recvmsg
  stmmac: fix phy naming inconsistency
  dsa: Add reporting of silicon revision for Marvell 88E6123/88E6161/88E6165 switches.
  tg3: fix ipv6 header length computation
  skge: add byte queue limit support
  mv643xx_eth: Add Rx Discard and Rx Overrun statistics
  bnx2x: fix compilation error with SOE in fw_dump
  bnx2x: handle CHIP_REVISION during init_one
  bnx2x: allow user to change ring size in ISCSI SD mode
  bnx2x: fix Big-Endianess in ethtool -t
  bnx2x: fixed ethtool statistics for MF modes
  bnx2x: credit-leakage fixup on vlan_mac_del_all
  macvlan: fix a possible use after free
  ...
2012-01-24 15:51:40 -08:00
Linus Torvalds ac1e3d4f5c Power management fixes for 3.3
Two fixes for regressions introduced during the merge window, one fix for
 a long-standing obscure issue in the computation of hibernate image size
 and two small PM documentation fixes.
 
 Namhyung Kim (1):
       PM / Hibernate: Correct additional pages number calculation
 
 Srivatsa S. Bhat (1):
       PM / Hibernate: Rewrite unlock_system_sleep() to fix s2disk regression
 
 Tetsuo Handa (1):
       PM / Sleep: Fix read_unlock_usermodehelper() call.
 
 Viresh Kumar (2):
       PM / Documentation: Fix spelling mistake in basic-pm-debugging.txt
       PM / Documentation: Fix minor issue in freezing_of_tasks.txt
 
  Documentation/power/basic-pm-debugging.txt |    2 +-
  Documentation/power/freezing-of-tasks.txt  |    8 ++++----
  drivers/base/firmware_class.c              |    3 +--
  include/linux/suspend.h                    |   19 +++++++++++++++++--
  kernel/power/snapshot.c                    |    3 ++-
  5 files changed, 25 insertions(+), 10 deletions(-)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIcBAABAgAGBQJPHdAnAAoJEKhOf7ml8uNsMOgQAJVw9lUK1grMRoWILq/oAIAy
 jWQfcVmLq04T32ZTAqohocMsJHTRRgHzhP60zWmiUtpiikO7UJT/sMPt1vSIMxOg
 83dT0s4VLCrHjpKW2J3UbyUq9CGlXjMchkwt5ZwCtQtGoNNBg3Q8ZCtS6ksUh7+c
 1xnUANBdHXzoz8gp9zemoW05vZu7KFX5SWuWYv+5FLd7gb0GpOFZksBAgQ7KAOwT
 jk2sx20yeXoCFshWTVIaDQa9ydQxGyfWu7ZoTst3fNJvpwXwozEFCSro/WsF4FFQ
 fwOl8hU6Ptaogy6VuzQZ8B9XNKxgzcSDfLqvmynUAzWbwZkgW7V9EA0cRmnyBI4C
 Mmus5npRq+zC09e82noPDI8Y+gOMe70u/pHPL/9aSiK8D97baQd7JCwV9iCf3bxr
 h+woLfEF3maMV0vdfzFz6Zh435oO668ZGUUxqSpFIKyJ2uVcCKRrpEo47N1xKChr
 msH8sSLl4yrVYg0XUhN62qOrPiMvWw0i6httrXXOBMs4jVEBNrdSUSeAjn269gsx
 xaDZS9teZNnn3Vf/lJdPvPC7dvz394mOkWOq707S/D/hHdA96dYii5PET366FNjE
 O0jgCrlbRLnFmcngeBdklxu2Zf1GIhSw+Lk+LdwnEloSZ4RKMjUy0vf4l5shDS02
 N/e4vEbwR68dN8Hg5evl
 =dgnH
 -----END PGP SIGNATURE-----

Merge tag 'pm-fixes-for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Power management fixes for 3.3

Two fixes for regressions introduced during the merge window, one fix for
a long-standing obscure issue in the computation of hibernate image size
and two small PM documentation fixes.

* tag 'pm-fixes-for-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM / Sleep: Fix read_unlock_usermodehelper() call.
  PM / Hibernate: Rewrite unlock_system_sleep() to fix s2disk regression
  PM / Hibernate: Correct additional pages number calculation
  PM / Documentation: Fix minor issue in freezing_of_tasks.txt
  PM / Documentation: Fix spelling mistake in basic-pm-debugging.txt
2012-01-23 15:11:27 -08:00
Linus Torvalds a99cbf6b43 Merge branch 'kernel-doc' from Randy Dunlap
The usual kernel-doc fixups from Randy.  Some of them David acked as
merged in his tree, this is the random left-overs.

* kernel-doc:
  docbook: fix sched source file names in device-drivers book
  docbook: change iomap source filename in deviceiobook
  docbook: don't use serial_core.h in device-drivers book
  kernel-doc: fix kernel-doc warnings in sched
  kernel-doc: fix new warnings in cfg80211.h
  kernel-doc: fix new warning in usb.h
  kernel-doc: fix new warnings in device.h
  kernel-doc: fix new warnings in debugfs
  kernel-doc: fix new warning in regulator core
  kernel-doc: fix new warnings in pci
  kernel-doc: fix new warnings in driver-core
  kernel-doc: fix new warnings in auditsc.c
  scripts/kernel-doc: fix fatal error caused by cfg80211.h
2012-01-23 10:08:08 -08:00
Randy Dunlap fa757281a0 kernel-doc: fix kernel-doc warnings in sched
Fix new kernel-doc notation warnings:

Warning(include/linux/sched.h:2094): No description found for parameter 'p'
Warning(include/linux/sched.h:2094): Excess function parameter 'tsk' description in 'is_idle_task'
Warning(kernel/sched/cpupri.c:139): No description found for parameter 'newpri'
Warning(kernel/sched/cpupri.c:139): Excess function parameter 'pri' description in 'cpupri_set'
Warning(kernel/sched/cpupri.c:208): Excess function parameter 'bootmem' description in 'cpupri_init'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc:	Ingo Molnar <mingo@elte.hu>
Cc:	Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-23 08:44:54 -08:00
Randy Dunlap 42ae610c1a kernel-doc: fix new warnings in auditsc.c
Fix new kernel-doc warnings in auditsc.c:

Warning(kernel/auditsc.c:1875): No description found for parameter 'success'
Warning(kernel/auditsc.c:1875): No description found for parameter 'return_code'
Warning(kernel/auditsc.c:1875): Excess function parameter 'pt_regs' description in '__audit_syscall_exit'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-23 08:44:53 -08:00
Ananth N Mavinakayanahalli d496aab567 kprobes: initialize before using a hlist
Commit ef53d9c5e ("kprobes: improve kretprobe scalability with hashed
locking") introduced a bug where we can potentially leak
kretprobe_instances since we initialize a hlist head after having used
it.

Initialize the hlist head before using it.

Reported by: Jim Keniston <jkenisto@us.ibm.com>
Acked-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Srinivasa D S <srinivasa@in.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-23 08:38:48 -08:00
Glauber Costa 0e90b31f4b net: introduce res_counter_charge_nofail() for socket allocations
There is a case in __sk_mem_schedule(), where an allocation
is beyond the maximum, but yet we are allowed to proceed.
It happens under the following condition:

	sk->sk_wmem_queued + size >= sk->sk_sndbuf

The network code won't revert the allocation in this case,
meaning that at some point later it'll try to do it. Since
this is never communicated to the underlying res_counter
code, there is an inbalance in res_counter uncharge operation.

I see two ways of fixing this:

1) storing the information about those allocations somewhere
   in memcg, and then deducting from that first, before
   we start draining the res_counter,
2) providing a slightly different allocation function for
   the res_counter, that matches the original behavior of
   the network code more closely.

I decided to go for #2 here, believing it to be more elegant,
since #1 would require us to do basically that, but in a more
obscure way.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
CC: Tejun Heo <tj@kernel.org>
CC: Li Zefan <lizf@cn.fujitsu.com>
CC: Laurent Chavey <chavey@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-22 15:08:46 -05:00
Namhyung Kim 46cd6a7f68 perf: Call perf_cgroup_event_time() directly
The perf_event_time() will call perf_cgroup_event_time()
if @event is a cgroup event. Just do it directly and avoid
the extra check..

Signed-off-by: Namhyung Kim <namhyung.kim@lge.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Link: http://lkml.kernel.org/r/1327021966-27688-2-git-send-email-namhyung.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-21 09:33:42 +01:00
Namhyung Kim fd45c15f13 perf: Don't call release_callchain_buffers() if allocation fails
When alloc_callchain_buffers() fails, it frees all of
entries before return. In addition, calling the
release_callchain_buffers() will cause a NULL pointer
dereference since callchain_cpu_entries is not set.

Signed-off-by: Namhyung Kim <namhyung.kim@lge.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Link: http://lkml.kernel.org/r/1327021966-27688-1-git-send-email-namhyung.kim@lge.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-21 09:33:41 +01:00
Mandeep Singh Baines fb5d2b4cfc cgroup: replace tasklist_lock with rcu_read_lock
We can replace the tasklist_lock in cgroup_attach_proc with an
rcu_read_lock().

Changes in V4:
* https://lkml.org/lkml/2011/12/23/284 (Frederic Weisbecker)
  * Minimize size of rcu_read_lock critical section
  * Add comment
* https://lkml.org/lkml/2011/12/26/136 (Li Zefan)
  * Split into two patches
Changes in V3:
* https://lkml.org/lkml/2011/12/22/419 (Frederic Weisbecker)
  * Add an rcu_read_lock to protect against exit
Changes in V2:
* https://lkml.org/lkml/2011/12/22/86 (Tejun Heo)
  * Use a goto instead of returning -EAGAIN

Suggested-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: containers@lists.linux-foundation.org
Cc: cgroups@vger.kernel.org
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
2012-01-20 15:58:13 -08:00
Mandeep Singh Baines b78949ebfb cgroup: simplify double-check locking in cgroup_attach_proc
To keep the complexity of the double-check locking in one place, move
the thread_group_leader check up into attach_task_by_pid().  This
allows us to use a goto instead of returning -EAGAIN.

While at it, convert a couple of returns to gotos and use rcu for the
!pid case also in order to simplify the logic.

Changes in V2:
* https://lkml.org/lkml/2011/12/22/86 (Tejun Heo)
  * Use a goto instead of returning -EAGAIN

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: containers@lists.linux-foundation.org
Cc: cgroups@vger.kernel.org
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
2012-01-20 15:58:13 -08:00
Li Zefan 245282557c cgroup: move struct cgroup_pidlist out from the header file
It's internally used only.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2012-01-20 09:30:08 -08:00
Linus Torvalds 567e47935a Merge branches 'sched-urgent-for-linus', 'perf-urgent-for-linus' and 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/accounting, proc: Fix /proc/stat interrupts sum

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tracepoints/module: Fix disabling tracepoints with taint CRAP or OOT
  x86/kprobes: Add arch/x86/tools/insn_sanity to .gitignore
  x86/kprobes: Fix typo transferred from Intel manual

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, syscall: Need __ARCH_WANT_SYS_IPC for 32 bits
  x86, tsc: Fix SMI induced variation in quick_pit_calibrate()
  x86, opcode: ANDN and Group 17 in x86-opcode-map.txt
  x86/kconfig: Move the ZONE_DMA entry under a menu
  x86/UV2: Add accounting for BAU strong nacks
  x86/UV2: Ack BAU interrupt earlier
  x86/UV2: Remove stale no-resources test for UV2 BAU
  x86/UV2: Work around BAU bug
  x86/UV2: Fix BAU destination timeout initialization
  x86/UV2: Fix new UV2 hardware by using native UV2 broadcast mode
  x86: Get rid of dubious one-bit signed bitfield
2012-01-19 14:53:06 -08:00
Namhyung Kim 160cb5a97d PM / Hibernate: Correct additional pages number calculation
The struct bm_block is allocated by chain_alloc(),
so it'd better counting it in LINKED_PAGE_DATA_SIZE.

Signed-off-by: Namhyung Kim <namhyung.kim@lge.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-01-19 23:23:10 +01:00
Linus Torvalds f429ee3b80 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit: (29 commits)
  audit: no leading space in audit_log_d_path prefix
  audit: treat s_id as an untrusted string
  audit: fix signedness bug in audit_log_execve_info()
  audit: comparison on interprocess fields
  audit: implement all object interfield comparisons
  audit: allow interfield comparison between gid and ogid
  audit: complex interfield comparison helper
  audit: allow interfield comparison in audit rules
  Kernel: Audit Support For The ARM Platform
  audit: do not call audit_getname on error
  audit: only allow tasks to set their loginuid if it is -1
  audit: remove task argument to audit_set_loginuid
  audit: allow audit matching on inode gid
  audit: allow matching on obj_uid
  audit: remove audit_finish_fork as it can't be called
  audit: reject entry,always rules
  audit: inline audit_free to simplify the look of generic code
  audit: drop audit_set_macxattr as it doesn't do anything
  audit: inline checks for not needing to collect aux records
  audit: drop some potentially inadvisable likely notations
  ...

Use evil merge to fix up grammar mistakes in Kconfig file.

Bad speling and horrible grammar (and copious swearing) is to be
expected, but let's keep it to commit messages and comments, rather than
expose it to users in config help texts or printouts.
2012-01-17 16:41:31 -08:00
Kees Cook c158a35c8a audit: no leading space in audit_log_d_path prefix
audit_log_d_path() injects an additional space before the prefix,
which serves no purpose and doesn't mix well with other audit_log*()
functions that do not sneak extra characters into the log.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:17:04 -05:00
Xi Wang 5afb8a3f96 audit: fix signedness bug in audit_log_execve_info()
In the loop, a size_t "len" is used to hold the return value of
audit_log_single_execve_arg(), which returns -1 on error.  In that
case the error handling (len <= 0) will be bypassed since "len" is
unsigned, and the loop continues with (p += len) being wrapped.
Change the type of "len" to signed int to fix the error handling.

	size_t len;
	...
	for (...) {
		len = audit_log_single_execve_arg(...);
		if (len <= 0)
			break;
		p += len;
	}

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:17:03 -05:00
Peter Moody 10d6836087 audit: comparison on interprocess fields
This allows audit to specify rules in which we compare two fields of a
process.  Such as is the running process uid != to the running process
euid?

Signed-off-by: Peter Moody <pmoody@google.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:17:03 -05:00
Peter Moody 4a6633ed08 audit: implement all object interfield comparisons
This completes the matrix of interfield comparisons between uid/gid
information for the current task and the uid/gid information for inodes.
aka I can audit based on differences between the euid of the process and
the uid of fs objects.

Signed-off-by: Peter Moody <pmoody@google.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:17:02 -05:00
Eric Paris c9fe685f7a audit: allow interfield comparison between gid and ogid
Allow audit rules to compare the gid of the running task to the gid of the
inode in question.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:17:02 -05:00
Eric Paris b34b039324 audit: complex interfield comparison helper
Rather than code the same loop over and over implement a helper function which
uses some pointer magic to make it generic enough to be used numerous places
as we implement more audit interfield comparisons

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:17:02 -05:00
Eric Paris 02d86a568c audit: allow interfield comparison in audit rules
We wish to be able to audit when a uid=500 task accesses a file which is
uid=0.  Or vice versa.  This patch introduces a new audit filter type
AUDIT_FIELD_COMPARE which takes as an 'enum' which indicates which fields
should be compared.  At this point we only define the task->uid vs
inode->uid, but other comparisons can be added.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:17:01 -05:00
Eric Paris 4043cde8ec audit: do not call audit_getname on error
Just a code cleanup really.  We don't need to make a function call just for
it to return on error.  This also makes the VFS function even easier to follow
and removes a conditional on a hot path.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:17:01 -05:00
Eric Paris 633b454545 audit: only allow tasks to set their loginuid if it is -1
At the moment we allow tasks to set their loginuid if they have
CAP_AUDIT_CONTROL.  In reality we want tasks to set the loginuid when they
log in and it be impossible to ever reset.  We had to make it mutable even
after it was once set (with the CAP) because on update and admin might have
to restart sshd.  Now sshd would get his loginuid and the next user which
logged in using ssh would not be able to set his loginuid.

Systemd has changed how userspace works and allowed us to make the kernel
work the way it should.  With systemd users (even admins) are not supposed
to restart services directly.  The system will restart the service for
them.  Thus since systemd is going to loginuid==-1, sshd would get -1, and
sshd would be allowed to set a new loginuid without special permissions.

If an admin in this system were to manually start an sshd he is inserting
himself into the system chain of trust and thus, logically, it's his
loginuid that should be used!  Since we have old systems I make this a
Kconfig option.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:17:00 -05:00
Eric Paris 0a300be6d5 audit: remove task argument to audit_set_loginuid
The function always deals with current.  Don't expose an option
pretending one can use it for something.  You can't.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:17:00 -05:00
Eric Paris 54d3218b31 audit: allow audit matching on inode gid
Much like the ability to filter audit on the uid of an inode collected, we
should be able to filter on the gid of the inode.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:16:59 -05:00
Eric Paris efaffd6e44 audit: allow matching on obj_uid
Allow syscall exit filter matching based on the uid of the owner of an
inode used in a syscall.  aka:

auditctl -a always,exit -S open -F obj_uid=0 -F perm=wa

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:16:59 -05:00
Eric Paris 6422e78de6 audit: remove audit_finish_fork as it can't be called
Audit entry,always rules are not allowed and are automatically changed in
exit,always rules in userspace.  The kernel refuses to load such rules.

Thus a task in the middle of a syscall (and thus in audit_finish_fork())
can only be in one of two states: AUDIT_BUILD_CONTEXT or AUDIT_DISABLED.
Since the current task cannot be in AUDIT_RECORD_CONTEXT we aren't every
going to actually use the code in audit_finish_fork() since it will
return without doing anything.  Thus drop the code.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:16:59 -05:00
Eric Paris 7ff68e53ec audit: reject entry,always rules
We deprecated entry,always rules a long time ago.  Reject those rules as
invalid.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:16:58 -05:00
Eric Paris a4ff8dba7d audit: inline audit_free to simplify the look of generic code
make the conditional a static inline instead of doing it in generic code.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:16:58 -05:00
Eric Paris 07c4941787 audit: inline checks for not needing to collect aux records
A number of audit hooks make function calls before they determine that
auxilary records do not need to be collected.  Do those checks as static
inlines since the most common case is going to be that records are not
needed and we can skip the function call overhead.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:16:57 -05:00
Eric Paris 56179a6ec6 audit: drop some potentially inadvisable likely notations
The audit code makes heavy use of likely() and unlikely() macros, but they
don't always make sense.  Drop any that seem questionable and let the
computer do it's thing.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:16:57 -05:00
Eric Paris 997f5b6444 audit: remove AUDIT_SETUP_CONTEXT as it isn't used
Audit contexts have 3 states.  Disabled, which doesn't collect anything,
build, which collects info but might not emit it, and record, which
collects and emits.  There is a 4th state, setup, which isn't used.  Get
rid of it.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:16:57 -05:00
Eric Paris b05d8447e7 audit: inline audit_syscall_entry to reduce burden on archs
Every arch calls:

if (unlikely(current->audit_context))
	audit_syscall_entry()

which requires knowledge about audit (the existance of audit_context) in
the arch code.  Just do it all in static inline in audit.h so that arch's
can remain blissfully ignorant.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:16:56 -05:00
Eric Paris d7e7528bcd Audit: push audit success and retcode into arch ptrace.h
The audit system previously expected arches calling to audit_syscall_exit to
supply as arguments if the syscall was a success and what the return code was.
Audit also provides a helper AUDITSC_RESULT which was supposed to simplify things
by converting from negative retcodes to an audit internal magic value stating
success or failure.  This helper was wrong and could indicate that a valid
pointer returned to userspace was a failed syscall.  The fix is to fix the
layering foolishness.  We now pass audit_syscall_exit a struct pt_reg and it
in turns calls back into arch code to collect the return value and to
determine if the syscall was a success or failure.  We also define a generic
is_syscall_success() macro which determines success/failure based on if the
value is < -MAX_ERRNO.  This works for arches like x86 which do not use a
separate mechanism to indicate syscall failure.

We make both the is_syscall_success() and regs_return_value() static inlines
instead of macros.  The reason is because the audit function must take a void*
for the regs.  (uml calls theirs struct uml_pt_regs instead of just struct
pt_regs so audit_syscall_exit can't take a struct pt_regs).  Since the audit
function takes a void* we need to use static inlines to cast it back to the
arch correct structure to dereference it.

The other major change is that on some arches, like ia64, MIPS and ppc, we
change regs_return_value() to give us the negative value on syscall failure.
THE only other user of this macro, kretprobe_example.c, won't notice and it
makes the value signed consistently for the audit functions across all archs.

In arch/sh/kernel/ptrace_64.c I see that we were using regs[9] in the old
audit code as the return value.  But the ptrace_64.h code defined the macro
regs_return_value() as regs[3].  I have no idea which one is correct, but this
patch now uses the regs_return_value() function, so it now uses regs[3].

For powerpc we previously used regs->result but now use the
regs_return_value() function which uses regs->gprs[3].  regs->gprs[3] is
always positive so the regs_return_value(), much like ia64 makes it negative
before calling the audit code when appropriate.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: H. Peter Anvin <hpa@zytor.com> [for x86 portion]
Acked-by: Tony Luck <tony.luck@intel.com> [for ia64]
Acked-by: Richard Weinberger <richard@nod.at> [for uml]
Acked-by: David S. Miller <davem@davemloft.net> [for sparc]
Acked-by: Ralf Baechle <ralf@linux-mips.org> [for mips]
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [for ppc]
2012-01-17 16:16:56 -05:00
Eric Paris 85e7bac33b seccomp: audit abnormal end to a process due to seccomp
The audit system likes to collect information about processes that end
abnormally (SIGSEGV) as this may me useful intrusion detection information.
This patch adds audit support to collect information when seccomp forces a
task to exit because of misbehavior in a similar way.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:16:55 -05:00
Eric Paris 16c174bd95 audit: check current inode and containing object when filtering on major and minor
The audit system has the ability to filter on the major and minor number of
the device containing the inode being operated upon.  Lets say that
/dev/sda1 has major,minor 8,1 and that we mount /dev/sda1 on /boot.  Now lets
say we add a watch with a filter on 8,1.  If we proceed to open an inode
inside /boot, such as /vboot/vmlinuz, we will match the major,minor filter.

Lets instead assume that one were to use a tool like debugfs and were to
open /dev/sda1 directly and to modify it's contents.  We might hope that
this would also be logged, but it isn't.  The rules will check the
major,minor of the device containing /dev/sda1.  In other words the rule
would match on the major/minor of the tmpfs mounted at /dev.

I believe these rules should trigger on either device.  The man page is
devoid of useful information about the intended semantics.  It only seems
logical that if you want to know everything that happened on a major,minor
that would include things that happened to the device itself...

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:16:55 -05:00
Eric Paris 3035c51e8a audit: drop the meaningless and format breaking word 'user'
userspace audit messages look like so:

type=USER msg=audit(1271170549.415:24710): user pid=14722 uid=0 auid=500 ses=1 subj=unconfined_u:unconfined_r:auditctl_t:s0-s0:c0.c1023 msg=''

That third field just says 'user'.  That's useless and doesn't follow the
key=value pair we are trying to enforce.  We already know it came from the
user based on the record type.  Kill that word.  Die.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:16:54 -05:00
Eric Paris 5195d8e217 audit: dynamically allocate audit_names when not enough space is in the names array
This patch does 2 things.  First it reduces the number of audit_names
allocated in every audit context from 20 to 5.  5 should be enough for all
'normal' syscalls (rename being the worst).  Some syscalls can still touch
more the 5 inodes such as mount.  When rpc filesystem is mounted it will
create inodes and those can exceed 5.  To handle that problem this patch will
dynamically allocate audit_names if it needs more than 5.  This should
decrease the typicall memory usage while still supporting all the possible
kernel operations.

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:16:54 -05:00
Eric Paris 5ef30ee53b audit: make filetype matching consistent with other filters
Every other filter that matches part of the inodes list collected by audit
will match against any of the inodes on that list.  The filetype matching
however had a strange way of doing things.  It allowed userspace to
indicated if it should match on the first of the second name collected by
the kernel.  Name collection ordering seems like a kernel internal and
making userspace rules get that right just seems like a bad idea.  As it
turns out the userspace audit writers had no idea it was doing this and
thus never overloaded the value field.  The kernel always checked the first
name collected which for the tested rules was always correct.

This patch just makes the filetype matching like the major, minor, inode,
and LSM rules in that it will match against any of the names collected.  It
also changes the rule validation to reject the old unused rule types.

Noone knew it was there.  Noone used it.  Why keep around the extra code?

Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-17 16:16:54 -05:00
Linus Torvalds 951880e634 Revert "capabitlies: ns_capable can use the cap helpers rather than lsm call"
This reverts commit d2a7009f0b.

J. R. Okajima explains:

 "After this commit, I am afraid access(2) on NFS may not work
  correctly.  The scenario based upon my guess.
   - access(2) overrides the credentials.
   - calls inode_permission() -- ... -- generic_permission() --
      ns_capable().
   - while the old ns_capable() calls security_capable(current_cred()),
     the new ns_capable() calls has_ns_capability(current) --
     security_capable(__task_cred(t)).

  current_cred() returns current->cred which is effective (overridden)
  credentials, but __task_cred(current) returns current->real_cred (the
  NFSD's credential).  And the overridden credentials by access(2) lost."

Requested-by: J. R. Okajima <hooanon05@yahoo.co.jp>
Acked-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-17 10:19:41 -08:00
Ingo Molnar b64b223aed Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent 2012-01-17 09:44:17 +01:00
Linus Torvalds 97740400bc Merge branch 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
* 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM / Hibernate: Drop the check of swap space size for compressed image
  PM / shmobile: fix A3SP suspend method
  PM / Domains: Skip governor functions for CONFIG_PM_RUNTIME unset
  PM / Domains: Fix build for CONFIG_PM_SLEEP unset
  PM: Make sysrq-o be available for CONFIG_PM unset
2012-01-16 15:02:30 -08:00
Heiko Carstens 4410030646 rcu: Add missing __cpuinit annotation in rcutorture code
"rcu: Add rcutorture CPU-hotplug capability" adds cpu hotplug operations
to the rcutorture code but produces a false positive warning about section
mismatches:

WARNING: vmlinux.o(.text+0x1e420c): Section mismatch in reference from the
	 function rcu_torture_onoff() to the function .cpuinit.text:cpu_up()
	 The function rcu_torture_onoff() references
	 the function __cpuinit cpu_up().
	 This is often because rcu_torture_onoff lacks a __cpuinit
	 annotation or the annotation of cpu_up is wrong.

This commit therefore adds a __cpuinit annotation so the warning goes away.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-01-16 09:44:05 -08:00
Rusty Russell d8e8ed95cd rcu: Make rcutorture bool parameters really bool (core code)
module_param(bool) used to counter-intuitively take an int.  In
fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy
trick.

It's time to remove the int/unsigned int option.  For this version
it'll simply give a warning, but it'll break next kernel version.
This commit makes this change to rcutorture.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-01-16 09:43:56 -08:00
Paul E. McKenney 51aa78271b Merge branch 'rcu/fixes-for-v3.2' into rcu/urgent
Merge reason: Add these commits so that fixes on this branch do not
	conflict with already-mainlined code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-01-16 09:41:18 -08:00
Steven Rostedt c10076c430 tracepoints/module: Fix disabling tracepoints with taint CRAP or OOT
Tracepoints are disabled for tainted modules, which is usually because the
module is either proprietary or was forced, and we don't want either of them
using kernel tracepoints.

But, a module can also be tainted by being in the staging directory or
compiled out of tree. Either is fine for use with tracepoints, no need
to punish them.  I found this out when I noticed that my sample trace event
module, when done out of tree, stopped working.

Cc: stable@vger.kernel.org # 3.2
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Dave Jones <davej@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-01-16 11:35:57 -05:00
Kevin Winchester 53999bf34d error: implicit declaration of function 'module_flags_taint'
Recent changes to kernel/module.c caused the following compile
error:

  kernel/module.c: In function ‘show_taint’:
  kernel/module.c:1024:2: error: implicit declaration of function ‘module_flags_taint’ [-Werror=implicit-function-declaration]
  cc1: some warnings being treated as errors

Correct this error by moving the definition of module_flags_taint
outside of the #ifdef CONFIG_MODULE_UNLOAD section.

Signed-off-by: Kevin Winchester <kjwinchester@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-15 16:21:07 -08:00
Linus Torvalds b3c9dd182e Merge branch 'for-3.3/core' of git://git.kernel.dk/linux-block
* 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
  Revert "block: recursive merge requests"
  block: Stop using macro stubs for the bio data integrity calls
  blockdev: convert some macros to static inlines
  fs: remove unneeded plug in mpage_readpages()
  block: Add BLKROTATIONAL ioctl
  block: Introduce blk_set_stacking_limits function
  block: remove WARN_ON_ONCE() in exit_io_context()
  block: an exiting task should be allowed to create io_context
  block: ioc_cgroup_changed() needs to be exported
  block: recursive merge requests
  block, cfq: fix empty queue crash caused by request merge
  block, cfq: move icq creation and rq->elv.icq association to block core
  block, cfq: restructure io_cq creation path for io_context interface cleanup
  block, cfq: move io_cq exit/release to blk-ioc.c
  block, cfq: move icq cache management to block core
  block, cfq: move io_cq lookup to blk-ioc.c
  block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
  block, cfq: reorganize cfq_io_context into generic and cfq specific parts
  block: remove elevator_queue->ops
  block: reorder elevator switch sequence
  ...

Fix up conflicts in:
 - block/blk-cgroup.c
	Switch from can_attach_task to can_attach
 - block/cfq-iosched.c
	conflict with now removed cic index changes (we now use q->id instead)
2012-01-15 12:24:45 -08:00
Linus Torvalds 83c2f912b4 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
  perf tools: Fix compile error on x86_64 Ubuntu
  perf report: Fix --stdio output alignment when --showcpuutilization used
  perf annotate: Get rid of field_sep check
  perf annotate: Fix usage string
  perf kmem: Fix a memory leak
  perf kmem: Add missing closedir() calls
  perf top: Add error message for EMFILE
  perf test: Change type of '-v' option to INCR
  perf script: Add missing closedir() calls
  tracing: Fix compile error when static ftrace is enabled
  recordmcount: Fix handling of elf64 big-endian objects.
  perf tools: Add const.h to MANIFEST to make perf-tar-src-pkg work again
  perf tools: Add support for guest/host-only profiling
  perf kvm: Do guest-only counting by default
  perf top: Don't update total_period on process_sample
  perf hists: Stop using 'self' for struct hist_entry
  perf hists: Rename total_session to total_period
  x86: Add counter when debug stack is used with interrupts enabled
  x86: Allow NMIs to hit breakpoints in i386
  x86: Keep current stack in NMI breakpoints
  ...
2012-01-15 11:26:35 -08:00