linux-2.6

dect

Archived

Author	SHA1	Message	Date
Matthias Kaehlcke	2e1975868a	kernel/time/clocksource.c: Use list_for_each_entry instead of list_for_each kernel/time/clocksource.c: Convert list_for_each to list_for_each_entry in clocksource_resume(), sysfs_override_clocksource() and show_available_clocksources() Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:38 -07:00
Matthias Kaehlcke	03ff17979c	kernel/exit.c: Use list_for_each_entry(_safe) instead of list_for_each(_safe) kernel/exit.c: Convert list_for_each(_safe) to list_for_each_entry(_safe) in forget_original_parent(), exit_notify() and do_wait() Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:38 -07:00
Johannes Berg	4e6045f134	workqueue: debug flushing deadlocks with lockdep In the following scenario: code path 1: my_function() -> lock(L1); ...; flush_workqueue(); ... code path 2: run_workqueue() -> my_work() -> ...; lock(L1); ... you can get a deadlock when my_work() is queued or running but my_function() has acquired L1 already. This patch adds a pseudo-lock to each workqueue to make lockdep warn about this scenario. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:38 -07:00
Pavel Emelyanov	cf7b708c8d	Make access to task's nsproxy lighter When someone wants to deal with some other taks's namespaces it has to lock the task and then to get the desired namespace if the one exists. This is slow on read-only paths and may be impossible in some cases. E.g. Oleg recently noticed a race between unshare() and the (sent for review in cgroups) pid namespaces - when the task notifies the parent it has to know the parent's namespace, but taking the task_lock() is impossible there - the code is under write locked tasklist lock. On the other hand switching the namespace on task (daemonize) and releasing the namespace (after the last task exit) is rather rare operation and we can sacrifice its speed to solve the issues above. The access to other task namespaces is proposed to be performed like this: rcu_read_lock(); nsproxy = task_nsproxy(tsk); if (nsproxy != NULL) { / * * work with the namespaces here * e.g. get the reference on one of them * / } / * * NULL task_nsproxy() means that this task is * almost dead (zombie) * / rcu_read_unlock(); This patch has passed the review by Eric and Oleg :) and, of course, tested. [clg@fr.ibm.com: fix unshare()] [ebiederm@xmission.com: Update get_net_ns_by_pid] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:37 -07:00
Sukadev Bhattiprolu	a6f5e06378	pid namespaces: move alloc_pid() to copy_process() Move alloc_pid() into copy_process(). This will keep all pid and pid namespace code together and simplify error handling when we support multiple pid namespaces. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:37 -07:00
Serge E. Hallyn	b460cbc581	pid namespaces: define is_global_init() and is_container_init() is_init() is an ambiguous name for the pid==1 check. Split it into is_global_init() and is_container_init(). A cgroup init has it's tsk->pid == 1. A global init also has it's tsk->pid == 1 and it's active pid namespace is the init_pid_ns. But rather than check the active pid namespace, compare the task structure with 'init_pid_ns.child_reaper', which is initialized during boot to the /sbin/init process and never changes. Changelog: 2.6.22-rc4-mm2-pidns1: - Use 'init_pid_ns.child_reaper' to determine if a given task is the global init (/sbin/init) process. This would improve performance and remove dependence on the task_pid(). 2.6.21-mm2-pidns2: - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc, ppc,avr32}/traps.c for the _exception() call to is_global_init(). This way, we kill only the cgroup if the cgroup's init has a bug rather than force a kernel panic. [akpm@linux-foundation.org: fix comment] [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c] [bunk@stusta.de: kernel/pid.c: remove unused exports] [sukadev@us.ibm.com: Fix capability.c to work with threaded init] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:37 -07:00
Sukadev Bhattiprolu	88f21d8182	pid namespaces: rename child_reaper() function Rename the child_reaper() function to task_child_reaper() to be similar to other task_* functions and to distinguish the function from 'struct pid_namspace.child_reaper'. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:37 -07:00
Sukadev Bhattiprolu	2894d650cd	pid namespaces: define and use task_active_pid_ns() wrapper With multiple pid namespaces, a process is known by some pid_t in every ancestor pid namespace. Every time the process forks, the child process also gets a pid_t in every ancestor pid namespace. While a process is visible in >=1 pid namespaces, it can see pid_t's in only one pid namespace. We call this pid namespace it's "active pid namespace", and it is always the youngest pid namespace in which the process is known. This patch defines and uses a wrapper to find the active pid namespace of a process. The implementation of the wrapper will be changed in when support for multiple pid namespaces are added. Changelog: 2.6.22-rc4-mm2-pidns1: - [Pavel Emelianov, Alexey Dobriyan] Back out the change to use task_active_pid_ns() in child_reaper() since task->nsproxy can be NULL during task exit (so child_reaper() continues to use init_pid_ns). to implement child_reaper() since init_pid_ns.child_reaper to implement child_reaper() since tsk->nsproxy can be NULL during exit. 2.6.21-rc6-mm1: - Rename task_pid_ns() to task_active_pid_ns() to reflect that a process can have multiple pid namespaces. Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Acked-by: Pavel Emelianov <xemul@openvz.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Herbert Poetzel <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:37 -07:00
Pavel Emelianov	baf8f0f82d	pid namespaces: dynamic kmem cache allocator for pid namespaces Add kmem_cache to pid_namespace to allocate pids from. Since both implementations expand the struct pid to carry more numerical values each namespace should have separate cache to store pids of different sizes. Each kmem cache is name "pid_<NR>", where <NR> is the number of numerical ids on the pid. Different namespaces with same level of nesting will have same caches. This patch has two FIXMEs that are to be fixed after we reach the consensus about the struct pid itself. The first one is that the namespace to free the pid from in free_pid() must be taken from pid. Now the init_pid_ns is used. The second FIXME is about the cache allocation. When we do know how long the object will be then we'll have to calculate this size in create_pid_cachep. Right now the sizeof(struct pid) value is used. [akpm@linux-foundation.org: coding-style repair] Signed-off-by: Pavel Emelianov <xemul@openvz.org> Acked-by: Cedric Le Goater <clg@fr.ibm.com> Acked-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:37 -07:00
Pavel Emelianov	a47afb0f9d	pid namespaces: round up the API The set of functions process_session, task_session, process_group and task_pgrp is confusing, as the names can be mixed with each other when looking at the code for a long time. The proposals are to * equip the functions that return the integer with _nr suffix to represent that fact, * and to make all functions work with task (not process) by making the common prefix of the same name. For monotony the routines signal_session() and set_signal_session() are replaced with task_session_nr() and set_task_session(), especially since they are only used with the explicit task->signal dereference. Signed-off-by: Pavel Emelianov <xemul@openvz.org> Acked-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:37 -07:00
Serge E. Hallyn	858d72ead4	cgroups: implement namespace tracking subsystem When a task enters a new namespace via a clone() or unshare(), a new cgroup is created and the task moves into it. This version names cgroups which are automatically created using cgroup_clone() as "node_<pid>" where pid is the pid of the unsharing or cloned process. (Thanks Pavel for the idea) This is safe because if the process unshares again, it will create /cgroups/(...)/node_<pid>/node_<pid> The only possibilities (AFAICT) for a -EEXIST on unshare are 1. pid wraparound 2. a process fails an unshare, then tries again. Case 1 is unlikely enough that I ignore it (at least for now). In case 2, the node_<pid> will be empty and can be rmdir'ed to make the subsequent unshare() succeed. Changelog: Name cloned cgroups as "node_<pid>". [clg@fr.ibm.com: fix order of cgroup subsystems in init/Kconfig] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:37 -07:00
Balbir Singh	846c7bb055	Add cgroupstats This patch is inspired by the discussion at http://lkml.org/lkml/2007/4/11/187 and implements per cgroup statistics as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263. The patch is on top of 2.6.21-mm1 with Paul's cgroups v9 patches (forward ported) This patch implements per cgroup statistics infrastructure and re-uses code from the taskstats interface. A new set of cgroup operations are registered with commands and attributes. It should be very easy to extend per cgroup statistics, by adding members to the cgroupstats structure. The current model for cgroupstats is a pull, a push model (to post statistics on interesting events), should be very easy to add. Currently user space requests for statistics by passing the cgroup file descriptor. Statistics about the state of all the tasks in the cgroup is returned to user space. TODO's/NOTE: This patch provides an infrastructure for implementing cgroup statistics. Based on the needs of each controller, we can incrementally add more statistics, event based support for notification of statistics, accumulation of taskstats into cgroup statistics in the future. Sample output # ./cgroupstats -C /cgroup/a sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0 # ./cgroupstats -C /cgroup/ sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0 If the approach looks good, I'll enhance and post the user space utility for the same Feedback, comments, test results are always welcome! [akpm@linux-foundation.org: build fix] Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Jay Lan <jlan@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:36 -07:00
Paul Menage	006cb99200	Task Control Groups: simple task cgroup debug info subsystem This example subsystem exports debugging information as an aid to diagnosing refcount leaks, etc, in the cgroup framework. Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:36 -07:00
Paul Menage	62d0df6406	Task Control Groups: example CPU accounting subsystem This example demonstrates how to use the generic cgroup subsystem for a simple resource tracker that counts, for the processes in a cgroup, the total CPU time used and the %CPU used in the last complete 10 second interval. Portions contributed by Balbir Singh <balbir@in.ibm.com> Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:36 -07:00
Paul Menage	8793d854ed	Task Control Groups: make cpusets a client of cgroups Remove the filesystem support logic from the cpusets system and makes cpusets a cgroup subsystem The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get passed through to the cgroup filesystem with the appropriate options to emulate the old cpuset filesystem behaviour. Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:36 -07:00
Paul Menage	81a6a5cdd2	Task Control Groups: automatic userspace notification of idle cgroups Add the following files to the cgroup filesystem: notify_on_release - configures/reports whether the cgroup subsystem should attempt to run a release script when this cgroup becomes unused release_agent - configures/reports the release agent to be used for this hierarchy (top level in each hierarchy only) releasable - reports whether this cgroup would have been auto-released if notify_on_release was true and a release agent was configured (mainly useful for debugging) To avoid locking issues, invoking the userspace release agent is done via a workqueue task; cgroups that need to have their release agents invoked by the workqueue task are linked on to a list. [pj@sgi.com: Need to include kmod.h] Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:36 -07:00
Paul Menage	817929ec27	Task Control Groups: shared cgroup subsystem group arrays Replace the struct css_set embedded in task_struct with a pointer; all tasks that have the same set of memberships across all hierarchies will share a css_set object, and will be linked via their css_sets field to the "tasks" list_head in the css_set. Assuming that many tasks share the same cgroup assignments, this reduces overall space usage and keeps the size of the task_struct down (three pointers added to task_struct compared to a non-cgroups kernel, no matter how many subsystems are registered). [akpm@linux-foundation.org: fix a printk] [akpm@linux-foundation.org: build fix] Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:36 -07:00
Paul Menage	a424316ca1	Task Control Groups: add procfs interface Add: /proc/cgroups - general system info /proc/*/cgroup - per-task cgroup membership info [a.p.zijlstra@chello.nl: cgroups: bdi init hooks] Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:36 -07:00
Paul Menage	697f416108	Task Control Groups: add cgroup_clone() interface Add support for cgroup_clone(), a way to create new cgroups intended to be used for systems such as namespace unsharing. A new subsystem callback, post_clone(), is added to allow subsystems to automatically configure cloned cgroups. Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:36 -07:00
Paul Menage	b4f48b6363	Task Control Groups: add fork()/exit() hooks This adds the necessary hooks to the fork() and exit() paths to ensure that new children inherit their parent's cgroup assignments, and that exiting processes release reference counts on their cgroups. Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:36 -07:00
Paul Menage	355e0c48b7	Add cgroup write_uint() helper method Add write_uint() helper method for cgroup subsystems This helper is analagous to the read_uint() helper method for reporting u64 values to userspace. It's designed to reduce the amount of boilerplate requierd for creating new cgroup subsystems. Signed-off-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:36 -07:00
Paul Menage	bbcb81d091	Task Control Groups: add tasks file interface Add the per-directory "tasks" file for cgroupfs mounts; this allows the user to determine which tasks are members of a cgroup by reading a cgroup's "tasks", and to move a task into a cgroup by writing its pid to its "tasks". Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:36 -07:00
Paul Menage	ddbcc7e8e5	Task Control Groups: basic task cgroup framework Generic Process Control Groups -------------------------- There have recently been various proposals floating around for resource management/accounting and other task grouping subsystems in the kernel, including ResGroups, User BeanCounters, NSProxy cgroups, and others. These all need the basic abstraction of being able to group together multiple processes in an aggregate, in order to track/limit the resources permitted to those processes, or control other behaviour of the processes, and all implement this grouping in different ways. This patchset provides a framework for tracking and grouping processes into arbitrary "cgroups" and assigning arbitrary state to those groupings, in order to control the behaviour of the cgroup as an aggregate. The intention is that the various resource management and virtualization/cgroup efforts can also become task cgroup clients, with the result that: - the userspace APIs are (somewhat) normalised - it's easier to test e.g. the ResGroups CPU controller in conjunction with the BeanCounters memory controller, or use either of them as the resource-control portion of a virtual server system. - the additional kernel footprint of any of the competing resource management systems is substantially reduced, since it doesn't need to provide process grouping/containment, hence improving their chances of getting into the kernel This patch: Add the main task cgroups framework - the cgroup filesystem, and the basic structures for tracking membership and associating subsystem state objects to tasks. Signed-off-by: Paul Menage <menage@google.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Jackson <pj@sgi.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:36 -07:00
Paul Jackson	55a230aae6	cpuset: zero malloc - revert the old cpuset fix The cpuset code to present a list of tasks using a cpuset to user space could write to an array that it had kmalloc'd, after a kmalloc request of zero size. The problem was that the code didn't check for writes past the allocated end of the array until -after- the first write. This is a race condition that is likely rare -- it would only show up if a cpuset went from being empty to having a task in it, during the brief time between the allocation and the first write. Prior to roughly 2.6.22 kernels, this was also a benign problem, because a zero kmalloc returned a few usable bytes anyway, and no harm was done with the bogus write. With the 2.6.22 kernel changes to make issue a warning if code tries to write to the location returned from a zero size allocation, this problem is no longer benign. This cpuset code would occassionally trigger that warning. The fix is trivial -- check before storing into the array, not after, whether the array is big enough to hold the store. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Serge E. Hallyn" <serue@us.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Cc: Kirill Korotaev <dev@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:35 -07:00
Alexey Dobriyan	fe9d4f5763	Add kernel/notifier.c There is separate notifier header, but no separate notifier .c file. Extract notifier code out of kernel/sys.c which will remain for misc syscalls I hope. Merge kernel/die_notifier.c into kernel/notifier.c. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-19 11:53:34 -07:00
Linus Torvalds	58f9b52ee8	Merge ssh://master.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt * ssh://master.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt: hrtimer: hook compat_sys_nanosleep up to high res timer code hrtimer: Rework hrtimer_nanosleep to make sys_compat_nanosleep easier	2007-10-18 15:12:41 -07:00
Linus Torvalds	54e840dd50	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched * git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: sched: reduce schedstat variable overhead a bit sched: add KERN_CONT annotation sched: cleanup, make struct rq comments more consistent sched: cleanup, fix spacing sched: fix return value of wait_for_completion_interruptible()	2007-10-18 14:54:03 -07:00
Stephen Hemminger	c80544dc0b	sparse pointer use of zero as null Get rid of sparse related warnings from places that use integer as NULL pointer. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Cc: Andi Kleen <ak@suse.de> Cc: Jeff Garzik <jeff@garzik.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Ian Kent <raven@themaw.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:31 -07:00
Michael Neuling	c66f08be7e	Add scaled time to taskstats based process accounting This adds items to the taststats struct to account for user and system time based on scaling the CPU frequency and instruction issue rates. Adds account_(user\|system)_time_scaled callbacks which architectures can use to account for time using this mechanism. Signed-off-by: Michael Neuling <mikey@neuling.org> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Jay Lan <jlan@engr.sgi.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:28 -07:00
Daniel Walker	5b4db0c2f2	whitespace fixes: system timers Signed-off-by: Daniel Walker <dwalker@mvista.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:26 -07:00
Daniel Walker	0719e3702e	whitespace fixes: interval timers Signed-off-by: Daniel Walker <dwalker@mvista.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:26 -07:00
Daniel Walker	62769dce86	whitespace fixes: execution domains Signed-off-by: Daniel Walker <dwalker@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:26 -07:00
Daniel Walker	5600b89278	whitespace fixes: system auditing Just removing white space at the end of lines. Signed-off-by: Daniel Walker <dwalker@mvista.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:25 -07:00
Daniel Walker	bd3a8492ba	whitespace fixes: compat syscalls Signed-off-by: Daniel Walker <dwalker@mvista.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:25 -07:00
Daniel Walker	9f8dbe9c9c	whitespace fixes: syscall auditing Signed-off-by: Daniel Walker <dwalker@mvista.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:25 -07:00
Daniel Walker	314f70fd96	whitespace fixes: capability syscalls Large chunks of 5 spaces instead of tabs. Signed-off-by: Daniel Walker <dwalker@mvista.com> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:25 -07:00
Daniel Walker	c277e63fbe	whitespace fixes: panic handling Signed-off-by: Daniel Walker <dwalker@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:25 -07:00
Daniel Walker	22a8bdeb53	whitespace fixes: module loading Signed-off-by: Daniel Walker <dwalker@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:25 -07:00
Daniel Walker	23ff444024	whitespace fixes: fork Signed-off-by: Daniel Walker <dwalker@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:25 -07:00
Daniel Walker	902749cdbd	whitespace fixes: DMA channel allocator Signed-off-by: Daniel Walker <dwalker@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:24 -07:00
Daniel Walker	9ce34218a8	whitespace fixes: audit filtering Signed-off-by: Daniel Walker <dwalker@mvista.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:24 -07:00
Daniel Walker	1bfbc608ba	whitespace fixes: relayfs Signed-off-by: Daniel Walker <dwalker@mvista.com> Cc: Tom Zanussi <zanussi@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:24 -07:00
Daniel Walker	dedf8b79ec	whitespace fixes: cpuset Signed-off-by: Daniel Walker <dwalker@mvista.com> Cc: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:24 -07:00
Daniel Walker	6ae965cd64	whitespace fixes: process accounting Lots of converting spaces to tabs. Signed-off-by: Daniel Walker <dwalker@mvista.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:24 -07:00
Daniel Walker	6fa6c3b1d1	whitespace fixes: time syscalls Signed-off-by: Daniel Walker <dwalker@mvista.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:24 -07:00
Andrew Morgan	72c2d5823f	V3 file capabilities: alter behavior of cap_setpcap The non-filesystem capability meaning of CAP_SETPCAP is that a process, p1, can change the capabilities of another process, p2. This is not the meaning that was intended for this capability at all, and this implementation came about purely because, without filesystem capabilities, there was no way to use capabilities without one process bestowing them on another. Since we now have a filesystem support for capabilities we can fix the implementation of CAP_SETPCAP. The most significant thing about this change is that, with it in effect, no process can set the capabilities of another process. The capabilities of a program are set via the capability convolution rules: pI(post-exec) = pI(pre-exec) pP(post-exec) = (X(aka cap_bset) & fP) \| (pI(post-exec) & fI) pE(post-exec) = fE ? pP(post-exec) : 0 at exec() time. As such, the only influence the pre-exec() program can have on the post-exec() program's capabilities are through the pI capability set. The correct implementation for CAP_SETPCAP (and that enabled by this patch) is that it can be used to add extra pI capabilities to the current process - to be picked up by subsequent exec()s when the above convolution rules are applied. Here is how it works: Let's say we have a process, p. It has capability sets, pE, pP and pI. Generally, p, can change the value of its own pI to pI' where (pI' & ~pI) & ~pP = 0. That is, the only new things in pI' that were not present in pI need to be present in pP. The role of CAP_SETPCAP is basically to permit changes to pI beyond the above: if (pE & CAP_SETPCAP) { pI' = anything; /* ie., even (pI' & ~pI) & ~pP != 0 */ } This capability is useful for things like login, which (say, via pam_cap) might want to raise certain inheritable capabilities for use by the children of the logged-in user's shell, but those capabilities are not useful to or needed by the login program itself. One such use might be to limit who can run ping. You set the capabilities of the 'ping' program to be "= cap_net_raw+i", and then only shells that have (pI & CAP_NET_RAW) will be able to run it. Without CAP_SETPCAP implemented as described above, login(pam_cap) would have to also have (pP & CAP_NET_RAW) in order to raise this capability and pass it on through the inheritable set. Signed-off-by: Andrew Morgan <morgan@kernel.org> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: James Morris <jmorris@namei.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:24 -07:00
Eric W. Biederman	7058cb02dd	sysctl: deprecate sys_sysctl in a user space visible fashion. After adding checking to register_sysctl_table and finding a whole new set of bugs. Missed by countless code reviews and testers I have finally lost patience with the binary sysctl interface. The binary sysctl interface has been sort of deprecated for years and finding a user space program that uses the syscall is more difficult then finding a needle in a haystack. Problems continue to crop up, with the in kernel implementation. So since supporting something that no one uses is silly, deprecate sys_sysctl with a sufficient grace period and notice that the handful of user space applications that care can be fixed or replaced. The /proc/sys sysctl interface that people use will continue to be supported indefinitely. This patch moves the tested warning about sysctls from the path where sys_sysctl to a separate path called from both implementations of sys_sysctl, and it adds a proper entry into Documentation/feature-removal-schedule. Allowing us to revisit this in a couple years time and actually kill sys_sysctl. [lethal@linux-sh.org: sysctl: Fix syscall disabled build] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:23 -07:00
Eric W. Biederman	8ada720d89	sysctl: for irda update sysctl_checks list of binary paths It turns out that the net/irda code didn't register any of it's binary paths in the global sysctl.h header file so I missed them completely when making an authoritative list of binary sysctl paths in the kernel. So add them to the list of valid binary sysctl paths. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Samuel Ortiz <samuel@sortiz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:23 -07:00
Eric W. Biederman	49ffcf8f99	sysctl: update sysctl_check_table Well it turns out after I dug into the problems a little more I was returning a few false positives so this patch updates my logic to remove them. - Don't complain about 0 ctl_names in sysctl_check_binary_path It is valid for someone to remove the sysctl binary interface and still keep the same sysctl proc interface. - Count ctl_names and procnames as matching if they both don't exist. - Only warn about missing min&max when the generic functions care. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:23 -07:00
Eric W. Biederman	fc6cd25b73	sysctl: Error on bad sysctl tables After going through the kernels sysctl tables several times it has become clear that code review and testing is just not effective in prevent problematic sysctl tables from being used in the stable kernel. I certainly can't seem to fix the problems as fast as they are introduced. Therefore this patch adds sysctl_check_table which is called when a sysctl table is registered and checks to see if we have a problematic sysctl table. The biggest part of the code is the table of valid binary sysctl entries, but since we have frozen our set of binary sysctls this table should not need to change, and it makes it much easier to detect when someone unintentionally adds a new binary sysctl value. As best as I can determine all of the several hundred errors spewed on boot up now are legitimate. [bunk@kernel.org: kernel/sysctl_check.c must #include <linux/string.h>] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:23 -07:00
Eric W. Biederman	c65f92398e	sysctl: remove the cad_pid binary sysctl path It looks like we inadvertently killed the cad_pid binary sysctl support when cap_pid was changed to be a struct pid. Since no one has complained just remove the binary path. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:23 -07:00
Eric W. Biederman	35834ca1e4	sysctl: simplify the pty sysctl logic Instead of having a bunch of ifdefs in sysctl.c move all of the pty sysctl logic into drivers/char/pty.c As well as cleaning up the logic this prevents sysctl_check_table from complaining that the root table has a NULL data pointer on something with generic methods. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:23 -07:00
Eric W. Biederman	0d135a4a8c	sysctl: remove the binary interface for aio-nr, aio-max-nr, acpi_video_flags aio-nr, aio-max-nr, acpi_video_flags are unsigned long values which sysctl does not handle properly with a 64bit kernel and a 32bit user space. Since no one is likely to be using the binary sysctl values and the ascii interface still works, this patch just removes support for the binary sysctl interface from the kernel. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@sw.ru> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Zach Brown <zach.brown@oracle.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Len Brown <lenb@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:23 -07:00
Eric W. Biederman	f5ead5cefc	sysctl: remove binary sysctl support where it clearly doesn't work These functions are all wrapper functions for the proc interface that are needed for them to work correctly. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@sw.ru> Acked-by: Andrew Morgan <morgan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:22 -07:00
Eric W. Biederman	49a0c45833	sysctl: Factor out sysctl_data. There as been no easy way to wrap the default sysctl strategy routine except for returning 0. Which is not always what we want. The few instances I have seen that want different behaviour have written their own version of sysctl_data. While not too hard it is unnecessary code and has the potential for extra bugs. So to make these situations easier and make that part of sysctl more symetric I have factord sysctl_data out of do_sysctl_strategy and exported as a function everyone can use. Further having sysctl_data be an explicit function makes checking for badly formed sysctl tables much easier. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:22 -07:00
Eric W. Biederman	d8217f076b	sysctl core: Stop using the unnecessary ctl_table typedef In sysctl.h the typedef struct ctl_table ctl_table violates coding style isn't needed and is a bit of a nuisance because it makes it harder to recognize ctl_table is a type name. So this patch removes it from the generic sysctl code. Hopefully I will have enough energy to send the rest of my patches will follow and to remove it from the rest of the kernel. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:22 -07:00
Akinobu Mita	a0d8cdb652	cpu hotplug: cpu: deliver CPU_UP_CANCELED only to NOTIFY_OKed callbacks with CPU_UP_PREPARE The functions in a CPU notifier chain is called with CPU_UP_PREPARE event before making the CPU online. If one of the callback returns NOTIFY_BAD, it stops to deliver CPU_UP_PREPARE event, and CPU online operation is canceled. Then CPU_UP_CANCELED event is delivered to the functions in a CPU notifier chain again. This CPU_UP_CANCELED event is delivered to the functions which have been called with CPU_UP_PREPARE, not delivered to the functions which haven't been called with CPU_UP_PREPARE. The problem that makes existing cpu hotplug error handlings complex is that the CPU_UP_CANCELED event is delivered to the function that has returned NOTIFY_BAD, too. Usually we don't expect to call destructor function against the object that has failed to initialize. It is like: err = register_something(); if (err) { unregister_something(); return err; } So it is natural to deliver CPU_UP_CANCELED event only to the functions that have returned NOTIFY_OK with CPU_UP_PREPARE event and not to call the function that have returned NOTIFY_BAD. This is what this patch is doing. Otherwise, every cpu hotplug notifiler has to track whether notifiler event is failed or not for each cpu. (drivers/base/topology.c is doing this with topology_dev_map) Similary this patch makes same thing with CPU_DOWN_PREPARE and CPU_DOWN_FAILED evnets. Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:21 -07:00
Dave Young	faf8c714f4	param_sysfs_builtin memchr argument fix If memchr argument is longer than strlen(kp->name), there will be some weird result. It will casuse duplicate filenames in sysfs for the "nousb". kernel warning messages are as bellow: sysfs: duplicate filename 'usbcore' can not be created WARNING: at fs/sysfs/dir.c:416 sysfs_add_one() [<c01c4750>] sysfs_add_one+0xa0/0xe0 [<c01c4ab8>] create_dir+0x48/0xb0 [<c01c4b69>] sysfs_create_dir+0x29/0x50 [<c024e0fb>] create_dir+0x1b/0x50 [<c024e3b6>] kobject_add+0x46/0x150 [<c024e2da>] kobject_init+0x3a/0x80 [<c053b880>] kernel_param_sysfs_setup+0x50/0xb0 [<c053b9ce>] param_sysfs_builtin+0xee/0x130 [<c053ba33>] param_sysfs_init+0x23/0x60 [<c024d062>] __next_cpu+0x12/0x20 [<c052aa30>] kernel_init+0x0/0xb0 [<c052aa30>] kernel_init+0x0/0xb0 [<c052a856>] do_initcalls+0x46/0x1e0 [<c01bdb12>] create_proc_entry+0x52/0x90 [<c0158d4c>] register_irq_proc+0x9c/0xc0 [<c01bda94>] proc_mkdir_mode+0x34/0x50 [<c052aa30>] kernel_init+0x0/0xb0 [<c052aa92>] kernel_init+0x62/0xb0 [<c0104f83>] kernel_thread_helper+0x7/0x14 ======================= kobject_add failed for usbcore with -EEXIST, don't try to register things with the same name in the same directory. [<c024e466>] kobject_add+0xf6/0x150 [<c053b880>] kernel_param_sysfs_setup+0x50/0xb0 [<c053b9ce>] param_sysfs_builtin+0xee/0x130 [<c053ba33>] param_sysfs_init+0x23/0x60 [<c024d062>] __next_cpu+0x12/0x20 [<c052aa30>] kernel_init+0x0/0xb0 [<c052aa30>] kernel_init+0x0/0xb0 [<c052a856>] do_initcalls+0x46/0x1e0 [<c01bdb12>] create_proc_entry+0x52/0x90 [<c0158d4c>] register_irq_proc+0x9c/0xc0 [<c01bda94>] proc_mkdir_mode+0x34/0x50 [<c052aa30>] kernel_init+0x0/0xb0 [<c052aa92>] kernel_init+0x62/0xb0 [<c0104f83>] kernel_thread_helper+0x7/0x14 ======================= Module 'usbcore' failed to be added to sysfs, error number -17 The system will be unstable now. Signed-off-by: Dave Young <hidave.darkstar@gmail.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:21 -07:00
Tony Breeds	2c62214831	Fix discrepancy between VDSO based gettimeofday() and sys_gettimeofday(). On platforms that copy sys_tz into the vdso (currently only x86_64, soon to include powerpc), it is possible for the vdso to get out of sync if a user calls (admittedly unusual) settimeofday(NULL, ptr). This patch adds a hook for architectures that set CONFIG_GENERIC_TIME_VSYSCALL to ensure when sys_tz is updated they can also updatee their copy in the vdso. Signed-off-by: Tony Breeds <tony@bakeyournoodle.com> Cc: Andi Kleen <ak@suse.de> Cc: Tony Luck <tony.luck@intel.com> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:20 -07:00
Alexey Dobriyan	6212e3a388	Remove struct task_struct::io_wait Hell knows what happened in commit 63b05203af57e7de4f3bb63b8b81d43bc196d32b during 2.6.9 development. Commit introduced io_wait field which remained write-only than and still remains write-only. Also garbage collect macros which "use" io_wait. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:20 -07:00
Rafael J. Wysocki	9cd9a0058d	Hibernation: Enter platform hibernation state in a consistent way Make hibernation_platform_enter() execute the enter-a-sleep-state sequence instead of the mixed shutdown-with-entering-S4 thing. Replace the shutting down of devices done by kernel_shutdown_prepare(), before entering the ACPI S4 sleep state, with suspending them and the shutting down of sysdevs with calling device_power_down(PMSG_SUSPEND) (just like before entering S1 or S3, but the target state is now S4). Also, disable the nonboot CPUs before entering the sleep state (S4), which generally always is a good idea. This is known to fix the "double disk spin down during hibernation" on some machines, eg. HPC nx6325 (ref. http://lkml.org/lkml/2007/8/7/316 and the following thread). Moreover, it has been reported to make /sys/class/rtc/rtc0/wakealarm work correctly with hibernation for some users. It also generally causes the hibernation state (ACPI S4) to be entered faster. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:20 -07:00
Rafael J. Wysocki	c7e0831d38	Hibernation: Check if ACPI is enabled during restore in the right place The following scenario leads to total confusion of the platform firmware on some boxes (eg. HPC nx6325): * Hibernate with ACPI enabled * Resume passing "acpi=off" to the boot kernel To prevent this from happening it's necessary to check if ACPI is enabled (and enable it if that's not the case) _right_ _after_ control has been transfered from the boot kernel to the image kernel, before device_power_up() is called (ie. with interrupts disabled). Enabling ACPI after calling device_power_up() turns out to be insufficient. For this reason, introduce new hibernation callback ->leave() that will be executed before device_power_up() by the restored image kernel. To make it work, it also is necessary to move swsusp_suspend() from swsusp.c to disk.c (it's name is changed to "create_image", which is more up to the point). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:20 -07:00
Rafael J. Wysocki	d307c4a8e8	Hibernation: Arbitrary boot kernel support - generic code Add the bits needed for supporting arbitrary boot kernels to the common hibernation code. To support arbitrary boot kernels, make it possible to replace the 'struct new_utsname' and the kernel version in the hibernation image header by some architecture specific data that will be used to verify if the image is valid and to restore the image. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:19 -07:00
Andres Salomon	8f4ce8c32f	serial: turn serial console suspend a boot rather than compile time option Currently, there's a CONFIG_DISABLE_CONSOLE_SUSPEND that allows one to stop the serial console from being suspended when the rest of the machine goes to sleep. This is incredibly useful for debugging power management-related things; however, having it as a compile-time option has proved to be incredibly inconvenient for us (OLPC). There are plenty of times that we want serial console to not suspend, but for the most part we'd like serial console to be suspended. This drops CONFIG_DISABLE_CONSOLE_SUSPEND, and replaces it with a kernel boot parameter (no_console_suspend). By default, the serial console will be suspended along with the rest of the system; by passing 'no_console_suspend' to the kernel during boot, serial console will remain alive during suspend. For now, this is pretty serial console specific; further fixes could be applied to make this work for things like netconsole. Signed-off-by: Andres Salomon <dilinger@debian.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@suspend2.net> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:19 -07:00
Rafael J. Wysocki	438e2ce68d	freezer: measure freezing time Measure the time of the freezing of tasks, even if it doesn't fail. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:19 -07:00
Rafael J. Wysocki	b842ee578e	freezer: be more verbose Increase the freezer's verbosity a bit, so that it's easier to read problem reports related to it. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:19 -07:00
Adrian Bunk	c3d42d7527	unexport pm_power_off_prepare This patch removes the unused EXPORT_SYMBOL(pm_power_off_prepare). Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:19 -07:00
Rafael J. Wysocki	d5d8c5976d	freezer: do not send signals to kernel threads The freezer should not send signals to kernel threads, since that may lead to subtle problems. In particular, commit `b74d0deb96` has changed recalc_sigpending_tsk() so that it doesn't clear TIF_SIGPENDING. For this reason, if the freezer continues to send fake signals to kernel threads and the freezing of kernel threads fails, some of them may be running with TIF_SIGPENDING set forever. Accordingly, recalc_sigpending_tsk() shouldn't set the task's TIF_SIGPENDING flag if TIF_FREEZE is set. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:19 -07:00
Rafael J. Wysocki	2e1318956c	freezer: prevent new tasks from inheriting TIF_FREEZE set Tasks should go to the refrigerator only if explicitly requested to do that by the freezer and not as a result of inheriting the TIF_FREEZE flag set from the parent. Make it happen. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Acked-by: Nigel Cunningham <nigel@nigel.suspend2.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:18 -07:00
Rafael J. Wysocki	232b143280	freezer: do not sync filesystems from freeze_processes The syncing of filesystems from within the freezer is generally not needed. Also, if there's an ext3 filesystem loopback-mounted from a FUSE one, the syncing results in writes to it and deadlocks. Similarly, it will deadlock if FUSE implements sync. Change freeze_processes() so that it doesn't execute sys_sync() and make the suspend and hibernation code path sync filesystems independently of the freezer. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Nigel Cunningham <nigel@nigel.suspend2.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:18 -07:00
Rafael J. Wysocki	b3dac3b304	PM: Rename hibernation_ops to platform_hibernation_ops Rename 'struct hibernation_ops' to 'struct platform_hibernation_ops' in analogy with 'struct platform_suspend_ops'. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:18 -07:00
Rafael J. Wysocki	74f270af0c	PM: Rework struct hibernation_ops During hibernation we also need to tell the ACPI core that we're going to put the system into the S4 sleep state. For this reason, an additional method in 'struct hibernation_ops' is needed, playing the role of set_target() in 'struct platform_suspend_operations'. Moreover, the role of the .prepare() method is now different, so it's better to introduce another method, that in general may be different from .prepare(), that will be used to prepare the platform for creating the hibernation image (.prepare() is used anyway to notify the platform that we're going to enter the low power state after the image has been saved). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:18 -07:00
Rafael J. Wysocki	f242d9196f	PM: Make suspend_ops static The variable suspend_ops representing the set of global platform-specific suspend-related operations, used by the PM core, need not be exported outside of kernel/power/main.c . Make it static. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:18 -07:00
Rafael J. Wysocki	e6c5eb9541	PM: Rework struct platform_suspend_ops There is no reason why the .prepare() and .finish() methods in 'struct platform_suspend_ops' should take any arguments, since architectures don't use these methods' argument in any practically meaningful way (ie. either the target system sleep state is conveyed to the platform by .set_target(), or there is only one suspend state supported and it is indicated to the PM core by .valid(), or .prepare() and .finish() aren't defined at all). There also is no reason why .finish() should return any result. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:18 -07:00
Rafael J. Wysocki	26398a70ea	PM: Rename struct pm_ops and related things The name of 'struct pm_ops' suggests that it is related to the power management in general, but in fact it is only related to suspend. Moreover, its name should indicate what this structure is used for, so it seems reasonable to change it to 'struct platform_suspend_ops'. In that case, the name of the global variable of this type used by the PM core and the names of related functions should be changed accordingly. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: Len Brown <lenb@kernel.org> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:18 -07:00
Adrian Bunk	a065c86e1b	make kernel/power/main.c:suspend_enter() static suspend_enter() can now become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-18 14:37:17 -07:00
Anton Blanchard	c70878b4e0	hrtimer: hook compat_sys_nanosleep up to high res timer code Now we have high res timers on ppc64 I thought Id test them. It turns out compat_sys_nanosleep hasnt been converted to the hrtimer code and so is limited to HZ resolution. The follow patch converts compat_sys_nanosleep to use high res timers. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-18 22:54:18 +02:00
Anton Blanchard	04c227140f	hrtimer: Rework hrtimer_nanosleep to make sys_compat_nanosleep easier Pull the copy_to_user out of hrtimer_nanosleep and into the callers (common_nsleep, sys_nanosleep) in preparation for converting compat_sys_nanosleep to use hrtimers. Signed-off-by: Anton Blanchard <anton@samba.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2007-10-18 22:54:18 +02:00
Ken Chen	480b9434c5	sched: reduce schedstat variable overhead a bit schedstat is useful in investigating CPU scheduler behavior. Ideally, I think it is beneficial to have it on all the time. However, the cost of turning it on in production system is quite high, largely due to number of events it collects and also due to its large memory footprint. Most of the fields probably don't need to be full 64-bit on 64-bit arch. Rolling over 4 billion events will most like take a long time and user space tool can be made to accommodate that. I'm proposing kernel to cut back most of variable width on 64-bit system. (note, the following patch doesn't affect 32-bit system). Signed-off-by: Ken Chen <kenchen@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-18 21:32:56 +02:00
Ingo Molnar	cc4ea79588	sched: add KERN_CONT annotation printk: add the KERN_CONT annotation (which is empty string but via which checkpatch.pl can notice that the lacking KERN_ level is fine). Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-18 21:32:56 +02:00
Ingo Molnar	d801649162	sched: cleanup, make struct rq comments more consistent cleanup, make struct rq comments more consistent. found via scripts/checkpatch.pl. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-18 21:32:55 +02:00
Ingo Molnar	8401f77505	sched: cleanup, fix spacing cleanup: fix sysctl_sched_features initialization spacing, and fix sd_alloc_ctl_cpu_table() prototype spacing. found via scripts/checkpatch.pl. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-18 21:32:55 +02:00
Andi Kleen	51e9799023	sched: fix return value of wait_for_completion_interruptible() The recent wait_for_completion() cleanups: commit `8cbbe86dfc` Author: Andi Kleen <ak@suse.de> Date: Mon Oct 15 17:00:14 2007 +0200 sched: cleanup: refactor common code of sleep_on / wait_for_completion Refactor common code of sleep_on / wait_for_completion broke the return value of wait_for_completion_interruptible(). Previously it returned 0 on success, now -1. Fix that. Problem found by Geert Uytterhoeven. [ mingo: fixed whitespace damage ] Reported-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-18 21:32:55 +02:00
Thomas Gleixner	3dfbc88464	x86: C1E late detection fix. Really switch off lapic timer Doh, I completely missed that devices marked DUMMY are not running the set_mode function. So we force broadcasting, but we keep the local APIC timer running. Let the clock event layer mark the device _after_ switching it off. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-17 20:15:13 +02:00
Linus Torvalds	e6d5a11dad	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched * git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched: sched: fix new task startup crash sched: fix !SYSFS build breakage sched: fix improper load balance across sched domain sched: more robust sd-sysctl entry freeing	2007-10-17 09:11:18 -07:00
Adrian Bunk	cbfee34520	security/ cleanups This patch contains the following cleanups that are now possible: - remove the unused security_operations->inode_xattr_getsuffix - remove the no longer used security_operations->unregister_security - remove some no longer required exit code - remove a bunch of no longer used exports Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: James Morris <jmorris@namei.org> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:07 -07:00
Alexey Dobriyan	57c521ce61	ifdef struct task_struct::security For those who don't care about CONFIG_SECURITY. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: James Morris <jmorris@namei.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:07 -07:00
Oleg Nesterov	d2da272a4e	migration_call(CPU_DEAD): use spin_lock_irq() instead of task_rq_lock() Change migration_call(CPU_DEAD) to use direct spin_lock_irq() instead of task_rq_lock(rq->idle), rq->idle can't change its task_rq(). This makes the code a bit more symmetrical with migrate_dead_tasks()'s path which uses spin_lock_irq/spin_unlock_irq. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Cliff Wickman <cpw@sgi.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:03 -07:00
Oleg Nesterov	f7b4cddcc5	do CPU_DEAD migrating under read_lock(tasklist) instead of write_lock_irq(tasklist) Currently move_task_off_dead_cpu() is called under write_lock_irq(tasklist). This means it can't use task_lock() which is needed to improve migrating to take task's ->cpuset into account. Change the code to call move_task_off_dead_cpu() with irqs enabled, and change migrate_live_tasks() to use read_lock(tasklist). This all is a preparation for the futher changes proposed by Cliff Wickman, see http://marc.info/?t=117327786100003 Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Cliff Wickman <cpw@sgi.com> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:03 -07:00
Akinobu Mita	d58ae67813	module: return error when mod_sysfs_init() failed load_module() returns zero when mod_sysfs_init() fails, then the module loading will succeed accidentally. This patch makes load_module() return error correctly in that case. Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:01 -07:00
Ralf Baechle	28e3fed8b7	Compile handle_percpu_irq even for uniprocessor kernels Compiling handle_percpu_irq only on uniprocessor generates an artificial special case so a typical use like: set_irq_chip_and_handler(irq, &some_irq_type, handle_percpu_irq); needs to be conditionally compiled only on SMP systems as well and an alternative UP construct is usually needed - for no good reason. This fixes uniprocessor configurations for some MIPS SMP systems. Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:00 -07:00
Andrey Mirkin	fd5eea4214	change inotifyfs magic as the same magic is used for futexfs Right now futexfs and inotifyfs have one magic 0xBAD1DEA, that looks a little bit confusing. Use 0xBAD1DEA as magic for futexfs and 0x2BAD1DEA as magic for inotifyfs. Signed-off-by: Andrey Mirkin <major@openvz.org> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:43:00 -07:00
Pavel Emelyanov	db8906da59	Use KMEM_CACHE macro to create the nsproxy cache The blessed way for standard caches is to use it. Besides, this may give this cache a better alignment. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Cedric Le Goater <clg@fr.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:59 -07:00
Alexey Dobriyan	970a8645ca	user.c: #ifdef ->mq_bytes For those who deselect POSIX message queues. Reduces SLAB size of user_struct from 64 to 32 bytes here, SLUB size -- from 40 bytes to 32 bytes. [akpm@linux-foundation.org: fix build] Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:59 -07:00
Alexey Dobriyan	40aeb400f6	user.c: deinline Save some space because uid_hash_find() has 3 callsites. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:58 -07:00
Jan Beulich	22e48eaf58	constify string/array kparam tracking structures .. in an effort to make read-only whatever can be made, so that CONFIG_DEBUG_RODATA can catch as many issues as possible. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:56 -07:00
Adrian Bunk	b012d346c0	make kernel/profile.c:time_hook static {,un}register_timer_hook() is the API that should be used. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:55 -07:00
Adrian Bunk	0732a552cb	kernel/sys_ni.c: add dummy sys_ni_syscall() prototype kernel/sys_ni.c can't #include <linux/syscalls.h> due to cond_syscall(), but let's tell gcc to not warn with -Wmissing-prototypes. Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:55 -07:00
Avi Kivity	e98c320291	Move PREEMPT_NOTIFIERS into an always-included Kconfig Kconfig.preempt is not included on some archs (for example, m68k). On those archs, the Kconfig machinery complains that KVM selects an undefined symbol PREEMPT_NOTIFIERS (which lives in Kconfig.preempt). So move the offending symbol into a Kconfig file which is included by everyone. Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Avi Kivity <avi@qumranet.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:55 -07:00
Alexey Dobriyan	42b2dd0a02	Shrink task_struct if CONFIG_FUTEX=n robust_list, compat_robust_list, pi_state_list, pi_state_cache are really used if futexes are on. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:55 -07:00
Ken'ichi Ohmichi	bcbba6c10e	add-vmcore: add a prefix "VMCOREINFO_" to the vmcoreinfo macros Add a prefix "VMCOREINFO_" to the vmcoreinfo macros. Old vmcoreinfo macros were defined as generic names SYMBOL/SIZE/OFFSET /LENGTH/CONFIG, and it is impossible to grep for them. So these names should be changed. This discussion is the following: http://www.ussg.iu.edu/hypermail/linux/kernel/0709.1/0415.html Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:54 -07:00
Ken'ichi Ohmichi	6cfa062f01	add-vmcore: add nodemask_t's size and NR_FREE_PAGES's value to vmcoreinfo_data [2/3] Add nodemask_t's size and NR_FREE_PAGES's value to vmcoreinfo_data. The dump filetering command 'makedumpfile'(v1.1.6 or before) had assumed the above values, and it was not good from the reliability viewpoint. So makedumpfile v1.2.0 came to need these values and I created the patch to let the kernel output them. makedumpfile site: https://sourceforge.net/projects/makedumpfile/ Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:54 -07:00
Ken'ichi Ohmichi	d768281e97	add-vmcore: cleanup the coding style according to Andrew's comments [1/3] Cleanup the coding style according to Andrew's comments: http://lists.infradead.org/pipermail/kexec/2007-August/000522.html - vmcoreinfo_append_str() should have suitable __attribute__s so that the compiler can check its use. - vmcoreinfo_max_size should have size_t. - Use get_seconds() instead of xtime.tv_sec. - Use init_uts_ns.name.release instead of UTS_RELEASE. Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:54 -07:00
Ken'ichi Ohmichi	fd59d231f8	Add vmcoreinfo This patch set frees the restriction that makedumpfile users should install a vmlinux file (including the debugging information) into each system. makedumpfile command is the dump filtering feature for kdump. It creates a small dumpfile by filtering unnecessary pages for the analysis. To distinguish unnecessary pages, it needs a vmlinux file including the debugging information. These days, the debugging package becomes a huge file, and it is hard to install it into each system. To solve the problem, kdump developers discussed it at lkml and kexec-ml. As the result, we reached the conclusion that necessary information for dump filtering (called "vmcoreinfo") should be embedded into the first kernel file and it should be accessed through /proc/vmcore during the second kernel. (http://www.uwsg.iu.edu/hypermail/linux/kernel/0707.0/1806.html) Dan Aloni created the patch set for the above implementation. (http://www.uwsg.iu.edu/hypermail/linux/kernel/0707.1/1053.html) And I updated it for multi architectures and memory models. (http://lists.infradead.org/pipermail/kexec/2007-August/000479.html) Signed-off-by: Dan Aloni <da-x@monatomic.org> Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp> Signed-off-by: Bernhard Walle <bwalle@suse.de> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:54 -07:00
Oleg Nesterov	13fbcb7312	do_sigaction: don't worry about signal_pending() do_sigaction() returns -ERESTARTNOINTR if signal_pending(). The comment says: * If there might be a fatal signal pending on multiple * threads, make sure we take it before changing the action. I think this is not needed. We should only worry about SIGNAL_GROUP_EXIT case, bit it implies a pending SIGKILL which can't be cleared by do_sigaction. Kill this special case. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:54 -07:00
Oleg Nesterov	6db840fa78	exec: RT sub-thread can livelock and monopolize CPU on exec de_thread() yields waiting for ->group_leader to be a zombie. This deadlocks if an rt-prio execer shares the same cpu with ->group_leader. Change the code to use ->group_exit_task/notify_count mechanics. This patch certainly uglifies the code, perhaps someone can suggest something better. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:54 -07:00
Paul E. McKenney	c17ac85504	Make rcutorture RNG use temporal entropy Repost of http://lkml.org/lkml/2007/8/10/472 made available by request. The locking used by get_random_bytes() can conflict with the preempt_disable() and synchronize_sched() form of RCU. This patch changes rcutorture's RNG to gather entropy from the new cpu_clock() interface (relying on interrupts, preemption, daemons, and rcutorture's reader thread's rock-bottom scheduling priority to provide useful entropy), and also adds and EXPORT_SYMBOL_GPL() to make that interface available to GPLed kernel modules such as rcutorture. Passes several hours of rcutorture. [ego@in.ibm.com: Use raw_smp_processor_id() in rcu_random()] Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:53 -07:00
john stultz	b2d9323d13	Use num_possible_cpus() instead of NR_CPUS for timer distribution To avoid lock contention, we distribute the sched_timer calls across the cpus so they do not trigger at the same instant. However, I used NR_CPUS, which can cause needless grouping on small smp systems depending on your kernel config. This patch converts to using num_possible_cpus() so we spread it as evenly as possible on every machine. Briefly tested w/ NR_CPUS=255 and verified reduced contention. Signed-off-by: John Stultz <johnstul@us.ibm.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:53 -07:00
Adrian Bunk	ba2a631b14	kernel/time/timekeeping.c: cleanups - remove the no longer required __attribute__((weak)) of xtime_lock - remove the following no longer used EXPORT_SYMBOL's: - xtime - xtime_lock Signed-off-by: Adrian Bunk <bunk@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:53 -07:00
Oleg Nesterov	715015e8da	wait_task_stopped/continued: remove unneeded p->signal != NULL check The child was found on ->children list under tasklist_lock, it must have a valid ->signal. __exit_signal() both removes the task from parent->children and clears ->signal "atomically" under write_lock(tasklist). Remove unneeded checks. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:52 -07:00
Oleg Nesterov	18442cf28a	__group_complete_signal: eliminate unneeded wakeup of ->group_exit_task Cleanup. __group_complete_signal() wakes up ->group_exit_task twice. The second wakeup's state includes TASK_UNINTERRUPTIBLE, which is not very appropriate. Change the code to pass the "correct" argument to signal_wake_up() and kill now unneeded wake_up_process(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:51 -07:00
Oleg Nesterov	442a10cf9e	wait_task_zombie: don't fight with non-existing race with a dying ptracee The "p->exit_signal == -1 && p->ptrace == 0" check and the comment are bogus. We already did exactly the same check in eligible_child(), we did not drop tasklist_lock since then, and both variables need write_lock(tasklist) to be changed. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:51 -07:00
Oleg Nesterov	ebca4cda11	zap_other_threads: don't optimize thread_group_empty() case Nowadays thread_group_empty() and next_thread() are simple list operations, this optimization doesn't make sense: we are doing exactly same check one line below. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:51 -07:00
Oleg Nesterov	3ae4cbadf4	exit_notify: don't take tasklist for TIF_SIGPENDING re-targeting ->siglock provides enough protection to iterate over the thread group. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:51 -07:00
Oleg Nesterov	2f4e6e2a81	wait_task_zombie: fix 2/3 races vs forget_original_parent() Two threads, T1 and T2. T2 ptraces P, and P is not a child of ptracer's thread group. P exits and goes to TASK_ZOMBIE. T1 does wait_task_zombie(P): P->exit_state = TASK_DEAD; ... read_unlock(&tasklist_lock); T2 does exit(), takes tasklist, forget_original_parent() does __ptrace_unlink(P) but doesn't call do_notify_parent(P) because p->exit_state == EXIT_DEAD. Now, P is not visible to our process: __ptrace_unlink() removed it from ->children. We should send notification to P->parent and release P if and only if SIGCHLD is ignored. And we have 3 bugs: 1. P->parent does do_wait() and gets -ECHILD (P is on ->parent->children, but its state is TASK_DEAD). 2. // wait_task_zombie() continues if (put_user(...)) { // TODO: is this safe? p->exit_state = EXIT_ZOMBIE; return; } we return without notification/release, task_struct leaked. Solution: ignore -EFAULT and proceed. It is an application's bug if we can't fill infop/stat_addr (in case of VM_FAULT_OOM we have much more problems). 3. // wait_task_zombie() continues if (p->real_parent != p->parent) { // Not taken, it was untraced'ed ... } release_task(p); we released the task which we shouldn't. Solution: check ->real_parent != ->parent before, under tasklist_lock, but use ptrace_unlink() instead of __ptrace_unlink() to check ->ptrace. This patch hopefully solves 2 and 3, the 1st bug will be fixed later, we need some cleanups in forget_original_parent/reparent_thread. However, the first race is very unlikely and not critical, so I hope it makes sense to fix 1 and 2 for now. 4. Small cleanup: don't "restore" EXIT_ZOMBIE unless we know we are not going to realease the child. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:51 -07:00
Oleg Nesterov	407af46a96	wait_task_zombie: remove unneeded child->signal check A zombie must have a valid ->signal, we are going to release it and __exit_signal() starts with BUG_ON(!sig). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:51 -07:00
Oleg Nesterov	84eb646b6e	handle the multi-threaded init's exit() properly With or without this patch, multi-threaded init's are not fully supported, but do_exit() is completely wrong. This becomes a real problem when we support pid namespaces. 1. do_exit() panics when the main thread of /sbin/init exits. It should not until the whole thread group exits. Move the code below, under the "if (group_dead)" check. Note: this means that forget_original_parent() can use an already dead child_reaper()'s task_struct. This is OK for /sbin/init because - do_wait() from alive sub-thread still can reap a zombie, we iterate over all sub-thread's ->children lists - do_notify_parent() will wakeup some alive sub-thread because it sends the group-wide signal However, we should remove choose_new_parent()->BUG_ON(reaper->exit_state) for this. 2. We are playing games with ->nsproxy->pid_ns. This code is bogus today, and it has to be changed anyway when we really support pid namespaces, just remove it. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Roland McGrath <roland@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:51 -07:00
Oleg Nesterov	045f902de5	do_sigaction: remove now unneeded recalc_sigpending() With the recent changes, do_sigaction()->recalc_sigpending_and_wake() can never clear TIF_SIGPENDING. Instead, it can set this flag and wake up the thread without any reason. Harmless, but unneeded and wastes CPU. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:51 -07:00
Oleg Nesterov	d2ee7198cc	pi-futex: set PF_EXITING without taking ->pi_lock It is a bit annoying that do_exit() takes ->pi_lock to set PF_EXITING. All we need is to synchronize with lookup_pi_state() which saw this task without PF_EXITING under ->pi_lock. Change do_exit() to use spin_unlock_wait(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:51 -07:00
Mike Frysinger	0b15d04af3	printk: add interfaces for external access to the log buffer Add two new functions for reading the kernel log buffer. The intention is for them to be used by recovery/dump/debug code so the kernel log can be easily retrieved/parsed in a crash scenario, but they are generic enough for other people to dream up other fun uses. [akpm@linux-foundation.org: buncha fixes] Signed-off-by: Mike Frysinger <vapier@gentoo.org> Cc: Robin Getz <rgetz@blackfin.uclinux.org> Cc: Greg Ungerer <gerg@snapgear.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Acked-by: Tim Bird <tim.bird@am.sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:50 -07:00
Adrian Bunk	a4d63e729e	kernel/rtmutex-debug.c: cleanups This patch contains the following cleanups: - make the needlessly global variable rt_trace_on static - remove the unused global function deadlock_trace_off() Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:50 -07:00
Roland McGrath	6d76013381	Add /sys/module/name/notes This patch adds the /sys/module/<name>/notes/ magic directory, which has a file for each allocated SHT_NOTE section that appears in <name>.ko. This is the counterpart for each module of /sys/kernel/notes for vmlinux. Reading this delivers the contents of the module's SHT_NOTE sections. This lets userland easily glean any detailed information about that module's build that was stored there at compile time (e.g. by ld --build-id). Signed-off-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:50 -07:00
David Woodhouse	1d99493b3a	Fix CONFIG_DEBUG_SHIRQ trigger on free_irq() Andy Gospodarek pointed out that because we return in the middle of the free_irq() function, we never actually do call the IRQ handler that just got deregistered. This should fix it, although I expect Andrew will want to convert those 'return's to 'break'. That's a separate change though. Signed-off-by: David Woodhouse <dwmw2@infradead.org> Cc: Andy Gospodarek <andy@greyhouse.net> Cc: Fernando Luis Vzquez Cao <fernando@oss.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:49 -07:00
Rusty Russell	af49d9248f	Remove "unsafe" from module struct Adrian Bunk points out that "unsafe" was used to mark modules touched by the deprecated MOD_INC_USE_COUNT interface, which has long gone. It's time to remove the member from the module structure, as well. If you want a module which can't unload, don't register an exit function. (Vlad Yasevich says SCTP is now safe to unload, so just remove the __unsafe there). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Shannon Nelson <shannon.nelson@intel.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com> Cc: Sridhar Samudrala <sri@us.ibm.com> Cc: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:49 -07:00
Avi Kivity	bf020cb7b3	time: simplify smp_call_function_single() call sequence smp_call_function_single() now knows how to call the function on the current cpu. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Avi Kivity <avi@qumranet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:48 -07:00
Jesper Juhl	a9022e9cb9	Clean up duplicate includes in kernel/ This patch cleans up duplicate includes in kernel/ Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Satyam Sharma <ssatyam@cse.iitk.ac.in> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:48 -07:00
Alexey Dobriyan	040b5c6f95	SLAB_PANIC more (proc, posix-timers, shmem) These aren't modular, so SLAB_PANIC is OK. Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:47 -07:00
Ravikiran G Thirumalai	c4f3b63fe1	softlockup: add a /proc tuning parameter Control the trigger limit for softlockup warnings. This is useful for debugging softlockups, by lowering the softlockup_thresh to identify possible softlockups earlier. This patch: 1. Adds a sysctl softlockup_thresh with valid values of 1-60s (Higher value to disable false positives) 2. Changes the softlockup printk to print the cpu softlockup time [akpm@linux-foundation.org: Fix various warnings and add definition of "two"] Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:47 -07:00
Ingo Molnar	a5f2ce3c60	softlockup watchdog: style cleanups kernel/softirq.c grew a few style uncleanlinesses in the past few months, clean that up. No functional changes: text data bss dec hex filename 1126 76 4 1206 4b6 softlockup.o.before 1129 76 4 1209 4b9 softlockup.o.after ( the 3 bytes .text increase is due to the "<1>" appended to one of the printk messages. ) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:47 -07:00
Ingo Molnar	43581a1007	softlockup: improve debug output Improve the debuggability of kernel lockups by enhancing the debug output of the softlockup detector: print the task that causes the lockup and try to print a more intelligent backtrace. The old format was: BUG: soft lockup detected on CPU#1! [<c0105e4a>] show_trace_log_lvl+0x19/0x2e [<c0105f43>] show_trace+0x12/0x14 [<c0105f59>] dump_stack+0x14/0x16 [<c015f6bc>] softlockup_tick+0xbe/0xd0 [<c013457d>] run_local_timers+0x12/0x14 [<c01346b8>] update_process_times+0x3e/0x63 [<c0145fb8>] tick_sched_timer+0x7c/0xc0 [<c0140a75>] hrtimer_interrupt+0x135/0x1ba [<c011bde7>] smp_apic_timer_interrupt+0x6e/0x80 [<c0105aa3>] apic_timer_interrupt+0x33/0x38 [<c0104f8a>] syscall_call+0x7/0xb ======================= The new format is: BUG: soft lockup detected on CPU#1! [prctl:2363] Pid: 2363, comm: prctl EIP: 0060:[<c013915f>] CPU: 1 EIP is at sys_prctl+0x24/0x18c EFLAGS: 00000213 Not tainted (2.6.22-cfs-v20 #26) EAX: 00000001 EBX: 000003e7 ECX: 00000001 EDX: f6df0000 ESI: 000003e7 EDI: 000003e7 EBP: f6df0fb0 DS: 007b ES: 007b FS: 00d8 CR0: 8005003b CR2: 4d8c3340 CR3: 3731d000 CR4: 000006d0 [<c0105e4a>] show_trace_log_lvl+0x19/0x2e [<c0105f43>] show_trace+0x12/0x14 [<c01040be>] show_regs+0x1ab/0x1b3 [<c015f807>] softlockup_tick+0xef/0x108 [<c013457d>] run_local_timers+0x12/0x14 [<c01346b8>] update_process_times+0x3e/0x63 [<c0145fcc>] tick_sched_timer+0x7c/0xc0 [<c0140a89>] hrtimer_interrupt+0x135/0x1ba [<c011bde7>] smp_apic_timer_interrupt+0x6e/0x80 [<c0105aa3>] apic_timer_interrupt+0x33/0x38 [<c0104f8a>] syscall_call+0x7/0xb ======================= Note that in the old format we only knew that some system call locked up, we didnt know _which_. With the new format we know that it's at a specific place in sys_prctl(). [which was where i created an artificial kernel lockup to test the new format.] This is also useful if the lockup happens in user-space - the user-space EIP (and other registers) will be printed too. (such a lockup would either suggest that the task was running at SCHED_FIFO:99 and looping for more than 10 seconds, or that the softlockup detector has a false-positive.) The task name is printed too first, just in case we dont manage to print a useful backtrace. [satyam@infradead.org: fix warning] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Satyam Sharma <satyam@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:47 -07:00
Ingo Molnar	a115d5caca	fix the softlockup watchdog to actually work this Xen related commit: commit `966812dc98` Author: Jeremy Fitzhardinge <jeremy@goop.org> Date: Tue May 8 00:28:02 2007 -0700 Ignore stolen time in the softlockup watchdog broke the softlockup watchdog to never report any lockups. (!) print_timestamp defaults to 0, this makes the following condition always true: if (print_timestamp < (touch_timestamp + 1) \|\| and we'll in essence never report soft lockups. apparently the functionality of the soft lockup watchdog was never actually tested with that patch applied ... Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:47 -07:00
Ingo Molnar	a3b13c23f1	softlockup: use cpu_clock() instead of sched_clock() sched_clock() is not a reliable time-source, use cpu_clock() instead. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:46 -07:00
David Rientjes	bbe373f2c6	oom: compare cpuset mems_allowed instead of exclusive ancestors Instead of testing for overlap in the memory nodes of the the nearest exclusive ancestor of both current and the candidate task, it is better to simply test for intersection between the task's mems_allowed in their task descriptors. This does not require taking callback_mutex since it is only used as a hint in the badness scoring. Tasks that do not have an intersection in their mems_allowed with the current task are not explicitly restricted from being OOM killed because it is quite possible that the candidate task has allocated memory there before and has since changed its mems_allowed. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:46 -07:00
David Rientjes	fe071d7e8a	oom: add oom_kill_allocating_task sysctl Adds a new sysctl, 'oom_kill_allocating_task', which will automatically kill the OOM-triggering task instead of scanning through the tasklist to find a memory-hogging target. This is helpful for systems with an insanely large number of tasks where scanning the tasklist significantly degrades performance. Cc: Andrea Arcangeli <andrea@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:46 -07:00
Christoph Lameter	4ba9b9d0ba	Slab API: remove useless ctor parameter and reorder parameters Slab constructors currently have a flags parameter that is never used. And the order of the arguments is opposite to other slab functions. The object pointer is placed before the kmem_cache pointer. Convert ctor(void object, struct kmem_cache s, unsigned long flags) to ctor(struct kmem_cache s, void object) throughout the kernel [akpm@linux-foundation.org: coupla fixes] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:45 -07:00
Peter Zijlstra	3e26c149c3	mm: dirty balancing for tasks Based on ideas of Andrew: http://marc.info/?l=linux-kernel&m=102912915020543&w=2 Scale the bdi dirty limit inversly with the tasks dirty rate. This makes heavy writers have a lower dirty limit than the occasional writer. Andrea proposed something similar: http://lwn.net/Articles/152277/ The main disadvantage to his patch is that he uses an unrelated quantity to measure time, which leaves him with a workload dependant tunable. Other than that the two approaches appear quite similar. [akpm@linux-foundation.org: fix warning] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:45 -07:00
Peter Zijlstra	04fbfdc14e	mm: per device dirty threshold Scale writeback cache per backing device, proportional to its writeout speed. By decoupling the BDI dirty thresholds a number of problems we currently have will go away, namely: - mutual interference starvation (for any number of BDIs); - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts). It might be that all dirty pages are for a single BDI while other BDIs are idling. By giving each BDI a 'fair' share of the dirty limit, each one can have dirty pages outstanding and make progress. A global threshold also creates a deadlock for stacked BDIs; when A writes to B, and A generates enough dirty pages to get throttled, B will never start writeback until the dirty pages go away. Again, by giving each BDI its own 'independent' dirty limit, this problem is avoided. So the problem is to determine how to distribute the total dirty limit across the BDIs fairly and efficiently. A DBI that has a large dirty limit but does not have any dirty pages outstanding is a waste. What is done is to keep a floating proportion between the DBIs based on writeback completions. This way faster/more active devices get a larger share than slower/idle devices. [akpm@linux-foundation.org: fix warnings] [hugh@veritas.com: Fix occasional hang when a task couldn't get out of balance_dirty_pages] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-17 08:42:45 -07:00
Srivatsa Vaddagiri	b9dca1e0fc	sched: fix new task startup crash Child task may be added on a different cpu that the one on which parent is running. In which case, task_new_fair() should check whether the new born task's parent entity should be added as well on the cfs_rq. Patch below fixes the problem in task_new_fair. This could fix the put_prev_task_fair() crashes reported. Reported-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Reported-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-17 16:55:11 +02:00
Dhaval Giani	b1a8c172c3	sched: fix !SYSFS build breakage When CONFIG_SYSFS is not set, CONFIG_FAIR_USER_SCHED fails to build with kernel/built-in.o: In function `uids_kobject_init': (.init.text+0x1488): undefined reference to `kernel_subsys' kernel/built-in.o: In function `uids_kobject_init': (.init.text+0x1490): undefined reference to `kernel_subsys' kernel/built-in.o: In function `uids_kobject_init': (.init.text+0x1480): undefined reference to `kernel_subsys' kernel/built-in.o: In function `uids_kobject_init': (.init.text+0x1494): undefined reference to `kernel_subsys' This patch fixes this build error. Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-17 16:55:11 +02:00
Ken Chen	908a7c1b9b	sched: fix improper load balance across sched domain We recently discovered a nasty performance bug in the kernel CPU load balancer where we were hit by 50% performance regression. When tasks are assigned to a subset of CPUs that span across sched_domains (either ccNUMA node or the new multi-core domain) via cpu affinity, kernel fails to perform proper load balance at these domains, due to several logic in find_busiest_group() miss identified busiest sched group within a given domain. This leads to inadequate load balance and causes 50% performance hit. To give you a concrete example, on a dual-core, 2 socket numa system, there are 4 logical cpu, organized as: CPU0 attaching sched-domain: domain 0: span 0003 groups: 0001 0002 domain 1: span 000f groups: 0003 000c CPU1 attaching sched-domain: domain 0: span 0003 groups: 0002 0001 domain 1: span 000f groups: 0003 000c CPU2 attaching sched-domain: domain 0: span 000c groups: 0004 0008 domain 1: span 000f groups: 000c 0003 CPU3 attaching sched-domain: domain 0: span 000c groups: 0008 0004 domain 1: span 000f groups: 000c 0003 If I run 2 tasks with CPU affinity set to 0x5. There are situation where cpu0 has run queue length of 2, and cpu2 will be idle. The kernel load balancer is unable to balance out these two tasks over cpu0 and cpu2 due to at least three logics in find_busiest_group() that heavily bias load balance towards power saving mode. e.g. while determining "busiest" variable, kernel only set it when "sum_nr_running > group_capacity". This test is flawed that "sum_nr_running" is not necessary same as sum-tasks-allowed-to-run-within-the sched-group. The end result is that kernel "think" everything is balanced, but in reality we have an imbalance and thus causing one CPU to be over-subscribed and leaving other idle. There are two other logic in the same function will also causing similar effect. The nastiness of this bug is that kernel not be able to get unstuck in this unfortunate broken state. From what we've seen in our environment, kernel will stuck in imbalanced state for extended period of time and it is also very easy for the kernel to stuck into that state (it's pretty much 100% reproducible for us). So proposing the following fix: add addition logic in find_busiest_group to detect intrinsic imbalance within the busiest group. When such condition is detected, load balance goes into spread mode instead of default grouping mode. Signed-off-by: Ken Chen <kenchen@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-17 16:55:11 +02:00
Milton Miller	cd79007634	sched: more robust sd-sysctl entry freeing It occurred to me this morning that the procname field was dynamically allocated and needed to be freed. I started to put in break statements when allocation failed but it was approaching 50% error handling code. I came up with this alternative of looping while entry->mode is set and checking proc_handler instead of ->table. Alternatively, the string version of the domain name and cpu number could be stored the structs. I verified by compiling CONFIG_DEBUG_SLAB and checking the allocation counts after taking a cpuset exclusive and back. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2007-10-17 16:55:11 +02:00
Ingo Molnar	f20bf61256	time: introduce xtime_seconds improve performance of sys_time(). sys_time() returns time in seconds, but it does so by calling do_gettimeofday() and then returning the tv_sec portion of the GTOD time. But the data structure "xtime", which is updated by every timer/scheduler tick, already offers HZ granularity time. the patch improves the sysbench oltp macrobenchmark by 4-5% on an AMD dual-core system: v2.6.23: #threads 1: transactions: 4073 (407.23 per sec.) 2: transactions: 8530 (852.81 per sec.) 3: transactions: 8321 (831.88 per sec.) 4: transactions: 8407 (840.58 per sec.) 5: transactions: 8070 (806.74 per sec.) v2.6.23 + sys_time-speedup.patch: 1: transactions: 4281 (428.09 per sec.) 2: transactions: 8910 (890.85 per sec.) 3: transactions: 8659 (865.79 per sec.) 4: transactions: 8676 (867.34 per sec.) 5: transactions: 8532 (852.91 per sec.) and by 4-5% on an Intel dual-core system too: 2.6.23: 1: transactions: 4560 (455.94 per sec.) 2: transactions: 10094 (1009.30 per sec.) 3: transactions: 9755 (975.36 per sec.) 4: transactions: 9859 (985.78 per sec.) 5: transactions: 9701 (969.72 per sec.) 2.6.23 + sys_time-speedup.patch: 1: transactions: 4779 (477.84 per sec.) 2: transactions: 10103 (1010.14 per sec.) 3: transactions: 10141 (1013.93 per sec.) 4: transactions: 10371 (1036.89 per sec.) 5: transactions: 10178 (1017.50 per sec.) (the more CPUs the system has, the more speedup this patch gives for this particular workload.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 10:01:50 -07:00
Masami Hiramatsu	f438d914b2	kprobes: support kretprobe blacklist Introduce architecture dependent kretprobe blacklists to prohibit users from inserting return probes on the function in which kprobes can be inserted but kretprobes can not. This patch also removes "__kprobes" mark from "__switch_to" on x86_64 and registers "__switch_to" to the blacklist on x86-64, because that mark is to prohibit user from inserting only kretprobe. Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:10 -07:00
Paul Jackson	607717a65d	cpuset: remove sched domain hooks from cpusets Remove the cpuset hooks that defined sched domains depending on the setting of the 'cpu_exclusive' flag. The cpu_exclusive flag can only be set on a child if it is set on the parent. This made that flag painfully unsuitable for use as a flag defining a partitioning of a system. It was entirely unobvious to a cpuset user what partitioning of sched domains they would be causing when they set that one cpu_exclusive bit on one cpuset, because it depended on what CPUs were in the remainder of that cpusets siblings and child cpusets, after subtracting out other cpu_exclusive cpusets. Furthermore, there was no way on production systems to query the result. Using the cpu_exclusive flag for this was simply wrong from the get go. Fortunately, it was sufficiently borked that so far as I know, almost no successful use has been made of this. One real time group did use it to affectively isolate CPUs from any load balancing efforts. They are willing to adapt to alternative mechanisms for this, such as someway to manipulate the list of isolated CPUs on a running system. They can do without this present cpu_exclusive based mechanism while we develop an alternative. There is a real risk, to the best of my understanding, of users accidentally setting up a partitioned scheduler domains, inhibiting desired load balancing across all their CPUs, due to the nonobvious (from the cpuset perspective) side affects of the cpu_exclusive flag. Furthermore, since there was no way on a running system to see what one was doing with sched domains, this change will be invisible to any using code. Unless they have real insight to the scheduler load balancing choices, they will be unable to detect that this change has been made in the kernel's behaviour. Initial discussion on lkml of this patch has generated much comment. My (probably controversial) take on that discussion is that it has reached a rough concensus that the current cpuset cpu_exclusive mechanism for defining sched domains is borked. There is no concensus on the replacement. But since we can remove this mechanism, and since its continued presence risks causing unwanted partitioning of the schedulers load balancing, we should remove it while we can, as we proceed to work the replacement scheduler domain mechanisms. Signed-off-by: Paul Jackson <pj@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:09 -07:00
Christoph Hellwig	0ac1555915	m32r: convert to generic sys_ptrace Convert m32r to the generic sys_ptrace. The conversion requires an architecture hook after ptrace_attach which this patch adds. The hook will also be needed for a conersion of ia64 to the generic ptrace code. Thanks to Hirokazu Takata for fixing a bug in the first version of this code. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:04 -07:00
Adam Litke	54f9f80d65	hugetlb: Add hugetlb_dynamic_pool sysctl The maximum size of the huge page pool can be controlled using the overall size of the hugetlb filesystem (via its 'size' mount option). However in the common case the this will not be set as the pool is traditionally fixed in size at boot time. In order to maintain the expected semantics, we need to prevent the pool expanding by default. This patch introduces a new sysctl controlling dynamic pool resizing. When this is enabled the pool will expand beyond its base size up to the size of the hugetlb filesystem. It is disabled by default. Signed-off-by: Adam Litke <agl@us.ibm.com> Acked-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Dave McCracken <dave.mccracken@oracle.com> Cc: William Irwin <bill.irwin@oracle.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Ken Chen <kenchen@google.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:02 -07:00
KAMEZAWA Hiroyuki	75884fb1c6	memory unplug: memory hotplug cleanup A clean up patch for "scanning memory resource [start, end)" operation. Now, find_next_system_ram() function is used in memory hotplug, but this interface is not easy to use and codes are complicated. This patch adds walk_memory_resouce(start,len,arg,func) function. The function 'func' is called per valid memory resouce range in [start,pfn). [pbadari@us.ibm.com: Error handling in walk_memory_resource()] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:01 -07:00
Mel Gorman	e12ba74d8f	Group short-lived and reclaimable kernel allocations This patch marks a number of allocations that are either short-lived such as network buffers or are reclaimable such as inode allocations. When something like updatedb is called, long-lived and unmovable kernel allocations tend to be spread throughout the address space which increases fragmentation. This patch groups these allocations together as much as possible by adding a new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be reclaimed on demand, but not moved. i.e. they can be migrated by deleting them and re-reading the information from elsewhere. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:43:00 -07:00
Christoph Lameter	0e1e7c7a73	Memoryless nodes: Use N_HIGH_MEMORY for cpusets cpusets try to ensure that any node added to a cpuset's mems_allowed is on-line and contains memory. The assumption was that online nodes contained memory. Thus, it is possible to add memoryless nodes to a cpuset and then add tasks to this cpuset. This results in continuous series of oom-kill and apparent system hang. Change cpusets to use node_states[N_HIGH_MEMORY] [a.k.a. node_memory_map] in place of node_online_map when vetting memories. Return error if admin attempts to write a non-empty mems_allowed node mask containing only memoryless-nodes. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:42:59 -07:00
Christoph Lameter	4199cfa02b	Memoryless nodes: Allow profiling data to fall back to other nodes Processors on memoryless nodes must be able to fall back to remote nodes in order to get a profiling buffer. This may lead to excessive NUMA traffic but I think we should allow this rather than failing. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Bob Picco <bob.picco@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2007-10-16 09:42:58 -07:00

1 2 3 4 5 ...

2957 Commits