dect
/
linux-2.6
Archived
13
0
Fork 0
This repository has been archived on 2022-02-17. You can view files and clone it, but cannot push or open issues or pull requests.
linux-2.6/kernel
Miao Xie 708c1bbc9d mempolicy: restructure rebinding-mempolicy functions
Nick Piggin reported that the allocator may see an empty nodemask when
changing cpuset's mems[1].  It happens only on the kernel that do not do
atomic nodemask_t stores.  (MAX_NUMNODES > BITS_PER_LONG)

But I found that there is also a problem on the kernel that can do atomic
nodemask_t stores.  The problem is that the allocator can't find a node to
alloc page when changing cpuset's mems though there is a lot of free
memory.  The reason is like this:

(mpol: mempolicy)
	task1			task1's mpol	task2
	alloc page		1
	  alloc on node0? NO	1
				1		change mems from 1 to 0
				1		rebind task1's mpol
				0-1		  set new bits
				0	  	  clear disallowed bits
	  alloc on node1? NO	0
	  ...
	can't alloc page
	  goto oom

I can use the attached program reproduce it by the following step:

# mkdir /dev/cpuset
# mount -t cpuset cpuset /dev/cpuset
# mkdir /dev/cpuset/1
# echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus
# echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems
# echo $$ > /dev/cpuset/1/tasks
# numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog <nr_tasks> &
   <nr_tasks> = max(nr_cpus - 1, 1)
# killall -s SIGUSR1 cpuset_mem_hog
# ./change_mems.sh

several hours later, oom will happen though there is a lot of free memory.

This patchset fixes this problem by expanding the nodes range first(set
newly allowed bits) and shrink it lazily(clear newly disallowed bits).  So
we use a variable to tell the write-side task that read-side task is
reading nodemask, and the write-side task clears newly disallowed nodes
after read-side task ends the current memory allocation.

This patch:

In order to fix no node to alloc memory, when we want to update mempolicy
and mems_allowed, we expand the set of nodes first (set all the newly
nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the
mempolicy's rebind functions may breaks the expanding.

So we restructure the mempolicy's rebind functions and split the rebind
work to two steps, just like the update of cpuset's mems: The 1st step:
expand the set of the mempolicy's nodes.  The 2nd step: shrink the set of
the mempolicy's nodes.  It is used when there is no real lock to protect
the mempolicy in the read-side.  Otherwise we can do rebind work at once.

In order to implement it, we define

	enum mpol_rebind_step {
		MPOL_REBIND_ONCE,
		MPOL_REBIND_STEP1,
		MPOL_REBIND_STEP2,
		MPOL_REBIND_NSTEP,
	};

If the mempolicy needn't be updated by two steps, we can pass
MPOL_REBIND_ONCE to the rebind functions.  Or we can pass
MPOL_REBIND_STEP1 to do the first step of the rebind work and pass
MPOL_REBIND_STEP2 to do the second step work.

Besides that, it maybe long time between these two step and we have to
release the lock that protects mempolicy and mems_allowed.  If we hold the
lock once again, we must check whether the current mempolicy is under the
rebinding (the first step has been done) or not, because the task may
alloc a new mempolicy when we don't hold the lock.  So we defined the
following flag to identify it:

#define MPOL_F_REBINDING (1 << 2)

The new functions will be used in the next patch.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Paul Menage <menage@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Ravikiran Thirumalai <kiran@scalex86.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25 08:06:57 -07:00
..
debug x86, kgdb, init: Add early and late debug states 2010-05-20 21:04:29 -05:00
gcov microblaze: Enable GCOV_PROFILE_ALL 2009-09-21 14:29:21 +02:00
irq genirq: Clear CPU mask in affinity_hint when none is provided 2010-05-12 11:23:34 +02:00
power PM / Hibernate: Fix block_io.c printk warning 2010-05-10 23:08:18 +02:00
time Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-05-19 17:11:10 -07:00
trace Merge branch 'master' into for-2.6.35 2010-05-21 21:27:26 +02:00
.gitignore
Kconfig.freezer
Kconfig.hz
Kconfig.locks mutex: Better control mutex adaptive spinning config 2009-12-03 11:50:11 +01:00
Kconfig.preempt
Makefile Move kernel/kgdb.c to kernel/debug/debug_core.c 2010-05-20 21:04:18 -05:00
acct.c Merge branch 'next' into for-linus 2010-05-18 08:57:00 +10:00
async.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
audit.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
audit.h
audit_tree.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
audit_watch.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
auditfilter.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
auditsc.c audit: preface audit printk with audit 2010-04-05 13:19:45 -07:00
backtracetest.c
bounds.c kbuild: move bounds.h to include/generated 2009-12-12 13:08:14 +01:00
capability.c sched: Remove remaining USER_SCHED code 2010-04-02 20:12:00 +02:00
cgroup.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2010-05-20 09:20:59 -07:00
cgroup_freezer.c Freezer / cgroup freezer: Update stale locking comments 2010-05-10 23:18:47 +02:00
compat.c cpumask: fix compat getaffinity 2010-05-19 11:48:18 -07:00
configs.c
cpu.c stop_machine: reimplement using cpu_stop 2010-05-06 18:49:20 +02:00
cpuset.c mempolicy: restructure rebinding-mempolicy functions 2010-05-25 08:06:57 -07:00
cred.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 2010-05-20 08:55:50 -07:00
delayacct.c headers: taskstats_kern.h trim 2009-09-18 09:48:52 -07:00
dma.c
early_res.c x86: Do not free zero sized per cpu areas 2010-03-29 18:55:40 +02:00
elfcore.c elf coredump: add extended numbering support 2010-03-06 11:26:46 -08:00
exec_domain.c
exit.c Merge branch 'linus' into sched/core 2010-04-15 09:36:16 +02:00
extable.c
fork.c Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-05-18 08:19:03 -07:00
freezer.c
futex.c futex: Handle futex value corruption gracefully 2010-02-03 15:13:22 +01:00
futex_compat.c futex: Protect pid lookup in compat code with RCU 2009-12-09 14:22:14 +01:00
groups.c security: remove dead hook task_setgroups 2010-04-12 12:19:18 +10:00
hrtimer.c hrtimers: Provide schedule_hrtimeout for CLOCK_REALTIME 2010-04-06 21:50:03 +02:00
hung_task.c softlockup: Fix hung_task_check_count sysctl 2009-11-27 06:21:57 +01:00
hw_breakpoint.c hw_breakpoints: Fix percpu build failure 2010-05-04 08:39:36 +02:00
itimer.c itimers: Fix racy writes to cpu_itimer fields 2009-11-18 16:32:12 +01:00
kallsyms.c kdb: core for kgdb back end (2 of 2) 2010-05-20 21:04:21 -05:00
kexec.c kexec: fix OOPS in crash_kernel_shrink 2010-05-11 17:33:42 -07:00
kfifo.c kfifo: Don't use integer as NULL pointer 2010-02-16 15:11:08 -08:00
kmod.c kmod: fix resource leak in call_usermodehelper_pipe() 2010-01-11 09:34:04 -08:00
kprobes.c kprobes: Move enable/disable_kprobe() out from debugfs code 2010-05-08 18:08:30 +02:00
ksysfs.c sysfs: add struct file* to bin_attr callbacks 2010-05-21 09:37:31 -07:00
kthread.c cpuset: fix the problem that cpuset_mem_spread_node() returns an offline node 2010-03-24 16:31:21 -07:00
latencytop.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
lockdep.c lockdep: Add novalidate class for dev->mutex conversion 2010-05-21 09:37:30 -07:00
lockdep_internals.h lockdep: No need to disable preemption in debug atomic ops 2010-05-04 05:38:16 +02:00
lockdep_proc.c lockstat: Make lockstat counting per cpu 2010-04-06 00:15:37 +02:00
lockdep_states.h
module.c Merge branch 'modules' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus 2010-05-21 17:15:44 -07:00
mutex-debug.c headers: remove sched.h from interrupt.h 2009-10-11 11:20:58 -07:00
mutex-debug.h locking: Implement new raw_spinlock 2009-12-14 23:55:32 +01:00
mutex.c mutex: Better control mutex adaptive spinning config 2009-12-03 11:50:11 +01:00
mutex.h
notifier.c sched: Use lockdep-based checking on rcu_dereference() 2010-02-25 10:34:26 +01:00
ns_cgroup.c cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time 2009-09-24 07:20:58 -07:00
nsproxy.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
padata.c padata: Use get_online_cpus/put_online_cpus in padata_free 2010-05-19 13:45:35 +10:00
panic.c panic: Add taint flag TAINT_FIRMWARE_WORKAROUND ('I') 2010-05-19 08:37:43 +01:00
params.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2010-03-12 16:04:50 -08:00
perf_event.c perf: Fix exit() vs event-groups 2010-05-11 17:08:24 +02:00
pid.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-03-13 14:43:01 -08:00
pid_namespace.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
pm_qos_params.c PM: PM QOS update fix 2010-05-17 00:21:03 +02:00
posix-cpu-timers.c posix-cpu-timers: Optimize run_posix_cpu_timers() 2010-05-10 14:24:26 +02:00
posix-timers.c posix-timers.c: Don't export local functions 2010-02-05 14:54:10 +01:00
printk.c printk,kdb: capture printk() when in kdb shell 2010-05-20 21:04:27 -05:00
profile.c profile: fix stats and data leakage 2010-05-14 19:45:06 -07:00
ptrace.c Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-05-18 08:19:03 -07:00
range.c x86: Change range end to start+size 2010-02-10 17:47:17 -08:00
rcupdate.c rcu: slim down rcutiny by removing rcu_scheduler_active and friends 2010-05-10 11:08:34 -07:00
rcutiny.c rcu: remove all rcu head initializations, except on_stack initializations 2010-05-11 16:10:47 -07:00
rcutiny_plugin.h rcu: slim down rcutiny by removing rcu_scheduler_active and friends 2010-05-10 11:08:34 -07:00
rcutorture.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-05-18 08:27:54 -07:00
rcutree.c rcu: remove all rcu head initializations, except on_stack initializations 2010-05-11 16:10:47 -07:00
rcutree.h rcu: reduce the number of spurious RCU_SOFTIRQ invocations 2010-05-10 11:08:35 -07:00
rcutree_plugin.h rcu: remove all rcu head initializations, except on_stack initializations 2010-05-11 16:10:47 -07:00
rcutree_trace.c rcu: reduce the number of spurious RCU_SOFTIRQ invocations 2010-05-10 11:08:35 -07:00
relay.c pipe: add support for shrinking and growing pipes 2010-05-21 21:12:40 +02:00
res_counter.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
resource.c resource: shared I/O region support 2010-05-11 12:01:10 -07:00
rtmutex-debug.c sched: Convert pi_lock to raw_spinlock 2009-12-14 23:55:33 +01:00
rtmutex-debug.h
rtmutex-tester.c
rtmutex.c rtmutes: Convert rtmutex.lock to raw_spinlock 2009-12-14 23:55:33 +01:00
rtmutex.h
rtmutex_common.h
rwsem.c
sched.c Merge branch 'kdb-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb 2010-05-21 11:08:05 -07:00
sched_clock.c blkio: fix for modular blk-cgroup build 2010-04-15 08:54:59 +02:00
sched_cpupri.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
sched_cpupri.h sched: Convert cpupri lock to raw_spinlock 2009-12-14 23:55:33 +01:00
sched_debug.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-05-18 08:27:54 -07:00
sched_fair.c sched: replace migration_thread with cpu_stop 2010-05-06 18:49:21 +02:00
sched_features.h sched: Remove ASYM_GRAN feature 2010-03-11 18:32:53 +01:00
sched_idletask.c sched: Cure load average vs NO_HZ woes 2010-04-23 11:02:02 +02:00
sched_rt.c sched: Add enqueue/dequeue flags 2010-04-02 20:12:05 +02:00
sched_stats.h
seccomp.c
semaphore.c
signal.c kdb: core for kgdb back end (2 of 2) 2010-05-20 21:04:21 -05:00
slow-work-debugfs.c SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
slow-work.c slow-work: use get_ref wrapper instead of directly calling get_ref 2010-03-29 09:13:30 -07:00
slow-work.h SLOW_WORK: CONFIG_SLOW_WORK_PROC should be CONFIG_SLOW_WORK_DEBUG 2010-03-29 09:14:47 -07:00
smp.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
softirq.c rcu: refactor RCU's context-switch handling 2010-05-10 11:08:33 -07:00
softlockup.c softlockup: Stop spurious softlockup messages due to overflow 2010-03-21 19:30:13 +01:00
spinlock.c locking: Cleanup the name space completely 2009-12-14 23:55:33 +01:00
srcu.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
stacktrace.c
stop_machine.c stop_machine: Move local variable closer to the usage site in cpu_stop_cpu_callback() 2010-05-18 00:17:44 +02:00
sys.c Merge branch 'master' into next 2010-05-06 10:56:07 +10:00
sys_ni.c Add generic sys_ipc wrapper 2010-03-12 15:52:32 -08:00
sysctl.c Merge branch 'for-2.6.35' of git://git.kernel.dk/linux-2.6-block 2010-05-21 15:25:33 -07:00
sysctl_binary.c ipv4: remove ip_rt_secret timer (v4) 2010-05-08 01:57:52 -07:00
sysctl_check.c ipv4 05/05: add sysctl to accept packets with local source addresses 2009-12-03 12:14:38 -08:00
taskstats.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
test_kprobes.c
time.c timekeeping: Fix timezone update 2010-05-24 11:50:38 +02:00
timeconst.pl
timer.c timers: Fix slack calculation for expired timers 2010-05-24 12:10:23 +02:00
tracepoint.c trivial: fix typo "to to" in multiple files 2009-09-21 15:14:55 +02:00
tsacct.c mm: clean up mm_counter 2010-03-06 11:26:23 -08:00
uid16.c headers: utsname.h redux 2009-09-23 18:13:10 -07:00
up.c
user-return-notifier.c core: Clean up user return notifers use of per_cpu 2009-12-02 10:22:59 +01:00
user.c sched: Remove a stale comment 2010-05-10 08:48:39 +02:00
user_namespace.c kref: remove kref_set 2010-05-21 09:37:29 -07:00
utsname.c
utsname_sysctl.c sysctl kernel: Remove binary sysctl logic 2009-11-12 02:04:55 -08:00
wait.c locking, sched: Give waitqueue spinlocks their own lockdep classes 2009-08-10 14:43:09 +02:00
workqueue.c workqueue: change cancel_work_sync() to clear work->data 2010-04-30 08:57:25 +02:00