dect
/
linux-2.6
Archived
13
0
Fork 0
Commit Graph

2089 Commits

Author SHA1 Message Date
Lee Schermerhorn 53f2556b67 mempolicy: mPOL_PREFERRED cleanups for "local allocation"
Here are a couple of "cleanups" for MPOL_PREFERRED behavior when
v.preferred_node < 0 -- i.e., "local allocation":

1)  [do_]get_mempolicy() calls the now renamed get_policy_nodemask()
    to fetch the nodemask associated with a policy.  Currently,
    get_policy_nodemask() returns the set of nodes with memory, when
    the policy 'mode' is 'PREFERRED, and the preferred_node is < 0.
    Change to return an empty nodemask, as this is what was specified
    to achieve "local allocation".

2)  When a task is moved into a [new] cpuset, mpol_rebind_policy() is
    called to adjust any task and vma policy nodes to be valid in the
    new cpuset.  However, when the policy is MPOL_PREFERRED, and the
    preferred_node is <0, no rebind is necessary.  The "local allocation"
    indication is valid in any cpuset.  Existing code will "do the right
    thing" because node_remap() will just return the argument node when
    it is outside of the valid range of node ids.  However, I think it is
    clearer and cleaner to skip the remap explicitly in this case.

3)  mpol_to_str() produces a printable, "human readable" string from a
    struct mempolicy.  For MPOL_PREFERRED with preferred_node <0,  show
    "local", as this indicates local allocation, as the task migrates
    among nodes.  Note that this matches the usage of "local allocation"
    in libnuma() and numactl.  Without this change, I believe that node_set()
    [via set_bit()] will set bit 31, resulting in a misleading display.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:24 -07:00
Lee Schermerhorn bea904d54d mempolicy: use MPOL_PREFERRED for system-wide default policy
Currently, when one specifies MPOL_DEFAULT via a NUMA memory policy API
[set_mempolicy(), mbind() and internal versions], the kernel simply installs a
NULL struct mempolicy pointer in the appropriate context: task policy, vma
policy, or shared policy.  This causes any use of that policy to "fall back"
to the next most specific policy scope.

The only use of MPOL_DEFAULT to mean "local allocation" is in the system
default policy.  This requires extra checks/cases for MPOL_DEFAULT in many
mempolicy.c functions.

There is another, "preferred" way to specify local allocation via the APIs.
That is using the MPOL_PREFERRED policy mode with an empty nodemask.
Internally, the empty nodemask gets converted to a preferred_node id of '-1'.
All internal usage of MPOL_PREFERRED will convert the '-1' to the id of the
node local to the cpu where the allocation occurs.

System default policy, except during boot, is hard-coded to "local
allocation".  By using the MPOL_PREFERRED mode with a negative value of
preferred node for system default policy, MPOL_DEFAULT will never occur in the
'policy' member of a struct mempolicy.  Thus, we can remove all checks for
MPOL_DEFAULT when converting policy to a node id/zonelist in the allocation
paths.

In slab_node() return local node id when policy pointer is NULL.  No need to
set a pol value to take the switch default.  Replace switch default with
BUG()--i.e., shouldn't happen.

With this patch MPOL_DEFAULT is only used in the APIs, including internal
calls to do_set_mempolicy() and in the display of policy in
/proc/<pid>/numa_maps.  It always means "fall back" to the the next most
specific policy scope.  This simplifies the description of memory policies
quite a bit, with no visible change in behavior.

get_mempolicy() continues to return MPOL_DEFAULT and an empty nodemask when
the requested policy [task or vma/shared] is NULL.  These are the values one
would supply via set_mempolicy() or mbind() to achieve that condition--default
behavior.

This patch updates Documentation to reflect this change.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:24 -07:00
Lee Schermerhorn 52cd3b0740 mempolicy: rework mempolicy Reference Counting [yet again]
After further discussion with Christoph Lameter, it has become clear that my
earlier attempts to clean up the mempolicy reference counting were a bit of
overkill in some areas, resulting in superflous ref/unref in what are usually
fast paths.  In other areas, further inspection reveals that I botched the
unref for interleave policies.

A separate patch, suitable for upstream/stable trees, fixes up the known
errors in the previous attempt to fix reference counting.

This patch reworks the memory policy referencing counting and, one hopes,
simplifies the code.  Maybe I'll get it right this time.

See the update to the numa_memory_policy.txt document for a discussion of
memory policy reference counting that motivates this patch.

Summary:

Lookup of mempolicy, based on (vma, address) need only add a reference for
shared policy, and we need only unref the policy when finished for shared
policies.  So, this patch backs out all of the unneeded extra reference
counting added by my previous attempt.  It then unrefs only shared policies
when we're finished with them, using the mpol_cond_put() [conditional put]
helper function introduced by this patch.

Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
containing just the policy.  read_swap_cache_async() can call alloc_page_vma()
multiple times, so we can't let alloc_page_vma() unref the shared policy in
this case.  To avoid this, we make a copy of any non-null shared policy and
remove the MPOL_F_SHARED flag from the copy.  This copy occurs before reading
a page [or multiple pages] from swap, so the overhead should not be an issue
here.

I introduced a new static inline function "mpol_cond_copy()" to copy the
shared policy to an on-stack policy and remove the flags that would require a
conditional free.  The current implementation of mpol_cond_copy() assumes that
the struct mempolicy contains no pointers to dynamically allocated structures
that must be duplicated or reference counted during copy.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:24 -07:00
Lee Schermerhorn aab0b1029f mempolicy: mark shared policies for unref
As part of yet another rework of mempolicy reference counting, we want to be
able to identify shared policies efficiently, because they have an extra ref
taken on lookup that needs to be removed when we're finished using the policy.

  Note:  the extra ref is required because the policies are
  shared between tasks/processes and can be changed/freed
  by one task while another task is using them--e.g., for
  page allocation.

Building on David Rientjes mempolicy "mode flags" enhancement, this patch
indicates a "shared" policy by setting a new MPOL_F_SHARED flag in the flags
member of the struct mempolicy added by David.  MPOL_F_SHARED, and any future
"internal mode flags" are reserved from bit zero up, as they will never be
passed in the upper bits of the mode argument of a mempolicy API.

I set the MPOL_F_SHARED flag when the policy is installed in the shared policy
rb-tree.  Don't need/want to clear the flag when removing from the tree as the
mempolicy is freed [unref'd] internally to the sp_delete() function.  However,
a task could hold another reference on this mempolicy from a prior lookup.  We
need the MPOL_F_SHARED flag to stay put so that any tasks holding a ref will
unref, eventually freeing, the mempolicy.

A later patch in this series will introduce a function to conditionally unref
[mpol_free] a policy.  The MPOL_F_SHARED flag is one reason [currently the
only reason] to unref/free a policy via the conditional free.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:24 -07:00
Lee Schermerhorn 45c4745af3 mempolicy: rename struct mempolicy 'policy' member to 'mode'
The terms 'policy' and 'mode' are both used in various places to describe the
semantics of the value stored in the 'policy' member of struct mempolicy.
Furthermore, the term 'policy' is used to refer to that member, to the entire
struct mempolicy and to the more abstract concept of the tuple consisting of a
"mode" and an optional node or set of nodes.  Recently, we have added "mode
flags" that are passed in the upper bits of the 'mode' [or sometimes,
'policy'] member of the numa APIs.

I'd like to resolve this confusion, which perhaps only exists in my mind, by
renaming the 'policy' member to 'mode' throughout, and fixing up the
Documentation.  Man pages will be updated separately.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:24 -07:00
Lee Schermerhorn ae4d8c16aa mempolicy: fixup Fallback for Default Shmem Policy
get_vma_policy() is not handling fallback to task policy correctly when the
get_policy() vm_op returns NULL.  The NULL overwrites the 'pol' variable that
was holding the fallback task mempolicy.  So, it was falling back directly to
system default policy.

Fix get_vma_policy() to use only non-NULL policy returned from the vma
get_policy op.

shm_get_policy() was falling back to current task's mempolicy if the "backing
file system" [tmpfs vs hugetlbfs] does not support the get_policy vm_op and
the vma policy is null.  This is incorrect for show_numa_maps() which is
likely querying the numa_maps of some task other than current.  Remove this
fallback.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:24 -07:00
Lee Schermerhorn f4e53d910b mempolicy: write lock mmap_sem while changing task mempolicy
A read of /proc/<pid>/numa_maps holds the target task's mmap_sem for read
while examining each vma's mempolicy.  A vma's mempolicy can fall back to the
task's policy.  However, the task could be changing it's task policy and free
the one that the show_numa_maps() is examining.

To prevent this, grab the mmap_sem for write when updating task mempolicy.
Pointed out to me by Christoph Lameter and extracted and reworked from
Christoph's alternative mempol reference counting patch.

This is analogous to the way that do_mbind() and do_get_mempolicy() prevent
races between task's sharing an mm_struct [a.k.a.  threads] setting and
querying a mempolicy for a particular address.

Note: this is necessary, but not sufficient, to allow us to stop taking an
extra reference on "other task's mempolicy" in get_vma_policy.  Subsequent
patches will complete this update, allowing us to simplify the tests for
whether we need to unref a mempolicy at various points in the code.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:23 -07:00
Lee Schermerhorn 846a16bf0f mempolicy: rename mpol_copy to mpol_dup
This patch renames mpol_copy() to mpol_dup() because, well, that's what it
does.  Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
existing mempolicy, allocates a new one and copies the contents.

In a later patch, I want to use the name mpol_copy() to copy the contents from
one mempolicy to another like, e.g., strcpy() does for strings.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:23 -07:00
Lee Schermerhorn f0be3d32b0 mempolicy: rename mpol_free to mpol_put
This is a change that was requested some time ago by Mel Gorman.  Makes sense
to me, so here it is.

Note: I retain the name "mpol_free_shared_policy()" because it actually does
free the shared_policy, which is NOT a reference counted object.  However, ...

The mempolicy object[s] referenced by the shared_policy are reference counted,
so mpol_put() is used to release the reference held by the shared_policy.  The
mempolicy might not be freed at this time, because some task attached to the
shared object associated with the shared policy may be in the process of
allocating a page based on the mempolicy.  In that case, the task performing
the allocation will hold a reference on the mempolicy, obtained via
mpol_shared_policy_lookup().  The mempolicy will be freed when all tasks
holding such a reference have called mpol_put() for the mempolicy.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:23 -07:00
Adam Litke 3b11630063 Subject: [PATCH] hugetlb: vmstat events for huge page allocations
Allocating huge pages directly from the buddy allocator is not guaranteed to
succeed.  Success depends on several factors (such as the amount of physical
memory available and the level of fragmentation).  With the addition of
dynamic hugetlb pool resizing, allocations can occur much more frequently.
For these reasons it is desirable to keep track of huge page allocation
successes and failures.

Add two new vmstat entries to track huge page allocations that succeed and
fail.  The presence of the two entries is contingent upon CONFIG_HUGETLB_PAGE
being enabled.

[akpm@linux-foundation.org: reduced ifdeffery]
Signed-off-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Eric Munson <ebmunson@us.ibm.com>
Tested-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andy Whitcroft <apw@shadowen.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:23 -07:00
Nick Piggin 70688e4dd1 xip: support non-struct page backed memory
Convert XIP to support non-struct page backed memory, using VM_MIXEDMAP for
the user mappings.

This requires the get_xip_page API to be changed to an address based one.
Improve the API layering a little bit too, while we're here.

This is required in order to support XIP filesystems on memory that isn't
backed with struct page (but memory with struct page is still supported too).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Cc: Jared Hulbert <jaredeh@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:23 -07:00
Nick Piggin 423bad6004 mm: add vm_insert_mixed
vm_insert_mixed will insert either a raw pfn or a refcounted struct page into
the page tables, depending on whether vm_normal_page() will return the page or
not.  With the introduction of the new pte bit, this is now a too tricky for
drivers to be doing themselves.

filemap_xip uses this in a subsequent patch.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jared Hulbert <jaredeh@gmail.com>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:23 -07:00
Nick Piggin 7e675137a8 mm: introduce pte_special pte bit
s390 for one, cannot implement VM_MIXEDMAP with pfn_valid, due to their memory
model (which is more dynamic than most).  Instead, they had proposed to
implement it with an additional path through vm_normal_page(), using a bit in
the pte to determine whether or not the page should be refcounted:

vm_normal_page()
{
	...
        if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
                if (vma->vm_flags & VM_MIXEDMAP) {
#ifdef s390
			if (!mixedmap_refcount_pte(pte))
				return NULL;
#else
                        if (!pfn_valid(pfn))
                                return NULL;
#endif
                        goto out;
                }
	...
}

This is fine, however if we are allowed to use a bit in the pte to determine
refcountedness, we can use that to _completely_ replace all the vma based
schemes.  So instead of adding more cases to the already complex vma-based
scheme, we can have a clearly seperate and simple pte-based scheme (and get
slightly better code generation in the process):

vm_normal_page()
{
#ifdef s390
	if (!mixedmap_refcount_pte(pte))
		return NULL;
	return pte_page(pte);
#else
	...
#endif
}

And finally, we may rather make this concept usable by any architecture rather
than making it s390 only, so implement a new type of pte state for this.
Unfortunately the old vma based code must stay, because some architectures may
not be able to spare pte bits.  This makes vm_normal_page a little bit more
ugly than we would like, but the 2 cases are clearly seperate.

So introduce a pte_special pte state, and use it in mm/memory.c.  It is
currently a noop for all architectures, so this doesn't actually result in any
compiled code changes to mm/memory.o.

BTW:
I haven't put vm_normal_page() into arch code as-per an earlier suggestion.
The reason is that, regardless of where vm_normal_page is actually
implemented, the *abstraction* is still exactly the same. Also, while it
depends on whether the architecture has pte_special or not, that is the
only two possible cases, and it really isn't an arch specific function --
the role of the arch code should be to provide primitive functions and
accessors with which to build the core code; pte_special does that. We do
not want architectures to know or care about vm_normal_page itself, and
we definitely don't want them being able to invent something new there
out of sight of mm/ code. If we made vm_normal_page an arch function, then
we have to make vm_insert_mixed (next patch) an arch function too. So I
don't think moving it to arch code fundamentally improves any abstractions,
while it does practically make the code more difficult to follow, for both
mm and arch developers, and easier to misuse.

[akpm@linux-foundation.org: build fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Cc: Jared Hulbert <jaredeh@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:23 -07:00
Jared Hulbert b379d79019 mm: introduce VM_MIXEDMAP
This series introduces some important infrastructure work.  The overall result
is that:

1. We now support XIP backed filesystems using memory that have no
   struct page allocated to them. And patches 6 and 7 actually implement
   this for s390.

   This is pretty important in a number of cases. As far as I understand,
   in the case of virtualisation (eg. s390), each guest may mount a
   readonly copy of the same filesystem (eg. the distro). Currently,
   guests need to allocate struct pages for this image. So if you have
   100 guests, you already need to allocate more memory for the struct
   pages than the size of the image. I think. (Carsten?)

   For other (eg. embedded) systems, you may have a very large non-
   volatile filesystem. If you have to have struct pages for this, then
   your RAM consumption will go up proportionally to fs size. Even
   though it is just a small proportion, the RAM can be much more costly
   eg in terms of power, so every KB less that Linux uses makes it more
   attractive to a lot of these guys.

2. VM_MIXEDMAP allows us to support mappings where you actually do want
   to refcount _some_ pages in the mapping, but not others, and support
   COW on arbitrary (non-linear) mappings. Jared needs this for his NVRAM
   filesystem in progress. Future iterations of this filesystem will
   most likely want to migrate pages between pagecache and XIP backing,
   which is where the requirement for mixed (some refcounted, some not)
   comes from.

3. pte_special also has a peripheral usage that I need for my lockless
   get_user_pages patch. That was shown to speed up "oltp" on db2 by
   10% on a 2 socket system, which is kind of significant because they
   scrounge for months to try to find 0.1% improvement on these
   workloads. I'm hoping we might finally be faster than AIX on
   pSeries with this :). My reference to lockless get_user_pages is not
   meant to justify this patchset (which doesn't include lockless gup),
   but just to show that pte_special is not some s390 specific thing that
   should be hidden in arch code or xip code: I definitely want to use it
   on at least x86 and powerpc as well.

This patch:

Introduce a new type of mapping, VM_MIXEDMAP.  This is unlike VM_PFNMAP in
that it can support COW mappings of arbitrary ranges including ranges without
struct page *and* ranges with a struct page that we actually want to refcount
(PFNMAP can only support COW in those cases where the un-COW-ed translations
are mapped linearly in the virtual address, and can only support non
refcounted ranges).

VM_MIXEDMAP achieves this by refcounting all pfn_valid pages, and not
refcounting !pfn_valid pages (which is not an option for VM_PFNMAP, because it
needs to avoid refcounting pfn_valid pages eg.  for /dev/mem mappings).

Signed-off-by: Jared Hulbert <jaredeh@gmail.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Cc: Jared Hulbert <jaredeh@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:22 -07:00
Christoph Lameter e20b8cca76 PAGEFLAGS_EXTENDED and separate page flags for Head and Tail
Having separate page flags for the head and the tail of a compound page allows
the compiler to use bitops instead of operations on a word to check for a tail
page.  That is f.e.  important for virt_to_head_page() which is used in
various critical code paths (kfree for example):

Code for PageTail(page)

Before:

 mov    (%rdi),%rdx		page->flags
 mov    %rdx,%rax		3 bytes
 and    $0x12000,%eax		5 bytes
 cmp    $0x12000,%rax		6 bytes
 je     897 <kfree+0xa7>

After:

 mov    (%rdi),%rax
 test   $0x40,%ah			(3 bytes)
 jne    887 <kfree+0x97>

So we go from 14 bytes to 3 bytes and from 3 instructions to one.  From the
use of 2 registers we go to none.

We can only use page flags for this if we have page flags available.  This
patch introduces CONFIG_PAGEFLAGS_EXTENDED that is set if pageflags are not
scarce due to SPARSEMEM using page flags for its sectionid on 32 bit NUMA
platforms.

Additional page flag definitions can be added to the CONFIG_PAGEFLAGS_EXTENDED
section in page-flags.h if the functionality depends on PAGEFLAGS_EXTENDED or
if more page flag overlapping tricks are used for the !PAGEFLAGS_EXTENDED
fallback (the upcoming virtual compound patch may hook in here and Rik's/Lee's
additional page flags to solve the reclaim issues could also be added there
[hint...  hint...  where are these patchsets?]).

Avoiding the overlaying of Pg_reclaim also clears the way for possible use of
compound pages for the pagecache or on the LRU.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:22 -07:00
Christoph Lameter 0a128b2b1a pageflags: eliminate PG_xxx aliases
Remove aliases of PG_xxx.  We can easily drop those now and alias by
specifying the PG_xxx flag in the macro that generates the functions.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:22 -07:00
Christoph Lameter 2301696932 vmallocinfo: add caller information
Add caller information so that /proc/vmallocinfo shows where the allocation
request for a slice of vmalloc memory originated.

Results in output like this:

0xffffc20000000000-0xffffc20000801000 8392704 alloc_large_system_hash+0x127/0x246 pages=2048 vmalloc vpages
0xffffc20000801000-0xffffc20000806000   20480 alloc_large_system_hash+0x127/0x246 pages=4 vmalloc
0xffffc20000806000-0xffffc20000c07000 4198400 alloc_large_system_hash+0x127/0x246 pages=1024 vmalloc vpages
0xffffc20000c07000-0xffffc20000c0a000   12288 alloc_large_system_hash+0x127/0x246 pages=2 vmalloc
0xffffc20000c0a000-0xffffc20000c0c000    8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
0xffffc20000c0c000-0xffffc20000c0f000   12288 acpi_os_map_memory+0x13/0x1c phys=cff64000 ioremap
0xffffc20000c10000-0xffffc20000c15000   20480 acpi_os_map_memory+0x13/0x1c phys=cff65000 ioremap
0xffffc20000c16000-0xffffc20000c18000    8192 acpi_os_map_memory+0x13/0x1c phys=cff69000 ioremap
0xffffc20000c18000-0xffffc20000c1a000    8192 acpi_os_map_memory+0x13/0x1c phys=fed1f000 ioremap
0xffffc20000c1a000-0xffffc20000c1c000    8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
0xffffc20000c1c000-0xffffc20000c1e000    8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
0xffffc20000c1e000-0xffffc20000c20000    8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
0xffffc20000c20000-0xffffc20000c22000    8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
0xffffc20000c22000-0xffffc20000c24000    8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
0xffffc20000c24000-0xffffc20000c26000    8192 acpi_os_map_memory+0x13/0x1c phys=e0081000 ioremap
0xffffc20000c26000-0xffffc20000c28000    8192 acpi_os_map_memory+0x13/0x1c phys=e0080000 ioremap
0xffffc20000c28000-0xffffc20000c2d000   20480 alloc_large_system_hash+0x127/0x246 pages=4 vmalloc
0xffffc20000c2d000-0xffffc20000c31000   16384 tcp_init+0xd5/0x31c pages=3 vmalloc
0xffffc20000c31000-0xffffc20000c34000   12288 alloc_large_system_hash+0x127/0x246 pages=2 vmalloc
0xffffc20000c34000-0xffffc20000c36000    8192 init_vdso_vars+0xde/0x1f1
0xffffc20000c36000-0xffffc20000c38000    8192 pci_iomap+0x8a/0xb4 phys=d8e00000 ioremap
0xffffc20000c38000-0xffffc20000c3a000    8192 usb_hcd_pci_probe+0x139/0x295 [usbcore] phys=d8e00000 ioremap
0xffffc20000c3a000-0xffffc20000c3e000   16384 sys_swapon+0x509/0xa15 pages=3 vmalloc
0xffffc20000c40000-0xffffc20000c61000  135168 e1000_probe+0x1c4/0xa32 phys=d8a20000 ioremap
0xffffc20000c61000-0xffffc20000c6a000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
0xffffc20000c6a000-0xffffc20000c73000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
0xffffc20000c73000-0xffffc20000c7c000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
0xffffc20000c7c000-0xffffc20000c7f000   12288 e1000e_setup_tx_resources+0x29/0xbe pages=2 vmalloc
0xffffc20000c80000-0xffffc20001481000 8392704 pci_mmcfg_arch_init+0x90/0x118 phys=e0000000 ioremap
0xffffc20001481000-0xffffc20001682000 2101248 alloc_large_system_hash+0x127/0x246 pages=512 vmalloc
0xffffc20001682000-0xffffc20001e83000 8392704 alloc_large_system_hash+0x127/0x246 pages=2048 vmalloc vpages
0xffffc20001e83000-0xffffc20002204000 3674112 alloc_large_system_hash+0x127/0x246 pages=896 vmalloc vpages
0xffffc20002204000-0xffffc2000220d000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
0xffffc2000220d000-0xffffc20002216000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
0xffffc20002216000-0xffffc2000221f000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
0xffffc2000221f000-0xffffc20002228000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
0xffffc20002228000-0xffffc20002231000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
0xffffc20002231000-0xffffc20002234000   12288 e1000e_setup_rx_resources+0x35/0x122 pages=2 vmalloc
0xffffc20002240000-0xffffc20002261000  135168 e1000_probe+0x1c4/0xa32 phys=d8a60000 ioremap
0xffffc20002261000-0xffffc2000270c000 4894720 sys_swapon+0x509/0xa15 pages=1194 vmalloc vpages
0xffffffffa0000000-0xffffffffa0022000  139264 module_alloc+0x4f/0x55 pages=33 vmalloc
0xffffffffa0022000-0xffffffffa0029000   28672 module_alloc+0x4f/0x55 pages=6 vmalloc
0xffffffffa002b000-0xffffffffa0034000   36864 module_alloc+0x4f/0x55 pages=8 vmalloc
0xffffffffa0034000-0xffffffffa003d000   36864 module_alloc+0x4f/0x55 pages=8 vmalloc
0xffffffffa003d000-0xffffffffa0049000   49152 module_alloc+0x4f/0x55 pages=11 vmalloc
0xffffffffa0049000-0xffffffffa0050000   28672 module_alloc+0x4f/0x55 pages=6 vmalloc

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:21 -07:00
Christoph Lameter a10aa57987 vmalloc: show vmalloced areas via /proc/vmallocinfo
Implement a new proc file that allows the display of the currently allocated
vmalloc memory.

It allows to see the users of vmalloc.  That is important if vmalloc space is
scarce (i386 for example).

And it's going to be important for the compound page fallback to vmalloc.
Many of the current users can be switched to use compound pages with fallback.
 This means that the number of users of vmalloc is reduced and page tables no
longer necessary to access the memory.  /proc/vmallocinfo allows to review how
that reduction occurs.

If memory becomes fragmented and larger order allocations are no longer
possible then /proc/vmallocinfo allows to see which compound page allocations
fell back to virtual compound pages.  That is important for new users of
virtual compound pages.  Such as order 1 stack allocation etc that may
fallback to virtual compound pages in the future.

/proc/vmallocinfo permissions are made readable-only-by-root to avoid possible
information leakage.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: CONFIG_MMU=n build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:21 -07:00
Miklos Szeredi ac6aadb24b mm: rotate_reclaimable_page() cleanup
Clean up messy conditional calling of test_clear_page_writeback() from both
rotate_reclaimable_page() and end_page_writeback().

The only user of rotate_reclaimable_page() is end_page_writeback() so this is
OK.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:20 -07:00
S.Caglar Onur f05111f501 mm/page_alloc.c: fix indentation
zlc_setup(): handle jiffies wraparound
(10ed273f50) changes tab with spaces

Signed-off-by: S.Caglar Onur <caglar@pardus.org.tr>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:20 -07:00
Andi Kleen b5ee5befa7 dmapool: enable debugging for CONFIG_SLUB_DEBUG_ON too
Previously it was only enabled for CONFIG_DEBUG_SLAB.

Not hooked into the slub runtime debug configuration, so you currently only
get it with CONFIG_SLUB_DEBUG_ON, not plain CONFIG_SLUB_DEBUG

Acked-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:20 -07:00
Lee Schermerhorn a43361cf3c mempolicy: fix parsing of tmpfs mpol mount option
Parsing of new mode flags in the tmpfs mpol mount option is slightly broken:

Setting a valid flag works OK:
	#mount -o remount,mpol=bind=static:1-2 /dev/shm
	#mount
	...
	tmpfs on /dev/shm type tmpfs (rw,mpol=bind=static:1-2)
	...

However, we can't remove them or change them, once we've
set a valid flag:

	#mount -o remount,mpol=bind:1-2 /dev/shm
	#mount
	...
	tmpfs on /dev/shm type tmpfs (rw,mpol=bind:1-2)
	...

It SAYS it removed it, but that's just a copy of the input
string.  If we now try to set it to a different flag, we
get:

	#mount -o remount,mpol=bind=relative:1-2 /dev/shm
	mount: /dev/shm not mounted already, or bad option

And on the console, we see:
	tmpfs: Bad value 'bind' for mount option 'mpol'
	                      ^ lost remainder of string

Furthermore, bogus flags are accepted with out error.
Granted, they are a no-op:

	#mount -o remount,mpol=interleave=foo:0-3 /dev/shm
	#mount
	...
	tmpfs on /dev/shm type tmpfs (rw,mpol=interleave=foo:0-3)

Again, that's just a copy of the input string shown by the mount command.

This patch fixes the behavior by pre-zeroing the flags so that only one of the
mutually exclusive flags can be set at one time.  It also reports an error
when an unrecognized flag is specified.

The check for both flags being set is removed because it can't happen with
this implementation.  If we ever want to support multiple non-exclusive flags,
this area will need rework and we will need to check that any mutually
exclusive flags aren't specified.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Eric Whitney <eric.whitney@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:20 -07:00
David Rientjes 3e1f064562 mempolicy: disallow static or relative flags for local preferred mode
MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES don't mean anything for
MPOL_PREFERRED policies that were created with an empty nodemask (for purely
local allocations).  They'll never be invalidated because the allowed mems of
a task changes or need to be rebound relative to a cpuset's placement.

Also fixes a bug identified by Lee Schermerhorn that disallowed empty
nodemasks to be passed to MPOL_PREFERRED to specify local allocations.  [A
different, somewhat incomplete, patch already existed in 25-rc5-mm1.]

Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:20 -07:00
David Rientjes 37012946da mempolicy: create mempolicy_operations structure
Create a mempolicy_operations structure that currently points to two
functions[*] for the various modes:

	int (*create)(struct mempolicy *, const nodemask_t *);
	void (*rebind)(struct mempolicy *, const nodemask_t *);

This splits the implementation for the various modes out of two large
functions, mpol_new() and mpol_rebind_policy().  Eventually it may be
beneficial to add additional functions to accomodate the existing switch()
statements in mm/mempolicy.c.

 [*] The ->create() function for MPOL_DEFAULT is currently NULL since no
     struct mempolicy is dynamically allocated.

[Lee.Schermerhorn@hp.com: fix regression in the package mempolicy regression tests]
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:20 -07:00
David Rientjes 1d0d2680a0 mempolicy: move rebind functions
Move the mpol_rebind_{policy,task,mm}() functions after mpol_new() to avoid
having to declare function prototypes.

Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:20 -07:00
David Rientjes 4c50bc0116 mempolicy: add MPOL_F_RELATIVE_NODES flag
Adds another optional mode flag, MPOL_F_RELATIVE_NODES, that specifies
nodemasks passed via set_mempolicy() or mbind() should be considered relative
to the current task's mems_allowed.

When the mempolicy is created, the passed nodemask is folded and mapped onto
the current task's mems_allowed.  For example, consider a task using
set_mempolicy() to pass MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES with a
nodemask of 1-3.  If current's mems_allowed is 4-7, the effected nodemask is
5-7 (the second, third, and fourth node of mems_allowed).

If the same task is attached to a cpuset, the mempolicy nodemask is rebound
each time the mems are changed.  Some possible rebinds and results are:

	mems			result
	1-3			1-3
	1-7			2-4
	1,5-6			1,5-6
	1,5-7			5-7

Likewise, the zonelist built for MPOL_BIND acts on the set of zones assigned
to the resultant nodemask from the relative remap.

In the MPOL_PREFERRED case, the preferred node is remapped from the currently
effected nodemask to the relative nodemask.

This mempolicy mode flag was conceived of by Paul Jackson <pj@sgi.com>.

Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:19 -07:00
David Rientjes f5b087b52f mempolicy: add MPOL_F_STATIC_NODES flag
Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the
node remap when the policy is rebound.

Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of
a union with cpuset_mems_allowed:

	struct mempolicy {
		...
		union {
			nodemask_t cpuset_mems_allowed;
			nodemask_t user_nodemask;
		} w;
	}

that stores the the nodemask that the user passed when he or she created the
mempolicy via set_mempolicy() or mbind().  When using MPOL_F_STATIC_NODES,
which is passed with any mempolicy mode, the user's passed nodemask
intersected with the VMA or task's allowed nodes is always used when
determining the preferred node, setting the MPOL_BIND zonelist, or creating
the interleave nodemask.  This happens whenever the policy is rebound,
including when a task's cpuset assignment changes or the cpuset's mems are
changed.

This creates an interesting side-effect in that it allows the mempolicy
"intent" to lie dormant and uneffected until it has access to the node(s) that
it desires.  For example, if you currently ask for an interleaved policy over
a set of nodes that you do not have access to, the mempolicy is not created
and the task continues to use the previous policy.  With this change, however,
it is possible to create the same mempolicy; it is only effected when access
to nodes in the nodemask is acquired.

It is also possible to mount tmpfs with the static nodemask behavior when
specifying a node or nodemask.  To do this, simply add "=static" immediately
following the mempolicy mode at mount time:

	mount -o remount mpol=interleave=static:1-3

Also removes mpol_check_policy() and folds its logic into mpol_new() since it
is now obsoleted.  The unused vma_mpol_equal() is also removed.

Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:19 -07:00
David Rientjes 028fec414d mempolicy: support optional mode flags
With the evolution of mempolicies, it is necessary to support mempolicy mode
flags that specify how the policy shall behave in certain circumstances.  The
most immediate need for mode flag support is to suppress remapping the
nodemask of a policy at the time of rebind.

Both the mempolicy mode and flags are passed by the user in the 'int policy'
formal of either the set_mempolicy() or mbind() syscall.  A new constant,
MPOL_MODE_FLAGS, represents the union of legal optional flags that may be
passed as part of this int.  Mempolicies that include illegal flags as part of
their policy are rejected as invalid.

An additional member to struct mempolicy is added to support the mode flags:

	struct mempolicy {
		...
		unsigned short policy;
		unsigned short flags;
	}

The splitting of the 'int' actual passed by the user is done in
sys_set_mempolicy() and sys_mbind() for their respective syscalls.  This is
done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the syscall of
there are additional flags, and storing it in the new 'flags' member of struct
mempolicy.  The intersection of the actual with ~MPOL_MODE_FLAGS is stored in
the 'policy' member of the struct and all current users of pol->policy remain
unchanged.

The union of the policy mode and optional mode flags is passed back to the
user in get_mempolicy().

This combination of mode and flags within the same actual does not break
userspace code that relies on get_mempolicy(&policy, ...) and either

	switch (policy) {
	case MPOL_BIND:
		...
	case MPOL_INTERLEAVE:
		...
	};

statements or

	if (policy == MPOL_INTERLEAVE) {
		...
	}

statements.  Such applications would need to use optional mode flags when
calling set_mempolicy() or mbind() for these previously implemented statements
to stop working.  If an application does start using optional mode flags, it
will need to mask the optional flags off the policy in switch and conditional
statements that only test mode.

An additional member is also added to struct shmem_sb_info to store the
optional mode flags.

[hugh@veritas.com: shmem mpol: fix build warning]
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:19 -07:00
David Rientjes a3b51e0142 mempolicy: convert MPOL constants to enum
The mempolicy mode constants, MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, and
MPOL_INTERLEAVE, are better declared as part of an enum since they are
sequentially numbered and cannot be combined.

The policy member of struct mempolicy is also converted from type short to
type unsigned short.  A negative policy does not have any legitimate meaning,
so it is possible to change its type in preparation for adding optional mode
flags later.

The equivalent member of struct shmem_sb_info is also changed from int to
unsigned short.

For compatibility, the policy formal to get_mempolicy() remains as a pointer
to an int:

	int get_mempolicy(int *policy, unsigned long *nmask,
			  unsigned long maxnode, unsigned long addr,
			  unsigned long flags);

although the only possible values is the range of type unsigned short.

Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:19 -07:00
Pekka Enberg 1b27d05b6e mm: move cache_line_size() to <linux/cache.h>
Not all architectures define cache_line_size() so as suggested by Andrew move
the private implementations in mm/slab.c and mm/slob.c to <linux/cache.h>.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:19 -07:00
Adam Litke 19fc3f0acd hugetlb: decrease hugetlb_lock cycling in gather_surplus_huge_pages
To reduce hugetlb_lock acquisitions and releases when freeing excess surplus
pages, scan the page list in two parts.  First, transfer the needed pages to
the hugetlb pool.  Then drop the lock and free the remaining pages back to the
buddy allocator.

In the common case there are zero excess pages and no lock operations are
required.

Thanks Mel Gorman for this improvement.

Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:19 -07:00
Chris Dearman 797df57490 mm: try both endianess when checking for endianess
When checking for the swap header try byteswapping the endianess dependent
fields to allow the swap partition to be shared between big & little endian
systems.

Signed-off-by: Chris Dearman <chris@mips.com>
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:19 -07:00
Mel Gorman 19770b3260 mm: filter based on a nodemask as well as a gfp_mask
The MPOL_BIND policy creates a zonelist that is used for allocations
controlled by that mempolicy.  As the per-node zonelist is already being
filtered based on a zone id, this patch adds a version of __alloc_pages() that
takes a nodemask for further filtering.  This eliminates the need for
MPOL_BIND to create a custom zonelist.

A positive benefit of this is that allocations using MPOL_BIND now use the
local node's distance-ordered zonelist instead of a custom node-id-ordered
zonelist.  I.e., pages will be allocated from the closest allowed node with
available memory.

[Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
[Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
[Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:19 -07:00
Mel Gorman dd1a239f6f mm: have zonelist contains structs with both a zone pointer and zone_idx
Filtering zonelists requires very frequent use of zone_idx().  This is costly
as it involves a lookup of another structure and a substraction operation.  As
the zone_idx is often required, it should be quickly accessible.  The node idx
could also be stored here if it was found that accessing zone->node is
significant which may be the case on workloads where nodemasks are heavily
used.

This patch introduces a struct zoneref to store a zone pointer and a zone
index.  The zonelist then consists of an array of these struct zonerefs which
are looked up as necessary.  Helpers are given for accessing the zone index as
well as the node index.

[kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
[hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
[hugh@veritas.com: just return do_try_to_free_pages]
[hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Mel Gorman 54a6eb5c47 mm: use two zonelist that are filtered by GFP mask
Currently a node has two sets of zonelists, one for each zone type in the
system and a second set for GFP_THISNODE allocations.  Based on the zones
allowed by a gfp mask, one of these zonelists is selected.  All of these
zonelists consume memory and occupy cache lines.

This patch replaces the multiple zonelists per-node with two zonelists.  The
first contains all populated zones in the system, ordered by distance, for
fallback allocations when the target/preferred node has no free pages.  The
second contains all populated zones in the node suitable for GFP_THISNODE
allocations.

An iterator macro is introduced called for_each_zone_zonelist() that interates
through each zone allowed by the GFP flags in the selected zonelist.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Mel Gorman 18ea7e710d mm: remember what the preferred zone is for zone_statistics
On NUMA, zone_statistics() is used to record events like numa hit, miss and
foreign.  It assumes that the first zone in a zonelist is the preferred zone.
When multiple zonelists are replaced by one that is filtered, this is no
longer the case.

This patch records what the preferred zone is rather than assuming the first
zone in the zonelist is it.  This simplifies the reading of later patches in
this set.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Mel Gorman 0e88460da6 mm: introduce node_zonelist() for accessing the zonelist for a GFP mask
Introduce a node_zonelist() helper function.  It is used to lookup the
appropriate zonelist given a node and a GFP mask.  The patch on its own is a
cleanup but it helps clarify parts of the two-zonelist-per-node patchset.  If
necessary, it can be merged with the next patch in this set without problems.

Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Mel Gorman dac1d27bc8 mm: use zonelists instead of zones when direct reclaiming pages
The following patches replace multiple zonelists per node with two zonelists
that are filtered based on the GFP flags.  The patches as a set fix a bug with
regard to the use of MPOL_BIND and ZONE_MOVABLE.  With this patchset, the
MPOL_BIND will apply to the two highest zones when the highest zone is
ZONE_MOVABLE.  This should be considered as an alternative fix for the
MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters
only custom zonelists.

The first patch cleans up an inconsistency where direct reclaim uses
zonelist->zones where other places use zonelist.

The second patch introduces a helper function node_zonelist() for looking up
the appropriate zonelist for a GFP mask which simplifies patches later in the
set.

The third patch defines/remembers the "preferred zone" for numa statistics, as
it is no longer always the first zone in a zonelist.

The forth patch replaces multiple zonelists with two zonelists that are
filtered.  The two zonelists are due to the fact that the memoryless patchset
introduces a second set of zonelists for __GFP_THISNODE.

The fifth patch introduces helper macros for retrieving the zone and node
indices of entries in a zonelist.

The final patch introduces filtering of the zonelists based on a nodemask.
Two zonelists exist per node, one for normal allocations and one for
__GFP_THISNODE.

Performance results varied depending on the machine configuration.  In real
workloads the gain/loss will depend on how much the userspace portion of the
benchmark benefits from having more cache available due to reduced referencing
of zonelists.

These are the range of performance losses/gains when running against
2.6.24-rc4-mm1.  The set and these machines are a mix of i386, x86_64 and
ppc64 both NUMA and non-NUMA.
			     loss   to  gain
Total CPU time on Kernbench: -0.86% to  1.13%
Elapsed   time on Kernbench: -0.79% to  0.76%
page_test from aim9:         -4.37% to  0.79%
brk_test  from aim9:         -0.71% to  4.07%
fork_test from aim9:         -1.84% to  4.60%
exec_test from aim9:         -0.71% to  1.08%

This patch:

The allocator deals with zonelists which indicate the order in which zones
should be targeted for an allocation.  Similarly, direct reclaim of pages
iterates over an array of zones.  For consistency, this patch converts direct
reclaim to use a zonelist.  No functionality is changed by this patch.  This
simplifies zonelist iterators in the next patch.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Nick Piggin 3c18ddd160 mm: remove nopage
Nothing in the tree uses nopage any more.  Remove support for it in the
core mm code and documentation (and a few stray references to it in
comments).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Oleg Nesterov 4d3d5b41a7 mmap_region: cleanup the final vma_merge() related code
It is not easy to actually understand the "if (!file || !vma_merge())"
code, turn it into "if (file && vma_merge())".  This makes immediately
obvious that the subsequent "if (file)" is superfluous.

As Hugh Dickins pointed out, we can also factor out the ->i_writecount
corrections, and add a small comment about that.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Hisashi Hifumi 0dd1334faf fix invalidate_inode_pages2_range() to not clear ret
DIO invalidates page cache through invalidate_inode_pages2_range().
invalidate_inode_pages2_range() sets ret=-EIO when
invalidate_complete_page2() fails, but this ret is cleared if
do_launder_page() succeed on a page of next index.

In this case, dio is carried out even if invalidate_complete_page2() fails
on some pages.

This can cause inconsistency between memory and blocks on HDD because the
page cache still exists.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Chuck Lever <cel@citi.umich.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Jeremy Fitzhardinge 180c06efce hotplug-memory: make online_page() common
All architectures use an effectively identical definition of online_page(), so
just make it common code.  x86-64, ia64, powerpc and sh are actually
identical; x86-32 is slightly different.

x86-32's differences arise because it puts its hotplug pages in the highmem
zone.  We can handle this in the generic code by inspecting the page to see if
its in highmem, and update the totalhigh_pages count appropriately.  This
leaves init_32.c:free_new_highpage with a single caller, so I folded it into
add_one_highpage_init.

I also removed an incorrect comment referring to the NUMA case; any NUMA
details have already been dealt with by the time online_page() is called.

[akpm@linux-foundation.org: fix indenting]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Dave Hansen <dave@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamez.hiroyu@jp.fujitsu.com>
Tested-by: KAMEZAWA Hiroyuki <kamez.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Christoph Lameter <clameter@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:17 -07:00
Badari Pulavarty ea01ea937d hotplug memory remove: generic __remove_pages() support
Generic helper function to remove section mappings and sysfs entries for the
section of the memory we are removing.  offline_pages() correctly adjusted
zone and marked the pages reserved.

TODO: Yasunori Goto is working on patches to free up allocations from bootmem.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:17 -07:00
Johannes Weiner 556637cdab mm: fix possible off-by-one in walk_pte_range()
After the loop in walk_pte_range() pte might point to the first address after
the pmd it walks.  The pte_unmap() is then applied to something bad.

Spotted by Roel Kluin and Andreas Schwab.

Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Cc: Roel Kluin <12o3l@tiscali.nl>
Cc: Andreas Schwab <schwab@suse.de>
Acked-by: Matt Mackall <mpm@selenic.com>
Acked-by: Mikael Pettersson <mikpe@it.uu.se>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:16 -07:00
Christian Borntraeger 5b7baf0578 s390: KVM preparation: host memory management changes for s390 kvm
This patch changes the s390 memory management defintions to use the pgste field
for dirty and reference bit tracking of host and guest code. Usually on s390,
dirty and referenced are tracked in storage keys, which belong to the physical
page. This changes with virtualization: The guest and host dirty/reference bits
are defined to be the logical OR of the values for the mapping and the physical
page. This patch implements the necessary changes in pgtable.h for s390.

There is a common code change in mm/rmap.c, the call to
page_test_and_clear_young must be moved. This is a no-op for all
architecture but s390. page_referenced checks the referenced bits for
the physiscal page and for all mappings:
o The physical page is checked with page_test_and_clear_young.
o The mappings are checked with ptep_test_and_clear_young and friends.

Without pgstes (the current implementation on Linux s390) the physical page
check is implemented but the mapping callbacks are no-ops because dirty
and referenced are not tracked in the s390 page tables. The pgstes introduces
guest and host dirty and reference bits for s390 in the host mapping. These
mapping must be checked before page_test_and_clear_young resets the reference
bit.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 12:00:40 +03:00
Yinghai Lu c2b91e2eec x86_64/mm: check and print vmemmap allocation continuous
On big systems with lots of memory, don't print out too much during
bootup, and make it easy to find if it is continuous.

on 256G 8 sockets system will get
 [ffffe20000000000-ffffe20002bfffff] PMD -> [ffff810001400000-ffff810003ffffff] on node 0
[ffffe2001c700000-ffffe2001c7fffff] potential offnode page_structs
 [ffffe20002c00000-ffffe2001c7fffff] PMD -> [ffff81000c000000-ffff8100255fffff] on node 0
[ffffe20038700000-ffffe200387fffff] potential offnode page_structs
 [ffffe2001c800000-ffffe200387fffff] PMD -> [ffff810820200000-ffff81083c1fffff] on node 1
 [ffffe20040000000-ffffe2007fffffff] PUD ->ffff811027a00000 on node 2
 [ffffe20038800000-ffffe2003fffffff] PMD -> [ffff811020200000-ffff8110279fffff] on node 2
[ffffe20054700000-ffffe200547fffff] potential offnode page_structs
 [ffffe20040000000-ffffe200547fffff] PMD -> [ffff811027c00000-ffff81103c3fffff] on node 2
[ffffe20070700000-ffffe200707fffff] potential offnode page_structs
 [ffffe20054800000-ffffe200707fffff] PMD -> [ffff811820200000-ffff81183c1fffff] on node 3
 [ffffe20080000000-ffffe200bfffffff] PUD ->ffff81202fa00000 on node 4
 [ffffe20070800000-ffffe2007fffffff] PMD -> [ffff812020200000-ffff81202f9fffff] on node 4
[ffffe2008c700000-ffffe2008c7fffff] potential offnode page_structs
 [ffffe20080000000-ffffe2008c7fffff] PMD -> [ffff81202fc00000-ffff81203c3fffff] on node 4
[ffffe200a8700000-ffffe200a87fffff] potential offnode page_structs
 [ffffe2008c800000-ffffe200a87fffff] PMD -> [ffff812820200000-ffff81283c1fffff] on node 5
 [ffffe200c0000000-ffffe200ffffffff] PUD ->ffff813037a00000 on node 6
 [ffffe200a8800000-ffffe200bfffffff] PMD -> [ffff813020200000-ffff8130379fffff] on node 6
[ffffe200c4700000-ffffe200c47fffff] potential offnode page_structs
 [ffffe200c0000000-ffffe200c47fffff] PMD -> [ffff813037c00000-ffff81303c3fffff] on node 6
 [ffffe200c4800000-ffffe200e07fffff] PMD -> [ffff813820200000-ffff81383c1fffff] on node 7

instead of a very long print out...

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-26 22:51:09 +02:00
Yinghai Lu a5645a61b3 mm: allow reserve_bootmem() cross nodes
split reserve_bootmem_core() into two functions, one which checks
conflicts, and one which sets the bits.

and make reserve_bootmem to loop bdata_list to cross the nodes.

user could be crashkernel and ramdisk..., in case the range provided
by those externalities crosses the nodes.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-26 22:51:08 +02:00
Yinghai Lu 9a2dc04cf0 mm: offset align in alloc_bootmem()
need offset alignment when node_boot_start's alignment is less than
the alignment required.

use local node_boot_start to match alignment - so don't add extra operation
in search loop.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-26 22:51:08 +02:00
Yinghai Lu ad09315cad mm: fix alloc_bootmem_core to use fast searching for all nodes
Make the nodes other than node 0 use bdata->last_success for fast
search too.

We need to use __alloc_bootmem_core() for vmemmap allocation for other
nodes when numa and sparsemem/vmemmap are enabled.

Also, make fail_block path increase i with incr only after ALIGN
to avoid extra increase when size is larger than align.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-26 22:51:07 +02:00
Yinghai Lu e123dd3f0e mm: make mem_map allocation continuous
vmemmap allocation currently has this layout:

 [ffffe20000000000-ffffe200001fffff] PMD ->ffff810001400000 on node 0
 [ffffe20000200000-ffffe200003fffff] PMD ->ffff810001800000 on node 0
 [ffffe20000400000-ffffe200005fffff] PMD ->ffff810001c00000 on node 0
 [ffffe20000600000-ffffe200007fffff] PMD ->ffff810002000000 on node 0
 [ffffe20000800000-ffffe200009fffff] PMD ->ffff810002400000 on node 0
...

note that there is a 2M hole between them - not optimal.

the root cause is that usemap (24 bytes) will be allocated after every 2M
mem_map, and it will push next vmemmap (2M) to the next (2M) alignment.

solution: try to allocate the mem_map continously.

after the patch, we get:

 [ffffe20000000000-ffffe200001fffff] PMD ->ffff810001400000 on node 0
 [ffffe20000200000-ffffe200003fffff] PMD ->ffff810001600000 on node 0
 [ffffe20000400000-ffffe200005fffff] PMD ->ffff810001800000 on node 0
 [ffffe20000600000-ffffe200007fffff] PMD ->ffff810001a00000 on node 0
 [ffffe20000800000-ffffe200009fffff] PMD ->ffff810001c00000 on node 0
...

which is the ideal layout.

and usemap will share a page because of they are allocated continuously too:

sparse_early_usemap_alloc: usemap = ffff810024e00000 size = 24
sparse_early_usemap_alloc: usemap = ffff810024e00080 size = 24
sparse_early_usemap_alloc: usemap = ffff810024e00100 size = 24
sparse_early_usemap_alloc: usemap = ffff810024e00180 size = 24
...

so we make the bootmem allocation more compact and use less memory
for usemap => mission accomplished ;-)

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-26 22:51:07 +02:00