dect
/
linux-2.6
Archived
13
0
Fork 0
Commit Graph

172 Commits

Author SHA1 Message Date
NeilBrown 4254376914 [PATCH] md: Don't write dirty/clean update to spares - leave them alone
- record the 'event' count on each individual device (they
  might sometimes be slightly different now)
- add a new value for 'sb_dirty': '3' means that the super
  block only needs to be updated to record a clean<->dirty
  transition.
- Prefer odd event numbers for dirty states and even numbers
  for clean states
- Using all the above, don't update the superblock on
  a spare device if the update is just doing a clean-dirty
  transition.  To accomodate this, a transition from
  dirty back to clean might now decrement the events counter
  if nothing else has changed.

The net effect of this is that spare drives will not see any IO requests
during normal running of the array, so they can go to sleep if that is what
they want to do.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:39 -07:00
NeilBrown d785a06a0b [PATCH] md/bitmap: change md/bitmap file handling to use bmap to file blocks
If md is asked to store a bitmap in a file, it tries to hold onto the page
cache pages for that file, manipulate them directly, and call a cocktail of
operations to write the file out.  I don't believe this is a supportable
approach.

This patch changes the approach to use the same approach as swap files.  i.e.
bmap is used to enumerate all the block address of parts of the file and we
write directly to those blocks of the device.

swapfile only uses parts of the file that provide a full pages at contiguous
addresses.  We don't have that luxury so we have to cope with pages that are
non-contiguous in storage.  To handle this we attach buffers to each page, and
store the addresses in those buffers.

With this approach the pagecache may contain data which is inconsistent with
what is on disk.  To alleviate the problems this can cause, md invalidates the
pagecache when releasing the file.  If the file is to be examined while the
array is active (a non-critical but occasionally useful function), O_DIRECT io
must be used.  And new version of mdadm will have support for this.

This approach simplifies a lot of code:
 - we no longer need to keep a list of pages which we need to wait for,
   as the b_endio function can keep track of how many outstanding
   writes there are.  This saves a mempool.
 - -EAGAIN returns from write_page are no longer possible (not sure if
    they ever were actually).

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:38 -07:00
NeilBrown 0b79ccf0cd [PATCH] md/bitmap: remove bitmap writeback daemon
md/bitmap currently has a separate thread to wait for writes to the bitmap
file to complete (as we cannot get a callback on that action).

However this isn't needed as bitmap_unplug is called from process context and
waits for the writeback thread to do it's work.  The same result can be
achieved by doing the waiting directly in bitmap_unplug.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:38 -07:00
Adrian Bunk 5e56341d02 [PATCH] md: make md_print_devices() static
This patch makes the needlessly global md_print_devices() static.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:37 -07:00
NeilBrown c93983bf51 [PATCH] md: support stripe/offset mode in raid10
The "industry standard" DDF format allows for a stripe/offset layout where
data is duplicated on different stripes.  e.g.

  A  B  C  D
  D  A  B  C
  E  F  G  H
  H  E  F  G

(columns are drives, rows are stripes, LETTERS are chunks of data).

This is similar to raid10's 'far' mode, but not quite the same.  So enhance
'far' mode with a 'far/offset' option which follows the layout of DDFs
stripe/offset.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:37 -07:00
NeilBrown 7c7546ccf6 [PATCH] md: allow a linear array to have drives added while active
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:37 -07:00
NeilBrown 5fd6c1dce0 [PATCH] md: allow checkpoint of recovery with version-1 superblock
For a while we have had checkpointing of resync.  The version-1 superblock
allows recovery to be checkpointed as well, and this patch implements that.

Due to early carelessness we need to add a feature flag to signal that the
recovery_offset field is in use, otherwise older kernels would assume that a
partially recovered array is in fact fully recovered.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:37 -07:00
NeilBrown 16a53ecc35 [PATCH] md: merge raid5 and raid6 code
There is a lot of commonality between raid5.c and raid6main.c.  This patches
merges both into one module called raid456.  This saves a lot of code, and
paves the way for online raid5->raid6 migrations.

There is still duplication, e.g.  between handle_stripe5 and handle_stripe6.
This will probably be cleaned up later.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:37 -07:00
NeilBrown 8932c2e0dc [PATCH] md: remove arbitrary limit on chunk size
The largest chunk size the code can support without substantial surgery is
2^30 bytes, so make that the limit instead of an arbitrary 4Meg.  Some day,
the 'chunksize' should change to a sector-shift instead of a byte-count.  Then
no limit would be needed.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26 09:58:36 -07:00
David Woodhouse 8555255f0b Add generic Kbuild files for 'make headers_install'
This adds the Kbuild files listing the files which are to be installed by
the 'headers_install' make target, in generic directories.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-06-18 12:14:01 +01:00
NeilBrown 6f91fe88e4 [PATCH] md: make sure 64bit fields in version-1 metadata are 64-bit aligned
reshape_position is a 64bit field that was not 64bit aligned.  So swap with
new_level.

NOTE: this is a user-visible change.  However:
  - The bad code has not appeared in a released kernel
  - This code is still marked 'experimental'
  - This only affects version-1 superblock, which are not in wide use
  - These field are only used (rather than simply reported) by user-space
    tools in extemely rare circumstances : after a reshape crashes in the
    first second of the reshape process.

So I believe that, at this stage, the change is safe.  Especially if people
heed the 'help' message on use mdadm-2.4.1.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11 06:18:30 -07:00
NeilBrown df5b89b323 [PATCH] md: Convert reconfig_sem to reconfig_mutex
... being careful that mutex_trylock is inverted wrt down_trylock

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:03 -08:00
NeilBrown e464eafdb4 [PATCH] md: Support suspending of IO to regions of an md array
This allows user-space to access data safely.  This is needed for raid5
reshape as user-space needs to take a backup of the first few stripes before
allowing reshape to commence.

It will also be useful in cluster-aware raid1 configurations so that all
cluster members can leave a section of the array untouched while a
resync/recovery happens.

A 'start' and 'end' of the suspended range are written to 2 sysfs attributes.
Note that only one range can be suspended at a time.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:02 -08:00
NeilBrown 63c70c4f3a [PATCH] md: Split reshape handler in check_reshape and start_reshape
check_reshape checks validity and does things that can be done instantly -
like adding devices to raid1.  start_reshape initiates a restriping process to
convert the whole array.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:02 -08:00
NeilBrown b578d55fdd [PATCH] md: Only checkpoint expansion progress occasionally
Instead of checkpointing at each stripe, only checkpoint when a new write
would overwrite uncheckpointed data.  Block any write to the uncheckpointed
area.  Arbitrarily checkpoint at least every 3Meg.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:02 -08:00
NeilBrown f67055780c [PATCH] md: Checkpoint and allow restart of raid5 reshape
We allow the superblock to record an 'old' and a 'new' geometry, and a
position where any conversion is up to.  The geometry allows for changing
chunksize, layout and level as well as number of devices.

When using verion-0.90 superblock, we convert the version to 0.91 while the
conversion is happening so that an old kernel will refuse the assemble the
array.  For version-1, we use a feature bit for the same effect.

When starting an array we check for an incomplete reshape and restart the
reshape process if needed.  If the reshape stopped at an awkward time (like
when updating the first stripe) we refuse to assemble the array, and let
user-space worry about it.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:01 -08:00
NeilBrown 292695531a [PATCH] md: Final stages of raid5 expand code
This patch adds raid5_reshape and end_reshape which will start and finish the
reshape processes.

raid5_reshape is only enabled in CONFIG_MD_RAID5_RESHAPE is set, to discourage
accidental use.

Read the 'help' for the CONFIG_MD_RAID5_RESHAPE entry.

and Make sure that you have backups, just in case.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:01 -08:00
NeilBrown ccfcc3c10b [PATCH] md: Core of raid5 resize process
This patch provides the core of the resize/expand process.

sync_request notices if a 'reshape' is happening and acts accordingly.

It allocated new stripe_heads for the next chunk-wide-stripe in the target
geometry, marking them STRIPE_EXPANDING.

Then it finds which stripe heads in the old geometry can provide data needed
by these and marks them STRIPE_EXPAND_SOURCE.  This causes stripe_handle to
read all blocks on those stripes.

Once all blocks on a STRIPE_EXPAND_SOURCE stripe_head are read, any that are
needed are copied into the corresponding STRIPE_EXPANDING stripe_head.  Once a
STRIPE_EXPANDING stripe_head is full, it is marks STRIPE_EXPAND_READY and then
is written out and released.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:01 -08:00
NeilBrown 7ecaa1e6a1 [PATCH] md: Infrastructure to allow normal IO to continue while array is expanding
We need to allow that different stripes are of different effective sizes, and
use the appropriate size.  Also, when a stripe is being expanded, we must
block any IO attempts until the stripe is stable again.

Key elements in this change are:
 - each stripe_head gets a 'disk' field which is part of the key,
   thus there can sometimes be two stripe heads of the same area of
   the array, but covering different numbers of devices.  One of these
   will be marked STRIPE_EXPANDING and so won't accept new requests.
 - conf->expand_progress tracks how the expansion is progressing and
   is used to determine whether the target part of the array has been
   expanded yet or not.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:01 -08:00
NeilBrown ad01c9e375 [PATCH] md: Allow stripes to be expanded in preparation for expanding an array
Before a RAID-5 can be expanded, we need to be able to expand the stripe-cache
data structure.

This requires allocating new stripes in a new kmem_cache.  If this succeeds,
we copy cache pages over and release the old stripes and kmem_cache.

We then allocate new pages.  If that fails, we leave the stripe cache at it's
new size.  It isn't worth the effort to shrink it back again.

Unfortuanately this means we need two kmem_cache names as we, for a short
period of time, we have two kmem_caches.  So they are raid5/%s and
raid5/%s-alt

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:01 -08:00
NeilBrown b55e6bfcd2 [PATCH] md: Split disks array out of raid5 conf structure so it is easier to grow
The remainder of this batch implements raid5 reshaping.  Currently the only
shape change that is supported is added a device, but it is envisioned that
changing the chunksize and layout will also be supported, as well as changing
the level (e.g.  1->5, 5->6).

The reshape process naturally has to move all of the data in the array, and so
should be used with caution.  It is believed to work, and some testing does
support this, but wider testing would be great for increasing my confidence.

You will need a version of mdadm newer than 2.3.1 to make use of raid5 growth.
 This is because mdadm need to take a copy of a 'critical section' at the
start of the array incase there is a crash at an awkward moment.  On restart,
mdadm will restore the critical section and allow reshape to continue.

I hope to release a 2.4-pre by early next week - it still needs a little more
polishing.

This patch:

Previously the array of disk information was included in the raid5 'conf'
structure which was allocated to an appropriate size.  This makes it awkward
to change the size of that array.  So we split it off into a separate
kmalloced array which will require a little extra indexing, but is much easier
to grow.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-27 08:45:01 -08:00
NeilBrown 9e71f9c848 [PATCH] DM: Fix bug: BIO_RW_BARRIER requests to md/raid1 hang.
Both R1BIO_Barrier and R1BIO_Returned are 4 !!!!

This means that barrier requests don't get returned (i.e.  b_endio called)
because it looks like they already have been.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-23 07:38:03 -08:00
NeilBrown 88202a0c84 [PATCH] md: allow sync-speed to be controlled per-device
Also export current (average) speed and status in sysfs.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:34:10 -08:00
NeilBrown 4dbcdc751c [PATCH] md: count corrected read errors per drive
Store this total in superblock (As appropriate), and make it available to
userspace via sysfs.

Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:34:09 -08:00
NeilBrown d9d166c2a9 [PATCH] md: allow array level to be set textually via sysfs
Signed-off-by: Neil Brown <neilb@suse.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:34:09 -08:00
NeilBrown 2989ddbd6e [PATCH] md: make a couple of names in md.c static
.. because they aren't used outside md.c

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:34:07 -08:00
NeilBrown 1345b1d8ad [PATCH] md: define and use safe_put_page for md
md sometimes call put_page on NULL pointers (treating it like kfree).  This is
not safe, so define and use a 'safe_put_page' which checks for NULL.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:34:07 -08:00
NeilBrown 2604b703b6 [PATCH] md: remove personality numbering from md
md supports multiple different RAID level, each being implemented by a
'personality' (which is often in a separate module).

These personalities have fairly artificial 'numbers'.  The numbers
are use to:
 1- provide an index into an array where the various personalities
    are recorded
 2- identify the module (via an alias) which implements are particular
    personality.

Neither of these uses really justify the existence of personality numbers.
The array can be replaced by a linked list which is searched (array lookup
only happens very rarely).  Module identification can be done using an alias
based on level rather than 'personality' number.

The current 'raid5' modules support two level (4 and 5) but only one
personality.  This slight awkwardness (which was handled in the mapping from
level to personality) can be better handled by allowing raid5 to register 2
personalities.

With this change in place, the core md module does not need to have an
exhaustive list of all possible personalities, so other personalities can be
added independently.

This patch also moves the check for chunksize being non-zero into the ->run
routines for the personalities that need it, rather than having it in core-md.
 This has a side effect of allowing 'faulty' and 'linear' not to have a
chunk-size set.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:34:06 -08:00
NeilBrown fccddba060 [PATCH] md: tidy up raid5/6 hash table code
- replace open-coded hash chain with hlist macros

- Fix hash-table size at one page - it is already quite generous, so there
  will never be a need to use multiple pages, so no need for __get_free_pages

No functional change.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:34:06 -08:00
NeilBrown 0eb3ff12aa [PATCH] md: raid10 read-error handling - resync and read-only
Add in correct read-error handling for resync and read-only situations.

When read-only, we don't over-write, so we need to mark the failed drive in
the r10_bio so we don't re-try it.  During resync, we always read all blocks,
so if there is a read error, we simply over-write it with the good block that
we found (assuming we found one).

Note that the recovery case still isn't handled in an interesting way.  There
is nothing useful to do for the 2-copies case.  If there are 3 or more copies,
then we could try reading from one of the non-missing copies, but this is a
bit complicated and very rarely would be used, so I'm leaving it for now.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:34:05 -08:00
NeilBrown 4443ae10ca [PATCH] md: auto-correct correctable read errors in raid10
Largely just a cross-port from raid1.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:34:05 -08:00
NeilBrown 9910f16af3 [PATCH] md: fix up some rdev rcu locking in raid5/6
There is this "FIXME" comment with a typo in it!!  that been annoying me for
days, so I just had to remove it.

conf->disks[i].rdev should only be accessed if
  - we know we hold a reference or
  - the mddev->reconfig_sem is down or
  - we have a rcu_readlock

handle_stripe was referencing rdev in three places without any of these.  For
the first two, get an rcu_readlock.  For the last, the same access
(md_sync_acct call) is made a little later after the rdev has been claimed
under and rcu_readlock, if R5_Syncio is set.  So just use that access...
However R5_Syncio isn't really needed as the 'syncing' variable contains the
same information.  So use that instead.

Issues, comment, and fix are identical in raid5 and raid6.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:34:04 -08:00
NeilBrown cf30a473a0 [PATCH] md: handle errors when read-only
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:34:04 -08:00
NeilBrown ddaf22abaa [PATCH] md: attempt to auto-correct read errors in raid1
On a read-error we suspend the array, then synchronously read the block from
other arrays until we find one where we can read it.  Then we try writing the
good data back everywhere and make sure it works.  If any write or subsequent
read fails, only then do we fail the device out of the array.

To be able to suspend the array, we need to also keep track of how many
requests are queued for handling by raid1d.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:34:03 -08:00
NeilBrown ca65b73bd9 [PATCH] md: fix raid6 resync check/repair code
raid6 currently does not check the P/Q syndromes when doing a resync, it just
calculates the correct value and writes it.  Doing the check can reduce writes
(often to 0) for a resync, and it is needed to properly implement the

  echo check > sync_action

operation.

This patch implements the appropriate checks and tidies up some related code.

It also allows raid6 user-requested resync to bypass the intent bitmap.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:34:03 -08:00
NeilBrown 6cce3b23f6 [PATCH] md: write intent bitmap support for raid10
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:34:03 -08:00
NeilBrown 6ff8d8ec06 [PATCH] md: allow dirty raid[456] arrays to be started at boot
See patch to md.txt for more details

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:34:02 -08:00
NeilBrown 0a27ec96b6 [PATCH] md: improve raid10 "IO Barrier" concept
raid10 needs to put up a barrier to new requests while it does resync or other
background recovery.  The code for this is currently open-coded, slighty
obscure by its use of two waitqueues, and not documented.

This patch gathers all the related code into 4 functions, and includes a
comment which (hopefully) explains what is happening.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:34:02 -08:00
NeilBrown 17999be4aa [PATCH] md: improve raid1 "IO Barrier" concept
raid1 needs to put up a barrier to new requests while it does resync or other
background recovery.  The code for this is currently open-coded, slighty
obscure by its use of two waitqueues, and not documented.

This patch gathers all the related code into 4 functions, and includes a
comment which (hopefully) explains what is happening.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-06 08:34:01 -08:00
NeilBrown 787453c239 [PATCH] md: complete conversion of md to use kthreads
There are a few loose ends following the conversion of md to use kthreads:

- Some fields in mdk_thread_t that aren't needed (kthreads does it's own
  completion and manages it's own name).

- thread->run is now never NULL, so no need to check

- Some tests for signal_pending that aren't needed (As we don't use signals
  to stop threads any more)

- Some flush_signals are not needed

- Some waits are interruptible and don't need to be.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:56:40 -08:00
NeilBrown a9701a3047 [PATCH] md: support BIO_RW_BARRIER for md/raid1
We can only accept BARRIER requests if all slaves handle
barriers, and that can, of course, change with time....

So we keep track of whether the whole array seems safe for barriers,
and also whether each individual rdev handles barriers.

We initially assumes barriers are OK.

When writing the superblock we try a barrier, and if that fails, we flag
things for no-barriers.  This will usually clear the flags fairly quickly.

If writing the superblock finds that BIO_RW_BARRIER is -ENOTSUPP, we need to
resubmit, so introduce function "md_super_wait" which waits for requests to
finish, and retries ENOTSUPP requests without the barrier flag.

When writing the real raid1, write requests which were BIO_RW_BARRIER but
which aresn't supported need to be retried.  So raid1d is enhanced to do this,
and when any bio write completes (i.e.  no retry needed) we remove it from the
r1bio, so that devices needing retry are easy to find.

We should hardly ever get -ENOTSUPP errors when writing data to the raid.
It should only happen if:
  1/ the device used to support BARRIER, but now doesn't.  Few devices
     change like this, though raid1 can!
or
  2/ the array has no persistent superblock, so there was no opportunity to
     pre-test for barriers when writing the superblock.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:56:38 -08:00
NeilBrown bd926c63b7 [PATCH] md: make md on-disk bitmaps not host-endian
Current bitmaps use set_bit et.al and so are host-endian, which means
not-portable.  Oops.

Define a new version number (4) for which bitmaps are little-endian.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:56:38 -08:00
NeilBrown b2d444d7ad [PATCH] md: convert 'faulty' and 'in_sync' fields to bits in 'flags' field
This has the advantage of removing the confusion caused by 'rdev_t' and
'mddev_t' both having 'in_sync' fields.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:56:38 -08:00
NeilBrown ba22dcbf10 [PATCH] md: improvements to raid5 handling of read errors
Two refinements to the 'attempt-overwrite-on-read-error' mechanism.
1/ If the array is read-only, don't attempt an over-write.
2/ If there are more than max_nr_stripes read errors on a device with
   no success, fail the drive.  This will make sure a dead
   drive will be eventually kicked even when we aren't trying
   to rewrite (which would normally kick a dead drive more quickly.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:56:38 -08:00
NeilBrown 007583c925 [PATCH] md: change raid5 sysfs attribute to not create a new directory
There isn't really a need for raid5 attributes to be an a subdirectory,
so this patch moves them from
  /sys/block/mdX/md/raid5/attribute
to
  /sys/block/mdX/md/attribute

This suggests that all md personalities should co-operate about
namespace usage, but that shouldn't be a problem.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:56:38 -08:00
NeilBrown 9d88883e68 [PATCH] md: teach raid5 the difference between 'check' and 'repair'.
With this, raid5 can be asked to check parity without repairing it.  It also
keeps a count of the number of incorrect parity blocks found (mismatches) and
reports them through sysfs.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:56:37 -08:00
NeilBrown 24dd469d72 [PATCH] md: allow a manual resync with md
You can trigger a 'check' with
  echo check > /sys/block/mdX/md/scan_mode
or a check-and-repair errors with
  echo repair > /sys/block/mdX/md/scan_mode

and read the current state from the same file.

Note: personalities need to know the different between 'check' and 'repair',
but don't yet.  Until they do, 'check' will be the same as 'repair' and will
just do a normal resync pass.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:56:37 -08:00
NeilBrown 3f294f4fb6 [PATCH] md: add kobject/sysfs support to raid5
/sys/block/mdX/md/raid5/
contains raid5-related attributes.
Currently
  stripe_cache_size
is number of entries in stripe cache, and is settable.
  stripe_cache_active
is number of active entries, and in only readable.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:56:37 -08:00
NeilBrown 86e6ffdd24 [PATCH] md: extend md sysfs support to component devices.
Each device in an md array how has a corresponding
  /sys/block/mdX/md/devNN/
directory which can contain attributes.  Currently there is only 'state' which
summarises the state, nd 'super' which has a copy of the superblock, and
'block' which is a symlink to the block device.

Also, /sys/block/mdX/md/rdNN represents slot 'NN' in the array, and is a
symlink to the relevant 'devNN'.  Obviously spare devices do not have a slot
in the array, and so don't have such a symlink.

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:56:37 -08:00
NeilBrown eae1701fbd [PATCH] md: initial sysfs support for md
Start using kobjects in mddevs, and provide a couple of simple attributes
(level and disks).  Attributes live in
  /sys/block/mdX/md/attr-name

Signed-off-by: Neil Brown <neilb@suse.de>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:56:36 -08:00
NeilBrown 4e5314b56a [PATCH] md: better handling of readerrors with raid5.
This patch changes the behaviour of raid5 when it gets a read error.
Instead of just failing the device, it tried to find out what should have
been there, and writes it over the bad block.  For some media-errors, this
has a reasonable chance of fixing the error.  If the write succeeds, and a
subsequent read succeeds as well, raid5 decided the address is OK and
conitnues.

Instead of failing a drive on read-error, we attempt to re-write the block,
and then re-read.  If that all works, we allow the device to remain in the
array.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-09 07:56:36 -08:00
NeilBrown 72626685dc [PATCH] md: add write-intent-bitmap support to raid5
Most awkward part of this is delaying write requests until bitmap updates have
been flushed.

To achieve this, we have a sequence number (seq_flush) which is incremented
each time the raid5 is unplugged.

If the raid thread notices that this has changed, it flushes bitmap changes,
and assigned the value of seq_flush to seq_write.

When a write request arrives, it is given the number from seq_write, and that
write request may not complete until seq_flush is larger than the saved seq
number.

We have a new queue for storing stripes which are waiting for a bitmap flush
and an extra flag for stripes to record if the write was 'degraded' and so
should not clear the a bit in the bitmap.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 16:39:12 -07:00
NeilBrown 0002b2718d [PATCH] md: limit size of sb read/written to appropriate amount
version-1 superblocks are not (normally) 4K long, and can be of variable size.
 Writing the full 4K can cause corruption (but only in non-default
configurations).

With this patch the super-block-flavour can choose a size to read, and set a
size to write based on what it finds.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 16:39:12 -07:00
NeilBrown 773f783442 [PATCH] md: remove old cruft from md_k.h header file
These inlines haven't been used for ages, they should go.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 16:39:12 -07:00
NeilBrown 71c0805cb4 [PATCH] md: allow md to load a superblock with feature-bit '1' set
As this is used to flag an internal bitmap.

Also, introduce symbolic names for feature bits.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 16:39:11 -07:00
NeilBrown 15945fee6f [PATCH] md: support md/linear array with components greater than 2 terabytes.
linear currently uses division by the size of the smallest componenet device
to find which device a request goes to.  If that smallest device is larger
than 2 terabytes, then the division will not work on some systems.

So we introduce a pre-shift, and take care not to make the hash table too
large, much like the code in raid0.

Also get rid of conf->nr_zones, which is not needed.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 16:39:10 -07:00
NeilBrown 4b6d287f62 [PATCH] md: add write-behind support for md/raid1
If a device is flagged 'WriteMostly' and the array has a bitmap, and the
bitmap superblock indicates that write_behind is allowed, then write_behind is
enabled for WriteMostly devices.

Write requests will be acknowledges as complete to the caller (via b_end_io)
when all non-WriteMostly devices have completed the write, but will not be
cleared from the bitmap until all devices complete.

This requires memory allocation to make a local copy of the data being
written.  If there is insufficient memory, then we fall-back on normal write
semantics.

Signed-Off-By: Paul Clements <paul.clements@steeleye.com>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 16:39:10 -07:00
NeilBrown 8ddf9efe67 [PATCH] md: support write-mostly device in raid1
This allows a device in a raid1 to be marked as "write mostly".  Read requests
will only be sent if there is no other option.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 16:39:10 -07:00
NeilBrown 36fa30636f [PATCH] md: all hot-add and hot-remove of md intent logging bitmaps
Both file-bitmaps and superblock bitmaps are supported.

If you add a bitmap file on the array device, you lose.

This introduces a 'default_bitmap_offset' field in mddev, as the ioctl used
for adding a superblock bitmap doesn't have room for giving an offset.  Later,
this value will be setable via sysfs.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 16:39:10 -07:00
NeilBrown 6b8b3e8a8b [PATCH] md: make sure md bitmap updates are flushed when array is stopped.
The recent change to never ignore the bitmap, revealed that the bitmap isn't
begin flushed properly when an array is stopped.

We call bitmap_daemon_work three times as there is a three-stage pipeline for
flushing updates to the bitmap file.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-04 13:00:54 -07:00
NeilBrown 6a806c510d [PATCH] md/raid1: clear bitmap when fullsync completes
We need to be careful differentiating between a resync of a complete array,
in which we can clear the bitmap, and a resync of a degraded array, in
which we cannot.

This patch cleans all that up.

Cc: Paul Clements <paul.clements@steeleye.com>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-15 09:54:51 -07:00
NeilBrown 39730960d9 [PATCH] Two small fixes for md verion-1 superblocks.
1/ Must typecast int to (sector_t) before inverting or we
 might not invert enough bits.

2/ When "bitmap_offset" was added to mdp_superblock_1, we didn't increase
   the count of words-used (96 to 100).

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 19:07:47 -07:00
NeilBrown 7bfa19f274 [PATCH] md: allow md to update multiple superblocks in parallel.
currently, md updates all superblocks (one on each device) in series.  It
waits for one write to complete before starting the next.  This isn't a big
problem as superblock updates don't happen that often.

However it is neater to do it in parallel, and if the drives in the array have
gone to "sleep" after a period of idleness, then waking them is parallel is
faster (and someone else should be worrying about power drain).

Futher, we will need parallel superblock updates for a future patch which
keeps the intent-logging bitmap near the superblock.

Also remove the silly code that retired superblock updates 100 times.  This
simply never made sense.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 19:07:47 -07:00
NeilBrown a654b9d8f8 [PATCH] md: allow md intent bitmap to be stored near the superblock.
This provides an alternate to storing the bitmap in a separate file.  The
bitmap can be stored at a given offset from the superblock.  Obviously the
creator of the array must make sure this doesn't intersect with data....
After is good for version-0.90 superblocks.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 19:07:47 -07:00
NeilBrown 3d310eb7b3 [PATCH] md: fix deadlock due to md thread processing delayed requests.
Before completing a 'write' the md superblock might need to be updated.
This is best done by the md_thread.

The current code schedules this up and queues the write request for later
handling by the md_thread.

However some personalities (Raid5/raid6) will deadlock if the md_thread
tries to submit requests to its own array.

So this patch changes things so the processes submitting the request waits
for the superblock to be written and then submits the request itself.

This fixes a recently-created deadlock in raid5/raid6

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 19:07:46 -07:00
NeilBrown 41158c7eb2 [PATCH] md: optimise reconstruction when re-adding a recently failed drive.
When an array is degraded, bit in the intent-bitmap are never cleared.  So if
a recently failed drive is re-added, we only need to reconstruct the block
that are still reflected in the bitmap.

This patch adds support for this re-adding.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 19:07:46 -07:00
NeilBrown 191ea9b2c7 [PATCH] md: raid1 support for bitmap intent logging
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 19:07:46 -07:00
NeilBrown 77ad4bc706 [PATCH] md: enable the bitmap write-back daemon and wait for it.
Currently we don't wait for updates to the bitmap to be flushed to disk
properly.  The infrastructure all there, but it isn't being used....

A separate kernel thread (bitmap_writeback_daemon) is needed to wait for each
page as we cannot get callbacks when a page write completes.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 19:07:45 -07:00
NeilBrown 32a7627cf3 [PATCH] md: optimised resync using Bitmap based intent logging
With this patch, the intent to write to some block in the array can be logged
to a bitmap file.  Each bit represents some number of sectors and is set
before any update happens, and only cleared when all writes relating to all
sectors are complete.

After an unclean shutdown, information in this bitmap can be used to optimise
resync - only sectors which could be out-of-sync need to be updated.

Also if a drive is removed and then added back into an array, the recovery can
make use of the bitmap to optimise reconstruction.  This is not implemented in
this patch.

Currently the bitmap is stored in a file which must (obviously) be stored on a
separate device.

The patch only provided infrastructure.  It does not update any personalities
to bitmap intent logging.

Md arrays can still be used with no bitmap file.  This patch has minimal
impact on such arrays.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 19:07:43 -07:00
NeilBrown 57afd89f98 [PATCH] md: improve the interface to sync_request
1/ change the return value (which is number-of-sectors synced)
 from 'int' to 'sector_t'.
 The number of sectors is usually easily small enough to fit
 in an int, but if resync needs to abort, it may want to return
 the total number of remaining sectors, which could be large.
 Also errors cannot be returned as negative numbers now, so use
 0 instead
2/ Add a 'skipped' return parameter to allow the array to report
 that it skipped the sectors.  This allows md to take this into account
 in the speed calculations.
 Currently there is no important skipping, but the bitmap-based-resync
 that is coming will use this.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 19:07:43 -07:00
NeilBrown 06d91a5fe0 [PATCH] md: improve locking on 'safemode' and move superblock writes
When md marks the superblock dirty before a write, it calls
generic_make_request (to write the superblock) from within
generic_make_request (to write the first dirty block), which could cause
problems later.

With this patch, the superblock write is always done by the helper thread, and
write request are delayed until that write completes.

Also, the locking around marking the array dirty and writing the superblock is
improved to avoid possible races.

Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 19:07:43 -07:00
Linus Torvalds 1da177e4c3 Linux-2.6.12-rc2
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!
2005-04-16 15:20:36 -07:00