linux-2.6

dect

Archived

Author	SHA1	Message	Date
NeilBrown	e384e58549	md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log. This allows md/raid5 to fully work as a dm target. Normally md uses a 'filemap' which contains a list of pages of bits each of which may be written separately. dm-log uses and all-or-nothing approach to writing the log, so when using a dm-log, ->filemap is NULL and the flags normally stored in filemap_attr are stored in ->logattrs instead. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 13:21:34 +10:00
NeilBrown	ef42567335	md/bitmap: optimise scanning of empty bitmaps. A bitmap is stored as one page per 2048 bits. If none of the bits are set, the page is not allocated. When bitmap_get_counter finds that a page isn't allocate, it just reports that one bit work of space isn't flagged, rather than reporting that 2048 bits worth of space are unflagged. This can cause searches for flagged bits (e.g. bitmap_close_sync) to do more work than is really necessary. So change bitmap_get_counter (when creating) to report a number of blocks that more accurately reports the range of the device for which no counter currently exists. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 13:21:32 +10:00
NeilBrown	b63d7c2e29	md/bitmap: clean up plugging calls. 1/ use md_unplug in bitmap.c as we will soon be using bitmaps under arrays with no queue attached. 2/ Don't bother plugging the queue when we set a bit in the bitmap. The reason for this was to encourage as many bits as possible to get set before we unplug and write stuff out. However every personality already plugs the queue after bitmap_startwrite either directly (raid1/raid10) or be setting STRIPE_BIT_DELAY which causes the queue to be plugged later (raid5). Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 13:21:32 +10:00
NeilBrown	5ff5afffe6	md/bitmap: reduce dependence on sysfs. For dm-raid45 we will want to use bitmaps in dm-targets which don't have entries in sysfs, so cope with the mddev not living in sysfs. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 13:21:31 +10:00
NeilBrown	ac2f40be46	md/bitmap: white space clean up and similar. Fixes some whitespace problems Fixed some checkpatch.pl complaints. Replaced kmalloc ... memset(0), with kzalloc Fixed an unlikely memory leak on an error path. Reformatted a number of 'if/else' sets, sometimes replacing goto with an else clause. Removed some old comments and commented-out code. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 13:07:22 +10:00
NeilBrown	9f7c222001	md/raid5: export raid5 unplugging interface. Also remove remaining accesses to ->queue and ->gendisk when ->queue is NULL (As it is in a DM target). Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:53:10 +10:00
NeilBrown	252ac5221a	md/plug: optionally use plugger to unplug an array during resync/recovery. If an array doesn't have a 'queue' then md_do_sync cannot unplug it. In that case it will have a 'plugger', so make that available to the mddev, and use it to unplug the array if needed. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:53:08 +10:00
NeilBrown	2ac8740151	md/raid5: add simple plugging infrastructure. md/raid5 uses the plugging infrastructure provided by the block layer and 'struct request_queue'. However when we plug raid5 under dm there is no request queue so we cannot use that. So create a similar infrastructure that is much lighter weight and use it for raid5. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:53:08 +10:00
NeilBrown	11d8a6e371	md/raid5: export is_congested test the dm module will need this for dm-raid45. Also only access ->queue->backing_dev_info->congested_fn if ->queue actually exists. It won't in a dm target. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:52:29 +10:00
NeilBrown	4a5add4995	raid5: Don't set read-ahead when there is no queue dm-raid456 does not provide a 'queue' for raid5 to use, so we must make raid5 stop depending on the queue. First: read_ahead dm handles read-ahead adjustment fully in userspace, so simply don't do any readahead adjustments if there is no queue. Also re-arrange code slightly so all the accesses to ->queue are together. Finally, move the blk_queue_merge_bvec function into the 'if' as the ->split_io setting in dm-raid456 has the same effect. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:52:27 +10:00
NeilBrown	768a418db1	md: add support for raising dm events. dm uses scheduled work to raise events to user-space. So allow md device to have work_structs and schedule them on an error. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:52:27 +10:00
NeilBrown	390ee602a1	md: export various start/stop interfaces export entry points for starting and stopping md arrays. This will be used by a module to make md/raid5 work under dm. Also stop calling md_stop_writes from md_stop, as that won't work well with dm - it will want to call the two separately. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:52:27 +10:00
NeilBrown	e8bb9a839a	md: split out md_rdev_init This functionality will be needed separately in a subsequent patch, so split it into it's own exported function. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:52:27 +10:00
NeilBrown	676e42d896	md: be more careful setting MD_CHANGE_CLEAN When MD_CHANGE_CLEAN is set we might block in md_write_start. So we should only set it when fairly sure that something will clear it. There are two places where it is set so as to encourage a metadata update to record the progress of resync/recovery. This should only be done if the internal metadata update mechanisms are in use, which can be tested by by inspecting '->persistent'. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:52:27 +10:00
NeilBrown	f4be6b43f1	md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk We will shortly allow md devices with no gendisk (they are attached to a dm-target instead). That will cause mdname() to return 'mdX'. There is one place where mdname really needs to be unique: when creating the name for a slab cache. So in that case, if there is no gendisk, you the address of the mddev formatted in HEX to provide a unique name. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-26 12:52:26 +10:00
NeilBrown	c41d4ac40d	md/raid5: factor out code for changing size of stripe cache. Separate the actual 'change' code from the sysfs interface so that it can eventually be called internally. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-21 13:28:15 +10:00
NeilBrown	00bcb4ac7e	md: reduce dependence on sysfs. We will want md devices to live as dm targets where sysfs is not visible. So allow md to not connect to sysfs. Signed-off-by: NeilBrown <neilb@suse.de>	2010-07-21 13:27:53 +10:00
NeilBrown	3424bf6a77	md/raid5: don't include 'spare' drives when reshaping to fewer devices. There are few situations where it would make any sense to add a spare when reducing the number of devices in an array, but it is conceivable: A 6 drive RAID6 with two missing devices could be reshaped to a 5 drive RAID6, and a spare could become available just in time for the reshape, but not early enough to have been recovered first. 'freezing' recovery can make this easy to do without any races. However doing such a thing is a bad idea. md will not record the partially-recovered state of the 'spare' and when the reshape finished it will think that the spare is still spare. Easiest way to avoid this confusion is to simply disallow it. Signed-off-by: NeilBrown <neilb@suse.de>	2010-06-24 13:36:04 +10:00
NeilBrown	2f11588249	md/raid5: add a missing 'continue' in a loop. As the comment says, the tail of this loop only applies to devices that are not fully in sync, so if In_sync was set, we should avoid the rest of the loop. This bug will hardly ever cause an actual problem. The worst it can do is allow an array to be assembled that is dirty and degraded, which is not generally a good idea (without warning the sysadmin first). This will only happen if the array is RAID4 or a RAID5/6 in an intermediate state during a reshape and so has one drive that is all 'parity' - no data - while some other device has failed. This is certainly possible, but not at all common. Signed-off-by: NeilBrown <neilb@suse.de>	2010-06-24 13:35:49 +10:00
NeilBrown	415e72d034	md/raid5: Allow recovered part of partially recovered devices to be in-sync During a recovery of reshape the early part of some devices might be in-sync while the later parts are not. We we know we are looking at an early part it is good to treat that part as in-sync for stripe calculations. This is particularly important for a reshape which suffers device failure. Treating the data as in-sync can mean the difference between data-safety and data-loss. Signed-off-by: NeilBrown <neilb@suse.de>	2010-06-24 13:35:39 +10:00
NeilBrown	674806d62f	md/raid5: More careful check for "has array failed". When we are reshaping an array, the device failure combinations that cause us to decide that the array as failed are more subtle. In particular, any 'spare' will be fully in-sync in the section of the array that has already been reshaped, thus failures that affect only that section are less critical. So encode this subtlety in a new function and call it as appropriate. The case that showed this problem was a 4 drive RAID5 to 8 drive RAID6 conversion where the last two devices failed. This resulted in: good good good good incomplete good good failed failed while converting a 5-drive RAID6 to 8 drive RAID5 The incomplete device causes the whole array to look bad, bad as it was actually good for the section that had been converted to 8-drives, all the data was actually safe. Reported-by: Terry Morris <tbmorris@tbmorris.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-06-24 13:35:27 +10:00
NeilBrown	70fffd0bfa	md: Don't update ->recovery_offset when reshaping an array to fewer devices. When an array is reshaped to have fewer devices, the reshape proceeds from the end of the devices to the beginning. If a device happens to be non-In_sync (which is possible but rare) we would normally update the ->recovery_offset as the reshape progresses. However that would be wrong as the recover_offset records that the early part of the device is in_sync, while in fact it would only be the later part that is in_sync, and in any case the offset number would be measured from the wrong end of the device. Relatedly, if after a reshape a spare is discovered to not be recoverred all the way to the end, not allow spare_active to incorporate it in the array. This becomes relevant in the following sample scenario: A 4 drive RAID5 is converted to a 6 drive RAID6 in a combined operation. The RAID5->RAID6 conversion will cause a 5 drive to be included as a spare, then the 5drive -> 6drive reshape will effectively rebuild that spare as it progresses. The 6th drive is treated as in_sync the whole time as there is never any case that we might consider reading from it, but must not because there is no valid data. If we interrupt this reshape part-way through and reverse it to return to a 5-drive RAID6 (or event a 4-drive RAID5), we don't want to update the recovery_offset - as that would be wrong - and we don't want to include that spare as active in the 5-drive RAID6 when the reversed reshape completed and it will be mostly out-of-sync still. Signed-off-by: NeilBrown <neilb@suse.de>	2010-06-24 13:35:18 +10:00
NeilBrown	e4e11e385d	md/raid5: avoid oops when number of devices is reduced then increased. The entries in the stripe_cache maintained by raid5 are enlarged when we increased the number of devices in the array, but not shrunk when we reduce the number of devices. So if entries are added after reducing the number of devices, we much ensure to initialise the whole entry, not just the part that is currently relevant. Otherwise if we enlarge the array again, we will reference uninitialised values. As grow_buffers/shrink_buffer now want to use a count that is stored explicity in the raid_conf, they should get it from there rather than being passed it as a parameter. Signed-off-by: NeilBrown <neilb@suse.de>	2010-06-24 13:35:02 +10:00
Maciej Trela	049d6c1ef9	md: enable raid4->raid0 takeover Only level 5 with layout=PARITY_N can be taken over to raid0 now. Lets allow level 4 either. Signed-off-by: Maciej Trela <maciej.trela@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-06-24 13:34:57 +10:00
Maciej Trela	001048a318	md: clear layout after ->raid0 takeover After takeover from raid5/10 -> raid0 mddev->layout is not cleared. Signed-off-by: Maciej Trela <maciej.trela@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-06-24 13:34:45 +10:00
Maciej Trela	f73ea87375	md: fix raid10 takeover: use new_layout for setup_conf Use mddev->new_layout in setup_conf. Also use new_chunk, and don't set ->degraded in takeover(). That gets set in run() Signed-off-by: Maciej Trela <maciej.trela@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-06-24 13:33:51 +10:00
NeilBrown	e93f68a1fc	md: fix handling of array level takeover that re-arranges devices. Most array level changes leave the list of devices largely unchanged, possibly causing one at the end to become redundant. However conversions between RAID0 and RAID10 need to renumber all devices (except 0). This renumbering is currently being done in the ->run method when the new personality takes over. However this is too late as the common code in md.c might already have invalidated some of the devices if they had a ->raid_disk number that appeared to high. Moving it into the ->takeover method is too early as the array is still active at that time and wrong ->raid_disk numbers could cause confusion. So add a ->new_raid_disk field to mdk_rdev_s and use it to communicate the new raid_disk number. Now the common code knows exactly which devices need to be renumbered, and which can be invalidated, and can do it all at a convenient time when the array is suspend. It can also update some symlinks in sysfs which previously were not be updated correctly. Reported-by: Maciej Trela <maciej.trela@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-06-24 13:33:24 +10:00
Prasanna S. Panchamukhi	0544a21db0	md: raid10: Fix null pointer dereference in fix_read_error() Such NULL pointer dereference can occur when the driver was fixing the read errors/bad blocks and the disk was physically removed causing a system crash. This patch check if the rcu_dereference() returns valid rdev before accessing it in fix_read_error(). Cc: stable@kernel.org Signed-off-by: Prasanna S. Panchamukhi <prasanna.panchamukhi@riverbed.com> Signed-off-by: Rob Becker <rbecker@riverbed.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-06-24 13:31:03 +10:00
NeilBrown	f3b99be19d	Restore partition detection of newly created md arrays. Commit `b821eaa572` broke partition detection for md arrays. The logic was almost right. However if revalidate_disk is called when the device is not yet open, bdev->bd_disk won't be set, so the flush_disk() Call will not set bd_invalidated. So when md_open is called we still need to ensure that ->bd_invalidated gets set. This is easily done with a call to check_disk_size_change in the place where the offending commit removed check_disk_change. At the important times, the size will have changed from 0 to non-zero, so check_disk_size_change will set bd_invalidated. Tested-by: Duncan <1i5t5.duncan@cox.net> Reported-by: Duncan <1i5t5.duncan@cox.net> Signed-off-by: NeilBrown <neilb@suse.de>	2010-06-24 13:31:03 +10:00
Akinobu Mita	55af6bb509	md: convert cpu notifier to return encapsulate errno value By the previous modification, the cpu notifier can return encapsulate errno value. This converts the cpu notifiers for raid5. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:48 -07:00
Linus Torvalds	e8bebe2f71	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits) fix handling of offsets in cris eeprom.c, get rid of fake on-stack files get rid of home-grown mutex in cris eeprom.c switch ecryptfs_write() to struct inode , kill on-stack fake files switch ecryptfs_get_locked_page() to struct inode simplify access to ecryptfs inodes in ->readpage() and friends AFS: Don't put struct file on the stack Ban ecryptfs over ecryptfs logfs: replace inode uid,gid,mode initialization with helper function ufs: replace inode uid,gid,mode initialization with helper function udf: replace inode uid,gid,mode init with helper ubifs: replace inode uid,gid,mode initialization with helper function sysv: replace inode uid,gid,mode initialization with helper function reiserfs: replace inode uid,gid,mode initialization with helper function ramfs: replace inode uid,gid,mode initialization with helper function omfs: replace inode uid,gid,mode initialization with helper function bfs: replace inode uid,gid,mode initialization with helper function ocfs2: replace inode uid,gid,mode initialization with helper function nilfs2: replace inode uid,gid,mode initialization with helper function minix: replace inode uid,gid,mode init with helper ext4: replace inode uid,gid,mode init with helper ... Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)	2010-05-21 19:37:45 -07:00
NeilBrown	19fdb9eefb	Merge commit '3ff195b011d7decf501a4d55aeed312731094796' into for-linus Conflicts: drivers/md/md.c - Resolved conflict in md_update_sb - Added extra 'NULL' arg to new instance of sysfs_get_dirent. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-22 08:31:36 +10:00
Christoph Hellwig	8018ab0574	sanitize vfs_fsync calling conventions Now that the last user passing a NULL file pointer is gone we can remove the redundant dentry argument and associated hacks inside vfs_fsynmc_range. The next step will be removig the dentry argument from ->fsync, but given the luck with the last round of method prototype changes I'd rather defer this until after the main merge window. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-21 18:31:21 -04:00
Eric W. Biederman	3ff195b011	sysfs: Implement sysfs tagged directory support. The problem. When implementing a network namespace I need to be able to have multiple network devices with the same name. Currently this is a problem for /sys/class/net/, /sys/devices/virtual/net/, and potentially a few other directories of the form /sys/ ... /net/. What this patch does is to add an additional tag field to the sysfs dirent structure. For directories that should show different contents depending on the context such as /sys/class/net/, and /sys/devices/virtual/net/ this tag field is used to specify the context in which those directories should be visible. Effectively this is the same as creating multiple distinct directories with the same name but internally to sysfs the result is nicer. I am calling the concept of a single directory that looks like multiple directories all at the same path in the filesystem tagged directories. For the networking namespace the set of directories whose contents I need to filter with tags can depend on the presence or absence of hotplug hardware or which modules are currently loaded. Which means I need a simple race free way to setup those directories as tagged. To achieve a reace free design all tagged directories are created and managed by sysfs itself. Users of this interface: - define a type in the sysfs_tag_type enumeration. - call sysfs_register_ns_types with the type and it's operations - sysfs_exit_ns when an individual tag is no longer valid - Implement mount_ns() which returns the ns of the calling process so we can attach it to a sysfs superblock. - Implement ktype.namespace() which returns the ns of a syfs kobject. Everything else is left up to sysfs and the driver layer. For the network namespace mount_ns and namespace() are essentially one line functions, and look to remain that. Tags are currently represented a const void pointers as that is both generic, prevides enough information for equality comparisons, and is trivial to create for current users, as it is just the existing namespace pointer. The work needed in sysfs is more extensive. At each directory or symlink creating I need to check if the directory it is being created in is a tagged directory and if so generate the appropriate tag to place on the sysfs_dirent. Likewise at each symlink or directory removal I need to check if the sysfs directory it is being removed from is a tagged directory and if so figure out which tag goes along with the name I am deleting. Currently only directories which hold kobjects, and symlinks are supported. There is not enough information in the current file attribute interfaces to give us anything to discriminate on which makes it useless, and there are no potential users which makes it an uninteresting problem to solve. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Benjamin Thery <benjamin.thery@bull.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-05-21 09:37:31 -07:00
NeilBrown	be6800a73a	md: don't insist on valid event count for spare devices. Devices which know that they are spares do not really need to have an event count that matches the rest of the array, so there are no data-in-sync issues. It is enough that the uuid matches. So remove the requirement that the event count is up-to-date. We currently still write out and event count on spares, but this allows us in a year or 3 to stop doing that completely. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:28:01 +10:00
NeilBrown	a8707c08f4	md: simplify updating of event count to sometimes avoid updating spares. When updating the event count for a simple clean <-> dirty transition, we try to avoid updating the spares so they can safely spin-down. As the event_counts across an array must be +/- 1, this means decrementing the event_count on a dirty->clean transition. This is not always safe and we have to avoid the unsafe time. We current do this with a misguided idea about it being safe or not depending on whether the event_count is odd or even. This approach only works reliably in a few common instances, but easily falls down. So instead, simply keep internal state concerning whether it is safe or not, and always assume it is not safe when an array is first assembled. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:28:01 +10:00
Gabriele A. Trombetti	7b0bb5368a	md/raid6: Fix raid-6 read-error correction in degraded state Fix: Raid-6 was not trying to correct a read-error when in singly-degraded state and was instead dropping one more device, going to doubly-degraded state. This patch fixes this behaviour. Tested-by: Janos Haar <janos.haar@netcenter.hu> Signed-off-by: Gabriele A. Trombetti <g.trombetti.lkrnl1213@logicschema.com> Reported-by: Janos Haar <janos.haar@netcenter.hu> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2010-05-18 15:28:00 +10:00
NeilBrown	75a73a29e5	md: restore ability of spare drives to spin down. Some time ago we stopped the clean/active metadata updates from being written to a 'spare' device in most cases so that it could spin down and say spun down. Device failure/removal etc are still recorded on spares. However commit `51d5668cb2` broke this 50% of the time, depending on whether the event count is even or odd. The change log entry said: This means that the alignment between 'odd/even' and 'clean/dirty' might take a little longer to attain, how ever the code makes no attempt to create that alignment, so it could take arbitrarily long. So when we find that clean/dirty is not aligned with odd/even, force a second metadata-update immediately. There are already cases where a second metadata-update is needed immediately (e.g. when a device fails during the metadata update). We just piggy-back on that. Reported-by: Joe Bryant <tenminjoe@yahoo.com> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2010-05-18 15:28:00 +10:00
NeilBrown	af3a2cd6b8	md: Fix read balancing in RAID1 and RAID10 on drives > 2TB read_balance uses a "unsigned long" for a sector number which will get truncated beyond 2TB. This will cause read-balancing to be non-optimal, and can cause data to be read from the 'wrong' branch during a resync. This has a very small chance of returning wrong data. Reported-by: Jordan Russell <jr-list-2010@quo.to> Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:28:00 +10:00
NeilBrown	2dc40f8094	md/linear: standardise all printk messages md/linear:mdname: Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:59 +10:00
NeilBrown	b5a20961f3	md/raid0: tidy up printk messages. All messages now start md/raid0:md-device-name: Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:59 +10:00
NeilBrown	128595ed6f	md/raid10: tidy up printk messages. All raid10 printk messages now start md/raid10:md-device-name: Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:59 +10:00
NeilBrown	9dd1e2faf7	md/raid1: improve printk messages Make sure the array name is included in a uniform way in all printk messages. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:59 +10:00
NeilBrown	0c55e02259	md/raid5: improve consistency of error messages. Many 'printk' messages from the raid456 module mention 'raid5' even though it may be a 'raid6' or even 'raid4' array. This can cause confusion. Also the actual array name is not always reported and when it is it is not reported consistently. So change all the messages to start: md/raid:%s: where '%s' becomes e.g. md3 to identify the particular array. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:58 +10:00
NeilBrown	08fb730ca3	md: remove EXPERIMENTAL designation from RAID10 RAID10 has been available for quite a while now and is quite well tested, so we can remove the EXPERIMENTAL designation. Reported-by: Eric MSP Veith <eveith@wwweb-library.net> Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:58 +10:00
Dan Williams	f2859af671	md: allow integers to be passed to md/level e.g. allow md to interpret 'echo 4 > md/level' as a request for raid4. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2010-05-18 15:27:58 +10:00
Dan Williams	bb7f8d2217	md: notify mdstat waiters of level change Level modifications change the output of mdstat. The mdmon manager thread is interested in these events for external metadata management. Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2010-05-18 15:27:57 +10:00
Dan Williams	f1b29bcae1	md/raid4: permit raid0 takeover For consistency allow raid4 to takeover raid0 in addition to raid5 (with a raid4 layout). Signed-off-by: Dan Williams <dan.j.williams@intel.com>	2010-05-18 15:27:57 +10:00
NeilBrown	e555190d82	md/raid1: delay reads that could overtake behind-writes. When a raid1 array is configured to support write-behind on some devices, it normally only reads from other devices. If all devices are write-behind (because the rest have failed) it is possible for a read request to be serviced before a behind-write request, which would appear as data corruption. So when forced to read from a WriteMostly device, wait for any write-behind to complete, and don't start any more behind-writes. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:57 +10:00
NeilBrown	d754c5ae1f	md/raid1: fix confusing 'redirect sector' message. This message seems to suggest the named device is the one on which a read failed, however it is actually the device that the read will be redirected to. So make the message a little clearer. Reported-by: Tim Burgess <ozburgess@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:56 +10:00
NeilBrown	9e35b99c7e	md: don't unregister the thread in mddev_suspend This is - unnecessary because mddev_suspend is always followed by a call to ->stop, and each ->stop unregisters the thread, and - a problem as it makes it awkwards to suspend and then resume a device as we will want later. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:56 +10:00
NeilBrown	fafd7fb052	md: factor out init code for an mddev This is a simple factorisation that makes mddev_find easier to read. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:55 +10:00
NeilBrown	21a52c6d05	md: pass mddev to make_request functions rather than request_queue We used to pass the personality make_request function direct to the block layer so the first argument had to be a queue. But now we have the intermediary md_make_request so it makes at lot more sense to pass a struct mddev_s. It makes it possible to have an mddev without its own queue too. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:55 +10:00
NeilBrown	cca9cf90c5	md: call md_stop_writes from md_stop This moves the call to the other side of set_readonly, but that should not be an issue. This encapsulates in 'md_stop' all of the functionality for internally stopping the array, leaving all the interactions with externalities (sysfs, request_queue, gendisk) in do_md_stop. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:54 +10:00
NeilBrown	a4bd82d0d0	md: split md_set_readonly out of do_md_stop Using do_md_stop to set an array to read-only is a little confusing. Now most of the common code has been factored out, split md_set_readonly off in to a separate function. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:54 +10:00
NeilBrown	a047e12540	md: factor md_stop_writes out of do_md_stop. Further refactoring of do_md_stop. This one requires some explanation as it takes code from different places in do_md_stop, so some re-ordering happens. We only get into this part of do_md_stop if there are no active opens of the device, so no writes can be happening and the device must have been flushed. In md_stop_writes we want to stop any internal sources of writes - i.e. resync - and flush out the metadata. The only code that was previously before some of this code is code to clean up the queue, the mddev, the gendisk, or sysfs, all of which is probably better after code that makes active changes (i.e. triggers writes). Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:54 +10:00
NeilBrown	6177b472ab	md: start to refactor do_md_stop do_md_stop is large and clunky, so hard to understand. This is a first step of refactoring, pulling two simple sub-functions out. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:53 +10:00
NeilBrown	fe60b01428	md: factor do_md_run to separate accesses to ->gendisk As part of relaxing the binding between an mddev and gendisk, we separate do_md_run into two functions. md_run does all the work internal to md do_md_run calls md_run and makes and changes to gendisk that are required. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:53 +10:00
NeilBrown	b821eaa572	md: remove ->changed and related code. We set ->changed to 1 and call check_disk_change at the end of md_open so that bd_invalidated would be set and thus partition rescan would happen appropriately. Now that we call revalidate_disk directly, which sets bd_invalidates, that indirection is no longer needed and can be removed. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:53 +10:00
NeilBrown	49ce6cea85	md: don't reference gendisk in getgeo Using ->array_sectors rather than get_capacity() is more direct and is a step towards relaxing the tight connection between mddev and gendisk. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:52 +10:00
NeilBrown	490773268c	md: move io accounting out of personalities into md_make_request While I generally prefer letting personalities do as much as possible, given that we have a central md_make_request anyway we may as well use it to simplify code. Also this centralises knowledge of ->gendisk which will help later. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:52 +10:00
NeilBrown	2b7f22284d	md/raid5: small tidyup in raid5_align_endio Diving through ->queue to find mddev is unnecessarily complex - there is an easier path to finding mddev, so use that. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:50 +10:00
NeilBrown	a78d38a1a1	md: add support for raid5 to raid4 conversion This is unlikely to be wanted, but we may as well provide it for completeness. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:49 +10:00
Maciej Trela	5cac7861b2	md: notify level changes through sysfs. Level changes can be very significant, so make sure to notify them via sysfs. Signed-off-by: Maciej Trela <maciej.trela@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:49 +10:00
NeilBrown	233fca36bb	md: Relax checks on ->max_disks when external metadata handling is used. When metadata is being managed by user-space, md doesn't know what the maximum number of devices allowed in an array is so ->max_disks is 0. In this case we should allow any (+ve) number of disks. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:49 +10:00
Maciej Trela	b71031076e	md: Correctly handle device removal via sysfs Writing "none" to "../md/dev-xx/slot" removes that device from being an active part of the array, but it didn't set ->raid_disk to -1 to record this fact. Signed-off-by: Maciej Trela <Maciej.Trela@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:48 +10:00
Trela, Maciej	dab8b29248	md: Add support for Raid0->Raid10 takeover Signed-off-by: Maciej Trela <maciej.trela@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:48 +10:00
Trela, Maciej	9af204cf72	md: Add support for Raid5->Raid0 and Raid10->Raid0 takeover Signed-off-by: Maciej Trela <maciej.trela@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:48 +10:00
Trela Maciej	54071b3808	md:Add support for Raid0->Raid5 takeover Signed-off-by: Maciej Trela <maciej.trela@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:47 +10:00
NeilBrown	84707f38e7	md: don't use mddev->raid_disks in raid0 or raid10 while array is active. In a subsequent patch we will make it possible to change mddev->raid_disks while a RAID0 or RAID10 array is active. This is part of the process of reshaping such an array. This means that we cannot use this value while processes requests (it is OK to use it during initialisation as we are locked against changes then). Both RAID0 and RAID10 have the same value stored in the private data structure, so use that value instead. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:47 +10:00
NeilBrown	c0cc75f84e	md: discard StateChanged device flag. This was needed when sysfs files could only be 'notified' from process context. Now that we have sys_notify_direct, we can call it directly from an interrupt. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:47 +10:00
H Hartley Sweeten	7b92813c3c	drivers/md: Remove unnecessary casts of void * void pointers do not need to be cast to other pointer types. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:46 +10:00
Paul Clements	696fcd535b	md: expose max value of behind writes counter Keep track of the maximum number of concurrent write-behind requests for an md array and exposed this number in sysfs at md/bitmap/max_backlog_used Writing any value to this file will clear it. This allows userspace to be involved in tuning bitmap/backlog. Signed-off-by: Paul Clements <paul.clements@steeleye.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:46 +10:00
NeilBrown	ee8b81b03d	md: remove some dead fields from mddev_s These fields have never been used. commit `4b6d287f62` added them, but also added identical files to bitmap_super_s, and only used the latter. So remove these unused fields. Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:45 +10:00
NeilBrown	964147d5c8	md/raid1: fix counting of write targets. There is a very small race window when writing to a RAID1 such that if a device is marked faulty at exactly the wrong time, the write-in-progress will not be sent to the device, but the bitmap (if present) will be updated to say that the write was sent. Then if the device turned out to still be usable as was re-added to the array, the bitmap-based-resync would skip resyncing that block, possibly leading to corruption. This would only be a problem if no further writes were issued to that area of the device (i.e. that bitmap chunk). Suitable for any pending -stable kernel. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-18 15:27:13 +10:00
NeilBrown	a64c876fd3	md: manage redundancy group in sysfs when changing level. Some levels expect the 'redundancy group' to be present, others don't. So when we change level of an array we might need to add or remove this group. This requires fixing up the current practice of overloading ->private to indicate (when ->pers == NULL) that something needs to be removed. So create a new ->to_remove to fill that role. When changing levels, we may need to add or remove attributes. When changing RAID5 -> RAID6, we both add and remove the same thing. It is important to catch this and optimise it out as the removal is delayed until a lock is released, so trying to add immediately would cause problems. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-17 14:45:40 +10:00
NeilBrown	b6eb127d27	md: remove unneeded sysfs files more promptly When an array is stopped we need to remove some sysfs files which are dependent on the type of array. We need to delay that deletion as deleting them while holding reconfig_mutex can lead to deadlocks. We currently delay them until the array is completely destroyed. However it is possible to deactivate and then reactivate the array. It is also possible to need to remove sysfs files when changing level, which can potentially happen several times before an array is destroyed. So we need to delete these files more promptly: as soon as reconfig_mutex is dropped. We need to ensure this happens before do_md_run can restart the array, so we use open_mutex for some extra locking. This is not deadlock prone. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-17 14:40:07 +10:00
NeilBrown	ef2f80ff73	md/linear: avoid possible oops and array stop Since commit `ef286f6fa6` it has been important that each personality clears ->private in the ->stop() function, or sets it to a attribute group to be removed. linear.c doesn't. This can sometimes lead to an oops, though it doesn't always. Suitable for 2.6.33-stable and 2.6.34. Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2010-05-17 14:38:18 +10:00
Dan Williams	e221835046	md: set mddev readonly flag on blkdev BLKROSET ioctl When the user sets the block device to readwrite then the mddev should follow suit. Otherwise, the BUG_ON in md_write_start() will be set to trigger. The reverse direction, setting mddev->ro to match a set readonly request, can be ignored because the blkdev level readonly flag precludes the need to have mddev->ro set correctly. Nevermind the fact that setting mddev->ro to 1 may fail if the array is in use. Cc: <stable@kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-05-12 08:25:37 +10:00
NeilBrown	1176568de7	md: restore ability of spare drives to spin down. Some time ago we stopped the clean/active metadata updates from being written to a 'spare' device in most cases so that it could spin down and say spun down. Device failure/removal etc are still recorded on spares. However commit `51d5668cb2` broke this 50% of the time, depending on whether the event count is even or odd. The change log entry said: This means that the alignment between 'odd/even' and 'clean/dirty' might take a little longer to attain, how ever the code makes no attempt to create that alignment, so it could take arbitrarily long. So when we find that clean/dirty is not aligned with odd/even, force a second metadata-update immediately. There are already cases where a second metadata-update is needed immediately (e.g. when a device fails during the metadata update). We just piggy-back on that. Reported-by: Joe Bryant <tenminjoe@yahoo.com> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2010-05-07 21:10:57 +10:00
Gabriele A. Trombetti	87aa63000c	md/raid6: Fix raid-6 read-error correction in degraded state Fix: Raid-6 was not trying to correct a read-error when in singly-degraded state and was instead dropping one more device, going to doubly-degraded state. This patch fixes this behaviour. Tested-by: Janos Haar <janos.haar@netcenter.hu> Signed-off-by: Gabriele A. Trombetti <g.trombetti.lkrnl1213@logicschema.com> Reported-by: Janos Haar <janos.haar@netcenter.hu> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2010-05-07 21:10:35 +10:00
NeilBrown	6e3b96ed61	md/raid5: fix previous patch. Previous patch changes stripe and chunk_number to sector_t but mistakenly did not update all of the divisions to use sector_dev(). This patch changes all the those divisions (actually the '%' operator) to sector_div. Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org Tested-by: Stefan Lippers-Hollmann <s.l-h@gmx.de>	2010-04-23 07:08:28 +10:00
NeilBrown	35f2a59119	md/raid5: allow for more than 2^31 chunks. With many large drives and small chunk sizes it is possible to create a RAID5 with more than 2^31 chunks. Make sure this works. Reported-by: Brett King <king.br@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2010-04-20 14:13:34 +10:00
Tejun Heo	5a0e3ad6af	include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>	2010-03-30 22:02:32 +09:00
Linus Torvalds	31cc1dd344	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: md: deal with merge_bvec_fn in component devices better.	2010-03-18 16:55:24 -07:00
NeilBrown	627a2d3c29	md: deal with merge_bvec_fn in component devices better. If a component device has a merge_bvec_fn then as we never call it we must ensure we never need to. Currently this is done by setting max_sector to 1 PAGE, however this does not stop a bio being created with several sub-page iovecs that would violate the merge_bvec_fn. So instead set max_segments to 1 and set the segment boundary to the same as a page boundary to ensure there is only ever one single-page segment of IO requested at a time. This can particularly be an issue when 'xen' is used as it is known to submit multiple small buffers in a single bio. Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org	2010-03-16 17:04:24 +11:00
Emese Revfy	52cf25d0ab	Driver core: Constify struct sysfs_ops in struct kobj_type Constify struct sysfs_ops. This is part of the ops structure constification effort started by Arjan van de Ven et al. Benefits of this constification: * prevents modification of data that is shared (referenced) by many other structure instances at runtime * detects/prevents accidental (but not intentional) modification attempts on archs that enforce read-only kernel data at runtime * potentially better optimized code as the compiler can assume that the const data cannot be changed * the compiler/linker move const data into .rodata and therefore exclude them from false sharing Signed-off-by: Emese Revfy <re.emese@gmail.com> Acked-by: David Teigland <teigland@redhat.com> Acked-by: Matt Domsch <Matt_Domsch@dell.com> Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com> Acked-by: Hans J. Koch <hjk@linutronix.de> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Jens Axboe <jens.axboe@oracle.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-03-07 17:04:49 -08:00
Takahiro Yasui	f070304094	dm raid1: fix deadlock when suspending failed device To prevent deadlock, bios in the hold list should be flushed before dm_rh_stop_recovery() is called in mirror_suspend(). The recovery can't start because there are pending bios and therefore dm_rh_stop_recovery deadlocks. When there are pending bios in the hold list, the recovery waits for the completion of the bios after recovery_count is acquired. The recovery_count is released when the recovery finished, however, the bios in the hold list are processed after dm_rh_stop_recovery() in mirror_presuspend(). dm_rh_stop_recovery() also acquires recovery_count, then deadlock occurs. Signed-off-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>	2010-03-06 02:32:35 +00:00
Mike Snitzer	924e600d41	dm: eliminate some holes data structures Eliminate a 4-byte hole in 'struct dm_io_memory' by moving 'offset' above the 'ptr' to which it applies (size reduced from 24 to 16 bytes). And by association, 1-4 byte hole is eliminated in 'struct dm_io_request' (size reduced from 56 to 48 bytes). Eliminate all 6 4-byte holes and 1 cache-line in 'struct dm_snapshot' (size reduced from 392 to 368 bytes). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-03-06 02:32:33 +00:00
Peter Rajnoha	3abf85b5b5	dm ioctl: introduce flag indicating uevent was generated Set a new DM_UEVENT_GENERATED_FLAG when returning from ioctls to indicate that a uevent was actually generated. This tells the userspace caller that it may need to wait for the event to be processed. Signed-off-by: Peter Rajnoha <prajnoha@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-03-06 02:32:31 +00:00
Mikulas Patocka	a97f925a32	dm: free dm_io before bio_endio not after Free the dm_io structure before calling bio_endio() instead of after it, to ensure that the io_pool containing it is not referenced after it is freed. This partially fixes a problem described here https://www.redhat.com/archives/dm-devel/2010-February/msg00109.html thread 1: bio_endio(bio, io_error); /* scheduling happens */ thread 2: close the device remove the device thread 1: free_io(md, io); Thread 2, when removing the device, sees non-empty md->io_pool (because the io hasn't been freed by thread 1 yet) and may crash with BUG in mempool_free. Thread 1 may also crash, when freeing into a nonexisting mempool. To fix this we must make sure that bio_endio() is the last call and the md structure is not accessed afterwards. There is another bio_endio in process_barrier, but it is called from the thread and the thread is destroyed prior to freeing the mempools, so this call is not affected by the bug. A similar bug exists with module unloads - the module may be unloaded immediately after bio_endio - but that is more difficult to fix. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-03-06 02:32:29 +00:00
Nikanth Karthikesan	8215d6ec5f	dm table: remove unused dm_get_device range parameters Remove unused parameters(start and len) of dm_get_device() and fix the callers. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-03-06 02:32:27 +00:00
Mike Snitzer	0f3649a9e3	dm ioctl: only issue uevent on resume if state changed Only issue a uevent on a resume if the state of the device changed, i.e. if it was suspended and/or its table was replaced. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-03-06 02:32:24 +00:00
Mikulas Patocka	ede5ea0b8b	dm raid1: always return error if all legs fail If all mirror legs fail, always return an error instead of holding the bio, even if the handle_errors option was set. At present it is the responsibility of the driver underneath us to deal with retries, multipath etc. The patch adds the bio to the failures list instead of holding it directly. do_failures tests first if all legs failed and, if so, returns the bio with -EIO. If any leg is still alive and handle_errors is set, do_failures calls hold_bio. Reviewed-by: Takahiro Yasui <tyasui@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-03-06 02:32:22 +00:00
Kiyoshi Ueda	fb61264297	dm mpath: refactor pg_init This patch pulls the pg_init path activation code out of process_queued_ios() into a new function. No functional change. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-03-06 02:32:18 +00:00
Kiyoshi Ueda	2bded7bd7e	dm mpath: wait for pg_init completion when suspending When suspending the device we must wait for all I/O to complete, but pg-init may be still in progress even after flushing the workqueue for kmpath_handlerd in multipath_postsuspend. This patch waits for pg-init completion correctly in multipath_postsuspend(). Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-03-06 02:32:13 +00:00
Kiyoshi Ueda	d0259bf0ee	dm mpath: hold io until all pg_inits completed m->queue_io is set to block processing I/Os, and it needs to be kept while pg-init, which issues multiple path activations, is in progress. But m->queue is cleared when a path activation completes without error in pg_init_done(), even while other path activations are in progress. That may cause undesired -EIO on paths which are not complete activation. This patch fixes that by not clearing m->queue_io until all path activations complete. (Before the hardware handlers were moved into the SCSI layer, pg_init only used one path.) Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-03-06 02:30:02 +00:00
Kiyoshi Ueda	fce323dd68	dm mpath: avoid storing private suspended state 'suspended' flag in struct multipath was introduced to check whether the multipath target is in suspended state, but the same check is done through dm_suspended() now, so remove the flag and related code. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Mike Anderson <andmike@linux.vnet.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-03-06 02:29:59 +00:00
Kiyoshi Ueda	ecdb2e257a	dm table: remove dm_get from dm_table_get_md Remove the dm_get() in dm_table_get_md() because dm_table_get_md() could be called from presuspend/postsuspend, which are called while mapped_device is in DMF_FREEING state, where dm_get() is not allowed. Justification for that is the lifetime of both objects: As far as the current dm design/implementation, mapped_device is never freed while targets are doing something, because dm core waits for targets to become quiet in dm_put() using presuspend/postsuspend. So targets should be able to touch mapped_device without holding reference count of the mapped_device, and we should allow targets to touch mapped_device even if it is in DMF_FREEING state. Backgrounds: I'm trying to remove the multipath internal queue, since dm core now has a generic queue for request-based dm. In the patch-set, the multipath target wants to request dm core to start/stop queue. One of such start/stop requests can happen during postsuspend() while the target waits for pg-init to complete, because the target stops queue when starting pg-init and tries to restart it when completing pg-init. Since queue belongs to mapped_device, it involves calling dm_table_get_md() and dm_put(). On the other hand, postsuspend() is called in dm_put() for mapped_device which is in DMF_FREEING state, and that triggers BUG_ON(DMF_FREEING) in the 2nd dm_put(). I had tried to solve this problem by changing only multipath not to touch mapped_device which is in DMF_FREEING state, but I couldn't and I came up with a question why we need dm_get() in dm_table_get_md(). Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-03-06 02:29:52 +00:00
Moger, Babu	f7b934c812	dm mpath: skip activate_path for failed paths This patch adds two minor fixes while processing device mapper path activation. Skip failed paths while calling activate_path. If the path is already failed then activate_path will fail for sure. We don't have to call in that case. In some case this might cause prolonged retries unnecessarily. Change the misleading message if the path being activated fails with SCSI_DH_NOSYS. Signed-off-by: Babu Moger <babu.moger@lsi.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-03-06 02:29:49 +00:00

1 2 3 4 5 ...

1745 Commits