dect
/
linux-2.6
Archived
13
0
Fork 0
Commit Graph

2242 Commits

Author SHA1 Message Date
Alexandre Oliva 425d83156c Btrfs: skip block groups without enough space for a cluster
We test whether a block group has enough free space to hold the
requested block, but when we're doing clustered allocation, we can
save some cycles by testing whether it has enough room for the cluster
upfront, otherwise we end up attempting to set up a cluster and
failing.  Only in the NO_EMPTY_SIZE loop do we attempt an unclustered
allocation, and by then we'll have zeroed the cluster size, so this
patch won't stop us from using the block group as a last resort.

Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-30 13:43:00 -05:00
Alexandre Oliva 1b22bad779 Btrfs: start search for new cluster at the beginning
Instead of starting at zero (offset is always zero), request a cluster
starting at search_start, that denotes the beginning of the current
block group.

Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-30 13:43:00 -05:00
Alexandre Oliva b78d09bceb Btrfs: reset cluster's max_size when creating bitmap
The field that indicates the size of the largest contiguous chunk of
free space in the cluster is not initialized when setting up bitmaps,
it's only increased when we find a larger contiguous chunk.  We end up
retaining a larger value than appropriate for highly-fragmented
clusters, which may cause pointless searches for large contiguous
groups, and even cause clusters that do not meet the density
requirements to be set up.

Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-30 13:43:00 -05:00
Alexandre Oliva f2d0f6765d Btrfs: initialize new bitmaps' list
We're failing to create clusters with bitmaps because
setup_cluster_no_bitmap checks that the list is empty before inserting
the bitmap entry in the list for setup_cluster_bitmap, but the list
field is only initialized when it is restored from the on-disk free
space cache, or when it is written out to disk.

Besides a potential race condition due to the multiple use of the list
field, filesystem performance severely degrades over time: as we use
up all non-bitmap free extents, the try-to-set-up-cluster dance is
done at every metadata block allocation.  For every block group, we
fail to set up a cluster, and after failing on them all up to twice,
we fall back to the much slower unclustered allocation.

To make matters worse, before the unclustered allocation, we try to
create new block groups until we reach the 1% threshold, which
introduces additional bitmaps and thus block groups that we'll iterate
over at each metadata block request.
2011-11-30 18:46:06 +01:00
Li Zefan b772a86ea6 Btrfs: fix oops when calling statfs on readonly device
To reproduce this bug:

  # dd if=/dev/zero of=img bs=1M count=256
  # mkfs.btrfs img
  # losetup -r /dev/loop1 img
  # mount /dev/loop1 /mnt
  OOPS!!

It triggered BUG_ON(!nr_devices) in btrfs_calc_avail_data_space().

To fix this, instead of checking write-only devices, we check all open
deivces:

  # df -h /dev/loop1
  Filesystem            Size  Used Avail Use% Mounted on
  /dev/loop1            250M   28K  238M   1% /mnt

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-11-30 18:46:05 +01:00
Mike Fleetwood ece7d20e8b Btrfs: Don't error on resizing FS to same size
It seems overly harsh to fail a resize of a btrfs file system to the
same size when a shrink or grow would succeed.  User app GParted trips
over this error.  Allow it by bypassing the shrink or grow operation.

Signed-off-by: Mike Fleetwood <mike.fleetwood@googlemail.com>
2011-11-30 18:46:04 +01:00
Miao Xie aa38a711a8 Btrfs: fix deadlock on metadata reservation when evicting a inode
When I ran the xfstests, I found the test tasks was blocked on meta-data
reservation.

By debugging, I found the reason of this bug:
   start transaction
        |
	v
   reserve meta-data space
	|
	v
   flush delay allocation -> iput inode -> evict inode
	^					|
	|					v
   wait for delay allocation flush <- reserve meta-data space

And besides that, the flush on evicting inode will block the thread, which
is reclaiming the memory, and make oom happen easily.

Fix this bug by skipping the flush step when evicting inode.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2011-11-30 18:46:03 +01:00
Dan Carpenter 26bdef541d btrfs scrub: handle -ENOMEM from init_ipath()
init_ipath() can return an ERR_PTR(-ENOMEM).

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2011-11-30 18:46:01 +01:00
Rafael J. Wysocki 986b11c3ee Merge branch 'pm-freezer' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into pm-freezer
* 'pm-freezer' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc: (24 commits)
  freezer: fix wait_event_freezable/__thaw_task races
  freezer: kill unused set_freezable_with_signal()
  dmatest: don't use set_freezable_with_signal()
  usb_storage: don't use set_freezable_with_signal()
  freezer: remove unused @sig_only from freeze_task()
  freezer: use lock_task_sighand() in fake_signal_wake_up()
  freezer: restructure __refrigerator()
  freezer: fix set_freezable[_with_signal]() race
  freezer: remove should_send_signal() and update frozen()
  freezer: remove now unused TIF_FREEZE
  freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE
  cgroup_freezer: prepare for removal of TIF_FREEZE
  freezer: clean up freeze_processes() failure path
  freezer: kill PF_FREEZING
  freezer: test freezable conditions while holding freezer_lock
  freezer: make freezing indicate freeze condition in effect
  freezer: use dedicated lock instead of task_lock() + memory barrier
  freezer: don't distinguish nosig tasks on thaw
  freezer: remove racy clear_freeze_flag() and set PF_NOFREEZE on dead tasks
  freezer: rename thaw_process() to __thaw_task() and simplify the implementation
  ...
2011-11-23 21:09:02 +01:00
Linus Torvalds af36d15f58 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: remove free-space-cache.c WARN during log replay
  Btrfs: sectorsize align offsets in fiemap
  Btrfs: clear pages dirty for io and set them extent mapped
  Btrfs: wait on caching if we're loading the free space cache
  Btrfs: prefix resize related printks with btrfs:
  btrfs: fix stat blocks accounting
  Btrfs: avoid unnecessary bitmap search for cluster setup
  Btrfs: fix to search one more bitmap for cluster setup
  btrfs: mirror_num should be int, not u64
  btrfs: Fix up 32/64-bit compatibility for new ioctls
  Btrfs: fix barrier flushes
  Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush
2011-11-22 08:53:40 -08:00
Tejun Heo a0acae0e88 freezer: unexport refrigerator() and update try_to_freeze() slightly
There is no reason to export two functions for entering the
refrigerator.  Calling refrigerator() instead of try_to_freeze()
doesn't save anything noticeable or removes any race condition.

* Rename refrigerator() to __refrigerator() and make it return bool
  indicating whether it scheduled out for freezing.

* Update try_to_freeze() to return bool and relay the return value of
  __refrigerator() if freezing().

* Convert all refrigerator() users to try_to_freeze().

* Update documentation accordingly.

* While at it, add might_sleep() to try_to_freeze().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Christoph Hellwig <hch@infradead.org>
2011-11-21 12:32:22 -08:00
Chris Mason 24a7031396 Btrfs: remove free-space-cache.c WARN during log replay
The log replay code only partially loads block groups, since
the block group caching code is able to detect and deal with
extents the logging code has pinned down.

While the logging code is pinning down block groups, there is
a bogus WARN_ON we're hitting if the code wasn't able to find
an extent in the cache.  This commit removes the warning because
it can happen any time there isn't a valid free space cache
for that block group.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-21 14:57:33 -05:00
Josef Bacik 4d479cf010 Btrfs: sectorsize align offsets in fiemap
We've been hitting BUG()'s in btrfs_cont_expand and btrfs_fallocate and anywhere
else that calls btrfs_get_extent while running xfstests 13 in a loop.  This is
because fiemap is calling btrfs_get_extent with non-sectorsize aligned offsets,
which will end up adding mappings that are not sectorsize aligned, which will
cause problems in some cases for subsequent calls to btrfs_get_extent for
similar areas that are sectorsize aligned.  With this patch I ran xfstests 13 in
a loop for a couple of hours and didn't hit the problem that I could previously
hit in at most 20 minutes.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-11-20 07:42:17 -05:00
Josef Bacik f7d61dcd68 Btrfs: clear pages dirty for io and set them extent mapped
When doing the io_ctl helpers to clean up the free space cache stuff I stopped
using our normal prepare_pages stuff, which means I of course forgot to do
things like set the pages extent mapped, which will cause us all sorts of
wonderful propblems.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-11-20 07:42:17 -05:00
Josef Bacik 291c7d2f57 Btrfs: wait on caching if we're loading the free space cache
We've been hitting panics when running xfstest 13 in a loop for long periods of
time.  And actually this problem has always existed so we've been hitting these
things randomly for a while.  Basically what happens is we get a thread coming
into the allocator and reading the space cache off of disk and adding the
entries to the free space cache as we go.  Then we get another thread that comes
in and tries to allocate from that block group.  Since block_group->cached !=
BTRFS_CACHE_NO it goes ahead and tries to do the allocation.  We do this because
if we're doing the old slow way of caching we don't want to hold people up and
wait for everything to finish.  The problem with this is we could end up
discarding the space cache at some arbitrary point in the future, which means we
could very well end up allocating space that is either bad, or when the real
caching happens it could end up thinking the space isn't in use when it really
is and cause all sorts of other problems.

The solution is to add a new flag to indicate we are loading the free space
cache from disk, and always try to cache the block group if cache->cached !=
BTRFS_CACHE_FINISHED.  That way if we are loading the space cache anybody else
who tries to allocate from the block group will have to wait until it's finished
to make sure it completes successfully.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-11-20 07:42:16 -05:00
Arnd Hannemann 5bb1468238 Btrfs: prefix resize related printks with btrfs:
For the user it is confusing to find something like:
[10197.627710] new size for /dev/mapper/vg0-usr_share is 3221225472
in kernel log, because it doesn't point directly to btrfs.

This patch prefixes those messages with "btrfs:" like other btrfs
related printks.

Signed-off-by: Arnd Hannemann <arnd@arndnet.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20 07:42:16 -05:00
David Sterba fadc0d8be4 btrfs: fix stat blocks accounting
Round inode bytes and delalloc bytes up to real blocksize before
converting to sector size. Otherwise eg. files smaller than 512
are reported with zero blocks due to incorrect rounding.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20 07:42:15 -05:00
Li Zefan 52621cb6ed Btrfs: avoid unnecessary bitmap search for cluster setup
setup_cluster_no_bitmap() searches all the extents and bitmaps starting
from offset. Therefore if it returns -ENOSPC, all the bitmaps starting
from offset are in the bitmaps list, so it's sufficient to search from
this list in setup_cluser_bitmap().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20 07:42:15 -05:00
Li Zefan 0f0fbf1d0e Btrfs: fix to search one more bitmap for cluster setup
Suppose there are two bitmaps [0, 256], [256, 512] and one extent
[100, 120] in the free space cache, and we want to setup a cluster
with offset=100, bytes=50.

In this case, there will be only one bitmap [256, 512] in the temporary
bitmaps list, and then setup_cluster_bitmap() won't search bitmap [0, 256].

The cause is, the list is constructed in setup_cluster_no_bitmap(),
and only bitmaps with bitmap_entry->offset >= offset will be added
into the list, and the very bitmap that convers offset has
bitmap_entry->offset <= offset.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20 07:42:14 -05:00
Jan Schmidt 32240a913d btrfs: mirror_num should be int, not u64
My previous patch introduced some u64 for failed_mirror variables, this one
makes it consistent again.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20 07:42:14 -05:00
Jeff Mahoney 745c4d8e16 btrfs: Fix up 32/64-bit compatibility for new ioctls
This patch casts to unsigned long before casting to a pointer and fixes
 the following warnings:
fs/btrfs/extent_io.c:2289:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2933:37: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2937:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/ioctl.c:3020:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/scrub.c:275:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/backref.c:686:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-20 07:42:13 -05:00
Chris Mason 387125fc72 Btrfs: fix barrier flushes
When btrfs is writing the super blocks, it send barrier flushes to make
sure writeback caching drives get all the metadata on disk in the
right order.

But, we have two bugs in the way these are sent down.  When doing
full commits (not via the tree log), we are sending the barrier down
before the last super when it should be going down before the first.

In multi-device setups, we should be waiting for the barriers to
complete on all devices before writing any of the supers.

Both of these bugs can cause corruptions on power failures.  We fix it
with some new code to send down empty barriers to all devices before
writing the first super.

Alexandre Oliva found the multi-device bug.  Arne Jansen did the async
barrier loop.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
2011-11-20 07:21:14 -05:00
Al Viro ea441d1104 new helper: mount_subtree()
takes vfsmount and relative path, does lookup within that vfsmount
(possibly triggering automounts) and returns the result as root
of subtree suitable for return by ->mount() (i.e. a reference to
dentry and an active reference to its superblock grabbed, superblock
locked exclusive).

btrfs and nfs switched to it instead of open-coding the sucker.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-11-16 22:00:34 -05:00
Al Viro c133449587 switch create_mnt_ns() to saner calling conventions, fix double mntput() in nfs
Life is much saner if create_mnt_ns(mnt) drops mnt in case of error...
Switch it to such calling conventions, switch callers, fix double mntput() in
fs/nfs/super.c one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-11-16 16:12:14 -05:00
Al Viro 8d514bbf37 btrfs: fix double mntput() in mount_subvol()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-11-16 16:06:09 -05:00
Liu Bo f1ebcc74d5 Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush
The btrfs snapshotting code requires that once a root has been
snapshotted, we don't change it during a commit.

But there are two cases to lead to tree corruptions:

1) multi-thread snapshots can commit serveral snapshots in a transaction,
   and this may change the src root when processing the following pending
   snapshots, which lead to the former snapshots corruptions;

2) the free inode cache was changing the roots when it root the cache,
   which lead to corruptions.

This fixes things by making sure we force COW the block after we create a
snapshot during commiting a transaction, then any changes to the roots
will result in COW, and we get all the fs roots and snapshot roots to be
consistent.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-15 09:53:28 -05:00
Linus Torvalds c1f4246716 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: rename the option to nospace_cache
  Btrfs: handle bio_add_page failure gracefully in scrub
  Btrfs: fix deadlock caused by the race between relocation
  Btrfs: only map pages if we know we need them when reading the space cache
  Btrfs: fix orphan backref nodes
  Btrfs: Abstract similar code for btrfs_block_rsv_add{, _noflush}
  Btrfs: fix unreleased path in btrfs_orphan_cleanup()
  Btrfs: fix no reserved space for writing out inode cache
  Btrfs: fix nocow when deleting the item
  Btrfs: tweak the delayed inode reservations again
  Btrfs: rework error handling in btrfs_mount()
  Btrfs: close devices on all error paths in open_ctree()
  Btrfs: avoid null dereference and leaks when bailing from open_ctree()
  Btrfs: fix subvol_name leak on error in btrfs_mount()
  Btrfs: fix memory leak in btrfs_parse_early_options()
  Btrfs: fix our reservations for updating an inode when completing io
  Btrfs: fix oops on NULL trans handle in btrfs_truncate
  btrfs: fix double-free 'tree_root' in 'btrfs_mount()'
2011-11-11 23:47:06 -02:00
David Sterba 8965593e41 btrfs: rename the option to nospace_cache
Rename no_space_cache option to nospace_cache to be more consistent with
the rest, where the simple prefix 'no' is used to negate an option.

The option has been introduced during the -rc1 cycle and there are has not been
widely used, so it's safe.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-11 10:14:57 -05:00
Arne Jansen 69f4cb526b Btrfs: handle bio_add_page failure gracefully in scrub
Currently scrub fails with ENOMEM when bio_add_page fails. Unfortunately
dm based targets accept only one page per bio, thus making scrub always
fails. This patch just submits the current bio when an error is encountered
and starts a new one.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-11 08:17:10 -05:00
Miao Xie 62f30c5462 Btrfs: fix deadlock caused by the race between relocation
We can not do flushable reservation for the relocation when we create snapshot,
because it may make the transaction commit task and the flush task wait for
each other and the deadlock happens.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:45:05 -05:00
Josef Bacik 2f120c05e6 Btrfs: only map pages if we know we need them when reading the space cache
People have been running into a warning when loading space cache because the
page is already mapped when trying to read in a bitmap.  The way we read in
entries and pages is kind of convoluted, so fix it so that io_ctl_read_entry
maps the entries if it needs to, and if it hits the end of the page it simply
unmaps the page.  That way we can unconditionally unmap the io_ctl before
reading in the bitmap and we should stop hitting these warnings.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:45:05 -05:00
Miao Xie 76b9e23d25 Btrfs: fix orphan backref nodes
If the root node of a fs/file tree is in the block group that is
being relocated, but the others are not in the other block groups.
when we create a snapshot for this tree between the relocation tree
creation ends and ->create_reloc_tree is set to 0, Btrfs will create
some backref nodes that are the lowest nodes of the backrefs cache.
But we forget to add them into ->leaves list of the backref cache
and deal with them, and at last, they will triggered BUG_ON().

  kernel BUG at fs/btrfs/relocation.c:239!

This patch fixes it by adding them into ->leaves list of backref cache.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:45:05 -05:00
Miao Xie 61b520a9d0 Btrfs: Abstract similar code for btrfs_block_rsv_add{, _noflush}
btrfs_block_rsv_add{, _noflush}() have similar code, so abstract that code.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:45:05 -05:00
Miao Xie 3254c87618 Btrfs: fix unreleased path in btrfs_orphan_cleanup()
When we did stress test for the space relocation, the deadlock happened.
By debugging, We found it was caused by the carelessness that we forgot
to unlock the read lock of the extent buffers in btrfs_orphan_cleanup()
before we end the transaction handle, so the transaction commit task waited
the task, which called btrfs_orphan_cleanup(), to unlock the extent buffer,
but that task waited the commit task to end the transaction commit, and
the deadlock happened. Fix it.

Signed-ff-by: Miao Xie <miaox@cn.fujitsu.com>

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:45:05 -05:00
Miao Xie ba38eb4de3 Btrfs: fix no reserved space for writing out inode cache
I-node cache forgets to reserve the space when writing out it. And when
we do some stress test, such as synctest, it will trigger WARN_ON() in
use_block_rsv().

WARNING: at fs/btrfs/extent-tree.c:5718 btrfs_alloc_free_block+0xbf/0x281 [btrfs]()
...
Call Trace:
 [<ffffffff8104df86>] warn_slowpath_common+0x80/0x98
 [<ffffffff8104dfb3>] warn_slowpath_null+0x15/0x17
 [<ffffffffa0369c60>] btrfs_alloc_free_block+0xbf/0x281 [btrfs]
 [<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108
 [<ffffffffa035c040>] __btrfs_cow_block+0x118/0x3b5 [btrfs]
 [<ffffffffa035c7ba>] btrfs_cow_block+0x103/0x14e [btrfs]
 [<ffffffffa035e4c4>] btrfs_search_slot+0x249/0x6a4 [btrfs]
 [<ffffffffa036d086>] btrfs_lookup_inode+0x2a/0x8a [btrfs]
 [<ffffffffa03788b7>] btrfs_update_inode+0xaa/0x141 [btrfs]
 [<ffffffffa036d7ec>] btrfs_save_ino_cache+0xea/0x202 [btrfs]
 [<ffffffffa03a761e>] ? btrfs_update_reloc_root+0x17e/0x197 [btrfs]
 [<ffffffffa0373867>] commit_fs_roots+0xaa/0x158 [btrfs]
 [<ffffffffa03746a6>] btrfs_commit_transaction+0x405/0x731 [btrfs]
 [<ffffffff810690df>] ? wake_up_bit+0x25/0x25
 [<ffffffffa039d652>] ? btrfs_log_dentry_safe+0x43/0x51 [btrfs]
 [<ffffffffa0381c5f>] btrfs_sync_file+0x16a/0x198 [btrfs]
 [<ffffffff81122806>] ? mntput+0x21/0x23
 [<ffffffff8112d150>] vfs_fsync_range+0x18/0x21
 [<ffffffff8112d170>] vfs_fsync+0x17/0x19
 [<ffffffff8112d316>] do_fsync+0x29/0x3e
 [<ffffffff8112d348>] sys_fsync+0xb/0xf
 [<ffffffff81468352>] system_call_fastpath+0x16/0x1b

Sometimes it causes BUG_ON() in the reservation code of the delayed inode
is triggered.

So we must reserve enough space for inode cache.

Note: If we can not reserve the enough space for inode cache, we will
give up writing out it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:45:04 -05:00
Miao Xie 924cd8fbe4 Btrfs: fix nocow when deleting the item
btrfs_previous_item() just search the b+ tree, do not COW the nodes or leaves,
if we modify the result of it, the meta-data will be broken. fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:45:04 -05:00
Chris Mason f7d572188b Merge branch 'mount-fixes' of git://github.com/idryomov/btrfs-unstable into integration 2011-11-10 20:42:53 -05:00
Chris Mason 2115133f8b Btrfs: tweak the delayed inode reservations again
Josef sent along an incremental to the inode reservation
code to make sure we try and fall back to directly updating
the inode item if things go horribly wrong.

This reworks that patch slightly, adding a fallback function
that will always try to update the inode item directly without
going through the delayed_inode code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:39:08 -05:00
Ilya Dryomov 04d21a244f Btrfs: rework error handling in btrfs_mount()
Commits 6c41761f and 45ea6095 introduced the possibility of NULL pointer
dereference on error paths, also we would leave all devices busy and
leak fs_info with all sub-structures on error when trying to mount an
already mounted fs to a different directory.

Fix this by doing all allocations before trying to open any of the
devices, adjust error path for mount-already-mounted-fs case.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-11-09 22:53:39 +02:00
Ilya Dryomov 586e46e281 Btrfs: close devices on all error paths in open_ctree()
Fix a bug introduced by 7e662854 where we would leave devices busy on
certain error paths in open_ctree().  fs_info is guaranteed to be
non-NULL now so it's safe to dereference it on all error paths.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-11-09 22:53:38 +02:00
Ilya Dryomov 4d34b27895 Btrfs: avoid null dereference and leaks when bailing from open_ctree()
Fix bugs introduced by 6c41761f.  Firstly, after failing to allocate any
of the tree roots (first 'goto fail' in open_ctree()) we would
dereference a NULL fs_info pointer in free_fs_info().  Secondly, after
failures from init_srcu_struct(), setup_bdi() and new_inode() we would
leak all earlier allocated roots: fs_info fields haven't been
initialized yet so free_fs_info() is rendered useless.

Fix this by initializing fs_info pointer and fs_info fields before any
allocations happen.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-11-09 22:53:38 +02:00
Ilya Dryomov f23c8af8ca Btrfs: fix subvol_name leak on error in btrfs_mount()
btrfs_parse_early_options() can fail due to error while scanning devices
(-o device= option), but still strdup() subvol_name string:

mount -o subvol=SUBV,device=BAD_DEVICE <dev> <mnt>

So free subvol_name string on error.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-11-09 22:53:38 +02:00
Ilya Dryomov a90e8b6fb8 Btrfs: fix memory leak in btrfs_parse_early_options()
Don't leak subvol_name string in case multiple subvol= options are
given.  "The lastest option is effective" behavior (consistent with
subvolid= and subvolrootid= options) is preserved.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-11-09 22:53:38 +02:00
Josef Bacik 7fd2ae21a4 Btrfs: fix our reservations for updating an inode when completing io
People have been reporting ENOSPC crashes in finish_ordered_io.  This is because
we try to steal from the delalloc block rsv to satisfy a reservation to update
the inode.  The problem with this is we don't explicitly save space for updating
the inode when doing delalloc.  This is kind of a problem and we've gotten away
with this because way back when we just stole from the delalloc reserve without
any questions, and this worked out fine because generally speaking the leaf had
been modified either by the mtime update when we did the original write or
because we just updated the leaf when we inserted the file extent item, only on
rare occasions had the leaf not actually been modified, and that was still ok
because we'd just use a block or two out of the over-reservation that is
delalloc.

Then came the delayed inode stuff.  This is amazing, except it wants a full
reservation for updating the inode since it may do it at some point down the
road after we've written the blocks and we have to recow everything again.  This
worked out because the delayed inode stuff just stole from the global reserve,
that is until recently when I changed that because it caused other problems.

So here we are, we're doing everything right and being screwed for it.  So take
an extra reservation for the inode at delalloc reservation time and carry it
through the life of the delalloc reservation.  If we need it we can steal it in
the delayed inode stuff.  If we have already stolen it try and do a normal
metadata reservation.  If that fails try to steal from the delalloc reservation.
If _that_ fails we'll get a WARN_ON() so I can start thinking of a better way to
solve this and in the meantime we'll steal from the global reserve.

With this patch I ran xfstests 13 in a loop for a couple of hours and didn't see
any problems.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-08 15:47:34 -05:00
Chris Mason 917c16b2b6 Btrfs: fix oops on NULL trans handle in btrfs_truncate
If we fail to reserve space in the transaction during truncate, we can
error out with a NULL trans handle.  The cleanup code needs an extra
check to make sure we aren't trying to use the bad handle.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-08 14:49:59 -05:00
slyich@gmail.com 45ea6095c8 btrfs: fix double-free 'tree_root' in 'btrfs_mount()'
On error path 'tree_root' is treed in 'free_fs_info()'.
No need to free it explicitely. Noticed by SLUB in debug mode:

Complete reproducer under usermode linux (discovered on real
machine):

    bdev=/dev/ubda
    btr_root=/btr
    /mkfs.btrfs $bdev
    mount $bdev $btr_root
    mkdir $btr_root/subvols/
    cd $btr_root/subvols/
    /btrfs su cr foo
    /btrfs su cr bar
    mount $bdev -osubvol=subvols/foo $btr_root/subvols/bar
    umount $btr_root/subvols/bar

which gives

device fsid 4d55aa28-45b1-474b-b4ec-da912322195e devid 1 transid 7 /dev/ubda
=============================================================================
BUG kmalloc-2048: Object already free
-----------------------------------------------------------------------------

INFO: Allocated in btrfs_mount+0x389/0x7f0 age=0 cpu=0 pid=277
INFO: Freed in btrfs_mount+0x51c/0x7f0 age=0 cpu=0 pid=277
INFO: Slab 0x0000000062886200 objects=15 used=9 fp=0x0000000070b4d2d0 flags=0x4081
INFO: Object 0x0000000070b4d2d0 @offset=21200 fp=0x0000000070b4a968
...
Call Trace:
70b31948:  [<6008c522>] print_trailer+0xe2/0x130
70b31978:  [<6008c5aa>] object_err+0x3a/0x50
70b319a8:  [<6008e242>] free_debug_processing+0x142/0x2a0
70b319e0:  [<600ebf6f>] btrfs_mount+0x55f/0x7f0
70b319f8:  [<6008e5c1>] __slab_free+0x221/0x2d0

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Arne Jansen <sensille@gmx.net>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-07 16:08:01 -05:00
Linus Torvalds 6a6662ced4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (114 commits)
  Btrfs: check for a null fs root when writing to the backup root log
  Btrfs: fix race during transaction joins
  Btrfs: fix a potential btrfs_bio leak on scrub fixups
  Btrfs: rename btrfs_bio multi -> bbio for consistency
  Btrfs: stop leaking btrfs_bios on readahead
  Btrfs: stop the readahead threads on failed mount
  Btrfs: fix extent_buffer leak in the metadata IO error handling
  Btrfs: fix the new inspection ioctls for 32 bit compat
  Btrfs: fix delayed insertion reservation
  Btrfs: ClearPageError during writepage and clean_tree_block
  Btrfs: be smarter about committing the transaction in reserve_metadata_bytes
  Btrfs: make a delayed_block_rsv for the delayed item insertion
  Btrfs: add a log of past tree roots
  btrfs: separate superblock items out of fs_info
  Btrfs: use the global reserve when truncating the free space cache inode
  Btrfs: release metadata from global reserve if we have to fallback for unlink
  Btrfs: make sure to flush queued bios if write_cache_pages waits
  Btrfs: fix extent pinning bugs in the tree log
  Btrfs: make sure btrfs_remove_free_space doesn't leak EAGAIN
  Btrfs: don't wait as long for more batches during SSD log commit
  ...
2011-11-06 20:03:41 -08:00
Linus Torvalds 208bca0860 Merge branch 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux
* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
  writeback: Add a 'reason' to wb_writeback_work
  writeback: send work item to queue_io, move_expired_inodes
  writeback: trace event balance_dirty_pages
  writeback: trace event bdi_dirty_ratelimit
  writeback: fix ppc compile warnings on do_div(long long, unsigned long)
  writeback: per-bdi background threshold
  writeback: dirty position control - bdi reserve area
  writeback: control dirty pause time
  writeback: limit max dirty pause time
  writeback: IO-less balance_dirty_pages()
  writeback: per task dirty rate limit
  writeback: stabilize bdi->dirty_ratelimit
  writeback: dirty rate control
  writeback: add bg_threshold parameter to __bdi_update_bandwidth()
  writeback: dirty position control
  writeback: account per-bdi accumulated dirtied pages
2011-11-06 19:02:23 -08:00
Chris Mason 7c7e82a77f Btrfs: check for a null fs root when writing to the backup root log
During log replay, can commit the transaction before the fs_root
pointers are setup, so we have to make sure they are not null before
trying to use them.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 18:50:56 -05:00
Chris Mason d43317dcd0 Btrfs: fix race during transaction joins
While we're allocating ram for a new transaction, we drop our spinlock.
When we get the lock back, we do check to see if a transaction started
while we slept, but we don't check to make sure it isn't blocked
because a commit has already started.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:26:19 -05:00
Ilya Dryomov 56d2a48f81 Btrfs: fix a potential btrfs_bio leak on scrub fixups
In case we were able to map less than we wanted (length < PAGE_SIZE
clause is true) btrfs_bio is still allocated and we have to free it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:11:29 -05:00
Ilya Dryomov 21ca543efc Btrfs: rename btrfs_bio multi -> bbio for consistency
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:11:21 -05:00
Ilya Dryomov 9510dc4c62 Btrfs: stop leaking btrfs_bios on readahead
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:11:08 -05:00
Chris Mason 306c8b68c8 Btrfs: stop the readahead threads on failed mount
If we don't stop them, they linger around corrupting
memory by using pointers to freed things.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:09:41 -05:00
Chris Mason c674e04e1c Btrfs: fix extent_buffer leak in the metadata IO error handling
The scrub readahead branch brought in a new error handling hook,
but it was leaking extent_buffer references.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:09:10 -05:00
Chris Mason 740c3d226c Btrfs: fix the new inspection ioctls for 32 bit compat
The new ioctls to follow backrefs are not clean for 32/64 bit
compat.  This reworks them for u64s everywhere.  They are brand new, so
there are no problems with changing the interface now.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:08:49 -05:00
Chris Mason 806468f8bf Merge git://git.jan-o-sch.net/btrfs-unstable into integration
Conflicts:
	fs/btrfs/Makefile
	fs/btrfs/extent_io.c
	fs/btrfs/extent_io.h
	fs/btrfs/scrub.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:07:10 -05:00
Chris Mason 531f4b1ae5 Merge branch 'for-chris' of git://github.com/sensille/linux into integration
Conflicts:
	fs/btrfs/ctree.h

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:05:08 -05:00
Josef Bacik c06a0e120a Btrfs: fix delayed insertion reservation
We all keep getting those stupid warnings from use_block_rsv when running
stress.sh, and it's because the delayed insertion stuff is being stupid.  It's
not the delayed insertion stuffs fault, it's all just stupid.  When marking an
inode dirty for oh say updating the time on it, we just do a
btrfs_join_transaction, which doesn't reserve any space.  This is stupid because
we're going to have to have space reserve to make this change, but we do it
because it's fast because chances are we're going to call it over and over again
and it doesn't matter.  Well thanks to the delayed insertion stuff this is
mostly the case, so we do actually need to make this reservation.  So if
trans->bytes_reserved is 0 then try to do a normal reservation.  If not return
ENOSPC which will make the btrfs_dirty_inode start a proper transaction which
will let it do the whole ENOSPC dance and reserve enough space for the delayed
insertion to steal the reservation from the transaction.

The other stupid thing we do is not reserve space for the inode when writing to
the thing.  Usually this is ok since we have to update the time so we'd have
already done all this work before we get to the endio stuff, so it doesn't
matter.  But this is stupid because we could write the data after the
transaction commits where we changed the mtime of the inode so we have to cow
all the way down to the inode anyway.  This used to be masked by the delalloc
reservation stuff, but because we delay the update it doesn't get masked in this
case.  So again the delayed insertion stuff bites us in the ass.  So if our
trans->block_rsv is delalloc, just steal the reservation from the delalloc
reserve.  Hopefully this won't bite us in the ass, but I've said that before.

With this patch stress.sh no longer spits out those stupid warnings (famous last
words).  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:04:20 -05:00
Chris Mason bf0da8c183 Btrfs: ClearPageError during writepage and clean_tree_block
Failure testing was tripping up over stale PageError bits in
metadata pages.  If we have an io error on a block, and later on
end up reusing it, nobody ever clears PageError on those pages.

During commit, we'll find PageError and think we had trouble writing
the block, which will lead to aborts and other problems.

This changes clean_tree_block and the btrfs writepage code to
clear the PageError bit.  In both cases we're either completely
done with the page or the page has good stuff and the error bit
is no longer valid.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:04:20 -05:00
Josef Bacik 663350ac38 Btrfs: be smarter about committing the transaction in reserve_metadata_bytes
Because of the overcommit stuff I had to make it so that we committed the
transaction all the time in reserve_metadata_bytes in case we had overcommitted
because of delayed items.  This was because previously we had no way of knowing
how much space was reserved for delayed items.  Now that we have the
delayed_block_rsv we can check it to see if committing the transaction would get
us anywhere.  This patch breaks out the committing logic into a helper function
that will check to see if committing the transaction would free enough space for
us to get anything done.  With this patch xfstests 83 goes from taking 445
seconds to taking 28 seconds on my box.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:04:19 -05:00
Josef Bacik 6d668dda0c Btrfs: make a delayed_block_rsv for the delayed item insertion
I've been hitting warnings in use_block_rsv when running the delayed insertion
stuff.  It's because we will readjust global block rsv based on what is in use,
which means we could end up discarding reservations that are for the delayed
insertion stuff.  So instead create a seperate block rsv for the delayed
insertion stuff.  This will also make it easier to debug problems with the
delayed insertion reservations since we will know that only the delayed
insertion code touches this block_rsv.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:04:18 -05:00
Chris Mason af31f5e5b8 Btrfs: add a log of past tree roots
This takes some of the free space in the btrfs super block
to record information about most of the roots in the last four
commits.

It also adds a -o recovery to use the root history log when
we're not able to read the tree of tree roots, the extent
tree root, the device tree root or the csum root.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:04:15 -05:00
David Sterba 6c41761fc6 btrfs: separate superblock items out of fs_info
fs_info has now ~9kb, more than fits into one page. This will cause
mount failure when memory is too fragmented. Top space consumers are
super block structures super_copy and super_for_commit, ~2.8kb each.
Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64)

Add a wrapper for freeing fs_info and all of it's dynamically allocated
members.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-11-06 03:04:01 -05:00
Josef Bacik c8174313a8 Btrfs: use the global reserve when truncating the free space cache inode
We no longer use the orphan block rsv for holding the reservation for truncating
the inode, so instead use the global block rsv and check to make sure it has
enough space for us to truncate the space.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:03:50 -05:00
Josef Bacik 5a77d76c24 Btrfs: release metadata from global reserve if we have to fallback for unlink
I fixed a problem where we weren't reserving space for an orphan item when we
had to fallback to using the global reserve for an unlink, but I introduced
another problem.  I was migrating the bytes from the transaction reserve to the
global reserve and then releasing from the global reserve in
btrfs_end_transaction().  The problem with this is that a migrate will jack up
the size for the destination, but leave the size alone for the source, with the
idea that you can do a release normally on the source and it all washes out, and
then you can do a release again on the destination and it works out right.  My
way was skipping the release on the trans_block_rsv which still had the jacked
up size from our original reservation.  So instead release manually from the
global reserve if this transaction was using it, and then set the
trans->block_rsv back to the trans_block_rsv so that btrfs_end_transaction
cleans everything up properly.  With this patch xfstest 83 doesn't emit warnings
about leaking space.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:03:49 -05:00
Chris Mason 01d658f2ca Btrfs: make sure to flush queued bios if write_cache_pages waits
write_cache_pages tries to build up a large bio to stuff down the pipe.
But if it needs to wait for a page lock, it needs to make sure and send
down any pending writes so we don't deadlock with anyone who has the
page lock and is waiting for writeback of things inside the bio.

Dave Sterba triggered this as a deadlock between the autodefrag code and
the extent write_cache_pages

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:03:48 -05:00
Chris Mason e688b7252f Btrfs: fix extent pinning bugs in the tree log
The tree log had two important bugs that could cause corruptions after a
crash.  Sometimes we were allowing tree log blocks to be reused after
the tree log was committed but before the transaction commit was done.

This allowed a future metadata write to overwrite the tree log data.  It
is fixed by adding a new variant of freeing reserved extents that always
pins them.  Credit goes to Stefan Behrens and Arne Jansen for many many
hours spent tracking this bug down.

During tree log replay, we do a pass through the tree log and pin all
the extents we find.  This makes sure the replay code won't go in and
use any of those blocks for new allocations during replay.  The problem
is the free space cache isn't honoring these pinned extents.  So the
allocator can end up handing them out, leading to all kinds of problems
during replay.

The fix here is to force any free space cache to load while we pin the
extents, and then to make sure we remove the pinned extents from the
free space rbtree.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
2011-11-06 03:03:48 -05:00
Chris Mason 1eae31e918 Btrfs: make sure btrfs_remove_free_space doesn't leak EAGAIN
btrfs_remove_free_space needs to make sure to set ret back to a
valid return value after setting it to EAGAIN, otherwise we return
it to the callers.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:03:47 -05:00
Chris Mason cd354ad613 Btrfs: don't wait as long for more batches during SSD log commit
When we're doing log commits, we try to wait for more writers to come in
and make the commit bigger.  This helps improve performance on rotating
disks, but on SSDs it adds latencies.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:03:47 -05:00
Miklos Szeredi bfe8684869 filesystems: add set_nlink()
Replace remaining direct i_nlink updates with a new set_nlink()
updater function.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-02 12:53:43 +01:00
Curt Wohlgemuth 0e175a1835 writeback: Add a 'reason' to wb_writeback_work
This creates a new 'reason' field in a wb_writeback_work
structure, which unambiguously identifies who initiates
writeback activity.  A 'wb_reason' enumeration has been
added to writeback.h, to enumerate the possible reasons.

The 'writeback_work_class' and tracepoint event class and
'writeback_queue_io' tracepoints are updated to include the
symbolic 'reason' in all trace events.

And the 'writeback_inodes_sbXXX' family of routines has had
a wb_stats parameter added to them, so callers can specify
why writeback is being started.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-10-31 00:33:36 +08:00
Linus Torvalds f362f98e7c Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue: (21 commits)
  leases: fix write-open/read-lease race
  nfs: drop unnecessary locking in llseek
  ext4: replace cut'n'pasted llseek code with generic_file_llseek_size
  vfs: add generic_file_llseek_size
  vfs: do (nearly) lockless generic_file_llseek
  direct-io: merge direct_io_walker into __blockdev_direct_IO
  direct-io: inline the complete submission path
  direct-io: separate map_bh from dio
  direct-io: use a slab cache for struct dio
  direct-io: rearrange fields in dio/dio_submit to avoid holes
  direct-io: fix a wrong comment
  direct-io: separate fields only used in the submission path from struct dio
  vfs: fix spinning prevention in prune_icache_sb
  vfs: add a comment to inode_permission()
  vfs: pass all mask flags check_acl and posix_acl_permission
  vfs: add hex format for MAY_* flag values
  vfs: indicate that the permission functions take all the MAY_* flags
  compat: sync compat_stats with statfs.
  vfs: add "device" tag to /proc/self/mountstats
  cleanup: vfs: small comment fix for block_invalidatepage
  ...

Fix up trivial conflict in fs/gfs2/file.c (llseek changes)
2011-10-28 10:49:34 -07:00
Andi Kleen ef3d0fd27e vfs: do (nearly) lockless generic_file_llseek
The i_mutex lock use of generic _file_llseek hurts.  Independent processes
accessing the same file synchronize over a single lock, even though
they have no need for synchronization at all.

Under high utilization this can cause llseek to scale very poorly on larger
systems.

This patch does some rethinking of the llseek locking model:

First the 64bit f_pos is not necessarily atomic without locks
on 32bit systems. This can already cause races with read() today.
This was discussed on linux-kernel in the past and deemed acceptable.
The patch does not change that.

Let's look at the different seek variants:

SEEK_SET: Doesn't really need any locking.
If there's a race one writer wins, the other loses.

For 32bit the non atomic update races against read()
stay the same. Without a lock they can also happen
against write() now.  The read() race was deemed
acceptable in past discussions, and I think if it's
ok for read it's ok for write too.

=> Don't need a lock.

SEEK_END: This behaves like SEEK_SET plus it reads
the maximum size too. Reading the maximum size would have the
32bit atomic problem. But luckily we already have a way to read
the maximum size without locking (i_size_read), so we
can just use that instead.

Without i_mutex there is no synchronization with write() anymore,
however since the write() update is atomic on 64bit it just behaves
like another racy SEEK_SET.  On non atomic 32bit it's the same
as SEEK_SET.

=> Don't need a lock, but need to use i_size_read()

SEEK_CUR: This has a read-modify-write race window
on the same file. One could argue that any application
doing unsynchronized seeks on the same file is already broken.
But for the sake of not adding a regression here I'm
using the file->f_lock to synchronize this. Using this
lock is much better than the inode mutex because it doesn't
synchronize between processes.

=> So still need a lock, but can use a f_lock.

This patch implements this new scheme in generic_file_llseek.
I dropped generic_file_llseek_unlocked and changed all callers.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-10-28 14:58:58 +02:00
Linus Torvalds 36b8d186e6 Merge branch 'next' of git://selinuxproject.org/~jmorris/linux-security
* 'next' of git://selinuxproject.org/~jmorris/linux-security: (95 commits)
  TOMOYO: Fix incomplete read after seek.
  Smack: allow to access /smack/access as normal user
  TOMOYO: Fix unused kernel config option.
  Smack: fix: invalid length set for the result of /smack/access
  Smack: compilation fix
  Smack: fix for /smack/access output, use string instead of byte
  Smack: domain transition protections (v3)
  Smack: Provide information for UDS getsockopt(SO_PEERCRED)
  Smack: Clean up comments
  Smack: Repair processing of fcntl
  Smack: Rule list lookup performance
  Smack: check permissions from user space (v2)
  TOMOYO: Fix quota and garbage collector.
  TOMOYO: Remove redundant tasklist_lock.
  TOMOYO: Fix domain transition failure warning.
  TOMOYO: Remove tomoyo_policy_memory_lock spinlock.
  TOMOYO: Simplify garbage collector.
  TOMOYO: Fix make namespacecheck warnings.
  target: check hex2bin result
  encrypted-keys: check hex2bin result
  ...
2011-10-25 09:45:31 +02:00
David Sterba dff51cd1c6 btrfs: ratelimit WARN_ON in use_block_rsv
The WARN_ON under some circumstances heavily polute log and slow down
the machine. This is just a safety, as the warning should be fixed by
another patch, nevertheless, it still pops up during testing.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-10-24 14:48:00 +02:00
David Sterba a81d3b1ba2 Merge branch 'hotfixes-20111024/josef/for-chris' into btrfs-next-stable 2011-10-24 14:47:58 +02:00
David Sterba afd582ac8f Merge remote-tracking branch 'remotes/josef/for-chris' into btrfs-next-stable 2011-10-24 14:47:57 +02:00
David Sterba f9d9ef62cd btrfs: do not allow mounting non-subvolumes via subvol option
There's a missing test whether the path passed to subvol=path option
during mount is a real subvolume, allowing any directory located in
default subovlume to be passed and accepted for mount.

(current btrfs progs prevent this early)
$ btrfs subvol snapshot . p1-snap
ERROR: '.' is not a subvolume

(with "is subvolume?" test bypassed)
$ btrfs subvol snapshot . p1-snap
Create a snapshot of '.' in './p1-snap'

$ btrfs subvol list -p .
ID 258 parent 5 top level 5 path subvol
ID 259 parent 5 top level 5 path subvol1
ID 260 parent 5 top level 5 path default-subvol1
ID 262 parent 5 top level 5 path p1/p1-snapshot
ID 263 parent 259 top level 5 path subvol1/subvol1-snap

The problem I see is that this makes a false impression of snapshotting the
given subvolume but in fact snapshots the default one: a user expects outcome
like ID 263 but in fact gets ID 262 .

This patch makes mount fail with EINVAL with a message in syslog.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-10-24 14:43:25 +02:00
Ilya Dryomov 20bcd64934 Btrfs: close all bdevs on mount failure
Fix a bug introduced by 20b45077.  We have to return EINVAL on mount
failure, but doing that too early in the sequence leaves all of the
devices opened exclusively.  This also fixes an issue where under some
scenarios only a second mount -o degraded <devices> command would
succeed.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-10-20 18:20:57 +02:00
Ilya Dryomov 5f524444c3 Btrfs: fix a bug when opening seed devices
Initialize fs_info->bdev_holder a bit earlier to be able to pass a
correct holder id to blkdev_get() when opening seed devices with O_EXCL.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-10-20 18:20:36 +02:00
Daniel J Blueman 068132bad1 btrfs: fix oops on failure path
If lookup_extent_backref fails, path->nodes[0] reasonably could be
null along with other callers of btrfs_print_leaf, so ensure we have a
valid extent buffer before dereferencing.

Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
2011-10-20 18:10:50 +02:00
Miao Xie 60d2adbb1e Btrfs: fix race between multi-task space allocation and caching space
The task may fail to get free space though it is enough when multi-task
space allocation and caching space happen at the same time.

	Task1			Caching Thread		Task2
	------------------------------------------------------------------------
	find_free_extent
	  The space has not
	  be cached, and start
	  caching thread. And
	  wait for it.
				cache space, if
				the space is > 2MB
				wake up Task1
							find_free_extent
							  get all the space that
							  is cached.
	  try to allocate space,
	  but there is no space
	  now.
	trigger BUG_ON()

The message is following:
btrfs allocation failed flags 1, wanted 4096
space_info has 1040187392 free, is not full
space_info total=1082130432, used=4096, pinned=41938944, reserved=0, may_use=40828928, readonly=0
block group 12582912 has 8388608 bytes, 0 used 8388608 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
block group 1103101952 has 1073741824 bytes, 4096 used 33550336 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:835!
 [<ffffffffa031261b>] __extent_writepage+0x1bf/0x5ce [btrfs]
 [<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108
 [<ffffffffa02f8ada>] ? wait_current_trans+0x23/0xec [btrfs]
 [<ffffffff810c3fbf>] ? find_get_pages_tag+0x73/0xe2
 [<ffffffffa0312d12>] extent_write_cache_pages.clone.0+0x176/0x29a [btrfs]
 [<ffffffffa0312e74>] extent_writepages+0x3e/0x53 [btrfs]
 [<ffffffff8110ad2c>] ? do_sync_write+0xc6/0x103
 [<ffffffffa0302d6e>] ? btrfs_submit_direct+0x414/0x414 [btrfs]
 [<ffffffff811380fa>] ? fsnotify+0x236/0x266
 [<ffffffffa02fc930>] btrfs_writepages+0x22/0x24 [btrfs]
 [<ffffffff810cc215>] do_writepages+0x1c/0x25
 [<ffffffff810c4958>] __filemap_fdatawrite_range+0x4e/0x50
 [<ffffffff810c4982>] filemap_write_and_wait_range+0x28/0x51
 [<ffffffffa0306b2e>] btrfs_sync_file+0x7d/0x198 [btrfs]
 [<ffffffff8110aa26>] ? fsnotify_modify+0x5d/0x65
 [<ffffffff8112d150>] vfs_fsync_range+0x18/0x21
 [<ffffffff8112d170>] vfs_fsync+0x17/0x19
 [<ffffffff8112d316>] do_fsync+0x29/0x3e
 [<ffffffff8112d348>] sys_fsync+0xb/0xf
 [<ffffffff81468352>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP  [<ffffffffa02fe08c>] cow_file_range+0x1c4/0x32b [btrfs]

We fix this bug by trying to allocate the space again if there are block groups
in caching.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2011-10-20 18:10:49 +02:00
Tsutomu Itoh cfbffc39ac Btrfs: fix return value of btrfs_get_acl()
In btrfs_get_acl(), when the second __btrfs_getxattr() call fails,
acl is not correctly set.
Therefore, a wrong value might return to the caller.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2011-10-20 18:10:47 +02:00
Ilya Dryomov 10b2f34d6e Btrfs: pass the correct root to lookup_free_space_inode()
Free space items are located in tree of tree roots, not in the extent
tree.  It didn't pop up because lookup_free_space_inode() grabs the
inode all the time instead of actually searching the tree.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-10-20 18:10:46 +02:00
Liu Bo fee187d9d9 Btrfs: do not set EXTENT_DIRTY along with EXTENT_DELALLOC
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2011-10-20 18:10:45 +02:00
Li Zefan f0dd9592a1 Btrfs: fix direct-io vs nodatacow
To reproduce the bug:

  # mount -o nodatacow /dev/sda7 /mnt/
  # dd if=/dev/zero of=/mnt/tmp bs=4K count=1
  1+0 records in
  1+0 records out
  4096 bytes (4.1 kB) copied, 0.000136115 s, 30.1 MB/s
  # dd if=/dev/zero of=/mnt/tmp bs=4K count=1 conv=notrunc oflag=direct
  dd: writing `/mnt/tmp': Input/output error
  1+0 records in
  0+0 records out

btrfs_ordered_update_i_size() may return 1, but btrfs_endio_direct_write()
mistakenly takes it as an error.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-10-20 18:10:44 +02:00
Li Zefan 560f7d7545 Btrfs: remove BUG_ON() in compress_file_range()
It's not a big deal if we fail to allocate the array, and instead of
panic we can just give up compressing.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-10-20 18:10:43 +02:00
Li Zefan a05a9bb18a Btrfs: fix array bound checking
Otherwise we can execced the array bound of path->slots[].

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-10-20 18:10:41 +02:00
Lukas Czerner f4c697e640 btrfs: return EINVAL if start > total_bytes in fitrim ioctl
We should retirn EINVAL if the start is beyond the end of the file
system in the btrfs_ioctl_fitrim(). Fix that by adding the appropriate
check for it.

Also in the btrfs_trim_fs() it is possible that len+start might overflow
if big values are passed. Fix it by decrementing the len so that start+len
is equal to the file system size in the worst case.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2011-10-20 18:10:40 +02:00
Li Zefan 008873eafb Btrfs: honor extent thresh during defragmentation
We won't defrag an extent, if it's bigger than the threshold we
specified and there's no small extent before it, but actually
the code doesn't work this way.

There are three bugs:

- When should_defrag_range() decides we should keep on defragmenting
  an extent, last_len is not incremented. (old bug)

- The length that passes to should_defrag_range() is not the length
  we're going to defrag. (new bug)

- We always defrag 256K bytes data, and a big extent can be part of
  this range. (new bug)

For a file with 4 extents:

        | 4K | 4K | 256K | 256K |

The result of defrag with (the default) 256K extent thresh should be:

        | 264K | 256K |

but with those bugs, we'll get:

        | 520K |

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-10-20 18:10:39 +02:00
Jeff Liu 83c8c9bde0 btrfs: trivial fix, a potential memory leak in btrfs_parse_early_options()
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
2011-10-20 18:10:38 +02:00
Li Zefan 5ca496604b Btrfs: fix wrong max_to_defrag in btrfs_defrag_file()
It's off-by-one, and thus we may skip the last page while defragmenting.

An example case:

  # create /mnt/file with 2 4K file extents
  # btrfs fi defrag /mnt/file
  # sync
  # filefrag /mnt/file
  /mnt/file: 2 extents found

So it's not defragmented.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-10-20 18:10:37 +02:00
Li Zefan 151a31b25e Btrfs: use i_size_read() in btrfs_defrag_file()
Don't use inode->i_size directly, since we're not holding i_mutex.

This also fixes another bug, that i_size can change after it's checked
against 0 and then (i_size - 1) can be negative.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-10-20 18:10:35 +02:00
Li Zefan cbcc83265d Btrfs: fix defragmentation regression
There's an off-by-one bug:

  # create a file with lots of 4K file extents
  # btrfs fi defrag /mnt/file
  # sync
  # filefrag -v /mnt/file
  Filesystem type is: 9123683e
  File size of /mnt/file is 1228800 (300 blocks, blocksize 4096)
   ext logical physical expected length flags
     0       0     3372              64
     1      64     3136     3435      1
     2      65     3436     3136     64
     3     129     3201     3499      1
     4     130     3500     3201     64
     5     194     3266     3563      1
     6     195     3564     3266     64
     7     259     3331     3627      1
     8     260     3628     3331     40 eof

After this patch:

  ...
  # filefrag -v /mnt/file
  Filesystem type is: 9123683e
  File size of /mnt/file is 1228800 (300 blocks, blocksize 4096)
   ext logical physical expected length flags
     0       0     3372             300 eof
  /mnt/file: 1 extent found

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-10-20 18:10:34 +02:00
Diego Calleja 60ccf82f5b btrfs: fix memory leak in btrfs_defrag_file
kmemleak found this:
unreferenced object 0xffff8801b64af968 (size 512):
  comm "btrfs-cleaner", pid 3317, jiffies 4306810886 (age 903.272s)
  hex dump (first 32 bytes):
    00 82 01 07 00 ea ff ff c0 83 01 07 00 ea ff ff  ................
    80 82 01 07 00 ea ff ff c0 87 01 07 00 ea ff ff  ................
  backtrace:
    [<ffffffff816875cc>] kmemleak_alloc+0x5c/0xc0
    [<ffffffff8114aec3>] kmem_cache_alloc_trace+0x163/0x240
    [<ffffffff8127a290>] btrfs_defrag_file+0xf0/0xb20
    [<ffffffff8125d9a5>] btrfs_run_defrag_inodes+0x165/0x210
    [<ffffffff812479d7>] cleaner_kthread+0x177/0x190
    [<ffffffff81075c7d>] kthread+0x8d/0xa0
    [<ffffffff816af5f4>] kernel_thread_helper+0x4/0x10
    [<ffffffffffffffff>] 0xffffffffffffffff

"pages" is not always freed. Fix it removing the unnecesary additional return.

Signed-off-by: Diego Calleja <diegocg@gmail.com>
2011-10-20 18:10:33 +02:00
Yan, Zheng 84850e8d8a btrfs: check file extent backref offset underflow
Offset field in data extent backref can underflow if clone range ioctl
is used. We can reliably detect the underflow because max file size is
limited to 2^63 and max data extent size is limited by block group size.

Signed-off-by: Zheng Yan  <zheng.z.yan@intel.com>
2011-10-20 18:10:31 +02:00
Josef Bacik 016fc6a63e Btrfs: don't flush the cache inode before writing it
I noticed we had a little bit of latency when writing out the space cache
inodes.  It's because we flush it before we write anything in case we have dirty
pages already there.  This doesn't matter though since we're just going to
overwrite the space, and there really shouldn't be any dirty pages anyway.  This
makes some of my tests run a little bit faster.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:13:01 -04:00
Josef Bacik 7e355b83ef Btrfs: if we have a lot of pinned space, commit the transaction
Mitch kept hitting a panic because he was getting ENOSPC.  One of my previous
patches makes it so we are much better at not allocating new metadata chunks.
Unfortunately coupled with the overcommit patch this works us into a bit of a
problem if we are removing a bunch of space and end up chewing up all of our
space with pinned extents.  We can allocate chunks fine and overflow is ok, but
the only way to reclaim this space is to commit the transaction.  So if we go to
overcommit, first check and see how much pinned space we have.  If we have more
than 80% of the free space chewed up with pinned extents, just commit the
transaction, this will free up enough space for our reservation and we won't
have this problem anymore.  With this patch Mitch's test doesn't blow up
anymore.  Thanks,

Reported-and-tested-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:13:00 -04:00
Josef Bacik 36ba022ac0 Btrfs: seperate out btrfs_block_rsv_check out into 2 different functions
Currently btrfs_block_rsv_check does 2 things, it will either refill a block
reserve like in the truncate or refill case, or it will check to see if there is
enough space in the global reserve and possibly refill it.  However because of
overcommit we could be well overcommitting ourselves just to try and refill the
global reserve, when really we should just be committing the transaction.  So
breack this out into btrfs_block_rsv_refill and btrfs_block_rsv_check.  Refill
will try to reserve more metadata if it can and btrfs_block_rsv_check will not,
it will only tell you if the factor of the total space is still reserved.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:59 -04:00
Josef Bacik 3880a1b46d Btrfs: reserve some space for an orphan item when unlinking
In __unlink_start_trans() if we don't have enough room for a reservation we will
check to see if the unlink will free up space.  If it does that's great, but we
will still could add an orphan item, so we need to reserve enough space to add
the orphan item.  Do this and migrate the space the global reserve so it all
works out right.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:59 -04:00
Josef Bacik b24e03db0d Btrfs: release trans metadata bytes before flushing delayed refs
We started setting trans->block_rsv = NULL to allow the delayed refs flushing
stuff to use the right block_rsv and then just made
btrfs_trans_release_metadata() unconditionally use the trans block rsv.  The
problem with this is we need to reserve some space in the transaction and then
migrate it to the global block rsv, so we need to be able to free that out
properly.  So instead just move btrfs_trans_release_metadata() before the
delayed ref flushing and use trans->block_rsv for the freeing.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:58 -04:00
Josef Bacik 877da17430 Btrfs: allow shrink_delalloc flush the needed reclaimed pages
Currently we only allow a maximum of 2 megabytes of pages to be flushed at a
time.  This was ok before, but now we have overcommit which will screw us in a
heartbeat if we are quickly filling the disk.  So instead pick either 2
megabytes or the number of pages we need to reclaim to be safe again, which ever
is larger.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:58 -04:00
Josef Bacik f104d04437 Btrfs: wait for ordered extents if we're in trouble when shrinking delalloc
The only way we actually reclaim delalloc space is waiting for the IO to
completely finish.  Usually we kick off a bunch of IO and wait for a little bit
and hope we can make our reservation, and usually this works out pretty well.
With overcommit however we can get seriously underwater if we're filling up the
disk quickly, so we need to be able to force the delalloc shrinker to wait for
the ordered IO to finish to give us a better chance of actually reclaiming
enough space to get our reservation.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:57 -04:00
Josef Bacik bbb495c2ed Btrfs: don't check bytes_pinned to determine if we should commit the transaction
Before the only reason to commit the transaction to recover space in
reserve_metadata_bytes() was if there were enough pinned_bytes to satisfy our
reservation.  But now we have the delayed inode stuff which will hold it's
reservations until we commit the transaction.  So say we max out our reservation
by creating a bunch of files but don't have any pinned bytes we will ENOSPC out
early even though we could commit the transaction and get that space back.  So
now just unconditionally commit the transaction since currently there is no way
to know how much metadata space is being reserved by delayed inode stuff.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:56 -04:00
Josef Bacik ed3ee9f44b Btrfs: fix regression in re-setting a large xattr
Recently I changed the xattr stuff to unconditionally set the xattr first in
case the xattr didn't exist yet.  This has introduced a regression when setting
an xattr that already exists with a large value.  If we find the key we are
looking for split_leaf will assume that we're extending that item.  The problem
is the size we pass down to btrfs_search_slot includes the size of the item
already, so if we have the largest xattr we can possibly have plus the size of
the xattr item plus the xattr item that btrfs_search_slot we'd overflow the
leaf.  Thankfully this is not what we're doing, but split_leaf doesn't know this
so it just returns EOVERFLOW.  So in the xattr code we need to check and see if
we got back EOVERFLOW and treat it like EEXIST since that's really what
happened.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:56 -04:00
Josef Bacik e70bea5fe0 Btrfs: fix the amount of space reserved for unlink
Our unlink reservations were a bit much, we were reserving 10 and I only count 8
possible items we're touching, so comment what we're reserving for and fix the
count value.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:55 -04:00
Josef Bacik 4b91c14f91 Btrfs: wait for ordered extents if we didn't reclaim enough
I noticed recently that my overcommit patch was causing one of my enospc tests
to fail 25% of the time with early ENOSPC.  This is because my overcommit patch
was letting us go way over board, but it wasn't waiting long enough to let the
delalloc shrinker do it's job.  The problem is we just start writeback and wait
a little bit hoping we flush enough, but we only free up delalloc space by
having the writes complete all the way.  We do this by waiting for ordered
extents, which we do but only if we already free'd enough for the reservation,
which isn't right, we should flush ordered extents if we didn't reclaim enough
in case that will push us over the edge.  With this patch I've not seen a
failure in this enospc test after running it in a loop for an hour.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:55 -04:00
Josef Bacik 5b0e95bf60 Btrfs: inline checksums into the disk free space cache
Yeah yeah I know this is how we used to do it and then I changed it, but damnit
I'm changing it back.  The fact is that writing out checksums will modify
metadata, which could cause us to dirty a block group we've already written out,
so we have to truncate it and all of it's checksums and re-write it which will
write new checksums which could dirty a blockg roup that has already been
written and you see where I'm going with this?  This can cause unmount or really
anything that depends on a transaction to commit to take it's sweet damned time
to happen.  So go back to the way it was, only this time we're specifically
setting NODATACOW because we can't go through the COW pathway anyway and we're
doing our own built-in cow'ing by truncating the free space cache.  The other
new thing is once we truncate the old cache and preallocate the new space, we
don't need to do that song and dance at all for the rest of the transaction, we
can just overwrite the existing space with the new cache if the block group
changes for whatever reason, and the NODATACOW will let us do this fine.  So
keep track of which transaction we last cleared our cache in and if we cleared
it in this transaction just say we're all setup and carry on.  This survives
xfstests and stress.sh.

The inode cache will continue to use the normal csum infrastructure since it
only gets written once and there will be no more modifications to the fs tree in
a transaction commit.

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:54 -04:00
Josef Bacik 9a82ca659d Btrfs: take overflow into account in reserving space
My overcommit stuff can be a little racy when we're filling up the disk with
fs_mark and we overcommit into things that quickly get used up for data.  So use
num_bytes to see if we have enough available space so we're less likely to
overcommit ourselves out of the ability to make reservations.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:53 -04:00
Josef Bacik 549b4fdb8f Btrfs: check the return value of filemap_write_and_wait in the space cache
We need to check the return value of filemap_write_and_wait in the space cache
writeout code.  Also don't set the inode's generation until we're sure nothing
else is going to fail.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:53 -04:00
Josef Bacik a67509c300 Btrfs: add a io_ctl struct and helpers for dealing with the space cache
In writing and reading the space cache we have one big loop that keeps track of
which page we are on and then a bunch of sizeable loops underneath this big loop
to try and read/write out properly.  Especially in the write case this makes
things hugely complicated and hard to follow, and makes our error checking and
recovery equally as complex.  So add a io_ctl struct with a bunch of helpers to
keep track of the pages we have, where we are, if we have enough space etc.
This unifies how we deal with the pages we're writing and keeps all the messy
tracking internal.  This allows us to kill the big loops in both the read and
write case and makes reviewing and chaning the write and read paths much
simpler.  I've run xfstests and stress.sh on this code and it survives.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:52 -04:00
Josef Bacik f75b130e9b Btrfs: don't skip writing out a empty block groups cache
I noticed a slight bug where we will not bother writing out the block group
cache's space cache if it's space tree is empty.  Since it could have a cluster
or pinned extents that need to be written out this is just not a valid test.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:51 -04:00
Josef Bacik 73bc187680 Btrfs: introduce mount option no_space_cache
Some users have requested this and I've found I needed a way to disable cache
loading without actually clearing the cache, so introduce the no_space_cache
option.  Before we check the super blocks cache generation field and if it was
populated we always turned space caching on.  Now we check this and set the
space cache option on, and then parse the mount options so that if we want it
off it get's turned off.  Then we check the mount option all the places we do
the caching work instead of checking the super's cache generation.  This makes
things more consistent and lets us turn space caching off.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:51 -04:00
Josef Bacik e27425d614 Btrfs: only inherit btrfs specific flags when creating files
Xfstests 79 was failing because we were inheriting the S_APPEND flag when we
weren't supposed to.  There isn't any specific documentation on this so I'm
taking the test as the standard of how things work, and having S_APPEND set on a
directory doesn't mean that S_APPEND gets inherited by its children according to
this test.  So only inherit btrfs specific things.  This will let us set
compress/nocompress on specific directories and everything in the directories
will inherit this flag, same with nodatacow.  With this patch test 79 passes.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:50 -04:00
Josef Bacik 2bf64758fd Btrfs: allow us to overcommit our enospc reservations
One of the things that kills us is the fact that our ENOSPC reservations are
horribly over the top in most normal cases.  There isn't too much that can be
done about this because when we are completely full we really need them to work
like this so we don't under reserve.  However if there is plenty of unallocated
chunks on the disk we can use that to gauge how much we can overcommit.  So this
patch adds chunk free space accounting so we always know how much unallocated
space we have.  Then if we fail to make a reservation within our allocated
space, check to see if we can overcommit.  In the normal flushing case (like
with delalloc metadata reservations) we'll take the free space and divide it by
2 if our metadata profile is setup for DUP or any of those, and then divide it
by 8 to make sure we don't overcommit too much.  Then if we're in a non-flushing
case (we really need this reservation now!) we only limit ourselves to half of
the free space.  This makes this fio test

[torrent]
filename=torrent-test
rw=randwrite
size=4g
ioengine=sync
directory=/mnt/btrfs-test

go from taking around 45 minutes to 10 seconds on my freshly formatted 3 TiB
file system.  This doesn't seem to break my other enospc tests, but could really
use some more testing as this is a super scary change.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:50 -04:00
Josef Bacik 8f6d7f4f45 Btrfs: break out of orphan cleanup if we can't make progress
I noticed while running xfstests 83 that if we didn't have enough space to
delete our inode the orphan cleanup would just loop.  This is because it keeps
finding the same orphan item and keeps trying to kill it but can't because we
don't get an error back from iput for deleting the inode.  So keep track of the
last guy we tried to kill, if it's the same as the one we're trying to kill
currently we know we are having problems and can just error out.  I don't have a
way to test this so look hard and make sure it's right.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:49 -04:00
Josef Bacik 726c35fa0e Btrfs: use the global reserve as a backup for deleting inodes
Xfstests 83 really stresses our ENOSPC since it uses a 100mb fs which ends up
with the mixed block group stuff.  Because of this we can run into a situation
where we don't have enough space to delete inodes, or even worse we can't free
the inodes when we next mount the fs which causes the orphan code to lose its
mind.  So if we fail to make our reservation, steal from the global reserve.
The global reserve will end up taking up the entire rest of the free space on
the fs in this worst case so there really is no other option.  With this patch
test 83 doesn't freak out.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:48 -04:00
Josef Bacik 1728366efa Btrfs: stop using write_one_page
While looking for a performance regression a user was complaining about, I
noticed that we had a regression with the varmail test of filebench.  This was
introduced by

0d10ee2e6d

which keeps us from calling writepages in writepage.  This is a correct change,
however it happens to help the varmail test because we write out in larger
chunks.  This is largly to do with how we write out dirty pages for each
transaction.  If you run filebench with

load varmail
set $dir=/mnt/btrfs-test
run 60

prior to this patch you would get ~1420 ops/second, but with the patch you get
~1200 ops/second.  This is a 16% decrease.  So since we know the range of dirty
pages we want to write out, don't write out in one page chunks, write out in
ranges.  So to do this we call filemap_fdatawrite_range() on the range of bytes.
Then we convert the DIRTY extents to NEED_WAIT extents.  When we then call
btrfs_wait_marked_extents() we only have to filemap_fdatawait_range() on that
range and clear the NEED_WAIT extents.  This doesn't get us back to our original
speeds, but I've been seeing ~1380 ops/second, which is a <5% regression as
opposed to a >15% regression.  That is acceptable given that the original commit
greatly reduces our latency to begin with.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:48 -04:00
Josef Bacik 462d6fac89 Btrfs: introduce convert_extent_bit
If I have a range where I know a certain bit is and I want to set it to another
bit the only option I have is to call set and then clear bit, which will result
in 2 tree searches.  This is inefficient, so introduce convert_extent_bit which
will go through and set the bit I want and clear the old bit I don't want.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:47 -04:00
Josef Bacik ef3be45722 Btrfs: check unused against how much space we actually want
There is a bug that may lead to early ENOSPC in our reservation code.  We've
been checking against num_bytes which may be above and beyond what we want to
actually reserve, which could give us a false ENOSPC.  Fix this by making sure
the unused space is above how much we want to reserve and not how much we're
trying to flush.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:47 -04:00
Josef Bacik a8c9e57697 Btrfs: fix orphan cleanup regression
In fixing how we deal with bad inodes, we had a regression in the orphan cleanup
code, since it expects to get a bad inode back.  So fix it to deal with getting
-ESTALE back by deleting the orphan item manually and moving on.  Thanks,

Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:46 -04:00
Josef Bacik 3b16a4e3c3 Btrfs: use the inode's mapping mask for allocating pages
Johannes pointed out we were allocating only kernel pages for doing writes,
which is kind of a big deal if you are on 32bit and have more than a gig of ram.
So fix our allocations to use the mapping's gfp but still clear __GFP_FS so we
don't re-enter.  Thanks,

Reported-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:45 -04:00
Josef Bacik 455757c322 Btrfs: delay iput when deleting a block group
I kept getting warnings from evict because we were calling
btrfs_start_transaction() with a transaction already started when doing a
balance.  This is because we remove a block group which requires a transaction,
and the put the last reference on the cache inode.  Instead of doing this we
need to delay the iput so it is done not within a transaction having started.
This gets rid of our warnings.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:45 -04:00
Josef Bacik 9c8d86db9a Btrfs: make sure to unset trans->block_rsv before running delayed refs
Checksums are charged in 2 different ways.  The first case is when we're writing
to the disk, we account for the new checksums with the delalloc block rsv.  In
order for this to work we check if we're allocating a block for the csum root
and if trans->block_rsv == the delalloc block rsv.  But when we're deleting the
csums because of cow, this is charged to the global block rsv, and is done when
we run the delayed refs.  So we need to make sure that trans->block_rsv == NULL
when running the delayed refs.  So set it to NULL and reset it in
should_end_transaction, and set it to NULL in commit_transaction.  This got rid
of the ridiculous amount of warnings I was seeing when trying to do a balance.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:44 -04:00
Josef Bacik 4a92b1b8d2 Btrfs: stop passing a trans handle all around the reservation code
The only thing that we need to have a trans handle for is in
reserve_metadata_bytes and thats to know how much flushing we can do.  So
instead of passing it around, just check current->journal_info for a
trans_handle so we know if we can commit a transaction to try and free up space
or not.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:44 -04:00
Josef Bacik d02c9955de Btrfs: don't get the block_rsv in btrfs_free_tree_block
Since the durable block rsv stuff has been killed there is no need to get the
block_rsv in btrfs_free_tree_block anymore.

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:43 -04:00
Josef Bacik 4c13d758b7 Btrfs: use the transactions block_rsv for the csum root
The alloc warnings everybody has been seeing is because we have been reserving
space for csums, but we weren't actually using that space.  So make
get_block_rsv() return the trans->block_rsv if we're modifying the csum root.
Also set the trans->block_rsv to NULL so that if we modify the csum root when
running delayed ref's that comes out of the global reserve like it's supposed
to.  With this patch I'm not seeing those alloc warnings anymore.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:42 -04:00
Josef Bacik c09544e07f Btrfs: handle enospc accounting for free space inodes
Since free space inodes now use normal checksumming we need to make sure to
account for their metadata use.  So reserve metadata space, and then if we fail
to write out the metadata we can just release it, otherwise it will be freed up
when the io completes.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:42 -04:00
Josef Bacik 300e4f8a56 Btrfs: put the block group cache after we commit the super
In moving some enospc stuff around I noticed that when we unmount we are often
evicting the free space cache inodes before we do our last commit.  This isn't
bad, but it makes us constantly have to re-read the inodes back.  So instead
don't evict the cache until after we do our last commit, this will make things a
little less crappy and makes a future enospc change work properly.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:41 -04:00
Josef Bacik 4a33854257 Btrfs: set truncate block rsv's size
While debugging a different issue I noticed that we were always reserving space
when we tried to use our truncate block rsv's.  This is because they didn't have
a ->size value, so use_block_rsv just assumes there is nothing reserved and it
does a reserve_metadata_bytes.  This is because btrfs_check_block_rsv() doesn't
actually add to the size of the block rsv.  That seems to be the right thing to
do so set ->size to the minimum truncate size we need, since we will always only
refill to that size anyway, and this way everything works out correctly.

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:40 -04:00
Josef Bacik 7f70150896 Btrfs: don't increase the block_rsv's size when emergency allocating space
If we have to emergency reserve space we need to not increase the block_rsv
size, otherwise we'll leak space.  Take for instance delalloc, say we reserve
4k, and we use that 4k, and then we have to emergency allocate another 4k, we
bump the size up to 8k, however we've only accounted for 4k in reservations in
all of our supporting logic, so we'll go to free the 4k and end up having a size
of 4k, which will cause us to later not free as much space.  I saw this doing
testing where I wasn't reserving enough space for something but was still
leaking space, very frustrating.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:40 -04:00
Josef Bacik 7ed49f187c Btrfs: fix space leak when we fail to make an allocation
When changing back to using a spin_lock to protect the extent counters I decided
that since we would only be dropping our original extent, it was ok to just drop
the extent and return.  However since somebody else could have come in and done
a reservation, we need to do the normal song and dance to clear the reservation
out properly.  So calculate how much space we need to free, and then subtract
what we just attempted to reserve.  If it's more then we know we need to drop
those bytes from the delalloc block rsv.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:39 -04:00
Josef Bacik a9b5fcddce Btrfs: fix call to btrfs_search_slot in free space cache
We are setting ins_len to 1 even tho we are just modifying an item that should
be there already.  This may cause the search stuff to split nodes on the way
down needelessly.  Set this to 0 since we aren't inserting anything.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:38 -04:00
Josef Bacik 482e6dc526 Btrfs: allow callers to specify if flushing can occur for btrfs_block_rsv_check
If you run xfstest 224 it you will get lots of messages about not being able to
delete inodes and that they will be cleaned up next mount.  This is because
btrfs_block_rsv_check was not calling reserve_metadata_bytes with the ability to
flush, so if there was not enough space, it simply failed.  But in truncate and
evict case we could easily flush space to try and get enough space to do our
work, so make btrfs_block_rsv_check take a flush argument to pass down to
reserve_metadata_bytes.  Now xfstests 224 runs fine without all those
complaints.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:38 -04:00
Josef Bacik 07127184ef Btrfs: reduce the amount of space needed for truncates
With btrfs_truncate_inode_items we always return if we have to go to another
leaf, which makes us do our reservation again.  This means we will only ever
modify one leaf at a time, so we only need 1 items worth of slack space.  Also,
since we are deleting we will not be creating nodes as we go down, if anything
we'll be free'ing them as we merge them together, so make a different
calculation for truncate which will only have the worst case useage of COW'ing
the entire path down to the leaf.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:37 -04:00
Josef Bacik 1b9c332b6c Btrfs: only reserve space in fallocate if we have to do a preallocate
Lukas found a problem where if he tries to fallocate over the same region twice
and the first fallocate took up all the space we would fail with ENOSPC.  This
is because we reserve the total space we want to use for fallocate, regardless
of wether or not we will have to actually preallocate.  So instead move the
check into the loop where we actually have to do the preallocate.  Thanks,

Tested-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:36 -04:00
Josef Bacik 5e962c7850 Btrfs: kill btrfs_truncate_reserve_metadata
Since we've optimized the truncate path, we no longer require this function.

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:36 -04:00
Josef Bacik 907cbcebd4 Btrfs: optimize how we account for space in truncate
Currently we're starting and stopping a transaction for no real reason, so kill
that and just reserve enough space as if we can truncate all in one transaction.
Also use btrfs_block_rsv_check() for our reserve to minimize the amount of space
we may have to allocate for our slack space.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:35 -04:00
Josef Bacik 13553e5221 Btrfs: don't try to commit in btrfs_block_rsv_check
We will try and reserve metadata bytes in btrfs_block_rsv_check and if we cannot
because we have a transaction open it will return EAGAIN, so we do not need to
try and commit the transaction again.

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:35 -04:00
Josef Bacik dabdb6408c Btrfs: kill unused parts of block_rsv
The priority and refill_used flags are not used anymore, and neither is the
usage counter, so just remove them from btrfs_block_rsv.

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:34 -04:00
Josef Bacik 6ab60601d5 Btrfs: ratelimit the generation printk for the free space cache
A user reported getting spammed when moving to 3.0 by this message.  Since we
switched to the normal checksumming infrastructure all old free space caches
will be wrong and need to be regenerated so people are likely to see this
message a lot, so ratelimit it so it doesn't fill up their logs and freak them
out.  Thanks,

Reported-by: Andrew Lutomirski <luto@mit.edu>
Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:33 -04:00
Josef Bacik 4289a667a0 Btrfs: fix how we reserve space for deleting inodes
I converted btrfs_truncate to do sane reservations for truncate, but didn't
convert btrfs_evict_inode.  Basically we need to save the orphan_rsv for
deleting the orphan item, and do normal reservations for our truncate.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:33 -04:00
Josef Bacik 37be25bcb6 Btrfs: kill the durable block rsv stuff
This is confusing code and isn't used by anything anymore, so delete it.

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:32 -04:00
Josef Bacik dba68306f3 Btrfs: kill the orphan space calculation for snapshots
This patch kills off the calculation for the amount of space needed for the
orphan operations during a snapshot.  The thing is we only do snapshots on
commit, so any space that is in the block_rsv->freed[] isn't going to be in the
new snapshot anyway, so there isn't any reason to require that space to be
reserved for the snapshot to occur.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:32 -04:00
Josef Bacik 7709cde33f Btrfs: calculate checksum space correctly
We have not been reserving enough space for checksums.  We were just reserving
bytes for the checksum items themselves, we were not taking into account having
to cow the tree and such.  This patch adds a csum_bytes counter to the inode for
keeping track of the number of bytes outstanding we have for checksums.  Then we
calculate how many leaves would be required for the checksums we are given and
use that to reserve space.  This adds a significant amount of bytes to our
reservations, but we will handle this later.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:31 -04:00
Josef Bacik 9e4871070b Btrfs: skip looking for delalloc if we don't have ->fill_delalloc
We always look for delalloc bytes in our io_tree so we can fill in delalloc.
This is fine in most cases, but if we're writing out the btree_inode this is
just a superfluous tree search on the io_tree, and if we have a lot of metadata
dirty this could be an expensive check.  So instead check to see if our io_tree
has a ->fill_delalloc op, and if not don't even bother doing the lookup.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:30 -04:00
Josef Bacik fb25e9141a Btrfs: use bytes_may_use for all ENOSPC reservations
We have been using bytes_reserved for metadata reservations, which is wrong
since we use that to keep track of outstanding reservations from the allocator.
This resulted in us doing a lot of silly things to make sure we don't allocate a
bunch of metadata chunks since we never had a real view of how much space was
actually in use by metadata.

This passes Arne's enospc test and xfstests as well as my own enospc tests.
Hopefully this will get us moving in the right direction.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:30 -04:00
Josef Bacik 830c4adbd0 Btrfs: fix how we mount subvol=<whatever>
We've only been able to mount with subvol=<whatever> where whatever was a subvol
within whatever root we had as the default.  This allows us to mount -o
subvol=path/to/subvol/you/want relative from the normal fs_tree root.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:29 -04:00
Josef Bacik ba5b8958da Btrfs: use d_obtain_alias when mounting subvol/subvolid
Currently what we do is just wrong.  We either

1) Alloc a new "root" dentry with sb->s_root as it's parent which is just wrong
as we could walk into this subvol later on via another path and hilarity could
ensue.  Also we don't check the return value of d_splice_alias which isn't good
either.

or

2) Do a d_find_alias() which we could have lost our dentry from cache at this
point and found nothing.

So use d_obtain_alias().  In the case that we already have the inode/dentry in
cache we will get the correct dentry.  If not we will get a disconnected dentry
tree so if we walk into it later on everything will be connected up properly.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:29 -04:00
Josef Bacik 0cbbdf7c9c Btrfs: kill reserved_bytes in inode
reserved_bytes is not used for anything in the inode, remove it.

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:28 -04:00
Josef Bacik f1bdcc0a82 Btrfs: move stuff around in btrfs_inode to get better packing
Moving things around to give us better packing in the btrfs_inode.  This reduces
the size of our inode by 8 bytes.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:28 -04:00
Linus Torvalds b2f9452bd5 Merge branch 'btrfs-3.0' of git://github.com/chrismason/linux
* 'btrfs-3.0' of git://github.com/chrismason/linux:
  Btrfs: make sure not to defrag extents past i_size
  Btrfs: fix recursive auto-defrag
2011-10-13 18:20:40 +12:00
Chris Mason f7f43cc841 Btrfs: make sure not to defrag extents past i_size
The btrfs file defrag code will loop through the extents and
force COW on them.  But there is a concurrent truncate in the middle of
the defrag, it might end up defragging the same range over and over
again.

The problem is that writepage won't go through and do anything on pages
past i_size, so the cow won't happen, so the file will appear to still
be fragmented.  defrag will end up hitting the same extents again and
again.

In the worst case, the truncate can actually live lock with the defrag
because the defrag keeps creating new ordered extents which the truncate
code keeps waiting on.

The fix here is to make defrag check for i_size inside the main loop,
instead of just once before the looping starts.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-10-11 11:45:55 -04:00
Li Zefan 2a0f7f5769 Btrfs: fix recursive auto-defrag
Follow those steps:

  # mount -o autodefrag /dev/sda7 /mnt
  # dd if=/dev/urandom of=/mnt/tmp bs=200K count=1
  # sync
  # dd if=/dev/urandom of=/mnt/tmp bs=8K count=1 conv=notrunc

and then it'll go into a loop: writeback -> defrag -> writeback ...

It's because writeback writes [8K, 200K] and then writes [0, 8K].

I tried to make writeback know if the pages are dirtied by defrag,
but the patch was a bit intrusive. Here I simply set writeback_index
when we defrag a file.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-10-10 15:43:34 -04:00
Linus Torvalds 7fd21be75d Merge branch 'btrfs-3.0' of git://github.com/chrismason/linux
* 'btrfs-3.0' of git://github.com/chrismason/linux:
  Btrfs: force a page fault if we have a shorty copy on a page boundary
2011-10-03 12:17:44 -07:00
Arne Jansen 7a26285eea btrfs: use readahead API for scrub
Scrub uses a simple tree-enumeration to bring the relevant portions
of the extent- and csum-tree into the page cache before starting the
scrub-I/O. This is now replaced by using the new readahead-API.
During readahead the scrub is being accounted as paused, so it won't
hold off transaction commits.

This change raises the average disk bandwith utilisation on my test
volume from 70% to 90%. On another volume, the time for a test run
went down from 89s to 43s.

Changes v5:
 - reada1/2 are now of type struct reada_control *

Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02 08:48:45 +02:00
Arne Jansen 4bb31e928d btrfs: hooks for readahead
This adds the hooks needed for readahead. In the readpage_end_io_hook,
the extent state is checked for the EXTENT_READAHEAD flag. Only in this
case the readahead hook is called, to keep the impact on non-ra as low
as possible.
Additionally, a hook for a failed IO is added, otherwise readahead would
wait indefinitely for the extent to finish.

Changes for v2:
 - eliminate race condition

Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02 08:48:44 +02:00
Arne Jansen 7414a03fbf btrfs: initial readahead code and prototypes
This is the implementation for the generic read ahead framework.

To trigger a readahead, btrfs_reada_add must be called. It will start
a read ahead for the given range [start, end) on tree root. The returned
handle can either be used to wait on the readahead to finish
(btrfs_reada_wait), or to send it to the background (btrfs_reada_detach).

The read ahead works as follows:
On btrfs_reada_add, the root of the tree is inserted into a radix_tree.
reada_start_machine will then search for extents to prefetch and trigger
some reads. When a read finishes for a node, all contained node/leaf
pointers that lie in the given range will also be enqueued. The reads will
be triggered in sequential order, thus giving a big win over a naive
enumeration. It will also make use of multi-device layouts. Each disk
will have its on read pointer and all disks will by utilized in parallel.
Also will no two disks read both sides of a mirror simultaneously, as this
would waste seeking capacity. Instead both disks will read different parts
of the filesystem.
Any number of readaheads can be started in parallel. The read order will be
determined globally, i.e. 2 parallel readaheads will normally finish faster
than the 2 started one after another.

Changes v2:
 - protect root->node by transaction instead of node_lock
 - fix missed branches:
    The readahead had a too simple check to determine if a branch from
    a node should be checked or not. It now also records the upper bound
    of each node to see if the requested RA range lies within.
 - use KERN_CONT to debug output, to avoid line breaks
 - defer reada_start_machine to worker to avoid deadlock

Changes v3:
 - protect root->node by rcu

Changes v5:
 - changed EIO-semantics of reada_tree_block_flagged
 - remove spin_lock from reada_control and make elems an atomic_t
 - remove unused read_total from reada_control
 - kill reada_key_cmp, use btrfs_comp_cpu_keys instead
 - use kref-style release functions where possible
 - return struct reada_control * instead of void * from btrfs_reada_add

Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02 08:48:44 +02:00
Arne Jansen 90519d66ab btrfs: state information for readahead
Add state information for readahead to btrfs_fs_info and btrfs_device

Changes v2:
 - don't wait in radix_trees
 - add own set of workers for readahead

Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02 08:48:30 +02:00
Arne Jansen ab0fff0305 btrfs: add READAHEAD extent buffer flag
Add a READAHEAD extent buffer flag.
Add a function to trigger a read with this flag set.

Changes v2:
 - use extent buffer flags instead of extent state flags

Changes v5:
 - adapt to changed read_extent_buffer_pages interface
 - don't return eb from reada_tree_block_flagged if it has CORRUPT flag set

Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02 08:47:57 +02:00
Arne Jansen bb82ab88df btrfs: add an extra wait mode to read_extent_buffer_pages
read_extent_buffer_pages currently has two modes, either trigger a read
without waiting for anything, or wait for the I/O to finish. The former
also bails when it's unable to lock the page. This patch now adds an
additional parameter to allow it to block on page lock, but don't wait
for completion.

Changes v5:
 - merge the 2 wait parameters into one and define WAIT_NONE, WAIT_COMPLETE and
   WAIT_PAGE_LOCK

Change v6:
 - fix bug introduced in v5

Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02 08:47:55 +02:00
Chris Mason 286d6e70aa Merge branch 'btrfs-3.0' into for-linus 2011-09-30 15:26:09 -04:00
Josef Bacik b6316429af Btrfs: force a page fault if we have a shorty copy on a page boundary
A user reported a problem where ceph was getting into 100% cpu usage while doing
some writing.  It turns out it's because we were doing a short write on a not
uptodate page, which means we'd fall back at one page at a time and fault the
page in.  The problem is our position is on the page boundary, so our fault in
logic wasn't actually reading the page, so we'd just spin forever or until the
page got read in by somebody else.  This will force a readpage if we end up
doing a short copy.  Alexandre could reproduce this easily with ceph and reports
it fixes his problem.  I also wrote a reproducer that no longer hangs my box
with this patch.  Thanks,

Reported-and-tested-by: Alexandre Oliva <aoliva@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-30 15:23:54 -04:00
Jan Schmidt 5da6fcbc4e btrfs: integrating raid-repair and scrub-fixup-nodatasum
This ties nodatasum fixup in scrub together with raid repair patches. While
both series are working fine alone, scrub will report uncorrectable errors
if they occur in a nodatasum extent *and* the page is in the page cache.

Previously, we would have triggered readpage to find good data and do the
repair. However, readpage wouldn't read anything in the case where the page
is up to date in the cache. So, we simply take that good data we have and
call repair_io_failure directly (unless the page in the cache is dirty).

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 13:38:43 +02:00
Jan Schmidt 4a54c8c165 btrfs: Moved repair code from inode.c to extent_io.c
The raid-retry code in inode.c can be generalized so that it works for
metadata as well. Thus, this patch moves it to extent_io.c and makes the
raid-retry code a raid-repair code.

Repair works that way: Whenever a read error occurs and we have more
mirrors to try, note the failed mirror, and retry another. If we find a
good one, check if we did note a failure earlier and if so, do not allow
the read to complete until after the bad sector was written with the good
data we just fetched. As we have the extent locked while reading, no one
can change the data in between.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 13:38:42 +02:00
Jan Schmidt 2774b2ca3d btrfs: Put mirror_num in bi_bdev
The error correction code wants to make sure that only the bad mirror is
rewritten. Thus, we need to know which mirror is the bad one. I did not
find a more apropriate field than bi_bdev. But I think using this is fine,
because it is modified by the block layer, anyway, and should not be read
after the bio returned.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 13:38:42 +02:00
Jan Schmidt 1503140d3e btrfs: Do not use bio->bi_bdev after submission
The block layer modifies bio->bi_bdev and bio->bi_sector while working on
the bio, they do _not_ come back unmodified in the completion callback.

To call add_page, we need at least some bi_bdev set, which is why the code
was working, previously. With this patch, we use the latest_bdev from
fsinfo instead of the leftover in the bio. This gives us the possibility to
use the bi_bdev field for another purpose.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 13:38:42 +02:00
Jan Schmidt a1d3c4786a btrfs: btrfs_multi_bio replaced with btrfs_bio
btrfs_bio is a bio abstraction able to split and not complete after the last
bio has returned (like the old btrfs_multi_bio). Additionally, btrfs_bio
tracks the mirror_num used to read data which can be used for error
correction purposes.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 13:38:42 +02:00
Jan Schmidt d7728c960d btrfs: new ioctls to do logical->inode and inode->path resolving
these ioctls make use of the new functions initially added for scrub. they
return all inodes belonging to a logical address (BTRFS_IOC_LOGICAL_INO) and
all paths belonging to an inode (BTRFS_IOC_INO_PATHS).

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 12:54:28 +02:00
Jan Schmidt 0ef8e45158 btrfs scrub: add fixup code for errors on nodatasum files
This removes a FIXME comment and introduces the first part of nodatasum
fixup: It gets the corresponding inode for a logical address and triggers a
regular readpage for the corrupted sector.

Once we have on-the-fly error correction our error will be automatically
corrected. The correction code is expected to clear the newly introduced
EXTENT_DAMAGED flag, making scrub report that error as "corrected" instead
of "uncorrectable" eventually.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 12:54:28 +02:00
Jan Schmidt e12fa9cd39 btrfs scrub: use int for mirror_num, not u64
the rest of the code uses int mirror_num, and so should scrub

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 12:54:28 +02:00
Jan Schmidt 8ddc7d9cd0 btrfs: add mirror_num to extent_read_full_page
Currently, extent_read_full_page always assumes we are trying to read mirror
0, which generally is the best we can do. To add flexibility, pass it as a
parameter. This will be needed by scrub fixup code.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 12:54:28 +02:00
Jan Schmidt 193ea74b27 btrfs scrub: bugfix: mirror_num off by one
Fix the mirror_num determination in scrub_stripe. The rest of the scrub code
did not use mirror_num for anything important and that error went unnoticed.
The nodatasum fixup patch of this set depends on a correct mirror_num.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 12:54:28 +02:00
Jan Schmidt 558540c177 btrfs scrub: print paths of corrupted files
While scrubbing, we may encounter various errors. Previously, a logical
address was printed to the log only. Now, all paths belonging to that
address are resolved and printed separately. That should work for hardlinks
as well as reflinks.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 12:54:28 +02:00
Jan Schmidt 13db62b7a1 btrfs scrub: added unverified_errors
In normal operation, scrub is reading data sequentially in large portions.
In case of an i/o error, we try to find the corrupted area(s) by issuing
page sized read requests. With this commit we increment the
unverified_errors counter if all of the small size requests succeed.

Userland patches carrying such conspicous events to the administrator should
already be around.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 12:54:27 +02:00
Jan Schmidt a542ad1baf btrfs: added helper functions to iterate backrefs
These helper functions iterate back references and call a function for each
backref. There is also a function to resolve an inode to a path in the
file system.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 12:54:27 +02:00
Chris Mason 0a7a0519d1 Merge branch 'btrfs-3.0' into for-linus 2011-09-20 14:49:29 -04:00
Sage Weil b6f3409b21 Btrfs: reserve sufficient space for ioctl clone
Fix a crash/BUG_ON in the clone ioctl due to insufficient reservation. We
need to reserve space for:

 - adjusting the old extent (possibly splitting it)
 - adding the new extent
 - updating the inode

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-20 14:48:51 -04:00
Josef Bacik a66e7cc626 Btrfs: only clear the need lookup flag after the dentry is setup
We can race with readdir and the RCU path walking stuff.  This is because we
clear the need lookup flag before actually instantiating the inode.  This will
lead the RCU path walk stuff to find a dentry it thinks is valid without a
d_inode attached.  So instead unhash the dentry when we first start the lookup,
and then clear the flag after we've instantiated the dentry so we're garunteed
to either try the slow lookup, or have the d_inode set properly.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-18 10:34:03 -04:00
Jeff Liu 48802c8ae2 BTRFS: Fix lseek return value for error
The recent reworking of btrfs' lseek lead to incorrect
values being returned.  This adds checks for seeking
beyond EOF in SEEK_HOLE and makes sure the error
values come back correct.

Andi Kleen also sent in similar patches.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reported-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-18 10:34:02 -04:00
Chris Mason 2cf4ce7c2a Merge branch 'btrfs-3.0' into for-linus 2011-09-18 10:31:44 -04:00
Li Zefan dde820fbf7 Btrfs: don't change inode flag of the dest clone file
The dst file will have the same inode flags with dst file after
file clone, and I think it's unexpected.

For example, the dst file will suddenly become immutable after
getting some share of data with src file, if the src is immutable.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-18 10:20:46 -04:00
Li Zefan 0e7b824c4e Btrfs: don't make a file partly checksummed through file clone
To reproduce the bug:

  # mount /dev/sda7 /mnt
  # dd if=/dev/zero of=/mnt/src bs=4K count=1
  # umount /mnt

  # mount -o nodatasum /dev/sda7 /mnt
  # dd if=/dev/zero of=/mnt/dst bs=4K count=1
  # clone_range -s 4K -l 4K /mnt/src /mnt/dst

  # echo 3 > /proc/sys/vm/drop_caches
  # cat /mnt/dst
  # dmesg
  ...
  btrfs no csum found for inode 258 start 0
  btrfs csum failed ino 258 off 0 csum 2566472073 private 0

It's because part of the file is checksummed and the other part is not,
and then btrfs will complain checksum is not found when we read the file.

Disallow file clone if src and dst file have different checksum flag,
so we ensure a file is completely checksummed or unchecksummed.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-18 10:20:46 -04:00
Li Zefan 71ef078610 Btrfs: fix pages truncation in btrfs_ioctl_clone()
It's a bug in commit f81c9cdc56
(Btrfs: truncate pages from clone ioctl target range)

We should pass the dest range to the truncate function, but not the
src range.

Also move the function before locking extent state.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-18 10:20:46 -04:00
Hidetoshi Seto 3765fefaee btrfs: fix d_off in the first dirent
Since the d_off in the first dirent for "." (that originates from
the 4th argument "offset" of filldir() for the 2nd dirent for "..")
is wrongly assigned in btrfs_real_readdir(), telldir returns same
offset for different locations.

 | # mkfs.btrfs /dev/sdb1
 | # mount /dev/sdb1 fs0
 | # cd fs0
 | # touch file0 file1
 | # ../test
 | telldir: 0
 | readdir: d_off = 2, d_name = "."
 | telldir: 2
 | readdir: d_off = 2, d_name = ".."
 | telldir: 2
 | readdir: d_off = 3, d_name = "file0"
 | telldir: 3
 | readdir: d_off = 2147483647, d_name = "file1"
 | telldir: 2147483647

To fix this problem, pass filp->f_pos (which is loff_t) instead.

 | # ../test
 | telldir: 0
 | readdir: d_off = 1, d_name = "."
 | telldir: 1
 | readdir: d_off = 2, d_name = ".."
 | telldir: 2
 | readdir: d_off = 3, d_name = "file0"
 :

At the moment the "offset" for "." is unused because there is no
preceding dirent, however it is better to pass filp->f_pos to follow
grammatical usage.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-18 10:20:46 -04:00
Linus Torvalds 0b001b2eda Merge branch 'for-linus' of git://github.com/chrismason/linux
* 'for-linus' of git://github.com/chrismason/linux:
  Btrfs: add dummy extent if dst offset excceeds file end in
  Btrfs: calc file extent num_bytes correctly in file clone
  btrfs: xattr: fix attribute removal
  Btrfs: fix wrong nbytes information of the inode
  Btrfs: fix the file extent gap when doing direct IO
  Btrfs: fix unclosed transaction handle in btrfs_cont_expand
  Btrfs: fix misuse of trans block rsv
  Btrfs: reset to appropriate block rsv after orphan operations
  Btrfs: skip locking if searching the commit root in csum lookup
  btrfs: fix warning in iput for bad-inode
  Btrfs: fix an oops when deleting snapshots
2011-09-12 11:47:49 -07:00
Li Zefan d525e8ab02 Btrfs: add dummy extent if dst offset excceeds file end in
You can see there's no file extent with range [0, 4096]. Check this by
btrfsck:

 # btrfsck /dev/sda7
 root 5 inode 258 errors 100
 ...

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-11 10:52:25 -04:00
Li Zefan d72c0842ff Btrfs: calc file extent num_bytes correctly in file clone
num_bytes should be 4096 not 12288.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-11 10:52:25 -04:00
David Sterba 4815053aba btrfs: xattr: fix attribute removal
An attribute is not removed by 'setfattr -x attr file' and remains
visible in attr list. This makes xfstests/062 pass again.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-11 10:52:25 -04:00
Miao Xie a39f752143 Btrfs: fix wrong nbytes information of the inode
If we write some data into the data hole of the file(no preallocation for this
hole), Btrfs will allocate some disk space, and update nbytes of the inode, but
the other element--disk_i_size needn't be updated. At this condition, we must
update inode metadata though disk_i_size is not changed(btrfs_ordered_update_i_size()
return 1).

 # mkfs.btrfs /dev/sdb1
 # mount /dev/sdb1 /mnt
 # touch /mnt/a
 # truncate -s 856002 /mnt/a
 # dd if=/dev/zero of=/mnt/a bs=4K count=1 conv=nocreat,notrunc
 # umount /mnt
 # btrfsck /dev/sdb1
 root 5 inode 257 errors 400
 found 32768 bytes used err is 1

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-11 10:52:25 -04:00
Miao Xie 0c1a98c814 Btrfs: fix the file extent gap when doing direct IO
When we write some data to the place that is beyond the end of the file
in direct I/O mode, a data hole will be created. And Btrfs should insert
a file extent item that point to this hole into the fs tree. But unfortunately
Btrfs forgets doing it.

The following is a simple way to reproduce it:
 # mkfs.btrfs /dev/sdc2
 # mount /dev/sdc2 /test4
 # touch /test4/a
 # dd if=/dev/zero of=/test4/a seek=8 count=1 bs=4K oflag=direct conv=nocreat,notrunc
 # umount /test4
 # btrfsck /dev/sdc2
 root 5 inode 257 errors 100

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-11 10:52:24 -04:00
Miao Xie 5b397377e9 Btrfs: fix unclosed transaction handle in btrfs_cont_expand
The function - btrfs_cont_expand() forgot to close the transaction handle before
it jump out the while loop. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-11 10:52:24 -04:00
Liu Bo 98c9942aca Btrfs: fix misuse of trans block rsv
At the beginning of create_pending_snapshot, trans->block_rsv is set
to pending->block_rsv and is used for snapshot things, however, when
it is done, we do not recover it as will.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-11 10:52:24 -04:00
Liu Bo 65450aa645 Btrfs: reset to appropriate block rsv after orphan operations
While truncating free space cache, we forget to change trans->block_rsv
back to the original one, but leave it with the orphan_block_rsv, and
then with option inode_cache enable, it leads to countless warnings of
btrfs_alloc_free_block and btrfs_orphan_commit_root:

WARNING: at fs/btrfs/extent-tree.c:5711 btrfs_alloc_free_block+0x180/0x350 [btrfs]()
...
WARNING: at fs/btrfs/inode.c:2193 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-11 10:52:24 -04:00
Josef Bacik ddf23b3fc6 Btrfs: skip locking if searching the commit root in csum lookup
It's not enough to just search the commit root, since we could be cow'ing the
very block we need to search through, which would mean that its locked and we'll
still deadlock.  So use path->skip_locking as well.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-11 10:52:24 -04:00
Sergei Trofimovich e0b6d65be5 btrfs: fix warning in iput for bad-inode
iput() shouldn't be called for inodes in I_NEW state.
We need to mark inode as constructed first.

WARNING: at fs/inode.c:1309 iput+0x20b/0x210()
Call Trace:
 [<ffffffff8103e7ba>] warn_slowpath_common+0x7a/0xb0
 [<ffffffff8103e805>] warn_slowpath_null+0x15/0x20
 [<ffffffff810eaf0b>] iput+0x20b/0x210
 [<ffffffff811b96fb>] btrfs_iget+0x1eb/0x4a0
 [<ffffffff811c3ad6>] btrfs_run_defrag_inodes+0x136/0x210
 [<ffffffff811ad55f>] cleaner_kthread+0x17f/0x1a0
 [<ffffffff81035b7d>] ? sub_preempt_count+0x9d/0xd0
 [<ffffffff811ad3e0>] ? transaction_kthread+0x280/0x280
 [<ffffffff8105af86>] kthread+0x96/0xa0
 [<ffffffff814336d4>] kernel_thread_helper+0x4/0x10
 [<ffffffff8105aef0>] ? kthread_worker_fn+0x190/0x190
 [<ffffffff814336d0>] ? gs_change+0xb/0xb

Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
CC: Konstantin Khlebnikov <khlebnikov@openvz.org>
Tested-by: David Sterba <dsterba@suse.cz>
CC: Josef Bacik <josef@redhat.com>
CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-11 10:52:24 -04:00
Liu Bo 14c7cca780 Btrfs: fix an oops when deleting snapshots
We can reproduce this oops via the following steps:

$ mkfs.btrfs /dev/sdb7
$ mount /dev/sdb7 /mnt/btrfs
$ for ((i=0; i<3; i++)); do btrfs sub snap /mnt/btrfs /mnt/btrfs/s_$i; done
$ rm -fr /mnt/btrfs/*
$ rm -fr /mnt/btrfs/*

then we'll get
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:2264!
[...]
Call Trace:
 [<ffffffffa05578c7>] btrfs_rmdir+0xf7/0x1b0 [btrfs]
 [<ffffffff81150b95>] vfs_rmdir+0xa5/0xf0
 [<ffffffff81153cc3>] do_rmdir+0x123/0x140
 [<ffffffff81145ac7>] ? fput+0x197/0x260
 [<ffffffff810aecff>] ? audit_syscall_entry+0x1bf/0x1f0
 [<ffffffff81153d0d>] sys_unlinkat+0x2d/0x40
 [<ffffffff8147896b>] system_call_fastpath+0x16/0x1b
RIP  [<ffffffffa054f7b9>] btrfs_orphan_add+0x179/0x1a0 [btrfs]

When it comes to btrfs_lookup_dentry, we may set a snapshot's inode->i_ino
to BTRFS_EMPTY_SUBVOL_DIR_OBJECTID instead of BTRFS_FIRST_FREE_OBJECTID,
while the snapshot's location.objectid remains unchanged.

However, btrfs_ino() does not take this into account, and returns a wrong ino,
and causes the oops.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-09-11 10:52:24 -04:00
Josef Bacik 6719db6a23 Btrfs: fix 64 bit divide problem
This fixes a regression introduced by commit cdcb725c05 ("Btrfs: check
if there is enough space for balancing smarter").  We can't do 64-bit
divides on 32-bit architectures.

In cases where we need to divide/multiply by 2 we should just left/right
shift respectively, and in cases where theres N number of devices use
do_div.  Also make the counters u64 to match up with rw_devices.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Acked-and-tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-21 07:02:00 -07:00
Chris Mason 81d86e1b70 Merge branch 'btrfs-3.0' into for-linus 2011-08-18 10:38:03 -04:00