dect
/
linux-2.6
Archived
13
0
Fork 0
Commit Graph

517 Commits

Author SHA1 Message Date
Chris Mason 12fcfd22fe Btrfs: tree logging unlink/rename fixes
The tree logging code allows individual files or directories to be logged
without including operations on other files and directories in the FS.
It tries to commit the minimal set of changes to disk in order to
fsync the single file or directory that was sent to fsync or O_SYNC.

The tree logging code was allowing files and directories to be unlinked
if they were part of a rename operation where only one directory
in the rename was in the fsync log.  This patch adds a few new rules
to the tree logging.

1) on rename or unlink, if the inode being unlinked isn't in the fsync
log, we must force a full commit before doing an fsync of the directory
where the unlink was done.  The commit isn't done during the unlink,
but it is forced the next time we try to log the parent directory.

Solution: record transid of last unlink/rename per directory when the
directory wasn't already logged.  For renames this is only done when
renaming to a different directory.

mkdir foo/some_dir
normal commit
rename foo/some_dir foo2/some_dir
mkdir foo/some_dir
fsync foo/some_dir/some_file

The fsync above will unlink the original some_dir without recording
it in its new location (foo2).  After a crash, some_dir will be gone
unless the fsync of some_file forces a full commit

2) we must log any new names for any file or dir that is in the fsync
log.  This way we make sure not to lose files that are unlinked during
the same transaction.

2a) we must log any new names for any file or dir during rename
when the directory they are being removed from was logged.

2a is actually the more important variant.  Without the extra logging
a crash might unlink the old name without recreating the new one

3) after a crash, we must go through any directories with a link count
of zero and redo the rm -rf

mkdir f1/foo
normal commit
rm -rf f1/foo
fsync(f1)

The directory f1 was fully removed from the FS, but fsync was never
called on f1, only its parent dir.  After a crash the rm -rf must
be replayed.  This must be able to recurse down the entire
directory tree.  The inode link count fixup code takes care of the
ugly details.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:52 -04:00
Chris Mason b9473439d3 Btrfs: leave btree locks spinning more often
btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
for the buffers it was dirtying.  This may require a kmalloc and it
was not atomic.  So, anyone who called btrfs_mark_buffer_dirty had to
set any btree locks they were holding to blocking first.

This commit changes dirty tracking for extent buffers to just use a flag
in the extent buffer.  Now that we have one and only one extent buffer
per page, this can be safely done without losing dirty bits along the way.

This also introduces a path->leave_spinning flag that callers of
btrfs_search_slot can use to indicate they will properly deal with a
path returned where all the locks are spinning instead of blocking.

Many of the btree search callers now expect spinning paths,
resulting in better btree concurrency overall.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:28 -04:00
Chris Mason b7ec40d784 Btrfs: reduce stalls during transaction commit
To avoid deadlocks and reduce latencies during some critical operations, some
transaction writers are allowed to jump into the running transaction and make
it run a little longer, while others sit around and wait for the commit to
finish.

This is a bit unfair, especially when the callers that jump in do a bunch
of IO that makes all the others procs on the box wait.  This commit
reduces the stalls this produces by pre-reading file extent pointers
during btrfs_finish_ordered_io before the transaction is joined.

It also tunes the drop_snapshot code to politely wait for transactions
that have started writing out their delayed refs to finish.  This avoids
new delayed refs being flooded into the queue while we're trying to
close off the transaction.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:26 -04:00
Chris Mason c3e69d58e8 Btrfs: process the delayed reference queue in clusters
The delayed reference queue maintains pending operations that need to
be done to the extent allocation tree.  These are processed by
finding records in the tree that are not currently being processed one at
a time.

This is slow because it uses lots of time searching through the rbtree
and because it creates lock contention on the extent allocation tree
when lots of different procs are running delayed refs at the same time.

This commit changes things to grab a cluster of refs for processing,
using a cursor into the rbtree as the starting point of the next search.
This way we walk smoothly through the rbtree.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:26 -04:00
Chris Mason 1887be66dc Btrfs: try to cleanup delayed refs while freeing extents
When extents are freed, it is likely that we've removed the last
delayed reference update for the extent.  This checks the delayed
ref tree when things are freed, and if no ref updates area left it
immediately processes the delayed ref.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:26 -04:00
Chris Mason 56bec294de Btrfs: do extent allocation and reference count updates in the background
The extent allocation tree maintains a reference count and full
back reference information for every extent allocated in the
filesystem.  For subvolume and snapshot trees, every time
a block goes through COW, the new copy of the block adds a reference
on every block it points to.

If a btree node points to 150 leaves, then the COW code needs to go
and add backrefs on 150 different extents, which might be spread all
over the extent allocation tree.

These updates currently happen during btrfs_cow_block, and most COWs
happen during btrfs_search_slot.  btrfs_search_slot has locks held
on both the parent and the node we are COWing, and so we really want
to avoid IO during the COW if we can.

This commit adds an rbtree of pending reference count updates and extent
allocations.  The tree is ordered by byte number of the extent and byte number
of the parent for the back reference.  The tree allows us to:

1) Modify back references in something close to disk order, reducing seeks
2) Significantly reduce the number of modifications made as block pointers
are balanced around
3) Do all of the extent insertion and back reference modifications outside
of the performance critical btrfs_search_slot code.

#3 has the added benefit of greatly reducing the btrfs stack footprint.
The extent allocation tree modifications are done without the deep
(and somewhat recursive) call chains used in the past.

These delayed back reference updates must be done before the transaction
commits, and so the rbtree is tied to the transaction.  Throttling is
implemented to help keep the queue of backrefs at a reasonable size.

Since there was a similar mechanism in place for the extent tree
extents, that is removed and replaced by the delayed reference tree.

Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:25 -04:00
Chris Mason 4184ea7f90 Btrfs: Fix locking around adding new space_info
Storage allocated to different raid levels in btrfs is tracked by
a btrfs_space_info structure, and all of the current space_infos are
collected into a list_head.

Most filesystems have 3 or 4 of these structs total, and the list is
only changed when new raid levels are added or at unmount time.

This commit adds rcu locking on the list head, and properly frees
things at unmount time.  It also clears the space_info->full flag
whenever new space is added to the FS.

The locking for the space info list goes like this:

reads: protected by rcu_read_lock()
writes: protected by the chunk_mutex

At unmount time we don't need special locking because all the readers
are gone.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-10 12:39:20 -04:00
Chris Mason b9447ef80b Btrfs: fix spinlock assertions on UP systems
btrfs_tree_locked was being used to make sure a given extent_buffer was
properly locked in a few places.  But, it wasn't correct for UP compiled
kernels.

This switches it to using assert_spin_locked instead, and renames it to
btrfs_assert_tree_locked to better reflect how it was really being used.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-09 11:45:38 -04:00
Josef Bacik 4e06bdd6cb Btrfs: try committing transaction before returning ENOSPC
This fixes a problem where we could return -ENOSPC when we may actually have
plenty of space, the space is just pinned.  Instead of returning -ENOSPC
immediately, commit the transaction first and then try and do the allocation
again.

This patch also does chunk allocation for metadata if we pass the 80%
threshold for metadata space.  This will help with stack usage since the chunk
allocation will happen early on, instead of when the allocation is happening.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-02-20 10:59:53 -05:00
Josef Bacik 6a63209fc0 Btrfs: add better -ENOSPC handling
This is a step in the direction of better -ENOSPC handling.  Instead of
checking the global bytes counter we check the space_info bytes counters to
make sure we have enough space.

If we don't we go ahead and try to allocate a new chunk, and then if that fails
we return -ENOSPC.  This patch adds two counters to btrfs_space_info,
bytes_delalloc and bytes_may_use.

bytes_delalloc account for extents we've actually setup for delalloc and will
be allocated at some point down the line. 

bytes_may_use is to keep track of how many bytes we may use for delalloc at
some point.  When we actually set the extent_bit for the delalloc bytes we
subtract the reserved bytes from the bytes_may_use counter.  This keeps us from
not actually being able to allocate space for any delalloc bytes.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-02-20 11:00:09 -05:00
Yan Zheng 2456242530 Btrfs: hold trans_mutex when using btrfs_record_root_in_trans
btrfs_record_root_in_trans needs the trans_mutex held to make sure two
callers don't race to setup the root in a given transaction.  This adds
it to all the places that were missing it.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-02-12 14:14:53 -05:00
Chris Mason 4008c04a07 Btrfs: make a lockdep class for the extent buffer locks
Btrfs is currently using spin_lock_nested with a nested value based
on the tree depth of the block.  But, this doesn't quite work because
the max tree depth is bigger than what spin_lock_nested can deal with,
and because locks are sometimes taken before the level field is filled in.

The solution here is to use lockdep_set_class_and_name instead, and to
set the class before unlocking the pages when the block is read from the
disk and just after init of a freshly allocated tree block.

btrfs_clear_path_blocking is also changed to take the locks in the proper
order, and it also makes sure all the locks currently held are properly
set to blocking before it tries to retake the spinlocks.  Otherwise, lockdep
gets upset about bad lock orderin.

The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 14:09:45 -05:00
Chris Mason 536ac8ae86 Btrfs: use larger metadata clusters in ssd mode
Larger metadata clusters can significantly improve writeback performance
on ssd drives with large erasure blocks.  The larger clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle.

On spinning media, lager metadata clusters end up spreading out the
metadata more over time, which makes fsck slower, so we don't want this
to be the default.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 09:41:38 -05:00
Josef Bacik eb09967089 Btrfs: make sure all pending extent operations are complete
Theres a slight problem with finish_current_insert, if we set all to 1 and then
go through and don't actually skip any of the extents on the pending list, we
could exit right after we've added new extents.

This is a problem because by inserting the new extents we could have gotten new
COW's to happen and such, so we may have some pending updates to do or even
more inserts to do after that.

So this patch will only exit if we have never skipped any of the extents in the
pending list, and we have no extents to insert, this will make sure that all of
the pending work is truly done before we return.  I've been running with this
patch for a few days with all of my other testing and have not seen issues.
Thanks,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2009-02-12 09:27:38 -05:00
Chris Mason 806638bce9 Btrfs: Fix memory leak in cache_drop_leaf_ref
The code wasn't doing a kfree on the sorted array

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-05 09:08:14 -05:00
Chris Mason bd56b30205 Btrfs: Make btrfs_drop_snapshot work in larger and more efficient chunks
Every transaction in btrfs creates a new snapshot, and then schedules the
snapshot from the last transaction for deletion.  Snapshot deletion
works by walking down the btree and dropping the reference counts
on each btree block during the walk.

If if a given leaf or node has a reference count greater than one,
the reference count is decremented and the subtree pointed to by that
node is ignored.

If the reference count is one, walking continues down into that node
or leaf, and the references of everything it points to are decremented.

The old code would try to work in small pieces, walking down the tree
until it found the lowest leaf or node to free and then returning.  This
was very friendly to the rest of the FS because it didn't have a huge
impact on other operations.

But it wouldn't always keep up with the rate that new commits added new
snapshots for deletion, and it wasn't very optimal for the extent
allocation tree because it wasn't finding leaves that were close together
on disk and processing them at the same time.

This changes things to walk down to a level 1 node and then process it
in bulk.  All the leaf pointers are sorted and the leaves are dropped
in order based on their extent number.

The extent allocation tree and commit code are now fast enough for
this kind of bulk processing to work without slowing the rest of the FS
down.  Overall it does less IO and is better able to keep up with
snapshot deletions under high load.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:27:02 -05:00
Chris Mason b4ce94de9b Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.

So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.

This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.

We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.

The basic idea is:

btrfs_tree_lock() returns with the spin lock held

btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock.  The buffer is
still considered locked by all of the btrfs code.

If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.

Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time.  So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.

btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.

btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.

ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:25:08 -05:00
Chris Mason b7a9f29fcf Btrfs: sort references by byte number during btrfs_inc_ref
When a block goes through cow, we update the reference counts of
everything that block points to.  The internal pointers of the block
can be in just about any order, and it is likely to have clusters of
things that are close together and clusters of things that are not.

To help reduce the seeks that come with updating all of these reference
counts, sort them by byte number before actual updates are done.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:23:45 -05:00
Yan Zheng 7237f18336 Btrfs: fix tree logs parallel sync
To improve performance, btrfs_sync_log merges tree log sync
requests. But it wrongly merges sync requests for different
tree logs. If multiple tree logs are synced at the same time,
only one of them actually gets synced.

This patch has following changes to fix the bug:

Move most tree log related fields in btrfs_fs_info to
btrfs_root. This allows merging sync requests separately
for each tree log.

Don't insert root item into the log root tree immediately
after log tree is allocated. Root item for log tree is
inserted when log tree get synced for the first time. This
allows syncing the log root tree without first syncing all
log trees.

At tree-log sync, btrfs_sync_log first sync the log tree;
then updates corresponding root item in the log root tree;
sync the log root tree; then update the super block.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21 12:54:03 -05:00
Yan Zheng 86288a198d Btrfs: fix stop searching test in replace_one_extent
replace_one_extent searches tree leaves for references to a given extent. It
stops searching if it goes beyond the last possible position.

The last possible position is computed by adding the starting offset of a found
file extent to the full size of the extent. The code uses physical size of the
extent as the full size. This is incorrect when compression is used.

The fix is get the full size from ram_bytes field of file extent item.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21 10:49:16 -05:00
Huang Weiyi 653249ff9a Btrfs: remove duplicated #include
Removed duplicated #include "compat.h"in
fs/btrfs/extent-tree.c

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00
Yan Zheng 5a7be515b1 Btrfs: Fix infinite loop in btrfs_extent_post_op
btrfs_extent_post_op calls finish_current_insert and del_pending_extents. They
both may enter infinite loops.

finish_current_insert enters infinite loop if it only finds some backrefs to
update.  The fix is to check for pending backref updates before restarting the
loop.

The infinite loop in del_pending_extents is due to a the skipped variable
not being properly reset before looping around.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21 10:49:16 -05:00
Yan Zheng 3dfdb9348a Btrfs: fix locking issue in btrfs_remove_block_group
We should hold the block_group_cache_lock while modifying the
block groups red-black tree. Thank you,

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-21 10:49:16 -05:00
Qinghuang Feng c6e308713a Btrfs: simplify iteration codes
Merge list_for_each* and list_entry to list_for_each_entry*

Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:59:08 -05:00
Huang Weiyi 7eaebe7d50 Btrfs: removed unused #include <version.h>'s
Removed unused #include <version.h>'s in btrfs

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-21 10:49:16 -05:00
Yan Zheng 07d400a6df Btrfs: tree logging checksum fixes
This patch contains following things.

1) Limit the max size of btrfs_ordered_sum structure to PAGE_SIZE.  This
struct is kmalloced so we want to keep it reasonable.

2) Replace copy_extent_csums by btrfs_lookup_csums_range.  This was
duplicated code in tree-log.c

3) Remove replay_one_csum. csum items are replayed at the same time as
   replaying file extents. This guarantees we only replay useful csums.

4) nbytes accounting fix.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2009-01-06 11:42:00 -05:00
Chris Mason 9ca03b997f Btrfs: drop remaining LINUX_KERNEL_VERSION checks and compat code
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-06 09:38:55 -05:00
Chris Mason d397712bcc Btrfs: Fix checkpatch.pl warnings
There were many, most are fixed now.  struct-funcs.c generates some warnings
but these are bogus.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-05 21:25:51 -05:00
Liu Hui 1f3c79a28c Btrfs: Fix free block discard calls down to the block layer
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.

There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.

This patch change discard behavior as follows:
1, For the extents which need to be free at once,
   we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
   when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
   FTL will destroy file system data.
2009-01-05 15:57:51 -05:00
Yan Zheng 1f80e4db0f Btrfs: set EXTENT_BOUNDARY bit before marking extent delalloc.
There is a race in relocate_inode_pages, it happens when
find_delalloc_range finds the delalloc extent before the
boundary bit is set. Thank you,

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-19 10:59:04 -05:00
Yan Zheng 34bf63c4dd Btrfs: properly update block accounting for metadata
This adds the missing block accounting code to finish_current_insert and makes
block accounting for root item properly protected by the delalloc spin lock.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-19 10:58:46 -05:00
Chris Mason dcbdd4dcb9 Btrfs: delete checksum items before marking blocks free
Btrfs maintains a cache of blocks available for allocation in ram.  The
code that frees extents was marking the extents free and then deleting
the checksum items.

This meant it was possible the extent would be reallocated before the
checksum item was actually deleted, leading to races and other
problems as the checksums were updated for the newly allocated extent.

The fix is to delete the checksum before marking the extent free.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-16 13:51:01 -05:00
Chris Mason 75eff68ea6 Btrfs: Don't use spin*lock_irq for the delalloc lock
The delalloc lock doesn't need to have irqs disabled, nobody that
changes the number of delalloc bytes in the FS is running with irqs off.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-15 15:54:40 -05:00
Yan Zheng 17d217fe97 Btrfs: fix nodatasum handling in balancing code
Checksums on data can be disabled by mount option, so it's
possible some data extents don't have checksums or have
invalid checksums. This causes trouble for data relocation.
This patch contains following things to make data relocation
work.

1) make nodatasum/nodatacow mount option only affects new
files. Checksums and COW on data are only controlled by the
inode flags.

2) check the existence of checksum in the nodatacow checker.
If checksums exist, force COW the data extent. This ensure that
checksum for a given block is either valid or does not exist.

3) update data relocation code to properly handle the case
of checksum missing.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-12 10:03:38 -05:00
Yan Zheng e4404d6e8d Btrfs: shared seed device
This patch makes seed device possible to be shared by
multiple mounted file systems. The sharing is achieved
by cloning seed device's btrfs_fs_devices structure.
Thanks you,

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-12 10:03:26 -05:00
Yan Zheng d2fb3437e4 Btrfs: fix leaking block group on balance
The block group structs are referenced in many different
places, and it's not safe to free while balancing.  So, those block
group structs were simply leaked instead.

This patch replaces the block group pointer in the inode with the starting byte
offset of the block group and adds reference counting to the block group
struct.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-11 16:30:39 -05:00
Yan Zheng 0403e47ee2 Btrfs: Add checking of csum tree in balancing code
This updates the space balancing code for the
new checksum format.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-10 20:32:51 -05:00
Chris Mason 459931eca5 Btrfs: Delete csum items when freeing extents
This finishes off the new checksumming code by removing csum items
for extents that are no longer in use.

The trick is doing it without racing because a single csum item may
hold csums for more than one extent.  Extra checks are added to
btrfs_csum_file_blocks to make sure that we are using the correct
csum item after dropping locks.

A new btrfs_split_item is added to split a single csum item so it
can be split without dropping the leaf lock.  This is used to
remove csum bytes from the middle of an item.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-10 09:10:46 -05:00
Yan Zheng a512bbf855 Btrfs: superblock duplication
This patch implements superblock duplication. Superblocks
are stored at offset 16K, 64M and 256G on every devices.
Spaces used by superblocks are preserved by the allocator,
which uses a reverse mapping function to find the logical
addresses that correspond to superblocks. Thank you,

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-08 16:46:26 -05:00
Christoph Hellwig b2950863c6 Btrfs: make things static and include the right headers
Shut up various sparse warnings about symbols that should be either
static or have their declarations in scope.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2008-12-02 09:54:17 -05:00
Josef Bacik ea6a478ed9 Btrfs: Fix for lockdep warnings with alloc_mutex and pinned_mutex
This the lockdep complaint by having a different mutex to gaurd caching the
block group, so you don't end up with this backwards dependancy.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-11-20 12:16:16 -05:00
Chris Mason 4b4e25f2a6 Btrfs: compat code fixes
The btrfs git kernel trees is used to build a standalone tree for
compiling against older kernels.  This commit makes the standalone tree
work with 2.6.27

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-20 10:22:27 -05:00
Chris Mason 15916de835 Btrfs: Fixes for 2.6.28-rc API changes
* open/close_bdev_excl -> open/close_bdev_exclusive
* blkdev_issue_discard takes a GFP mask now
* Fix blkdev_issue_discard usage now that it is enabled

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-19 21:17:22 -05:00
Josef Bacik 07103a3cdb Btrfs: fix free space accounting when unpinning extents
This patch fixes what I hope is the last early ENOSPC bug left.  I did not know
that pinned extents would merge into one big extent when inserted on to the
pinned extent tree, so I was adding free space to a block group that could
possibly span multiple block groups.

This is a big issue because first that space doesn't exist in that block group,
and second we won't actually use that space because there are a bunch of other
checks to make sure we're allocating within the constraints of the block group.

This patch fixes the problem by adding the btrfs_add_free_space to
btrfs_update_pinned_extents which makes sure we are adding the appropriate
amount of free space to the appropriate block group.  Thanks much to Lee Trager
for running my myriad of debug patches to help me track this problem down.
Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-11-19 15:17:55 -05:00
Liu Hui b4eec2ca11 Btrfs: Some fixes for batching extent insert.
In insert_extents(), when ret==1 and last is not zero, it should
check if the current inserted item is the last item in this batching
inserts. If so, it should just break from loop. If not, 'cur =
insert_list->next' will make no sense because the list is empty now,
and 'op' will point to an unexpectable place.

There are also some trivial fixs in this patch including one comment
typo error and deleting two redundant lines.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-18 11:30:10 -05:00
Josef Bacik 4ce4cb526f Btrfs: Add some debugging around the ENOSPC bugs
Some people are still reporting problems with early enospc.  This
will help narrown down the cause.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-17 21:12:00 -05:00
Josef Bacik e3e469f86e Btrfs: fix free space leak
In my batch delete/update/insert patch I introduced a free space leak.  The
extent that we do the original search on in free_extents is never pinned, so we
always update the block saying that it has free space, but the free space never
actually gets added to the free space tree, since op->del will always be 0 and
it's never actually added to the pinned extents tree.

This patch fixes this problem by making sure we call pin_down_bytes on the
pending extent op and set op->del to the return value of pin_down_bytes so
update_block_group is called with the right value.  This seems to fix the case
where we were getting ENOSPC when there was plenty of space available.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-11-17 21:11:49 -05:00
Yan Zheng 2b82032c34 Btrfs: Seed device support
Seed device is a special btrfs with SEEDING super flag
set and can only be mounted in read-only mode. Seed
devices allow people to create new btrfs on top of it.

The new FS contains the same contents as the seed device,
but it can be mounted in read-write mode.

This patch does the following:

1) split code in btrfs_alloc_chunk into two parts. The first part does makes
the newly allocated chunk usable, but does not do any operation that modifies
the chunk tree. The second part does the the chunk tree modifications. This
division is for the bootstrap step of adding storage to the seed device.

2) Update device management code to handle seed device.
The basic idea is: For an FS grown from seed devices, its
seed devices are put into a list. Seed devices are
opened on demand at mounting time. If any seed device is
missing or has been changed, btrfs kernel module will
refuse to mount the FS.

3) make btrfs_find_block_group not return NULL when all
block groups are read-only.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-11-17 21:11:30 -05:00
Yan Zheng c146afad2c Btrfs: mount ro and remount support
This patch adds mount ro and remount support. The main
changes in patch are: adding btrfs_remount and related
helper function; splitting the transaction related code
out of close_ctree into btrfs_commit_super; updating
allocator to properly handle read only block group.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-11-12 14:34:12 -05:00
Josef Bacik f3465ca44e Btrfs: batch extent inserts/updates/deletions on the extent root
While profiling the allocator I noticed a good amount of time was being spent in
finish_current_insert and del_pending_extents, and as the filesystem filled up
more and more time was being spent in those functions.  This patch aims to try
and reduce that problem.  This happens two ways

1) track if we tried to delete an extent that we are going to update or insert.
Once we get into finish_current_insert we discard any of the extents that were
marked for deletion.  This saves us from doing unnecessary work almost every
time finish_current_insert runs.

2) Batch insertion/updates/deletions.  Instead of doing a btrfs_search_slot for
each individual extent and doing the needed operation, we instead keep the leaf
around and see if there is anything else we can do on that leaf.  On the insert
case I introduced a btrfs_insert_some_items, which will take an array of keys
with an array of data_sizes and try and squeeze in as many of those keys as
possible, and then return how many keys it was able to insert.  In the update
case we search for an extent ref, update the ref and then loop through the leaf
to see if any of the other refs we are looking to update are on that leaf, and
then once we are done we release the path and search for the next ref we need to
update.  And finally for the deletion we try and delete the extent+ref in pairs,
so we will try to find extent+ref pairs next to the extent we are trying to free
and free them in bulk if possible.

This along with the other cluster fix that Chris pushed out a bit ago helps make
the allocator preform more uniformly as it fills up the disk.  There is still a
slight drop as we fill up the disk since we start having to stick new blocks in
odd places which results in more COW's than on a empty fs, but the drop is not
nearly as severe as it was before.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-11-12 14:19:50 -05:00
Chris Mason 2ed6d66408 Btrfs: Fix handling of space info full during allocations
When we fail to allocate a new block group, we should still do the
checks to make sure allocations try again with the minimum requested
allocation size.

This also fixes a deadlock that come from a missed down_read in
the chunk allocation failure handling.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-13 09:59:33 -05:00
Chris Mason 8a1413a296 Btrfs: empty_size allocation fixes again
The allocator wasn't catching all of the cases where it needed to do
extra loops because the check to enforce them wasn't happening early
enough.

When the allocator decided to increase the size of the allocation
for metadata clustering, it wasn't always setting the empty_size to
include the extra (optional) bytes.  This also fixes the empty_size field
to be correct.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10 16:13:54 -05:00
Chris Mason f5a31e1667 Btrfs: Try harder while searching for free space
The loop searching for free space would exit out too soon when
metadata clustering was trying to allocate a large extent.  This makes
sure a full scan of the free space is done searching for only the
minimum extent size requested by the higher layers.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10 11:47:09 -05:00
Chris Mason 5b7c3fcc46 Btrfs: Don't substract too much from the allocation target (avoid wrapping)
When metadata allocation clustering has to fall back to unclustered
allocs because large free areas could not be found, it was sometimes
substracting too much from the total bytes to allocate.  This would
make it wrap below zero.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-10 07:26:33 -05:00
Chris Mason 42e70e7a2f Btrfs: Fix more false enospc errors and an oops from empty clustering
In comes cases the empty cluster was added twice to the total number of
bytes the allocator was trying to find.

With empty clustering on, the hint byte was sometimes outside of the
block group.  Add an extra goto to find the correct block group.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-07 18:17:11 -05:00
Chris Mason 4366211ccd Btfs: More metadata allocator optimizations
This lowers the empty cluster target for metadata allocations.  The lower
target makes it easier to do allocations and still seems to perform well.

It also fixes the allocator loop to drop the empty cluster when things
start getting difficult, avoiding false enospc warnings.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-07 09:06:11 -05:00
Chris Mason 3b7885bf96 Btrfs: enforce metadata allocation clustering
The allocator uses the last allocation as a starting point for metadata
allocations, and tries to allocate in clusters of at least 256k.

If the search for a free block fails to find the expected block, this patch
forces a new cluster to be found in the free list.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-06 21:48:27 -05:00
Chris Mason 771ed689d2 Btrfs: Optimize compressed writeback and reads
When reading compressed extents, try to put pages into the page cache
for any pages covered by the compressed extent that readpages didn't already
preload.

Add an async work queue to handle transformations at delayed allocation processing
time.  Right now this is just compression.  The workflow is:

1) Find offsets in the file marked for delayed allocation
2) Lock the pages
3) Lock the state bits
4) Call the async delalloc code

The async delalloc code clears the state lock bits and delalloc bits.  It is
important this happens before the range goes into the work queue because
otherwise it might deadlock with other work queue items that try to lock
those extent bits.

The file pages are compressed, and if the compression doesn't work the
pages are written back directly.

An ordered work queue is used to make sure the inodes are written in the same
order that pdflush or writepages sent them down.

This changes extent_write_cache_pages to let the writepage function
update the wbc nr_written count.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-06 22:02:51 -05:00
Yan Zheng d899e05215 Btrfs: Add fallocate support v2
This patch updates btrfs-progs for fallocate support.

fallocate is a little different in Btrfs because we need to tell the
COW system that a given preallocated extent doesn't need to be
cow'd as long as there are no snapshots of it.  This leverages the
-o nodatacow checks.
 
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-30 14:25:28 -04:00
Yan Zheng 80ff385665 Btrfs: update nodatacow code v2
This patch simplifies the nodatacow checker. If all references
were created after the latest snapshot, then we can avoid COW
safely. This patch also updates run_delalloc_nocow to do more
fine-grained checking.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-30 14:20:02 -04:00
Yan Zheng 6643558db2 Btrfs: Fix bookend extent race v2
When dropping middle part of an extent, btrfs_drop_extents truncates
the extent at first, then inserts a bookend extent.

Since truncation and insertion can't be done atomically, there is a small
period that the bookend extent isn't in the tree. This causes problem for
functions that search the tree for file extent item. The way to fix this is
lock the range of the bookend extent before truncation.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-30 14:19:50 -04:00
Chris Mason 87ef2bb46b Btrfs: prevent looping forever in finish_current_insert and del_pending_extents
finish_current_insert and del_pending_extents process extent tree modifications
that build up while we are changing the extent tree.  It is a confusing
bit of code that prevents recursion.

Both functions run through a list of pending operations and both funcs
add to the list of pending operations.  If you have two procs in either
one of them, they can end up looping forever making more work for each other.

This patch makes them walk forward through the list of pending changes instead
of always trying to process the entire list.  At transaction commit
time, we catch any changes that were left over.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-30 11:23:27 -04:00
Yan Zheng 84234f3a1f Btrfs: Add root tree pointer transaction ids
This patch adds transaction IDs to root tree pointers.
Transaction IDs in tree pointers are compared with the
generation numbers in block headers when reading root
blocks of trees. This can detect some types of IO errors.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-29 14:49:05 -04:00
Josef Bacik 2517920135 Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.

There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.

The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same.  Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.

Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one.  I have tested this heavily and it does
not appear to break anything.  This has to be applied on top of my
find_free_extent redo patch.

I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-29 14:49:05 -04:00
Josef Bacik 80eb234af0 Btrfs: fix enospc when there is plenty of space
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had.  It only happens with Metadata writes, and happens _very_
infrequently.  What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start.  We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP.  We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.

If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of.  This patch completely kills find_free_space and
rolls it into find_free_extent.  I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.

The basic flow is this:  We have the variable loop which is 0, meaning we are
in the hint phase.  We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of.  If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.

This is also where we add the empty_cluster to total_needed.  At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups.  If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good.  If not we start
over at space_info->block_groups and loop through again, with loop == 2.  If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.

Also I've made a groups_sem to handle the group list for the space_info.  This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time.  Thanks,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-29 14:49:05 -04:00
Yan Zheng f82d02d9d8 Btrfs: Improve space balancing code
This patch improves the space balancing code to keep more sharing
of tree blocks. The only case that breaks sharing of tree blocks is
data extents get fragmented during balancing. The main changes in
this patch are:

Add a 'drop sub-tree' function. This solves the problem in old code
that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block.

Remove relocation mapping tree. Relocation mappings are stored in
struct btrfs_ref_path and updated dynamically during walking up/down
the reference path. This reduces CPU usage and simplifies code.

This patch also fixes a bug. Root items for reloc trees should be
updated in btrfs_free_reloc_root.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-29 14:49:05 -04:00
Chris Mason c8b978188c Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents.  It does some fairly large
surgery to the writeback paths.

Compression is off by default and enabled by mount -o compress.  Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.

If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.

* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler.  This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.

* Inline extents are inserted at delalloc time now.  This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.

* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.

From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field.  Neither the encryption or the
'other' field are currently used.

In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k.  This is a
software only limit, the disk format supports u64 sized compressed extents.

In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k.  This is a software only limit
and will be subject to tuning later.

Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data.  This way additional encodings can be
layered on without having to figure out which encoding to checksum.

Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread.  This makes it tricky to
spread the compression load across all the cpus on the box.  We'll have to
look at parallel pdflush walks of dirty inodes at a later time.

Decompression is hooked into readpages and it does spread across CPUs nicely.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 14:49:59 -04:00
Yan Zheng 5b84e8d6ee Btrfs: Fix leaf reference cache miss
Due to the optimization for truncate, tree leaves only containing
checksum items can be deleted without being COW'ed first. This causes
reference cache misses. The way to fix the miss is create cache
entries for tree leaves only contain checksum.

This patch also fixes a -EEXIST issue in shared reference cache.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-09 11:46:19 -04:00
Yan Zheng 3bb1a1bc42 Btrfs: Remove offset field from struct btrfs_extent_ref
The offset field in struct btrfs_extent_ref records the position
inside file that file extent is referenced by. In the new back
reference system, tree leaves holding references to file extent
are recorded explicitly. We can scan these tree leaves very quickly, so the
offset field is not required.

This patch also makes the back reference system check the objectid
when extents are in deleting.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-09 11:46:24 -04:00
Yan Zheng a76a3cd40c Btrfs: Count space allocated to file in bytes
This patch makes btrfs count space allocated to file in bytes instead
of 512 byte sectors.

Everything else in btrfs uses a byte count instead of sector sizes or
blocks sizes, so this fits better.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-09 11:46:29 -04:00
Chris Mason 30c43e2444 Btrfs: remove last_log_alloc allocator optimization
The tree logging code was trying to separate tree log allocations
from normal metadata allocations to improve writeback patterns during
an fsync.

But, the code was not effective and ended up just mixing tree log
blocks with regular metadata.  That seems to be working fairly well,
so the last_log_alloc code can be removed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-03 12:24:01 -04:00
Josef Bacik cf74982385 Btrfs: fix deadlock between alloc_mutex/chunk_mutex
This fixes a deadlock that happens between the alloc_mutex and chunk_mutex.
Process A comes in, decides to do a do_chunk_alloc, which takes the
chunk_mutex, and is holding the alloc_mutex because the only way you get to
do_chunk_alloc is by holding the alloc_mutex.  btrfs_alloc_chunk does its thing
and goes to insert a new item, which results in a cow of the block.

We get into del_pending_extents from there, where if we need to be rescheduled
we drop the alloc_mutex and schedule.  At this point process B comes in to do
an allocation and gets the alloc_mutex, and because process A did not do the
chunk allocation completely it thinks its a good time to do a chunk allocation
as well, and hangs on the chunk_mutex.

Process A wakes up and tries to take the alloc_mutex and cannot.  The way to
fix this is do a mutex_trylock() on chunk_mutex.  If we return 0 we didn't get
the lock, and if this is just a "hey it may be a good time to allocate a chunk"
then we just exit.  If we are trying to force an allocation then we reschedule
and keep trying to acquire the chunk_mutex.  If once we acquire it the space is
already full then we can just exit, otherwise we can continue with the chunk
allocation.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-01 19:11:18 -04:00
Chris Mason 75ccf47d13 Btrfs: fix multi-device code to use raid policies set by mkfs
When reading in block groups, a global mask of the available raid policies
should be adjusted based on the types of block groups found on disk.  This
global mask is then used to decide which raid policy to use for new
block groups.

The recent allocator changes dropped the call that updated the global
mask, making all the block groups allocated at run time single striped
onto a single drive.

This also fixes the async worker threads to set any thread that uses
the requeue mechanism as busy.  This allows us to avoid blocking
on get_request_wait for the async bio submission threads.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-30 19:36:34 -04:00
Josef Bacik 45b8c9a8b1 Btrfs: fix seekiness due to finding the wrong block group
This patch fixes a problem where we end up seeking too much when *last_ptr is
valid.  This happens because btrfs_lookup_first_block_group only returns a
block group that starts on or after the given search start, so if the
search_start is in the middle of a block group it will return the block group
after the given search_start, which is suboptimal.

This patch fixes that by doing a btrfs_lookup_block_group, which will return
the block group that contains the given search start.  If we fail to find a
block group, we fall back on btrfs_lookup_first_block_group so we can find the
next block group, not sure if this is absolutely needed, but better safe than
sorry.

Also if we can't find the block group that we need, or it happens to not be of
the right type, we need to add empty_cluster since *last_ptr could point to a
mismatched block group, which means we need to start over with empty_cluster
added to total needed.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-30 14:40:06 -04:00
Zheng Yan 1a40e23b95 Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format.  Before, btrfs-vol -b would break any COW links
on data blocks or metadata.  This was slow and caused the amount
of space used to explode if a large number of snapshots were present.

The new code can keeps the sharing of all data extents and
most of the tree blocks.

To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.

To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).

To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 10:09:34 -04:00
Zheng Yan e465768938 Btrfs: Add shared reference cache
Btrfs has a cache of reference counts in leaves, allowing it to
avoid reading tree leaves while deleting snapshots.  To reduce
contention with multiple subvolumes, this cache is private to each
subvolume.

This patch adds shared reference cache support. The new space
balancing code plays with multiple subvols at the same time, So
the old per-subvol reference cache is not well suited.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 10:04:53 -04:00
Zheng Yan e856981384 Btrfs: allocator fixes for space balancing update
* Reserved extent accounting:  reserved extents have been
allocated in the rbtrees that track free space but have not
been allocated on disk.  They were never properly accounted for
in the past, making it hard to know how much space was really free.

* btrfs_find_block_group used to return NULL for block groups that
had been removed by the space balancing code.  This made it hard
to account for space during the final stages of a balance run.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 10:05:48 -04:00
Chris Mason 4434c33c7f Btrfs: fix sleep with spinlock held during unmount
The code to free block groups needs to drop the space info spin lock
before calling btrfs_remove_free_space_cache (which can schedule).

This is safe because at unmount time, nobody else is going to play
with the block groups.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 15:41:59 -04:00
Zheng Yan 31840ae1a6 Btrfs: Full back reference support
This patch makes the back reference system to explicit record the
location of parent node for all types of extents. The location of
parent node is placed into the offset field of backref key. Every
time a tree block is balanced, the back references for the affected
lower level extents are updated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason 1c2308f8e7 Add check for tree-log roots in btrfs_alloc_reserved_extents
Tree log blocks are only reserved, and should not ever get fully
allocated on disk.  This check makes sure they stay out of the
extent tree.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Josef Bacik 0f9dd46cda Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size.  The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing.  If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible.  When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.

2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset.  also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.

3) cleaned up the allocation code to make it a little easier to read and a
little less complicated.  Basically there are 3 steps, first look from our
provided hint.  If we couldn't find from that given hint, start back at our
original search start and look for space from there.  If that fails try to
allocate space if we can and start looking again.  If not we're screwed and need
to start over again.

4) small fixes.  there were some issues in volumes.c where we wouldn't allocate
the rest of the disk.  fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space.  Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space.  Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups.  The alloc_hint has fixed
this slight degredation and made things semi-normal.

There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space.  This only happens with metadata
allocations, and only when we are almost full.  So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall.  I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Josef Bacik ef8bbdfe7e Btrfs: fix cache_block_group error handling
cache block group had a few bugs in the error handling code,
this makes sure paths get properly released and the correct return value
goes out.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason d0c803c404 Btrfs: Record dirty pages tree-log pages in an extent_io tree
This is the same way the transaction code makes sure that all the
other tree blocks are safely on disk.  There's an extent_io tree
for each root, and any blocks allocated to the tree logs are
recorded in that tree.

At tree-log sync, the extent_io tree is walked to flush down the
dirty pages and wait for them.

The main benefit is less time spent walking the tree log and skipping
clean pages, and getting sequential IO down to the drive.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason d00aff0013 Btrfs: Optimize tree log block allocations
Since tree log blocks get freed every transaction, they never really
need to be written to disk.  This skips the step where we update
metadata to record they were allocated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason 4bef084857 Btrfs: Tree logging fixes
* Pin down data blocks to prevent them from being reallocated like so:

trans 1: allocate file extent
trans 2: free file extent
trans 3: free file extent during old snapshot deletion
trans 3: allocate file extent to new file
trans 3: fsync new file

Before the tree logging code, this was legal because the fsync
would commit the transation that did the final data extent free
and the transaction that allocated the extent to the new file
at the same time.

With the tree logging code, the tree log subtransaction can commit
before the transaction that freed the extent.  If we crash,
we're left with two different files using the extent.

* Don't wait in start_transaction if log replay is going on.  This
avoids deadlocks from iput while we're cleaning up link counts in the
replay code.

* Don't deadlock in replay_one_name by trying to read an inode off
the disk while holding paths for the directory

* Hold the buffer lock while we mark a buffer as written.  This
closes a race where someone is changing a buffer while we write it.
They are supposed to mark it dirty again after they change it, but
this violates the cow rules.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason e02119d5a7 Btrfs: Add a write ahead tree log to optimize synchronous operations
File syncs and directory syncs are optimized by copying their
items into a special (copy-on-write) log tree.  There is one log tree per
subvolume and the btrfs super block points to a tree of log tree roots.

After a crash, items are copied out of the log tree and back into the
subvolume.  See tree-log.c for all the details.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
David Woodhouse 21af804c07 Btrfs: Discard sector data in __free_extent()
Date: Tue, 12 Aug 2008 14:13:26 +0100
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Yan Zheng 7ea394f119 Btrfs: Fix nodatacow for the new data=ordered mode
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason ea8c281947 Btrfs: Maintain a list of inodes that are delalloc and a way to wait on them
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason d7a029a89e Btrfs: Don't corrupt ram in shrink_extent_tree, leak it instead
Far from the perfect fix, but these structs are small.  TODO for the
next release.  The block group cache structs are referenced in many
different places, and it isn't safe to just free them while resizing.

A real fix will be a larger change to the allocator so that it doesn't
have to carry about the block group cache structs to find good places
to search for free blocks.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason 2dd3e67b1e Btrfs: More throttle tuning
* Make walk_down_tree wake up throttled tasks more often
* Make walk_down_tree call cond_resched during long loops
* As the size of the ref cache grows, wait longer in throttle
* Get rid of the reada code in walk_down_tree, the leaves don't get
  read anymore, thanks to the ref cache.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason 65b51a009e btrfs_search_slot: reduce lock contention by cowing in two stages
A btree block cow has two parts, the first is to allocate a destination
block and the second is to copy the old bock over.

The first part needs locks in the extent allocation tree, and may need to
do IO.  This changeset splits that into a separate function that can be
called without any tree locks held.

btrfs_search_slot is changed to drop its path and start over if it has
to COW a contended block.  This often means that many writers will
pre-alloc a new destination for a the same contended block, but they
cache their prealloc for later use on lower levels in the tree.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason 18e35e0ab3 Btrfs: Throttle less often waiting for snapshots to delete
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason f87f057b49 Btrfs: Improve and cleanup locking done by walk_down_tree
While dropping snapshots, walk_down_tree does most of the work of checking
reference counts and limiting tree traversal to just the blocks that
we are freeing.

It dropped and held the allocation mutex in strange and confusing ways,
this commit changes it to only hold the mutex while actually freeing a block.

The rest of the checks around reference counts should be safe without the lock
because we only allow one process in btrfs_drop_snapshot at a time.  Other
processes dropping reference counts should not drop it to 1 because
their tree roots already have an extra ref on the block.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Chris Mason 37d1aeee39 Btrfs: Throttle tuning
This avoids waiting for transactions with pages locked by breaking out
the code to wait for the current transaction to close into a function
called by btrfs_throttle.

It also lowers the limits for where we start throttling.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason 47ac14fa0f Btrfs: Add missing hunk from Yan Zheng's cache reclaim patch
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Yan bcc63abbf3 Btrfs: implement memory reclaim for leaf reference cache
The memory reclaiming issue happens when snapshot exists. In that
case, some cache entries may not be used during old snapshot dropping,
so they will remain in the cache until umount.

The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
the patch makes all dead roots of a given snapshot linked together in order of
create time. After a old snapshot was completely dropped, we check the dead
root list and remove all cache entries created before the oldest dead root in
the list.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Yan Zheng f321e49103 Btrfs: Update and fix mount -o nodatacow
To check whether a given file extent is referenced by multiple snapshots, the
checker walks down the fs tree through dead root and checks all tree blocks in
the path.

We can easily detect whether a given tree block is directly referenced by other
snapshot. We can also detect any indirect reference from other snapshot by
checking reference's generation. The checker can always detect multiple
references, but can't reliably detect cases of single reference. So btrfs may
do file data cow even there is only one reference.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason ab78c84de1 Btrfs: Throttle operations if the reference cache gets too large
A large reference cache is directly related to a lot of work pending
for the cleaner thread.  This throttles back new operations based on
the size of the reference cache so the cleaner thread will be able to keep
up.

Overall, this actually makes the FS faster because the cleaner thread will
be more likely to find things in cache.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason 017e5369eb Btrfs: Leaf reference cache update
This changes the reference cache to make a single cache per root
instead of one cache per transaction, and to key by the byte number
of the disk block instead of the keys inside.

This makes it much less likely to have cache misses if a snapshot
or something has an extra reference on a higher node or a leaf while
the first transaction that added the leaf into the cache is dropping.

Some throttling is added to functions that free blocks heavily so they
wait for old transactions to drop.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Yan Zheng 31153d8128 Btrfs: Add a leaf reference cache
Much of the IO done while dropping snapshots is done looking up
leaves in the filesystem trees to see if they point to any extents and
to drop the references on any extents found.

This creates a cache so that IO isn't required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Yan 974e35a82d Btrfs: Properly release lock in pin_down_bytes
When buffer isn't uptodate, pin_down_bytes may leave the tree locked
after it returns.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Josef Bacik 8e8a1e31f2 Btrfs: Fix a few functions that exit without stopping their transaction
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason 3eaa288527 Btrfs: Fix the defragmention code and the block relocation code for data=ordered
Before setting an extent to delalloc, the code needs to wait for
pending ordered extents.

Also, the relocation code needs to wait for ordered IO before scanning
the block group again.  This is because the extents are not removed
until the IO for the new extents is finished

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason c286ac48ed Btrfs: alloc_mutex latency reduction
This releases the alloc_mutex in a few places that hold it for over long
operations.  btrfs_lookup_block_group is changed so that it doesn't need
the mutex at all.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason e34a5b4f77 Btrfs: Add some conditional schedules near the alloc_mutex
This helps prevent stalls, especially while the snapshot cleaner is
running hard

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason a61e6f29dc Btrfs: Use a mutex in the extent buffer for tree block locking
This replaces the use of the page cache lock bit for locking, which wasn't
suitable for block size < page size and couldn't be used recursively.

The mutexes alone don't fix either problem, but they are the first step.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason 4a09675279 Btrfs: Data ordered fixes
* In btrfs_delete_inode, wait for ordered extents after calling
truncate_inode_pages.  This is much faster, and more correct

* Properly clear our the PageChecked bit everywhere we redirty the page.

* Change the writepage fixup handler to lock the page range and check to
see if an ordered extent had been inserted since the improperly dirtied
page was discovered

* Wait for ordered extents outside the transaction.  This isn't required
for locking rules but does improve transaction latencies

* Reduce contention on the alloc_mutex by dropping it while incrementing
refs on a node/leaf and while dropping refs on a leaf.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason 54641bd17d Btrfs: Force caching of metadata block groups on mount to avoid deadlock
This is a temporary change to avoid deadlocks until the extent tree locking
is fixed up.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason ee6e6504e1 Add a per-inode lock around btrfs_drop_extents
btrfs_drop_extents is always called with a range lock held on the inode.
But, it may operate on extents outside that range as it drops and splits
them.

This patch adds a per-inode mutex that is held while calling
btrfs_drop_extents and while inserting new extents into the tree.  It
prevents races from two procs working against adjacent ranges in the tree.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Chris Mason e6dcd2dc9c Btrfs: New data=ordered implementation
The old data=ordered code would force commit to wait until
all the data extents from the transaction were fully on disk.  This
introduced large latencies into the commit and stalled new writers
in the transaction for a long time.

The new code changes the way data allocations and extents work:

* When delayed allocation is filled, data extents are reserved, and
  the extent bit EXTENT_ORDERED is set on the entire range of the extent.
  A struct btrfs_ordered_extent is allocated an inserted into a per-inode
  rbtree to track the pending extents.

* As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
  to that page.

* When all of the bytes corresponding to a single struct btrfs_ordered_extent
  are written, The previously reserved extent is inserted into the FS
  btree and into the extent allocation trees.  The checksums for the file
  data are also updated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Chris Mason 7d9eb12c87 Btrfs: Add locking around volume management (device add/remove/balance)
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Chris Mason 3f157a2fd2 Btrfs: Online btree defragmentation fixes
The btree defragger wasn't making forward progress because the new key wasn't
being saved by the btrfs_search_forward function.

This also disables the automatic btree defrag, it wasn't scaling well to
huge filesystems.  The auto-defrag needs to be done differently.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Chris Mason 079899c238 Btrfs: Change find_extent_buffer to use TestSetPageLocked
This makes it possible for callers to check for extent_buffers in cache
without deadlocking against any btree locks held.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason e7a84565bc Btrfs: Add btree locking to the tree defragmentation code
The online btree defragger is simplified and rewritten to use
standard btree searches instead of a walk up / down mechanism.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason a74a4b97b6 Btrfs: Replace the transaction work queue with kthreads
This creates one kthread for commits and one kthread for
deleting old snapshots.  All the work queues are removed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason 333db94cdd Btrfs: Fix snapshot deletion to release the alloc_mutex much more often.
This lowers the impact of snapshot deletion on the rest of the FS.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason 5cd57b2cbb Btrfs: Add a skip_locking parameter to struct path, and make various funcs honor it
Allocations may need to read in block groups from the extent allocation tree,
which will require a tree search and take locks on the extent allocation
tree.  But, those locks might already be held in other places, leading
to deadlocks.

Since the alloc_mutex serializes everything right now, it is safe to
skip the btree locking while caching block groups.  A better fix will be
to either create a recursive lock or find a way to back off existing
locks while caching block groups.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason 051e1b9f74 Drop locks in btrfs_search_slot when reading a tree block.
One lock per btree block can make for significant congestion if everyone
has to wait for IO at the high levels of the btree.  This drops
locks held by a path when doing reads during a tree search.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason a213501153 Btrfs: Replace the big fs_mutex with a collection of other locks
Extent alloctions are still protected by a large alloc_mutex.
Objectid allocations are covered by a objectid mutex
Other btree operations are protected by a lock on individual btree nodes

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason 925baeddc5 Btrfs: Start btree concurrency work.
The allocation trees and the chunk trees are serialized via their own
dedicated mutexes.  This means allocation location is still not very
fine grained.

The main FS btree is protected by locks on each block in the btree.  Locks
are taken top / down, and as processing finishes on a given level of the
tree, the lock is released after locking the lower level.

The end result of a search is now a path where only the lowest level
is locked.  Releasing or freeing the path drops any locks held.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason 0ef3e66b67 Btrfs: Allocator fix variety pack
* Force chunk allocation when find_free_extent has to do a full scan
* Record the max key at the start of defrag so it doesn't run forever
* Block groups might not be contiguous, make a forward search for the
  next block group in extent-tree.c
* Get rid of extra checks for total fs size
* Fix relocate_one_reference to avoid relocating the same file data block
  twice when referenced by an older transaction
* Use the open device count when allocating chunks so that we don't
  try to allocate from devices that don't exist

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason 1259ab75c6 Btrfs: Handle write errors on raid1 and raid10
When duplicate copies exist, writes are allowed to fail to one of those
copies.  This changeset includes a few changes that allow the FS to
continue even when some IOs fail.

It also adds verification of the parent generation number for btree blocks.
This generation is stored in the pointer to a block, and it ensures
that missed writes to are detected.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason ca7a79ad8d Btrfs: Pass down the expected generation number when reading tree blocks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason 323da79c9f Btrfs: Chunk relocation fine tuning, and add a few printks to show progress
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason bbaf549e0c Btrfs: A number of nodatacow fixes
Once part of a delalloc request fails the cow checks, just cow the
entire range

It is possible for the back references to all be from the same root,
but still have snapshots against an extent.  The checks are now more strict,
forcing cow any time there are multiple refs against the data extent.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason a68d5933a0 Btrfs: Update nodatacow mode to support cloned single files and resizing
Before, nodatacow only checked to make sure multiple roots didn't have
references on a single extent.  This check makes sure that multiple
inodes don't have references.

nodatacow needed an extra check to see if the block group was currently
readonly.  This way cows forced by the chunk relocation code are honored.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason bf4ef67924 Btrfs: Properly find the root for snapshotted blocks during chunk relocation
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason a061fc8da7 Btrfs: Add support for online device removal
This required a few structural changes to the code that manages bdev pointers:

The VFS super block now gets an anon-bdev instead of a pointer to the
lowest bdev.  This allows us to avoid swapping the super block bdev pointer
around at run time.

The code to read in the super block no longer goes through the extent
buffer interface.  Things got ugly keeping the mapping constant.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason a236aed14c Btrfs: Deal with failed writes in mirrored configurations
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason ec44a35cbe Btrfs: Add balance ioctl to restripe the chunks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason 8e7bf94fd5 Btrfs: Do more optimal file RA during shrinking and defrag
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason 3bf3d9e9c2 Btrfs: Avoid recursive chunk allocations
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason 8f18cf1339 Btrfs: Make the resizer work based on shrinking and growing devices
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason bce4eae986 Btrfs: Fix balance_level to free the middle block if there is room in the left one
balance level starts by trying to empty the middle block, and then
pushes from the right to the middle.  This might empty the right block
and leave a small number of pointers in the middle.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason 3c12ac7205 Btrfs: Simplify device selection for mirrored reads
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason 3b951516ed Btrfs: Use the extent map cache to find the logical disk block during data retries
The data read retry code needs to find the logical disk block before it
can resubmit new bios.  But, finding this block isn't allowed to take
the fs_mutex because that will deadlock with a number of different callers.

This changes the retry code to use the extent map cache instead, but
that requires the extent map cache to have the extent we're looking for.
This is a problem because btrfs_drop_extent_cache just drops the entire
extent instead of the little tiny part it is invalidating.

The bulk of the code in this patch changes btrfs_drop_extent_cache to
invalidate only a portion of the extent cache, and changes btrfs_get_extent
to deal with the results.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason 699122f559 Btrfs: Don't wait on tree block writeback before freeing them anymore
This isn't required anymore because we don't reallocate blocks that
have already been written in this transaction.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason 321aecc656 Btrfs: Add RAID10 support
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason e17cade25f Btrfs: Add chunk uuids and update multi-device back references
Block headers now store the chunk tree uuid

Chunk items records the device uuid for each stripes

Device extent items record better back refs to the chunk tree

Block groups record better back refs to the chunk tree

The chunk tree format has also changed.  The objectid of BTRFS_CHUNK_ITEM_KEY
used to be the logical offset of the chunk.  Now it is a chunk tree id,
with the logical offset being stored in the offset field of the key.

This allows a single chunk tree to record multiple logical address spaces,
upping the number of bytes indexed by a chunk tree from 2^64 to
2^128.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason 98d20f67cf Add a min size parameter to btrfs_alloc_extent
On huge machines, delayed allocation may try to allocate massive extents.
This change allows btrfs_alloc_extent to return something smaller than
the caller asked for, and the data allocation routines will loop over
the allocations until it fills the whole delayed alloc.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Miguel a5eb62e345 Btrfs: Endianess bug fix for v0.13 with kernels
Fix for a endianess BUG when using btrfs v0.13 with kernels older than 2.6.23

Problem:

Has of v0.13, btrfs-progs is using crc32c.c equivalent to the one found on
linux-2.6.23/lib/libcrc32c.c Since crc32c_le() changed in linux-2.6.23, when
running btrfs v0.13 with older kernels we have a missmatch between the versions
of crc32c_le() from btrfs-progs and libcrc32c in the kernel.  This missmatch
causes a bug when using btrfs on big endian machines.

Solution:
btrfs_crc32c() macro that when compiling for kernels older than 2.6.23, does
endianess conversion to parameters and return value of crc32c().
This endianess conversion nullifies the differences in implementation
of crc32c_le().
If kernel 2.6.23 or better, it calls crc32c().

Signed-off-by: Miguel Sousa Filipe <miguel.filipe@gmail.com>
---

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason ce9adaa5a7 Btrfs: Do metadata checksums for reads via a workqueue
Before, metadata checksumming was done by the callers of read_tree_block,
which would set EXTENT_CSUM bits in the extent tree to show that a given
range of pages was already checksummed and didn't need to be verified
again.

But, those bits could go away via try_to_releasepage, and the end
result was bogus checksum failures on pages that never left the cache.

The new code validates checksums when the page is read.  It is a little
tricky because metadata blocks can span pages and a single read may
end up going via multiple bios.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason d18a2c4475 Btrfs: Fix allocation profile init
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason 6bc34676c0 Btrfs: Don't allow written blocks from this transaction to be reallocated
When a block is freed, it can be immediately reused if it is from
the current transaction.  But, an extra check is required to make sure
the block had not been written yet.  If it were reused after being written,
the transid in the block header might match the transid of the
next time the block was allocated.

The parent node records the transaction ID of the block it is pointing to,
and this is used as part of validating the block on reads.  So, there
can only be one version of a block per transaction.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason 611f0e00a2 Btrfs: Add support for duplicate blocks on a single spindle
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason 8790d502e4 Btrfs: Add support for mirroring across drives
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason 0999df54f8 Btrfs: Verify checksums on tree blocks found without read_tree_block
Checksums were only verified by btrfs_read_tree_block, which meant the
functions to probe the page cache for blocks were not validating checksums.
Normally this is fine because the buffers will only be in cache if they
have already been validated.

But, there is a window while the buffer is being read from disk where
it could be up to date in the cache but not yet verified.  This patch
makes sure all buffers go through checksum verification before they
are used.

This is safer, and it prevents modification of buffers before they go
through the csum code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason ecbe2402cb Btrfs: Keep fs_mutex during reads done by snapshot deletion
There was an optimization to drop the fs_mutex when doing snapshot deletion
reads, but this can lead to false positives on checksumming errors.  Keep
the lock for now.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason 593060d756 Btrfs: Implement raid0 when multiple devices are present
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason 239b14b32d Btrfs: Bring back mount -o ssd optimizations
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason a9218f6b00 Add /dev/btrfs-control for device scanning ioctls
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason 7d1660d411 Btrfs: Bring back find_free_extent CPU usage optimizations
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason 6324fbf334 Btrfs: Dynamic chunk and block group allocation
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason 0b86a832a1 Btrfs: Add support for multiple devices per filesystem
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason 7f93bf8d27 Match the extent tree code to btrfs-progs for multi-device merging
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason 952fccac50 Btrfs: Remove extent back refs in batches, and avoid duplicate searches
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason d7fc640e6f Btrfs: Allocator improvements
Reduce CPU time searching for free blocks by optimizing find_first_extent_bit

Fix find_free_extent to make better use of the last_alloc hint.  Before it
was often finding blocks just before the hint.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason 9afbb0b752 Btrfs: Disable tree defrag in SSD mode
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason 5d196fc15d Btrfs: Use 2MB as the empty_size for clustered allocations
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason 068fe39fa1 Btrfs: Add checks for last byte in disk to allocator grouping
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason f594706643 Btrfs: Add debugging for block group update failure
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason 60cde612c8 Btrfs: Use last_alloc optimizations for metadata, even without -o ssd
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason 21a4989d26 Btrfs: Hash in the offset and owner for file extent backref keys
This makes searches for backrefs and backref insertion much more efficient
when there are many backrefs for a single extent

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason 47e4bb988c Btrfs: Insert extent record and the first backref in a single balance
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason 4529ba495c Btrfs: Add data block hints to SSD mode too
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason 291d673e6a Btrfs: Do delalloc accounting via hooks in the extent_state code
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason bea495e5b4 Btrfs: Tune readahead during defrag to avoid reading too much at once
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason d1310b2e0c Btrfs: Split the extent_map code into two parts
There is now extent_map for mapping offsets in the file to disk and
extent_io for state tracking, IO submission and extent_bufers.

The new extent_map code shifts from [start,end] pairs to [start,len], and
pushes the locking out into the caller.  This allows a few performance
optimizations and is easier to use.

A number of extent_map usage bugs were fixed, mostly with failing
to remove extent_map entries when changing the file.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason e18e4809b1 Btrfs: Add mount -o ssd, which includes optimizations for seek free storage
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason 55c69072d6 Btrfs: Fix extent_buffer usage when nodesize != leafsize
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason c31f8830f0 Btrfs: online shrinking fixes
While shrinking the FS, the allocation functions need to make sure
they don't try to allocate bytes past the end of the FS.

nodatacow needed an extra check to force cows when the existing extents are
past the end of the FS.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason b0331a4c4c Btrfs: Disable btree reada during extent backref lookups.
This reada is generally not effective.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason dc17ff8f11 Btrfs: Add data=ordered support
This forces file data extents down the disk along with the metadata that
references them.  The current implementation is fairly simple, and just
writes out all of the dirty pages in an inode before the commit.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason 725c8463ea Btrfs: resizer: don't hold the fs_mutex for long periods of time
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason 1372f8e609 Properly call btrfs_search_slot while shrinking
The shrinking code used btrfs_next_leaf to find the next item, but
this does not cow the blocks it touches.  This fix calls search_slot after
finding the next item to do appropriate cow and balancing.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Yan 73e48b277a Btrfs: Properly handle overlapping extent in shrink_extent_tree
The patch fixes the overlapping extent issue in shrink_extent_tree.
It checks whether there is an overlapping extent by using
find_previous_extent. If there is an overlapping extent, it setups
key.objectid and cur_byte properly.

---

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Yan d548ee5182 Btrfs: Add a helper that finds previous extent item
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason bd09835d9a count_snapshots: Properly update the leaf pointer after btrfs_next_leaf
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 98ed51747b Btrfs: Force inlining off in a few places to save stack usage
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason f9ef6604ac Btrfs: 32 bit compile fixes for the resizer and enospc checks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 4313b3994d Btrfs: Reduce stack usage in the resizer, fix 32 bit compiles
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 56b453c92f Btrfs: Explicitly send a root objectid to count_snapshots_in_path
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 8f662a76c6 Btrfs: Add readahead to the online shrinker, and a mount -o alloc_start= for testing
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason e52ec0eb62 Btrfs: Fix NULL block groups on reading the inode
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason edbd8d4efe Btrfs: Support for online FS resize (grow and shrink)
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason be20aa9dba Btrfs: Add mount option to turn off data cow
A number of workloads do not require copy on write data or checksumming.
mount -o nodatasum to disable checksums and -o nodatacow to disable
both copy on write and checksumming.

In nodatacow mode, copy on write is still performed when a given extent
is under snapshot.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason f6dbff55d7 Btrfs: Reorder extent back refs to differentiate btree blocks from file data
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 6caab489c5 Fix btrfs_inc_ref to add backref hints
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 70b043f0c7 Btrfs: Extra NULL block group checks in find_free_extent
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason d8d5f3e16d Btrfs: Add lowest key information to back refs for extent tree blocks as well.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 7bb86316c3 Btrfs: Add back pointers from extents to the btree or file referencing them
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 74493f7a59 Btrfs: Implement generation numbers in block pointers
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 1a2b2ac78a Btrfs: Fix extent allocation for btree blocks as the disk fills
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 87ee04eb0f Btrfs: Add simple stripe size parameter
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 00f5c795fc btrfs_drop_extents: make sure the item is getting smaller before truncate
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 015a739c7c Btrfs: Handle writeback under high memory pressure better
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 0e4de58432 Btrfs: Add check for null block group to find_search_start
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Yan 5cf664263b Btrfs: Off by one fixes for extent-tree.c
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Yan 5e5745dcaf Btrfs: Add full_scan parameter to find_search_start
This patch adds a new parameter 'full_scan' to 'find_search_start',
thereby 'find_search_start' can know whether 'find_free_extent' is in
full scan phrase. I feel that 'find_search_start' should skip calling
'btrfs_find_block_group' when 'find_free_extent' is in full scan
phrase. In my test on a 2GB volume, Oops occurs when space usage is
about 76%. After apply the patch,  Oops occurs when space usage is
near 100%.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Yan 324ae4df00 Btrfs: Add block group pinned accounting back
This patch adds a helper function 'update_pinned_extents' to
extent-tree.c. The usage of the helper function is similar to
'update_block_group',  the last parameter of the function indicates
pin vs unpin.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason 257d0ce36f Btrfs: Allow large data extents in a single file to span into metadata block groups
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason f84a8b362d Btrfs: Optimize allocations as we need to mix data and metadata into one group
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Yan c549228ff6 Btrfs: Properly update free space cache in __free_extent
When pin_down_bytes decides not to pin a block because it was from the
current transaction, make sure the in memory cache of free extents is updated

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Yan b97f9203b4 Btrfs: Fix typo and memory leak in extent-tree.c
This patch fixes a typo in update_block_group and memory leak in
btrfs_free_block_groups.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Jens Axboe ae2f5411c4 btrfs: 32-bit type problems
An assorted set of casts to get rid of the warnings on 32-bit archs.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason 19c00ddcc3 Btrfs: Add back metadata checksumming
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 0f82731fc5 Breakout BTRFS_SETGET_FUNCS into a separate C file, the inlines were too big.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 4dc119046d Btrfs: Add an extent buffer LRU to reduce radix tree hits
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason e19caa5f0e Btrfs: Fix allocation routines to avoid intermixing data and metadata allocations
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 6b80053d02 Btrfs: Add back the online defragging code
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason db94535db7 Btrfs: Allow tree blocks larger than the page size
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 1a5bc167f6 Btrfs: Change the remaining radix trees used by extent-tree.c to extent_map trees
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 96b5179d0d Btrfs: Stop using radix trees for the block group cache
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason f510cfecfc Btrfs: Fix extent_buffer and extent_state leaks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 5f39d397df Btrfs: Create extent_buffer interface for large blocksizes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason cf67582bb2 Btrfs: Fix duplicate ENOSPC checks in find_free_extent
find_free_extent would fail to wrap around to the start of the drive because
it was doing the enospc case checking twice in some cases, causing it
to return -ENOSPC early.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason d3c2fdcf7b Btrfs: Use balance_dirty_pages_nr on btree blocks
btrfs_btree_balance_dirty is changed to pass the number of pages dirtied
for more accurate dirty throttling.  This lets the VM make better decisions
about when to force some writeback.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:00:48 -04:00
Yan 7d7d6068be Btrfs: Fix cache_block_group to catch holes at the start of the group
Cache block group was overly complex and missed free blocks at the very start
of the group.  This patch simplifies things significantly.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-09-14 16:15:28 -04:00
Yan e9fe395e47 Btrfs: Fix oopsen in extent_tree.c during enospc
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-29 09:11:44 -04:00
Josef Bacik 58176a9604 Btrfs: Add per-root block accounting and sysfs entries
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-29 15:47:34 -04:00
Chris Mason b888db2bd7 Btrfs: Add delayed allocation to the extent based page tree code
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-27 16:49:44 -04:00
Chris Mason 2cc58cf24f Btrfs: Do more extensive readahead during tree searches
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-27 16:49:44 -04:00
Chris Mason f2183bde1a Btrfs: Add BH_Defrag to mark buffers that are in need of defragging
This allows the tree walking code to defrag only the newly allocated
buffers, it seems to be a good balance between perfect defragging and the
performance hit of repeatedly reallocating blocks.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-10 14:42:37 -04:00
Chris Mason e9d0b13b5b Btrfs: Btree defrag on the extent-mapping tree as well
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-10 14:06:19 -04:00
Chris Mason 409eb95d7f Btrfs: Further reduce the concurrency penalty of defrag and drop_snapshot
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-08 20:17:12 -04:00
Chris Mason 26b8003f10 Btrfs: Replace extent tree preallocation code with some bit radix magic.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-08 20:17:12 -04:00
Chris Mason f4468e94c8 Btrfs: Let some locks go during defrag and snapshot dropping
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-08 10:08:58 -04:00
Chris Mason 6702ed490c Btrfs: Add run time btree defrag, and an ioctl to force btree defrag
This adds two types of btree defrag, a run time form that tries to
defrag recently allocated blocks in the btree when they are still in ram,
and an ioctl that forces defrag of all btree blocks.

File data blocks are not defragged yet, but this can make a huge difference
in sequential btree reads.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-07 16:15:09 -04:00
Chris Mason 3c69faecb8 Btrfs: Fold some btree readahead routines into something more generic.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-07 15:52:22 -04:00
Chris Mason 9f3a742736 Btrfs: Do snapshot deletion in smaller chunks.
Before, snapshot deletion was a single atomic unit.  This caused considerable
lock contention and required an unbounded amount of space.  Now,
the drop_progress field in the root item is used to indicate how far along
snapshot deletion is, and to resume where it left off.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-08-07 15:52:19 -04:00
Zach Brown ec6b910fb3 Btrfs: trivial include fixups
Almost none of the files including module.h need to do so,
remove them.

Include sched.h in extent-tree.c to silence a warning about cond_resched()
being undeclared.

Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-07-11 10:00:37 -04:00
Chris Mason ccd467d60e Btrfs: crash recovery fixes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-28 15:57:36 -04:00
Chris Mason f2654de42a Btrfs: Allow find_free_extent callers to pass in an exclusion range
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-26 12:20:46 -04:00
Chris Mason 4b52dff6d3 Btrfs: Fix super block updates during transaction commit
The super block written during commit was not consistent with the state of
the trees.  This change adds an in-memory copy of the super so that we can
make sure to write out consistent data during a commit.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-26 10:06:50 -04:00
Chris Mason 54aa1f4dfd Btrfs: Audit callers and return codes to make sure -ENOSPC gets up the stack
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-22 14:16:25 -04:00
Chris Mason e011599b0f Btrfs: reada while dropping snapshots
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-19 16:23:05 -04:00
Chris Mason 85e55b13e4 Btrfs: cache the extent tree preallocation
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-19 15:50:51 -04:00
Chris Mason 8c2383c3dd Subject: Rework btrfs_file_write to only allocate while page locks are held
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-18 09:57:58 -04:00
Aneesh f1ace244c8 btrfs: Code cleanup
Attaching below is some of the code cleanups that i came across while
reading the code.

a) alloc_path already calls init_path.
b) Mention that btrfs_inode is the in memory copy.Ext4 have ext4_inode_info as
the in memory copy ext4_inode as the disk copy

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-13 16:18:26 -04:00
Chris Mason 6cbd557078 Btrfs: add GPLv2
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-12 09:07:21 -04:00
Chris Mason 5af3981c18 Btrfs: printk fixes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-12 07:50:13 -04:00
Chris Mason 84f54cfa78 Btrfs: 64 bit div fixes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-12 07:43:08 -04:00
Chris Mason 5276aedab0 Btrfs: fix oops after block group lookup
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-11 21:33:38 -04:00
Chris Mason fabb568183 Btrfs: d_type optimization
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-06-07 22:13:21 -04:00
Chris Mason fbdc762b4e Btrfs: use a separate flag for search_start vs a hint in find_free_extent
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-30 10:22:12 -04:00
Chris Mason 1e2677e000 Btrfs: block group switching
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-29 16:52:18 -04:00
Chris Mason 3a68637562 Btrfs: sparse files!
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-24 13:35:57 -04:00
Chris Mason de428b63b1 Btrfs: allocator optimizations, truncate readahead
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-18 13:28:27 -04:00
Chris Mason 8d7be552a7 Btrfs: fix check_node and check_leaf to use less cpu
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2007-05-10 11:24:42 -04:00