Archived
14
0
Fork 0
Commit graph

14372 commits

Author SHA1 Message Date
Christoph Hellwig
e1696834e8 xfs: update max log size
Commit a6634fba3dec4a92f0a2c4e30c80b634c0576ad5 in xfsprogs increased the
maximum log size supported by mkfs.  Merged back the changes to xfs_fs.h
so the growfs enforced the same limit and the headers are in sync.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
2009-06-08 15:32:59 +02:00
Christoph Hellwig
21bea49594 fat: split fat_generic_ioctl
Split up fat_generic_ioctl and add separate functions for the two
implemented ioctls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
2009-06-08 21:39:50 +09:00
Artem Bityutskiy
f2c5dbd7b7 UBIFS: start using hrtimers
UBIFS uses timers for write-buffer write-back. It is not
crucial for us to write-back exactly on time. We are fine
to write-back a little earlier or later. And this means
we may optimize UBIFS timer so that it could be groped
with a close timer event, so that the CPU would not be
waken up just to do the write back. This is optimization
to lessen power consumption, which is important in
embedded devices UBIFS is used for.

hrtimers have a nice feature: they are effectively range
timers, and we may defind the soft and hard limits for
it. Standard timers do not have these feature. They may
only be made deferrable, but this means there is effectively
no hard limit. So, we will better use hrtimers.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-06-08 11:14:58 +03:00
Artem Bityutskiy
3f36406f26 UBIFS: do not forget to register BDI device
Reviewed-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-06-08 11:14:21 +03:00
Bartlomiej Zolnierkiewicz
db429e9ec0 partitions: add ->set_capacity block device method
* Add ->set_capacity block device method and use it in rescan_partitions()
  to attempt enabling native capacity of the device upon detecting the
  partition which exceeds device capacity.

* Add GENHD_FL_NATIVE_CAPACITY flag to try limit attempts of enabling
  native capacity during partition scan.

Together with the consecutive patch implementing ->set_capacity method in
ide-gd device driver this allows automatic disabling of Host Protected Area
(HPA) if any partitions overlapping HPA are detected.

Cc: Robert Hancock <hancockrwd@gmail.com>
Cc: Frans Pop <elendil@planet.nl>
Cc: "Andries E. Brouwer" <Andries.Brouwer@cwi.nl>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Emphatically-Acked-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2009-06-07 13:52:52 +02:00
Bartlomiej Zolnierkiewicz
02c33b123e partitions: warn about the partition exceeding device capacity
The current warning message says only about the kernel's action taken
without mentioning the underlying reason behind it.

Noticed-by: Robert Hancock <hancockrwd@gmail.com>
Cc: Frans Pop <elendil@planet.nl>
Cc: "Andries E. Brouwer" <Andries.Brouwer@cwi.nl>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Emphatically-Acked-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2009-06-07 13:52:51 +02:00
Hugh Dickins
f07502dae2 integrity: fix IMA inode leak
CONFIG_IMA=y inode activity leaks iint_cache and radix_tree_node objects
until the system runs out of memory.  Nowhere is calling ima_inode_free()
a.k.a. ima_iint_delete().  Fix that by calling it from destroy_inode().

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-06 14:33:41 -07:00
Steve French
f0472d0ec8 [CIFS] Add mention of new mount parm (forceuid) to cifs readme
Also update fs/cifs/CHANGES

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-06 21:09:39 +00:00
Jeff Layton
4ae1507f6d cifs: make overriding of ownership conditional on new mount options
We have a bit of a problem with the uid= option. The basic issue is that
it means too many things and has too many side-effects.

It's possible to allow an unprivileged user to mount a filesystem if the
user owns the mountpoint, /bin/mount is setuid root, and the mount is
set up in /etc/fstab with the "user" option.

When doing this though, /bin/mount automatically adds the "uid=" and
"gid=" options to the share. This is fortunate since the correct uid=
option is needed in order to tell the upcall what user's credcache to
use when generating the SPNEGO blob.

On a mount without unix extensions this is fine -- you generally will
want the files to be owned by the "owner" of the mount. The problem
comes in on a mount with unix extensions. With those enabled, the
uid/gid options cause the ownership of files to be overriden even though
the server is sending along the ownership info.

This means that it's not possible to have a mount by an unprivileged
user that shows the server's file ownership info. The result is also
inode permissions that have no reflection at all on the server. You
simply cannot separate ownership from the mode in this fashion.

This behavior also makes MultiuserMount option less usable. Once you
pass in the uid= option for a mount, then you can't use unix ownership
info and allow someone to share the mount.

While I'm not thrilled with it, the only solution I can see is to stop
making uid=/gid= force the overriding of ownership on mounts, and to add
new mount options that turn this behavior on.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-06 21:03:27 +00:00
Ingo Molnar
75b5032212 Merge branch 'linus' into perfcounters/core
Merge reason: Pick up the latest fixes before the -v8 perfcounters
	      release.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-06 20:21:28 +02:00
Al Viro
72a43d63cb ext3/4 with synchronous writes gets wedged by Postfix
OK, that's probably the easiest way to do that, as much as I don't like it...
Since iget() et.al. will not accept I_FREEING (will wait to go away
and restart), and since we'd better have serialization between new/free
on fs data structures anyway, we can afford simply skipping I_FREEING
et.al. in insert_inode_locked().

We do that from new_inode, so it won't race with free_inode in any interesting
ways and it won't race with iget (of any origin; nfsd or in case of fs
corruption a lookup) since both still will wait for I_LOCK.

Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
Tested-by: David Watson <dbwatson@ukfsn.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-06 06:17:26 -04:00
Theodore Ts'o
460bcf57b1 Fix nobh_truncate_page() to not pass stack garbage to get_block()
The nobh_truncate_page() function is used by ext2, exofs, and jfs.  Of
these three, only ext2 and jfs's get_block() function pays attention
to bh->b_size --- which is normally always the filesystem blocksize
except when the get_block() function is called by either
mpage_readpage(), mpage_readpages(), or the direct I/O routines in
fs/direct_io.c.

Unfortunately, nobh_truncate_page() does not initialize map_bh before
calling the filesystem-supplied get_block() function.  So ext2 and jfs
will try to calculate the number of blocks to map by taking stack
garbage and shifting it left by inode->i_blkbits.  This should be
*mostly* harmless (except the filesystem will do some unnneeded work)
unless the stack garbage is less than filesystem's blocksize, in which
case maxblocks will be zero, and the attempt to find out whether or
not the filesystem has a hole at a given logical block will fail, and
the page cache entry might not get zero'ed out.

Also if the stack garbage in in map_bh->state happens to have the
BH_Mapped bit set, there could be an attempt to call readpage() on a
non-existent page, which could cause nobh_truncate_page() to return an
error when it should not.

Fix this by initializing map_bh->state and map_bh->size.

Fortunately, it's probably fairly unlikely that ext2 and jfs users
mount with nobh these days.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-06-06 06:17:25 -04:00
Linus Torvalds
064e38aade Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: Fix oops and use after free during space balancing
  Btrfs: set device->total_disk_bytes when adding new device
2009-06-05 11:54:28 -07:00
Steven Whitehouse
f6d03139d7 GFS2: Fix locking issue mounting gfs2meta fs
This patch uses sget() to get a reference to the
existing gfs2 sb when mouting the gfs2meta filesystem
(in fact thats just another mount of the gfs2
filesystem with a different root and this interface
is for backward compatibility).

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: Benjamin Marzinski <bmarzins@redhat.com>
Tested-by: Benjamin Marzinski <bmarzins@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
2009-06-05 08:35:15 +01:00
Aneesh Kumar K.V
f8514083cd ext4: truncate the file properly if we fail to copy data from userspace
In generic_perform_write if we fail to copy the user data we don't
update the inode->i_size.  We should truncate the file in the above
case so that we don't have blocks allocated outside inode->i_size.  Add
the inode to orphan list in the same transaction as block allocation
This ensures that if we crash in between the recovery would do the
truncate.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC:  Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-05 00:56:49 -04:00
Aneesh Kumar K.V
1938a150c2 ext4: Avoid leaking blocks after a block allocation failure
We should add inode to the orphan list in the same transaction
as block allocation.  This ensures that if we crash after a failed
block allocation and before we do a vmtruncate we don't leak block
(ie block marked as used in bitmap but not claimed by the inode).

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC:  Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-05 01:00:26 -04:00
Eric Sandeen
b31e15527a ext4: Change all super.c messages to print the device
This patch changes ext4 super.c to include the device name with all 
warning/error messages, by using a new utility function ext4_msg. 
It's a rather large patch, but very mechanic. I left debug printks
alone.

This is a straightforward port of a patch which Andi Kleen did for
ext3.

Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-04 17:36:36 -04:00
Jan Kara
03f5d8bcf0 ext4: Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle()
Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle(). This
seems to be a relict from some old days and setting disksize in this
function does not make much sense.  Currently it was set only by
ext4_getblk().  Since the parameter has some effect only if create ==
1, it is easy to check by grepping through the sources that the three
callers which end up calling ext4_getblk() with create == 1
(ext4_append, ext4_quota_write, ext4_mkdir) do the right thing and set
disksize themselves.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-09 00:17:05 -04:00
Jens Axboe
172124e220 Revert "block: implement blkdev_readpages"
This reverts commit db2dbb12dc.

It apparently causes problems with partition table read-ahead
on archs with large page sizes. Until that problem is diagnosed
further, just drop the readpages support on block devices.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-06-04 22:34:44 +02:00
Chris Mason
44fb551163 Btrfs: Fix oops and use after free during space balancing
The btrfs allocator uses list_for_each to walk the available block
groups when searching for free blocks.  It starts off with a hint
to help find the best block group for a given allocation.

The hint is resolved into a block group, but we don't properly check
to make sure the block group we find isn't in the middle of being
freed due to filesystem shrinking or balancing.  If it is being
freed, the list pointers in it are bogus and can't be trusted.  But,
the code happily goes along and uses them in the list_for_each loop,
leading to all kinds of fun.

The fix used here is to check to make sure the block group we find really
is on the list before we use it.  list_del_init is used when removing
it from the list, so we can do a proper check.

The allocation clustering code has a similar bug where it will trust
the block group in the current free space cluster.  If our allocation
flags have changed (going from single spindle dup to raid1 for example)
because the drives in the FS have changed, we're not allowed to use
the old block group any more.

The fix used here is to check the current cluster against the
current allocation flags.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-04 15:41:27 -04:00
Yan Zheng
2cc3c559fb Btrfs: set device->total_disk_bytes when adding new device
It was not being properly initialized, and so the size saved to
disk was not correct.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-04 09:23:57 -04:00
Tao Ma
06c59bb896 ocfs2: Remove redundant gotos in ocfs2_mount_volume()
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:20:15 -07:00
Joel Becker
73be192b17 ocfs2: Add statistics for the checksum and ecc operations.
It would be nice to know how often we get checksum failures.  Even
better, how many of them we can fix with the single bit ecc.  So, we add
a statistics structure.  The structure can be installed into debugfs
wherever the user wants.

For ocfs2, we'll put it in the superblock-specific debugfs directory and
pass it down from our higher-level functions.  The stats are only
registered with debugfs when the filesystem supports metadata ecc.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:15:36 -07:00
Srinivas Eeda
15633a220f ocfs2 patch to track delayed orphan scan timer statistics
Patch to track delayed orphan scan timer statistics.

Modifies ocfs2_osb_dump to print the following:
  Orphan Scan=> Local: 10  Global: 21  Last Scan: 67 seconds ago

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:31 -07:00
Srinivas Eeda
83273932fb ocfs2: timer to queue scan of all orphan slots
When a dentry is unlinked, the unlinking node takes an EX on the dentry lock
before moving the dentry to the orphan directory. Other nodes that have
this dentry in cache have a PR on the same dentry lock.  When the EX is
requested, the other nodes flag the corresponding inode as MAYBE_ORPHANED
during downconvert.  The inode is finally deleted when the last node to iput
the inode sees that i_nlink==0 and the MAYBE_ORPHANED flag is set.

A problem arises if a node is forced to free dentry locks because of memory
pressure. If this happens, the node will no longer get downconvert
notifications for the dentries that have been unlinked on another node.
If it also happens that node is actively using the corresponding inode and
happens to be the one performing the last iput on that inode, it will fail
to delete the inode as it will not have the MAYBE_ORPHANED flag set.

This patch fixes this shortcoming by introducing a periodic scan of the
orphan directories to delete such inodes. Care has been taken to distribute
the workload across the cluster so that no one node has to perform the task
all the time.

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:31 -07:00
Jan Kara
edd45c0849 ocfs2: Correct ordering of ip_alloc_sem and localloc locks for directories
We use ordering ip_alloc_sem -> local alloc locks in ocfs2_write_begin().
So change lock ordering in ocfs2_extend_dir() and ocfs2_expand_inline_dir()
to also use this lock ordering.

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:30 -07:00
Jan Kara
80d73f15d1 ocfs2: Fix possible deadlock in quota recovery
In ocfs2_finish_quota_recovery() we acquired global quota file lock and started
recovering local quota file. During this process we need to get quota
structures, which calls ocfs2_dquot_acquire() which gets global quota file lock
again. This second lock can block in case some other node has requested the
quota file lock in the mean time. Fix the problem by moving quota file locking
down into the function where it is really needed.  Then dqget() or dqput()
won't be called with the lock held.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:30 -07:00
Jan Kara
65bac575e3 ocfs2: Fix possible deadlock with quotas in ocfs2_setattr()
We called vfs_dq_transfer() with global quota file lock held. This can lead
to deadlocks as if vfs_dq_transfer() has to allocate new quota structure,
it calls ocfs2_dquot_acquire() which tries to get quota file lock again and
this can block if another node requested the lock in the mean time.

Since we have to call vfs_dq_transfer() with transaction already started
and quota file lock ranks above the transaction start, we cannot just rely
on ocfs2_dquot_acquire() or ocfs2_dquot_release() on getting the lock
if they need it. We fix the problem by acquiring pointers to all quota
structures needed by vfs_dq_transfer() already before calling the function.
By this we are sure that all quota structures are properly allocated and
they can be freed only after we drop references to them. Thus we don't need
quota file lock anywhere inside vfs_dq_transfer().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:29 -07:00
Jan Kara
b4c30de39a ocfs2: Fix lock inversion in ocfs2_local_read_info()
This function is called with dqio_mutex held but it has to acquire lock
from global quota file which ranks above this lock. This is not deadlockable
lock inversion since this code path is take only during mount when noone
else can race with us but let's clean this up to silence lockdep.

We just drop the dqio_mutex in the beginning of the function and reacquire
it in the end since we don't need it - noone can race with us at this moment.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:29 -07:00
Jan Kara
4e8a301929 ocfs2: Fix possible deadlock in ocfs2_global_read_dquot()
It is not possible to get a read lock and then try to get the same write lock
in one thread as that can block on downconvert being requested by other node
leading to deadlock. So first drop the quota lock for reading and only after
that get it for writing.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:28 -07:00
Andreas Dilger
0b8e58a140 ext4: super.c whitespace cleanup
Cleanup of whitespace and formatting.  Initially driven by confusing indents
for the ext4_{block,inode}_bitmap() et. al. helper routines, but figured I'd
cleanup some other 80-column wrapping and other indenting problems at the
same time.

Signed-off-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-03 17:59:28 -04:00
Alberto Bertogli
bfcd3555af jbd2: Fix minor typos in comments in fs/jbd2/journal.c
Signed-off-by: Alberto Bertogli <albertito@blitiri.com.ar>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2009-06-09 00:06:20 -04:00
Denis Karpov
85c7859190 FAT: add 'errors' mount option
On severe errors FAT remounts itself in read-only mode. Allow to
specify FAT fs desired behavior through 'errors' mount option:
panic, continue or remount read-only.

`mount -t [fat|vfat] -o errors=[panic,remount-ro,continue] \
	<bdev> <mount point>`

This is analog to ext2 fs 'errors' mount option.

Signed-off-by: Denis Karpov <ext-denis.2.karpov@nokia.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
2009-06-04 02:34:51 +09:00
Steven Whitehouse
e09f9446b9 GFS2: Remove unused variable
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-06-03 10:07:44 +01:00
Linus Torvalds
4157fd85fc Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: prevent deadlock in xfs_qm_shake()
  xfs: fix overflow in xfs_growfs_data_private
  xfs: fix double unlock in xfs_swap_extents()
2009-06-02 09:47:21 -07:00
Jeff Layton
50b64e3b77 cifs: fix IPv6 address length check
For IPv6 the userspace mount helper sends an address in the "ip="
option.  This check fails if the length is > 35 characters. I have no
idea where the magic 35 character limit came from, but it's clearly not
enough for IPv6. Fix it by making it use the INET6_ADDRSTRLEN #define.

While we're at it, use the same #define for the address length in SPNEGO
upcalls.

Reported-by: Charles R. Anderson <cra@wpi.edu>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-06-02 15:45:40 +00:00
Artem Bityutskiy
8379ea31e9 UBIFS: allow sync option in rootflags
When passing UBIFS parameters via kernel command line, the
sync option will be passed to UBIFS as a string, not as an
MS_SYNCHRONOUS flag. Teach UBIFS interpreting this flag.

Reported-by: Aurélien GÉRÔME <ag@debian.org>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-06-02 11:08:07 +03:00
Abhijith Das
a12af1ebe6 GFS2: smbd proccess hangs with flock() call.
GFS2 currently does not support mandatory flocks. An flock() call with
LOCK_MAND triggers unexpected behavior because gfs2 is not checking for
this lock type. This patch corrects that.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-06-02 08:01:12 +01:00
Felix Blyakher
1b17d76646 xfs: prevent deadlock in xfs_qm_shake()
It's possible to recurse into filesystem from the memory
allocation, which deadlocks in xfs_qm_shake(). Add check
for __GFP_FS, and bail out if it is not set.

Signed-off-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Hedi Berriche <hedi@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-06-01 22:59:45 -05:00
Eric Sandeen
e6da7c9fed xfs: fix overflow in xfs_growfs_data_private
In the case where growing a filesystem would leave the last AG
too small, the fixup code has an overflow in the calculation
of the new size with one fewer ag, because "nagcount" is a 32
bit number.  If the new filesystem has > 2^32 blocks in it
this causes a problem resulting in an EINVAL return from growfs:

 # xfs_io -f -c "truncate 19998630180864" fsfile
 # mkfs.xfs -f -bsize=4096 -dagsize=76288719b,size=3905982455b fsfile
 # mount -o loop fsfile /mnt
 # xfs_growfs /mnt

meta-data=/dev/loop0             isize=256    agcount=52,
agsize=76288719 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=3905982455, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0
xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: Invalid argument

Reported-by: richard.ems@cape-horn-eng.com
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-06-01 22:59:38 -05:00
Felix Blyakher
1f23920dbf xfs: fix double unlock in xfs_swap_extents()
Regreesion from commit ef8f7fc, which rearranged the code in
xfs_swap_extents() leading to double unlock of xfs inode ilock.
That resulted in xfs_fsr deadlocking itself on platforms, which
don't handle double unlock of rw_semaphore nicely. It caused the
count go negative, which represents the write holder, without
really having one. ia64 is one of the platforms where deadlock
was easily reproduced and the fix was tested.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-06-01 22:59:29 -05:00
Felix Blyakher
4156e735d3 xfs: prevent deadlock in xfs_qm_shake()
It's possible to recurse into filesystem from the memory
allocation, which deadlocks in xfs_qm_shake(). Add check
for __GFP_FS, and bail out if it is not set.

Signed-off-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Hedi Berriche <hedi@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-06-01 13:13:24 -05:00
Ingo Molnar
23db9f430b Merge branch 'linus' into perfcounters/core
Merge reason: merge almost-rc8 into perfcounters/core, which was -rc6
              based - to pick up the latest upstream fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-06-01 10:01:39 +02:00
Linus Torvalds
b4566ac524 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: fix bh leak in nilfs_cpfile_delete_checkpoints function
2009-05-30 08:04:15 -07:00
Ryusuke Konishi
62013ab5d5 nilfs2: fix bh leak in nilfs_cpfile_delete_checkpoints function
The nilfs_cpfile_delete_checkpoints() wrongly skips brelse() for the
header block of checkpoint file in case of errors.  This fixes the
leak bug.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-30 22:07:50 +09:00
Linus Torvalds
3218911f83 Merge git://git.infradead.org/~dwmw2/mtd-2.6.30
* git://git.infradead.org/~dwmw2/mtd-2.6.30:
  jffs2: Fix corruption when flash erase/write failure
  mtd: MXC NAND driver fixes (v5)
2009-05-29 08:52:13 -07:00
Linus Torvalds
deeb103412 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
  Driver Core: do not oops when driver_unregister() is called for unregistered drivers
  sysfs: file.c: use create_singlethread_workqueue()
2009-05-29 08:49:52 -07:00
Linus Torvalds
c8bce3d3bd Merge branch 'for-2.6.30' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.30' of git://linux-nfs.org/~bfields/linux:
  svcrdma: dma unmap the correct length for the RPCRDMA header page.
  nfsd: Revert "svcrpc: take advantage of tcp autotuning"
  nfsd: fix hung up of nfs client while sync write data to nfs server
2009-05-29 08:49:09 -07:00
Oskar Schirmer
c3dc5bec05 flat: fix data sections alignment
The flat loader uses an architecture's flat_stack_align() to align the
stack but assumes word-alignment is enough for the data sections.

However, on the Xtensa S6000 we have registers up to 128bit width
which can be used from userspace and therefor need userspace stack and
data-section alignment of at least this size.

This patch drops flat_stack_align() and uses the same alignment that
is required for slab caches, ARCH_SLAB_MINALIGN, or wordsize if it's
not defined by the architecture.

It also fixes m32r which was obviously kaput, aligning an
uninitialized stack entry instead of the stack pointer.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oskar Schirmer <os@emlix.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Bryan Wu <cooloney@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Johannes Weiner <jw@emlix.com>
Acked-by: Mike Frysinger <vapier.adi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-29 08:40:02 -07:00
KOSAKI Motohiro
bd6daba909 procfs: make errno values consistent when open pident vs exit(2) race occurs
proc_pident_instantiate() has following call flow.

proc_pident_lookup()
  proc_pident_instantiate()
    proc_pid_make_inode()

And, proc_pident_lookup() has following error handling.

	const struct pid_entry *p, *last;
	error = ERR_PTR(-ENOENT);
	if (!task)
		goto out_no_task;

Then, proc_pident_instantiate should return ENOENT too when racing against
exit(2) occur.

EINAL has two bad reason.
  - it implies caller is wrong. bad the race isn't caller's mistake.
  - man 2 open don't explain EINVAL. user often don't handle it.

Note: Other proc_pid_make_inode() caller already use ENOENT properly.

Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-29 08:40:02 -07:00
Artem Bityutskiy
428ff9d2e3 UBIFS: remove dead code
UBIFS assumes that @c->min_io_size is 8 in case of NOR flash. This
is because UBIFS alignes all nodes to 8-byte boundary, and maintaining
@c->min_io_size introduced unnecessary complications.

This patch removes senseless constructs like:

if (c->min_io_size == 1)
	NOR-specific code

Also, few commentaries amendments.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-05-29 14:38:37 +03:00
Joakim Tjernlund
81e2962801 jffs2: Fix corruption when flash erase/write failure
Erase errors such as:
"Newly-erased block contained word 0xa4ef223e at offset 0x0296a014"
and failure to write the clean marker,
moves the offending erase block to erasing list before calling
jffs2_erase_failed(). This is bad as jffs2_erase_failed() will
also move the block to the bad_list, but is now moving the
wrong block, causing FS corruption.

Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-05-29 10:44:46 +01:00
Andrew Morton
086a377edc sysfs: file.c: use create_singlethread_workqueue()
We don't need a kernel thread per CPU for this application.

Acked-by: Alex Chiang <achiang@hp.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-05-28 14:24:07 -07:00
Christoph Hellwig
b96d31a62f cifs: clean up set_cifs_acl interfaces
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 18:41:56 +00:00
Christoph Hellwig
1bf4072da6 cifs: reorganize get_cifs_acl
Thus spake Christoph:

"But this whole set_cifs_acl function is a real mess anyway and needs
some splitting up."

With this change too, it's possible to call acl_to_uid_mode() with a
NULL inode pointer. That (or something close to it) will eventually be
necessary when cifs_get_inode_info is reorganized.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 17:08:02 +00:00
Steve French
c5077ec423 [CIFS] Update readme to indicate change to default mount (serverino)
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 15:09:04 +00:00
Jeff Layton
a0c9217f64 cifs: make serverino the default when mounting
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 15:04:17 +00:00
Jeff Layton
bd433d4cf4 cifs: rename cifs_iget to cifs_root_iget
The current cifs_iget isn't suitable for anything but the root inode.
Rename it with a more appropriate name.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 14:57:29 +00:00
Jeff Layton
c4a2c08db7 cifs: make cnvrtDosUnixTm take a little-endian args and an offset
The callers primarily end up converting the args from le anyway. Also,
most of the callers end up needing to add an offset to the result. The
exception to these rules is cnvrtDosCifsTm, but there are no callers of
that function, so we might as well remove it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 14:57:20 +00:00
Jeff Layton
07119a4df8 cifs: have cifs_NTtimeToUnix take a little-endian arg
...and just have the function call le64_to_cpu.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-28 14:32:31 +00:00
Mimi Zohar
14dba5331b integrity: nfsd imbalance bug fix
An nfsd exported file is opened/closed by the kernel causing the
integrity imbalance message.

Before a file is opened, there normally is permission checking, which
is done in inode_permission().  However, as integrity checking requires
a dentry and mount point, which is not available in inode_permission(),
the integrity (permission) checking must be called separately.

In order to detect any missing integrity checking calls, we keep track
of file open/closes.  ima_path_check() increments these counts and
does the integrity (permission) checking. As a result, the number of
calls to ima_path_check()/ima_file_free() should be balanced.  An extra
call to fput(), indicates the file could have been accessed without first
calling ima_path_check().

In nfsv3 permission checking is done once, followed by multiple reads,
which do an open/close for each read.  The integrity (permission) checking
call should be in nfsd_permission() after the inode_permission() call, but
as there is no correlation between the number of permission checking and
open calls, the integrity checking call should not increment the counters,
but defer it to when the file is actually opened.

This patch adds:
- integrity (permission) checking for nfsd exported files in nfsd_permission().
- a call to increment counts for files opened by nfsd.

This patch has been updated to return the nfs error types.

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-05-28 09:32:43 +10:00
Wei Yongjun
a0d24b295a nfsd: fix hung up of nfs client while sync write data to nfs server
Commit 'Short write in nfsd becomes a full write to the client'
(31dec2538e) broken the sync write.
With the following commands to reproduce:

  $ mount -t nfs -o sync 192.168.0.21:/nfsroot /mnt
  $ cd /mnt
  $ echo aaaa > temp.txt

Then nfs client is hung up.

In SYNC mode the server alaways return the write count 0 to the
client. This is because the value of host_err in nfsd_vfs_write()
will be overwrite in SYNC mode by 'host_err=nfsd_sync(file);',
and then we return host_err(which is now 0) as write count.

This patch fixed the problem.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-27 17:40:06 -04:00
David Howells
911e690e70 CacheFiles: Fixup renamed filenames in comments in internal.h
Fix up renamed filenames in comments in fs/cachefiles/internal.h.

Originally, the files were all called cf-xxx.c, but they got renamed to
just xxx.c.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-27 10:20:13 -07:00
David Howells
348ca1029e FS-Cache: Fixup renamed filenames in comments in internal.h
Fix up renamed filenames in comments in fs/fscache/internal.h.

Originally, the files were all called fsc-xxx.c, but they got renamed to
just xxx.c.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-27 10:20:13 -07:00
Eric Sandeen
096324873f xfs: fix overflow in xfs_growfs_data_private
In the case where growing a filesystem would leave the last AG
too small, the fixup code has an overflow in the calculation
of the new size with one fewer ag, because "nagcount" is a 32
bit number.  If the new filesystem has > 2^32 blocks in it
this causes a problem resulting in an EINVAL return from growfs:

 # xfs_io -f -c "truncate 19998630180864" fsfile
 # mkfs.xfs -f -bsize=4096 -dagsize=76288719b,size=3905982455b fsfile
 # mount -o loop fsfile /mnt
 # xfs_growfs /mnt

meta-data=/dev/loop0             isize=256    agcount=52,
agsize=76288719 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=3905982455, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=32768, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=4096   blocks=0, rtextents=0
xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: Invalid argument

Reported-by: richard.ems@cape-horn-eng.com
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-05-26 17:46:37 -05:00
Jeff Layton
f55ed1a83d cifs: tighten up default file_mode/dir_mode
The current default file mode is 02767 and dir mode is 0777. This is
extremely "loose". Given that CIFS is a single-user protocol, these
permissions allow anyone to use the mount -- in effect, giving anyone on
the machine access to the credentials used to mount the share.

Change this by making the default permissions restrict write access to
the default owner of the mount. Give read and execute permissions to
everyone else. These are the same permissions that VFAT mounts get by
default so there is some precedent here.

Note that this patch also removes the mandatory locking flags from the
default file_mode. After having looked at how these flags are used by
the kernel, I don't think that keeping them as the default offers any
real benefit. That flag combination makes it so that the kernel enforces
mandatory locking.

Since the server is going to do that for us anyway, I don't think we
want the client to enforce this by default on applications that just
want advisory locks. Anyone that does want this behavior can always
enable it by setting the file_mode appropriately.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-26 21:10:55 +00:00
Jeff Layton
46a7574caf cifs: fix artificial limit on reading symlinks
There's no reason to limit the size of a symlink that we can read to
4000 bytes. That may be nowhere near PATH_MAX if the server is sending
UCS2 strings. CIFS should be able to read in a symlink up to the size of
the buffer. The size of the header has already been accounted for when
creating the slabcache, so CIFSMaxBufSize should be the correct size to
pass in.

Fixes samba bug #6384.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-26 21:09:14 +00:00
Trond Myklebust
95baa25c73 NFSv4: Fix the case where NFSv4 renewal fails
If the asynchronous lease renewal fails (usually due to a soft timeout),
then we _must_ schedule state recovery in order to ensure that we don't
lose the lease unnecessarily or, if the lease is already lost, that we
recover the locking state promptly...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-05-26 14:51:00 -04:00
Sam Ravnborg
d0367a508a nfs: fix build error in nfsroot with initconst
fix build error with latest kbuild adjustments to initconst.

The commit a447c09324 ("vfs: Use
const for kernel parser table") changed:

    static match_table_t __initdata tokens = {
to
    static match_table_t __initconst tokens = {

But the missing const causes popwerpc to fail with latest
updates to __initconst like this:

fs/nfs/nfsroot.c:400: error: __setup_str_nfs_root_setup causes a section type conflict
fs/nfs/nfsroot.c:400: error: __setup_str_nfs_root_setup causes a section type conflict

The bug is only present with kbuild-next.
Following patch has been build tested.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-05-26 14:51:00 -04:00
Steven Whitehouse
f6eb53498e GFS2: Remove args subdir from gfs2 sysfs files
Since we can cat /proc/mounts there is no need to have this
subdirectory in the gfs2 sysfs files. In fact this does not
reflect the full range of possible mount argumenmts, where
as /proc/mounts does.

There was only one userland user of this set of sysfs files
and it will function perfectly well without these files
being present (in fact that subcommand of gfs2_tool is
obsolete anyway).

The tune/* subdirectory is also considered mostly obsolete,
but there are a few uses of this until mount arguments can
be added for the last few functions for which there are no
equivalents currently. However the tune/* directory is still
in my sights and new code should avoid using it. Only the gfs2_quota
and gfs2_tool programs are know to use tune/* at the moment.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-26 15:50:25 +01:00
Steven Whitehouse
e1b28aab58 GFS2: Remove lockstruct subdir from gfs2 sysfs files
The lockstruct sub directory contained two entries, both of
which are duplicated elsewhere in the gfs2 sysfs files as
well as being available via /proc/mounts. There is no userland program
using either of them, so this patch removes them.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-26 15:41:27 +01:00
Artem Bityutskiy
7c83f5cb55 UBIFS: use anonymous device
UBIFS has erroneuosly set 'sb->s_dev' to the UBI volume
character device major/minor. This may lead to clashes
if there is another FS mounted to a block device with
the same major/minor numbers. User-space programs which
use 'stat->st_dev' may get confused because of this.

This problem was found by Al Viro. He also pointed the
way to fix the problem - use 'set_anon_super()' and
'kill_anon_super()' VFS helpers.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-05-26 12:37:11 +03:00
Theodore Ts'o
88b6edd17c ext4: Clean up calls to ext4_get_group_desc()
If the caller isn't planning on modifying the block group descriptors,
there's no need to pass in a pointer to a struct buffer_head.  Nuking
this saves a tiny amount of CPU time and stack space usage.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-25 11:50:39 -04:00
Theodore Ts'o
759d427aa5 ext4: remove unused function __ext4_write_dirty_metadata
The __ext4_write_dirty_metadata() function was introduced by commit
0390131b, "ext4: Allow ext4 to run without a journal", but nothing
ever used the function, either then or since.  So let's remove it and
save a bit of space.

Cc: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-25 11:51:00 -04:00
Corentin Chary
8eec2f36fb UBIFS: return proper error code if the compr is not present
If the compressor is not present, mount_ubifs need
to return an error code. This way ubifs_fill_super
will stop and handle the error.

Signed-off-by: Corentin Chary <corentincj@iksaif.net>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-05-25 12:28:27 +03:00
Dave Kleikamp
79f52b77b8 jfs: Add missing mutex_unlock call to error path
Jan Kucera found an missing call to mutex_unlock() with his static code
checker.  It's an unlikely error path to hit in the real world, but it
should be fixed.

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Reported-by: Jan Kucera <kucera.jan.cz@gmail.com>
2009-05-23 20:28:41 -05:00
Linus Torvalds
3eb9c8be0c Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  [CIFS] Avoid open on possible directories since Samba now rejects them
2009-05-23 13:42:53 -07:00
Steve French
8db14ca125 [CIFS] Avoid open on possible directories since Samba now rejects them
Small change (mostly formatting) to limit lookup based open calls to
file create only.

After discussion yesteday on samba-technical about the posix lookup
regression,  and looking at a problem with cifs posix open to one
particular Samba version, Jeff and JRA realized that Samba server's
behavior changed in this area (posix open behavior on files vs.
directories).   To make this behavior consistent, JRA just made a
fix to Samba server to alter how it handles open of directories (now
returning the equivalent of EISDIR instead of success). Since we don't
know at lookup time whether the inode is a directory or file (and
thus whether posix open will succeed with most current Samba server),
this change avoids the posix open code on lookup open (just issues
posix open on creates).    This gets the semantic benefits we want
(atomicity, posix byte range locks, improved write semantics on newly
created files) and file create still is fast, and we avoid the problem
that Jeff noticed yesterday with "openat" (and some open directory
calls) of non-cached directories to one version of Samba server, and
will work with future Samba versions (which include the fix jra just
pushed into Samba server).  I confirmed this approach with jra
yesterday and with Shirish today.

Posix open is only called (at lookup time) for file create now.
For opens (rather than creates), because we do not know if it
is a file or directory yet, and current Samba no longer allows
us to do posix open on dirs, we could end up wasting an open call
on what turns out to be a dir. For file opens, we wait to call posix
open till cifs_open.  It could be added here (lookup) in the future
but the performance tradeoff of the extra network request when EISDIR
or EACCES is returned would have to be weighed against the 50%
reduction in network traffic in the other paths.

Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Tested-by: Jeff Layton <jlayton@redhat.com>
CC: Jeremy Allison <jra@samba.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-23 18:57:25 +00:00
Martin K. Petersen
c72758f337 block: Export I/O topology for block devices and partitions
To support devices with physical block sizes bigger than 512 bytes we
need to ensure proper alignment.  This patch adds support for exposing
I/O topology characteristics as devices are stacked.

  logical_block_size is the smallest unit the device can address.

  physical_block_size indicates the smallest I/O the device can write
  without incurring a read-modify-write penalty.

  The io_min parameter is the smallest preferred I/O size reported by
  the device.  In many cases this is the same as the physical block
  size.  However, the io_min parameter can be scaled up when stacking
  (RAID5 chunk size > physical block size).

  The io_opt characteristic indicates the optimal I/O size reported by
  the device.  This is usually the stripe width for arrays.

  The alignment_offset parameter indicates the number of bytes the start
  of the device/partition is offset from the device's natural alignment.
  Partition tools and MD/DM utilities can use this to pad their offsets
  so filesystems start on proper boundaries.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 23:22:55 +02:00
Martin K. Petersen
ae03bf639a block: Use accessor functions for queue limits
Convert all external users of queue limits to using wrapper functions
instead of poking the request queue variables directly.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 23:22:54 +02:00
Martin K. Petersen
e1defc4ff0 block: Do away with the notion of hardsect_size
Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case.  The
sector size will be 4KB but the logical block size will remain
512-bytes.  Hence we need to distinguish between the physical block size
and the logical ditto.

This patch renames hardsect_size to logical_block_size.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 23:22:54 +02:00
Jens Axboe
9bd7de51ee Merge branch 'master' into for-2.6.31
Conflicts:
	drivers/ide/ide-io.c

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 20:28:35 +02:00
Jens Axboe
e4b636366c Merge branch 'master' into for-2.6.31
Conflicts:
	drivers/block/hd.c
	drivers/block/mg_disk.c

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22 20:25:34 +02:00
Linus Torvalds
6a44587ee7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: fix memory leak in nilfs_ioctl_clean_segments
2009-05-22 08:41:13 -07:00
Ryusuke Konishi
d504685363 nilfs2: fix memory leak in nilfs_ioctl_clean_segments
This fixes a new memory leak problem in garbage collection.  The
problem was brought by the bugfix patch ("nilfs2: fix lock order
reversal in nilfs_clean_segments ioctl").

Thanks to Kentaro Suzuki for finding this problem.

Reported-by: Kentaro Suzuki <k_suzuki@ms.sylc.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-22 20:49:04 +09:00
Steven Whitehouse
87ec217411 GFS2: Move gfs2_unlink_ok into ops_inode.c
Another function which is only called from one ops_inode.c so
we move it and make it static.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-22 10:54:50 +01:00
Steven Whitehouse
536baf02f6 GFS2: Move gfs2_readlinki into ops_inode.c
Move gfs2_readlinki into ops_inode.c and make it static

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-22 10:48:59 +01:00
Steven Whitehouse
2286dbfad1 GFS2: Move gfs2_rmdiri into ops_inode.c
Move gfs2_rmdiri() into ops_inode.c and make it static.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-22 10:45:09 +01:00
Steven Whitehouse
9e6e0a128b GFS2: Merge mount.c and ops_super.c into super.c
mount.c only contained a single function, so is not really
worth retaining on its own. All of the super related code
is now either in super.c or ops_fstype.c

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-22 10:36:01 +01:00
Steven Whitehouse
b1e71b0622 GFS2: Clean up some file names
This patch renames the ops_*.c files which have no counterpart
without the ops_ prefix in order to shorten the name and make
it more readable. In addition, ops_address.h (which was very
small) is moved into inode.h and inode.h is cleaned up by
adding extern where required.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-22 10:01:55 +01:00
James Morris
2c9e703c61 Merge branch 'master' into next
Conflicts:
	fs/exec.c

Removed IMA changes (the IMA checks are now performed via may_open()).

Signed-off-by: James Morris <jmorris@namei.org>
2009-05-22 18:40:59 +10:00
Mimi Zohar
c9d9ac525a integrity: move ima_counts_get
Based on discussion on lkml (Andrew Morton and Eric Paris),
move ima_counts_get down a layer into shmem/hugetlb__file_setup().
Resolves drm shmem_file_setup() usage case as well.

HD comment:
  I still think you're doing this at the wrong level, but recognize
  that you probably won't be persuaded until a few more users of
  alloc_file() emerge, all wanting your ima_counts_get().

  Resolving GEM's shmem_file_setup() is an improvement, so I'll say

Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-05-22 09:45:33 +10:00
Mimi Zohar
b9fc745db8 integrity: path_check update
- Add support in ima_path_check() for integrity checking without
incrementing the counts. (Required for nfsd.)
- rename and export opencount_get to ima_counts_get
- replace ima_shm_check calls with ima_counts_get
- export ima_path_check

Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-05-22 09:43:41 +10:00
Steve French
703a3b8e5c [CIFS] fix posix open regression
Posix open code was not properly adding the file to the
list of open files.  Fix  allocating cifsFileInfo
more than once, and adding twice to flist and tlist.
Also fix mode setting to be done in one place in these
paths.

Signed-off-by: Steve French <sfrench@us.ibm.com>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Tested-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Luca Tettamanti <kronos.it@gmail.com>
2009-05-21 22:38:08 +00:00
Steven Whitehouse
1ce97e564b GFS2: Be more aggressive in reclaiming unlinked inodes
This patch increases the frequency with which gfs2 looks
for unlinked, but still allocated inodes. Its the equivalent
operation to ext3's orphan list, but done with bitmaps in
the resource groups.

This also fixes a bug where a field in the rgrp was too small.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-21 15:18:19 +01:00
Steven Whitehouse
60a0b8f936 GFS2: Add a rgrp bitmap full flag
During block allocation, it is useful to know if sections of disk
are full on a finer grained basis than a single resource group.
This can make a performance difference when resource groups have
larger numbers of bitmap blocks, since we no longer have to search
them all block by block in each individual bitmap.

The full flag is set on a per-bitmap basis when it has been
searched and found to have no free space. It is then skipped in
subsequent searches until the flag is reset. The resetting
occurs if we have to drop the glock on the resource group for any
reason, or if we deallocate some blocks within that resource
group and thus free up some space.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-21 12:23:12 +01:00
Linus Torvalds
929a8651f4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: fix pointer initialization and checks in cifs_follow_symlink (try #4)
2009-05-20 08:36:53 -07:00
Steven Whitehouse
0901097834 GFS2: Improve resource group error handling
This patch improves the error handling in the case where we
discover that the summary information in the resource group
doesn't match the bitmap information while in the process of
allocating blocks. Originally this resulted in a kernel bug,
but this patch changes that so that we return -EIO and print
some messages explaining what went wrong, and how to fix it.

We also remember locally not to try and allocate from the
same rgrp again, so that a subsequent allocation in a
different rgrp should succeed.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-20 10:48:47 +01:00
Jeff Layton
8b6427a2a8 cifs: fix pointer initialization and checks in cifs_follow_symlink (try #4)
This is the third respin of the patch posted yesterday to fix the error
handling in cifs_follow_symlink. It also includes a fix for a bogus NULL
pointer check in CIFSSMBQueryUnixSymLink that Jeff Moyer spotted.

It's possible for CIFSSMBQueryUnixSymLink to return without setting
target_path to a valid pointer. If that happens then the current value
to which we're initializing this pointer could cause an oops when it's
kfree'd.

This patch is a little more comprehensive than the last patches. It
reorganizes cifs_follow_link a bit for (hopefully) better readability.
It should also eliminate the uneeded allocation of full_path on servers
without unix extensions (assuming they can get to this point anyway, of
which I'm not convinced).

On a side note, I'm not sure I agree with the logic of enabling this
query even when unix extensions are disabled on the client. It seems
like that should disable this as well. But, changing that is outside the
scope of this fix, so I've left it alone for now.

Reported-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@inraded.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-19 15:31:20 +00:00
Steven Whitehouse
ef9e8b14a5 GFS2: Don't warn when delete inode fails on ro filesystem
If the filesystem is read-only, then we expect that delete inode
will fail, so there is no need to warn about it.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-19 14:25:16 +01:00
Miklos Szeredi
b2858d7d16 splice: fix kmaps in default_file_splice_write()
Unfortunately multiple kmap() within a single thread are deadlockable,
so writing out multiple buffers with writev() isn't possible.

Change the implementation so that it does a separate write() for each
buffer.  This actually simplifies the code a lot since the
splice_from_pipe() helper can be used.

This limitation is caused by HIGHMEM pages, and so only affects a
subset of architectures and configurations.  In the future it may be
worth to implement default_file_splice_write() in a more efficient way
on configs that allow it.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-19 11:37:46 +02:00
Tejun Heo
4fc981ef9e bio: always copy back data for copied kernel requests
When a read bio_copy_kern() request fails, the content of the bounce
buffer is not copied back.  However, as request failure doesn't
necessarily mean complete failure, the buffer state can be useful.
This behavior is also inconsistent with the user map counterpart and
causes the subtle difference between bounced and unbounced IO causes
confusion.

This patch makes bio_copy_kern_endio() ignore @err and always copy
back data on request completion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-19 11:36:08 +02:00
Steven Whitehouse
fe64d517df GFS2: Umount recovery race fix
This patch fixes a race condition where we can receive recovery
requests part way through processing a umount. This was causing
problems since the recovery thread had already gone away.

Looking in more detail at the recovery code, it was really trying
to implement a slight variation on a work queue, and that happens to
align nicely with the recently introduced slow-work subsystem. As a
result I've updated the code to use slow-work, rather than its own home
grown variety of work queue.

When using the wait_on_bit() function, I noticed that the wait function
that was supplied as an argument was appearing in the WCHAN field, so
I've updated the function names in order to produce more meaningful
output.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-19 10:01:18 +01:00
Hunter Adrian
8b3884a841 UBIFS: return error if link and unlink race
Consider a scenario when 'vfs_link(dirA/fileA)' and
'vfs_unlink(dirA/fileA, dirB/fileB)' race. 'vfs_link()' does not
lock 'dirA->i_mutex', so this is possible. Both of the functions
lock 'fileA->i_mutex' though. Suppose 'vfs_unlink()' wins, and takes
'fileA->i_mutex' mutex first. Suppose 'fileA->i_nlink' is 1. In this
case 'ubifs_unlink()' will drop the last reference, and put 'inodeA'
to the list of orphans. After this, 'vfs_link()' will link
'dirB/fileB' to 'inodeA'. Thir is a problem because, for example,
the subsequent 'vfs_unlink(dirB/fileB)' will add the same inode
to the list of orphans.

This problem was reported by J. R. Okajima <hooanon05@yahoo.co.jp>

[Artem: add more comments, amended commit message]

Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-05-19 11:01:31 +03:00
Frank Filz
7ee2cb7f32 nfs: Fix NFS v4 client handling of MAY_EXEC in nfs_permission.
The problem is that permission checking is skipped if atomic open is
possible, but when exec opens a file, it just opens it O_READONLY which
means EXEC permission will not be checked at that time.

This problem is observed by the following sequence (executed as root):

  mount -t nfs4 server:/ /mnt4
  echo "ls" >/mnt4/foo
  chmod 744 /mnt4/foo
  su guest -c "mnt4/foo"

Signed-off-by: Frank Filz <ffilzlnx@us.ibm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
Tested-by: Eugene Teo <eugeneteo@kernel.sg>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-18 20:11:12 -07:00
Ingo Molnar
dc3f81b129 Merge commit 'v2.6.30-rc6' into perfcounters/core
Merge reason: this branch was on an -rc4 base, merge it up to -rc6
              to get the latest upstream fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-05-18 07:37:49 +02:00
Manish Katiyar
0f7ee7c172 ext2: Fix memory leak in ext2_fill_super() in case of a failed mount
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 23:52:51 -04:00
Manish Katiyar
de5ce03730 ext3: Fix memory leak in ext3_fill_super() in case of a failed mount
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 23:52:47 -04:00
Manish Katiyar
f68301656b ext4: Fix memory leak in ext4_fill_super() in case of a failed mount
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 23:52:44 -04:00
Theodore Ts'o
0568c51893 ext4: down i_data_sem only for read when walking tree for fiemap
Not sure why I put this in as down_write originally; all we are
doing is walking the tree, nothing will change under us and
concurrent reads should be no problem.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 23:31:23 -04:00
Theodore Ts'o
6fd058f779 ext4: Add a comprehensive block validity check to ext4_get_blocks()
To catch filesystem bugs or corruption which could lead to the
filesystem getting severly damaged, this patch adds a facility for
tracking all of the filesystem metadata blocks by contiguous regions
in a red-black tree.  This allows quick searching of the tree to
locate extents which might overlap with filesystem metadata blocks.

This facility is also used by the multi-block allocator to assure that
it is not allocating blocks out of the system zone, as well as by the
routines used when reading indirect blocks and extents information
from disk to make sure their contents are valid.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-17 15:38:01 -04:00
Jeff Mahoney
b83674c0da reiserfs: fixup perms when xattrs are disabled
This adds CONFIG_REISERFS_FS_XATTR protection from reiserfs_permission.

This is needed to avoid warnings during file deletions and chowns with
xattrs disabled.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-17 11:45:45 -07:00
Jeff Mahoney
ceb5edc457 reiserfs: deal with NULL xattr root w/ xattrs disabled
This avoids an Oops in open_xa_root that can occur when deleting a file
with xattrs disabled.  It assumes that the xattr root will be there, and
that is not guaranteed.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-17 11:45:45 -07:00
Jeff Mahoney
12abb35a03 reiserfs: clean up ifdefs
With xattr cleanup even with xattrs disabled, much of the initial setup
is still performed.  Some #ifdefs are just not needed since the options
they protect wouldn't be available anyway.

This cleans those up.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-17 11:45:45 -07:00
David Teigland
748285ccf7 dlm: use more NOFS allocation
Change some GFP_KERNEL allocations to use either GFP_NOFS or
ls_allocation (when available) which the fs sets to GFP_NOFS.
The point is to prevent allocations from going back into the
cluster fs in places where that might lead to deadlock.

Signed-off-by: David Teigland <teigland@redhat.com>
2009-05-15 11:24:59 -05:00
Linus Torvalds
5d41343ac8 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Fix race in ext4_inode_info.i_cached_extent
  ext4: Clear the unwritten buffer_head flag after the extent is initialized
  ext4: Use a fake block number for delayed new buffer_head
  ext4: Fix sub-block zeroing for writes into preallocated extents
2009-05-15 08:07:25 -07:00
Sukadev Bhattiprolu
1f71ebedb3 devpts: correctly set default options
devpts_get_sb() calls memset(0) to clear mount options and calls
parse_mount_options() if user specified any mount options.

The memset(0) is bogus since the 'mode' and 'ptmxmode' options are
non-zero by default.  parse_mount_options() restores options to default
anyway and can properly deal with NULL mount options.

So in devpts_get_sb() remove memset(0) and call parse_mount_options() even
for NULL mount options.

Bug reported by Eric Paris: http://lkml.org/lkml/2009/5/7/448.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Tested-by: Marc Dionne <marc.c.dionne@gmail.com>
Reported-by: Eric Paris <eparis@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Reviewed-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-15 08:03:23 -07:00
Christine Caulfield
391fbdc5d5 dlm: connect to nodes earlier
Make network connections to other nodes earlier, in the context of
dlm_recoverd.  This avoids connecting to nodes from dlm_send where we
try to avoid allocations which could possibly deadlock if memory reclaim
goes into the cluster fs which may try to do a dlm operation.

Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2009-05-15 09:34:12 -05:00
Thomas Gleixner
2d02494f5a sched, timers: cleanup avenrun users
avenrun is an rough estimate so we don't have to worry about
consistency of the three avenrun values. Remove the xtime lock
dependency and provide a function to scale the values. Cleanup the
users.

[ Impact: cleanup ]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
2009-05-15 15:32:45 +02:00
Theodore Ts'o
2ec0ae3ace ext4: Fix race in ext4_inode_info.i_cached_extent
If two CPU's simultaneously call ext4_ext_get_blocks() at the same
time, there is nothing protecting the i_cached_extent structure from
being used and updated at the same time.  This could potentially cause
the wrong location on disk to be read or written to, including
potentially causing the corruption of the block group descriptors
and/or inode table.

This bug has been in the ext4 code since almost the very beginning of
ext4's development.  Fortunately once the data is stored in the page
cache cache, ext4_get_blocks() doesn't need to be called, so trying to
replicate this problem to the point where we could identify its root
cause was *extremely* difficult.  Many thanks to Kevin Shanahan for
working over several months to be able to reproduce this easily so we
could finally nail down the cause of the corruption.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
2009-05-15 09:07:28 -04:00
Linus Torvalds
bd67ce0f66 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
  cifs: fix error handling in parse_DFS_referrals
2009-05-14 19:20:04 -07:00
Linus Torvalds
5732c46849 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: Spelling fix in btrfs_lookup_first_block_group comments
  Btrfs: make show_options result match actual option names
  Btrfs: remove outdated comment in btrfs_ioctl_resize()
  Btrfs: remove some WARN_ONs in the IO failure path
  Btrfs: Don't loop forever on metadata IO failures
  Btrfs: init inode ordered_data_close flag properly
2009-05-14 19:18:44 -07:00
Aneesh Kumar K.V
2a8964d63d ext4: Clear the unwritten buffer_head flag after the extent is initialized
The BH_Unwritten flag indicates that the buffer is allocated on disk
but has not been written; that is, the disk was part of a persistent
preallocation area.  That flag should only be set when a get_blocks()
function is looking up a inode's logical to physical block mapping.

When ext4_get_blocks_wrap() is called with create=1, the uninitialized
extent is converted into an initialized one, so the BH_Unwritten flag
is no longer appropriate.  Hence, we need to make sure the
BH_Unwritten is not left set, since the combination of BH_Mapped and
BH_Unwritten is not allowed; among other things, it will result ext4's
get_block() to be called over and over again during the write_begin
phase of write(2).

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 17:05:39 -04:00
Sankar P
9f55684c2d Btrfs: Spelling fix in btrfs_lookup_first_block_group comments
Signed-off-by: Sankar P <sankar.curiosity@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:34 -04:00
Sage Weil
6b65c5c61b Btrfs: make show_options result match actual option names
The notreelog and flushoncommit mount options were being printed slightly
differently.

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:34 -04:00
Li Hong
5d847a8ed9 Btrfs: remove outdated comment in btrfs_ioctl_resize()
In Li Zefan's commit dae7b665cf,
a combination call of kmalloc() and copy_from_user() is replaced by
memdup_user(). So btrfs_ioctl_resize() doesn't use GFP_NOFS any more.

Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:33 -04:00
Chris Mason
cc7b0c9b70 Btrfs: remove some WARN_ONs in the IO failure path
These debugging WARN_ONs make too much console noise during regular
IO failures.  An IO failure will still generate a number of messages
as we verify checksums etc, but these two are not needed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:33 -04:00
Chris Mason
76a05b35a3 Btrfs: Don't loop forever on metadata IO failures
When a btrfs metadata read fails, the first thing we try to do is find
a good copy on another mirror of the block.  If this fails, read_tree_block()
ends up returning a buffer that isn't up to date.

The btrfs btree reading code was reworked to drop locks and repeat
the search when IO was done, but the changes didn't add a check for failed
reads.  The end result was looping forever on buffers that were never
going to become up to date.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:32 -04:00
Chris Mason
2757495c90 Btrfs: init inode ordered_data_close flag properly
This flag is used to decide when we need to send a given file through
the ordered code to make sure it is fully written before a transaction
commits.  It was not being properly set to zero when the inode was
being setup.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-05-14 14:00:31 -04:00
Theodore Ts'o
2ac3b6e00a ext4: Clean up ext4_get_blocks() so it does not depend on bh_result->b_state
The ext4_get_blocks() function was depending on the value of
bh_result->b_state as an input parameter to decide whether or not
update the delalloc accounting statistics by calling
ext4_da_update_reserve_space().  We now use a separate flag,
EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE, to requests this update, so that
all callers of ext4_get_blocks() can clear map_bh.b_state before
calling ext4_get_blocks() without worrying about any consistency
issues.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 13:57:08 -04:00
Jeff Layton
d8e2f53ac9 cifs: fix error handling in parse_DFS_referrals
cifs_strndup_from_ucs returns NULL on error, not an ERR_PTR

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-14 13:55:32 +00:00
Theodore Ts'o
2fa3cdfb31 ext4: Merge ext4_da_get_block_write() into mpage_da_map_blocks()
The static function ext4_da_get_block_write() was only used by
mpage_da_map_blocks().  So to simplify the code, merge that function
into mpage_da_map_blocks().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 09:29:45 -04:00
Andrew Morton
77f6bf57ba splice: fix error return code
fs/splice.c: In function 'default_file_splice_read':
fs/splice.c:566: warning: 'error' may be used uninitialized in this function

which is sort-of true.  The code will in fact return -ENOMEM instead of the
kernel_readv() return value.

Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-14 09:49:44 +02:00
Linus Torvalds
210af919c9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus
* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus:
  Squashfs: cody tidying, remove commented out line in Makefile
  Squashfs: check page size is not larger than the filesystem block size
  Squashfs: fix breakage when page size > metadata block size
2009-05-13 16:33:25 -07:00
Linus Torvalds
a6aeeebf51 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: destroy bdi on error
2009-05-13 16:32:57 -07:00
Linus Torvalds
014049a1c7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
  nilfs2: check size of array structured data exchanged via ioctls
  nilfs2: fix lock order reversal in nilfs_clean_segments ioctl
  nilfs2: fix possible circular locking for get information ioctls
  nilfs2: ensure to clear dirty state when deleting metadata file block
  nilfs2: fix circular locking dependency of writer mutex
  nilfs2: fix possible recovery failure due to block creation without writer
2009-05-13 16:32:16 -07:00
Mel Gorman
f2deae9d4e Remove implementation of readpage from the hugetlbfs_aops
The core VM assumes the page size used by the address_space in
inode->i_mapping is PAGE_SIZE but hugetlbfs breaks this assumption by
inserting pages into the page cache at offsets the core VM considers
unexpected.

This would not be a problem except that hugetlbfs also provide a
->readpage implementation.  As it exists, the core VM can assume the
base page size is being used, allocate pages on behalf of the
filesystem, insert them into the page cache and call ->readpage to
populate them.  These pages are the wrong size and at the wrong offset
for hugetlbfs causing confusion.

This patch deletes the ->readpage implementation for hugetlbfs on the
grounds the core VM should not be allocating and populating pages on
behalf of hugetlbfs.  There should be no existing users of the
->readpage implementation so it should not cause a regression.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-13 08:04:45 -07:00
Steven Whitehouse
9582d41135 GFS2: Remove a couple of unused sysfs entries
These two tunables are pointless and would never need to be
changed anyway. There is also a race between them and umount
as the deamons which they refer to might have gone away. The
easiest way to fix the race is to remove the interface.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-13 15:06:25 +01:00
Steven Whitehouse
48c2b61361 GFS2: Add commit= mount option
It has always been possible to adjust the gfs2 log commit
interval, but only from the sysfs interface. This adds a
mount option, commit=<nn>, which will be familar to ext3
users.

The sysfs interface continues to be available as well, although
this might be removed in the future.

Also this patch cleans up some duplicated structures in the GFS2
sysfs code.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-13 14:49:48 +01:00
Steven Whitehouse
a1c0643ff9 GFS2: Move journal live test at transaction start
There seems little point grabbing the transaction glock
only to have to release it again if the journal isn't
live. This moves the test earlier to avoid grabbing the lock
when we don't need it in the first place.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-13 10:56:52 +01:00
Jens Axboe
4f23122858 splice: fix repeated kmap()'s in default_file_splice_read()
We cannot reliably map more than one page at the time, or we risk
deadlocking. Just allocate the pages from low mem instead.

Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-13 08:35:35 +02:00
Phillip Lougher
e5d287539d Squashfs: cody tidying, remove commented out line in Makefile
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2009-05-13 03:25:20 +01:00
Phillip Lougher
fffb47b80e Squashfs: check page size is not larger than the filesystem block size
Normally the block size (by default 128K) will be larger than the
page size, unless a non-standard block size has been specified in
Mksquashfs, and the page size is larger than 4K.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2009-05-13 02:59:26 +01:00
Doug Chapman
a37b06d589 Squashfs: fix breakage when page size > metadata block size
Squashfs is broken on any system where the page size is larger than
the metadata size (8192).  This is easily fixed by ensuring cache->pages
is always > 0.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Doug Chapman <doug.chapman@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
2009-05-13 02:56:39 +01:00
Linus Torvalds
2ea3f86848 Merge branch 'for-2.6.30' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.30' of git://linux-nfs.org/~bfields/linux:
  nfsd: silence lockdep warning
  lockd: fix list corruption on lockd restart
  nfsd4: check for negative dentry before use in nfsv4 readdir
  nfsd41: slots are freed with session
  svcrdma: clean up error paths.
  svcrdma: Fix dma map direction for rdma read targets
2009-05-12 17:11:56 -07:00
Davide Libenzi
bfe3891a5f epoll: fix size check in epoll_create()
Fix a size check WRT the manual pages.  This was inadvertently broken by
commit 9fe5ad9c8c ("flag parameters
add-on: remove epoll_create size param").

Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: <Hiroyuki.Mach@gmail.com>
Cc: rohit verma <rohit.170309@gmail.com>
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-12 14:11:35 -07:00
Aneesh Kumar K.V
33b9817e2a ext4: Use a fake block number for delayed new buffer_head
Use a very large unsigned number (~0xffff) as as the fake block number
for the delayed new buffer. The VFS should never try to write out this
number, but if it does, this will make it obvious.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-12 14:40:37 -04:00
Aneesh Kumar K.V
9c1ee184a3 ext4: Fix sub-block zeroing for writes into preallocated extents
We need to mark the buffer_head mapping preallocated space as new
during write_begin. Otherwise we don't zero out the page cache content
properly for a partial write. This will cause file corruption with
preallocation.

Now that we mark the buffer_head new we also need to have a valid
buffer_head blocknr so that unmap_underlying_metadata() unmaps the
correct block.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-13 18:36:58 -04:00
Theodore Ts'o
a2dc52b5d1 ext4: Add BUG_ON debugging checks to noalloc_get_block_write()
Enforce that noalloc_get_block_write() is only called to map one block
at a time, and that it always is successful in finding a mapping for
given an inode's logical block block number if it is called with
create == 1.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-12 13:51:29 -04:00
Theodore Ts'o
b920c75502 ext4: Add documentation to the ext4_*get_block* functions
This adds more documentation to various internal functions in
fs/ext4/inode.c, most notably ext4_ind_get_blocks(),
ext4_da_get_block_write(), ext4_da_get_block_prep(),
ext4_normal_get_block_write().

In addition, the static function ext4_normal_get_block_write() has
been renamed noalloc_get_block_write(), since it is used in many
places far beyond ext4_normal_writepage().

Plenty of warnings have been added to the noalloc_get_block_write()
function, since the way it is used is amazingly fragile.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 00:54:29 -04:00
Theodore Ts'o
c217705733 ext4: Define a new set of flags for ext4_get_blocks()
The functions ext4_get_blocks(), ext4_ext_get_blocks(), and
ext4_ind_get_blocks() used an ad-hoc set of integer variables used as
boolean flags passed in as arguments.  Use a single flags parameter
and a setandard set of bitfield flags instead.  This saves space on
the call stack, and it also makes the code a bit more understandable.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 00:58:52 -04:00
Theodore Ts'o
12b7ac1768 ext4: Rename ext4_get_blocks_wrap() to be ext4_get_blocks()
Another function rename for clarity's sake.  The _wrap prefix simply
confuses people, and didn't add much people trying to follow the code
paths.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-14 00:57:44 -04:00
Abhijith Das
7537d81aa7 GFS2: Fix timestamps on write
This patch copies the timestamps from the vfs inode into gfs2 and syncs
it to the disk inode during writes.

Signed-off-by: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-12 16:14:05 +01:00
Theodore Ts'o
e4d996ca80 ext4: Rename ext4_get_blocks_handle() to be ext4_ind_get_blocks()
The static function ext4_get_blocks_handle() is badly named.  Of
*course* it takes a handle.  Since its counterpart for extent-based
file is ext4_ext_get_blocks(), rename it to be ext4_ind_get_blocks().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-12 00:25:28 -04:00
Theodore Ts'o
f888e652d7 ext4: Simplify function signature for ext4_da_get_block_write()
The function ext4_da_get_block_write() is called in exactly one write,
and the last argument, create, is always 1.  Remove it to simplify the
code slightly.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-12 00:21:29 -04:00
Vincent Minet
bc8e67409c ext4: Fix spinlock assertions on UP systems
On UP systems without DEBUG_SPINLOCK, ext4_is_group_locked always fails
which triggers a BUG_ON() call.
This patch fixes it by using assert_spin_locked instead.

Signed-off-by: Vincent Minet <vincent@vincent-minet.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-15 08:33:18 -04:00
J. Bruce Fields
8daed1e549 nfsd: silence lockdep warning
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-11 17:23:14 -04:00
Jeff Mahoney
2b79bc4f7e dup2: Fix return value with oldfd == newfd and invalid fd
The return value of dup2 when oldfd == newfd and the fd isn't valid is
not getting properly sign extended.  We end up with 4294967287 instead
of -EBADF.

I've reproduced this on SLE11 (2.6.27.21), openSUSE Factory
(2.6.29-rc5), and Ubuntu 9.04 (2.6.28).

This patch uses a signed int for the error value so it is properly
extended.

Commit 6c5d0512a0 introduced this
regression.

Reported-by: Jiri Dluhos <jdluhos@novell.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-11 12:18:06 -07:00
Ryusuke Konishi
83aca8f480 nilfs2: check size of array structured data exchanged via ioctls
Although some ioctls of nilfs2 exchange data in the form of indirectly
referenced array, some of them lack size check on the array elements.

This inserts the missing checks and rejects requests if data of ioctl
does not have a valid format.

We usually don't have to check size of structures that we associated
with ioctl commands because the size is tested implicitly for
identifying ioctl command; the checks this patch adds are for the
cases where the implicit check is not applied.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-12 01:48:54 +09:00
Miklos Szeredi
0b0a47f5c4 splice: implement default splice_write method
If f_op->splice_write() is not implemented, fall back to a plain write.
Use vfs_writev() to write from the pipe buffers.

This will allow splice on all filesystems and file types.  This
includes "direct_io" files in fuse which bypass the page cache.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 14:13:10 +02:00
Miklos Szeredi
6818173bd6 splice: implement default splice_read method
If f_op->splice_read() is not implemented, fall back to a plain read.
Use vfs_readv() to read into previously allocated pages.

This will allow splice and functions using splice, such as the loop
device, to work on all filesystems.  This includes "direct_io" files
in fuse which bypass the page cache.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 14:13:10 +02:00
Miklos Szeredi
7c77f0b3f9 splice: implement pipe to pipe splicing
Allow splice(2) to work when both the input and the output is a pipe.

Based on the impementation of the tee(2) syscall, but instead of
duplicating the buffer references move the buffers from the input pipe
to the output pipe.

Moving the whole buffer only succeeds if the full length of the buffer
is spliced.  Otherwise duplicate the buffer, just like tee(2), set the
length of the output buffer and advance the offset on the input
buffer.

Since splice is operating on two pipes, special care needs to be taken
with locking to prevent AN ABBA deadlock.  Again this is done
similarly to the tee(2) syscall, first preparing the input and output
pipes so there's data to consume and space for that data, and then
doing the move operation while holding both locks.

If other processes are doing I/O on the same pipes parallel to the
splice, then by the time both inodes are locked there might be no
buffers left to move, or no space to move them to.  In this case retry
the whole operation, including the preparation phase.  This could lead
to starvation, but I'm not sure if that's serious enough to worry
about.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 14:13:09 +02:00
Steven Whitehouse
48bf2b1711 GFS2: Something nonlinear this way comes!
For some reason GFS2 has been missing support for non-linear
mappings. This patch fixes that, and also avoids taking any
locks for mmap in the O_NOATIME case. In fact we don't actually need
to take the lock here at all - just doing file_accessed() would be
enough, but we have to take the lock eventually and this helps
it hit disk (and thus be seen by other nodes) faster.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-11 12:36:44 +01:00
Steven Whitehouse
4a0f9a321a GFS2: Optimise writepage for metadata
This adds a GFS2 specific writepage for metadata, rather than
continuing to use the VFS function. As a result we now tag all
our metadata I/O with the correct flag so that blktraces will
now be less confusing.

Also, the generic function was checking for a number of corner
cases which cannot happen on the metadata address spaces so that
this should be faster too.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-11 12:36:43 +01:00
Steven Whitehouse
c969f58ca4 GFS2: Update the rw flags
After Jens recent updates:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=a1f242524c3c1f5d40f1c9c343427e34d1aadd6e
et al. this is a patch to bring gfs2 uptodate with the core
code. Also I've managed to squash another call to ll_rw_block()
along the way.

There is still one part of the GFS2 I/O paths which are not correctly
annotated and that is due to the sharing of the writeback code between
the data and metadata address spaces. I would like to change that too,
but this patch is still worth doing on its own, I think.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-11 12:36:41 +01:00
Tejun Heo
c3a4d78c58 block: add rq->resid_len
rq->data_len served two purposes - the length of data buffer on issue
and the residual count on completion.  This duality creates some
headaches.

First of all, block layer and low level drivers can't really determine
what rq->data_len contains while a request is executing.  It could be
the total request length or it coulde be anything else one of the
lower layers is using to keep track of residual count.  This
complicates things because blk_rq_bytes() and thus
[__]blk_end_request_all() relies on rq->data_len for PC commands.
Drivers which want to report residual count should first cache the
total request length, update rq->data_len and then complete the
request with the cached data length.

Secondly, it makes requests default to reporting full residual count,
ie. reporting that no data transfer occurred.  The residual count is
an exception not the norm; however, the driver should clear
rq->data_len to zero to signify the normal cases while leaving it
alone means no data transfer occurred at all.  This reverse default
behavior complicates code unnecessarily and renders block PC on some
drivers (ide-tape/floppy) unuseable.

This patch adds rq->resid_len which is used only for residual count.

While at it, remove now unnecessasry blk_rq_bytes() caching in
ide_pc_intr() as rq->data_len is not changed anymore.

Boaz	: spotted missing conversion in osd
Sergei	: spotted too early conversion to blk_rq_bytes() in ide-tape

[ Impact: cleanup residual count handling, report 0 resid by default ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Borislav Petkov <petkovbb@googlemail.com>
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Eric Moore <Eric.Moore@lsi.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Eric Moore <Eric.Moore@lsi.com>
Cc: Darrick J. Wong <djwong@us.ibm.com>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-11 09:50:53 +02:00
Ryusuke Konishi
4f6b828837 nilfs2: fix lock order reversal in nilfs_clean_segments ioctl
This is a companion patch to ("nilfs2: fix possible circular locking
for get information ioctls").

This corrects lock order reversal between mm->mmap_sem and
nilfs->ns_segctor_sem in nilfs_clean_segments() which was detected by
lockdep check:

 =======================================================
 [ INFO: possible circular locking dependency detected ]
 2.6.30-rc3-nilfs-00003-g360bdc1 #7
 -------------------------------------------------------
 mmap/5294 is trying to acquire lock:
  (&nilfs->ns_segctor_sem){++++.+}, at: [<d0d0e846>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]

 but task is already holding lock:
  (&mm->mmap_sem){++++++}, at: [<c043700a>] do_page_fault+0x1d8/0x30a

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (&mm->mmap_sem){++++++}:
        [<c01470a5>] __lock_acquire+0x1066/0x13b0
        [<c01474a9>] lock_acquire+0xba/0xdd
        [<c01836bc>] might_fault+0x68/0x88
        [<c023c61d>] copy_from_user+0x2a/0x111
        [<d0d120d0>] nilfs_ioctl_prepare_clean_segments+0x1d/0xf1 [nilfs2]
        [<d0d0e2aa>] nilfs_clean_segments+0x6d/0x1b9 [nilfs2]
        [<d0d11f68>] nilfs_ioctl+0x2ad/0x318 [nilfs2]
        [<c01a3be7>] vfs_ioctl+0x22/0x69
        [<c01a408e>] do_vfs_ioctl+0x460/0x499
        [<c01a4107>] sys_ioctl+0x40/0x5a
        [<c01031a4>] sysenter_do_call+0x12/0x38
        [<ffffffff>] 0xffffffff

 -> #0 (&nilfs->ns_segctor_sem){++++.+}:
        [<c0146e0b>] __lock_acquire+0xdcc/0x13b0
        [<c01474a9>] lock_acquire+0xba/0xdd
        [<c0433f1d>] down_read+0x2a/0x3e
        [<d0d0e846>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]
        [<d0cfe0e5>] nilfs_page_mkwrite+0xe7/0x154 [nilfs2]
        [<c0183b0b>] __do_fault+0x165/0x376
        [<c01855cd>] handle_mm_fault+0x287/0x5d1
        [<c043712d>] do_page_fault+0x2fb/0x30a
        [<c0435462>] error_code+0x72/0x78
        [<ffffffff>] 0xffffffff

where nilfs_clean_segments() holds:

  nilfs->ns_segctor_sem -> copy_from_user()
                             --> page fault -> mm->mmap_sem

And, page fault path may hold:

  page fault -> mm->mmap_sem
         --> nilfs_page_mkwrite() -> nilfs->ns_segctor_sem

Even though nilfs_clean_segments() does not perform write access on
given user pages, it may cause deadlock because nilfs->ns_segctor_sem
is shared per device and mm->mmap_sem can be shared with other tasks.

To avoid this problem, this patch moves all calls of copy_from_user()
outside the nilfs->ns_segctor_sem lock in the ioctl.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-11 14:54:41 +09:00
Ryusuke Konishi
47eb6b9c8f nilfs2: fix possible circular locking for get information ioctls
This is one of two patches which are to correct possible circular
locking between mm->mmap_sem and nilfs->ns_segctor_sem.

The problem was detected by lockdep check as follows:

 =======================================================
 [ INFO: possible circular locking dependency detected ]
 2.6.30-rc3-nilfs-00002-g3552613 #6
 -------------------------------------------------------
 mmap/5418 is trying to acquire lock:
 (&nilfs->ns_segctor_sem){++++.+}, at: [<d0d0e852>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]

 but task is already holding lock:
 (&mm->mmap_sem){++++++}, at: [<c043700a>] do_page_fault+0x1d8/0x30a

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (&mm->mmap_sem){++++++}:
 [<c01470a5>] __lock_acquire+0x1066/0x13b0
 [<c01474a9>] lock_acquire+0xba/0xdd
 [<c01836bc>] might_fault+0x68/0x88
 [<c023c730>] copy_to_user+0x2c/0xfc
 [<d0d11b4f>] nilfs_ioctl_wrap_copy+0x103/0x160 [nilfs2]
 [<d0d11fa9>] nilfs_ioctl+0x30a/0x3b0 [nilfs2]
 [<c01a3be7>] vfs_ioctl+0x22/0x69
 [<c01a408e>] do_vfs_ioctl+0x460/0x499
 [<c01a4107>] sys_ioctl+0x40/0x5a
 [<c01031a4>] sysenter_do_call+0x12/0x38
 [<ffffffff>] 0xffffffff

 -> #0 (&nilfs->ns_segctor_sem){++++.+}:
 [<c0146e0b>] __lock_acquire+0xdcc/0x13b0
 [<c01474a9>] lock_acquire+0xba/0xdd
 [<c0433f1d>] down_read+0x2a/0x3e
 [<d0d0e852>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]
 [<d0cfe0e5>] nilfs_page_mkwrite+0xe7/0x154 [nilfs2]
 [<c0183b0b>] __do_fault+0x165/0x376
 [<c01855cd>] handle_mm_fault+0x287/0x5d1
 [<c043712d>] do_page_fault+0x2fb/0x30a
 [<c0435462>] error_code+0x72/0x78
 [<ffffffff>] 0xffffffff

 other info that might help us debug this:

 1 lock held by mmap/5418:
 #0:  (&mm->mmap_sem){++++++}, at: [<c043700a>] do_page_fault+0x1d8/0x30a

 stack backtrace:
 Pid: 5418, comm: mmap Not tainted 2.6.30-rc3-nilfs-00002-g3552613 #6
 Call Trace:
 [<c0432145>] ? printk+0xf/0x12
 [<c0145c48>] print_circular_bug_tail+0xaa/0xb5
 [<c0146e0b>] __lock_acquire+0xdcc/0x13b0
 [<d0d10149>] ? nilfs_sufile_get_stat+0x1e/0x105 [nilfs2]
 [<c013b59a>] ? up_read+0x16/0x2c
 [<d0d10225>] ? nilfs_sufile_get_stat+0xfa/0x105 [nilfs2]
 [<c01474a9>] lock_acquire+0xba/0xdd
 [<d0d0e852>] ? nilfs_transaction_begin+0xb6/0x10c [nilfs2]
 [<c0433f1d>] down_read+0x2a/0x3e
 [<d0d0e852>] ? nilfs_transaction_begin+0xb6/0x10c [nilfs2]
 [<d0d0e852>] nilfs_transaction_begin+0xb6/0x10c [nilfs2]
 [<d0cfe0e5>] nilfs_page_mkwrite+0xe7/0x154 [nilfs2]
 [<c0183b0b>] __do_fault+0x165/0x376
 [<c01855cd>] handle_mm_fault+0x287/0x5d1
 [<c043700a>] ? do_page_fault+0x1d8/0x30a
 [<c013b54f>] ? down_read_trylock+0x39/0x43
 [<c043712d>] do_page_fault+0x2fb/0x30a
 [<c0436e32>] ? do_page_fault+0x0/0x30a
 [<c0435462>] error_code+0x72/0x78
 [<c0436e32>] ? do_page_fault+0x0/0x30a

This makes the lock granularity of nilfs->ns_segctor_sem finer than
that of the mmap semaphore for ioctl commands except
nilfs_clean_segments().

The successive patch ("nilfs2: fix lock order reversal in
nilfs_clean_segments ioctl") is required to fully resolve the problem.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-11 12:57:46 +09:00
David Howells
107db7c7dd CRED: Guard the setprocattr security hook against ptrace
Guard the setprocattr security hook against ptrace by taking the target task's
cred_guard_mutex around it.  The problem is that setprocattr() may otherwise
note the lack of a debugger, and then perform an action on that basis whilst
letting a debugger attach between the two points.  Holding cred_guard_mutex
across the test and the action prevents ptrace_attach() from doing that.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-05-11 08:15:39 +10:00
David Howells
5e751e992f CRED: Rename cred_exec_mutex to reflect that it's a guard against ptrace
Rename cred_exec_mutex to reflect that it's a guard against foreign
intervention on a process's credential state, such as is made by ptrace().  The
attachment of a debugger to a process affects execve()'s calculation of the new
credential state - _and_ also setprocattr()'s calculation of that state.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
2009-05-11 08:15:36 +10:00
Linus Torvalds
93b49d45eb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (22 commits)
  Fix the race between capifs remount and node creation
  Fix races around the access to ->s_options
  switch ufs directories to ufs_sync_file()
  Switch open_exec() and sys_uselib() to do_open_filp()
  Make open_exec() and sys_uselib() use may_open(), instead of duplicating its parts
  Reduce path_lookup() abuses
  Make checkpatch.pl shut up on fs/inode.c
  NULL noise in fs/super.c:kill_bdev_super()
  romfs: cleanup romfs_fs.h
  ROMFS: romfs_dev_read() error ignored
  fs: dcache fix LRU ordering
  ocfs2: Use nd_set_link().
  Fix deadlock in ipathfs ->get_sb()
  Fix a leak in failure exit in 9p ->get_sb()
  Convert obvious places to deactivate_locked_super()
  New helper: deactivate_locked_super()
  reiserfs: remove privroot hiding in lookup
  reiserfs: dont associate security.* with xattr files
  reiserfs: fixup xattr_root caching
  Always lookup priv_root on reiserfs mount and keep it
  ...
2009-05-10 10:49:08 -07:00
Ryusuke Konishi
843382370e nilfs2: ensure to clear dirty state when deleting metadata file block
This would fix the following failure during GC:

 nilfs_cpfile_delete_checkpoints: cannot delete block
 NILFS: GC failed during preparation: cannot delete checkpoints: err=-2

The problem was caused by a break in state consistency between page
cache and btree; the above block was removed from the btree but the
page buffering the block was remaining in the page cache in dirty
state.

This resolves the inconsistency by ensuring to clear dirty state of
the page buffering the deleted block.

Reported-by: David Arendt <admin@prnet.org>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-10 17:04:42 +09:00
Al Viro
2a32cebd6c Fix races around the access to ->s_options
Put generic_show_options read access to s_options under rcu_read_lock,
split save_mount_options() into "we are setting it the first time"
(uses in foo_fill_super()) and "we are relacing and freeing the old one",
synchronize_rcu() before kfree() in the latter.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:51:34 -04:00
Al Viro
f9dbd05bc9 switch ufs directories to ufs_sync_file()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:42 -04:00
Al Viro
6e8341a11e Switch open_exec() and sys_uselib() to do_open_filp()
... and make path_lookup_open() static

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:42 -04:00
Al Viro
a44ddbb6d8 Make open_exec() and sys_uselib() use may_open(), instead of duplicating its parts
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:42 -04:00
Al Viro
e24977d45f Reduce path_lookup() abuses
... use kern_path() where possible

[folded a fix from rdd]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:42 -04:00
Manish Katiyar
6b3304b531 Make checkpatch.pl shut up on fs/inode.c
Code Quality According To Mingo(tm) has been vastly improved,
no code has been damaged^Wchanged^Wdamaged.

[commit message rewritten -- AV]

Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:41 -04:00
H Hartley Sweeten
ddbaaf3024 NULL noise in fs/super.c:kill_bdev_super()
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Subrata Modak <subrata@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:41 -04:00
Roel Kluin
774e33e70b ROMFS: romfs_dev_read() error ignored
romfs_dev_read() may return -EIO, but ret is unsigned, so the errorpath
isn't taken.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:41 -04:00
npiggin@suse.de
c490d79bb7 fs: dcache fix LRU ordering
Fix ordering of LRU when moving referenced dentries to the head of the list
(they should go to the head of the list in the same order as they were found
from the tail, rather than reverse order).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:40 -04:00
Joel Becker
a731d12d6d ocfs2: Use nd_set_link().
ocfs2 was hand-calling vfs_follow_link(), but there's no point to that.
Let's use page_follow_link_light() and nd_set_link().

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:40 -04:00
Al Viro
c96f585737 Fix a leak in failure exit in 9p ->get_sb()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:40 -04:00
Al Viro
6f5bbff9a1 Convert obvious places to deactivate_locked_super()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:40 -04:00
Al Viro
74dbbdd7fd New helper: deactivate_locked_super()
Does equivalent of up_write(&s->s_umount); deactivate_super(s);
However, it does not does not unlock it until it's all over.
As the result, it's safe to use to dispose of new superblock on ->get_sb()
failure exits - nobody will see the sucker until it's all over.
Equivalent using up_write/deactivate_super is safe for that purpose
if superblock is either	safe to use or has NULL ->s_root when we unlock.
Normally filesystems take the required precautions, but
	a) we do have bugs in that area in some of them.
	b) up_write/deactivate_super sequence is extremely common,
so the helper makes sense anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:39 -04:00
Jeff Mahoney
677c9b2e39 reiserfs: remove privroot hiding in lookup
With Al Viro's patch to move privroot lookup to fs mount, there's no need
 to have special code to hide the privroot in reiserfs_lookup.

 I've also cleaned up the privroot hiding in reiserfs_readdir_dentry and
 removed the last user of reiserfs_xattrs().

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:39 -04:00
Jeff Mahoney
b82bb72ba7 reiserfs: dont associate security.* with xattr files
The security.* xattrs are ignored for xattr files, so don't create them.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:39 -04:00
Jeff Mahoney
ab17c4f021 reiserfs: fixup xattr_root caching
The xattr_root caching was broken from my previous patch set. It wouldn't
 cause corruption, but could cause decreased performance due to allocating
 a larger chunk of the journal (~ 27 blocks) than it would actually use.

 This patch loads the xattr root dentry at xattr initialization and creates
 it on-demand. Since we're using the cached dentry, there's no point
 in keeping lookup_or_create_dir around, so that's removed.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:39 -04:00
Al Viro
edcc37a047 Always lookup priv_root on reiserfs mount and keep it
... even if it's a negative dentry.  That way we can set ->d_op on
root before anyone could race with us.  Simplify d_compare(), while
we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:38 -04:00
Jeff Mahoney
5a6059c358 reiserfs: Expand i_mutex to enclose lookup_one_len
2.6.30-rc3 introduced some sanity checks in the VFS code to avoid NFS
 bugs by ensuring that lookup_one_len is always called under i_mutex.

 This patch expands the i_mutex locking to enclose lookup_one_len. This was
 always required, but not not enforced in the reiserfs code since it
 does locking around the xattr interactions with the xattr_sem.

 This is obvious enough, and it survived an overnight 50 thread ACL test.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:38 -04:00
Alessio Igor Bogani
67e55205ec vfs: umount_begin BKL pushdown
Push BKL down into ->umount_begin()

Signed-off-by: Alessio Igor Bogani <abogani@texware.it>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-05-09 10:49:38 -04:00
Steven Whitehouse
0c7a531a20 GFS2: Fix glock ref counting bug
Depending on the ordering of events as we go around the
glock shrinker loop, it is possible to drop the ref count
of a glock incorrectly. It doesn't happen very often. This
patch corrects the got_ref variable, fixing the problem.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-05-09 15:15:17 +01:00
Ryusuke Konishi
201913ed74 nilfs2: fix circular locking dependency of writer mutex
This fixes the following circular locking dependency problem:

 =======================================================
 [ INFO: possible circular locking dependency detected ]
 2.6.30-rc3 #5
 -------------------------------------------------------
 segctord/3895 is trying to acquire lock:
  (&nilfs->ns_writer_mutex){+.+...}, at: [<d0d02172>]
   nilfs_mdt_get_block+0x89/0x20f [nilfs2]

 but task is already holding lock:
  (&bmap->b_sem){++++..}, at: [<d0d02d99>]
   nilfs_bmap_propagate+0x14/0x2e [nilfs2]

 which lock already depends on the new lock.

The bugfix is done by replacing call sites of nilfs_get_writer() which
are never called from read-only context with direct dereferencing of
pointer to a writable FS-instance.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-09 13:36:57 +09:00
Ryusuke Konishi
85c2a74fab nilfs2: fix possible recovery failure due to block creation without writer
Some function calls in nilfs_prepare_segment_for_recovery() may fail
because they can create blocks on meta data files without configuring
a writable FS-instance.  Concretely, nilfs_mdt_create_block() routine
of meta data files will fail in that case.

This fixes the problem by temporarily attaching a writable FS-instace
during the function is called.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2009-05-09 13:36:56 +09:00
Felix Blyakher
ec91d1335f xfs: fix double unlock in xfs_swap_extents()
Regreesion from commit ef8f7fc, which rearranged the code in
xfs_swap_extents() leading to double unlock of xfs inode ilock.
That resulted in xfs_fsr deadlocking itself on platforms, which
don't handle double unlock of rw_semaphore nicely. It caused the
count go negative, which represents the write holder, without
really having one. ia64 is one of the platforms where deadlock
was easily reproduced and the fix was tested.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-05-08 00:29:44 -05:00
Linus Torvalds
d7a5926978 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: (32 commits)
  [CIFS] Fix double list addition in cifs posix open code
  [CIFS] Allow raw ntlmssp code to be enabled with sec=ntlmssp
  [CIFS] Fix SMB uid in NTLMSSP authenticate request
  [CIFS] NTLMSSP reenabled after move from connect.c to sess.c
  [CIFS] Remove sparse warning
  [CIFS] remove checkpatch warning
  [CIFS] Fix final user of old string conversion code
  [CIFS] remove cifs_strfromUCS_le
  [CIFS] NTLMSSP support moving into new file, old dead code removed
  [CIFS] Fix endian conversion of vcnum field
  [CIFS] Remove trailing whitespace
  [CIFS] Remove sparse endian warnings
  [CIFS] Add remaining ntlmssp flags and standardize field names
  [CIFS] Fix build warning
  cifs: fix length handling in cifs_get_name_from_search_buf
  [CIFS] Remove unneeded QuerySymlink call and fix mapping for unmapped status
  [CIFS] rename cifs_strndup to cifs_strndup_from_ucs
  Added loop check when mounting DFS tree.
  Enable dfs submounts to handle remote referrals.
  [CIFS] Remove older session setup implementation
  ...
2009-05-07 21:13:24 -07:00
Steve French
90e4ee5d31 [CIFS] Fix double list addition in cifs posix open code
Remove adding open file entry twice to lists in the file
Do not fill file info twice in case of posix opens and creates

Signed-off-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-08 03:04:30 +00:00
David Teigland
8511a2728a dlm: fix use count with multiple joins
When a lockspace was joined multiple times, the global dlm
use count was incremented when it should not have been.  This
caused the global dlm threads to not be stopped when all
lockspaces were eventually be removed.

Signed-off-by: David Teigland <teigland@redhat.com>
2009-05-07 10:14:42 -05:00
Geert Uytterhoeven
08ce4c91e4 dlm: Make name input parameter of {,dlm_}new_lockspace() const
| fs/gfs2/lock_dlm.c:207: warning: passing argument 1 of 'dlm_new_lockspace' discards qualifiers from pointer target type

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David Teigland <teigland@redhat.com>
2009-05-07 10:14:26 -05:00
Josef Bacik
df3935ffd6 fiemap: fix problem with setting FIEMAP_EXTENT_LAST
Fix a problem where the generic block based fiemap stuff would not
properly set FIEMAP_EXTENT_LAST on the last extent.  I've reworked things
to keep track if we go past the EOF, and mark the last extent properly.
The problem was reported by and tested by Eric Sandeen.

Tested-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: <xfs-masters@oss.sgi.com>
Cc: <linux-btrfs@vger.kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <Joel.Becker@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-06 16:36:09 -07:00
Wu Fengguang
381a80e6df inotify: use GFP_NOFS in kernel_event() to work around a lockdep false-positive
There is what we believe to be a false positive reported by lockdep.

inotify_inode_queue_event() => take inotify_mutex => kernel_event() =>
kmalloc() => SLOB => alloc_pages_node() => page reclaim => slab reclaim =>
dcache reclaim => inotify_inode_is_dead => take inotify_mutex => deadlock

The plan is to fix this via lockdep annotation, but that is proving to be
quite involved.

The patch flips the allocation over to GFP_NFS to shut the warning up, for
the 2.6.30 release.

Hopefully we will fix this for real in 2.6.31.  I'll queue a patch in -mm
to switch it back to GFP_KERNEL so we don't forget.

  =================================
  [ INFO: inconsistent lock state ]
  2.6.30-rc2-next-20090417 #203
  ---------------------------------
  inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
  kswapd0/380 [HC0[0]:SC0[0]:HE1:SE1] takes:
   (&inode->inotify_mutex){+.+.?.}, at: [<ffffffff8112f1b5>] inotify_inode_is_dead+0x35/0xb0
  {RECLAIM_FS-ON-W} state was registered at:
    [<ffffffff81079188>] mark_held_locks+0x68/0x90
    [<ffffffff810792a5>] lockdep_trace_alloc+0xf5/0x100
    [<ffffffff810f5261>] __kmalloc_node+0x31/0x1e0
    [<ffffffff81130652>] kernel_event+0xe2/0x190
    [<ffffffff81130826>] inotify_dev_queue_event+0x126/0x230
    [<ffffffff8112f096>] inotify_inode_queue_event+0xc6/0x110
    [<ffffffff8110444d>] vfs_create+0xcd/0x140
    [<ffffffff8110825d>] do_filp_open+0x88d/0xa20
    [<ffffffff810f6b68>] do_sys_open+0x98/0x140
    [<ffffffff810f6c50>] sys_open+0x20/0x30
    [<ffffffff8100c272>] system_call_fastpath+0x16/0x1b
    [<ffffffffffffffff>] 0xffffffffffffffff
  irq event stamp: 690455
  hardirqs last  enabled at (690455): [<ffffffff81564fe4>] _spin_unlock_irqrestore+0x44/0x80
  hardirqs last disabled at (690454): [<ffffffff81565372>] _spin_lock_irqsave+0x32/0xa0
  softirqs last  enabled at (690178): [<ffffffff81052282>] __do_softirq+0x202/0x220
  softirqs last disabled at (690157): [<ffffffff8100d50c>] call_softirq+0x1c/0x50

  other info that might help us debug this:
  2 locks held by kswapd0/380:
   #0:  (shrinker_rwsem){++++..}, at: [<ffffffff810d0bd7>] shrink_slab+0x37/0x180
   #1:  (&type->s_umount_key#17){++++..}, at: [<ffffffff8110cfbf>] shrink_dcache_memory+0x11f/0x1e0

  stack backtrace:
  Pid: 380, comm: kswapd0 Not tainted 2.6.30-rc2-next-20090417 #203
  Call Trace:
   [<ffffffff810789ef>] print_usage_bug+0x19f/0x200
   [<ffffffff81018bff>] ? save_stack_trace+0x2f/0x50
   [<ffffffff81078f0b>] mark_lock+0x4bb/0x6d0
   [<ffffffff810799e0>] ? check_usage_forwards+0x0/0xc0
   [<ffffffff8107b142>] __lock_acquire+0xc62/0x1ae0
   [<ffffffff810f478c>] ? slob_free+0x10c/0x370
   [<ffffffff8107c0a1>] lock_acquire+0xe1/0x120
   [<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0
   [<ffffffff81562d43>] mutex_lock_nested+0x63/0x420
   [<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0
   [<ffffffff8112f1b5>] ? inotify_inode_is_dead+0x35/0xb0
   [<ffffffff81012fe9>] ? sched_clock+0x9/0x10
   [<ffffffff81077165>] ? lock_release_holdtime+0x35/0x1c0
   [<ffffffff8112f1b5>] inotify_inode_is_dead+0x35/0xb0
   [<ffffffff8110c9dc>] dentry_iput+0xbc/0xe0
   [<ffffffff8110cb23>] d_kill+0x33/0x60
   [<ffffffff8110ce23>] __shrink_dcache_sb+0x2d3/0x350
   [<ffffffff8110cffa>] shrink_dcache_memory+0x15a/0x1e0
   [<ffffffff810d0cc5>] shrink_slab+0x125/0x180
   [<ffffffff810d1540>] kswapd+0x560/0x7a0
   [<ffffffff810ce160>] ? isolate_pages_global+0x0/0x2c0
   [<ffffffff81065a30>] ? autoremove_wake_function+0x0/0x40
   [<ffffffff8107953d>] ? trace_hardirqs_on+0xd/0x10
   [<ffffffff810d0fe0>] ? kswapd+0x0/0x7a0
   [<ffffffff8106555b>] kthread+0x5b/0xa0
   [<ffffffff8100d40a>] child_rip+0xa/0x20
   [<ffffffff8100cdd0>] ? restore_args+0x0/0x30
   [<ffffffff81065500>] ? kthread+0x0/0xa0
   [<ffffffff8100d400>] ? child_rip+0x0/0x20

[eparis@redhat.com: fix audit too]
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-06 16:36:09 -07:00
J. Bruce Fields
89996df4b5 lockd: fix list corruption on lockd restart
If lockd is signalled soon enough after restart then locks_start_grace()
will try to re-add an entry to a list and trigger a lock corruption
warning.

Thanks to Wang Chen for the problem report and diagnosis.

WARNING: at lib/list_debug.c:26 __list_add+0x27/0x5c()
...
list_add corruption. next->prev should be prev (ef8fe958), but was ef8ff128.  (next=ef8ff128).
...
Pid: 23062, comm: lockd Tainted: G        W  2.6.30-rc2 #3
Call Trace:
[<c042d5b5>] warn_slowpath+0x71/0xa0
[<c0422a96>] ? update_curr+0x11d/0x125
[<c044b12d>] ? trace_hardirqs_on_caller+0x18/0x150
[<c044b270>] ? trace_hardirqs_on+0xb/0xd
[<c051c61a>] ? _raw_spin_lock+0x53/0xfa
[<c051c89f>] __list_add+0x27/0x5c
[<ef8f6daa>] locks_start_grace+0x22/0x30 [lockd]
[<ef8f34da>] set_grace_period+0x39/0x53 [lockd]
[<c06b8921>] ? lock_kernel+0x1c/0x28
[<ef8f3558>] lockd+0x64/0x164 [lockd]
[<c044b12d>] ? trace_hardirqs_on_caller+0x18/0x150
[<c04227b0>] ? complete+0x34/0x3e
[<ef8f34f4>] ? lockd+0x0/0x164 [lockd]
[<ef8f34f4>] ? lockd+0x0/0x164 [lockd]
[<c043dd42>] kthread+0x45/0x6b
[<c043dcfd>] ? kthread+0x0/0x6b
[<c0403c23>] kernel_thread_helper+0x7/0x10

Reported-by: Wang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: stable@kernel.org
2009-05-06 17:19:36 -04:00
J. Bruce Fields
b2c0cea6b1 nfsd4: check for negative dentry before use in nfsv4 readdir
After 2f9092e102 "Fix i_mutex vs.  readdir
handling in nfsd" (and 14f7dd63 "Copy XFS readdir hack into nfsd code"),
an entry may be removed between the first mutex_unlock and the second
mutex_lock. In this case, lookup_one_len() will return a negative
dentry.  Check for this case to avoid a NULL dereference.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Reviewed-by: J. R. Okajima <hooanon05@yahoo.co.jp>
Cc: stable@kernel.org
2009-05-06 16:16:36 -04:00
Adrian Hunter
6d6cb0d688 UBIFS: reset no_space flag after inode deletion
When UBIFS runs out of space it spends a lot of time trying to
find more space before returning ENOSPC.  As there is no point
repeating that unless something has changed, UBIFS has an
optimization to record that the file system is 100% full and not
try to find space.  That flag was not being reset when a pending
deletion was finally done.

Signed-off-by: Adrian Hunter <adrian.hunter@nokia.com>
Reviewed-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-05-06 10:37:56 +03:00
Steve French
ac68392460 [CIFS] Allow raw ntlmssp code to be enabled with sec=ntlmssp
On mount, "sec=ntlmssp" can now be specified to allow
"rawntlmssp" security to be enabled during
CIFS session establishment/authentication (ntlmssp used to
require specifying krb5 which was counterintuitive).

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-06 04:16:04 +00:00
Steve French
844823cb82 [CIFS] Fix SMB uid in NTLMSSP authenticate request
We were not setting the SMB uid in NTLMSSP authenticate
request which could lead to INVALID_PARAMETER error
on 2nd session setup.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-06 00:48:30 +00:00
Coly Li
2b53bc7bff ocfs2: update comments in masklog.h
In the mainline ocfs2 code, the interface for masklog is in files under
/sys/fs/o2cb/masklog, but the comments in fs/ocfs2/cluster/masklog.h
reference the old /proc interface.  They are out of date.

This patch modifies the comments in cluster/masklog.h, which also provides
a bash script example on how to change the log mask bits.

Signed-off-by: Coly Li <coly.li@suse.de>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-05-05 14:48:11 -07:00
Tao Ma
a46fa684fc ocfs2: Don't printk the error when listing too many xattrs.
Currently the kernel defines XATTR_LIST_MAX as 65536
in include/linux/limits.h.  This is the largest buffer that is used for
listing xattrs.

But with ocfs2 xattr tree, we actually have no limit for the number.  If
filesystem has more names than can fit in the buffer, the kernel
logs will be pollluted with something like this when listing:

(27738,0):ocfs2_iterate_xattr_buckets:3158 ERROR: status = -34
(27738,0):ocfs2_xattr_tree_list_index_block:3264 ERROR: status = -34

So don't print "ERROR" message as this is not an ocfs2 error.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-05-05 14:43:24 -07:00
Jake Edge
f83ce3e6b0 proc: avoid information leaks to non-privileged processes
By using the same test as is used for /proc/pid/maps and /proc/pid/smaps,
only allow processes that can ptrace() a given process to see information
that might be used to bypass address space layout randomization (ASLR).
These include eip, esp, wchan, and start_stack in /proc/pid/stat as well
as the non-symbolic output from /proc/pid/wchan.

ASLR can be bypassed by sampling eip as shown by the proof-of-concept
code at http://code.google.com/p/fuzzyaslr/ As part of a presentation
(http://www.cr0.org/paper/to-jt-linux-alsr-leak.pdf) esp and wchan were
also noted as possibly usable information leaks as well.  The
start_stack address also leaks potentially useful information.

Cc: Stable Team <stable@kernel.org>
Signed-off-by: Jake Edge <jake@lwn.net>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-04 15:14:23 -07:00
Steve French
0b3cc85800 [CIFS] NTLMSSP reenabled after move from connect.c to sess.c
The NTLMSSP code was removed from fs/cifs/connect.c and merged
(75% smaller, cleaner) into fs/cifs/sess.c

As with the old code it requires that cifs be built with
CONFIG_CIFS_EXPERIMENTAL, the /proc/fs/cifs/Experimental flag
must be set to 2, and mount must turn on extended security
(e.g. with sec=krb5).

Although NTLMSSP encapsulated in SPNEGO is not enabled yet,
"raw" ntlmssp is common and useful in some cases since it
offers more complete security negotiation, and is the
default way of negotiating security for many Windows systems.
SPNEGO encapsulated NTLMSSP will be able to reuse the same
code.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-04 08:37:12 +00:00
Andy Adamson
ccecee1e5e nfsd41: slots are freed with session
The session and slots are allocated all in one piece.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-05-03 14:45:02 -04:00
Trond Myklebust
7fdf523067 NFS: Close page_mkwrite() races
Follow up to Nick Piggin's patches to ensure that nfs_vm_page_mkwrite
returns with the page lock held, and sets the VM_FAULT_LOCKED flag.

See http://bugzilla.kernel.org/show_bug.cgi?id=12913

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 19:42:39 -07:00
Aneesh Kumar K.V
955ce5f5be ext4: Convert ext4_lock_group to use sb_bgl_lock
We have sb_bgl_lock() and ext4_group_info.bb_state
bit spinlock to protech group information. The later is only
used within mballoc code. Consolidate them to use sb_bgl_lock().
This makes the mballoc.c code much simpler and also avoid
confusion with two locks protecting same info.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-02 20:35:09 -04:00
Linus Torvalds
b4348f32da Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
  xfs: fix getbmap vs mmap deadlock
  xfs: a couple getbmap cleanups
  xfs: add more checks to superblock validation
  xfs_file_last_byte() needs to acquire ilock
2009-05-02 16:52:50 -07:00
Linus Torvalds
afc1e702e8 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/configfs
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/configfs:
  configfs: Fix Trivial Warning in fs/configfs/symlink.c
2009-05-02 16:50:46 -07:00
Linus Torvalds
7e567b44e6 Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  ocfs2: Change repository in MAINTAINERS.
  ocfs2: Fix a missing credit when deleting from indexed directories.
  ocfs2/trivial: Remove unused variable in ocfs2_rename.
  ocfs2: Add missing iput() during error handling in ocfs2_dentry_attach_lock()
  ocfs2: Fix some printk() warnings.
  ocfs2: Fix 2 warning during ocfs2 make.
  ocfs2: Reserve 1 more cluster in expanding_inline_dir for indexed dir.
2009-05-02 16:30:47 -07:00
Theodore Ts'o
eefd7f03b8 ext4: fix the length returned by fiemap for an unallocated extent
If the file's blocks have not yet been allocated because of delayed
allocation, the length of the extent returned by fiemap is incorrect.
This commit fixes this bug.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-02 19:05:37 -04:00
Oleg Nesterov
0ae05fb254 ptrace: s/parent/real_parent/ in binfmt_elf_fdpic.c
->real_parent is the parent. ->parent may be the tracer.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Greg Ungerer <gerg@snapgear.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 15:36:10 -07:00
KOSAKI Motohiro
00a62ce91e mm: fix Committed_AS underflow on large NR_CPUS environment
The Committed_AS field can underflow in certain situations:

>         # while true; do cat /proc/meminfo  | grep _AS; sleep 1; done | uniq -c
>               1 Committed_AS: 18446744073709323392 kB
>              11 Committed_AS: 18446744073709455488 kB
>               6 Committed_AS:    35136 kB
>               5 Committed_AS: 18446744073709454400 kB
>               7 Committed_AS:    35904 kB
>               3 Committed_AS: 18446744073709453248 kB
>               2 Committed_AS:    34752 kB
>               9 Committed_AS: 18446744073709453248 kB
>               8 Committed_AS:    34752 kB
>               3 Committed_AS: 18446744073709320960 kB
>               7 Committed_AS: 18446744073709454080 kB
>               3 Committed_AS: 18446744073709320960 kB
>               5 Committed_AS: 18446744073709454080 kB
>               6 Committed_AS: 18446744073709320960 kB

Because NR_CPUS can be greater than 1000 and meminfo_proc_show() does
not check for underflow.

But NR_CPUS proportional isn't good calculation.  In general,
possibility of lock contention is proportional to the number of online
cpus, not theorical maximum cpus (NR_CPUS).

The current kernel has generic percpu-counter stuff.  using it is right
way.  it makes code simplify and percpu_counter_read_positive() don't
make underflow issue.

Reported-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Eric B Munson <ebmunson@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org>		[All kernel versions]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 15:36:10 -07:00
Ivan Kokshaysky
74641f584d alpha: binfmt_aout fix
This fixes the problem introduced by commit 3bfacef412 (get rid of
special-casing the /sbin/loader on alpha): osf/1 ecoff binary segfaults
when binfmt_aout built as module.  That happens because aout binary
handler gets on the top of the binfmt list due to late registration, and
kernel attempts to execute the binary without preparatory work that must
be done by binfmt_loader.

Fixed by changing the registration order of the default binfmt handlers
using list_add_tail() and introducing insert_binfmt() function which
places new handler on the top of the binfmt list.  This might be generally
useful for installing arch-specific frontends for default handlers or just
for overriding them.

Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Richard Henderson <rth@twiddle.net
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 15:36:10 -07:00
Vitaly Mayatskikh
0816178638 pagemap: require aligned-length, non-null reads of /proc/pid/pagemap
The intention of commit aae8679b0e
("pagemap: fix bug in add_to_pagemap, require aligned-length reads of
/proc/pid/pagemap") was to force reads of /proc/pid/pagemap to be a
multiple of 8 bytes, but now it allows to read 0 bytes, which actually
puts some data to user's buffer.  According to POSIX, if count is zero,
read() should return zero and has no other results.

Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Cc: Thomas Tuttle <ttuttle@google.com>
Acked-by: Matt Mackall <mpm@selenic.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 15:36:09 -07:00
Nick Piggin
b827e496c8 mm: close page_mkwrite races
Change page_mkwrite to allow implementations to return with the page
locked, and also change it's callers (in page fault paths) to hold the
lock until the page is marked dirty.  This allows the filesystem to have
full control of page dirtying events coming from the VM.

Rather than simply hold the page locked over the page_mkwrite call, we
call page_mkwrite with the page unlocked and allow callers to return with
it locked, so filesystems can avoid LOR conditions with page lock.

The problem with the current scheme is this: a filesystem that wants to
associate some metadata with a page as long as the page is dirty, will
perform this manipulation in its ->page_mkwrite.  It currently then must
return with the page unlocked and may not hold any other locks (according
to existing page_mkwrite convention).

In this window, the VM could write out the page, clearing page-dirty.  The
filesystem has no good way to detect that a dirty pte is about to be
attached, so it will happily write out the page, at which point, the
filesystem may manipulate the metadata to reflect that the page is no
longer dirty.

It is not always possible to perform the required metadata manipulation in
->set_page_dirty, because that function cannot block or fail.  The
filesystem may need to allocate some data structure, for example.

And the VM cannot mark the pte dirty before page_mkwrite, because
page_mkwrite is allowed to fail, so we must not allow any window where the
page could be written to if page_mkwrite does fail.

This solution of holding the page locked over the 3 critical operations
(page_mkwrite, setting the pte dirty, and finally setting the page dirty)
closes out races nicely, preventing page cleaning for writeout being
initiated in that window.  This provides the filesystem with a strong
synchronisation against the VM here.

- Sage needs this race closed for ceph filesystem.
- Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
- I need it for fsblock.
- I suspect other filesystems may need it too (eg. btrfs).
- I have converted buffer.c to the new locking. Even simple block allocation
  under dirty pages might be susceptible to i_size changing under partial page
  at the end of file (we also have a buffer.c-side problem here, but it cannot
  be fixed properly without this patch).
- Other filesystems (eg. NFS, maybe btrfs) will need to change their
  page_mkwrite functions themselves.

[ This also moves page_mkwrite another step closer to fault, which should
  eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
  filesystem calldown and page lock/unlock cycle in __do_fault. ]

[akpm@linux-foundation.org: fix derefs of NULL ->mapping]
Cc: Sage Weil <sage@newdream.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 15:36:09 -07:00
Ian Kent
a8985f3ac5 autofs4: fix incorrect return in autofs4_mount_busy()
Fix an obvious incorrect return status in autofs4_mount_busy().

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02 15:36:09 -07:00
Steve French
24d35add2b [CIFS] Remove sparse warning
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-02 05:40:39 +00:00
Steve French
989c7e512f [CIFS] remove checkpatch warning
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-02 05:32:20 +00:00
Steve French
afe48c31ea [CIFS] Fix final user of old string conversion code
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-02 05:25:46 +00:00
Jeff Layton
3410602732 [CIFS] remove cifs_strfromUCS_le
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-02 04:59:34 +00:00
Steve French
2edd6c5b05 [CIFS] NTLMSSP support moving into new file, old dead code removed
Remove dead NTLMSSP support from connect.c prior to addition of
the new code to replace it.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-02 04:55:39 +00:00
Eric Sandeen
c9877b205f ext4: fix for fiemap last-block test
Carl Henrik Lunde reported and debugged this; the test for the
last allocated block was comparing bytes to blocks in this test:

	if (logical + length - 1 == EXT_MAX_BLOCK ||
	    ext4_ext_next_allocated_block(path) == EXT_MAX_BLOCK)
		flags |= FIEMAP_EXTENT_LAST;

so any extent which ended right at 4G was stopping the extent
walk.  Just replacing these values with the extent block &
length should fix it.

Also give blksize_bits a saner type, and reverse the order 
of the tests to make the more likely case tested first.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reported-by: Carl Henrik Lunde <chlunde@ping.uio.no>
Tested-by: Carl Henrik Lunde <chlunde@ping.uio.no>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 23:32:06 -04:00
Aneesh Kumar K.V
19ba0559f9 vfs: Enable FS_IOC_FIEMAP and FIGETBSZ for all filetypes
The fiemap and get_blk_size ioctls should be enabled even for
directories.  So move it outisde file_ioctl.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-13 18:12:05 -04:00
Aneesh Kumar K.V
abc8746eb9 ext4: hook fiemap operation for directories
Add fiemap callback for directories

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-02 22:54:32 -04:00
Curt Wohlgemuth
f40339031b ext4: Make the length of the mb_history file tunable
In memory-constrained systems with many partitions, the ~68K for each
partition for the mb_history buffer can be excessive.

This patch adds a new mount option, mb_history_length, as well as a
way of setting the default via a module parameter (or via a sysfs
parameter in /sys/module/ext4/parameter/default_mb_history_length).
If the mb_history_length is set to zero, the mb_history facility is
disabled entirely.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 20:27:20 -04:00
Theodore Ts'o
bb23c20a85 ext4: Move fs/ext4/group.h into ext4.h
Move the function prototypes in group.h into ext4.h so they are all
defined in one place.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 19:44:44 -04:00
Theodore Ts'o
596397b77c ext4: Move fs/ext4/namei.h into ext4.h
The fs/ext4/namei.h header file had only a single function
declaration, and should have never been a standalone file.  Move it
into ext4.h, where should have been from the beginning.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 13:49:15 -04:00
Theodore Ts'o
ca0faba0e8 ext4: Move the ext4_sb.h header file into ext4.h
There is no longer a reason for a separate ext4_sb.h header file, so
move it into ext4.h just to make life easier for developers to find
the relevant data structures and typedefs.  Should also speed up
compiles slightly, too.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-03 16:33:44 -04:00
Theodore Ts'o
d444c3c381 ext4: Move the ext4_i.h header file into ext4.h
There is no longer a reason for a separate ext4_i.h header file, so
move it into ext4.h just to make life easier for developers to find
the relevant data structures and typedefs.  Should also speed up
compiles slightly, too.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 13:44:33 -04:00
Theodore Ts'o
75507efb13 ext4: Don't avoid using BLOCK_UNINIT block groups in mballoc
By avoiding the use of not-yet-used block groups (i.e., block groups
with the BLOCK_UNINIT flag), mballoc had a tendency to create large
files with large non-contiguous gaps.  In addition avoiding the use of
new block groups had a tendency to push regular file data into the
first block group in a flex_bg group, which slows down the speed of
e2fsck pass 2, since it has a tendency to seek much more.  For
example:

               Before Patch                       After Patch
              Time in seconds                   Time in seconds
            Real /  User/  Sys   MB/s      Real /  User/  Sys    MB/s
Pass 1      8.52 / 2.21 / 0.46  20.43      8.84 / 4.97 / 1.11   19.68
Pass 2     21.16 / 1.02 / 1.86  11.30      6.54 / 1.77 / 1.78   36.39
Pass 3      0.01 / 0.00 / 0.00 139.00      0.01 / 0.01 / 0.00  128.90
Pass 4      0.16 / 0.15 / 0.00   0.00      0.17 / 0.17 / 0.00    0.00
Pass 5      2.52 / 1.99 / 0.09   0.79      2.31 / 1.78 / 0.06    0.86
Total      32.40 / 5.11 / 2.49  12.81     17.99 / 8.75 / 2.98   23.01

This was on a sample 80 gig root filesystem which was approximately
50% full.  Note the improved e2fsck pass 2 performance, by over a
factor of 3, due to a decreased number of seeks.  (The total amount of
I/O in pass 2 was unchanged; the layout of the directory blocks was
simply much better from e2fsck's's perspective.)

Other changes as a result of this patch on this sample filesystem:

                             Before Patch    After Patch
# of non-contig files           762             779
# of non-contig directories     571             570
# of BLOCK_UNINIT bg's          307             293
# of INODE_UNINIT bg's          503             503

Out of 640 block groups, of which 333 were in use, this patch caused
an extra 14 block groups to be utilized.  The number of non-contiguous
files did go up slightly, but when measured against the 99.9% of the
files (603,154) which were contiguously allocated, this is pretty
insignificant.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andreas Dilger <adilger@sun.com>
2009-05-01 12:58:36 -04:00
Steve French
051a2a0d32 [CIFS] Fix endian conversion of vcnum field
When multiply mounting from the same client to the same server, with
different userids, we create a vcnum which should be unique if
possible (this is not the same as the smb uid, which is the handle
to the security context).  We were not endian converting additional
(beyond the first which is zero) vcnum properly.

CC: Stable <stable@kernel.org>
Acked-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-01 16:25:15 +00:00
Steve French
e836f015b5 [CIFS] Remove trailing whitespace
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-01 16:20:35 +00:00
Steve French
0e0d2cf327 [CIFS] Remove sparse endian warnings
Removes two sparse CHECK_ENDIAN warnings from Jeffs earlier patch,
and removes the dead readlink code (after noting where in
findfirst we will need to add something like that in the future
to handle the newly discovered unexpected error on FindFirst of NTFS symlinks.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-01 05:27:32 +00:00
Steve French
e14b2fe1e6 [CIFS] Add remaining ntlmssp flags and standardize field names
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-01 04:37:43 +00:00
Steve French
cf398e3a11 [CIFS] Fix build warning
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-01 03:50:42 +00:00
Jeff Layton
18295796a3 cifs: fix length handling in cifs_get_name_from_search_buf
The earlier patch to move this code to use the new unicode helpers
assumed that the filename strings would be null terminated. That's not
always the case.

Instead of passing "max_len" to the string converter, pass "min(len,
max_len)", which makes it do the right thing while still keeping the
parser confined to the response. Also fix up the prototypes of this
function and the callers so that max_len is unsigned (like len is).

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-05-01 00:49:23 +00:00
Steve French
9e39b0ae8a [CIFS] Remove unneeded QuerySymlink call and fix mapping for unmapped status
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 21:31:15 +00:00
Joel Becker
dfa13f39b7 ocfs2: Fix a missing credit when deleting from indexed directories.
The ocfs2 directory index updates two blocks when we remove an entry -
the dx root and the dx leaf.  OCFS2_DELETE_INODE_CREDITS was only
accounting for the dx leaf.  This shows up when ocfs2_delete_inode()
runs out of credits in jbd2_journal_dirty_metadata() at
"J_ASSERT_JH(jh, handle->h_buffer_credits > 0);".

The test that caught this was running dirop_file_racer from the
ocfs2-test suite with a 250-character filename PREFIX.  Run on a 512B
blocksize, it forces the orphan dir index to grow large enough to
trigger.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-04-30 13:21:56 -07:00
Thomas Gleixner
3c56999eec Merge branch 'core/signal' into perfcounters/core
This is necessary to avoid the conflict of syscall numbers.

Conflicts:
	arch/x86/ia32/ia32entry.S
	arch/x86/include/asm/unistd_32.h
	arch/x86/include/asm/unistd_64.h

Fixes up the borked syscall numbers of perfcounters versus
preadv/pwritev as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-04-30 21:16:49 +02:00
Louis Rilling
420118caa3 configfs: Rework configfs_depend_item() locking and make lockdep happy
configfs_depend_item() recursively locks all inodes mutex from configfs root to
the target item, which makes lockdep unhappy. The purpose of this recursive
locking is to ensure that the item tree can be safely parsed and that the target
item, if found, is not about to leave.

This patch reworks configfs_depend_item() locking using configfs_dirent_lock.
Since configfs_dirent_lock protects all changes to the configfs_dirent tree, and
protects tagging of items to be removed, this lock can be used instead of the
inodes mutex lock chain.
This needs that the check for dependents be done atomically with
CONFIGFS_USET_DROPPING tagging.

Now lockdep looks happy with configfs.

[ Lifted the setting of s_type into configfs_new_dirent() to satisfy the
  atomic setting of CONFIGFS_USET_CREATING  -- Joel ]

Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-04-30 10:48:26 -07:00
Louis Rilling
e74cc06df3 configfs: Silence lockdep on mkdir() and rmdir()
When attaching default groups (subdirs) of a new group (in mkdir() or
in configfs_register()), configfs recursively takes inode's mutexes
along the path from the parent of the new group to the default
subdirs. This is needed to ensure that the VFS will not race with
operations on these sub-dirs. This is safe for the following reasons:

- the VFS allows one to lock first an inode and second one of its
  children (The lock subclasses for this pattern are respectively
  I_MUTEX_PARENT and I_MUTEX_CHILD);
- from this rule any inode path can be recursively locked in
  descending order as long as it stays under a single mountpoint and
  does not follow symlinks.

Unfortunately lockdep does not know (yet?) how to handle such
recursion.

I've tried to use Peter Zijlstra's lock_set_subclass() helper to
upgrade i_mutexes from I_MUTEX_CHILD to I_MUTEX_PARENT when we know
that we might recursively lock some of their descendant, but this
usage does not seem to fit the purpose of lock_set_subclass() because
it leads to several i_mutex locked with subclass I_MUTEX_PARENT by
the same task.

>From inside configfs it is not possible to serialize those recursive
locking with a top-level one, because mkdir() and rmdir() are already
called with inodes locked by the VFS. So using some
mutex_lock_nest_lock() is not an option.

I am proposing two solutions:
1) one that wraps recursive mutex_lock()s with
   lockdep_off()/lockdep_on().
2) (as suggested earlier by Peter Zijlstra) one that puts the
   i_mutexes recursively locked in different classes based on their
   depth from the top-level config_group created. This
   induces an arbitrary limit (MAX_LOCK_DEPTH - 2 == 46) on the
   nesting of configfs default groups whenever lockdep is activated
   but this limit looks reasonably high. Unfortunately, this also
   isolates VFS operations on configfs default groups from the others
   and thus lowers the chances to detect locking issues.

Nobody likes solution 1), which I can understand.

This patch implements solution 2). However lockdep is still not happy with
configfs_depend_item(). Next patch reworks the locking of
configfs_depend_item() and finally makes lockdep happy.

[ Note: This hides a few locking interactions with the VFS from lockdep.
  That was my big concern, because we like lockdep's protection.  However,
  the current state always dumps a spurious warning.  The locking is
  correct, so I tell people to ignore the warning and that we'll keep
  our eyes on the locking to make sure it stays correct.  With this patch,
  we eliminate the warning.  We do lose some of the lockdep protections,
  but this only means that we still have to keep our eyes on the locking.
  We're going to do that anyway.  -- Joel ]

Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-04-30 10:48:23 -07:00
Steve French
d185cda771 [CIFS] rename cifs_strndup to cifs_strndup_from_ucs
In most cases, cifs_strndup is converting from Unicode (UCS2 / UTF-32) to
the configured local code page for the Linux mount (usually UTF8), so
Jeff suggested that to make it more clear that cifs_strndup is doing
a conversion not just memory allocation and copy, rename the function
to including "from_ucs" (ie Unicode)

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 17:45:10 +00:00
Igor Mammedov
5c2503a8e3 Added loop check when mounting DFS tree.
Added loop check when mounting DFS tree. mount will fail with
ELOOP if referral walks exceed MAX_NESTED_LINK count.

Signed-off-by: Igor Mammedov <niallain@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 17:24:09 +00:00
Igor Mammedov
1af28ceb92 Enable dfs submounts to handle remote referrals.
Having remote dfs root support in cifs_mount, we can
afford to pass into it UNC that is remote.

Signed-off-by: Igor Mammedov <niallain@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 17:24:00 +00:00
Steve French
20418acd68 [CIFS] Remove older session setup implementation
Two years ago, when the session setup code in cifs was rewritten and moved
to fs/cifs/sess.c, we were asked to keep the old code for a release or so
(which could be reenabled at runtime) since it was such a large change and
because the asn (SPNEGO) and NTLMSSP code was not rewritten and needed to
be. This was useful to avoid regressions, but is long overdue to be removed.
Now that the Kerberos (asn/spnego) code is working in fs/cifs/sess.c,
and the NTLMSSP code moved (NTLMSSP blob setup be rewritten with the
next patch in this series) quite a bit of dead code from fs/cifs/connect.c
now can be removed.

This old code should have been removed last year, but the earlier krb5
patches did not move/remove the NTLMSSP code which we had asked to
be done first.  Since no one else volunteered, I am doing it now.

It is extremely important that we continue to examine the documentation
for this area, to make sure our code continues to be uptodate with
changes since Windows 2003.

Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 16:13:32 +00:00
Jeff Layton
f58841666b cifs: change cifs_get_name_from_search_buf to use new unicode helper
...and remove cifs_convertUCSpath. There are no more callers. Also add a
#define for the buffer used in the readdir path so that we don't have so
many magic numbers floating around.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 15:45:01 +00:00
Jeff Layton
460b96960d cifs: change CIFSSMBUnixQuerySymLink to use new helpers
Change CIFSSMBUnixQuerySymLink to use the new unicode helper functions.
Also change the calling conventions so that the allocation of the target
name buffer is done in CIFSSMBUnixQuerySymLink rather than by the caller.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 15:45:00 +00:00
Jeff Layton
59140797c5 cifs: fix session setup unicode string saving to use new unicode helpers
...and change decode_unicode_ssetup to be a void function. It never
returns an actual error anyway.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 15:45:00 +00:00
Jeff Layton
cc20c031bb cifs: convert CIFSTCon to use new unicode helper functions
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 15:45:00 +00:00
Jeff Layton
066ce68994 cifs: rename cifs_strlcpy_to_host and make it use new functions
Rename cifs_strlcpy_to_host to cifs_strndup since that better describes
what this function really does. Then, convert it to use the new string
conversion and measurement functions that work in units of bytes rather
than wide chars.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 15:45:00 +00:00
Jeff Layton
69f801fcaa cifs: add new function to get unicode string length in bytes
Working in units of words means we do a lot of unnecessary conversion back
and forth. Standardize on bytes instead since that's more useful for
allocating buffers and such. Also, remove hostlen_fromUCS since the new
function has a similar purpose.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 15:45:00 +00:00
Jeff Layton
7fabf0c947 cifs: add replacement for cifs_strtoUCS_le called cifs_from_ucs2
Add a replacement function for cifs_strtoUCS_le. cifs_from_ucs2
takes args for the source and destination length so that we can ensure
that the function is confined within the intended buffers.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 15:44:59 +00:00
Jeff Layton
66345f50f0 cifs: move #defines for mapchars into cifs_unicode.h
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-04-30 15:44:59 +00:00
Steve French
912bc6ae3d Merge branch 'master' of /pub/scm/linux/kernel/git/torvalds/linux-2.6 2009-04-30 15:36:52 +00:00
Christoph Hellwig
28e211700a xfs: fix getbmap vs mmap deadlock
xfs_getbmap (or rather the formatters called by it) copy out the getbmap
structures under the ilock, which can deadlock against mmap.  This has
been reported via bugzilla a while ago (#717) and has recently also
shown up via lockdep.

So allocate a temporary buffer to format the kernel getbmap structures
into and then copy them out after dropping the locks.

A little problem with this is that we limit the number of extents we
can copy out by the maximum allocation size, but I see no real way
around that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-30 00:29:02 -05:00
Christoph Hellwig
5f79ed685f xfs: a couple getbmap cleanups
- reshuffle various conditionals for data vs attr fork to make the code
   more readable
 - do fine-grainded goto-based error handling
 - exit early from conditionals instead of keeping a long else branch around
 - allow kmem_alloc to fail

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-30 00:28:31 -05:00
Olaf Weber
b9ec9068d7 xfs: add more checks to superblock validation
There had been reports where xfs filesystem was randomly
corrupted with fsfuzzer, and xfs failed to handle it
gracefully. This patch fixes couple of reported problem
by providing additional checks in the superblock
validation routine.

Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-30 00:26:14 -05:00
Lachlan McIlroy
def6b3ba56 xfs_file_last_byte() needs to acquire ilock
We had some systems crash with this stack:

[<a00000010000cb20>] ia64_leave_kernel+0x0/0x280
[<a00000021291ca00>] xfs_bmbt_get_startoff+0x0/0x20 [xfs]
[<a0000002129080b0>] xfs_bmap_last_offset+0x210/0x280 [xfs]
[<a00000021295b010>] xfs_file_last_byte+0x70/0x1a0 [xfs]
[<a00000021295b200>] xfs_itruncate_start+0xc0/0x1a0 [xfs]
[<a0000002129935f0>] xfs_inactive_free_eofblocks+0x290/0x460 [xfs]
[<a000000212998fb0>] xfs_release+0x1b0/0x240 [xfs]
[<a0000002129ad930>] xfs_file_release+0x70/0xa0 [xfs]
[<a000000100162ea0>] __fput+0x1a0/0x420
[<a000000100163160>] fput+0x40/0x60

The problem here is that xfs_file_last_byte() does not acquire the
inode lock and can therefore race with another thread that is modifying
the extext list.  While xfs_bmap_last_offset() is trying to lookup
what was the last extent some extents were merged and the extent list
shrunk so the index we lookup is now beyond the end of the extent list
and potentially in a freed buffer.

Signed-off-by: Lachlan McIlroy <lmcilroy@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-30 00:25:25 -05:00
Christoph Hellwig
6321e3ed2a xfs: fix getbmap vs mmap deadlock
xfs_getbmap (or rather the formatters called by it) copy out the getbmap
structures under the ilock, which can deadlock against mmap.  This has
been reported via bugzilla a while ago (#717) and has recently also
shown up via lockdep.

So allocate a temporary buffer to format the kernel getbmap structures
into and then copy them out after dropping the locks.

A little problem with this is that we limit the number of extents we
can copy out by the maximum allocation size, but I see no real way
around that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-29 13:25:29 -05:00
Tao Ma
7e31a966ad ocfs2/trivial: Remove unused variable in ocfs2_rename.
With indexed dir enabled, now we use ocfs2_dir_lookup_result to
wrap all the bh used for dir. So remove the 2 unused variables.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-04-29 10:57:18 -07:00
Christoph Hellwig
4be4a00fb5 xfs: a couple getbmap cleanups
- reshuffle various conditionals for data vs attr fork to make the code
   more readable
 - do fine-grainded goto-based error handling
 - exit early from conditionals instead of keeping a long else branch around
 - allow kmem_alloc to fail

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-29 10:00:01 -05:00
Olaf Weber
2ac00af7a6 xfs: add more checks to superblock validation
There had been reports where xfs filesystem was randomly
corrupted with fsfuzzer, and xfs failed to handle it
gracefully. This patch fixes couple of reported problem
by providing additional checks in the superblock
validation routine.

Signed-off-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-29 09:24:29 -05:00
Lachlan McIlroy
f25181f598 xfs_file_last_byte() needs to acquire ilock
We had some systems crash with this stack:

[<a00000010000cb20>] ia64_leave_kernel+0x0/0x280
[<a00000021291ca00>] xfs_bmbt_get_startoff+0x0/0x20 [xfs]
[<a0000002129080b0>] xfs_bmap_last_offset+0x210/0x280 [xfs]
[<a00000021295b010>] xfs_file_last_byte+0x70/0x1a0 [xfs]
[<a00000021295b200>] xfs_itruncate_start+0xc0/0x1a0 [xfs]
[<a0000002129935f0>] xfs_inactive_free_eofblocks+0x290/0x460 [xfs]
[<a000000212998fb0>] xfs_release+0x1b0/0x240 [xfs]
[<a0000002129ad930>] xfs_file_release+0x70/0xa0 [xfs]
[<a000000100162ea0>] __fput+0x1a0/0x420
[<a000000100163160>] fput+0x40/0x60

The problem here is that xfs_file_last_byte() does not acquire the
inode lock and can therefore race with another thread that is modifying
the extext list.  While xfs_bmap_last_offset() is trying to lookup
what was the last extent some extents were merged and the extent list
shrunk so the index we lookup is now beyond the end of the extent list
and potentially in a freed buffer.

Signed-off-by: Lachlan McIlroy <lmcilroy@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
2009-04-29 09:14:10 -05:00
Ingo Molnar
e7fd5d4b3d Merge branch 'linus' into perfcounters/core
Merge reason: This brach was on -rc1, refresh it to almost-rc4 to pick up
              the latest upstream fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-04-29 14:47:05 +02:00
FUJITA Tomonori
69838727bc bio: fix memcpy corruption in bio_copy_user_iov()
st driver uses blk_rq_map_user() in order to just build a request out
of page frames. In this case, map_data->offset is a non zero value and
iov[0].iov_base is NULL. We need to increase nr_pages for that.

Cc: stable@kernel.org
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-28 20:24:29 +02:00
Tejun Heo
08cbf542bf fuse: export symbols to be used by CUSE
Export the following symbols for CUSE.

fuse_conn_put()
fuse_conn_get()
fuse_conn_kill()
fuse_send_init()
fuse_do_open()
fuse_sync_release()
fuse_direct_io()
fuse_do_ioctl()
fuse_file_poll()
fuse_request_alloc()
fuse_get_req()
fuse_put_request()
fuse_request_send()
fuse_abort_conn()
fuse_dev_release()
fuse_dev_operations

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:42 +02:00
Tejun Heo
a325f9b922 fuse: update fuse_conn_init() and separate out fuse_conn_kill()
Update fuse_conn_init() such that it doesn't take @sb and move bdi
registration into a separate function.  Also separate out
fuse_conn_kill() from fuse_put_super().

These will be used to implement cuse.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:41 +02:00
Miklos Szeredi
797759aaf3 fuse: don't use inode in fuse_file_poll
Use ff->fc and ff->nodeid instead of file->f_dentry->d_inode in the
fuse_file_poll() implementation.

This prepares this function for use by CUSE, where the inode is not
owned by a fuse filesystem.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:41 +02:00
Miklos Szeredi
d36f248710 fuse: don't use inode in fuse_do_ioctl() helper
Create a helper for sending an IOCTL request that doesn't use a struct
inode.

This prepares this function for use by CUSE, where the inode is not
owned by a fuse filesystem.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:39 +02:00
Miklos Szeredi
8b0797a498 fuse: don't use inode in fuse_sync_release()
Make fuse_sync_release() a generic helper function that doesn't need a
struct inode pointer.  This makes it suitable for use by CUSE.

Change return value of fuse_release_common() from int to void.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:39 +02:00
Miklos Szeredi
91fe96b403 fuse: create fuse_do_open() helper for CUSE
Create a helper for sending an OPEN request that doesn't need a struct
inode pointer.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:37 +02:00
Miklos Szeredi
c7b7143c63 fuse: clean up args in fuse_finish_open() and fuse_release_fill()
Move setting ff->fh, ff->nodeid and file->private_data outside
fuse_finish_open().  Add ->open_flags member to struct fuse_file.

This simplifies the argument passing to fuse_finish_open() and
fuse_release_fill(), and paves the way for creating an open helper
that doesn't need an inode pointer.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:37 +02:00
Miklos Szeredi
2106cb1893 fuse: don't use inode in helpers called by fuse_direct_io()
Use ff->fc and ff->nodeid instead of passing down the inode.

This prepares this function for use by CUSE, where the inode is not
owned by a fuse filesystem.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:37 +02:00
Miklos Szeredi
da5e471457 fuse: add members to struct fuse_file
Add new members ->fc and ->nodeid to struct fuse_file.  This will aid
in converting functions for use by CUSE, where the inode is not owned
by a fuse filesystem.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:36 +02:00
Miklos Szeredi
d09cb9d7f6 fuse: prepare fuse_direct_io() for CUSE
Move code operating on the inode out from fuse_direct_io().

This prepares this function for use by CUSE, where the inode is not
owned by a fuse filesystem.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:36 +02:00
Miklos Szeredi
2d698b0703 fuse: clean up fuse_write_fill()
Move out code from fuse_write_fill() which is not common to all
callers.  Remove two function arguments which become unnecessary.

Also remove unnecessary memset(), the request is already initialized
to zero.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:36 +02:00
Miklos Szeredi
b0be46ebf7 fuse: use struct path in release structure
Use struct path instead of separate dentry and vfsmount in
req->misc.release.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:36 +02:00
Miklos Szeredi
fd9db72977 fuse: destroy bdi on error
Destroy bdi on error in fuse_fill_super().

This was an omission from commit 26c3679101
"fuse: destroy bdi on umount", which moved the bdi_destroy() call from
fuse_conn_put() to fuse_put_super().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
2009-04-28 16:56:35 +02:00
Tejun Heo
6b2db28a7a fuse: misc cleanups
* fuse_file_alloc() was structured in weird way.  The success path was
  split between else block and code following the block.  Restructure
  the code such that it's easier to read and modify.

* Unindent success path of fuse_release_common() to ease future
  changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:35 +02:00
Jeff Moyer
db2dbb12dc block: implement blkdev_readpages
Doing a proper block dev ->readpages() speeds up the crazy dump(8)
approach of using interleaved process IO.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-28 07:37:33 +02:00
Theodore Ts'o
8b0f9e8f78 ext4: avoid unnecessary spinlock in critical POSIX ACL path
If a filesystem supports POSIX ACL's, the VFS layer expects the filesystem
to do POSIX ACL checks on any files not owned by the caller, and it does
this for every single pathname component that it looks up.

That obviously can be pretty expensive if the filesystem isn't careful
about it, especially with locking. That's doubly sad, since the common
case tends to be that there are no ACL's associated with the files in
question.

ext4 already caches the ACL data so that it doesn't have to look it up
over and over again, but it does so by taking the inode->i_lock spinlock
on every lookup. Which is a noticeable overhead even if it's a private
lock, especially on CPU's where the serialization is expensive (eg Intel
Netburst aka 'P4').

For the special case of not actually having any ACL's, all that locking is
unnecessary. Even if somebody else were to be changing the ACL's on
another CPU, we simply don't care - if we've seen a NULL ACL, we might as
well use it.

So just load the ACL speculatively without any locking, and if it was
NULL, just use it. If it's non-NULL (either because we had a cached
entry, or because the cache hasn't been filled in at all), it means that
we'll need to get the lock and re-load it properly.

(This commit was ported from a patch originally authored by Linus for
ext3.)

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-27 17:33:23 -04:00
Linus Torvalds
96159f2511 ext3: avoid unnecessary spinlock in critical POSIX ACL path
If a filesystem supports POSIX ACL's, the VFS layer expects the filesystem 
to do POSIX ACL checks on any files not owned by the caller, and it does 
this for every single pathname component that it looks up.

That obviously can be pretty expensive if the filesystem isn't careful 
about it, especially with locking. That's doubly sad, since the common 
case tends to be that there are no ACL's associated with the files in 
question.

ext3 already caches the ACL data so that it doesn't have to look it up 
over and over again, but it does so by taking the inode->i_lock spinlock 
on every lookup. Which is a noticeable overhead even if it's a private 
lock, especially on CPU's where the serialization is expensive (eg Intel 
Netburst aka 'P4').

For the special case of not actually having any ACL's, all that locking is 
unnecessary. Even if somebody else were to be changing the ACL's on 
another CPU, we simply don't care - if we've seen a NULL ACL, we might as 
well use it.

So just load the ACL speculatively without any locking, and if it was 
NULL, just use it. If it's non-NULL (either because we had a cached 
entry, or because the cache hasn't been filled in at all), it means that 
we'll need to get the lock and re-load it properly.

This is noticeable even on Nehalem, which does locking quite well (much 
better than P4). From lmbench:

	Processor, Processes - times in microseconds - smaller is better
	--------------------------------------------------------------------
	Host                 OS  Mhz null null      open slct fork exec sh  
	                             call  I/O stat clos TCP  proc proc proc
	--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ----
 - before:
	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.95 1.45 2.18 69.1 273. 1141
	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.95 1.48 2.28 69.9 253. 1140
	nehalem.l Linux 2.6.30- 3193 0.04 0.10 0.95 1.42 2.19 68.6 284. 1141
 - after:
	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.44 2.12 68.3 282. 1094
	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.39 2.20 67.0 308. 1123
	nehalem.l Linux 2.6.30- 3193 0.04 0.09 0.92 1.39 2.36 67.4 293. 1148

where you can see what appears to be a roughly 3% improvement in stat
and open/close latencies from just the removal of the locking overhead. 

Of course, this only matters for files you don't own (the owner never 
needs to do the ACL checks), but that's the common case for libraries, 
header files, and executables. As well as for the base components of any 
absolute pathname, even if you are the owner of the final file.

[ At some point we probably want to move this ACL caching logic entirely
  into the VFS layer (and only call down to the filesystem when
  uncached), but in the meantime this improves ext3 a bit.

  A similar fix to btrfs makes a much bigger difference (15x improvement
  in lmbench) due to broken caching. ]

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
2009-04-27 17:24:22 -04:00
Theodore Ts'o
9bffad1ed2 ext4: convert instrumentation from markers to tracepoints
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-17 11:48:11 -04:00
Theodore Ts'o
879c5e6b7c jbd2: convert instrumentation from markers to tracepoints
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-06-17 11:47:48 -04:00
Tyler Hicks
ac20100df7 eCryptfs: Fix min function comparison warning
This warning shows up on 64 bit builds:

fs/ecryptfs/inode.c:693: warning: comparison of distinct pointer types
lacks a cast

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-04-27 13:31:12 -05:00
Linus Torvalds
4ebf662337 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: look for acls during btrfs_read_locked_inode
  Btrfs: fix acl caching
  Btrfs: Fix a bunch of printk() warnings.
  Btrfs: Fix a trivial warning using max() of u64 vs ULL.
  Btrfs: remove unused btrfs_bit_radix slab
  Btrfs: ratelimit IO error printks
  Btrfs: remove #if 0 code
  Btrfs: When shrinking, only update disk size on success
  Btrfs: fix deadlocks and stalls on dead root removal
  Btrfs: fix fallocate deadlock on inode extent lock
  Btrfs: kill btrfs_cache_create
  Btrfs: don't export symbols
  Btrfs: simplify makefile
  Btrfs: try to keep a healthy ratio of metadata vs data block groups
2009-04-27 11:16:33 -07:00
Randy Dunlap
802b352f29 ecryptfs: fix printk format warning
fs/ecryptfs/inode.c:670: warning: format '%d' expects type 'int', but argument 3 has type 'size_t'

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: Dustin Kirkland <kirkland@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2009-04-27 13:10:06 -05:00
Chris Mason
46a53cca82 Btrfs: look for acls during btrfs_read_locked_inode
This changes btrfs_read_locked_inode() to peek ahead in the btree for acl items.
If it is certain a given inode has no acls, it will set the in memory acl
fields to null to avoid acl lookups completely.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 13:18:35 -04:00
Chris Mason
7b1a14bbb0 Btrfs: fix acl caching
Linus noticed the btrfs code to cache acls wasn't properly caching
a NULL acl when the inode didn't have any acls.  This meant the common
case of no acls resulted in expensive btree searches every time the
kernel checked permissions (which is quite often).

This is a modified version of Linus' original patch:

Properly set initial acl fields to BTRFS_ACL_NOT_CACHED in the inode.
This forces an acl lookup when permission checks are done.

Fix btrfs_get_acl to avoid lookups and locking when the inode acls fields
are set to null.

Fix btrfs_get_acl to use the right return value from __btrfs_getxattr
when deciding to cache a NULL acl.  It was storing a NULL acl when
__btrfs_getxattr return -ENOENT, but __btrfs_getxattr was actually returning
-ENODATA for this case.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 13:18:26 -04:00
Linus Torvalds
dccdee460e Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6:
  ext2: missing unlock in ext2_quota_write()
  quota: remove obsolete comments in fs/quota/Makefile
2009-04-27 08:40:00 -07:00
Dan Carpenter
a069e9cee1 ext2: missing unlock in ext2_quota_write()
The inode->i_mutex should be unlocked.

Found by smatch (http://repo.or.cz/w/smatch.git).  Compile tested.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-04-27 16:49:52 +02:00
Christoph Hellwig
fd1b52435a quota: remove obsolete comments in fs/quota/Makefile
Get rid of useless comments and the equally useless obj-y
initialization.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2009-04-27 16:49:52 +02:00
Joel Becker
21380931eb Btrfs: Fix a bunch of printk() warnings.
Just happened to notice a bunch of %llu vs u64 warnings.  Here's a patch
to cast them all.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 08:37:49 -04:00
Joel Becker
e63b6a6c0f Btrfs: Fix a trivial warning using max() of u64 vs ULL.
A small warning popped up on ia64 because inode-map.c was comparing a
u64 object id with the ULL FIRST_FREE_OBJECTID.  My first thought was
that all the OBJECTID constants should contain the u64 cast because
btrfs code deals entirely in u64s.  But then I saw how large that was,
and figured I'd just fix the max() call.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 08:37:49 -04:00
Chris Mason
45c06543af Btrfs: remove unused btrfs_bit_radix slab
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 08:37:48 -04:00
Chris Mason
193f284d49 Btrfs: ratelimit IO error printks
Btrfs has printks for various IO errors, including bad checksums and
mismatches between what we expect the block headers to contain and what
we actually find on the disk.

Longer term we need a real reporting mechanism for this, but for now
printk is going to have to do.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 07:41:47 -04:00
Chris Mason
b7967db75a Btrfs: remove #if 0 code
Btrfs had some old code sitting around under #if 0, this drops it.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 07:40:52 -04:00
Chris Ball
d6397baee4 Btrfs: When shrinking, only update disk size on success
Previously, we updated a device's size prior to attempting a shrink
operation.  This patch moves the device resizing logic to only happen if
the shrink completes successfully.  In the process, it introduces a new
field to btrfs_device -- disk_total_bytes -- to track the on-disk size.

Signed-off-by: Chris Ball <cjb@laptop.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-27 07:40:51 -04:00
Theodore Ts'o
32ed5058ce ext4: Replace lock/unlock_super() with an explicit lock for resizing
Use a separate lock to protect s_groups_count and the other block
group descriptors which get changed via an on-line resize operation,
so we can stop overloading the use of lock_super().
    
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-25 22:53:39 -04:00
Theodore Ts'o
3b9d4ed266 ext4: Replace lock/unlock_super() with an explicit lock for the orphan list
Use a separate lock to protect the orphan list, so we can stop
overloading the use of lock_super().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-25 22:54:04 -04:00
Theodore Ts'o
a63c9eb2ce ext4: ext4_mark_recovery_complete() doesn't need to use lock_super
The function ext4_mark_recovery_complete() is called from two call
paths: either (a) while mounting the filesystem, in which case there's
no danger of any other CPU calling write_super() until the mount is
completed, and (b) while remounting the filesystem read-write, in
which case the fs core has already locked the superblock.  This also
allows us to take out a very vile unlock_super()/lock_super() pair in
ext4_remount().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 01:59:42 -04:00
Theodore Ts'o
114e9fc907 ext4: Remove outdated comment about lock_super()
ext4_fill_super() is no longer called by read_super(), and it is no
longer called with the superblock locked.  The
unlock_super()/lock_super() is no longer present, so this comment is
entirely superfluous.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-25 15:48:07 -04:00
Theodore Ts'o
8df9675f8b ext4: Avoid races caused by on-line resizing and SMP memory reordering
Ext4's on-line resizing adds a new block group and then, only at the
last step adjusts s_groups_count.  However, it's possible on SMP
systems that another CPU could see the updated the s_group_count and
not see the newly initialized data structures for the just-added block
group.  For this reason, it's important to insert a SMP read barrier
after reading s_groups_count and before reading any (for example) the
new block group descriptors allowed by the increased value of
s_groups_count.

Unfortunately, we rather blatently violate this locking protocol
documented in fs/ext4/resize.c.  Fortunately, (1) on-line resizes
happen relatively rarely, and (2) it seems rare that the filesystem
code will immediately try to use just-added block group before any
memory ordering issues resolve themselves.  So apparently problems
here are relatively hard to hit, since ext3 has been vulnerable to the
same issue for years with no one apparently complaining.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 08:50:38 -04:00
Theodore Ts'o
9ca92389c5 ext4: Use separate super_operations structure for no_journal filesystems
By using a separate super_operations structure for filesystems that
have and don't have journals, we can simply ext4_write_super() ---
which is only needed when no journal is present --- and ext4_freeze(),
ext4_unfreeze(), and ext4_sync_fs(), which are only needed when the
journal is present.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 12:52:25 -04:00
Theodore Ts'o
7234ab2a55 ext4: Fix and simplify s_dirt handling
The s_dirt flag wasn't completely handled correctly, but it didn't
really matter when journalling was enabled.  It turns out that when
ext4 runs without a journal, we don't clear s_dirt in places where we
should have, with the result that the high-level write_super()
function was writing the superblock when it wasn't necessary.

So we fix this by making ext4_commit_super() clear the s_dirt flag,
and removing many of the other places where s_dirt is manipulated.
When journalling is enabled, the s_dirt flag might be left set more
often, but s_dirt really doesn't matter when journalling is enabled.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-30 21:24:04 -04:00
Theodore Ts'o
e2d670523c ext4: Simplify ext4_commit_super()'s function signature
The ext4_commit_super() function took both a struct super_block * and
a struct ext4_super_block *, but the struct ext4_super_block can be
derived from the struct super_block.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-01 00:33:44 -04:00
Theodore Ts'o
f7c439504c ext4: Use is_power_of_2() for clarity
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-24 23:31:59 -04:00
Theodore Ts'o
c5ca7c7636 ext4: Fallback to vmalloc if kmalloc can't allocate s_flex_groups array
For very large filesystems, the s_flex_groups array can get quite big.
For example, a filesystem that can be resized up to 16TB will have
8192 flex groups (assuming the default flex_bg size of 16), so the
array is 96k, which is *very* marginal for kmalloc().  On the other
hand, a 160GB filesystem without the resize_inode feature will only
require 960 bytes.  So we try to allocate the array first using
kmalloc(), and if that fails, we'll try to use vmalloc() instead.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-27 22:48:48 -04:00
Aneesh Kumar K.V
29fa89d088 ext4: Mark the unwritten buffer_head as mapped during write_begin
Setting BH_Unwritten buffer_heads as BH_Mapped avoids multiple
(unnecessary) calls to get_block() during the call to the write(2)
system call.  Setting BH_Unwritten buffer heads as BH_Mapped requires
that the writepages() functions can handle BH_Unwritten buffer_heads.

After this commit, things work as follows:

ext4_ext_get_block() returns unmapped, unwritten, buffer head when
called with create = 0 for prealloc space. This makes sure we handle
the read path and non-delayed allocation case correctly.  Even though
the buffer head is marked unmapped we have valid b_blocknr and b_bdev
values in the buffer_head.

ext4_da_get_block_prep() called for block resrevation will now return
mapped, unwritten, new buffer_head for prealloc space. This avoids
multiple calls to get_block() for write to same offset. By making such
buffers as BH_New, we also assure that sub-block zeroing of buffered
writes happens correctly.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-12 16:30:27 -04:00
Aneesh Kumar K.V
8fb0e34248 vfs: Add BUG_ON for delayed and unwritten flags in submit_bh()
The BH_Delay and BH_Unwritten flags should never leak out to
submit_bh().  So add some BUG_ON() checks to submit_bh so we can get a
stack trace and determine how and why this might have happened.

(Note that only XFS and ext4 use these buffer head flags, and XFS does
not use submit_bh().  So this patch should only modify behavior for
ext4.)

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
2009-05-12 16:22:37 -04:00
Aneesh Kumar K.V
79ffab3439 ext4: Properly initialize the buffer_head state
These struct buffer_heads are allocated on the stack (and hence are
initialized with stack garbage).  They are only used to call a
get_blocks() function, so that's mostly OK, but b_state must be
initialized to be 0 so we don't have any unexpected BH_* flags set by
accident, such as BH_Unwritten or BH_Delay.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-05-13 15:13:42 -04:00
Theodore Ts'o
c4b5a61431 ext4: Do not try to validate extents on special files
The EXTENTS_FL flag should never be set on special files, but if it
is, don't bother trying to validate that the extents tree is valid,
since only files, directories, and non-fast symlinks will ever have an
extent data structure.  We perhaps should flag the filesystem as being
corrupted if we see a special file (named pipes, device nodes, Unix
domain sockets, etc.) with the EXTENTS_FL flag, but e2fsck doesn't
currently check this case, so we'll just ignore this for now, since
it's harmless.

Without this fix, a special device with the extents flag is flagged as
an error by the kernel, so it is impossible to access or delete the
inode, but e2fsck doesn't see it as a problem, leading to
confused/frustrated users.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-24 18:45:35 -04:00
David Howells
4b2b0b9753 ROMFS: Advance destination buffer pointer when reading from a blockdev
RomFS should advance the destination buffer pointer when reading data from a
blockdev source (the data may be split over multiple blocks, each requiring its
own sb_read() call).  Without this, all the data is copied to the beginning of
the output buffer.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Michal Simek <monstr@monstr.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-24 13:28:31 -07:00
David Howells
84baf74bf2 ROMFS: romfs_lookup() shouldn't be doing a partial name comparison
romfs_lookup() should be using a routine akin to strcmp() on the backing store,
rather than one akin to strncmp().  If it uses the latter, it's liable to match
/bin/shutdown when looking up /bin/sh.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Michal Simek <monstr@monstr.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-24 13:28:31 -07:00
Theodore Ts'o
a9e817425d ext4: Ignore i_file_acl_high unless EXT4_FEATURE_INCOMPAT_64BIT is present
Don't try to look at i_file_acl_high unless the INCOMPAT_64BIT feature
bit is set.  The field is normally zero, but older versions of e2fsck
didn't automatically check to make sure of this, so in the spirit of
"be liberal in what you accept", don't look at i_file_acl_high unless
we are using a 64-bit filesystem.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-24 16:11:18 -04:00
Chris Mason
59bc5c758e Btrfs: fix deadlocks and stalls on dead root removal
After a transaction commit, the old root of the subvol btrees are sent through
snapshot removal.  This is what actually frees up any blocks replaced by
COW, and anything the old blocks pointed to.

Snapshot deletion will pause when a transaction commit has started, which
helps to avoid a huge amount of delayed reference count updates piling up
as the transaction is trying to close.

But, this pause happens after the snapshot deletion process has asked other
procs on the system to throttle back a bit so that it can make progress.

We don't want to throttle everyone while we're waiting for the transaction
commit, it leads to deadlocks in the user transaction ioctls used by Ceph
and makes things slower in general.

This patch changes things to avoid the throttling while we sleep.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-24 15:46:05 -04:00
Chris Mason
e980b50cda Btrfs: fix fallocate deadlock on inode extent lock
The btrfs fallocate call takes an extent lock on the entire range
being fallocated, and then runs through insert_reserved_extent on each
extent as they are allocated.

The problem with this is that btrfs_drop_extents may decide to try
and take the same extent lock fallocate was already holding.  The solution
used here is to push down knowledge of the range that is already locked
going into btrfs_drop_extents.

It turns out that at least one other caller had the same bug.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-24 15:46:05 -04:00
Christoph Hellwig
9601e3f633 Btrfs: kill btrfs_cache_create
Just use kmem_cache_create directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-24 15:46:04 -04:00
Christoph Hellwig
0d4bf11e53 Btrfs: don't export symbols
Currently the extent_map code is only for btrfs so don't export it's
symbols.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-24 15:46:04 -04:00
Christoph Hellwig
2ea2544ef5 Btrfs: simplify makefile
Get rid of the hacks for building out of tree, and always use += for
assigning to the object lists.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-24 15:46:03 -04:00
Josef Bacik
97e728d435 Btrfs: try to keep a healthy ratio of metadata vs data block groups
This patch makes the chunk allocator keep a good ratio of metadata vs data
block groups.  By default for every 8 data block groups, we'll allocate 1
metadata chunk, or about 12% of the disk will be allocated for metadata.  This
can be changed by specifying the metadata_ratio mount option.

This is simply the number of data block groups that have to be allocated to
force a metadata chunk allocation.  By making sure we allocate metadata chunks
more often, we are less likely to get into situations where the whole disk
has been allocated as data block groups.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-04-24 15:46:02 -04:00
Theodore Ts'o
485c26ec70 ext4: Fix softlockup caused by illegal i_file_acl value in on-disk inode
If the block containing external extended attributes (which is stored
in i_file_acl and i_file_acl_high) is larger than the on-disk
filesystem, the process which tried to access the extended attributes
will endlessly issue kernel printks complaining that
"__find_get_block_slow() failed", locking up that CPU until the system
is forcibly rebooted.

So when we read in the inode, make sure the i_file_acl value is legal,
and if not, flag the filesystem as being corrupted.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-24 13:43:20 -04:00
Linus Torvalds
a4277bf122 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Fix potential inode allocation soft lockup in Orlov allocator
  ext4: Make the extent validity check more paranoid
  jbd: use SWRITE_SYNC_PLUG when writing synchronous revoke records
  jbd2: use SWRITE_SYNC_PLUG when writing synchronous revoke records
  ext4: really print the find_group_flex fallback warning only once
2009-04-24 08:37:40 -07:00
Linus Torvalds
ff91fad2db Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
  eCryptfs: Larger buffer for encrypted symlink targets
  eCryptfs: Lock lower directory inode mutex during lookup
  eCryptfs: Remove ecryptfs_unlink_sigs warnings
  eCryptfs: Fix data corruption when using ecryptfs_passthrough
  eCryptfs: Print FNEK sig properly in /proc/mounts
  eCryptfs: NULL pointer dereference in ecryptfs_send_miscdev()
  eCryptfs: Copy lower inode attrs before dentry instantiation
2009-04-24 08:32:44 -07:00
Linus Torvalds
58be18c4de Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6:
  [S390] update default configuration.
  [S390] omit frame pointers on s390 when possible
  [S390] Use tape_generic_offline directly.
  [S390] /proc/stat idle field for idle cpus
  [S390] appldata: avoid deadlock with appldata_mem
  [S390] ipl: fix compile breakage
2009-04-24 08:28:27 -07:00
Linus Torvalds
12bac708e6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
  GFS2: Ensure that the inode goal block settings are updated
  GFS2: Fix bug in block allocation
  bitops: Add __ffs64 bitop
2009-04-24 08:27:02 -07:00
Linus Torvalds
97c68d00db Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  cfq-iosched: cache prio_tree root in cfqq->p_root
  cfq-iosched: fix bug with aliased request and cooperation detection
  cfq-iosched: clear ->prio_trees[] on cfqd alloc
  block: fix intermittent dm timeout based oops
  umem: fix request_queue lock warning
  block: simplify I/O stat accounting
  pktcdvd.h should include mempool.h
  cfq-iosched: use the default seek distance when there aren't enough seek samples
  cfq-iosched: make seek_mean converge more quickly
  block: make blk_abort_queue() ignore non-request based devices
  block: include empty disks in /proc/diskstats
  bio: use bio_kmalloc() in copy/map functions
  bio: fix bio_kmalloc()
  block: fix queue bounce limit setting
  block: fix SG_IO vector request data length handling
  scatterlist: make sure sg_miter_next() doesn't return 0 sized mappings
2009-04-24 07:48:24 -07:00
Oleg Nesterov
437f7fdb60 check_unsafe_exec: s/lock_task_sighand/rcu_read_lock/
write_lock(&current->fs->lock) guarantees we can't wrongly miss
LSM_UNSAFE_SHARE, this is what we care about. Use rcu_read_lock()
instead of ->siglock to iterate over the sub-threads. We must see
all CLONE_THREAD|CLONE_FS threads which didn't pass exit_fs(), it
takes fs->lock too.

With or without this patch we can miss the freshly cloned thread
and set LSM_UNSAFE_SHARE, we don't care.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
[ Fixed lock/unlock typo  - Hugh ]
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-24 07:39:45 -07:00
Oleg Nesterov
8c652f96d3 do_execve() must not clear fs->in_exec if it was set by another thread
If do_execve() fails after check_unsafe_exec(), it clears fs->in_exec
unconditionally. This is wrong if we race with our sub-thread which
also does do_execve:

	Two threads T1 and T2 and another process P, all share the same
	->fs.

	T1 starts do_execve(BAD_FILE). It calls check_unsafe_exec(), since
	->fs is shared, we set LSM_UNSAFE but not ->in_exec.

	P exits and decrements fs->users.

	T2 starts do_execve(), calls check_unsafe_exec(), now ->fs is not
	shared, we set fs->in_exec.

	T1 continues, open_exec(BAD_FILE) fails, we clear ->in_exec and
	return to the user-space.

	T1 does clone(CLONE_FS /* without CLONE_THREAD */).

	T2 continues without LSM_UNSAFE_SHARE while ->fs is shared with
	another process.

Change check_unsafe_exec() to return res = 1 if we set ->in_exec, and change
do_execve() to clear ->in_exec depending on res.

When do_execve() suceeds, it is safe to clear ->in_exec unconditionally.
It can be set only if we don't share ->fs with another process, and since
we already killed all sub-threads either ->in_exec == 0 or we are the
only user of this ->fs.

Also, we do not need fs->lock to clear fs->in_exec.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-24 07:39:45 -07:00
Sunil Mushran
a5a0a63092 ocfs2: Add missing iput() during error handling in ocfs2_dentry_attach_lock()
In ocfs2_dentry_attach_lock(), if unable to get the dentry lock, we need to
call iput(inode) because a failure here means no d_instantiate(), which means
the normally matching iput() will not be called during dput(dentry).

This patch fixes the oops that accompanies the following message:
(3996,1):dlm_empty_lockres:2708 ERROR: lockres W00000000000000000a1046b06a4382 still has local locks!
kernel BUG in dlm_empty_lockres at /rpmbuild/smushran/BUILD/ocfs2-1.4.2/fs/ocfs2/dlm/dlmmaster.c:2709!

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-04-23 14:56:13 -07:00
Martin Schwidefsky
e1c805309d [S390] /proc/stat idle field for idle cpus
The cpu idle field in the output of /proc/stat is too small for cpus
that have been idle for more than a tick. Add the architecture hook
arch_idle_time that allows to add the not accounted idle time of a
sleeping cpu without waking the cpu.

The s390 implementation of arch_idle_time uses the already existing
s390_idle_data per_cpu variable to find the sleep time of a neighboring
idle cpu.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2009-04-23 13:58:17 +02:00
Steven Whitehouse
d9ba7615bf GFS2: Ensure that the inode goal block settings are updated
GFS2 has a goal block associated with each inode indicating the
search start position for future block allocations (in fact there
are two, but thats for backward compatibility with GFS1 as they
are set to identical locations in GFS2).

In some circumstances, depending on the ordering of updates to
the inode it was possible for the goal block settings to not
be updated on disk. This patch ensures that the goal block will
always get updated, thus reducing the potential for searching
the same (already allocated) blocks again when looking for free
space during block allocation.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-04-23 10:07:37 +01:00
Steven Whitehouse
d8bd504ab8 GFS2: Fix bug in block allocation
The new bitfit algorithm was counting from the wrong end of
64 bit words in the bitfield. This fixes it by using __ffs64
instead of fls64

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2009-04-23 10:07:16 +01:00
Theodore Ts'o
b5451f7b26 ext4: Fix potential inode allocation soft lockup in Orlov allocator
If the Orlov allocator is having trouble finding an appropriate block
group, the fallback code could loop forever, causing a soft lockup
warning in find_group_orlov():

BUG: soft lockup - CPU#0 stuck for 61s! [cp:11728]
     ...
Pid: 11728, comm: cp Not tainted (2.6.30-rc1-dirty #77) Lenovo          
EIP: 0060:[<c021650e>] EFLAGS: 00000246 CPU: 0
EIP is at ext4_get_group_desc+0x54/0x9d
    ...
Call Trace:
 [<c0218021>] find_group_orlov+0x2ee/0x334
 [<c0120a5f>] ? sched_clock+0x8/0xb
 [<c02188e3>] ext4_new_inode+0x2cf/0xb1a

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-22 21:00:36 -04:00
Theodore Ts'o
e84a26ce17 ext4: Make the extent validity check more paranoid
Instead of just checking that the extent block number is greater or
equal than s_first_data_block, make sure it it is not pointing into
the block group descriptors, since that is clearly wrong.  This helps
prevent filesystem from getting very badly corrupted in case an extent
block is corrupted.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-22 20:52:25 -04:00
Tyler Hicks
3a6b42cadc eCryptfs: Larger buffer for encrypted symlink targets
When using filename encryption with eCryptfs, the value of the symlink
in the lower filesystem is encrypted and stored as a Tag 70 packet.
This results in a longer symlink target than if the target value wasn't
encrypted.

Users were reporting these messages in their syslog:

[ 45.653441] ecryptfs_parse_tag_70_packet: max_packet_size is [56]; real
packet size is [51]
[ 45.653444] ecryptfs_decode_and_decrypt_filename: Could not parse tag
70 packet from filename; copying through filename as-is

This was due to bufsiz, one the arguments in readlink(), being used to
when allocating the buffer passed to the lower inode's readlink().
That symlink target may be very large, but when decoded and decrypted,
could end up being smaller than bufsize.

To fix this, the buffer passed to the lower inode's readlink() will
always be PATH_MAX in size when filename encryption is enabled.  Any
necessary truncation occurs after the decoding and decrypting.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-04-22 17:02:46 -05:00
Tyler Hicks
ca8e34f2b0 eCryptfs: Lock lower directory inode mutex during lookup
This patch locks the lower directory inode's i_mutex before calling
lookup_one_len() to find the appropriate dentry in the lower filesystem.
This bug was found thanks to the warning set in commit 2f9092e1.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-04-22 16:27:12 -05:00
Tyler Hicks
e77cc8d243 eCryptfs: Remove ecryptfs_unlink_sigs warnings
A feature was added to the eCryptfs umount helper to automatically
unlink the keys used for an eCryptfs mount from the kernel keyring upon
umount.  This patch keeps the unrecognized mount option warnings for
ecryptfs_unlink_sigs out of the logs.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-04-22 04:08:46 -05:00
Tyler Hicks
13a791b4e6 eCryptfs: Fix data corruption when using ecryptfs_passthrough
ecryptfs_passthrough is a mount option that allows eCryptfs to allow
data to be written to non-eCryptfs files in the lower filesystem.  The
passthrough option was causing data corruption due to it not always
being treated as a non-eCryptfs file.

The first 8 bytes of an eCryptfs file contains the decrypted file size.
This value was being written to the non-eCryptfs files, too.  Also,
extra 0x00 characters were being written to make the file size a
multiple of PAGE_CACHE_SIZE.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-04-22 03:54:13 -05:00
Tyler Hicks
3a5203ab3c eCryptfs: Print FNEK sig properly in /proc/mounts
The filename encryption key signature is not properly displayed in
/proc/mounts.  The "ecryptfs_sig=" mount option name is displayed for
all global authentication tokens, included those for filename keys.

This patch checks the global authentication token flags to determine if
the key is a FEKEK or FNEK and prints the appropriate mount option name
before the signature.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-04-22 03:54:13 -05:00
Tyler Hicks
57ea34d199 eCryptfs: NULL pointer dereference in ecryptfs_send_miscdev()
If data is NULL, msg_ctx->msg is set to NULL and then dereferenced
afterwards.  ecryptfs_send_raw_message() is the only place that
ecryptfs_send_miscdev() is called with data being NULL, but the only
caller of that function (ecryptfs_process_helo()) is never called.  In
short, there is currently no way to trigger the NULL pointer
dereference.

This patch removes the two unused functions and modifies
ecryptfs_send_miscdev() to remove the NULL dereferences.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-04-22 03:54:13 -05:00
Tyler Hicks
ae6e84596e eCryptfs: Copy lower inode attrs before dentry instantiation
Copies the lower inode attributes to the upper inode before passing the
upper inode to d_instantiate().  This is important for
security_d_instantiate().

The problem was discovered by a user seeing SELinux denials like so:

type=AVC msg=audit(1236812817.898:47): avc:  denied  { 0x100000 } for
pid=3584 comm="httpd" name="testdir" dev=ecryptfs ino=943872
scontext=root:system_r:httpd_t:s0
tcontext=root:object_r:httpd_sys_content_t:s0 tclass=file

Notice target class is file while testdir is really a directory,
confusing the permission translation (0x100000) due to the wrong i_mode.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-04-22 03:54:12 -05:00
Tejun Heo
a9e9dc24bb bio: use bio_kmalloc() in copy/map functions
Impact: remove possible deadlock condition

There is no reason to use mempool backed allocation for map functions.
Also, because kern mapping is used inside LLDs (e.g. for EH), using
mempool backed allocation can lead to deadlock under extreme
conditions (mempool already consumed by the time a request reached EH
and requests are blocked on EH).

Switch copy/map functions to bio_kmalloc().

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-22 08:35:10 +02:00