dect
/
linux-2.6
Archived
13
0
Fork 0
Commit Graph

5060 Commits

Author SHA1 Message Date
Ralf Baechle 75606dc69a [AX25/NETROM/ROSE]: Convert to use modern wait queue API
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:15 -07:00
Peter P. Waskiewicz Jr 80feaacb8a [AF_PACKET]: Add option to return orig_dev to userspace.
Add a packet socket option to allow the orig_dev index to be returned
to userspace when passing traffic through a decapsulated device, such
as the bonding driver.

This is very useful for layer 2 traffic being able to report which
physical device actually received the traffic, instead of having the
encapsulating device hide that information.

The new option is called PACKET_ORIGDEV.

Signed-off-by: Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:14 -07:00
YOSHIFUJI Hideaki 1370b5a59b [IPV6] SNMP: Export statistics via netlink without CONFIG_PROC_FS.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:13 -07:00
YOSHIFUJI Hideaki 334901700f [IPV4] SNMP: Move some statistic bits to net/ipv4/proc.c.
This also fixes memory leak in error path.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:12 -07:00
YOSHIFUJI Hideaki 49ed67a9ee [IPV6] SNMP: Move some statistic bits to net/ipv6/proc.c.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:11 -07:00
YOSHIFUJI Hideaki bf99f1bde3 [IPV6] SNMP: Netlink interface.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:10 -07:00
John Heffner 628a5c5618 [INET]: Add IP(V6)_PMTUDISC_RPOBE
Add IP(V6)_PMTUDISC_PROBE value for IP(V6)_MTU_DISCOVER.  This option forces
us not to fragment, but does not make use of the kernel path MTU discovery.
That is, it allows for user-mode MTU probing (or, packetization-layer path
MTU discovery).  This is particularly useful for diagnostic utilities, like
traceroute/tracepath.

Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:10 -07:00
John Heffner b881ef7603 [IPV6]: MTU discovery check in ip6_fragment()
Adds a check in ip6_fragment() mirroring ip_fragment() for packets
that we can't fragment, and sends an ICMP Packet Too Big message
in response.

Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:09 -07:00
Patrick McHardy fd44de7cc1 [NET_SCHED]: ingress: switch back to using ingress_lock
Switch ingress queueing back to use ingress_lock. qdisc_lock_tree now locks
both the ingress and egress qdiscs on the device. All changes to data that
might be used on both ingress and egress needs to be protected by using
qdisc_lock_tree instead of manually taking dev->queue_lock. Additionally
the qdisc stats_lock needs to be initialized to ingress_lock for ingress
qdiscs.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:08 -07:00
Patrick McHardy 0463d4ae25 [NET_SCHED]: Eliminate qdisc_tree_lock
Since we're now holding the rtnl during the entire dump operation, we
can remove qdisc_tree_lock, whose only purpose is to protect dump
callbacks from concurrent changes to the qdisc tree.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:07 -07:00
Patrick McHardy ffa4d7216e [NETLINK]: don't reinitialize callback mutex
Don't reinitialize the callback mutex the netlink_kernel_create caller
handed in, it is supposed to already be initialized and could already
be held by someone.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:06 -07:00
Patrick McHardy 6313c1e099 [RTNETLINK]: Remove unnecessary locking in dump callbacks
Since we're now holding the rtnl during the entire dump operation, we can
remove additional locking for rtnl protected data. This patch does that
for all simple cases (dev_base_lock for dev_base walking, RCU protection
for FIB rule dumping).

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:05 -07:00
Patrick McHardy 1c2d670f36 [RTNETLINK]: Hold rtnl_mutex during netlink dump callbacks
Hold rtnl_mutex during the entire netlink dump operation. This allows
to simplify locking in the dump callbacks, since they can now rely on
that no concurrent changes happen.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:04 -07:00
Patrick McHardy af65bdfce9 [NETLINK]: Switch cb_lock spinlock to mutex and allow to override it
Switch cb_lock to mutex and allow netlink kernel users to override it
with a subsystem specific mutex for consistent locking in dump callbacks.
All netlink_dump_start users have been audited not to rely on any
side-effects of the previously used spinlock.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:03 -07:00
Patrick McHardy b076deb849 [NETFILTER]: ipt_ULOG: add compat conversion functions
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:02 -07:00
Patrick McHardy d3a2c3ca8e [NETFILTER]: nfnetlink_log: remove fallback to group 0
Don't fallback to group 0 if no instance can be found for the given group.
This potentially confuses the listener and is not what the user configured.
Also remove the ring buffer spamming that happens when rules are set up
before the logging daemon is started.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:01 -07:00
Patrick McHardy 3b5018d676 [NETFILTER]: {eb,ip6,ip}t_LOG: remove remains of LOG target overloading
All LOG targets always use their internal logging function nowadays, so
remove the incorrect error message and handle real errors (!= -EEXIST)
by failing to load.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:29:00 -07:00
Patrick McHardy fe6092ea00 [NETFILTER]: nf_nat: use HW checksumming when possible
When mangling packets forwarded to a HW checksumming capable device,
offload recalculation of the checksum instead of doing it in software.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:59 -07:00
Bart De Schuymer c15bf6e699 [NETFILTER]: ebt_arp: add gratuitous arp filtering
The attached patch adds gratuitous arp filtering, more precisely: it
allows checking that the IPv4 source address matches the IPv4
destination address inside the ARP header. It also adds a check for the
hardware address type when matching MAC addresses (nothing critical,
just for better consistency).

Signed-off-by: Bart De Schuymer <bdschuym@pandora.be>
Acked-by: Carl-Daniel Hailfinger <c-d.hailfinger.devel.2006@gmx.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:58 -07:00
Michael Milner 516299d2f5 [NETFILTER]: bridge-nf: filter bridged IPv4/IPv6 encapsulated in pppoe traffic
The attached patch by Michael Milner adds support for using iptables and
ip6tables on bridged traffic encapsulated in ppoe frames, similar to
what's already supported for vlan.

Signed-off-by: Michael Milner <milner@blissisland.ca>
Signed-off-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:57 -07:00
Gerrit Renker f73f7097c9 [DCCP]: Debug statements for Elapsed Time option
This prints the value of the parsed Elapsed Time when received via a
Timestamp Echo option [RFC 4342, 13.3].

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:55 -07:00
Gerrit Renker b2449fdc30 [DCCP]: Fix bug in the calculation of very low sending rates
This fixes an error in the calculation of t_ipi when X converges towards
very low sending rates (between 1 and 64 bytes per second).

Although this case may not sound likely, it can be reproduced by connecting,
hitting enter (1 byte sent) and waiting for some time, during which the
nofeedback timer halves the sending rate until finally it reaches the region
1..64 bytes/sec. Computing X is handled correctly (tested separately); but by
dividing X _before_ entering the calculation of t_ipi, X becomes zero as
a result.  This in turn triggers a BUG condition caught in scaled_div().

Fixed by replacing with equivalent statement and explicit typecast for good
measure.

Calculation verified and effect of patch tested - reduced never below 1 byte
per 64 seconds afterwards, i.e. not allowing divide-by-zero.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:54 -07:00
Patrick McHardy efd1e8d569 [SK_BUFF]: Fix missing offset adjustment in skb_copy_expand
skb_copy_expand changes the headroom, so it needs to adjust the header
offsets by the difference between the old and the new value.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:53 -07:00
Akinobu Mita 87a596e0b8 bridge: check kmem_cache_create() error
This patch checks kmem_cache_create() error and aborts loading module
on failure.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
2007-04-25 22:28:51 -07:00
Stephen Hemminger ffe1d49cc3 bridge: allow changing hardware address to any valid address
For case of bridging pseudo devices, the get created/destroyed (Xen)
need to allow setting address to any valid value.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
2007-04-25 22:28:50 -07:00
Stephen Hemminger b86c45035c bridge: change when netlink events go to STP
Need to tell STP daemon about more events, like any time a
device is added even when it is down.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
2007-04-25 22:28:48 -07:00
Stephen Hemminger 9cde070874 bridge: add support for user mode STP
This patchset based on work by Aji_Srinivas@emc.com provides allows
spanning tree to be controled from userspace.  Like hotplug, it
uses call_usermodehelper when spanning tree is enabled so there
is no visible API change. If call to start usermode STP fails
it falls back to existing kernel STP.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
2007-04-25 22:28:48 -07:00
Stephen Hemminger 9cf637473c bridge: add sysfs hook to flush forwarding table
The RSTP daemon needs to be able to flush all dynamic forwarding
entries in the case of topology change.

This is a temporary interface. It will change to a netlink interface
before RSTP daemon is officially released.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
2007-04-25 22:28:47 -07:00
Stephen Hemminger 3f89092318 bridge: simpler hash with salt
Instead of hashing the whole Ethernet address, it should be faster
to just use the last 4 bytes. Add a random salt value to the hash
to make it more difficult to construct worst case DoS hash chains.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
2007-04-25 22:28:46 -07:00
Stephen Hemminger 467aea0ddf bridge: don't route packets while learning
While in the STP learning state, don't route packets; wait until
forwarding delay has expired. The purpose of the forwarding delay
is to detect loops in the network, and if a brouter started up
and started forwarding, it could cause a flood.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
2007-04-25 22:28:45 -07:00
Stephen Hemminger 6229e362dd bridge: eliminate call by reference
Change the bridging hook to be simple function with return value
rather than modifying the skb argument. This could generate better
code and is cleaner.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
2007-04-25 22:28:44 -07:00
Herbert Xu 604763722c [NET]: Treat CHECKSUM_PARTIAL as CHECKSUM_UNNECESSARY
When a transmitted packet is looped back directly, CHECKSUM_PARTIAL
maps to the semantics of CHECKSUM_UNNECESSARY.  Therefore we should
treat it as such in the stack.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:43 -07:00
Herbert Xu 663ead3bb8 [NET]: Use csum_start offset instead of skb_transport_header
The skb transport pointer is currently used to specify the start
of the checksum region for transmit checksum offload.  Unfortunately,
the same pointer is also used during receive side processing.

This creates a problem when we want to retransmit a received
packet with partial checksums since the skb transport pointer
would be overwritten.

This patch solves this problem by creating a new 16-bit csum_start
offset value to replace the skb transport header for the purpose
of checksums.  This offset is calculated from skb->head so that
it does not have to change when skb->data changes.

No extra space is required since csum_offset itself fits within
a 16-bit word so we can use the other 16 bits for csum_start.

For backwards compatibility, just before we push a packet with
partial checksums off into the device driver, we set the skb
transport header to what it would have been under the old scheme.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:40 -07:00
Patrick McHardy ac758e3c55 [XFRM]: beet: fix worst case header_len calculation
esp_init_state doesn't account for the beet pseudo header in the header_len
calculation, which may result in undersized skbs hitting xfrm4_beet_output,
causing unnecessary reallocations in ip_finish_output2.

The skbs should still always have enough room to avoid causing
skb_under_panic in skb_push since we have at least 16 bytes available
from LL_RESERVED_SPACE in xfrm_state_check_space.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:39 -07:00
Patrick McHardy c5c2523893 [XFRM]: Optimize MTU calculation
Replace the probing based MTU estimation, which usually takes 2-3 iterations
to find a fitting value and may underestimate the MTU, by an exact calculation.

Also fix underestimation of the XFRM trailer_len, which causes unnecessary
reallocations.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:38 -07:00
Patrick McHardy 557922584d [XFRM]: esp: fix skb_tail_pointer conversion bug
Fix incorrect switch of "trailer" skb by "skb" during skb_tail_pointer
conversion:

-       *(u8*)(trailer->tail - 1) = top_iph->protocol;
+       *(skb_tail_pointer(skb) - 1) = top_iph->protocol;

-       *(u8 *)(trailer->tail - 1) = *skb_network_header(skb);
+       *(skb_tail_pointer(skb) - 1) = *skb_network_header(skb);

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:37 -07:00
Patrick McHardy 56eb88828b [SK_BUFF]: Fix missing offset adjustment in pskb_expand_head
Since we're increasing the headroom, the header offsets need to be
increased by the same amount as well.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:36 -07:00
YOSHIFUJI Hideaki 29f6af7712 [IPV6] FIB6RULE: Find source address during looking up route.
When looking up route for destination with rules with
source address restrictions, we may need to find a source
address for the traffic if not given.

Based on patch from Noriaki TAKAMIYA <takamiya@po.ntts.co.jp>.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:35 -07:00
Patrick McHardy ea2f10a3c8 [XFRM]: beet: minor cleanups
Remove unnecessary initialization/variable.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:34 -07:00
Thomas Graf 038890fed8 [RTNL]: Improve error codes for unsupported operations
The most common trigger of these errors is that the
config option hasn't been enable wich would make the
functionality available. Therefore returning EOPNOTSUPP
gives a better idea on what is going wrong.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:34 -07:00
David Howells 716ea3a7aa [NET]: Move generic skbuff stuff from XFRM code to generic code
Move generic skbuff stuff from XFRM code to generic code so that
AF_RXRPC can use it too.

The kdoc comments I've attached to the functions needs to be checked
by whoever wrote them as I had to make some guesses about the workings
of these functions.

Signed-off-By: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:33 -07:00
Arnaldo Carvalho de Melo 1a4e2d093f [SK_BUFF]: Some more conversions to skb_copy_from_linear_data
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
2007-04-25 22:28:30 -07:00
Arnaldo Carvalho de Melo 27d7ff46a3 [SK_BUFF]: Introduce skb_copy_to_linear_data{_offset}
To clearly state the intent of copying to linear sk_buffs, _offset being a
overly long variant but interesting for the sake of saving some bytes.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
2007-04-25 22:28:29 -07:00
Rusty Russell c45d286e72 [NET]: Inline net_device_stats
Network drivers which keep stats allocate their own stats structure
then write a get_stats() function to return them.  It would be nice if
this were done by default.

1) Add a new "stats" field to "struct net_device".
2) Add a new feature field to say "this driver uses the internal one"
3) Have a default "get_stats" which returns NULL if that feature not set.
4) Change callers to check result of get_stats call for NULL, not if
   ->get_stats is set.

This should not break backwards compatibility with older drivers, yet
allow modern drivers to shed some boilerplate code.

Lightly tested: works for a modified lguest network driver.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:26 -07:00
Thomas Graf 4b19ca44cb [NET] fib_rules: delay route cache flush by ip_rt_min_delay
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:24 -07:00
Arnaldo Carvalho de Melo d626f62b11 [SK_BUFF]: Introduce skb_copy_from_linear_data{_offset}
To clearly state the intent of copying from linear sk_buffs, _offset being a
overly long variant but interesting for the sake of saving some bytes.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2007-04-25 22:28:23 -07:00
Eric Dumazet 03d4f879b9 [IPV4]: align inet_protos[] on SMP
As IPPROTO_TCP is 6, it makes sense to make sure inet_protos[] array
is properly cache line aligned to avoid false sharing on SMP.

c0680540 b peer_total
c0680544 b inet_peer_unused_head
c0680560 B inet_protos

On i386 this example, we can see that inet_protos[IPPROTO_TCP] shares
a potentially hot (and modified) cache line.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:20 -07:00
Eric Dumazet 4103f8cd5c [TCP]: tcp_memory_pressure and tcp_socket are__read_mostly candidates
tcp_memory_pressure and tcp_socket currently share a cache line with tcp_memory_allocated, tcp_sockets_allocated.
(Very hot cache line)
It makes sense to declare these variables as __read_mostly, to avoid false sharing on SMP.

ffffffff8081d9c0 B tcp_orphan_count
ffffffff8081d9c4 B tcp_memory_allocated
ffffffff8081d9c8 B tcp_sockets_allocated
ffffffff8081d9cc B tcp_memory_pressure
ffffffff8081d9d0 b tcp_md5sig_users
ffffffff8081d9d8 b tcp_md5sig_pool
ffffffff8081d9e0 b warntime.31570
ffffffff8081d9e8 b tcp_socket

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:19 -07:00
Thomas Graf 73417f617a [NET] fib_rules: Flush route cache after rule modifications
The results of FIB rules lookups are cached in the routing cache
except for IPv6 as no such cache exists. So far, it was the
responsibility of the user to flush the cache after modifying any
rules. This lead to many false bug reports due to misunderstanding
of this concept.

This patch automatically flushes the route cache after inserting
or deleting a rule.

Thanks to Muli Ben-Yehuda <muli@il.ibm.com> for catching a bug
in the previous patch.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:18 -07:00
Eric Dumazet be776281ae [NET]: inet_ehash_secret should be __read_mostly and set only once
There is a very tiny probability that build_ehash_secret() is called
at the same time by different CPUS.

Also, using __read_mostly is a must for inet_ehash_secret

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:17 -07:00
Herbert Xu 35fc92a9de [NET]: Allow forwarding of ip_summed except CHECKSUM_COMPLETE
Right now Xen has a horrible hack that lets it forward packets with
partial checksums.  One of the reasons that CHECKSUM_PARTIAL and
CHECKSUM_COMPLETE were added is so that we can get rid of this hack
(where it creates two extra bits in the skbuff to essentially mirror
ip_summed without being destroyed by the forwarding code).

I had forgotten that I've already gone through all the deivce drivers
last time around to make sure that they're looking at ip_summed ==
CHECKSUM_PARTIAL rather than ip_summed != 0 on transmit.  In any case,
I've now done that again so it should definitely be safe.

Unfortunately nobody has yet added any code to update CHECKSUM_COMPLETE
values on forward so we I'm setting that to CHECKSUM_NONE.  This should
be safe to remove for bridging but I'd like to check that code path
first.

So here is the patch that lets us get rid of the hack by preserving
ip_summed (mostly) on forwarded packets.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:16 -07:00
Janusz Krzysztofik 2d771cd86d [IPV4] LVS: Allow to send ICMP unreachable responses when real-servers are removed
this is a small patch by Janusz Krzysztofik to ip_route_output_slow()
that allows VIP-less LVS linux director to generate packets
originating >From VIP if sysctl_ip_nonlocal_bind is set.

In a nutshell, the intention is for an LVS linux director to be able
to send ICMP unreachable responses to end-users when real-servers are
removed.

http://archive.linuxvirtualserver.org/html/lvs-users/2007-01/msg00106.html

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:15 -07:00
Thomas Graf fa0b2d1d21 [NET] fib_rules: Add no-operation action
The use of nop rules simplifies the usage of goto rules
and adds more flexibility as they allow targets to remain
while the actual content of the branches can change easly.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:14 -07:00
Thomas Graf 2b44368307 [NET] fib_rules: Mark rules detached from the device
Rules which match against device names in their selector can
remain while the device itself disappears, in fact the device
doesn't have to present when the rule is added in the first
place. The device name is resolved by trying when the rule is
added and later by listening to NETDEV_REGISTER/UNREGISTER
notifications.

This patch adds the flag FIB_RULE_DEV_DETACHED which is set
towards userspace when a rule contains a device match which
is unresolved at the moment. This eases spotting the reason
why certain rules seem not to function properly.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:13 -07:00
Thomas Graf 0947c9fe56 [NET] fib_rules: goto rule action
This patch adds a new rule action FR_ACT_GOTO which allows
to skip a set of rules by jumping to another rule. The rule
to jump to is specified via the FRA_GOTO attribute which
carries a rule preference.

Referring to a rule which doesn't exists is explicitely allowed.
Such goto rules are marked with the flag FIB_RULE_UNRESOLVED
and will act like a rule with a non-matching selector. The rule
will become functional as soon as its target is present.

The goto action enables performance optimizations by reducing
the average number of rules that have to be passed per lookup.

Example:
0:      from all lookup local
40:     not from all to 192.168.23.128 goto 32766
41:     from all fwmark 0xa blackhole
42:     from all fwmark 0xff blackhole
32766:  from all lookup main

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:12 -07:00
Stephen Hemminger 85795d64ed [TCP] tcp_probe: improvements for net-2.6.22
Change tcp_probe to use ktime (needed to add one export).
Add option to only get events when cwnd changes - from Doug Leith

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:10 -07:00
Stephen Hemminger e1c3e7ab6d [TCP]: cubic update for net-2.6.22
The following update received from Injong updates TCP cubic to the latest
version. I am running more complete tests and will have results after 4/1.

According to Injong: the new version improves on its scalability,
fairness and stability.  So in all properties, we confirmed it shows better
performance.

NCSU results (for 2.6.18 and 2.6.20) available:
	http://netsrv.csc.ncsu.edu/wiki/index.php/TCP_Testing

This version is described in a new Internet draft for CUBIC.
	http://www.ietf.org/internet-drafts/draft-rhee-tcp-cubic-00.txt

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:09 -07:00
John Heffner 9af3912ec9 [NET] Move DF check to ip_forward
Do fragmentation check in ip_forward, similar to ipv6 forwarding.

Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:07 -07:00
David S. Miller b3da2cf37c [INET]: Use jhash + random secret for ehash.
The days are gone when this was not an issue, there are folks out
there with huge bot networks that can be used to attack the
established hash tables on remote systems.

So just like the routing cache and connection tracking
hash, use Jenkins hash with random secret input.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:06 -07:00
Johannes Berg d30045a0bc [NETLINK]: introduce NLA_BINARY type
This patch introduces a new NLA_BINARY attribute policy type with the
verification of simply checking the maximum length of the payload.

It also fixes a small typo in the example.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:05 -07:00
Vlad Yasevich 703315712c [SCTP]: Implement SCTP_MAX_BURST socket option.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:04 -07:00
Vlad Yasevich a5a35e7675 [SCTP]: Implement sac_info field in SCTP_ASSOC_CHANGE notification.
As stated in the sctp socket api draft:

   sac_info: variable

   If the sac_state is SCTP_COMM_LOST and an ABORT chunk was received
   for this association, sac_info[] contains the complete ABORT chunk as
   defined in the SCTP specification RFC2960 [RFC2960] section 3.3.7.

We now save received ABORT chunks into the sac_info field and pass that
to the user.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:03 -07:00
Vlad Yasevich bdf3092af6 [SCTP]: Honor flags when setting peer address parameters
Parameters only take effect when a corresponding flag bit is set
and a value is specified. This means we need to check the flags
in addition to checking for non-zero value.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:02 -07:00
Vlad Yasevich 1ae4114dce [SCTP]: Implement SCTP_ADDR_CONFIRMED state for ADDR_CHNAGE event
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:01 -07:00
Vlad Yasevich d49d91d79a [SCTP]: Implement SCTP_PARTIAL_DELIVERY_POINT option.
This option induces partial delivery to run as soon
as the specified amount of data has been accumulated on
the association.  However, we give preference to fully
reassembled messages over PD messages.  In any case,
window and buffer is freed up.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@.hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:28:00 -07:00
Vlad Yasevich b6e1331f3c [SCTP]: Implement SCTP_FRAGMENT_INTERLEAVE socket option
This option was introduced in draft-ietf-tsvwg-sctpsocket-13.  It
prevents head-of-line blocking in the case of one-to-many endpoint.
Applications enabling this option really must enable SCTP_SNDRCV event
so that they would know where the data belongs.  Based on an
earlier patch by Ivan Skytte Jørgensen.

Additionally, this functionality now permits multiple associations
on the same endpoint to enter Partial Delivery.  Applications should
be extra careful, when using this functionality, to track EOR indicators.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:59 -07:00
Patrick McHardy c95e939508 [NET_SCHED]: qdisc: remove unnecessary memory barriers
We're holding dev->queue_lock in qdisc_watchdog_schedule and
qdisc_watchdog_cancel, no need for the barriers.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:58 -07:00
Patrick McHardy a48b5a6144 [NET_SCHED]: Unline tcf_destroy
Uninline tcf_destroy and add a helper function to destroy an entire filter
chain.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:56 -07:00
Patrick McHardy 3bebcda280 [NET_SCHED]: turn PSCHED_GET_TIME into inline function
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:55 -07:00
Patrick McHardy 03cc45c0a5 [NET_SCHED]: turn PSCHED_TDIFF_SAFE into inline function
Also rename to psched_tdiff_bounded.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:54 -07:00
Patrick McHardy 8edc0c31d6 [NET_SCHED]: kill PSCHED_TDIFF
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:53 -07:00
Patrick McHardy a084980dcb [NET_SCHED]: kill PSCHED_SET_PASTPERFECT/PSCHED_IS_PASTPERFECT
Use direct assignment and comparison instead.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:51 -07:00
Patrick McHardy 104e087898 [NET_SCHED]: kill PSCHED_TLESS
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:50 -07:00
Patrick McHardy 7c59e25f31 [NET_SCHED]: kill PSCHED_TADD/PSCHED_TADD2
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:49 -07:00
Patrick McHardy 26e252df1e [NET_SCHED]: kill PSCHED_AUDIT_TDIFF
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:48 -07:00
Patrick McHardy 76d643cd3b [NET_SCHED]: sch_netem: fix off-by-one in send time comparison
netem checks PSCHED_TLESS(cb->time_to_send, now) to find out whether it is
allowed to send a packet, which is equivalent to cb->time_to_send < now.
Use !PSCHED_TLESS(now, cb->time_to_send) instead to properly handle
cb->time_to_send == now.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:47 -07:00
Thomas Graf c7bf5f9dc2 [NETFILTER] nfnetlink: netlink_run_queue() already checks for NLM_F_REQUEST
Patrick has made use of netlink_run_queue() in nfnetlink while my patches
have been waiting for net-2.6.22 to open. So this check for NLM_F_REQUEST
can go as well.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:46 -07:00
Yasuyuki Kozakai de6e05c49f [NETFILTER]: nf_conntrack: kill destroy() in struct nf_conntrack for diet
The destructor per conntrack is unnecessary, then this replaces it with
system wide destructor.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:45 -07:00
Yasuyuki Kozakai 5f79e0f916 [NETFILTER]: nf_conntrack: don't use nfct in skb if conntrack is disabled
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:44 -07:00
Patrick McHardy e6f689db51 [NETFILTER]: Use setup_timer
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:43 -07:00
Patrick McHardy 9afdb00c80 [NETFILTER]: nfnetlink_log: remove conditional locking
This is gross, have the wrapper function take the lock.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:42 -07:00
Michal Miroslaw 370e6a8789 [NETFILTER]: nfnetlink_log: micro-optimization: inst->skb != NULL in __nfulnl_send()
No other function calls __nfulnl_send() with inst->skb == NULL than
nfulnl_timer().

Signed-off-by: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:41 -07:00
Michal Miroslaw f76cdcee5b [NETFILTER]: nfnetlink_log: iterator functions need iter_state * only
get_*() don't need access to seq_file - iter_state is enough for them.

Signed-off-by: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:40 -07:00
Michal Miroslaw 9a36e8c2b3 [NETFILTER]: nfnetlink_log: micro-optimization: don't modify destroyed instance
Simple micro-optimization: Don't change any options if the instance is
being destroyed.

Signed-off-by: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:39 -07:00
Michal Miroslaw f414c16c04 [NETFILTER]: nfnetlink_log: micro-optimization for inst==NULL in nfulnl_recv_config()
Simple micro-optimization: don't call instance_put() on known NULL pointers.

Signed-off-by: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:38 -07:00
Michal Miroslaw 55b5a91e17 [NETFILTER]: nfnetlink_log: kill duplicate code
Kill some duplicate code in nfulnl_log_packet().

Signed-off-by: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:37 -07:00
Michal Miroslaw 09972d6f96 [NETFILTER]: nfnetlink_log: don't count max(a,b) twice
We don't need local nlbufsiz (skb size) as nfulnl_alloc_skb() takes
the maximum anyway.

Signed-off-by: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:36 -07:00
Patrick McHardy 1b53d9042c [NETFILTER]: Remove changelogs and CVS IDs
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:35 -07:00
Stephen Hemminger bb2f8cc0ec [NETEM]: spelling errors
Get rid of some of my creative spelling.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:33 -07:00
Thomas Graf c702e8047f [NETLINK]: Directly return -EINTR from netlink_dump_start()
Now that all users of netlink_dump_start() use netlink_run_queue()
to process the receive queue, it is possible to return -EINTR from
netlink_dump_start() directly, therefore simplying the callers.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:33 -07:00
Thomas Graf ead592ba24 [IPv4] diag: Use netlink_run_queue() to process the receive queue
Makes use of netlink_run_queue() to process the receive queue and
converts inet_diag_rcv_msg() to use the type safe netlink interface.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:32 -07:00
Thomas Graf 1d00a4eb42 [NETLINK]: Remove error pointer from netlink message handler
The error pointer argument in netlink message handlers is used
to signal the special case where processing has to be interrupted
because a dump was started but no error happened. Instead it is
simpler and more clear to return -EINTR and have netlink_run_queue()
deal with getting the queue right.

nfnetlink passed on this error pointer to its subsystem handlers
but only uses it to signal the start of a netlink dump. Therefore
it can be removed there as well.

This patch also cleans up the error handling in the affected
message handlers to be consistent since it had to be touched anyway.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:30 -07:00
Thomas Graf 45e7ae7f71 [NETLINK]: Ignore control messages directly in netlink_run_queue()
Changes netlink_rcv_skb() to skip netlink controll messages and don't
pass them on to the message handler.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:29 -07:00
Thomas Graf d35b685640 [NETLINK]: Ignore !NLM_F_REQUEST messages directly in netlink_run_queue()
netlink_rcv_skb() is changed to skip messages which don't have the
NLM_F_REQUEST bit to avoid every netlink family having to perform this
check on their own.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:29 -07:00
Thomas Graf 33a0543cd9 [NETLINK]: Remove unused groups variable
Leftover from dynamic multicast groups allocation work.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:27 -07:00
Thomas Graf 267281058c [TCP] westwood: Use type safe netlink interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:26 -07:00
Thomas Graf e9195d677d [TCP] vegas: Use type safe netlink interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:26 -07:00
Thomas Graf 51057f2fec [RTNL]: Properly return rntl message handler
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:24 -07:00
Stephen Hemminger 1936502d00 [NET_SCHED] qdisc: avoid transmit softirq on watchdog wakeup
If possible, avoid having to do a transmit softirq when a qdisc
watchdog decides to re-enable.  The watchdog routine runs off
a timer, so it is already in the same effective context as
the softirq.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:23 -07:00
Stephen Hemminger 11274e5a43 [NETEM]: avoid excessive requeues
The netem code would call getnstimeofday() and dequeue/requeue after
every packet, even if it was waiting. Avoid this overhead by using
the throttled flag.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:22 -07:00
Stephen Hemminger 075aa573b7 [NETEM]: Optimize tfifo
In most cases, the next packet will be sent after the
last one. So optimize that case.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:21 -07:00
Stephen Hemminger b407621c35 [NETEM]: use better types for time values
The random number generator always generates 32 bit values.
The time values are limited by psched_tdiff_t

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:20 -07:00
Stephen Hemminger a362e0a789 [NETEM]: report reorder percent correctly.
If you setup netem to just delay packets; "tc qdisc ls" will report
the reordering as 100%. Well it's a lie, reorder isn't used unless
gap is set, so just set value to 0 so the output of utility
is correct.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:20 -07:00
Stephen Hemminger 7e58886b45 [TCP]: cubic optimization
Use willy's work in optimizing cube root by having table for small values.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:19 -07:00
Thomas Graf c454673da7 [NET] rules: Unified rules dumping
Implements a unified, protocol independant rules dumping function
which is capable of both, dumping a specific protocol family or
all of them. This speeds up dumping as less lookups are required.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:17 -07:00
Thomas Graf 687ad8cc64 [RTNL]: Use rtnl registration interface for dump-all aliases
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:15 -07:00
Thomas Graf 32fe21c0c0 [BRIDGE]: Use rtnl registration interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:14 -07:00
Thomas Graf c127ea2c45 [IPv6]: Use rtnl registration interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:13 -07:00
Thomas Graf fa34ddd739 [DECNet]: Use rtnl registration interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:12 -07:00
Thomas Graf 708914cc5e [PKT_SCHED] act: Use rtnl registration interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:11 -07:00
Thomas Graf 82623c0d73 [PKT_SCHED] cls: Use rtnl registration interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:10 -07:00
Thomas Graf be577ddc2b [PKT_SCHED] qdisc: Use rtnl registration interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:09 -07:00
Thomas Graf 63f3444fb9 [IPv4]: Use rtnl registration interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:08 -07:00
Thomas Graf 9d9e6a5819 [NET] rules: Use rtnl registration interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:07 -07:00
Thomas Graf c8822a4e00 [NEIGH]: Use rtnl registration interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:06 -07:00
Thomas Graf 340d17fc9d [NET] link: Use rtnl registration interface
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:05 -07:00
Thomas Graf e284986385 [RTNL]: Message handler registration interface
This patch adds a new interface to register rtnetlink message
handlers replacing the exported rtnl_links[] array which
required many message handlers to be exported unnecessarly.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:04 -07:00
Gerrit Renker 30833ffead [CCID3]: Use initial RTT sample from SYN exchange
The patch follows the following recommendation made in an erratum to RFC 4342:

  "Senders MAY additionally make use of other available RTT measurements,
   including those from the initial Request-Response packet exchange."

It implements larger initial windows with regard to this inital RTT measurement,
using the mechanism suggested in draft-ietf-dccp-rfc3448bis, section 4.2.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:04 -07:00
Gerrit Renker 89560b53b9 [DCCP]: Sample RTT from SYN exchange
Function:
2007-04-25 22:27:02 -07:00
Gerrit Renker 7dfee1a9c0 [CCID3]: Use function for RTT sampling
This replaces the existing occurrences of RTT sampling with
the use of the new function dccp_sample_rtt.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:01 -07:00
Gerrit Renker 4712a792ee [DCCP]: Provide function for RTT sampling
A recurring problem, in particular in the CCID code, is that RTT samples
from packets with timestamp echo and elapsed time options need to be taken.

This service is provided via a new function dccp_sample_rtt in this patch.
Furthermore, to protect against `insane' RTT samples, the sampled value
is bounded between 100 microseconds and 4 seconds - for which u32 is sufficient.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:27:00 -07:00
Gerrit Renker 0c150efb28 [CCID3]: Handle Idle and Application-Limited periods
This updates the code with regard to handling idle and application-limited
periods as specified in [RFC 4342, 5.1].

Background:
2007-04-25 22:26:59 -07:00
Gerrit Renker a21f9f96cd [CCID3]: Wrap computation of RFC3390-initial rate into separate function
The CCID 3 and TFRC specs (RFC 4342, RFC 3448, draft-3448bis) make frequent
reference to the computation of the RFC-3390 initial sending rate:

  1. Initial sending rate when RTT is known (RFC 4342, p. 6)
  2. Response to Idle/Application-Limited periods (RFC 4342, 5.1)

This warrants putting the code into its own function, for later code reuse.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:58 -07:00
Gerrit Renker 1761f7d7fe [CCID3]: Remove build warnings for 64bit
This clears the following sparc64 build warnings:
 1) warning: format "%ld" expects type "long int", but argument 3 has type "suseconds_t"
 2) warning: format "%llu" expects type "long long unsigned int", but argument 3 has type "__u64"
Fixed by using typecast to unsigned. This is argued to be safe, since the quantities, after
de-scaling (factor 2^6) fit all in u32.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:57 -07:00
Gerrit Renker fddc2feb94 [CCID3]: More to see in dccp_probe
This adds a few more fields of interest to /proc/net/dccpprobe, the following output ensues:

1           2          3           4     5  6     7   8        9        10   11
sec.usec   src:sport   dst:dport   size  s  rtt   p   X_calc   X_recv   X    t_ipi

Also made the formatting consistent.

Scripts that go with this can be downloaded from http://139.133.210.30/users/gerrit/dccp/dccp_probe/

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:56 -07:00
Gerrit Renker 6626e3628f [DCCP]: More debug information for dccp_wait_for_ccid
This adds more detail in the wait_for_ccid packet scheduling loop.
In particular, it informs about (i) when delay is used and (ii) why
a packet is discarded.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:54 -07:00
Gerrit Renker ac12b0c495 [DCCP]: Always use debug-toggle parameters
Currently debugging output (when configured) is automatically enabled when
DCCP modules are compiled into the kernel rather than built as loadable modules.
This is not necessary, since the module parameters in this case become kernel
commandline parameters, e.g. DCCP or CCID3 debug output can be enabled for a
static build by appending the following at the boot prompt:

	dccp.dccp_debug=1 	dccp_ccid3.ccid3_debug=1

This patch therefore does away with the more complicated way of always enabling
debug output for static builds

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:53 -07:00
Gerrit Renker 1266adee12 [CCID3]: Remove race condition and update t_ipi when `s' changes
This:

 1. removes a race condition in the access to the scheduled send time t_nom which
    results from allowing asynchronous r/w access to t_nom without locks;

 2. updates the inter-packet interval t_ipi = s/X when `s' changes, following a
    suggestion by Ian McDonald.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:52 -07:00
Ian McDonald 8699be7d24 [CCID3]: More verbose debugging
This adds a few debugging statements to ccid3.c

Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:51 -07:00
Ian McDonald 551dc5f7a1 [CCID3]: Fix use of invalid loss intervals
This fixes a bug which uses an invalid comparison.
The bug resulted in the use of invalid loss intervals.

Signed-off-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Acked-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:50 -07:00
Gerrit Renker 371fe7779c [CCID3]: Use MSS for larger initial windows
This improves the slow-start phase by using the MSS
(as suggested in RFC 4342, sec. 5) instead of the packet size s.
Also figured out that __u32 is ample resource enough.

After applying, I got the following in the logs:

  ccid3_hc_tx_packet_recv: client(f7421700), s=6, MSS=1424, w_init=4380, R_sample=176us, X=24886363

Had the previous variant been used, w_init would have been as low as 24.

Committer note: removed unneeded cast to unsigned long long that was
                causing a compiler warning on 64bit architectures.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:49 -07:00
Gerrit Renker 9bf17475eb [CCID3]: Re-order CCID 3 source file
No code change at all.
This splits ccid3.c into a RX and a TX section, so that the file has an
organisation similar to the other ones (e.g. packet_history.{h,c}).

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:48 -07:00
Gerrit Renker 353b13e10a [CCID3]: Remove redundant `len' test
Since CCID3 avoids  sending 0-byte data packets (cf. ccid3_hc_tx_send_packet),
testing for zero-payload length, as performed by ccid3_hc_tx_update_s, is
redundant - hence removed by this patch.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:47 -07:00
Gerrit Renker 8d13bf9a0b [DCCP]: Remove ambiguity in the way before48 is used
This removes two ambiguities in employing the new definition of before48,
following the analysis on http://www.mail-archive.com/dccp@vger.kernel.org/msg01295.html

 (1) Updating GSR when P.seqno >= S.SWL
     With the old definition we did not update when P.seqno and S.SWL are 2^47 apart. To
     ensure the same behaviour as with the old definition, this is replaced with the
     equivalent condition dccp_delta_seqno(S.SWL, P.seqno) >= 0

 (2) Sending SYNC when P.seqno >= S.OSR
     Here it is debatable whether the new definition causes an ambiguity: the case is
     similar to (1); and to have consistency with the case (1), we use the equivalent
     condition dccp_delta_seqno(S.OSR, P.seqno) >= 0

Detailed Justification
2007-04-25 22:26:46 -07:00
Gerrit Renker b16be51b5e [DCCP]: Fix for follows48
The follows48 relation identifies whether 48-bit sequence number
x is the direct successor of y. Currently, it does not handle cases
of the following type correctly:

	follows48(0x(prefix)10000LL, 0x(prefix)0FFFFLL)

where prefix is an arbitrary hex sequence of up to 7 digits.

This is fixed by reusing the new dccp_delta_seqno function.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:45 -07:00
Gerrit Renker d52de17b8c [DCCP]: Make `before' relation unambiguous
Problem:
2007-04-25 22:26:44 -07:00
Gerrit Renker 0aec51c869 [DCCP]: Make dccp_delta_seqno return signed numbers
Problem:
2007-04-25 22:26:43 -07:00
Gerrit Renker 6b811d43f6 [DCCP]: 48-bit sequence number arithmetic
This patch
 * organizes the sequence arithmetic functions into one corner of dccp.h
 * performs a small modification of dccp_set_seqno to make it more widely reusable
   (now it is safe to use any number, since it performs modulo-2^48 assignment)
 * adds functions and generic macros for 48-bit sequence arithmetic:
 	--48 bit complement
 	--modulo-48 addition and modulo-48 subtraction
	--dccp_inc_seqno now a special case of add48
Constants renamed following a suggestion by Arnaldo.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:42 -07:00
Arnaldo Carvalho de Melo 6b88dd966b [SK_BUFF] ipv6: Use skb_network_offset in some more places
So that we reduce the number of direct accesses to skb->data.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2007-04-25 22:26:38 -07:00
Arnaldo Carvalho de Melo dc5fc579b9 [NETLINK]: Use nlmsg_trim() where appropriate
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:37 -07:00
Arnaldo Carvalho de Melo 897933bcdf [SK_BUFF]: Remove skb_add_mtu() leftovers
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2007-04-25 22:26:35 -07:00
Arnaldo Carvalho de Melo b529ccf279 [NETLINK]: Introduce nlmsg_hdr() helper
For the common "(struct nlmsghdr *)skb->data" sequence, so that we reduce the
number of direct accesses to skb->data and for consistency with all the other
cast skb member helpers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:34 -07:00
Robert Olsson 965ffea43d [IPV4]: fib_trie root node settings
The threshold for root node can be more aggressive set to get
better tree compression. The new setting mekes the root grow
from 16 to 19 bits and substansial improvemnt in Aver depth
this with the current table of 214393 prefixes

But really the dynamic resize should need more investigation
both in terms convergence and performance and maybe it should
be possible to change...

Maybe just for the brave to start with or we may have to back
this out.
2007-04-25 22:26:32 -07:00
Robert Olsson 05eee48c5a [IPV4]: fib_trie resize break
The patch below adds break condition for the resize operations. If
we don't achieve the desired fill factor a warning is printed. Trie
should still be operational but new thresholds should be considered.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:32 -07:00
Arnaldo Carvalho de Melo ca0605a7c8 [SK_BUFF]: Adjust the zeroing up to tail in __alloc_skb too
I did it just in alloc_skb_from_cache, forgot __alloc_skb, fixed now.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:31 -07:00
Arnaldo Carvalho de Melo 4305b54135 [SK_BUFF]: Convert skb->end to sk_buff_data_t
Now to convert the last one, skb->data, that will allow many simplifications
and removal of some of the offset helpers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:29 -07:00
Arnaldo Carvalho de Melo 27a884dc3c [SK_BUFF]: Convert skb->tail to sk_buff_data_t
So that it is also an offset from skb->head, reduces its size from 8 to 4 bytes
on 64bit architectures, allowing us to combine the 4 bytes hole left by the
layer headers conversion, reducing struct sk_buff size to 256 bytes, i.e. 4
64byte cachelines, and since the sk_buff slab cache is SLAB_HWCACHE_ALIGN...
:-)

Many calculations that previously required that skb->{transport,network,
mac}_header be first converted to a pointer now can be done directly, being
meaningful as offsets or pointers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:28 -07:00
David S. Miller be8bd86321 [VLAN] vlan_dev: Use skb_reset_network_header().
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:26 -07:00
Samuel Ortiz c7630a4b93 [IrDA]: irda lockdep annotation
Rmmoding irda triggers a lockdep false positive.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Samuel Ortiz <samuel@sortiz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:23 -07:00
Arnaldo Carvalho de Melo 2e07fa9cd3 [SK_BUFF]: Use offsets for skb->{mac,network,transport}_header on 64bit architectures
With this we save 8 bytes per network packet, leaving a 4 bytes hole to be used
in further shrinking work, likely with the offsetization of other pointers,
such as ->{data,tail,end}, at the cost of adds, that were minimized by the
usual practice of setting skb->{mac,nh,n}.raw to a local variable that is then
accessed multiple times in each function, it also is not more expensive than
before with regards to most of the handling of such headers, like setting one
of these headers to another (transport to network, etc), or subtracting, adding
to/from it, comparing them, etc.

Now we have this layout for sk_buff on a x86_64 machine:

[acme@mica net-2.6.22]$ pahole vmlinux sk_buff
struct sk_buff {
	struct sk_buff *       next;             /*   0   8 */
	struct sk_buff *       prev;             /*   8   8 */
	struct rb_node         rb;               /*  16  24 */
	struct sock *          sk;               /*  40   8 */
	ktime_t                tstamp;           /*  48   8 */
	struct net_device *    dev;              /*  56   8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct net_device *    input_dev;        /*  64   8 */
	sk_buff_data_t         transport_header; /*  72   4 */
	sk_buff_data_t         network_header;   /*  76   4 */
	sk_buff_data_t         mac_header;       /*  80   4 */

	/* XXX 4 bytes hole, try to pack */

	struct dst_entry *     dst;              /*  88   8 */
	struct sec_path *      sp;               /*  96   8 */
	char                   cb[48];           /* 104  48 */
	/* cacheline 2 boundary (128 bytes) was 24 bytes ago*/
	unsigned int           len;              /* 152   4 */
	unsigned int           data_len;         /* 156   4 */
	unsigned int           mac_len;          /* 160   4 */
	union {
		__wsum         csum;             /*       4 */
		__u32          csum_offset;      /*       4 */
	};                                       /* 164   4 */
	__u32                  priority;         /* 168   4 */
	__u8                   local_df:1;       /* 172   1 */
	__u8                   cloned:1;         /* 172   1 */
	__u8                   ip_summed:2;      /* 172   1 */
	__u8                   nohdr:1;          /* 172   1 */
	__u8                   nfctinfo:3;       /* 172   1 */
	__u8                   pkt_type:3;       /* 173   1 */
	__u8                   fclone:2;         /* 173   1 */
	__u8                   ipvs_property:1;  /* 173   1 */

	/* XXX 2 bits hole, try to pack */

	__be16                 protocol;         /* 174   2 */
	void    (*destructor)(struct sk_buff *); /* 176   8 */
	struct nf_conntrack *  nfct;             /* 184   8 */
	/* --- cacheline 3 boundary (192 bytes) --- */
	struct sk_buff *       nfct_reasm;       /* 192   8 */
	struct nf_bridge_info *nf_bridge;        /* 200   8 */
	__u16                  tc_index;         /* 208   2 */
	__u16                  tc_verd;          /* 210   2 */
	dma_cookie_t           dma_cookie;       /* 212   4 */
	__u32                  secmark;          /* 216   4 */
	__u32                  mark;             /* 220   4 */
	unsigned int           truesize;         /* 224   4 */
	atomic_t               users;            /* 228   4 */
	unsigned char *        head;             /* 232   8 */
	unsigned char *        data;             /* 240   8 */
	unsigned char *        tail;             /* 248   8 */
	/* --- cacheline 4 boundary (256 bytes) --- */
	unsigned char *        end;              /* 256   8 */
}; /* size: 264, cachelines: 5 */
   /* sum members: 260, holes: 1, sum holes: 4 */
   /* bit holes: 1, sum bit holes: 2 bits */
   /* last cacheline: 8 bytes */

On 32 bits nothing changes, and pointers continue to be used with the compiler
turning all this abstraction layer into dust. But there are some sk_buff
validation tricks that are now possible, humm... :-)

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:21 -07:00
Arnaldo Carvalho de Melo b0e380b1d8 [SK_BUFF]: unions of just one member don't get anything done, kill them
Renaming skb->h to skb->transport_header, skb->nh to skb->network_header and
skb->mac to skb->mac_header, to match the names of the associated helpers
(skb[_[re]set]_{transport,network,mac}_header).

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:20 -07:00
Arnaldo Carvalho de Melo cfe1fc7759 [SK_BUFF]: Introduce skb_network_header_len
For the common sequence "skb->h.raw - skb->nh.raw", similar to skb->mac_len,
that is precalculated tho, don't think we need to bloat skb with one more
member, so just use this new helper, reducing the number of non-skbuff.h
references to the layer headers even more.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:19 -07:00
Arnaldo Carvalho de Melo bff9b61ce3 [SK_BUFF]: Use the helpers to get the layer header pointer
Some more cases...

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:18 -07:00
Patrick McHardy 514bca322c [NET_SCHED]: Fix warning
net/sched/sch_api.c: In function 'psched_show':
net/sched/sch_api.c:1219: warning: format '%08x' expects type 'unsigned int', but argument 6 has type 's64'

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:17 -07:00
Patrick McHardy bb239acf56 [NET_SCHED]: sch_cbq: fix watchdog scheduled too late
q->now is increased during dequeue and doesn't contain the current time
afterwards, resulting in a too large timeout value for the qdisc watchdog.
Use "now" instead, which still contains the current time.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:16 -07:00
Patrick McHardy 4361cb17f0 [NET_SCHED]: Export real timer resolution in /proc/net/psched
The timer resolution exported in /proc/net/psched is used by userspace to
calculate HTB's burst values. Currently it is set to HZ, since we're now
using hrtimers, use KTIME_MONOTONIC_RES, which makes HTB use smaller burst
values.

This patch also affects libnl, which incorrectly uses this value for
the SFQ perturbation parameter, which is always in seconds, and some
routing cache values, which are in USER_HZ, so both cases are broken
anyway.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:15 -07:00
Patrick McHardy 00c04af9df [NET_SCHED]: kill jiffie conversion macros
Now that all packet schedulers have been converted to hrtimers most users
of PSCHED_JIFFIE2US and PSCHED_US2JIFFIE are gone. The remaining users use
it to convert external time units to packet scheduler clock ticks, so use
PSCHED_TICKS_PER_SEC instead.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:14 -07:00
Patrick McHardy fb983d4578 [NET_SCHED]: sch_htb: use hrtimer based watchdog
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:13 -07:00
Patrick McHardy 1a13cb63d6 [NET_SCHED]: sch_cbq: use hrtimer for delay_timer
Switch delay_timer to hrtimer.

The class penalty parameter is changed to use psched ticks as units.
Since iproute never supported using this and the only existing user
(libnl) incorrectly assumes psched ticks as units anyway, this
shouldn't break anything.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:12 -07:00
Patrick McHardy e9054a339e [NET_SCHED]: sch_cbq: fix cbq_undelay_prio for non-active priorites
cbq_undelay_prio is supposed to return a time delta, but returns the
current time for non-active priorities, causing cbq_undelay to mark
the priority as active and schedule a timer for twice the current
time.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:11 -07:00
Patrick McHardy 88a993540a [NET_SCHED]: sch_cbq: use hrtimer based watchdog
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:09 -07:00
Patrick McHardy 59cb5c6734 [NET_SCHED]: sch_netem: use hrtimer based watchdog
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:08 -07:00
Patrick McHardy f7f593e383 [NET_SCHED]: sch_tbf: use hrtimer based watchdog
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:07 -07:00
Patrick McHardy ed2b229a97 [NET_SCHED]: sch_hfsc: use hrtimer based watchdog
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:06 -07:00
Patrick McHardy 4179477f63 [NET_SCHED]: Add hrtimer based qdisc watchdog
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:05 -07:00
Patrick McHardy 641b9e0e8b [NET_SCHED]: Use ktime as clocksource
Get rid of the manual clock source selection mess and use ktime. Also
use a scalar representation, which allows to clean up pkt_sched.h a bit
more and results in less ktime_to_ns() calls in most cases.

The PSCHED_US2JIFFIE/PSCHED_JIFFIE2US macros are implemented quite
inefficient by this patch, following patches will convert all qdiscs
to hrtimers and get rid of them entirely.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:04 -07:00
Arnaldo Carvalho de Melo ddc7b8e32b [SK_BUFF]: Some more layer header conversions
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:03 -07:00
Arnaldo Carvalho de Melo d10ba34b00 [SK_BUFF]: More skb_put related skb_reset_transport_header
This time we have to set it to skb->tail that is not anymore equal to
skb->data, so we either add a new helper or just add the skb->tail - skb->data
offset, for now do the later.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:01 -07:00
Arnaldo Carvalho de Melo 55f79cc0c0 [IPV6]: Reset the network header in ip6_nd_hdr
ip6_nd_hdr is always called immediately after a alloc_skb + skb_reserve
sequence, i.e. when skb->tail is equal to skb->data, making it correct to use
skb_reset_network_header().

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:26:00 -07:00
Arnaldo Carvalho de Melo eeeb03745b [SK_BUFF]: More skb_put related conversions to skb_reset_transport_header
This is similar to the skb_reset_network_header(), i.e. at the point we reset
the transport header pointer/offset skb->tail is equal to skb->data.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:59 -07:00
Pablo Neira Ayuso ac6d141dc7 [NETFILTER]: nfnetlink: parse attributes with nfattr_parse in nfnetlink_check_attribute
Use nfattr_parse to parse attributes, this patch also modifies the default
behaviour since unknown attributes will be ignored instead of returning
EINVAL. This ensure backward compatibility: new libraries with new
attributes and old kernels can work.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:58 -07:00
Pablo Neira Ayuso c8e2078cfe [NETFILTER]: ctnetlink: add support for internal tcp connection tracking flags handling
This patch let userspace programs set the IP_CT_TCP_BE_LIBERAL flag to
force the pickup of established connections.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:57 -07:00
Willy Tarreau 5c8ce7c921 [NETFILTER]: TCP conntrack: factorize out the PUSH flag
The PUSH flag is accepted with every other valid combination.
Let's get it out of the tcp_valid_flags table and reduce the
number of combinations we have to handle. This does not
significantly reduce the table size however (8 bytes).

Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:56 -07:00
Willy Tarreau 8f5bd99071 [NETFILTER]: TCP conntrack: accept RST|PSH as valid
This combination has been encountered on an IBM AS/400 in response
to packets sent to a closed session. There is no particular reason
to mark it invalid.

Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:56 -07:00
Yasuyuki Kozakai e7ac05f340 [NETFILTER]: nf_conntrack: add nf_copy() to safely copy members in skb
This unifies the codes to copy netfilter related datas. Before copying,
nf_copy() puts original members in destination skb.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:55 -07:00
Yasuyuki Kozakai edda553c32 [NETFILTER]: nf_conntrack: add __nf_copy() to copy members in skb
This unifies the codes to copy netfilter related datas. Note that
__nf_copy() assumes destination skb doesn't have any netfilter
related members.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:54 -07:00
Sami Farin 9b88790972 [NETFILTER]: nf_conntrack: use jhash2 in __hash_conntrack
Now it uses jhash, but using jhash2 would be around 3-4 times faster
(on P4).

Signed-off-by: Sami Farin <safari-netfilter@safari.iki.fi>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:53 -07:00
Pablo Neira Ayuso f4bc177f0f [NETFILTER]: nfnetlink: move EXPORT_SYMBOL declarations next to the exported symbol
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:51 -07:00
Pablo Neira Ayuso 8a2e89533a [NETFILTER]: nfnetlink: remove unused includes in nfnetlink.c
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:50 -07:00
Pablo Neira Ayuso ac0f1d9894 [NETFILTER]: nfnetlink: remove unrequired check in nfnetlink_get_subsys
subsys_table is initialized to NULL, therefore just returns NULL in case
that it is not set.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:49 -07:00
Pablo Neira Ayuso d9e6d02949 [NETFILTER]: nfnetlink: remove duplicate checks in nfnetlink_check_attributes
Remove nfnetlink_check_attributes duplicates message size and callback
id checks. nfnetlink_find_client and nfnetlink_rcv_msg already do
such checks.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:48 -07:00
Pablo Neira Ayuso 67ca396606 [NETFILTER]: nfnetlink: remove early debugging messages from nfnetlink
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:47 -07:00
Patrick McHardy 010c7d6f86 [NETFILTER]: nf_conntrack: uninline notifier registration functions
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:46 -07:00
Patrick McHardy 73c361862c [NETFILTER]: nfnetlink: use netlink_run_queue()
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:44 -07:00
Patrick McHardy a3c5029cf7 [NETFILTER]: nfnetlink: use mutex instead of semaphore
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:43 -07:00
Patrick McHardy c6a1e615d1 [NETFILTER]: nf_conntrack: simplify l4 protocol array allocation
The retrying after an allocation failure is not necessary anymore
since we're holding the mutex the entire time, for the same
reason the double allocation race can't happen anymore.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:43 -07:00
Patrick McHardy 0661cca9c2 [NETFILTER]: nf_conntrack: simplify protocol locking
Now that we don't use nf_conntrack_lock anymore but a single mutex for
all protocol handling, no need to release and grab it again for sysctl
registration.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:41 -07:00
Patrick McHardy ac5357ebac [NETFILTER]: nf_conntrack: remove ugly hack in l4proto registration
Remove ugly special-casing of nf_conntrack_l4proto_generic, all it
wants is its sysctl tables registered, so do that explicitly in an
init function and move the remaining protocol initialization and
cleanup code to nf_conntrack_proto.c as well.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:40 -07:00
Patrick McHardy b19caa0ca0 [NETFILTER]: nf_conntrack: switch protocol registration/unregistration to mutex
The protocol lookups done by nf_conntrack are already protected by RCU,
there is no need to keep taking nf_conntrack_lock for registration
and unregistration. Switch to a mutex.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:39 -07:00
Patrick McHardy 587aa64163 [NETFILTER]: Remove IPv4 only connection tracking/NAT
Remove the obsolete IPv4 only connection tracking/NAT as scheduled in
feature-removal-schedule.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:34 -07:00
Tobias Klauser ce18afe57b [NETFILTER]: x_tables: remove duplicate of xt_prefix
Remove xt_proto_prefix array which duplicates xt_prefix and change all
users of xt_proto_prefix to xt_prefix.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:33 -07:00
David S. Miller 239254fedc [IPV4] xfrm4_mode_beet: Use skb_transport_header().
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:32 -07:00
Arnaldo Carvalho de Melo 9c70220b73 [SK_BUFF]: Introduce skb_transport_header(skb)
For the places where we need a pointer to the transport header, it is
still legal to touch skb->h.raw directly if just adding to,
subtracting from or setting it to another layer header.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:31 -07:00
Arnaldo Carvalho de Melo a27ef749e7 [SCTP]: Eliminate some pointer attributions to the skb layer headers
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:30 -07:00
Arnaldo Carvalho de Melo bd82393ca2 [SK_BUFF]: More skb_reset_transport_header conversions
These are a bit more subtle, they are of this type:

-       skb->h.raw = payload;
        __skb_pull(skb, payload - skb->data);
+       skb_reset_transport_header(skb);

__skb_pull results in:

skb->data = skb->data + payload - skb->data;
skb->data = payload;

So after __skb_pull we have skb->data pointing to payload and we can
just call skb_reset_transport_header(skb), that will do:

skb->h.raw = payload;

The others are similar, allowing us to get rid of some more cases where a
pointer was being attributed to the layer headers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:29 -07:00
Arnaldo Carvalho de Melo 39b89160df [SK_BUFF]: Introduce ipipv6_hdr(), remove skb->h.ipv6h
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:28 -07:00
Arnaldo Carvalho de Melo b0061ce49c [SK_BUFF]: Introduce ipip_hdr(), remove skb->h.ipiph
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:27 -07:00
Arnaldo Carvalho de Melo aa8223c7bb [SK_BUFF]: Introduce tcp_hdr(), remove skb->h.th
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:26 -07:00
Arnaldo Carvalho de Melo ab6a5bb6b2 [TCP]: Introduce tcp_hdrlen() and tcp_optlen()
The ip_hdrlen() buddy, created to reduce the number of skb->h.th-> uses and to
avoid the longer, open coded equivalent.

Ditched a no-op in bnx2 in the process.

I wonder if we should have a BUG_ON(skb->h.th->doff < 5) in tcp_optlen()...

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:24 -07:00
Arnaldo Carvalho de Melo 88c7664f13 [SK_BUFF]: Introduce icmp_hdr(), remove skb->h.icmph
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:23 -07:00
Arnaldo Carvalho de Melo 4bedb45203 [SK_BUFF]: Introduce udp_hdr(), remove skb->h.uh
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:22 -07:00
Arnaldo Carvalho de Melo d9edf9e2be [SK_BUFF]: Introduce igmp_hdr() & friends, remove skb->h.igmph
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:21 -07:00
Arnaldo Carvalho de Melo cc70ab261c [ICMP6]: Introduce icmp6_hdr()
For consistency with all the other skb->h.raw accessors.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:20 -07:00
Arnaldo Carvalho de Melo 2c0fd387b0 [SCTP]: Introduce sctp_hdr()
For consistency with all the other skb->h.raw accessors.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:19 -07:00
Arnaldo Carvalho de Melo 967b05f64e [SK_BUFF]: Introduce skb_set_transport_header
For the cases where the transport header is being set to a offset from
skb->data.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:17 -07:00
Arnaldo Carvalho de Melo ea2ae17d64 [SK_BUFF]: Introduce skb_transport_offset()
For the quite common 'skb->h.raw - skb->data' sequence.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:16 -07:00
Arnaldo Carvalho de Melo badff6d01a [SK_BUFF]: Introduce skb_reset_transport_header(skb)
For the common, open coded 'skb->h.raw = skb->data' operation, so that we can
later turn skb->h.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple cases:

skb->h.raw = skb->data;
skb->h.raw = {skb_push|[__]skb_pull}()

The next ones will handle the slightly more "complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:15 -07:00
Arnaldo Carvalho de Melo 0660e03f6b [SK_BUFF]: Introduce ipv6_hdr(), remove skb->nh.ipv6h
Now the skb->nh union has just one member, .raw, i.e. it is just like the
skb->mac union, strange, no? I'm just leaving it like that till the transport
layer is done with, when we'll rename skb->mac.raw to skb->mac_header (or
->mac_header_offset?), ditto for ->{h,nh}.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:14 -07:00
Arnaldo Carvalho de Melo d0a92be05e [SK_BUFF]: Introduce arp_hdr(), remove skb->nh.arph
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:12 -07:00
Stephen Hemminger fd74e6ccd5 [BRIDGE]: faster compare for link local addresses
Use logic operations rather than memcmp() to compare destination
address with link local multicast addresses.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:11 -07:00
Arnaldo Carvalho de Melo eddc9ec53b [SK_BUFF]: Introduce ip_hdr(), remove skb->nh.iph
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:10 -07:00
Arnaldo Carvalho de Melo e023dd6437 [IPMR]: Fix bug introduced when converting to skb_network_reset_header
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:08 -07:00
Arnaldo Carvalho de Melo c9bdd4b525 [IP]: Introduce ip_hdrlen()
For the common sequence "skb->nh.iph->ihl * 4", removing a good number of open
coded skb->nh.iph uses, now to go after the rest...

Just out of curiosity, here are the idioms found to get the same result:

skb->nh.iph->ihl << 2
skb->nh.iph->ihl<<2
skb->nh.iph->ihl * 4
skb->nh.iph->ihl*4
(skb->nh.iph)->ihl * sizeof(u32)

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:07 -07:00
Arnaldo Carvalho de Melo 0272ffc46f [SK_BUFF] ipmr: Missed one conversion to skb_network_header()
We can't access skb->nh.raw directly anymore, it will become an offset.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:05 -07:00
Stephen Hemminger 0e1256ffd1 [NET]: show bound packet types
Show what protocols are bound to what packet types in /proc/net/ptype
Uses kallsyms to decode function pointers if possible.
Example:
	Type Device      Function
	ALL  eth1     packet_rcv_spkt+0x0
	0800          ip_rcv+0x0
	0806          arp_rcv+0x0
	86dd          :ipv6:ipv6_rcv+0x0

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:04 -07:00
Stephen Hemminger f690808e17 [NET]: make seq_operations const
The seq_file operations stuff can be marked constant to
get it out of dirty cache.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:03 -07:00
Stephen Hemminger 6b2bedc3a6 [NET]: network dev read_mostly
For Eric, mark packet type and network device watermarks
as read mostly.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:02 -07:00
Arnaldo Carvalho de Melo c14d2450cb [SK_BUFF]: Introduce skb_set_network_header
For the cases where the network header is being set to a offset from skb->data.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:01 -07:00
Arnaldo Carvalho de Melo 878c814500 [SK_BUFF] ipmr: Another skb_push related conversion to skb_reset_network_header
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:25:00 -07:00
Arnaldo Carvalho de Melo d56f90a7c9 [SK_BUFF]: Introduce skb_network_header()
For the places where we need a pointer to the network header, it is still legal
to touch skb->nh.raw directly if just adding to, subtracting from or setting it
to another layer header.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:59 -07:00
Arnaldo Carvalho de Melo bbe735e424 [SK_BUFF]: Introduce skb_network_offset()
For the quite common 'skb->nh.raw - skb->data' sequence.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:58 -07:00
Arnaldo Carvalho de Melo 7f5c0cb05f [SK_BUFF] xfrm4: use skb_reset_network_header
Setting it to skb->h.raw, which is valid, in the (to become) old pointer based
world order and in the new world of offset based layer headers.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:55 -07:00
Arnaldo Carvalho de Melo 1ced98e81d [SK_BUFF] ipv6: More skb_reset_network_header conversions related to skb_pull
Now related to this form:

skb->nh.ipv6h = (struct ipv6hdr *)skb_put(skb, length);

That, as the others, is done when skb->tail is still equal to skb->data, making
the conversion to skb_reset_network_header possible.

Also one more case equivalent to skb->nh.raw = skb->data, of this form:

iph = (struct ipv6hdr *)skb->data;
<SNIP>
skb->nh.ipv6h = iph;

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:54 -07:00
Arnaldo Carvalho de Melo 8856dfa3e9 [SK_BUFF]: Use skb_reset_network_header after skb_push
Some more cases where skb->nh.iph was being set that were converted
to using skb_reset_network_header.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:53 -07:00
Arnaldo Carvalho de Melo 04b964dbad [SK_BUFF] ipconfig: Another conversion to skb_reset_network_header related to skb_put
boot_pkt->iph is the first member, that is at skb->data, so just use
skb_reset_network_header().

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:52 -07:00
Arnaldo Carvalho de Melo 2ca9e6f2c2 [SK_BUFF]: Some more skb_put cases converted to skb_reset_network_header
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:51 -07:00
Arnaldo Carvalho de Melo 31c7711b50 [SK_BUFF]: Some more simple skb_reset_network_header conversions
This time of the type:

 skb->nh.iph = (struct iphdr *)skb->data;

That is completely equivalent to:

 skb->nh.raw = skb->data;

Wonder why people love casts... :-)

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:50 -07:00
Arnaldo Carvalho de Melo 4209fb601c [SK_BUFF]: Use skb_reset_network_header where the return of __pskb_pull was being used
It returns skb->data, so we can just use skb_reset_network_header after it.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:49 -07:00
Arnaldo Carvalho de Melo 7e28ecc282 [SK_BUFF]: Use skb_reset_network_header where the skb_pull return was being used
But only in the cases where its a newly allocated skb, i.e. one where skb->tail
is equal to skb->data, or just after skb_reserve, where this requirement is
maintained.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:48 -07:00
Arnaldo Carvalho de Melo e2d1bca7e6 [SK_BUFF]: Use skb_reset_network_header in skb_push cases
skb_push updates and returns skb->data, so we can just call
skb_reset_network_header after the call to skb_push.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:47 -07:00
Arnaldo Carvalho de Melo c1d2bbe1cd [SK_BUFF]: Introduce skb_reset_network_header(skb)
For the common, open coded 'skb->nh.raw = skb->data' operation, so that we can
later turn skb->nh.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple case, next will handle the slightly more
"complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:46 -07:00
Arnaldo Carvalho de Melo 57effc70a5 [IPV6]: Use skb->nh.ipv6h instead of casting skb->nh.raw
nh.ipv6h is there exactly for this reason! Use it while it exists ;-)

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:45 -07:00
Arnaldo Carvalho de Melo 98e399f82a [SK_BUFF]: Introduce skb_mac_header()
For the places where we need a pointer to the mac header, it is still legal to
touch skb->mac.raw directly if just adding to, subtracting from or setting it
to another layer header.

This one also converts some more cases to skb_reset_mac_header() that my
regex missed as it had no spaces before nor after '=', ugh.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:41 -07:00
Arnaldo Carvalho de Melo 31713c333d [TCP]: Use skb_set_mac_header in tcp_collapse
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:39 -07:00
Arnaldo Carvalho de Melo c51957dafa [TCP]: Do the layer header setting in tcp_collapse relative to skb->data
That is equal to skb->head before skb_reserve, to help in the layer header
changes.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:38 -07:00
Arnaldo Carvalho de Melo 39f69c6f92 [SK_BUFF] xfrm: Use skb_set_mac_header in the memmove cases
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:37 -07:00
Arnaldo Carvalho de Melo 48d49d0ccd [SK_BUFF]: Introduce skb_set_mac_header()
For the cases where we want to set skb->mac.raw to an offset from skb->data.

Simple cases first, the memmove ones and specially pktgen will be left for later.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:37 -07:00
Arnaldo Carvalho de Melo f64955eb11 [LLC]: Use skb_reset_mac_header in llc_mac_hdr_init
skb_push updates and returns skb->data, so we can just call
skb_reset_mac_header after the call to skb_push.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:35 -07:00
Arnaldo Carvalho de Melo 0a1b0ad9ae [LLC]: Use skb_reset_mac_header in llc_alloc_frame
skb->head is equal to skb->data after alloc_skb, so reset the mac header while
this is true, i.e. before skb_reserve.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:34 -07:00
Arnaldo Carvalho de Melo 459a98ed88 [SK_BUFF]: Introduce skb_reset_mac_header(skb)
For the common, open coded 'skb->mac.raw = skb->data' operation, so that we can
later turn skb->mac.raw into a offset, reducing the size of struct sk_buff in
64bit land while possibly keeping it as a pointer on 32bit.

This one touches just the most simple case, next will handle the slightly more
"complex" cases.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:32 -07:00
Arnaldo Carvalho de Melo 4c13eb6657 [ETH]: Make eth_type_trans set skb->dev like the other *_type_trans
One less thing for drivers writers to worry about.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:30 -07:00
Arnaldo Carvalho de Melo 0a4f23fbbf [HIPPI/FDDI]: Make {hippi,fddi}_type_trans set skb->dev
Now all the _type_trans routines are consistent in this regard.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:26 -07:00
Arnaldo Carvalho de Melo c8fb7948dc [TR]: Make tr_type_trans set skb->dev
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:24 -07:00
Arnaldo Carvalho de Melo c1a4b86e39 [TR]: Use tr_hdr() were appropriate
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:23 -07:00
Arnaldo Carvalho de Melo 7c81fd8bfb [SOCKET]: Export __sock_recv_timestamp
Kernel: arch/x86_64/boot/bzImage is ready  (#2)
  MODPOST 1816 modules
WARNING: "__sock_recv_timestamp" [net/sctp/sctp.ko] undefined!
WARNING: "__sock_recv_timestamp" [net/packet/af_packet.ko] undefined!
WARNING: "__sock_recv_timestamp" [net/key/af_key.ko] undefined!
WARNING: "__sock_recv_timestamp" [net/ipv6/ipv6.ko] undefined!
WARNING: "__sock_recv_timestamp" [net/atm/atm.ko] undefined!
make[2]: *** [__modpost] Error 1
make[1]: *** [modules] Error 2
make: *** [_all] Error 2

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:22 -07:00
Eric Dumazet 92f37fd2ee [NET]: Adding SO_TIMESTAMPNS / SCM_TIMESTAMPNS support
Now that network timestamps use ktime_t infrastructure, we can add a new
SOL_SOCKET sockopt  SO_TIMESTAMPNS.

This command is similar to SO_TIMESTAMP, but permits transmission of
a 'timespec struct' instead of a 'timeval struct' control message.
(nanosecond resolution instead of microsecond)

Control message is labelled SCM_TIMESTAMPNS instead of SCM_TIMESTAMP

A socket cannot mix SO_TIMESTAMP and SO_TIMESTAMPNS : the two modes are
mutually exclusive.

sock_recv_timestamp() became too big to be fully inlined so I added a
__sock_recv_timestamp() helper function.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
CC: linux-arch@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:21 -07:00
Arnaldo Carvalho de Melo c7a3c5da35 [UDP]: Use __skb_pull since we have checked it won't fail with pskb_may_pull
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:20 -07:00
Eric Dumazet 6dea649a8a [NET]: New sysctls should use __read_mostly tags
net_msg_warn should be placed in the read_mostly section, to avoid
performance problems on SMP

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:19 -07:00
YOSHIFUJI Hideaki e5268f12f2 [IPV6]: Ensure to truncate result and return full length for sticky options.
Bug noticed by Chris Wright <chrisw@sous-sol.org>.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:17 -07:00
YOSHIFUJI Hideaki 4c6510a738 [IPV6]: Return correct result for sticky options.
We returned incorrect result with IPV6_RTHDRDSTOPTS, IPV6_RTHDR and
IPV6_DSTOPTS.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:16 -07:00
Stephen Hemminger 3fbe070a42 [UDP]: deinline
A couple of functions are exported or used indirectly
so it is pointless to mark them as inline.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:15 -07:00
Stephen Hemminger 6f05f62971 [NET]: deinline some functions
Several functions are marked inline or forced inline, but it
would be better to let the compiler decide.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:14 -07:00
Stephen Hemminger 2de979bd7d [TCP]: whitespace cleanup
Add whitespace around keywords.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:13 -07:00
Stephen Hemminger 132adf5463 [IPV4]: cleanup
Add whitespace around keywords.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:11 -07:00
Stephen Hemminger 1ac58ee37f [WIRELESS]: use ARRAY_SIZE()
Use ARRAY_SIZE() macro now.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:10 -07:00
Stephen Hemminger e71a4783aa [NET] core: whitespace cleanup
Fix whitespace around keywords. Fix indentation especially of switch
statements.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:09 -07:00
Stephen Hemminger add459aa1a [UDP]: ipv6 style cleanup
Fix whitespace around keywords. Eliminate unnecessary ()'s on return
statements.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:08 -07:00
Stephen Hemminger 6516c65573 [UDP]: ipv4 whitespace cleanup
Fix whitespace around keywords.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:07 -07:00
Stephen Hemminger a2a316fd06 [NET]: Replace CONFIG_NET_DEBUG with sysctl.
Covert network warning messages from a compile time to runtime choice.
Removes kernel config option and replaces it with new /proc/sys/net/core/warnings.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:05 -07:00
Eric Dumazet ae40eb1ef3 [NET]: Introduce SIOCGSTAMPNS ioctl to get timestamps with nanosec resolution
Now network timestamps use ktime_t infrastructure, we can add a new
ioctl() SIOCGSTAMPNS command to get timestamps in 'struct timespec'.
User programs can thus access to nanosecond resolution.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
CC: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:04 -07:00
Adrian Bunk cb69cc5236 [TCP/DCCP/RANDOM]: Remove unused exports.
This patch removes the following not or no longer used exports:
- drivers/char/random.c: secure_tcp_sequence_number
- net/dccp/options.c: sysctl_dccp_feat_sequence_window
- net/netlink/af_netlink.c: netlink_set_err

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:03 -07:00
David S. Miller fe067e8ab5 [TCP]: Abstract out all write queue operations.
This allows the write queue implementation to be changed,
for example, to one which allows fast interval searching.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:02 -07:00
YOSHIFUJI Hideaki 02ea4923b4 [NET] TIPC: Use htons() where appropriate.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:01 -07:00
YOSHIFUJI Hideaki b6d9bcb069 [NET] SCHED: Use htons() where appropriate.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:24:00 -07:00
YOSHIFUJI Hideaki 8f05ce91c8 [NET] NETFILTER: Use htonl() where appropriate.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:59 -07:00
YOSHIFUJI Hideaki 4412ec4948 [NET] IPV4: Use hton{s,l}() where appropriate.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:58 -07:00
YOSHIFUJI Hideaki 1c9e8ef7f7 [NET] IEEE80211: Use htons() where appropriate.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:57 -07:00
YOSHIFUJI Hideaki f576e24ffa [NET] ETHERNET: Use htons() where appropriate.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:56 -07:00
YOSHIFUJI Hideaki 724800d61b [NET] CORE: Use htons() where appropriate.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:55 -07:00
YOSHIFUJI Hideaki aca3192cc6 [NET] BLUETOOTH: Use cpu_to_le{16,32}() where appropriate.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:54 -07:00
YOSHIFUJI Hideaki acde4855bb [NET] ATM: Use htons() where appropriate.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:53 -07:00
YOSHIFUJI Hideaki b93b7eebd3 [NET] 8021Q: Use htons() where appropriate.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:53 -07:00
YOSHIFUJI Hideaki 2953fd2468 [NET] 802: Use hton{s,l}() where appropriate.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:52 -07:00
Herbert Xu 759e5d0064 [UDP]: Clean up UDP-Lite receive checksum
This patch eliminates some duplicate code for the verification of
receive checksums between UDP-Lite and UDP.  It does this by
introducing __skb_checksum_complete_head which is identical to
__skb_checksum_complete_head apart from the fact that it takes
a length parameter rather than computing the first skb->len bytes.

As a result UDP-Lite will be able to use hardware checksum offload
for packets which do not use partial coverage checksums.  It also
means that UDP-Lite loopback no longer does unnecessary checksum
verification.

If any NICs start support UDP-Lite this would also start working
automatically.

This patch removes the assumption that msg_flags has MSG_TRUNC clear
upon entry in recvmsg.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:51 -07:00
Herbert Xu 1ab6eb62b0 [UDP6]: Restore sk_filter optimisation
This reverts the changeset

    [IPV6]: UDPv6 checksum.

    We always need to check UDPv6 checksum because it is mandatory.

The sk_filter optimisation has nothing to do whether we verify the
checksum.  It simply postpones it to the point when the user calls
recv or poll.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:50 -07:00
Eric Dumazet 243bbcaa09 [IPV4]: Optimize inet_getpeer()
1) Some sysctl vars are declared __read_mostly

2) We can avoid updating stack[] when doing an AVL lookup only.

    lookup() macro is extended to receive a second parameter, that may be NULL
in case of a pure lookup (no need to save the AVL path). This removes
unnecessary instructions, because compiler knows if this _stack parameter is
NULL or not.

    text size of net/ipv4/inetpeer.o is 2063 bytes instead of 2107 on x86_64

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:49 -07:00
Stephen Hemminger 43e683926f [TCP] TCP Yeah: cleanup
Eliminate need for full 6/4/64 divide to compute queue.
Variable maxqueue was really a constant.
Fix indentation.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:48 -07:00
Stephen Hemminger c5f5877c04 [TCP] tcp_cubic: faster cube root
The Newton-Raphson method is quadratically convergent so
only a small fixed number of steps are necessary.
Therefore it is faster to unroll the loop. Since div64_64 is no longer
inline it won't cause code explosion.

Also fixes a bug that can occur if x^2 was bigger than 32 bits.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:47 -07:00
YOSHIFUJI Hideaki ca04356939 [IPV6] ADDRCONF: Fix possible inet6_ifaddr leakage with CONFIG_OPTIMISTIC_DAD.
The inet6_ifaddr for source address of RS is leaked if the address
is not an optimistic address.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:44 -07:00
Neil Horman 95c385b4d5 [IPV6] ADDRCONF: Optimistic Duplicate Address Detection (RFC 4429) Support.
Nominally an autoconfigured IPv6 address is added to an interface in the
Tentative state (as per RFC 2462).  Addresses in this state remain in this
state while the Duplicate Address Detection process operates on them to
determine their uniqueness on the network.  During this period, these
tentative addresses may not be used for communication, increasing the time
before a node may be able to communicate on a network.  Using Optimistic
Duplicate Address Detection, autoconfigured addresses may be used
immediately for communication on the network, as long as certain rules are
followed to avoid conflicts with other nodes during the Duplicate Address
Detection process.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:43 -07:00
Yasuyuki Kozakai 502b093569 [IPV6] IP6TUNNEL: Enable to control the handled inner protocol.
ip6_tunnel before supporting IPv4/IPv6 tunnel allows only IPPROTO_IPV6
in configurations from userland. This allows userland to set IPPROTO_IPIP
and 0(wildcard). ip6_tunnel only handles allowed inner protocols.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:42 -07:00
Yasuyuki Kozakai 3144581cb0 [IPV6] IP6TUNNEL: Rename functions ip6ip6_* to ip6_tnl_*.
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:41 -07:00
Yasuyuki Kozakai c4d3efafcc [IPV6] IP6TUNNEL: Add support to IPv4 over IPv6 tunnel.
Some notes
- Protocol number IPPROTO_IPIP is used for IPv4 over IPv6 packets.
- If IP6_TNL_F_USE_ORIG_TCLASS is set, TOS in IPv4 header is copied to
  Traffic Class in outer IPv6 header on xmit.
- IP6_TNL_F_USE_ORIG_FLOWLABEL is ignored on xmit of IPv4 packets, because
  IPv4 header does not have flow label.
- Kernel sends ICMP error if IPv4 packet is too big on xmit, even if
  DF flag is not set.

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:40 -07:00
Yasuyuki Kozakai 61ec2aec28 [IPV6] IP6TUNNEL: Split out generic routine in ip6ip6_xmit().
This enables to add IPv4/IPv6 specific handling later,

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:39 -07:00
Yasuyuki Kozakai 8359925be8 [IPV6] IP6TUNNEL: Split out generic routine in ip6ip6_rcv().
This enables to add IPv4/IPv6 specific handling later,

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:38 -07:00
Yasuyuki Kozakai e490d1d85c [IPV6] IP6TUNNEL: Split out generic routine in ip6ip6_err().
This enables to add IPv4/IPv6 specific error handling later,

Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:37 -07:00
YOSHIFUJI Hideaki 7159039a12 [IPV6]: Decentralize EXPORT_SYMBOLs.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2007-04-25 22:23:36 -07:00
David S. Miller b558ff7999 [NETLINK]: Mirror UDP MSG_TRUNC semantics.
If the user passes MSG_TRUNC in via msg_flags, return
the full packet size not the truncated size.

Idea from Herbert Xu and Thomas Graf.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:35 -07:00
Eric Dumazet b7aa0bf70c [NET]: convert network timestamps to ktime_t
We currently use a special structure (struct skb_timeval) and plain
'struct timeval' to store packet timestamps in sk_buffs and struct
sock.

This has some drawbacks :
- Fixed resolution of micro second.
- Waste of space on 64bit platforms where sizeof(struct timeval)=16

I suggest using ktime_t that is a nice abstraction of high resolution
time services, currently capable of nanosecond resolution.

As sizeof(ktime_t) is 8 bytes, using ktime_t in 'struct sock' permits
a 8 byte shrink of this structure on 64bit architectures. Some other
structures also benefit from this size reduction (struct ipq in
ipv4/ip_fragment.c, struct frag_queue in ipv6/reassembly.c, ...)

Once this ktime infrastructure adopted, we can more easily provide
nanosecond resolution on top of it. (ioctl SIOCGSTAMPNS and/or
SO_TIMESTAMPNS/SCM_TIMESTAMPNS)

Note : this patch includes a bug correction in
compat_sock_get_timestamp() where a "err = 0;" was missing (so this
syscall returned -ENOENT instead of 0)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
CC: Stephen Hemminger <shemminger@linux-foundation.org>
CC: John find <linux.kernel@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:34 -07:00
Stephen Hemminger 3927f2e8f9 [NET]: div64_64 consolidate (rev3)
Here is the current version of the 64 bit divide common code.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:33 -07:00
James Morris 9d729f72dc [NET]: Convert xtime.tv_sec to get_seconds()
Where appropriate, convert references to xtime.tv_sec to the
get_seconds() helper function.

Signed-off-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:32 -07:00
Stephen Hemminger 39df232f1a [PKTGEN]: fix device name handling
Since devices can change name and other wierdness, don't hold onto
a copy of device name, instead use pointer to output device.

Fix a couple of leaks in error handling path as well.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:31 -07:00
Stephen Hemminger d5f1ce9a5e [PKTGEN]: don't use __constant_htonl()
The existing htonl() macro is smart enough to do the same code as
using __constant_htonl() and it looks cleaner.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:30 -07:00
Stephen Hemminger 5fa6fc76f5 [PKTGEN]: use random32
Can use random32() now.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:29 -07:00
Stephen Hemminger 25c4e53a4c [PKTGEN]: use pr_debug
Remove private debug macro and replace with standard version

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:28 -07:00
Eric Dumazet fa438ccfdf [NET]: Keep sk_backlog near sk_lock
sk_backlog is a critical field of struct sock. (known famous words)

It is (ab)used in hot paths, in particular in release_sock(), tcp_recvmsg(),
tcp_v4_rcv(), sk_receive_skb().

It really makes sense to place it next to sk_lock, because sk_backlog is only
used after sk_lock locked (and thus memory cache line in L1 cache). This
should reduce cache misses and sk_lock acquisition time.

(In theory, we could only move the head pointer near sk_lock, and leaving tail
far away, because 'tail' is normally not so hot, but keep it simple :) )

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:27 -07:00
Ilpo Järvinen e317f6f69c [TCP]: FRTO undo response falls back to ratehalving one if ECEd
Undoing ssthresh is disabled in fastretrans_alert whenever
FLAG_ECE is set by clearing prior_ssthresh. The clearing does
not protect FRTO because FRTO operates before fastretrans_alert.
Moving the clearing of prior_ssthresh earlier seems to be a
suboptimal solution to the FRTO case because then FLAG_ECE will
cause a second ssthresh reduction in try_to_open (the first
occurred when FRTO was entered). So instead, FRTO falls back
immediately to the rate halving response, which switches TCP to
CA_CWR state preventing the latter reduction of ssthresh.

If the first ECE arrived before the ACK after which FRTO is able
to decide RTO as spurious, prior_ssthresh is already cleared.
Thus no undoing for ssthresh occurs. Besides, FLAG_ECE should be
set also in the following ACKs resulting in rate halving response
that sees TCP is already in CA_CWR, which again prevents an extra
ssthresh reduction on that round-trip.

If the first ECE arrived before RTO, ssthresh has already been
adapted and prior_ssthresh remains cleared on entry because TCP
is in CA_CWR (the same applies also to a case where FRTO is
entered more than once and ECE comes in the middle).

High_seq must not be touched after tcp_enter_cwr because CWR
round-trip calculation depends on it.

I believe that after this patch, FRTO should be ECN-safe and
even able to take advantage of synergy benefits.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:26 -07:00
Ilpo Järvinen e01f9d7793 [TCP]: Complete icsk-to-local-variable change (in tcp_enter_cwr)
A local variable for icsk was created but this change was
missing. Spotted by Jarek Poplawski.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:25 -07:00
Ilpo Järvinen 3cfe3baaf0 [TCP]: Add two new spurious RTO responses to FRTO
New sysctl tcp_frto_response is added to select amongst these
responses:
	- Rate halving based; reuses CA_CWR state (default)
	- Very conservative; used to be the only one available (=1)
	- Undo cwr; undoes ssthresh and cwnd reductions (=2)

The response with rate halving requires a new parameter to
tcp_enter_cwr because FRTO has already reduced ssthresh and
doing a second reduction there has to be prevented. In addition,
to keep things nice on 80 cols screen, a local variable was
added.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:23 -07:00
Ilpo Järvinen c5e7af0df5 [TCP]: Correct reordering detection change (no FRTO case)
The reordering detection must work also when FRTO has not been
used at all which was the original intention of mine, just the
expression of the idea was flawed.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:22 -07:00
Eric Dumazet 54287cc178 [TCP]: Keep copied_seq, rcv_wup and rcv_next together.
I noticed in oprofile study a cache miss in tcp_rcv_established() to read
copied_seq.

ffffffff80400a80 <tcp_rcv_established>: /* tcp_rcv_established total: 4034293  
2.0400 */

 55493  0.0281 :ffffffff80400bc9:   mov    0x4c8(%r12),%eax copied_seq
543103  0.2746 :ffffffff80400bd1:   cmp    0x3e0(%r12),%eax   rcv_nxt    

if (tp->copied_seq == tp->rcv_nxt &&
        len - tcp_header_len <= tp->ucopy.len) {

In this function, the cache line 0x4c0 -> 0x500 is used only for this
reading 'copied_seq' field.

rcv_wup and copied_seq should be next to rcv_nxt field, to lower number of
active cache lines in hot paths. (tcp_rcv_established(), tcp_poll(), ...)

As you suggested, I changed tcp_create_openreq_child() so that these fields
are changed together, to avoid adding a new store buffer stall.

Patch is 64bit friendly (no new hole because of alignment constraints)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:21 -07:00
Ilpo Järvinen cf4c6bf83d [TCP]: struct *sock argument renamed: sp -> sk
In general, TCP code uses "sk" for struct sock pointer.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:20 -07:00
John Heffner 886236c124 [TCP]: Add RFC3742 Limited Slow-Start, controlled by variable sysctl_tcp_max_ssthresh.
Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:19 -07:00
Angelo P. Castellani 5ef814753e [TCP] YeAH-TCP: algorithm implementation
YeAH-TCP is a sender-side high-speed enabled TCP congestion control
algorithm, which uses a mixed loss/delay approach to compute the
congestion window. It's design goals target high efficiency, internal,
RTT and Reno fairness, resilience to link loss while keeping network
elements load as low as possible.

For further details look here:
    http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf

Signed-off-by: Angelo P. Castellani <angelo.castellani@gmail.con>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:18 -07:00
Ilpo Järvinen 4dc2665e36 [TCP]: SACK enhanced FRTO
Implements the SACK-enhanced FRTO given in RFC4138 using the
variant given in Appendix B.

RFC4138, Appendix B:
  "This means that in order to declare timeout spurious, the TCP
   sender must receive an acknowledgment for non-retransmitted
   segment between SND.UNA and RecoveryPoint in algorithm step 3.
   RecoveryPoint is defined in conservative SACK-recovery
   algorithm [RFC3517]"

The basic version of the FRTO algorithm can still be used also
when SACK is enabled. To enabled SACK-enhanced version, tcp_frto
sysctl is set to 2.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:16 -07:00
Ilpo Järvinen 288035f915 [TCP]: Prevent reordering adjustments during FRTO
To be honest, I'm not too sure how the reord stuff works in the
first place but this seems necessary.

When FRTO has been active, the one and only retransmission could
be unnecessary but the state and sending order might not be what
the sacktag code expects it to be (to work correctly).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:15 -07:00
Ilpo Järvinen 66e93e45c0 [TCP] FRTO: Fake cwnd for ssthresh callback
TCP without FRTO would be in Loss state with small cwnd. FRTO,
however, leaves cwnd (typically) to a larger value which causes
ssthresh to become too large in case RTO is triggered again
compared to what conventional recovery would do. Because
consecutive RTOs result in only a single ssthresh reduction,
RTO+cumulative ACK+RTO pattern is required to trigger this
event.

A large comment is included for congestion control module writers
trying to figure out what CA_EVENT_FRTO handler should do because
there exists a remote possibility of incompatibility between
FRTO and module defined ssthresh functions.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:14 -07:00
Ilpo Järvinen d1a54c6a0a [TCP] FRTO: Reverse RETRANS bit clearing logic
Previously RETRANS bits were cleared on the entry to FRTO. We
postpone that into tcp_enter_frto_loss, which is really the
place were the clearing should be done anyway. This allows
simplification of the logic from a clearing loop to the head skb
clearing only.

Besides, the other changes made in the previous patches to
tcp_use_frto made it impossible for the non-SACKed FRTO to be
entered if other than the head has been rexmitted.

With SACK-enhanced FRTO (and Appendix B), however, there can be
a number retransmissions in flight when RTO expires (same thing
could happen before this patchset also with non-SACK FRTO). To
not introduce any jumpiness into the packet counting during FRTO,
instead of clearing RETRANS bits from skbs during entry, do it
later on.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:13 -07:00
Ilpo Järvinen 46d0de4ed9 [TCP] FRTO: Entry is allowed only during (New)Reno like recovery
This interpretation comes from RFC4138:
    "If the sender implements some loss recovery algorithm other
     than Reno or NewReno [FHG04], the F-RTO algorithm SHOULD
     NOT be entered when earlier fast recovery is underway."

I think the RFC means to say (especially in the light of
Appendix B) that ...recovery is underway (not just fast recovery)
or was underway when it was interrupted by an earlier (F-)RTO
that hasn't yet been resolved (snd_una has not advanced enough).
Thus, my interpretation is that whenever TCP has ever
retransmitted other than head, basic version cannot be used
because then the order assumptions which are used as FRTO basis
do not hold.

NewReno has only the head segment retransmitted at a time.
Therefore, walk up to the segment that has not been SACKed, if
that segment is not retransmitted nor anything before it, we know
for sure, that nothing after the non-SACKed segment should be
either. This assumption is valid because TCPCB_EVER_RETRANS does
not leave holes but each non-SACKed segment is rexmitted
in-order.

Check for retrans_out > 1 avoids more expensive walk through the
skb list, as we can know the result beforehand: F-RTO will not be
allowed.

SACKed skb can turn into non-SACked only in the extremely rare
case of SACK reneging, in this case we might fail to detect
retransmissions if there were them for any other than head. To
get rid of that feature, whole rexmit queue would have to be
walked (always) or FRTO should be prevented when SACK reneging
happens. Of course RTO should still trigger after reneging which
makes this issue even less likely to show up. And as long as the
response is as conservative as it's now, nothing bad happens even
then.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:12 -07:00
Ilpo Järvinen 7c9a4a5b67 [TCP]: Prevent unrelated cwnd adjustment while using FRTO
FRTO controls cwnd when it still processes the ACK input or it
has just reverted back to conventional RTO recovery; the normal
rules apply when FRTO has reverted to standard congestion
control.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:11 -07:00
Ilpo Järvinen 94d0ea7786 [TCP] FRTO: frto_counter modulo-op converted to two assignments
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:10 -07:00
Ilpo Järvinen 52c63f1e86 [TCP]: Don't enter to fast recovery while using FRTO
Because TCP is not in Loss state during FRTO recovery, fast
recovery could be triggered by accident. Non-SACK FRTO is more
robust than not yet included SACK-enhanced version (that can
receiver high number of duplicate ACKs with SACK blocks during
FRTO), at least with unidirectional transfers, but under
extraordinary patterns fast recovery can be incorrectly
triggered, e.g., Data loss+ACK losses => cumulative ACK with
enough SACK blocks to meet sacked_out >= dupthresh condition).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:09 -07:00
Ilpo Järvinen aa8b6a7ad1 [TCP] FRTO: Response should reset also snd_cwnd_cnt
Since purpose is to reduce CWND, we prevent immediate growth. This
is not a major issue nor is "the correct way" specified anywhere.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:08 -07:00
Ilpo Järvinen 95c4922bf9 [TCP] FRTO: fixes fallback to conventional recovery
The FRTO detection did not care how ACK pattern affects to cwnd
calculation of the conventional recovery. This caused incorrect
setting of cwnd when the fallback becames necessary. The
knowledge tcp_process_frto() has about the incoming ACK is now
passed on to tcp_enter_frto_loss() in allowed_segments parameter
that gives the number of segments that must be added to
packets-in-flight while calculating the new cwnd.

Instead of snd_una we use FLAG_DATA_ACKED in duplicate ACK
detection because RFC4138 states (in Section 2.2):
  If the first acknowledgment after the RTO retransmission
  does not acknowledge all of the data that was retransmitted
  in step 1, the TCP sender reverts to the conventional RTO
  recovery.  Otherwise, a malicious receiver acknowledging
  partial segments could cause the sender to declare the
  timeout spurious in a case where data was lost.

If the next ACK after RTO is duplicate, we do not retransmit
anything, which is equal to what conservative conventional
recovery does in such case.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:07 -07:00
Ilpo Järvinen 6408d206c7 [TCP] FRTO: Ignore some uninteresting ACKs
Handles RFC4138 shortcoming (in step 2); it should also have case
c) which ignores ACKs that are not duplicates nor advance window
(opposite dir data, winupdate).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:06 -07:00
Ilpo Järvinen 7b0eb22b1d [TCP] FRTO: Use Disorder state during operation instead of Open
Retransmission counter assumptions are to be changed. Forcing
reason to do this exist: Using sysctl in check would be racy
as soon as FRTO starts to ignore some ACKs (doing that in the
following patches). Userspace may disable it at any moment
giving nice oops if timing is right. frto_counter would be
inaccessible from userspace, but with SACK enhanced FRTO
retrans_out can include other than head, and possibly leaving
it non-zero after spurious RTO, boom again.

Luckily, solution seems rather simple: never go directly to Open
state but use Disorder instead. This does not really change much,
since TCP could anyway change its state to Disorder during FRTO
using path tcp_fastretrans_alert -> tcp_try_to_open (e.g., when
a SACK block makes ACK dubious). Besides, Disorder seems to be
the state where TCP should be if not recovering (in Recovery or
Loss state) while having some retransmissions in-flight (see
tcp_try_to_open), which is exactly what happens with FRTO.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:05 -07:00
Ilpo Järvinen 7487c48c4f [TCP] FRTO: Consecutive RTOs keep prior_ssthresh and ssthresh
In case a latency spike causes more than one RTO, the later should not
cause the already reduced ssthresh to propagate into the prior_ssthresh
since FRTO declares all such RTOs spurious at once or none of them. In
treating of ssthresh, we mimic what tcp_enter_loss() does.

The previous state (in frto_counter) must be available until we have
checked it in tcp_enter_frto(), and also ACK information flag in
process_frto().

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:04 -07:00
Ilpo Järvinen 30935cf4f9 [TCP] FRTO: Comment cleanup & improvement
Moved comments out from the body of process_frto() to the head
(preferred way; see Documentation/CodingStyle). Bonus: it's much
easier to read in this compacted form.

FRTO algorithm and implementation is described in greater detail.
For interested reader, more information is available in RFC4138.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:03 -07:00
Ilpo Järvinen bdaae17da8 [TCP] FRTO: Moved tcp_use_frto from tcp.h to tcp_input.c
In addition, removed inline.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:02 -07:00
Ilpo Järvinen 9ead9a1d38 [TCP] FRTO: Separated response from FRTO detection algorithm
FRTO spurious RTO detection algorithm (RFC4138) does not include response
to a detected spurious RTO but can use different response algorithms.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:01 -07:00
Ilpo Järvinen 522e7548a9 [TCP] FRTO: Incorrectly clears TCPCB_EVER_RETRANS bit
FRTO was slightly too brave... Should only clear
TCPCB_SACKED_RETRANS bit.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 22:23:00 -07:00
Alexey Kuznetsov 1194ed0a3e [NETLINK]: Infinite recursion in netlink.
Reply to NETLINK_FIB_LOOKUP messages were misrouted back to kernel,
which resulted in infinite recursion and stack overflow.

The bug is present in all kernel versions since the feature appeared.

The patch also makes some minimal cleanup:

1. Return something consistent (-ENOENT) when fib table is missing
2. Do not crash when queue is empty (does not happen, but yet)
3. Put result of lookup

Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25 13:07:28 -07:00
YOSHIFUJI Hideaki a23cf14b16 IPv6: fix Routing Header Type 0 handling thinko
Oops, thinko.  The test for accempting a RH0 was exatly the wrong way
around.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-24 19:26:06 -07:00
YOSHIFUJI Hideaki 0bcbc92629 [IPV6]: Disallow RH0 by default.
A security issue is emerging.  Disallow Routing Header Type 0 by default
as we have been doing for IPv4.
Note: We allow RH2 by default because it is harmless.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-24 14:58:30 -07:00
Patrick McHardy 05d224468a [XFRM]: beet: fix pseudo header length value
draft-nikander-esp-beet-mode-07.txt is not entirely clear on how the length
value of the pseudo header should be calculated, it states "The Header Length
field contains the length of the pseudo header, IPv4 options, and padding in
8 octets units.", but also states "Length in octets (Header Len + 1) * 8".
draft-nikander-esp-beet-mode-08-pre1.txt [1] clarifies this, the header length
should not include the first 8 byte.

This change affects backwards compatibility, but option encapsulation didn't
work until very recently anyway.

[1] http://users.piuha.net/jmelen/BEET/draft-nikander-esp-beet-mode-08-pre1.txt

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-23 22:39:02 -07:00
Stephen Hemminger 4d4d3d1e88 [TCP]: Congestion control initialization.
Change to defer congestion control initialization.

If setsockopt() was used to change TCP_CONGESTION before
connection is established, then protocols that use sequence numbers
to keep track of one RTT interval (vegas, illinois, ...) get confused.

Change the init hook to be called after handshake.

Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-23 22:32:11 -07:00
Trond Myklebust 241c39b9ac RPC: Fix the TCP resend semantics for NFSv4
Fix a regression due to the patch "NFS: disconnect before retrying NFSv4
requests over TCP"

The assumption made in xprt_transmit() that the condition
	"req->rq_bytes_sent == 0 and request is on the receive list"
should imply that we're dealing with a retransmission is false.
Firstly, it may simply happen that the socket send queue was full
at the time the request was initially sent through xprt_transmit().
Secondly, doing this for each request that was retransmitted implies
that we disconnect and reconnect for _every_ request that happened to
be retransmitted irrespective of whether or not a disconnection has
already occurred.

Fix is to move this logic into the call_status request timeout handler.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-20 22:56:30 -07:00
Denis Lunev ac57b3a9ce [NETLINK]: Don't attach callback to a going-away netlink socket
There is a race between netlink_dump_start() and netlink_release()
that can lead to the situation when a netlink socket with non-zero
callback is freed.

Here it is:

CPU1:                           CPU2
netlink_release():              netlink_dump_start():

                                sk = netlink_lookup(); /* OK */

netlink_remove();

spin_lock(&nlk->cb_lock);
if (nlk->cb) { /* false */
  ...
}
spin_unlock(&nlk->cb_lock);

                                spin_lock(&nlk->cb_lock);
                                if (nlk->cb) { /* false */
                                         ...
                                }
                                nlk->cb = cb;
                                spin_unlock(&nlk->cb_lock);
                                ...
sock_orphan(sk);
/*
 * proceed with releasing
 * the socket
 */

The proposal it to make sock_orphan before detaching the callback
in netlink_release() and to check for the sock to be SOCK_DEAD in
netlink_dump_start() before setting a new callback.

Signed-off-by: Denis Lunev <den@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-18 17:05:58 -07:00
Olaf Kirch bfb6709d0b [IrDA]: Correctly handling socket error
This patch fixes an oops first reported in mid 2006 - see
http://lkml.org/lkml/2006/8/29/358 The cause of this bug report is that
when an error is signalled on the socket, irda_recvmsg_stream returns
without removing a local wait_queue variable from the socket's sk_sleep
queue. This causes havoc further down the road.

In response to this problem, a patch was made that invoked sock_orphan on
the socket when receiving a disconnect indication. This is not a good fix,
as this sets sk_sleep to NULL, causing applications sleeping in recvmsg
(and other places) to oops.

This is against the latest net-2.6 and should be considered for -stable
inclusion. 

Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com>
Signed-off-by: Samuel Ortiz <samuel@sortiz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-18 15:07:22 -07:00
Vlad Yasevich d0cf0d9940 [SCTP]: Do not interleave non-fragments when in partial delivery
The way partial delivery is currently implemnted, it is possible to
intereleave a message (either from another steram, or unordered) that
is not part of partial delivery process.  The only way to this is for
a message to not be a fragment and be 'in order' or unorderd for a
given stream.  This will result in bypassing the reassembly/ordering
queues where things live duing partial delivery, and the
message will be delivered to the socket in the middle of partial delivery.

This is a two-fold problem, in that:
1.  the app now must check the stream-id and flags which it may not
be doing.
2.  this clearing partial delivery state from the association and results
in ulp hanging.

This patch is a band-aid over a much bigger problem in that we
don't do stream interleave.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-18 14:16:09 -07:00
David S. Miller fefaa75e04 [IPSEC] af_key: Fix thinko in pfkey_xfrm_policy2msg()
Make sure to actually assign the determined mode to
rq->sadb_x_ipsecrequest_mode.

Noticed by Joe Perches.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-18 14:16:07 -07:00
Linus Torvalds 80d74d5123 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [BRIDGE]: Unaligned access when comparing ethernet addresses
  [SCTP]: Unmap v4mapped addresses during SCTP_BINDX_REM_ADDR operation.
  [SCTP]: Fix assertion (!atomic_read(&sk->sk_rmem_alloc)) failed message
  [NET]: Set a separate lockdep class for neighbour table's proxy_queue
  [NET]: Fix UDP checksum issue in net poll mode.
  [KEY]: Fix conversion between IPSEC_MODE_xxx and XFRM_MODE_xxx.
  [NET]: Get rid of alloc_skb_from_cache
2007-04-17 16:51:32 -07:00
NeilBrown 30f3deeee8 knfsd: use a spinlock to protect sk_info_authunix
sk_info_authunix is not being protected properly so the object that it
points to can be cache_put twice, leading to corruption.

We borrow svsk->sk_defer_lock to provide the protection.  We should
probably rename that lock to have a more generic name - later.

Thanks to Gabriel for reporting this.

Cc: Greg Banks <gnb@melbourne.sgi.com>
Cc: Gabriel Barazer <gabriel@oxeva.fr>
Signed-off-by: Neil Brown <neilb@suse.de>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-17 16:36:27 -07:00
Evgeny Kravtsunov 19bb3506e2 [BRIDGE]: Unaligned access when comparing ethernet addresses
compare_ether_addr() implicitly requires that the addresses
passed are 2-bytes aligned in memory.

This is not true for br_stp_change_bridge_id() and
br_stp_recalculate_bridge_id() in which one of the addresses
is unsigned char *, and thus may not be 2-bytes aligned.

Signed-off-by: Evgeny Kravtsunov <emkravts@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
2007-04-17 14:16:00 -07:00
Paolo Galtieri 0304ff8a2d [SCTP]: Unmap v4mapped addresses during SCTP_BINDX_REM_ADDR operation.
During the sctp_bindx() call to add additional addresses to the
endpoint, any v4mapped addresses are converted and stored as regular
v4 addresses.  However, when trying to remove these addresses, the
v4mapped addresses are not converted and the operation fails.  This
patch unmaps the addresses on during the remove operation as well.

Signed-off-by: Paolo Galtieri <pgaltieri@mvista.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-17 13:13:42 -07:00
Tsutomu Fujii ea2bc483ff [SCTP]: Fix assertion (!atomic_read(&sk->sk_rmem_alloc)) failed message
In current implementation, LKSCTP does receive buffer accounting for
data in sctp_receive_queue and pd_lobby. However, LKSCTP don't do
accounting for data in frag_list when data is fragmented. In addition,
LKSCTP doesn't do accounting for data in reasm and lobby queue in
structure sctp_ulpq.
When there are date in these queue, assertion failed message is printed
in inet_sock_destruct because sk_rmem_alloc of oldsk does not become 0
when socket is destroyed.

Signed-off-by: Tsutomu Fujii <t-fujii@nb.jp.nec.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-17 13:13:37 -07:00
Pavel Emelianov c2ecba7171 [NET]: Set a separate lockdep class for neighbour table's proxy_queue
Otherwise the following calltrace will lead to a wrong
lockdep warning:

  neigh_proxy_process()
    `- lock(neigh_table->proxy_queue.lock);
  arp_redo /* via tbl->proxy_redo */
  arp_process
  neigh_event_ns
  neigh_update
  skb_queue_purge
    `- lock(neighbor->arp_queue.lock);

This is not a deadlock actually, as neighbor table's proxy_queue
and the neighbor's arp_queue are different queues.

Lockdep thinks there is a deadlock as both queues are initialized
with skb_queue_head_init() and thus have a common class.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-17 13:13:31 -07:00
Aubrey.Li 5e7d7fa573 [NET]: Fix UDP checksum issue in net poll mode.
In net poll mode, the current checksum function doesn't consider the
kind of packet which is padded to reach a specific minimum length. I
believe that's the problem causing my test case failed. The following
patch fixed this issue.

Signed-off-by: Aubrey.Li <aubreylee@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-17 13:13:26 -07:00
Kazunori MIYAZAWA 55569ce256 [KEY]: Fix conversion between IPSEC_MODE_xxx and XFRM_MODE_xxx.
We should not blindly convert between IPSEC_MODE_xxx and XFRM_MODE_xxx just
by incrementing / decrementing because the assumption is not true any longer.

Signed-off-by: Kazunori MIYAZAWA <miyazawa@linux-ipv6.org>
Singed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
2007-04-17 13:13:21 -07:00
Herbert Xu b4dfa0b1fb [NET]: Get rid of alloc_skb_from_cache
Since this was added originally for Xen, and Xen has recently (~2.6.18)
stopped using this function, we can safely get rid of it.  Good timing
too since this function has started to bit rot.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-17 13:13:16 -07:00
Linus Torvalds b1847a041a Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6:
  [SPARC64]: Fix inline directive in pci_iommu.c
  [SPARC64]: Fix arg passing to compat_sys_ipc().
  [SPARC]: Fix section mismatch warnings in pci.c and pcic.c
  [SUNRPC]: Make sure on-stack cmsg buffer is properly aligned.
  [SPARC]: avoid CHILD_MAX and OPEN_MAX constants
  [SPARC64]: Fix SBUS IOMMU allocation code.
2007-04-13 18:20:39 -07:00
David S. Miller 49688c8431 [NETFILTER] arp_tables: Fix unaligned accesses.
There are two device string comparison loops in arp_packet_match().
The first one goes byte-by-byte but the second one tries to be
clever and cast the string to a long and compare by longs.

The device name strings in the arp table entries are not guarenteed
to be aligned enough to make this value, so just use byte-by-byte
for both cases.

Based upon a report by <drraid@gmail.com>.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-13 16:37:54 -07:00
YOSHIFUJI Hideaki 612f09e849 [IPV6] SNMP: Fix {In,Out}NoRoutes statistics.
A packet which is being discarded because of no routes in the
forwarding path should not be counted as OutNoRoutes but as
InNoRoutes.
Additionally, on this occasion, a packet whose destinaion is
not valid should be counted as InAddrErrors separately.

Based on patch from Mitsuru Chinen <mitch@linux.vnet.ibm.com>.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-13 16:18:02 -07:00
Joy Latten 661697f728 [IPSEC] XFRM_USER: kernel panic when large security contexts in ACQUIRE
When sending a security context of 50+ characters in an ACQUIRE 
message, following kernel panic occurred.

kernel BUG in xfrm_send_acquire at net/xfrm/xfrm_user.c:1781!
cpu 0x3: Vector: 700 (Program Check) at [c0000000421bb2e0]
    pc: c00000000033b074: .xfrm_send_acquire+0x240/0x2c8
    lr: c00000000033b014: .xfrm_send_acquire+0x1e0/0x2c8
    sp: c0000000421bb560
   msr: 8000000000029032
  current = 0xc00000000fce8f00
  paca    = 0xc000000000464b00
    pid   = 2303, comm = ping
kernel BUG in xfrm_send_acquire at net/xfrm/xfrm_user.c:1781!
enter ? for help
3:mon> t
[c0000000421bb650] c00000000033538c .km_query+0x6c/0xec
[c0000000421bb6f0] c000000000337374 .xfrm_state_find+0x7f4/0xb88
[c0000000421bb7f0] c000000000332350 .xfrm_tmpl_resolve+0xc4/0x21c
[c0000000421bb8d0] c0000000003326e8 .xfrm_lookup+0x1a0/0x5b0
[c0000000421bba00] c0000000002e6ea0 .ip_route_output_flow+0x88/0xb4
[c0000000421bbaa0] c0000000003106d8 .ip4_datagram_connect+0x218/0x374
[c0000000421bbbd0] c00000000031bc00 .inet_dgram_connect+0xac/0xd4
[c0000000421bbc60] c0000000002b11ac .sys_connect+0xd8/0x120
[c0000000421bbd90] c0000000002d38d0 .compat_sys_socketcall+0xdc/0x214
[c0000000421bbe30] c00000000000869c syscall_exit+0x0/0x40
--- Exception: c00 (System Call) at 0000000007f0ca9c
SP (fc0ef8f0) is in userspace

We are using size of security context from xfrm_policy to determine
how much space to alloc skb and then putting security context from
xfrm_state into skb. Should have been using size of security context 
from xfrm_state to alloc skb. Following fix does that

Signed-off-by: Joy Latten <latten@austin.ibm.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-13 16:14:35 -07:00
Jerome Borsboom 279e172a58 [VLAN]: Allow VLAN interface on top of bridge interface
When a VLAN interface is created on top of a bridge interface and 
netfilter is enabled to see the bridged packets, the packets can be 
corrupted when passing through the netfilter code. This is caused by the 
VLAN driver not setting the 'protocol' and 'nh' members of the sk_buff 
structure. In general, this is no problem as the VLAN interface is mostly 
connected to a physical ethernet interface which does not use the 
'protocol' and 'nh' members. For a bridge interface, however, these 
members do matter.

Signed-off-by: Jerome Borsboom <j.borsboom@erasmusmc.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-13 16:12:47 -07:00
Andrew Morton 09fe3ef46c [PKTGEN]: Add try_to_freeze()
The pktgen module prevents suspend-to-disk.  Fix.

Acked-by: "Michal Piotrowski" <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-12 14:45:32 -07:00
Patrick McHardy 01102e7ca2 [NETFILTER]: ipt_ULOG: use put_unaligned
Use put_unaligned to fix warnings about unaligned accesses.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-12 14:27:03 -07:00
David S. Miller bc375ea7ef [SUNRPC]: Make sure on-stack cmsg buffer is properly aligned.
Based upon a report from Meelis Roos.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-12 13:35:59 -07:00
Jaroslav Kysela 50c9cc2e54 [NETFILTER]: ipt_CLUSTERIP: fix oops in checkentry function
The clusterip_config_find_get() already increases entries reference
counter, so there is no reason to do it twice in checkentry() callback.

This causes the config to be freed before it is removed from the list,
resulting in a crash when adding the next rule.

Signed-off-by: Jaroslav Kysela <perex@suse.cz>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-10 13:26:48 -07:00
David S. Miller 15d33c070d [TCP]: slow_start_after_idle should influence cwnd validation too
For the cases that slow_start_after_idle are meant to deal
with, it is almost a certainty that the congestion window
tests will think the connection is application limited and
we'll thus decrease the cwnd there too.  This defeats the
whole point of setting slow_start_after_idle to zero.

So test it there too.

We do not cancel out the entire tcp_cwnd_validate() function
so that if the sysctl is changed we still have the validation
state maintained.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-09 13:31:15 -07:00
Patrick McHardy bb8a954f27 [NET_SCHED]: cls_tcindex: fix compatibility breakage
Userspace uses an integer for TCA_TCINDEX_SHIFT, the kernel was changed
to expect and use a u16 value in 2.6.11, which broke compatibility on
big endian machines. Change back to use int.

Reported by Ole Reinartz <ole.reinartz@gmx.de>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-09 13:31:13 -07:00
David S. Miller 161980f4c6 [IPV6]: Revert recent change to rt6_check_dev().
This reverts a0d78ebf3a

It causes pings to link-local addresses to fail.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-06 11:42:27 -07:00
Patrick McHardy 254d0d24e3 [XFRM]: beet: fix IP option decapsulation
Beet mode looks for the beet pseudo header after the outer IP header,
which is wrong since that is followed by the ESP header. Additionally
it needs to adjust the packet length after removing the pseudo header
and point the data pointer to the real data location.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-05 16:03:33 -07:00
Patrick McHardy d4b1e84062 [XFRM]: beet: fix beet mode decapsulation
Beet mode decapsulation fails to properly set up the skb pointers, which
only works by accident in combination with CONFIG_NETFILTER, since in that
case the skb is fixed up in xfrm4_input before passing it to the netfilter
hooks.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-05 15:59:41 -07:00
Patrick McHardy 04fef9893a [XFRM]: beet: use IPOPT_NOP for option padding
draft-nikander-esp-beet-mode-07.txt states "The padding MUST be filled
with NOP options as defined in Internet Protocol [1] section 3.1
Internet header format.", so do that.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-05 15:54:39 -07:00
Patrick McHardy c5027c9a89 [XFRM]: beet: fix IP option encapsulation
Beet mode calculates an incorrect value for the transport header location
when IP options are present, resulting in encapsulation errors.

The correct location is 4 or 8 bytes before the end of the original IP
header, depending on whether the pseudo header is padded.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-05 15:54:02 -07:00
Herbert Xu 4c4d51a731 [IPSEC]: Reject packets within replay window but outside the bit mask
Up until this point we've accepted replay window settings greater than
32 but our bit mask can only accomodate 32 packets.  Thus any packet
with a sequence number within the window but outside the bit mask would
be accepted.

This patch causes those packets to be rejected instead.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-05 00:07:39 -07:00
Mitsuru Chinen 60e5c16641 [IPv6]: Exclude truncated packets from InHdrErrors statistics
Incoming trancated packets are counted as not only InTruncatedPkts but
also InHdrErrors. They should be counted as InTruncatedPkts only.

Signed-off-by: Mitsuru Chinen <mitch@linux.vnet.ibm.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-04 23:54:59 -07:00
Jean Delvare 75559c167b [APPLETALK]: Fix a remotely triggerable crash
When we receive an AppleTalk frame shorter than what its header says,
we still attempt to verify its checksum, and trip on the BUG_ON() at
the end of function atalk_sum_skb() because of the length mismatch.

This has security implications because this can be triggered by simply
sending a specially crafted ethernet frame to a target victim,
effectively crashing that host. Thus this qualifies, I think, as a
remote DoS. Here is the frame I used to trigger the crash, in npg
format:

<Appletalk Killer>
{
# Ethernet header -----

  XX XX XX XX XX XX  # Destination MAC
  00 00 00 00 00 00  # Source MAC
  00 1D              # Length

# LLC header -----

  AA AA 03
  08 00 07 80 9B  # Appletalk

# Appletalk header -----

  00 1B        # Packet length (invalid)
  00 01        # Fake checksum 
  00 00 00 00  # Destination and source networks
  00 00 00 00  # Destination and source nodes and ports

# Payload -----

  0C 0D 0E 0F 10 11 12 13
  14
}

The destination MAC address must be set to those of the victim.

The severity is mitigated by two requirements:
* The target host must have the appletalk kernel module loaded. I
  suspect this isn't so frequent.
* AppleTalk frames are non-IP, thus I guess they can only travel on
  local networks. I am no network expert though, maybe it is possible
  to somehow encapsulate AppleTalk packets over IP.

The bug has been reported back in June 2004:
  http://bugzilla.kernel.org/show_bug.cgi?id=2979
But it wasn't investigated, and was closed in July 2006 as both
reporters had vanished meanwhile.

This code was new in kernel 2.6.0-test5:
  http://git.kernel.org/?p=linux/kernel/git/tglx/history.git;a=commitdiff;h=7ab442d7e0a76402c12553ee256f756097cae2d2
And not modified since then, so we can assume that vanilla kernels
2.6.0-test5 and later, and distribution kernels based thereon, are
affected.

Note that I still do not know for sure what triggered the bug in the
real-world cases. The frame could have been corrupted by the kernel if
we have a bug hiding somewhere. But more likely, we are receiving the
faulty frame from the network.

Signed-off-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-04 23:52:46 -07:00
Adrian Bunk 418106d624 [PATCH] net/sunrpc/svcsock.c: fix a check
The return value of kernel_recvmsg() should be assigned to "err", not
compared with the random value of a never initialized "err" (and the "< 0"
check wrongly always returned false since == comparisons never have a
result < 0).

Spotted by the Coverity checker.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-04 21:12:47 -07:00
Eric W. Biederman 927498217c [PATCH] net: Ignore sysfs network device rename bugs.
The generic networking code ensures that no two networking devices
have the same name, so  there is no time except when sysfs has
implementation bugs that device_rename when called from
dev_change_name will fail.

The current error handling for errors from device_rename in
dev_change_name is wrong and results in an unusable and unrecoverable
network device if device_rename is happens to return an error.

This patch removes the buggy error handling.  Which confines the mess
when device_rename hits a problem to sysfs, instead of propagating it
the rest of the network stack.  Making linux a little more robust.

Without this patch you can observe what happens when sysfs has a bug
when CONFIG_SYSFS_DEPRECATED is not set and you attempt to rename
a real network device to a name like (broken_parity_status, device,
modalias, power, resource2, subsystem_vendor, class,  driver, irq,
msi_bus, resource, subsystem, uevent, config, enable, local_cpus,
numa_node, resource0, subsystem_device, vendor)

Greg has a patch that fixes the sysfs bugs but he doesn't trust it
for a 2.6.21 timeframe.  This patch which just ignores errors should
be safe and it keeps the system from going completely wacky.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-04-04 08:51:52 -07:00
John Heffner 84565070e4 [TCP]: Do receiver-side SWS avoidance for rcvbuf < MSS.
Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-02 13:56:32 -07:00
YOSHIFUJI Hideaki b59e139bbd [IPv6]: Fix incorrect length check in rawv6_sendmsg()
In article <20070329.142644.70222545.davem@davemloft.net> (at Thu, 29 Mar 2007 14:26:44 -0700 (PDT)), David Miller <davem@davemloft.net> says:

> From: Sridhar Samudrala <sri@us.ibm.com>
> Date: Thu, 29 Mar 2007 14:17:28 -0700
>
> > The check for length in rawv6_sendmsg() is incorrect.
> > As len is an unsigned int, (len < 0) will never be TRUE.
> > I think checking for IPV6_MAXPLEN(65535) is better.
> >
> > Is it possible to send ipv6 jumbo packets using raw
> > sockets? If so, we can remove this check.
>
> I don't see why such a limitation against jumbo would exist,
> does anyone else?
>
> Thanks for catching this Sridhar.  A good compiler should simply
> fail to compile "if (x < 0)" when 'x' is an unsigned type, don't
> you think :-)

Dave, we use "int" for returning value,
so we should fix this anyway, IMHO;
we should not allow len > INT_MAX.

Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Acked-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-02 13:30:54 -07:00
Patrick McHardy 31ba548f96 [NET_SCHED]: cls_basic: fix memory leak in basic_destroy
tp->root is not freed on destruction.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-02 13:30:52 -07:00
Steven Whitehouse 83886b6b63 [NET]: Change "not found" return value for rule lookup
This changes the "not found" error return for the lookup
function to -ESRCH so that it can be distinguished from
the case where a rule or route resulting in -ENETUNREACH
has been found during the search.

It fixes a bug where if DECnet was compiled with routing
support, but no routes were added to the routing table,
it was failing to fall back to endnode routing.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-02 13:30:51 -07:00
Linus Torvalds 9415fddd99 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [IFB]: Fix crash on input device removal
  [BNX2]: Fix link interrupt problem.
2007-03-29 13:15:13 -07:00
Patrick McHardy c01003c205 [IFB]: Fix crash on input device removal
The input_device pointer is not refcounted, which means the device may
disappear while packets are queued, causing a crash when ifb passes packets
with a stale skb->dev pointer to netif_rx().

Fix by storing the interface index instead and do a lookup where neccessary.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-29 11:46:52 -07:00
Jiri Kosina cb3fecc2f2 [PATCH] bluetooth hid quirks: mightymouse quirk
I have a bugreport that scrollwheel of bluetooth version of apple
mightymouse doesn't work.  The USB version of mightymouse works, as there
is a quirk for handling scrollwheel in hid/usbhid for it.

Now that bluetooth git tree is hooked to generic hid layer, it could easily
use the quirks which are already present in generic hid parser, hid-input,
etc.

Below is a simple patch against bluetooth git tree, which adds quirk
handling to current bluetooth hidp code, and sets quirk flags for device
0x05ac/0x030c, which is the bluetooth version of the apple mightymouse.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-29 08:22:24 -07:00
Linus Torvalds c203b33d2e Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [DCCP] getsockopt: Fix DCCP_SOCKOPT_[SEND,RECV]_CSCOV
2007-03-28 14:00:27 -07:00
Linus Torvalds 4db43e677e Merge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6
* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6:
  SUN3/3X Lance trivial fix improved
  mv643xx_eth: Fix use of uninitialized port_num field
  forcedeth: fix tx timeout
  forcedeth: fix nic poll
  qla3xxx: bugfix: Jumbo frame handling.
  qla3xxx: bugfix: Dropping interrupt under heavy network load.
  qla3xxx: bugfix: Multi segment sends were getting whacked.
  qla3xxx: bugfix: Add tx control block memset.
  atl1: remove unnecessary crc inversion
  myri10ge: correctly detect when TSO should be used
  [PATCH] WE-22 : prevent information leak on 64 bit
  [PATCH] wext: Add missing ioctls to 64<->32 conversion
  [PATCH] bcm43xx: Fix machine check on PPC for version 1 PHY
  [PATCH] bcm43xx: fix radio_set_tx_iq
  [PATCH] bcm43xx: Fix code for confusion between PHY revision and PHY version
2007-03-28 13:45:13 -07:00
Arnaldo Carvalho de Melo 39ebc0276b [DCCP] getsockopt: Fix DCCP_SOCKOPT_[SEND,RECV]_CSCOV
We were only checking if there was enough space to put the int, but
left len as specified by the (malicious) user, sigh, fix it by setting
len to sizeof(val) and transfering just one int worth of data, the one
asked for.

Also check for negative len values.

Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-28 11:54:32 -07:00
Jeff Garzik a9c87a10db Merge branch 'upstream-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6 into upstream-fixes 2007-03-28 02:21:18 -04:00
Herbert Xu 53aadcc909 [IPV6]: Set IF_READY if the device is up and has carrier
We still need to set the IF_READY flag in ipv6_add_dev for the case
where all addresses (including the link-local) are deleted and then
recreated.  In that case the IPv6 device too will be destroyed and
then recreated.

In order to prevent the original problem, we simply ensure that
the device is up before setting IF_READY.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-27 14:31:52 -07:00
Patrick McHardy c38c83cb70 [NET_SCHED]: sch_htb/sch_hfsc: fix oops in qlen_notify
During both HTB and HFSC class deletion the class is removed from the
class hash before calling qdisc_tree_decrease_qlen. This makes the
->get operation in qdisc_tree_decrease_qlen fail, so it passes a NULL
pointer to ->qlen_notify, causing an oops.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-27 14:04:24 -07:00
Jean Tourrilhes c2805fbb86 [PATCH] WE-22 : prevent information leak on 64 bit
Johannes Berg discovered that kernel space was leaking to
userspace on 64 bit platform. He made a first patch to fix that. This
is an improved version of his patch.

Signed-off-by: Jean Tourrilhes <jt@hpl.hp.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
2007-03-27 14:10:26 -04:00
Robert P. J. Day 9b2f7bcf0e [NET]: Remove dead net/sched/Makefile entry for sch_hpfq.o.
Remove the worthless net/sched/Makefile entry for the non-existent
source file sch_hpfq.c.

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-26 16:20:34 -07:00
Robert Olsson d562f1f8a9 [IPV4] fib_trie: Document locking.
Paul E. McKenney writes:

> Those of use who dive into networking only occasionally would much
> appreciate this.  ;-)

No problem here... 

Acked-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> (but trivial)
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-26 14:22:22 -07:00
Alexey Dobriyan 79f4f6428f [NET]: Correct accept(2) recovery after sock_attach_fd()
* d_alloc() in sock_attach_fd() fails leaving ->f_dentry of new file NULL
* bail out to out_fd label, doing fput()/__fput() on new file
* but __fput() assumes valid ->f_dentry and dereferences it

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-26 14:09:52 -07:00
Patrick McHardy 035832a280 [NET_SCHED]: Fix ingress locking
Ingress queueing uses a seperate lock for serializing enqueue operations,
but fails to properly protect itself against concurrent changes to the
qdisc tree. Use queue_lock for now since the real fix it quite intrusive.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-25 18:48:12 -07:00
Patrick McHardy d3fa76ee6b [NET_SCHED]: cls_basic: fix NULL pointer dereference
cls_basic doesn't allocate tp->root before it is linked into the
active classifier list, resulting in a NULL pointer dereference
when packets hit the classifier before its ->change function is
called.

Reported by Chris Madden <chris@reflexsecurity.com>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-25 18:48:11 -07:00
Adrian Bunk c93a882ebe [DCCP]: make dccp_write_xmit_timer() static again
dccp_write_xmit_timer() needlessly became global.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-25 18:48:10 -07:00
David S. Miller f11e6659ce [IPV6]: Fix routing round-robin locking.
As per RFC2461, section 6.3.6, item #2, when no routers on the
matching list are known to be reachable or probably reachable we
do round robin on those available routes so that we make sure
to probe as many of them as possible to detect when one becomes
reachable faster.

Each routing table has a rwlock protecting the tree and the linked
list of routes at each leaf.  The round robin code executes during
lookup and thus with the rwlock taken as a reader.  A small local
spinlock tries to provide protection but this does not work at all
for two reasons:

1) The round-robin list manipulation, as coded, goes like this (with
   read lock held):

	walk routes finding head and tail

	spin_lock();
	rotate list using head and tail
	spin_unlock();

   While one thread is rotating the list, another thread can
   end up with stale values of head and tail and then proceed
   to corrupt the list when it gets the lock.  This ends up causing
   the OOPS in fib6_add() later onthat many people have been hitting.

2) All the other code paths that run with the rwlock held as
   a reader do not expect the list to change on them, they
   expect it to remain completely fixed while they hold the
   lock in that way.

So, simply stated, it is impossible to implement this correctly using
a manipulation of the list without violating the rwlock locking
semantics.

Reimplement using a per-fib6_node round-robin pointer.  This way we
don't need to manipulate the list at all, and since the round-robin
pointer can only ever point to real existing entries we don't need
to perform any locking on the changing of the round-robin pointer
itself.  We only need to reset the round-robin pointer to NULL when
the entry it is pointing to is removed.

The idea is from Thomas Graf and it is very similar to how this
was implemented before the advanced router selection code when in.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-25 18:48:05 -07:00
Thomas Graf a979101106 [DECNet] fib: Fix out of bound access of dn_fib_props[]
Fixes a typo which caused fib_props[] to have the wrong size
and makes sure the value used to index the array which is
provided by userspace via netlink is checked to avoid out of
bound access.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-25 18:48:04 -07:00
Thomas Graf a0ee18b9b7 [IPv4] fib: Fix out of bound access of fib_props[]
Fixes a typo which caused fib_props[] to have the wrong size
and makes sure the value used to index the array which is
provided by userspace via netlink is checked to avoid out of
bound access.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-25 18:48:03 -07:00
Ralf Baechle 954b2e7f4c [NET] AX.25 Kconfig and docs updates and fixes
o The AX.25 Howto is unmaintained since several years.  I've replaced it
   with a wiki at http://www.linux-ax25.org which provides more uptodate
   information.
 o Change default for AX25_DAMA_SLAVE to Y.  AX25_DAMA_SLAVE only compiles
   in support for DAMA but doesn't activate it.  I hope this gets Linux
   distributions to ship their AX.25 kernels with AX25_DAMA_SLAVE enabled.
   The price for this would be very small.
 o Delete historic changelog from comments, that's what SCM systems are
   meant to do.
 o ---help--- in Kconfig looks so yellingly eye insulting.  Use just help.
 o Rewrite the commented out piece of old Linux 2.4 configuration language
   to Kconfig for consistency.
 o Fixup dependencies.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-25 18:48:02 -07:00
Alexey Kuznetsov ecbb416939 [NET]: Fix neighbour destructor handling.
->neigh_destructor() is killed (not used), replaced with
->neigh_cleanup(), which is called when neighbor entry goes to dead
state. At this point everything is still valid: neigh->dev,
neigh->parms etc.

The device should guarantee that dead neighbor entries (neigh->dead !=
0) do not get private part initialized, otherwise nobody will cleanup
it.

I think this is enough for ipoib which is the only user of this thing.
Initialization private part of neighbor entries happens in ipib
start_xmit routine, which is not reached when device is down.  But it
would be better to add explicit test for neigh->dead in any case.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-25 18:48:01 -07:00
Thomas Graf e1701c68c1 [NET]: Fix fib_rules compatibility breakage
Based upon a patch from Patrick McHardy.

The fib_rules netlink attribute policy introduced in 2.6.19 broke
userspace compatibilty. When specifying a rule with "from all"
or "to all", iproute adds a zero byte long netlink attribute,
but the policy requires all addresses to have a size equal to
sizeof(struct in_addr)/sizeof(struct in6_addr), resulting in a
validation error.

Check attribute length of FRA_SRC/FRA_DST in the generic framework
by letting the family specific rules implementation provide the
length of an address. Report an error if address length is non
zero but no address attribute is provided. Fix actual bug by
checking address length for non-zero instead of relying on
availability of attribute.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-25 18:48:00 -07:00
Patrick Ringl 0fa7d868ca [PATCH] fix typos in net/ieee80211/Kconfig
This is just a QA / cosmetic fix ..

[ "a modules" => "a module" ]

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-24 16:51:53 -07:00
Patrick McHardy 848c29fd64 [NETFILTER]: nat: avoid rerouting packets if only XFRM policy key changed
Currently NAT not only reroutes packets in the OUTPUT chain when the
routing key changed, but also if only the non-routing part of the
IPsec policy key changed. This breaks ping -I since it doesn't use
SO_BINDTODEVICE but IP_PKTINFO cmsg to specify the output device, and
this information is lost.

Only do full rerouting if the routing key changed, and just do a new
policy lookup with the old route if only the ports changed.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-22 12:30:29 -07:00
Patrick McHardy ca8fbb859c [NETFILTER]: nf_conntrack_netlink: add missing dependency on NF_NAT
NF_CT_NETLINK=y, NF_NAT=m results in:

 LD      .tmp_vmlinux1
 net/built-in.o: dans la fonction « nfnetlink_parse_nat_proto »:
 nf_conntrack_netlink.c:(.text+0x28db9): référence indéfinie vers « nf_nat_proto_find_get »
 nf_conntrack_netlink.c:(.text+0x28dd6): référence indéfinie vers « nf_nat_proto_put »
 net/built-in.o: dans la fonction « ctnetlink_new_conntrack »:
 nf_conntrack_netlink.c:(.text+0x29959): référence indéfinie vers « nf_nat_setup_info »
 nf_conntrack_netlink.c:(.text+0x29b35): référence indéfinie vers « nf_nat_setup_info »
 nf_conntrack_netlink.c:(.text+0x29cf7): référence indéfinie vers « nf_nat_setup_info »
 nf_conntrack_netlink.c:(.text+0x29de2): référence indéfinie vers « nf_nat_setup_info »
 make: *** [.tmp_vmlinux1] Erreur 1 

Reported by Kevin Baradon <kevin.baradon@gmail.com>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-22 12:29:57 -07:00
Dave Jones b6f99a2119 [NET]: fix up misplaced inlines.
Turning up the warnings on gcc makes it emit warnings
about the placement of 'inline' in function declarations.
Here's everything that was under net/

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-22 12:27:49 -07:00
Vlad Yasevich 289f42492c [SCTP]: Correctly reset ssthresh when restarting association
Reset ssthresh to the correct value (peer's a_rwnd) when restarting
association.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-22 12:26:25 -07:00
Patrick McHardy b19cbe2a16 [BRIDGE]: Fix fdb RCU race
br_fdb_get use atomic_inc to increase the refcount of an element found
on a RCU protected list, which can lead to the following race:

CPU0					CPU1

					br_fdb_get:   rcu_read_lock
					__br_fdb_get: find element
fdb_delete:   hlist_del_rcu
	      br_fdb_put
br_fdb_put:   atomic_dec_and_test
	      call_rcu(fdb_rcu_free)	br_fdb_get:   atomic_inc
						      rcu_read_unlock
fdb_rcu_free: kmem_cache_free

Use atomic_inc_not_zero instead.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-22 12:25:20 -07:00
Patrick McHardy ec25615b9d [NET]: Fix fib_rules dump race
fib_rules_dump needs to use list_for_each_entry_rcu to protect against
concurrent changes to the rules list.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-22 12:24:38 -07:00
Joy Latten 961995582e [XFRM]: ipsecv6 needs a space when printing audit record.
This patch adds a space between printing of the src and dst ipv6 addresses.
Otherwise, audit or other test tools may fail to process the audit
record properly because they cannot find the dst address.

Signed-off-by: Joy Latten <latten@austin.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-20 00:09:47 -07:00
Adrian Bunk e4ce837de9 [X25] x25_forward_call(): fix NULL dereferences
This patch fixes two NULL dereferences spotted by the Coverity checker.

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-20 00:09:46 -07:00
Vlad Yasevich 749bf9215e [SCTP]: Reset some transport and association variables on restart
If the association has been restarted, we need to reset the
transport congestion variables as well as accumulated error
counts and CACC variables.  If we do not, the association
will use the wrong values and may terminate prematurely.

This was found with a scenario where the peer restarted
the association when lksctp was in the last HB timeout for
its association.  The restart happened, but the error counts
have not been reset and when the timeout occurred, a newly
restarted association was terminated due to excessive
retransmits.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-20 00:09:45 -07:00
Vlad Yasevich fb78525ae1 [SCTP]: Increment error counters on user requested HBs.
2960bis states (Section 8.3):

   D) Request an on-demand HEARTBEAT on a specific destination transport
      address of a given association.

   The endpoint should increment the respective error counter of the
   destination transport address each time a HEARTBEAT is sent to that
   address and not acknowledged within one RTO.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-20 00:09:44 -07:00
Vlad Yasevich 0b58a81146 [SCTP]: Clean up stale data during association restart
During association restart we may have stale data sitting
on the ULP queue waiting for ordering or reassembly.  This
data may cause severe problems if not cleaned up.  In particular
stale data pending ordering may cause problems with receive
window exhaustion if our peer has decided to restart the
association.

Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-20 00:09:43 -07:00
Samuel Ortiz c577c2b993 [IrDA]: Calling ppp_unregister_channel() from process context
We need to call ppp_unregister_channel() when IrNET disconnects, and this
must be done from a process context.

Bug reported and patch tested by Guennadi Liakhovetski.

Signed-off-by: Samuel Ortiz <samuel@sortiz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-20 00:09:42 -07:00
G. Liakhovetski 7bb1bbe615 [IrDA]: irttp_dup spin_lock initialisation
Without this initialization one gets

kernel BUG at kernel/rtmutex_common.h:80!

This patch should also be included in the -stable kernel.

Signed-off-by: G. Liakhovetski <gl@dsa-ac.de>
Signed-off-by: Samuel Ortiz <samuel@sortiz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-20 00:09:41 -07:00
Masayuki Nakagawa d35690beda [IPV6]: ipv6_fl_socklist is inadvertently shared.
The ipv6_fl_socklist from listening socket is inadvertently shared
with new socket created for connection.  This leads to a variety of
interesting, but fatal, bugs. For example, removing one of the
sockets may lead to the other socket's encountering a page fault
when the now freed list is referenced.

The fix is to not share the flow label list with the new socket.

Signed-off-by: Masayuki Nakagawa <nakagawa.msy@ncos.nec.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-16 16:14:03 -07:00
John Heffner 53cdcc04c1 [TCP]: Fix tcp_mem[] initialization.
Change tcp_mem initialization function.  The fraction of total memory
is now a continuous function of memory size, and independent of page
size.

Signed-off-by: John Heffner <jheffner@psc.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-16 15:04:03 -07:00
Alexey Dobriyan 3e6b3b2e34 [NET]: Copy mac_len in skb_clone() as well
ANK says: "It is rarely used, that's wy it was not noticed.
But in the places, where it is used, it should be disaster."

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-16 15:00:46 -07:00
Robert Olsson d5cc4a73a5 [IPV4]: Do not disable preemption in trie_leaf_remove().
Hello, Just discussed this Patrick...

We have two users of trie_leaf_remove, fn_trie_flush and fn_trie_delete
both are holding RTNL. So there shouldn't be need for this preempt stuff.
This is assumed to a leftover from an older RCU-take.

> Mhh .. I think I just remembered something - me incorrectly suggesting
> to add it there while we were talking about this at OLS :) IIRC the
> idea was to make sure tnode_free (which at that time didn't use
> call_rcu) wouldn't free memory while still in use in a rcu read-side
> critical section. It should have been synchronize_rcu of course,
> but with tnode_free using call_rcu it seems to be completely
> unnecessary. So I guess we can simply remove it.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-16 15:00:07 -07:00
Joy Latten 75e252d981 [XFRM]: Fix missing protocol comparison of larval SAs.
I noticed that in xfrm_state_add we look for the larval SA in a few
places without checking for protocol match. So when using both 
AH and ESP, whichever one gets added first, deletes the larval SA. 
It seems AH always gets added first and ESP is always the larval 
SA's protocol since the xfrm->tmpl has it first. Thus causing the
additional km_query()

Adding the check eliminates accidental double SA creation. 

Signed-off-by: Joy Latten <latten@austin.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-12 17:14:07 -07:00
Robert P. J. Day ce0ecd594d [WANROUTER]: Delete superfluous source file "net/wanrouter/af_wanpipe.c".
Delete the apparently superfluous source file
net/wanrouter/af_wanpipe.c.

Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-12 17:06:27 -07:00
Geert Uytterhoeven 08882669e0 [IPV4]: Fix warning in ip_mc_rejoin_group.
Kill warning about unused variable `in_dev' when CONFIG_IP_MULTICAST
is not set.

Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-12 17:02:37 -07:00
Ralf Baechle 2536b94a2d [ROSE]: Socket locking is a great invention.
Especially if you actually try to do it ;-)

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-12 15:53:33 -07:00
Ralf Baechle 6cee77dbf2 [ROSE]: Remove ourselves from waitqueue when receiving a signal
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-12 15:52:52 -07:00
Paul Moore 38c8947c1b [NetLabel]: parse the CIPSO ranged tag on incoming packets
Commit 484b366932 added support for the CIPSO
ranged categories tag.  However, it appears that I made a mistake when rebasing
then patch to the latest upstream sources for submission and dropped the part
of the patch that actually parses the tag on incoming packets.  This patch
fixes this mistake by adding the required function call to the
cipso_v4_skbuff_getattr() function.

I've run this patch over the weekend and have not noticed any problems.

Signed-off-by: Paul Moore <paul.moore@hp.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-12 14:38:02 -07:00
Chris Wright d2b02ed948 [IPV6] fix ipv6_getsockopt_sticky copy_to_user leak
User supplied len < 0 can cause leak of kernel memory.
Use unsigned compare instead.

Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-09 16:19:17 -08:00
Olaf Kirch dfee0a725b [IPV6]: Fix for ipv6_setsockopt NULL dereference
I came across this bug in http://bugzilla.kernel.org/show_bug.cgi?id=8155

Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-09 13:55:38 -08:00
Gerrit Renker aabb601b0f [DCCP]: Initialise write_xmit_timer also on passive sockets
The TX CCID needs the write_xmit_timer for delaying packet sends. Previously
this timer was only activated on active (connecting) sockets.

This patch initialises the write_xmit_timer in sync with the other timers, i.e.
the timer will be ready on any socket. This is used by applications with a
listening socket which start to stream after receiving an initiation by the
client.  The write_xmit_timer is stopped when the application closes, as before.

Was tested to work and to remove the timer bug reported on dccp@vger.

Also moved timer initialisation into timer.c (static). 

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-09 13:47:58 -08:00
Evgeniy Polyakov c4e38f41e3 [IPV4]: Fix rtm_to_ifaddr() error handling.
Return negative error value (embedded in the pointer) instead of
returning NULL.

Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-09 13:43:24 -08:00
Jarek Poplawski e2eb8d4528 [SCTP] ipv6: inconsistent lock state ipv6_add_addr/sctp_v6_copy_addrlist
lockdep found that dev->lock taken from softirq in ipv6_add_addr
is also taken in sctp_v6_copy_addrlist with softirqs enabled, so
lockup is possible.

Noticed-by: Simon Arlott <simon@arlott.org>
Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-08 14:43:41 -08:00
Jiri Kosina b40df5743e [PATCH] bluetooth: fix socket locking in hci_sock_dev_event()
[Bluetooth] Fix socket locking in hci_sock_dev_event()

hci_sock_dev_event() uses bh_lock_sock() to lock the socket lock.
This is not deadlock-safe against locking of the same socket lock in
l2cap_connect_cfm() from softirq context. In addition to that,
hci_sock_dev_event() doesn't seem to be called from softirq context,
so it is safe to use lock_sock()/release_sock() instead.

The lockdep warning can be triggered on my T42p simply by switching
the Bluetooth off by the keyboard button.

  =================================
  [ INFO: inconsistent lock state ]
  2.6.21-rc2 #4
  ---------------------------------
  inconsistent {in-softirq-W} -> {softirq-on-W} usage.
  khubd/156 [HC0[0]:SC0[0]:HE1:SE1] takes:
   (slock-AF_BLUETOOTH){-+..}, at: [<e0ca5520>] hci_sock_dev_event+0xa8/0xc5 [bluetooth]
  {in-softirq-W} state was registered at:
    [<c012d1db>] mark_lock+0x59/0x414
    [<e0cef688>] l2cap_connect_cfm+0x4e/0x11f [l2cap]
    [<c012dfd7>] __lock_acquire+0x3e5/0xb99
    [<e0cef688>] l2cap_connect_cfm+0x4e/0x11f [l2cap]
    [<c012e7f2>] lock_acquire+0x67/0x81
    [<e0cef688>] l2cap_connect_cfm+0x4e/0x11f [l2cap]
    [<c036ee72>] _spin_lock+0x29/0x34
    [<e0cef688>] l2cap_connect_cfm+0x4e/0x11f [l2cap]
    [<e0cef688>] l2cap_connect_cfm+0x4e/0x11f [l2cap]
    [<e0ca17c3>] hci_send_cmd+0x126/0x14f [bluetooth]
    [<e0ca4ce4>] hci_event_packet+0x729/0xebd [bluetooth]
    [<e0ca205b>] hci_rx_task+0x2a/0x20f [bluetooth]
    [<e0ca209d>] hci_rx_task+0x6c/0x20f [bluetooth]
    [<c012d7be>] trace_hardirqs_on+0x10d/0x14e
    [<c011ac85>] tasklet_action+0x3d/0x68
    [<c011abba>] __do_softirq+0x41/0x92
    [<c011ac32>] do_softirq+0x27/0x3d
    [<c0105134>] do_IRQ+0x7b/0x8f
    [<c0103dec>] common_interrupt+0x24/0x34
    [<c0103df6>] common_interrupt+0x2e/0x34
    [<c0248e65>] acpi_processor_idle+0x1b3/0x34a
    [<c0248e68>] acpi_processor_idle+0x1b6/0x34a
    [<c010232b>] cpu_idle+0x39/0x4e
    [<c04bab0c>] start_kernel+0x372/0x37a
    [<c04ba42b>] unknown_bootoption+0x0/0x202
    [<ffffffff>] 0xffffffff

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-08 07:38:21 -08:00
Aji Srinivas de79059ecd [BRIDGE]: adding new device to bridge should enable if up
One change introduced by the workqueue removal patch is that adding an
interface that is up to a bridge which is also up does not ever call
br_stp_enable_port(), leaving the port in DISABLED state until we do
ifconfig down and up or link events occur.

The following patch to the br_add_if function fixes it.
This is a regression introduced in 2.6.21.

Submitted-by: Aji_Srinivas@emc.com
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-07 16:10:53 -08:00
Herbert Xu c7ababbdc6 [IPV6]: Do not set IF_READY if device is down
Now that we add the IPv6 device at registration time we don't need
to set IF_READY in ipv6_add_dev anymore because we will always get
a NETDEV_UP event later on should the device ever become ready.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-07 16:08:12 -08:00
Eric Paris 16bec31db7 [IPSEC]: xfrm audit hook misplaced in pfkey_delete and xfrm_del_sa
Inside pfkey_delete and xfrm_del_sa the audit hooks were not called if
there was any permission/security failures in attempting to do the del
operation (such as permission denied from security_xfrm_state_delete).
This patch moves the audit hook to the exit path such that all failures
(and successes) will actually get audited.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Venkat Yekkirala <vyekkirala@trustedcs.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-07 16:08:11 -08:00
Eric Paris 215a2dd3b4 [IPSEC]: Add xfrm policy change auditing to pfkey_spdget
pfkey_spdget neither had an LSM security hook nor auditing for the
removal of xfrm_policy structs.  The security hook was added when it was
moved into xfrm_policy_byid instead of the callers to that function by
my earlier patch and this patch adds the auditing hooks as well.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Venkat Yekkirala <vyekkirala@trustedcs.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-07 16:08:10 -08:00
Eric Paris ef41aaa0b7 [IPSEC]: xfrm_policy delete security check misplaced
The security hooks to check permissions to remove an xfrm_policy were
actually done after the policy was removed.  Since the unlinking and
deletion are done in xfrm_policy_by* functions this moves the hooks
inside those 2 functions.  There we have all the information needed to
do the security check and it can be done before the deletion.  Since
auditing requires the result of that security check err has to be passed
back and forth from the xfrm_policy_by* functions.

This patch also fixes a bug where a deletion that failed the security
check could cause improper accounting on the xfrm_policy
(xfrm_get_policy didn't have a put on the exit path for the hold taken
by xfrm_policy_by*)

It also fixes the return code when no policy is found in
xfrm_add_pol_expire.  In old code (at least back in the 2.6.18 days) err
wasn't used before the return when no policy is found and so the
initialization would cause err to be ENOENT.  But since err has since
been used above when we don't get a policy back from the xfrm_policy_by*
function we would always return 0 instead of the intended ENOENT.  Also
fixed some white space damage in the same area.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Venkat Yekkirala <vyekkirala@trustedcs.com>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-07 16:08:09 -08:00
Gerrit Renker 151a99317e [DCCP]: Revert patch which disables bidirectional mode
This reverts an earlier patch which disabled bidirectional mode, meaning that
a listening (passive) socket was not allowed to write to the other (active)
end of the connection.

This mode had been disabled when there were problems with CCID3, but it
imposes a constraint on socket programming and thus hinders deployment.

A change is included to ignore RX feedback received by the TX CCID3 module.

Many thanks to Andre Noll for pointing out this issue.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-07 16:08:07 -08:00
David S. Miller 286930797d [IPV6]: Handle np->opt being NULL in ipv6_getsockopt_sticky().
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-07 16:08:05 -08:00
Herbert Xu d644329bc9 [UDP]: Reread uh pointer after pskb_trim
The header may have moved when trimming.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-07 16:08:04 -08:00
Patrick McHardy ba5dcee128 [NETFILTER]: nfnetlink_log: fix crash on bridged packet
physoutdev is only set on purely bridged packet, when nfnetlink_log is used
in the OUTPUT/FORWARD/POSTROUTING hooks on packets forwarded from or to a
bridge it crashes when trying to dereference skb->nf_bridge->physoutdev.

Reported by Holger Eitzenberger <heitzenberger@astaro.com>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-07 16:08:03 -08:00
Patrick McHardy 881dbfe8ac [NETFILTER]: nfnetlink_log: zero-terminate prefix
Userspace expects a zero-terminated string, so include the trailing
zero in the netlink message.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-07 16:08:02 -08:00
Patrick McHardy dd63006b8f [NETFILTER]: nf_conntrack_ipv6: fix incorrect classification of IPv6 fragments as ESTABLISHED
The individual fragments of a packet reassembled by conntrack have the
conntrack reference from the reassembled packet attached, but nfctinfo
is not copied. This leaves it initialized to 0, which unfortunately is
the value of IP_CT_ESTABLISHED.

The result is that all IPv6 fragments are tracked as ESTABLISHED,
allowing them to bypass a usual ruleset which accepts ESTABLISHED
packets early.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-07 16:08:01 -08:00
Linus Torvalds 5b3c1184e7 Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
  [DCCP]: Set RTO for newly created child socket
  [DCCP]: Correctly split CCID half connections
  [NET]: Fix compat_sock_common_getsockopt typo.
  [NET]: Revert incorrect accept queue backlog changes.
  [INET]: twcal_jiffie should be unsigned long, not int
  [GIANFAR]: Fix compile error in latest git
  [PPPOE]: Use ifindex instead of device pointer in key lookups.
  [NETFILTER]: ip6_route_me_harder should take into account mark
  [NETFILTER]: nfnetlink_log: fix reference counting
  [NETFILTER]: nfnetlink_log: fix module reference counting
  [NETFILTER]: nfnetlink_log: fix possible NULL pointer dereference
  [NETFILTER]: nfnetlink_log: fix NULL pointer dereference
  [NETFILTER]: nfnetlink_log: fix use after free
  [NETFILTER]: nfnetlink_log: fix reference leak
  [NETFILTER]: tcp conntrack: accept SYN|URG as valid
  [NETFILTER]: nf_conntrack/nf_nat: fix incorrect config ifdefs
  [NETFILTER]: conntrack: fix {nf,ip}_ct_iterate_cleanup endless loops
2007-03-06 19:53:34 -08:00
Linus Torvalds 205c911da3 Merge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6
* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6:
  sis900 warning fixes
  mv643xx_eth: Place explicit port number in mv643xx_eth_platform_data
  pcnet32: Fix PCnet32 performance bug on non-coherent architecutres
  __devinit & __devexit cleanups for de2104x driver
  3c59x: Handle pci_enable_device() failure while resuming
  dmfe: Fix link detection
  dmfe: fix two bugs
  dmfe: trivial/spelling fixes
  revert "drivers/net/tulip/dmfe: support basic carrier detection"
  ucc_geth: returns NETDEV_TX_BUSY when BD ring is full
  ucc_geth: Fix BD processing
  natsemi: netpoll fixes
  bonding: Improve IGMP join processing
  bonding: only receive ARPs for us
  bonding: fix double dev_add_pack
2007-03-06 17:30:59 -08:00
Gerrit Renker 99c72ce091 [DCCP]: Set RTO for newly created child socket
This mirrors a recent change in tcp_open_req_child, whereby the icsk_rto of the
newly created child socket was not set (but rather on the parent socket). Same
fix for DCCP.

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-06 14:24:44 -08:00
Gerrit Renker 4d46861be6 [DCCP]: Correctly split CCID half connections
This fixes a bug caused by a previous patch, which causes DCCP servers in
LISTEN state to not receive packets.

This patch changes the logic so that
 * servers in either LISTEN or OPEN state get the RX half connection packets
 * clients in OPEN state get the TX half connection packets

Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-06 14:24:18 -08:00
Johannes Berg 1e51f9513e [NET]: Fix compat_sock_common_getsockopt typo.
This patch fixes a typo in compat_sock_common_getsockopt.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-06 13:44:06 -08:00
David S. Miller 64a146513f [NET]: Revert incorrect accept queue backlog changes.
This reverts two changes:

8488df894d
248f06726e

A backlog value of N really does mean allow "N + 1" connections
to queue to a listening socket.  This allows one to specify
"0" as the backlog and still get 1 connection.

Noticed by Gerrit Renker and Rick Jones.

Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-06 11:21:05 -08:00
Greg Banks 42a7fc4a65 [PATCH] knfsd: provide sunrpc pool_mode module option
Provide a module param "pool_mode" for sunrpc.ko which allows a sysadmin to
choose the mode for mapping NFS thread service pools to CPUs.  Values are:

auto	    choose a mapping mode heuristically
global	    (default, same as the pre-2.6.19 code) a single global pool
percpu	    one pool per CPU
pernode	    one pool per NUMA node

Note that since 2.6.19 the hardcoded behaviour has been "auto", this patch
makes the default "global".

The pool mode can be changed after boot/modprobe using /sys, if the NFS and
lockd services have been shut down.  A useful side effect of this change is to
fix a small memory leak when unloading the module.

Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-06 09:30:26 -08:00
NeilBrown cda1fd4abd [PATCH] knfsd: fix recently introduced problem with shutting down a busy NFS server
When the last thread of nfsd exits, it shuts down all related sockets.  It
currently uses svc_close_socket to do this, but that only is immediately
effective if the socket is not SK_BUSY.

If the socket is busy - i.e.  if a request has arrived that has not yet been
processes - svc_close_socket is not effective and the shutdown process spins.

So create a new svc_force_close_socket which removes the SK_BUSY flag is set
and then calls svc_close_socket.

Also change some open-codes loops in svc_destroy to use
list_for_each_entry_safe.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-06 09:30:26 -08:00
NeilBrown 5a05ed73e1 [PATCH] knfsd: remove CONFIG_IPV6 ifdefs from sunrpc server code
They don't really save that much, and aren't worth the hassle.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-06 09:30:26 -08:00
NeilBrown 7a37f5787e [PATCH] knfsd: use recv_msg to get peer address for NFSD instead of code-copying
The sunrpc server code needs to know the source and destination address for
UDP packets so it can reply properly.  It currently copies code out of the
network stack to pick the pieces out of the skb.  This is ugly and causes
compile problems with the IPv6 stuff.

So, rip that out and use recv_msg instead.  This is a much cleaner interface,
but has a slight cost in that the checksum is now checked before the copy, so
we don't benefit from doing both at the same time.  This can probably be
fixed.

Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-03-06 09:30:26 -08:00
Jay Vosburgh a816c7c712 bonding: Improve IGMP join processing
In active-backup mode, the current bonding code duplicates IGMP
traffic to all slaves, so that switches are up to date in case of a
failover from an active to a backup interface.  If bonding then fails
back to the original active interface, it is likely that the "active
slave" switch's IGMP forwarding for the port will be out of date until
some event occurs to refresh the switch (e.g., a membership query).

	This patch alters the behavior of bonding to no longer flood
IGMP to all ports, and to issue IGMP JOINs to the newly active port at
the time of a failover.  This insures that switches are kept up to date
for all cases.

	"GOELLESCH Niels" <niels.goellesch@eurocontrol.int> originally
reported this problem, and included a patch.  His original patch was
modified by Jay Vosburgh to additionally remove the existing IGMP flood
behavior, use RCU, streamline code paths, fix trailing white space, and
adjust for style.

Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
2007-03-06 06:08:11 -05:00
Yasuyuki Kozakai bc5f774347 [NETFILTER]: ip6_route_me_harder should take into account mark
Signed-off-by: Yasuyuki Kozakai <yasuyuki.kozakai@toshiba.co.jp>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-05 13:25:27 -08:00
Michal Miroslaw b4d6202b36 [NETFILTER]: nfnetlink_log: fix reference counting
Fix reference counting (memory leak) problem in __nfulnl_send() and callers
related to packet queueing.

Signed-off-by: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-05 13:25:26 -08:00
Patrick McHardy 7d90e86d31 [NETFILTER]: nfnetlink_log: fix module reference counting
Count module references correctly: after instance_destroy() there
might be timer pending and holding a reference for this netlink instance.

Based on patch by Michal Miroslaw <mirq-linux@rere.qmqm.pl>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-05 13:25:25 -08:00
Michal Miroslaw dd16704eba [NETFILTER]: nfnetlink_log: fix possible NULL pointer dereference
Eliminate possible NULL pointer dereference in nfulnl_recv_config().

Signed-off-by: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-05 13:25:24 -08:00
Michal Miroslaw a497097d35 [NETFILTER]: nfnetlink_log: fix NULL pointer dereference
Fix the nasty NULL dereference on multiple packets per netlink message.

BUG: unable to handle kernel NULL pointer dereference at virtual address 00000004
 printing eip:
f8a4b3bf
*pde = 00000000
Oops: 0002 [#1]
SMP
Modules linked in: nfnetlink_log ipt_ttl ipt_REDIRECT xt_tcpudp iptable_nat nf_nat nf_conntrack_ipv4 xt_state ipt_ipp2p xt_NFLOG xt_hashlimit ip6_tables iptable_filter xt_multiport xt_mark ipt_set iptable_raw xt_MARK iptable_mangle ip_tables cls_fw cls_u32 sch_esfq sch_htb ip_set_ipmap ip_set ipt_ULOG x_tables dm_snapshot dm_mirror loop e1000 parport_pc parport e100 floppy ide_cd cdrom
CPU:    0
EIP:    0060:[<f8a4b3bf>]    Not tainted VLI
EFLAGS: 00010206   (2.6.20 #5)
EIP is at __nfulnl_send+0x24/0x51 [nfnetlink_log]
eax: 00000000   ebx: f2b5cbc0   ecx: c03f5f54   edx: c03f4000
esi: f2b5cbc8   edi: c03f5f54   ebp: f8a4b3ec   esp: c03f5f30
ds: 007b   es: 007b   ss: 0068
Process swapper (pid: 0, ti=c03f4000 task=c03bece0 task.ti=c03f4000)
Stack: f2b5cbc0 f8a4b401 00000100 c0444080 c012af49 00000000 f6f19100 f6f19000
       c1707800 c03f5f54 c03f5f54 00000123 00000021 c03e8d08 c0426380 00000009
       c0126932 00000000 00000046 c03e9980 c03e6000 0047b007 c01269bd 00000000
Call Trace:
 [<f8a4b401>] nfulnl_timer+0x15/0x25 [nfnetlink_log]
 [<c012af49>] run_timer_softirq+0x10a/0x164
 [<c0126932>] __do_softirq+0x60/0xba
 [<c01269bd>] do_softirq+0x31/0x35
 [<c0104f6e>] do_IRQ+0x62/0x74
 [<c01036cb>] common_interrupt+0x23/0x28
 [<c0101018>] default_idle+0x0/0x3f
 [<c0101045>] default_idle+0x2d/0x3f
 [<c01010fa>] cpu_idle+0xa0/0xb9
 [<c03fb7f5>] start_kernel+0x1a8/0x1ac
 [<c03fb293>] unknown_bootoption+0x0/0x181
 =======================
Code: 5e 5f 5b 5e 5f 5d c3 53 89 c3 8d 40 1c 83 7b 1c 00 74 05 e8 2c ee 6d c7 83 7b 14 00 75 04 31 c0 eb 34 83 7b 10 01 76 09 8b 43 18 <66> c7 40 04 03 00 8b 53 34 8b 43 14 b9 40 00 00 00 e8 08 9a 84
EIP: [<f8a4b3bf>] __nfulnl_send+0x24/0x51 [nfnetlink_log] SS:ESP 0068:c03f5f30
 <0>Kernel panic - not syncing: Fatal exception in interrupt
 <0>Rebooting in 5 seconds..

Panic no more!

Signed-off-by: Micha Mirosaw <mirq-linux@rere.qmqm.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-05 13:25:23 -08:00
Michal Miroslaw 05f7b7b369 [NETFILTER]: nfnetlink_log: fix use after free
Paranoia: instance_put() might have freed the inst pointer when we
spin_unlock_bh().

Signed-off-by: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-05 13:25:22 -08:00
Michal Miroslaw ed32abeaf3 [NETFILTER]: nfnetlink_log: fix reference leak
Stop reference leaking in nfulnl_log_packet(). If we start a timer we
are already taking another reference.

Signed-off-by: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-05 13:25:21 -08:00
Patrick McHardy d3ab4298aa [NETFILTER]: tcp conntrack: accept SYN|URG as valid
Some stacks apparently send packets with SYN|URG set. Linux accepts
these packets, so TCP conntrack should to.

Pointed out by Martijn Posthuma <posthuma@sangine.com>.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-05 13:25:20 -08:00
Patrick McHardy e281db5cdf [NETFILTER]: nf_conntrack/nf_nat: fix incorrect config ifdefs
The nf_conntrack_netlink config option is named CONFIG_NF_CT_NETLINK,
but multiple files use CONFIG_IP_NF_CONNTRACK_NETLINK or
CONFIG_NF_CONNTRACK_NETLINK for ifdefs.

Fix this and reformat all CONFIG_NF_CT_NETLINK ifdefs to only use a line.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-05 13:25:19 -08:00
Patrick McHardy ec68e97ded [NETFILTER]: conntrack: fix {nf,ip}_ct_iterate_cleanup endless loops
Fix {nf,ip}_ct_iterate_cleanup unconfirmed list handling:

- unconfirmed entries can not be killed manually, they are removed on
  confirmation or final destruction of the conntrack entry, which means
  we might iterate forever without making forward progress.

  This can happen in combination with the conntrack event cache, which
  holds a reference to the conntrack entry, which is only released when
  the packet makes it all the way through the stack or a different
  packet is handled.

- taking references to an unconfirmed entry and using it outside the
  locked section doesn't work, the list entries are not refcounted and
  another CPU might already be waiting to destroy the entry

What the code really wants to do is make sure the references of the hash
table to the selected conntrack entries are released, so they will be
destroyed once all references from skbs and the event cache are dropped.

Since unconfirmed entries haven't even entered the hash yet, simply mark
them as dying and skip confirmation based on that.

Reported and tested by Chuck Ebbert <cebbert@redhat.com>

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-05 13:25:18 -08:00
Dan Aloni 5c15bdec5c [VLAN]: Avoid a 4-order allocation.
This patch splits the vlan_group struct into a multi-allocated struct. On
x86_64, the size of the original struct is a little more than 32KB, causing
a 4-order allocation, which is prune to problems caused by buddy-system
external fragmentation conditions.

I couldn't just use vmalloc() because vfree() cannot be called in the
softirq context of the RCU callback.

Signed-off-by: Dan Aloni <da-x@monatomic.org>
Acked-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-03-02 20:44:51 -08:00