linux-2.6

dect

Archived

Author	SHA1	Message	Date
Jiri Pirko	0b5c9db1b1	vlan: Fix the ingress VLAN_FLAG_REORDER_HDR check Testing of VLAN_FLAG_REORDER_HDR does not belong in vlan_untag but rather in vlan_do_receive. Otherwise the vlan header will not be properly put on the packet in the case of vlan header accelleration. As we remove the check from vlan_check_reorder_header rename it vlan_reorder_header to keep the naming clean. Fix up the skb->pkt_type early so we don't look at the packet after adding the vlan tag, which guarantees we don't goof and look at the wrong field. Use a simple if statement instead of a complicated switch statement to decided that we need to increment rx_stats for a multicast packet. Hopefully at somepoint we will just declare the case where VLAN_FLAG_REORDER_HDR is cleared as unsupported and remove the code. Until then this keeps it working correctly. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Jiri Pirko <jpirko@redhat.com> Acked-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-06-11 16:15:50 -07:00
Linus Torvalds	53ee7569ce	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits) bnx2x: allow device properly initialize after hotplug bnx2x: fix DMAE timeout according to hw specifications bnx2x: properly handle CFC DEL in cnic flow bnx2x: call dev_kfree_skb_any instead of dev_kfree_skb net: filter: move forward declarations to avoid compile warnings pktgen: refactor pg_init() code pktgen: use vzalloc_node() instead of vmalloc_node() + memset() net: skb_trim explicitely check the linearity instead of data_len ipv4: Give backtrace in ip_rt_bug(). net: avoid synchronize_rcu() in dev_deactivate_many net: remove synchronize_net() from netdev_set_master() rtnetlink: ignore NETDEV_RELEASE and NETDEV_JOIN event net: rename NETDEV_BONDING_DESLAVE to NETDEV_RELEASE bridge: call NETDEV_JOIN notifiers when add a slave netpoll: disable netpoll when enslave a device macvlan: Forward unicast frames in bridge mode to lowerdev net: Remove linux/prefetch.h include from linux/skbuff.h ipv4: Include linux/prefetch.h in fib_trie.c netlabel: Remove prefetches from list handlers. drivers/net: add prefetch header for prefetch users ... Fixed up prefetch parts: removed a few duplicate prefetch.h includes, fixed the location of the igb prefetch.h, took my version of the skbuff.h code without the extra parentheses etc.	2011-05-23 08:39:24 -07:00
Linus Torvalds	a1e4891fd4	Remove prefetch() from <linux/skbuff.h> and "netlabel_addrlist.h" Commit `e66eed651f` ("list: remove prefetching from regular list iterators") removed the include of prefetch.h from list.h. The skbuff list traversal still had them. Quoth David Miller: "Please just remove the prefetches. Those are modelled after list.h as I intend to eventually convert SKB list handling to "struct list_head" but we're not there yet. Therefore if we kill prefetches from list.h we should kill it from these things in skbuff.h too." Requested-by: David Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-22 21:43:41 -07:00
Emmanuel Grumbach	c4264f27e8	net: skb_trim explicitely check the linearity instead of data_len The purpose of the check on data_len is to check linearity, so use the inline helper for this. No overhead and more explicit. Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-22 21:01:21 -04:00
David S. Miller	67f11f4ded	net: Remove linux/prefetch.h include from linux/skbuff.h No longer needed. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-22 20:54:11 -04:00
David S. Miller	0fcbe742ea	net: Remove prefetches from SKB list handlers. Noticed by Linus. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-05-22 20:35:29 -04:00
Heiko Carstens	34ea646c9f	net: add missing prefetch.h include Fixes build errors on s390 and probably other archs as well: In file included from net/ipv4/ip_forward.c:32:0: include/net/udp.h: In function 'udp_csum_outgoing': include/net/udp.h:141:2: error: implicit declaration of function 'prefetch' Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-05-22 11:26:02 -07:00
Eric Dumazet	0a14842f5a	net: filter: Just In Time compiler for x86-64 In order to speedup packet filtering, here is an implementation of a JIT compiler for x86_64 It is disabled by default, and must be enabled by the admin. echo 1 >/proc/sys/net/core/bpf_jit_enable It uses module_alloc() and module_free() to get memory in the 2GB text kernel range since we call helpers functions from the generated code. EAX : BPF A accumulator EBX : BPF X accumulator RDI : pointer to skb (first argument given to JIT function) RBP : frame pointer (even if CONFIG_FRAME_POINTER=n) r9d : skb->len - skb->data_len (headlen) r8 : skb->data To get a trace of generated code, use : echo 2 >/proc/sys/net/core/bpf_jit_enable Example of generated code : # tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24 flen=18 proglen=147 pass=3 image=ffffffffa00b5000 JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60 JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00 JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00 JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0 JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00 JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00 JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24 JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31 JIT code: ffffffffa00b5090: c0 c9 c3 BPF program is 144 bytes long, so native program is almost same size ;) (000) ldh [12] (001) jeq #0x800 jt 2 jf 8 (002) ld [26] (003) and #0xffffff00 (004) jeq #0xc0a81400 jt 16 jf 5 (005) ld [30] (006) and #0xffffff00 (007) jeq #0xc0a81400 jt 16 jf 17 (008) jeq #0x806 jt 10 jf 9 (009) jeq #0x8035 jt 10 jf 17 (010) ld [28] (011) and #0xffffff00 (012) jeq #0xc0a81400 jt 16 jf 13 (013) ld [38] (014) and #0xffffff00 (015) jeq #0xc0a81400 jt 16 jf 17 (016) ret #65535 (017) ret #0 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-04-27 23:05:08 -07:00
Linus Torvalds	42933bac11	Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6 * 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6: Fix common misspellings	2011-04-07 11:14:49 -07:00
Lucas De Marchi	25985edced	Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>	2011-03-31 11:26:23 -03:00
David S. Miller	eec009548e	net: Fix warnings caused by MAX_SKB_FRAGS change. After commit `a715dea3c8` ("net: Always allocate at least 16 skb frags regardless of page size"), the value of MAX_SKB_FRAGS can now take on either an "unsigned long" or an "int" value. This causes warnings like: net/packet/af_packet.c: In function ‘tpacket_fill_skb’: net/packet/af_packet.c:948: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 2 has type ‘int’ Fix by forcing the constant to be unsigned long, otherwise we have a situation where the type of a system wide constant is variable. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-29 23:34:08 -07:00
Anton Blanchard	a715dea3c8	net: Always allocate at least 16 skb frags regardless of page size When analysing performance of the cxgb3 on a ppc64 box I noticed that we weren't doing much GRO merging. It turns out we are limited by the number of SKB frags: #define MAX_SKB_FRAGS (65536/PAGE_SIZE + 2) With a 4kB page size we have 18 frags, but with a 64kB page size we only have 3 frags. I ran a single stream TCP bandwidth test to compare the performance of Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-28 22:26:32 -07:00
Jiri Pirko	8a4eb5734e	net: introduce rx_handler results and logic around that This patch allows rx_handlers to better signalize what to do next to it's caller. That makes skb->deliver_no_wcard no longer needed. kernel-doc for rx_handler_result is taken from Nicolas' patch. Signed-off-by: Jiri Pirko <jpirko@redhat.com> Reviewed-by: Nicolas de Pesloüan <nicolas.2p.debian@free.fr> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-03-16 12:53:54 -07:00
Michał Mirosław	04ed3e741d	net: change netdev->features to u32 Quoting Ben Hutchings: we presumably won't be defining features that can only be enabled on 64-bit architectures. Occurences found by `grep -r` on net/, drivers/net, include/ [ Move features and vlan_features next to each other in struct netdev, as per Eric Dumazet's suggestion -DaveM ] Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-24 15:32:47 -08:00
David S. Miller	686a295553	net: Add safe reverse SKB queue walkers. Signed-off-by: David S. Miller <davem@davemloft.net>	2011-01-20 22:47:32 -08:00
KOVACS Krisztian	2fc72c7b84	netfilter: fix compilation when conntrack is disabled but tproxy is enabled The IPv6 tproxy patches split IPv6 defragmentation off of conntrack, but failed to update the #ifdef stanzas guarding the defragmentation related fields and code in skbuff and conntrack related code in nf_defrag_ipv6.c. This patch adds the required #ifdefs so that IPv6 tproxy can truly be used without connection tracking. Original report: http://marc.info/?l=linux-netdev&m=129010118516341&w=2 Reported-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: KOVACS Krisztian <hidden@balabit.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>	2011-01-12 20:25:08 +01:00
Michał Mirosław	04fb451eff	net: Introduce skb_checksum_start_offset() Introduce skb_checksum_start_offset() to replace repetitive calculation. Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-16 14:43:14 -08:00
Vladislav Zolotarov	a3d22a68d7	bnx2x: Take the distribution range definition out of skb_tx_hash() Move the calcualation of the Tx hash for a given hash range into a separate function and define the skb_tx_hash(), which calculates a Tx hash for a [0; dev->real_num_tx_queues - 1] hash values range, using this function (__skb_tx_hash()). Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-12-16 13:15:53 -08:00
Tom Herbert	3853b5841c	xps: Improvements in TX queue selection In dev_pick_tx, don't do work in calculating queue index or setting the index in the sock unless the device has more than one queue. This allows the sock to be set only with a queue index of a multi-queue device which is desirable if device are stacked like in a tunnel. We also allow the mapping of a socket to queue to be changed. To maintain in order packet transmission a flag (ooo_okay) has been added to the sk_buff structure. If a transport layer sets this flag on a packet, the transmit queue can be changed for the socket. Presumably, the transport would set this if there was no possbility of creating OOO packets (for instance, there are no packets in flight for the socket). This patch includes the modification in TCP output for setting this flag. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-11-24 11:44:19 -08:00
Eric Dumazet	27b75c95f1	net: avoid RCU for NOCACHE dst There is no point using RCU for dst we allocate for a very short time (used once). Change dst_release() to take DST_NOCACHE into account, but also change skb_dst_set_noref() to force a refcount increment for such dst. This is a _huge_ gain, because we dont waste memory to store xx thousand of dsts. Instead of queueing them to RCU, we can free them instantly. CPU caches can stay hot, re-using same memory blocks to hold temporary dsts. Note : remove unneeded smp_mb__before_atomic_dec(); in dst_release(), since atomic_dec_return() implies a full memory barrier. Stress test, 160.000.000 udp frames sent, IP route cache disabled (DDOS). Before: real 0m38.091s user 0m13.189s sys 7m53.018s After: real 0m29.946s user 0m12.157s sys 7m40.605s For reference, if IP route cache was enabled : real 0m32.030s user 0m10.521s sys 8m15.243s Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-20 03:02:23 -07:00
Eric Dumazet	564824b0c5	net: allocate skbs on local node commit `b30973f877` (node-aware skb allocation) spread a wrong habit of allocating net drivers skbs on a given memory node : The one closest to the NIC hardware. This is wrong because as soon as we try to scale network stack, we need to use many cpus to handle traffic and hit slub/slab management on cross-node allocations/frees when these cpus have to alloc/free skbs bound to a central node. skb allocated in RX path are ephemeral, they have a very short lifetime : Extra cost to maintain NUMA affinity is too expensive. What appeared as a nice idea four years ago is in fact a bad one. In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load, and two 10Gb NIC might deliver more than 28 million packets per second, needing all the available cpus. Cost of cross-node handling in network and vm stacks outperforms the small benefit hardware had when doing its DMA transfert in its 'local' memory node at RX time. Even trying to differentiate the two allocations done for one skb (the sk_buff on local node, the data part on NIC hardware node) is not enough to bring good performance. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-10-16 11:13:19 -07:00
Eric Dumazet	cb4dfe562c	net: skb_frag_t can be smaller on small arches On 32bit arches, if PAGE_SIZE is smaller than 65536, we can use 16bit offset and size fields. This patch saves 72 bytes per skb on i386, or 128 bytes after rounding. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-26 18:31:13 -07:00
Eric Dumazet	a02cec2155	net: return operator cleanup Change "return (EXPR);" to "return EXPR;" return is not a function, parentheses are not required. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-23 14:33:39 -07:00
Eric Dumazet	bc8acf2c8c	drivers/net: avoid some skb->ip_summed initializations fresh skbs have ip_summed set to CHECKSUM_NONE (0) We can avoid setting again skb->ip_summed to CHECKSUM_NONE in drivers. Introduce skb_checksum_none_assert() helper so that we keep this assertion documented in driver sources. Change most occurrences of : skb->ip_summed = CHECKSUM_NONE; by : skb_checksum_none_assert(skb); Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-09-02 19:06:22 -07:00
David S. Miller	21dc330157	net: Rename skb_has_frags to skb_has_frag_list SKBs can be "fragmented" in two ways, via a page array (called skb_shinfo(skb)->frags[]) and via a list of SKBs (called skb_shinfo(skb)->frag_list). Since skb_has_frags() tests the latter, it's name is confusing since it sounds more like it's testing the former. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-23 00:13:46 -07:00
Oliver Hartkopp	2244d07bfa	net: simplify flags for tx timestamping This patch removes the abstraction introduced by the union skb_shared_tx in the shared skb data. The access of the different union elements at several places led to some confusion about accessing the shared tx_flags e.g. in skb_orphan_try(). http://marc.info/?l=linux-netdev&m=128084897415886&w=2 Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-19 00:08:30 -07:00
Krishna Kumar	bfb564e739	core: Factor out flow calculation from get_rps_cpu Factor out flow calculation code from get_rps_cpu, since other functions can use the same code. Revisions: v2 (Ben): Separate flow calcuation out and use in select queue. v3 (Arnd): Don't re-implement MIN. v4 (Changli): skb->data points to ethernet header in macvtap, and make a fast path. Tested macvtap with this patch. v5 (Changli): - Cache skb->rxhash in skb_get_rxhash - macvtap may not have pow(2) queues, so change code for queue selection. (Arnd): - Use first available queue if all fails. Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-16 21:06:24 -07:00
Changli Gao	f9599ce111	sk_buff: introduce pskb_network_may_pull() Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-04 21:53:14 -07:00
Oliver Hartkopp	cff0d6e6ed	can-raw: Fix skb_orphan_try handling Commit `fc6055a5ba` (net: Introduce skb_orphan_try()) allows an early orphan of the skb and takes care on tx timestamping, which needs the sk-reference in the skb on driver level. So does the can-raw socket, which has not been taken into account here. The patch below adds a 'prevent_sk_orphan' bit in the skb tx shared info, which fixes the problem discovered by Matthias Fuchs here: http://marc.info/?t=128030411900003&r=1&w=2 Even if it's not a primary tx timestamp topic it fits well into some skb shared tx context. Or should be find a different place for the information to protect the sk reference until it reaches the driver level? Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-08-03 00:31:48 -07:00
Eric Dumazet	fed66381d6	net: pskb_expand_head() optimization Move frags[] at the end of struct skb_shared_info, and make pskb_expand_head() copy only the used part of it instead of whole array. This should avoid kmemcheck warnings and speedup pskb_expand_head() as well, avoiding a lot of cache misses. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-24 21:05:57 -07:00
Richard Cochran	c1f19b51d1	net: support time stamping in phy devices. This patch adds a new networking option to allow hardware time stamps from PHY devices. When enabled, likely candidates among incoming and outgoing network packets are offered to the PHY driver for possible time stamping. When accepted by the PHY driver, incoming packets are deferred for later delivery by the driver. The patch also adds phylib driver methods for the SIOCSHWTSTAMP ioctl and callbacks for transmit and receive time stamping. Drivers may optionally implement these functions. Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-18 19:15:26 -07:00
Richard Cochran	4507a71507	net: add driver hook for tx time stamping. This patch adds a hook for transmit time stamps. The transmit hook allows a software fallback for transmit time stamps, for MACs lacking time stamping hardware. Using the hook will still require adding an inline function call to each MAC driver. Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-07-18 19:15:25 -07:00
Eric Dumazet	5933dd2f02	net: NET_SKB_PAD should depend on L1_CACHE_BYTES In old kernels, NET_SKB_PAD was defined to 16. Then commit `d6301d3dd1` (net: Increase default NET_SKB_PAD to 32), and commit `18e8c134f4` (net: Increase NET_SKB_PAD to 64 bytes) increased it to 64. While first patch was governed by network stack needs, second was more driven by performance issues on current hardware. Real intent was to align data on a cache line boundary. So use max(32, L1_CACHE_BYTES) instead of 64, to be more generic. Remove microblaze and powerpc own NET_SKB_PAD definitions. Thanks to Alexander Duyck and David Miller for their comments. Suggested-by: David Miller <davem@davemloft.net> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-06-15 18:16:43 -07:00
David S. Miller	62522d36d7	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6	2010-06-11 13:32:31 -07:00
John Fastabend	597a264b1a	net: deliver skbs on inactive slaves to exact matches Currently, the accelerated receive path for VLAN's will drop packets if the real device is an inactive slave and is not one of the special pkts tested for in skb_bond_should_drop(). This behavior is different then the non-accelerated path and for pkts over a bonded vlan. For example, vlanx -> bond0 -> ethx will be dropped in the vlan path and not delivered to any packet handlers at all. However, bond0 -> vlanx -> ethx and bond0 -> ethx will be delivered to handlers that match the exact dev, because the VLAN path checks the real_dev which is not a slave and netif_recv_skb() doesn't drop frames but only delivers them to exact matches. This patch adds a sk_buff flag which is used for tagging skbs that would previously been dropped and allows the skb to continue to skb_netif_recv(). Here we add logic to check for the deliver_no_wcard flag and if it is set only deliver to handlers that match exactly. This makes both paths above consistent and gives pkt handlers a way to identify skbs that come from inactive slaves. Without this patch in some configurations skbs will be delivered to handlers with exact matches and in others be dropped out right in the vlan path. I have tested the following 4 configurations in failover modes and load balancing modes. # bond0 -> ethx # vlanx -> bond0 -> ethx # bond0 -> vlanx -> ethx # bond0 -> ethx \| vlanx -> -- Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-06-10 22:23:34 -07:00
Alexander Duyck	b78462ebc6	skbuff: add check for non-linear to warn_if_lro and needs_linearize We can avoid an unecessary cache miss by checking if the skb is non-linear before accessing gso_size/gso_type in skb_warn_if_lro, the same can also be done to avoid a cache miss on nr_frags if data_len is 0. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-06-05 02:23:16 -07:00
Changli Gao	5b0daa3474	skb: make skb_recycle_check() return a bool value Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-05-29 00:12:13 -07:00
Eric Dumazet	7fee226ad2	net: add a noref bit on skb dst Use low order bit of skb->_skb_dst to tell dst is not refcounted. Change _skb_dst to _skb_refdst to make sure all uses are catched. skb_dst() returns the dst, regardless of noref bit set or not, but with a lockdep check to make sure a noref dst is not given if current user is not rcu protected. New skb_dst_set_noref() helper to set an notrefcounted dst on a skb. (with lockdep check) skb_dst_drop() drops a reference only if skb dst was refcounted. skb_dst_force() helper is used to force a refcount on dst, when skb is queued and not anymore RCU protected. Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if !IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in sock_queue_rcv_skb(), in __nf_queue(). Use skb_dst_force() in dev_requeue_skb(). Note: dst_use_noref() still dirties dst, we might transform it later to do one dirtying per jiffies. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-05-17 17:18:50 -07:00
Eric Dumazet	18e8c134f4	net: Increase NET_SKB_PAD to 64 bytes eth_type_trans() & get_rps_cpus() currently need two 64bytes cache lines in packet to compute rxhash. Increasing NET_SKB_PAD from 32 to 64 reduces the need to one cache line only, and makes RPS faster. NET_IP_ALIGN(2) + ethernet_header(14) + IP_header(20/40) + ports(8) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-05-06 21:58:51 -07:00
Eric Dumazet	ec7d2f2cf3	net: __alloc_skb() speedup With following patch I can reach maximum rate of my pktgen+udpsink simulator : - 'old' machine : dual quad core E5450 @3.00GHz - 64 UDP rx flows (only differ by destination port) - RPS enabled, NIC interrupts serviced on cpu0 - rps dispatched on 7 other cores. (~130.000 IPI per second) - SLAB allocator (faster than SLUB in this workload) - tg3 NIC - 1.080.000 pps without a single drop at NIC level. Idea is to add two prefetchw() calls in __alloc_skb(), one to prefetch first sk_buff cache line, the second to prefetch the shinfo part. Also using one memset() to initialize all skb_shared_info fields instead of one by one to reduce number of instructions, using long word moves. All skb_shared_info fields before 'dataref' are cleared in __alloc_skb(). Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-05-05 01:07:37 -07:00
David S. Miller	47d29646a2	net: Inline skb_pull() in eth_type_trans(). In commit `6be8ac2f` ("[NET]: uninline skb_pull, de-bloats a lot") we uninlined skb_pull. But in some critical paths it makes sense to inline this thing and it helps performance significantly. Create an skb_pull_inline() so that we can do this in a way that serves also as annotation. Based upon a patch by Eric Dumazet. Signed-off-by: David S. Miller <davem@davemloft.net>	2010-05-02 02:21:44 -07:00
Rami Rosen	ccb7c7732e	net: Remove two unnecessary exports (skbuff). There is no need to export skb_under_panic() and skb_over_panic() in skbuff.c, since these methods are used only in skbuff.c ; this patch removes these two exports. It also marks these functions as 'static' and removeS the extern declarations of them from include/linux/skbuff.h Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-04-20 22:39:53 -07:00
Alexander Duyck	cd58950a53	skbuff: remove unused dev_consume_skb macro definition dev_consume_skb and kfree_skb_clean have no users and in the case of kfree_skb_clean could cause potential build issues since I cannot find where it is defined. Based on the patch in which it was introduced it appears to have been a bit of leftover code from an earlier version of the patch in which kfree_skb_clean was dropped in favor of consume_skb. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-04-13 02:58:24 -07:00
David S. Miller	4a35ecf8bf	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/bonding/bond_main.c drivers/net/via-velocity.c drivers/net/wireless/iwlwifi/iwl-agn.c	2010-04-06 23:53:30 -07:00
Alexander Duyck	03e6d819c2	skbuff: remove unused dma_head & dma_maps fields The dma map fields in the skb_shared_info structure no longer has any users and can be dropped since it is making the skb_shared_info unecessarily larger. Running slabtop show that we were using 4K slabs for the skb->head on x86_64 w/ an allocation size of 1522. It turns out that the dma_head and dma_maps array made skb_shared large enough that we had crossed over the 2k boundary with standard frames and as such we were using 4k blocks of memory for all skbs. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-03-24 11:13:35 -07:00
Tom Herbert	0a9627f264	rps: Receive Packet Steering This patch implements software receive side packet steering (RPS). RPS distributes the load of received packet processing across multiple CPUs. Problem statement: Protocol processing done in the NAPI context for received packets is serialized per device queue and becomes a bottleneck under high packet load. This substantially limits pps that can be achieved on a single queue NIC and provides no scaling with multiple cores. This solution queues packets early on in the receive path on the backlog queues of other CPUs. This allows protocol processing (e.g. IP and TCP) to be performed on packets in parallel. For each device (or each receive queue in a multi-queue device) a mask of CPUs is set to indicate the CPUs that can process packets. A CPU is selected on a per packet basis by hashing contents of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index into the CPU mask. The IPI mechanism is used to raise networking receive softirqs between CPUs. This effectively emulates in software what a multi-queue NIC can provide, but is generic requiring no device support. Many devices now provide a hash over the 4-tuple on a per packet basis (e.g. the Toeplitz hash). This patch allow drivers to set the HW reported hash in an skb field, and that value in turn is used to index into the RPS maps. Using the HW generated hash can avoid cache misses on the packet when steering it to a remote CPU. The CPU mask is set on a per device and per queue basis in the sysfs variable /sys/class/net/<device>/queues/rx-<n>/rps_cpus. This is a set of canonical bit maps for receive queues in the device (numbered by <n>). If a device does not support multi-queue, a single variable is used for the device (rx-0). Generally, we have found this technique increases pps capabilities of a single queue device with good CPU utilization. Optimal settings for the CPU mask seem to depend on architectures and cache hierarcy. Below are some results running 500 instances of netperf TCP_RR test with 1 byte req. and resp. Results show cumulative transaction rate and system CPU utilization. e1000e on 8 core Intel Without RPS: 108K tps at 33% CPU With RPS: 311K tps at 64% CPU forcedeth on 16 core AMD Without RPS: 156K tps at 15% CPU With RPS: 404K tps at 49% CPU bnx2x on 16 core AMD Without RPS 567K tps at 61% CPU (4 HW RX queues) Without RPS 738K tps at 96% CPU (8 HW RX queues) With RPS: 854K tps at 76% CPU (4 HW RX queues) Caveats: - The benefits of this patch are dependent on architecture and cache hierarchy. Tuning the masks to get best performance is probably necessary. - This patch adds overhead in the path for processing a single packet. In a lightly loaded server this overhead may eliminate the advantages of increased parallelism, and possibly cause some relative performance degradation. We have found that masks that are cache aware (share same caches with the interrupting CPU) mitigate much of this. - The RPS masks can be changed dynamically, however whenever the mask is changed this introduces the possibility of generating out of order packets. It's probably best not change the masks too frequently. Signed-off-by: Tom Herbert <therbert@google.com> include/linux/netdevice.h \| 32 ++++- include/linux/skbuff.h \| 3 + net/core/dev.c \| 335 +++++++++++++++++++++++++++++++++++++-------- net/core/net-sysfs.c \| 225 ++++++++++++++++++++++++++++++- net/core/skbuff.c \| 2 + 5 files changed, 538 insertions(+), 59 deletions(-) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-03-16 21:23:18 -07:00
Eric Dumazet	4ab408dea0	net: fix protocol sk_buff field Commit `e992cd9b72` (kmemcheck: make bitfield annotations truly no-ops when disabled) allows us to revert a workaround we did in the past to not add holes in sk_buff structure. This patch partially reverts commit `14d18a81b5` (net: fix kmemcheck annotations) so that sparse doesnt complain: include/linux/skbuff.h:357:41: error: invalid bitfield specifier for type restricted __be16. Reported-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-03-02 03:05:05 -08:00
Felix Fietkau	da3f5cf1f8	skbuff: align sk_buff::cb to 64 bit and close some potential holes The alignment requirement for 64-bit load/store instructions on ARM is implementation defined. Some CPUs (such as Marvell Feroceon) do not generate an exception, if such an instruction is executed with an address that is not 64 bit aligned. In such a case, the Feroceon corrupts adjacent memory, which showed up in my tests as a crash in the rx path of ath9k that only occured with CONFIG_XFRM set. This crash happened, because the first field of the mac80211 rx status info in the cb is an u64, and changing it corrupted the skb->sp field. This patch also closes some potential pre-existing holes in the sk_buff struct surrounding the cb[] area. Signed-off-by: Felix Fietkau <nbd@openwrt.org> Cc: stable@kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>	2010-02-27 03:16:59 -08:00
Ben Hutchings	1a5778aa00	net: Fix first line of kernel-doc for a few functions The function name must be followed by a space, hypen, space, and a short description. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2010-02-14 22:35:47 -08:00
Alexander Duyck	c81c2d9544	skbuff: remove skb_dma_map/unmap The two functions skb_dma_map/unmap are unsafe to use as they cause problems when packets are cloned and sent to multiple devices while a HW IOMMU is enabled. Due to this it is best to remove the code so it is not used by any other network driver maintainters. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2009-12-02 19:57:15 -08:00

1 2 3 4 5 ...

274 Commits