Handle non-linear skbs by linearizing them instead of silently failing.
Long term the helper should be fixed to either work with non-linear skbs
directly by using the string search API or work on a copy of the data.
Based on patch by Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch removes from net/ netfilter files
all the unnecessary return; statements that precede the
last closing brace of void functions.
It does not remove the returns that are immediately
preceded by a label as gcc doesn't like that.
Done via:
$ grep -rP --include=*.[ch] -l "return;\n}" net/ | \
xargs perl -i -e 'local $/ ; while (<>) { s/\n[ \t\n]+return;\n}/\n}/g; print; }'
Signed-off-by: Joe Perches <joe@perches.com>
[Patrick: changed to keep return statements in otherwise empty function bodies]
Signed-off-by: Patrick McHardy <kaber@trash.net>
Make sure all printk messages have a severity level.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Aviod these link-time errors when IPV6=m, XT_TEE=y:
net/built-in.o: In function `tee_tg_route6':
xt_TEE.c:(.text+0x45ca5): undefined reference to `ip6_route_output'
net/built-in.o: In function `tee_tg6':
xt_TEE.c:(.text+0x45d79): undefined reference to `ip6_local_out'
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since xt_action_param is writable, let's use it. The pointer to
'bool hotdrop' always worried (8 bytes (64-bit) to write 1 byte!).
Surprisingly results in a reduction in size:
text data bss filename
5457066 692730 357892 vmlinux.o-prev
5456554 692730 357892 vmlinux.o
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
In future, layer-3 matches will be an xt module of their own, and
need to set the fragoff and thoff fields. Adding more pointers would
needlessy increase memory requirements (esp. so for 64-bit, where
pointers are wider).
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Restore the rcu_dereference() calls in conntrack/expectation notifier
and logger registration/unregistration, but use the _protected variant,
which will be required by the upcoming __rcu annotations.
Based on patch by Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
I suspect an unfortunatly series of events occuring under a DDoS
attack, in function __nf_conntrack_find() nf_contrack_core.c.
Adding a stats counter to see if the search is restarted too often.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The jumpstack allocation needs to be moved out of the critical region.
Corrects this notice:
BUG: sleeping function called from invalid context at mm/slub.c:1705
[ 428.295762] in_atomic(): 1, irqs_disabled(): 0, pid: 9111, name: iptables
[ 428.295771] Pid: 9111, comm: iptables Not tainted 2.6.34-rc1 #2
[ 428.295776] Call Trace:
[ 428.295791] [<c012138e>] __might_sleep+0xe5/0xed
[ 428.295801] [<c019e8ca>] __kmalloc+0x92/0xfc
[ 428.295825] [<f865b3bb>] ? xt_jumpstack_alloc+0x36/0xff [x_tables]
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Define a new function to return the waitqueue of a "struct sock".
static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
return sk->sk_sleep;
}
Change all read occurrences of sk_sleep by a call to this function.
Needed for a future RCU conversion. sk_sleep wont be a field directly
available.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace the runtime oif name resolving by netdevice notifier based
resolving. When an oif is given, a netdevice notifier is registered
to resolve the name on NETDEV_REGISTER or NETDEV_CHANGE and unresolve
it again on NETDEV_UNREGISTER or NETDEV_CHANGE to a different name.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Since Xtables is now reentrant/nestable, the cloned packet can also go
through Xtables and be subject to rules itself.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Currently, the table traverser stores return addresses in the ruleset
itself (struct ip6t_entry->comefrom). This has a well-known drawback:
the jumpstack is overwritten on reentry, making it necessary for
targets to return absolute verdicts. Also, the ruleset (which might
be heavy memory-wise) needs to be replicated for each CPU that can
possibly invoke ip6t_do_table.
This patch decouples the jumpstack from struct ip6t_entry and instead
puts it into xt_table_info. Not being restricted by 'comefrom'
anymore, we can set up a stack as needed. By default, there is room
allocated for two entries into the traverser.
arp_tables is not touched though, because there is just one/two
modules and further patches seek to collapse the table traverser
anyhow.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
xt_TEE can be used to clone and reroute a packet. This can for
example be used to copy traffic at a router for logging purposes
to another dedicated machine.
References: http://www.gossamer-threads.com/lists/iptables/devel/68781
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Fix some coding styles and remove moduleparam.h
Signed-off-by: Zhitong Wang <zhitong.wangzt@alibaba-inc.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Add reference counting to the netfilter LED target, to fix errors when
multiple rules point to the same target ("LED trigger already exists").
Signed-off-by: Adam Nielsen <a.nielsen@shikadi.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The CONFIG_PROVE_RCU option discovered a few invalid uses of
rcu_dereference() in netfilter. In all these cases, the code code
intends to check whether a pointer is already assigned when
performing registration or whether the assigned pointer matches
when performing unregistration. The entire registration/
unregistration is protected by a mutex, so we don't need the
rcu_dereference() calls.
Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
As we will set ip_summed to CHECKSUM_NONE when necessary in
nfqnl_mangle, there is no need to zap CHECKSUM_COMPLETE in
nfqnl_build_packet_message.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
When protocols use very long names, the sprintf calls might overflow
the on-stack buffer. No protocol in the kernel does this however.
Print the protocol name in the pr_debug statement directly to avoid
this.
Based on patch by Zhitong Wang <zhitong.wangzt@alibaba-inc.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
xt_hashlimit uses a central lock per hash table and suffers from
contention on some workloads. (Multiqueue NIC or if RPS is enabled)
After RCU conversion, central lock is only used when a writer wants to
add or delete an entry.
For 'readers', updating an existing entry, they use an individual lock
per entry.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Message size should be dependent on the presence of an accounting
extension, not on CONFIG_NF_CT_ACCT definition.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
A missing break statement in hashlimit_ipv6_mask(), and masks
between /64 and /95 are not working at all...
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
When extended status codes are available, such as ENOMEM on failed
allocations, or subsequent functions (e.g. nf_ct_get_l3proto), passing
them up to userspace seems like a good idea compared to just always
EINVAL.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Restore function signatures from bool to int so that we can report
memory allocation failures or similar using -ENOMEM rather than
always having to pass -EINVAL back.
// <smpl>
@@
type bool;
identifier check, par;
@@
-bool check
+int check
(struct xt_tgchk_param *par) { ... }
// </smpl>
Minus the change it does to xt_ct_find_proto.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Restore function signatures from bool to int so that we can report
memory allocation failures or similar using -ENOMEM rather than
always having to pass -EINVAL back.
This semantic patch may not be too precise (checking for functions
that use xt_mtchk_param rather than functions referenced by
xt_match.checkentry), but reviewed, it produced the intended result.
// <smpl>
@@
type bool;
identifier check, par;
@@
-bool check
+int check
(struct xt_mtchk_param *par) { ... }
// </smpl>
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Supplement to 1159683ef4.
Downgrade the log level to INFO for most checkentry messages as they
are, IMO, just an extra information to the -EINVAL code that is
returned as part of a parameter "constraint violation". Leave errors
to real errors, such as being unable to create a LED trigger.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
If dl_seq_start() memory allocation fails, we crash later in
dl_seq_stop(), trying to kfree(ERR_PTR(-ENOMEM))
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Commit 8ccb92ad (netfilter: xt_recent: fix false match) fixed supposedly
false matches in rules using a zero hit_count. As it turns out there is
nothing false about these matches and people are actually using entries
with a hit_count of zero to make rules dependant on addresses inserted
manually through /proc.
Since this slipped past the eyes of three reviewers, instead of
reverting the commit in question, this patch explicitly checks
for a hit_count of zero to make the intentions more clear.
Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Tested-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch fixes a bug that allows to lose events when reliable
event delivery mode is used, ie. if NETLINK_BROADCAST_SEND_ERROR
and NETLINK_RECV_NO_ENOBUFS socket options are set.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ENOMEM is a very obvious error code (cf. EINVAL), so I think we do not
really need a warning message. Not to mention that if the allocation
fails, the user is most likely going to get a stack trace from slab
already.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
The matches can have .family = NFPROTO_UNSPEC, and though that is not
the case for the touched modules, it seems better to just use the
nfproto from the caller.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
I do not see a point of allowing the MAC module to work with devices
that don't possibly have one, e.g. various tunnel interfaces such as
tun and sit.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
XT_ALIGN is already applied on matchsize/targetsize in x_tables.c,
so it is not strictly needed in the extensions.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Remove unused headers in net/netfilter/nfnetlink.c
Signed-off-by: Zhitong Wang <zhitong.wangzt@alibaba-inc.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
One of the problems with the way xt_recent is implemented is that
there is no efficient way to remove expired entries. Of course,
one can write a rule '-m recent --remove', but you have to know
beforehand which entry to delete. This commit adds reaper
logic which checks the head of the LRU list when a rule
is invoked that has a '--seconds' value and XT_RECENT_REAP set. If an
entry ceases to accumulate time stamps, then it will eventually bubble
to the top of the LRU list where it is then reaped.
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Two arguments for combining the two:
- xt_mark is pretty useless without xt_MARK
- the actual code is so small anyway that the kmod metadata and the module
in its loaded state totally outweighs the combined actual code size.
i586-before:
-rw-r--r-- 1 jengelh users 3821 Feb 10 01:01 xt_MARK.ko
-rw-r--r-- 1 jengelh users 2592 Feb 10 00:04 xt_MARK.o
-rw-r--r-- 1 jengelh users 3274 Feb 10 01:01 xt_mark.ko
-rw-r--r-- 1 jengelh users 2108 Feb 10 00:05 xt_mark.o
text data bss dec hex filename
354 264 0 618 26a xt_MARK.o
223 176 0 399 18f xt_mark.o
And the runtime size is like 14 KB.
i586-after:
-rw-r--r-- 1 jengelh users 3264 Feb 18 17:28 xt_mark.o
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
NIPQUAD has very few uses left.
Remove this use and make the code have the identical form of the only
other use of "%u,%u,%u,%u,%u,%u" in net/ipv4/netfilter/nf_nat_ftp.c
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Quick fix for memory/module refcount leak.
Reference count of listener instance never reaches 0.
Start/stop of ulogd2 is enough to trigger this bug!
Now, refcounting there looks very fishy in particular this code:
if (!try_module_get(THIS_MODULE)) {
...
and creation of listener instance with refcount 2,
so it may very well be ripped and redone. :-)
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Use list_head rather than a custom list implementation.
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
The macro is replaced by a list.h-like foreach loop. This makes
the code more inspectable.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
A rule with a zero hit_count will always match.
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
e->index overflows e->stamps[] every ip_pkt_list_tot packets.
Consider the case when ip_pkt_list_tot==1; the first packet received is stored
in e->stamps[0] and e->index is initialized to 1. The next received packet
timestamp is then stored at e->stamps[1] in recent_entry_update(),
a buffer overflow because the maximum e->stamps[] index is 0.
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Cc: stable@kernel.org
Signed-off-by: Patrick McHardy <kaber@trash.net>
commit 3bc38712e3 (handle NF_STOP and unknown verdicts in
nf_reinject) was a partial fix to packet leaks.
If user asks NF_STOLEN status, we must free the skb as well.
Reported-by: Afi Gjermund <afigjermund@gmail.com>
Signed-off-by: Eric DUmazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
This patch fixes a bug that triggers an assertion if you create
a conntrack entry with a helper and netfilter debugging is enabled.
Basically, we hit the assertion because the confirmation flag is
set before the conntrack extensions are added. To fix this, we
move the extension addition before the aforementioned flag is
set.
This patch also removes the possibility of setting a helper for
existing conntracks. This operation would also trigger the
assertion since we are not allowed to add new extensions for
existing conntracks. We know noone that could benefit from
this operation sanely.
Thanks to Eric Dumazet for initial posting a preliminary patch
to address this issue.
Reported-by: David Ramblewski <David.Ramblewski@atosorigin.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Dunno, what was the idea, it wasn't used for a long time.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enhance IPVS to load balance SCTP transport protocol packets. This is done
based on the SCTP rfc 4960. All possible control chunks have been taken
care. The state machine used in this code looks some what lengthy. I tried
to make the state machine easy to understand.
Signed-off-by: Venkata Mohan Reddy Koppula <mohanreddykv@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Commit 2eff25c18c
(netfilter: xt_hashlimit: fix race condition and simplify locking)
added a mutex deadlock :
htable_create() is called with hashlimit_mutex already locked
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
with 32 bit userland and 64 bit kernels, it is unlikely but possible
that insertion of new rules fails even tough there are only about 2000
iptables rules.
This happens because the compat delta is using a short int.
Easily reproducible via "iptables -m limit" ; after about 2050
rules inserting new ones fails with -ELOOP.
Note that compat_delta included 2 bytes of padding on x86_64, so
structure size remains the same.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Normally, each connection needs a unique identity. Conntrack zones allow
to specify a numerical zone using the CT target, connections in different
zones can use the same identity.
Example:
iptables -t raw -A PREROUTING -i veth0 -j CT --zone 1
iptables -t raw -A OUTPUT -o veth1 -j CT --zone 1
Signed-off-by: Patrick McHardy <kaber@trash.net>
The error handlers might need the template to get the conntrack zone
introduced in the next patches to perform a conntrack lookup.
Signed-off-by: Patrick McHardy <kaber@trash.net>
It is one of these things that iptables cannot catch and which can
cause "Invalid argument" to be printed. Without a hint in dmesg, it is
not going to be helpful.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
call_rcu() will unconditionally reinitialize RCU head anyway.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Add TCP support, which is mandated by RFC3261 for all SIP elements.
SIP over TCP is similar to UDP, except that messages are delimited
by Content-Length: headers and multiple messages may appear in one
packet.
Signed-off-by: Patrick McHardy <kaber@trash.net>
When using TCP multiple SIP messages might be present in a single packet.
A following patch will parse them by setting the dptr to the beginning of
each message. The NAT helper needs to reload the dptr value after mangling
the packet however, so it needs to know the offset of the message to the
beginning of the packet.
Signed-off-by: Patrick McHardy <kaber@trash.net>
When requests are parsed, the "sip:" part of the SIP URI should be skipped.
Usually this doesn't matter because address parsing skips forward until after
the username part, but in case REGISTER requests it doesn't contain a username
and the address can not be parsed.
Signed-off-by: Patrick McHardy <kaber@trash.net>