657 lines
27 KiB
TeX
657 lines
27 KiB
TeX
\documentclass[twocolumn,12pt]{article}
|
|
|
|
\usepackage{alltt}
|
|
|
|
\usepackage[T1]{fontenc}
|
|
\usepackage[latin1]{inputenc}
|
|
\usepackage{isolatin1}
|
|
\usepackage{latexsym}
|
|
\usepackage{textcomp}
|
|
\usepackage{times}
|
|
\usepackage{url}
|
|
\usepackage[T1,obeyspaces]{zrl}
|
|
|
|
% "verbatim" with line breaks, obeying spaces
|
|
\providecommand\code{\begingroup \xrlstyle{tt}\Xrl}
|
|
% as above, but okay to break lines at spaces
|
|
\providecommand\brcode{\begingroup \zrlstyle{tt}\Zrl}
|
|
|
|
% Same as the pair above, but 'l' for long == small type
|
|
\providecommand\lcode{\begingroup \small\xrlstyle{tt}\Xrl}
|
|
\providecommand\lbrcode{\begingroup \small\zrlstyle{tt}\Zrl}
|
|
|
|
% For identifiers - "verbatim" with line breaks at punctuation
|
|
\providecommand\ident{\begingroup \urlstyle{tt}\Url}
|
|
\providecommand\lident{\begingroup \small\urlstyle{tt}\Url}
|
|
|
|
|
|
|
|
|
|
\begin{document}
|
|
|
|
% Required: do not print the date.
|
|
\date{}
|
|
|
|
\title{\texttt{ct\_sync}: state replication of \texttt{ip\_conntrack}\\
|
|
% {\normalsize Subtitle goes here}
|
|
}
|
|
|
|
\author{
|
|
Harald Welte \\
|
|
{\em netfilter core team / Astaro AG / hmw-consulting.de}\\
|
|
{\tt\normalsize laforge@gnumonks.org}\\
|
|
% \and
|
|
% Second Author\\
|
|
% {\em Second Institution}\\
|
|
% {\tt\normalsize another@address.for.email.com}\\
|
|
} % end author section
|
|
|
|
\maketitle
|
|
|
|
% Required: do not use page numbers on title page.
|
|
\thispagestyle{empty}
|
|
|
|
\section*{Abstract}
|
|
|
|
With traditional, stateless firewalling (such as ipfwadm, ipchains)
|
|
there is no need for special HA support in the firewalling
|
|
subsystem. As long as all packet filtering rules and routing table
|
|
entries are configured in exactly the same way, one can use any
|
|
available tool for IP-Address takeover to accomplish the goal of
|
|
failing over from one node to the other.
|
|
|
|
With Linux 2.4/2.6 netfilter/iptables, the Linux firewalling code
|
|
moves beyond traditional packet filtering. Netfilter provides a
|
|
modular connection tracking susbsystem which can be employed for
|
|
stateful firewalling. The connection tracking subsystem gathers
|
|
information about the state of all current network flows
|
|
(connections). Packet filtering decisions and NAT information is
|
|
associated with this state information.
|
|
|
|
In a high availability scenario, this connection tracking state needs
|
|
to be replicated from the currently active firewall node to all
|
|
standby slave firewall nodes. Only when all connection tracking state
|
|
is replicated, the slave node will have all necessary state
|
|
information at the time a failover event occurs.
|
|
|
|
Due to funding by Astaro AG, the netfilter/iptables project now offers
|
|
a \ident{ct_sync} kernel module for replicating connection tracking state
|
|
accross multiple nodes. The presentation will cover the architectural
|
|
design and implementation of the connection tracking failover sytem.
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%%% BODY OF PAPER GOES HERE %%%
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
\section{Failover of stateless firewalls}
|
|
|
|
There are no special precautions when installing a highly available
|
|
stateless packet filter. Since there is no state kept, all information
|
|
needed for filtering is the ruleset and the individual, separate packets.
|
|
|
|
Building a set of highly available stateless packet filters can thus be
|
|
achieved by using any traditional means of IP-address takeover, such
|
|
as Heartbeat or VRRPd.
|
|
|
|
The only remaining issue is to make sure the firewalling ruleset is
|
|
exactly the same on both machines. This should be ensured by the firewall
|
|
administrator every time he updates the ruleset and can be optionally managed
|
|
by some scripts utilizing scp or rsync.
|
|
|
|
If this is not applicable, because a very dynamic ruleset is employed, one can
|
|
build a very easy solution using iptables-supplied tools iptables-save and
|
|
iptables-restore. The output of iptables-save can be piped over ssh to
|
|
iptables-restore on a different host.
|
|
|
|
Limitations
|
|
\begin{itemize}
|
|
\item
|
|
no state tracking
|
|
\item
|
|
not possible in combination with iptables stateful NAT
|
|
\item
|
|
no counter consistency of per-rule packet/byte counters
|
|
\end{itemize}
|
|
|
|
\section{Failover of stateful firewalls}
|
|
|
|
Modern firewalls implement state tracking (a.k.a.\ connection tracking) in order
|
|
to keep some state about the currently active sessions. The amount of
|
|
per-connection state kept at the firewall depends on the particular
|
|
configuration and networking protocols used.
|
|
|
|
As soon as \texttt{any} state is kept at the packet filter, this state
|
|
information needs to be replicated to the slave/backup nodes within the
|
|
failover setup.
|
|
|
|
Since Linux 2.4.x, all relevant state is kept within the \textit{connection
|
|
tracking subsystem}. In order to understand how this state could possibly be
|
|
replicated, we need to understand the architecture of this conntrack subsystem.
|
|
|
|
\subsection{Architecture of the Linux Connection Tracking Subsystem}
|
|
|
|
Connection tracking within Linux is implemented as a netfilter module, called
|
|
\ident{ip_conntrack.o} (\ident{ip_conntrack.ko} in 2.6.x kernels).
|
|
|
|
Before describing the connection tracking subsystem, we need to describe a
|
|
couple of definitions and primitives used throughout the conntrack code.
|
|
|
|
A connection is represented within the conntrack subsystem using
|
|
\brcode{struct ip_conntrack}, also called \textit{connection tracking entry}.
|
|
|
|
Connection tracking is utilizing \textit{conntrack tuples}, which are tuples
|
|
consisting of
|
|
\begin{itemize}
|
|
\item
|
|
source IP address
|
|
\item
|
|
source port (or icmp type/code, gre key, ...)
|
|
\item
|
|
destination IP address
|
|
\item
|
|
destination port
|
|
\item
|
|
layer 4 protocol number
|
|
\end{itemize}
|
|
|
|
A connection is uniquely identified by two tuples: The tuple in the original
|
|
direction (\lident{IP_CT_DIR_ORIGINAL}) and the tuple for the reply direction
|
|
(\lident{IP_CT_DIR_REPLY}).
|
|
|
|
Connection tracking itself does not drop packets\footnote{well, in some rare
|
|
cases in combination with NAT it needs to drop. But don't tell anyone, this is
|
|
secret.} or impose any policy. It just associates every packet with a
|
|
connection tracking entry, which in turn has a particular state. All other
|
|
kernel code can use this state information\footnote{State information is
|
|
referenced via the \brcode{struct sk_buff.nfct} structure member of a
|
|
packet.}.
|
|
|
|
\subsubsection{Integration of conntrack with netfilter}
|
|
|
|
If the \ident{ip_conntrack.[k]o} module is registered with netfilter, it
|
|
attaches to the \lident{NF_IP_PRE_ROUTING}, \lident{NF_IP_POST_ROUTING}, \lident{NF_IP_LOCAL_IN},
|
|
and \lident{NF_IP_LOCAL_OUT} hooks.
|
|
|
|
Because forwarded packets are the most common case on firewalls, I will only
|
|
describe how connection tracking works for forwarded packets. The two relevant
|
|
hooks for forwarded packets are \lident{NF_IP_PRE_ROUTING} and \lident{NF_IP_POST_ROUTING}.
|
|
|
|
Every time a packet arrives at the \lident{NF_IP_PRE_ROUTING} hook, connection
|
|
tracking creates a conntrack tuple from the packet. It then compares this
|
|
tuple to the original and reply tuples of all already-seen
|
|
connections
|
|
\footnote{Of course this is not implemented as a linear
|
|
search over all existing connections.} to find out if this
|
|
just-arrived packet belongs to any existing
|
|
connection. If there is no match, a new conntrack table entry
|
|
(\brcode{struct ip_conntrack}) is created.
|
|
|
|
Let's assume the case where we have already existing connections but are
|
|
starting from scratch.
|
|
|
|
The first packet comes in, we derive the tuple from the packet headers, look up
|
|
the conntrack hash table, don't find any matching entry. As a result, we
|
|
create a new \brcode{struct ip_conntrack}. This \brcode{struct ip_conntrack} is filled with
|
|
all necessarry data, like the original and reply tuple of the connection.
|
|
How do we know the reply tuple? By inverting the source and destination
|
|
parts of the original tuple.\footnote{So why do we need two tuples, if they can
|
|
be derived from each other? Wait until we discuss NAT.}
|
|
Please note that this new \brcode{struct ip_conntrack} is \textbf{not} yet placed
|
|
into the conntrack hash table.
|
|
|
|
The packet is now passed on to other callback functions which have registered
|
|
with a lower priority at \lident{NF_IP_PRE_ROUTING}. It then continues traversal of
|
|
the network stack as usual, including all respective netfilter hooks.
|
|
|
|
If the packet survives (i.e., is not dropped by the routing code, network stack,
|
|
firewall ruleset, \ldots), it re-appears at \lident{NF_IP_POST_ROUTING}. In this case,
|
|
we can now safely assume that this packet will be sent off on the outgoing
|
|
interface, and thus put the connection tracking entry which we created at
|
|
\lident{NF_IP_PRE_ROUTING} into the conntrack hash table. This process is called
|
|
\textit{confirming the conntrack}.
|
|
|
|
The connection tracking code itself is not monolithic, but consists of a
|
|
couple of separate modules\footnote{They don't actually have to be separate
|
|
kernel modules; e.g.\ TCP, UDP, and ICMP tracking modules are all part of
|
|
the linux kernel module \ident{ip_conntrack.o}.}. Besides the conntrack core,
|
|
there are two important kind of modules: Protocol helpers and application
|
|
helpers.
|
|
|
|
Protocol helpers implement the layer-4-protocol specific parts. They currently
|
|
exist for TCP, UDP, and ICMP (an experimental helper for GRE exists).
|
|
|
|
\subsubsection{TCP connection tracking}
|
|
|
|
As TCP is a connection oriented protocol, it is not very difficult to imagine
|
|
how conntection tracking for this protocol could work. There are well-defined
|
|
state transitions possible, and conntrack can decide which state transitions
|
|
are valid within the TCP specification. In reality it's not all that easy,
|
|
since we cannot assume that all packets that pass the packet filter actually
|
|
arrive at the receiving end\ldots
|
|
|
|
It is noteworthy that the standard connection tracking code does \textbf{not}
|
|
do TCP sequence number and window tracking. A well-maintained patch to add
|
|
this feature has existed for almost as long as connection tracking itself. It
|
|
will be integrated with the 2.5.x kernel. The problem with window tracking is
|
|
its bad interaction with connection pickup. The TCP conntrack code is able to
|
|
pick up already existing connections, e.g.\ in case your firewall was rebooted.
|
|
However, connection pickup is conflicting with TCP window tracking: The TCP
|
|
window scaling option is only transferred at connection setup time, and we
|
|
don't know about it in case of pickup\ldots
|
|
|
|
\subsubsection{ICMP tracking}
|
|
|
|
ICMP is not really a connection oriented protocol. So how is it possible to
|
|
do connection tracking for ICMP?
|
|
|
|
The ICMP protocol can be split in two groups of messages:
|
|
|
|
\begin{itemize}
|
|
\item
|
|
ICMP error messages, which sort-of belong to a different connection
|
|
ICMP error messages are associated \textit{RELATED} to a different connection.
|
|
(\lident{ICMP_DEST_UNREACH}, \lident{ICMP_SOURCE_QUENCH},
|
|
\lident{ICMP_TIME_EXCEEDED},
|
|
\lident{ICMP_PARAMETERPROB}, \lident{ICMP_REDIRECT}).
|
|
\item
|
|
ICMP queries, which have a \ident{request-reply} character. So what
|
|
the conntrack
|
|
code does, is let the request have a state of \textit{NEW}, and the reply
|
|
\textit{ESTABLISHED}. The reply closes the connection immediately.
|
|
(\lident{ICMP_ECHO}, \lident{ICMP_TIMESTAMP}, \lident{ICMP_INFO_REQUEST}, \lident{ICMP_ADDRESS})
|
|
\end{itemize}
|
|
|
|
\subsubsection{UDP connection tracking}
|
|
|
|
UDP is designed as a connectionless datagram protocol. But most common
|
|
protocols using UDP as layer 4 protocol have bi-directional UDP communication.
|
|
Imagine a DNS query, where the client sends an UDP frame to port 53 of the
|
|
nameserver, and the nameserver sends back a DNS reply packet from its UDP
|
|
port 53 to the client.
|
|
|
|
Netfilter treats this as a connection. The first packet (the DNS request) is
|
|
assigned a state of \textit{NEW}, because the packet is expected to create a new
|
|
`connection.' The DNS server's reply packet is marked as \textit{ESTABLISHED}.
|
|
|
|
\subsubsection{conntrack application helpers}
|
|
|
|
More complex application protocols involving multiple connections need special
|
|
support by a so-called ``conntrack application helper module.'' Modules in
|
|
the stock kernel come for FTP, IRC (DCC), TFTP and Amanda. Netfilter CVS currently contains
|
|
%%% orig: ``tftp ald talk'' -- um, 'tftp and talk'? Yes, that's correct. It refers
|
|
%%% to the talk protocol.
|
|
patches for PPTP, H.323, Eggdrop botnet, mms, DirectX, RTSP and talk/ntalk. We're still lacking
|
|
a lot of protocols (e.g.\ SIP, SMB/CIFS)---but they are unlikely to appear
|
|
until somebody really needs them and either develops them on his own or
|
|
funds development.
|
|
|
|
\subsubsection{Integration of connection tracking with iptables}
|
|
|
|
As stated earlier, conntrack doesn't impose any policy on packets. It just
|
|
determines the relation of a packet to already existing connections.
|
|
To base
|
|
packet filtering decision on this state information, the iptables \textit{state}
|
|
match can be used. Every packet is within one of the following categories:
|
|
|
|
\begin{itemize}
|
|
\item
|
|
\textbf{NEW}: packet would create a new connection, if it survives
|
|
\item
|
|
\textbf{ESTABLISHED}: packet is part of an already established connection
|
|
(either direction)
|
|
\item
|
|
\textbf{RELATED}: packet is in some way related to an already established
|
|
connection, e.g.\ ICMP errors or FTP data sessions
|
|
\item
|
|
\textbf{INVALID}: conntrack is unable to derive conntrack information
|
|
from this packet. Please note that all multicast or broadcast packets
|
|
fall in this category.
|
|
\end{itemize}
|
|
|
|
|
|
\subsection{Poor man's conntrack failover}
|
|
|
|
When thinking about failover of stateful firewalls, one usually thinks about
|
|
replication of state. This presumes that the state is gathered at one
|
|
firewalling node (the currently active node), and replicated to several other
|
|
passive standby nodes. There is, however, a very different approach to
|
|
replication: concurrent state tracking on all firewalling nodes.
|
|
|
|
While this scheme has not been implemented within \ident{ct_sync}, the author
|
|
still thinks it is worth an explanation in this paper.
|
|
|
|
The basic assumption of this approach is: In a setup where all firewalling
|
|
%%% deduct or deduce? I'd guess the latter, but I don't know, so I'm
|
|
%%% leaving it...
|
|
nodes receive exactly the same traffic, all nodes will deduct the same state
|
|
information.
|
|
|
|
The implementability of this approach is totally dependent on fulfillment of
|
|
this assumption.
|
|
|
|
\begin{itemize}
|
|
\item
|
|
\textit{All packets need to be seen by all nodes}. This is not always true, but
|
|
can be achieved by using shared media like traditional ethernet (no switches!!)
|
|
and promiscuous mode on all ethernet interfaces.
|
|
\item
|
|
\textit{All nodes need to be able to process all packets}. This cannot be
|
|
universally guaranteed. Even if the hardware (CPU, RAM, Chipset, NICs) and
|
|
software (Linux kernel) are exactly the same, they might behave different,
|
|
especially under high load. To avoid those effects, the hardware should be
|
|
able to deal with way more traffic than seen during operation. Also, there
|
|
should be no userspace processes (like proxies, etc.) running on the firewalling
|
|
nodes at all. WARNING: Nobody guarantees this behaviour. However, the poor
|
|
man is usually not interested in scientific proof but in usability in his
|
|
particular practical setup.
|
|
\end{itemize}
|
|
|
|
However, even if those conditions are fulfilled, there are remaining issues:
|
|
\begin{itemize}
|
|
\item
|
|
\textit{No resynchronization after reboot}. If a node is rebooted (because of
|
|
a hardware fault, software bug, software update, etc.) it will lose all state
|
|
information until the event of the reboot. This means, the state information
|
|
of this node after reboot will not contain any old state, gathered before the
|
|
reboot. The effects depend on the traffic. Generally, it is only assured that
|
|
state information about all connections initiated after the reboot will be
|
|
present. If there are short-lived connections (like http), the state
|
|
information on the just rebooted node will approximate the state information of
|
|
an older node. Only after all sessions active at the time of reboot have
|
|
terminated, state information is guaranteed to be resynchronized.
|
|
\item
|
|
\textit{Only possible with shared medium}. The practical implication is that no
|
|
switched ethernet (and thus no full duplex) can be used.
|
|
\end{itemize}
|
|
|
|
The major advantage of the poor man's approach is implementation simplicity.
|
|
No state transfer mechanism needs to be developed. Only very little changes
|
|
to the existing conntrack code would be needed in order to be able to
|
|
do tracking based on packets received from promiscuous interfaces. The active
|
|
node would have packet forwarding turned on, the passive nodes, off.
|
|
|
|
I'm not proposing this as a real solution to the failover problem. It's
|
|
hackish, buggy, and likely to break very easily. But considering it can be
|
|
implemented in very little programming time, it could be an option for very
|
|
small installations with low reliability criteria.
|
|
|
|
\subsection{Conntrack state replication}
|
|
|
|
The preferred solution to the failover problem is, without any doubt,
|
|
replication of the connection tracking state.
|
|
|
|
The proposed conntrack state replication soltution consists of several
|
|
parts:
|
|
\begin{itemize}
|
|
\item
|
|
A connection tracking state replication protocol
|
|
\item
|
|
An event interface generating event messages as soon as state information
|
|
changes on the active node
|
|
\item
|
|
An interface for explicit generation of connection tracking table entries on
|
|
the standby slaves
|
|
\item
|
|
Some code (preferrably a kernel thread) running on the active node, receiving
|
|
state updates by the event interface and generating conntrack state replication
|
|
protocol messages
|
|
\item
|
|
Some code (preferrably a kernel thread) running on the slave node(s), receiving
|
|
conntrack state replication protocol messages and updating the local conntrack
|
|
table accordingly
|
|
\end{itemize}
|
|
|
|
Flow of events in chronological order:
|
|
\begin{itemize}
|
|
\item
|
|
\textit{on active node, inside the network RX softirq}
|
|
\begin{itemize}
|
|
\item
|
|
\ident{ip_conntrack} analyzes a forwarded packet
|
|
\item
|
|
\ident{ip_conntrack} gathers some new state information
|
|
\item
|
|
\ident{ip_conntrack} updates conntrack hash table
|
|
\item
|
|
\ident{ip_conntrack} calls event API
|
|
\item
|
|
function registered to event API builds and enqueues message to send ring
|
|
\end{itemize}
|
|
\item
|
|
\textit{on active node, inside the conntrack-sync sender kernel thread}
|
|
\begin{itemize}
|
|
\item
|
|
\ident{ct_sync_send} aggregates multiple messages into one packet
|
|
\item
|
|
\ident{ct_sync_send} dequeues packet from ring
|
|
\item
|
|
\ident{ct_sync_send} sends packet via in-kernel sockets API
|
|
\end{itemize}
|
|
\item
|
|
\textit{on slave node(s), inside network RX softirq}
|
|
\begin{itemize}
|
|
\item
|
|
\ident{ip_conntrack} ignores packets coming from the \ident{ct_sync} interface via NOTRACK mechanism
|
|
\item
|
|
UDP stack appends packet to socket receive queue of \ident{ct_sync_recv} kernel thread
|
|
\end{itemize}
|
|
\item
|
|
\textit{on slave node(s), inside conntrack-sync receive kernel thread}
|
|
\begin{itemize}
|
|
\item
|
|
\ident{ct_sync_recv} thread receives state replication packet
|
|
\item
|
|
\ident{ct_sync_recv} thread parses packet into individual messages
|
|
\item
|
|
\ident{ct_sync_recv} thread creates/updates local \ident{ip_conntrack} entry
|
|
\end{itemize}
|
|
\end{itemize}
|
|
|
|
|
|
\subsubsection{Connection tracking state replication protocol}
|
|
|
|
|
|
In order to be able to replicate the state between two or more firewalls, a
|
|
state replication protocol is needed. This protocol is used over a private
|
|
network segment shared by all nodes for state replication. It is designed to
|
|
work over IP unicast and IP multicast transport. IP unicast will be used for
|
|
direct point-to-point communication between one active firewall and one
|
|
standby firewall. IP multicast will be used when the state needs to be
|
|
replicated to more than one standby firewall.
|
|
|
|
|
|
The principal design criteria of this protocol are:
|
|
\begin{itemize}
|
|
\item
|
|
\textbf{reliable against data loss}, as the underlying UDP layer only
|
|
provides checksumming against data corruption, but doesn't employ any
|
|
means against data loss
|
|
\item
|
|
\textbf{lightweight}, since generating the state update messages is
|
|
already a very expensive process for the sender, eating additional CPU,
|
|
memory, and IO bandwith.
|
|
\item
|
|
\textbf{easy to parse}, to minimize overhead at the receiver(s)
|
|
\end{itemize}
|
|
|
|
The protocol does not employ any security mechanism like encryption,
|
|
authentication, or reliability against spoofing attacks. It is
|
|
assumed that the private conntrack sync network is a secure communications
|
|
channel, not accessible to any malicious third party.
|
|
|
|
To achieve the reliability against data loss, an easy sequence numbering
|
|
scheme is used. All protocol messages are prefixed by a sequence number,
|
|
determined by the sender. If the slave detects packet loss by discontinuous
|
|
sequence numbers, it can request the retransmission of the missing packets
|
|
by stating the missing sequence number(s). Since there is no acknowledgement
|
|
for sucessfully received packets, the sender has to keep a
|
|
reasonably-sized\footnote{\textit{reasonable size} must be large enough for the
|
|
round-trip time between master and slowest slave.} backlog of recently-sent
|
|
packets in order to be able to fulfill retransmission
|
|
requests.
|
|
|
|
The different state replication protocol packet types are:
|
|
\begin{itemize}
|
|
\item
|
|
\textbf{\ident{CT_SYNC_PKT_MASTER_ANNOUNCE}}: A new master announces itself.
|
|
Any still existing master will downgrade itself to slave upon
|
|
reception of this packet.
|
|
\item
|
|
\textbf{\ident{CT_SYNC_PKT_SLAVE_INITSYNC}}: A slave requests initial
|
|
synchronization from the master (after reboot or loss of sync).
|
|
\item
|
|
\textbf{\ident{CT_SYNC_PKT_SYNC}}: A packet containing synchronization data
|
|
from master to slaves
|
|
\item
|
|
\textbf{\ident{CT_SYNC_PKT_NACK}}: A slave indicates packet loss of a
|
|
particular sequence number
|
|
\end{itemize}
|
|
|
|
The messages within a \lident{CT_SYNC_PKT_SYNC} packet always refer to a particular
|
|
\textit{resource} (currently \lident{CT_SYNC_RES_CONNTRACK} and \lident{CT_SYNC_RES_EXPECT},
|
|
although support for the latter has not been fully implemented yet).
|
|
|
|
For every resource, there are several message types. So far, only
|
|
\lident{CT_SYNC_MSG_UPDATE} and \lident{CT_SYNC_MSG_DELETE} have been implemented. This
|
|
means a new connection as well as state changes to an existing connection will
|
|
always be encapsulated in a \lident{CT_SYNC_MSG_UDPATE} message and therefore contain
|
|
the full conntrack entry.
|
|
|
|
To uniquely identify (and later reference) a conntrack entry, the only unique
|
|
criteria is used: \ident{ip_conntrack_tuple}.
|
|
|
|
\subsubsection{\texttt{ct\_sync} sender thread}
|
|
|
|
Maximum care needs to be taken for the implementation of the ctsyncd sender.
|
|
|
|
The normal workload of the active firewall node is likely to be already very
|
|
high, so generating and sending the conntrack state replication messages needs
|
|
to be highly efficient.
|
|
|
|
It was therefore decided to use a pre-allocated ringbuffer for outbound
|
|
\ident{ct_sync} packets. New messages are appended to individual buffers in this
|
|
ring, and pointers into this ring are passed to the in-kernel sockets API to
|
|
ensure a minimum number of copies and memory allocations.
|
|
|
|
\subsubsection{\texttt{ct\_sync} initsync sender thread}
|
|
|
|
In order to facilitate ongoing state synchronization at the same time as
|
|
responding to initial sync requests of an individual slave, the sender has a
|
|
separate kernel thread for initial state synchronization (and \ident{ct_sync_initsync}).
|
|
|
|
At the moment it iterates over the state table and transmits packets with a
|
|
fixed rate of about 1000 packets per second, resulting in about 4000
|
|
connections per second, averaging to about 1.5 Mbps of bandwith consumed.
|
|
|
|
The speed of this initial sync should be configurable by the system
|
|
administrator, especially since there is no flow control mechanism, and the
|
|
slave node(s) will have to deal with the packets or otherwise lose sync again.
|
|
|
|
This is certainly an area of future improvement and development---but first we
|
|
want to see practical problems with this primitive scheme.
|
|
|
|
\subsubsection{\texttt{ct\_sync} receiver thread}
|
|
|
|
Implementation of the receiver is very straightforward.
|
|
|
|
For performance reasons, and to facilitate code-reuse, the receiver uses the
|
|
same pre-allocated ring buffer structure as the sender. Incoming packets are
|
|
written into ring members and then successively parsed into their individual
|
|
messages.
|
|
|
|
Apart from dealing with lost packets, it just needs to call the
|
|
respective conntrack add/modify/delete functions.
|
|
|
|
\subsubsection{Necessary changes within netfilter conntrack core}
|
|
|
|
To be able to achieve the described conntrack state replication mechanism,
|
|
the following changes to the conntrack core were implemented:
|
|
\begin{itemize}
|
|
\item
|
|
Ability to exclude certain packets from being tracked. This was a
|
|
long-wanted feature on the TODO list of the netfilter project and is
|
|
implemented by having a ``raw'' table in combination with a
|
|
``NOTRACK'' target.
|
|
\item
|
|
Ability to register callback functions to be called every time a new
|
|
conntrack entry is created or an existing entry modified. This is
|
|
part of the nfnetlink-ctnetlink patch, since the ctnetlink event
|
|
interface also uses this API.
|
|
\item
|
|
Export an API to externally add, modify, and remove conntrack entries.
|
|
\end{itemize}
|
|
|
|
Since the number of changes is very low, their inclusion into the mainline
|
|
kernel is not a problem and can happen during the 2.6.x stable kernel series.
|
|
|
|
|
|
\subsubsection{Layer 2 dropping and \texttt{ct\_sync}}
|
|
|
|
In most cases, netfilter/iptables-based firewalls will not only function as
|
|
packet filter but also run local processes such as proxies, dns relays, smtp
|
|
relays, etc.
|
|
|
|
In order to minimize failover time, it is helpful if the full startup and
|
|
configuration of all network interfaces and all of those userspace processes
|
|
can happen at system bootup time rather then in the instance of a failover.
|
|
|
|
l2drop provides a convenient way for this goal: It hooks into layer 2
|
|
netfilter hooks (immediately attached to \ident{netif_rx()} and
|
|
\ident{dev_queue_xmit}) and blocks all incoming and outgoing network packets at this
|
|
very low layer. Even kernel-generated messages such as ARP replies, IPv6
|
|
neighbour discovery, IGMP, \dots are blocked this way.
|
|
|
|
Of course there has to be an exemption for the state synchronization messages
|
|
themselves. In order to still facilitate remote administration via SSH and
|
|
other communication between the cluster nodes, the whole network
|
|
interface used for synchronization is subject to this exemption from
|
|
l2drop.
|
|
|
|
As soon as a node is propagated to master state, l2drop is disabled and the
|
|
system becomes visible to the network.
|
|
|
|
|
|
\subsubsection{Configuration}
|
|
|
|
All configuration happens via module parameters.
|
|
|
|
\begin{itemize}
|
|
\item
|
|
\texttt{syncdev}: Name of the multicast-capable network device
|
|
used for state synchronization among the nodes
|
|
\item
|
|
\texttt{state}: Initial state of the node (0=slave, 1=master)
|
|
\item
|
|
\texttt{id}: Unique Node ID (0..255)
|
|
\item
|
|
\texttt{l2drop}: Enable (1) or disable (0) the l2drop functionality
|
|
\end{itemize}
|
|
|
|
\subsubsection{Interfacing with the cluster manager}
|
|
|
|
As indicated in the beginning of this paper, \ident{ct_sync} itself does not provide
|
|
any mechanism to determine outage of the master node within a cluster. This
|
|
job is left to a cluster manager software running in userspace.
|
|
|
|
Once an outage of the master is detected, the cluster manager needs to elect
|
|
one of the remaining (slave) nodes to become new master. On this elected node,
|
|
the cluster manager will write the ascii character \texttt{1} into the
|
|
\ident{/proc/net/ct_sync} file. Reading from this file will return the current state
|
|
of the local node.
|
|
|
|
\section{Acknowledgements}
|
|
|
|
The author would like to thank his fellow netfilter developers for their
|
|
help. Particularly important to \ident{ct_sync} is Krisztian KOVACS
|
|
\ident{<hidden@balabit.hu>}, who did a proof-of-concept implementation based on my
|
|
first paper on \ident{ct_sync} at OLS2002.
|
|
|
|
Without the financial support of Astaro AG, I would not have been able to spend any
|
|
time on \ident{ct_sync} at all.
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\end{document}
|
|
|