316 lines
9.4 KiB
Plaintext
316 lines
9.4 KiB
Plaintext
%include "default.mgp"
|
|
%default 1 bgrad
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%page
|
|
%nodefault
|
|
%back "blue"
|
|
|
|
%center
|
|
%size 7
|
|
|
|
|
|
Linux Kernel Architecture
|
|
%size 5
|
|
SMP issues, locking primitives
|
|
|
|
|
|
%center
|
|
%size 4
|
|
by
|
|
|
|
Harald Welte <laforge@gnumonks.org>
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%page
|
|
Linux Kernel Architecture
|
|
Prerequirements
|
|
|
|
Due to the technical nature of this presentation, the audience should be familiar with the following subjects
|
|
|
|
experience in programming on a Linux/*NIX system
|
|
C language preferred
|
|
general knowledge about computer hardware
|
|
interrupts / IO / DMA
|
|
general knowledge about modern CPU architeture
|
|
address space / MMU
|
|
'protected mode' / supervisor mode / ...
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%page
|
|
Linux Kernel Architecture
|
|
Kernel / Userspace
|
|
|
|
OS kernel provides
|
|
hardware abstraction (file I/O, network I/O, ...)
|
|
ressource allocation / limiting
|
|
address sepraration
|
|
privilege separation
|
|
IPC
|
|
|
|
the traditional process model in *NIX operating systems
|
|
processes reside in seperate virtual address spaces
|
|
kernel only executes one process (init) at bootup
|
|
all other processes descend from from init
|
|
processes are scheduled and preempted by the kernel
|
|
processes invoke system functions via syscalls.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%page
|
|
Linux Kernel Architecture
|
|
System calls
|
|
|
|
Definition
|
|
|
|
a userspace process enters the kernel
|
|
mechanism is CPU architecture dependent
|
|
can be software interrupt (int 0x80)
|
|
can be special asm instruction (sysenter)
|
|
arguments are passed on the stack
|
|
common examples
|
|
open/close/read/write
|
|
exit/fork/execve/kill
|
|
socketcall, implements (socket/bind/connect/listen)
|
|
about 270 system calls in 2.6.x kernels
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%page
|
|
Linux Kernel Architecture
|
|
Invocation of system call
|
|
|
|
chronological order of events in case of a system call
|
|
|
|
userspace process calls library function
|
|
library function is executed within the process' address space
|
|
library will eventually issue a systemcall, pushing arguments on the stack
|
|
library will issue syscall (int 0x80 / sysenter / ...)
|
|
execution will switch to syscall context in kernel mode
|
|
kernel will look up systemcall table and dispatch to respective function
|
|
syscall function in the kernel will handle the syscall
|
|
all data between kernel/userspace needs to be copied between address spaces
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%page
|
|
Linux Kernel Architecture
|
|
Execution contexts
|
|
|
|
apart from scheduling between different userspace processes, the kernel has different jobs like reacting to an external event
|
|
|
|
hardirq
|
|
hardware interrupt line was triggered
|
|
softirq
|
|
the workhorse behind a hardirq
|
|
userspace
|
|
executing within userspace process
|
|
syscall
|
|
invoked by a system call from userspace
|
|
vsyscall
|
|
virtual system calls, executed in userspace context
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%page
|
|
Linux Kernel Architecture
|
|
hardirq context
|
|
|
|
interrupt generated by hardware is received + handled
|
|
can be interrupted by other hardirq's
|
|
does only minimal job and returns
|
|
examples
|
|
packet has arrived on network board
|
|
character was received on serial port
|
|
dma read/write to disk drive has completed
|
|
timer interrupt went off
|
|
|
|
in most cases, a hardirq is followed by softirq or tasklet.
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%page
|
|
Linux Kernel Architecture
|
|
softirq context
|
|
|
|
softirqs are run after hardirq
|
|
do the real work associated withe a hardirq
|
|
multithreaded (can run simultaneously on multiple cpus)
|
|
examples
|
|
network receive softirq
|
|
timer softirq
|
|
|
|
prior to softirq's, linux had so-called 'bottom halves'
|
|
softirq introduced in 2.4.x (net rx/tx softirq)
|
|
bottom halves removed in 2.6.x
|
|
difference: only one BH can be run at a time
|
|
BH's have to be converted to tasklets in 2.6.x
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%page
|
|
Linux Kernel Architecture
|
|
tasklets
|
|
|
|
tasklets are somewhat in between of softirq's and bottom halves
|
|
one particular tasklet cannot run on multiple CPUs simultaneously
|
|
different tasklets can run on different CPUs simultaneosly
|
|
|
|
otherwise, same as softirq context
|
|
tasklets are impl. inside the 'tasklet softirq'
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%page
|
|
Linux Kernel Architecture
|
|
syscall / userspace context
|
|
|
|
userspace context
|
|
in userspace, executing a process
|
|
|
|
syscall context
|
|
inside kernel, when userspace process issues syscall()
|
|
|
|
vsyscalls (virtual syscalls)
|
|
first introduced with the x86-64 (AMD Opteron) arch
|
|
fast read-only access to kernel data structures
|
|
can do stuff like gettimeofday() without context switch
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%page
|
|
Linux Kernel Architecture
|
|
synchronization
|
|
|
|
Due to reentrancy and SMP, synchronization issues arise:
|
|
|
|
simple case: UP system
|
|
softirq can be interrupted by hardirq
|
|
thus, shared structures (queues, ...) need to be protected
|
|
complex case: SMP system
|
|
softirq can run at the same time on multiple CPU's
|
|
as softirqs are multithreaded, synchronization between threads has to be implemented
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%page
|
|
Linux Kernel Architecture
|
|
synchronization primitives
|
|
|
|
busy-waiting locks
|
|
|
|
spinlocks
|
|
if lock was not taken, take it and continue
|
|
if lock was taken, bysy-loop until it is free
|
|
rwlocks
|
|
special case of spinlocks
|
|
useful when structure protected by lock is often read but rarely updated/written to
|
|
allows either
|
|
multiple readers simultaneously, or
|
|
only one writer [and no readers]
|
|
brlocks
|
|
super-fast read/write locks, with write-side penalty
|
|
avoid cache ping-pong in multi reader case
|
|
only in kernel 2.4.x
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%page
|
|
Linux Kernel Architecture
|
|
synchronization primitives (cont'd)
|
|
|
|
sleeper locks
|
|
|
|
semaphores
|
|
if semaphore can be acquired, continue
|
|
if semaphore cannot be acquired, put current process to sleep
|
|
once semaphore is available again, wakeup process
|
|
|
|
WARNING: can only be used for sync userspace/syscall context
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%page
|
|
Linux Kernel Architecture
|
|
new locking primitives in 2.6.x
|
|
|
|
seqlocks
|
|
introduced with vsyscalls in 2.5/2.6
|
|
reader/writer consistent mechanism without starving writers
|
|
readers never block but may have to retry if write in progress
|
|
|
|
read copy update
|
|
new lockless mechanism in kernel 2.5/2.6
|
|
defers update of data structure until all CPU's have scheduled and thus nobody has any references left
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%page
|
|
Linux Kernel Architecture
|
|
example: incoming network packet
|
|
|
|
hardirq context
|
|
NIC issues interrupt line after a packet was received
|
|
kernel enters (arch/i386/kernel/entry.S:common_interrupt)
|
|
core interrupt handler (arch/i386/kernel/irq.c:do_IRQ)
|
|
hardirq handler of network driver (drivers/net/tulip/interrupt.c:tulip_interrupt)
|
|
net/core/dev.c:netif_rx(): append skb to backlog queue
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%page
|
|
Linux Kernel Architecture
|
|
example: incoming network packet
|
|
|
|
softirq context
|
|
net/core/dev.c:net_rx_action()
|
|
net/core/dev.c:process_backlog()
|
|
net/core/dev.c:netif_receive_skb()
|
|
net/core/dev.c:deliver_skb()
|
|
net/ipv4/ip_input.c:ip_rcv()
|
|
netfilter prerouting hook
|
|
net/ipv4/ip_input.c:ip_rcv_finish()
|
|
call routing code
|
|
net/ipv4/ip_input.c:ip_local_deliver()
|
|
netfilter localin hook
|
|
net/ipv4/ip_input.c:ip_local_deliver_finish()
|
|
call l4 protocol
|
|
net/ipv4/udp.c:udp_rcv()
|
|
lookup socket, if any
|
|
include/net/sock.h:sock_queue_rcv_skb()
|
|
enqueue into socket receiver queue
|
|
net/core/sock.c:sock_def_readable()
|
|
wake_up_interruptible() on socket waitqueue
|
|
return from recv() via socketcall
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%page
|
|
Linux Kernel Architecture
|
|
Cache Effects
|
|
|
|
SMP systems have multiple CPU's
|
|
Every CPU has it's own cache(s) / cache hierarchies
|
|
Most modern CPU archs are cache coherent in hardware
|
|
This means a certain chunk of memory can only be write-cached on one CPU at a given time
|
|
Frequently updated data structures will ping-pong between CPU caches
|
|
Data structures have to be organized to avoid cache issues
|
|
Cacheline alignment
|
|
very easy by using SLAB_HWCACHE_ALIGN
|
|
per-cpu data structures
|
|
e.g. packet counters: have one for every CPU
|
|
structure layout
|
|
put all writeable/updated members together
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%page
|
|
Future of Linux packet filtering
|
|
Thanks
|
|
The slides and the an according paper of this presentation are available at http://www.gnumonks.org/
|
|
|
|
Thanks to
|
|
the BBS people, Z-Netz, FIDO, ...
|
|
for heavily increasing my computer usage in 1992
|
|
KNF
|
|
for bringing me in touch with the internet as early as 1994
|
|
for providing a playground for technical people
|
|
for telling me about the existance of Linux!
|
|
Alan Cox, Alexey Kuznetsov, David Miller, Andi Kleen
|
|
for implementing (one of?) the world's best TCP/IP stacks
|
|
Paul 'Rusty' Russell
|
|
for starting the netfilter/iptables project
|
|
for trusting me to maintain it today
|
|
Astaro AG
|
|
for sponsoring parts of my netfilter work
|
|
linux-bangalore
|
|
for sponsoring my trip to this conference
|
|
|