net: TX_RING and packet mmap

New packet socket feature that makes packet socket more efficient for
transmission.

- It reduces number of system call through a PACKET_TX_RING mechanism,
  based on PACKET_RX_RING (Circular buffer allocated in kernel space
  which is mmapped from user space).

- It minimizes CPU copy using fragmented SKB (almost zero copy).

Signed-off-by: Johann Baudy <johann.baudy@gnu-log.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit is contained in:
Johann Baudy 2009-05-18 22:11:22 -07:00 committed by David S. Miller
parent f67f340849
commit 69e3c75f4d
4 changed files with 622 additions and 141 deletions

View File

@ -4,16 +4,18 @@
This file documents the CONFIG_PACKET_MMAP option available with the PACKET This file documents the CONFIG_PACKET_MMAP option available with the PACKET
socket interface on 2.4 and 2.6 kernels. This type of sockets is used for socket interface on 2.4 and 2.6 kernels. This type of sockets is used for
capture network traffic with utilities like tcpdump or any other that uses capture network traffic with utilities like tcpdump or any other that needs
the libpcap library. raw access to network interface.
You can find the latest version of this document at
You can find the latest version of this document at:
http://pusa.uv.es/~ulisses/packet_mmap/ http://pusa.uv.es/~ulisses/packet_mmap/
Please send me your comments to Howto can be found at:
http://wiki.gnu-log.net (packet_mmap)
Please send your comments to
Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es> Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
Johann Baudy <johann.baudy@gnu-log.net>
------------------------------------------------------------------------------- -------------------------------------------------------------------------------
+ Why use PACKET_MMAP + Why use PACKET_MMAP
@ -25,19 +27,24 @@ to capture each packet, it requires two if you want to get packet's
timestamp (like libpcap always does). timestamp (like libpcap always does).
In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
configurable circular buffer mapped in user space. This way reading packets just configurable circular buffer mapped in user space that can be used to either
needs to wait for them, most of the time there is no need to issue a single send or receive packets. This way reading packets just needs to wait for them,
system call. By using a shared buffer between the kernel and the user most of the time there is no need to issue a single system call. Concerning
also has the benefit of minimizing packet copies. transmission, multiple packets can be sent through one system call to get the
highest bandwidth.
By using a shared buffer between the kernel and the user also has the benefit
of minimizing packet copies.
It's fine to use PACKET_MMAP to improve the performance of the capture process, It's fine to use PACKET_MMAP to improve the performance of the capture and
but it isn't everything. At least, if you are capturing at high speeds (this transmission process, but it isn't everything. At least, if you are capturing
is relative to the cpu speed), you should check if the device driver of your at high speeds (this is relative to the cpu speed), you should check if the
network interface card supports some sort of interrupt load mitigation or device driver of your network interface card supports some sort of interrupt
(even better) if it supports NAPI, also make sure it is enabled. load mitigation or (even better) if it supports NAPI, also make sure it is
enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
supported by devices of your network.
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
+ How to use CONFIG_PACKET_MMAP + How to use CONFIG_PACKET_MMAP to improve capture process
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
From the user standpoint, you should use the higher level libpcap library, which From the user standpoint, you should use the higher level libpcap library, which
@ -57,7 +64,7 @@ the low level details or want to improve libpcap by including PACKET_MMAP
support. support.
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
+ How to use CONFIG_PACKET_MMAP directly + How to use CONFIG_PACKET_MMAP directly to improve capture process
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
From the system calls stand point, the use of PACKET_MMAP involves From the system calls stand point, the use of PACKET_MMAP involves
@ -66,6 +73,7 @@ the following process:
[setup] socket() -------> creation of the capture socket [setup] socket() -------> creation of the capture socket
setsockopt() ---> allocation of the circular buffer (ring) setsockopt() ---> allocation of the circular buffer (ring)
option: PACKET_RX_RING
mmap() ---------> mapping of the allocated buffer to the mmap() ---------> mapping of the allocated buffer to the
user process user process
@ -96,6 +104,65 @@ Next I will describe PACKET_MMAP settings and it's constraints,
also the mapping of the circular buffer in the user process and also the mapping of the circular buffer in the user process and
the use of this buffer. the use of this buffer.
--------------------------------------------------------------------------------
+ How to use CONFIG_PACKET_MMAP directly to improve transmission process
--------------------------------------------------------------------------------
Transmission process is similar to capture as shown below.
[setup] socket() -------> creation of the transmission socket
setsockopt() ---> allocation of the circular buffer (ring)
option: PACKET_TX_RING
bind() ---------> bind transmission socket with a network interface
mmap() ---------> mapping of the allocated buffer to the
user process
[transmission] poll() ---------> wait for free packets (optional)
send() ---------> send all packets that are set as ready in
the ring
The flag MSG_DONTWAIT can be used to return
before end of transfer.
[shutdown] close() --------> destruction of the transmission socket and
deallocation of all associated resources.
Binding the socket to your network interface is mandatory (with zero copy) to
know the header size of frames used in the circular buffer.
As capture, each frame contains two parts:
--------------------
| struct tpacket_hdr | Header. It contains the status of
| | of this frame
|--------------------|
| data buffer |
. . Data that will be sent over the network interface.
. .
--------------------
bind() associates the socket to your network interface thanks to
sll_ifindex parameter of struct sockaddr_ll.
Initialization example:
struct sockaddr_ll my_addr;
struct ifreq s_ifr;
...
strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
/* get interface index of eth0 */
ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
/* fill sockaddr_ll struct to prepare binding */
my_addr.sll_family = AF_PACKET;
my_addr.sll_protocol = ETH_P_ALL;
my_addr.sll_ifindex = s_ifr.ifr_ifindex;
/* bind socket to eth0 */
bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
A complete tutorial is available at: http://wiki.gnu-log.net/
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
+ PACKET_MMAP settings + PACKET_MMAP settings
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
@ -103,7 +170,10 @@ the use of this buffer.
To setup PACKET_MMAP from user level code is done with a call like To setup PACKET_MMAP from user level code is done with a call like
- Capture process
setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req)) setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
- Transmission process
setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
The most significant argument in the previous call is the req parameter, The most significant argument in the previous call is the req parameter,
this parameter must to have the following structure: this parameter must to have the following structure:
@ -117,11 +187,11 @@ this parameter must to have the following structure:
}; };
This structure is defined in /usr/include/linux/if_packet.h and establishes a This structure is defined in /usr/include/linux/if_packet.h and establishes a
circular buffer (ring) of unswappable memory mapped in the capture process. circular buffer (ring) of unswappable memory.
Being mapped in the capture process allows reading the captured frames and Being mapped in the capture process allows reading the captured frames and
related meta-information like timestamps without requiring a system call. related meta-information like timestamps without requiring a system call.
Captured frames are grouped in blocks. Each block is a physically contiguous Frames are grouped in blocks. Each block is a physically contiguous
region of memory and holds tp_block_size/tp_frame_size frames. The total number region of memory and holds tp_block_size/tp_frame_size frames. The total number
of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because
@ -336,6 +406,7 @@ struct tpacket_hdr). If this field is 0 means that the frame is ready
to be used for the kernel, If not, there is a frame the user can read to be used for the kernel, If not, there is a frame the user can read
and the following flags apply: and the following flags apply:
+++ Capture process:
from include/linux/if_packet.h from include/linux/if_packet.h
#define TP_STATUS_COPY 2 #define TP_STATUS_COPY 2
@ -391,6 +462,37 @@ packets are in the ring:
It doesn't incur in a race condition to first check the status value and It doesn't incur in a race condition to first check the status value and
then poll for frames. then poll for frames.
++ Transmission process
Those defines are also used for transmission:
#define TP_STATUS_AVAILABLE 0 // Frame is available
#define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send()
#define TP_STATUS_SENDING 2 // Frame is currently in transmission
#define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct
First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a
packet, the user fills a data buffer of an available frame, sets tp_len to
current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST.
This can be done on multiple frames. Once the user is ready to transmit, it
calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are
forwarded to the network device. The kernel updates each status of sent
frames with TP_STATUS_SENDING until the end of transfer.
At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE.
header->tp_len = in_i_size;
header->tp_status = TP_STATUS_SEND_REQUEST;
retval = send(this->socket, NULL, 0, 0);
The user can also use poll() to check if a buffer is available:
(status == TP_STATUS_SENDING)
struct pollfd pfd;
pfd.fd = fd;
pfd.revents = 0;
pfd.events = POLLOUT;
retval = poll(&pfd, 1, timeout);
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------
+ THANKS + THANKS
-------------------------------------------------------------------------------- --------------------------------------------------------------------------------

View File

@ -46,6 +46,8 @@ struct sockaddr_ll
#define PACKET_VERSION 10 #define PACKET_VERSION 10
#define PACKET_HDRLEN 11 #define PACKET_HDRLEN 11
#define PACKET_RESERVE 12 #define PACKET_RESERVE 12
#define PACKET_TX_RING 13
#define PACKET_LOSS 14
struct tpacket_stats struct tpacket_stats
{ {
@ -63,14 +65,22 @@ struct tpacket_auxdata
__u16 tp_vlan_tci; __u16 tp_vlan_tci;
}; };
/* Rx ring - header status */
#define TP_STATUS_KERNEL 0x0
#define TP_STATUS_USER 0x1
#define TP_STATUS_COPY 0x2
#define TP_STATUS_LOSING 0x4
#define TP_STATUS_CSUMNOTREADY 0x8
/* Tx ring - header status */
#define TP_STATUS_AVAILABLE 0x0
#define TP_STATUS_SEND_REQUEST 0x1
#define TP_STATUS_SENDING 0x2
#define TP_STATUS_WRONG_FORMAT 0x4
struct tpacket_hdr struct tpacket_hdr
{ {
unsigned long tp_status; unsigned long tp_status;
#define TP_STATUS_KERNEL 0
#define TP_STATUS_USER 1
#define TP_STATUS_COPY 2
#define TP_STATUS_LOSING 4
#define TP_STATUS_CSUMNOTREADY 8
unsigned int tp_len; unsigned int tp_len;
unsigned int tp_snaplen; unsigned int tp_snaplen;
unsigned short tp_mac; unsigned short tp_mac;

View File

@ -203,6 +203,9 @@ struct skb_shared_info {
#ifdef CONFIG_HAS_DMA #ifdef CONFIG_HAS_DMA
dma_addr_t dma_maps[MAX_SKB_FRAGS + 1]; dma_addr_t dma_maps[MAX_SKB_FRAGS + 1];
#endif #endif
/* Intermediate layers must ensure that destructor_arg
* remains valid until skb destructor */
void * destructor_arg;
}; };
/* We divide dataref into two halves. The higher 16 bits hold references /* We divide dataref into two halves. The higher 16 bits hold references

View File

@ -39,6 +39,7 @@
* will simply extend the hardware address * will simply extend the hardware address
* byte arrays at the end of sockaddr_ll * byte arrays at the end of sockaddr_ll
* and packet_mreq. * and packet_mreq.
* Johann Baudy : Added TX RING.
* *
* This program is free software; you can redistribute it and/or * This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License * modify it under the terms of the GNU General Public License
@ -157,7 +158,25 @@ struct packet_mreq_max
}; };
#ifdef CONFIG_PACKET_MMAP #ifdef CONFIG_PACKET_MMAP
static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing); static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
int closing, int tx_ring);
struct packet_ring_buffer {
char * *pg_vec;
unsigned int head;
unsigned int frames_per_block;
unsigned int frame_size;
unsigned int frame_max;
unsigned int pg_vec_order;
unsigned int pg_vec_pages;
unsigned int pg_vec_len;
atomic_t pending;
};
struct packet_sock;
static int tpacket_snd(struct packet_sock *po, struct msghdr *msg);
#endif #endif
static void packet_flush_mclist(struct sock *sk); static void packet_flush_mclist(struct sock *sk);
@ -167,11 +186,8 @@ struct packet_sock {
struct sock sk; struct sock sk;
struct tpacket_stats stats; struct tpacket_stats stats;
#ifdef CONFIG_PACKET_MMAP #ifdef CONFIG_PACKET_MMAP
char * *pg_vec; struct packet_ring_buffer rx_ring;
unsigned int head; struct packet_ring_buffer tx_ring;
unsigned int frames_per_block;
unsigned int frame_size;
unsigned int frame_max;
int copy_thresh; int copy_thresh;
#endif #endif
struct packet_type prot_hook; struct packet_type prot_hook;
@ -185,12 +201,10 @@ struct packet_sock {
struct packet_mclist *mclist; struct packet_mclist *mclist;
#ifdef CONFIG_PACKET_MMAP #ifdef CONFIG_PACKET_MMAP
atomic_t mapped; atomic_t mapped;
unsigned int pg_vec_order;
unsigned int pg_vec_pages;
unsigned int pg_vec_len;
enum tpacket_versions tp_version; enum tpacket_versions tp_version;
unsigned int tp_hdrlen; unsigned int tp_hdrlen;
unsigned int tp_reserve; unsigned int tp_reserve;
unsigned int tp_loss:1;
#endif #endif
}; };
@ -206,35 +220,6 @@ struct packet_skb_cb {
#ifdef CONFIG_PACKET_MMAP #ifdef CONFIG_PACKET_MMAP
static void *packet_lookup_frame(struct packet_sock *po, unsigned int position,
int status)
{
unsigned int pg_vec_pos, frame_offset;
union {
struct tpacket_hdr *h1;
struct tpacket2_hdr *h2;
void *raw;
} h;
pg_vec_pos = position / po->frames_per_block;
frame_offset = position % po->frames_per_block;
h.raw = po->pg_vec[pg_vec_pos] + (frame_offset * po->frame_size);
switch (po->tp_version) {
case TPACKET_V1:
if (status != (h.h1->tp_status ? TP_STATUS_USER :
TP_STATUS_KERNEL))
return NULL;
break;
case TPACKET_V2:
if (status != (h.h2->tp_status ? TP_STATUS_USER :
TP_STATUS_KERNEL))
return NULL;
break;
}
return h.raw;
}
static void __packet_set_status(struct packet_sock *po, void *frame, int status) static void __packet_set_status(struct packet_sock *po, void *frame, int status)
{ {
union { union {
@ -247,12 +232,88 @@ static void __packet_set_status(struct packet_sock *po, void *frame, int status)
switch (po->tp_version) { switch (po->tp_version) {
case TPACKET_V1: case TPACKET_V1:
h.h1->tp_status = status; h.h1->tp_status = status;
flush_dcache_page(virt_to_page(&h.h1->tp_status));
break; break;
case TPACKET_V2: case TPACKET_V2:
h.h2->tp_status = status; h.h2->tp_status = status;
flush_dcache_page(virt_to_page(&h.h2->tp_status));
break; break;
default:
printk(KERN_ERR "TPACKET version not supported\n");
BUG();
}
smp_wmb();
}
static int __packet_get_status(struct packet_sock *po, void *frame)
{
union {
struct tpacket_hdr *h1;
struct tpacket2_hdr *h2;
void *raw;
} h;
smp_rmb();
h.raw = frame;
switch (po->tp_version) {
case TPACKET_V1:
flush_dcache_page(virt_to_page(&h.h1->tp_status));
return h.h1->tp_status;
case TPACKET_V2:
flush_dcache_page(virt_to_page(&h.h2->tp_status));
return h.h2->tp_status;
default:
printk(KERN_ERR "TPACKET version not supported\n");
BUG();
return 0;
} }
} }
static void *packet_lookup_frame(struct packet_sock *po,
struct packet_ring_buffer *rb,
unsigned int position,
int status)
{
unsigned int pg_vec_pos, frame_offset;
union {
struct tpacket_hdr *h1;
struct tpacket2_hdr *h2;
void *raw;
} h;
pg_vec_pos = position / rb->frames_per_block;
frame_offset = position % rb->frames_per_block;
h.raw = rb->pg_vec[pg_vec_pos] + (frame_offset * rb->frame_size);
if (status != __packet_get_status(po, h.raw))
return NULL;
return h.raw;
}
static inline void *packet_current_frame(struct packet_sock *po,
struct packet_ring_buffer *rb,
int status)
{
return packet_lookup_frame(po, rb, rb->head, status);
}
static inline void *packet_previous_frame(struct packet_sock *po,
struct packet_ring_buffer *rb,
int status)
{
unsigned int previous = rb->head ? rb->head - 1 : rb->frame_max;
return packet_lookup_frame(po, rb, previous, status);
}
static inline void packet_increment_head(struct packet_ring_buffer *buff)
{
buff->head = buff->head != buff->frame_max ? buff->head+1 : 0;
}
#endif #endif
static inline struct packet_sock *pkt_sk(struct sock *sk) static inline struct packet_sock *pkt_sk(struct sock *sk)
@ -648,7 +709,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packe
macoff = netoff - maclen; macoff = netoff - maclen;
} }
if (macoff + snaplen > po->frame_size) { if (macoff + snaplen > po->rx_ring.frame_size) {
if (po->copy_thresh && if (po->copy_thresh &&
atomic_read(&sk->sk_rmem_alloc) + skb->truesize < atomic_read(&sk->sk_rmem_alloc) + skb->truesize <
(unsigned)sk->sk_rcvbuf) { (unsigned)sk->sk_rcvbuf) {
@ -661,16 +722,16 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packe
if (copy_skb) if (copy_skb)
skb_set_owner_r(copy_skb, sk); skb_set_owner_r(copy_skb, sk);
} }
snaplen = po->frame_size - macoff; snaplen = po->rx_ring.frame_size - macoff;
if ((int)snaplen < 0) if ((int)snaplen < 0)
snaplen = 0; snaplen = 0;
} }
spin_lock(&sk->sk_receive_queue.lock); spin_lock(&sk->sk_receive_queue.lock);
h.raw = packet_lookup_frame(po, po->head, TP_STATUS_KERNEL); h.raw = packet_current_frame(po, &po->rx_ring, TP_STATUS_KERNEL);
if (!h.raw) if (!h.raw)
goto ring_is_full; goto ring_is_full;
po->head = po->head != po->frame_max ? po->head+1 : 0; packet_increment_head(&po->rx_ring);
po->stats.tp_packets++; po->stats.tp_packets++;
if (copy_skb) { if (copy_skb) {
status |= TP_STATUS_COPY; status |= TP_STATUS_COPY;
@ -727,7 +788,6 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packe
__packet_set_status(po, h.raw, status); __packet_set_status(po, h.raw, status);
smp_mb(); smp_mb();
{ {
struct page *p_start, *p_end; struct page *p_start, *p_end;
u8 *h_end = h.raw + macoff + snaplen - 1; u8 *h_end = h.raw + macoff + snaplen - 1;
@ -760,10 +820,249 @@ ring_is_full:
goto drop_n_restore; goto drop_n_restore;
} }
static void tpacket_destruct_skb(struct sk_buff *skb)
{
struct packet_sock *po = pkt_sk(skb->sk);
void * ph;
BUG_ON(skb == NULL);
if (likely(po->tx_ring.pg_vec)) {
ph = skb_shinfo(skb)->destructor_arg;
BUG_ON(__packet_get_status(po, ph) != TP_STATUS_SENDING);
BUG_ON(atomic_read(&po->tx_ring.pending) == 0);
atomic_dec(&po->tx_ring.pending);
__packet_set_status(po, ph, TP_STATUS_AVAILABLE);
}
sock_wfree(skb);
}
static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff * skb,
void * frame, struct net_device *dev, int size_max,
__be16 proto, unsigned char * addr)
{
union {
struct tpacket_hdr *h1;
struct tpacket2_hdr *h2;
void *raw;
} ph;
int to_write, offset, len, tp_len, nr_frags, len_max;
struct socket *sock = po->sk.sk_socket;
struct page *page;
void *data;
int err;
ph.raw = frame;
skb->protocol = proto;
skb->dev = dev;
skb->priority = po->sk.sk_priority;
skb_shinfo(skb)->destructor_arg = ph.raw;
switch (po->tp_version) {
case TPACKET_V2:
tp_len = ph.h2->tp_len;
break;
default:
tp_len = ph.h1->tp_len;
break;
}
if (unlikely(tp_len > size_max)) {
printk(KERN_ERR "packet size is too long (%d > %d)\n",
tp_len, size_max);
return -EMSGSIZE;
}
skb_reserve(skb, LL_RESERVED_SPACE(dev));
skb_reset_network_header(skb);
data = ph.raw + po->tp_hdrlen - sizeof(struct sockaddr_ll);
to_write = tp_len;
if (sock->type == SOCK_DGRAM) {
err = dev_hard_header(skb, dev, ntohs(proto), addr,
NULL, tp_len);
if (unlikely(err < 0))
return -EINVAL;
} else if (dev->hard_header_len ) {
/* net device doesn't like empty head */
if (unlikely(tp_len <= dev->hard_header_len)) {
printk(KERN_ERR "packet size is too short "
"(%d < %d)\n", tp_len,
dev->hard_header_len);
return -EINVAL;
}
skb_push(skb, dev->hard_header_len);
err = skb_store_bits(skb, 0, data,
dev->hard_header_len);
if (unlikely(err))
return err;
data += dev->hard_header_len;
to_write -= dev->hard_header_len;
}
err = -EFAULT;
page = virt_to_page(data);
offset = offset_in_page(data);
len_max = PAGE_SIZE - offset;
len = ((to_write > len_max) ? len_max : to_write);
skb->data_len = to_write;
skb->len += to_write;
skb->truesize += to_write;
atomic_add(to_write, &po->sk.sk_wmem_alloc);
while (likely(to_write)) {
nr_frags = skb_shinfo(skb)->nr_frags;
if (unlikely(nr_frags >= MAX_SKB_FRAGS)) {
printk(KERN_ERR "Packet exceed the number "
"of skb frags(%lu)\n",
MAX_SKB_FRAGS);
return -EFAULT;
}
flush_dcache_page(page);
get_page(page);
skb_fill_page_desc(skb,
nr_frags,
page++, offset, len);
to_write -= len;
offset = 0;
len_max = PAGE_SIZE;
len = ((to_write > len_max) ? len_max : to_write);
}
return tp_len;
}
static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
{
struct socket *sock;
struct sk_buff *skb;
struct net_device *dev;
__be16 proto;
int ifindex, err, reserve = 0;
void * ph;
struct sockaddr_ll *saddr=(struct sockaddr_ll *)msg->msg_name;
int tp_len, size_max;
unsigned char *addr;
int len_sum = 0;
int status = 0;
sock = po->sk.sk_socket;
mutex_lock(&po->pg_vec_lock);
err = -EBUSY;
if (saddr == NULL) {
ifindex = po->ifindex;
proto = po->num;
addr = NULL;
} else {
err = -EINVAL;
if (msg->msg_namelen < sizeof(struct sockaddr_ll))
goto out;
if (msg->msg_namelen < (saddr->sll_halen
+ offsetof(struct sockaddr_ll,
sll_addr)))
goto out;
ifindex = saddr->sll_ifindex;
proto = saddr->sll_protocol;
addr = saddr->sll_addr;
}
dev = dev_get_by_index(sock_net(&po->sk), ifindex);
err = -ENXIO;
if (unlikely(dev == NULL))
goto out;
reserve = dev->hard_header_len;
err = -ENETDOWN;
if (unlikely(!(dev->flags & IFF_UP)))
goto out_put;
size_max = po->tx_ring.frame_size
- sizeof(struct skb_shared_info)
- po->tp_hdrlen
- LL_ALLOCATED_SPACE(dev)
- sizeof(struct sockaddr_ll);
if (size_max > dev->mtu + reserve)
size_max = dev->mtu + reserve;
do {
ph = packet_current_frame(po, &po->tx_ring,
TP_STATUS_SEND_REQUEST);
if (unlikely(ph == NULL)) {
schedule();
continue;
}
status = TP_STATUS_SEND_REQUEST;
skb = sock_alloc_send_skb(&po->sk,
LL_ALLOCATED_SPACE(dev)
+ sizeof(struct sockaddr_ll),
0, &err);
if (unlikely(skb == NULL))
goto out_status;
tp_len = tpacket_fill_skb(po, skb, ph, dev, size_max, proto,
addr);
if (unlikely(tp_len < 0)) {
if (po->tp_loss) {
__packet_set_status(po, ph,
TP_STATUS_AVAILABLE);
packet_increment_head(&po->tx_ring);
kfree_skb(skb);
continue;
} else {
status = TP_STATUS_WRONG_FORMAT;
err = tp_len;
goto out_status;
}
}
skb->destructor = tpacket_destruct_skb;
__packet_set_status(po, ph, TP_STATUS_SENDING);
atomic_inc(&po->tx_ring.pending);
status = TP_STATUS_SEND_REQUEST;
err = dev_queue_xmit(skb);
if (unlikely(err > 0 && (err = net_xmit_errno(err)) != 0))
goto out_xmit;
packet_increment_head(&po->tx_ring);
len_sum += tp_len;
}
while (likely((ph != NULL) || ((!(msg->msg_flags & MSG_DONTWAIT))
&& (atomic_read(&po->tx_ring.pending))))
);
err = len_sum;
goto out_put;
out_xmit:
skb->destructor = sock_wfree;
atomic_dec(&po->tx_ring.pending);
out_status:
__packet_set_status(po, ph, status);
kfree_skb(skb);
out_put:
dev_put(dev);
out:
mutex_unlock(&po->pg_vec_lock);
return err;
}
#endif #endif
static int packet_snd(struct socket *sock,
static int packet_sendmsg(struct kiocb *iocb, struct socket *sock,
struct msghdr *msg, size_t len) struct msghdr *msg, size_t len)
{ {
struct sock *sk = sock->sk; struct sock *sk = sock->sk;
@ -854,6 +1153,19 @@ out:
return err; return err;
} }
static int packet_sendmsg(struct kiocb *iocb, struct socket *sock,
struct msghdr *msg, size_t len)
{
#ifdef CONFIG_PACKET_MMAP
struct sock *sk = sock->sk;
struct packet_sock *po = pkt_sk(sk);
if (po->tx_ring.pg_vec)
return tpacket_snd(po, msg);
else
#endif
return packet_snd(sock, msg, len);
}
/* /*
* Close a PACKET socket. This is fairly simple. We immediately go * Close a PACKET socket. This is fairly simple. We immediately go
* to 'closed' state and remove our protocol entry in the device list. * to 'closed' state and remove our protocol entry in the device list.
@ -864,6 +1176,9 @@ static int packet_release(struct socket *sock)
struct sock *sk = sock->sk; struct sock *sk = sock->sk;
struct packet_sock *po; struct packet_sock *po;
struct net *net; struct net *net;
#ifdef CONFIG_PACKET_MMAP
struct tpacket_req req;
#endif
if (!sk) if (!sk)
return 0; return 0;
@ -893,11 +1208,13 @@ static int packet_release(struct socket *sock)
packet_flush_mclist(sk); packet_flush_mclist(sk);
#ifdef CONFIG_PACKET_MMAP #ifdef CONFIG_PACKET_MMAP
if (po->pg_vec) { memset(&req, 0, sizeof(req));
struct tpacket_req req;
memset(&req, 0, sizeof(req)); if (po->rx_ring.pg_vec)
packet_set_ring(sk, &req, 1); packet_set_ring(sk, &req, 1, 0);
}
if (po->tx_ring.pg_vec)
packet_set_ring(sk, &req, 1, 1);
#endif #endif
/* /*
@ -1391,7 +1708,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
if (level != SOL_PACKET) if (level != SOL_PACKET)
return -ENOPROTOOPT; return -ENOPROTOOPT;
switch(optname) { switch (optname) {
case PACKET_ADD_MEMBERSHIP: case PACKET_ADD_MEMBERSHIP:
case PACKET_DROP_MEMBERSHIP: case PACKET_DROP_MEMBERSHIP:
{ {
@ -1415,6 +1732,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
#ifdef CONFIG_PACKET_MMAP #ifdef CONFIG_PACKET_MMAP
case PACKET_RX_RING: case PACKET_RX_RING:
case PACKET_TX_RING:
{ {
struct tpacket_req req; struct tpacket_req req;
@ -1422,7 +1740,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
return -EINVAL; return -EINVAL;
if (copy_from_user(&req,optval,sizeof(req))) if (copy_from_user(&req,optval,sizeof(req)))
return -EFAULT; return -EFAULT;
return packet_set_ring(sk, &req, 0); return packet_set_ring(sk, &req, 0, optname == PACKET_TX_RING);
} }
case PACKET_COPY_THRESH: case PACKET_COPY_THRESH:
{ {
@ -1442,7 +1760,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
if (optlen != sizeof(val)) if (optlen != sizeof(val))
return -EINVAL; return -EINVAL;
if (po->pg_vec) if (po->rx_ring.pg_vec || po->tx_ring.pg_vec)
return -EBUSY; return -EBUSY;
if (copy_from_user(&val, optval, sizeof(val))) if (copy_from_user(&val, optval, sizeof(val)))
return -EFAULT; return -EFAULT;
@ -1461,13 +1779,26 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
if (optlen != sizeof(val)) if (optlen != sizeof(val))
return -EINVAL; return -EINVAL;
if (po->pg_vec) if (po->rx_ring.pg_vec || po->tx_ring.pg_vec)
return -EBUSY; return -EBUSY;
if (copy_from_user(&val, optval, sizeof(val))) if (copy_from_user(&val, optval, sizeof(val)))
return -EFAULT; return -EFAULT;
po->tp_reserve = val; po->tp_reserve = val;
return 0; return 0;
} }
case PACKET_LOSS:
{
unsigned int val;
if (optlen != sizeof(val))
return -EINVAL;
if (po->rx_ring.pg_vec || po->tx_ring.pg_vec)
return -EBUSY;
if (copy_from_user(&val, optval, sizeof(val)))
return -EFAULT;
po->tp_loss = !!val;
return 0;
}
#endif #endif
case PACKET_AUXDATA: case PACKET_AUXDATA:
{ {
@ -1517,7 +1848,7 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
if (len < 0) if (len < 0)
return -EINVAL; return -EINVAL;
switch(optname) { switch (optname) {
case PACKET_STATISTICS: case PACKET_STATISTICS:
if (len > sizeof(struct tpacket_stats)) if (len > sizeof(struct tpacket_stats))
len = sizeof(struct tpacket_stats); len = sizeof(struct tpacket_stats);
@ -1573,6 +1904,12 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
val = po->tp_reserve; val = po->tp_reserve;
data = &val; data = &val;
break; break;
case PACKET_LOSS:
if (len > sizeof(unsigned int))
len = sizeof(unsigned int);
val = po->tp_loss;
data = &val;
break;
#endif #endif
default: default:
return -ENOPROTOOPT; return -ENOPROTOOPT;
@ -1643,7 +1980,7 @@ static int packet_ioctl(struct socket *sock, unsigned int cmd,
{ {
struct sock *sk = sock->sk; struct sock *sk = sock->sk;
switch(cmd) { switch (cmd) {
case SIOCOUTQ: case SIOCOUTQ:
{ {
int amount = atomic_read(&sk->sk_wmem_alloc); int amount = atomic_read(&sk->sk_wmem_alloc);
@ -1705,13 +2042,17 @@ static unsigned int packet_poll(struct file * file, struct socket *sock,
unsigned int mask = datagram_poll(file, sock, wait); unsigned int mask = datagram_poll(file, sock, wait);
spin_lock_bh(&sk->sk_receive_queue.lock); spin_lock_bh(&sk->sk_receive_queue.lock);
if (po->pg_vec) { if (po->rx_ring.pg_vec) {
unsigned last = po->head ? po->head-1 : po->frame_max; if (!packet_previous_frame(po, &po->rx_ring, TP_STATUS_KERNEL))
if (packet_lookup_frame(po, last, TP_STATUS_USER))
mask |= POLLIN | POLLRDNORM; mask |= POLLIN | POLLRDNORM;
} }
spin_unlock_bh(&sk->sk_receive_queue.lock); spin_unlock_bh(&sk->sk_receive_queue.lock);
spin_lock_bh(&sk->sk_write_queue.lock);
if (po->tx_ring.pg_vec) {
if (packet_current_frame(po, &po->tx_ring, TP_STATUS_AVAILABLE))
mask |= POLLOUT | POLLWRNORM;
}
spin_unlock_bh(&sk->sk_write_queue.lock);
return mask; return mask;
} }
@ -1788,21 +2129,33 @@ out_free_pgvec:
goto out; goto out;
} }
static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing) static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
int closing, int tx_ring)
{ {
char **pg_vec = NULL; char **pg_vec = NULL;
struct packet_sock *po = pkt_sk(sk); struct packet_sock *po = pkt_sk(sk);
int was_running, order = 0; int was_running, order = 0;
struct packet_ring_buffer *rb;
struct sk_buff_head *rb_queue;
__be16 num; __be16 num;
int err = 0; int err;
rb = tx_ring ? &po->tx_ring : &po->rx_ring;
rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue;
err = -EBUSY;
if (!closing) {
if (atomic_read(&po->mapped))
goto out;
if (atomic_read(&rb->pending))
goto out;
}
if (req->tp_block_nr) { if (req->tp_block_nr) {
int i;
/* Sanity tests and some calculations */ /* Sanity tests and some calculations */
err = -EBUSY;
if (unlikely(po->pg_vec)) if (unlikely(rb->pg_vec))
return -EBUSY; goto out;
switch (po->tp_version) { switch (po->tp_version) {
case TPACKET_V1: case TPACKET_V1:
@ -1813,42 +2166,35 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing
break; break;
} }
err = -EINVAL;
if (unlikely((int)req->tp_block_size <= 0)) if (unlikely((int)req->tp_block_size <= 0))
return -EINVAL; goto out;
if (unlikely(req->tp_block_size & (PAGE_SIZE - 1))) if (unlikely(req->tp_block_size & (PAGE_SIZE - 1)))
return -EINVAL; goto out;
if (unlikely(req->tp_frame_size < po->tp_hdrlen + if (unlikely(req->tp_frame_size < po->tp_hdrlen +
po->tp_reserve)) po->tp_reserve))
return -EINVAL; goto out;
if (unlikely(req->tp_frame_size & (TPACKET_ALIGNMENT - 1))) if (unlikely(req->tp_frame_size & (TPACKET_ALIGNMENT - 1)))
return -EINVAL; goto out;
po->frames_per_block = req->tp_block_size/req->tp_frame_size; rb->frames_per_block = req->tp_block_size/req->tp_frame_size;
if (unlikely(po->frames_per_block <= 0)) if (unlikely(rb->frames_per_block <= 0))
return -EINVAL; goto out;
if (unlikely((po->frames_per_block * req->tp_block_nr) != if (unlikely((rb->frames_per_block * req->tp_block_nr) !=
req->tp_frame_nr)) req->tp_frame_nr))
return -EINVAL; goto out;
err = -ENOMEM; err = -ENOMEM;
order = get_order(req->tp_block_size); order = get_order(req->tp_block_size);
pg_vec = alloc_pg_vec(req, order); pg_vec = alloc_pg_vec(req, order);
if (unlikely(!pg_vec)) if (unlikely(!pg_vec))
goto out; goto out;
}
for (i = 0; i < req->tp_block_nr; i++) { /* Done */
void *ptr = pg_vec[i]; else {
int k; err = -EINVAL;
for (k = 0; k < po->frames_per_block; k++) {
__packet_set_status(po, ptr, TP_STATUS_KERNEL);
ptr += req->tp_frame_size;
}
}
/* Done */
} else {
if (unlikely(req->tp_frame_nr)) if (unlikely(req->tp_frame_nr))
return -EINVAL; goto out;
} }
lock_sock(sk); lock_sock(sk);
@ -1872,23 +2218,24 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req, int closing
if (closing || atomic_read(&po->mapped) == 0) { if (closing || atomic_read(&po->mapped) == 0) {
err = 0; err = 0;
#define XC(a, b) ({ __typeof__ ((a)) __t; __t = (a); (a) = (b); __t; }) #define XC(a, b) ({ __typeof__ ((a)) __t; __t = (a); (a) = (b); __t; })
spin_lock_bh(&rb_queue->lock);
pg_vec = XC(rb->pg_vec, pg_vec);
rb->frame_max = (req->tp_frame_nr - 1);
rb->head = 0;
rb->frame_size = req->tp_frame_size;
spin_unlock_bh(&rb_queue->lock);
spin_lock_bh(&sk->sk_receive_queue.lock); order = XC(rb->pg_vec_order, order);
pg_vec = XC(po->pg_vec, pg_vec); req->tp_block_nr = XC(rb->pg_vec_len, req->tp_block_nr);
po->frame_max = (req->tp_frame_nr - 1);
po->head = 0;
po->frame_size = req->tp_frame_size;
spin_unlock_bh(&sk->sk_receive_queue.lock);
order = XC(po->pg_vec_order, order); rb->pg_vec_pages = req->tp_block_size/PAGE_SIZE;
req->tp_block_nr = XC(po->pg_vec_len, req->tp_block_nr); po->prot_hook.func = (po->rx_ring.pg_vec) ?
tpacket_rcv : packet_rcv;
po->pg_vec_pages = req->tp_block_size/PAGE_SIZE; skb_queue_purge(rb_queue);
po->prot_hook.func = po->pg_vec ? tpacket_rcv : packet_rcv;
skb_queue_purge(&sk->sk_receive_queue);
#undef XC #undef XC
if (atomic_read(&po->mapped)) if (atomic_read(&po->mapped))
printk(KERN_DEBUG "packet_mmap: vma is busy: %d\n", atomic_read(&po->mapped)); printk(KERN_DEBUG "packet_mmap: vma is busy: %d\n",
atomic_read(&po->mapped));
} }
mutex_unlock(&po->pg_vec_lock); mutex_unlock(&po->pg_vec_lock);
@ -1909,11 +2256,13 @@ out:
return err; return err;
} }
static int packet_mmap(struct file *file, struct socket *sock, struct vm_area_struct *vma) static int packet_mmap(struct file *file, struct socket *sock,
struct vm_area_struct *vma)
{ {
struct sock *sk = sock->sk; struct sock *sk = sock->sk;
struct packet_sock *po = pkt_sk(sk); struct packet_sock *po = pkt_sk(sk);
unsigned long size; unsigned long size, expected_size;
struct packet_ring_buffer *rb;
unsigned long start; unsigned long start;
int err = -EINVAL; int err = -EINVAL;
int i; int i;
@ -1921,26 +2270,43 @@ static int packet_mmap(struct file *file, struct socket *sock, struct vm_area_st
if (vma->vm_pgoff) if (vma->vm_pgoff)
return -EINVAL; return -EINVAL;
size = vma->vm_end - vma->vm_start;
mutex_lock(&po->pg_vec_lock); mutex_lock(&po->pg_vec_lock);
if (po->pg_vec == NULL)
expected_size = 0;
for (rb = &po->rx_ring; rb <= &po->tx_ring; rb++) {
if (rb->pg_vec) {
expected_size += rb->pg_vec_len
* rb->pg_vec_pages
* PAGE_SIZE;
}
}
if (expected_size == 0)
goto out; goto out;
if (size != po->pg_vec_len*po->pg_vec_pages*PAGE_SIZE)
size = vma->vm_end - vma->vm_start;
if (size != expected_size)
goto out; goto out;
start = vma->vm_start; start = vma->vm_start;
for (i = 0; i < po->pg_vec_len; i++) { for (rb = &po->rx_ring; rb <= &po->tx_ring; rb++) {
struct page *page = virt_to_page(po->pg_vec[i]); if (rb->pg_vec == NULL)
int pg_num; continue;
for (pg_num = 0; pg_num < po->pg_vec_pages; pg_num++, page++) { for (i = 0; i < rb->pg_vec_len; i++) {
err = vm_insert_page(vma, start, page); struct page *page = virt_to_page(rb->pg_vec[i]);
if (unlikely(err)) int pg_num;
goto out;
start += PAGE_SIZE; for (pg_num = 0; pg_num < rb->pg_vec_pages;
pg_num++,page++) {
err = vm_insert_page(vma, start, page);
if (unlikely(err))
goto out;
start += PAGE_SIZE;
}
} }
} }
atomic_inc(&po->mapped); atomic_inc(&po->mapped);
vma->vm_ops = &packet_mmap_ops; vma->vm_ops = &packet_mmap_ops;
err = 0; err = 0;