Merge branch 'io_uring-zerocopy-send' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux

Pavel Begunkov says:

====================
io_uring zerocopy send

The patchset implements io_uring zerocopy send. It works with both registered
and normal buffers, mixing is allowed but not recommended. Apart from usual
request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
the userspace when buffers are freed and can be reused (see API design below),
which is delivered into io_uring's Completion Queue. Those "buffer-free"
notifications are not necessarily per request, but the userspace has control
over it and should explicitly attaching a number of requests to a single
notification. The series also adds some internal optimisations when used with
registered buffers like removing page referencing.

From the kernel networking perspective there are two main changes. The first
one is passing ubuf_info into the network layer from io_uring (inside of an
in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
caching on the io_uring side, but also helps to avoid cross-referencing
and synchronisation problems. The second part is an optional optimisation
removing page referencing for requests with registered buffers.

Benchmarking UDP with an optimised version of the selftest (see [1]), which
sends a bunch of requests, waits for completions and repeats. "+ flush" column
posts one additional "buffer-free" notification per request, and just "zc"
doesn't post buffer notifications at all.

NIC (requests / second):
IO size | non-zc    | zc             | zc + flush
4000    | 495134    | 606420 (+22%)  | 558971 (+12%)
1500    | 551808    | 577116 (+4.5%) | 565803 (+2.5%)
1000    | 584677    | 592088 (+1.2%) | 560885 (-4%)
600     | 596292    | 598550 (+0.4%) | 555366 (-6.7%)

dummy (requests / second):
IO size | non-zc    | zc             | zc + flush
8000    | 1299916   | 2396600 (+84%) | 2224219 (+71%)
4000    | 1869230   | 2344146 (+25%) | 2170069 (+16%)
1200    | 2071617   | 2361960 (+14%) | 2203052 (+6%)
600     | 2106794   | 2381527 (+13%) | 2195295 (+4%)

Previously it also brought a massive performance speedup compared to the
msg_zerocopy tool (see [3]), which is probably not super interesting. There
is also an additional bunch of refcounting optimisations that was omitted from
the series for simplicity and as they don't change the picture drastically,
they will be sent as follow up, as well as flushing optimisations closing the
performance gap b/w two last columns.

For TCP on localhost (with hacks enabling localhost zerocopy) and including
additional overhead for receive:

IO size | non-zc    | zc
1200    | 4174      | 4148
4096    | 7597      | 11228

Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
omitted optimisations will somewhat help, should look better for 4000,
but couldn't test properly because of setup problems.

Links:

  liburing (benchmark + tests):
  [1] https://github.com/isilence/liburing/tree/zc_v4

  kernel repo:
  [2] https://github.com/isilence/linux/tree/zc_v4

  RFC v1:
  [3] https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/

  RFC v2:
  https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@gmail.com/

  Net patches based:
  git@github.com:isilence/linux.git zc_v4-net-base
  or
  https://github.com/isilence/linux/tree/zc_v4-net-base

API design overview:

  The series introduces an io_uring concept of notifactors. From the userspace
  perspective it's an entity to which it can bind one or more requests and then
  requesting to flush it. Flushing a notifier makes it impossible to attach new
  requests to it, and instructs the notifier to post a completion once all
  requests attached to it are completed and the kernel doesn't need the buffers
  anymore.

  Notifications are stored in notification slots, which should be registered as
  an array in io_uring. Each slot stores only one notifier at any particular
  moment. Flushing removes it from the slot and the slot automatically replaces
  it with a new notifier. All operations with notifiers are done by specifying
  an index of a slot it's currently in.

  When registering a notification the userspace specifies a u64 tag for each
  slot, which will be copied in notification completion entries as
  cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32
  sequence number counting notifiers of a slot.

====================

Link: https://lore.kernel.org/r/cover.1657643355.git.asml.silence@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit is contained in:
Jakub Kicinski 2022-07-19 14:13:33 -07:00
commit 7f9eee196e
9 changed files with 191 additions and 64 deletions

View File

@ -509,10 +509,18 @@ enum {
* charged to the kernel memory. * charged to the kernel memory.
*/ */
SKBFL_PURE_ZEROCOPY = BIT(2), SKBFL_PURE_ZEROCOPY = BIT(2),
SKBFL_DONT_ORPHAN = BIT(3),
/* page references are managed by the ubuf_info, so it's safe to
* use frags only up until ubuf_info is released
*/
SKBFL_MANAGED_FRAG_REFS = BIT(4),
}; };
#define SKBFL_ZEROCOPY_FRAG (SKBFL_ZEROCOPY_ENABLE | SKBFL_SHARED_FRAG) #define SKBFL_ZEROCOPY_FRAG (SKBFL_ZEROCOPY_ENABLE | SKBFL_SHARED_FRAG)
#define SKBFL_ALL_ZEROCOPY (SKBFL_ZEROCOPY_FRAG | SKBFL_PURE_ZEROCOPY) #define SKBFL_ALL_ZEROCOPY (SKBFL_ZEROCOPY_FRAG | SKBFL_PURE_ZEROCOPY | \
SKBFL_DONT_ORPHAN | SKBFL_MANAGED_FRAG_REFS)
/* /*
* The callback notifies userspace to release buffers when skb DMA is done in * The callback notifies userspace to release buffers when skb DMA is done in
@ -1596,13 +1604,14 @@ void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref);
void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg, void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
bool success); bool success);
int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb, int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
struct iov_iter *from, size_t length); struct sk_buff *skb, struct iov_iter *from,
size_t length);
static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb, static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb,
struct msghdr *msg, int len) struct msghdr *msg, int len)
{ {
return __zerocopy_sg_from_iter(skb->sk, skb, &msg->msg_iter, len); return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len);
} }
int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb, int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
@ -1629,6 +1638,11 @@ static inline bool skb_zcopy_pure(const struct sk_buff *skb)
return skb_shinfo(skb)->flags & SKBFL_PURE_ZEROCOPY; return skb_shinfo(skb)->flags & SKBFL_PURE_ZEROCOPY;
} }
static inline bool skb_zcopy_managed(const struct sk_buff *skb)
{
return skb_shinfo(skb)->flags & SKBFL_MANAGED_FRAG_REFS;
}
static inline bool skb_pure_zcopy_same(const struct sk_buff *skb1, static inline bool skb_pure_zcopy_same(const struct sk_buff *skb1,
const struct sk_buff *skb2) const struct sk_buff *skb2)
{ {
@ -1703,6 +1717,14 @@ static inline void skb_zcopy_clear(struct sk_buff *skb, bool zerocopy_success)
} }
} }
void __skb_zcopy_downgrade_managed(struct sk_buff *skb);
static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
{
if (unlikely(skb_zcopy_managed(skb)))
__skb_zcopy_downgrade_managed(skb);
}
static inline void skb_mark_not_on_list(struct sk_buff *skb) static inline void skb_mark_not_on_list(struct sk_buff *skb)
{ {
skb->next = NULL; skb->next = NULL;
@ -2351,6 +2373,22 @@ static inline unsigned int skb_pagelen(const struct sk_buff *skb)
return skb_headlen(skb) + __skb_pagelen(skb); return skb_headlen(skb) + __skb_pagelen(skb);
} }
static inline void __skb_fill_page_desc_noacc(struct skb_shared_info *shinfo,
int i, struct page *page,
int off, int size)
{
skb_frag_t *frag = &shinfo->frags[i];
/*
* Propagate page pfmemalloc to the skb if we can. The problem is
* that not all callers have unique ownership of the page but rely
* on page_is_pfmemalloc doing the right thing(tm).
*/
frag->bv_page = page;
frag->bv_offset = off;
skb_frag_size_set(frag, size);
}
/** /**
* skb_len_add - adds a number to len fields of skb * skb_len_add - adds a number to len fields of skb
* @skb: buffer to add len to * @skb: buffer to add len to
@ -2379,17 +2417,7 @@ static inline void skb_len_add(struct sk_buff *skb, int delta)
static inline void __skb_fill_page_desc(struct sk_buff *skb, int i, static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
struct page *page, int off, int size) struct page *page, int off, int size)
{ {
skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; __skb_fill_page_desc_noacc(skb_shinfo(skb), i, page, off, size);
/*
* Propagate page pfmemalloc to the skb if we can. The problem is
* that not all callers have unique ownership of the page but rely
* on page_is_pfmemalloc doing the right thing(tm).
*/
frag->bv_page = page;
frag->bv_offset = off;
skb_frag_size_set(frag, size);
page = compound_head(page); page = compound_head(page);
if (page_is_pfmemalloc(page)) if (page_is_pfmemalloc(page))
skb->pfmemalloc = true; skb->pfmemalloc = true;
@ -3019,8 +3047,7 @@ static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
{ {
if (likely(!skb_zcopy(skb))) if (likely(!skb_zcopy(skb)))
return 0; return 0;
if (!skb_zcopy_is_nouarg(skb) && if (skb_shinfo(skb)->flags & SKBFL_DONT_ORPHAN)
skb_uarg(skb)->callback == msg_zerocopy_callback)
return 0; return 0;
return skb_copy_ubufs(skb, gfp_mask); return skb_copy_ubufs(skb, gfp_mask);
} }
@ -3333,7 +3360,10 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)
*/ */
static inline void skb_frag_unref(struct sk_buff *skb, int f) static inline void skb_frag_unref(struct sk_buff *skb, int f)
{ {
__skb_frag_unref(&skb_shinfo(skb)->frags[f], skb->pp_recycle); struct skb_shared_info *shinfo = skb_shinfo(skb);
if (!skb_zcopy_managed(skb))
__skb_frag_unref(&shinfo->frags[f], skb->pp_recycle);
} }
/** /**

View File

@ -14,6 +14,8 @@ struct file;
struct pid; struct pid;
struct cred; struct cred;
struct socket; struct socket;
struct sock;
struct sk_buff;
#define __sockaddr_check_size(size) \ #define __sockaddr_check_size(size) \
BUILD_BUG_ON(((size) > sizeof(struct __kernel_sockaddr_storage))) BUILD_BUG_ON(((size) > sizeof(struct __kernel_sockaddr_storage)))
@ -69,6 +71,9 @@ struct msghdr {
unsigned int msg_flags; /* flags on received message */ unsigned int msg_flags; /* flags on received message */
__kernel_size_t msg_controllen; /* ancillary data buffer length */ __kernel_size_t msg_controllen; /* ancillary data buffer length */
struct kiocb *msg_iocb; /* ptr to iocb for async requests */ struct kiocb *msg_iocb; /* ptr to iocb for async requests */
struct ubuf_info *msg_ubuf;
int (*sg_from_iter)(struct sock *sk, struct sk_buff *skb,
struct iov_iter *from, size_t length);
}; };
struct user_msghdr { struct user_msghdr {

View File

@ -80,6 +80,7 @@ int __get_compat_msghdr(struct msghdr *kmsg,
return -EMSGSIZE; return -EMSGSIZE;
kmsg->msg_iocb = NULL; kmsg->msg_iocb = NULL;
kmsg->msg_ubuf = NULL;
*ptr = msg.msg_iov; *ptr = msg.msg_iov;
*len = msg.msg_iovlen; *len = msg.msg_iovlen;
return 0; return 0;

View File

@ -610,10 +610,16 @@ fault:
} }
EXPORT_SYMBOL(skb_copy_datagram_from_iter); EXPORT_SYMBOL(skb_copy_datagram_from_iter);
int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb, int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
struct iov_iter *from, size_t length) struct sk_buff *skb, struct iov_iter *from,
size_t length)
{ {
int frag = skb_shinfo(skb)->nr_frags; int frag;
if (msg && msg->sg_from_iter)
return msg->sg_from_iter(sk, skb, from, length);
frag = skb_shinfo(skb)->nr_frags;
while (length && iov_iter_count(from)) { while (length && iov_iter_count(from)) {
struct page *pages[MAX_SKB_FRAGS]; struct page *pages[MAX_SKB_FRAGS];
@ -699,7 +705,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
if (skb_copy_datagram_from_iter(skb, 0, from, copy)) if (skb_copy_datagram_from_iter(skb, 0, from, copy))
return -EFAULT; return -EFAULT;
return __zerocopy_sg_from_iter(NULL, skb, from, ~0U); return __zerocopy_sg_from_iter(NULL, NULL, skb, from, ~0U);
} }
EXPORT_SYMBOL(zerocopy_sg_from_iter); EXPORT_SYMBOL(zerocopy_sg_from_iter);

View File

@ -669,11 +669,18 @@ static void skb_release_data(struct sk_buff *skb)
&shinfo->dataref)) &shinfo->dataref))
goto exit; goto exit;
skb_zcopy_clear(skb, true); if (skb_zcopy(skb)) {
bool skip_unref = shinfo->flags & SKBFL_MANAGED_FRAG_REFS;
skb_zcopy_clear(skb, true);
if (skip_unref)
goto free_head;
}
for (i = 0; i < shinfo->nr_frags; i++) for (i = 0; i < shinfo->nr_frags; i++)
__skb_frag_unref(&shinfo->frags[i], skb->pp_recycle); __skb_frag_unref(&shinfo->frags[i], skb->pp_recycle);
free_head:
if (shinfo->frag_list) if (shinfo->frag_list)
kfree_skb_list(shinfo->frag_list); kfree_skb_list(shinfo->frag_list);
@ -898,7 +905,10 @@ EXPORT_SYMBOL(skb_dump);
*/ */
void skb_tx_error(struct sk_buff *skb) void skb_tx_error(struct sk_buff *skb)
{ {
skb_zcopy_clear(skb, true); if (skb) {
skb_zcopy_downgrade_managed(skb);
skb_zcopy_clear(skb, true);
}
} }
EXPORT_SYMBOL(skb_tx_error); EXPORT_SYMBOL(skb_tx_error);
@ -1196,7 +1206,7 @@ static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
uarg->len = 1; uarg->len = 1;
uarg->bytelen = size; uarg->bytelen = size;
uarg->zerocopy = 1; uarg->zerocopy = 1;
uarg->flags = SKBFL_ZEROCOPY_FRAG; uarg->flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
refcount_set(&uarg->refcnt, 1); refcount_set(&uarg->refcnt, 1);
sock_hold(sk); sock_hold(sk);
@ -1215,6 +1225,10 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
const u32 byte_limit = 1 << 19; /* limit to a few TSO */ const u32 byte_limit = 1 << 19; /* limit to a few TSO */
u32 bytelen, next; u32 bytelen, next;
/* there might be non MSG_ZEROCOPY users */
if (uarg->callback != msg_zerocopy_callback)
return NULL;
/* realloc only when socket is locked (TCP, UDP cork), /* realloc only when socket is locked (TCP, UDP cork),
* so uarg->len and sk_zckey access is serialized * so uarg->len and sk_zckey access is serialized
*/ */
@ -1357,7 +1371,7 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
if (orig_uarg && uarg != orig_uarg) if (orig_uarg && uarg != orig_uarg)
return -EEXIST; return -EEXIST;
err = __zerocopy_sg_from_iter(sk, skb, &msg->msg_iter, len); err = __zerocopy_sg_from_iter(msg, sk, skb, &msg->msg_iter, len);
if (err == -EFAULT || (err == -EMSGSIZE && skb->len == orig_len)) { if (err == -EFAULT || (err == -EMSGSIZE && skb->len == orig_len)) {
struct sock *save_sk = skb->sk; struct sock *save_sk = skb->sk;
@ -1374,6 +1388,16 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
} }
EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream); EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream);
void __skb_zcopy_downgrade_managed(struct sk_buff *skb)
{
int i;
skb_shinfo(skb)->flags &= ~SKBFL_MANAGED_FRAG_REFS;
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
skb_frag_ref(skb, i);
}
EXPORT_SYMBOL_GPL(__skb_zcopy_downgrade_managed);
static int skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig, static int skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
gfp_t gfp_mask) gfp_t gfp_mask)
{ {
@ -1692,6 +1716,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
BUG_ON(skb_shared(skb)); BUG_ON(skb_shared(skb));
skb_zcopy_downgrade_managed(skb);
size = SKB_DATA_ALIGN(size); size = SKB_DATA_ALIGN(size);
if (skb_pfmemalloc(skb)) if (skb_pfmemalloc(skb))
@ -3486,6 +3512,8 @@ void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len)
int pos = skb_headlen(skb); int pos = skb_headlen(skb);
const int zc_flags = SKBFL_SHARED_FRAG | SKBFL_PURE_ZEROCOPY; const int zc_flags = SKBFL_SHARED_FRAG | SKBFL_PURE_ZEROCOPY;
skb_zcopy_downgrade_managed(skb);
skb_shinfo(skb1)->flags |= skb_shinfo(skb)->flags & zc_flags; skb_shinfo(skb1)->flags |= skb_shinfo(skb)->flags & zc_flags;
skb_zerocopy_clone(skb1, skb, 0); skb_zerocopy_clone(skb1, skb, 0);
if (len < pos) /* Split line is inside header. */ if (len < pos) /* Split line is inside header. */
@ -3834,6 +3862,7 @@ int skb_append_pagefrags(struct sk_buff *skb, struct page *page,
if (skb_can_coalesce(skb, i, page, offset)) { if (skb_can_coalesce(skb, i, page, offset)) {
skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], size); skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], size);
} else if (i < MAX_SKB_FRAGS) { } else if (i < MAX_SKB_FRAGS) {
skb_zcopy_downgrade_managed(skb);
get_page(page); get_page(page);
skb_fill_page_desc(skb, i, page, offset, size); skb_fill_page_desc(skb, i, page, offset, size);
} else { } else {

View File

@ -969,7 +969,6 @@ static int __ip_append_data(struct sock *sk,
struct inet_sock *inet = inet_sk(sk); struct inet_sock *inet = inet_sk(sk);
struct ubuf_info *uarg = NULL; struct ubuf_info *uarg = NULL;
struct sk_buff *skb; struct sk_buff *skb;
struct ip_options *opt = cork->opt; struct ip_options *opt = cork->opt;
int hh_len; int hh_len;
int exthdrlen; int exthdrlen;
@ -977,6 +976,7 @@ static int __ip_append_data(struct sock *sk,
int copy; int copy;
int err; int err;
int offset = 0; int offset = 0;
bool zc = false;
unsigned int maxfraglen, fragheaderlen, maxnonfragsize; unsigned int maxfraglen, fragheaderlen, maxnonfragsize;
int csummode = CHECKSUM_NONE; int csummode = CHECKSUM_NONE;
struct rtable *rt = (struct rtable *)cork->dst; struct rtable *rt = (struct rtable *)cork->dst;
@ -1017,17 +1017,35 @@ static int __ip_append_data(struct sock *sk,
(!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM))) (!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
csummode = CHECKSUM_PARTIAL; csummode = CHECKSUM_PARTIAL;
if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) { if ((flags & MSG_ZEROCOPY) && length) {
uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb)); struct msghdr *msg = from;
if (!uarg)
return -ENOBUFS; if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
extra_uref = !skb_zcopy(skb); /* only ref on new uarg */ if (skb_zcopy(skb) && msg->msg_ubuf != skb_zcopy(skb))
if (rt->dst.dev->features & NETIF_F_SG && return -EINVAL;
csummode == CHECKSUM_PARTIAL) {
paged = true; /* Leave uarg NULL if can't zerocopy, callers should
} else { * be able to handle it.
uarg->zerocopy = 0; */
skb_zcopy_set(skb, uarg, &extra_uref); if ((rt->dst.dev->features & NETIF_F_SG) &&
csummode == CHECKSUM_PARTIAL) {
paged = true;
zc = true;
uarg = msg->msg_ubuf;
}
} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
if (!uarg)
return -ENOBUFS;
extra_uref = !skb_zcopy(skb); /* only ref on new uarg */
if (rt->dst.dev->features & NETIF_F_SG &&
csummode == CHECKSUM_PARTIAL) {
paged = true;
zc = true;
} else {
uarg->zerocopy = 0;
skb_zcopy_set(skb, uarg, &extra_uref);
}
} }
} }
@ -1091,9 +1109,12 @@ alloc_new_skb:
(fraglen + alloc_extra < SKB_MAX_ALLOC || (fraglen + alloc_extra < SKB_MAX_ALLOC ||
!(rt->dst.dev->features & NETIF_F_SG))) !(rt->dst.dev->features & NETIF_F_SG)))
alloclen = fraglen; alloclen = fraglen;
else { else if (!zc) {
alloclen = min_t(int, fraglen, MAX_HEADER); alloclen = min_t(int, fraglen, MAX_HEADER);
pagedlen = fraglen - alloclen; pagedlen = fraglen - alloclen;
} else {
alloclen = fragheaderlen + transhdrlen;
pagedlen = datalen - transhdrlen;
} }
alloclen += alloc_extra; alloclen += alloc_extra;
@ -1188,13 +1209,14 @@ alloc_new_skb:
err = -EFAULT; err = -EFAULT;
goto error; goto error;
} }
} else if (!uarg || !uarg->zerocopy) { } else if (!zc) {
int i = skb_shinfo(skb)->nr_frags; int i = skb_shinfo(skb)->nr_frags;
err = -ENOMEM; err = -ENOMEM;
if (!sk_page_frag_refill(sk, pfrag)) if (!sk_page_frag_refill(sk, pfrag))
goto error; goto error;
skb_zcopy_downgrade_managed(skb);
if (!skb_can_coalesce(skb, i, pfrag->page, if (!skb_can_coalesce(skb, i, pfrag->page,
pfrag->offset)) { pfrag->offset)) {
err = -EMSGSIZE; err = -EMSGSIZE;

View File

@ -1223,17 +1223,23 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
flags = msg->msg_flags; flags = msg->msg_flags;
if (flags & MSG_ZEROCOPY && size && sock_flag(sk, SOCK_ZEROCOPY)) { if ((flags & MSG_ZEROCOPY) && size) {
skb = tcp_write_queue_tail(sk); skb = tcp_write_queue_tail(sk);
uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
if (!uarg) {
err = -ENOBUFS;
goto out_err;
}
zc = sk->sk_route_caps & NETIF_F_SG; if (msg->msg_ubuf) {
if (!zc) uarg = msg->msg_ubuf;
uarg->zerocopy = 0; net_zcopy_get(uarg);
zc = sk->sk_route_caps & NETIF_F_SG;
} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
if (!uarg) {
err = -ENOBUFS;
goto out_err;
}
zc = sk->sk_route_caps & NETIF_F_SG;
if (!zc)
uarg->zerocopy = 0;
}
} }
if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect) && if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect) &&
@ -1356,8 +1362,11 @@ new_segment:
copy = min_t(int, copy, pfrag->size - pfrag->offset); copy = min_t(int, copy, pfrag->size - pfrag->offset);
if (tcp_downgrade_zcopy_pure(sk, skb)) if (unlikely(skb_zcopy_pure(skb) || skb_zcopy_managed(skb))) {
goto wait_for_space; if (tcp_downgrade_zcopy_pure(sk, skb))
goto wait_for_space;
skb_zcopy_downgrade_managed(skb);
}
copy = tcp_wmem_schedule(sk, copy); copy = tcp_wmem_schedule(sk, copy);
if (!copy) if (!copy)

View File

@ -1464,6 +1464,7 @@ static int __ip6_append_data(struct sock *sk,
int copy; int copy;
int err; int err;
int offset = 0; int offset = 0;
bool zc = false;
u32 tskey = 0; u32 tskey = 0;
struct rt6_info *rt = (struct rt6_info *)cork->dst; struct rt6_info *rt = (struct rt6_info *)cork->dst;
struct ipv6_txoptions *opt = v6_cork->opt; struct ipv6_txoptions *opt = v6_cork->opt;
@ -1541,17 +1542,35 @@ emsgsize:
rt->dst.dev->features & (NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM)) rt->dst.dev->features & (NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM))
csummode = CHECKSUM_PARTIAL; csummode = CHECKSUM_PARTIAL;
if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) { if ((flags & MSG_ZEROCOPY) && length) {
uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb)); struct msghdr *msg = from;
if (!uarg)
return -ENOBUFS; if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
extra_uref = !skb_zcopy(skb); /* only ref on new uarg */ if (skb_zcopy(skb) && msg->msg_ubuf != skb_zcopy(skb))
if (rt->dst.dev->features & NETIF_F_SG && return -EINVAL;
csummode == CHECKSUM_PARTIAL) {
paged = true; /* Leave uarg NULL if can't zerocopy, callers should
} else { * be able to handle it.
uarg->zerocopy = 0; */
skb_zcopy_set(skb, uarg, &extra_uref); if ((rt->dst.dev->features & NETIF_F_SG) &&
csummode == CHECKSUM_PARTIAL) {
paged = true;
zc = true;
uarg = msg->msg_ubuf;
}
} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
if (!uarg)
return -ENOBUFS;
extra_uref = !skb_zcopy(skb); /* only ref on new uarg */
if (rt->dst.dev->features & NETIF_F_SG &&
csummode == CHECKSUM_PARTIAL) {
paged = true;
zc = true;
} else {
uarg->zerocopy = 0;
skb_zcopy_set(skb, uarg, &extra_uref);
}
} }
} }
@ -1630,9 +1649,12 @@ alloc_new_skb:
(fraglen + alloc_extra < SKB_MAX_ALLOC || (fraglen + alloc_extra < SKB_MAX_ALLOC ||
!(rt->dst.dev->features & NETIF_F_SG))) !(rt->dst.dev->features & NETIF_F_SG)))
alloclen = fraglen; alloclen = fraglen;
else { else if (!zc) {
alloclen = min_t(int, fraglen, MAX_HEADER); alloclen = min_t(int, fraglen, MAX_HEADER);
pagedlen = fraglen - alloclen; pagedlen = fraglen - alloclen;
} else {
alloclen = fragheaderlen + transhdrlen;
pagedlen = datalen - transhdrlen;
} }
alloclen += alloc_extra; alloclen += alloc_extra;
@ -1742,13 +1764,14 @@ alloc_new_skb:
err = -EFAULT; err = -EFAULT;
goto error; goto error;
} }
} else if (!uarg || !uarg->zerocopy) { } else if (!zc) {
int i = skb_shinfo(skb)->nr_frags; int i = skb_shinfo(skb)->nr_frags;
err = -ENOMEM; err = -ENOMEM;
if (!sk_page_frag_refill(sk, pfrag)) if (!sk_page_frag_refill(sk, pfrag))
goto error; goto error;
skb_zcopy_downgrade_managed(skb);
if (!skb_can_coalesce(skb, i, pfrag->page, if (!skb_can_coalesce(skb, i, pfrag->page,
pfrag->offset)) { pfrag->offset)) {
err = -EMSGSIZE; err = -EMSGSIZE;

View File

@ -2103,6 +2103,7 @@ int __sys_sendto(int fd, void __user *buff, size_t len, unsigned int flags,
msg.msg_control = NULL; msg.msg_control = NULL;
msg.msg_controllen = 0; msg.msg_controllen = 0;
msg.msg_namelen = 0; msg.msg_namelen = 0;
msg.msg_ubuf = NULL;
if (addr) { if (addr) {
err = move_addr_to_kernel(addr, addr_len, &address); err = move_addr_to_kernel(addr, addr_len, &address);
if (err < 0) if (err < 0)
@ -2402,6 +2403,7 @@ int __copy_msghdr_from_user(struct msghdr *kmsg,
return -EMSGSIZE; return -EMSGSIZE;
kmsg->msg_iocb = NULL; kmsg->msg_iocb = NULL;
kmsg->msg_ubuf = NULL;
*uiov = msg.msg_iov; *uiov = msg.msg_iov;
*nsegs = msg.msg_iovlen; *nsegs = msg.msg_iovlen;
return 0; return 0;