It would be useful e.g. in a server or desktop environment to have
a facility in the notion of fine-grained "per application" or "per
application group" firewall policies. Probably, users in the mobile,
embedded area (e.g. Android based) with different security policy
requirements for application groups could have great benefit from
that as well. For example, with a little bit of configuration effort,
an admin could whitelist well-known applications, and thus block
otherwise unwanted "hard-to-track" applications like [1] from a
user's machine. Blocking is just one example, but it is not limited
to that, meaning we can have much different scenarios/policies that
netfilter allows us than just blocking, e.g. fine grained settings
where applications are allowed to connect/send traffic to, application
traffic marking/conntracking, application-specific packet mangling,
and so on.
Implementation of PID-based matching would not be appropriate
as they frequently change, and child tracking would make that
even more complex and ugly. Cgroups would be a perfect candidate
for accomplishing that as they associate a set of tasks with a
set of parameters for one or more subsystems, in our case the
netfilter subsystem, which, of course, can be combined with other
cgroup subsystems into something more complex if needed.
As mentioned, to overcome this constraint, such processes could
be placed into one or multiple cgroups where different fine-grained
rules can be defined depending on the application scenario, while
e.g. everything else that is not part of that could be dropped (or
vice versa), thus making life harder for unwanted processes to
communicate to the outside world. So, we make use of cgroups here
to track jobs and limit their resources in terms of iptables
policies; in other words, limiting, tracking, etc what they are
allowed to communicate.
In our case we're working on outgoing traffic based on which local
socket that originated from. Also, one doesn't even need to have
an a-prio knowledge of the application internals regarding their
particular use of ports or protocols. Matching is *extremly*
lightweight as we just test for the sk_classid marker of sockets,
originating from net_cls. net_cls and netfilter do not contradict
each other; in fact, each construct can live as standalone or they
can be used in combination with each other, which is perfectly fine,
plus it serves Tejun's requirement to not introduce a new cgroups
subsystem. Through this, we result in a very minimal and efficient
module, and don't add anything except netfilter code.
One possible, minimal usage example (many other iptables options
can be applied obviously):
1) Configuring cgroups if not already done, e.g.:
mkdir /sys/fs/cgroup/net_cls
mount -t cgroup -o net_cls net_cls /sys/fs/cgroup/net_cls
mkdir /sys/fs/cgroup/net_cls/0
echo 1 > /sys/fs/cgroup/net_cls/0/net_cls.classid
(resp. a real flow handle id for tc)
2) Configuring netfilter (iptables-nftables), e.g.:
iptables -A OUTPUT -m cgroup ! --cgroup 1 -j DROP
3) Running applications, e.g.:
ping 208.67.222.222 <pid:1799>
echo 1799 > /sys/fs/cgroup/net_cls/0/tasks
64 bytes from 208.67.222.222: icmp_seq=44 ttl=49 time=11.9 ms
[...]
ping 208.67.220.220 <pid:1804>
ping: sendmsg: Operation not permitted
[...]
echo 1804 > /sys/fs/cgroup/net_cls/0/tasks
64 bytes from 208.67.220.220: icmp_seq=89 ttl=56 time=19.0 ms
[...]
Of course, real-world deployments would make use of cgroups user
space toolsuite, or own custom policy daemons dynamically moving
applications from/to various cgroups.
[1] http://www.blackhat.com/presentations/bh-europe-06/bh-eu-06-biondi/bh-eu-06-biondi-up.pdf
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: cgroups@vger.kernel.org
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We currently use prandom_u32() for allocation of ports in tcp bind(0)
and udp code. In case of plain SNAT we try to keep the ports as is
or increment on collision.
SNAT --random mode does use per-destination incrementing port
allocation. As a recent paper pointed out in [1] that this mode of
port allocation makes it possible to an attacker to find the randomly
allocated ports through a timing side-channel in a socket overloading
attack conducted through an off-path attacker.
So, NF_NAT_RANGE_PROTO_RANDOM actually weakens the port randomization
in regard to the attack described in this paper. As we need to keep
compatibility, add another flag called NF_NAT_RANGE_PROTO_RANDOM_FULLY
that would replace the NF_NAT_RANGE_PROTO_RANDOM hash-based port
selection algorithm with a simple prandom_u32() in order to mitigate
this attack vector. Note that the lfsr113's internal state is
periodically reseeded by the kernel through a local secure entropy
source.
More details can be found in [1], the basic idea is to send bursts
of packets to a socket to overflow its receive queue and measure
the latency to detect a possible retransmit when the port is found.
Because of increasing ports to given destination and port, further
allocations can be predicted. This information could then be used by
an attacker for e.g. for cache-poisoning, NS pinning, and degradation
of service attacks against DNS servers [1]:
The best defense against the poisoning attacks is to properly
deploy and validate DNSSEC; DNSSEC provides security not only
against off-path attacker but even against MitM attacker. We hope
that our results will help motivate administrators to adopt DNSSEC.
However, full DNSSEC deployment make take significant time, and
until that happens, we recommend short-term, non-cryptographic
defenses. We recommend to support full port randomisation,
according to practices recommended in [2], and to avoid
per-destination sequential port allocation, which we show may be
vulnerable to derandomisation attacks.
Joint work between Hannes Frederic Sowa and Daniel Borkmann.
[1] https://sites.google.com/site/hayashulman/files/NIC-derandomisation.pdf
[2] http://arxiv.org/pdf/1205.5190v1.pdf
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
With this plugin, user could specify IPComp tagged with certain
CPI that host not interested will be DROPped or any other action.
For example:
iptables -A INPUT -p 108 -m ipcomp --ipcompspi 0x87 -j DROP
ip6tables -A INPUT -p 108 -m ipcomp --ipcompspi 0x87 -j DROP
Then input IPComp packet with CPI equates 0x87 will not reach
upper layer anymore.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Thanks to commits 41063e9 (ipv4: Early TCP socket demux) and 421b388
(udp: ipv4: Add udp early demux) it is now possible to parse UID and
GID socket info also for incoming TCP and UDP connections. Having
this info available, it is convenient to let NFQUEUE parse it in
order to improve and refine the traffic analysis in userspace.
Signed-off-by: Valentina Giusti <valentina.giusti@bmw-carit.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch introduces a PACKET_QDISC_BYPASS socket option, that
allows for using a similar xmit() function as in pktgen instead
of taking the dev_queue_xmit() path. This can be very useful when
PF_PACKET applications are required to be used in a similar
scenario as pktgen, but with full, flexible packet payload that
needs to be provided, for example.
On default, nothing changes in behaviour for normal PF_PACKET
TX users, so everything stays as is for applications. New users,
however, can now set PACKET_QDISC_BYPASS if needed to prevent
own packets from i) reentering packet_rcv() and ii) to directly
push the frame to the driver.
In doing so we can increase pps (here 64 byte packets) for
PF_PACKET a bit:
# CPUs -- QDISC_BYPASS -- qdisc path -- qdisc path[**]
1 CPU == 1,509,628 pps -- 1,208,708 -- 1,247,436
2 CPUs == 3,198,659 pps -- 2,536,012 -- 1,605,779
3 CPUs == 4,787,992 pps -- 3,788,740 -- 1,735,610
4 CPUs == 6,173,956 pps -- 4,907,799 -- 1,909,114
5 CPUs == 7,495,676 pps -- 5,956,499 -- 2,014,422
6 CPUs == 9,001,496 pps -- 7,145,064 -- 2,155,261
7 CPUs == 10,229,776 pps -- 8,190,596 -- 2,220,619
8 CPUs == 11,040,732 pps -- 9,188,544 -- 2,241,879
9 CPUs == 12,009,076 pps -- 10,275,936 -- 2,068,447
10 CPUs == 11,380,052 pps -- 11,265,337 -- 1,578,689
11 CPUs == 11,672,676 pps -- 11,845,344 -- 1,297,412
[...]
20 CPUs == 11,363,192 pps -- 11,014,933 -- 1,245,081
[**]: qdisc path with packet_rcv(), how probably most people
seem to use it (hopefully not anymore if not needed)
The test was done using a modified trafgen, sending a simple
static 64 bytes packet, on all CPUs. The trick in the fast
"qdisc path" case, is to avoid reentering packet_rcv() by
setting the RAW socket protocol to zero, like:
socket(PF_PACKET, SOCK_RAW, 0);
Tradeoffs are documented as well in this patch, clearly, if
queues are busy, we will drop more packets, tc disciplines are
ignored, and these packets are not visible to taps anymore. For
a pktgen like scenario, we argue that this is acceptable.
The pointer to the xmit function has been placed in packet
socket structure hole between cached_dev and prot_hook that
is hot anyway as we're working on cached_dev in each send path.
Done in joint work together with Jesper Dangaard Brouer.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Creating an address with this flag set will result in kernel taking care
of temporary addresses in the same way as if the address was created by
kernel itself (after RA receive). This allows userspace applications
implementing the autoconfiguration (NetworkManager for example) to
implement ipv6 addresses privacy.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Thomas Haller <thaller@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no more space in u8 ifa_flags. So do what davem suffested and
add another netlink attr called IFA_FLAGS for carry more flags.
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Thomas Haller <thaller@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
John W. Linville says:
====================
Please pull this batch of updates intended for the 3.14 stream...
For the mac80211 bits, Johannes says:
"I have various improvements/cleanups/fixes all over, but the shortlog
shows that Luis's regulatory work and mesh work from the cozybit folks
are the biggest ones, along with the CSA fixes."
Along with that, we have big batches of updates to brcmfmac, rtlwifi,
and ath9k. There are updates to wcn36xx, rt2x00, and a handful of
others as well.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.
Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)
If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.
This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.
The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.
Using FQ/pacing is a way to increase the probability of
autocorking being triggered.
Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)
Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.
Tested:
Interesting effects when using line buffered commands under ssh.
Excellent performance results in term of cpu usage and total throughput.
lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
35209.439626 task-clock # 2.901 CPUs utilized
2,294 context-switches # 0.065 K/sec
101 CPU-migrations # 0.003 K/sec
4,079 page-faults # 0.116 K/sec
97,923,241,298 cycles # 2.781 GHz [83.31%]
51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%]
25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%]
102,225,978,536 instructions # 1.04 insns per cycle
# 0.51 stalled cycles per insn [83.38%]
18,657,696,819 branches # 529.906 M/sec [83.29%]
91,679,646 branch-misses # 0.49% of all branches [83.40%]
12.136204899 seconds time elapsed
lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
40045.864494 task-clock # 3.301 CPUs utilized
171 context-switches # 0.004 K/sec
53 CPU-migrations # 0.001 K/sec
4,080 page-faults # 0.102 K/sec
111,340,458,645 cycles # 2.780 GHz [83.34%]
61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle [83.31%]
29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle [66.67%]
108,654,349,355 instructions # 0.98 insns per cycle
# 0.57 stalled cycles per insn [83.34%]
19,552,170,748 branches # 488.244 M/sec [83.34%]
157,875,417 branch-misses # 0.81% of all branches [83.34%]
12.130267788 seconds time elapsed
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several files refer to an old address for the Free Software Foundation
in the file header comment. Resolve by replacing the address with
the URL <http://www.gnu.org/licenses/> so that we do not have to keep
updating the header comments anytime the address changes.
CC: netfilter@vger.kernel.org
CC: Pablo Neira Ayuso <pablo@netfilter.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several files refer to an old address for the Free Software Foundation
in the file header comment. Resolve by replacing the address with
the URL <http://www.gnu.org/licenses/> so that we do not have to keep
updating the header comments anytime the address changes.
CC: Vlad Yasevich <vyasevich@gmail.com>
CC: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ben Hutchings says:
====================
SIOCGHWTSTAMP ioctl
1. Add the SIOCGHWTSTAMP ioctl and update the timestamping
documentation.
2. Implement SIOCGHWTSTAMP in most drivers that support SIOCSHWTSTAMP.
3. Add a test program to exercise SIOC{G,S}HWTSTAMP.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking updates from David Miller:
"Here is a pile of bug fixes that accumulated while I was in Europe"
1) In fixing kernel leaks to userspace during copying of socket
addresses, we broke a case that used to work, namely the user
providing a buffer larger than the in-kernel generic socket address
structure. This broke Ruby amongst other things. Fix from Dan
Carpenter.
2) Fix regression added by byte queue limit support in 8139cp driver,
from Yang Yingliang.
3) The addition of MSG_SENDPAGE_NOTLAST buggered up a few sendpage
implementations, they should just treat it the same as MSG_MORE.
Fix from Richard Weinberger and Shawn Landden.
4) Handle icmpv4 errors received on ipv6 SIT tunnels correctly, from
Oussama Ghorbel. In particular we should send an ICMPv6 unreachable
in such situations.
5) Fix some regressions in the recent genetlink fixes, in particular
get the pmcraid driver to use the new safer interfaces correctly.
From Johannes Berg.
6) macvtap was converted to use a per-cpu set of statistics, but some
code was still bumping tx_dropped elsewhere. From Jason Wang.
7) Fix build failure of xen-netback due to missing include on some
architectures, from Andy Whitecroft.
8) macvtap double counts received packets in statistics, fix from Vlad
Yasevich.
9) Fix various cases of using *_STATS_BH() when *_STATS() is more
appropriate. From Eric Dumazet and Hannes Frederic Sowa.
10) Pktgen ipsec mode doesn't update the ipv4 header length and checksum
properly after encapsulation. Fix from Fan Du.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (61 commits)
net/mlx4_en: Remove selftest TX queues empty condition
{pktgen, xfrm} Update IPv4 header total len and checksum after tranformation
virtio_net: make all RX paths handle erors consistently
virtio_net: fix error handling for mergeable buffers
virtio_net: Fixed a trivial typo (fitler --> filter)
netem: fix gemodel loss generator
netem: fix loss 4 state model
netem: missing break in ge loss generator
net/hsr: Support iproute print_opt ('ip -details ...')
net/hsr: Very small fix of comment style.
MAINTAINERS: Added net/hsr/ maintainer
ipv6: fix possible seqlock deadlock in ip6_finish_output2
ixgbe: Make ixgbe_identify_qsfp_module_generic static
ixgbe: turn NETIF_F_HW_L2FW_DOFFLOAD off by default
ixgbe: ixgbe_fwd_ring_down needs to be static
e1000: fix possible reset_task running after adapter down
e1000: fix lockdep warning in e1000_reset_task
e1000: prevent oops when adapter is being closed and reset simultaneously
igb: Fixed Wake On LAN support
inet: fix possible seqlock deadlocks
...
nla_put_flag needs a real nl80211 attribute id, not a wiphy flag bit.
While at it, split 5 and 10 MHz capability flags in case we ever need
to support hardware that can only do one of the two.
Also move the flag settings to the split-only information so we don't
increase the space needed for old userspace.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
[change location of flag setting]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This implements the rtnl_link_ops fill_info routine for HSR.
Signed-off-by: Arvid Brodin <arvid.brodin@alten.se>
Signed-off-by: David S. Miller <davem@davemloft.net>
The pmcraid driver is abusing the genetlink API and is using its
family ID as the multicast group ID, which is invalid and may
belong to somebody else (and likely will.)
Make it use the correct API, but since this may already be used
as-is by userspace, reserve a family ID for this code and also
reserve that group ID to not break userspace assumptions.
My previous patch broke event delivery in the driver as I missed
that it wasn't using the right API and forgot to update it later
in my series.
While changing this, I noticed that the genetlink code could use
the static group ID instead of a strcmp(), so also do that for
the VFS_DQUOT family.
Cc: Anil Ravindranath <anil_ravindranath@pmc-sierra.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The first netlink attribute (value 0) must always be defined as none/unspec.
This is correctly done in inet_diag.h, but other diag interfaces are wrong.
Because we cannot change an existing API, I add a comment to point the mistake
and avoid to propagate it in a new diag API in the future.
CC: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
These two flags are used for the same purpose, just
combine them into a no-ir flag to annotate no initiating
radiation is allowed.
Old userspace sending either flag will have it treated as
the no-ir flag. To be considerate to older userspace we
also send both the no-ir flag and the old no-ibss flags.
Newer userspace will have to be aware of older kernels.
Update all places in the tree using these flags with the
following semantic patch:
@@
@@
-NL80211_RRF_PASSIVE_SCAN
+NL80211_RRF_NO_IR
@@
@@
-NL80211_RRF_NO_IBSS
+NL80211_RRF_NO_IR
@@
@@
-IEEE80211_CHAN_PASSIVE_SCAN
+IEEE80211_CHAN_NO_IR
@@
@@
-IEEE80211_CHAN_NO_IBSS
+IEEE80211_CHAN_NO_IR
@@
@@
-NL80211_RRF_NO_IR | NL80211_RRF_NO_IR
+NL80211_RRF_NO_IR
@@
@@
-IEEE80211_CHAN_NO_IR | IEEE80211_CHAN_NO_IR
+IEEE80211_CHAN_NO_IR
@@
@@
-(NL80211_RRF_NO_IR)
+NL80211_RRF_NO_IR
@@
@@
-(IEEE80211_CHAN_NO_IR)
+IEEE80211_CHAN_NO_IR
Along with some hand-optimisations in documentation, to
remove duplicates and to fix some indentation.
Signed-off-by: Luis R. Rodriguez <mcgrof@do-not-panic.com>
[do all the driver updates in one go]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This command has different semantics depending on the action code sent.
Document this fact and detail the supported action codes.
Signed-off-by: Arik Nemtsov <arik@wizery.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Pull DRM fixes from Dave Airlie:
"I was going to leave this until post -rc1 but sysfs fixes broke
hotplug in userspace, so I had to fix it harder, otherwise a set of
pulls from intel, radeon and vmware,
The vmware/ttm changes are bit larger but since its early and they are
unlikely to break anything else I put them in, it lets vmware work
with dri3"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (36 commits)
drm/sysfs: fix hotplug regression since lifetime changes
drm/exynos: g2d: fix memory leak to userptr
drm/i915: Fix gen3 self-refresh watermarks
drm/ttm: Remove set_need_resched from the ttm fault handler
drm/ttm: Don't move non-existing data
drm/radeon: hook up backlight functions for CI and KV family.
drm/i915: Replicate BIOS eDP bpp clamping hack for hsw
drm/i915: Do not enable package C8 on unsupported hardware
drm/i915: Hold pc8 lock around toggling pc8.gpu_idle
drm/i915: encoder->get_config is no longer optional
drm/i915/tv: add ->get_config callback
drm/radeon/cik: Add macrotile mode array query
drm/radeon/cik: Return backend map information to userspace
drm/vmwgfx: Make vmwgfx dma buffers prime aware
drm/vmwgfx: Make surfaces prime-aware
drm/vmwgfx: Hook up the prime ioctls
drm/ttm: Add a minimal prime implementation for ttm base objects
drm/vmwgfx: Fix false lockdep warning
drm/ttm: Allow execbuf util reserves without ticket
drm/i915: restore the early forcewake cleanup
...
Pull security subsystem updates from James Morris:
"In this patchset, we finally get an SELinux update, with Paul Moore
taking over as maintainer of that code.
Also a significant update for the Keys subsystem, as well as
maintenance updates to Smack, IMA, TPM, and Apparmor"
and since I wanted to know more about the updates to key handling,
here's the explanation from David Howells on that:
"Okay. There are a number of separate bits. I'll go over the big bits
and the odd important other bit, most of the smaller bits are just
fixes and cleanups. If you want the small bits accounting for, I can
do that too.
(1) Keyring capacity expansion.
KEYS: Consolidate the concept of an 'index key' for key access
KEYS: Introduce a search context structure
KEYS: Search for auth-key by name rather than target key ID
Add a generic associative array implementation.
KEYS: Expand the capacity of a keyring
Several of the patches are providing an expansion of the capacity of a
keyring. Currently, the maximum size of a keyring payload is one page.
Subtract a small header and then divide up into pointers, that only gives
you ~500 pointers on an x86_64 box. However, since the NFS idmapper uses
a keyring to store ID mapping data, that has proven to be insufficient to
the cause.
Whatever data structure I use to handle the keyring payload, it can only
store pointers to keys, not the keys themselves because several keyrings
may point to a single key. This precludes inserting, say, and rb_node
struct into the key struct for this purpose.
I could make an rbtree of records such that each record has an rb_node
and a key pointer, but that would use four words of space per key stored
in the keyring. It would, however, be able to use much existing code.
I selected instead a non-rebalancing radix-tree type approach as that
could have a better space-used/key-pointer ratio. I could have used the
radix tree implementation that we already have and insert keys into it by
their serial numbers, but that means any sort of search must iterate over
the whole radix tree. Further, its nodes are a bit on the capacious side
for what I want - especially given that key serial numbers are randomly
allocated, thus leaving a lot of empty space in the tree.
So what I have is an associative array that internally is a radix-tree
with 16 pointers per node where the index key is constructed from the key
type pointer and the key description. This means that an exact lookup by
type+description is very fast as this tells us how to navigate directly to
the target key.
I made the data structure general in lib/assoc_array.c as far as it is
concerned, its index key is just a sequence of bits that leads to a
pointer. It's possible that someone else will be able to make use of it
also. FS-Cache might, for example.
(2) Mark keys as 'trusted' and keyrings as 'trusted only'.
KEYS: verify a certificate is signed by a 'trusted' key
KEYS: Make the system 'trusted' keyring viewable by userspace
KEYS: Add a 'trusted' flag and a 'trusted only' flag
KEYS: Separate the kernel signature checking keyring from module signing
These patches allow keys carrying asymmetric public keys to be marked as
being 'trusted' and allow keyrings to be marked as only permitting the
addition or linkage of trusted keys.
Keys loaded from hardware during kernel boot or compiled into the kernel
during build are marked as being trusted automatically. New keys can be
loaded at runtime with add_key(). They are checked against the system
keyring contents and if their signatures can be validated with keys that
are already marked trusted, then they are marked trusted also and can
thus be added into the master keyring.
Patches from Mimi Zohar make this usable with the IMA keyrings also.
(3) Remove the date checks on the key used to validate a module signature.
X.509: Remove certificate date checks
It's not reasonable to reject a signature just because the key that it was
generated with is no longer valid datewise - especially if the kernel
hasn't yet managed to set the system clock when the first module is
loaded - so just remove those checks.
(4) Make it simpler to deal with additional X.509 being loaded into the kernel.
KEYS: Load *.x509 files into kernel keyring
KEYS: Have make canonicalise the paths of the X.509 certs better to deduplicate
The builder of the kernel now just places files with the extension ".x509"
into the kernel source or build trees and they're concatenated by the
kernel build and stuffed into the appropriate section.
(5) Add support for userspace kerberos to use keyrings.
KEYS: Add per-user_namespace registers for persistent per-UID kerberos caches
KEYS: Implement a big key type that can save to tmpfs
Fedora went to, by default, storing kerberos tickets and tokens in tmpfs.
We looked at storing it in keyrings instead as that confers certain
advantages such as tickets being automatically deleted after a certain
amount of time and the ability for the kernel to get at these tokens more
easily.
To make this work, two things were needed:
(a) A way for the tickets to persist beyond the lifetime of all a user's
sessions so that cron-driven processes can still use them.
The problem is that a user's session keyrings are deleted when the
session that spawned them logs out and the user's user keyring is
deleted when the UID is deleted (typically when the last log out
happens), so neither of these places is suitable.
I've added a system keyring into which a 'persistent' keyring is
created for each UID on request. Each time a user requests their
persistent keyring, the expiry time on it is set anew. If the user
doesn't ask for it for, say, three days, the keyring is automatically
expired and garbage collected using the existing gc. All the kerberos
tokens it held are then also gc'd.
(b) A key type that can hold really big tickets (up to 1MB in size).
The problem is that Active Directory can return huge tickets with lots
of auxiliary data attached. We don't, however, want to eat up huge
tracts of unswappable kernel space for this, so if the ticket is
greater than a certain size, we create a swappable shmem file and dump
the contents in there and just live with the fact we then have an
inode and a dentry overhead. If the ticket is smaller than that, we
slap it in a kmalloc()'d buffer"
* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (121 commits)
KEYS: Fix keyring content gc scanner
KEYS: Fix error handling in big_key instantiation
KEYS: Fix UID check in keyctl_get_persistent()
KEYS: The RSA public key algorithm needs to select MPILIB
ima: define '_ima' as a builtin 'trusted' keyring
ima: extend the measurement list to include the file signature
kernel/system_certificate.S: use real contents instead of macro GLOBAL()
KEYS: fix error return code in big_key_instantiate()
KEYS: Fix keyring quota misaccounting on key replacement and unlink
KEYS: Fix a race between negating a key and reading the error set
KEYS: Make BIG_KEYS boolean
apparmor: remove the "task" arg from may_change_ptraced_domain()
apparmor: remove parent task info from audit logging
apparmor: remove tsk field from the apparmor_audit_struct
apparmor: fix capability to not use the current task, during reporting
Smack: Ptrace access check mode
ima: provide hash algo info in the xattr
ima: enable support for larger default filedata hash algorithms
ima: define kernel parameter 'ima_template=' to change configured default
ima: add Kconfig default measurement list template
...
Pull audit updates from Eric Paris:
"Nothing amazing. Formatting, small bug fixes, couple of fixes where
we didn't get records due to some old VFS changes, and a change to how
we collect execve info..."
Fixed conflict in fs/exec.c as per Eric and linux-next.
* git://git.infradead.org/users/eparis/audit: (28 commits)
audit: fix type of sessionid in audit_set_loginuid()
audit: call audit_bprm() only once to add AUDIT_EXECVE information
audit: move audit_aux_data_execve contents into audit_context union
audit: remove unused envc member of audit_aux_data_execve
audit: Kill the unused struct audit_aux_data_capset
audit: do not reject all AUDIT_INODE filter types
audit: suppress stock memalloc failure warnings since already managed
audit: log the audit_names record type
audit: add child record before the create to handle case where create fails
audit: use given values in tty_audit enable api
audit: use nlmsg_len() to get message payload length
audit: use memset instead of trying to initialize field by field
audit: fix info leak in AUDIT_GET requests
audit: update AUDIT_INODE filter rule to comparator function
audit: audit feature to set loginuid immutable
audit: audit feature to only allow unsetting the loginuid
audit: allow unsetting the loginuid (with priv)
audit: remove CONFIG_AUDIT_LOGINUID_IMMUTABLE
audit: loginuid functions coding style
selinux: apply selinux checks on new audit message types
...
More fixes for radeon. This adds new queries for tiling on CIK, and
fixes a crash in handling acpi atif backlight events on CIK.
Some fixes for radeon for 3.13. Mostly CI stability fixes. I think
I've tracked down the stability problems with dpm on Trinity/Richland,
so I'm going to enable that by default now.
* 'drm-next-3.13' of git://people.freedesktop.org/~agd5f/linux:
drm/radeon: hook up backlight functions for CI and KV family.
drm/radeon/cik: Add macrotile mode array query
drm/radeon/cik: Return backend map information to userspace
drm/radeon: enable DPM by default in TN asics
drm/radeon: adjust TN dpm parameters for stability (v2)
drm/radeon: use a single doorbell for cik kms compute
drm/radeon/vm: don't attempt to update ptes if ib allocation fails
drm/radeon: disable CIK CP semaphores for now
drm/radeon: allow semaphore emission to fail
drm/radeon: add semaphore trace point
radeon: workaround pinning failure on low ram gpu
radeon/i2c: do not count reg index in number of i2c byte we are writing.
drm/radeon: cypress_dpm: Fix unused variable warning when CONFIG_ACPI=n
drm: radeon: ni_dpm: Fix unused variable warning when CONFIG_ACPI=n
Mostly optimisations and obscure bug fixes.
- raid5 gets less lock contention
- raid1 gets less contention between normal-io and resync-io
during resync.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIVAwUAUovzDznsnt1WYoG5AQJ1pQ//bDuXadoJ5dwjWjVxFOKoQ9j/9joEI0yH
XTApD3ADKckdBc4TSLOIbCNLW1Pbe23HlOI/GjCiJ/7mePL3OwHd7Fx8Rfq3BubV
f7NgjVwu8nwYD0OXEZsshImptEtrbYwQdy+qlKcHXcZz1MUfR+Egih3r/ouTEfEt
FNq/6MpyN0IKSY82xP/jFZgesBucgKz/YOUIbwClxm7UiyISKvWQLBIAfLB3dyI3
HoEdEzQX6I56Rw0mkSUG4Mk+8xx/8twxL+yqEUqfdJREWuB56Km8kl8y/e465Nk0
ZZg6j/TrslVEwbEeVMx0syvYcaAWFZ4X2jdKfo1lI0g9beZp7H1GRF8yR1s2t/h4
g/vb55MEN++4LPaE9ut4z7SG2yLyGkZgFTzTjyq5of+DFL0cayO7wXxbgpcD7JYf
Doef/OSa6csKiGiJI48iQa08Bolmz9ZWzZQXhAthKfFQ9Rv+GEtIAi4kLR8EZPbu
0/FL1ylYNUY9O7p0g+iy9Kcoc+xW36I95pPZf8pO8GFcXTjyuCCBVh/SNvFZZHPl
3xk3aZJknAEID8VrVG2IJPkeDI8WK8YxmpU/nARCoytn07Df6Ye8jGvLdR8pL3lB
TIZV6eRY4yciB8LtoK9Kg4XTmOMhBtjt4c3znkljp98vhOQQb/oHN+BXMGcwqvr9
fk0KGrg31VA=
=8RCg
-----END PGP SIGNATURE-----
Merge tag 'md/3.13' of git://neil.brown.name/md
Pull md update from Neil Brown:
"Mostly optimisations and obscure bug fixes.
- raid5 gets less lock contention
- raid1 gets less contention between normal-io and resync-io during
resync"
* tag 'md/3.13' of git://neil.brown.name/md:
md/raid5: Use conf->device_lock protect changing of multi-thread resources.
md/raid5: Before freeing old multi-thread worker, it should flush them.
md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE.
UAPI: include <asm/byteorder.h> in linux/raid/md_p.h
raid1: Rewrite the implementation of iobarrier.
raid1: Add some macros to make code clearly.
raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
raid1: Add a field array_frozen to indicate whether raid in freeze state.
md: Convert use of typedef ctl_table to struct ctl_table
md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes.
md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.
md: fix some places where mddev_lock return value is not checked.
raid5: Retry R5_ReadNoMerge flag when hit a read error.
raid5: relieve lock contention in get_active_stripe()
raid5: relieve lock contention in get_active_stripe()
wait: add wait_event_cmd()
md/raid5.c: add proper locking to error path of raid5_start_reshape.
md: fix calculation of stacking limits on level change.
raid5: Use slow_path to release stripe when mddev->thread is null
Pull networking fixes from David Miller:
"Mostly these are fixes for fallout due to merge window changes, as
well as cures for problems that have been with us for a much longer
period of time"
1) Johannes Berg noticed two major deficiencies in our genetlink
registration. Some genetlink protocols we passing in constant
counts for their ops array rather than something like
ARRAY_SIZE(ops) or similar. Also, some genetlink protocols were
using fixed IDs for their multicast groups.
We have to retain these fixed IDs to keep existing userland tools
working, but reserve them so that other multicast groups used by
other protocols can not possibly conflict.
In dealing with these two problems, we actually now use less state
management for genetlink operations and multicast groups.
2) When configuring interface hardware timestamping, fix several
drivers that simply do not validate that the hwtstamp_config value
is one the driver actually supports. From Ben Hutchings.
3) Invalid memory references in mwifiex driver, from Amitkumar Karwar.
4) In dev_forward_skb(), set the skb->protocol in the right order
relative to skb_scrub_packet(). From Alexei Starovoitov.
5) Bridge erroneously fails to use the proper wrapper functions to make
calls to netdev_ops->ndo_vlan_rx_{add,kill}_vid. Fix from Toshiaki
Makita.
6) When detaching a bridge port, make sure to flush all VLAN IDs to
prevent them from leaking, also from Toshiaki Makita.
7) Put in a compromise for TCP Small Queues so that deep queued devices
that delay TX reclaim non-trivially don't have such a performance
decrease. One particularly problematic area is 802.11 AMPDU in
wireless. From Eric Dumazet.
8) Fix crashes in tcp_fastopen_cache_get(), we can see NULL socket dsts
here. Fix from Eric Dumzaet, reported by Dave Jones.
9) Fix use after free in ipv6 SIT driver, from Willem de Bruijn.
10) When computing mergeable buffer sizes, virtio-net fails to take the
virtio-net header into account. From Michael Dalton.
11) Fix seqlock deadlock in ip4_datagram_connect() wrt. statistic
bumping, this one has been with us for a while. From Eric Dumazet.
12) Fix NULL deref in the new TIPC fragmentation handling, from Erik
Hugne.
13) 6lowpan bit used for traffic classification was wrong, from Jukka
Rissanen.
14) macvlan has the same issue as normal vlans did wrt. propagating LRO
disabling down to the real device, fix it the same way. From Michal
Kubecek.
15) CPSW driver needs to soft reset all slaves during suspend, from
Daniel Mack.
16) Fix small frame pacing in FQ packet scheduler, from Eric Dumazet.
17) The xen-netfront RX buffer refill timer isn't properly scheduled on
partial RX allocation success, from Ma JieYue.
18) When ipv6 ping protocol support was added, the AF_INET6 protocol
initialization cleanup path on failure was borked a little. Fix
from Vlad Yasevich.
19) If a socket disconnects during a read/recvmsg/recvfrom/etc that
blocks we can do the wrong thing with the msg_name we write back to
userspace. From Hannes Frederic Sowa. There is another fix in the
works from Hannes which will prevent future problems of this nature.
20) Fix route leak in VTI tunnel transmit, from Fan Du.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
genetlink: make multicast groups const, prevent abuse
genetlink: pass family to functions using groups
genetlink: add and use genl_set_err()
genetlink: remove family pointer from genl_multicast_group
genetlink: remove genl_unregister_mc_group()
hsr: don't call genl_unregister_mc_group()
quota/genetlink: use proper genetlink multicast APIs
drop_monitor/genetlink: use proper genetlink multicast APIs
genetlink: only pass array to genl_register_family_with_ops()
tcp: don't update snd_nxt, when a socket is switched from repair mode
atm: idt77252: fix dev refcnt leak
xfrm: Release dst if this dst is improper for vti tunnel
netlink: fix documentation typo in netlink_set_err()
be2net: Delete secondary unicast MAC addresses during be_close
be2net: Fix unconditional enabling of Rx interface options
net, virtio_net: replace the magic value
ping: prevent NULL pointer dereference on write to msg_name
bnx2x: Prevent "timeout waiting for state X"
bnx2x: prevent CFC attention
bnx2x: Prevent panic during DMAE timeout
...
The quota code is abusing the genetlink API and is using
its family ID as the multicast group ID, which is invalid
and may belong to somebody else (and likely will.)
Make the quota code use the correct API, but since this
is already used as-is by userspace, reserve a family ID
for this code and also reserve that group ID to not break
userspace assumptions.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SIOCSHWTSTAMP returns the real configuration to the application
using it, but there is currently no way for any other
application to find out the configuration non-destructively.
Add a new ioctl for this, making it unprivileged.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
linux/raid/md_p.h is using conditionals depending on endianess and fails
with an error if neither of __BIG_ENDIAN, __LITTLE_ENDIAN or
__BYTE_ORDER are defined, but it doesn't include any header which can
define these constants. This make this header unusable alone.
This patch adds a #include <asm/byteorder.h> at the beginning of this
header to make it usable alone. This is needed to compile klibc on MIPS.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: NeilBrown <neilb@suse.de>
Fix the name of the rx_filter field.
Remove text about 32/64-bit compatibility; this works just the same as
for most socket ioctls and as the structure is not allowed to grow
there is no need to remind anyone how to maintain it.
Add explanation about drivers changing the filter mode.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
- Re-enable flow steering verbs with new improved userspace ABI
- Fixes for slow connection due to GID lookup scalability
- IPoIB fixes
- Many fixes to HW drivers including mlx4, mlx5, ocrdma and qib
- Further improvements to SRP error handling
- Add new transport type for Cisco usNIC
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQIcBAABCAAGBQJSil7BAAoJEENa44ZhAt0hbtgP/A+AmUalbOX6ZKzuOFxsrtY2
r55CX9b1JBeFM/Zhn2o6y+81lpCjkckJSggESMe4izNgocGw0nW4vYGN4SBynatj
y8sR9OSn+G3ihuENrzG41MJUGEa5WbcNMy4boN+Oa+qyTlV/WjLR7Fv4WbikK7Wm
o8FNlXiiDhMoGfHHG5J0MD0EQsnxuLDk2XP+ciu4tLtTs+wBka+gFK8WnMvztle3
gTeMNna5ilvCS2fdBxteuPA3KeDnJE9AgJSMJ2a4Rh+DR8uTgWYQ6n7amjmOc546
yhAKkoBkxPE10+Yj82WOPhCFxSeWcuSwJvpgv5dTVZ1XqUUcC1V3TEcZDHmyyHQ7
uPXgS1A+erBW3OYPBjZqtKvnHObscV12fL+rId3vIhcAQIbFroci08ZwPidEYRkn
fvwlEKcrIsBIpRXEyjlFCxsiiDnfq1wC1VayMR3jrIK0P6idf1SXf/geiRp9+RGT
wKUc0j51jvEx29qc65xuhEP9FQV9pCMxyd+FEE0d0KkjMz5hsIkjmcUcBbgF0CGg
GEyDPlgRLv+vmWDGpT8XraaV/0CJOEQDIgB4WSN87/AZ4UoNt7spW2xqsLsp1toy
5e0100tpWUleTPLe/Wig5GtBdagQ2jAUK1+186CP93pFPtkwc4/7X3hyp7qPIPTz
VDvT9DEy6zjSMCLpMcdo
=xxC+
-----END PGP SIGNATURE-----
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull infiniband/rdma updates from Roland Dreier:
- Re-enable flow steering verbs with new improved userspace ABI
- Fixes for slow connection due to GID lookup scalability
- IPoIB fixes
- Many fixes to HW drivers including mlx4, mlx5, ocrdma and qib
- Further improvements to SRP error handling
- Add new transport type for Cisco usNIC
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (66 commits)
IB/core: Re-enable create_flow/destroy_flow uverbs
IB/core: extended command: an improved infrastructure for uverbs commands
IB/core: Remove ib_uverbs_flow_spec structure from userspace
IB/core: Use a common header for uverbs flow_specs
IB/core: Make uverbs flow structure use names like verbs ones
IB/core: Rename 'flow' structs to match other uverbs structs
IB/core: clarify overflow/underflow checks on ib_create/destroy_flow
IB/ucma: Convert use of typedef ctl_table to struct ctl_table
IB/cm: Convert to using idr_alloc_cyclic()
IB/mlx5: Fix page shift in create CQ for userspace
IB/mlx4: Fix device max capabilities check
IB/mlx5: Fix list_del of empty list
IB/mlx5: Remove dead code
IB/core: Encorce MR access rights rules on kernel consumers
IB/mlx4: Fix endless loop in resize CQ
RDMA/cma: Remove unused argument and minor dead code
RDMA/ucma: Discard events for IDs not yet claimed by user space
IB/core: Add Cisco usNIC rdma node and transport types
RDMA/nes: Remove self-assignment from nes_query_qp()
IB/srp: Report receive errors correctly
...
Pull media updates from Mauro Carvalho Chehab:
"This series include:
- a new Remote Controller driver for ST SoC with the corresponding DT
bindings
- a new frontend (cx24117)
- a new I2C camera flash driver (lm3560)
- a new mem2mem driver for TI SoC (ti-vpe)
- support for Raphael r828d added to r820t driver
- some improvements on buffer allocation at VB2 core
- usual driver fixes and improvements
PS this time, we have a smaller number of patches. While it is hard
to pinpoint to the reasons, I believe that it is mainly due to:
1) there are several patch series ready, but depending on DT review.
I decided to grant some extra time for DT maintainers to look on
it, as they're expecting to have more time with the changes agreed
during ARM mini-summit and KS. If they can't review in time for
3.14, I'll review myself and apply for the next merge window.
2) I suspect that having both LinuxCon EU and LinuxCon NA happening
during the same merge window affected the development
productivity, as several core media developers participated on
both events"
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (151 commits)
[media] media: st-rc: Add ST remote control driver
[media] gpio-ir-recv: Include linux/of.h header
[media] tvp7002: Include linux/of.h header
[media] tvp514x: Include linux/of.h header
[media] ths8200: Include linux/of.h header
[media] adv7343: Include linux/of.h header
[media] v4l: Fix typo in v4l2_subdev_get_try_crop()
[media] media: i2c: add driver for dual LED Flash, lm3560
[media] rtl28xxu: add 15f4:0131 Astrometa DVB-T2
[media] rtl28xxu: add RTL2832P + R828D support
[media] rtl2832: add new tuner R828D
[media] r820t: add support for R828D
[media] media/i2c: ths8200: fix build failure with gcc 4.5.4
[media] Add support for KWorld UB435-Q V2
[media] staging/media: fix msi3101 build errors
[media] ddbridge: Remove casting the return value which is a void pointer
[media] ngene: Remove casting the return value which is a void pointer
[media] dm1105: remove unneeded not-null test
[media] sh_mobile_ceu_camera: remove deprecated IRQF_DISABLED
[media] media: rcar_vin: Add preliminary r8a7790 support
...
This is required to properly calculate the tiling parameters
in userspace.
Signed-off-by: Michel Dänzer <michel.daenzer@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This commit reverts commit 7afbddfae9 ("IB/core: Temporarily disable
create_flow/destroy_flow uverbs"). Since the uverbs extensions
functionality was experimental for v3.12, this patch re-enables the
support for them and flow-steering for v3.13.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit 400dbc9658 ("IB/core: Infrastructure for extensible uverbs
commands") added an infrastructure for extensible uverbs commands
while later commit 436f2ad05a ("IB/core: Export ib_create/destroy_flow
through uverbs") exported ib_create_flow()/ib_destroy_flow() functions
using this new infrastructure.
According to the commit 400dbc9658, the purpose of this
infrastructure is to support passing around provider (eg. hardware)
specific buffers when userspace issue commands to the kernel, so that
it would be possible to extend uverbs (eg. core) buffers independently
from the provider buffers.
But the new kernel command function prototypes were not modified to
take advantage of this extension. This issue was exposed by Roland
Dreier in a previous review[1].
So the following patch is an attempt to a revised extensible command
infrastructure.
This improved extensible command infrastructure distinguish between
core (eg. legacy)'s command/response buffers from provider
(eg. hardware)'s command/response buffers: each extended command
implementing function is given a struct ib_udata to hold core
(eg. uverbs) input and output buffers, and another struct ib_udata to
hold the hw (eg. provider) input and output buffers.
Having those buffers identified separately make it easier to increase
one buffer to support extension without having to add some code to
guess the exact size of each command/response parts: This should make
the extended functions more reliable.
Additionally, instead of relying on command identifier being greater
than IB_USER_VERBS_CMD_THRESHOLD, the proposed infrastructure rely on
unused bits in command field: on the 32 bits provided by command
field, only 6 bits are really needed to encode the identifier of
commands currently supported by the kernel. (Even using only 6 bits
leaves room for about 23 new commands).
So this patch makes use of some high order bits in command field to
store flags, leaving enough room for more command identifiers than one
will ever need (eg. 256).
The new flags are used to specify if the command should be processed
as an extended one or a legacy one. While designing the new command
format, care was taken to make usage of flags itself extensible.
Using high order bits of the commands field ensure that newer
libibverbs on older kernel will properly fail when trying to call
extended commands. On the other hand, older libibverbs on newer kernel
will never be able to issue calls to extended commands.
The extended command header includes the optional response pointer so
that output buffer length and output buffer pointer are located
together in the command, allowing proper parameters checking. This
should make implementing functions easier and safer.
Additionally the extended header ensure 64bits alignment, while making
all sizes multiple of 8 bytes, extending the maximum buffer size:
legacy extended
Maximum command buffer: 256KBytes 1024KBytes (512KBytes + 512KBytes)
Maximum response buffer: 256KBytes 1024KBytes (512KBytes + 512KBytes)
For the purpose of doing proper buffer size accounting, the headers
size are no more taken in account in "in_words".
One of the odds of the current extensible infrastructure, reading
twice the "legacy" command header, is fixed by removing the "legacy"
command header from the extended command header: they are processed as
two different parts of the command: memory is read once and
information are not duplicated: it's making clear that's an extended
command scheme and not a different command scheme.
The proposed scheme will format input (command) and output (response)
buffers this way:
- command:
legacy header +
extended header +
command data (core + hw):
+----------------------------------------+
| flags | 00 00 | command |
| in_words | out_words |
+----------------------------------------+
| response |
| response |
| provider_in_words | provider_out_words |
| padding |
+----------------------------------------+
| |
. <uverbs input> .
. (in_words * 8) .
| |
+----------------------------------------+
| |
. <provider input> .
. (provider_in_words * 8) .
| |
+----------------------------------------+
- response, if present:
+----------------------------------------+
| |
. <uverbs output space> .
. (out_words * 8) .
| |
+----------------------------------------+
| |
. <provider output space> .
. (provider_out_words * 8) .
| |
+----------------------------------------+
The overall design is to ensure that the extensible infrastructure is
itself extensible while begin more reliable with more input and bound
checking.
Note:
The unused field in the extended header would be perfect candidate to
hold the command "comp_mask" (eg. bit field used to handle
compatibility). This was suggested by Roland Dreier in a previous
review[2]. But "comp_mask" field is likely to be present in the uverb
input and/or provider input, likewise for the response, as noted by
Matan Barak[3], so it doesn't make sense to put "comp_mask" in the
header.
[1]:
http://marc.info/?i=CAL1RGDWxmM17W2o_era24A-TTDeKyoL6u3NRu_=t_dhV_ZA9MA@mail.gmail.com
[2]:
http://marc.info/?i=CAL1RGDXJtrc849M6_XNZT5xO1+ybKtLWGq6yg6LhoSsKpsmkYA@mail.gmail.com
[3]:
http://marc.info/?i=525C1149.6000701@mellanox.com
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com
[ Convert "ret ? ret : 0" to the equivalent "ret". - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
The structure holding any types of flow_spec is of no use to
userspace. It would be wrong for userspace to do:
struct ib_uverbs_flow_spec flow_spec;
flow_spec.type = IB_FLOW_SPEC_TCP;
flow_spec.size = sizeof(flow_spec);
Instead, userspace should use the dedicated flow_spec structure for
- Ethernet : struct ib_uverbs_flow_spec_eth,
- IPv4 : struct ib_uverbs_flow_spec_ipv4,
- TCP/UDP : struct ib_uverbs_flow_spec_tcp_udp.
In other words, struct ib_uverbs_flow_spec is a "virtual" data
structure that can only be use by the kernel as an alias to the other.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com
Signed-off-by: Roland Dreier <roland@purestorage.com>
A common header will allows better checking of flow specs size, while
ensuring strict alignment to 64 bits.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch adds "flow" prefix to most of data structure added as part
of commit 436f2ad05a ("IB/core: Export ib_create/destroy_flow through
uverbs") to keep those names in sync with the data structures added in
commit 319a441d13 ("IB/core: Add receive flow steering support").
It's just a matter of translating 'ib_flow' to 'ib_uverbs_flow'.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit 436f2ad05a ("IB/core: Export ib_create/destroy_flow through
uverbs") added public data structures to support receive flow
steering. The new structs are not following the 'uverbs' pattern:
they're lacking the common prefix 'ib_uverbs'.
This patch replaces ib_kern prefix by ib_uverbs.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://marc.info/?i=cover.1383773832.git.ydroneaud@opteya.com
Signed-off-by: Roland Dreier <roland@purestorage.com>
This patch fixes the following issues:
1. Unneeded checks were removed
2. Removed the fixed size out of flow_attr.size, thus simplifying the checks.
3. Remove a 32bit hole on 64bit systems with strict alignment in
struct ib_kern_flow_att by adding a reserved field.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
For performance reasons, sch_fq tried hard to not setup timers for every
sent packet, using a quantum based heuristic : A delay is setup only if
the flow exhausted its credit.
Problem is that application limited flows can refill their credit
for every queued packet, and they can evade pacing.
This problem can also be triggered when TCP flows use small MSS values,
as TSO auto sizing builds packets that are smaller than the default fq
quantum (3028 bytes)
This patch adds a 40 ms delay to guard flow credit refill.
Fixes: afe4fd0624 ("pkt_sched: fq: Fair Queue packet scheduler")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7eec4174ff ("pkt_sched: fq: fix non TCP flows pacing")
obsoleted TCA_FQ_FLOW_DEFAULT_RATE without notice for the users.
Suggested by David Miller
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull second round of block driver updates from Jens Axboe:
"As mentioned in the original pull request, the bcache bits were pulled
because of their dependency on the immutable bio vecs. Kent re-did
this part and resubmitted it, so here's the 2nd round of (mostly)
driver updates for 3.13. It contains:
- The bcache work from Kent.
- Conversion of virtio-blk to blk-mq. This removes the bio and request
path, and substitutes with the blk-mq path instead. The end result
almost 200 deleted lines. Patch is acked by Asias and Christoph, who
both did a bunch of testing.
- A removal of bootmem.h include from Grygorii Strashko, part of a
larger series of his killing the dependency on that header file.
- Removal of __cpuinit from blk-mq from Paul Gortmaker"
* 'for-linus' of git://git.kernel.dk/linux-block: (56 commits)
virtio_blk: blk-mq support
blk-mq: remove newly added instances of __cpuinit
bcache: defensively handle format strings
bcache: Bypass torture test
bcache: Delete some slower inline asm
bcache: Use ida for bcache block dev minor
bcache: Fix sysfs splat on shutdown with flash only devs
bcache: Better full stripe scanning
bcache: Have btree_split() insert into parent directly
bcache: Move spinlock into struct time_stats
bcache: Kill sequential_merge option
bcache: Kill bch_next_recurse_key()
bcache: Avoid deadlocking in garbage collection
bcache: Incremental gc
bcache: Add make_btree_freeing_key()
bcache: Add btree_node_write_sync()
bcache: PRECEDING_KEY()
bcache: bch_(btree|extent)_ptr_invalid()
bcache: Don't bother with bucket refcount for btree node allocations
bcache: Debug code improvements
...
Pull drm updates from Dave Airlie:
"This is a combo of -next and some -fixes that came in in the
intervening time.
Highlights:
New drivers:
ARM Armada driver for Marvell Armada 510 SOCs
Intel:
Broadwell initial support under a default off switch,
Stereo/3D HDMI mode support
Valleyview improvements
Displayport improvements
Haswell fixes
initial mipi dsi panel support
CRC support for debugging
build with CONFIG_FB=n
Radeon:
enable DPM on a number of GPUs by default
secondary GPU powerdown support
enable HDMI audio by default
Hawaii support
Nouveau:
dynamic pm code infrastructure reworked, does nothing major yet
GK208 modesetting support
MSI fixes, on by default again
PMPEG improvements
pageflipping fixes
GMA500:
minnowboard SDVO support
VMware:
misc fixes
MSM:
prime, plane and rendernodes support
Tegra:
rearchitected to put the drm driver into the drm subsystem.
HDMI and gr2d support for tegra 114 SoC
QXL:
oops fix, and multi-head fixes
DRM core:
sysfs lifetime fixes
client capability ioctl
further cleanups to device midlayer
more vblank timestamp fixes"
* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (789 commits)
drm/nouveau: do not map evicted vram buffers in nouveau_bo_vma_add
drm/nvc0-/gr: shift wrapping bug in nvc0_grctx_generate_r406800
drm/nouveau/pwr: fix missing mutex unlock in a failure path
drm/nv40/therm: fix slowing down fan when pstate undefined
drm/nv11-: synchronise flips to vblank, unless async flip requested
drm/nvc0-: remove nasty fifo swmthd hack for flip completion method
drm/nv10-: we no longer need to create nvsw object on user channels
drm/nouveau: always queue flips relative to kernel channel activity
drm/nouveau: there is no need to reserve/fence the new fb when flipping
drm/nouveau: when bailing out of a pushbuf ioctl, do not remove previous fence
drm/nouveau: allow nouveau_fence_ref() to be a noop
drm/nvc8/mc: msi rearm is via the nvc0 method
drm/ttm: Fix vma page_prot bit manipulation
drm/vmwgfx: Fix a couple of compile / sparse warnings and errors
drm/vmwgfx: Resource evict fixes
drm/edid: compare actual vrefresh for all modes for quirks
drm: shmob_drm: Convert to clk_prepare/unprepare
drm/nouveau: fix 32-bit build
drm/i915/opregion: fix build error on CONFIG_ACPI=n
Revert "drm/radeon/audio: don't set speaker allocation on DCE4+"
...
side: the HV and emulation flavors can now coexist in a single kernel
is probably the most interesting change from a user point of view.
On the x86 side there are nested virtualization improvements and a
few bugfixes. ARM got transparent huge page support, improved
overcommit, and support for big endian guests.
Finally, there is a new interface to connect KVM with VFIO. This
helps with devices that use NoSnoop PCI transactions, letting the
driver in the guest execute WBINVD instructions. This includes
some nVidia cards on Windows, that fail to start without these
patches and the corresponding userspace changes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJShPAhAAoJEBvWZb6bTYbyl48P/297GgmELHAGBgjvb6q7yyGu
L8+eHjKbh4XBAkPwyzbvUjuww5z2hM0N3JQ0BDV9oeXlO+zwwCEns/sg2Q5/NJXq
XxnTeShaKnp9lqVBnE6G9rAOUWKoyLJ2wItlvUL8JlaO9xJ0Vmk0ta4n2Nv5GqDp
db6UD7vju6rHtIAhNpvvAO51kAOwc01xxRixCVb7KUYOnmO9nvpixzoI/S0Rp1gu
w/OWMfCosDzBoT+cOe79Yx1OKcpaVW94X6CH1s+ShCw3wcbCL2f13Ka8/E3FIcuq
vkZaLBxio7vjUAHRjPObw0XBW4InXEbhI1DjzIvm8dmc4VsgmtLQkTCG8fj+jINc
dlHQUq6Do+1F4zy6WMBUj8tNeP1Z9DsABp98rQwR8+BwHoQpGQBpAxW0TE0ZMngC
t1caqyvjZ5pPpFUxSrAV+8Kg4AvobXPYOim0vqV7Qea07KhFcBXLCfF7BWdwq/Jc
0CAOlsLL4mHGIQWZJuVGw0YGP7oATDCyewlBuDObx+szYCoV4fQGZVBEL0KwJx/1
7lrLN7JWzRyw6xTgJ5VVwgYE1tUY4IFQcHu7/5N+dw8/xg9KWA3f4PeMavIKSf+R
qteewbtmQsxUnvuQIBHLs8NRWPnBPy+F3Sc2ckeOLIe4pmfTte6shtTXcLDL+LqH
NTmT/cfmYp2BRkiCfCiS
=rWNf
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM changes from Paolo Bonzini:
"Here are the 3.13 KVM changes. There was a lot of work on the PPC
side: the HV and emulation flavors can now coexist in a single kernel
is probably the most interesting change from a user point of view.
On the x86 side there are nested virtualization improvements and a few
bugfixes.
ARM got transparent huge page support, improved overcommit, and
support for big endian guests.
Finally, there is a new interface to connect KVM with VFIO. This
helps with devices that use NoSnoop PCI transactions, letting the
driver in the guest execute WBINVD instructions. This includes some
nVidia cards on Windows, that fail to start without these patches and
the corresponding userspace changes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
kvm, vmx: Fix lazy FPU on nested guest
arm/arm64: KVM: PSCI: propagate caller endianness to the incoming vcpu
arm/arm64: KVM: MMIO support for BE guest
kvm, cpuid: Fix sparse warning
kvm: Delete prototype for non-existent function kvm_check_iopl
kvm: Delete prototype for non-existent function complete_pio
hung_task: add method to reset detector
pvclock: detect watchdog reset at pvclock read
kvm: optimize out smp_mb after srcu_read_unlock
srcu: API for barrier after srcu read unlock
KVM: remove vm mmap method
KVM: IOMMU: hva align mapping page size
KVM: x86: trace cpuid emulation when called from emulator
KVM: emulator: cleanup decode_register_operand() a bit
KVM: emulator: check rex prefix inside decode_register()
KVM: x86: fix emulation of "movzbl %bpl, %eax"
kvm_host: typo fix
KVM: x86: emulate SAHF instruction
MAINTAINERS: add tree for kvm.git
Documentation/kvm: add a 00-INDEX file
...
Pull btrfs update frm Chris Mason:
"This is our usual merge window set of bug fixes, performance
improvements and cleanups. Miao Xie has some really nice
optimizations for writeback.
Josef also expanded our sanity checks quite a bit; these make up a big
chunk of the new lines"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (98 commits)
Btrfs: rename btrfs_start_all_delalloc_inodes
Btrfs: don't wait for the completion of all the ordered extents
Btrfs: don't wait for all the async delalloc when shrinking delalloc
Btrfs: fix the confusion between delalloc bytes and metadata bytes
Btrfs: pick up the code for the item number calculation in flush_space()
Btrfs: wait for the ordered extent only when we want
Btrfs: remove unnecessary initialization and memory barrior in shrink_delalloc()
Btrfs: avoid unnecessary scrub workers allocation
Btrfs: check file extent type before anything else
btrfs: Remove useless variable in write_ctree_super()
btrfs: Fix checkpatch.pl warning of spacing issues
btrfs: Replace kmalloc with kmalloc_array
btrfs: Enclose macros with complex values within parenthesis
btrfs: Use WARN_ON()'s return value in place of WARN_ON(1)
btrfs: Remove redundant local zero structure
btrfs: Pack struct btrfs_device
btrfs: Replace multiple atomic_inc() with atomic_add()
btrfs: Add helper function for free_root_pointers()
Btrfs: fix a crash when running balance and defrag concurrently
Btrfs: do not run snapshot-aware defragment on error
...
Fix whitespace, capitalization, and spelling errors. No functional change.
I know "busses" is not an error, but "buses" was more common, so I used it
consistently.
Signed-off-by: Marta Rybczynska <rybczynska@gmail.com> (pci_reset_bridge_secondary_bus())
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>