New mount option "idsfromsid" indicates to cifs.ko that
it should try to retrieve the uid and gid owner fields
from special sids. This patch adds the code to parse the owner
sids in the ACL to see if they match, and if so populate the
uid and/or gid from them. This is faster than upcalling for
them and asking winbind, and is a fairly common case, and is
also helpful when cifs.upcall and idmapping is not configured.
Signed-off-by: Steve French <steve.french@primarydata.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Add "idsfromsid" mount option to indicate to cifs.ko that it should
try to retrieve the uid and gid owner fields from special sids in the
ACL if present. This first patch just adds the parsing for the mount
option.
Signed-off-by: Steve French <steve.french@primarydata.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
We are already doing the same thing for an ordinary open case:
we can't keep read oplock on a file if we have mandatory byte-range
locks because pagereading can conflict with these locks on a server.
Fix it by setting oplock level to NONE.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
openFileList of tcon can be changed while cifs_reopen_file() is called
that can lead to an unexpected behavior when we return to the loop.
Fix this by introducing a temp list for keeping all file handles that
need to be reopen.
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
We split the rawntlmssp authentication into negotiate and
authencate parts. We also clean up the code and add helpers.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Add helper functions and split Kerberos authentication off
SMB2_sess_setup.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
/sys/module/cifs/parameters should display the three
other module load time configuration settings for cifs.ko
Signed-off-by: Germano Percossi <germano.percossi@citrix.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
Cleanup some missing mem frees on some cifs ioctls, and
clarify others to make more obvious that no data is returned.
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Sachin Prabhu <sprabhu@redhat.com>
[CIFS] We had cases where we sent a SMB2/SMB3 setinfo request with all
timestamp (and DOS attribute) fields marked as 0 (ie do not change)
e.g. on chmod or chown.
Signed-off-by: Steve French <steve.french@primarydata.com>
CC: Stable <stable@vger.kernel.org>
Add mount option "max_credits" to allow setting maximum SMB3
credits to any value from 10 to 64000 (default is 32000).
This can be useful to workaround servers with problems allocating
credits, or to throttle the client to use smaller amount of
simultaneous i/o or to workaround server performance issues.
Also adds a cap, so that even if the server granted us more than
65000 credits due to a server bug, we would not use that many.
Signed-off-by: Steve French <steve.french@primarydata.com>
Continuous Availability features like persistent handles
require that clients reconnect their open files, not
just the sessions, soon after the network connection comes
back up, otherwise the server will throw away the state
(byte range locks, leases, deny modes) on those handles
after a timeout.
Add code to reconnect handles when use_persistent set
(e.g. Continuous Availability shares) after tree reconnect.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Germano Percossi <germano.percossi@citrix.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Remove the global file_list_lock to simplify cifs/smb3 locking and
have spinlocks that more closely match the information they are
protecting.
Add new tcon->open_file_lock and file->file_info_lock spinlocks.
Locks continue to follow a heirachy,
cifs_socket --> cifs_ses --> cifs_tcon --> cifs_file
where global tcp_ses_lock still protects socket and cifs_ses, while the
the newer locks protect the lower level structure's information
(tcon and cifs_file respectively).
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <steve.french@primarydata.com>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Germano Percossi <germano.percossi@citrix.com>
Patch a6b5058 results in -EREMOTE returned by is_path_accessible() in
cifs_mount() to be ignored which breaks DFS mounting.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
When we open a durable handle we give a Globally Unique
Identifier (GUID) to the server which we must keep for later reference
e.g. when reopening persistent handles on reconnection.
Without this the GUID generated for a new persistent handle was lost and
16 zero bytes were used instead on re-opening.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
GUIDs although random, and 16 bytes, need to be generated as
proper uuids.
Signed-off-by: Steve French <steve.french@primarydata.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reported-by: David Goebels <davidgoe@microsoft.com>
CC: Stable <stable@vger.kernel.org>
The kernel client requests 2 credits for many operations even though
they only use 1 credit (presumably to build up a buffer of credit).
Some servers seem to give the client as much credit as is requested. In
this case, the amount of credit the client has continues increasing to
the point where (server->credits * MAX_BUFFER_SIZE) overflows in
smb2_wait_mtu_credits().
Fix this by throttling the credit requests if an set limit is reached.
For async requests where the credit charge may be > 1, request as much
credit as what is charged.
The limit is chosen somewhat arbitrarily. The Windows client
defaults to 128 credits, the Windows server allows clients up to
512 credits (or 8192 for Windows 2016), and the NetApp server
(and at least one other) does not limit clients at all.
Choose a high enough value such that the client shouldn't limit
performance.
This behavior was seen with a NetApp filer (NetApp Release 9.0RC2).
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
In debugging smb3, it is useful to display the number
of credits available, so we can see when the server has not granted
sufficient operations for the client to make progress, or alternatively
the client has requested too many credits (as we saw in a recent bug)
so we can compare with the number of credits the server thinks
we have.
Add a /proc/fs/cifs/DebugData line to display the client view
on how many credits are available.
Signed-off-by: Steve French <steve.french@primarydata.com>
Reported-by: Germano Percossi <germano.percossi@citrix.com>
CC: Stable <stable@vger.kernel.org>
Add parsing for new pseudo-xattr user.cifs.creationtime file
attribute to allow backup and test applications to view
birth time of file on cifs/smb3 mounts.
Signed-off-by: Steve French <steve.french@primarydata.com>
Add parsing for new pseudo-xattr user.cifs.dosattrib file attribute
so tools can recognize what kind of file it is, and verify if common
SMB3 attributes (system, hidden, archive, sparse, indexed etc.) are
set.
Signed-off-by: Steve French <steve.french@primarydata.com>
Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org>
Merge more updates from Andrew Morton:
- a few block updates that fell in my lap
- lib/ updates
- checkpatch
- autofs
- ipc
- a ton of misc other things
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (100 commits)
mm: split gfp_mask and mapping flags into separate fields
fs: use mapping_set_error instead of opencoded set_bit
treewide: remove redundant #include <linux/kconfig.h>
hung_task: allow hung_task_panic when hung_task_warnings is 0
kthread: add kerneldoc for kthread_create()
kthread: better support freezable kthread workers
kthread: allow to modify delayed kthread work
kthread: allow to cancel kthread work
kthread: initial support for delayed kthread work
kthread: detect when a kthread work is used by more workers
kthread: add kthread_destroy_worker()
kthread: add kthread_create_worker*()
kthread: allow to call __kthread_create_on_node() with va_list args
kthread/smpboot: do not park in kthread_create_on_cpu()
kthread: kthread worker API cleanup
kthread: rename probe_kthread_data() to kthread_probe_data()
scripts/tags.sh: enable code completion in VIM
mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping
kdump, vmcoreinfo: report memory sections virtual addresses
ipc/sem.c: add cond_resched in exit_sme
...
The mapping_set_error() helper sets the correct AS_ flag for the mapping
so there is no reason to open code it. Use the helper directly.
[akpm@linux-foundation.org: be honest about conversion from -ENXIO to -EIO]
Link: http://lkml.kernel.org/r/20160912111608.2588-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kernel source files need not include <linux/kconfig.h> explicitly
because the top Makefile forces to include it with:
-include $(srctree)/include/linux/kconfig.h
This commit removes explicit includes except the following:
* arch/s390/include/asm/facilities_src.h
* tools/testing/radix-tree/linux/kernel.h
These two are used for host programs.
Link: http://lkml.kernel.org/r/1473656164-11929-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a patch that provides behavior that is more consistent, and
probably less surprising to users. I consider the change optional, and
welcome opinions about whether it should be applied.
By default, pipes are created with a capacity of 64 kiB. However,
/proc/sys/fs/pipe-max-size may be set smaller than this value. In this
scenario, an unprivileged user could thus create a pipe whose initial
capacity exceeds the limit. Therefore, it seems logical to cap the
initial pipe capacity according to the value of pipe-max-size.
The test program shown earlier in this patch series can be used to
demonstrate the effect of the change brought about with this patch:
# cat /proc/sys/fs/pipe-max-size
1048576
# sudo -u mtk ./test_F_SETPIPE_SZ 1
Initial pipe capacity: 65536
# echo 10000 > /proc/sys/fs/pipe-max-size
# cat /proc/sys/fs/pipe-max-size
16384
# sudo -u mtk ./test_F_SETPIPE_SZ 1
Initial pipe capacity: 16384
# ./test_F_SETPIPE_SZ 1
Initial pipe capacity: 65536
The last two executions of 'test_F_SETPIPE_SZ' show that pipe-max-size
caps the initial allocation for a new pipe for unprivileged users, but
not for privileged users.
Link: http://lkml.kernel.org/r/31dc7064-2a17-9c5b-1df1-4e3012ee992c@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is an optional patch, to provide a small performance
improvement. Alter account_pipe_buffers() so that it returns the
new value in user->pipe_bufs. This means that we can refactor
too_many_pipe_buffers_soft() and too_many_pipe_buffers_hard() to
avoid the costs of repeated use of atomic_long_read() to get the
value user->pipe_bufs.
Link: http://lkml.kernel.org/r/93e5f193-1e5e-3e1f-3a20-eae79b7e1310@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The limit checking in alloc_pipe_info() (used by pipe(2) and when
opening a FIFO) has the following problems:
(1) When checking capacity required for the new pipe, the checks against
the limit in /proc/sys/fs/pipe-user-pages-{soft,hard} are made
against existing consumption, and exclude the memory required for
the new pipe capacity. As a consequence: (1) the memory allocation
throttling provided by the soft limit does not kick in quite as
early as it should, and (2) the user can overrun the hard limit.
(2) As currently implemented, accounting and checking against the limits
is done as follows:
(a) Test whether the user has exceeded the limit.
(b) Make new pipe buffer allocation.
(c) Account new allocation against the limits.
This is racey. Multiple processes may pass point (a) simultaneously,
and then allocate pipe buffers that are accounted for only in step
(c). The race means that the user's pipe buffer allocation could be
pushed over the limit (by an arbitrary amount, depending on how
unlucky we were in the race). [Thanks to Vegard Nossum for spotting
this point, which I had missed.]
This patch addresses the above problems as follows:
* Alter the checks against limits to include the memory required for the
new pipe.
* Re-order the accounting step so that it precedes the buffer allocation.
If the accounting step determines that a limit has been reached, revert
the accounting and cause the operation to fail.
Link: http://lkml.kernel.org/r/8ff3e9f9-23f6-510c-644f-8e70cd1c0bd9@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace an 'if' block that covers most of the code in this function
with a 'goto'. This makes the code a little simpler to read, and also
simplifies the next patch (fix limit checking in alloc_pipe_info())
Link: http://lkml.kernel.org/r/aef030c1-0257-98a9-4988-186efa48530c@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The limit checking in pipe_set_size() (used by fcntl(F_SETPIPE_SZ))
has the following problems:
(1) When increasing the pipe capacity, the checks against the limits in
/proc/sys/fs/pipe-user-pages-{soft,hard} are made against existing
consumption, and exclude the memory required for the increased pipe
capacity. The new increase in pipe capacity can then push the total
memory used by the user for pipes (possibly far) over a limit. This
can also trigger the problem described next.
(2) The limit checks are performed even when the new pipe capacity is
less than the existing pipe capacity. This can lead to problems if a
user sets a large pipe capacity, and then the limits are lowered,
with the result that the user will no longer be able to decrease the
pipe capacity.
(3) As currently implemented, accounting and checking against the
limits is done as follows:
(a) Test whether the user has exceeded the limit.
(b) Make new pipe buffer allocation.
(c) Account new allocation against the limits.
This is racey. Multiple processes may pass point (a)
simultaneously, and then allocate pipe buffers that are accounted
for only in step (c). The race means that the user's pipe buffer
allocation could be pushed over the limit (by an arbitrary amount,
depending on how unlucky we were in the race). [Thanks to Vegard
Nossum for spotting this point, which I had missed.]
This patch addresses the above problems as follows:
* Perform checks against the limits only when increasing a pipe's
capacity; an unprivileged user can always decrease a pipe's capacity.
* Alter the checks against limits to include the memory required for
the new pipe capacity.
* Re-order the accounting step so that it precedes the buffer
allocation. If the accounting step determines that a limit has
been reached, revert the accounting and cause the operation to fail.
The program below can be used to demonstrate problems 1 and 2, and the
effect of the fix. The program takes one or more command-line arguments.
The first argument specifies the number of pipes that the program should
create. The remaining arguments are, alternately, pipe capacities that
should be set using fcntl(F_SETPIPE_SZ), and sleep intervals (in
seconds) between the fcntl() operations. (The sleep intervals allow the
possibility to change the limits between fcntl() operations.)
Problem 1
=========
Using the test program on an unpatched kernel, we first set some
limits:
# echo 0 > /proc/sys/fs/pipe-user-pages-soft
# echo 1000000000 > /proc/sys/fs/pipe-max-size
# echo 10000 > /proc/sys/fs/pipe-user-pages-hard # 40.96 MB
Then show that we can set a pipe with capacity (100MB) that is
over the hard limit
# sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000
Initial pipe capacity: 65536
Loop 1: set pipe capacity to 100000000 bytes
F_SETPIPE_SZ returned 134217728
Now set the capacity to 100MB twice. The second call fails (which is
probably surprising to most users, since it seems like a no-op):
# sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000 0 100000000
Initial pipe capacity: 65536
Loop 1: set pipe capacity to 100000000 bytes
F_SETPIPE_SZ returned 134217728
Loop 2: set pipe capacity to 100000000 bytes
Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted
With a patched kernel, setting a capacity over the limit fails at the
first attempt:
# echo 0 > /proc/sys/fs/pipe-user-pages-soft
# echo 1000000000 > /proc/sys/fs/pipe-max-size
# echo 10000 > /proc/sys/fs/pipe-user-pages-hard
# sudo -u mtk ./test_F_SETPIPE_SZ 1 100000000
Initial pipe capacity: 65536
Loop 1: set pipe capacity to 100000000 bytes
Loop 1, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted
There is a small chance that the change to fix this problem could
break user-space, since there are cases where fcntl(F_SETPIPE_SZ)
calls that previously succeeded might fail. However, the chances are
small, since (a) the pipe-user-pages-{soft,hard} limits are new (in
4.5), and the default soft/hard limits are high/unlimited. Therefore,
it seems warranted to make these limits operate more precisely (and
behave more like what users probably expect).
Problem 2
=========
Running the test program on an unpatched kernel, we first set some limits:
# getconf PAGESIZE
4096
# echo 0 > /proc/sys/fs/pipe-user-pages-soft
# echo 1000000000 > /proc/sys/fs/pipe-max-size
# echo 10000 > /proc/sys/fs/pipe-user-pages-hard # 40.96 MB
Now perform two fcntl(F_SETPIPE_SZ) operations on a single pipe,
first setting a pipe capacity (10MB), sleeping for a few seconds,
during which time the hard limit is lowered, and then set pipe
capacity to a smaller amount (5MB):
# sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 &
[1] 748
# Initial pipe capacity: 65536
Loop 1: set pipe capacity to 10000000 bytes
F_SETPIPE_SZ returned 16777216
Sleeping 15 seconds
# echo 1000 > /proc/sys/fs/pipe-user-pages-hard # 4.096 MB
# Loop 2: set pipe capacity to 5000000 bytes
Loop 2, pipe 0: F_SETPIPE_SZ failed: fcntl: Operation not permitted
In this case, the user should be able to lower the limit.
With a kernel that has the patch below, the second fcntl()
succeeds:
# echo 0 > /proc/sys/fs/pipe-user-pages-soft
# echo 1000000000 > /proc/sys/fs/pipe-max-size
# echo 10000 > /proc/sys/fs/pipe-user-pages-hard
# sudo -u mtk ./test_F_SETPIPE_SZ 1 10000000 15 5000000 &
[1] 3215
# Initial pipe capacity: 65536
# Loop 1: set pipe capacity to 10000000 bytes
F_SETPIPE_SZ returned 16777216
Sleeping 15 seconds
# echo 1000 > /proc/sys/fs/pipe-user-pages-hard
# Loop 2: set pipe capacity to 5000000 bytes
F_SETPIPE_SZ returned 8388608
8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---
/* test_F_SETPIPE_SZ.c
(C) 2016, Michael Kerrisk; licensed under GNU GPL version 2 or later
Test operation of fcntl(F_SETPIPE_SZ) for setting pipe capacity
and interactions with limits defined by /proc/sys/fs/pipe-* files.
*/
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
int
main(int argc, char *argv[])
{
int (*pfd)[2];
int npipes;
int pcap, rcap;
int j, p, s, stime, loop;
if (argc < 2) {
fprintf(stderr, "Usage: %s num-pipes "
"[pipe-capacity sleep-time]...\n", argv[0]);
exit(EXIT_FAILURE);
}
npipes = atoi(argv[1]);
pfd = calloc(npipes, sizeof (int [2]));
if (pfd == NULL) {
perror("calloc");
exit(EXIT_FAILURE);
}
for (j = 0; j < npipes; j++) {
if (pipe(pfd[j]) == -1) {
fprintf(stderr, "Loop %d: pipe() failed: ", j);
perror("pipe");
exit(EXIT_FAILURE);
}
}
printf("Initial pipe capacity: %d\n", fcntl(pfd[0][0], F_GETPIPE_SZ));
for (j = 2; j < argc; j += 2 ) {
loop = j / 2;
pcap = atoi(argv[j]);
printf(" Loop %d: set pipe capacity to %d bytes\n", loop, pcap);
for (p = 0; p < npipes; p++) {
s = fcntl(pfd[p][0], F_SETPIPE_SZ, pcap);
if (s == -1) {
fprintf(stderr, " Loop %d, pipe %d: F_SETPIPE_SZ "
"failed: ", loop, p);
perror("fcntl");
exit(EXIT_FAILURE);
}
if (p == 0) {
printf(" F_SETPIPE_SZ returned %d\n", s);
rcap = s;
} else {
if (s != rcap) {
fprintf(stderr, " Loop %d, pipe %d: F_SETPIPE_SZ "
"unexpected return: %d\n", loop, p, s);
exit(EXIT_FAILURE);
}
}
stime = (j + 1 < argc) ? atoi(argv[j + 1]) : 0;
if (stime > 0) {
printf(" Sleeping %d seconds\n", stime);
sleep(stime);
}
}
}
exit(EXIT_SUCCESS);
}
8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---8x---
Patch history:
v2
* Switch order of test in 'if' statement to avoid function call
(to capability()) in normal path. [This is a fix to a preexisting
wart in the code. Thanks to Willy Tarreau]
* Perform (size > pipe_max_size) check before calling
account_pipe_buffers(). [Thanks to Vegard Nossum]
Quoting Vegard:
The potential problem happens if the user passes a very large number
which will overflow pipe->user->pipe_bufs.
On 32-bit, sizeof(int) == sizeof(long), so if they pass arg = INT_MAX
then round_pipe_size() returns INT_MAX. Although it's true that the
accounting is done in terms of pages and not bytes, so you'd need on
the order of (1 << 13) = 8192 processes hitting the limit at the same
time in order to make it overflow, which seems a bit unlikely.
(See https://lkml.org/lkml/2016/8/12/215 for another discussion on the
limit checking)
Link: http://lkml.kernel.org/r/1e464945-536b-2420-798b-e77b9c7e8593@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a preparatory patch for following work. account_pipe_buffers()
performs accounting in the 'user_struct'. There is no need to pass a
pointer to a 'pipe_inode_info' struct (which is then dereferenced to
obtain a pointer to the 'user' field). Instead, pass a pointer directly
to the 'user_struct'. This change is needed in preparation for a
subsequent patch that the fixes the limit checking in alloc_pipe_info()
(and the resulting code is a little more logical).
Link: http://lkml.kernel.org/r/7277bf8c-a6fc-4a7d-659c-f5b145c981ab@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is a preparatory patch for following work. Move the F_SETPIPE_SZ
limit-checking logic from pipe_fcntl() into pipe_set_size(). This
simplifies the code a little, and allows for reworking required in
a later patch that fixes the limit checking in pipe_set_size()
Link: http://lkml.kernel.org/r/3701b2c5-2c52-2c3e-226d-29b9deb29b50@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "pipe: fix limit handling", v2.
When changing a pipe's capacity with fcntl(F_SETPIPE_SZ), various limits
defined by /proc/sys/fs/pipe-* files are checked to see if unprivileged
users are exceeding limits on memory consumption.
While documenting and testing the operation of these limits I noticed
that, as currently implemented, these checks have a number of problems:
(1) When increasing the pipe capacity, the checks against the limits
in /proc/sys/fs/pipe-user-pages-{soft,hard} are made against
existing consumption, and exclude the memory required for the
increased pipe capacity. The new increase in pipe capacity can then
push the total memory used by the user for pipes (possibly far) over
a limit. This can also trigger the problem described next.
(2) The limit checks are performed even when the new pipe capacity
is less than the existing pipe capacity. This can lead to problems
if a user sets a large pipe capacity, and then the limits are
lowered, with the result that the user will no longer be able to
decrease the pipe capacity.
(3) As currently implemented, accounting and checking against the
limits is done as follows:
(a) Test whether the user has exceeded the limit.
(b) Make new pipe buffer allocation.
(c) Account new allocation against the limits.
This is racey. Multiple processes may pass point (a) simultaneously,
and then allocate pipe buffers that are accounted for only in step
(c). The race means that the user's pipe buffer allocation could be
pushed over the limit (by an arbitrary amount, depending on how
unlucky we were in the race). [Thanks to Vegard Nossum for spotting
this point, which I had missed.]
This patch series addresses these three problems.
This patch (of 8):
This is a minor preparatory patch. After subsequent patches,
round_pipe_size() will be called from pipe_set_size(), so place
round_pipe_size() above pipe_set_size().
Link: http://lkml.kernel.org/r/91a91fdb-a959-ba7f-b551-b62477cc98a1@gmail.com
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: <socketpair@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cmd part of this struct is the same as an index of itself within
_ioctls[]. In fact this cmd is unused, so we can drop this part.
Link: http://lkml.kernel.org/r/20160831033414.9910.66697.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Having this in autofs_i.h gives illusion that uncommenting this enables
pr_debug(), but it doesn't enable all the pr_debug() in autofs because
inclusion order matters.
XFS has the same DEBUG macro in its core header fs/xfs/xfs.h, however XFS
seems to have a rule to include this prior to other XFS headers as well as
kernel headers. This is not the case with autofs, and DEBUG could be
enabled via Makefile, so autofs should just get rid of this comment to
make the code less confusing. It's a comment, so there is literally no
functional difference.
Link: http://lkml.kernel.org/r/20160831033409.9910.77067.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All other warnings use "cmd(0x%08x)" and this is the only one with
"cmd(%d)". (below comes from my userspace debug program, but not
automount daemon)
[ 1139.905676] autofs4:pid:1640:check_dev_ioctl_version: ioctl control interface version mismatch: kernel(1.0), user(0.0), cmd(-1072131215)
Link: http://lkml.kernel.org/r/20160812024851.12352.75458.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <ikent@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No functional changes, based on the following justification.
1. Make the code more consistent using the ioctl vector _ioctls[],
rather than assigning NULL only for this ioctl command.
2. Remove goto done; for better maintainability in the long run.
3. The existing code is based on the fact that validate_dev_ioctl()
sets ioctl version for any command, but AUTOFS_DEV_IOCTL_VERSION_CMD
should explicitly set it regardless of the default behavior.
Link: http://lkml.kernel.org/r/20160812024846.12352.9885.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <ikent@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The count of miscellaneous device ioctls in fs/autofs4/autofs_i.h is wrong.
The number of ioctls is the difference between AUTOFS_DEV_IOCTL_VERSION_CMD
and AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD (14) not the difference between
AUTOFS_IOC_COUNT and 11 (21).
[kusumi.tomohiro@gmail.com: fix typo that made the count macro negative]
Link: http://lkml.kernel.org/r/20160831033420.9910.16809.stgit@pluto.themaw.net
Link: http://lkml.kernel.org/r/20160812024841.12352.11975.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This isn't a return value, so change the message to indicate the status is
the result of may_umount().
(or locate pr_debug() after put_user() with the same message)
Link: http://lkml.kernel.org/r/20160812024836.12352.74628.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <ikent@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These two were left from commit aa55ddf340 ("autofs4: remove unused
ioctls") which removed unused ioctls.
Link: http://lkml.kernel.org/r/20160812024810.12352.96377.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <ikent@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kfree dentry data allocated by autofs4_new_ino() with autofs4_free_ino()
instead of raw kfree. (since we have the interface to free autofs_info*)
This patch was modified to remove the need to set the dentry info field to
NULL dew to a change in the previous patch.
Link: http://lkml.kernel.org/r/20160812024805.12352.43650.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The inode allocation failure case in autofs4_dir_symlink() frees the
autofs dentry info of the dentry without setting ->d_fsdata to NULL.
That could lead to a double free so just get rid of the free and leave it
to ->d_release().
Link: http://lkml.kernel.org/r/20160812024759.12352.10653.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's invalid if the given mode is neither dir nor link, so warn on else
case.
Link: http://lkml.kernel.org/r/20160812024754.12352.8536.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Somewhere along the line the error handling gotos have become incorrect.
Link: http://lkml.kernel.org/r/20160812024749.12352.15100.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch does what the below comment says. It could be and it's
considered better to do this first before various functions get called
during initialization.
/* Couldn't this be tested earlier? */
Link: http://lkml.kernel.org/r/20160812024744.12352.43075.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
autofs4_kill_sb() doesn't need to be declared as extern, and no other
functions in .h are explicitly declared as extern.
Link: http://lkml.kernel.org/r/20160812024739.12352.99354.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
with the number of fds passed. We had a customer report page allocation
failures of order-4 for this allocation. This is a costly order, so it might
easily fail, as the VM expects such allocation to have a lower-order fallback.
Such trivial fallback is vmalloc(), as the memory doesn't have to be physically
contiguous and the allocation is temporary for the duration of the syscall
only. There were some concerns, whether this would have negative impact on the
system by exposing vmalloc() to userspace. Although an excessive use of vmalloc
can cause some system wide performance issues - TLB flushes etc. - a large
order allocation is not for free either and an excessive reclaim/compaction can
have a similar effect. Also note that the size is effectively limited by
RLIMIT_NOFILE which defaults to 1024 on the systems I checked. That means the
bitmaps will fit well within single page and thus the vmalloc() fallback could
be only excercised for processes where root allows a higher limit.
Note that the poll(2) syscall seems to use a linked list of order-0 pages, so
it doesn't need this kind of fallback.
[eric.dumazet@gmail.com: fix failure path logic]
[akpm@linux-foundation.org: use proper type for size]
Link: http://lkml.kernel.org/r/20160927084536.5923-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After much discussion, it seems that the fallocate feature flag
FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature
FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been whitelisted
for zeroing SCSI UNMAP. Punch still requires that FALLOC_FL_KEEP_SIZE is
set. A length that goes past the end of the device will be clamped to the
device size if KEEP_SIZE is set; or will return -EINVAL if not. Both
start and length must be aligned to the device's logical block size.
Since the semantics of fallocate are fairly well established already, wire
up the two pieces. The other fallocate variants (collapse range, insert
range, and allocate blocks) are not supported.
Link: http://lkml.kernel.org/r/147518379992.22791.8849838163218235007.stgit@birch.djwong.org
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com> # tweaked header
Cc: Brian Foster <bfoster@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the dlm_migrate_request_handler(), when `ret' is -EEXIST, the mle
should be freed, otherwise the memory will be leaked.
Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA4A3D3522A@H3CMLB12-EX.srv.huawei-3com.com
Signed-off-by: Guozhonghua <guozhonghua@h3c.com>
Reviewed-by: Mark Fasheh <mfasheh@versity.com>
Cc: Eric Ren <zren@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* PMEM sub-division support: Allow a single PMEM region to be divided
into multiple namespaces. Originally, ~2 years ago, it was thought that
partitions of a /dev/pmemX block device could handle sub-allocations of
persistent memory for different use cases. With the decision to not
support DAX mappings of raw block-devices, and the genesis of
device-dax, the need for having multiple pmem-namespace per region has
grown.
* Device-DAX unified inode: In support of dynamic-resizing of a
device-dax instance the kernel arranges for all mappings of a
device-dax node to share the same inode. This allows unmap / truncate /
invalidation events to affect all instances of the device similar to the
behavior of mmap on block devices.
* Hardware error scrubbing reworks: The original address-range-scrub +
badblocks tracking solution allowed clearing entries at the individual
namespace level, but it failed to clear the internal list of media
errors maintained at the bus level. The result was that the next scrub
or namespace disable/re-enable event would restore the cleared
badblocks, but now that is fixed. The v4.8 kernel introduced an
auto-scrub-on-machine-check behavior to repopulate the badblocks list.
Now, in v4.9, the auto-scrub behavior can be disabled and simply arrange
for the error reported in the machine-check to be added to the list.
* DIMM health-event notification support: ACPI 6.1 defines a
notification event code that can be send to ACPI NVDIMM devices. A
poll(2) capable file descriptor for these events can be obtained from
the nmemX/nfit/flags sysfs-attribute of a libnvdimm memory device.
* Miscellaneous fixes: NVDIMM-N probe error, device-dax build error, and
a change to dedup the flush hint list to not flush the memory controller
more than necessary.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJX/B2oAAoJEB7SkWpmfYgCe3YQAJiH4ZYRxr6HeJzVQltbhB2k
qyLC+7vIssefPPqn/Wycc3aHJjyk2ktetmFyjYE1q/vlJJWCG3y/ACfz2SZANXXx
2tgLsI+3dXZaGgIxRsZF8MsB672owqCbzJHbbmTRu3EtgMplagfh27G7HFZxt4Jd
FyKnRkknYsCEbHry/s0aRcZWPmacu5v1TDJyWgd0edNTG32GrKOtwxWrWEPRDJE1
dIK5JjPaDwMFMKjV6lgRuBVlsMKCzIC4YjSYZZmN/Mf/JCJBJuPSlkYEdGZ+xx84
/ZmKrE/XRPr7469f66QyD8iRtGAQ9OparhChbuzCagCHRAwgYy4yQGbK7rk0lwUM
18jysZU8NJxp4jEJIt0u2ap6W9ySePX5Bm+3CSwqxT0Ernew2AUJDLIw9f1hAAbX
rippSWyHp0JtBTjOeaV2ZY1LJlm+J//AycbFo51lAERHoX5zPimHL730EM8mJu7y
fIbFpau3fjob+ovQMXMIYam8C/MpTqAvcjpBFhkSlsY7q/l+ARgFpjYpg9qVir8g
v6PZ0UoGBhQvD2lTNTUjaCaHOc+sjo8PLeNI1ZsFebh63rF3k5sOLOk7wXllf8z5
jQBnYtYnPCJI67BLLZmwWzoBb0HpCbcPp9/0/c1rdLTcAo+3gi6SY4pVJgznxCZZ
+fkeOvSutJ687tFMarc1
=SenK
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"Aside from the recently added pmem sub-division support these have
been in -next for several releases with no reported issues. The sub-
division support was included in next-20161010 with no reported
issues. It passes all unit tests including new tests for all the new
functionality below.
Summary:
- PMEM sub-division support: Allow a single PMEM region to be divided
into multiple namespaces. Originally, ~2 years ago, it was thought
that partitions of a /dev/pmemX block device could handle
sub-allocations of persistent memory for different use cases. With
the decision to not support DAX mappings of raw block-devices, and
the genesis of device-dax, the need for having multiple
pmem-namespace per region has grown.
- Device-DAX unified inode: In support of dynamic-resizing of a
device-dax instance the kernel arranges for all mappings of a
device-dax node to share the same inode. This allows unmap /
truncate / invalidation events to affect all instances of the
device similar to the behavior of mmap on block devices.
- Hardware error scrubbing reworks: The original address-range-scrub
and badblocks tracking solution allowed clearing entries at the
individual namespace level, but it failed to clear the internal
list of media errors maintained at the bus level. The result was
that the next scrub or namespace disable/re-enable event would
restore the cleared badblocks, but now that is fixed. The v4.8
kernel introduced an auto-scrub-on-machine-check behavior to
repopulate the badblocks list. Now, in v4.9, the auto-scrub
behavior can be disabled and simply arrange for the error reported
in the machine-check to be added to the list.
- DIMM health-event notification support: ACPI 6.1 defines a
notification event code that can be send to ACPI NVDIMM devices. A
poll(2) capable file descriptor for these events can be obtained
from the nmemX/nfit/flags sysfs-attribute of a libnvdimm memory
device.
- Miscellaneous fixes: NVDIMM-N probe error, device-dax build error,
and a change to dedup the flush hint list to not flush the memory
controller more than necessary"
* tag 'libnvdimm-for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
/dev/dax: fix Kconfig dependency build breakage
dax: use correct dev_t value
dax: convert devm_create_dax_dev to PTR_ERR
libnvdimm, namespace: allow creation of multiple pmem-namespaces per region
libnvdimm, namespace: lift single pmem limit in scan_labels()
libnvdimm, namespace: filter out of range labels in scan_labels()
libnvdimm, namespace: enable allocation of multiple pmem namespaces
libnvdimm, namespace: update label implementation for multi-pmem
libnvdimm, namespace: expand pmem device naming scheme for multi-pmem
libnvdimm, region: update nd_region_available_dpa() for multi-pmem support
libnvdimm, namespace: sort namespaces by dpa at init
libnvdimm, namespace: allow multiple pmem-namespaces per region at scan time
tools/testing/nvdimm: support for sub-dividing a pmem region
libnvdimm, namespace: unify blk and pmem label scanning
libnvdimm, namespace: refactor uuid_show() into a namespace_to_uuid() helper
libnvdimm, label: convert label tracking to a linked list
libnvdimm, region: move region-mapping input-paramters to nd_mapping_desc
nvdimm: reduce duplicated wpq flushes
libnvdimm: clear the internal poison_list when clearing badblocks
pmem: reduce kmap_atomic sections to the memcpys only
...