Commit Graph

647899 Commits

Author SHA1 Message Date
Masami Hiramatsu
d2d4edbebe perf probe: Fix to show correct locations for events on modules
Fix to show correct locations for events on modules by relocating given
address instead of retrying after failure.

This happens when the module text size is big enough, bigger than
sh_addr, because the original code retries with given address + sh_addr
if it failed to find CU DIE at the given address.

Any address smaller than sh_addr always fails and it retries with the
correct address, but addresses bigger than sh_addr will get a CU DIE
which is on the given address (not adjusted by sh_addr).

In my environment(x86-64), the sh_addr of ".text" section is 0x10030.
Since i915 is a huge kernel module, we can see this issue as below.

  $ grep "[Tt] .*\[i915\]" /proc/kallsyms | sort | head -n1
  ffffffffc0270000 t i915_switcheroo_can_switch	[i915]

ffffffffc0270000 + 0x10030 = ffffffffc0280030, so we'll check
symbols cross this boundary.

  $ grep "[Tt] .*\[i915\]" /proc/kallsyms | grep -B1 ^ffffffffc028\
  | head -n 2
  ffffffffc027ff80 t haswell_init_clock_gating	[i915]
  ffffffffc0280110 t valleyview_init_clock_gating	[i915]

So setup probes on both function and see what happen.

  $ sudo ./perf probe -m i915 -a haswell_init_clock_gating \
        -a valleyview_init_clock_gating
  Added new events:
    probe:haswell_init_clock_gating (on haswell_init_clock_gating in i915)
    probe:valleyview_init_clock_gating (on valleyview_init_clock_gating in i915)

  You can now use it in all perf tools, such as:

  	perf record -e probe:valleyview_init_clock_gating -aR sleep 1

  $ sudo ./perf probe -l
    probe:haswell_init_clock_gating (on haswell_init_clock_gating@gpu/drm/i915/intel_pm.c in i915)
    probe:valleyview_init_clock_gating (on i915_vga_set_decode:4@gpu/drm/i915/i915_drv.c in i915)

As you can see, haswell_init_clock_gating is correctly shown,
but valleyview_init_clock_gating is not.

With this patch, both events are shown correctly.

  $ sudo ./perf probe -l
    probe:haswell_init_clock_gating (on haswell_init_clock_gating@gpu/drm/i915/intel_pm.c in i915)
    probe:valleyview_init_clock_gating (on valleyview_init_clock_gating@gpu/drm/i915/intel_pm.c in i915)

Committer notes:

In my case:

  # perf probe -m i915 -a haswell_init_clock_gating -a valleyview_init_clock_gating
  Added new events:
    probe:haswell_init_clock_gating (on haswell_init_clock_gating in i915)
    probe:valleyview_init_clock_gating (on valleyview_init_clock_gating in i915)

  You can now use it in all perf tools, such as:

	  perf record -e probe:valleyview_init_clock_gating -aR sleep 1

  # perf probe -l
    probe:haswell_init_clock_gating (on i915_getparam+432@gpu/drm/i915/i915_drv.c in i915)
    probe:valleyview_init_clock_gating (on __i915_printk+240@gpu/drm/i915/i915_drv.c in i915)
  #

  # readelf -SW /lib/modules/4.9.0+/build/vmlinux | egrep -w '.text|Name'
   [Nr] Name   Type      Address          Off    Size   ES Flg Lk Inf Al
   [ 1] .text  PROGBITS  ffffffff81000000 200000 822fd3 00  AX  0   0 4096
  #

  So both are b0rked, now with the fix:

  # perf probe -m i915 -a haswell_init_clock_gating -a valleyview_init_clock_gating
  Added new events:
    probe:haswell_init_clock_gating (on haswell_init_clock_gating in i915)
    probe:valleyview_init_clock_gating (on valleyview_init_clock_gating in i915)

  You can now use it in all perf tools, such as:

	perf record -e probe:valleyview_init_clock_gating -aR sleep 1

  # perf probe -l
    probe:haswell_init_clock_gating (on haswell_init_clock_gating@gpu/drm/i915/intel_pm.c in i915)
    probe:valleyview_init_clock_gating (on valleyview_init_clock_gating@gpu/drm/i915/intel_pm.c in i915)
  #

Both looks correct.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/148411436777.9978.1440275861947194930.stgit@devbox
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-16 15:14:06 -03:00
Jiri Olsa
18e7a45af9 perf/x86: Reject non sampling events with precise_ip
As Peter suggested [1] rejecting non sampling PEBS events,
because they dont make any sense and could cause bugs
in the NMI handler [2].

  [1] http://lkml.kernel.org/r/20170103094059.GC3093@worktop
  [2] http://lkml.kernel.org/r/1482931866-6018-3-git-send-email-jolsa@kernel.org

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20170103142454.GA26251@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:06:50 +01:00
Jiri Olsa
475113d937 perf/x86/intel: Account interrupts for PEBS errors
It's possible to set up PEBS events to get only errors and not
any data, like on SNB-X (model 45) and IVB-EP (model 62)
via 2 perf commands running simultaneously:

    taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10

This leads to a soft lock up, because the error path of the
intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt
for error PEBS interrupts, so in case you're getting ONLY
errors you don't have a way to stop the event when it's over
the max_samples_per_tick limit:

  NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816]
  ...
  RIP: 0010:[<ffffffff81159232>]  [<ffffffff81159232>] smp_call_function_single+0xe2/0x140
  ...
  Call Trace:
   ? trace_hardirqs_on_caller+0xf5/0x1b0
   ? perf_cgroup_attach+0x70/0x70
   perf_install_in_context+0x199/0x1b0
   ? ctx_resched+0x90/0x90
   SYSC_perf_event_open+0x641/0xf90
   SyS_perf_event_open+0x9/0x10
   do_syscall_64+0x6c/0x1f0
   entry_SYSCALL64_slow_path+0x25/0x25

Add perf_event_account_interrupt() which does the interrupt
and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s
error path.

We keep the pending_kill and pending_wakeup logic only in the
__perf_event_overflow() path, because they make sense only if
there's any data to deliver.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:06:49 +01:00
Peter Zijlstra
321027c1fe perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race
Di Shen reported a race between two concurrent sys_perf_event_open()
calls where both try and move the same pre-existing software group
into a hardware context.

The problem is exactly that described in commit:

  f63a8daa58 ("perf: Fix event->ctx locking")

... where, while we wait for a ctx->mutex acquisition, the event->ctx
relation can have changed under us.

That very same commit failed to recognise sys_perf_event_context() as an
external access vector to the events and thereby didn't apply the
established locking rules correctly.

So while one sys_perf_event_open() call is stuck waiting on
mutex_lock_double(), the other (which owns said locks) moves the group
about. So by the time the former sys_perf_event_open() acquires the
locks, the context we've acquired is stale (and possibly dead).

Apply the established locking rules as per perf_event_ctx_lock_nested()
to the mutex_lock_double() for the 'move_group' case. This obviously means
we need to validate state after we acquire the locks.

Reported-by: Di Shen (Keen Lab)
Tested-by: John Dias <joaodias@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Min Chong <mchong@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: f63a8daa58 ("perf: Fix event->ctx locking")
Link: http://lkml.kernel.org/r/20170106131444.GZ3174@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 10:56:11 +01:00
Peter Zijlstra
63cae12bce perf/core: Fix sys_perf_event_open() vs. hotplug
There is problem with installing an event in a task that is 'stuck' on
an offline CPU.

Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
blocked task doesn't run and doesn't require a CPU etc.. Only on
wakeup do we ammend the situation and place the task on a available
CPU.

If we hit such a task with perf_install_in_context() we'll loop until
either that task wakes up or the CPU comes back online, if the task
waking depends on the event being installed, we're stuck.

While looking into this issue, I also spotted another problem, if we
hit a task with perf_install_in_context() that is in the middle of
being migrated, that is we observe the old CPU before sending the IPI,
but run the IPI (on the old CPU) while the task is already running on
the new CPU, things also go sideways.

Rework things to rely on task_curr() -- outside of rq->lock -- which
is rather tricky. Imagine the following scenario where we're trying to
install the first event into our task 't':

CPU0            CPU1            CPU2

                (current == t)

t->perf_event_ctxp[] = ctx;
smp_mb();
cpu = task_cpu(t);

                switch(t, n);
                                migrate(t, 2);
                                switch(p, t);

                                ctx = t->perf_event_ctxp[]; // must not be NULL

smp_function_call(cpu, ..);

                generic_exec_single()
                  func();
                    spin_lock(ctx->lock);
                    if (task_curr(t)) // false

                    add_event_to_ctx();
                    spin_unlock(ctx->lock);

                                perf_event_context_sched_in();
                                  spin_lock(ctx->lock);
                                  // sees event

So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
Because if CPU2's load of that variable were to observe NULL, it would
not try to schedule the ctx and we'd have a task running without its
counter, which would be 'bad'.

As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
first and not see the event yet, then CPU0 must observe task_curr()
and retry. If the install happens first, then we must see the event on
sched-in and all is well.

I think we can translate the first part (until the 'must not be NULL')
of the scenario to a litmus test like:

  C C-peterz

  {
  }

  P0(int *x, int *y)
  {
          int r1;

          WRITE_ONCE(*x, 1);
          smp_mb();
          r1 = READ_ONCE(*y);
  }

  P1(int *y, int *z)
  {
          WRITE_ONCE(*y, 1);
          smp_store_release(z, 1);
  }

  P2(int *x, int *z)
  {
          int r1;
          int r2;

          r1 = smp_load_acquire(z);
	  smp_mb();
          r2 = READ_ONCE(*x);
  }

  exists
  (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)

Where:
  x is perf_event_ctxp[],
  y is our tasks's CPU, and
  z is our task being placed on the rq of CPU2.

The P0 smp_mb() is the one added by this patch, ordering the store to
perf_event_ctxp[] from find_get_context() and the load of task_cpu()
in task_function_call().

The smp_store_release/smp_load_acquire model the RCpc locking of the
rq->lock and the smp_mb() of P2 is the context switch switching from
whatever CPU2 was running to our task 't'.

This litmus test evaluates into:

  Test C-peterz Allowed
  States 7
  0:r1=0; 2:r1=0; 2:r2=0;
  0:r1=0; 2:r1=0; 2:r2=1;
  0:r1=0; 2:r1=1; 2:r2=1;
  0:r1=1; 2:r1=0; 2:r2=0;
  0:r1=1; 2:r1=0; 2:r2=1;
  0:r1=1; 2:r1=1; 2:r2=0;
  0:r1=1; 2:r1=1; 2:r2=1;
  No
  Witnesses
  Positive: 0 Negative: 7
  Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
  Observation C-peterz Never 0 7
  Hash=e427f41d9146b2a5445101d3e2fcaa34

And the strong and weak model agree.

Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: jeremy.linton@arm.com
Link: http://lkml.kernel.org/r/20161209135900.GU3174@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 10:56:10 +01:00
Colin King
ad5013d569 perf/x86/intel: Use ULL constant to prevent undefined shift behaviour
When x86_pmu.num_counters is 32 the shift of the integer constant 1 is
exceeding 32bit and therefor undefined behaviour.

Fix this by shifting 1ULL instead of 1.

Reported-by: CoverityScan CID#1192105 ("Bad bit shift operation")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Link: http://lkml.kernel.org/r/20170111114310.17928-1-colin.king@canonical.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-01-11 16:43:30 +01:00
Prarit Bhargava
6d6daa2094 perf/x86/intel/uncore: Fix hardcoded socket 0 assumption in the Haswell init code
hswep_uncore_cpu_init() uses a hardcoded physical package id 0 for the boot
cpu. This works as long as the boot CPU is actually on the physical package
0, which is normaly the case after power on / reboot.

But it fails with a NULL pointer dereference when a kdump kernel is started
on a secondary socket which has a different physical package id because the
locigal package translation for physical package 0 does not exist.

Use the logical package id of the boot cpu instead of hard coded 0.

[ tglx: Rewrote changelog once more ]

Fixes: cf6d445f68 ("perf/x86/uncore: Track packages, not per CPU data")
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Harish Chegondi <harish.chegondi@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1483628965-2890-1-git-send-email-prarit@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-01-11 12:13:21 +01:00
David Carrillo-Cisneros
74545f6389 perf/x86: Set pmu->module in Intel PMU modules
The conversion of Intel PMU drivers into modules did not include reference
counting. The machine will crash when attempting to  access deleted code
if an event from a module PMU is started and the module removed before the
event is destroyed.

i.e. this crashes the machine:

	$ insmod intel-rapl-perf.ko
	$ perf stat -e power/energy-cores/ -C 0 &
	$ rmmod intel-rapl-perf.ko

Set THIS_MODULE to pmu->module in Intel module PMUs so that generic code
can handle reference counting and deny rmmod while an event still exists.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1482455860-116269-1-git-send-email-davidcc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-05 09:13:55 +01:00
Ingo Molnar
4e06d4f083 perf/urgent fixes and one improvement:
Fixes:
 
 - Fix prev/next_prio formatting for deadline tasks in libtraceevent (Daniel Bristot de Oliveira)
 
 - Robustify reading of build-ids from /sys/kernel/note (Arnaldo Carvalho de Melo)
 
 - Fix building some sample/bpf in Alpine Linux 3.4 (Arnaldo Carvalho de Melo)
 
 - Fix 'make install-bin' to install libtraceevent plugins (Arnaldo Carvalho de Melo)
 
 - Fix 'perf record --switch-output' documentation and comment (Jiri Olsa)
 
 - 'perf probe' fixes for cross arch probing (Masami Hiramatsu)
 
 Improvement:
 
 - Show total scheduling time in 'perf sched timehist' (Namhyumg Kim)
 
 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJYbS6KAAoJENZQFvNTUqpAWqkP/1xdvbDqeTA2YjcWG+ffQJye
 LULSNQpRShvBGpB3zsOzOebOtw0R5oWb7YAp4JFZNyMIk5qcnS9gVwdw9Qx1lc/W
 Gq8OuhI28Pr9AHE37eTJ1+12HI2aAOylwKFqUpEWMzoaUqBDRh5JV2MeVGVQXBRe
 YBJ4Qwk1a7Ng74uzN/pdiC6V7pjddPEoLZEDG5p35vrgtNAY1JGlX5rvNR7CO6nN
 72QHRnB4r1d9fgNGS5oO5zEU1q5+JN5phe+gDANdk/8pCqDegRjZpHnj5gcm71jr
 mEvbstmlV1XPZSA9aFVu2L5LZuv6asS02HjrbGydqO2DUTpDgrynHsjAuqqaTnwk
 kLGwA2I+Kyh8g155H+HrQANONe2TTOXRkBVWIoFBWkhd1JUS/p3GE2qvqx5N36sr
 U26d9i/Bk/pbOYOdRUFMc4lckZY2HrrRWj8dVr06qLr2rlbYoLv8hpqvdyHr9MJS
 Id2Yqqd1JVzBDU3/+GfdFDBjltVtpSG1mdc6LWy9IpXFMJoSOQNsrbjf+mgMadeV
 EbtY8MdDZ525EB8niv9E9Vfd56LiWTKCjg9Mc758+U+Ns50YlQFsDq2hBO2+jQ/k
 W3QNX+WJiLgxqQK8adveXoYgCSWPNVttIvg8Lj1KNk8Y//vUC1KSrvDfaKiROzLB
 iF0u5mLRAQ+Pfec6X5PP
 =o9KL
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-for-mingo-4.10-20170104' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent

Pull perf/urgent fixes and one improvement from Arnaldo Carvalho de Melo:

Fixes:

  - Fix prev/next_prio formatting for deadline tasks in libtraceevent (Daniel Bristot de Oliveira)

  - Robustify reading of build-ids from /sys/kernel/note (Arnaldo Carvalho de Melo)

  - Fix building some sample/bpf in Alpine Linux 3.4 (Arnaldo Carvalho de Melo)

  - Fix 'make install-bin' to install libtraceevent plugins (Arnaldo Carvalho de Melo)

  - Fix 'perf record --switch-output' documentation and comment (Jiri Olsa)

  - Fix 'perf probe' for cross arch probing (Masami Hiramatsu)

Improvement:

  - Show total scheduling time in 'perf sched timehist' (Namhyumg Kim)

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-05 08:33:02 +01:00
Masami Hiramatsu
8a937a25a7 perf probe: Fix to probe on gcc generated symbols for offline kernel
Fix perf-probe to show probe definition on gcc generated symbols for
offline kernel (including cross-arch kernel image).

gcc sometimes optimizes functions and generate new symbols with suffixes
such as ".constprop.N" or ".isra.N" etc. Since those symbol names are
not recorded in DWARF, we have to find correct generated symbols from
offline ELF binary to probe on it (kallsyms doesn't correct it).  For
online kernel or uprobes we don't need it because those are rebased on
_text, or a section relative address.

E.g. Without this:

  $ perf probe -k build-arm/vmlinux -F __slab_alloc*
  __slab_alloc.constprop.9
  $ perf probe -k build-arm/vmlinux -D __slab_alloc
  p:probe/__slab_alloc __slab_alloc+0

If you put above definition on target machine, it should fail
because there is no __slab_alloc in kallsyms.

With this fix, perf probe shows correct probe definition on
__slab_alloc.constprop.9:

  $ perf probe -k build-arm/vmlinux -D __slab_alloc
  p:probe/__slab_alloc __slab_alloc.constprop.9+0

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/148350060434.19001.11864836288580083501.stgit@devbox
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-04 11:44:22 -03:00
Masami Hiramatsu
eebc509b20 perf probe: Fix --funcs to show correct symbols for offline module
Fix --funcs (-F) option to show correct symbols for offline module.
Since previous perf-probe uses machine__findnew_module_map() for offline
module, even if user passes a module file (with full path) which is for
other architecture, perf-probe always tries to load symbol map for
current kernel module.

This fix uses dso__new_map() to load the map from given binary as same
as a map for user applications.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/148350053478.19001.15435255244512631545.stgit@devbox
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-04 11:15:09 -03:00
Arnaldo Carvalho de Melo
7934c98a6e perf symbols: Robustify reading of build-id from sysfs
Markus reported that perf segfaults when reading /sys/kernel/notes from
a kernel linked with GNU gold, due to what looks like a gold bug, so do
some bounds checking to avoid crashing in that case.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Report-Link: http://lkml.kernel.org/r/20161219161821.GA294@x4
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/n/tip-ryhgs6a6jxvz207j2636w31c@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-03 16:11:13 -03:00
Arnaldo Carvalho de Melo
30a9c64448 perf tools: Install tools/lib/traceevent plugins with install-bin
Those are binaries as well, so should be installed by:

  make -C tools/perf install-bin'

too.

Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/n/tip-3841b37u05evxrs1igkyu6ks@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-03 16:11:13 -03:00
Daniel Bristot de Oliveira
074859184d tools lib traceevent: Fix prev/next_prio for deadline tasks
Currently, the sched:sched_switch tracepoint reports deadline tasks with
priority -1. But when reading the trace via perf script I've got the
following output:

  # ./d & # (d is a deadline task, see [1])
  # perf record -e sched:sched_switch -a sleep 1
  # perf script
      ...
         swapper     0 [000]  2146.962441: sched:sched_switch: swapper/0:0 [120] R ==> d:2593 [4294967295]
               d  2593 [000]  2146.972472: sched:sched_switch: d:2593 [4294967295] R ==> g:2590 [4294967295]

The task d reports the wrong priority [4294967295]. This happens because
the "int prio" is stored in an unsigned long long val. Although it is
set as a %lld, as int is shorter than unsigned long long,
trace_seq_printf prints it as a positive number.

The fix is just to cast the val as an int, and print it as a %d,
as in the sched:sched_switch tracepoint's "format".

The output with the fix is:

  # ./d &
  # perf record -e sched:sched_switch -a sleep 1
  # perf script
      ...
         swapper     0 [000]  4306.374037: sched:sched_switch: swapper/0:0 [120] R ==> d:10941 [-1]
               d 10941 [000]  4306.383823: sched:sched_switch: d:10941 [-1] R ==> swapper/0:0 [120]

[1] d.c

 ---
  #include <stdio.h>
  #include <unistd.h>
  #include <sys/syscall.h>
  #include <linux/types.h>
  #include <linux/sched.h>

  struct sched_attr {
	__u32 size, sched_policy;
	__u64 sched_flags;
	__s32 sched_nice;
	__u32 sched_priority;
	__u64 sched_runtime, sched_deadline, sched_period;
  };

  int sched_setattr(pid_t pid, const struct sched_attr *attr, unsigned int flags)
  {
	return syscall(__NR_sched_setattr, pid, attr, flags);
  }

  int main(void)
  {
	struct sched_attr attr = {
		.size		= sizeof(attr),
		.sched_policy	= SCHED_DEADLINE, /* This creates a 10ms/30ms reservation */
		.sched_runtime	= 10 * 1000 * 1000,
		.sched_period	= attr.sched_deadline = 30 * 1000 * 1000,
	};

	if (sched_setattr(0, &attr, 0) < 0) {
		perror("sched_setattr");
		return -1;
	}

	for(;;);
  }
 ---

Committer notes:

Got the program from the provided URL, http://bristot.me/lkml/d.c,
trimmed it and included in the cset log above, so that we have
everything needed to test it in one place.

Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/866ef75bcebf670ae91c6a96daa63597ba981f0d.1483443552.git.bristot@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-03 16:11:12 -03:00
Jiri Olsa
60437ac02f perf record: Fix --switch-output documentation and comment
There's no --signal-trigger option, also adding the code comment into
record man page.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Wang Nan <wangnan0@huawei.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1483431600-19887-4-git-send-email-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-03 11:11:38 -03:00
Jiri Olsa
efd2130711 perf record: Make __record_options static
There's no need for this one to be global.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Wang Nan <wangnan0@huawei.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1483431600-19887-3-git-send-email-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-03 11:11:10 -03:00
Jiri Olsa
b66fb1da5a tools lib subcmd: Add OPT_STRING_OPTARG_SET option
To allow string options with a default argument and variable set when
the option is used.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Tested-by: Wang Nan <wangnan0@huawei.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1483431600-19887-2-git-send-email-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-03 11:10:38 -03:00
Masami Hiramatsu
1f2ed153b9 perf probe: Fix to get correct modname from elf header
Since 'perf probe' supports cross-arch probes, it is possible to analyze
different arch kernel image which has different bits-per-long.

In that case, it fails to get the module name because it uses the
MOD_NAME_OFFSET macro based on the host machine bits-per-long, instead
of the target arch bits-per-long.

This fixes above issue by changing modname-offset based on the target
archs bit width. This is ok because linux kernel uses LP64 model on
64bit arch.

E.g. without this (on x86_64, and target module is arm32):

  $ perf probe -m build-arm/fs/configfs/configfs.ko -D configfs_lookup
  p:probe/configfs_lookup :configfs_lookup+0
                          ^-Here is an empty module name.

With this fix, you can see correct module name:

  $ perf probe -m build-arm/fs/configfs/configfs.ko -D configfs_lookup
  p:probe/configfs_lookup configfs:configfs_lookup+0

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/148337043836.6752.383495516397005695.stgit@devbox
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2017-01-02 14:09:17 -03:00
Linus Torvalds
0c744ea4f7 Linux 4.10-rc2 2017-01-01 14:31:53 -08:00
Linus Torvalds
4759d386d5 Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull DAX updates from Dan Williams:
 "The completion of Jan's DAX work for 4.10.

  As I mentioned in the libnvdimm-for-4.10 pull request, these are some
  final fixes for the DAX dirty-cacheline-tracking invalidation work
  that was merged through the -mm, ext4, and xfs trees in -rc1. These
  patches were prepared prior to the merge window, but we waited for
  4.10-rc1 to have a stable merge base after all the prerequisites were
  merged.

  Quoting Jan on the overall changes in these patches:

     "So I'd like all these 6 patches to go for rc2. The first three
      patches fix invalidation of exceptional DAX entries (a bug which
      is there for a long time) - without these patches data loss can
      occur on power failure even though user called fsync(2). The other
      three patches change locking of DAX faults so that ->iomap_begin()
      is called in a more relaxed locking context and we are safe to
      start a transaction there for ext4"

  These have received a build success notification from the kbuild
  robot, and pass the latest libnvdimm unit tests. There have not been
  any -next releases since -rc1, so they have not appeared there"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  ext4: Simplify DAX fault path
  dax: Call ->iomap_begin without entry lock during dax fault
  dax: Finish fault completely when loading holes
  dax: Avoid page invalidation races and unnecessary radix tree traversals
  mm: Invalidate DAX radix tree entries only if appropriate
  ext2: Return BH_New buffers for zeroed blocks
2017-01-01 12:27:05 -08:00
Linus Torvalds
238d1d0f79 Two small fixes:
- A merge error on my part broke the DocBook build.  I've requisitioned
    one of tglx's frozen sharks for appropriate disciplinary action and
    resolved to be more careful about testing the DocBook stuff as long as
    it's still around.
 
  - Fix an error in unaligned-memory-access.txt
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYZoL/AAoJEI3ONVYwIuV6zCQP/2q0yjKH0qVe+FQqjqBkAcfs
 m6B/8QXKLEiTfS4sthvIa2iQgN6ZKWTg8DyVndXoGhE5qgZZAuctxxNj2nwpt0XD
 jjTrfLVhyASXCCCfSDPcujtgrQElNOm814+ZsAo5e4KCJk1V7Ap1NlwJpd6beld5
 ac6INH3r8X3Qgm2BNbiMJ3VApubcjAkjJRi75mc1JIZGtIAf8ePzjkKVpGyTsaaM
 qQQDbx5mXNrYPXFJYHgCOGcVGkWRAqIcDiYscS6XHpmj5DWfz/x6OdicdeB9l7b4
 Jw4TWVNTXTWz0THRUD74N1SdHwwetnWnZ3abcaorFIA/yr8tw1OuMmEqGn5VNGX5
 Bxk8YYMjjezH4y7+XC7IGhZ1dcrGrdc3AJTqZYIoTOI+HxGJ4RjZsqHghgHy3Qvi
 YgLumVmXLQQ6gSTE7sRixq/2hpLUlUrR0po+zQcwO/8Ka0JA860qhYcTjNzRdO1s
 csQIJtybytbtX28lz4eMxadFcwTtMyYMyVTslUsP/HJQefNcpPz97OQ0zV7UTmi8
 zpHRBo5GnDJNL0u/R6tNDsDo5XaXzy+uR/HDUz/oBn7QofhtrUsgBX2DHeV6pqr7
 guvfVVhQ4I3r7/BGyFaiXf28W9f3ADwE1jbibLjEA2IrLxaClIGLxLS/fGql5MAI
 FQhqlQhsKFeVTAxMtC9j
 =W4e5
 -----END PGP SIGNATURE-----

Merge tag 'docs-4.10-rc1-fix' of git://git.lwn.net/linux

Pull documentation fixes from Jonathan Corbet:
 "Two small fixes:

   - A merge error on my part broke the DocBook build. I've
     requisitioned one of tglx's frozen sharks for appropriate
     disciplinary action and resolved to be more careful about testing
     the DocBook stuff as long as it's still around.

   - Fix an error in unaligned-memory-access.txt"

* tag 'docs-4.10-rc1-fix' of git://git.lwn.net/linux:
  Documentation/unaligned-memory-access.txt: fix incorrect comparison operator
  docs: Fix build failure
2016-12-30 09:32:26 -08:00
Linus Torvalds
f3de082c12 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
 "This fixes a boot failure on some platforms when crypto self test is
  enabled along with the new acomp interface"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: testmgr - Use heap buffer for acomp test input
2016-12-30 09:29:50 -08:00
Olof Johansson
98473f9f3f mm/filemap: fix parameters to test_bit()
mm/filemap.c: In function 'clear_bit_unlock_is_negative_byte':
  mm/filemap.c:933:9: error: too few arguments to function 'test_bit'
    return test_bit(PG_waiters);
         ^~~~~~~~

Fixes: b91e1302ad ('mm: optimize PageWaiters bit use for unlock_page()')
Signed-off-by: Olof Johansson <olof@lixom.net>
Brown-paper-bag-by: Linus Torvalds <dummy@duh.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-29 14:46:39 -08:00
Linus Torvalds
b91e1302ad mm: optimize PageWaiters bit use for unlock_page()
In commit 6290602709 ("mm: add PageWaiters indicating tasks are
waiting for a page bit") Nick Piggin made our page locking no longer
unconditionally touch the hashed page waitqueue, which not only helps
performance in general, but is particularly helpful on NUMA machines
where the hashed wait queues can bounce around a lot.

However, the "clear lock bit atomically and then test the waiters bit"
sequence turns out to be much more expensive than it needs to be,
because you get a nasty stall when trying to access the same word that
just got updated atomically.

On architectures where locking is done with LL/SC, this would be trivial
to fix with a new primitive that clears one bit and tests another
atomically, but that ends up not working on x86, where the only atomic
operations that return the result end up being cmpxchg and xadd.  The
atomic bit operations return the old value of the same bit we changed,
not the value of an unrelated bit.

On x86, we could put the lock bit in the high bit of the byte, and use
"xadd" with that bit (where the overflow ends up not touching other
bits), and look at the other bits of the result.  However, an even
simpler model is to just use a regular atomic "and" to clear the lock
bit, and then the sign bit in eflags will indicate the resulting state
of the unrelated bit #7.

So by moving the PageWaiters bit up to bit #7, we can atomically clear
the lock bit and test the waiters bit on x86 too.  And architectures
with LL/SC (which is all the usual RISC suspects), the particular bit
doesn't matter, so they are fine with this approach too.

This avoids the extra access to the same atomic word, and thus avoids
the costly stall at page unlock time.

The only downside is that the interface ends up being a bit odd and
specialized: clear a bit in a byte, and test the sign bit.  Nick doesn't
love the resulting name of the new primitive, but I'd rather make the
name be descriptive and very clear about the limitation imposed by
trying to work across all relevant architectures than make it be some
generic thing that doesn't make the odd semantics explicit.

So this introduces the new architecture primitive

    clear_bit_unlock_is_negative_byte();

and adds the trivial implementation for x86.  We have a generic
non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
combination) which can be overridden by any architecture that can do
better.  According to Nick, Power has the same hickup x86 has, for
example, but some other architectures may not even care.

All these optimizations mean that my page locking stress-test (which is
just executing a lot of small short-lived shell scripts: "make test" in
the git source tree) no longer makes our page locking look horribly bad.
Before all these optimizations, just the unlock_page() costs were just
over 3% of all CPU overhead on "make test".  After this, it's down to
0.66%, so just a quarter of the cost it used to be.

(The difference on NUMA is bigger, but there this micro-optimization is
likely less noticeable, since the big issue on NUMA was not the accesses
to 'struct page', but the waitqueue accesses that were already removed
by Nick's earlier commit).

Acked-by: Nick Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-29 11:03:15 -08:00
Arnaldo Carvalho de Melo
b6f4c66704 samples/bpf trace_output_user: Remove duplicate sys/ioctl.h include
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Joe Stringer <joe@ovn.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/n/tip-3awp0nv8tpnblatojmwjww7z@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-12-28 10:47:13 -03:00
Linus Torvalds
2d706e790f Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
 "This fixes a hash corruption bug in the marvell driver"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: marvell - Copy IVDIG before launching partial DMA ahash requests
2016-12-27 17:51:36 -08:00
Arnaldo Carvalho de Melo
ee12996c9d samples/bpf sock_example: Avoid getting ethhdr from two includes
To avoid the following build failure on Alpine Linux 3.4, that has
clang-3.8 with the bpf target:

    HOSTCC  samples/bpf/sock_example.o
  In file included from /usr/include/net/ethernet.h:10:0,
                   from /git/linux/samples/bpf/sock_example.h:7,
                   from /git/linux/samples/bpf/sock_example.c:30:
  /usr/include/netinet/if_ether.h:96:8: error: redefinition of 'struct
  ethhdr'
   struct ethhdr {
          ^
  In file included from /git/linux/samples/bpf/sock_example.c:26:0:
  ./usr/include/linux/if_ether.h:144:8: note: originally defined here
   struct ethhdr {
          ^
  scripts/Makefile.host:124: recipe for target
  'samples/bpf/sock_example.o' failed
  make[2]: *** [samples/bpf/sock_example.o] Error 1
  /git/linux/Makefile:1658: recipe for target 'samples/bpf/' failed

So include net/if_ether.h for the needs of sock_example.h, using the
same include that sock_example.c uses.

Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Joe Stringer <joe@ovn.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/n/tip-m9avekl1b651qe1r1zd5tzz9@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-12-27 21:49:17 -03:00
Namhyung Kim
9396c9cb0d perf sched timehist: Show total scheduling time
Show length of analyzed sample time and rate of idle task running.
This also takes care of time range given by --time option.

  $ perf sched timehist -sI | tail
  Samples do not have callchains.
  Idle stats:
      CPU  0 idle for    930.316  msec  ( 92.93%)
      CPU  1 idle for    963.614  msec  ( 96.25%)
      CPU  2 idle for    885.482  msec  ( 88.45%)
      CPU  3 idle for    938.635  msec  ( 93.76%)

      Total number of unique tasks: 118
  Total number of context switches: 2337
             Total run time (msec): 3718.048
      Total scheduling time (msec): 1001.131  (x 4)

Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20161222060350.17655-3-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2016-12-27 21:47:57 -03:00
Linus Torvalds
8f18e4d03e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Various ipvlan fixes from Eric Dumazet and Mahesh Bandewar.

    The most important is to not assume the packet is RX just because
    the destination address matches that of the device. Such an
    assumption causes problems when an interface is put into loopback
    mode.

 2) If we retry when creating a new tc entry (because we dropped the
    RTNL mutex in order to load a module, for example) we end up with
    -EAGAIN and then loop trying to replay the request. But we didn't
    reset some state when looping back to the top like this, and if
    another thread meanwhile inserted the same tc entry we were trying
    to, we re-link it creating an enless loop in the tc chain. Fix from
    Daniel Borkmann.

 3) There are two different WRITE bits in the MDIO address register for
    the stmmac chip, depending upon the chip variant. Due to a bug we
    could set them both, fix from Hock Leong Kweh.

 4) Fix mlx4 bug in XDP_TX handling, from Tariq Toukan.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  net: stmmac: fix incorrect bit set in gmac4 mdio addr register
  r8169: add support for RTL8168 series add-on card.
  net: xdp: remove unused bfp_warn_invalid_xdp_buffer()
  openvswitch: upcall: Fix vlan handling.
  ipv4: Namespaceify tcp_tw_reuse knob
  net: korina: Fix NAPI versus resources freeing
  net, sched: fix soft lockup in tc_classify
  net/mlx4_en: Fix user prio field in XDP forward
  tipc: don't send FIN message from connectionless socket
  ipvlan: fix multicast processing
  ipvlan: fix various issues in ipvlan_process_multicast()
2016-12-27 16:04:37 -08:00
Cihangir Akturk
36f671be1d Documentation/unaligned-memory-access.txt: fix incorrect comparison operator
In the actual implementation ether_addr_equal function tests for equality to 0
when returning. It seems in commit 0d74c4 it is somehow overlooked to change
this operator to reflect the actual function.

Signed-off-by: Cihangir Akturk <cakturk@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2016-12-27 13:08:42 -07:00
John Brooks
66115335fb docs: Fix build failure
The 80211.tmpl DocBook file was removed in commit 819bf59376 ("docs-rst:
sphinxify 802.11 documentation"), but the 80211.xml target was re-added to
the Makefile by commit 7ddedebb03 ("ALSA: doc: ReSTize
writing-an-alsa-driver document"), leading to a failure when building the
documentation:

*** No rule to make target 'Documentation/DocBook/80211.xml', needed by
'Documentation/DocBook/80211.aux.xml'.

cc: stable@vger.kernel.org
Signed-off-by: John Brooks <john@fastquake.com>
Mea-culpa-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2016-12-27 13:05:36 -07:00
Jonathan Corbet
54ab6db090 Linux 4.10-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJYYGCgAAoJEHm+PkMAQRiGjVMH/R1WKLSCVyU2QboSTZVyBGqU
 6E42pMalPNaY72uxf29ZmUzds1uV5KyFn7OntsyD4qc+sQb2wxG5PvMSYAsL7HKN
 lTFiW738zC9Hfx8MzC/fHLGm/7HTHpPFndZJkDOJjIPnS0MeTHAmOFM+RwCRq+px
 5uvRHV4Z8yibHtijET6GqCywV0gw/uyXCi6xJfJNAspnj3hsm3ZXKJ0JPvP2ja+V
 yhdnWYHDEQwRs6FyNtIWnfjH92XilVn4KcOtwnb1pFahALiTmmVqJVMiGartagqJ
 fPRw98B3YHwmZpEc2SDbXaZi36WLu4hcWvvDa22SN/srXwYIzzblEwuNq1+fiBw=
 =X7z+
 -----END PGP SIGNATURE-----

Merge tag 'v4.10-rc1' into docs-next

Linux 4.10-rc1
2016-12-27 12:53:44 -07:00
Kweh, Hock Leong
5799fc9059 net: stmmac: fix incorrect bit set in gmac4 mdio addr register
Fixing the gmac4 mdio write access to use MII_GMAC4_WRITE only instead of
OR together with MII_WRITE.

Signed-off-by: Kweh, Hock Leong <hock.leong.kweh@intel.com>
Acked-By: Joao Pinto <jpinto@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-27 12:28:08 -05:00
Chun-Hao Lin
610c908773 r8169: add support for RTL8168 series add-on card.
This chip is the same as RTL8168, but its device id is 0x8161.

Signed-off-by: Chun-Hao Lin <hau@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-27 12:28:07 -05:00
Jason Wang
be26727772 net: xdp: remove unused bfp_warn_invalid_xdp_buffer()
After commit 73b62bd085 ("virtio-net:
remove the warning before XDP linearizing"), there's no users for
bpf_warn_invalid_xdp_buffer(), so remove it. This is a revert for
commit f23bc46c30.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-27 12:28:07 -05:00
pravin shelar
df30f7408b openvswitch: upcall: Fix vlan handling.
Networking stack accelerate vlan tag handling by
keeping topmost vlan header in skb. This works as
long as packet remains in OVS datapath. But during
OVS upcall vlan header is pushed on to the packet.
When such packet is sent back to OVS datapath, core
networking stack might not handle it correctly. Following
patch avoids this issue by accelerating the vlan tag
during flow key extract. This simplifies datapath by
bringing uniform packet processing for packets from
all code paths.

Fixes: 5108bbaddc ("openvswitch: add processing of L3 packets").
CC: Jarno Rajahalme <jarno@ovn.org>
CC: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-27 12:28:07 -05:00
Haishuang Yan
56ab6b9300 ipv4: Namespaceify tcp_tw_reuse knob
Different namespaces might have different requirements to reuse
TIME-WAIT sockets for new connections. This might be required in
cases where different namespace applications are in place which
require TIME_WAIT socket connections to be reduced independently
of the host.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-27 12:28:07 -05:00
Laura Abbott
02608e02fb crypto: testmgr - Use heap buffer for acomp test input
Christopher Covington reported a crash on aarch64 on recent Fedora
kernels:

kernel BUG at ./include/linux/scatterlist.h:140!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
Modules linked in:
CPU: 2 PID: 752 Comm: cryptomgr_test Not tainted 4.9.0-11815-ge93b1cc #162
Hardware name: linux,dummy-virt (DT)
task: ffff80007c650080 task.stack: ffff800008910000
PC is at sg_init_one+0xa0/0xb8
LR is at sg_init_one+0x24/0xb8
...
[<ffff000008398db8>] sg_init_one+0xa0/0xb8
[<ffff000008350a44>] test_acomp+0x10c/0x438
[<ffff000008350e20>] alg_test_comp+0xb0/0x118
[<ffff00000834f28c>] alg_test+0x17c/0x2f0
[<ffff00000834c6a4>] cryptomgr_test+0x44/0x50
[<ffff0000080dac70>] kthread+0xf8/0x128
[<ffff000008082ec0>] ret_from_fork+0x10/0x50

The test vectors used for input are part of the kernel image. These
inputs are passed as a buffer to sg_init_one which eventually blows up
with BUG_ON(!virt_addr_valid(buf)). On arm64, virt_addr_valid returns
false for the kernel image since virt_to_page will not return the
correct page. Fix this by copying the input vectors to heap buffer
before setting up the scatterlist.

Reported-by: Christopher Covington <cov@codeaurora.org>
Fixes: d7db7a882d ("crypto: acomp - update testmgr with support for acomp")
Signed-off-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-12-27 17:32:11 +08:00
Jan Kara
1db175428e ext4: Simplify DAX fault path
Now that dax_iomap_fault() calls ->iomap_begin() without entry lock, we
can use transaction starting in ext4_iomap_begin() and thus simplify
ext4_dax_fault(). It also provides us proper retries in case of ENOSPC.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-26 20:29:25 -08:00
Jan Kara
9f141d6ef6 dax: Call ->iomap_begin without entry lock during dax fault
Currently ->iomap_begin() handler is called with entry lock held. If the
filesystem held any locks between ->iomap_begin() and ->iomap_end()
(such as ext4 which will want to hold transaction open), this would cause
lock inversion with the iomap_apply() from standard IO path which first
calls ->iomap_begin() and only then calls ->actor() callback which grabs
entry locks for DAX (if it faults when copying from/to user provided
buffers).

Fix the problem by nesting grabbing of entry lock inside ->iomap_begin()
- ->iomap_end() pair.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-26 20:29:25 -08:00
Jan Kara
f449b936f1 dax: Finish fault completely when loading holes
The only case when we do not finish the page fault completely is when we
are loading hole pages into a radix tree. Avoid this special case and
finish the fault in that case as well inside the DAX fault handler. It
will allow us for easier iomap handling.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-26 20:29:25 -08:00
Jan Kara
e3fce68cdb dax: Avoid page invalidation races and unnecessary radix tree traversals
Currently dax_iomap_rw() takes care of invalidating page tables and
evicting hole pages from the radix tree when write(2) to the file
happens. This invalidation is only necessary when there is some block
allocation resulting from write(2). Furthermore in current place the
invalidation is racy wrt page fault instantiating a hole page just after
we have invalidated it.

So perform the page invalidation inside dax_iomap_actor() where we can
do it only when really necessary and after blocks have been allocated so
nobody will be instantiating new hole pages anymore.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-26 20:29:24 -08:00
Jan Kara
c6dcf52c23 mm: Invalidate DAX radix tree entries only if appropriate
Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
just delete all exceptional radix tree entries they find. For DAX this
is not desirable as we track cache dirtiness in these entries and when
they are evicted, we may not flush caches although it is necessary. This
can for example manifest when we write to the same block both via mmap
and via write(2) (to different offsets) and fsync(2) then does not
properly flush CPU caches when modification via write(2) was the last
one.

Create appropriate DAX functions to handle invalidation of DAX entries
for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
wire them up into the corresponding mm functions.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-26 20:29:24 -08:00
Jan Kara
e568df6b84 ext2: Return BH_New buffers for zeroed blocks
So far we did not return BH_New buffers from ext2_get_blocks() when we
allocated and zeroed-out a block for DAX inode to avoid racy zeroing in
DAX code. This zeroing is gone these days so we can remove the
workaround.

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-12-26 20:29:24 -08:00
Thomas Gleixner
0dad3a3014 x86/mce/AMD: Make the init code more robust
If mce_device_init() fails then the mce device pointer is NULL and the
AMD mce code happily dereferences it.

Add a sanity check.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-26 17:30:24 -08:00
Thomas Gleixner
b9d9d6911b smp/hotplug: Undo tglxs brainfart
The attempt to prevent overwriting an active state resulted in a
disaster which effectively disables all dynamically allocated hotplug
states.

Cleanup the mess.

Fixes: dc280d9362 ("cpu/hotplug: Prevent overwriting of callbacks")
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-26 17:30:24 -08:00
Al Viro
b4b8664d29 arm64: don't pull uaccess.h into *.S
Split asm-only parts of arm64 uaccess.h into a new header and use that
from *.S.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-12-26 13:05:17 -05:00
Florian Fainelli
e6afb1ad88 net: korina: Fix NAPI versus resources freeing
Commit beb0babfb7 ("korina: disable napi on close and restart")
introduced calls to napi_disable() that were missing before,
unfortunately this leaves a small window during which NAPI has a chance
to run, yet we just freed resources since korina_free_ring() has been
called:

Fix this by disabling NAPI first then freeing resource, and make sure
that we also cancel the restart task before doing the resource freeing.

Fixes: beb0babfb7 ("korina: disable napi on close and restart")
Reported-by: Alexandros C. Couloumbis <alex@ozo.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-26 11:26:16 -05:00
Daniel Borkmann
628185cfdd net, sched: fix soft lockup in tc_classify
Shahar reported a soft lockup in tc_classify(), where we run into an
endless loop when walking the classifier chain due to tp->next == tp
which is a state we should never run into. The issue only seems to
trigger under load in the tc control path.

What happens is that in tc_ctl_tfilter(), thread A allocates a new
tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
with it. In that classifier callback we had to unlock/lock the rtnl
mutex and returned with -EAGAIN. One reason why we need to drop there
is, for example, that we need to request an action module to be loaded.

This happens via tcf_exts_validate() -> tcf_action_init/_1() meaning
after we loaded and found the requested action, we need to redo the
whole request so we don't race against others. While we had to unlock
rtnl in that time, thread B's request was processed next on that CPU.
Thread B added a new tp instance successfully to the classifier chain.
When thread A returned grabbing the rtnl mutex again, propagating -EAGAIN
and destroying its tp instance which never got linked, we goto replay
and redo A's request.

This time when walking the classifier chain in tc_ctl_tfilter() for
checking for existing tp instances we had a priority match and found
the tp instance that was created and linked by thread B. Now calling
again into tp->ops->change() with that tp was successful and returned
without error.

tp_created was never cleared in the second round, thus kernel thinks
that we need to link it into the classifier chain (once again). tp and
*back point to the same object due to the match we had earlier on. Thus
for thread B's already public tp, we reset tp->next to tp itself and
link it into the chain, which eventually causes the mentioned endless
loop in tc_classify() once a packet hits the data path.

Fix is to clear tp_created at the beginning of each request, also when
we replay it. On the paths that can cause -EAGAIN we already destroy
the original tp instance we had and on replay we really need to start
from scratch. It seems that this issue was first introduced in commit
12186be7d2 ("net_cls: fix unconfigured struct tcf_proto keeps chaining
and avoid kernel panic when we use cls_cgroup").

Fixes: 12186be7d2 ("net_cls: fix unconfigured struct tcf_proto keeps chaining and avoid kernel panic when we use cls_cgroup")
Reported-by: Shahar Klein <shahark@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Shahar Klein <shahark@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-26 11:24:10 -05:00
Linus Torvalds
7ce7d89f48 Linux 4.10-rc1 2016-12-25 16:13:08 -08:00