The last cleanup patch triggered another issue, as now another function
should be moved into the same section:
kernel/locking/lockdep.c:3580:12: error: 'mark_lock' defined but not used [-Werror=unused-function]
static int mark_lock(struct task_struct *curr, struct held_lock *this,
Move mark_lock() into the same #ifdef section as its only caller, and
remove the now-unused mark_lock_irq() stub helper.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yuyang Du <duyuyang@gmail.com>
Fixes: 0d2cc3b345 ("locking/lockdep: Move valid_state() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING")
Link: https://lkml.kernel.org/r/20190617124718.1232976-1-arnd@arndb.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix sparse warning:
arch/x86/kernel/jump_label.c:106:5: warning:
symbol 'tp_vec_nr' was not declared. Should it be static?
It's only used in jump_label.c, so make it static.
Fixes: ba54f0c3f7 ("x86/jump_label: Batch jump label updates")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <bp@alien8.de>
Cc: <hpa@zytor.com>
Cc: <peterz@infradead.org>
Cc: <bristot@redhat.com>
Cc: <namit@vmware.com>
Link: https://lkml.kernel.org/r/20190625034548.26392-1-yuehaibing@huawei.com
Since raw_cpu_xchg() doesn't need to be IRQ-safe, like
this_cpu_xchg(), we can use a simple load-store instead of the cmpxchg
loop.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Nadav reported that code-gen changed because of the this_cpu_*()
constraints, avoid this for select_idle_cpu() because that runs with
preemption (and IRQs) disabled anyway.
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Nadav reported that since the this_cpu_*() ops got asm-volatile
constraints on, code generation suffered for do_IRQ(), but since this
is all with IRQs disabled we can use __this_cpu_*().
smp_x86_platform_ipi 234 222 -12,+0
smp_kvm_posted_intr_ipi 74 66 -8,+0
smp_kvm_posted_intr_wakeup_ipi 86 78 -8,+0
smp_apic_timer_interrupt 292 284 -8,+0
smp_kvm_posted_intr_nested_ipi 74 66 -8,+0
do_IRQ 195 187 -8,+0
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The upper bits of the count field is used as reader count. When
sufficient number of active readers are present, the most significant
bit will be set and the count becomes negative. If the number of active
readers keep on piling up, we may eventually overflow the reader counts.
This is not likely to happen unless the number of bits reserved for
reader count is reduced because those bits are need for other purpose.
To prevent this count overflow from happening, the most significant
bit is now treated as a guard bit (RWSEM_FLAG_READFAIL). Read-lock
attempts will now fail for both the fast and slow paths whenever this
bit is set. So all those extra readers will be put to sleep in the wait
list. Wakeup will not happen until the reader count reaches 0.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-17-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reader optimistic spinning is helpful when the reader critical section
is short and there aren't that many readers around. It makes readers
relatively more preferred than writers. When a writer times out spinning
on a reader-owned lock and set the nospinnable bits, there are two main
reasons for that.
1) The reader critical section is long, perhaps the task sleeps after
acquiring the read lock.
2) There are just too many readers contending the lock causing it to
take a while to service all of them.
In the former case, long reader critical section will impede the progress
of writers which is usually more important for system performance.
In the later case, reader optimistic spinning tends to make the reader
groups that contain readers that acquire the lock together smaller
leading to more of them. That may hurt performance in some cases. In
other words, the setting of nonspinnable bits indicates that reader
optimistic spinning may not be helpful for those workloads that cause it.
Therefore, any writers that have observed the setting of the writer
nonspinnable bit for a given rwsem after they fail to acquire the lock
via optimistic spinning will set the reader nonspinnable bit once they
acquire the write lock. Similarly, readers that observe the setting
of reader nonspinnable bit at slowpath entry will also set the reader
nonspinnable bit when they acquire the read lock via the wakeup path.
Once the reader nonspinnable bit is on, it will only be reset when
a writer is able to acquire the rwsem in the fast path or somehow a
reader or writer in the slowpath doesn't observe the nonspinable bit.
This is to discourage reader optmistic spinning on that particular
rwsem and make writers more preferred. This adaptive disabling of reader
optimistic spinning will alleviate some of the negative side effect of
this feature.
In addition, this patch tries to make readers in the spinning queue
follow the phase-fair principle after quitting optimistic spinning
by checking if another reader has somehow acquired a read lock after
this reader enters the optimistic spinning queue. If so and the rwsem
is still reader-owned, this reader is in the right read-phase and can
attempt to acquire the lock.
On a 2-socket 40-core 80-thread Skylake system, the page_fault1 test of
the will-it-scale benchmark was run with various number of threads. The
number of operations done before reader optimistic spinning patches,
this patch and after this patch were:
Threads Before rspin Before patch After patch %change
------- ------------ ------------ ----------- -------
20 5541068 5345484 5455667 -3.5%/ +2.1%
40 10185150 7292313 9219276 -28.5%/+26.4%
60 8196733 6460517 7181209 -21.2%/+11.2%
80 9508864 6739559 8107025 -29.1%/+20.3%
This patch doesn't recover all the lost performance, but it is more
than half. Given the fact that reader optimistic spinning does benefit
some workloads, this is a good compromise.
Using the rwsem locking microbenchmark with very short critical section,
this patch doesn't have too much impact on locking performance as shown
by the locking rates (kops/s) below with equal numbers of readers and
writers before and after this patch:
# of Threads Pre-patch Post-patch
------------ --------- ----------
2 4,730 4,969
4 4,814 4,786
8 4,866 4,815
16 4,715 4,511
32 3,338 3,500
64 3,212 3,389
80 3,110 3,044
When running the locking microbenchmark with 40 dedicated reader and writer
threads, however, the reader performance is curtailed to favor the writer.
Before patch:
40 readers, Iterations Min/Mean/Max = 204,026/234,309/254,816
40 writers, Iterations Min/Mean/Max = 88,515/95,884/115,644
After patch:
40 readers, Iterations Min/Mean/Max = 33,813/35,260/36,791
40 writers, Iterations Min/Mean/Max = 95,368/96,565/97,798
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-16-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When the rwsem is owned by reader, writers stop optimistic spinning
simply because there is no easy way to figure out if all the readers
are actively running or not. However, there are scenarios where
the readers are unlikely to sleep and optimistic spinning can help
performance.
This patch provides a simple mechanism for spinning on a reader-owned
rwsem by a writer. It is a time threshold based spinning where the
allowable spinning time can vary from 10us to 25us depending on the
condition of the rwsem.
When the time threshold is exceeded, the nonspinnable bits will be set
in the owner field to indicate that no more optimistic spinning will
be allowed on this rwsem until it becomes writer owned again. Not even
readers is allowed to acquire the reader-locked rwsem by optimistic
spinning for fairness.
We also want a writer to acquire the lock after the readers hold the
lock for a relatively long time. In order to give preference to writers
under such a circumstance, the single RWSEM_NONSPINNABLE bit is now split
into two - one for reader and one for writer. When optimistic spinning
is disabled, both bits will be set. When the reader count drop down
to 0, the writer nonspinnable bit will be cleared to allow writers to
spin on the lock, but not the readers. When a writer acquires the lock,
it will write its own task structure pointer into sem->owner and clear
the reader nonspinnable bit in the process.
The time taken for each iteration of the reader-owned rwsem spinning
loop varies. Below are sample minimum elapsed times for 16 iterations
of the loop.
System Time for 16 Iterations
------ ----------------------
1-socket Skylake ~800ns
4-socket Broadwell ~300ns
2-socket ThunderX2 (arm64) ~250ns
When the lock cacheline is contended, we can see up to almost 10X
increase in elapsed time. So 25us will be at most 500, 1300 and 1600
iterations for each of the above systems.
With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:
# of Threads Pre-patch Post-patch
------------ --------- ----------
2 1,759 6,684
4 1,684 6,738
8 1,074 7,222
16 900 7,163
32 458 7,316
64 208 520
128 168 425
240 143 474
This patch gives a big boost in performance for mixed reader/writer
workloads.
With 32 locking threads, the rwsem lock event data were:
rwsem_opt_fail=79850
rwsem_opt_nospin=5069
rwsem_opt_rlock=597484
rwsem_opt_wlock=957339
rwsem_sleep_reader=57782
rwsem_sleep_writer=55663
With 64 locking threads, the data looked like:
rwsem_opt_fail=346723
rwsem_opt_nospin=6293
rwsem_opt_rlock=1127119
rwsem_opt_wlock=1400628
rwsem_sleep_reader=308201
rwsem_sleep_writer=72281
So a lot more threads acquired the lock in the slowpath and more threads
went to sleep.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-15-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The rwsem->owner contains not just the task structure pointer, it also
holds some flags for storing the current state of the rwsem. Some of
the flags may have to be atomically updated. To reflect the new reality,
the owner is now changed to an atomic_long_t type.
New helper functions are added to properly separate out the task
structure pointer and the embedded flags.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-14-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch enables readers to optimistically spin on a
rwsem when it is owned by a writer instead of going to sleep
directly. The rwsem_can_spin_on_owner() function is extracted
out of rwsem_optimistic_spin() and is called directly by
rwsem_down_read_slowpath() and rwsem_down_write_slowpath().
With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBrige-EX system with equal
numbers of readers and writers before and after the patch were as
follows:
# of Threads Pre-patch Post-patch
------------ --------- ----------
4 1,674 1,684
8 1,062 1,074
16 924 900
32 300 458
64 195 208
128 164 168
240 149 143
The performance change wasn't significant in this case, but this change
is required by a follow-on patch.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-13-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Bit 1 of sem->owner (RWSEM_ANONYMOUSLY_OWNED) is used to designate an
anonymous owner - readers or an anonymous writer. The setting of this
anonymous bit is used as an indicator that optimistic spinning cannot
be done on this rwsem.
With the upcoming reader optimistic spinning patches, a reader-owned
rwsem can be spinned on for a limit period of time. We still need
this bit to indicate a rwsem is nonspinnable, but not setting this
bit loses its meaning that the owner is known. So rename the bit
to RWSEM_NONSPINNABLE to clarify its meaning.
This patch also fixes a DEBUG_RWSEMS_WARN_ON() bug in __up_write().
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-12-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When the front of the wait queue is a reader, other readers
immediately following the first reader will also be woken up at the
same time. However, if there is a writer in between. Those readers
behind the writer will not be woken up.
Because of optimistic spinning, the lock acquisition order is not FIFO
anyway. The lock handoff mechanism will ensure that lock starvation
will not happen.
Assuming that the lock hold times of the other readers still in the
queue will be about the same as the readers that are being woken up,
there is really not much additional cost other than the additional
latency due to the wakeup of additional tasks by the waker. Therefore
all the readers up to a maximum of 256 in the queue are woken up when
the first waiter is a reader to improve reader throughput. This is
somewhat similar in concept to a phase-fair R/W lock.
With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:
# of Threads Pre-Patch Post-patch
------------ --------- ----------
4 1,641 1,674
8 731 1,062
16 564 924
32 78 300
64 38 195
240 50 149
There is no performance gain at low contention level. At high contention
level, however, this patch gives a pretty decent performance boost.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-11-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
An RT task can do optimistic spinning only if the lock holder is
actually running. If the state of the lock holder isn't known, there
is a possibility that high priority of the RT task may block forward
progress of the lock holder if it happens to reside on the same CPU.
This will lead to deadlock. So we have to make sure that an RT task
will not spin on a reader-owned rwsem.
When the owner is temporarily set to NULL, there are two cases
where we may want to continue spinning:
1) The lock owner is in the process of releasing the lock, sem->owner
is cleared but the lock has not been released yet.
2) The lock was free and owner cleared, but another task just comes
in and acquire the lock before we try to get it. The new owner may
be a spinnable writer.
So an RT task is now made to retry one more time to see if it can
acquire the lock or continue spinning on the new owning writer.
When testing on a 8-socket IvyBridge-EX system, the one additional retry
seems to improve locking performance of RT write locking threads under
heavy contentions. The table below shows the locking rates (in kops/s)
with various write locking threads before and after the patch.
Locking threads Pre-patch Post-patch
--------------- --------- -----------
4 2,753 2,608
8 2,529 2,520
16 1,727 1,918
32 1,263 1,956
64 889 1,343
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-10-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With the use of wake_q, we can do task wakeups without holding the
wait_lock. There is one exception in the rwsem code, though. It is
when the writer in the slowpath detects that there are waiters ahead
but the rwsem is not held by a writer. This can lead to a long wait_lock
hold time especially when a large number of readers are to be woken up.
Remediate this situation by releasing the wait_lock before waking
up tasks and re-acquiring it afterward. The rwsem_try_write_lock()
function is also modified to read the rwsem count directly to avoid
stale count value.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-9-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Because of writer lock stealing, it is possible that a constant
stream of incoming writers will cause a waiting writer or reader to
wait indefinitely leading to lock starvation.
This patch implements a lock handoff mechanism to disable lock stealing
and force lock handoff to the first waiter or waiters (for readers)
in the queue after at least a 4ms waiting period unless it is a RT
writer task which doesn't need to wait. The waiting period is used to
avoid discouraging lock stealing too much to affect performance.
The setting and clearing of the handoff bit is serialized by the
wait_lock. So racing is not possible.
A rwsem microbenchmark was run for 5 seconds on a 2-socket 40-core
80-thread Skylake system with a v5.1 based kernel and 240 write_lock
threads with 5us sleep critical section.
Before the patch, the min/mean/max numbers of locking operations for
the locking threads were 1/7,792/173,696. After the patch, the figures
became 5,842/6,542/7,458. It can be seen that the rwsem became much
more fair, though there was a drop of about 16% in the mean locking
operations done which was a tradeoff of having better fairness.
Making the waiter set the handoff bit right after the first wakeup can
impact performance especially with a mixed reader/writer workload. With
the same microbenchmark with short critical section and equal number of
reader and writer threads (40/40), the reader/writer locking operation
counts with the current patch were:
40 readers, Iterations Min/Mean/Max = 1,793/1,794/1,796
40 writers, Iterations Min/Mean/Max = 1,793/34,956/86,081
By making waiter set handoff bit immediately after wakeup:
40 readers, Iterations Min/Mean/Max = 43/44/46
40 writers, Iterations Min/Mean/Max = 43/1,263/3,191
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-8-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch modifies rwsem_spin_on_owner() to return four possible
values to better reflect the state of lock holder which enables us to
make a better decision of what to do next.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-7-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After merging all the relevant rwsem code into one single file, there
are a number of optimizations and cleanups that can be done:
1) Remove all the EXPORT_SYMBOL() calls for functions that are not
accessed elsewhere.
2) Remove all the __visible tags as none of the functions will be
called from assembly code anymore.
3) Make all the internal functions static.
4) Remove some unneeded blank lines.
5) Remove the intermediate rwsem_down_{read|write}_failed*() functions
and rename __rwsem_down_{read|write}_failed_common() to
rwsem_down_{read|write}_slowpath().
6) Remove "__" prefix of __rwsem_mark_wake().
7) Use atomic_long_try_cmpxchg_acquire() as much as possible.
8) Remove the rwsem_rtrylock and rwsem_wtrylock lock events as they
are not that useful.
That enables the compiler to do better optimization and reduce code
size. The text+data size of rwsem.o on an x86-64 machine with gcc8 was
reduced from 10237 bytes to 5030 bytes with this change.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-6-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now we only have one implementation of rwsem. Even though we still use
xadd to handle reader locking, we use cmpxchg for writer instead. So
the filename rwsem-xadd.c is not strictly correct. Also no one outside
of the rwsem code need to know the internal implementation other than
function prototypes for two internal functions that are called directly
from percpu-rwsem.c.
So the rwsem-xadd.c and rwsem.h files are now merged into rwsem.c in
the following order:
<upper part of rwsem.h>
<rwsem-xadd.c>
<lower part of rwsem.h>
<rwsem.c>
The rwsem.h file now contains only 2 function declarations for
__up_read() and __down_read().
This is a code relocation patch with no code change at all except
making __up_read() and __down_read() non-static functions so they
can be used by percpu-rwsem.c.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-5-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current way of using various reader, writer and waiting biases
in the rwsem code are confusing and hard to understand. I have to
reread the rwsem count guide in the rwsem-xadd.c file from time to
time to remind myself how this whole thing works. It also makes the
rwsem code harder to be optimized.
To make rwsem more sane, a new locking scheme similar to the one in
qrwlock is now being used. The atomic long count has the following
bit definitions:
Bit 0 - writer locked bit
Bit 1 - waiters present bit
Bits 2-7 - reserved for future extension
Bits 8-X - reader count (24/56 bits)
The cmpxchg instruction is now used to acquire the write lock. The read
lock is still acquired with xadd instruction, so there is no change here.
This scheme will allow up to 16M/64P active readers which should be
more than enough. We can always use some more reserved bits if necessary.
With that change, we can deterministically know if a rwsem has been
write-locked. Looking at the count alone, however, one cannot determine
for certain if a rwsem is owned by readers or not as the readers that
set the reader count bits may be in the process of backing out. So we
still need the reader-owned bit in the owner field to be sure.
With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) of the benchmark on a 8-socket 120-core
IvyBridge-EX system before and after the patch were as follows:
Before Patch After Patch
# of Threads wlock rlock wlock rlock
------------ ----- ----- ----- -----
1 30,659 31,341 31,055 31,283
2 8,909 16,457 9,884 17,659
4 9,028 15,823 8,933 20,233
8 8,410 14,212 7,230 17,140
16 8,217 25,240 7,479 24,607
The locking rates of the benchmark on a Power8 system were as follows:
Before Patch After Patch
# of Threads wlock rlock wlock rlock
------------ ----- ----- ----- -----
1 12,963 13,647 13,275 13,601
2 7,570 11,569 7,902 10,829
4 5,232 5,516 5,466 5,435
8 5,233 3,386 5,467 3,168
The locking rates of the benchmark on a 2-socket ARM64 system were
as follows:
Before Patch After Patch
# of Threads wlock rlock wlock rlock
------------ ----- ----- ----- -----
1 21,495 21,046 21,524 21,074
2 5,293 10,502 5,333 10,504
4 5,325 11,463 5,358 11,631
8 5,391 11,712 5,470 11,680
The performance are roughly the same before and after the patch. There
are run-to-run variations in performance. Runs with higher variances
usually have higher throughput.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-4-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After the following commit:
59aabfc7e9 ("locking/rwsem: Reduce spinlock contention in wakeup after up_read()/up_write()")
the rwsem_wake() forgoes doing a wakeup if the wait_lock cannot be directly
acquired and an optimistic spinning locker is present. This can help performance
by avoiding spinning on the wait_lock when it is contended.
With the later commit:
133e89ef5e ("locking/rwsem: Enable lockless waiter wakeup(s)")
the performance advantage of the above optimization diminishes as the average
wait_lock hold time become much shorter.
With a later patch that supports rwsem lock handoff, we can no
longer relies on the fact that the presence of an optimistic spinning
locker will ensure that the lock will be acquired by a task soon and
rwsem_wake() will be called later on to wake up waiters. This can lead
to missed wakeup and application hang.
So the original 59aabfc7e9 commit has to be reverted.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-3-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The owner field in the rw_semaphore structure is used primarily for
optimistic spinning. However, identifying the rwsem owner can also be
helpful in debugging as well as tracing locking related issues when
analyzing crash dump. The owner field may also store state information
that can be important to the operation of the rwsem.
So the owner field is now made a permanent member of the rw_semaphore
structure irrespective of CONFIG_RWSEM_SPIN_ON_OWNER.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Link: https://lkml.kernel.org/r/20190520205918.22251-2-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Recent probing at the Linux Kernel Memory Model uncovered a
'surprise'. Strongly ordered architectures where the atomic RmW
primitive implies full memory ordering and
smp_mb__{before,after}_atomic() are a simple barrier() (such as x86)
fail for:
*x = 1;
atomic_inc(u);
smp_mb__after_atomic();
r0 = *y;
Because, while the atomic_inc() implies memory order, it
(surprisingly) does not provide a compiler barrier. This then allows
the compiler to re-order like so:
atomic_inc(u);
*x = 1;
smp_mb__after_atomic();
r0 = *y;
Which the CPU is then allowed to re-order (under TSO rules) like:
atomic_inc(u);
r0 = *y;
*x = 1;
And this very much was not intended. Therefore strengthen the atomic
RmW ops to include a compiler barrier.
NOTE: atomic_{or,and,xor} and the bitops already had the compiler
barrier.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
All callers of lockdep_assert_held_exclusive() use it to verify the
correct locking state of either a semaphore (ldisc_sem in tty,
mmap_sem for perf events, i_rwsem of inode for dax) or rwlock by
apparmor. Thus it makes sense to rename _exclusive to _write since
that's the semantics callers care. Additionally there is already
lockdep_assert_held_read(), which this new naming is more consistent with.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190531100651.3969-1-nborisov@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, the jump label of a static key is transformed via the arch
specific function:
void arch_jump_label_transform(struct jump_entry *entry,
enum jump_label_type type)
The new approach (batch mode) uses two arch functions, the first has the
same arguments of the arch_jump_label_transform(), and is the function:
bool arch_jump_label_transform_queue(struct jump_entry *entry,
enum jump_label_type type)
Rather than transforming the code, it adds the jump_entry in a queue of
entries to be updated. This functions returns true in the case of a
successful enqueue of an entry. If it returns false, the caller must to
apply the queue and then try to queue again, for instance, because the
queue is full.
This function expects the caller to sort the entries by the address before
enqueueuing then. This is already done by the arch independent code, though.
After queuing all jump_entries, the function:
void arch_jump_label_transform_apply(void)
Applies the changes in the queue.
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/57b4caa654bad7e3b066301c9a9ae233dea065b5.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If the architecture supports the batching of jump label updates, use it!
An easy way to see the benefits of this patch is switching the
schedstats on and off. For instance:
-------------------------- %< ----------------------------
#!/bin/sh
while [ true ]; do
sysctl -w kernel.sched_schedstats=1
sleep 2
sysctl -w kernel.sched_schedstats=0
sleep 2
done
-------------------------- >% ----------------------------
while watching the IPI count:
-------------------------- %< ----------------------------
# watch -n1 "cat /proc/interrupts | grep Function"
-------------------------- >% ----------------------------
With the current mode, it is possible to see +- 168 IPIs each 2 seconds,
while with this patch the number of IPIs goes to 3 each 2 seconds.
Regarding the performance impact of this patch set, I made two measurements:
The time to update a key (the task that is causing the change)
The time to run the int3 handler (the side effect on a thread that
hits the code being changed)
The schedstats static key was chosen as the key to being switched on and off.
The reason being is that it is used in more than 56 places, in a hot path. The
change in the schedstats static key will be done with the following command:
while [ true ]; do
sysctl -w kernel.sched_schedstats=1
usleep 500000
sysctl -w kernel.sched_schedstats=0
usleep 500000
done
In this way, they key will be updated twice per second. To force the hit of the
int3 handler, the system will also run a kernel compilation with two jobs per
CPU. The test machine is a two nodes/24 CPUs box with an Intel Xeon processor
@2.27GHz.
Regarding the update part, on average, the regular kernel takes 57 ms to update
the schedstats key, while the kernel with the batch updates takes just 1.4 ms
on average. Although it seems to be too good to be true, it makes sense: the
schedstats key is used in 56 places, so it was expected that it would take
around 56 times to update the keys with the current implementation, as the
IPIs are the most expensive part of the update.
Regarding the int3 handler, the non-batch handler takes 45 ns on average, while
the batch version takes around 180 ns. At first glance, it seems to be a high
value. But it is not, considering that it is doing 56 updates, rather than one!
It is taking four times more, only. This gain is possible because the patch
uses a binary search in the vector: log2(56)=5.8. So, it was expected to have
an overhead within four times.
(voice of tv propaganda) But, that is not all! As the int3 handler keeps on for
a shorter period (because the update part is on for a shorter time), the number
of hits in the int3 handler decreased by 10%.
The question then is: Is it worth paying the price of "135 ns" more in the int3
handler?
Considering that, in this test case, we are saving the handling of 53 IPIs,
that takes more than these 135 ns, it seems to be a meager price to be paid.
Moreover, the test case was forcing the hit of the int3, in practice, it
does not take that often. While the IPI takes place on all CPUs, hitting
the int3 handler or not!
For instance, in an isolated CPU with a process running in user-space
(nohz_full use-case), the chances of hitting the int3 handler is barely zero,
while there is no way to avoid the IPIs. By bounding the IPIs, we are improving
a lot this scenario.
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/acc891dbc2dbc9fd616dd680529a2337b1d1274c.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, the patch of an address is done in three steps:
-- Pseudo-code #1 - Current implementation ---
1) add an int3 trap to the address that will be patched
sync cores (send IPI to all other CPUs)
2) update all but the first byte of the patched range
sync cores (send IPI to all other CPUs)
3) replace the first byte (int3) by the first byte of replacing opcode
sync cores (send IPI to all other CPUs)
-- Pseudo-code #1 ---
When a static key has more than one entry, these steps are called once for
each entry. The number of IPIs then is linear with regard to the number 'n' of
entries of a key: O(n*3), which is O(n).
This algorithm works fine for the update of a single key. But we think
it is possible to optimize the case in which a static key has more than
one entry. For instance, the sched_schedstats jump label has 56 entries
in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in
which the thread that is enabling the key is _not_ running.
With this patch, rather than receiving a single patch to be processed, a vector
of patches is passed, enabling the rewrite of the pseudo-code #1 in this
way:
-- Pseudo-code #2 - This patch ---
1) for each patch in the vector:
add an int3 trap to the address that will be patched
sync cores (send IPI to all other CPUs)
2) for each patch in the vector:
update all but the first byte of the patched range
sync cores (send IPI to all other CPUs)
3) for each patch in the vector:
replace the first byte (int3) by the first byte of replacing opcode
sync cores (send IPI to all other CPUs)
-- Pseudo-code #2 - This patch ---
Doing the update in this way, the number of IPI becomes O(3) with regard
to the number of keys, which is O(1).
The batch mode is done with the function text_poke_bp_batch(), that receives
two arguments: a vector of "struct text_to_poke", and the number of entries
in the vector.
The vector must be sorted by the addr field of the text_to_poke structure,
enabling the binary search of a handler in the poke_int3_handler function
(a fast path).
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/ca506ed52584c80f64de23f6f55ca288e5d079de.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In the batching mode, all the entries of a given key are updated at once.
During the update of a key, a hit in the int3 handler will check if the
hitting code address belongs to one of these keys.
To optimize the search of a given code in the vector of entries being
updated, a binary search is used. The binary search relies on the order
of the entries of a key by its code. Hence the keys need to be sorted
by the code too, so sort the entries of a given key by the code.
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/f57ae83e0592418ba269866bb7ade570fc8632e0.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the definition of the code to be written from
__jump_label_transform() to a specialized function. No functional
change.
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/d2f52a0010ecd399cf9b02a65bcf5836571b9e52.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the check if a jump_entry is valid to a function. No functional
change.
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris von Recklinghausen <crecklin@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott Wood <swood@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/56b69bd3f8e644ed64f2dbde7c088030b8cbe76b.1560325897.git.bristot@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Thomas Gleixner:
"The accumulated fixes from this and last week:
- Fix vmalloc TLB flush and map range calculations which lead to
stale TLBs, spurious faults and other hard to diagnose issues.
- Use fault_in_pages_writable() for prefaulting the user stack in the
FPU code as it's less fragile than the current solution
- Use the PF_KTHREAD flag when checking for a kernel thread instead
of current->mm as the latter can give the wrong answer due to
use_mm()
- Compute the vmemmap size correctly for KASLR and 5-Level paging.
Otherwise this can end up with a way too small vmemmap area.
- Make KASAN and 5-level paging work again by making sure that all
invalid bits are masked out when computing the P4D offset. This
worked before but got broken recently when the LDT remap area was
moved.
- Prevent a NULL pointer dereference in the resource control code
which can be triggered with certain mount options when the
requested resource is not available.
- Enforce ordering of microcode loading vs. perf initialization on
secondary CPUs. Otherwise perf tries to access a non-existing MSR
as the boot CPU marked it as available.
- Don't stop the resource control group walk early otherwise the
control bitmaps are not updated correctly and become inconsistent.
- Unbreak kgdb by returning 0 on success from
kgdb_arch_set_breakpoint() instead of an error code.
- Add more Icelake CPU model defines so depending changes can be
queued in other trees"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback
x86/kasan: Fix boot with 5-level paging and KASAN
x86/fpu: Don't use current->mm to check for a kthread
x86/kgdb: Return 0 from kgdb_arch_set_breakpoint()
x86/resctrl: Prevent NULL pointer dereference when local MBM is disabled
x86/resctrl: Don't stop walking closids when a locksetup group is found
x86/fpu: Update kernel's FPU state before using for the fsave header
x86/mm/KASLR: Compute the size of the vmemmap section properly
x86/fpu: Use fault_in_pages_writeable() for pre-faulting
x86/CPU: Add more Icelake model numbers
mm/vmalloc: Avoid rare case of flushing TLB with weird arguments
mm/vmalloc: Fix calculation of direct map addr range
Pull timer fixes from Thomas Gleixner:
"A set of small fixes:
- Repair the ktime_get_coarse() functions so they actually deliver
what they are supposed to: tick granular time stamps. The current
code missed to add the accumulated nanoseconds part of the
timekeeper so the resulting granularity was 1 second.
- Prevent the tracer from infinitely recursing into time getter
functions in the arm architectured timer by marking these functions
notrace
- Fix a trivial compiler warning caused by wrong qualifier ordering"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Repair ktime_get_coarse*() granularity
clocksource/drivers/arm_arch_timer: Don't trace count reader functions
clocksource/drivers/timer-ti-dm: Change to new style declaration
Pull RAS fixes from Thomas Gleixner:
"Two small fixes for RAS:
- Use a proper search algorithm to find the correct element in the
CEC array. The replacement was a better choice than fixing the
crash causes by the original search function with horrible duct
tape.
- Move the timer based decay function into thread context so it can
actually acquire the mutex which protects the CEC array to prevent
corruption"
* 'ras-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
RAS/CEC: Convert the timer callback to a workqueue
RAS/CEC: Fix binary search function
Couple of driver enumeration issues have been fixed for Mellanox.
ASUS laptops got a regression with backlight, which is fixed now.
Dell computers got a wrong mode (tablet versus laptop) after resume,
that is fixed now.
The following is an automated git shortlog grouped by driver:
asus-wmi:
- Only Tell EC the OS will handle display hotkeys from asus_nb_wmi
intel-vbtn:
- Report switch events when event wakes device
mlx-platform:
- Fix parent device in i2c-mux-reg device registration
platform/mellanox:
- mlxreg-hotplug: Add devm_free_irq call to remove flow
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEqaflIX74DDDzMJJtb7wzTHR8rCgFAl0FJxcACgkQb7wzTHR8
rCirlhAAqtceFp9MwqM3EaHkgurK/zMdq3dw62/TKFQfK6n0MaxYeE6G43BI7vBX
XOoqNzOH3vDELGrZaP1DfmXpd6xASngd57p2qElHUYVqqVHgkqORcuIG5VJpzKsY
1+GRey4eWJ114G16tpGi/MH+i44ro+Y+WJV/F0qbZHRDANenW/VCl1EDWYYf4SYX
9QGKtQWkmGKMzklorhIHiTmrg3HoH08tddR5zStTr88Db8/hErnCtp8Xfm48cJ2J
4mLEXuta0bh2W9WbG2prFL14FE12iawLjDQH67r1XpYtXb+LN1n/f7nq+AgdSTbH
raEYkmhARGbxUjwuW46hDUBHFMkgRY4VG+VyBeGgAZAFtSTbyV5XtQIoV1541sW5
7R4hjta8Gb/6r40L6izAyy23KVjI9wygt++fpSwRtf3/77yICMWHq3Uel4RuX/gX
AWiyrGoOPb3BVwk+Hx1ME6HNMupV0OLTdS+eS0nUH1MTdidawAZ+xM+Y7D7BPJKd
116lHuGlQcMzvP6r+cGAQiptzQvRP+NzuSfbY7H7YY+11dQGm4XULWbFC6q28EgY
42REhawu0PXpMWUQRPK7ZcqFKiqE+nJiKmAuRpzpC9nlr3aaPKP6KCU0eBbYbeyZ
N4rOB8y6MBqm2HgFL9yh9l0tcnMF7o9P5NU/aztZEndL6/LMxQA=
=/H6h
-----END PGP SIGNATURE-----
Merge tag 'platform-drivers-x86-v5.2-3' of git://git.infradead.org/linux-platform-drivers-x86
Pull x86 platform driver fixes from Andy Shevchenko:
- fix a couple of Mellanox driver enumeration issues
- fix ASUS laptop regression with backlight
- fix Dell computers that got a wrong mode (tablet versus laptop) after
resume
* tag 'platform-drivers-x86-v5.2-3' of git://git.infradead.org/linux-platform-drivers-x86:
platform/mellanox: mlxreg-hotplug: Add devm_free_irq call to remove flow
platform/x86: mlx-platform: Fix parent device in i2c-mux-reg device registration
platform/x86: intel-vbtn: Report switch events when event wakes device
platform/x86: asus-wmi: Only Tell EC the OS will handle display hotkeys from asus_nb_wmi
Here are some small USB driver fixes for 5.2-rc5
Nothing major, just some small gadget fixes, usb-serial new device ids,
a few new quirks, and some small fixes for some regressions that have
been found after the big 5.2-rc1 merge.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXQUTIg8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykPJQCfZdiElw3N35bKVf7bhruxSb5DevEAni96uDM7
jSOSEB/EWqxZEy0ZjU7J
=cKx4
-----END PGP SIGNATURE-----
Merge tag 'usb-5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small USB driver fixes for 5.2-rc5
Nothing major, just some small gadget fixes, usb-serial new device
ids, a few new quirks, and some small fixes for some regressions that
have been found after the big 5.2-rc1 merge.
All of these have been in linux-next for a while with no reported
issues"
* tag 'usb-5.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: typec: Make sure an alt mode exist before getting its partner
usb: gadget: udc: lpc32xx: fix return value check in lpc32xx_udc_probe()
usb: gadget: dwc2: fix zlp handling
usb: dwc2: Set actual frame number for completed ISOC transfer for none DDMA
usb: gadget: udc: lpc32xx: allocate descriptor with GFP_ATOMIC
usb: gadget: fusb300_udc: Fix memory leak of fusb300->ep[i]
usb: phy: mxs: Disable external charger detect in mxs_phy_hw_init()
usb: dwc2: Fix DMA cache alignment issues
usb: dwc2: host: Fix wMaxPacketSize handling (fix webcam regression)
USB: Fix chipmunk-like voice when using Logitech C270 for recording audio.
USB: usb-storage: Add new ID to ums-realtek
usb: typec: ucsi: ccg: fix memory leak in do_flash
USB: serial: option: add Telit 0x1260 and 0x1261 compositions
USB: serial: pl2303: add Allied Telesis VT-Kit3
USB: serial: option: add support for Simcom SIM7500/SIM7600 RNDIS mode
One fix for a regression introduced by our 32-bit KASAN support, which broke
booting on machines with "bootx" early debugging enabled.
A fix for a bug which broke kexec on 32-bit, introduced by changes to the 32-bit
STRICT_KERNEL_RWX support in v5.1.
Finally two fixes going to stable for our THP split/collapse handling,
discovered by Nick. The first fixes random crashes and/or corruption in guests
under sufficient load.
Thanks to:
Nicholas Piggin, Christophe Leroy, Aaro Koskinen, Mathieu Malaterre.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJdBOivAAoJEFHr6jzI4aWA/T0P/1pmr4JLWzXOPQOGOwS2mmu5
HhXBFuDflpGiWS31syOhfKhiE2qlwdIcGaclSo1wAgUnMKp+sxVEagF6DEt484r3
DXs3eRyrGu5vQT7Q6yReuT3Kw2ZR474a5ob00WGAQosBKyJF4ZHWz16ETVWMdAMQ
TknEEU3hOUnMWWIEvLnZOKT7eJcmzj5IYy1OtLHBjWiHHizGC8PSxdVhiRcD/O6R
F6C7XrFb7RRj5ran6gxwMbcTvjgu922TSQPOCw93qnXYWLfvWDUXC4yCqY21oHnr
b3zgJNgIdSoMYxE8pfOH7Y+eaJrbzgnhlS5OJNEz/4NOfGQnJXYcSF8QO6eeVoKM
L2SkT1Ov+QMmZQjMC5e9OAe7DFHfM59RYFg11eaUqfiaObRsmwu8rqjngITV5Ede
Ydq3W39XQkjB3aQ4qb0MnEBbVgyQ/y6/T5hoHRlvnb5byFk5Pd3jpub56sq87UGM
M4GD3YdmT5eBqzGxApddyIiS839PZpdw7g/Ivtp2GYCtiNNZcJqaXvN7IwfcVF3c
YJrCfhNTUJPjICIL9k9v2RQovGuGZqtM3BHc8PmyVvDxTqtEbh8B9GEg+QC58xp/
s+UI2sEd/Vg6NVKluIJHau7aEmJlaxYIshqsIhYvPgJRN6KrFMhdfPNx7D+zMlu8
nbM6Z1n2VFk9Cxb0vA9K
=MXJI
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.2-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"One fix for a regression introduced by our 32-bit KASAN support, which
broke booting on machines with "bootx" early debugging enabled.
A fix for a bug which broke kexec on 32-bit, introduced by changes to
the 32-bit STRICT_KERNEL_RWX support in v5.1.
Finally two fixes going to stable for our THP split/collapse handling,
discovered by Nick. The first fixes random crashes and/or corruption
in guests under sufficient load.
Thanks to: Nicholas Piggin, Christophe Leroy, Aaro Koskinen, Mathieu
Malaterre"
* tag 'powerpc-5.2-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/32s: fix booting with CONFIG_PPC_EARLY_DEBUG_BOOTX
powerpc/64s: __find_linux_pte() synchronization vs pmdp_invalidate()
powerpc/64s: Fix THP PMD collapse serialisation
powerpc: Fix kexec failure on book3s/32
- Out of range read of stack trace output
- Fix for NULL pointer dereference in trace_uprobe_create()
- Fix to a livepatching / ftrace permission race in the module code
- Fix for NULL pointer dereference in free_ftrace_func_mapper()
- A couple of build warning clean ups
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXQToxhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qusmAP4/mmJPgsDchnu5ui0wB8BByzJlsPn8
luXFDuqI4f34zgD+JCmeYbj5LLh98D9XkaaEgP4yz3yKsWeSdwWPCU0vTgo=
=M/+E
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Out of range read of stack trace output
- Fix for NULL pointer dereference in trace_uprobe_create()
- Fix to a livepatching / ftrace permission race in the module code
- Fix for NULL pointer dereference in free_ftrace_func_mapper()
- A couple of build warning clean ups
* tag 'trace-v5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix NULL pointer dereference in free_ftrace_func_mapper()
module: Fix livepatch/ftrace module text permissions race
tracing/uprobe: Fix obsolete comment on trace_uprobe_create()
tracing/uprobe: Fix NULL pointer dereference in trace_uprobe_create()
tracing: Make two symbols static
tracing: avoid build warning with HAVE_NOP_MCOUNT
tracing: Fix out-of-range read in trace_stack_print()
Adric Blake reported the following warning during suspend-resume:
Enabling non-boot CPUs ...
x86: Booting SMP configuration:
smpboot: Booting Node 0 Processor 1 APIC 0x2
unchecked MSR access error: WRMSR to 0x10f (tried to write 0x0000000000000000) \
at rIP: 0xffffffff8d267924 (native_write_msr+0x4/0x20)
Call Trace:
intel_set_tfa
intel_pmu_cpu_starting
? x86_pmu_dead_cpu
x86_pmu_starting_cpu
cpuhp_invoke_callback
? _raw_spin_lock_irqsave
notify_cpu_starting
start_secondary
secondary_startup_64
microcode: sig=0x806ea, pf=0x80, revision=0x96
microcode: updated to revision 0xb4, date = 2019-04-01
CPU1 is up
The MSR in question is MSR_TFA_RTM_FORCE_ABORT and that MSR is emulated
by microcode. The log above shows that the microcode loader callback
happens after the PMU restoration, leading to the conjecture that
because the microcode hasn't been updated yet, that MSR is not present
yet, leading to the #GP.
Add a microcode loader-specific hotplug vector which comes before
the PERF vectors and thus executes earlier and makes sure the MSR is
present.
Fixes: 400816f60c ("perf/x86/intel: Implement support for TSX Force Abort")
Reported-by: Adric Blake <promarbler14@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: x86@kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=203637
Pull cgroup fixes from Tejun Heo:
"This has an unusually high density of tricky fixes:
- task_get_css() could deadlock when it races against a dying cgroup.
- cgroup.procs didn't list thread group leaders with live threads.
This could mislead readers to think that a cgroup is empty when
it's not. Fixed by making PROCS iterator include dead tasks. I made
a couple mistakes making this change and this pull request contains
a couple follow-up patches.
- When cpusets run out of online cpus, it updates cpusmasks of member
tasks in bizarre ways. Joel improved the behavior significantly"
* 'for-5.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: restore sanity to cpuset_cpus_allowed_fallback()
cgroup: Fix css_task_iter_advance_css_set() cset skip condition
cgroup: css_task_iter_skip()'d iterators must be advanced before accessed
cgroup: Include dying leaders with live threads in PROCS iterations
cgroup: Implement css_task_iter_skip()
cgroup: Call cgroup_release() before __exit_signal()
docs cgroups: add another example size for hugetlb
cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css()
- fix regression on amdgpu on SI
- fix edid override regression
- driver fixes: amdgpu, i915, mediatek, meson, panfrost
- fix writecombine for vmap in gem-shmem helper (used by panfrost)
- add more panel quirks
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEb4nG6jLu8Y5XI+PfTA9ye/CYqnEFAl0D4LAACgkQTA9ye/CY
qnFePw/8DymJIm8hNIJ1hF65be0XVdQiKhggw76llBDAWv/Zc5LHZxK/4QJgwbrF
ounfcJOnLWHyeMaoxO8AtLFZQc+4hlm6PMV3tkou8UwYPszcTK++wbvWiYmYn5/t
q7MoLUf/kd1sngzFYJHEBDmNHFo34uPInc9YJX4Ufvz5fl6uIr+Yc2gLPOZsUhig
PvcKSm7sJwNgUGJq7g0qPgLFaIwRL0FMVrPltTg533OkTBcQwAmC+cJVrx5AKxmo
qJYyPy0iK3YNAz2KvMf4q+c8TGk4EpkExpqLD/6bHHiglD8PKMh9C5ER7iyhlzqo
vgog+dD3cjFaTn+Fo1mX5UYo/rRl6DYantySKZcdfDVJ8VUo8HALeHs6vNE47WzD
jDziMVRwgXs8uYFpQJM+E+V1XmuxmsJ+H21p+Lzv0BFxHDakTnPNY7Yibclgoz+v
zMJP7uQbrMthGmM7VsVsHbiMARSak9UasbHoikUuMBhYG3pJ59jFpCwKlqTGfJ3+
hKWvL5jnXpOOpLP1beJg4bpFn2seBkN1uNDllCz0gv+WBw7B8vJMBOQR4c29Oy4v
fDD/scs/eG29btdLUMeA4cJQNbUNY2MC2hfde3gWJHO3V1GCFdGMfAlQh09b+qhg
Q9RqiRmw4oZqAZLCmx5oa+7vf19SWjUqh36zfXOu3aTy/9nDa0Q=
=4gYP
-----END PGP SIGNATURE-----
Merge tag 'drm-fixes-2019-06-14' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Daniel Vetter:
"Nothing unsettling here, also not aware of anything serious still
pending.
The edid override regression fix took a bit longer since this seems to
be an area with an overabundance of bad options. But the fix we have
now seems like a good path forward.
Next week it should be back to Dave.
Summary:
- fix regression on amdgpu on SI
- fix edid override regression
- driver fixes: amdgpu, i915, mediatek, meson, panfrost
- fix writecombine for vmap in gem-shmem helper (used by panfrost)
- add more panel quirks"
* tag 'drm-fixes-2019-06-14' of git://anongit.freedesktop.org/drm/drm: (25 commits)
drm/amdgpu: return 0 by default in amdgpu_pm_load_smu_firmware
drm/amdgpu: Fix bounds checking in amdgpu_ras_is_supported()
drm: add fallback override/firmware EDID modes workaround
drm/edid: abstract override/firmware EDID retrieval
drm/i915/perf: fix whitelist on Gen10+
drm/i915/sdvo: Implement proper HDMI audio support for SDVO
drm/i915: Fix per-pixel alpha with CCS
drm/i915/dmc: protect against reading random memory
drm/i915/dsi: Use a fuzzy check for burst mode clock check
drm/amdgpu/{uvd,vcn}: fetch ring's read_ptr after alloc
drm/panfrost: Require the simple_ondemand governor
drm/panfrost: make devfreq optional again
drm/gem_shmem: Use a writecombine mapping for ->vaddr
drm: panel-orientation-quirks: Add quirk for GPD MicroPC
drm: panel-orientation-quirks: Add quirk for GPD pocket2
drm/meson: fix G12A primary plane disabling
drm/meson: fix primary plane disabling
drm/meson: fix G12A HDMI PLL settings for 4K60 1000/1001 variations
drm/mediatek: call mtk_dsi_stop() after mtk_drm_crtc_atomic_disable()
drm/mediatek: clear num_pipes when unbind driver
...
A single bug fix for hpsa. The user visible consequences aren't
clear, but the ioaccel2 raid acceleration may misfire on the malformed
request assuming the payload is big enough to require chaining (more
than 31 sg entries).
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCXQQdoCYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishfXkAP0aPFp3
fktJnJ/33j7uk4GwOGKB3cOzTnRG+gQ99SNp4AD+NPxXD1AXhdVqtEnxOAt2cm3B
wX3ImBnV+4zIDQEEg60=
=w5Is
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fix from James Bottomley:
"A single bug fix for hpsa.
The user visible consequences aren't clear, but the ioaccel2 raid
acceleration may misfire on the malformed request assuming the payload
is big enough to require chaining (more than 31 sg entries)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: hpsa: correct ioaccel2 chaining
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl0DzeIQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppgoD/wOQ4LaeFvnRZkOQ+1+64xot3okN2BvFFAB
HN+yc8rS2EvHYMG/2PRpLJpGPzTdu83m7Erc5SHBOQVTFqbADxmjRlVtNezOrefk
TWcpsNw2z4Fc7cRQ1u6CNZxqjTVF2QqQvC4bv3fNrZYgckhI4FuiIjJ4pxGI0o9N
HjrGLmp9hdpr20geMomfKSYx9XWUyZs7SxIr8oOFiZYF+K5goO4K8fRsdDIpFGYD
xvzwkohf0xeI1kdfehII+yanlUrdYdeSShLgGl7F/qbiupyKhkxSjvmH8tOShFYa
qSQFRhacjKy84VTOA4MQjkpcTaJfsDqINIPuU8IR+fhTqFbQW8kq9ovlhW2GeVZ/
63b8mWD0XTq7+9vnNeJid8MZ3LqmcKy5qbLWYnnh927sdzjc1Y/RgPWQS4pBllaT
m1rtRMBS0OBcHU0ofsshzL0UZshVh+M+mrKBp+tctIUVmF0bDRqNOJqatihzgNL1
hI/9GY0KbBJQdHJD1iUgw+egikw/GQfoUcNG0IVbHHIREqy+0u/mBSmNzVw/hWeS
jynnsuaAOSrwWeaIeHBGYQFUeQrjkgS1brTYIsXcG5wIcX99uFrp41kuTJGKD2fR
jxkU3EIGxyLQL25T4PaxFqJQrfh/2rAia3Jq99lao5/rLNUdgX/2Ysv+hiw8M/Y7
N+DyskreNQ==
=6C8e
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190614' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Remove references to old schedulers for the scheduler switching and
blkio controller documentation (Andreas)
- Kill duplicate check for report zone for null_blk (Chaitanya)
- Two bcache fixes (Coly)
- Ensure that mq-deadline is selected if zoned block device is enabled,
as we need that to support them (Damien)
- Fix io_uring memory leak (Eric)
- ps3vram fallout from LBDAF removal (Geert)
- Redundant blk-mq debugfs debugfs_create return check cleanup (Greg)
- Extend NOPLM quirk for ST1000LM024 drives (Hans)
- Remove error path warning that can now trigger after the queue
removal/addition fixes (Ming)
* tag 'for-linus-20190614' of git://git.kernel.dk/linux-block:
block/ps3vram: Use %llu to format sector_t after LBDAF removal
libata: Extend quirks for the ST1000LM024 drives with NOLPM quirk
bcache: only set BCACHE_DEV_WB_RUNNING when cached device attached
bcache: fix stack corruption by PRECEDING_KEY()
blk-mq: remove WARN_ON(!q->elevator) from blk_mq_sched_free_requests
blkio-controller.txt: Remove references to CFQ
block/switching-sched.txt: Update to blk-mq schedulers
null_blk: remove duplicate check for report zone
blk-mq: no need to check return value of debugfs_create functions
io_uring: fix memory leak of UNIX domain socket inode
block: force select mq-deadline for zoned block devices
Pull i2c fixes from Wolfram Sang:
"I2C has two simple but wanted driver fixes for you"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: pca-platform: Fix GPIO lookup code
i2c: acorn: fix i2c warning
The 5.1 mount system rework changed the smackfsdef mount option to
smackfsdefault. This fixes the regression by making smackfsdef treated
the same way as smackfsdefault.
Also fix the smack_param_specs[] to have "smack" prefixes on all the
names. This isn't visible to a user unless they either:
(a) Try to mount a filesystem that's converted to the internal mount API
and that implements the ->parse_monolithic() context operation - and
only then if they call security_fs_context_parse_param() rather than
security_sb_eat_lsm_opts().
There are no examples of this upstream yet, but nfs will probably want
to do this for nfs2 or nfs3.
(b) Use fsconfig() to configure the filesystem - in which case
security_fs_context_parse_param() will be called.
This issue is that smack_sb_eat_lsm_opts() checks for the "smack" prefix
on the options, but smack_fs_context_parse_param() does not.
Fixes: c3300aaf95 ("smack: get rid of match_token()")
Fixes: 2febd254ad ("smack: Implement filesystem context security hooks")
Cc: stable@vger.kernel.org
Reported-by: Jose Bollo <jose.bollo@iot.bzh>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>