Linux Kernel 4.4.x (Ubuntu 16.04) - 'double-fdput()' bpf(BPF_PROG_LOAD) Privilege Escalation








In Linux >=4.4, when the CONFIG_BPF_SYSCALL config option is set and the
kernel.unprivileged_bpf_disabled sysctl is not explicitly set to 1 at runtime,
unprivileged code can use the bpf() syscall to load eBPF socket filter programs.
These conditions are fulfilled in Ubuntu 16.04.

When an eBPF program is loaded using bpf(BPF_PROG_LOAD, ...), the first
function that touches the supplied eBPF instructions is
replace_map_fd_with_map_ptr(), which looks for instructions that reference eBPF
map file descriptors and looks up pointers for the corresponding map files.
This is done as follows:

	/* look for pseudo eBPF instructions that access map FDs and
	 * replace them with actual map pointers
	static int replace_map_fd_with_map_ptr(struct verifier_env *env)
		struct bpf_insn *insn = env->prog->insnsi;
		int insn_cnt = env->prog->len;
		int i, j;

		for (i = 0; i < insn_cnt; i++, insn++) {
			[checks for bad instructions]

			if (insn[0].code == (BPF_LD | BPF_IMM | BPF_DW)) {
				struct bpf_map *map;
				struct fd f;

				[checks for bad instructions]

				f = fdget(insn->imm);
				map = __bpf_map_get(f);
				if (IS_ERR(map)) {
					verbose("fd %d is not pointing to valid bpf_map\n",
					return PTR_ERR(map);


__bpf_map_get contains the following code:

/* if error is returned, fd is released.
 * On success caller should complete fd access with matching fdput()
struct bpf_map *__bpf_map_get(struct fd f)
	if (!f.file)
		return ERR_PTR(-EBADF);
	if (f.file->f_op != &bpf_map_fops) {
		return ERR_PTR(-EINVAL);

	return f.file->private_data;

The problem is that when the caller supplies a file descriptor number referring
to a struct file that is not an eBPF map, both __bpf_map_get() and
replace_map_fd_with_map_ptr() will call fdput() on the struct fd. If
__fget_light() detected that the file descriptor table is shared with another
task and therefore the FDPUT_FPUT flag is set in the struct fd, this will cause
the reference count of the struct file to be over-decremented, allowing an
attacker to create a use-after-free situation where a struct file is freed
although there are still references to it.

A simple proof of concept that causes oopses/crashes on a kernel compiled with
memory debugging options is attached as crasher.tar.

One way to exploit this issue is to create a writable file descriptor, start a
write operation on it, wait for the kernel to verify the file's writability,
then free the writable file and open a readonly file that is allocated in the
same place before the kernel writes into the freed file, allowing an attacker
to write data to a readonly file. By e.g. writing to /etc/crontab, root
privileges can then be obtained.

There are two problems with this approach:

The attacker should ideally be able to determine whether a newly allocated
struct file is located at the same address as the previously freed one. Linux
provides a syscall that performs exactly this comparison for the caller:
kcmp(getpid(), getpid(), KCMP_FILE, uaf_fd, new_fd).

In order to make exploitation more reliable, the attacker should be able to
pause code execution in the kernel between the writability check of the target
file and the actual write operation. This can be done by abusing the writev()
syscall and FUSE: The attacker mounts a FUSE filesystem that artificially delays
read accesses, then mmap()s a file containing a struct iovec from that FUSE
filesystem and passes the result of mmap() to writev(). (Another way to do this
would be to use the userfaultfd() syscall.)

writev() calls do_writev(), which looks up the struct file * corresponding to
the file descriptor number and then calls vfs_writev(). vfs_writev() verifies
that the target file is writable, then calls do_readv_writev(), which first
copies the struct iovec from userspace using import_iovec(), then performs the
rest of the write operation. Because import_iovec() performs a userspace memory
access, it may have to wait for pages to be faulted in - and in this case, it
has to wait for the attacker-owned FUSE filesystem to resolve the pagefault,
allowing the attacker to suspend code execution in the kernel at that point

An exploit that puts all this together is in exploit.tar. Usage:

user@host:~/ebpf_mapfd_doubleput$ ./
user@host:~/ebpf_mapfd_doubleput$ ./doubleput
starting writev
woohoo, got pointer reuse
writev returned successfully. if this worked, you'll have a root shell in <=60 seconds.
suid file detected, launching rootshell...
we have root privs now...
root@host:~/ebpf_mapfd_doubleput# id
uid=0(root) gid=0(root) groups=0(root),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),113(lpadmin),128(sambashare),999(vboxsf),1000(user)

This exploit was tested on a Ubuntu 16.04 Desktop system.


Proof of Concept:
Exploit-DB Mirror: