Perf call stack surprises

Note: I posted a short version of this on Twitter some time ago. Let’s make it permanent here.

One of the tasks a sampling code profiler will typically do is capturing stack traces at various points of program execution. Due to the random nature of sampling, you can’t expect to only get samples from your program. Some samples will be from within the kernel. In some cases, most samples will originate in the kernel due to the specifics of a workload.

Capturing stack traces on Linux is done through the perf_event_open syscall. The specifics on how to do this are out of scope here. Let’s just say that we have not set either the exclude_callchain_kernel or the exclude_callchain_user bit fields.

If you would do so, you might get a sample call stack, which may look similar to the following example:

0xffffffffffffff80
0xffffffff9c7d14c4
0xffffffff9c7daa23
0xffffffff9c98373c
0xffffffff9c983c0c
0xffffffff9c241dd3
0xffffffff9c40007c
0xfffffffffffffe00
0x00007f0304a2e79c

To properly read these addresses, you must know how to differentiate between user- and kernel-space. User-space addresses have the highest bit zeroed (e.g., 0x00007f0304a2e79c), and kernel addresses have the highest bit set (e.g., 0xffffffff9c7daa23). Current x64 processor designs do not allow addressing whole 64-bit address space, and certain rules require high bits to be the same in the so-called canonical addresses. This is why no addresses look like 0xa012bc... in any of the frames listed above.

The problem now is that you can see that there are two seemingly valid addresses, 0xffffffffffffff80 and 0xfffffffffffffe00. At first glance, they look just like any other kernel address. It’s more complex, of course. If you looked at the kernel symbol table, you would find no symbol exists at these addresses. If you look at the kernel memory table, you will find that these addresses are in an unused memory hole. So what are they? Is it some particular encoding for syscalls? Or maybe it’s an entry point into an interrupt handler? Or perhaps something related to securing the CPU against problems like Spectre?

When I looked at this problem, nothing about it was documented, and nothing was present in any search engine.

The answer was found inside the Linux kernel tree in enum perf_callchain_context:

PERF_CONTEXT_KERNEL     = (__u64)-128,
PERF_CONTEXT_USER       = (__u64)-512,

The change was introduced in commit f9188e02, with no documentation anywhere at all. The only way to deal with these markers is to manually remove them from the call stack. There is no perf_event_open option to inhibit their insertion.