02 August 2017

Debugging a Weird GDB Misbehavior

I was debugging a crashing program with gdb last week, and I observed a very strange behavior.  I was using gdb to print the faulting address following a segfault, and it was getting it wrong sometimes.  I determined from the registers and the faulting instruction that faulting address was something like 0xce4a414100000000, but gdb was reporting the faulting address as 0x0.  WTF?

So I started playing with it.  I observed that this failure mode did not occur when I threw the same crash against a 32-bit version of the target; it still crashed, but gdb reported the faulting address correctly.  Weird.

So I checked out a copy of the gdb source tree and started poking around.  I was under the impression that there was a way to get the faulting address of a segfault from the kernel inside a segfault handler, and googling revealed that it's available in a siginfo_t struct.  Reading some man pages suggested that a ptracing program should be able to get such information about a fault in the program it's tracing using PTRACE_GETSIGINFO, so I looked for all the places that gdb was using this option to ptrace.  I found some code that seemed to be translating 64-bit siginfo_t structs into 32-bit siginfo_t structs, which seemed like a likely candidate.  There were only one or two places where GETSIGINFO was used, so I added some debugging printfs, compiled gdb, and...  compilation failed, I forget why exactly.  I chalked this up as a hazard of using the bleeding-edge source version; I downloaded a source release tarball, added my printfs again, and got it to compile, but it didn't seem to be hitting the debug prints around the PTRACE_GETSIGINFO calls, and it complained about not having its python installation in the correct place, so I was somewhat suspicious of its correct operation.  I did confirm that I still got the wrong address even in the freshest version of gdb, though.

At this point, after a morning of debugging and poking around in the gdb source, I told my boss that it was going to take longer to get to the bottom of this rat-hole than I expected, and tabled the gdb issue to investigate later.

The weekend rolled around, and I decided that if printf debugging wasn't going to cut it, I should use gdb to debug gdb.  Some googling indicated that this was a thing that people do, and suggested that I use gdbserver for it.  So I fired up gdbserver running gdb running my crashing program, then fired up gdb and used target remote to connect to the gdbserver, hit run...  and my ssh session to my work machine died. Some pinging around the work network revealed that my host was down. My suspicions of a kernel panic were confirmed on monday morning; nobody else was in the office to see and reboot it.

So I was left to debug this thing on my own machine.  I checked out the crashing project, built it, ran it in gdb, and observed the same failure mode, the incorrect faulting address.  Deciding that gdb-on-gdb action was just too hot, I decided to give strace a shot.  stracing gdb revealed that ptrace(PTRACE_GETSIGINFO, ...) was returning a faulting address of 0...  from the kernel!  So this wasn't a gdb bug at all, but a weird kernel behavior.  Along with this weird faulting address, ptrace's siginfo_t struct also had weird si_code value of SI_KERNEL.  Running the same gdb command to get the faulting address under strace on some other crashing programs, si_code was usually SEGV_MAPPERR.

Some googling later I found this stackoverflow answer.  The relevant part is:
A segmentation violation that occurs as a result of userspace process accessing virtual memory above the TASK_SIZE limit will cause a segmentation violation with an si_code of SI_KERNEL. In other words, the TASK_SIZE limit is the highest virtual address that any process is allowed to access. This is normally 3GB unless the kernel is configured for high memory support. The area above the TASK_SIZE limit is referred to as the "kernel segment".
And indeed, the address that I was faulting on was above the TASK_SIZE limit.  But what I found odd about this whole thing was that it wasn't even really a kernel address; looking at this description of 48-bit memory layout, my faulting address fell into the noncanonical zone.

So anyway, the moral of this story: if, on 64-bit linux, gdb is telling you that the faulting address of a segfault is 0, it might be lying, and the address might just be in the noncanonical region.

And that's what I did this saturday.

Analysis / post-mortem:

I don't think this was a terrible performance on my part.  Total time elapsed was something like four hours, and some of that was spent compiling gdb variants.  I did a decent job of changing approaches when something seemed unproductive.  I did not make maximum use of the early observation that the behavior was different on 32-bit; instead this caused me to investigate the gdb source, where it translates between 64- and 32-bit siginfo_t structs, which was a false lead, but at least I didn't get stuck on it.  I googled early and often.  I should probably have resorted to strace earlier; it's a very strong tool.  Arguably I should have known beforehand that this was a 48-bit issue, but this is how you learn.

No comments:

Post a Comment