23 August 2017

Paper Notes - Predicting Bugs from History, 2008

Notes on this chapter of what seems to be a book on "software evolution".

"The defects we measure from history can only be mapped to components because they have been fixed."  That seems rather optimistic...

"This mistake manifests itself as an error in some development artefact, be it requirements, specification, or a design document."  This is getting concerningly software-engineering-ish...

Sort of weird that everything is module-scoped.  I guess it makes sense for very large systems, but I'd tend to think of things more on the function / basic-block level.

Also weird that all of their coupling metrics (except class coupling) are for global variables.  Any / all shared state is a potential bug source; global just happens to be the most egregious possible case (except for system-globals, like files on disk and registry entries...).

"Never blindly trust a metric."  Wiser words are rarely written.

Not a whole lot of surprises.  Typical complexity metrics correlate with bugs.  Churn correlates with bugs.  Tricky problem domains correlate with bugs (though I feel like their example of Eclipse compiler internals vs GUI is sort of disingenuous; if the compiler internals are a little broken, Eclipse cannot perform its core function, but if the UI is a little broken, often end users can work around it or just live with it.  So is it a function of the inherent difficulty of the problem domain, or the centrality of that problem domain to the function of the project?).  Buggy dependencies correlate with bugs, but fall off with distance.  Would've been interesting to see the d=4 case for "domino effect in Windows Server 2003".

Sort of bummed that their references weren't included.

Potential follow-ups:

20 August 2017

Paper Notes - Valgrind 2007

I've been doing some work with Valgrind recently, and the suggested way to get a big-picture understanding of how Valgrind works was to read this paper. Having read it, I think this is a good recommendation.  Some notes, primarily for my own benefit.

Dynamic recompilation seems very similar to my under-informed understanding of QEMU's approach.  Substantially more complex than our hacked-up approach to static binary instrumentation.  Would probably be a lot easier to implement nowadays with LLVM than it was in 2007.  Interesting loading procedure, though it has the same issue that PIN does where it shares an address space with its target (and a target seeking to interfere with analysis will likely be able to).  The dispatcher / scheduler translation execution mechanism is also interesting; doesn't do translation block chaining like QEMU does (we ran into an issue with QEMU's tb-linking a couple weeks ago), but has a very tight "dispatcher" mechanism that checks a cache and executes known / hot translations, with the slower "scheduler" as fallback.  Coming from writing system call models in PIN, the events system sounds pretty great; I wonder how much of Valgrind's syscall models are stealable for use in other dynamic instrumentation frameworks.

Follow-up topics I should read more on:

13 August 2017

Two Technical Contradictions

Observed two "technical contradictions" in the style of TRIZ at work the other day:

We want to show the user all of this data, but to most users it won't be useful and we don't have the space on the webpage.

We want the performance benefits that this unsafe optimization give us, but we observe that this unsafe optimization is causing a huge amount of incorrect behavior (much greater than we expected when we enabled it).

The first was resolved by proposing to pick out the important bits of information for the user, bring those to attention on the main page, and  then making details available on request.

The performance / correctness tradeoff was temporarily resolved with preference to correctness, but has not yet been fully resolved because we do not understand the root cause of the incorrectness caused by the optimization (and this flows into TRIZ's root cause analysis procedure).

But it's curious that I noticed these contradictions in terms of TRIZ, especially both on the same day, and having not read or thought about TRIZ in some months.  This is particularly curious because I've been reading up on John Gall and systemantics, which suggest that TRIZ (as a system) is precisely not what you want if you want to get results (which agrees with general intuition - the US outperformed the Soviet Union at research with a much less structured approach, though there are an infinitude of confounding factors).

02 August 2017

Debugging a Weird GDB Misbehavior

I was debugging a crashing program with gdb last week, and I observed a very strange behavior.  I was using gdb to print the faulting address following a segfault, and it was getting it wrong sometimes.  I determined from the registers and the faulting instruction that faulting address was something like 0xce4a414100000000, but gdb was reporting the faulting address as 0x0.  WTF?

So I started playing with it.  I observed that this failure mode did not occur when I threw the same crash against a 32-bit version of the target; it still crashed, but gdb reported the faulting address correctly.  Weird.

So I checked out a copy of the gdb source tree and started poking around.  I was under the impression that there was a way to get the faulting address of a segfault from the kernel inside a segfault handler, and googling revealed that it's available in a siginfo_t struct.  Reading some man pages suggested that a ptracing program should be able to get such information about a fault in the program it's tracing using PTRACE_GETSIGINFO, so I looked for all the places that gdb was using this option to ptrace.  I found some code that seemed to be translating 64-bit siginfo_t structs into 32-bit siginfo_t structs, which seemed like a likely candidate.  There were only one or two places where GETSIGINFO was used, so I added some debugging printfs, compiled gdb, and...  compilation failed, I forget why exactly.  I chalked this up as a hazard of using the bleeding-edge source version; I downloaded a source release tarball, added my printfs again, and got it to compile, but it didn't seem to be hitting the debug prints around the PTRACE_GETSIGINFO calls, and it complained about not having its python installation in the correct place, so I was somewhat suspicious of its correct operation.  I did confirm that I still got the wrong address even in the freshest version of gdb, though.

At this point, after a morning of debugging and poking around in the gdb source, I told my boss that it was going to take longer to get to the bottom of this rat-hole than I expected, and tabled the gdb issue to investigate later.

The weekend rolled around, and I decided that if printf debugging wasn't going to cut it, I should use gdb to debug gdb.  Some googling indicated that this was a thing that people do, and suggested that I use gdbserver for it.  So I fired up gdbserver running gdb running my crashing program, then fired up gdb and used target remote to connect to the gdbserver, hit run...  and my ssh session to my work machine died. Some pinging around the work network revealed that my host was down. My suspicions of a kernel panic were confirmed on monday morning; nobody else was in the office to see and reboot it.

So I was left to debug this thing on my own machine.  I checked out the crashing project, built it, ran it in gdb, and observed the same failure mode, the incorrect faulting address.  Deciding that gdb-on-gdb action was just too hot, I decided to give strace a shot.  stracing gdb revealed that ptrace(PTRACE_GETSIGINFO, ...) was returning a faulting address of 0...  from the kernel!  So this wasn't a gdb bug at all, but a weird kernel behavior.  Along with this weird faulting address, ptrace's siginfo_t struct also had weird si_code value of SI_KERNEL.  Running the same gdb command to get the faulting address under strace on some other crashing programs, si_code was usually SEGV_MAPPERR.

Some googling later I found this stackoverflow answer.  The relevant part is:
A segmentation violation that occurs as a result of userspace process accessing virtual memory above the TASK_SIZE limit will cause a segmentation violation with an si_code of SI_KERNEL. In other words, the TASK_SIZE limit is the highest virtual address that any process is allowed to access. This is normally 3GB unless the kernel is configured for high memory support. The area above the TASK_SIZE limit is referred to as the "kernel segment".
And indeed, the address that I was faulting on was above the TASK_SIZE limit.  But what I found odd about this whole thing was that it wasn't even really a kernel address; looking at this description of 48-bit memory layout, my faulting address fell into the noncanonical zone.

So anyway, the moral of this story: if, on 64-bit linux, gdb is telling you that the faulting address of a segfault is 0, it might be lying, and the address might just be in the noncanonical region.

And that's what I did this saturday.

Analysis / post-mortem:

I don't think this was a terrible performance on my part.  Total time elapsed was something like four hours, and some of that was spent compiling gdb variants.  I did a decent job of changing approaches when something seemed unproductive.  I did not make maximum use of the early observation that the behavior was different on 32-bit; instead this caused me to investigate the gdb source, where it translates between 64- and 32-bit siginfo_t structs, which was a false lead, but at least I didn't get stuck on it.  I googled early and often.  I should probably have resorted to strace earlier; it's a very strong tool.  Arguably I should have known beforehand that this was a 48-bit issue, but this is how you learn.