07 February 2017

Debugging Journal 2: splitext, copypasta, failure to orient

Recently I wrote, and then fixed, a very stupid bug.  For the record, this journal is not the sum total of my debugging, only episodes that bear some examination, which typically means I did something dumb (in other words: I'm not like this all the time, I swear.  I'd be fired if I were).

Fatigue state: tired (4.5 hours of solid sleep and 1.5 of fade-in-fade-out, late afternoon, having some difficulty understanding coworkers when they were talking), hungry
Emotional state: slightly up (working on getting a thing finally merged after a long time, slight fatigue/hunger mania)

As noted, I was making edits during code-review on a branch that I've been developing on for a couple months.  In a piece of code that drops xinetd service config files, I had a line of code that looked something like this:

# strips ".xinetd" suffix from service file name
service_name = '.'.join(os.path.basename(service_path).split('.')[:-1])

My reviewing coworker suggested that I use os.path.splitext (I'm not generally one for library functions that I could write myself, especially when they have lousy names).  He pasted to me in the merge request comments:

os.path.splitext("path_to_file")[0]

In my not-proudest moment, I pasted that line over the '.'.join...  code I'd been using previously and fired up a test run of our whole system on AWS.  Better to trust my coworkers too much than too little, I suppose?

Predictably, it did not operate as intended, and now we get to the debugging part.

I SSHed into the VM and noted that none of the services xinetd was supposed to be running were running, despite logs noting that they deployed correctly.  Checking the xinetd/conf.d folder, I saw six entries, none of which had the names of the services that should've been deployed.  I could tell that something was wrong, though, because there were usually five default service config files.  And indeed, one of those six files was named "path_to_file".  But I didn't catch it; I just knew that my services weren't up, and they weren't in the conf.d directory, and left with a feeling that something was off.

So I dug into the code for deploying services and ran it at various debug levels and eventually realized that the whole basename operation had gone missing and we were writing to path_to_file.  Notably, despite editing the log levels of printing operations in the same function as the offending line, I did not scrutinize the line itself until the debug output made it very obvious that that was where things were going wrong.

Total time elapsed, probably around an hour between breaking it and fixing it.  Ouch.

Post-mortem:

Not my finest work, by any means.  The main thing about this that strikes me is that I saw the "path_to_file" in conf.d, but failed to make sense of it.  To use OODA-loop terminology, I observed, but did not orient on that information / make it part of my mental model of what had gone wrong.  If last time I drew a conclusion too strong for the available data, this time I had data and ignored it in hypothesis formation.  "Observed but failed to orient" is also somewhat true of the paste that created the bug in the first place.  This is what you get for programming in Kahneman System 1.

I also failed to take the bug apart temporally.  Having missed the conf.d/path_to_file indicator, had I then started re-reading recently changed code, I would've caught it faster than the deploy-read-deploy cycle I ended up in.

Other morals: implement "trust but verify" policy towards coworker code, get more sleep.  Maybe I should try to shift my debugging to first thing in the morning when possible; this lets me sleep on the bug and historically has been very effective.  Both this bug and previous sub-par journaled bug were late in the afternoon, around 4PM.  Two data points is hardly a trend, but is more than no data points.