04 September 2017

Reading / Links, 4 Sept 17

Stuff I've been reading in the last week or two:

Network Science, chapters 3 and 4. Pretty funny; he throws some shade on Erdos and Strogatz.  The editing / proofreading continues to disappoint, but the material is decent.  The main thing I want out of this book is an understanding of cascade failures (which he claims to have a good model for in the introduction); a graph theory refresher doesn't hurt though.

The Mind Illuminated, Chapter 4, Interlude 4, beginning of Chapter 5.  Interlude 4 was very interesting - consciousness is quantized and cut up into frames, like network packets, and dullness is packets dropping.  I really wish he'd include footnote references for the science behind this stuff, given that he's a neuro guy...  Given chapters 3, 4, and 5, it seems like I'm somewhere in late phase 3 or early phase 4 (modulo the fact that my practice is still irregular).

The Systems Bible.  Has nothing to do with systems programming, except inasmuch as programmers build systems.  Describes ways in which complex systems evolve and dysfunction.  Not at all rigorous, to the point where it doesn't bother to define "system", but some parallels with Rao's Gervais Principle in the organizational context (organizations constructed with backdoors allowing actual work to get done, eventually collapse under their own entropy) and with some of Scott's criticisms of high modernism in Seeing Like a State (the designed system opposes its own intended function and scales in unpredictable ways).  Also seems sort of linked to The Dispossessed, with its point about the emergence of effectively-bureaucratic systems under anarchist conditions.

Introduction to the DWARF Debugging Format.  I'm looking for stupid dwarf tricks, and was excited to find that DWARF contains at least two sorts of bytecode for generating tables of debugging information.

Relatedly, Funky File Formats.

Documentation on ptrace, proc, ELF, bpf, more bpf...  There's all kinds of fun stuff in /proc that I didn't know about.

23 August 2017

Paper Notes - Predicting Bugs from History, 2008

Notes on this chapter of what seems to be a book on "software evolution".

"The defects we measure from history can only be mapped to components because they have been fixed."  That seems rather optimistic...

"This mistake manifests itself as an error in some development artefact, be it requirements, specification, or a design document."  This is getting concerningly software-engineering-ish...

Sort of weird that everything is module-scoped.  I guess it makes sense for very large systems, but I'd tend to think of things more on the function / basic-block level.

Also weird that all of their coupling metrics (except class coupling) are for global variables.  Any / all shared state is a potential bug source; global just happens to be the most egregious possible case (except for system-globals, like files on disk and registry entries...).

"Never blindly trust a metric."  Wiser words are rarely written.

Not a whole lot of surprises.  Typical complexity metrics correlate with bugs.  Churn correlates with bugs.  Tricky problem domains correlate with bugs (though I feel like their example of Eclipse compiler internals vs GUI is sort of disingenuous; if the compiler internals are a little broken, Eclipse cannot perform its core function, but if the UI is a little broken, often end users can work around it or just live with it.  So is it a function of the inherent difficulty of the problem domain, or the centrality of that problem domain to the function of the project?).  Buggy dependencies correlate with bugs, but fall off with distance.  Would've been interesting to see the d=4 case for "domino effect in Windows Server 2003".

Sort of bummed that their references weren't included.

Potential follow-ups:

20 August 2017

Paper Notes - Valgrind 2007

I've been doing some work with Valgrind recently, and the suggested way to get a big-picture understanding of how Valgrind works was to read this paper. Having read it, I think this is a good recommendation.  Some notes, primarily for my own benefit.

Dynamic recompilation seems very similar to my under-informed understanding of QEMU's approach.  Substantially more complex than our hacked-up approach to static binary instrumentation.  Would probably be a lot easier to implement nowadays with LLVM than it was in 2007.  Interesting loading procedure, though it has the same issue that PIN does where it shares an address space with its target (and a target seeking to interfere with analysis will likely be able to).  The dispatcher / scheduler translation execution mechanism is also interesting; doesn't do translation block chaining like QEMU does (we ran into an issue with QEMU's tb-linking a couple weeks ago), but has a very tight "dispatcher" mechanism that checks a cache and executes known / hot translations, with the slower "scheduler" as fallback.  Coming from writing system call models in PIN, the events system sounds pretty great; I wonder how much of Valgrind's syscall models are stealable for use in other dynamic instrumentation frameworks.

Follow-up topics I should read more on:

13 August 2017

Two Technical Contradictions

Observed two "technical contradictions" in the style of TRIZ at work the other day:

We want to show the user all of this data, but to most users it won't be useful and we don't have the space on the webpage.

We want the performance benefits that this unsafe optimization give us, but we observe that this unsafe optimization is causing a huge amount of incorrect behavior (much greater than we expected when we enabled it).

The first was resolved by proposing to pick out the important bits of information for the user, bring those to attention on the main page, and  then making details available on request.

The performance / correctness tradeoff was temporarily resolved with preference to correctness, but has not yet been fully resolved because we do not understand the root cause of the incorrectness caused by the optimization (and this flows into TRIZ's root cause analysis procedure).

But it's curious that I noticed these contradictions in terms of TRIZ, especially both on the same day, and having not read or thought about TRIZ in some months.  This is particularly curious because I've been reading up on John Gall and systemantics, which suggest that TRIZ (as a system) is precisely not what you want if you want to get results (which agrees with general intuition - the US outperformed the Soviet Union at research with a much less structured approach, though there are an infinitude of confounding factors).

02 August 2017

Debugging a Weird GDB Misbehavior

I was debugging a crashing program with gdb last week, and I observed a very strange behavior.  I was using gdb to print the faulting address following a segfault, and it was getting it wrong sometimes.  I determined from the registers and the faulting instruction that faulting address was something like 0xce4a414100000000, but gdb was reporting the faulting address as 0x0.  WTF?

So I started playing with it.  I observed that this failure mode did not occur when I threw the same crash against a 32-bit version of the target; it still crashed, but gdb reported the faulting address correctly.  Weird.

So I checked out a copy of the gdb source tree and started poking around.  I was under the impression that there was a way to get the faulting address of a segfault from the kernel inside a segfault handler, and googling revealed that it's available in a siginfo_t struct.  Reading some man pages suggested that a ptracing program should be able to get such information about a fault in the program it's tracing using PTRACE_GETSIGINFO, so I looked for all the places that gdb was using this option to ptrace.  I found some code that seemed to be translating 64-bit siginfo_t structs into 32-bit siginfo_t structs, which seemed like a likely candidate.  There were only one or two places where GETSIGINFO was used, so I added some debugging printfs, compiled gdb, and...  compilation failed, I forget why exactly.  I chalked this up as a hazard of using the bleeding-edge source version; I downloaded a source release tarball, added my printfs again, and got it to compile, but it didn't seem to be hitting the debug prints around the PTRACE_GETSIGINFO calls, and it complained about not having its python installation in the correct place, so I was somewhat suspicious of its correct operation.  I did confirm that I still got the wrong address even in the freshest version of gdb, though.

At this point, after a morning of debugging and poking around in the gdb source, I told my boss that it was going to take longer to get to the bottom of this rat-hole than I expected, and tabled the gdb issue to investigate later.

The weekend rolled around, and I decided that if printf debugging wasn't going to cut it, I should use gdb to debug gdb.  Some googling indicated that this was a thing that people do, and suggested that I use gdbserver for it.  So I fired up gdbserver running gdb running my crashing program, then fired up gdb and used target remote to connect to the gdbserver, hit run...  and my ssh session to my work machine died. Some pinging around the work network revealed that my host was down. My suspicions of a kernel panic were confirmed on monday morning; nobody else was in the office to see and reboot it.

So I was left to debug this thing on my own machine.  I checked out the crashing project, built it, ran it in gdb, and observed the same failure mode, the incorrect faulting address.  Deciding that gdb-on-gdb action was just too hot, I decided to give strace a shot.  stracing gdb revealed that ptrace(PTRACE_GETSIGINFO, ...) was returning a faulting address of 0...  from the kernel!  So this wasn't a gdb bug at all, but a weird kernel behavior.  Along with this weird faulting address, ptrace's siginfo_t struct also had weird si_code value of SI_KERNEL.  Running the same gdb command to get the faulting address under strace on some other crashing programs, si_code was usually SEGV_MAPPERR.

Some googling later I found this stackoverflow answer.  The relevant part is:
A segmentation violation that occurs as a result of userspace process accessing virtual memory above the TASK_SIZE limit will cause a segmentation violation with an si_code of SI_KERNEL. In other words, the TASK_SIZE limit is the highest virtual address that any process is allowed to access. This is normally 3GB unless the kernel is configured for high memory support. The area above the TASK_SIZE limit is referred to as the "kernel segment".
And indeed, the address that I was faulting on was above the TASK_SIZE limit.  But what I found odd about this whole thing was that it wasn't even really a kernel address; looking at this description of 48-bit memory layout, my faulting address fell into the noncanonical zone.

So anyway, the moral of this story: if, on 64-bit linux, gdb is telling you that the faulting address of a segfault is 0, it might be lying, and the address might just be in the noncanonical region.

And that's what I did this saturday.

Analysis / post-mortem:

I don't think this was a terrible performance on my part.  Total time elapsed was something like four hours, and some of that was spent compiling gdb variants.  I did a decent job of changing approaches when something seemed unproductive.  I did not make maximum use of the early observation that the behavior was different on 32-bit; instead this caused me to investigate the gdb source, where it translates between 64- and 32-bit siginfo_t structs, which was a false lead, but at least I didn't get stuck on it.  I googled early and often.  I should probably have resorted to strace earlier; it's a very strong tool.  Arguably I should have known beforehand that this was a 48-bit issue, but this is how you learn.

28 June 2017

Google and Grothendieck

There is a piece of open source software that we use occasionally.  Its primary author is a single Google employee, whose work on it is (as far as we can tell) a large part of his employment.  I was reading part of the source today, and remarked that it was surprisingly bad code stylistically - enormous functions, enormous files, many global variables, and so forth - but tremendously functional.  A coworker replied:

You can really tell who at Google is good enough that they're left alone to do their own thing.

And it reminded me of this, something that Alexander Grothendieck supposedly said:

In those critical years I learned how to be alone. [But even] this formulation doesn't really capture my meaning. I didn't, in any literal sense learn to be alone, for the simple reason that this knowledge had never been unlearned during my childhood. It is a basic capacity in all of us from the day of our birth. However these three years of work in isolation [1945–1948], when I was thrown onto my own resources, following guidelines which I myself had spontaneously invented, instilled in me a strong degree of confidence, unassuming yet enduring, in my ability to do mathematics, which owes nothing to any consensus or to the fashions which pass as law....By this I mean to say: to reach out in my own way to the things I wished to learn, rather than relying on the notions of the consensus, overt or tacit, coming from a more or less extended clan of which I found myself a member, or which for any other reason laid claim to be taken as an authority. This silent consensus had informed me, both at the lycĂ©e and at the university, that one shouldn't bother worrying about what was really meant when using a term like "volume," which was "obviously self-evident," "generally known," "unproblematic," etc....It is in this gesture of "going beyond," to be something in oneself rather than the pawn of a consensus, the refusal to stay within a rigid circle that others have drawn around one—it is in this solitary act that one finds true creativity. All others things follow as a matter of course.


Since then I've had the chance, in the world of mathematics that bid me welcome, to meet quite a number of people, both among my "elders" and among young people in my general age group, who were much more brilliant, much more "gifted" than I was. I admired the facility with which they picked up, as if at play, new ideas, juggling them as if familiar with them from the cradle—while for myself I felt clumsy, even oafish, wandering painfully up an arduous track, like a dumb ox faced with an amorphous mountain of things that I had to learn (so I was assured), things I felt incapable of understanding the essentials or following through to the end. Indeed, there was little about me that identified the kind of bright student who wins at prestigious competitions or assimilates, almost by sleight of hand, the most forbidding subjects.

In fact, most of these comrades who I gauged to be more brilliant than I have gone on to become distinguished mathematicians. Still, from the perspective of thirty or thirty-five years, I can state that their imprint upon the mathematics of our time has not been very profound. They've all done things, often beautiful things, in a context that was already set out before them, which they had no inclination to disturb. Without being aware of it, they've remained prisoners of those invisible and despotic circles which delimit the universe of a certain milieu in a given era. To have broken these bounds they would have had to rediscover in themselves that capability which was their birthright, as it was mine: the capacity to be alone.
And it lines up - I'm pretty sure this developer of ours wrote his magnum opus in solitude first, and then was hired by Google to develop it after it had shaken up the field.  No concern for "best practices", no design by committee, just wrestling with the hard problems and solving them by whatever means necessary, in the time required to do it correctly and efficiently, releasing when it's good and ready.

O!  To program like that!

...  Well, what're you doing this weekend?

(I am in turn reminded of something Hamming said in "You and Your Research": you'll get the resources to do the job after you've proven you can do it without them, on your own time)

25 June 2017

Linkpost, 19-25 June 2017

Some things I read this week:

News: 

Give the FSB your source code, they said.  It'll be fun, they said.

Bulletin of Atomic Scientists analyzes feasibility of North Korean chemical bombardment of Seoul - a highly improbable scenario, but an interesting (if pessimistic) analysis nonetheless.

Blogs / Culture War:

SSC: To understand polarization, undersand conservatism's failures

Samzdat: The meridian of her greatness - sounds to me like Polyani had an accurate view of the world (but coming from reading a bunch of James C. Scott in the last year, I would say that).  Reminds me somewhat of this.

SSC: Against murderism

David Brin: The Jefferson Rifle - came up at work because a coworker claimed that no compromise on gun control was possible.  Which may be correct, but part of his argument was an unavailability heuristic - he had never even heard of a good-faith proposal for compromise (granted: young, very work-focused engineer).

David Brin: A Time for Colonels, Part 3 - I think he takes Lakoff entirely too seriously.  Might work in the short term, but I suspect there's a good reason for that norm that even retired officers mostly stay out of tribal politics.  Potential "guilt by association" backfire failure mode of "the officer corps is now publicly aligned with the Blue Tribe, ergo the officer corps is no longer to be trusted."

Speeches / Lectures:

Alan Kay: [pdf warning] The Power of the Context

Marvin Minsky: Turing Award address - a bit dated, but a novel perspective on education:
– To help people learn is to help them build, in their heads, various kinds of computational models.
– This can best be done by a teacher who has, in his head, a reasonable model of what is in the pupil's head.
– For the same reason the student, when debugging his own models and procedures, should have a model of what he is doing, and must know good debugging techniques, such as how to formulate simple but critical test cases.
– It will help the student to know something about computational models and programming. The idea of debugging itself, for example, is a very powerful concept-in contrast to the helplessness promoted by our cultural heritage about gifts, talents, and aptitudes. The latter encourages "I'm not good at this" instead of "How can I make myself better at it?"
...
The child needs models: to understand the city he may use the organism model: it must eat, breathe, excrete, defend itself, etc. Not a very good model, but useful enough. The metabolism of a real organism he can understand, in turn, by comparison with an engine. But to model his own self he cannot use the engine or the organism or the city or the telephone switchboard; nothing will serve at all but the computer with its programs and their bugs. Eventually, programming itself will become more important even than mathematics in early education.
Richard Hamming: n-dimensional spaces - impressively fast derivations, and man my calc is rusty.  Not my favorite Hamming lecture.  Interesting notes on testing at the very end

Tensorflow without a PhD

Books:

Not a great week for books.  Read a little of Generatingfunctionology after an epiphany in the shower, started Barabasi's Network Science, stalled on The Strategy of Technology and The Mind Illuminated.  

09 June 2017

Vignettes from a "Tech Happy Hour"

Demographics: ~30 attendees total, relatively large fraction of non-technical folks (management, marketing, MBA students, ...).  Almost all white, Indian, middle-eastern; only two asians (one of whom was definitely nontechnical) and one black dude teaching himself to program, but not sure what language to use.  Surprisingly high fraction of women, maybe 20%, including at least one female engineer.  Also a relatively senior crowd for startuppy software; looked like mostly late-20s early-30s, with one or two late-30s or early-40s.  Free beer and awkward swag tshirts provided by beer company representative.

Inebriated man who works in the oil and gas industry is looking for someone to build a website to track company finances, because the finance thing they currently use was written by a friend of the founder, isn't very good, feels like borderline-corruption.  Nobody's interested.

Javascript developer wants to learn Haskell.

Woman: "Why are these fries so delicious?"
Man: "Salt.  Also a reasonable level of Maillard browning."
Woman: "..."
Man: "What?"

Lawyer-in-training tells startup people that lawyers are most clutch at the beginning, and by the time the money is coming in it's already too late.  Attendees immediately and in parallel posit "legal debt" analogous to technical debt.

Man is working on a website to get beer delivered to your house despite $STATE's arcane alcohol laws.  Most of the work he does is talking to lawyers; he's outsourced all of his development to India, because he "doesn't have $50k to drop on the project, you know?"

Sad, quiet Indian man has been working on industrial control systems for six years, looking for new job; declines free beer because "I've been drinking a lot lately."

"So what do you do?"
"I'm a developer for a security company."
"But like...  is that all?  Do you have a side hustle?  Do you invest?"
"Not really; they keep me pretty busy."
"Ah, don't give me that.  We're all slaves here.  What are you doing for your freedom?"
"I have no hope for freedom; I am planning to work until I die."
"Strong attitude, but augh!  With a little change of direction, you could be working for yourself.  Gotta look out for #1.  That's you!  You're #1!"

UX designer laments the difficulty of finding remote work, speculates that IBM's recent "move or you're fired" termination of remote-work policy was actually just an excuse for staffing cuts without severance.

Man is working on a system for providing free Subway sandwiches and gift cards and things to people who volunteer for charitable causes.  But is it really volunteering if you're getting stuff for it?  Seems to me like it's low-cost feel-good advertising for Subway.

"I hope I won't offend any of you, but you know Brietbart, right?  All they do is take other peoples' content, slap a caption and a paragraph of text on it, and republish it.  It's super low-cost, and that's part of why they can put out the volume that they do.  And it's super-effective.  Sure, they have "writers", but they don't really write, you know?"

A man is working on a system to add a feedback form into wifi captive portals at hotels and restaurants, so that owners can get feedback and fix issues before they turn into negative reviews, each of which "costs a restaurant 30 customers".  His company is at a local startup incubator, making him a popular fellow.

A singularitarian who works for a startup that makes house-calls via phone camera.  He's convinced strong/general AI is coming in the next decade, and talks about Bostrom's Superintelligence, Calico's life extension work, and China's use of CRISPR on humans.  Missed my chance to ask him if he reads LessWrong.

A designer talks about the time her startup found a dead rat in their coworking space five minutes before a big client meeting, so she had to move a couch to cover it.  Symbolic of the whole startup experience, really.


Overall I found the whole thing darkly comedic.  I recall reading an incisive observation of a tech conference once, that "everyone is selling new ways of selling to each other", and it held some recognizable truth here.

07 February 2017

Debugging Journal 2: splitext, copypasta, failure to orient

Recently I wrote, and then fixed, a very stupid bug.  For the record, this journal is not the sum total of my debugging, only episodes that bear some examination, which typically means I did something dumb (in other words: I'm not like this all the time, I swear.  I'd be fired if I were).

Fatigue state: tired (4.5 hours of solid sleep and 1.5 of fade-in-fade-out, late afternoon, having some difficulty understanding coworkers when they were talking), hungry
Emotional state: slightly up (working on getting a thing finally merged after a long time, slight fatigue/hunger mania)

As noted, I was making edits during code-review on a branch that I've been developing on for a couple months.  In a piece of code that drops xinetd service config files, I had a line of code that looked something like this:

# strips ".xinetd" suffix from service file name
service_name = '.'.join(os.path.basename(service_path).split('.')[:-1])

My reviewing coworker suggested that I use os.path.splitext (I'm not generally one for library functions that I could write myself, especially when they have lousy names).  He pasted to me in the merge request comments:

os.path.splitext("path_to_file")[0]

In my not-proudest moment, I pasted that line over the '.'.join...  code I'd been using previously and fired up a test run of our whole system on AWS.  Better to trust my coworkers too much than too little, I suppose?

Predictably, it did not operate as intended, and now we get to the debugging part.

I SSHed into the VM and noted that none of the services xinetd was supposed to be running were running, despite logs noting that they deployed correctly.  Checking the xinetd/conf.d folder, I saw six entries, none of which had the names of the services that should've been deployed.  I could tell that something was wrong, though, because there were usually five default service config files.  And indeed, one of those six files was named "path_to_file".  But I didn't catch it; I just knew that my services weren't up, and they weren't in the conf.d directory, and left with a feeling that something was off.

So I dug into the code for deploying services and ran it at various debug levels and eventually realized that the whole basename operation had gone missing and we were writing to path_to_file.  Notably, despite editing the log levels of printing operations in the same function as the offending line, I did not scrutinize the line itself until the debug output made it very obvious that that was where things were going wrong.

Total time elapsed, probably around an hour between breaking it and fixing it.  Ouch.

Post-mortem:

Not my finest work, by any means.  The main thing about this that strikes me is that I saw the "path_to_file" in conf.d, but failed to make sense of it.  To use OODA-loop terminology, I observed, but did not orient on that information / make it part of my mental model of what had gone wrong.  If last time I drew a conclusion too strong for the available data, this time I had data and ignored it in hypothesis formation.  "Observed but failed to orient" is also somewhat true of the paste that created the bug in the first place.  This is what you get for programming in Kahneman System 1.

I also failed to take the bug apart temporally.  Having missed the conf.d/path_to_file indicator, had I then started re-reading recently changed code, I would've caught it faster than the deploy-read-deploy cycle I ended up in.

Other morals: implement "trust but verify" policy towards coworker code, get more sleep.  Maybe I should try to shift my debugging to first thing in the morning when possible; this lets me sleep on the bug and historically has been very effective.  Both this bug and previous sub-par journaled bug were late in the afternoon, around 4PM.  Two data points is hardly a trend, but is more than no data points.

19 January 2017

Debugging Journal 1: PAM, chattr, and docker

For a year or two, I've been considering starting a debugging journal in order to reflect on my experiences and improve my capabilities through examination.  This may or may not be interesting to anyone else, and is probably incomprehensible without some unix background.  But here it begins anyway.

Today I fixed a (pretty easy) bug.

Fatigue state: tired (poor sleep previous two days, late afternoon, had just given a demo to a customer)
Emotional state: up (demo went well, happy)

We host a shell server which runs a shellinabox daemon so that users of our website can log into it, in order to learn to use the unix shell.  To let users log in to the shell server with the same credentials as they use on the website, we have a custom PAM module which asks for credentials and hit an API on the website to validate the user.  The PAM module also handles creating unix accounts for these users.

This user-creation component of the PAM module seemed to be malfunctioning; it created the users, but was generating the wrong output to the shellinabox frontend on the site.  A coworker who had been working in this area told me that it was a very simple known issue, a mismatch between what the authorization server was returning and what the PAM client was expecting.  I was pretty sure I had fixed that mismatch issue in November, but there was certainly something going wrong, so I dug in.

I was able to reproduce the bug immediately, and it happened consistently.  Some preliminary investigation by taking the client apart and calling individual functions showed that this assumed cause was incorrect; all functions interacting with the website's API appeared to be operating correctly across a number of test users.

The code for the main function of the PAM module looked something like this in pseudopython:

if username doesn't exist on the website:
  return PAM auth unsure, fall back to another PAM module
try:
  verify that username is in unix passwd file
  prompt user for password
  validate password for user against website
  return PAM auth success, user logged in
except user not in passwd file:
  try: 
    create unix user on shell server
    copy some default rcfiles into user's home directory
    print a welcome message, ask user to refresh page and login
    return PAM auth denied / no login
  except any error:
    return PAM auth unsure, fall back to another PAM module

Here I made a minor error: while editing this function in order to generate more output to the shell, I fat-fingered a syntax error near the beginning of the file, which caused the PAM module to fail immediately and always fall back.  I'm still not sure how I did this (I opened the file in vim and hit shift-G to go to the end, but ended up also capitalizing an i in "import"), but I quickly realized that I had erred tremendously and restored the file to its original state.  I abandoned the output-based approach for reading code and thinking about it.

On further inspection, the behavior I was seeing was consistent with falling back to another PAM module.  I also knew that the user accounts were being created.  This meant that an exception had to be happening in the second try block, after creating the unix user, which led to the fallback.  I dug into the function that copied the rcfiles.

This function had the following structure:

create an empty bash_history file to the user's home directory
set permissions on it
set its filesystem attributes to append-only
copy a default bashrc file from /opt/ to the user's home directory
set permissions on it
set filesystem attributes to append-only
copy a default bash profile from /opt/ to the user's home directory
set perms
set filesystem attributes to append-only
set some general permissions on the user's home directory

Besides the fact that it might've been prettier as a loop, there didn't seem anything immediately objectionable about this function.  Examining a test user's home directory, I observed that the history file had been created and had the correct permissions set on it, but the bashrc and profile had more permissive permissions than expected.  Hashing the user's bashrc and the bashrc in /opt/ revealed that they differed.  I concluded that we must have failed to copy the bashrc.  This was only half-correct.

The bashrc in /opt/ had permissions 544, so we should've been able to read it, and indeed I could.  I wasn't totally sure which user the PAM module was running as, but the PAM user had had permissions to create unix users, and checking the sudoers file I observed no users besides root with that power.  At this point I got confused and asked the coworker familiar with this area, who was returning from a meeting, what he made of it.  He asked if we might've set filesystem attributes on the user's home directory before copying the bashrc, which could concievably stopped even root.  We hadn't, but it got us looking at the chattr call on the bash_history as a possible suspect.  While attempting to list the filesystem attributes on bash_history with lsattr, we discovered that the system we were running in did not support attributes; therefore the call to change attributes had failed.  This made sense; we had recently shoved the codebase into docker containers, and docker apparently does not support filesystem attributes.

Removing all calls to change filesystem attributes caused the system to function as intended, albeit with slightly reduced security against malicious users of the shell machine.  Total time from beginning of investigation to pushing fix was about 45 minutes.  I'd like to say that that's not terrible for starting from an incorrect premise and then getting confused while debugging a piece of code that I hadn't looked at in two months, but frankly I don't have the data to back that up...  yet.  Another purpose of this journal, I suppose.

I committed two logical errors:

The first was ignoring the chattr calls.  I did this because I did not understand them, and presumed them to be noise and unimportant.  This is a bad habit picked up from reverse-engineering large assembly codebases, where most of what goes on is just noise.  Moral: when a thing is not understood, investigate it more thoroughly, not less.  Code is guilty until proven innocent.  Seems rather obvious when you write it down, but that's the way of things.

The second was fixating on the copy.  I drew a too-general conclusion, that the copy was the source of the failure, from the available data, that the copy had not executed.

A third possible error was not considering sooner that the problem was docker-related.  We've had the code in docker for a couple of months now though, so that seemed like a pretty stale bug-cause (in fact this failure-mode had been hidden behind other failures that were only recently resolved).

15 January 2017

Why Cognitive Biases?

An old friend of mine expressed confusion and dismay recently, that the rationalist movement is as concerned with cognitive biases as it is.  Given that being aware of biases does not significantly reduce susceptibility to them, he believes identifying useful heuristics might be a better approach.

I have a heuristic that I like: "If something seems totally unfit for its purpose, you have probably misidentified that purpose."

Two examples: The function of a typical "dysfunctional" civil service bureaucracy is not to complete any nominal mission on behalf of the public, but to provide stable, low-risk jobs and petty administrative fiefdoms for those inside it; to fulfill the material and psychological needs of the people making up the organization.  It performs this task admirably.  The function of the public school system is not to provide learning of any useful material; its primary function is instead to grind down and separate those who are docile and trainable from those who are not, for use as raw material by industry and the state.

I work in computer security.  I like doing offense; you only have to be right once, while defense has to be right every time.  My impression of Thinking Fast and Slow was "This book is a weapon.  No, an armory."

The function of studying biases is not to make our reasoning clearer.  The function of studying biases is to win arguments, make sales, and otherwise influence the behavior of others.  It serves the social exercise of the will to power, not the pursuit of truth (this is true of philosophy generally.  Socrates was a glorified mugger armed with the cudgel of reason).  In the weakest case, the study of biases provides a feeling of superiority over others whom you witness falling victim to biases.  I am a cynic about human nature, and the sort of people who end up at LessWrong are, by my impression, not especially well-adjusted, high-status, or spectacularly successful people (nor am I).  Understanding cognitive baises, identifying them in others, and feeling superior serves a psychological function.  I really don't think there's much more to it than that.

04 January 2017

Experimental Computer Science

I studied computer science in college, and was struck by the lack of science in the discipline.  Computer science as a field is fundamentally applied mathematics in the style of theoretical physics.  Software engineering, the other side of the coin, is as superstitious as theoretical computer science is formal.  Given the long time-periods and budgets required to construct large software projects, it is little surprise that software engineering is still largely imitative in character ("Well Google did it this way...").  We cannot afford to conduct worthwhile experiments, and the art suffers as a result.

A senior colleague at my first internship was so kind as to reveal to me the mundane nature of experimental computer science, however.  I had encountered a bug in my code, and was frustrated.  He came over, sat down on his trademark inflatable exercise ball, asked me what my hypothesis was, and started bouncing beatifically.  And so I learned that lowly debugging was the experimental computer science that I had long sought.  You consider your known facts.  You formulate a set of hypotheses about what might have happened consistent with those facts.  You find a way to test your hypotheses and gather more facts.  Repeat until phenomenon is explained, mystery solved.

Engineering builds artifacts using facts; experiemtal science builds facts using artifacts.  Debugging is most certainly in the latter category.

In the years since, debugging has come to be probably my favorite part of my job, and in the style of Lakoff's Metaphors We Live By, I've picked up a couple more perspectives on it.

The professor of my operating systems class once said: "Debugging is telling two stories.  One is the story of what you wanted the computer to do, and the other is the story of what the computer did.  Where they diverge, there is your bug."  This narrative view is a very temporal way to think about debugging, well-suited to stepping through code in a debugger.

A third view that I have used while debugging distributed systems is that of police procedural / forensics.  A symptom of a "crime" appears, an invariant violated.  Careful notes are taken on the evidence; places and times, the frequency if repeated or irregular, any commonalities between multiple events.  A list of "suspect" components is drawn up.  "Means, motive, and opportunity" sort of still holds; components with the permissions and logic to do something like the crime, as well as components which are historically known to be buggy, or which have been changed in a recent commit.  Then you investigate the suspects, entertaining the possibility that they acted alone or in conspiracy with each other.  Fundamentally this differs from the scientific approach in two respects: chunking and anthropomorphization.  Anthropomorphization is a dangerous falsehood to permit oneself, but it works very well for me, perhaps because it lets me leverage some measure of social intelligence in an otherwise asocial situation.  I have had some great successes with this method, in several cases correctly calling complex sequences of race conditions in a cluster within single-digit minutes of receiving a bug report, without looking at any code.

So, three faces of debugging:
  • Science
  • Storytelling
  • Policework
There are, to be sure, more such lenses to be found and played with.  I look forward to it.