19 January 2017

Debugging Journal 1: PAM, chattr, and docker

For a year or two, I've been considering starting a debugging journal in order to reflect on my experiences and improve my capabilities through examination.  This may or may not be interesting to anyone else, and is probably incomprehensible without some unix background.  But here it begins anyway.

Today I fixed a (pretty easy) bug.

Fatigue state: tired (poor sleep previous two days, late afternoon, had just given a demo to a customer)
Emotional state: up (demo went well, happy)

We host a shell server which runs a shellinabox daemon so that users of our website can log into it, in order to learn to use the unix shell.  To let users log in to the shell server with the same credentials as they use on the website, we have a custom PAM module which asks for credentials and hit an API on the website to validate the user.  The PAM module also handles creating unix accounts for these users.

This user-creation component of the PAM module seemed to be malfunctioning; it created the users, but was generating the wrong output to the shellinabox frontend on the site.  A coworker who had been working in this area told me that it was a very simple known issue, a mismatch between what the authorization server was returning and what the PAM client was expecting.  I was pretty sure I had fixed that mismatch issue in November, but there was certainly something going wrong, so I dug in.

I was able to reproduce the bug immediately, and it happened consistently.  Some preliminary investigation by taking the client apart and calling individual functions showed that this assumed cause was incorrect; all functions interacting with the website's API appeared to be operating correctly across a number of test users.

The code for the main function of the PAM module looked something like this in pseudopython:

if username doesn't exist on the website:
  return PAM auth unsure, fall back to another PAM module
  verify that username is in unix passwd file
  prompt user for password
  validate password for user against website
  return PAM auth success, user logged in
except user not in passwd file:
    create unix user on shell server
    copy some default rcfiles into user's home directory
    print a welcome message, ask user to refresh page and login
    return PAM auth denied / no login
  except any error:
    return PAM auth unsure, fall back to another PAM module

Here I made a minor error: while editing this function in order to generate more output to the shell, I fat-fingered a syntax error near the beginning of the file, which caused the PAM module to fail immediately and always fall back.  I'm still not sure how I did this (I opened the file in vim and hit shift-G to go to the end, but ended up also capitalizing an i in "import"), but I quickly realized that I had erred tremendously and restored the file to its original state.  I abandoned the output-based approach for reading code and thinking about it.

On further inspection, the behavior I was seeing was consistent with falling back to another PAM module.  I also knew that the user accounts were being created.  This meant that an exception had to be happening in the second try block, after creating the unix user, which led to the fallback.  I dug into the function that copied the rcfiles.

This function had the following structure:

create an empty bash_history file to the user's home directory
set permissions on it
set its filesystem attributes to append-only
copy a default bashrc file from /opt/ to the user's home directory
set permissions on it
set filesystem attributes to append-only
copy a default bash profile from /opt/ to the user's home directory
set perms
set filesystem attributes to append-only
set some general permissions on the user's home directory

Besides the fact that it might've been prettier as a loop, there didn't seem anything immediately objectionable about this function.  Examining a test user's home directory, I observed that the history file had been created and had the correct permissions set on it, but the bashrc and profile had more permissive permissions than expected.  Hashing the user's bashrc and the bashrc in /opt/ revealed that they differed.  I concluded that we must have failed to copy the bashrc.  This was only half-correct.

The bashrc in /opt/ had permissions 544, so we should've been able to read it, and indeed I could.  I wasn't totally sure which user the PAM module was running as, but the PAM user had had permissions to create unix users, and checking the sudoers file I observed no users besides root with that power.  At this point I got confused and asked the coworker familiar with this area, who was returning from a meeting, what he made of it.  He asked if we might've set filesystem attributes on the user's home directory before copying the bashrc, which could concievably stopped even root.  We hadn't, but it got us looking at the chattr call on the bash_history as a possible suspect.  While attempting to list the filesystem attributes on bash_history with lsattr, we discovered that the system we were running in did not support attributes; therefore the call to change attributes had failed.  This made sense; we had recently shoved the codebase into docker containers, and docker apparently does not support filesystem attributes.

Removing all calls to change filesystem attributes caused the system to function as intended, albeit with slightly reduced security against malicious users of the shell machine.  Total time from beginning of investigation to pushing fix was about 45 minutes.  I'd like to say that that's not terrible for starting from an incorrect premise and then getting confused while debugging a piece of code that I hadn't looked at in two months, but frankly I don't have the data to back that up...  yet.  Another purpose of this journal, I suppose.

I committed two logical errors:

The first was ignoring the chattr calls.  I did this because I did not understand them, and presumed them to be noise and unimportant.  This is a bad habit picked up from reverse-engineering large assembly codebases, where most of what goes on is just noise.  Moral: when a thing is not understood, investigate it more thoroughly, not less.  Code is guilty until proven innocent.  Seems rather obvious when you write it down, but that's the way of things.

The second was fixating on the copy.  I drew a too-general conclusion, that the copy was the source of the failure, from the available data, that the copy had not executed.

A third possible error was not considering sooner that the problem was docker-related.  We've had the code in docker for a couple of months now though, so that seemed like a pretty stale bug-cause (in fact this failure-mode had been hidden behind other failures that were only recently resolved).

15 January 2017

Why Cognitive Biases?

An old friend of mine expressed confusion and dismay recently, that the rationalist movement is as concerned with cognitive biases as it is.  Given that being aware of biases does not significantly reduce susceptibility to them, he believes identifying useful heuristics might be a better approach.

I have a heuristic that I like: "If something seems totally unfit for its purpose, you have probably misidentified that purpose."

Two examples: The function of a typical "dysfunctional" civil service bureaucracy is not to complete any nominal mission on behalf of the public, but to provide stable, low-risk jobs and petty administrative fiefdoms for those inside it; to fulfill the material and psychological needs of the people making up the organization.  It performs this task admirably.  The function of the public school system is not to provide learning of any useful material; its primary function is instead to grind down and separate those who are docile and trainable from those who are not, for use as raw material by industry and the state.

I work in computer security.  I like doing offense; you only have to be right once, while defense has to be right every time.  My impression of Thinking Fast and Slow was "This book is a weapon.  No, an armory."

The function of studying biases is not to make our reasoning clearer.  The function of studying biases is to win arguments, make sales, and otherwise influence the behavior of others.  It serves the social exercise of the will to power, not the pursuit of truth (this is true of philosophy generally.  Socrates was a glorified mugger armed with the cudgel of reason).  In the weakest case, the study of biases provides a feeling of superiority over others whom you witness falling victim to biases.  I am a cynic about human nature, and the sort of people who end up at LessWrong are, by my impression, not especially well-adjusted, high-status, or spectacularly successful people (nor am I).  Understanding cognitive baises, identifying them in others, and feeling superior serves a psychological function.  I really don't think there's much more to it than that.

04 January 2017

Experimental Computer Science

I studied computer science in college, and was struck by the lack of science in the discipline.  Computer science as a field is fundamentally applied mathematics in the style of theoretical physics.  Software engineering, the other side of the coin, is as superstitious as theoretical computer science is formal.  Given the long time-periods and budgets required to construct large software projects, it is little surprise that software engineering is still largely imitative in character ("Well Google did it this way...").  We cannot afford to conduct worthwhile experiments, and the art suffers as a result.

A senior colleague at my first internship was so kind as to reveal to me the mundane nature of experimental computer science, however.  I had encountered a bug in my code, and was frustrated.  He came over, sat down on his trademark inflatable exercise ball, asked me what my hypothesis was, and started bouncing beatifically.  And so I learned that lowly debugging was the experimental computer science that I had long sought.  You consider your known facts.  You formulate a set of hypotheses about what might have happened consistent with those facts.  You find a way to test your hypotheses and gather more facts.  Repeat until phenomenon is explained, mystery solved.

Engineering builds artifacts using facts; experiemtal science builds facts using artifacts.  Debugging is most certainly in the latter category.

In the years since, debugging has come to be probably my favorite part of my job, and in the style of Lakoff's Metaphors We Live By, I've picked up a couple more perspectives on it.

The professor of my operating systems class once said: "Debugging is telling two stories.  One is the story of what you wanted the computer to do, and the other is the story of what the computer did.  Where they diverge, there is your bug."  This narrative view is a very temporal way to think about debugging, well-suited to stepping through code in a debugger.

A third view that I have used while debugging distributed systems is that of police procedural / forensics.  A symptom of a "crime" appears, an invariant violated.  Careful notes are taken on the evidence; places and times, the frequency if repeated or irregular, any commonalities between multiple events.  A list of "suspect" components is drawn up.  "Means, motive, and opportunity" sort of still holds; components with the permissions and logic to do something like the crime, as well as components which are historically known to be buggy, or which have been changed in a recent commit.  Then you investigate the suspects, entertaining the possibility that they acted alone or in conspiracy with each other.  Fundamentally this differs from the scientific approach in two respects: chunking and anthropomorphization.  Anthropomorphization is a dangerous falsehood to permit oneself, but it works very well for me, perhaps because it lets me leverage some measure of social intelligence in an otherwise asocial situation.  I have had some great successes with this method, in several cases correctly calling complex sequences of race conditions in a cluster within single-digit minutes of receiving a bug report, without looking at any code.

So, three faces of debugging:
  • Science
  • Storytelling
  • Policework
There are, to be sure, more such lenses to be found and played with.  I look forward to it.