19 January 2017

Debugging Journal 1: PAM, chattr, and docker

For a year or two, I've been considering starting a debugging journal in order to reflect on my experiences and improve my capabilities through examination.  This may or may not be interesting to anyone else, and is probably incomprehensible without some unix background.  But here it begins anyway.

Today I fixed a (pretty easy) bug.

Fatigue state: tired (poor sleep previous two days, late afternoon, had just given a demo to a customer)
Emotional state: up (demo went well, happy)

We host a shell server which runs a shellinabox daemon so that users of our website can log into it, in order to learn to use the unix shell.  To let users log in to the shell server with the same credentials as they use on the website, we have a custom PAM module which asks for credentials and hit an API on the website to validate the user.  The PAM module also handles creating unix accounts for these users.

This user-creation component of the PAM module seemed to be malfunctioning; it created the users, but was generating the wrong output to the shellinabox frontend on the site.  A coworker who had been working in this area told me that it was a very simple known issue, a mismatch between what the authorization server was returning and what the PAM client was expecting.  I was pretty sure I had fixed that mismatch issue in November, but there was certainly something going wrong, so I dug in.

I was able to reproduce the bug immediately, and it happened consistently.  Some preliminary investigation by taking the client apart and calling individual functions showed that this assumed cause was incorrect; all functions interacting with the website's API appeared to be operating correctly across a number of test users.

The code for the main function of the PAM module looked something like this in pseudopython:

if username doesn't exist on the website:
  return PAM auth unsure, fall back to another PAM module
  verify that username is in unix passwd file
  prompt user for password
  validate password for user against website
  return PAM auth success, user logged in
except user not in passwd file:
    create unix user on shell server
    copy some default rcfiles into user's home directory
    print a welcome message, ask user to refresh page and login
    return PAM auth denied / no login
  except any error:
    return PAM auth unsure, fall back to another PAM module

Here I made a minor error: while editing this function in order to generate more output to the shell, I fat-fingered a syntax error near the beginning of the file, which caused the PAM module to fail immediately and always fall back.  I'm still not sure how I did this (I opened the file in vim and hit shift-G to go to the end, but ended up also capitalizing an i in "import"), but I quickly realized that I had erred tremendously and restored the file to its original state.  I abandoned the output-based approach for reading code and thinking about it.

On further inspection, the behavior I was seeing was consistent with falling back to another PAM module.  I also knew that the user accounts were being created.  This meant that an exception had to be happening in the second try block, after creating the unix user, which led to the fallback.  I dug into the function that copied the rcfiles.

This function had the following structure:

create an empty bash_history file to the user's home directory
set permissions on it
set its filesystem attributes to append-only
copy a default bashrc file from /opt/ to the user's home directory
set permissions on it
set filesystem attributes to append-only
copy a default bash profile from /opt/ to the user's home directory
set perms
set filesystem attributes to append-only
set some general permissions on the user's home directory

Besides the fact that it might've been prettier as a loop, there didn't seem anything immediately objectionable about this function.  Examining a test user's home directory, I observed that the history file had been created and had the correct permissions set on it, but the bashrc and profile had more permissive permissions than expected.  Hashing the user's bashrc and the bashrc in /opt/ revealed that they differed.  I concluded that we must have failed to copy the bashrc.  This was only half-correct.

The bashrc in /opt/ had permissions 544, so we should've been able to read it, and indeed I could.  I wasn't totally sure which user the PAM module was running as, but the PAM user had had permissions to create unix users, and checking the sudoers file I observed no users besides root with that power.  At this point I got confused and asked the coworker familiar with this area, who was returning from a meeting, what he made of it.  He asked if we might've set filesystem attributes on the user's home directory before copying the bashrc, which could concievably stopped even root.  We hadn't, but it got us looking at the chattr call on the bash_history as a possible suspect.  While attempting to list the filesystem attributes on bash_history with lsattr, we discovered that the system we were running in did not support attributes; therefore the call to change attributes had failed.  This made sense; we had recently shoved the codebase into docker containers, and docker apparently does not support filesystem attributes.

Removing all calls to change filesystem attributes caused the system to function as intended, albeit with slightly reduced security against malicious users of the shell machine.  Total time from beginning of investigation to pushing fix was about 45 minutes.  I'd like to say that that's not terrible for starting from an incorrect premise and then getting confused while debugging a piece of code that I hadn't looked at in two months, but frankly I don't have the data to back that up...  yet.  Another purpose of this journal, I suppose.

I committed two logical errors:

The first was ignoring the chattr calls.  I did this because I did not understand them, and presumed them to be noise and unimportant.  This is a bad habit picked up from reverse-engineering large assembly codebases, where most of what goes on is just noise.  Moral: when a thing is not understood, investigate it more thoroughly, not less.  Code is guilty until proven innocent.  Seems rather obvious when you write it down, but that's the way of things.

The second was fixating on the copy.  I drew a too-general conclusion, that the copy was the source of the failure, from the available data, that the copy had not executed.

A third possible error was not considering sooner that the problem was docker-related.  We've had the code in docker for a couple of months now though, so that seemed like a pretty stale bug-cause (in fact this failure-mode had been hidden behind other failures that were only recently resolved).

No comments:

Post a Comment