04 September 2017

Reading / Links, 4 Sept 17

Stuff I've been reading in the last week or two:

Network Science, chapters 3 and 4. Pretty funny; he throws some shade on Erdos and Strogatz.  The editing / proofreading continues to disappoint, but the material is decent.  The main thing I want out of this book is an understanding of cascade failures (which he claims to have a good model for in the introduction); a graph theory refresher doesn't hurt though.

The Mind Illuminated, Chapter 4, Interlude 4, beginning of Chapter 5.  Interlude 4 was very interesting - consciousness is quantized and cut up into frames, like network packets, and dullness is packets dropping.  I really wish he'd include footnote references for the science behind this stuff, given that he's a neuro guy...  Given chapters 3, 4, and 5, it seems like I'm somewhere in late phase 3 or early phase 4 (modulo the fact that my practice is still irregular).

The Systems Bible.  Has nothing to do with systems programming, except inasmuch as programmers build systems.  Describes ways in which complex systems evolve and dysfunction.  Not at all rigorous, to the point where it doesn't bother to define "system", but some parallels with Rao's Gervais Principle in the organizational context (organizations constructed with backdoors allowing actual work to get done, eventually collapse under their own entropy) and with some of Scott's criticisms of high modernism in Seeing Like a State (the designed system opposes its own intended function and scales in unpredictable ways).  Also seems sort of linked to The Dispossessed, with its point about the emergence of effectively-bureaucratic systems under anarchist conditions.

Introduction to the DWARF Debugging Format.  I'm looking for stupid dwarf tricks, and was excited to find that DWARF contains at least two sorts of bytecode for generating tables of debugging information.

Relatedly, Funky File Formats.

Documentation on ptrace, proc, ELF, bpf, more bpf...  There's all kinds of fun stuff in /proc that I didn't know about.

23 August 2017

Paper Notes - Predicting Bugs from History, 2008

Notes on this chapter of what seems to be a book on "software evolution".

"The defects we measure from history can only be mapped to components because they have been fixed."  That seems rather optimistic...

"This mistake manifests itself as an error in some development artefact, be it requirements, specification, or a design document."  This is getting concerningly software-engineering-ish...

Sort of weird that everything is module-scoped.  I guess it makes sense for very large systems, but I'd tend to think of things more on the function / basic-block level.

Also weird that all of their coupling metrics (except class coupling) are for global variables.  Any / all shared state is a potential bug source; global just happens to be the most egregious possible case (except for system-globals, like files on disk and registry entries...).

"Never blindly trust a metric."  Wiser words are rarely written.

Not a whole lot of surprises.  Typical complexity metrics correlate with bugs.  Churn correlates with bugs.  Tricky problem domains correlate with bugs (though I feel like their example of Eclipse compiler internals vs GUI is sort of disingenuous; if the compiler internals are a little broken, Eclipse cannot perform its core function, but if the UI is a little broken, often end users can work around it or just live with it.  So is it a function of the inherent difficulty of the problem domain, or the centrality of that problem domain to the function of the project?).  Buggy dependencies correlate with bugs, but fall off with distance.  Would've been interesting to see the d=4 case for "domino effect in Windows Server 2003".

Sort of bummed that their references weren't included.

Potential follow-ups:

20 August 2017

Paper Notes - Valgrind 2007

I've been doing some work with Valgrind recently, and the suggested way to get a big-picture understanding of how Valgrind works was to read this paper. Having read it, I think this is a good recommendation.  Some notes, primarily for my own benefit.

Dynamic recompilation seems very similar to my under-informed understanding of QEMU's approach.  Substantially more complex than our hacked-up approach to static binary instrumentation.  Would probably be a lot easier to implement nowadays with LLVM than it was in 2007.  Interesting loading procedure, though it has the same issue that PIN does where it shares an address space with its target (and a target seeking to interfere with analysis will likely be able to).  The dispatcher / scheduler translation execution mechanism is also interesting; doesn't do translation block chaining like QEMU does (we ran into an issue with QEMU's tb-linking a couple weeks ago), but has a very tight "dispatcher" mechanism that checks a cache and executes known / hot translations, with the slower "scheduler" as fallback.  Coming from writing system call models in PIN, the events system sounds pretty great; I wonder how much of Valgrind's syscall models are stealable for use in other dynamic instrumentation frameworks.

Follow-up topics I should read more on:

13 August 2017

Two Technical Contradictions

Observed two "technical contradictions" in the style of TRIZ at work the other day:

We want to show the user all of this data, but to most users it won't be useful and we don't have the space on the webpage.

We want the performance benefits that this unsafe optimization give us, but we observe that this unsafe optimization is causing a huge amount of incorrect behavior (much greater than we expected when we enabled it).

The first was resolved by proposing to pick out the important bits of information for the user, bring those to attention on the main page, and  then making details available on request.

The performance / correctness tradeoff was temporarily resolved with preference to correctness, but has not yet been fully resolved because we do not understand the root cause of the incorrectness caused by the optimization (and this flows into TRIZ's root cause analysis procedure).

But it's curious that I noticed these contradictions in terms of TRIZ, especially both on the same day, and having not read or thought about TRIZ in some months.  This is particularly curious because I've been reading up on John Gall and systemantics, which suggest that TRIZ (as a system) is precisely not what you want if you want to get results (which agrees with general intuition - the US outperformed the Soviet Union at research with a much less structured approach, though there are an infinitude of confounding factors).

02 August 2017

Debugging a Weird GDB Misbehavior

I was debugging a crashing program with gdb last week, and I observed a very strange behavior.  I was using gdb to print the faulting address following a segfault, and it was getting it wrong sometimes.  I determined from the registers and the faulting instruction that faulting address was something like 0xce4a414100000000, but gdb was reporting the faulting address as 0x0.  WTF?

So I started playing with it.  I observed that this failure mode did not occur when I threw the same crash against a 32-bit version of the target; it still crashed, but gdb reported the faulting address correctly.  Weird.

So I checked out a copy of the gdb source tree and started poking around.  I was under the impression that there was a way to get the faulting address of a segfault from the kernel inside a segfault handler, and googling revealed that it's available in a siginfo_t struct.  Reading some man pages suggested that a ptracing program should be able to get such information about a fault in the program it's tracing using PTRACE_GETSIGINFO, so I looked for all the places that gdb was using this option to ptrace.  I found some code that seemed to be translating 64-bit siginfo_t structs into 32-bit siginfo_t structs, which seemed like a likely candidate.  There were only one or two places where GETSIGINFO was used, so I added some debugging printfs, compiled gdb, and...  compilation failed, I forget why exactly.  I chalked this up as a hazard of using the bleeding-edge source version; I downloaded a source release tarball, added my printfs again, and got it to compile, but it didn't seem to be hitting the debug prints around the PTRACE_GETSIGINFO calls, and it complained about not having its python installation in the correct place, so I was somewhat suspicious of its correct operation.  I did confirm that I still got the wrong address even in the freshest version of gdb, though.

At this point, after a morning of debugging and poking around in the gdb source, I told my boss that it was going to take longer to get to the bottom of this rat-hole than I expected, and tabled the gdb issue to investigate later.

The weekend rolled around, and I decided that if printf debugging wasn't going to cut it, I should use gdb to debug gdb.  Some googling indicated that this was a thing that people do, and suggested that I use gdbserver for it.  So I fired up gdbserver running gdb running my crashing program, then fired up gdb and used target remote to connect to the gdbserver, hit run...  and my ssh session to my work machine died. Some pinging around the work network revealed that my host was down. My suspicions of a kernel panic were confirmed on monday morning; nobody else was in the office to see and reboot it.

So I was left to debug this thing on my own machine.  I checked out the crashing project, built it, ran it in gdb, and observed the same failure mode, the incorrect faulting address.  Deciding that gdb-on-gdb action was just too hot, I decided to give strace a shot.  stracing gdb revealed that ptrace(PTRACE_GETSIGINFO, ...) was returning a faulting address of 0...  from the kernel!  So this wasn't a gdb bug at all, but a weird kernel behavior.  Along with this weird faulting address, ptrace's siginfo_t struct also had weird si_code value of SI_KERNEL.  Running the same gdb command to get the faulting address under strace on some other crashing programs, si_code was usually SEGV_MAPPERR.

Some googling later I found this stackoverflow answer.  The relevant part is:
A segmentation violation that occurs as a result of userspace process accessing virtual memory above the TASK_SIZE limit will cause a segmentation violation with an si_code of SI_KERNEL. In other words, the TASK_SIZE limit is the highest virtual address that any process is allowed to access. This is normally 3GB unless the kernel is configured for high memory support. The area above the TASK_SIZE limit is referred to as the "kernel segment".
And indeed, the address that I was faulting on was above the TASK_SIZE limit.  But what I found odd about this whole thing was that it wasn't even really a kernel address; looking at this description of 48-bit memory layout, my faulting address fell into the noncanonical zone.

So anyway, the moral of this story: if, on 64-bit linux, gdb is telling you that the faulting address of a segfault is 0, it might be lying, and the address might just be in the noncanonical region.

And that's what I did this saturday.

Analysis / post-mortem:

I don't think this was a terrible performance on my part.  Total time elapsed was something like four hours, and some of that was spent compiling gdb variants.  I did a decent job of changing approaches when something seemed unproductive.  I did not make maximum use of the early observation that the behavior was different on 32-bit; instead this caused me to investigate the gdb source, where it translates between 64- and 32-bit siginfo_t structs, which was a false lead, but at least I didn't get stuck on it.  I googled early and often.  I should probably have resorted to strace earlier; it's a very strong tool.  Arguably I should have known beforehand that this was a 48-bit issue, but this is how you learn.

28 June 2017

Google and Grothendieck

There is a piece of open source software that we use occasionally.  Its primary author is a single Google employee, whose work on it is (as far as we can tell) a large part of his employment.  I was reading part of the source today, and remarked that it was surprisingly bad code stylistically - enormous functions, enormous files, many global variables, and so forth - but tremendously functional.  A coworker replied:

You can really tell who at Google is good enough that they're left alone to do their own thing.

And it reminded me of this, something that Alexander Grothendieck supposedly said:

In those critical years I learned how to be alone. [But even] this formulation doesn't really capture my meaning. I didn't, in any literal sense learn to be alone, for the simple reason that this knowledge had never been unlearned during my childhood. It is a basic capacity in all of us from the day of our birth. However these three years of work in isolation [1945–1948], when I was thrown onto my own resources, following guidelines which I myself had spontaneously invented, instilled in me a strong degree of confidence, unassuming yet enduring, in my ability to do mathematics, which owes nothing to any consensus or to the fashions which pass as law....By this I mean to say: to reach out in my own way to the things I wished to learn, rather than relying on the notions of the consensus, overt or tacit, coming from a more or less extended clan of which I found myself a member, or which for any other reason laid claim to be taken as an authority. This silent consensus had informed me, both at the lycĂ©e and at the university, that one shouldn't bother worrying about what was really meant when using a term like "volume," which was "obviously self-evident," "generally known," "unproblematic," etc....It is in this gesture of "going beyond," to be something in oneself rather than the pawn of a consensus, the refusal to stay within a rigid circle that others have drawn around one—it is in this solitary act that one finds true creativity. All others things follow as a matter of course.


Since then I've had the chance, in the world of mathematics that bid me welcome, to meet quite a number of people, both among my "elders" and among young people in my general age group, who were much more brilliant, much more "gifted" than I was. I admired the facility with which they picked up, as if at play, new ideas, juggling them as if familiar with them from the cradle—while for myself I felt clumsy, even oafish, wandering painfully up an arduous track, like a dumb ox faced with an amorphous mountain of things that I had to learn (so I was assured), things I felt incapable of understanding the essentials or following through to the end. Indeed, there was little about me that identified the kind of bright student who wins at prestigious competitions or assimilates, almost by sleight of hand, the most forbidding subjects.

In fact, most of these comrades who I gauged to be more brilliant than I have gone on to become distinguished mathematicians. Still, from the perspective of thirty or thirty-five years, I can state that their imprint upon the mathematics of our time has not been very profound. They've all done things, often beautiful things, in a context that was already set out before them, which they had no inclination to disturb. Without being aware of it, they've remained prisoners of those invisible and despotic circles which delimit the universe of a certain milieu in a given era. To have broken these bounds they would have had to rediscover in themselves that capability which was their birthright, as it was mine: the capacity to be alone.
And it lines up - I'm pretty sure this developer of ours wrote his magnum opus in solitude first, and then was hired by Google to develop it after it had shaken up the field.  No concern for "best practices", no design by committee, just wrestling with the hard problems and solving them by whatever means necessary, in the time required to do it correctly and efficiently, releasing when it's good and ready.

O!  To program like that!

...  Well, what're you doing this weekend?

(I am in turn reminded of something Hamming said in "You and Your Research": you'll get the resources to do the job after you've proven you can do it without them, on your own time)

25 June 2017

Linkpost, 19-25 June 2017

Some things I read this week:

News: 

Give the FSB your source code, they said.  It'll be fun, they said.

Bulletin of Atomic Scientists analyzes feasibility of North Korean chemical bombardment of Seoul - a highly improbable scenario, but an interesting (if pessimistic) analysis nonetheless.

Blogs / Culture War:

SSC: To understand polarization, undersand conservatism's failures

Samzdat: The meridian of her greatness - sounds to me like Polyani had an accurate view of the world (but coming from reading a bunch of James C. Scott in the last year, I would say that).  Reminds me somewhat of this.

SSC: Against murderism

David Brin: The Jefferson Rifle - came up at work because a coworker claimed that no compromise on gun control was possible.  Which may be correct, but part of his argument was an unavailability heuristic - he had never even heard of a good-faith proposal for compromise (granted: young, very work-focused engineer).

David Brin: A Time for Colonels, Part 3 - I think he takes Lakoff entirely too seriously.  Might work in the short term, but I suspect there's a good reason for that norm that even retired officers mostly stay out of tribal politics.  Potential "guilt by association" backfire failure mode of "the officer corps is now publicly aligned with the Blue Tribe, ergo the officer corps is no longer to be trusted."

Speeches / Lectures:

Alan Kay: [pdf warning] The Power of the Context

Marvin Minsky: Turing Award address - a bit dated, but a novel perspective on education:
– To help people learn is to help them build, in their heads, various kinds of computational models.
– This can best be done by a teacher who has, in his head, a reasonable model of what is in the pupil's head.
– For the same reason the student, when debugging his own models and procedures, should have a model of what he is doing, and must know good debugging techniques, such as how to formulate simple but critical test cases.
– It will help the student to know something about computational models and programming. The idea of debugging itself, for example, is a very powerful concept-in contrast to the helplessness promoted by our cultural heritage about gifts, talents, and aptitudes. The latter encourages "I'm not good at this" instead of "How can I make myself better at it?"
...
The child needs models: to understand the city he may use the organism model: it must eat, breathe, excrete, defend itself, etc. Not a very good model, but useful enough. The metabolism of a real organism he can understand, in turn, by comparison with an engine. But to model his own self he cannot use the engine or the organism or the city or the telephone switchboard; nothing will serve at all but the computer with its programs and their bugs. Eventually, programming itself will become more important even than mathematics in early education.
Richard Hamming: n-dimensional spaces - impressively fast derivations, and man my calc is rusty.  Not my favorite Hamming lecture.  Interesting notes on testing at the very end

Tensorflow without a PhD

Books:

Not a great week for books.  Read a little of Generatingfunctionology after an epiphany in the shower, started Barabasi's Network Science, stalled on The Strategy of Technology and The Mind Illuminated.