Holistic Bug Fixing

The other day, while doing my morning jog, I was thinking about a particularly nasty bug I had been chasing for quite some time. I was wearing a portable radio and the guy on the radio was talking about Paul Simon’s all-time classic “Fifty ways to leave your lover”. He wasn’t so much talking about the song — rather about the amazing drum beat created by drummer Steve Gadd. Steve’s performance is incredible — about 100 beats per minute, it just sounds like “prrrrrrrrrr…”. Anyway, the broadcast led my thoughts astray and I suddenly thought: “How many ways are there to fix a bug?” Which brings us back to the topic of this post…

After having chased down a nasty bug, all you have to do is fix it. But a fix is not a fix, right?

Imagine this setting. There is a test execution framework that you and your team mates are using to run automated regression testing. Since your software is configurable (i. e. certain features can be turned on/off either at build time or at run-time), your tests need to be configurable, too. The framework sports a preprocessor (implemented in Perl) that gives you just that — very much like what the C preprocessor gives to C developers. Here is what a typical test script looks like:


    |File: sample.test|
    ...
    #include "config.h"

    Execute test step 1
    Execute test step 2
    #if CONFIGURED_FEATURE_A
    Execute test step 3
    #endif
    Execute test step 4
    #ifndef CONFIGURED_FEATURE_Z
    Execute test step 5
    #endif
    ...

|File: sample.test|

...

#include "config.h"

Execute test step 1

Execute test step 2

#if CONFIGURED_FEATURE_A

Execute test step 3

#endif

Execute test step 4

#ifndef CONFIGURED_FEATURE_Z

Execute test step 5

#endif

...

During execution of this test script, a log file is updated:


    |File: sample.test.log|

    # Log file for test script 'sample.test'
    PASS test step1
    PASS test step2
    PASS test step4
    ...
    FAIL test step25
    ...
    PASS test step 42

|File: sample.test.log|

# Log file for test script 'sample.test'

PASS test step1

PASS test step2

PASS test step4

...

FAIL test step25

...

PASS test step 42

This is obviously just an artificial (or at least, simplified) example, but I guess it is sufficient to explain the idea.

Now, your test execution framework probably has code like this:


    |File: testexec.sh|

    ...
    for $test in *.test ; do
        # Preprocess.
        preprocess_test.pl -I dev/project/build/config $test

        # Execute.
        execute_test.pl $test

        # Now check if there are any FAILs in the log.
        grep ^FAIL $test.log
        if (($?)) ; then
            echo FAIL $test
        else
            echo PASS $test
        fi
    done
    ...

|File: testexec.sh|

...

for $test in *.test ; do

# Preprocess.

preprocess_test.pl -I dev/project/build/config $test

# Execute.

execute_test.pl $test

# Now check if there are any FAILs in the log.

grep ^FAIL $test.log

if (($?)) ; then

echo FAIL $test

else

echo PASS $test

done

...

A loop iterates over all test scripts in the current directory. In a first step, the test script is fed through a preprocessor Perl script where the configuration business is done. Next, the preprocessed test script is executed and finally, it is checked whether the execution of the test script resulted in any failing test steps.

So far so good. But on a dark and rainy day, you find out by coincidence that some of your test scripts have not been executed at all; worse yet, testexec.sh reported success on them! The logs clearly show that nothing has been executed due to a fatal error:


    |File: mytest.test.log|

    preprocess_test.pl: cannot find include file 'config_ex.h'.

|File: mytest.test.log|

preprocess_test.pl: cannot find include file 'config_ex.h'.

You immediately know what the problem is. A couple of weeks back you added another include directive to some of your test scripts since they depend on additional configuration switches:


    |File: mytest.test|

    #include "config.h"
    #include "config_ex.h"
    ...

|File: mytest.test|

#include "config.h"

#include "config_ex.h"

...

Unfortunately, config_ex.h is located in a different directory than config.h, so the preprocessor — who is given only a single include base path (dev/project/build/config) — cannot find it.

What possibilities do we have to get rid of this problem?

Level 1: Fix the symptom.

A very simple fix would be to change the failing test script by changing the #include statement to include config_ex.h based on an explicit path:


    |File: mytest.test|

    #include "config.h"
    #include "dev/project/build/other/config_ex.h"
    ...

|File: mytest.test|

#include "config.h"

#include "dev/project/build/other/config_ex.h"

...

This would do the job. Yet, this approach is ugly: other developers (including yourself) can easily step into the same pitfall again (most likely they have already).

You would never only fix the symptom, would you? EVERYONE knows that it is bad to only fix the symptom!

But, hey, such a hack is not bad per se. Sometimes, you need a quick fix to be able to carry on. Maybe you cannot change the test execution script yourself because it is located on a remote testbot. Or your company doesn’t like the idea of collective code ownership and only Sam is allowed to change the test execution script and — unfortunately — Sam has already left for the weekend. Anyway, fixing the symptom is sometimes appropriate, but at the very least you should ensure that it will be cleaned up later, by (for instance) using TODOs that can be tracked automatically:


    |File: mytest.test|

    #include "config.h"
    #include "dev/project/build/other/config_ex.h" // TODO:2009-07-29:ralf:Quick fix to get my test running again
                                                   // (missing include base path in test execution script)
    ...

|File: mytest.test|

#include "config.h"

#include "dev/project/build/other/config_ex.h" // TODO:2009-07-29:ralf:Quick fix to get my test running again

// (missing include base path in test execution script)

...

Level 2: Fix immediate problem.

If you can, you’d better fix the bug in the test execution script directly by adding the missing include base path:


    |File: testexec.sh|

    ...
    for $test in *.test ; do
        # Preprocess.
        preprocess_test.pl -I dev/project/build/config -I dev/project/build/other $test

        # Execute.
        execute_test.pl $test
        ...

|File: testexec.sh|

...

for $test in *.test ; do

# Preprocess.

preprocess_test.pl -I dev/project/build/config -I dev/project/build/other $test

# Execute.

execute_test.pl $test

...

Now this looks good and you can claim that you “fixed the problem, not the symptom”, right? While this level 2 fix is certainly more pleasing than the level 1 fix above, I think we can do much better.

Level 3: Prevent bug from happening again.

Our level 2 fix still has shortcomings. The same mistake (i. e. forgetting to add or update an include path in the execution script) can and will lead to the same misery. So we need to safeguard against future errors.

Looking at the invocation of the preprocessor, we can clearly see that there is no error handling at all: either preprocess_test.pl doesn’t produce an exit code in case of fatal errors or our test execution script doesn’t evaluate it. So here is a potential level 3 fix:


    ...
    for $test in *.test ; do
        # Preprocess.
        preprocess_test.pl -I dev/project/build/config -I dev/project/build/other $test
        if (($?)) ; then
            echo "Fatal: Preprocessing of file $test failed."
            exit 2;
        fi

        # Execute.
        execute_test.pl $test
        ...

...

for $test in *.test ; do

# Preprocess.

preprocess_test.pl -I dev/project/build/config -I dev/project/build/other $test

if (($?)) ; then

echo "Fatal: Preprocessing of file $test failed."

exit 2;

# Execute.

execute_test.pl $test

...

Now this is slick, isn’t it? Never will a wrong or missing include path trouble you again. But the best is yet to come, please bear with me.

Level 4: Prevent a whole class of bugs.

Our previous fix makes sure that a particular kind of error will not occur again. In a highly automated environment (where only machines look at output) this is not enough. Consider what happens if somebody makes a modification to execute_test.pl that leads to a crash. In this case, no output would be produced, and hence no FAIL messages would be generated and as a result, grep wouldn’t find any FAILs.

Of course, this can only happen because of the “textual” interface of execute_test.pl. A better design would use exit codes instead of grep — 0 for no errors, 1 for “normal” test errors and anything else for fatal errors:


    for $test in *.test ; do
        # Preprocess.
        preprocess_test.pl -I dev/project/build/config -I dev/project/build/other $test
        if (($?)) ; then
            echo "Fatal: Preprocessing of file $test failed."
            exit 2;
        fi

        # Execute.
        execute_test.pl $test
        # Save exit code.
        my_error=$?
        if (($error == 0)) ; then
            echo PASS $test
        elif (($error == 1)) ; then
            echo FAIL $test
        else
            echo "Fatal: Test execution of file $test failed."
            exit 2;
        fi

        ...

for $test in *.test ; do

# Preprocess.

preprocess_test.pl -I dev/project/build/config -I dev/project/build/other $test

if (($?)) ; then

echo "Fatal: Preprocessing of file $test failed."

exit 2;

# Execute.

execute_test.pl $test

# Save exit code.

my_error=$?

if (($error == 0)) ; then

echo PASS $test

elif (($error == 1)) ; then

echo FAIL $test

else

echo "Fatal: Test execution of file $test failed."

exit 2;

...

You probably think that this only solves yet another kind of bug, but not really a whole class of bugs. What if somebody someday adds another step and forgets to check an exit code again? Wouldn’t we get the same problem again? We would, of course, at least until we pull out our level 4 laser gun: post-condition checking.

What our test execution script actually promises to do is this: “If you give me a set of N test scripts I will give you back a set of P passed test scripts and a set of F failed test scripts. Either P or F maybe zero but P + F is always N”. This is the post-condition and it holds as long as the pre-conditions (e. g. well-formed test scripts) are respected. So here we have our (hopefully) bullet-proof level 4 version:


    test_glob=*.test

    # Get total number of test scripts.
    tests_total=`ls -1 $test_glob | wc -l`

    for $test in $test_glob ; do
        # Preprocess.
        preprocess_test.pl -I dev/project/build/config -I dev/project/build/other $test
        if (($?)) ; then
            echo "Fatal: Preprocessing of file $test failed."
            exit 2;
        fi

        # Execute.
        execute_test.pl $test
        my_error=$?
        if (($error == 0)) ; then
            $(($tests_ok++))
            echo PASS $test
        elif (($error == 1)) ; then
            $(($tests_bad++))
            echo FAIL $test
        else
            echo "Fatal: Test execution of file $test failed."
            exit 2;
        fi

        # Check post-condition.
        if (( $((tests_ok + tests_bad)) != tests_total )) ; then
            echo "Fatal: Not all tests were executed."
            exit 2;
        fi

        ...

test_glob=*.test

# Get total number of test scripts.

tests_total=`ls -1 $test_glob | wc -l`

for $test in $test_glob ; do

# Preprocess.

preprocess_test.pl -I dev/project/build/config -I dev/project/build/other $test

if (($?)) ; then

echo "Fatal: Preprocessing of file $test failed."

exit 2;

# Execute.

execute_test.pl $test

my_error=$?

if (($error == 0)) ; then

$(($tests_ok++))

echo PASS $test

elif (($error == 1)) ; then

$(($tests_bad++))

echo FAIL $test

else

echo "Fatal: Test execution of file $test failed."

exit 2;

# Check post-condition.

if (( $((tests_ok + tests_bad)) != tests_total )) ; then

echo "Fatal: Not all tests were executed."

exit 2;

...

I used a testing example to show you the many facets of bug-fixing but these principles equally apply to “real” source code. Here is a summary of what I wanted to show:

– it is fine to do a quick and dirty “symptom-level” bug-fix every now and then — as long as you are explicit about it.
– Repeatedly zoom out in the search of the root cause, zoom out as much out as possible, but stay within your circle of influence (“We wouldn’t have all of these bugs if these spec coders were fired” is clearly beyond your circle of influence).
– Fortify your fix by making sure that the same or similar bugs will not creep in again.

On Pragmatic Thinking and Learning (and Cats)

pragmatic cat I just finished reading “Pragmatic Thinking and Learning” by Andy Hunt, one of the “Pragmatic Programmers”.

Besides explaining how our brain works and how to acquire knowledge effectively, he tells many interesting stories and gives pragmatic tips that are invaluable for any knowledge worker.

Most likely this book is going to make it on my yet-to-be-published top 10 of books that radically influenced my professional life. Some positions are already allocated. For instance, number one is Steve McConnell’s “Code Complete” because it turned me from a hobbyist programmer into a professional software developer (I hope). Next comes “Peopleware” by Tom DeMarco and Timothy Lister, because it taught me that software development is all about people and only remotely about tools or methodologies. The third place goes to “The Pragmatic Programmer” (here they go again!) for tripling the level of passion I have for programming.

So what is so special about those pragmatic guys?

I guess it is the way they entertain their audience with immediately usable advice — they are very good presenters. I first met them in 2000 at a conference about Java and object-oriented programming (JAOO) in Denmark.
They gave one of the best (correction: the best) presentation I have attended so far — full of wit, full of fun. Nobody knew them at the time (I vividly recall when at the end of the presentation somebody asked “Who are you guys, anyway?”). Regretfully, I missed the opportunity to buy their then brand-new book “The Pragmatic Programmer” and have them sign it on the spot (I learned from my mistake and had Bjarne Stroustrup sign my copy of “The Design and Evolution of C++” the next day).

Not all of what they write or say is truly novel. Frequently, they reuse material and wisdom from others (a very pragmatic habit, indeed) but they always present it in a context that makes me think “Wow!”. Take, for instance, this quote from Mark Twain at the beginning of chapter 7 “Gain Experience” in “Pragmatic Thinking and Learning”:

“We should be careful to get out of an experience only the wisdom that is in it and stop there; lest we be like the cat that sits on a hot stove-lid; he will never sit on a hot stove-lid again—and that is well; but also he will never sit on a cold one anymore.”

Very profound, isn’t it?

Towel Day 2009

Douglas Adams, the brilliant author of “The Hitchhiker’s Guide to the Galaxy”. The idea is that you take your towel everywhere you go, as — according to The Guide — “a towel is about the most massively useful thing an interstellar hitch hiker can have”.

Douglas, who indisputably died too soon, had the idea for the book while lying drunk in a field on a camping-site: “The idea for the title first cropped up, while I was lying drunk in a field in Innsbruck, Austria, in 1971. Not particularly drunk, just the sort of drunk you get when you have a couple of stiff Gössers after not having eaten for two days straight.”

So me and some co-hackers thought it would be a good idea (and a great tribute) to go to Innsbruck on that day and try to find the exact location. Alas, it turns out that the camping-site doesn’t exist anymore — it has been replaced by a nursing home. Here are the exact coordinates, in case you want to go there as well:
47°16'33.79" N 11°25'24.61" E
On the train from Munich to Innsbruck I reread a couple of chapters from Douglas’ last book, “Salmon of Doubt”; one of the chapters is a transcript of a fascinating talk he gave in 1998, which is entitled “Is there an artificial god?” (transcript)(mp3).

The reason why I like this talk so much is that Douglas succinctly explains the origins and purposes of religions as well as how technology and scientific progress — he differentiates between “four ages of sand” — shaped our view of the world and religions. Very inspiring words… a must read (hear) for any carbon-based, ape-descendant, bipedal life form.

Bad Hiring Strategy, Great Interview Question

recruitment You have to bear in mind that this story happened in 1999, more than a year before the dot com bubble burst; senseless hiring of people was considered perfectly normal at that time.

I was working for a well-known consumer electronics company, developing mobile phones. Our CEO had a problem. The problem — which is not so uncommon among CEOs — was that the investors weren’t happy because the company didn’t sell enough handsets. So the boss went to sales and they claimed they couldn’t sell more because the development guys didn’t give them enough cool products — products the customers reeeeeelly wanted and even if they did, they’d finish too late; that is, they’d come out with a phone in March, missing the Christmas sales season by threeeeee month.

So our boss went to the development manager to find out what the problem was; our development manager told him what almost every development manager tells in such a situation: “We could develop soooooo many cool products in sooooooo little time, if we only had moooooore developers!”

This procedure repeated a couple of times until our CEO freaked out. He went to our human resources manager and commanded him to “get more software developers, no matter what!”

Our HR manager didn’t know much about hiring software developers, but he surely had a plan: he wanted to hire 100 software developers in 100 days. He had t-shirts printed, carrying these words:

[company]
100
in
100
Do YOU have good
soft(ware) skills?

Our job was to wear these (poor quality) t-shirts and attract potential software developers for our team.

This strategy was unbelievably stupid for many reasons. First, it emphasizes quantity, not quality. Second, it reads as if soft skills were more important than software skills. Last, it is very offending as it is based on the idea that good software developers stupidly fall prey to such bad HR campaigns.

Anyway, we got lots of candidates — many more than we could interview; our HR manager celebrated a victory. We, however, had to separate the wheat from the chaff and thus we developed programming tests that every candidate had to take.

One of the most successful questions was this one:

Write a routine that sorts an array of ‘n’ integers; write best-quality code.

Innocuously as it looks, this question has several good characteristics: it requires the candidate to actually write real code; since it has to be ‘best-quality code’ you can find out about his/her skills and quality standards. For instance, does the candidate

– pay attention to style issues (indentation, layout and consistency in general)?
– choose meaningful identifier names?
– write (good) comments?
– use assertions and/or checks for boundary cases?

This question can reveal even more: since no programming language is given, you can find out about a candidate’s favorite programming language. But most importantly, you can find out if a developer is smart and has a questioning attitude.

Every smart developer knows that there is no perfect sorting algorithm. The choice depends on many constraints — constraints that are not given in the problem statement and hence must be investigated. I remember one applicant commented along these lines:

[…] It all depends on how big ‘n’ is and whether we have write access to the array (that is, we can sort in-place). Are there any code/RAM restrictions? Depending on factors like these I would choose the best sorting algorithm from a text book. Unless no further information is given, I would use the sorting algorithm that comes with the standard library (e. g. qsort() in C, java.util.Arrays.sort() in Java). Since I know that you want me to write some code, I’ll implement an insertion sort algorithm which is easy to code and its O(n^2) behavior is acceptable for ‘n’ < 1000 [...]

Naturally, we made an offer to him, which he turned down a couple of days later. I guess he was scared of having to work with too many soft skills experts.

How to become a better programmer

Good programmers often wonder how to become even better programmers. They constantly seek for new tools and techniques that help them getting their job done better and faster.

If you want to know what helps the most, here is some advice:

“You must do two things above all others: read a lot and write a lot. There’s no way around these two things that I’m aware of, no shortcut.”

These words are from the best-selling author Stephen King; he should really know — he makes 45 million bucks per year from his books.

I believe that programming and writing novels have a lot in common and that King’s words of wisdom are applicable to software development as well.

Let’s first focus on reading. It’s a well known fact that programmers read too little. In their book “Peopleware”, Tom DeMarco and Timothy Lister assert that the average developer doesn’t own a single book on the subject of his or her work. If this is true, it might be an explanation as to why our industry is performing so badly: if developers don’t know about the fundamentals of software engineering (not just coding issues — also topics like software quality, configuration management and peopleware in general) how can they explain them to non-technical folks like sales and upper management once they’ve become technical leaders?

What about code reading? Fortunately, we live in very privileged times. Twenty years ago, almost all code was closed source; nowadays, there are billions of lines of open source code out there from which we can learn. Alas, there is the fundamental law of programming: “It’s harder to read code than it is to write it.”

If browsing through huge open source code bases gives you headaches, check out “Code Reading” or “Code Quality” by Diomidis Spinellis. These two fine books quote (and criticize) countless examples from open source projects — in my view, a lightweight and often entertaining way to improve your programming Kung Fu.

But what about code writing? Isn’t a professional software developer already writing enough code? Not so! Typically, software developers only spend a fraction of their time writing code. In fact, most of their time is devoted to meetings, email, reading specs, writing documentation and so on. With this little time given for writing code, it is vitally important that developers keep their programming skills active.

A good way to practice is by contributing to an open source project. Another possibility is doing Code Katas — little practice sessions, based on a concept borrowed from karate and other martial arts, where the practitioner fights against an imaginary opponent. But by far the best way is to work for your employer in your leisure time — for free!

Have you recovered?

I presume that to most people, this idea sounds shocking, almost insane — but I really mean it. Often, good ideas arise during the day that your boss doesn’t understand and hence doesn’t approve. If you think your idea is challenging and useful for the company — do it at home! Not only does this improve your coding skills, it helps your company; as a bonus, your reputation within the company increases. So we have at least a win-win, if not a double-win-win situation. But only choose interesting topics, things that improve your skills; leave the drudgework for the office.

Constant reading and practicing is the key to success. It doesn’t take much time, but it needs to be done habitually. Don’t expect that your company or your boss or anyone but you is responsible for improving your skills. Even if those days existed in the past, they certainly don’t exist anymore.

The Pizza Box Problem

Consider this real-world problem: In a UMTS network, short messages (SMS) are comprised of a protocol header and the actual text message. The text message can be encoded in many different formats, but for the sake of this example I want to focus only on two encodings: 8-bit characters and 7-bit characters.

Since the standard SMS alphabet only uses 7-bit characters, it often makes sense to use a 7-bit encoding for the text message, as you can squeeze more characters in the available 140 octets (an octet is a byte comprising 8 bits; remember that it is not specified how many bits there are in a byte).

In the header, there is a so-called ‘user data length’ element that tells how many characters the text message comprises. The stress is on characters – you don’t know over how many of the following octets the message is distributed. But of course, you can find out. A so-called ‘data coding scheme’ octet tells you whether the text message uses 7-bit encoding or 8-bit encoding. Thus, calculating the total number of used octets should be straightforward:


    int char_count = get_char_count(buffer);
    int octet_count;

    // If 7-bit encoding.
    if (get_dcs(buffer) & DCS_7BIT_ENCODING) {
        octet_count = (char_count * 7) / 8;
    // If 8-bit encoding.
    } else {
        octet_count = char_count;
    }

int char_count = get_char_count(buffer);

int octet_count;

// If 7-bit encoding.

if (get_dcs(buffer) & DCS_7BIT_ENCODING) {

octet_count = (char_count * 7) / 8;

// If 8-bit encoding.

} else {

octet_count = char_count;

}

This code is short, simple and – unfortunately – wrong. I’ve seen this mistake in several guises and the reason for this bug is that programmers obviously don’t know about what I call the ‘Pizza Box Problem’. It goes like this.

You are a pizza delivery guy and you have to deliver pizza (stored in pizza cases) to your customers. To keep your pizzas hot, you stuff them into thermal bags, each of which is capable of holding 8 pizza boxes.

How many bags do you need to deliver, say, 21 pizza boxes?

Every pizza delivery guy immediately knows the answer: 3. It is not 21 / 8, since integer division causes the result to be equal to 2!

What you need is this: if your division yields a fractional part, you want to increase the result of the integer division by one. You could resort to floating point arithmetic (and use the ceil() function, for instance) but that would be inefficient.

The trick is that you add the divisor minus one to the dividend before performing the integer division:


    bags = (boxes + bag_size - 1) / bag_size;

bags = (boxes + bag_size - 1) / bag_size;

It works like this: if ‘boxes’ is already evenly divisible by ‘bag_size’, adding one less than the ‘bag_size’ doesn’t change the overall result; otherwise, the dividend will be increased such that the next ‘bag_size’ multiple is crossed:


    bags = (21 + 7) / 8 == 3

bags = (21 + 7) / 8 == 3

Applying what we have just learned to our SMS problem, we conclude that the code in the if block should look like this:


    ...
        octet_count = (char_count * 7 + 7) / 8;
    ...

...

octet_count = (char_count * 7 + 7) / 8;

...

We have ‘char_count’ * 7 bits (pizza boxes) that we want to store in octets (thermal bags) of size 8.

Enjoy your pizza!

The Return of the Pizza Delivery Guy

[update 2009-03-29: The equation

octet_count = (char_count * 7 + 7) / 8

1
2
3

    octet_count = (char_count * 7 + 7) / 8

can obviously be simplified to

octet_count = char_count - char_count / 8

1
2
3

    octet_count = char_count - char_count / 8

Here is the proof:

octet_count = (char_count * 8 - char_count + 7) / 8 octet_count = ( (char_count * 8) / 8 - (char_count / 8) + 7 / 8 )

1
2
3
4

    octet_count = (char_count * 8 - char_count + 7) / 8
    octet_count = ( (char_count * 8) / 8 - (char_count / 8) + 7 / 8 )

since (char_count * 8) / 8 = char_count and 7 / 8 = 0 we get

octet_count = char_count - char_count / 8 (q.e.d.)

1
2
3

    octet_count = char_count - char_count / 8  (q.e.d.)

— end update]

Approxion

Code – People – Everything

Holistic Bug Fixing

On Pragmatic Thinking and Learning (and Cats)

Towel Day 2009

Bad Hiring Strategy, Great Interview Question

How to become a better programmer

The Pizza Box Problem