Code Kata 2: The “Average” Developer

You might be tempted to think that calculating the average of two integers is a trivial thing. If you think that it is as easy as this:


inline unsigned int average(unsigned int a, unsigned int b) {
    return (a + b) / 2;
}

inline unsigned int average(unsigned int a, unsigned int b) {

return (a + b) / 2;

}

you are wrong. But don’t be depressed — you are in good company. Code like this is written every day — not just by average developers.

(Task#1: Hold on for a minute. Can you spot any problems with this code?)

I’m not talking about the fact that the result is not precise — it can’t be, due to the limitations of the integer type. If ‘a’ is 5 and ‘b’ is 6, the mathematically correct result is 5.5. With integer division, however, it is 5, which is perfectly OK for our purposes; otherwise, we would have used ‘float’ or ‘double’. No, that’s not what I’m looking for…

The trouble is that if a developer looks at the signature, (s)he will probably think “Well, this function takes two integers and returns their average value. I can pass ANY unsigned integer to this function.”, which is of course not true. Why? If the integers are large enough, their sum will overflow the unsigned int range and hence the result of the division will be wrong. There you have it!

(Task#2: Write tests that show when and how overflow happens.)

You could use a larger type internally:


inline unsigned int average(unsigned int a, unsigned int b) {
    return ((long long)a + (long long)b) / 2;
}

inline unsigned int average(unsigned int a, unsigned int b) {

return ((long long)a + (long long)b) / 2;

}

but maybe ‘long long’ is not available on your platform, or you are already using ‘long long’ and there is no bigger type, or you simply don’t want to sacrifice efficiency, especially by using a floating point type. (Note that ‘long’ is not sufficient in our example as on 32-bit platforms ‘int’ and ‘long’ are usually of the same size). So how can you solve this problem without resorting to a larger (more expensive) type?

Years ago I found a solution (I don’t claim that I’m the first who invented this, and I won’t be the last to reinvent it, either) by reasoning like this:

In pure mathematical terms, with a mathematical division operator, (a + b) / 2 is identical to a / 2 + b / 2. By applying this principle — and “wasting” another division — one could avoid the dreaded overflow in the sum. Unfortunately, integer division doesn’t exactly work like mathematical division. For instance, if ‘a’ is 5 and ‘b’ is 7, (5 + 7) / 2 is 6, but 5 / 2 + 7 / 2 = 2 + 3, which is 5. So it is possible that the result is one less than what we expect. How can we compensate for that?

Here is the insight: For both divisions, the integer division operator leaves a remainder behind and if the sum of the remainders (of both divisions) is equal to two, we have to add 1 to our result. Think of it this way: in such cases, the sum of ‘a’ and ‘b’ contains a 2 more than the parts of the sum themselves:


2 2 2 2 2 2       // There a six 2's in the sum of 5 and 7
- - 2 2 2 2       // The division of 5 by 2 consumes two 2's
- - - - - 2       // The division of 7 by 2 consumes three 2's

2 2 2 2 2 2 // There a six 2's in the sum of 5 and 7

- - 2 2 2 2 // The division of 5 by 2 consumes two 2's

- - - - - 2 // The division of 7 by 2 consumes three 2's

Here is an implementation that applies this correction:


unsigned int div = a / 2 + b / 2;
if (a % 2 + b % 2 == 2) {
    ++div;
}
return div;

unsigned int div = a / 2 + b / 2;

if (a % 2 + b % 2 == 2) {

++div;

}

return div;

Unfortunately, this is not only ugly, it is not very efficient, either. Now we have two divisions, two modulo operations (which are as expensive as division operations), two additions, an if-statement and an increment operation. But as every bit-fiddler knows, there is a simpler modulo operation for divisors that are a power of two:


x % k == x & (k - 1)    // If k is 1, 2, 4, 8, 16...

x % k == x & (k - 1) // If k is 1, 2, 4, 8, 16...

Equipped with this knowledge we get:


if (a & 1 + b & 1 == 2) {

if (a & 1 + b & 1 == 2) {

Which is identical to:


if (a & 1 && b & 1) {

if (a & 1 && b & 1) {

In C, results of boolean operations are either 0 or 1 (in C++, they are ‘false’ or ‘true’, but ‘false’ and ‘true’ are converted implicitly to 0 or 1 in arithmetic expressions), so we can shorten our average function:


inline unsigned int average(unsigned int a, unsigned int b) {
    return a / 2 + b / 2 + (a & 1 && b & 1);
}

inline unsigned int average(unsigned int a, unsigned int b) {

return a / 2 + b / 2 + (a & 1 && b & 1);

}

Cool isn’t it? But it’s not unlikely that your compiler will apply these optimizations for you automatically, in case you write your code like in the first, ugly solution. But it certainly won’t add the correction that is missing from the first implementation — trust me on this one.

(Task#3: Write a small program that uses this implementation of average() and test it.)
(Task#4: If you change the types used by average() from ‘unsigned int’ to ‘int’, will the code also work, specifically for negative values of ‘a’ and ‘b’? Why? Write some tests to proof your theory.)
(Task#5: Implement and test an average4() function that calculates the average of four integers.)

The Manager in the Room

Early in my career, I had a boss who was spending most of his time in his office—and what a huge office he had! In fact, his office was so big, you could hardly see him, as his desk was located opposite to the door at the far-end of the room. Hunched over his keyboard, he was diligently hacking away, coding on his (more or less) private little project that he enjoyed so much. Little information flowed from him and if you dared entering his palace, you always felt a sense of guilt for disturbing him.

In retrospect, I think it is obvious why he behaved like this: what he really loved to do was programming, but since there was no technical career ladder, he had to become a manager in order to advance his financial situation. So this introvert, technophile person took on a job he wasn’t qualified for, a job that required dealing with people first and computers second.

But what exactly is the job of a manager? There are many answers to this question, but in my view, there are two things that stand out:

1. Removing obstacles
2. Showing that one cares

Removing obstacles means that a manager has to do everything humanly possible to make sure that the team can work as effectively and efficiently as possible. This includes many of the obvious tasks, like getting proper infrastructure, money, and staff. But that’s just the bare basics. I have heard of a manager who didn’t get approval for buying a coffee machine for the team, so he bought one out of his own pocket. I guess the money was paid back nicely by his teammates—not in money, but in increased motivation and loyalty. Lack of space? If he has a better/bigger/quieter office than the team, he should offer to switch rooms. And if there is a problem with the bathroom, well, let’s hope for our manager that he can find a janitor…

I once worked for a company where the department manager (another “hide-in-the-room” kind of manager) delegated organizing a company outing to a senior developer, an assignment that dragged on for days. I think that this brilliant manager did a wonderful job of demonstrating his superb status, but wasted valuable development time—which is, to quote Peopleware, the ultimate management sin.

“Removing obstacles” really boils down to what Robert Greenleaf called “Servant Leadership” in the 1970s, but the concept is much older, summarized in the words of Chanakya from the fourth century B. C.:

“The king shall consider as good, not what pleases himself but what pleases his subjects. The king is a paid servant and enjoys the resources of the state together with the people.”

Even though removing obstacles is very important, it is not at all sufficient. We all know that perfect service is unnoticeable and thus the team might not know about all the fine work that is going on behind the scenes. Even if the manager works like crazy to keep the ball running, the team might get the impression that what they are doing is not important after all.

There is only one way to fight this: the manager has to get out of his room. Often. Very often. For a manager it is not enough to spend most of a day sitting in meetings with peer managers and/or top management: a manager has to spend time with the team, everyday, without giving them the impression that he is checking up on them and—needless to say—without wasting their precious time. All a manager has to do is clearly convey the message that he truly cares for the work his people are doing. By walking around, by asking how it is going, by telling jokes, by asking technical questions (without being afraid that most of his people are much smarter than him), by asking if he can lend a hand (for instance, it’s a disgrace if a developer under time pressure has to convert some data manually—a mundane job that’s perfectly suited for a servant leader).

Being a successful manager requires many skills and practices, but almost everything derives from these two items. Managers everywhere: Don’t insist on your status and be present; be a good servant to the team and clearly show that you care—for your team and the work that they produce.

A Small Matter of Style?

Some claim that there is a reason for everything. Well, I’m not so sure about this statement, but surely there is a reason for Code Kata 1: it was inspired by a routine that I found in a real-world product some time ago. It looked more or less like this:


 1	int findBestMatch(double value, const vector<double>& values)
 2	{
 3	    double bestDistance = 1.7976931348623158e+308;
 4	    int bestIndex = 0;
 5
 6	    for (int i = 0; i < values.size(); ++i) {
 7	        double currentValue = values[i];
 8	        if (fabs(value - currentValue) < bestDistance) {
 9	            bestDistance = fabs(value - currentValue);
10	            bestIndex = i;
11	        }
12	    }
13
14	    if (bestIndex < values.size()) {
15	        return bestIndex;
16	    }
17	    return -1;
18	}

1 int findBestMatch(double value, const vector<double>& values)

2 {

3 double bestDistance = 1.7976931348623158e+308;

4 int bestIndex = 0;

6 for (int i = 0; i < values.size(); ++i) {

7 double currentValue = values[i];

8 if (fabs(value - currentValue) < bestDistance) {

9 bestDistance = fabs(value - currentValue);

10 bestIndex = i;

11 }

12 }

14 if (bestIndex < values.size()) {

15 return bestIndex;

16 }

17 return -1;

18 }

The job of this routine is to find a value in a vector that matches the given reference value as close as possible. It works by visiting each element of the vector, calculating the absolute difference to the reference value and, if the new difference is smaller than the previous, updating the ‘bestIndex’, the index of the value that matches best.

This routine has proven to work reliably, at least no errors have been found so far. But in my view, it nevertheless leaves a lot to be desired. Let’s dissect this code step by step.

First of all, there are no comments, but I don’t think this is a big issue here. In my view, the code is quite short and the identifier names give sufficient meaning.

The most noticeable peculiarity about this code is the hideous floating point literal 1.7976931348623158e+308 on line 3. What the heck is this?

If you study the for loop and find out how the algorithm works it becomes clear: ‘bestDistance’ is initialized to the largest possible value such that the condition on line 8 becomes ‘true’ for the first element, in order to get a first candidate.

But using a hard coded floating-point literal is the worst approach to achieve this. Why not use ‘DBL_MAX’ (from cfloat/float.h) or ‘numeric_limits::max()’ (from limits)? Especially the former is much easier to type, less error-prone, and you get portability for free: recall that the limits of numeric types are not fixed in C++ (for double the upper bound is specified to be at least 1E+37, but that’s about all that the language standard promises).

The algorithm itself is not perfect either. Like I said, during the first iteration, the comparison on line 8 will always fail as ‘bestDistance’ is initialized to the largest possible value. Why not save one iteration by initializing ‘bestDistance’ to the distance between the reference value and the first vector element? Sounds like a good idea, but this approach will work only if we know that the vector is not empty. Which brings us to the topic of the odd way in which errors are handled in this routine.

Have a look ath lines 14 and 15. The routine returns the found ‘bestIndex’ only if it is within the bounds of the vector; otherwise it returns -1 (for failure). Under what circumstances is it possible that ‘bestIndex’ is outside the range of the vector? It turns out that it can’t — unless the vector is completely empty. So why not make this check explicit and put it near the top of the routine before the actual search starts?

Here is an improved implementation, based on my previous observations:


int findBestMatch(double value, const vector<double>& value)
{
    if (values.size() == 0) {
        return -1;
    }

    int bestIndex = 0;
    double bestDistance = fabs(value - values[0]);

    for (int i = 1; i < values.size(); ++i) {
        double currentValue = values[i];
        double currentDistance = fabs(value - currentValue);
        if (currentDistance < bestDistance) {
            bestDistance = currentDistance;
            bestIndex = i;
        }
    }

    return bestIndex;
}

int findBestMatch(double value, const vector<double>& value)

{

if (values.size() == 0) {

return -1;

}

int bestIndex = 0;

double bestDistance = fabs(value - values[0]);

for (int i = 1; i < values.size(); ++i) {

double currentValue = values[i];

double currentDistance = fabs(value - currentValue);

if (currentDistance < bestDistance) {

bestDistance = currentDistance;

bestIndex = i;

}

return bestIndex;

}

But I’m still not happy with this routine: as it stands it works for doubles only, but isn’t it of general utility? The code would be identical for vectors of chars, ints, pointers, and complex numbers. Why not turn this code into a template function such that it can be put in a library?


template<typename T>
int findBestMatch(T value, const vector<T>& values)
    ...

template<typename T>

int findBestMatch(T value, const vector<T>& values)

...

There is a snag, however. The code uses ‘fabs‘ a function that returns the absolute value of a floating point value. We need it to be able to compare the distances of vector elements to our given value, but it only works with floats/doubles and ‘abs’ is for ints/longs only, so we cannot write truly generic code. But wait a minute, doesn’t cmath provide overloaded versions of ‘abs‘ that work with all types?


int abs(int x);                   // from cstdlib.
long abs(long x);                 // from cstdlib.
double abs(double x );            // from cmath.
float abs(float x );              // from cmath.
long double abs(long double x);   // from cmath.

int abs(int x); // from cstdlib.

long abs(long x); // from cstdlib.

double abs(double x ); // from cmath.

float abs(float x ); // from cmath.

long double abs(long double x); // from cmath.

So our generic implementation could look like this:


template<typename T>
int findBestMatch(T value, const vector<T>& values)
{
    if (values.size() == 0) {
        return -1;
    }

    int bestIndex = 0;
    T bestDistance = abs(value - values[0]);

    for (int i = 1; i < values.size(); ++i) {
        T currentValue = values[i];
        T currentDistance = abs(value - currentValue);
        if (currentDistance < bestDistance) {
            bestDistance = currentDistance;
            bestIndex = i;
        }
    }

    return bestIndex;
}

template<typename T>

int findBestMatch(T value, const vector<T>& values)

{

if (values.size() == 0) {

return -1;

}

int bestIndex = 0;

T bestDistance = abs(value - values[0]);

for (int i = 1; i < values.size(); ++i) {

T currentValue = values[i];

T currentDistance = abs(value - currentValue);

if (currentDistance < bestDistance) {

bestDistance = currentDistance;

bestIndex = i;

}

return bestIndex;

}

Looks neat, but we still have a problem: for simple scalar types, we can assume that ‘abs’ looks like this:


T abs(const T& value);

T abs(const T& value);

which means that the return type is the same as the argument type. Hence, code like


T bestDistance = abs(value - values[0]);

T bestDistance = abs(value - values[0]);

is perfectly fine.

But for complex types, like complex<>, the picture is different as complex is itself parameterized by a type:


complex<int> ci(1, 4);          // complex number based on ints.
complex<double> cd(1.1, 4.4);   // complex number based on doubles.

complex<int> ci(1, 4); // complex number based on ints.

complex<double> cd(1.1, 4.4); // complex number based on doubles.

and according to the mathematical definition, the return value of ‘abs’ on a complex number is not a complex number but rather a real number:


cout << abs(cd) << endl;     // prints 4.53542 (ie. sqrt(1.1 * 1.1 + 4.4 * 4.4))

cout << abs(cd) << endl; // prints 4.53542 (ie. sqrt(1.1 * 1.1 + 4.4 * 4.4))

Therefore, if T is complex<double>, this line


T bestDistance = abs(value - values[0]);

T bestDistance = abs(value - values[0]);

doesn’t compile: ‘bestDistance’ is of type complex<double> but the return value of ‘abs’ is double.

Are we stuck in a rut? Sometimes the result of ‘abs’ is T (eg. for int, float), sometimes it is something totally different (eg. for complex, valarray, pointers). It looks like we cannot provide a truly generic implementation.

Not so fast! If we use C++0x and one of my favorite features, automatic type inference, we can:


template<typename T>
int findBestMatch(T value, const vector<T>& values)
{
    if (values.size() == 0) {
        return -1;
    }

    int bestIndex = 0;
    auto bestDistance = abs(value - values[0]);

    for (int i = 1; i < values.size(); ++i) {
        T currentValue = values[i];
        auto currentDistance = abs(value - currentValue);
        if (currentDistance < bestDistance) {
            bestDistance = currentDistance;
            bestIndex = i;
        }
    }

    return bestIndex;
}

template<typename T>

int findBestMatch(T value, const vector<T>& values)

{

if (values.size() == 0) {

return -1;

}

int bestIndex = 0;

auto bestDistance = abs(value - values[0]);

for (int i = 1; i < values.size(); ++i) {

T currentValue = values[i];

auto currentDistance = abs(value - currentValue);

if (currentDistance < bestDistance) {

bestDistance = currentDistance;

bestIndex = i;

}

return bestIndex;

}

The ‘auto’ keyword is magic: in C++0x a variable declared as ‘auto’ gets the type of the expression that is assigned to it — what a great tool for developing generic code!

There you have it. Even though the original code worked fine, the final version is nicer: it provides better readability, portability, and genericity. On top of that, it is theoretically slightly more efficient because it saves one redundant comparison. Why theoretically? Because I haven’t measured execution times. What a breach of style. Shame on me.

Killing Your Loved Ones

You have been working on this new feature for days. Suddenly you are stuck. You are faced with a problem that you haven’t thought of so far. After some painful hours of blood, sweat, and tears you finally have a Eureka! moment: a brilliant idea that will solve your problem in a very elegant way, but it will take a few days to implement.

After you’re done, you feel overjoyed: what a nice solution, what a beautiful design (you even used template metaprogramming), well-crafted doxygen comments, no compiler warnings and — needless to say — all the unit tests pass. You walk around the office with a smile on your face, you want to tell everybody what marvels you have achieved… alas, the third person you bump into tells you that the problem you have solved is in fact no problem at all — neither was nor is. Perhaps in the distant future, but surely not today.

Your heart is pounding, you feel dizzy, you feel sick. You try to defend your work, but your colleague’s arguments are compelling.

What to do, you ask, after having gone through so much pain and effort; what to do with this beautiful code? I tell you what: admit defeat and take it out again, period.

We’ve all been there. Sometimes it’s just a few lines, sometimes many classes, worth many person-days: It’s not really required for the product, but it is already implemented. In such miserable moments, you can add the most value to your project by deleting what’s not needed. I know it is hard, due to what psychologists call cognitive dissonance, but it must be done. Remember Antoine de Saint-Exupéry’s words?

“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.”

Usually, programmers have little difficulty accepting software minimalism principles as advocated by agile development practices like Extreme Programming (XP). As Ron Jeffries notes “Always implement things when you actually need them; never when you just foresee that you need them”. Clearly, every feature requires time to write, document, and test. And according to the YAGNI principle: “You Ain’t Gonna Need It”. But what if the time and money is already spent? Why not keep the code, it surely doesn’t harm?

The sad truth is that it does harm, even if it is already written, tested, and documented. The more code you have, the more complex a system becomes. Every additional line of code makes a system harder to read and understand, especially for new team members. It makes the system a little bit harder to refactor and maintain. Maybe someone wants to reuse your feature in another project, but rather reinvents the wheel because he doesn’t want to read and understand all that extra code. Further, such unnecessary code often leads to more unnecessary code in the same or other parts of the system.

Let me put it this way: “Anything that’s not worth implementing is not worth having in your code base”.

You needn’t (and shouldn’t) throw your code into a black hole, though. Push it on an experimental branch, or tar it up and put it in your project’s Wiki. This helps you to get over your loss and if you really need it someday, it’s there for you to be resurrected. Until that day comes, it’s best for your and your teammates if your code rests in peace.

Soap Bubbles

The other day I was flipping through “Beautiful Code“, presumably due to the fact that I felt a bit uneasy about what I recently wrote about imperfect code. By coincidence I’ve found this memorable quote by Alan Perlis:

“Is it possible that software is not like anything else; that it is meant to be discarded: that the whole point is to see it as a soap bubble?”

The Beauty of Imperfection

There once was a master programmer who wrote unstructured programs. A novice programmer, seeking to imitate him, also began to write unstructured programs. When the novice asked the master to evaluate his progress, the master criticized him for writing unstructured programs, saying, “What is appropriate for the master is not appropriate for the novice. You must understand the Tao before transcending structure.” (from The Tao of Programming)

Some time ago, I worked on a little command-line tool, when all of a sudden I stopped. Did I really just use a ‘goto’ statement?


int main(int argc, char *argv[])
{
    if (argc < 3) {
        ARG_ERROR:
        cerr << "Usage: " << argv[0] << " 0|1 <iterations>" << endl;
        exit(1);
    }
    int mode = atoi(argv[1]);
    int iterations = atoi(argv[2]);
    if (mode < 0 || mode > 1 || iterations < 0) goto ARG_ERROR;
    ...
}

int main(int argc, char *argv[])

{

if (argc < 3) {

ARG_ERROR:

cerr << "Usage: " << argv[0] << " 0|1 <iterations>" << endl;

exit(1);

}

int mode = atoi(argv[1]);

int iterations = atoi(argv[2]);

if (mode < 0 || mode > 1 || iterations < 0) goto ARG_ERROR;

...

}

What surprised me most about this code was that I didn’t feel any sense of guilt; somehow, it was just natural to use a ‘goto’ in this context.

Isn’t it considered harmful to use ‘goto’ statements? Or global variables? Or public attributes instead of using accessor methods? Or to duplicate code and data structures? In general, I think it is (of course) but there are circumstances when it is perfectly OK to write such “imperfect” code: If you know what you are doing and if you sacrifice perfection for a higher-level goal. This is very similar to a gambit in chess-playing: you trade a piece for a strategic long-term advantage over your opponent.

In my case, the higher-level goal was to actually do write a little tool for my personal use that allowed me to do things automatically — things I used to do manually (and due to the pain of manual work, frequently omitted). Granted, it would not have taken much effort to put the error reporting code into a routine of its own. But I guess that if I had spent enough time thinking about all software quality aspects like readability, testability and documentation, I would have never written this otherwise useful tool. It’s always better to have a tool that saves you from manual labor but uses goto statements than no tool at all. So my use of a ‘goto’ is just one particular manifestation of my “go-as-fast-as-possible-or-it-will-never-be-done” mindset.

You could also call this style of programming “deliberate hacking”: you know that you do something suboptimal but you do it on purpose for good (sometimes noble) reasons. This style of hacking should not be confused with bad hacking, where you write bad code either unconsciously or consciously for a bad reason (like pure laziness or obscuring your code to make yourself indispensable).

It is a good idea to keep software quality gambits in the private implementation details, where as little people as possible are affected. As a C programmer, do you care about any hacks in the ‘make’ program’s source code? Most likely not! However, many ‘make’ users complain about the fact that ‘make’ is whitespace sensitive: you have to use TAB characters to indent build commands — spaces may result in obscure errors.

For sure, a software quality gambit increases technical debt, but it is important to note that you get an asset with a much higher value in return. Ideally, the value of your asset doesn’t decrease (or only slowly decreases) over time, so that the technical debt isn’t noticeable or at least doesn’t harm. I’m convinced that the decision to use TAB characters in ‘make’ was a good hack at the time, but its asset value has long become zero. One possibility would be to change ‘make’ such that it works with TABs and spaces, but is it really worth it? For instance, how confusing and annoying would it be if makefiles that use spaces (and work with the latest version of our imaginary ‘make’) result in obscure errors on systems where only older versions of ‘make’ are present? It’s probably the best to embrace this historical beauty mark as a given fact.

In general, I think we must fight perfectionism in software development to avoid losing track of our main goal: shipping working software with good enough quality. Next time you find a hack, try to understand its nature and motives. Was it a good hack or a bad hack (frequently good hacks are marked with explanatory comments, like TODO comments). Maybe you can even develop a sense for appreciating such good hacks. This is akin to “Wabi Sabi“, a Japanese art concept. Wabi Sabi is sometimes described as seeing beauty in things that are imperfect, impermanent and incomplete. Wabi Sabi art acknowledges three simple realities: nothing lasts, nothing is finished, and nothing is perfect. I think we should accept these realities for software as well.

Approxion

Code – People – Everything

Code Kata 2: The “Average” Developer

The Manager in the Room

A Small Matter of Style?

Killing Your Loved Ones

Soap Bubbles

The Beauty of Imperfection