Some claim that there is a reason for everything. Well, I’m not so sure about this statement, but surely there is a reason for Code Kata 1: it was inspired by a routine that I found in a real-world product some time ago. It looked more or less like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
1 int findBestMatch(double value, const vector<double>& values) 2 { 3 double bestDistance = 1.7976931348623158e+308; 4 int bestIndex = 0; 5 6 for (int i = 0; i < values.size(); ++i) { 7 double currentValue = values[i]; 8 if (fabs(value - currentValue) < bestDistance) { 9 bestDistance = fabs(value - currentValue); 10 bestIndex = i; 11 } 12 } 13 14 if (bestIndex < values.size()) { 15 return bestIndex; 16 } 17 return -1; 18 } |
The job of this routine is to find a value in a vector that matches the given reference value as close as possible. It works by visiting each element of the vector, calculating the absolute difference to the reference value and, if the new difference is smaller than the previous, updating the ‘bestIndex’, the index of the value that matches best.
This routine has proven to work reliably, at least no errors have been found so far. But in my view, it nevertheless leaves a lot to be desired. Let’s dissect this code step by step.
First of all, there are no comments, but I don’t think this is a big issue here. In my view, the code is quite short and the identifier names give sufficient meaning.
The most noticeable peculiarity about this code is the hideous floating point literal 1.7976931348623158e+308 on line 3. What the heck is this?
If you study the for loop and find out how the algorithm works it becomes clear: ‘bestDistance’ is initialized to the largest possible value such that the condition on line 8 becomes ‘true’ for the first element, in order to get a first candidate.
But using a hard coded floating-point literal is the worst approach to achieve this. Why not use ‘DBL_MAX’ (from cfloat/float.h) or ‘numeric_limits
The algorithm itself is not perfect either. Like I said, during the first iteration, the comparison on line 8 will always fail as ‘bestDistance’ is initialized to the largest possible value. Why not save one iteration by initializing ‘bestDistance’ to the distance between the reference value and the first vector element? Sounds like a good idea, but this approach will work only if we know that the vector is not empty. Which brings us to the topic of the odd way in which errors are handled in this routine.
Have a look ath lines 14 and 15. The routine returns the found ‘bestIndex’ only if it is within the bounds of the vector; otherwise it returns -1 (for failure). Under what circumstances is it possible that ‘bestIndex’ is outside the range of the vector? It turns out that it can’t — unless the vector is completely empty. So why not make this check explicit and put it near the top of the routine before the actual search starts?
Here is an improved implementation, based on my previous observations:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
int findBestMatch(double value, const vector<double>& value) { if (values.size() == 0) { return -1; } int bestIndex = 0; double bestDistance = fabs(value - values[0]); for (int i = 1; i < values.size(); ++i) { double currentValue = values[i]; double currentDistance = fabs(value - currentValue); if (currentDistance < bestDistance) { bestDistance = currentDistance; bestIndex = i; } } return bestIndex; } |
But I’m still not happy with this routine: as it stands it works for doubles only, but isn’t it of general utility? The code would be identical for vectors of chars, ints, pointers, and complex numbers. Why not turn this code into a template function such that it can be put in a library?
1 2 3 4 5 |
template<typename T> int findBestMatch(T value, const vector<T>& values) ... |
There is a snag, however. The code uses ‘fabs‘ a function that returns the absolute value of a floating point value. We need it to be able to compare the distances of vector elements to our given value, but it only works with floats/doubles and ‘abs’ is for ints/longs only, so we cannot write truly generic code. But wait a minute, doesn’t cmath provide overloaded versions of ‘abs‘ that work with all types?
1 2 3 4 5 6 7 |
int abs(int x); // from cstdlib. long abs(long x); // from cstdlib. double abs(double x ); // from cmath. float abs(float x ); // from cmath. long double abs(long double x); // from cmath. |
So our generic implementation could look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
template<typename T> int findBestMatch(T value, const vector<T>& values) { if (values.size() == 0) { return -1; } int bestIndex = 0; T bestDistance = abs(value - values[0]); for (int i = 1; i < values.size(); ++i) { T currentValue = values[i]; T currentDistance = abs(value - currentValue); if (currentDistance < bestDistance) { bestDistance = currentDistance; bestIndex = i; } } return bestIndex; } |
Looks neat, but we still have a problem: for simple scalar types, we can assume that ‘abs’ looks like this:
1 2 3 |
T abs(const T& value); |
which means that the return type is the same as the argument type. Hence, code like
1 2 3 |
T bestDistance = abs(value - values[0]); |
is perfectly fine.
But for complex types, like complex<>, the picture is different as complex is itself parameterized by a type:
1 2 3 4 |
complex<int> ci(1, 4); // complex number based on ints. complex<double> cd(1.1, 4.4); // complex number based on doubles. |
and according to the mathematical definition, the return value of ‘abs’ on a complex number is not a complex number but rather a real number:
1 2 3 |
cout << abs(cd) << endl; // prints 4.53542 (ie. sqrt(1.1 * 1.1 + 4.4 * 4.4)) |
Therefore, if T is complex<double>, this line
1 2 3 |
T bestDistance = abs(value - values[0]); |
doesn’t compile: ‘bestDistance’ is of type complex<double> but the return value of ‘abs’ is double.
Are we stuck in a rut? Sometimes the result of ‘abs’ is T (eg. for int, float), sometimes it is something totally different (eg. for complex, valarray, pointers). It looks like we cannot provide a truly generic implementation.
Not so fast! If we use C++0x and one of my favorite features, automatic type inference, we can:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
template<typename T> int findBestMatch(T value, const vector<T>& values) { if (values.size() == 0) { return -1; } int bestIndex = 0; auto bestDistance = abs(value - values[0]); for (int i = 1; i < values.size(); ++i) { T currentValue = values[i]; auto currentDistance = abs(value - currentValue); if (currentDistance < bestDistance) { bestDistance = currentDistance; bestIndex = i; } } return bestIndex; } |
The ‘auto’ keyword is magic: in C++0x a variable declared as ‘auto’ gets the type of the expression that is assigned to it — what a great tool for developing generic code!
There you have it. Even though the original code worked fine, the final version is nicer: it provides better readability, portability, and genericity. On top of that, it is theoretically slightly more efficient because it saves one redundant comparison. Why theoretically? Because I haven’t measured execution times. What a breach of style. Shame on me.