Category Archives: C/C++/Embedded

“inline” Is Yet Another Word For “Premature Optimization”

The fact that some C++ developers use the ‘inline’ keyword so much has always been a conundrum to me — I’ve never liked it. Why? First and foremost because it clutters up header files and exposes implementation details to the users of a class.

Most likely, inline aficionados believe that these disadvantages are more than compensated for by the fact that inlining gives them faster code, but this is not necessarily the case: according to the C++ standard (ISO/IEC 14882:2014), the compiler is allowed to silently ignore the ‘inline’ keyword:

“An implementation is not required to perform this inline substitution at the point of call”

Believing is not knowing, as the old saying goes. This is another reason why I don’t like the ‘inline’ keyword: it doesn’t guarantee you anything.

But let’s attack the ‘inline’ keyword from another angle. Even if we knew that declaring a method inline made it faster, shouldn’t we have to ask ourselves first if there is actually a performance case? Without profiling, without a proven need, any optimization is premature optimization, which — according to Donald Knuth — is the root of all evil. The fact that an optimization gives a local improvement doesn’t justify it sufficiently — it’s the overall improvement of the major use cases that matters. Otherwise we would implement all of our functions with inline assembly, wouldn’t we?

In the old days of C programming, developers used the ‘register’ keyword as a hint to tell the compiler what variables should be kept in registers for performance reasons. Nowadays, every C compiler is much better at allocating variables to registers than any human being. Consequently, the ‘register’ keyword has been deprecated in C11.

By the same token, today’s C++ compilers do a much better job of figuring out which functions should be inlined than we are able to do. Therefore, instead of giving hints to the compiler we should rather rely on automated, transparent inlinining that doesn’t clutter up class interfaces.

As an example, at optimization level -O2, the g++ compiler automatically inlines all functions that are small or called only once. Specifying -finline-functions (enabled by default at -O3) uses a heuristic to determine if its worthwhile to inline a function or not — without the need for any developer intervention.

To me, it’s about time that ‘inline’ goes the way of the ‘register’ keyword.

Counting Down Correctly in C

The countdown for the New Year is near to its end, so I want to take this opportunity to discuss how to implement loops that count down from an upper boundary to a lower boundary. I know it sounds mundane, but I will present a technique that is — at least in my experience — not widely known, not even amongst seasoned C coders (with the notable exception of Chuck Norris, of course).

But first, please take a moment to look at the following routine that employs a countdown for-loop and decide if it works correctly or not:


/* Return the highest index of given character in array. Search from 
   upper (exclusive) to lower (inclusive) index. If character is not
   found, return upper index. */
size_t rfind(const char* array, size_t lower, size_t upper, char c) {
    size_t i;

    for (i = upper - 1; i >= lower; --i) {
        if (array[i] == c) {
            return i;
        }
    }

    return upper;
}

/* Return the highest index of given character in array. Search from

upper (exclusive) to lower (inclusive) index. If character is not

found, return upper index. */

size_t rfind(const char* array, size_t lower, size_t upper, char c) {

size_t i;

for (i = upper - 1; i >= lower; --i) {

if (array[i] == c) {

return i;

}

return upper;

}

This code appears to be fine, but it has a flaw that shows only when the ‘lower’ index is 0: ‘size_t’ is an unsigned type, and when ‘i’ becomes 0, subtracting 1 yields a very large positive number (due to integer wrap-around) which in turn causes an out-of-bounds access to the given ‘array’. So what do we need to change such that the code works as expected, even for a lower bound of 0?

Most developer’s knee-jerk reaction is to change the type of the indices to a signed type, like ‘int’, but this is unfortunate, as it limits (at least halves) the available value range. As often in life, the proper solution is not to fight the enemy but to turn him into a friend: Let’s use unsigned wrap-around to our advantage:


size_t rfind(const char* array, size_t lower, size_t upper, char c) {
    size_t i;

    for (i = upper - 1; i != lower - 1; --i) {
        if (array[i] == c) {
            return i;
        }
    }

    return upper;
}

size_t rfind(const char* array, size_t lower, size_t upper, char c) {

size_t i;

for (i = upper - 1; i != lower - 1; --i) {

if (array[i] == c) {

return i;

}

return upper;

}

Instead of using the greater-than operator, we now use the not-equals operator and instead of comparing against ‘lower’ we now compare against one less than ‘lower’. If ‘lower’ happens to be 0, ‘lower’ – 1 (again, due to integer wrap-around) will yield the maximum possible value representable by type ‘size_t’. The same will happen to the loop counter ‘i’ when it has a value of 0 and is decremented once more. As a consequence, the expression ‘i != lower – 1’ becomes false and the loop terminates — as desired.

A Happy New Year to all of my faithful readers!

Dangerously Confusing Interfaces III

Just like the other “Dangerously Confusing Interfaces” posts, this one was also inspired by a real-world blunder that I made.

Here’s the background: usually, routines that accept data via a pointer from the caller either execute synchronously or copy the data into their own internal data structures for later processing. Take the venerable ‘fwrite’ from the C standard library as an example:


size_t fwrite(const void *ptr, size_t size, size_t nmemb, FILE *stream);

size_t fwrite(const void *ptr, size_t size, size_t nmemb, FILE *stream);

‘fwrite’ blocks until the data has been written, either to disk or to an internal buffer. In either case, once ‘fwrite’ returns, it doesn’t care about the original data anymore. That’s why it’s safe (and common practice) to pass a pointer to a local buffer on the stack:


void LogHello() {
    const char message[] = "Hello";
    fwrite(message, sizeof(message), 1, g_log_file);
}

void LogHello() {

const char message[] = "Hello";

fwrite(message, sizeof(message), 1, g_log_file);

}

All standard library and POSIX APIs behave like ‘fwrite’, which is both, safe and convenient. However, with embedded systems, the story is different: in some cases, memory is so tight that additional buffers/internal storage can’t be afforded. Such functions don’t copy the provided data but only store a pointer to your data and expect the memory pointed-to by this pointer to be still valid long after the function call has returned. Here is an example from the AUTOSAR standard, which is used by almost all embedded automotive products:


Std_ReturnType NvM_WriteBlock(NvM_blockIdType blockId, const uint8* data);

Std_ReturnType NvM_WriteBlock(NvM_blockIdType blockId, const uint8* data);

‘NvM_WriteBlock’ is used to store data to a given non-volatile memory block. However, what this function does is only enqueue a request for the given block ID together with the data pointer (not a copy of your data). This is done for the sake of efficiency, because there can be multiple write requests in parallel. The queue is later processed in another task, long after any local buffer would have been removed from the stack.

Passing a pointer to a buffer with automatic storage is an easy mistake to make, especially since such “non-copy” interfaces are so rarely encountered. How can “write-like” interfaces that don’t make a copy of the provided data be made safer, such that misuse is less likely? Obviously, just adding documentation is not enough — nobody reads documentation, especially in the heat of the moment.

In my view, the root of the problem is that such functions accept just about any pointer. What if the caller was forced to explicitly cast the pointer to another type? A type with a cunningly chosen typename, one that reminded the caller of the potential pitfall? Here is my approach:


typedef struct {
    char dummy;
} uncopied_memory;

void SomeWritelikeFunction(const uncopied_memory* data, size_t len);

typedef struct {

char dummy;

} uncopied_memory;

void SomeWritelikeFunction(const uncopied_memory* data, size_t len);

Whenever a pointer is passed to this function, developers have to write something like this to make the compiler happy:


SomeWritelikeFunction((uncopied_memory*) data, sizeof(data));

SomeWritelikeFunction((uncopied_memory*) data, sizeof(data));

Typing ‘uncopied_memory’ should shake up even the most focused developers and remind them to double-check what they are passing into this function.

Of course, within ‘SomeWritelikeFunction’, the provided pointer needs to be cast back into something more useful, like a ‘const uint8_t*’. Further, note that the ‘dummy’ member within ‘uncopied_memory’ must not be used; it only exists to make sure that the cast to ‘uncopied_memory*’ in the calling function is safe: a pointer to a struct is aligned such that it is aligned with the struct’s first member, which is ‘char’ and ‘char’ has the weakest alignment requirements.

Using the C Preprocessor to Perform Compile-time Computations

Sometimes, it is desirable to perform computations already at compile-time — either for efficiency or to avoid redundancy. Alas, what a compiler can compute at compile-time is rather limited — mostly just a combination of unary and binary operators. What if you need to do something more complex?

For the sake of illustration, I chose computing the (floored) log2 of an integer as an example, but the techniques presented below can be easily adapted to other use cases:


const int MY_CONST = STATIC_LOG2(1234); // Compute log2(1234)
                                        // at compile-time

const int MY_CONST = STATIC_LOG2(1234); // Compute log2(1234)

// at compile-time

In C++ — provided you’re brave enough — you can always resort to template metaprogramming:


template<int value, int shift>
struct static_log2_impl {
    enum { result = value & (1 << shift) ? shift : 
        static_log2_impl<value, shift - 1>::result 
    };
};

template<int value>
struct static_log2_impl<value, -1> {
    enum { result = -1 };
};

#define STATIC_LOG2(value)\
    (static_log2_impl<value, sizeof(int) * CHAR_BIT - 1>::result)

template<int value, int shift>

struct static_log2_impl {

enum { result = value & (1 << shift) ? shift :

static_log2_impl<value, shift - 1>::result

};

template<int value>

struct static_log2_impl<value, -1> {

enum { result = -1 };

};

#define STATIC_LOG2(value)\

(static_log2_impl<value, sizeof(int) * CHAR_BIT - 1>::result)

But since template metaprogramming was not deliberately built into C++ but rather discovered by accident, template metaprogramming code is far from pleasant to look at. If you are lucky and your compiler supports C++11 (or rather C++11’s constexpr feature), you have a better option:


constexpr int static_log2_constexpr_impl(int value, int shift) {
    return shift == -1 ? -1 : 
        value & (1<< shift) ? shift :
            static_log2_constexpr_impl(value, shift - 1);
}

#define STATIC_LOG2(value)\
    (static_log2_constexpr_impl(value, sizeof(int) * CHAR_BIT - 1))

constexpr int static_log2_constexpr_impl(int value, int shift) {

return shift == -1 ? -1 :

value & (1<< shift) ? shift :

static_log2_constexpr_impl(value, shift - 1);

}

#define STATIC_LOG2(value)\

(static_log2_constexpr_impl(value, sizeof(int) * CHAR_BIT - 1))

It’s still recursive, but at least this solution is using real functions and not structs to achive its goal — much easier on the eyes!

But what if you code at the other end of the spectrum? What if you are limited to plain C?

Many years ago, I discovered the following technique that has proven useful in quite a few situations; it is used like this:


#define STATIC_LOG2_ARG 89
#include static_log2.h
char myarray[STATIC_LOG2_VALUE]; // sizeof(myarray) == 6

#define STATIC_LOG2_ARG 89

#include static_log2.h

char myarray[STATIC_LOG2_VALUE]; // sizeof(myarray) == 6

The “argument” is passed via a symbolic constant (STATIC_LOG2_ARG), the computation is done by “calling the function” (by including static_log2.h) and the “return value” is stored in another symbolic constant (STATIC_LOG2_VALUE).

Here’s an excerpt of what’s contained in the static_log2.h header file:


#if STATIC_LOG2_ARG & (1 << 31)
#define STATIC_LOG2_VALUE 31

#elif STATIC_LOG2_ARG & (1 << 30)
#define STATIC_LOG2_VALUE 30

#elif STATIC_LOG2_ARG & (1 << 29)
#define STATIC_LOG2_VALUE 29 

#elif STATIC_LOG2_ARG & (1 << 28)
#define STATIC_LOG2_VALUE 28 

...

#elif STATIC_LOG2_ARG & (1 << 2)
#define STATIC_LOG2_VALUE 1

#elif STATIC_LOG2_ARG & (1 << 1)
#define STATIC_LOG2_VALUE 0

#else
#define STATIC_LOG2_VALUE -1 
#endif

#if STATIC_LOG2_ARG & (1 << 31)

#define STATIC_LOG2_VALUE 31

#elif STATIC_LOG2_ARG & (1 << 30)

#define STATIC_LOG2_VALUE 30

#elif STATIC_LOG2_ARG & (1 << 29)

#define STATIC_LOG2_VALUE 29

#elif STATIC_LOG2_ARG & (1 << 28)

#define STATIC_LOG2_VALUE 28

...

#elif STATIC_LOG2_ARG & (1 << 2)

#define STATIC_LOG2_VALUE 1

#elif STATIC_LOG2_ARG & (1 << 1)

#define STATIC_LOG2_VALUE 0

#else

#define STATIC_LOG2_VALUE -1

#endif

In the C++ examples, iteration is done using recursion, but here everything is unrolled/inlined.

For another case where this approach is employed, checkout this post. You probably don’t need this technique very often, but it’s good to have it in your bag of macro tricks.

Bug Hunting Adventures #11: Bad Weather

“It is only in sorrow bad weather masters us;
in joy we face the storm and defy it”
— Amelia Barr

Imagine a weather monitoring system where environmental data is collected by various sensors and distributed via messages to other components for further processing.

In the code below, produce_env_measurement() represents a task that constantly produces messages containing various environmental measurements while another task (represented by process_env_measurement()) consumes them. To ensure data integrity, a Fletcher-16 checksum is appended to every message, but the application nevertheless doesn’t work reliably. Where’s the bug?

Code
Solution

Distrusting Experts

“You don’t know what you don’t know until you know it.”
— anon

I recently wrote a post about Scott Meyers, one of my programming heroes. The one I’m writing about today ranks even higher: Michael Abrash. For decades, I have admired his ability to craft awesome, non-obvious, super-tight code. But he’s not just an excellent performance programmer; he is also an execellent writer and storyteller.

It’s no coincidence that he’s also the author of one of my favorite programming books: “ZEN of Code Optimization” [1]. Released in 1994, it predates the Internet age (lo and behold, it comes with a 3.5″ floppy disk), but most of the content is timeless advice for every developer who cares about performance. (We all should. Michael’s rule #1: “From the user’s perspective, performance is fundamental”.)

In chapter 10, “Patient Coding, Faster Code”, he explains that just doing micro optimizations (or recoding in assembly language) usually doesn’t cut it; all you end up with is “fast slow code”. For best results, you need key insights; that is, view the problem from a different angle and use an offbeat approach to solve it.

He presents the venerable “greatest common divisor of two given numbers” problem as a case in point. After toying with a brute-force approach (incrementing a variable and checking if it divides both numbers) he employs Euclid’s key insight, which improves the run-time by orders of magnitude:

gcd(int1, int2) ≡ gcd(int2, int1 % int2)

Or in plain English: the GCD of a larger positive integer ‘int1’ and a smaller positive integer ‘int2’ is the same as the GCD of ‘int2’ and the remainder of ‘int1’ divided by ‘int2’.

This is his original code:


unsigned int gcd(unsigned int int1, unsigned int int2) {
   unsigned int temp;

   /* Swap if necessary to make sure that int1 >= int2 */
   if (int1 < int2) {
      temp = int1;
      int1 = int2;
      int2 = temp;
   }
   /* Now loop, dividing int1 by int2 and checking the remainder,
      until the remainder is 0. At each step, if the remainder isn't
      0, assign int2 to int1, and the remainder to int2, then
      repeat */
   for (;;) {
      /* If the remainder of int1 divided by int2 is 0, then int2 is
         the gcd */
      if ((temp = int1 % int2) == 0) {
         return(int2);
      }
      /* Make int2 the larger integer and the remainder the
         smaller integer, and repeat the process */
      int1 = int2;
      int2 = temp;
   }
}

unsigned int gcd(unsigned int int1, unsigned int int2) {

unsigned int temp;

/* Swap if necessary to make sure that int1 >= int2 */

if (int1 < int2) {

temp = int1;

int1 = int2;

int2 = temp;

}

/* Now loop, dividing int1 by int2 and checking the remainder,

until the remainder is 0. At each step, if the remainder isn't

0, assign int2 to int1, and the remainder to int2, then

repeat */

for (;;) {

/* If the remainder of int1 divided by int2 is 0, then int2 is

the gcd */

if ((temp = int1 % int2) == 0) {

return(int2);

}

/* Make int2 the larger integer and the remainder the

smaller integer, and repeat the process */

int1 = int2;

int2 = temp;

}

Needless to say, being that assembly wizard that he is, he goes on and reimplements this routine in assembly language to make it even faster. But as for C, this is probably as fast as it can get. Or is it?

I’ve used this code blindly and almost literally for many, many years. After all, it’s from Michael Abrash, and to me, his code has always been beyond any doubt. Definitely not the right attitude, but let me explain.

I’m a big fan of code katas, little exercises that you do (and redo time and again, sometimes in different programming languages) to flex your programming muscle. I don’t know why, but when I did the GCD code kata this time, I stopped and scratched my head. What the heck is this int swapping business about? Why is this code necessary, as the comment says?

Just for the fun of it, I took it out and reran the unit tests — they all passed.

Actually, it’s not that hard to see why it’s not necessary. Let’s assume ‘int1’ is 366 and ‘int2’ is 60. Since ‘int1’ is greater than ‘int2’, no swapping is necessary. Within the ‘for’ loop, the variables take on the following values:

iteration	temp	int1	int2	return
1	366 % 60	60	6
2	60 % 6	60	6	6

Now if ‘int1’ is 60 and ‘int2’ is 366, the code as it is would swap their values, but let’s pretend there was no swapping code:

iteration	temp	int1	int2	return
1	60 % 366	366	60
2	366 % 60	60	6
3	60 % 6	60	6	6

Aha! There is one more iteration and this additional iteration (actually the very first iteration) swaps ‘int1’ and ‘int2’. The ‘for’ loop by itself takes care of rearranging the arguments, there is no need for the extra swap test at the beginning of the ‘gcd’ routine. I was puzzled. Is it really possible that I have discovered redundant code in Michael Abrash’s sacred code? I felt quite uneasy because I wanted my hero to be infallible.

A couple of days ago, I sent him an email about my findings. His reply was short and to the point:

“Believe me, I never claimed to be immune to cycle-wasting! And anyway, There Ain’t No Such Thing As The Fastest Code :)”

There you have it. It’s that simple. I found unnecessary code in a 20+ years old routine. That’s what I call delayed gratification!

The basic problem was that I blindly trusted what I saw, probably because it came from an expert. However, nobody is perfect, every once in a while even the best make blunders. Have a flexible mind, assume nothing and question everything — this is the right attitude for every software developer and proclaimed in many of Michael’s writings [2].

—

[1] “ZEN of Code Optimization” is available online as part I of “Michael Abrash’s Graphics Programming Black Book, Special Edition”.

[2] Some years ago, I gave a presentation titled “The Art of Writing Efficient Software”. It includes a lot of Michael Abrash’s advice plus more. Check it out here. I truly believe that efficiency is still (and forever will be) of utmost importance for end users.

Approxion

Code – People – Everything