Bug Hunting Adventures #11: Bad Weather

“It is only in sorrow bad weather masters us;
in joy we face the storm and defy it”
— Amelia Barr

Imagine a weather monitoring system where environmental data is collected by various sensors and distributed via messages to other components for further processing.

In the code below, produce_env_measurement() represents a task that constantly produces messages containing various environmental measurements while another task (represented by process_env_measurement()) consumes them. To ensure data integrity, a Fletcher-16 checksum is appended to every message, but the application nevertheless doesn’t work reliably. Where’s the bug?

Code
Solution

Distrusting Experts

“You don’t know what you don’t know until you know it.”
— anon

I recently wrote a post about Scott Meyers, one of my programming heroes. The one I’m writing about today ranks even higher: Michael Abrash. For decades, I have admired his ability to craft awesome, non-obvious, super-tight code. But he’s not just an excellent performance programmer; he is also an execellent writer and storyteller.

It’s no coincidence that he’s also the author of one of my favorite programming books: “ZEN of Code Optimization” [1]. Released in 1994, it predates the Internet age (lo and behold, it comes with a 3.5″ floppy disk), but most of the content is timeless advice for every developer who cares about performance. (We all should. Michael’s rule #1: “From the user’s perspective, performance is fundamental”.)

In chapter 10, “Patient Coding, Faster Code”, he explains that just doing micro optimizations (or recoding in assembly language) usually doesn’t cut it; all you end up with is “fast slow code”. For best results, you need key insights; that is, view the problem from a different angle and use an offbeat approach to solve it.

He presents the venerable “greatest common divisor of two given numbers” problem as a case in point. After toying with a brute-force approach (incrementing a variable and checking if it divides both numbers) he employs Euclid’s key insight, which improves the run-time by orders of magnitude:

gcd(int1, int2) ≡ gcd(int2, int1 % int2)

Or in plain English: the GCD of a larger positive integer ‘int1’ and a smaller positive integer ‘int2’ is the same as the GCD of ‘int2’ and the remainder of ‘int1’ divided by ‘int2’.

This is his original code:


unsigned int gcd(unsigned int int1, unsigned int int2) {
   unsigned int temp;

   /* Swap if necessary to make sure that int1 >= int2 */
   if (int1 < int2) {
      temp = int1;
      int1 = int2;
      int2 = temp;
   }
   /* Now loop, dividing int1 by int2 and checking the remainder,
      until the remainder is 0. At each step, if the remainder isn't
      0, assign int2 to int1, and the remainder to int2, then
      repeat */
   for (;;) {
      /* If the remainder of int1 divided by int2 is 0, then int2 is
         the gcd */
      if ((temp = int1 % int2) == 0) {
         return(int2);
      }
      /* Make int2 the larger integer and the remainder the
         smaller integer, and repeat the process */
      int1 = int2;
      int2 = temp;
   }
}

unsigned int gcd(unsigned int int1, unsigned int int2) {

unsigned int temp;

/* Swap if necessary to make sure that int1 >= int2 */

if (int1 < int2) {

temp = int1;

int1 = int2;

int2 = temp;

}

/* Now loop, dividing int1 by int2 and checking the remainder,

until the remainder is 0. At each step, if the remainder isn't

0, assign int2 to int1, and the remainder to int2, then

repeat */

for (;;) {

/* If the remainder of int1 divided by int2 is 0, then int2 is

the gcd */

if ((temp = int1 % int2) == 0) {

return(int2);

}

/* Make int2 the larger integer and the remainder the

smaller integer, and repeat the process */

int1 = int2;

int2 = temp;

}

Needless to say, being that assembly wizard that he is, he goes on and reimplements this routine in assembly language to make it even faster. But as for C, this is probably as fast as it can get. Or is it?

I’ve used this code blindly and almost literally for many, many years. After all, it’s from Michael Abrash, and to me, his code has always been beyond any doubt. Definitely not the right attitude, but let me explain.

I’m a big fan of code katas, little exercises that you do (and redo time and again, sometimes in different programming languages) to flex your programming muscle. I don’t know why, but when I did the GCD code kata this time, I stopped and scratched my head. What the heck is this int swapping business about? Why is this code necessary, as the comment says?

Just for the fun of it, I took it out and reran the unit tests — they all passed.

Actually, it’s not that hard to see why it’s not necessary. Let’s assume ‘int1’ is 366 and ‘int2’ is 60. Since ‘int1’ is greater than ‘int2’, no swapping is necessary. Within the ‘for’ loop, the variables take on the following values:

iteration	temp	int1	int2	return
1	366 % 60	60	6
2	60 % 6	60	6	6

Now if ‘int1’ is 60 and ‘int2’ is 366, the code as it is would swap their values, but let’s pretend there was no swapping code:

iteration	temp	int1	int2	return
1	60 % 366	366	60
2	366 % 60	60	6
3	60 % 6	60	6	6

Aha! There is one more iteration and this additional iteration (actually the very first iteration) swaps ‘int1’ and ‘int2’. The ‘for’ loop by itself takes care of rearranging the arguments, there is no need for the extra swap test at the beginning of the ‘gcd’ routine. I was puzzled. Is it really possible that I have discovered redundant code in Michael Abrash’s sacred code? I felt quite uneasy because I wanted my hero to be infallible.

A couple of days ago, I sent him an email about my findings. His reply was short and to the point:

“Believe me, I never claimed to be immune to cycle-wasting! And anyway, There Ain’t No Such Thing As The Fastest Code :)”

There you have it. It’s that simple. I found unnecessary code in a 20+ years old routine. That’s what I call delayed gratification!

The basic problem was that I blindly trusted what I saw, probably because it came from an expert. However, nobody is perfect, every once in a while even the best make blunders. Have a flexible mind, assume nothing and question everything — this is the right attitude for every software developer and proclaimed in many of Michael’s writings [2].

—

[1] “ZEN of Code Optimization” is available online as part I of “Michael Abrash’s Graphics Programming Black Book, Special Edition”.

[2] Some years ago, I gave a presentation titled “The Art of Writing Efficient Software”. It includes a lot of Michael Abrash’s advice plus more. Check it out here. I truly believe that efficiency is still (and forever will be) of utmost importance for end users.

A Micro Logger for Constrained Systems

Drei Notizzettel vor Holzhintergrund Logging is one of the oldest and most widely used software debugging tools: for decades, developers have put printf-like statements in their code to find out what their code is really doing, in situ. Even today, logging is still crucial for finding bugs in software, especially sporadic problems in concurrent and event-driven systems.

Fortunately, most programming environments come with powerful logging facilities (like Android’s Log) — but what if you are working on a highly-constrained system that simply can’t afford them?

Let’s have a look at an example that uses printf-like logging:


#include "log.h"

uint32_t readInput() {
    log_info("Enter readInput"); 
    uint32_t sampledValue = IoGetRawInput();
    ...
    if (sampledValue > SAMPLED_VALUE_MAX) {
        log_warn("Sampled value exceeded threshold: %d", sampledValue);
        sampledValue = 0;
    }
    ...
    log_info("Exit readInput");
    return sampledValue;
}

#include "log.h"

uint32_t readInput() {

log_info("Enter readInput");

uint32_t sampledValue = IoGetRawInput();

...

if (sampledValue > SAMPLED_VALUE_MAX) {

log_warn("Sampled value exceeded threshold: %d", sampledValue);

sampledValue = 0;

}

...

log_info("Exit readInput");

return sampledValue;

}

Traditional (printf-like) logging is expensive because:

Log strings consume memory (ROM), which can easily increase the overall size of the executable by 10% to 30%.
printf-based logging is extremely flexible, but the parsing of format strings can incur a significant run-time overhead.
The amount of log data produced at run-time is quite large because all the log strings need to be transmitted (or stored locally, in case you keep logging data in a file), usually accompanied with timestamps and other bookkeeping information.

I recently had a crazy idea how to compress a log string of (almost) arbitrary size to a single byte: Instead of using real (read: expensive) C strings, why not define a constant whose symbol name serves as the “log string” and only log it’s address?


... 
static const uint8_t A_log_string_that_consumes_only_a_single_byte = 0;
mlog_impl(&A_log_string_that_consumes_only_a_single_byte);
...

...

static const uint8_t A_log_string_that_consumes_only_a_single_byte = 0;

mlog_impl(&A_log_string_that_consumes_only_a_single_byte);

...

On the receiving side, you can reconstruct the original log strings from the map file produced by the linker.

Given an appropriate set of magic macros, the example above looks like this:


#include "mlog.h"

uint32_t readInput() {
    MLOG_INFO(Enter_readInput); 
    uint32_t sampledValue = IoGetRawInput();
    ...
    if (sampledValue > MAX_SAMPLE_VALUE) {
        MLOG_WARN_VALUE(Sampled_value_exceeded_threshold, sampledValue);
        sampledValue = 0;
    }
    ...
    MLOG_INFO(Exit_readInput); 
    return sampledValue;
}

#include "mlog.h"

uint32_t readInput() {

MLOG_INFO(Enter_readInput);

uint32_t sampledValue = IoGetRawInput();

...

if (sampledValue > MAX_SAMPLE_VALUE) {

MLOG_WARN_VALUE(Sampled_value_exceeded_threshold, sampledValue);

sampledValue = 0;

}

...

MLOG_INFO(Exit_readInput);

return sampledValue;

}

It may be hard to believe, but these three “log strings” really only consume three bytes of memory. To send out their addresses, four bytes are needed (assuming 32-bit addresses are used) and ‘sampledValue’ requires the transmission of another four bytes. Let’s compare these two approaches assuming that ‘sampledValue’ is 42 (two digits) and taking into account that the size of a C string is the number of characters plus one for a trailing ‘\0’:

	footprint (bytes)	transmission (bytes)
traditional logging	16 + 37 + 15	16 + 37 + 15
MLOG logging	1 + 1 + 1	4 + 4 + 4 + 4
savings %	96	76

That’s what I call an improvement!

I’ve implemented this idea for the GCC toolchain. The log server emits MLOG traces via UDP, which are received by a little tool that converts them back into human-readable log messages by exchanging MLOG symbol addresses with (beautified) MLOG symbol names from the linker map file. Extracting the MLOG symbols from the map file is the most GCC-dependent part. I believe it is a straightforward task to port MLOG to other platforms and toolchains.

Check it out at GitHub.

EPILOGUE:
This is my 100th blog post (actually 101st, damn)! Thanks, faithful reader, for bearing with me through all these years.
— Yours Truly

Bug Hunting Adventures #10: For Whom The Bell Tolls

“Then later that night when the ship’s bell rang
Could it be the north wind they’d been feelin’?”

“The Wreck Of The Edmund Fitzgerald”
— Gordon Lightfoot

At my home, I’m using a Raspberry Pi as a watchdog (aptly named “Brutus”) for all kinds of tasks: burglar detection, network intrusion detection, and server monitoring, just to name a few. Still, most of the time, my watchdog hangs around, idling away. That’s not the way I like it, so I’m constantly on the lookout for new jobs that I can assign to Brutus, small or big.

My current plan is to create a little ship’s bell app that emits pleasing bell sounds every 30 minutes, just like it has been done traditionally on all ships since the 16th century: double-strikes for full hours and an additional single-strike for half an hour. But unlike civil clocks, ship’s bells don’t have dedicated indications for every one of the 12 (or rather 24) hours in a day; instead, bell patterns repeat every four hours:

Bell pattern	Time (a.m. and p.m.)
1	12:30	4:00	8:00
2	1:00	5:00	9:00
2 1	1:30	5:30	9:30
2 2	2:00	6:00	10:00
2 2 1	2:30	6:30	10:30
2 2 2	3:00	7:00	11:00
2 2 2 1	3:30	7:30	11:30
2 2 2 2	4:00	8:00	12:00

In this table, a “2” denotes a double-strike whereas a “1” signifies a single-strike of the bell.

The code below is a first draft of my ship’s bell app. It is running as a thread, sleeping most of the time (so you can still call Brutus a lazy dog). When it wakes up, it checks the current local time and determines how many strikes are to be done (‘compute_strikes’). Afterwards, the thread puts itself to rest again. However, I didn’t want to wake it up every second to check the wall time — that would be too inefficient. Instead, I base the sleep time on the temporal distance between now and the next half hour (‘compute_sleep_time’) and sleep for half of this time before checking again.

Alas, my initial implementation comes with a bug and the bell doesn’t work as it is supposed to. Can you spot it? (The bug is in the algorithm — it has nothing to do with any Python language quirks, of course.)

Code
Solution
Ship’s Bell app code at GitHub.

Code Kata 4: Struct Member Alignment

“The whole is greater than the sum of its parts”
— Aristotle

How time flies! More than two years have passed since I posted my last kata, so it’s high time for a new one. Today’s kata is based on a real problem that I had to solve last week when I worked on a code generator that produced C/C++ code. I hope you enjoy this task as much as I did!

When you define a struct in C, the overall size of the struct is not necessarily the sum of its members. Much like in real life, the whole is greater than the sum of the parts. Consider this example:


struct Foo {
    uint8_t a;
    uint32_t b;
};

struct Foo {

uint8_t a;

uint32_t b;

};

Only novice programmers would assume that sizeof(Foo) equals 5; experienced programmers know that 8 is a much safer bet. How come?

Most computer architectures can access data only if it is properly aligned in memory, e. g. a 32-bit (4-byte) integer can only be accessed if it is stored at an address that is evenly divisible by 4. The compiler usually achieves this by inserting invisible padding bytes between the struct members. Thus, internally, struct Foo is likely to look like this:


struct Foo_internal {
    uint8_t a;
    uint8_t _pad1;
    uint8_t _pad2;
    uint8_t _pad3;
    uint32_t b;
};

struct Foo_internal {

uint8_t a;

uint8_t _pad1;

uint8_t _pad2;

uint8_t _pad3;

uint32_t b;

};

As a first step, familiarize yourself with struct padding. Check out this must-read by legendary Eric S. Raymond, especially if you are a systems programmer.

Now that you have read it, you should understand why sometimes there is also trailing padding (but never leading padding) and hence why the size of the following struct is most likely 12:


struct Bar {
    uint8_t a;
    uint32_t b;
    uint8_t c;
};

struct Bar {

uint8_t a;

uint32_t b;

uint8_t c;

};

Equipped with this knowledge we are ready to tackle our first programming task: assume that every primitive struct member of base-2 size is to be aligned on its base-2 boundary (a 2-byte integer on an address that is evenly divisible by 2, a 4-byte integer on an address that is evenly divisible by 4 and so on). Implement an algorithm that computes the overall size of a struct given an ordered list of its members. Instead of real types, provide a list of integer values where the values represent the sizes of the members. Here are examples for Foo and Bar (in Python):


assert(struct_size([1, 4]) == 8)     # Foo
assert(struct_size([1, 4, 1] == 12)  # Bar

assert(struct_size([1, 4]) == 8) # Foo

assert(struct_size([1, 4, 1] == 12) # Bar

One weakness of this approach is that you cannot differentiate between a unit32_t (size is 4, alignment is 4) and an array of 4 uint8_ts (size is 4, alignment is 1):


struct Baz {
    uint8_t a;
    uint8_t b[4];
    uint8_t c;
};

struct Baz {

uint8_t a;

uint8_t b[4];

uint8_t c;

};

Extend your algorithm to accept a list of pairs, where the first pair member specifies the size and the second pair member specifies the alignment:


assert(struct_size([(1, 1), (4, 4)]) == 8)          # Foo
assert(struct_size([(1, 1), (4, 4), (1, 1)] == 12)  # Bar
assert(struct_size([(1, 1), (4, 1), (1, 1)] == 6)   # Baz

assert(struct_size([(1, 1), (4, 4)]) == 8) # Foo

assert(struct_size([(1, 1), (4, 4), (1, 1)] == 12) # Bar

assert(struct_size([(1, 1), (4, 1), (1, 1)] == 6) # Baz

But there is one more feature we need to add before we’re done: support for arbitrarily nested structs:


struct Fuz {
    uint8_t a;
    struct Foo b;
    uint8_t c;
};

struct Fuz {

uint8_t a;

struct Foo b;

uint8_t c;

};

How does a member that itself is a struct impact the alignment of the outer struct? Devise a suitable data structure for passing in nested struct specifications.

[Update 2016-03-12: I’ve uploaded a sample solution to GitHub]

So long Scott Meyers, So long C++?

Last month, Scott Meyers wrote a blog post where he announced that he will withdraw from active involvement in C++.

Scott is famous for at least two things: first, his excellent way of explaining dry technical stuff in an entertaining way (I read all his books on C++, except for “Effective Modern C++” which is still on my to-be-read pile) and second, his He-man like hairdo.

After 25 years of dedicating his life exclusively to C++ he has become tired — tired of a language that gets more and more complex at an seemingly ever-increasing rate, all in the name of backwards compatibility and efficiency. (He didn’t say that, but this is my view.)

Ah, speaking of efficiency, the “e” word.

Just because C++ gives you control over efficiency doesn’t mean that you will get it in the end. Due to a lack of compiler and hardware knowledge, many C++ developers have a wrong (insufficient, at least) notion about efficiency. There are many misconceptions, because they don’t know how compilers, CPUs, or memories work.

One example is not understanding the effects of caching. Many C++ developers blindly trust std::map’s or std::unordered_map’s O(log n) and O(1) promises but there are situations where an O(n) std::vector (or plain C-style array) can be orders of magnitude faster because it accesses memory in a cache-friendly way. There is a nice talk by Scott on YouTube where he gives a good overview about caching and its consequences.

Another common efficiency fallacy is illustrated by this little for loop:


for (uint8_t i = 0; i < arrayLength; ++i) {
    if (array[i] == 42) {
        ...
    }
}

for (uint8_t i = 0; i < arrayLength; ++i) {

if (array[i] == 42) {

...

}

Many developers I’ve met believe that using a ‘uint8_t’ for the loop counter is more efficient than using a plain ‘int’. But what most likely will happen is that by using ‘uint8_t’ the code becomes both, bigger and slower, especially on modern RISC-style processor architectures like ARM and PowerPC. Why? If the value of ‘arrayLength’ is not known at compile-time, the compiler has to create additional code that ensures that ‘i’ wraps around for values greater or equal to 256. Internally, the compiler assigns a 32-bit register to ‘i’ (provided you are targeting a 32-bit platform) and adding 1 to 255 in a 32-bit register is different to adding 1 to 255 in an 8-bit register. Behind the scenes, your compiler rewrites your loop to look like this:


for (int i = 0; i < arrayLength; i = ((i + 1) & 0xFF) {
    if (array[i] == 42) {
        ...
    }
}

for (int i = 0; i < arrayLength; i = ((i + 1) & 0xFF) {

if (array[i] == 42) {

...

}

Granted, in most situations this additional code will not amount to much, but maybe in a low-level driver or some communications stack, situations which systems languages like C++ were made for. But this example shows a problem that many would-be efficiency experts share: for the sake of (false) efficiency, they increase complexity and risk correctness and security. What happens if some day ‘arrayLength’ can be larger than 255? The for loop will loop forever, of course.

So while C++ is a language that has the potential to yield extremely efficient systems, you neither get efficiency automatically nor for free. C++ has a steep learning curve and there are many pitfalls. I truly belief that much of C++’s efficiency is wasted on too many developers. If you don’t need utmost efficiency or don’t know how to put the corresponding language features to best use, better keep away from C++ and use a managed language. You will be much more productive and create programs that are also (probably) more secure by default.

Getting back to Scott Meyers, I must admit that I’m somewhat happy about his decision. Not because he left C++ per se but because he now has time to focus on other important topics — topics that he will explain with the same quality he is renowned for. Like some programmers say: when one curly brace closes, another one opens.

Approxion

Code – People – Everything

Bug Hunting Adventures #11: Bad Weather

Distrusting Experts

A Micro Logger for Constrained Systems

Bug Hunting Adventures #10: For Whom The Bell Tolls

Code Kata 4: Struct Member Alignment

So long Scott Meyers, So long C++?