Software Project Democracy

“It has been said that democracy is the worst form of government except all the others that have been tried”
— Sir Winston Churchill

Considering all that fuss going on about agile processes these days, I sometimes wonder if this is yet another case of violating the first principle of the Agile Manifesto:

“Value individuals and interactions over processes and tools”

Given that there are so many books and consultants on the subject of Scrum/Kanban/Lean Product Development, isn’t it possible that people once again focus on processes instead of individuals?

While this hype is certainly annoying at times, the answer is a clear “No”. Agile processes care a lot about individuals and give democracy to the whole team. There is a clear concept of cross-functional, self-managing teams. Developers “pull” their work and decide (or at least largely influence) what and how to do it; work is not foisted upon them by a single dictator (aka. traditional project manager). In Scrum, the project manager role is filled by the product owner together with the rest of the team. This distributed style of project management is democratic, empowering, and very motivating for all members.

Yet, the most people-valuating “process” I’ve ever heard of comes from Tom DeMarco and Timothy Lister, mentioned in “Peopleware”:

1. Get the right people
2. Make them happy so they don’t want to leave
3. Turn them loose

To me, this is the best recipe for project success ever. If you’d follow this “process”, you wouldn’t need Scrum/Kanban/XP at all. If your team just consisted of smart and motivated people, everything else would follow. Why? Because as part of the team — and most importantly — you would also have the “right” leader, one who clearly says what to do and in what order. I’m not talking about a Gantt-chart/PowerPoint virtuoso, but a leader with the following properties and skills:

1. Charismatic, a born leader
2. Systematic, follows-up issues
3. Technically adept
4. Cares about the product and the team more than his/her personal career
5. Has the respect of upper management and the rest of the team
6. Shows that (s)he cares by inspecting and criticizing the main work product (the “Source Code”)

Alas, such gifted leaders are very hard to find.

Some claim that in real life benevolent dictatorship is by far the best form of government, much better than democracy, which is accompanied by frequent shady compromises. But also in real life, benevolent dictators are hard (if not impossible) to find and that’s why democracy was invented. So despite of all its shortcomings, democracy is the best government system in practice.

In some sense, you can view the appearance of agile methodologies as a result of our industry’s inability to find the “right” leaders. Yes, there are training and certification programmes for project managers, but in my honest opinion, they don’t help a lot; if anything, they focus on the “systematic” part, yielding people that spend a lot of time on upfront planning, estimation, and project reporting (“defending”) to upper management. They not only waste their own time but also the time and motivation of the rest of the team.

Hence, I think this strong shift from software management absolutism to software management democracy — from traditional project management to lean and agile management — it is not just hype. It is a natural and logical consequence. It is an uprising from the disappointed and suppressed “doers” who have had to suffer far too long from “traditional” project management practices.

Until our trade has developed a system for breeding benevolent leaders (like the guild system did hundreds of years ago with the apprentice/journeyman/master model), I vote for agile software development and the software process democracy that it gives me.

Dangerously Confusing Interfaces II

Last week was a sad week for me. A bug in my code made it into the final version that was shipped to an important customer.

When something like this happens, it is almost always due to fact that there is a higher-level “bug” in the software development process, but I don’t want to go there. Instead, I want to focus on the technicalities. Once more, I got bitten by another instance of a dangerously confusing interface. Let me explain.

I’ve always had a liking for interfaces that are self-evident; that is, one knows immediately what’s going on by just looking at how the interface is used — without having to consult the documentation. Let me give you a counter example, an interface that is far from what I desire:


sort_temperatures(values, values_count, true);

sort_temperatures(values, values_count, true);

The interface to ‘sort_temperatures’ is not at all self-explanatory. What is clear is that it sorts ‘values_count’ values, which are obviously temperature values, but what the heck does the ‘true’ argument stand for? In order to find out, you have to look at the declaration of ‘sort_temperatures’ and/or its API documentation:


/** Sorts an array of temperature values. 
 * @param values Begin of temperature values array. 
 * @param values_count Total number of temperature values. 
 * @param ascending If true, sort values in ascending order; 
 *        if false, sort in descending order.
 */
void sort_temperatures(double* values, size_t values_count, 
    bool ascending);

/** Sorts an array of temperature values.

* @param values Begin of temperature values array.

* @param values_count Total number of temperature values.

* @param ascending If true, sort values in ascending order;

* if false, sort in descending order.

void sort_temperatures(double* values, size_t values_count,

bool ascending);

Now it is clear what the boolean parameter is for, but at the cost of having to take a detour. Maybe you got so distracted by this detour that you forgot what you originally were about to do.

Programmers often make this mistake when designing interfaces. They use boolean parameters when they have two mutual exclusive modes of operation. This is easy for the implementer of the routine, but confusing to the ones who have to use it. Contrast this with this alternative:


sort_temperatures(values, values_count, SORT_MODE_ASCENDING | 
    SORT_MODE_REMOVE_DUPLICATES);

sort_temperatures(values, values_count, SORT_MODE_ASCENDING |

SORT_MODE_REMOVE_DUPLICATES);

By replacing the boolean parameter with symbolic constants you not only make the code more readable, you also open it up for future extension: adding more modes becomes straightforward.

Now have a look at this code and try to guess what it does:


uint8_t hash[HASH_SHA256_LEN];
size_t hash_len;
if (!calc_hash(HASH_SHA256,mydata, mydata_len, hash, &hash_len) || hash_len !=
    HASH_SHA256_LEN) {
    LOG << "Hash computation failed" << endl; return false; 
}

uint8_t hash[HASH_SHA256_LEN];

size_t hash_len;

if (!calc_hash(HASH_SHA256,mydata, mydata_len, hash, &hash_len) || hash_len !=

HASH_SHA256_LEN) {

LOG << "Hash computation failed" << endl; return false;

}

That’s fairly simple: ‘calc_hash’ calculates a SHA-256 checksum over ‘mydata’ (which is ‘mydata_len’ bytes in size) and stores it into the provided buffer ‘hash’. The length of the hash is stored to ‘hash_len’ via call-by-reference.

But is this code correct? The answer is — you guessed it — no. If you can’t spot the bug, you are in good company. You can’t see it with just the information given. ‘calc_hash’ interface is not self-evident.

I wrote this code more or less one year ago. It contains a bug that remained dormant until the product was in the hands of our customer. And it is there because of a silly interface.

The last (pointer) parameter ‘hash_len’ actually serves as both, an input AND output parameter. When you call ‘calc_hash’ it is expected that ‘*hash_len’ contains the size of the provided ‘hash’ buffer; on return ‘*hash_len’ will contain the actual number of bytes used by the hash algorithm; that is,the size (or length) of the SHA-256 checksum stored in ‘hash’. The whole idea behind this is that ‘calc_hash’ (or rather its author) wants to offer protection against buffer overruns — for cases where the provided ‘hash’buffer is not large enough to accommodate the checksum.

So the problem here is that ‘hash_len’ (being a stack variable) is not properly initialized to ‘HASH_SHA256_LEN’; it’s value is more or less arbitrary. If it is by chance greater or equal to 32 (the value of the ‘HASH_SHA256_LEN’ symbolic constant) everything is fine and the checksum is correctly calculated. If it is not, ‘calc_hash’ returns ‘false’ and an error is reported.

For as long as a year — by sheer coincidence — ‘*hashLen’ was never below 32 (which is not that unlikely, given that ‘size_t’ can accommodate values ranging from 0 to 4,294,967,296); but in the hands of the customer — and very much in line with Murphy’s Law — it happened.

OK, accuse me of not having initialized ‘*hashLen’ properly, accuse me of not having read the API documentation carefully. Maybe I did read the API documentation and then I was interrupted. I don’t know. But what I know for sure is that this bug would have never happened if the interface had been clearer.

The first problem with ‘calc_hash’ is that ‘hash_len’ is an IN/OUT parameter, which is unusual. I’m not aware of any function in the C/C++ standard library (or the POSIX libraries) which makes use of IN/OUT parameters. Since the input value (not just the output value) is passed by reference, neither the compiler nor static analysis tools like PC-Lint are able to detect its uninitialized state. One obvious improvement is to pass the length of the buffer by value:


bool calc_hash(HASH_TYPE type, const void* src, size_t src_len, void* hash,
    size_t hash_buf_len, size_t* hash_len);

bool calc_hash(HASH_TYPE type, const void* src, size_t src_len, void* hash,

size_t hash_buf_len, size_t* hash_len);

Granted, there is now one more argument to pass (‘hash_buf_len’), but if the unlucky programmer ever forgets to initialize it, the compiler will issue a warning.

But let’s not stop here. I’d like to pose the following question: what good is the hash buffer length check, anyway?

In my view, it is not at all necessary. The length of a particular checksum is constant and known a priori — that’s why the hash buffer is allocated statically like this:


uint8_t hash[HASH_SHA256_LEN];

uint8_t hash[HASH_SHA256_LEN];

What additional benefit does a developer get by providing the same information again to the ‘calc_hash’? Isn’t this redundancy that just begs for consistency errors?

And what use is the output value that tells how long the checksum is? Again, this should, no it MUST be known a priori, there should never be a mismatch between what the caller of ‘calc_hash’ expects and what is returned. Of course, there can be error conditions but if ‘calc_hash’ fails, it should return ‘false’ and not a length different to what the caller expects.

Note that ‘calc_hash’ is not at all comparable to functions in the C API that add an additional output buffer length parameter like ‘strncpy’ or ‘snprintf’. These functions carry the length parameter for completely legitimate safety reasons because the total length of the output is usually not known a priori (for instance, as some input may stem from a human user and one has little control over how many characters (s)he will enter).

Based on these arguments, I dismiss the ‘hash_len’ parameter altogether and propose the following simplified interface:


bool calc_hash(HASH_TYPE type, const void* src, size_t src_len, void* hash);

bool calc_hash(HASH_TYPE type, const void* src, size_t src_len, void* hash);

and would use it like this:


uint8_t hash[HASH_SHA256_LEN];
if (!calc_hash(HASH_SHA256, mydata, mydata_len, hash)) {
    LOG << "Hash computation failed" << endl;
    return false;
}

uint8_t hash[HASH_SHA256_LEN];

if (!calc_hash(HASH_SHA256, mydata, mydata_len, hash)) {

LOG << "Hash computation failed" << endl;

return false;

}

Easy to read, hard to get wrong. It is impossible to forget to initialize a variable that just isn’t there.

Circular Adventures II: A Ring within a Ring

“Is all that we see or seem
But a dream within a dream?”
–Edgar Allan Poe

In part one of this series I’ve discussed the mod operator and demonstrated that it is a flexible tool for solving circular problems. I’ve also explained that the mod operator, as implemented by most (system) programming languages, is not able to handle problems involving negative dividends. As a remedy, I’ve introduced an efficient alternative of the mod operator for divisors that are base-2 numbers.

Today, I want to have a look at some first circular use-cases. Again, we meet our good old friend, the ring buffer.

Sometimes, you need to store data and keep previous versions of that data, but the memory available is limited. In such situations, an obvious approach is to use a ring buffer.


    +---------------------------------+
 0  | First data set                  | 
    +---------------------------------+ 
 1  | Second data set                 | 
    +---------------------------------+ 
 2  | ...                             | 
    +---------------------------------+ 
 3  | ...                             | 
    +---------------------------------+ 
 4  | ...                             | 
    +---------------------------------+ 
 5  | ...                             | 
    +---------------------------------+ 
 6  | ...                             | 
    +---------------------------------+ 
 7  | Eighth data set                 | 
    +---------------------------------+

+---------------------------------+

0 | First data set |

+---------------------------------+

1 | Second data set |

+---------------------------------+

2 | ... |

+---------------------------------+

3 | ... |

+---------------------------------+

4 | ... |

+---------------------------------+

5 | ... |

+---------------------------------+

6 | ... |

+---------------------------------+

7 | Eighth data set |

+---------------------------------+

In this example, we have a ring buffer (again, pretend that the last and first element are joined) comprising eight (N = 8) fixed-size slots for storing information. Each slot contains a data set. We start by writing to slot 0, then slot 1 and so on; when the ring buffer is full (that is, slot 7 has been written to), we wrap around to slot 0 and overwrite what has been stored there previously, thus kicking out the oldest data set. By using this scheme, we not only have access to the most recent data set, but also to up to seven predecessors.

In addition to the actual ring buffer, we need a variable to keep track of the most-recently updated (“MRU”) slot number; that is, an index that points to the “newest” data set. Based on the current value of the MRU we can easily calculate which slot is to be updated upon the next write access:


mru = (mru + 1) mod N

mru = (mru + 1) mod N

Updating the ring buffer is just one part of the story, reading another. For instance, to get the predecessor of the current data set we need to calculate the slot index based on the previous value of MRU. Again, by using the mod operator, this can be achieved easily:


i = (mru - 1) mod N

i = (mru - 1) mod N

Dumping the whole ring buffer in reverse chronological order is achieved with this simple loop:


for i in (0 .. N - 1) 
    print buffer[(mru - i) mod N]     
end

for i in (0 .. N - 1)

print buffer[(mru - i) mod N]

end

Note that for this to work, the mod operator has to be a mathematical modulo operator since the dividends may become negative. If you don’t have access to a true mathematical mod operator, you should make N a base-2 number and use the modulo replacement operation given in the previous installment (mod_base2).

Now let’s make our ring buffer example a bit more interesting. Let’s pretend our ring buffer must be stored in persistent, non-volatile memory (NVM), like flash. One problem with flash memory is that it wears out over time: the more often you update flash memory, the shorter the data retention time gets. (It’s like using a bucket with a crack to carry water from one place to another and the crack gets bigger with every lap you go.)

Therefore, storing the most-recently updated value in a single flash variable is not a good idea, since this memory location would wear-out quickly: every update to a ring buffer would require an update to the MRU variable, which means that the update load on the MRU variable is N times the update load on the individual slots.

A frequently used countermeasure is to have a “distributed MRU variable”, which is achieved by prefixing or suffixing every slot with a so-called “age counter” field.

The age counter starts at zero and is incremented with every write to a slot. So after three updates to an initially empty ring buffer, we would have this:


    +---------------------------------+-----+
 0  | First data set                  |   1 |
    +---------------------------------+-----+
 1  | Second data set                 |   2 |
    +---------------------------------+-----+
 2  | Third data set                  |   3 |
    +---------------------------------+-----+
 3  |                                 |   0 |
    +---------------------------------+-----+
 4  |                                 |   0 |
    +---------------------------------+-----+
 5  |                                 |   0 |
    +---------------------------------+-----+
 6  |                                 |   0 |
    +---------------------------------+-----+
 7  |                                 |   0 |
    +---------------------------------+-----+

+---------------------------------+-----+

0 | First data set | 1 |

+---------------------------------+-----+

1 | Second data set | 2 |

+---------------------------------+-----+

2 | Third data set | 3 |

+---------------------------------+-----+

3 | | 0 |

+---------------------------------+-----+

4 | | 0 |

+---------------------------------+-----+

5 | | 0 |

+---------------------------------+-----+

6 | | 0 |

+---------------------------------+-----+

7 | | 0 |

+---------------------------------+-----+

After ten updates, the ring buffer would look like this:


    +---------------------------------+-----+
 0  | ...                             |   9 |
    +---------------------------------+-----+
 1  | ...                             |  10 |
    +---------------------------------+-----+
 2  | ...                             |   3 |
    +---------------------------------+-----+
 3  | ...                             |   4 |
    +---------------------------------+-----+
 4  | ...                             |   5 |
    +---------------------------------+-----+
 5  | ...                             |   6 |
    +---------------------------------+-----+
 6  | ...                             |   7 |
    +---------------------------------+-----+
 7  | ...                             |   8 |
    +---------------------------------+-----+

+---------------------------------+-----+

0 | ... | 9 |

+---------------------------------+-----+

1 | ... | 10 |

+---------------------------------+-----+

2 | ... | 3 |

+---------------------------------+-----+

3 | ... | 4 |

+---------------------------------+-----+

4 | ... | 5 |

+---------------------------------+-----+

5 | ... | 6 |

+---------------------------------+-----+

6 | ... | 7 |

+---------------------------------+-----+

7 | ... | 8 |

+---------------------------------+-----+

The slot containing the age counter with the highest value must be the MRU slot, slot 1 in our example. For obvious reasons, the value range of the age counter must be larger than the slot range of the ring buffer, but sooner or later, even the age counter will wrap around. After many writes to the ring buffer we might get this picture:


    +---------------------------------+-----+
 0  | ...                             | 253 |
    +---------------------------------+-----+
 1  | ...                             | 254 |
    +---------------------------------+-----+
 2  | ...                             | 255 |
    +---------------------------------+-----+
 3  | ...                             |   0 |
    +---------------------------------+-----+
 4  | ...                             |   1 |
    +---------------------------------+-----+
 5  | ...                             |   2 |
    +---------------------------------+-----+
 6  | ...                             |   3 |
    +---------------------------------+-----+
 7  | ...                             | 252 |
    +---------------------------------+-----+

+---------------------------------+-----+

0 | ... | 253 |

+---------------------------------+-----+

1 | ... | 254 |

+---------------------------------+-----+

2 | ... | 255 |

+---------------------------------+-----+

3 | ... | 0 |

+---------------------------------+-----+

4 | ... | 1 |

+---------------------------------+-----+

5 | ... | 2 |

+---------------------------------+-----+

6 | ... | 3 |

+---------------------------------+-----+

7 | ... | 252 |

+---------------------------------+-----+

So which slot is the MRU slot now? The age counter is obviously an unsigned 8-bit type, since it wrapped around at value 255. Starting with 253 (the age counter value of slot 0) and taken modulo 256, all counter values increase by one up to 3. But the next value, 252, is not a successor of 3; it is too far ahead (or rather behind). This means that the slot containing age counter value 3 must be the most recently updated slot (which is 6).

The MRU is the basis for all ring buffer operations and it makes sense to maintain its value in a RAM variable. But how can one determine the MRU programmatically, for instance after a reset when all RAM content is lost? Here is a method to search for it:


for mru in (0..N-2)
    if (buffer[mru].counter + 1) mod 256 != (buffer[mru + 1].counter)
        break
end

for mru in (0..N-2)

if (buffer[mru].counter + 1) mod 256 != (buffer[mru + 1].counter)

break

end

It is a good exercise to convince yourself that this algorithm actually works. Think about the initial “empty” ring buffer case where all age counters are zero as well as the case where the “highest” age counter is stored in the last slot.

Luckily, the dividends in this algorithm cannot become negative, so if we wanted to code this in C, we could use the ‘%’ operator. But since we are doing a modulo 256 operation, we can apply the optimization given in the first installement, which is replacing the modulo 256 operation with a (uint8_t) cast:


int mru;
for (mru = 0; mru <= N - 2; ++mru) {
    if ( (uint8_t) (buffer[mru].counter + 1) != (buffer[mru + 1].counter) )
        break;
}

int mru;

for (mru = 0; mru <= N - 2; ++mru) {

if ( (uint8_t) (buffer[mru].counter + 1) != (buffer[mru + 1].counter) )

break;

}

Once we are in possession of the MRU, we can work with this ring buffer just like we did with the previous ring buffer. The only thing we have to do in addition is to memorize the highest (or current) age counter value for future updates to the ring buffer:


# Determine current age counter.
age_counter = buffer[mru].counter
...
# Print most recent data set.
print buffer.dataset[mru]     
# Print oldest data set.
print buffer.dataset[(mru - N - 1) mod N]     
# Write a new data set.
mru = (mru + 1) mod N
age_counter = (age_counter + 1) mod 256
buffer[mru].age_counter = age_counter
buffer[mru].dataset = "some new data"

# Determine current age counter.

age_counter = buffer[mru].counter

...

# Print most recent data set.

print buffer.dataset[mru]

# Print oldest data set.

print buffer.dataset[(mru - N - 1) mod N]

# Write a new data set.

mru = (mru + 1) mod N

age_counter = (age_counter + 1) mod 256

buffer[mru].age_counter = age_counter

buffer[mru].dataset = "some new data"

Finding the MRU in our second ring buffer was not particularly difficult — we just needed to compare the age counters until we found a “discontinuity”. We knew that the slots are updated, one after another, cyclically, with increasing age counter values. Things get more complicated when there is no such strict sequential ordering. Let me explain.

Imaginge you want to keep track of when certain events occur. For each kind of event, you would store a timestamp from a 32-bit timer source that indicates when this particular event occurred:


    +---------------------------------+-----------+
 0  | First type of event             | 353011    |
    +---------------------------------+-----------+
 1  | Second type of event            | 1643      |
    +---------------------------------+-----------+
 2  | Third type of event             | 4         |
    +---------------------------------+-----------+
 3  | ...                             | ...       |
    +---------------------------------+-----------+
    :                                 :           :

+---------------------------------+-----------+

0 | First type of event | 353011 |

+---------------------------------+-----------+

1 | Second type of event | 1643 |

+---------------------------------+-----------+

2 | Third type of event | 4 |

+---------------------------------+-----------+

3 | ... | ... |

+---------------------------------+-----------+

: : :

Further, assume that the 32-bit timer can wrap around (otherwise we wouldn’t have a circular problem), just like our 8-bit age counter wrapped around in our ring buffer example. How would you know which event occurred first or last? How would you calculate the relative time between two events? This and more is the topic of the next part of this series. Stay tuned…

More circular adventures…

Circular Adventures I: The Modulo Operation

“All things from eternity are of like forms and come round in a circle”
–Marcus Aurelius, AD 121-180

In computer science, just like in real life, many things go round in circles. There are ring buffers, circular lists, wrap-around counters and timers, just to name a few. In this multi-part article, I want to explore the theory behind circular behavior. Equipped with this knowledge, I attempt to provide solutions to “recurring” problems.

Let’s start with the very basics; that is, the movement of an index within a ring buffer.

Assume you have a ring buffer containing N = 10 elements and an index i:


0 1 2 3 4 5 6 7 8 9
      ^
      i

0 1 2 3 4 5 6 7 8 9

(For simplicity, I depict circular structures as a flat arrays — just imagine the first and last element are joined to form a ring.)

Question: How do you advance the index by a signed offset n, taking wrap-around into account?
Answer: i = (i + n) mod N

Many circular problems like ring buffer operations can be elegantly expressed by using the ‘mod’ operator. Alas, only theoretically, as in practice, there are at least two problems surrounding the ‘mod’ operator: portability and efficiency.

The ‘mod’ operator, as it is used in the equation above, is the mathematical ‘mod’ operator — the one that is used in number theory. Programming languages, however, sport many different flavors of the ‘mod’ operator, which vary in the way they treat negative operands.

All of them obey this equation:


a mod N = a - ( a div N ) * N

a mod N = a - ( a div N ) * N

Yet the results for negative operands depend on whether the ‘div’ operator rounds towards zero or negative infinity. If the ‘div’ operator rounds towards zero, the expression -5 div 2 yields -2; if it rounds towards negative infinity the result is -3.

The following example illustrates how this influences the calculation of -3 mod 8:


                          Round towards
                        zero  neg infinity
-3 div 8                  0      -1
(-3 div 8) * 8            0      -8
-3 - ((-3 div 8) * 8)    -3       5

Round towards

zero neg infinity

-3 div 8 0 -1

(-3 div 8) * 8 0 -8

-3 - ((-3 div 8) * 8) -3 5

This means that -3 mod 8 might be -3 or +5 depending on the style the implementers of your programming language have chosen. Some languages (like Ada) have even two modulo operators (rem and mod), while others allow to control the behavior at run-time (Perl: ‘use integer’). Still others (like C90 and C++98) leave it as ‘implementation-defined’. Have a look at this for a nice overview on how different programming languages implement the modulo operator.

(And just in case you haven’t guessed it already: C/C++/Java’s approximation of ‘mod’ is ‘%’ and ‘div’ is ‘/’.)

Now, if you only travel through a circle in positive direction (by adding positive offsets), either rounding style will do; however, if you intend to travel backwards (or calculate differences between indices that yield negative values), only the ‘negative-infinity’ modulus operator will do the job; that is, wrap (in the 10 element ring buffer example) from -1, -2, -3 … to 9, 8, 7 …, respectively.

How about efficiency? As can be seen in the table above, the calculation of the remainder is based on a division operation, which is, — alas — a rather expensive operation on many platforms. On most processors, the ‘div’ operation is many times slower than other primitive operations. As an example, some popular ARM processors don’t even have a ‘div’ instruction, so division is done through a software library which consumes up to 40 cycles. Compare this to the 5 cycle multiplication instruction and the other instructions that typically execute in a single cycle.

Due to these portability and efficiency issues, many developers shun the modulo operator in performance-critical code. Instead of


i = (i + n) mod N

i = (i + n) mod N

they use the computationally equivalent


i = i + n
while i >= N
    i = i - N
while i < 0
    i = i + N

i = i + n

while i >= N

i = i - N

while i < 0

i = i + N

or, if it guaranteed that our index is not more than N – 1 off the ends


if i >= N
    i = i - N
if i < 0
    i = i + N

if i >= N

i = i - N

if i < 0

i = i + N

But this approach is still not very efficient, since it uses branches and on modern multi-stage pipelined processors a branch might invalidate already prefetched and preprocessed instructions. Isn’t there a true mathematical ‘mod’ operator out there that is also efficient at the same time? You bet there is, but only if N happens to be a base-2 number.

If you are lucky and N is a base-2 number (like 64, 1024, or 4096) i mod N is computationally equivalent to


i and (N - 1)

i and (N - 1)

where ‘and’ is the binary and operator (‘&’ in C, C++, and Java). This works even if i is negative, but requires that your environment stores negative numbers in two’s complement fashion, which is the case for pretty much all systems that you will ever program for.

As an example, consider -2 mod 16. -2 is 0xFFFFFFFE in 32-bit two’s complement notation and 16 – 1 corresponds to 0x0000000F:


  0xFFFFFFFE
& 0x0000000F
  ----------
  0x0000000E

0xFFFFFFFE

& 0x0000000F

----------

0x0000000E

which is 14, exactly, what a mathematical ‘mod’ operator would yield.

The ‘and’ operation is easy to compute for any processor. In C/C++ we might define an optimized ‘mod’ function like this:


inline int mod_base2(int dividend, int divisor) {
    // Ensure divisor is positive.
    assert(divisor >= 0);
    // Ensure divisor is a base-2 number.
    assert((divisor & (divisor - 1)) == 0);
    return dividend & (divisor - 1);
}

inline int mod_base2(int dividend, int divisor) {

// Ensure divisor is positive.

assert(divisor >= 0);

// Ensure divisor is a base-2 number.

assert((divisor & (divisor - 1)) == 0);

return dividend & (divisor - 1);

}

It is a good idea to use this optimization whenever possible. Sometimes, it makes even sense to round-up the size of circular structures to a base-2 boundary, just to be able to use this kind of optimization.

A variant of this theme is casting a (potentially negative) value to an unsigned type in C/C++. Casting x to an uint8_t is equivalent to calculating x mod 256. While most optimizing compilers will generate the same code for x & 0xFF and (uint8_t)x, there is a certain likelihood that the latter might be a bit faster. The obvious disadvantage of casting to unsigned is, that this approach practically limits N to the value range of uint8_t, uint16_t, and uint32_t, which is 256, 65536, and 4294967296, respectively. Because of the fact that the performance gain in only hypothetical, it is usually much wiser to go for the ‘mod_base2’ optimization, though.

This concludes my first installment, which is mainly about some of the many facets of the ‘mod’ operator. Just like the constant PI appears in all circular problems in mathematics, some variant of the ‘mod’ operator appears in all circular problems in computer science. Next time, I will explore what it means to have two indices into a circular structure, which turns out to be the foundation of many interesting circular use cases.

More circular adventures…

Dear Compiler, What’s My Size?

Sometimes, when you work with classes or structs, your code is not only dependent on particular fields (attributes, member variables) but on all of them at the same time.

The copy constructor, assignment operator, and the equality operator are prominent examples: if you add a field to your class, these functions need to be maintained in parallel; that is, all of them potentially need to be updated to support this new field. And we all know that it is so easy to forget to update one of them…

I got bitten by this when I used a class that lacked an operator to stream out all of its members. Since I didn’t own the source code of the class (it was part of a library), I couldn’t modify it directly. (I actually could have modified it, but this would have been outright stupid). Instead, I decided to add the missing functionality in my own code. Here is a simplified example:


// Class that I didn't own.
class Fruit {
...
private:
    std::string name;
    int id;
    Color color;
...
};

// Class that I didn't own.

class Fruit {

...

private:

std::string name;

int id;

Color color;

...

};

Here’s what I added in my code:


#include "Fruit.h"

// Stream-out all attributes of Fruit.
ostream& operator<<(ostream& s, const Fruit& fruit) {
    s << fruit.getName() << ":"
      << fruit.getId() << ":"
      << fruit.getColor() << endl;
    return s;
}

#include "Fruit.h"

// Stream-out all attributes of Fruit.

ostream& operator<<(ostream& s, const Fruit& fruit) {

s << fruit.getName() << ":"

<< fruit.getId() << ":"

<< fruit.getColor() << endl;

return s;

}

It wasn’t long until I got a new version of the library in which “Fruit” was extended by another attribute, “weight”. Since my code didn’t support this new attribute yet, it was broken. Even worse: by (unconsciously) ignoring the new attribute my code failed silently at run-time! Just in case you didn’t know: If there is one thing I absolutely can’t stand it’s if CODE FAILS SILENTLY AT RUN-TIME.

Something had to be done to prevent this from happening again. After a while I came up with the idea of checking the size of class Fruit at compile-time against a hard-coded value. Even though this wasn’t perfect (you can easily think of cases where you would get false positives) it gave me a good chance of catching similar modifications next time I rebuilt my own code:


// Stream-out all attributes of Fruit.
ostream& operator<<(ostream& s, const Fruit& fruit) {
    static_assert(sizeof(Fruit) == 24, "Fruit has been tampered with");
    s << fruit.getName() << ":"
      << fruit.getId() << ":"
      << fruit.getColor() << ":"
      << fruit.getWeight() << endl;
    return s;
}

// Stream-out all attributes of Fruit.

ostream& operator<<(ostream& s, const Fruit& fruit) {

static_assert(sizeof(Fruit) == 24, "Fruit has been tampered with");

s << fruit.getName() << ":"

<< fruit.getId() << ":"

<< fruit.getColor() << ":"

<< fruit.getWeight() << endl;

return s;

}

(Note: static assertions are like plain assertions but checked by the compiler at compile-time, not at run-time. I’m using a C++ 2011 feature here, but you can implement static assertions yourself, if your compiler lacks support for them. Read more here.)

Now, if the compiler flagged a failed assertion I would review the public interface of class Fruit and look for any new attributes, update my code along with the reference value of the size and everything would be fine. But how would I get the reference size value (i. e. 24) in the first place?

Calculating the size of a class or struct by hand is not trivial. There can be hundreds of members, including union members, padding bytes, hidden pointers to vtables and the like. A viable solution is to add a temporary print statement that writes sizeof(Fruit) to stdout, but that would require linking and running the program. And what if you are working on an embedded system where there is no stdout or where it is impossible to inspect the value in a debugger?

No, these approaches are too slow and cumbersome. If the compiler knows the size of Fruit it should be able to tell me. But how?

This question haunted me for quite some time. I devised a cunning plan: why not add a statement containing a syntax error that somehow forced the compiler to spit out the size of my class. Among other things, I tried this:


extern int foo[-(int)sizeof(Fruit)];

extern int foo[-(int)sizeof(Fruit)];

This hack exploits the fact that in C/C++ it is illegal to declare an array of negative size. I had hoped to get a message like this:


FruitApp.cpp:39 error: size of array ‘foo[-24]’ is negative.

FruitApp.cpp:39 error: size of array ‘foo[-24]’ is negative.

Alas, all compilers that I checked-out just gave a worthless error message among these lines:


FruitApp.cpp:39 error: size of array ‘foo’ is negative.

FruitApp.cpp:39 error: size of array ‘foo’ is negative.

There was no clue as to what the actual size was. Too bad. The compiler knows the answer but it doesn’t want to tell me!

I was just about to give up when I suddenly had another idea: what the compiler would always tell me is the line on which an error occurred. Why not attempt to create many arrays, where the size is decremented, starting from sizeof(Fruit):


// Fails if sizeof(Fruit) == 1
extern int foo_1[(int)sizeof(Fruit) - 1 - 1];
// Fails if sizeof(Fruit) == 2
extern int foo_2[(int)sizeof(Fruit) - 1 - 2];
...
// Fails if sizeof(Fruit) == 42
extern int foo_42[(int)sizeof(Fruit) - 1 - 42];
...

// Fails if sizeof(Fruit) == 1

extern int foo_1[(int)sizeof(Fruit) - 1 - 1];

// Fails if sizeof(Fruit) == 2

extern int foo_2[(int)sizeof(Fruit) - 1 - 2];

...

// Fails if sizeof(Fruit) == 42

extern int foo_42[(int)sizeof(Fruit) - 1 - 42];

...

I would just include this header file and the compiler would compile line after line until it hits a syntax error due to a negative array size. When it does, it would report the corresponding line number which in turn corresponds to the size of Fruit. G-r-e-a-t!

Generating this include file (getsize.h) was straightforward: I cobbled together a simple Bash for-loop:


for ((x=1; x<=65535; ++x)); do
    echo "extern int size_is_$x[(int)sizeof(GETSIZE_TYPE)-1-$x];";
done > getsize.h

for ((x=1; x<=65535; ++x)); do

echo "extern int size_is_$x[(int)sizeof(GETSIZE_TYPE)-1-$x];";

done > getsize.h

I used it like this:


ostream& operator<<(ostream& s, const Fruit& fruit) {
    #define GETSIZE_TYPE Fruit
    #include "getsize.h"
    static_assert(sizeof(Fruit) == 16, "Fruit has been tampered with");
    s << fruit.getName() << ":"
      << fruit.getId() << ":"
      << fruit.getColor() << ":"
      << fruit.getWeight() << endl;
    return s;
}

ostream& operator<<(ostream& s, const Fruit& fruit) {

#define GETSIZE_TYPE Fruit

#include "getsize.h"

static_assert(sizeof(Fruit) == 16, "Fruit has been tampered with");

s << fruit.getName() << ":"

<< fruit.getId() << ":"

<< fruit.getColor() << ":"

<< fruit.getWeight() << endl;

return s;

}

Compiling this code yields the following error messages (the exact messages may vary, depending on the compiler that you use):


getsize.h:24 error: size of array ‘size_is_24’ is negative
getsize.h:25 error: size of array ‘size_is_25’ is negative
getsize.h:26 error: size of array ‘size_is_26’ is negative
getsize.h:27 error: size of array ‘size_is_27’ is negative
...

getsize.h:24 error: size of array ‘size_is_24’ is negative

getsize.h:25 error: size of array ‘size_is_25’ is negative

getsize.h:26 error: size of array ‘size_is_26’ is negative

getsize.h:27 error: size of array ‘size_is_27’ is negative

...

The first line that is reported as erroneous is the size of class Fruit, 24 in our example. Once you know the size you just need to update the reference value and comment out the two lines to have them ready when you need them again.


ostream& operator<<(ostream& s, const Fruit& fruit) {
    // #define GETSIZE_TYPE Fruit
    // #include "getsize.h"
    static_assert(sizeof(Fruit) == 24, "Fruit has been tampered with");
    s << fruit.getName() << ":"
      << fruit.getId() << ":"
      << fruit.getColor() << ":"
      << fruit.getWeight() << endl;
    return s;
}

ostream& operator<<(ostream& s, const Fruit& fruit) {

// #define GETSIZE_TYPE Fruit

// #include "getsize.h"

static_assert(sizeof(Fruit) == 24, "Fruit has been tampered with");

s << fruit.getName() << ":"

<< fruit.getId() << ":"

<< fruit.getColor() << ":"

<< fruit.getWeight() << endl;

return s;

}

You might find all of this a bit hacky, but I happen to like it very much. It serves a good purpose and nicely solved my problem. On top of that, it gave me a really good time. What more can one expect?

[Update 2015-12-06: Bernd Hanebaum sent me a nice implementation of this idea that is based on template metaprogramming — many thanks, Bernd!]

Approxion

Code – People – Everything

Software Project Democracy

Circular Adventures II: A Ring within a Ring

Circular Adventures I: The Modulo Operation

Dear Compiler, What’s My Size?