Category Archives: Code

The Need for an Hippocratic Oath for Software Engineers

Two months ago, I wrote about an incident where a fault in an airbag system was responsible for the death of a child. (Actually, as a court ruled later, it was not the airbag system that was responsible, but rather an engineer who kept quiet about a bug that he had discovered earlier.)

Now, it seems like we have another case of a software-based disaster: Volkswagen recently admitted that they used software in their cars that detects official test situations and then reconfigures the engine to reduce pollution — just to get the figures right. When not in this “cheat mode”, that is, under real driving conditions, Volkswagen’s NOx emission rates are up to 40 times higher.

I think this cunning piece of German engineering has the potential to smash Volkswagen. Not extinguish entirely, of course, but Volkswagen’s stock value might decrease so much that it can be taken over easily by the competition. In any case, it will cost this company, whose bosses probably thought it was too big to fail, big: billions of dollars, thousands of jobs, an immeasurable amount of reputation and credibility.

What makes this story so exceptionally shocking is that we are not talking about a bug (like in the failing airbag case) but a feature. This software was put in deliberately, so it is rather a sin of commission than a sin of omission.

While it would be interesting to know what made people act in such a criminal way, what really matters is this: the whole mess could have probably been avoided if someone along the chain of command would have stood up and said “No!”.

I will use this sad story as an opportunity to introduce the “Software Engineering Code of Ethics and Professional Practice” to you, a joint effort by IEEE and ACM. As for the Volkswagen scandal, it looks as if at least the following principles have been grossly violated:

Software Engineers shall:

1.03. Approve software only if they have a well-founded belief that it is safe, meets specifications, passes appropriate tests, and does not diminish quality of life, diminish privacy or harm the environment. The ultimate effect of the work should be to the public good.

1.04. Disclose to appropriate persons or authorities any actual or potential danger to the user, the public, or the environment, that they reasonably believe to be associated with software or related documents.

1.06. Be fair and avoid deception in all statements, particularly public ones, concerning software or related documents, methods and tools.

2.06. Identify, document, collect evidence and report to the client or the employer promptly if, in their opinion, a project is likely to fail, to prove too expensive, to violate intellectual property law, or otherwise to be problematic.

3.03. Identify, define and address ethical, economic, cultural, legal and environmental issues related to work projects.

5.11. Not ask a software engineer to do anything inconsistent with this Code.

6.13. Report significant violations of this Code to appropriate authorities when it is clear that consultation with people involved in these significant violations is impossible, counter-productive or dangerous.

In a world in which an ever-increasing part is driven by software, “The Code” should be considered the “Hippocratic Oath” of software engineers, something we are obliged to swear before we can call ourselves software professionals, before we are unleashed to an unsuspecting world. I wonder how many more software catastrophes it will take until we finally get there.

Bug Hunting Adventures #8: Just Like a Rubber Ball…

Two incidents gave rise to this Bug Hunting episode: first, I’m currently beefing up my burglar alarm system with a Raspberry Pi and second, I’ve just come across the old song from the ’60s “Rubber Ball” by Bobby Vee while randomly surfing YouTube (“Bouncy, bouncy, bouncy, bouncy”).

The alarm system extension I have in mind is this: I want to be able to configure temporary inhibition of motion detection via a simple push-button: if you press it once, you suppress motion detection by one hour, if you press it n times, you suppress it by n hours, respectively. If you press the button after you haven’t pressed it for five seconds, any suppression is undone. There are status LEDs that give visual feedback such that you know what you’ve configured.

Thanks to the fine RPi.GPIO library, accessing GPIO ports is usually a piece of Pi. However, there is a fundamental problem when reading digital signals from switches and buttons called “switch bounce”, a phenomenon that has driven many brave engineers nuts.

If you flip a switch, there never is a nice transition from zero to one (or one to zero, depending on how you flipped the switch). Instead, the signal bounces between zero and one, in a rather indeterminate fashion for many milliseconds. The bouncing characteristics depend on various factors and can even vary for the same type of switch.

The takeaway is this: with switches (and buttons, of course), expect the unexpected! You must use some sort of debouncing mechanism, implemented in either hardware or software. For some background information check out this excellent article by embedded systems luminary and die-hard veteran Jack Ganssle.

But back to my alarm system. I skimmed the RPi.GPIO documentation and was happy to read that since version 0.5.7, RPi.GPIO has support for debouncing of digital input signals, so there was no need to debounce the push-button myself. Exuberantly, I sat down and wrote this test code:

To avoid arbitrary, floating input voltages, I setup the port such that it is pulled down (to ground) by default. As a consequence, reading that port when the button is not pressed yields a clean zero.

I didn’t want to poll the button port, as this would burn valuable CPU cycles, so I registered an event handler with the RPi.GPIO library that would be called back once a rising edge had been detected. I added a ‘bouncetime’ parameter of 200 milliseconds, to avoid multiple calls to my ‘button_pressed’ handler, in full accordance with the RPi.GPIO documentation:

To debounce using software, add the ‘bouncetime=’ parameter to a function where you specify a callback function. Bouncetime should be specified in milliseconds. For example:

In other words, after a rising edge has been detected due to a button press, I won’t be bothered with spurious rising edges for 200 ms. (200 ms is plenty of time — no switch I’ve ever seen bounced for more than 100 ms. Yet, 200 ms is short enough to handle cases where the user presses the button in rapid succession.)

I tested my button for a minute and it worked like a charm; then, I varied my press/release scheme and suddenly some spurious “Button pressed!” messages appeared. After some head scratching, I discovered the problem — can you spot it, too? (Hint: it’s a logic error, not something related to misuse of Python or the RPi.GPIO library.)

Epilogue: Once the problem was understood, the bug was fixed easily; getting rid of this darn, “bouncy, bouncy” earworm was much harder, though.

Solution

Bug Hunting Adventures #7: Random Programming

If you want to draw m random samples out of a set of n elements, Knuth’s algorithm ‘S’ (ref. The Art of Computer Programming, Vol 2, 3.4.2) comes in handy. It starts out by attempting to select the first element with a probability of m/n; subsequent elements are selected with a probability that depends on whether a previous element was drawn or not.

Let’s do an example. To select 2 random samples out of a set of 5 (say, the digits ‘0’, ‘1’, ‘2’, ‘3’, and ‘4’), we select ‘0’ with a probability of 2/5. If ‘0’ is not chosen, we attempt to select ‘1’ with a probability of 2/4 (since there are only 4 candidates left); otherwise (if ‘0’ has been selected) with a probability of 1/4, since one attempt to draw has already been used up. It’s easy to show the corresponding element selection probabilities in a binary tree:

In this tree, the nodes represent the likelihood that the element in the column on the right will be picked. Transitions from a node to the right mean that the element was picked, while transistions to the left denote that the corresponding element was not picked. In the ‘selected’ case, the numerator is decreased by one; the denominator is decreased in either case.

Here is an implementation of Knuth’s algorithm in C:

For the sake of demonstration, assume that bigrand() is a function that returns a random number that is “big enough”, meaning a value that is greater or equal to 0 and has an upper bound of at least n – 1 (a larger upper bound does not harm). Spend some time to understand how the ‘if’ statement implements the selection with the correct probablity.

So far, so good. But how would this clever algorithm look in another language? Let’s port it to Ruby!

Not being a Ruby programmer (but being an experienced Online Programmer), I had to ask Auntie Google many questions in order to cobble together this code:

This Ruby code looks remarkably similar to the C version. It doesn’t contain any syntax errors, but nevertheless doesn’t output what it is supposed to. Can you see why?

For those folks who don’t want to waste their time on Ruby, here is the Python version, sporting the same behavior and thus the same blunder:

Solution

Oh No! The ‘in’ Operator Strikes Again!

depressed

“Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.”
— Scott Adams

When I discovered the eerie truth about JavaScript’s ‘in’ operator, I thought that it couldn’t get any weirder. But boy-oh-boy, was I wrong!

Let’s first recap what’s so bad about JavaScript’s implementation of ‘in’. If you have a list (aka. array) in JavaScript and use the ‘in’ operator to test for membership, you are asking for trouble:

Contrary to what you might expect, the ‘in’ operator doesn’t test whether the left-hand side value is a member of the array, but rather whether it is a valid index into the array. This makes absolutely no sense for arrays, but it does for dictionaries, since it tests whether a given key is present or not:

The moral of the story? Don’t use the ‘in’ operator in JavaScript on arrays to test for membership; with dictionaries everything is fine.

The other day, I found another shocking behavior of the ‘in’ operator, but this time in Groovy; not when used on arrays — when used on dictionaries! Consider this routine. It has two parameters, a list of words (the input text) and a dictionary which contains an entry for every word whose number of occurrences in the input text is to be counted:

The initial word counts in ‘wordsToCount’ are set to 0; the word counts are increased for every corresponding word found in ‘wordsList’. For instance:

should print:

At least, this is what most people would expect and what one would get in Python or JavaScript. Alas, what you will get in Groovy is this:

When used on dictionaries, Groovy’s ‘in’ operator not only checks whether the given key on its left-hand side exists in the dictionary, it also overeagerly checks if the corresponding value is ‘true’ according to the Groovy Truth Rules; that is, 0, null, empty strings and empty lists will make the ‘in’ operator return false.

The ‘in’ operator should only test for membership; that is, it should test if the left-hand side is in the collection on the right-hand side. On no account should it interpret an element’s value.

Not following a well-established convention is bad enought, but being inconsistent within the same language is a lot worse: when used with lists/arrays, the ‘in’ operator just checks for membership and doesn’t care a bag of beans about the value of a member:

It won’t get any weirder — it simply can’t!

Bug Hunting Adventures #6: Logarithmic Errors

Sometimes, it is necessary to find out how an application uses dynamically memory. Especially on constrained (think embedded) systems you should know, for instance, what block sizes are typically requested, how much memory is allocated at any one time, and how much memory has been requested at most so far (allocation high-water mark). Data like this allows you to fine-tune your memory allocator and/or roll out your own highly-efficient, special-purpose allocator for popular block sizes.

I recently implemented such a statistics-enhanced version of operator new/delete and one of its tasks was to find out how frequently blocks of certain sizes were requested. Instead of keeping track of exact request sizes (which would have been prohibitively expensive on the target system) I came up with a simpler, but sufficient scheme where I counted allocations to block sizes, rounded to the next base-2 boundary:

log2ceil is the ‘ceiled’ (rounded-up) binary logarithm of a given value. As an example, log2ceil(100) and log2ceil(128) both yield the same value of 7. Hence, the value g_allocations[7] would tell me how many allocations of block sizes in the range of 65 – 128 bytes there have been.

Below is the code of a log2ceil function; it works by counting the number of leading (left-most) zero bits in a word and subtracting this value from 31. In order to get the desired rounding behavior, a well-known trick is applied: first, decrement the argument by one and then increment the result by one.

This implementation is straightforward and all the tests pass. Still, it has a subtle portability issue. Can you see it?

Code
Solution

Bug Hunting Adventures #5: False Diagnosis

Today’s cars host dozens of ECUs (electronic control units) that exchange tens of thousands of messages and signals on various buses (eg. CAN, MOST, Flexray, Ethernet). On top of that, ECUs provide hundredths of diagnostic services, that allow factories, maintenance personnel and even devices on the car to obtain (and change) various parameters, including vehicle speed, diagnostic trouble codes, and sensor/actuator data. By using diagnostic services, you can literally remote control your car, provided you get passed the security mechanisms, of course.

Diagnostic commands are exchanged based on the UDS protocol, as specified by ISO 14229-1:2013, via request/response pairs. Requests and responses are coded like this:

Request: SA, DA, SID1, ..., SIDn
Response (OK): SA, DA, SID1|0x40, SID2, ..., SIDn, RESP-PAYLOAD
Response (error): SA, DA, 0x7F, SID1, ERROR-CODE

Every ECU in a car has a unique 1-byte diagnostic address that uniquely identifies it (signified as source address SA and destination address DA, depending on whether a device acts as a sender or receiver) and each diagnostic service of an ECU is addressed by a 1 to n byte long service ID (SID). (Note that in reality, the n-byte “SID” consists of the real SID and a sub-level function ID, but this detail is of no importance for our considerations.)

If, for example, a device with SA = 0x20 wants to read all diagnostic trouble codes from an ECU with DA = 0x60, it would issue this request:

0x20 0x60 0x19 0x0A

where 0x19 0x0A is the two-byte service ID for the “READ DIAGNOSTIC TROUBLE CODES” service.

The positive response of the targeted ECU might look like this:

0x60 0x20 0x59 0x0A 0xFF 0x02 0x40 0x00 0x50 0x02 ...

Notice that the values of SA and DA are swapped; the replying ECU is now the sender and the requesting ECU the receiver. 0x59 is the first byte of the service ID (SID1) or’ed with the “response OK” flag 0x40. The response payload (the requested information, that is, the actual trouble codes) starts with 0xFF 0x02… .

If an ECU cannot fulfill a request, it sends a negative response indicator, followed by an error code that gives details on what went wrong:

0x60 0x20 0x7F 0x019 0x11

Here, 0x7F denotes ‘error’, 0x19 is the first SID byte of the original request (this time without the “response OK” flag) and 0x11 is the error code for “SERVICE NOT SUPPORTED”.

There are two error code values that usually require special treatment: If an ECU responds with error code 0x21 it means that it is currently busy and unable to process the request; if it responds with 0x78 it denotes that it is able to process the request but needs more time to complete it.

This description of the UDS protocol is of course an oversimplification, but it is enough for our bug-hunting purposes. Attached is an extract of an application that incorrectly processes the response to a diagnostic service request. Can you diagnose the problem?

Code
Solution