Category Archives: C/C++/Embedded

Dangerously Confusing Interfaces IV: The Perils of C’s “safe” String Functions

“It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.”
–Mark Twain
Buffer overflows are among the most frequent causes of security flaws in software. They typically arise in situations such as when a programmer is 100% certain that the buffer to hold a user’s name is big enough — until a guy from India logs in. Thus, well-behaved developers always use the bounded-length versions of string functions. Alas, they come with differing, dangerously confusing interfaces.

THE GOOD

Let’s start with ‘fgets‘:


char buffer[30]; /* 30 bytes ought to be enough for everyone! */
fgets(buffer, sizeof(buffer), stdin);

char buffer[30]; /* 30 bytes ought to be enough for everyone! */

fgets(buffer, sizeof(buffer), stdin);

No matter what users type into their terminals, ‘fgets’ will ensure that ‘user_name’ is a well-formed, zero-terminated string of at most 29 characters (one character is needed for the ‘\0’ terminator). The same goes for the ‘snprintf‘ function. After executing the following code


char buffer[4];
snprintf(buffer, sizeof(buffer), "The quick brown fox");

char buffer[4];

snprintf(buffer, sizeof(buffer), "The quick brown fox");

‘buffer’ will contain the string “The”, again, properly zero-terminated.

Both functions follow the same, easy-to-grasp pattern: you pass a pointer to a target buffer as well as the buffer’s total size and get back a terminated string that doesn’t overflow the buffer. Awesome!

THE BAD

In order to copy strings safely, developers often reach for ‘strncpy‘ to guard themselves against dreaded buffer overruns:


char buffer[30]; /* 30 bytes ought to be enough for everyone! */
strncpy(buffer, user_name, sizeof(buffer)); /* safer than good ol' strcpy? */

char buffer[30]; /* 30 bytes ought to be enough for everyone! */

strncpy(buffer, user_name, sizeof(buffer)); /* safer than good ol' strcpy? */

Unfortunately, this is not how ‘strncpy’ works! We assumed that it followed the pattern established by ‘fgets’ and ‘snprintf’ but that’s not the case. Even if ‘strncpy’ promises that it never overflows the target buffer, it doesn’t necessarily zero-terminate it. What it does is copy up to ‘sizeof(buffer)’ bytes from ‘user_name’ to ‘buffer’ but if the last byte that is copied is not ‘\0’ (i. e. ‘user_name’ comprises more than ‘sizeof(buffer)’ characters), ‘strncpy’ leaves you with an untermiated string! A traditional approach to solve this shortcoming is to enforce zero-termination by putting a ‘\0’ character as the last element of the target buffer after the call to ‘strncpy’:


strncpy(buffer, user_name, sizeof(buffer));
buffer[sizeof(buffer) - 1] = '\0';

strncpy(buffer, user_name, sizeof(buffer));

buffer[sizeof(buffer) - 1] = '\0';

Using ‘strncpy’ without such explicit string termination is almost always an error — a rather insidious one, as your code will work most of the time but not when the buffer is completely filled (i. e. your Indian colleague “Villupuram Chinnaih Pillai Ganesan” logs on).

Boy, oh boy is this inconsistent! ‘fgets’ and ‘snprintf’ give you guaranteed zero-termination but ‘strncpy’ doesn’t. A clear violation of the principle of least surprise. Apparently, ‘strncpy’ fixes one safety problem and at the same time lays the foundation for another one.

THE UGLY

Can it get worse? You bet! How do you think ‘strncat‘, the bounded-length string concatenation function, behaves? Ponder this code:


const char* string1 = "123";
const char* string2 = "4567890";

char buffer[7];

/* First, safely fill buffer with string1. */
strncpy(buffer, string1, sizeof(buffer));
buffer[sizeof(buffer) - 1] = '\0';

/* Next, concatenate strings. */
strncat(buffer, string2, sizeof(buffer));

const char* string1 = "123";

const char* string2 = "4567890";

char buffer[7];

/* First, safely fill buffer with string1. */

strncpy(buffer, string1, sizeof(buffer));

buffer[sizeof(buffer) - 1] = '\0';

/* Next, concatenate strings. */

strncat(buffer, string2, sizeof(buffer));

But this is wrong, of course: the third argument to ‘strncat’ (let’s call this argument ‘n’) is not the size of the target buffer. It is the maximum number of characters to copy from the source string (‘string2’) to the destination buffer (‘buffer’). If the length of the source string is greater or equal to ‘n’, ‘strncat’ copies ‘n’ characters plus a ‘\0’ to terminate the target string. Confused? Don’t worry, here’s how you would use it to avoid concatenation buffer overruns:


strncat(buffer, string2, sizeof(buffer) - strlen(buffer) - 1); 
    // -1 to account for '\0'.

strncat(buffer, string2, sizeof(buffer) - strlen(buffer) - 1);

// -1 to account for '\0'.

Yuck! What’s the likelihood that people remember this correctly?

THE REMEDY

Even if the different interfaces and behaviors of the bounded-length string functions in the C API make sense for certain use cases (or made sense at some point in time), the upshot is that they confuse programmers and potentially lead to new security holes when in fact they were supposed to plug them. What’s a poor C coder supposed to do?

As always, you can roll your own versions of bounded/safe string functions or use my safe version of ‘strcpy’. If you rather prefer something from the standard library, I’d suggest that you use ‘snprintf’ as a replacement for both, ‘strncpy’ and ‘strncat’:


/* Safe replacement for 'strncpy' */
snprintf(buffer, sizeof(buffer), "%s", string1);

/* Safe replacement for 'strncat' */
snprintf(buffer, sizeof(buffer), "%s%s", string1, string2);

/* Safe replacement for 'strncpy' */

snprintf(buffer, sizeof(buffer), "%s", string1);

/* Safe replacement for 'strncat' */

snprintf(buffer, sizeof(buffer), "%s%s", string1, string2);

Looks like ‘snprintf’ is the swiss army knife of safe string processing, doesn’t it? The moral is this: use whatever you’re comfortable with, but refrain from using ‘strncpy’ or ‘strncat’ directly.

More dangerously confusing interfaces…

Playgrounds Revamped

“Play is the highest form of research.”
— Albert Einstein

Many years ago, I wrote about the importance of having playgrounds, that is, easy-to-access try-out areas for carrying out programming-related experiments with the overall goal of exploring and learning.

Recently, I’ve reworked my C++ playground and uploaded it to GitHub. Compared to my previous C++ playground, the new one comes with the following major advantages:

Shared access to playgrounds from multiple computers — since it is based on a Git repository.
Every experiment has its own subdirectory — the top-level playground directory stays clean and clearly arranged.
Unit test support through Google Test — running ‘make’ not just builds the experiment but also executes contained unit tests.

Once cloned and installed, you can start a new experiment is this:


cd ~/pg-cpp
. pg-setup init_within_loop_body

cd ~/pg-cpp

. pg-setup init_within_loop_body

‘pg-setup’ will create a directory called ‘init_within_loop_body’ along with a ‘Makefile’ and a ‘init_within_loop_body.cpp’ source file. Plus, if you have defined your ‘EDITOR’ environment variable properly, it will open ‘init_within_loop_body.cpp’ in your favorite editor for you. All that’s left to do is add your experiment’s code to the testcase template:


// This experiment tests if a variable inside a loop body
// is initialized with every iteration.
TEST(init_within_loop_body, simple) {
    for (int i = 0; i < 10; ++i) {
        int k = 0;
        // Assume that k is initialized every time.
        EXPECT_EQ(0, k);
        ++k;
    }
}

// This experiment tests if a variable inside a loop body

// is initialized with every iteration.

TEST(init_within_loop_body, simple) {

for (int i = 0; i < 10; ++i) {

int k = 0;

// Assume that k is initialized every time.

EXPECT_EQ(0, k);

++k;

}

Now, just type/execute ‘make’ (either from within your editor or from the command-line) and your code will be compiled and run:


g++  -W -Wall -g -pthread -I /home/ralf/get-me-gtest/googletest-release-1.8.0/googletest/include -I /home/ralf/get-me-gtest/googletest-release-1.8.0/googlemock/include -L /home/ralf/get-me-gtest/googletest-release-1.8.0/googlemock  init_within_loop_body.cpp  -l gmock_main -o init_within_loop_body
./init_within_loop_body
Running main() from gmock_main.cc
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from init_within_loop_body
[ RUN      ] init_within_loop_body.simple
[       OK ] init_within_loop_body.simple (0 ms)
[----------] 1 test from init_within_loop_body (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[  PASSED  ] 1 test.

g++ -W -Wall -g -pthread -I /home/ralf/get-me-gtest/googletest-release-1.8.0/googletest/include -I /home/ralf/get-me-gtest/googletest-release-1.8.0/googlemock/include -L /home/ralf/get-me-gtest/googletest-release-1.8.0/googlemock init_within_loop_body.cpp -l gmock_main -o init_within_loop_body

./init_within_loop_body

Running main() from gmock_main.cc

[==========] Running 1 test from 1 test case.

[----------] Global test environment set-up.

[----------] 1 test from init_within_loop_body

[ RUN ] init_within_loop_body.simple

[ OK ] init_within_loop_body.simple (0 ms)

[----------] 1 test from init_within_loop_body (0 ms total)

[----------] Global test environment tear-down

[==========] 1 test from 1 test case ran. (0 ms total)

[ PASSED ] 1 test.

Pointers in C, Part II: CV-Qualifiers

“A teacher is never a giver of truth; he is a guide, a pointer to the truth that each student must find for himself.”
— Bruce Lee

In part I of this series, I explained what pointers are in general, how they are similar to arrays, and — more importantly — where, when, and why they are different to arrays. Today, I’ll shed some light on the so-called ‘cv qualifiers’ which are frequently encountered in pointer contexts.

CV-QUALIFIER BASICS

CV-qualifiers allow you to supplement a type declaration with the keywords ‘const’ or ‘volatile’ in order to give a type (or rather an object of a certain type) special treatment. Take ‘const’, for instance:


const double PI = 3.1415927;
PI = 1.23;  // Error, PI is constant.
PI += 1;    // dito.

const double PI = 3.1415927;

PI = 1.23; // Error, PI is constant.

PI += 1; // dito.

‘const’ is a guarantee that a value isn’t (inadvertently) changed by a developer. On top of that, it gives the compiler some leeway to perform certain optimizations, like placing ‘const’ objects in ROM/non-volatile memory instead of (expensive) RAM, or even not storing the object at all and instead ‘inline’ the literal value whenever it’s needed.

‘volatile’, on the other hand, prevents optimizations. It’s a hint to the compiler that the value of an object can change in ways not known by the compiler and thus the value must never be cached in a processor register (or inlined) but instead always loaded from memory. Apart from this ‘don’t optimize’ behavior, there’s little that ‘volatile’ guarantees. In particular — and contrary to common belief — it’s no cure for typical race condition problems — It’s mostly used in signal handlers and to access memory-mapped hardware devices.

Even if it sounds silly at first, it’s possible to combine ‘const’ and ‘volatile’. The following code declares a constant that shall not be inlined/optimized:


const volatile int MAX_SENSORS = 4;
...
for (int i = 0; i < MAX_SENSORS; ++i) {  // Always load MAX_SENSORS
                                         // value from memory.
    sum += sensors[i].value;
}

const volatile int MAX_SENSORS = 4;

...

for (int i = 0; i < MAX_SENSORS; ++i) { // Always load MAX_SENSORS

// value from memory.

sum += sensors[i].value;

}

Using both ‘const’ and ‘volatile’ together makes sense when you want to ensure that developers can’t change the value of a constant and at the same time retain the possibility to update the value through some other means, later. In such a setting, you would place ‘MAX_SENSORS’ in a dedicated non-volatile memory section (ie. flash or EEPROM) that is independent of the code, eg. a section that only hosts configuration values^*. By combining ‘const’ and ‘volatile’ you ensure that the latest configuration values are used and that these configuration values cannot be altered by the programmer (ie. from within the software).

To sum it up, ‘const’ means “not modifiable by the programmer” whereas ‘volatile’ denotes “modifiable in unforeseeable ways”.

CV-QUALIFIERS COMBINED WITH POINTERS

Like I stated in the intro, cv-qualifiers often appear in pointer declarations. However, this poses a problem because we have to differentiate between cv-qualifying the pointer and cv-qualifying the pointed-to object. There are “pointers to ‘const'” and “‘const’ pointers”, two terms that are often confused. Here’s code involving a pointer to a constant value:


const int MAX_RATE = 200;
const int MIN_RATE = 10;
int default_rate = 42;

const int* rate;
rate = &MAX_RATE;    // Point to memory containing MAX_RATE.
rate = &MIN_RATE;    // Now point to memory containing MIN_RATE.

*rate = 1000;        // Error: pointer-to-const cannot modify
                     // pointed-to object.

rate = &default_rate // Point to non-const value.
*rate = 1000;        // Error: pointer-to-const cannot modify
                     // pointed-to object.

const int MAX_RATE = 200;

const int MIN_RATE = 10;

int default_rate = 42;

const int* rate;

rate = &MAX_RATE; // Point to memory containing MAX_RATE.

rate = &MIN_RATE; // Now point to memory containing MIN_RATE.

*rate = 1000; // Error: pointer-to-const cannot modify

// pointed-to object.

rate = &default_rate // Point to non-const value.

*rate = 1000; // Error: pointer-to-const cannot modify

// pointed-to object.

Since the pointer is declared as pointing to ‘const’, no changes through this pointer are possible, even if it points to a mutable object in reality.

Constant pointers, on the other hand, behave differently. Have a look at this example:


int default_rate = 42;  // Non-const value.
int current_rate = 19;  // dito.

int* const p;                   // Error: const pointers must be 
                                // initialized.
int* const p = &current_rate;   // Fine, point to a non-const value.
*p = 50;                        // Indirectly update current rate.
p = &default_rate               // Error: const pointers can't be 
                                // bound to another object.
++p;                            // dito.

int default_rate = 42; // Non-const value.

int current_rate = 19; // dito.

int* const p; // Error: const pointers must be

// initialized.

int* const p = &current_rate; // Fine, point to a non-const value.

*p = 50; // Indirectly update current rate.

p = &default_rate // Error: const pointers can't be

// bound to another object.

++p; // dito.

The takeaway is this: if the ‘const’ keyword appears to the left of the ‘*’, the pointed-to value is ‘const’ and hence we are dealing with a pointer to ‘const’; if the ‘const’ keyword is to the right of the ‘*’, the pointer itself is ‘const’. Of course, it’s possible to have the ‘const’ qualifier on both sides at the same time:


const int * const rate = &MAX_RATE;
*rate = 42;                     // Error: pointer to const can't 
                                // modify value.
++rate;                         // Error: const pointer can't 
                                // point elsewhere.

const int * const rate = &MAX_RATE;

*rate = 42; // Error: pointer to const can't

// modify value.

++rate; // Error: const pointer can't

// point elsewhere.

The same goes for multi-level pointers:


const int * const * v;

const int * const * v;

Here, ‘v’ is a regular (non-‘const’) pointer to ‘const’ pointer to a pointer to a ‘const’ integer.

Yuck! Sometimes, I really wish the inventors of C had used ‘<-‘ instead of ‘*’ for pointer declarations — the resulting code would have been easier on the eyes! Consider:


int* p;

int* p;

versus


int <- p;    // say: "p is a POINTER TO int"

int <- p; // say: "p is a POINTER TO int"


const int <- const <- v;

const int <- const <- v;

would read from right to left as “v is a POINTER TO const POINTER TO const int”. Life would be some much simpler… but let’s face reality and stop day-dreaming!

Everything I said about ‘const’ equally applies to pointers to ‘volatile’ and ‘volatile’ pointers: pointers to ‘volatile’ ensure that the pointed-to value is always loaded from memory whenever a pointer is dereferenced; with ‘volatile’ pointers, the pointer itself is always loaded from memory (and never kept in registers).

Things really get complicated when there is a free mix of ‘volatile’ and ‘const’ keywords with pointers involving more than two levels of indirection:


volatile int * const volatile * volatile * p;

volatile int * const volatile * volatile * p;

Let’s better not go there! If you are in multi-level pointer trouble, remember that there’s a little tool called ‘cdecl‘ which I showcased in the previous episode. But now let’s move on to the topic of how and when cv-qualified pointers can be assigned to each other.

ASSIGNMENT COMPATIBILITY I

Pointers are assignable if the pointer on the left hand side of the ‘=’ sign is not more capable than the pointer on the right hand side. In other words: you can assign a less constrained pointer to a more constrained pointer, but not vice versa. If you could, the promise made by the constrained pointer would be broken:


const int* pc;
int* p;

pc = p;     // OK, since 'p' is a read/write pointer and
            // 'pc' is a read-only pointer.
p = pc;     // Error: 'pc' is more constrained than 'p'.

const int* pc;

int* p;

pc = p; // OK, since 'p' is a read/write pointer and

// 'pc' is a read-only pointer.

p = pc; // Error: 'pc' is more constrained than 'p'.

If the previous statement was legal, a programmer could suddenly get write access to a read-only variable:


const int VALUE = 42;
const int* pc = &VALUE;     // Equal restrictiveness on both 
                            // sides (ie. const).
*pc = 43;                   // Error: no write access.
int* p = pc;                // Let's pretend this was legal...
*p = 43;                    // const value updated!

const int VALUE = 42;

const int* pc = &VALUE; // Equal restrictiveness on both

// sides (ie. const).

*pc = 43; // Error: no write access.

int* p = pc; // Let's pretend this was legal...

*p = 43; // const value updated!

Again, the same restrictions hold for pointers to ‘volatile’. In general, pointers to cv-qualified objects are more constrained than their non-qualified counterparts and hence may not appear on the right hand side of an assignment expression. By the same token, this is not legal:


const volatile int* pcv;
const* pc;
pc = pcv;               // Error: right hand side is more constrained...
pcv = pc                // OK.

const volatile int* pcv;

const* pc;

pc = pcv; // Error: right hand side is more constrained...

pcv = pc // OK.

ASSIGNMENT COMPATIBILITY II

The rule which requires that the right hand side must not be more constrained than the left hand side might lead you to the conclusion that the following code is perfectly kosher:


int value = 100;
int* p = &value;
int** pp = &p;

const int** ppc = pp;   // Error: incompatible assignment.

int value = 100;

int* p = &value;

int** pp = &p;

const int** ppc = pp; // Error: incompatible assignment.

However, it’s not, and for good reason, as I will explain shortly. But it’s far from obvious and it’s a conundrum to most — even seasoned — C developers. Why is it possible to assign a pointer to non-const to a pointer to ‘const’:


const int *pc;
int* p;
pc = p;             // OK.

const int *pc;

int* p;

pc = p; // OK.

but not a pointer to a pointer to non-const to a pointer to a pointer to ‘const’?


const int** ppc;
int** pp;
ppc = pp;           // Error.

const int** ppc;

int** pp;

ppc = pp; // Error.

Here is why. Imagine this example:


const int VALUE = 42;
int* p;
const int** ppc;
ppc = &p;           // Error, but let's pretend this was legal.

const int VALUE = 42;

int* p;

const int** ppc;

ppc = &p; // Error, but let's pretend this was legal.

Graphically, our situation is this. ‘ppc’ points to ‘p’ which in turn points to some random memory location, as it hasn’t been initialized yet:


VALUE       0x00B00010: 00 00 00 2A     // 42
:           :
p           0x00004220: ?? ?? ?? ??     // Points to random location
ppc         0x00004224: 00 00 42 20     // Points to 'p'

VALUE 0x00B00010: 00 00 00 2A // 42

: :

p 0x00004220: ?? ?? ?? ?? // Points to random location

ppc 0x00004224: 00 00 42 20 // Points to 'p'

Now, when we dereference ‘ppc’ one time, we get to our pointer ‘p’. Let’s point it to ‘VALUE’:


*ppc = &VALUE;

*ppc = &VALUE;

It shouldn’t surprise you that this assignment is valid: the right hand side (pointer to const int) is not less constrained than the left hand side (also pointer to const int). The resulting picture is this:


VALUE       0x00B00010: 00 00 00 2A     // 42
:           :
p           0x00004220: 00 B0 00 10     // Now points to 'VALUE'
ppc         0x00004224: 00 00 42 20     // Points to 'p'

VALUE 0x00B00010: 00 00 00 2A // 42

: :

p 0x00004220: 00 B0 00 10 // Now points to 'VALUE'

ppc 0x00004224: 00 00 42 20 // Points to 'p'

Everything looks safe. If we attempt to update ‘VALUE’, we won’t succeed:


**ppc = 666; // Error: can't update through pointer to 'const'.

**ppc = 666; // Error: can't update through pointer to 'const'.

But we are far from safe. Remember that we also (indirectly) updated ‘p’ which was declared as pointing to a non-const int and ‘p’ was declared as pointing to non-const? The compiler would happily accept the following assignment:


*p = 666;

*p = 666;

which leads to undefined behavior, as the C language standard calls it.

This example should have convinced you that it’s a good thing that the compiler rejects the assignment from ‘int**’ to ‘const int**’: it would open-up a backdoor for granting write access to more constrained objects. Finding the corresponding words in the C language standard is not so easy, however and requires some digging. If you feel “qualified” enough (sorry for the pun), look at chapter “6.5.16.1 Simple assignment”, which states the rules of objects assignability. You probably also need to have a look at “6.7.5.1 Pointer declarators” which details pointer type compatibility as well as “6.7.3 Type qualifiers” which specifies compatibility of qualified types. Putting this all into a cohesive picture is left as an exercise to the diligent reader.

________________________________
^{*) Separating code from configuration values is generally a good idea in embedded context as it allows you to replace either of them independently.↩}

Pointers in C, Part I: Pointers vs. Arrays

“Remember, When You Point a Finger at Someone, There Are Three More Pointing Back at You”
— Unknown
It’s easy to meet even long-time C programmers who don’t fully grok pointers, let alone beginners. Because of this and the fact that pointers play such a crucial role in the C programming language, I’ve decided to launch a new series of blog posts on pointers. I want to start off with an episode that sheds some light on similarities and — more importantly — differences between pointers and arrays.

POINTERS AND ARRAYS: THE BASICS

An array is a sequence of same-sized objects, integers, for instance:


int array[] = { 
    0xA, 
    0xBBBB,
    0xCC000000
};

int array[] = {

0xA,

0xBBBB,

0xCC000000

};

On a big-endian machine, ‘array’ could be stored like this (that it starts at memory address 0xB00010 is just an example):


array       0x00B00010: 00 00 00 0A     // First integer.
            0x00B00014: 00 00 BB BB     // Second integer.
            0x00B00018: CC 00 00 00     // Third integer.

array 0x00B00010: 00 00 00 0A // First integer.

0x00B00014: 00 00 BB BB // Second integer.

0x00B00018: CC 00 00 00 // Third integer.

The compiler (or rather the linker) places the array at a fixed memory location. Thus, When you think array, think memory.

By contrast, a pointer is an object that holds a memory address. Pointers are used to refer to memory where an object of a specific type (like ‘int’) resides.


int value = 42;
int* pointer = &value;

int value = 42;

int* pointer = &value;


value       0x00B00800: 00 00 00 2A     // 0x2A == 42.
:
pointer     0xD70012A0: 00 B0 08 00     // Holds address of 'value', thus we say
                                        // that 'pointer' points to 'value'.

value 0x00B00800: 00 00 00 2A // 0x2A == 42.

pointer 0xD70012A0: 00 B0 08 00 // Holds address of 'value', thus we say

// that 'pointer' points to 'value'.

Pointers are used for flexibility: you can refer to another object at run-time by changing the memory address stored inside the pointer variable:


pointer = &array[1];    // Now point to 2nd 'array' element.

pointer = &array[1]; // Now point to 2nd 'array' element.


pointer     0xD70012A0: 00 B0 00 14     // Now contains address of 
                                        // 2nd array element.

pointer 0xD70012A0: 00 B0 00 14 // Now contains address of

// 2nd array element.

A pointer introduces a level of indirection: in order to access the actual object it refers to (and not the pointer variable itself), you dereference it:


*pointer = 0x1234;  // Don't update address it points to
                    // but value of object it points to.

*pointer = 0x1234; // Don't update address it points to

// but value of object it points to.


array       0x00B00010: 00 00 00 0A
            0x00B00014: 00 00 12 34     // Memory updated.
            0x00B00018: CC 00 00 00

array 0x00B00010: 00 00 00 0A

0x00B00014: 00 00 12 34 // Memory updated.

0x00B00018: CC 00 00 00

DIRECT ACCESS VS. INDIRECT ACCESS

The crucial difference between pointers and arrays is how memory is accessed. For instance, when you retrieve the first array element:


int n = array[0]    // Direct access.

int n = array[0] // Direct access.

the compiler generates code along these lines:

1. Load address of beginning of array into register A
2. Load data at address stored in A into register B

Whereas when you fetch the first array element via a pointer pointing to it:


pointer = &array[0];  // Point to 1st 'array' element.
...
int n = *pointer;     // Indirect access.

pointer = &array[0]; // Point to 1st 'array' element.

...

int n = *pointer; // Indirect access.

The generated code will access memory indirectly very much like this:

1. Load address of pointer into register X
2. Load data at address in register X into register Y
3. Load data at address in register Y into register B

So as you can see, pointers and arrays use different ways to access memory and hence are fundamentally different beasts.

WHEN POINTERS LOOK LIKE ARRAYS AND VICE VERSA

Nevertheless, there are cases where pointers and arrays appear to be same thing.

The C language comes with a little bit of syntactic sugar. In certain situations you can use an array like you would use a pointer:


int x = *array;     // Get first element of 'array'.

int x = *array; // Get first element of 'array'.

This looks like you are dereferencing a pointer named ‘array’, but looks can be deceiving. What this really compiles to is this:


int x = array[0];

int x = array[0];

Why? According to the C language standard, in expressions, the name of an array acts as a pointer to the first array element. Hence, the compiler really sees this:


int x = *(&array[0]);

int x = *(&array[0]);

which is equivalent to


int x = array[0];

int x = array[0];

Similarly, you can dereference pointers not just by using the ‘*’ operator but also by using the subscript operator [], which is another form of syntactic sugar — one that makes you believe you are accessing an array instead of a pointer:


// Plain pointer access:
int x1 = *pointer;       // Indirectly access first element.
int x2 = *(pointer + 2); // Indirectly access third element.
int x3 = *(2 + pointer); // dito (commutative law).

// Array-like access:
int x4 = pointer[0];     // Indirectly access first element.
int x5 = pointer[2];     // Indirectly access third element.
int x6 = 2[pointer];     // dito (commutative law, who knew?).

// Plain pointer access:

int x1 = *pointer; // Indirectly access first element.

int x2 = *(pointer + 2); // Indirectly access third element.

int x3 = *(2 + pointer); // dito (commutative law).

// Array-like access:

int x4 = pointer[0]; // Indirectly access first element.

int x5 = pointer[2]; // Indirectly access third element.

int x6 = 2[pointer]; // dito (commutative law, who knew?).

All this syntactic sugar makes C code involving pointers and arrays easier on the eyes — the compiler will do some access magic behind the scenes. The downside is, that it deludes people into believing that pointers and arrays are the same, which is not the case: arrays employ direct access, pointers indirect access.

Contrary to expressions, such syntactic sugar is not available in declarations. If you define an array in one translation unit (file):


const int VALUES[4] = { 
    0x1111,
    0x2222,
    0x3333,
    0x4444,
};

const int VALUES[4] = {

0x1111,

0x2222,

0x3333,

0x4444,

};

and foolishly attempt to import it into another translation unit via this forward declaration:


extern const int* VALUES;    // Import 'VALUES' into translation unit.
int x = *VALUES;             // Indirect access, undefined behavior!

extern const int* VALUES; // Import 'VALUES' into translation unit.

int x = *VALUES; // Indirect access, undefined behavior!

you risk a crash because dereferencing ‘VALUES’ will indirectly access memory when a direct access was required. Let’s assume that the array is stored like this, as defined in the first translation unit:


VALUES      0x00B00210: 00 00 11 11
            0x00B00214: 00 00 22 22
            0x00B00218: 00 00 33 33
            0x00B0021C: 00 00 44 44

VALUES 0x00B00210: 00 00 11 11

0x00B00214: 00 00 22 22

0x00B00218: 00 00 33 33

0x00B0021C: 00 00 44 44

Now, dereferencing ‘VALUES’ declared as a pointer will lead to these steps:

1. Load address of pointer ‘VALUES’ into register X (X = 0x00B00210)
2. Load data at address in register X into register Y (Y = 0x00001111)
3. Load data at address in register Y into register B (B = ???)

What this means in practice depends on whether the address 0x00001111 is a valid address or not. If it is, arbitrary data will be read; otherwise, the memory management unit (MMU) will raise an exception. Therefore, make sure that your array declarations exactly match your definitions:


extern const double VALUES[5]; // Matches definition.
int x = VALUES[0];  // Direct access.
int y = *VALUES;    // dito, syntactic sugar.

extern const double VALUES[5]; // Matches definition.

int x = VALUES[0]; // Direct access.

int y = *VALUES; // dito, syntactic sugar.

PASSING ARRAYS TO FUNCTIONS

So far so good (or bad). Another source of confusion is the fact that arrays are the only objects in C that are implicitly passed by reference:^* You always provide a pointer to the first array element to get an array into a function:


int sum(int* nums, size_t len) {
    int i, sum = 0;
    for (i = 0; i < len; ++i) {
        sum += nums[i]  // indirect access, syntactic sugar.
    }
    return sum;
}

int sum(int* nums, size_t len) {

int i, sum = 0;

for (i = 0; i < len; ++i) {

sum += nums[i] // indirect access, syntactic sugar.

}

return sum;

}

At the caller’s site, the code looks like this:


int total1 = sum(array, 3);       // Pass pointer to 1st elem, syntactic sugar.
int total2 = sum(&array[0], 3);   // dito, but explicitly.

int total1 = sum(array, 3); // Pass pointer to 1st elem, syntactic sugar.

int total2 = sum(&array[0], 3); // dito, but explicitly.

TYPE-SAFETY THAT ISN’T

Sometimes, you want to ensure at compile-time, that only arrays of certain sizes can enter your function. Imagine you have a function that builds a 128-bit random value in an array of eight bytes:


void get_random(uint8_t* random) {
    for (size_t i = 0; i < 8; ++i) {
        random[i] = *get_random_byte();
    }
}

void get_random(uint8_t* random) {

for (size_t i = 0; i < 8; ++i) {

random[i] = *get_random_byte();

}

‘get_random’ assumes that it is passed the address of eight bytes of memory, but nobody prevents the caller from passing an array that is not big enough:


uint8_t myrand[4];    // Short by 4 bytes.
get_random(myrand);   // but compiles fine...

uint8_t myrand[4]; // Short by 4 bytes.

get_random(myrand); // but compiles fine...

Which will — of course — lead to a dreaded buffer overrun.

Is it possible to make ‘get_random’ type-safe, such that arrays with a length different to eight lead to compile-time errors?

One (ill-fated) approach is to employ a C feature that allows you to declare arguments using array-like notation:


void get_random(uint8_t random[8]) {
    ...
}

void get_random(uint8_t random[8]) {

...

}

However, this doesn’t give you any extra type safety. To the compiler, ‘random’ is still a pointer to a ‘uint8_t’ and if you ask for the size of ‘random’ (via sizeof(random)) in the body of the function, you will still get the value returned by sizeof(uint8_t*). Few developers are aware of this fact. To me, it’s a source of nasty bugs.

Since this array-ish syntax fools people into believing that a real array was passed to a function (by value) I don’t recommend using it.

TYPE-SAFETY DONE RIGHT

You can get real type-safety for your “array” arguments through so-called “pointers to arrays”. Alas, this C feature tends to confuse the heck out of programmers.

In the previous examples, we passed an array (conceptually) by passing a pointer to the first element:


uint8_t randval[8];
get_random(randval);      // Implicitly.
get_random(&randval[0]);  // Explicitly.

uint8_t randval[8];

get_random(randval); // Implicitly.

get_random(&randval[0]); // Explicitly.

The real type of the array and the size of the array is lost in this process; the called function only sees a pointer to a ‘uint8_t’. By contrast, the following syntax allows you to obtain a pointer to an array that preserves the full type information:


typedef uint8_t RANDVAL[8];
RANDVAL randval;
RANDVAL* pointer = &randval;  // note the '&'

typedef uint8_t RANDVAL[8];

RANDVAL randval;

RANDVAL* pointer = &randval; // note the '&'

This ‘pointer’ is completely type-safe:


int* p = pointer;         // Doesn't compile, incompatible pointers.
get_random(pointer);      // dito.
int x = (*pointer)[9];    // OK: extract 10th element.

int* p = pointer; // Doesn't compile, incompatible pointers.

get_random(pointer); // dito.

int x = (*pointer)[9]; // OK: extract 10th element.

To add type-safety to our ‘get_random’ function, we could define it like this:


void get_random_type_safe(RANDVAL* random) {
    for (size_t i = 0; i < sizeof(*random); ++i) {
        (*random)[i] = *get_random_byte();
    }
}

void get_random_type_safe(RANDVAL* random) {

for (size_t i = 0; i < sizeof(*random); ++i) {

(*random)[i] = *get_random_byte();

}

With this change, ‘get_random_type_safe’ only accepts pointers to 8 element arrays of uint8_t’s. Passing any other kind of pointer will result in a compile-time error.

We know that in expressions, using an array’s name like ‘array’ is short for “pointer to first element in array” but that doesn’t mean that ‘&array’ is a pointer to a pointer to the first element — the ‘&’ operator doesn’t create another level of indirection, even though it looks like it did. In the previous example, the value stored in ‘pointer’ is still the address of the first element of the array. Hence, this assertion holds:


assert((size_t) array == (size_t) &array); // Casting to 'size_t' obtains 
                                           // numeric value of address.

assert((size_t) array == (size_t) &array); // Casting to 'size_t' obtains

// numeric value of address.

Since the actual pointer values are the same, you can still use legacy APIs that only accept pointers to ‘uint8_t’s (like the original ‘get_random’ function), if you apply type casts:


uint8_t* p = (uint8_t*) pointer;   // OK, but type-safety lost.
get_random(p);                     // Fine.

uint8_t* p = (uint8_t*) pointer; // OK, but type-safety lost.

get_random(p); // Fine.

You don’t need typedefs like ‘RANDVAL’ if you want to employ pointers to arrays. I mainly used it to avoid overwhelming you with the hideous pointer-to-array syntax. Without typedefs, you would need to type in things like this:


uint8_t randval[8];
uint8_t (*pointer)[8] = &randval;
void get_random_type_safe(uint8_t (*random)[8]) {
    for (size_t i = 0; i < sizeof(*random); ++i) {
        (*random)[i] = *get_random_byte();
    }
}

uint8_t randval[8];

uint8_t (*pointer)[8] = &randval;

void get_random_type_safe(uint8_t (*random)[8]) {

for (size_t i = 0; i < sizeof(*random); ++i) {

(*random)[i] = *get_random_byte();

}

The syntax to declare pointers to arrays is similar to the syntax to declare pointers to functions and takes a little getting used to. If in doubt, ask the Linux tool ‘cdecl’ which is also available online:


cdecl> explain int (*x[10])[42]
declare x as array 10 of pointer to array 42 of int

cdecl> explain int (*x[10])[42]

declare x as array 10 of pointer to array 42 of int

Do I recommend using pointers to arrays? No, at least not in general. It confuses way too many developers and leads to ugly casts in order to access plain pointer interfaces. Still, pointers to arrays make sense every now and then and it’s always good to know your options.

This concludes my first installment on pointers. There is more to come. Stay tuned!

________________________________

^{*) The language designers of C believed that passing an array by value (e. g. as a copy via the stack) would be extremely inefficient and dangerous (think: stack overflow), so there is no direct way to do it. However, they were not so fearful regarding structs (which can also get quite large and overflow the stack), so you could pass an array by value if you wrapped it inside a struct:}


typedef struct {
    int data[3];
} MY_ARRAY;
void some_func(MY_ARRAY the_array) {
   the_array.data[0] = ...
   ...
}
MY_ARRAY array2 = { 1, 2, 3 };
some_func(array2); // Pass by value, ie. duplicate array2 on the stack.

typedef struct {

int data[3];

} MY_ARRAY;

void some_func(MY_ARRAY the_array) {

the_array.data[0] = ...

...

}

MY_ARRAY array2 = { 1, 2, 3 };

some_func(array2); // Pass by value, ie. duplicate array2 on the stack.

↩

Bug Hunting Adventures #12: String Limits

“The limits of my language mean the limits of my world.”
— Ludwig Wittgenstein

The aim of the routine below (‘reduce_string’) is to limit a given ‘string’ to at most ‘max_len’ characters. If the length of ‘string’ exceeds ‘max_len’, characters are removed from around the middle and filled with an ‘ellipsis’ string. Here are some examples that demonstrate what ‘reduce_string’ is supposed to do:


char text1[] = "The quick brown fox";
reduce_string(text1, 8, "..")
// -> "The..fox"
char text2[] = "The quick brown fox";
reduce_string(text2, 4, "")
// -> "Thox"
char text3[] = "I am the spirit that denies!"
reduce_string(text3, 7, "---")
// -> "I ---s!"

char text1[] = "The quick brown fox";

reduce_string(text1, 8, "..")

// -> "The..fox"

char text2[] = "The quick brown fox";

reduce_string(text2, 4, "")

// -> "Thox"

char text3[] = "I am the spirit that denies!"

reduce_string(text3, 7, "---")

// -> "I ---s!"

But as always in this series, a bug slipped in. Can you find it?


char* reduce_string(char* string, int max_len, const char* ellipsis) {
    assert(string != NULL);
    assert(ellipsis != NULL);

    int string_len = strlen(string);
    int excess_chars = string_len - max_len;

    if (excess_chars > 0) {
        int ellipsis_len = strlen(ellipsis);
        // Number of chars to be removed from the original string.
        int to_be_dropped = excess_chars + ellipsis_len;
        int to_be_dropped_half = to_be_dropped / 2;
        int middle = string_len / 2;

        // Drop chars from the middle to the left;
        // what remains is called the 'left part'.
        int p = middle - to_be_dropped_half;

        // If ellipsis longer than string, skip left part;
        // ie. the resulting string starts with ellipsis.
        if (p < 0) {
            p = 0;
        }

        // Append ellipsis after left part.
        for (int i = 0; i < ellipsis_len; ++i) {
            // Ensure that maximum length is respected.
            if (p >= max_len) {
                break;
            }
            string[p++] = ellipsis[i];
        }

        // Append right part.
        int r = middle + to_be_dropped - to_be_dropped_half;
        while (p < max_len) {
            string[p++] = string[r++];
        }
    }

    return string;
}

char* reduce_string(char* string, int max_len, const char* ellipsis) {

assert(string != NULL);

assert(ellipsis != NULL);

int string_len = strlen(string);

int excess_chars = string_len - max_len;

if (excess_chars > 0) {

int ellipsis_len = strlen(ellipsis);

// Number of chars to be removed from the original string.

int to_be_dropped = excess_chars + ellipsis_len;

int to_be_dropped_half = to_be_dropped / 2;

int middle = string_len / 2;

// Drop chars from the middle to the left;

// what remains is called the 'left part'.

int p = middle - to_be_dropped_half;

// If ellipsis longer than string, skip left part;

// ie. the resulting string starts with ellipsis.

if (p < 0) {

p = 0;

}

// Append ellipsis after left part.

for (int i = 0; i < ellipsis_len; ++i) {

// Ensure that maximum length is respected.

if (p >= max_len) {

break;

}

string[p++] = ellipsis[i];

}

// Append right part.

int r = middle + to_be_dropped - to_be_dropped_half;

while (p < max_len) {

string[p++] = string[r++];

}

return string;

}

Solution

Random Casting

Recently, a security-related bug slipped into libcurl 7.52.0.

For those of you who don’t know, libcurl is a popular open source library that supports many protocols and greatly simplifies data transfer over the Internet; an uncountable number of open- and closed-source projects depend on it.

Because of the bug, this particular version of libcurl doesn’t use random numbers when it should, which is really bad for security:


static CURLcode randit(struct Curl_easy *data, unsigned int *rnd)
{
  // ... 24 lines ...
  result = Curl_ssl_random(data, (unsigned char *)&rnd, sizeof(rnd));
  //...
}

static CURLcode randit(struct Curl_easy *data, unsigned int *rnd)

{

// ... 24 lines ...

result = Curl_ssl_random(data, (unsigned char *)&rnd, sizeof(rnd));

//...

}

Since all the surrounding code is stripped away it is pretty easy to see what went wrong, right?

Within ‘randit’ there is an attempt to obtain a random number by calling ‘Curl_ssl_random’. However, ‘Curl_ssl_random’ is not passed the pointer ‘rnd’, but instead a pointer to ‘rnd’. Hence, the memory pointed to by ‘rnd’ is not filled with a random number but rather the pointer ‘rnd’ will point to a random memory location.

How did this bug come about? I’m pretty sure that — initially — the unlucky developer had accidentally typed this:


static CURLcode randit(struct Curl_easy *data, unsigned int *rnd)
{
  // ... 24 lines ...
  result = Curl_ssl_random(data, &rnd, sizeof(rnd));
  // ...
}

static CURLcode randit(struct Curl_easy *data, unsigned int *rnd)

{

// ... 24 lines ...

result = Curl_ssl_random(data, &rnd, sizeof(rnd));

// ...

}

When (s)he compiled the code with gcc, the following error message was produced:


rand.c:63 error: cannot convert ‘unsigned int**’ to ‘unsigned char*’ for argument ‘2’ 
    to ‘CURLcode Curl_ssl_random(void*, unsigned char*, size_t)’

rand.c:63 error: cannot convert ‘unsigned int**’ to ‘unsigned char*’ for argument ‘2’

to ‘CURLcode Curl_ssl_random(void*, unsigned char*, size_t)’

Which exactly explains the problem, but most likely, the developer only skimmed the error message and jumped to the wrong conclusion; that is, (s)he thought that a cast was needed because of a simple pointer incompatibility (unsigned int* vs. unsigned char*) when in fact there is a severe pointer incompatibility (pointer to pointer vs. pointer).

I’ve seen this many times before: developers apply casts to get rid of warnings from the compiler (or a static analysis tool) without a second thought. Don’t do this. Be very considerate when your compiler speaks to you. Casting, on the other hand, will silence it forever.

Approxion

Code – People – Everything

Category Archives: C/C++/Embedded

Dangerously Confusing Interfaces IV: The Perils of C’s “safe” String Functions

Playgrounds Revamped

Pointers in C, Part II: CV-Qualifiers

Pointers in C, Part I: Pointers vs. Arrays

Bug Hunting Adventures #12: String Limits

Random Casting