kaashif's blog

Programming, with some mathematics on the side

Differences in backwards incompatibility between Rust and C++

2024-01-02

Why is the following change to a Rust struct backwards incompatible?

 struct S {
+    y: i8,
     x: i32,
 }

And why is the following change to a C++ struct backwards incompatible?

 struct S {
+    char y;
     int x;
 };

The answers are different and may surprise you. Rust provides fewer compile guarantees about structs by default and more guarantees in code interacting with structs than C++.

C++ is batshit crazy as always, providing guarantees no-one cares about while allowing you to invoke UB by accident.

Worth thinking about when deciding whether you need to bump the major version number of your Rust crate.

Why the C++ change is backwards incompatible

Let's start with the easy one. The C++ standard defines a standard layout struct as:

If you could write it in C, it's a standard layout class.

-- https://en.cppreference.com/w/cpp/language/classes#Standard-layout_class

Okay, that's not what it says - there are classes that are only expressible in C++ that are "standard layout" (e.g. using inheritance), but anything you could write in C is standard layout.

Standard layout classes have several guarantees: https://en.cppreference.com/w/cpp/language/data_members#Standard-layout but most importantly:

A pointer to an object of standard-layout class type can be reinterpret_cast to pointer to its first non-static non-bitfield data member

So given the struct:

struct S {
    int x;
};

the C++ standard guarantees that this code will work as expected:

int main() {
    S s{1};
    int x = *reinterpret_cast<int*>(&s);
    std::cout << x;
}

If we make the change above, this code compiles but is now undefined behaviour:

struct S {
    char y;
    int x;
};

int main() {
    S s{1};
    int x = *reinterpret_cast<int*>(&s);
    std::cout << x;
}

S* can be cast to a char* but not an int*! That's now UB.

What will likely actually happen on a little endian machine is that it will still print 1. We're reading the 0x01 at the start, and the rest of the struct is probably zeroes. But it's not guaranteed to be - the C++ standard says nothing about the presence or content of the padding in structs. It could be all 0xff!

This does start to "matter" on big endian machines (quotes because no-one cares), where the change is observable.

On a big endian machine where sizeof(int) == 4, the first struct may look like:

00 00 00 01

where reading the first 4 bytes gives an int with value 1, while the second struct looks like:

01 [00 00 00] 00 00 00 00

The [bytes] are padding bytes to make sure the int starts at an address divisible by 4.

Reading the first 4 bytes of this gives a completely different value: 0x01000000.

Two notes:

  1. Struct members are guaranteed to be initialized to zero if the struct is partially initialized.

  2. The contents of padding bytes aren't guaranteed to be 0x00.

Overall, this is as expected. The surprising part comes next.

Why the Rust change is backwards incompatible

Let's suppose you swapped two elements in a struct:

struct S {
    y: i8,
    x: i32,
}

to

struct S {
    x: i32,
    y: i8,
}

This is totally backwards compatible if you're using safe Rust! Rust makes almost no guarantees about the layout of a struct in memory! Unlike C and C++, Rust doesn't even guarantee the order of elements in memory matches the order in the struct declaration.

It's true!

From https://doc.rust-lang.org/reference/type-layout.html

The only data layout guarantees made by this representation are those required for soundness. They are:

  1. The fields are properly aligned.

  2. The fields do not overlap.

  3. The alignment of the type is at least the maximum alignment of its fields.

There is no guarantee about order or pointer conversions.

If you scroll down in that page, you'll notice Rust has #[repr(C)] which gives you the same layout as C, if you want some stronger guarantees.

But why is our first change above still backwards incompatible? This code works:

struct S {
    x: i32,
}

pub fn main() {
    let my_s = S{ x: 1 };

    match my_s {
        S{x} => x,
    };
}

but this code doesn't compile:

struct S {
    x: i32,
    y: i8
}

pub fn main() {
    let my_s = S{ x: 1 };

    match my_s {
        S{x} => x,
    };
}

because the initialization is missing y, and the pattern match is missing y.

This is also totally unsurprising.

Conclusion

The moral here is:

  1. C and C++'s standard layout provides some guarantees, but not really that many.

  2. Rust's default struct memory layout provides almost no guarantees. But your Rust code is still safe, since the compiler can make sure you don't accidentally assume things about structs.

Technically, Rust might write winning lottery ticket numbers as padding at the start of your structs. Make sure to check.

Addendum: sizeof

One naive answer is that adding fields is backwards incompatible because the sizeof the struct changes.

This isn't exactly right since nothing in the C or C++ standards guarantees the sizeof a struct is the same even between compilations of the same program. This is trivially true because e.g. long may be 4 or 8 bytes depending on the platform and C implementation (long is always 4 bytes on Windows and is 8 bytes on 64 bit Linux, usually (yes, I know, the standard doesn't say anything about "bytes" in relation to int and long, only value ranges)).

So the struct sizeof could change arbitrarily depending on the time of day you compiled the program - you can't rely on that anyway.

Same with Rust:

In general, the size of a type is not stable across compilations

-- https://doc.rust-lang.org/std/mem/fn.size_of.html

Funnily enough, Rust guarantees that the std::mem::size_of a #[repr(C)] struct is stable if all members are also #[repr(C)]. So that guarantee is actually stronger than the C standard's guarantee.

This is unfair because GCC and Clang do guarantee a lot about struct layouts, it's just the C standard that doesn't. rustc and "the Rust language" are hard to separate because there's only really one real implementation that people use and no standard.

Comparing things unfairly is fun though.