# kaashif's blog

Programming, with some mathematics on the side

## Why doesn't GCC do this "easy" NRVO optimization?

### 2022-01-25

Since C++17, us C++ programmers have rejoiced in the fact that when we return something from a function, the standard now guarantees that the value won't be copied. This is known as return value optimization (RVO) or copy/move elision, and happens in cases like this:

MyType myfunc() {
// ...
return MyType{arg1, arg2};
}


That is, MyType is constructed once and never copied or moved.

But some compilers still don't perform RVO in some cases. It turns out this is because RVO refers only to when you return unnamed values. Named RVO is apparently not considered RVO by the standard's definition. Named means something like:

    MyType x{};
x.do_something();
return x;


And gcc (11.2) doesn't always perform NRVO, even if it "obviously" can. Why? Do other compilers do better? I tried to find out.

## An example that works

Let's look at a code sample to see what the problem isn't, i.e. an example where NRVO does happen. We have this code:

void doSomething(char*);

struct MyType {
char buffer[100000];
};

MyType wantNrvo() {
MyType x;
x.buffer[0] = '\0';
return x;
}

int main() {
auto x = wantNrvo();
doSomething(x.buffer);
}


And we are interested in what the wantNrvo function compiles down to with -O3, we want gcc to give it its best shot. That's why we add the doSomething function, to stop x being optimised away. We could have instead added volatile to x, but I hardly ever see that in real code - that would just confuse us. A function call is more realistic.

Anyway, the generated assembly is:

wantNrvo():
mov     BYTE PTR [rdi], 0
mov     rax, rdi
ret
main:
sub     rsp, 100008
mov     rdi, rsp
mov     BYTE PTR [rsp], 0
call    doSomething(char*)
xor     eax, eax
ret


We can see there's obviously no copying of the huge 100,000-byte buffer there, since there's no memcpy (or equivalent) anywhere. That's what we're looking for.

It's interesting to note that NRVO means the caller has to make some space for the object to be returned on the stack before the function is even run, then the object is built in the space given.

The function is being passed the location to build the object as rdi. Thus, RVO can be viewed as a rewriting of our function to take a pointer to a block of memory big enough for MyType, and using placement new.

That might look like this:

void wantNrvo(char *memory) {
MyType *x = new(memory) MyType{};
x->buffer[0] = '\0';
}


The point is that there is one single block of memory where the return value is constructed, and this is allocated by the caller. That seems fine: the caller knows the size of the block and can allocate it - what could go wrong?

## What goes wrong

Sometimes it's hard to tell which object out of multiple objects is going to be returned, meaning we don't know which object to construct in the return value area.

That is kind of non-obvious, so here's an example:

MyType wantNrvo(bool test) {
MyType x;
MyType y;
x.buffer[0] = '\0';
return test ? x : y;
}


There is obviously no way to know at compile time which of x or y we should construct in the return value area. Using our knowledge of this program, you might think: hey, why don't we just check test and decide which of x and y to construct in the return value? Alas, gcc, even with -O3, isn't that smart. This is the machine code produced:

wantNrvo(bool):
sub     rsp, 200008
mov     r9d, esi
mov     edx, 100000
lea     rax, [rsp+100000]
test    r9b, r9b
mov     rsi, rsp
mov     BYTE PTR [rsp], 0
cmove   rsi, rax
call    memcpy
ret


It's there, plain as day: memcpy! The huge buffer is being copied!

## It gets worse!

There are other, even more obvious cases where gcc fails. Maybe the problem is that we're construction two objects, then picking one to return. Maybe that's confusing for some reason. We can attempt to rewrite the code:

MyType wantNrvo(bool test) {
if (test) {
MyType x;
x.buffer[0] = '\0';
return x;
} else {
MyType y;
return y;
}
}


This code is starting to get a bit contrived. But let's see what that compiles to:

wantNrvo(bool) [clone .part.0]:
sub     rsp, 100008
mov     edx, 100000
mov     rsi, rsp
call    memcpy
ret
wantNrvo(bool):
push    r12
mov     r12, rdi
test    sil, sil
je      .L5
mov     rax, r12
mov     BYTE PTR [rdi], 0
pop     r12
ret
.L5:
call    wantNrvo(bool) [clone .part.0]
mov     rax, r12
pop     r12
ret


Now this is bizarre! There are two branches of this function:

• The test = true branch which uses NRVO and writes directly to the block of memory pointed to by rdi.

• The test = false branch which allocates new memory on the stack, and copies the new, just-allocated memory into the area pointed to by rdi with memcpy.

Surely this is just optimized badly because we don't actually call it, right? Don't be so sure, when we call it it gets inlined, but still uses memcpy:

int main() {
auto x = wantNrvo(false);
doSomething(x.buffer);
}

main:
sub     rsp, 200008
mov     edx, 100000
lea     rsi, [rsp+100000]
mov     rdi, rsp
call    memcpy
mov     rdi, rsp
call    doSomething(char*)
xor     eax, eax
ret


## Why, God? Why?

I am not a compiler author. I have never written a compiler and am much dumber than those behind gcc's NRVO.

But NRVO is pretty important, and I expect it sometimes! That's backed up by this comment on this gcc bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51571

I would like to strongly oppose the notion that this "just a missed optimisation and not very critical, really".

NRVO is not just "an optimisation". It's actually one that is explcitly permitted to change observable behaviour of the program and it's extremely powerful.

And it it required for performant C++. Just try to return a std::vector by value to see the importance of this optimisation. This is not missed optimisation. This is premature pessimisation.

You could just as well stop all optimisation work for the C++ frontend until this is implemented, because any other optimisation effords are dwarfed by the overhead when NRVO is expected by the developer but not applied.

Please make this a top priority. Every C++ program will benefit both in text size and in runtime performance - dramatically.

-- Marc Mutz

But then again, there are often hyperbolic, borderline rude comments left on projects saying "make this a top priority", so who knows?

## Let's see clang's business card

Let's see clang's machine code, does it do any better? Yes!

(using clang 13.0.0)

wantNrvo(bool):                           # @wantNrvo(bool)
mov     rax, rdi
test    esi, esi
je      .LBB0_2
mov     byte ptr [rax], 0
.LBB0_2:
ret


This is exactly the perfect machine code we want. Give us some memory in rdi, collapse those branches down to writing or not writing a nul byte to it. Easy, no memcpy!

## Lessons to learn

Although you might see some text on cppreference.com saying:

Return value optimization is mandatory and no longer considered as copy elision; see above. (since C++17)

This doesn't refer to NRVO! Assuming that NRVO will happen can be very dangerous, resulting in copies of huge objects, and that hurts performance a lot.

Try as hard as possible to return prvalues to take advantage of C++17's guaranteed RVO.

It's a dog-eat-RVO world out there. Be careful.