Why doesn't GCC do this "easy" NRVO optimization?
2022-01-25
Since C++17, us C++ programmers have rejoiced in the fact that when we return something from a function, the standard now guarantees that the value won't be copied. This is known as return value optimization (RVO) or copy/move elision, and happens in cases like this:
MyType myfunc() {
// ...
return MyType{arg1, arg2};
}
That is, MyType
is constructed once and never copied or moved.
But some compilers still don't perform RVO in some cases. It turns out this is because RVO refers only to when you return unnamed values. Named RVO is apparently not considered RVO by the standard's definition. Named means something like:
MyType x{};
x.do_something();
return x;
And gcc (11.2) doesn't always perform NRVO, even if it "obviously" can. Why? Do other compilers do better? I tried to find out.
An example that works
Let's look at a code sample to see what the problem isn't, i.e. an example where NRVO does happen. We have this code:
void doSomething(char*);
struct MyType {
char buffer[100000];
};
MyType wantNrvo() {
MyType x;
x.buffer[0] = '\0';
return x;
}
int main() {
auto x = wantNrvo();
doSomething(x.buffer);
}
And we are interested in what the wantNrvo
function compiles down to
with -O3
, we want gcc to give it its best shot. That's why we add
the doSomething
function, to stop x being optimised away. We could
have instead added volatile
to x, but I hardly ever see that in real
code - that would just confuse us. A function call is more realistic.
Anyway, the generated assembly is:
wantNrvo():
mov BYTE PTR [rdi], 0
mov rax, rdi
ret
main:
sub rsp, 100008
mov rdi, rsp
mov BYTE PTR [rsp], 0
call doSomething(char*)
xor eax, eax
add rsp, 100008
ret
We can see there's obviously no copying of the huge 100,000-byte buffer there, since there's no memcpy (or equivalent) anywhere. That's what we're looking for.
It's interesting to note that NRVO means the caller has to make some space for the object to be returned on the stack before the function is even run, then the object is built in the space given.
The function is being passed the location to build the object as rdi. Thus, RVO can be viewed as a rewriting of our function to take a pointer to a block of memory big enough for MyType, and using placement new.
That might look like this:
void wantNrvo(char *memory) {
MyType *x = new(memory) MyType{};
x->buffer[0] = '\0';
}
The point is that there is one single block of memory where the return value is constructed, and this is allocated by the caller. That seems fine: the caller knows the size of the block and can allocate it - what could go wrong?
What goes wrong
Sometimes it's hard to tell which object out of multiple objects is going to be returned, meaning we don't know which object to construct in the return value area.
That is kind of non-obvious, so here's an example:
MyType wantNrvo(bool test) {
MyType x;
MyType y;
x.buffer[0] = '\0';
return test ? x : y;
}
There is obviously no way to know at compile time which of x or y we
should construct in the return value area. Using our knowledge of this
program, you might think: hey, why don't we just check test
and
decide which of x and y to construct in the return value? Alas, gcc,
even with -O3
, isn't that smart. This is the machine code produced:
wantNrvo(bool):
sub rsp, 200008
mov r9d, esi
mov edx, 100000
lea rax, [rsp+100000]
test r9b, r9b
mov rsi, rsp
mov BYTE PTR [rsp], 0
cmove rsi, rax
call memcpy
add rsp, 200008
ret
It's there, plain as day: memcpy! The huge buffer is being copied!
It gets worse!
There are other, even more obvious cases where gcc fails. Maybe the problem is that we're construction two objects, then picking one to return. Maybe that's confusing for some reason. We can attempt to rewrite the code:
MyType wantNrvo(bool test) {
if (test) {
MyType x;
x.buffer[0] = '\0';
return x;
} else {
MyType y;
return y;
}
}
This code is starting to get a bit contrived. But let's see what that compiles to:
wantNrvo(bool) [clone .part.0]:
sub rsp, 100008
mov edx, 100000
mov rsi, rsp
call memcpy
add rsp, 100008
ret
wantNrvo(bool):
push r12
mov r12, rdi
test sil, sil
je .L5
mov rax, r12
mov BYTE PTR [rdi], 0
pop r12
ret
.L5:
call wantNrvo(bool) [clone .part.0]
mov rax, r12
pop r12
ret
Now this is bizarre! There are two branches of this function:
The test = true branch which uses NRVO and writes directly to the block of memory pointed to by rdi.
The test = false branch which allocates new memory on the stack, and copies the new, just-allocated memory into the area pointed to by rdi with memcpy.
Surely this is just optimized badly because we don't actually call it, right? Don't be so sure, when we call it it gets inlined, but still uses memcpy:
int main() {
auto x = wantNrvo(false);
doSomething(x.buffer);
}
main:
sub rsp, 200008
mov edx, 100000
lea rsi, [rsp+100000]
mov rdi, rsp
call memcpy
mov rdi, rsp
call doSomething(char*)
xor eax, eax
add rsp, 200008
ret
Why, God? Why?
I am not a compiler author. I have never written a compiler and am much dumber than those behind gcc's NRVO.
But NRVO is pretty important, and I expect it sometimes! That's backed up by this comment on this gcc bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51571
I would like to strongly oppose the notion that this "just a missed optimisation and not very critical, really".
NRVO is not just "an optimisation". It's actually one that is explcitly permitted to change observable behaviour of the program and it's extremely powerful.
And it it required for performant C++. Just try to return a std::vector by value to see the importance of this optimisation. This is not missed optimisation. This is premature pessimisation.
You could just as well stop all optimisation work for the C++ frontend until this is implemented, because any other optimisation effords are dwarfed by the overhead when NRVO is expected by the developer but not applied.
Please make this a top priority. Every C++ program will benefit both in text size and in runtime performance - dramatically.
-- Marc Mutz
But then again, there are often hyperbolic, borderline rude comments left on projects saying "make this a top priority", so who knows?
Let's see clang's business card
Let's see clang's machine code, does it do any better? Yes!
(using clang 13.0.0)
wantNrvo(bool): # @wantNrvo(bool)
mov rax, rdi
test esi, esi
je .LBB0_2
mov byte ptr [rax], 0
.LBB0_2:
ret
This is exactly the perfect machine code we want. Give us some memory in rdi, collapse those branches down to writing or not writing a nul byte to it. Easy, no memcpy!
Lessons to learn
Although you might see some text on cppreference.com saying:
Return value optimization is mandatory and no longer considered as copy elision; see above. (since C++17)
This doesn't refer to NRVO! Assuming that NRVO will happen can be very dangerous, resulting in copies of huge objects, and that hurts performance a lot.
Try as hard as possible to return prvalues to take advantage of C++17's guaranteed RVO.
It's a dog-eat-RVO world out there. Be careful.