Saturday, April 20, 2013

C++ COW Craziness

Note: This isn't one of those Linus'ish articles that bitches about C++. I like C++ and I would just like to point out one of the many nuances in the language that could affect the performance of your program without your knowledge.


C++ STL's string class promises Copy-on-write. What that means is that, you can make as many copies of the string, but the actual memory duplication will happen only when one of the strings are actually written to (i.e.) no memory duplication will be made for copies that are made for pure reads. Or atleast that's what I thought, until I discovered today that, if you use the [ ] operator on the string, you rig the COW functionality of it forever. It is something that you normally don't do, but doing so could cost you a lot of performance. Let's run through an example.
string s1(1024 * 1024 * 16, 'g');
for(int i = 0; i < 1000; i++) {
  string s2 = s1;
}
This runs in 19 milliseconds. That's because (obviously) there are no actual copies made. Just 1000 pointers being created to the existing 16 megabytes of data. Now, lets try modifiying the copied string.
string s1(1024 * 1024 * 16, 'g');
for(int i = 0; i < 1000; i++) {
  string s2 = s1;
  s2[0] = 'v';
}
This runs in 4.3 seconds. That's right, from 19 milliseconds to 4.3 seconds for making 1000 actual copies of 16 MB of data. This is the expected behavior, a copy is done when you try to write to it. Next comes the weird part, consider the following code:
string s1(1024 * 1024 * 16, 'g');
for(int i = 0; i < 1000; i++) {
  string s2 = s1;
  s2[0];
}
Guess how much time this should take? Intuitively it seems like this should hit the COW fast path (i.e.) no actual copies, because there is no "write" here. This takes 4.3 seconds too! The problem behind the [ ] operator is that, you can easily stash away a pointer to some portion of the string and modify it later thereby screwing up the state. So, it is impossible to perform COW once you use the [ ] operator on a string. The following snippet illustrates this:
string s1("hello");
char *p = &s1[2];
string s2 = s1;
*p = 'v';
You see what happened there? You stashed away a pointer to the middle of the string and then tried to change it later after the copy. This is sort of an indirect write, and there is no way for the compiler to determine this. So, the moment is sees the [ ] operator, it removes the COW functionality for that string. One way to do such a read without rigging the COW functionality is to do a crazy cast like this:
string s1(1024 * 1024 * 16, 'g');
const_cast<const string &>(s1)[0];
for(int i = 0; i < 1000; i++) {
  string s2 = s1;
}
This snippet takes the fast COW path and runs in 20 milliseconds. The takeaway from this article is that, do not use the [ ] (or the .at()) operator on strings, especially large string that could be copied later on. Even though you think you're doing an harmless read, you are rigging the COW functionality of that string forever. You are paying the price for that pointer you stashed away (or may be even released long back) without knowing.

 -Vignesh