Andre's Blog
Perfection is when there is nothing left to take away
Is C++ slower than C?

This question seems silly on the surface, to put it mildly, as both languages are compiled to produce most optimal code, often by the same compiler engine using the same optimization techniques. However, if widely-used programming practices and common libraries are factored into this mix, the result is not as straightforward as it may seem.

C++ makes it easy to write more maintainable code by simplifying memory management for things like string management. In places where a C developer would think twice before allocating memory, a C++ developer will often do so, sometimes without even realizing memory is being allocated. Let's take a look at a typical operation - getting a string from some component.

There are many examples where one needs to get a string out of some components - getting a name of a database column from a database schema mode, getting a serialized value from a JSON parser, getting a file name from its handle, etc. Many of those functions avoid memory allocation and instead use a buffer of a fixed maximum size, such as a Win32 function GetFinalPathNameByHandle with this signature:

DWORD GetFinalPathNameByHandle(
  HANDLE hFile,
  LPSTR  lpszFilePath,
  DWORD  cchFilePath,
  DWORD  dwFlags

Other platforms have similar mechanisms, such as a POSIX fcntl function used with F_GETPATH. The actual function is not particularly important - it is just used to establish a usage pattern.

Based on this pattern, if one wanted to introduce a portable C function to get a file name from a handle, it would look like this in C (for simplicity, the handle is just an integer):

size_t get_file_name(int handle, char *namebuf, size_t bufsize);

The returned value would be set to reflect the actual number of characters in the buffer.

A similar C++ function would often have this signature:

std::string get_file_name(int handle);

A more performance-conscious, but less frequently used form would look like this:

std::string& get_file_name(int handle, std::string& name);

This form allows the caller to allocate the string once and then reuse allocated memory for subsequent calls.

Let's put these signatures through a test and have a look at the profiler results.

I didn't want to add the file system into the mix, so I used simple functions with just the code I wanted to test and a few volatile assignments to make sure important code is not optimized out. The C function used strcpy to copy a fake file name from a fixed location and C++ functions just created a local std::string variable, populated it from the same location and returned that variable. In all cases I went through the assembly to make sure all important calls or loops are still there. strcpy was used because it is commonly used in these cases, even though memcpy or memmove would actually run faster.

The test ran 500 million iterations divided between 4 threads for each method on a 4-core, 8-logical CPU laptop that was plugged in. The test was compiled with full optimization in Visual Studio 2019 and profiled in Intel's VTune on Windows 10.

The size of the string copied in all tests was 20 characters to avoid the short string optimization in VC++, which maintains a 16-byte buffer for short strings to avoid memory allocations.

This is a VTune screenshot of per-thread CPU usage for each of the test cases.

4 threads, 200 M iterations

Each brown bar shows CPU utilization for its thread. Given that all threads ran tight loops, they show 100% CPU utilization throughout their entire run time, with occasional dark green specks for when they were running, but utilizing less than 100% of one CPU. The brown bar at the bottom shows overall CPU utilization on the test computer.

The four threads calling the C-style function finished in about 0.9 sec.

The four threads calling a C++ function returning an std::string instance finished in about 11.9 sec.

The four threads calling a C++ function taking a string reference that is reused between calls finished in about 2.2 sec.

The C-style function just copied 20 characters and a null terminator byte-by-byte. No surprise there. memmove would run faster, but other than that it's as fast as it gets.

The C++ test in the middle creates a temporary string, which then copied into the result string. This extra copy and an additional memory allocation is what takes this extra time. Here is the function call stack breakdown (all values must be divided by 4 threads to get the duration):

Source Function Stack Time (seconds)
string_perf_test::enum_files_cpp_sc 47.635
  [Loop at line 113 in string_perf_test::enum_files_cpp_sc] 47.635
    std::basic_string<char,struct std::char_traits<char>,class std::allocator<char> >::assign 26.805
      operator new 21.562
      memcpy 2.156
      [Import thunk memcpy] 0.067
    operator delete 0.182
    operator delete 0.062
    free_base 13.538
    free 1.871

This shows that memory allocation is far from being free and should be treated with more respect.

The C++ function that takes an std::string reference does a couple of extra calls and comparisons for its memory management, so it's not as fast as the C counterpart, but in the grand scheme of things it is usable when one wants to maintain a good balance between code maintainability and performance.

This example may seem somewhat contrived in that in many projects the cost of in-memory calls, like those to copy strings, would be offset by the underlying I/O costs, such as reaching into the file system or getting database information, but in large projects this pattern pops up at the top of profiler hot spots surprisingly more often than one would expect because it is used pretty much everywhere.

It is also worth mentioning that there is nothing wrong with using this pattern when it is clear that the underlying component cannot be called the number of times that would be visible in profiler results, but sometimes it's hard to anticipate how some values could be used.

For example, application configuration is often considered as code that runs during application start-up and configuration updates and it is convenient to return a string copy to avoid having locks or implementing a detachable configuration state for configuration updates, but when such string represents something like a Mongo DB collection name for multi-threaded client pool and called in a tight request processing loop, these extra allocations will take their toll.

One more trick in the C++ toolbox is string views. When one can guarantee that source characters are not going away for the entire length of the time a string view is being in use, returning std::string_view will provide a huge benefit. The four thread bars on the far right of the profiling diagram above represent the same test using string views. The benefit is very visible, but one has to design the application in a way that the underlying memory outlives all string views. Otherwise application crashes are inevitable.

All this shows that while C++ source is compiled into the same highly-optimized machine code as is C source, using C++ conveniences without paying attention to the costs involved may often negate the benefits of a fast compiled language and its fast runtime environment. This may not matter much for a single-user desktop application, but will most certainly be felt in a server-like application handling large volumes of requests and/or data.

Having said that, it is worth noting that even though the C function outperforms the equivalent C++ function, an average C developer will introduce enough quirks, such as if(!strlen(str)), to slow down a typical C application and even out the field with an equivalent well thought-out C++ code that uses tried-and-true optimized standard libraries, making a C application more fragile and harder to maintain. In other words, the point is not to go back to coding in assembly, but rather to be mindful of the costs of C++ conveniences in applications where performance matters.