C/C++ Performance, mmap, and string_view

A C++ discussion group I participated in got a challenge last week. Someone posted a short Perl program, and claimed that it was faster than the corresponding C version. It was to this effect (with slight modifications):

open IN, "$ARGV[0]";
my $gc = 0;
while (my $line = ) {
  $gc += ($line =~ tr/cCgG//);
print "$gc\n";
close IN

The program simply searched and counted all occurrences of the letters ‘C’ and ‘G’ from the input file in a case-insensitive way. Since the posted C code was incomplete, I wrote a naïve implementation to test, which did turn out to be slower than the Perl code. It was about two times as slow on Linux, and about 10 times as slow on macOS.1

FILE* fp = fopen(argv[1], "rb");
int count = 0;
int ch;
while ((ch = getc(fp)) != EOF) {
    if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G') {

Of course, it actually shows how optimized the Perl implementation is, instead of how inferior the C language is. Before I had time to test my mmap_line_reader, another person posted an mmap-based solution, to the following effect (with modifications):

int fd = open(argv[1], O_RDONLY);
struct stat st;
fstat(fd, &st);
int len = st.st_size;
char ch;
int count = 0;
char* ptr = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0);
char* begin = ptr;
char* end = ptr + len;
while (ptr < end) {
    ch = *ptr++;
    if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G')
munmap(begin, len);

When I tested my mmap_line_reader, I found that its performance was only on par with the Perl code, but slower than the handwritten mmap-based code. It is not surprising, considering that mmap_line_reader copies the line content, while the C code above searches directly in the mmap’d buffer.

I then thought of the C++17 string_view.2 It seemed a good chance of using it to return a line without copying its content. It was actually easy refactoring (on code duplicated from mmap_line_reader), and most of the original code did not require any changes. I got faster code for this test, but the implementations of mmap_line_reader and the new mmap_line_reader_sv were nearly identical, except for a few small differences.

Naturally, the next step was to refactor again to unify the implementations. I made a common base to store the bottommost mmap-related logic, and made the difference between string and string_view disappear with a class template. Now mmap_line_reader and mmap_line_reader_sv were just two aliases of instantiations of basic_mmap_line_reader!

While mmap_line_reader_sv was faster than mmap_line_reader, it was still slower than the mmap-based C code. So I made another abstraction, a ‘container’ that allowed iteration over all of the file content. Since the mmap-related logic was already mostly separated, only some minor modifications were needed to make that base class fully independent of the line reading logic. After that, adding mmap_char_reader was easy, which was simply a normal container that did not need to mess with platform-specific logic.

At this point, all seemed well—except one thing: it only worked on Unix. I did not have an immediate need to make it work on Windows, but I really wanted to show that the abstraction provided could work seamlessly regardless of the platform underneath. After several revisions, in which I dealt with proper Windows support,3 proper 32- and 64-bit support, and my own mistakes, I finally made it work. You can check out the current code in the nvwa repository. With it, I can finally make the following simple code work on Linux, macOS, and Windows, with the same efficiency as raw C code when fully optimized:4

#include <iostream>
#include <stdlib.h>
#include "nvwa/mmap_byte_reader.h"

int main(int argc, char* argv[])
    if (argc != 2) {
        std::cerr << "A file name is needed" << std::endl;

    try {
        int count = 0;
        for (char ch : nvwa::mmap_char_reader(argv[1]))
            if (ch == 'c' || ch == 'C' ||
                ch == 'g' || ch == 'G')
        std::cout << count << std::endl;
    catch (std::exception& e) {
        std::cerr << e.what() << std::endl;

Even though it is not used in the final code, I love the C++17 string_view. And I like the simplicity I finally achieved. Do you?

P.S. The string_view-based test code is posted here too as a reference. Line-based processing is common enough!5

#include <iostream>
#include <stdlib.h>
#include "nvwa/mmap_line_reader.h"

int main(int argc, char* argv[])
    if (argc != 2) {
        std::cerr << "A file name is needed" << std::endl;

    try {
        int count = 0;
        for (const auto& line :
            for (char ch : line)
                if (ch == 'c' || ch == 'C' ||
                    ch == 'g' || ch == 'G')
        std::cout << count << std::endl;
    catch (std::exception& e) {
        std::cerr << e.what() << std::endl;

Update (18 September 2017)

Thanks to Alex Maystrenko (see the comments below), It is now understood that the reason why getc was slow was because there was an implicit lock around file operations. I did not expect it, as I grew from an age when multi-threading was the exception, and I had not learnt about the existence of getc_unlocked until he mentioned it! According to the getc_unlocked page in the POSIX specification:

Some I/O functions are typically implemented as macros for performance reasons (for example, putc() and getc()). For safety, they need to be synchronized, but it is often too expensive to synchronize on every character. Nevertheless, it was felt that the safety concerns were more important; consequently, the getc(), getchar(), putc(), and putchar() functions are required to be thread-safe. However, unlocked versions are also provided with names that clearly indicate the unsafe nature of their operation but can be used to exploit their higher performance.

After replacing getc with getc_unlocked, the naïve implementation immediately outperforms the Perl code on Linux, though not on macOS.6

Another interesting thing to notice is that GCC provides vastly optimized code for the comparison with ‘c’, ‘C’, ‘g’, and ‘G’,7 which is extremely unlikely for interpreted languages like Perl. Observing the codes for the characters are:

  • 010000112 or 6710 (‘C’)
  • 010001112 or 7110 (‘G’)
  • 011000112 or 9910 (‘c’)
  • 011001112 or 10310 (‘g’)

GCC basically ANDs the input character with 110110112, and compares the result with 010000112. In Intel-style assembly:

        movzx   ecx, BYTE PTR [r12]
        and     ecx, -37
        cmp     cl, 67

It is astoundingly short and efficient!

  1. I love my Mac, but I do feel Linux has great optimizations. 
  2. If you are not familiar with string_view, check out its reference
  3. Did I mention how unorthogonal and uncomfortable the Win32 API is, when compared with POSIX? I am very happy to hide all the ugliness from the application code. 
  4. Comparison was only made on Unix (using the POSIX mmap API), under GCC with ‘-O3’ optimization (‘-O2’ was not enough). 
  5. -std=c++17’ must be specified for GCC, and ‘/std:c++latest’ must be specified for Visual C++ 2017. 
  6. However, the performance improvement is more dramatic on macOS, from about 10 times slower to 5% slower. 
  7. Of course, this is only possible when the characters are given as compile-time constants. 

Performance of My Line Readers

After I wrote the article about Python yield and C++ Coroutines, I felt that I needed to test the performance of istream_line_reader. The immediate result was both good and bad: good in that there was no actual difference between the straightforward std::getline and my istream_line_reader (as anticipated), and bad in that neither version performed well (a surprise to me). I vaguely remember that sync_with_stdio(false) may affect the performance, so I also tested calling this function in the beginning. However, it did not seem to matter. By the way, my favourite compiler has always been Clang recently (and I use a Mac).

Seeing that istream_line_reader had a performance problem, I tried other approaches. One thing I tried was using the traditional C I/O functions. I wrote another file_line_reader, which used either fgets or fread to read the data, depending what the delimiter is. (fgets could only use ‘\n’ as the delimiter, but it performed better than fread, for I could fgets into the final line buffer, but had to fread into a temporary buffer first.) I also added a switch on whether to strip the delimiter, something not possible with the getline function. The result achieved a more than 10x performance improvement (from about 28 MB/s to 430 MB/s). I was happy, and presented this on the last slide of my presentation on C++ and Functional Programming in the 2016 C++ and System Software Summit (China).

Until C++11, modifying the character array accessed through string::data() has undefined behaviour. To be on the safe side, I implemented a self-expanding character buffer on my own, which complicated the implementation a little bit. It also made the interface slightly different from istream_line_reader, which can be demonstrated in the following code snippets.

Iteration with istream_line_reader:

for (auto& line : istream_line_reader(cin)) {

Iteration with file_line_reader:

for (auto& line : file_line_reader(stdin)) {

I.e. each iteration with file_line_reader returns a char* instead of a string. This should be OK, as a raw character pointer is often enough. One can always construct a string from char* easily, anyway.

After the presentation, I turned to implementing a small enhancement—iterating over the lines with mmap. This proved interesting work. Not only did it improved the line reading performance, but the code was simplified as well. As I could access the file content directly with a pointer, I was able to copy the lines to a string simply with string::assign. As I used string again, there was no need to define a custom copy constructor, copy assignment operator, move constructor, and move assignment operator as well. The performance was, of course, also good: the throughput rate reached 650 MB/s, a 50% improvement! The only negative side was that it could not work on stdin, so testing it required more lines. Apart from that, I was quite satisfied. And I had three different line readers that could take an istream&, FILE*, or file descriptor as the input source. So all situations were dealt with. Not bad!

One thing of note about the implementation. I tried copying (a character at a time) while searching, before adopting the current method of searching first before assigning to the string. The latter proved faster when dealing with long lines. I can see two reasons:

  1. Strings are normally (and required to be since C++11) null-terminated, so copying one character at a time has a big overhead of zeroing the next byte. I confirmed the case from the libc++ source code of Clang.
  2. Assignment can use memcpy or memmove internally, which normally has a fast platform-specific implementation. In the case of string::assign(const char*, size_t), I verified that libc++ used memmove indeed.

If you are interested, this is the assembly code I finally traced into on my Mac (comments are my analysis; you may need to scroll horizontally to see them all):

   0x7fff9291fcbd:  pushq  %rbp
   0x7fff9291fcbe:  movq   %rsp, %rbp
   0x7fff9291fcc1:  movq   %rdi, %r11           ; save dest
   0x7fff9291fcc4:  movq   %rdi, %rax
   0x7fff9291fcc7:  subq   %rsi, %rax           ; dest - src
   0x7fff9291fcca:  cmpq   %rdx, %rax
   0x7fff9291fccd:  jb     0x7fff9291fd04       ; dest in (src, src + len)?
   ; Entry condition: dest <= src or dest >= src + len; copy starts from front
   0x7fff9291fccf:  cmpq   $80, %rdx
   0x7fff9291fcd3:  ja     0x7fff9291fd09       ; len > 128?
   ; Entry condition: len <= 128
   0x7fff9291fcd5:  movl   %edx, %ecx
   0x7fff9291fcd7:  shrl   $2, %ecx             ; len / 4
   0x7fff9291fcda:  je     0x7fff9291fcec       ; len < 4?
   0x7fff9291fcdc:  movl   (%rsi), %eax         ; 4-byte read
   0x7fff9291fcde:  addq   $4, %rsi             ; src <- src + 4
   0x7fff9291fce2:  movl   %eax, (%rdi)         ; 4-byte write
   0x7fff9291fce4:  addq   $4, %rdi             ; dest <- dest + 4
   0x7fff9291fce8:  decl   %ecx
   0x7fff9291fcea:  jne    0x7fff9291fcdc       ; more 4-byte blocks?
   ; Entry condition: len < 4
   0x7fff9291fcec:  andl   $3, %edx
   0x7fff9291fcef:  je     0x7fff9291fcff       ; len == 0?
   0x7fff9291fcf1:  movb   (%rsi), %al          ; 1-byte read
   0x7fff9291fcf3:  incq   %rsi                 ; src <- src + 1
   0x7fff9291fcf6:  movb   %al, (%rdi)          ; 1-byte write
   0x7fff9291fcf8:  incq   %rdi                 ; dest <- dest + 1
   0x7fff9291fcfb:  decl   %edx
   0x7fff9291fcfd:  jne    0x7fff9291fcf1       ; more bytes?
   0x7fff9291fcff:  movq   %r11, %rax           ; restore dest
   0x7fff9291fd02:  popq   %rbp
   0x7fff9291fd03:  ret
   0x7fff9291fd04:  jmpq   0x7fff9291fdb9
   ; Entry condition: len > 128
   0x7fff9291fd09:  movl   %edi, %ecx
   0x7fff9291fd0b:  negl   %ecx
   0x7fff9291fd0d:  andl   $15, %ecx            ; 16 - dest % 16
   0x7fff9291fd10:  je     0x7fff9291fd22       ; dest 16-byte aligned?
   0x7fff9291fd12:  subl   %ecx, %edx           ; adjust len
   0x7fff9291fd14:  movb   (%rsi), %al          ; one-byte read
   0x7fff9291fd16:  incq   %rsi                 ; src <- src + 1
   0x7fff9291fd19:  movb   %al, (%rdi)          ; one-byte write
   0x7fff9291fd1b:  incq   %rdi                 ; dest <- dest + 1
   0x7fff9291fd1e:  decl   %ecx
   0x7fff9291fd20:  jne    0x7fff9291fd14       ; until dest is aligned
   ; Entry condition: dest is 16-byte aligned
   0x7fff9291fd22:  movq   %rdx, %rcx           ; len
   0x7fff9291fd25:  andl   $63, %edx            ; len % 64
   0x7fff9291fd28:  andq   $-64, %rcx           ; len <- align64(len)
   0x7fff9291fd2c:  addq   %rcx, %rsi           ; src <- src + len
   0x7fff9291fd2f:  addq   %rcx, %rdi           ; src <- dest + len
   0x7fff9291fd32:  negq   %rcx                 ; len <- -len
   0x7fff9291fd35:  testl  $15, %esi
   0x7fff9291fd3b:  jne    0x7fff9291fd80       ; src not 16-byte aligned?
   0x7fff9291fd3d:  jmp    0x7fff9291fd40
   0x7fff9291fd3f:  nop
   ; Entry condition: both src and dest are 16-byte aligned
   0x7fff9291fd40:  movdqa (%rsi,%rcx), %xmm0   ; aligned 16-byte read
   0x7fff9291fd45:  movdqa 16(%rsi,%rcx), %xmm1
   0x7fff9291fd4b:  movdqa 32(%rsi,%rcx), %xmm2
   0x7fff9291fd51:  movdqa 48(%rsi,%rcx), %xmm3
   0x7fff9291fd57:  movdqa %xmm0, (%rdi,%rcx)   ; aligned 16-byte write
   0x7fff9291fd5c:  movdqa %xmm1, 16(%rdi,%rcx)
   0x7fff9291fd62:  movdqa %xmm2, 32(%rdi,%rcx)
   0x7fff9291fd68:  movdqa %xmm3, 48(%rdi,%rcx)
   0x7fff9291fd6e:  addq   $64, %rcx
   0x7fff9291fd72:  jne    0x7fff9291fd40       ; more 64-byte blocks?
   0x7fff9291fd74:  jmpq   0x7fff9291fcd5
   0x7fff9291fd79:  nopl   (%rax)               ; 7-byte nop
   ; Entry condition: src is NOT 16-byte aligned but dest is
   0x7fff9291fd80:  movdqu (%rsi,%rcx), %xmm0   ; unaligned 16-byte read
   0x7fff9291fd85:  movdqu 16(%rsi,%rcx), %xmm1
   0x7fff9291fd8b:  movdqu 32(%rsi,%rcx), %xmm2
   0x7fff9291fd91:  movdqu 48(%rsi,%rcx), %xmm3
   0x7fff9291fd97:  movdqa %xmm0, (%rdi,%rcx)   ; aligned 16-byte write
   0x7fff9291fd9c:  movdqa %xmm1, 16(%rdi,%rcx)
   0x7fff9291fda2:  movdqa %xmm2, 32(%rdi,%rcx)
   0x7fff9291fda8:  movdqa %xmm3, 48(%rdi,%rcx)
   0x7fff9291fdae:  addq   $64, %rcx
   0x7fff9291fdb2:  jne    0x7fff9291fd80       ; more 64-byte blocks?
   0x7fff9291fdb4:  jmpq   0x7fff9291fcd5
   ; Entry condition: dest > src and dest < src + len; copy starts from back
   0x7fff9291fdb9:  addq   %rdx, %rsi           ; src <- src + len
   0x7fff9291fdbc:  addq   %rdx, %rdi           ; dest <- dest + len
   0x7fff9291fdbf:  cmpq   $80, %rdx
   0x7fff9291fdc3:  ja     0x7fff9291fdf6       ; len > 128?
   ; Entry condition: len < 128
   0x7fff9291fdc5:  movl   %edx, %ecx
   0x7fff9291fdc7:  shrl   $3, %ecx             ; len / 8
   0x7fff9291fdca:  je     0x7fff9291fdde       ; len < 8?
   ; Entry condition: len >= 8
   0x7fff9291fdcc:  subq   $8, %rsi             ; src <- src - 8
   0x7fff9291fdd0:  movq   (%rsi), %rax         ; 8-byte read
   0x7fff9291fdd3:  subq   $8, %rdi             ; dest <- dest - 8
   0x7fff9291fdd7:  movq   %rax, (%rdi)         ; 8-byte write
   0x7fff9291fdda:  decl   %ecx
   0x7fff9291fddc:  jne    0x7fff9291fdcc       ; until len < 8
   ; Entry condition: len < 8
   0x7fff9291fdde:  andl   $7, %edx
   0x7fff9291fde1:  je     0x7fff9291fdf1       ; len == 0?
   0x7fff9291fde3:  decq   %rsi                 ; src <- src - 1
   0x7fff9291fde6:  movb   (%rsi), %al          ; 1-byte read
   0x7fff9291fde8:  decq   %rdi                 ; dest <- dest - 1
   0x7fff9291fdeb:  movb   %al, (%rdi)          ; 1-byte write
   0x7fff9291fded:  decl   %edx
   0x7fff9291fdef:  jne    0x7fff9291fde3       ; more bytes?
   0x7fff9291fdf1:  movq   %r11, %rax           ; restore dest
   0x7fff9291fdf4:  popq   %rbp
   0x7fff9291fdf5:  ret
   ; Entry condition: len > 128
   0x7fff9291fdf6:  movl   %edi, %ecx
   0x7fff9291fdf8:  andl   $15, %ecx
   0x7fff9291fdfb:  je     0x7fff9291fe0e       ; dest 16-byte aligned?
   0x7fff9291fdfd:  subq   %rcx, %rdx           ; adjust len
   0x7fff9291fe00:  decq   %rsi                 ; src <- src - 1
   0x7fff9291fe03:  movb   (%rsi), %al          ; one-byte read
   0x7fff9291fe05:  decq   %rdi                 ; dest <- dest - 1
   0x7fff9291fe08:  movb   %al, (%rdi)          ; one-byte write
   0x7fff9291fe0a:  decl   %ecx
   0x7fff9291fe0c:  jne    0x7fff9291fe00       ; until dest is aligned
   ; Entry condition: dest is 16-byte aligned
   0x7fff9291fe0e:  movq   %rdx, %rcx           ; len
   0x7fff9291fe11:  andl   $63, %edx            ; len % 64
   0x7fff9291fe14:  andq   $-64, %rcx           ; len <- align64(len)
   0x7fff9291fe18:  subq   %rcx, %rsi           ; src <- src - len
   0x7fff9291fe1b:  subq   %rcx, %rdi           ; dest <- dest - len
   0x7fff9291fe1e:  testl  $15, %esi
   0x7fff9291fe24:  jne    0x7fff9291fe61       ; src 16-byte aligned?
   ; Entry condition: both src and dest are 16-byte aligned
   0x7fff9291fe26:  movdqa -16(%rsi,%rcx), %xmm0; aligned 16-byte read
   0x7fff9291fe2c:  movdqa -32(%rsi,%rcx), %xmm1
   0x7fff9291fe32:  movdqa -48(%rsi,%rcx), %xmm2
   0x7fff9291fe38:  movdqa -64(%rsi,%rcx), %xmm3
   0x7fff9291fe3e:  movdqa %xmm0, -16(%rdi,%rcx); aligned 16-byte write
   0x7fff9291fe44:  movdqa %xmm1, -32(%rdi,%rcx)
   0x7fff9291fe4a:  movdqa %xmm2, -48(%rdi,%rcx)
   0x7fff9291fe50:  movdqa %xmm3, -64(%rdi,%rcx)
   0x7fff9291fe56:  subq   $64, %rcx
   0x7fff9291fe5a:  jne    0x7fff9291fe26       ; more 64-byte blocks?
   0x7fff9291fe5c:  jmpq   0x7fff9291fdc5
   ; Entry condition: src is NOT 16-byte aligned but dest is
   0x7fff9291fe61:  movdqu -16(%rsi,%rcx), %xmm0; unaligned 16-byte read
   0x7fff9291fe67:  movdqu -32(%rsi,%rcx), %xmm1
   0x7fff9291fe6d:  movdqu -48(%rsi,%rcx), %xmm2
   0x7fff9291fe73:  movdqu -64(%rsi,%rcx), %xmm3
   0x7fff9291fe79:  movdqa %xmm0, -16(%rdi,%rcx); aligned 16-byte write
   0x7fff9291fe7f:  movdqa %xmm1, -32(%rdi,%rcx)
   0x7fff9291fe85:  movdqa %xmm2, -48(%rdi,%rcx)
   0x7fff9291fe8b:  movdqa %xmm3, -64(%rdi,%rcx)
   0x7fff9291fe91:  subq   $64, %rcx
   0x7fff9291fe95:  jne    0x7fff9291fe61       ; more 64-byte blocks?
   0x7fff9291fe97:  jmpq   0x7fff9291fdc5

I am happy that I can take advantage of such optimizations, but do not need to write such code on my own—there are so many different cases to deal with!

Of couse, nothing is simple regarding performance. More tests revealed more facts that are interesting and/or surprising:

  • While libc++ (it is the library, but not the compiler, that matters here) seems to completely ignore sync_with_stdio, it makes a big difference in libstdc++. The same function call gets a more than 10x performance improvement when the istream_line_reader test program is compiled with GCC (which uses libstdc++), from ~28 MB/s to ~390 MB/s. It shows that I made a wrong assumption! Interestingly, reading from stdin (piped from the pv tool) is slightly faster than reading from a file on my Mac (when compiled with GCC).
  • On a CentOS 6.5 Linux system, sync_with_stdio(false) has a bigger performance win (~23 MB/s vs. ~800 MB/s). Reading from a file directly is even faster at 1100 MB/s. That totally beats my file_line_reader (~550 MB/s reading a file directly) and mmap_line_reader (~600 MB/s reading a file directly) on the same machine. I was stunned when first seeing this performance difference of nearly 40 times!

So, apart from the slight difference in versatility, the first and simplest form of my line readers is also the best on Linux, while the mmap-based version may be a better implementation on OS X—though your mileage may vary depending on the different combinations of OS versions, compilers, and hardware. Should I be happy, or sad?

You can find the implementation of istream_line_reader among my example code for the ‘C++ and Functional Programming’ presentation, and the implementations of file_line_reader and mmap_line_reader in the Nvwa repository. And the test code is as follows:


#include <fstream>
#include <iostream>
#include <string>
#include <getopt.h>
#include <stdio.h>
#include <stdlib.h>
#include "istream_line_reader.h"

using namespace std;

int main(int argc, char* argv[])
    char optch;
    while ( (optch = getopt(argc, argv, "s")) != EOF) {
        switch (optch) {
        case 's':
    if (!(optind == argc || optind == argc - 1)) {
                "Only one file name can be specified\n");

    istream* is = nullptr;
    ifstream ifs;
    if (optind == argc) {
        is = &cin;
    } else {
        if (!ifs) {
                    "Cannot open file '%s'\n",
        is = &ifs;

    for (auto& line : istream_line_reader(*is)) {


#include <stdio.h>
#include <stdlib.h>
#include <nvwa/file_line_reader.h>

using nvwa::file_line_reader;

int main(int argc, char* argv[])
    FILE* fp = stdin;
    if (argc == 2) {
        fp = fopen(argv[1], "r");
        if (!fp) {
                    "Cannot open file '%s'\n",

        reader(fp, '\n',
    for (auto& line : reader) {
        fputs(line, stdout);


#include <stdio.h>
#include <stdlib.h>
#include <stdexcept>
#include <nvwa/mmap_line_reader.h>

using nvwa::mmap_line_reader;

int main(int argc, char* argv[])
    if (argc != 2) {
                "A file name shall be provided\n");

    try {
            reader(argv[1], '\n',

        for (auto& str : reader) {
            fputs(str.c_str(), stdout);
    catch (std::runtime_error& e) {