C/C++ Performance, mmap, and string_view

A C++ discussion group I participated in got a challenge last week. Someone posted a short Perl program, and claimed that it was faster than the corresponding C version. It was to this effect (with slight modifications):

open IN, "$ARGV[0]";
my $gc = 0;
while (my $line = ) {
  $gc += ($line =~ tr/cCgG//);
}
print "$gc\n";
close IN

The program simply searched and counted all occurrences of the letters ‘C’ and ‘G’ from the input file in a case-insensitive way. Since the posted C code was incomplete, I wrote a naïve implementation to test, which did turn out to be slower than the Perl code. It was about two times as slow on Linux, and about 10 times as slow on macOS.1

FILE* fp = fopen(argv[1], "rb");
int count = 0;
int ch;
while ((ch = getc(fp)) != EOF) {
    if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G') {
        ++count;
    }
}
fclose(fp);

Of course, it actually shows how optimized the Perl implementation is, instead of how inferior the C language is. Before I had time to test my mmap_line_reader, another person posted an mmap-based solution, to the following effect (with modifications):

int fd = open(argv[1], O_RDONLY);
struct stat st;
fstat(fd, &st);
int len = st.st_size;
char ch;
int count = 0;
char* ptr = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd);
char* begin = ptr;
char* end = ptr + len;
while (ptr < end) {
    ch = *ptr++;
    if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G')
        ++count;
}
munmap(begin, len);

When I tested my mmap_line_reader, I found that its performance was only on par with the Perl code, but slower than the handwritten mmap-based code. It is not surprising, considering that mmap_line_reader copies the line content, while the C code above searches directly in the mmap’d buffer.

I then thought of the C++17 string_view.2 It seemed a good chance of using it to return a line without copying its content. It was actually easy refactoring (on code duplicated from mmap_line_reader), and most of the original code did not require any changes. I got faster code for this test, but the implementations of mmap_line_reader and the new mmap_line_reader_sv were nearly identical, except for a few small differences.

Naturally, the next step was to refactor again to unify the implementations. I made a common base to store the bottommost mmap-related logic, and made the difference between string and string_view disappear with a class template. Now mmap_line_reader and mmap_line_reader_sv were just two aliases of instantiations of basic_mmap_line_reader!

While mmap_line_reader_sv was faster than mmap_line_reader, it was still slower than the mmap-based C code. So I made another abstraction, a ‘container’ that allowed iteration over all of the file content. Since the mmap-related logic was already mostly separated, only some minor modifications were needed to make that base class fully independent of the line reading logic. After that, adding mmap_char_reader was easy, which was simply a normal container that did not need to mess with platform-specific logic.

At this point, all seemed well—except one thing: it only worked on Unix. I did not have an immediate need to make it work on Windows, but I really wanted to show that the abstraction provided could work seamlessly regardless of the platform underneath. After several revisions, in which I dealt with proper Windows support,3 proper 32- and 64-bit support, and my own mistakes, I finally made it work. You can check out the current code in the nvwa repository. With it, I can finally make the following simple code work on Linux, macOS, and Windows, with the same efficiency as raw C code when fully optimized:4

#include <iostream>
#include <stdlib.h>
#include "nvwa/mmap_byte_reader.h"

int main(int argc, char* argv[])
{
    if (argc != 2) {
        std::cerr << "A file name is needed" << std::endl;
        exit(1);
    }

    try {
        int count = 0;
        for (char ch : nvwa::mmap_char_reader(argv[1]))
            if (ch == 'c' || ch == 'C' ||
                ch == 'g' || ch == 'G')
                ++count;
        std::cout << count << std::endl;
    }
    catch (std::exception& e) {
        std::cerr << e.what() << std::endl;
    }
}

Even though it is not used in the final code, I love the C++17 string_view. And I like the simplicity I finally achieved. Do you?

P.S. The string_view-based test code is posted here too as a reference. Line-based processing is common enough!5

#include <iostream>
#include <stdlib.h>
#include "nvwa/mmap_line_reader.h"

int main(int argc, char* argv[])
{
    if (argc != 2) {
        std::cerr << "A file name is needed" << std::endl;
        exit(1);
    }

    try {
        int count = 0;
        for (const auto& line :
                nvwa::mmap_line_reader_sv(argv[1]))
            for (char ch : line)
                if (ch == 'c' || ch == 'C' ||
                    ch == 'g' || ch == 'G')
                    ++count;
        std::cout << count << std::endl;
    }
    catch (std::exception& e) {
        std::cerr << e.what() << std::endl;
    }
}

Update (18 September 2017)

Thanks to Alex Maystrenko (see the comments below), It is now understood that the reason why getc was slow was because there was an implicit lock around file operations. I did not expect it, as I grew from an age when multi-threading was the exception, and I had not learnt about the existence of getc_unlocked until he mentioned it! According to the getc_unlocked page in the POSIX specification:

Some I/O functions are typically implemented as macros for performance reasons (for example, putc() and getc()). For safety, they need to be synchronized, but it is often too expensive to synchronize on every character. Nevertheless, it was felt that the safety concerns were more important; consequently, the getc(), getchar(), putc(), and putchar() functions are required to be thread-safe. However, unlocked versions are also provided with names that clearly indicate the unsafe nature of their operation but can be used to exploit their higher performance.

After replacing getc with getc_unlocked, the naïve implementation immediately outperforms the Perl code on Linux, though not on macOS.6

Another interesting thing to notice is that GCC provides vastly optimized code for the comparison with ‘c’, ‘C’, ‘g’, and ‘G’,7 which is extremely unlikely for interpreted languages like Perl. Observing the codes for the characters are:

  • 010000112 or 6710 (‘C’)
  • 010001112 or 7110 (‘G’)
  • 011000112 or 9910 (‘c’)
  • 011001112 or 10310 (‘g’)

GCC basically ANDs the input character with 110110112, and compares the result with 010000112. In Intel-style assembly:

        movzx   ecx, BYTE PTR [r12]
        and     ecx, -37
        cmp     cl, 67

It is astoundingly short and efficient!


  1. I love my Mac, but I do feel Linux has great optimizations. 
  2. If you are not familiar with string_view, check out its reference
  3. Did I mention how unorthogonal and uncomfortable the Win32 API is, when compared with POSIX? I am very happy to hide all the ugliness from the application code. 
  4. Comparison was only made on Unix (using the POSIX mmap API), under GCC with ‘-O3’ optimization (‘-O2’ was not enough). 
  5. -std=c++17’ must be specified for GCC, and ‘/std:c++latest’ must be specified for Visual C++ 2017. 
  6. However, the performance improvement is more dramatic on macOS, from about 10 times slower to 5% slower. 
  7. Of course, this is only possible when the characters are given as compile-time constants. 

7 thoughts on “C/C++ Performance, mmap, and string_view

    • No, you are wrong. getc/fgetc is not a system call. The C runtime reads a file by chunks, and these functions only return data from the buffer. Only when a buffer is exhausted will a system call be invoked to retrieve the next chunk. In addition, getc is normally a macro in the C world, so even the function call overhead is eliminated.

      In fact, people even tried reading the whole file with fread. It still performed worse than the Perl code.

      Like

  1. So the problem is in getc locks…

    $ du -h file
    489M    file
    
    $ cat abc.pl
    #!/usr/bin/perl
    open IN, "$ARGV[0]";
    print "FILE: $ARGV[0]\n";
    my $gc = 0;
    while (my $line = ) {
      print "L\n";
      $gc += ($line =~ tr/cCgG//);
    }
    print "$gc\n";
    close IN
    $ cat abc.pl 
    #!/usr/bin/perl
    open IN, "$ARGV[0]";
    my $gc = 0;
    while (my $line = ) {
      $gc += ($line =~ tr/cCgG//);
    }
    print "$gc\n";
    close IN
    $ time ./abc.pl file
    33024960
    
    real    0m0.638s
    user    0m0.570s
    sys     0m0.068s
    
    $ cat abc.cpp 
    #include <stdio.h>
    int main(int argc, char* argv[])
    {
        FILE* fp = fopen(argv[1], "r");
        int count = 0;
        int ch;
        while ((ch = getc(fp)) != EOF) {
            if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G') {
                ++count;
            }
        }
        fclose(fp);
        printf("%d", count);
    }
    $ g++ -O2 abc.cpp -o abc && time ./abc file
    33024960
    real    0m3.257s
    user    0m3.181s
    sys     0m0.076s
    
    $ cat abc_unlocked.cpp
    #include <stdio.h>
    int main(int argc, char* argv[])
    {
        FILE* fp = fopen(argv[1], "r");
        int count = 0;
        int ch;
        while ((ch = getc_unlocked(fp)) != EOF) {
            if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G') {
                ++count;
            }
        }
        fclose(fp);
        printf("%d", count);
    }
    $ g++ -O2 abc_unlocked.cpp -o abc_unlocked && time ./abc_unlocked file
    33024960
    real    0m0.566s
    user    0m0.486s
    sys     0m0.080s
    

    P.S Sorry, I don’t know how to format text nicely in your blog comments.

    Liked by 1 person

    • Thank you, I think you solved the problem why getc was slow. I was unaware of getc_unlocked, which is a relatively new POSIX function. It is not in the standard C/C++ library.

      The comments use Markdown. To simplify things, I just applied the code block for all your code (three back ticks).

      Like

  2. size should be a size_t, not a int, there’s lots of systems where “int size” would choke on files bigger than 2GB (and uint would choke on 4GB and so on).. for example, on ubuntu 18.04 gcc-8 x86_64, even tho it’s a 64bit system+64bit compiler, “int” is just 4 bytes, while size_t is 8 bytes)

    in other news tinnitus is fking killing me

    Like

    • You are right, if we ever want to process big files. Also be aware that some people argue ssize_t or ptrdiff_t may be better: an unsigned type can sometimes cause unexpected troubles, and make the compiler more difficult to optimize.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s