A C++ discussion group I participated in got a challenge last week. Someone posted a short Perl program, and claimed that it was faster than the corresponding C version. It was to this effect (with slight modifications):
open IN, "$ARGV[0]"; my $gc = 0; while (my $line = ) { $gc += ($line =~ tr/cCgG//); } print "$gc\n"; close IN
The program simply searched and counted all occurrences of the letters ‘C’ and ‘G’ from the input file in a case-insensitive way. Since the posted C code was incomplete, I wrote a naïve implementation to test, which did turn out to be slower than the Perl code. It was about two times as slow on Linux, and about 10 times as slow on macOS.1
FILE* fp = fopen(argv[1], "rb"); int count = 0; int ch; while ((ch = getc(fp)) != EOF) { if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G') { ++count; } } fclose(fp);
Of course, it actually shows how optimized the Perl implementation is, instead of how inferior the C language is. Before I had time to test my mmap_line_reader
, another person posted an mmap
-based solution, to the following effect (with modifications):
int fd = open(argv[1], O_RDONLY); struct stat st; fstat(fd, &st); int len = st.st_size; char ch; int count = 0; char* ptr = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0); close(fd); char* begin = ptr; char* end = ptr + len; while (ptr < end) { ch = *ptr++; if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G') ++count; } munmap(begin, len);
When I tested my mmap_line_reader
, I found that its performance was only on par with the Perl code, but slower than the handwritten mmap
-based code. It is not surprising, considering that mmap_line_reader
copies the line content, while the C code above searches directly in the mmap
’d buffer.
I then thought of the C++17 string_view
.2 It seemed a good chance of using it to return a line without copying its content. It was actually easy refactoring (on code duplicated from mmap_line_reader
), and most of the original code did not require any changes. I got faster code for this test, but the implementations of mmap_line_reader
and the new mmap_line_reader_sv
were nearly identical, except for a few small differences.
Naturally, the next step was to refactor again to unify the implementations. I made a common base to store the bottommost mmap
-related logic, and made the difference between string
and string_view
disappear with a class template. Now mmap_line_reader
and mmap_line_reader_sv
were just two aliases of instantiations of basic_mmap_line_reader
!
While mmap_line_reader_sv
was faster than mmap_line_reader
, it was still slower than the mmap
-based C code. So I made another abstraction, a ‘container’ that allowed iteration over all of the file content. Since the mmap
-related logic was already mostly separated, only some minor modifications were needed to make that base class fully independent of the line reading logic. After that, adding mmap_char_reader
was easy, which was simply a normal container that did not need to mess with platform-specific logic.
At this point, all seemed well—except one thing: it only worked on Unix. I did not have an immediate need to make it work on Windows, but I really wanted to show that the abstraction provided could work seamlessly regardless of the platform underneath. After several revisions, in which I dealt with proper Windows support,3 proper 32- and 64-bit support, and my own mistakes, I finally made it work. You can check out the current code in the nvwa repository. With it, I can finally make the following simple code work on Linux, macOS, and Windows, with the same efficiency as raw C code when fully optimized:4
#include <iostream> #include <stdlib.h> #include "nvwa/mmap_byte_reader.h" int main(int argc, char* argv[]) { if (argc != 2) { std::cerr << "A file name is needed" << std::endl; exit(1); } try { int count = 0; for (char ch : nvwa::mmap_char_reader(argv[1])) if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G') ++count; std::cout << count << std::endl; } catch (std::exception& e) { std::cerr << e.what() << std::endl; } }
Even though it is not used in the final code, I love the C++17 string_view
. And I like the simplicity I finally achieved. Do you?
P.S. The string_view
-based test code is posted here too as a reference. Line-based processing is common enough!5
#include <iostream> #include <stdlib.h> #include "nvwa/mmap_line_reader.h" int main(int argc, char* argv[]) { if (argc != 2) { std::cerr << "A file name is needed" << std::endl; exit(1); } try { int count = 0; for (const auto& line : nvwa::mmap_line_reader_sv(argv[1])) for (char ch : line) if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G') ++count; std::cout << count << std::endl; } catch (std::exception& e) { std::cerr << e.what() << std::endl; } }
Update (18 September 2017)
Thanks to Alex Maystrenko (see the comments below), It is now understood that the reason why getc
was slow was because there was an implicit lock around file operations. I did not expect it, as I grew from an age when multi-threading was the exception, and I had not learnt about the existence of getc_unlocked
until he mentioned it! According to the getc_unlocked page in the POSIX specification:
Some I/O functions are typically implemented as macros for performance reasons (for example,
putc()
andgetc()
). For safety, they need to be synchronized, but it is often too expensive to synchronize on every character. Nevertheless, it was felt that the safety concerns were more important; consequently, thegetc()
,getchar()
,putc()
, andputchar()
functions are required to be thread-safe. However, unlocked versions are also provided with names that clearly indicate the unsafe nature of their operation but can be used to exploit their higher performance.
After replacing getc
with getc_unlocked
, the naïve implementation immediately outperforms the Perl code on Linux, though not on macOS.6
Another interesting thing to notice is that GCC provides vastly optimized code for the comparison with ‘c’, ‘C’, ‘g’, and ‘G’,7 which is extremely unlikely for interpreted languages like Perl. Observing the codes for the characters are:
01000011
2 or67
10 (‘C’)01000111
2 or71
10 (‘G’)01100011
2 or99
10 (‘c’)01100111
2 or103
10 (‘g’)
GCC basically ANDs the input character with 11011011
2, and compares the result with 01000011
2. In Intel-style assembly:
movzx ecx, BYTE PTR [r12] and ecx, -37 cmp cl, 67
It is astoundingly short and efficient!
- I love my Mac, but I do feel Linux has great optimizations. ↩
-
If you are not familiar with
string_view
, check out its reference. ↩ - Did I mention how unorthogonal and uncomfortable the Win32 API is, when compared with POSIX? I am very happy to hide all the ugliness from the application code. ↩
-
Comparison was only made on Unix (using the POSIX
mmap
API), under GCC with ‘-O3
’ optimization (‘-O2
’ was not enough). ↩ -
‘
-std=c++17
’ must be specified for GCC, and ‘/std:c++latest
’ must be specified for Visual C++ 2017. ↩ - However, the performance improvement is more dramatic on macOS, from about 10 times slower to 5% slower. ↩
- Of course, this is only possible when the characters are given as compile-time constants. ↩