A C++ discussion group I participated in got a challenge last week. Someone posted a short Perl program, and claimed that it was faster than the corresponding C version. It was to this effect (with slight modifications):
open IN, "$ARGV[0]"; my $gc = 0; while (my $line = ) { $gc += ($line =~ tr/cCgG//); } print "$gc\n"; close IN
The program simply searched and counted all occurrences of the letters ‘C’ and ‘G’ from the input file in a case-insensitive way. Since the posted C code was incomplete, I wrote a naïve implementation to test, which did turn out to be slower than the Perl code. It was about two times as slow on Linux, and about 10 times as slow on macOS.1
FILE* fp = fopen(argv[1], "rb"); int count = 0; int ch; while ((ch = getc(fp)) != EOF) { if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G') { ++count; } } fclose(fp);
Of course, it actually shows how optimized the Perl implementation is, instead of how inferior the C language is. Before I had time to test my mmap_line_reader
, another person posted an mmap
-based solution, to the following effect (with modifications):
int fd = open(argv[1], O_RDONLY); struct stat st; fstat(fd, &st); int len = st.st_size; char ch; int count = 0; char* ptr = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0); close(fd); char* begin = ptr; char* end = ptr + len; while (ptr < end) { ch = *ptr++; if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G') ++count; } munmap(begin, len);
When I tested my mmap_line_reader
, I found that its performance was only on par with the Perl code, but slower than the handwritten mmap
-based code. It is not surprising, considering that mmap_line_reader
copies the line content, while the C code above searches directly in the mmap
’d buffer.
I then thought of the C++17 string_view
.2 It seemed a good chance of using it to return a line without copying its content. It was actually easy refactoring (on code duplicated from mmap_line_reader
), and most of the original code did not require any changes. I got faster code for this test, but the implementations of mmap_line_reader
and the new mmap_line_reader_sv
were nearly identical, except for a few small differences.
Naturally, the next step was to refactor again to unify the implementations. I made a common base to store the bottommost mmap
-related logic, and made the difference between string
and string_view
disappear with a class template. Now mmap_line_reader
and mmap_line_reader_sv
were just two aliases of instantiations of basic_mmap_line_reader
!
While mmap_line_reader_sv
was faster than mmap_line_reader
, it was still slower than the mmap
-based C code. So I made another abstraction, a ‘container’ that allowed iteration over all of the file content. Since the mmap
-related logic was already mostly separated, only some minor modifications were needed to make that base class fully independent of the line reading logic. After that, adding mmap_char_reader
was easy, which was simply a normal container that did not need to mess with platform-specific logic.
At this point, all seemed well—except one thing: it only worked on Unix. I did not have an immediate need to make it work on Windows, but I really wanted to show that the abstraction provided could work seamlessly regardless of the platform underneath. After several revisions, in which I dealt with proper Windows support,3 proper 32- and 64-bit support, and my own mistakes, I finally made it work. You can check out the current code in the nvwa repository. With it, I can finally make the following simple code work on Linux, macOS, and Windows, with the same efficiency as raw C code when fully optimized:4
#include <iostream> #include <stdlib.h> #include "nvwa/mmap_byte_reader.h" int main(int argc, char* argv[]) { if (argc != 2) { std::cerr << "A file name is needed" << std::endl; exit(1); } try { int count = 0; for (char ch : nvwa::mmap_char_reader(argv[1])) if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G') ++count; std::cout << count << std::endl; } catch (std::exception& e) { std::cerr << e.what() << std::endl; } }
Even though it is not used in the final code, I love the C++17 string_view
. And I like the simplicity I finally achieved. Do you?
P.S. The string_view
-based test code is posted here too as a reference. Line-based processing is common enough!5
#include <iostream> #include <stdlib.h> #include "nvwa/mmap_line_reader.h" int main(int argc, char* argv[]) { if (argc != 2) { std::cerr << "A file name is needed" << std::endl; exit(1); } try { int count = 0; for (const auto& line : nvwa::mmap_line_reader_sv(argv[1])) for (char ch : line) if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G') ++count; std::cout << count << std::endl; } catch (std::exception& e) { std::cerr << e.what() << std::endl; } }
Update (18 September 2017)
Thanks to Alex Maystrenko (see the comments below), It is now understood that the reason why getc
was slow was because there was an implicit lock around file operations. I did not expect it, as I grew from an age when multi-threading was the exception, and I had not learnt about the existence of getc_unlocked
until he mentioned it! According to the getc_unlocked page in the POSIX specification:
Some I/O functions are typically implemented as macros for performance reasons (for example,
putc()
andgetc()
). For safety, they need to be synchronized, but it is often too expensive to synchronize on every character. Nevertheless, it was felt that the safety concerns were more important; consequently, thegetc()
,getchar()
,putc()
, andputchar()
functions are required to be thread-safe. However, unlocked versions are also provided with names that clearly indicate the unsafe nature of their operation but can be used to exploit their higher performance.
After replacing getc
with getc_unlocked
, the naïve implementation immediately outperforms the Perl code on Linux, though not on macOS.6
Another interesting thing to notice is that GCC provides vastly optimized code for the comparison with ‘c’, ‘C’, ‘g’, and ‘G’,7 which is extremely unlikely for interpreted languages like Perl. Observing the codes for the characters are:
01000011
2 or67
10 (‘C’)01000111
2 or71
10 (‘G’)01100011
2 or99
10 (‘c’)01100111
2 or103
10 (‘g’)
GCC basically ANDs the input character with 11011011
2, and compares the result with 01000011
2. In Intel-style assembly:
movzx ecx, BYTE PTR [r12] and ecx, -37 cmp cl, 67
It is astoundingly short and efficient!
- I love my Mac, but I do feel Linux has great optimizations. ↩
-
If you are not familiar with
string_view
, check out its reference. ↩ - Did I mention how unorthogonal and uncomfortable the Win32 API is, when compared with POSIX? I am very happy to hide all the ugliness from the application code. ↩
-
Comparison was only made on Unix (using the POSIX
mmap
API), under GCC with ‘-O3
’ optimization (‘-O2
’ was not enough). ↩ -
‘
-std=c++17
’ must be specified for GCC, and ‘/std:c++latest
’ must be specified for Visual C++ 2017. ↩ - However, the performance improvement is more dramatic on macOS, from about 10 times slower to 5% slower. ↩
- Of course, this is only possible when the characters are given as compile-time constants. ↩
Hi,
You should read file by chunks, not by single character. Each getc() causes a kernel space call, that is way too expensive.
LikeLike
No, you are wrong.
getc
/fgetc
is not a system call. The C runtime reads a file by chunks, and these functions only return data from the buffer. Only when a buffer is exhausted will a system call be invoked to retrieve the next chunk. In addition,getc
is normally a macro in the C world, so even the function call overhead is eliminated.In fact, people even tried reading the whole file with
fread
. It still performed worse than the Perl code.LikeLike
Ok, would it make sense to adjust this buffering?
Did you try also with a plain read()?
Do you think, that perl does mmap internally?
LikeLike
So the problem is in getc locks…
P.S Sorry, I don’t know how to format text nicely in your blog comments.
LikeLiked by 1 person
Thank you, I think you solved the problem why
getc
was slow. I was unaware ofgetc_unlocked
, which is a relatively new POSIX function. It is not in the standard C/C++ library.The comments use Markdown. To simplify things, I just applied the code block for all your code (three back ticks).
LikeLike
size should be a size_t, not a int, there’s lots of systems where “int size” would choke on files bigger than 2GB (and uint would choke on 4GB and so on).. for example, on ubuntu 18.04 gcc-8 x86_64, even tho it’s a 64bit system+64bit compiler, “int” is just 4 bytes, while size_t is 8 bytes)
in other news tinnitus is fking killing me
LikeLike
You are right, if we ever want to process big files. Also be aware that some people argue
ssize_t
orptrdiff_t
may be better: an unsigned type can sometimes cause unexpected troubles, and make the compiler more difficult to optimize.LikeLike