Another Microsoft Unicode I/O Problem

I encountered an annoying bug in Visual C++ 2017 recently. It started when I found my favourite logging library, Easylogging++, output Chinese as garbage characters on Windows. Checking the documentation carefully, I noted that I should have used the macro START_EASYLOGGINGPP. It turned out to be worse: all output starting from the Chinese character was gone. Puzzled but busy, I put it down and worked on something else.

I spend another hour of detective work on this issue today. The result was quite surprising.

  • First, it is not an issue with Easylogging++. The problem can occur if I purely use std::wcout.
  • Second, the magical thing about START_EASYLOGGINGPP is that it will invoke std::locale::global(std::locale("")). This is the switch that leads to the different behaviour.
  • Myteriously, with the correct locale setting, I can get the correct result with both std::wcout and Easylogging++ in a test program. I was not able to get it working in my real project.
  • Finally, it turns out that the difference above is caused by /MT vs. /MD! The former (default if neither is specified on the command line) tells the Visual C++ compiler to use the static multi-threading library, and the latter (set by default in Visual Studio projects) tells the compiler to use the dynamic multi-threading library.

People may remember that I wrote about MSVCRT.DLL Console I/O Bug. While Visual C++ 2013 shows consistent behaviour between /MT and /MD, Visual C++ 2015 and 2017 exhibit the same woeful bug when /MD is specified on the command line. This is something perplexingly weird: it seems someone at Microsoft messed up with the MSVCRT.DLL shipped with Windows first (around 2006), and then the problem spread to the Visual C++ runtime DLL after nearly a decade!

I am using many modern C++ features, so I definitely do not want to go back to Visual C++ 2013 for new projects. It seems I have to tolerate garbage characters in the log for now. Meanwhile, I submitted a bug to Microsoft. Given that I have a bug report that is deferred for four years, I am not very hopeful. But let us wait and see.

Update (20 December 2017)

A few more tests show that the debug versions (/MTd and /MDd) both work well. So only the default release build (using the dynamic C runtime) exhibits this problem, where the executable depends on DLLs like api-ms-win-crt-stdio-l1-1-0.dll. It seems this issue is related to the Universal C Runtime introduced in Visual Studio 2015 and Windows 10. . . .

Update (25 March 2018)

The bug was closed, and a Microsoft developer indicated that the issue had already been fixed since the Windows 10 Anniversary Update SDK (build 10.0.14393). Actually I had had build 10.0.15063 installed. The reason why I still saw the problem was that the Universal C Runtime on Windows 7 had not been updated (‘the issue will be fixed in a future update to the Universal C Runtime on Windows 7’), and I should not have seen the problem on a Windows 10 box. The current workaround is either use static linking (as I did), or copy the redistributable DLLs under C:\Program Files (x86)\Windows Kits\10\Redist\ucrt\DLLs\x86 (or x64 etc.) to the app directory (so called ‘app-local deployment’; which should not be used on Windows 10, as the system version is always preferred). My test showed that copying ucrtbase.dll was enough to fix my test case.

C/C++ Performance, mmap, and string_view

A C++ discussion group I participated in got a challenge last week. Someone posted a short Perl program, and claimed that it was faster than the corresponding C version. It was to this effect (with slight modifications):

open IN, "$ARGV[0]";
my $gc = 0;
while (my $line = ) {
  $gc += ($line =~ tr/cCgG//);
}
print "$gc\n";
close IN

The program simply searched and counted all occurrences of the letters ‘C’ and ‘G’ from the input file in a case-insensitive way. Since the posted C code was incomplete, I wrote a naïve implementation to test, which did turn out to be slower than the Perl code. It was about two times as slow on Linux, and about 10 times as slow on macOS.1

FILE* fp = fopen(argv[1], "rb");
int count = 0;
int ch;
while ((ch = getc(fp)) != EOF) {
    if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G') {
        ++count;
    }
}
fclose(fp);

Of course, it actually shows how optimized the Perl implementation is, instead of how inferior the C language is. Before I had time to test my mmap_line_reader, another person posted an mmap-based solution, to the following effect (with modifications):

int fd = open(argv[1], O_RDONLY);
struct stat st;
fstat(fd, &st);
int len = st.st_size;
char ch;
int count = 0;
char* ptr = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd);
char* begin = ptr;
char* end = ptr + len;
while (ptr < end) {
    ch = *ptr++;
    if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G')
        ++count;
}
munmap(begin, len);

When I tested my mmap_line_reader, I found that its performance was only on par with the Perl code, but slower than the handwritten mmap-based code. It is not surprising, considering that mmap_line_reader copies the line content, while the C code above searches directly in the mmap’d buffer.

I then thought of the C++17 string_view.2 It seemed a good chance of using it to return a line without copying its content. It was actually easy refactoring (on code duplicated from mmap_line_reader), and most of the original code did not require any changes. I got faster code for this test, but the implementations of mmap_line_reader and the new mmap_line_reader_sv were nearly identical, except for a few small differences.

Naturally, the next step was to refactor again to unify the implementations. I made a common base to store the bottommost mmap-related logic, and made the difference between string and string_view disappear with a class template. Now mmap_line_reader and mmap_line_reader_sv were just two aliases of instantiations of basic_mmap_line_reader!

While mmap_line_reader_sv was faster than mmap_line_reader, it was still slower than the mmap-based C code. So I made another abstraction, a ‘container’ that allowed iteration over all of the file content. Since the mmap-related logic was already mostly separated, only some minor modifications were needed to make that base class fully independent of the line reading logic. After that, adding mmap_char_reader was easy, which was simply a normal container that did not need to mess with platform-specific logic.

At this point, all seemed well—except one thing: it only worked on Unix. I did not have an immediate need to make it work on Windows, but I really wanted to show that the abstraction provided could work seamlessly regardless of the platform underneath. After several revisions, in which I dealt with proper Windows support,3 proper 32- and 64-bit support, and my own mistakes, I finally made it work. You can check out the current code in the nvwa repository. With it, I can finally make the following simple code work on Linux, macOS, and Windows, with the same efficiency as raw C code when fully optimized:4

#include <iostream>
#include <stdlib.h>
#include "nvwa/mmap_byte_reader.h"

int main(int argc, char* argv[])
{
    if (argc != 2) {
        std::cerr << "A file name is needed" << std::endl;
        exit(1);
    }

    try {
        int count = 0;
        for (char ch : nvwa::mmap_char_reader(argv[1]))
            if (ch == 'c' || ch == 'C' ||
                ch == 'g' || ch == 'G')
                ++count;
        std::cout << count << std::endl;
    }
    catch (std::exception& e) {
        std::cerr << e.what() << std::endl;
    }
}

Even though it is not used in the final code, I love the C++17 string_view. And I like the simplicity I finally achieved. Do you?

P.S. The string_view-based test code is posted here too as a reference. Line-based processing is common enough!5

#include <iostream>
#include <stdlib.h>
#include "nvwa/mmap_line_reader.h"

int main(int argc, char* argv[])
{
    if (argc != 2) {
        std::cerr << "A file name is needed" << std::endl;
        exit(1);
    }

    try {
        int count = 0;
        for (const auto& line :
                nvwa::mmap_line_reader_sv(argv[1]))
            for (char ch : line)
                if (ch == 'c' || ch == 'C' ||
                    ch == 'g' || ch == 'G')
                    ++count;
        std::cout << count << std::endl;
    }
    catch (std::exception& e) {
        std::cerr << e.what() << std::endl;
    }
}

Update (18 September 2017)

Thanks to Alex Maystrenko (see the comments below), It is now understood that the reason why getc was slow was because there was an implicit lock around file operations. I did not expect it, as I grew from an age when multi-threading was the exception, and I had not learnt about the existence of getc_unlocked until he mentioned it! According to the getc_unlocked page in the POSIX specification:

Some I/O functions are typically implemented as macros for performance reasons (for example, putc() and getc()). For safety, they need to be synchronized, but it is often too expensive to synchronize on every character. Nevertheless, it was felt that the safety concerns were more important; consequently, the getc(), getchar(), putc(), and putchar() functions are required to be thread-safe. However, unlocked versions are also provided with names that clearly indicate the unsafe nature of their operation but can be used to exploit their higher performance.

After replacing getc with getc_unlocked, the naïve implementation immediately outperforms the Perl code on Linux, though not on macOS.6

Another interesting thing to notice is that GCC provides vastly optimized code for the comparison with ‘c’, ‘C’, ‘g’, and ‘G’,7 which is extremely unlikely for interpreted languages like Perl. Observing the codes for the characters are:

  • 010000112 or 6710 (‘C’)
  • 010001112 or 7110 (‘G’)
  • 011000112 or 9910 (‘c’)
  • 011001112 or 10310 (‘g’)

GCC basically ANDs the input character with 110110112, and compares the result with 010000112. In Intel-style assembly:

        movzx   ecx, BYTE PTR [r12]
        and     ecx, -37
        cmp     cl, 67

It is astoundingly short and efficient!


  1. I love my Mac, but I do feel Linux has great optimizations. 
  2. If you are not familiar with string_view, check out its reference
  3. Did I mention how unorthogonal and uncomfortable the Win32 API is, when compared with POSIX? I am very happy to hide all the ugliness from the application code. 
  4. Comparison was only made on Unix (using the POSIX mmap API), under GCC with ‘-O3’ optimization (‘-O2’ was not enough). 
  5. -std=c++17’ must be specified for GCC, and ‘/std:c++latest’ must be specified for Visual C++ 2017. 
  6. However, the performance improvement is more dramatic on macOS, from about 10 times slower to 5% slower. 
  7. Of course, this is only possible when the characters are given as compile-time constants.