MSVCRT.DLL Console I/O Bug

I have been quite annoyed by a Windows bug that causes a huge number of open-source command-line tools to choke on multi-byte characters at the Windows Command Prompt. The MSVCRT.DLL shipped with Windows Vista or later has been having big troubles with such characters. While Microsoft tools and compilers after Visual Studio 6.0 do not use this DLL anymore, the GNU tools on Windows, usually built by MinGW or Mingw-w64, are dependent on this DLL and suffer from this problem. One cannot even use ls to display a Chinese file name, when the system locale is set to Chinese.

The following simple code snippet demonstrates the problem:

#include <locale.h>
#include <stdio.h>

char msg[] = "\xd7\xd6\xb7\xfb Char";
wchar_t wmsg[] = L"字符 char";

void Test1()
{
    char* ptr = msg;
    printf("Test 1: ");
    while (*ptr) {
        putchar(*ptr++);
    }
    putchar('\n');
}

void Test2()
{
    printf("Test 2: ");
    puts(msg);
}

void Test3()
{
    wchar_t* ptr = wmsg;
    printf("Test 3: ");
    while (*ptr) {
        putwchar(*ptr++);
    }
    putwchar(L'\n');
}

int main()
{
    char buffer[32];
    puts("Default C locale");
    Test1();
    Test2();
    Test3();
    putchar('\n');
    puts("Chinese locale");
    setlocale(LC_CTYPE, "Chinese_China.936");
    Test1();
    Test2();
    Test3();
    putchar('\n');
    puts("English locale");
    setlocale(LC_CTYPE, "English_United States.1252");
    Test1();
    Test2();
    Test3();
}

When built with a modern version of Visual Studio, it gives the expected output (console code page is 936):

Default C locale
Test 1: 字符 Char
Test 2: 字符 Char
Test 3:  char

Chinese locale
Test 1: 字符 Char
Test 2: 字符 Char
Test 3: 字符 char

English locale
Test 1: ×?·? Char
Test 2: ×?·? Char
Test 3:  char

I.e. when the locale is the default ‘C’, the ‘ANSI’ version of character output routines can successfully output single-byte and multi-byte characters, while putwchar, the ‘Unicode’ version of putchar, fails at the multi-byte characters (reasonably, as the C locale does not understand how to translate Chinese characters). When the locale is set correctly to code page 936 (Simplified Chinese), everything is correct. When the locale is set to code page 1252 (Latin), the corresponding characters at the same code points of the original Chinese characters (‘×Ö·û’ instead of ‘字符’) are shown with the ‘ANSI’ routines, though ‘Ö’ (\xd6) and ‘û’ (\xfb) are shown as ‘?’ because they do not exist in code page 936. The Chinese characters, of course, cannot be shown with putwchar in this locale, just like the C locale.

When built with GCC, the result is woeful:

Default C locale
Test 1: 字符 Char
Test 2: 字符 Char
Test 3:  char

Chinese locale
Test 1:  Char
Test 2: 字符 Char
Test 3:  char

English locale
Test 1: ×?·? Char
Test 2: ×?·? Char
Test 3:  char

Two things are worth noticing:

  • putchar stops working for Chinese when the locale is correctly set.
  • putwchar never works for Chinese.

Horrible and thoroughly broken! (Keep in mind that Microsoft is to blame here. You can compile the program with MSVC 6.0 using the /MD option, and the result will be the same—an executable that works in Windows XP but not in Windows Vista or later.)

I attacked this problem a few years ago, and tried some workarounds. The solution I came up with looked so fragile that I did not push it up to the MinGW library. It was a personal failure, as well as an indication that working around a buggy implementation without affecting the application code can be very difficult or just impossible.


The problem occurs only with the console, where the Microsoft runtime does some translation (broken in MSVCRT.DLL, but OK in newer MSVC runtimes). It vanishes when users redirect the output from the console. So one solution is not to use the Command Prompt at all. The Cygwin Terminal may be a good choice, especially for people familiar with Linux/Unix. I have Cygwin installed, but sometimes I still want to do things in the more Windows-y way. I figured I could make a small tool (like cat) to get the input from stdin, and forward everything to stdout. As long as this tool is compiled by a Microsoft compiler, things should be OK. Then I thought a script could be faster. Finally, I came up with putting the following line into an mbf.bat:

@perl -p -e ""

(Perl is still wonderful for text processing, even in this ‘empty’ program!)

Now the executables built by GCC and MSVC give the same result, if we append ‘|mbf’ on the command line:

Default C locale
Test 1: 字符 Char
Test 2: 字符 Char
Test 3:  char

Chinese locale
Test 1: 字符 Char
Test 2: 字符 Char
Test 3: 字符 char

English locale
Test 1: 字符 Char
Test 2: 字符 Char
Test 3:  char

If you know how to make Microsoft fix the DLL problem, do it. Otherwise you know at least a workaround now. 🙂


The following code is my original partial solution to the problem, and it may be helpful to your GCC-based project. I don’t claim any copyright of it, nor will I take any responsibilities for its use.

/* mingw_mbcs_safe_io.c */

#include <mbctype.h>
#include <stdio.h>

/* Output functions that work with the Windows 7+ MSVCRT.DLL
 * for multi-byte characters on the console.  Please notice
 * that buffering must not be enabled for the console (e.g.
 * by calling setvbuf); otherwise weird things may occur. */

int __cdecl _mgw_flsbuf(int ch, FILE* fp)
{
  static char lead = '\0';
  int ret = 1;

  if (lead != '\0')
    {
      ret = fprintf(fp, "%c%c", lead, ch);
      lead = '\0';
      if (ret < 0)
        return EOF;
    }
  else if (_ismbblead(ch))
    lead = ch;
  else
    return _flsbuf(ch, fp);

  return ch;
}

int __cdecl putc(int ch, FILE* fp)
{
  static __thread char lead = '\0';
  int ret = 1;

  if (lead != '\0')
    {
      ret = fprintf(fp, "%c%c", lead, ch);
      lead = '\0';
    }
  else if (_ismbblead(ch))
    lead = ch;
  else
    ret = fprintf(fp, "%c", ch);

  if (ret < 0)
    return EOF;
  else
    return ch;
}

int __cdecl putchar(int ch)
{
  putc(ch, stdout);
}

int __cdecl _mgwrt_putchar(int ch)
{
  putc(ch, stdout);
}

A Complaint of ODF’s Asian Language Support

I have recently read about news about better support for ODF from Google. The author then went on to complain that neither Google nor Microsoft makes ‘it easy to use ODF as part of a workflow’. This reminds me that maybe I should write down a long-time complaint I have for ODF.

I have always loved open standards. However, there are not only open and proprietary standards, there are also good and bad standards. ODF looks pretty bad regarding Asian language support. It can be powerfully demonstrated by this image:

ODF Issue

If you are interested in it, you can download the document yourself. It simply contains four lines:

  • The first line has a left quotation mark, the English word ‘Test’, and the corresponding Chinese word. It looks OK.
  • The second line is a duplication of the first line, with an additional colon added at the beginning. It immediately changes the font of the left quotation mark.
  • The third line is a duplication of the second line, with the Chinese word removed. Both quotation marks are now using the default Western font ‘Times New Roman’.
  • The fourth line is a duplication of the third line, with the leading colon removed. Weirdly enough, the left quotation mark now uses the Chinese font. (This may be related to my using the Chinese OpenOffice version or Chinese Windows OS.)

Is it ridiculous that adding or removing a character can change how other characters are rendered? Still, I would not blog about it, if it had only been a bug in OpenOffice (actually I filed three bug reports back in 2006—more ancient than I thought—and this bug remains unfixed ). It actually seems a problem in the ODF standard. After extracting the content from the .ODT file (as a zip file), I can shrink the content of the document to these XML lines (content.xml with irrelevant contents removed and the result reformatted):

<office:font-face-decls>
<style:font-face
    style:name="Times New Roman"
    style:font-family-generic="roman"
    style:font-pitch="variable"/>
<style:font-face
    style:name="宋体"
    style:font-family-generic="system"
    style:font-pitch="variable"/>
</office:font-face-decls>
<office:automatic-styles>
<style:style
    style:name="P1"
    style:family="paragraph"
    style:parent-style-name="Standard">
<style:text-properties
    fo:font-size="12pt" fo:language="en" fo:country="GB"
    style:language-asian="zh" style:country-asian="CN"/>
</style:style>
</office:automatic-styles>
<office:body>
<office:text>
<text:p text:style-name="P1">“Test测试”</text:p>
<text:p text:style-name="P1">:“Test测试”</text:p>
<text:p text:style-name="P1">:“Test”</text:p>
<text:p text:style-name="P1">“Test”</text:p>
</office:text>
</office:body>

The problem is that instead of specifying a single language on any text, it specifies both a ‘fo:language’ and a ‘style:language-asian’. The designer of this feature definitely did not think carefully about the fact that many symbols exist in both Asian and non-Asian contexts and can often be rendered differently!

When I repeated the same process in Microsoft Word (on Windows), all text appeared correctly—Microsoft applications recognize which keyboard I use and which language it represents. Pasting as plain text introduced one error (as no language information is present). Even in that case, fixing the problem is easier. In OpenOffice I have to change the font manually, but in Microsoft Word I only need to specify the correct language (‘Office, this is English, not Chinese’). It is much more intuitive and natural.

I also analysed the XML in the resulting .DOCX file. Its styles.xml contained this:

<w:lang w:val="en-US" w:eastAsia="zh-CN" w:bidi="ar-SA"/>

So these are default languages. I had to use UK English and Traditional Chinese to force Word to specify the languages in the document explicitly. The embedded document.xml now contains content like the following:

<w:p>
<w:r>
<w:rPr>
<w:rFonts w:eastAsia="PMingLiU" w:hint="eastAsia"/>
<w:lang w:eastAsia="zh-TW"/>
</w:rPr>
<w:t>“</w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:eastAsia="PMingLiU"/>
<w:lang w:val="en-GB" w:eastAsia="zh-TW"/>
</w:rPr>
<w:t>Test</w:t>
</w:r>
<w:r>
<w:rPr>
<w:rFonts w:eastAsia="PMingLiU" w:hint="eastAsia"/>
<w:lang w:eastAsia="zh-TW"/>
</w:rPr>
<w:t>測試”</w:t>
</w:r>
</w:p>
...
<w:p>
<w:r>
<w:rPr>
<w:rFonts w:eastAsia="PMingLiU"/>
<w:lang w:val="en-GB" w:eastAsia="zh-TW"/>
</w:rPr>
<w:t>“Test”</w:t>
</w:r>
</w:p>

We can argue the structure is somewhat similar (compare ‘w:val’ in <w:lang> with ‘fo:language’ and ‘fo:country’, and ‘w:eastAsia’ with ‘style:language-asian’ and ‘style:country-asian’), but the semantics are obviously different, and text of different languages is not mixed together. The English text has the language attribute <w:lang w:val="en-GB" w:eastAsia="zh-TW"/>, and the Chinese text has only <w:lang w:eastAsia="zh-TW"/>. It looks to me a more robust approach to processing mixed text.

Although it might be true that Microsoft lobbied strongly to get OOXML approved as an international standard, I do not think ODF’s openness alone is enough to make people truly adopt it.