A VPN Issue with MTU

One environment I have access to uses a PPTP VPN to allow people to connect to the site remotely.1 One thing that had been troublesome was that there were always people complaining that they could not access the Internet after connecting to the VPN.

I was not concerned in the beginning as my test showed no problem: it seemed my browser had no problems opening http://www.taobao.com/ after connecting to the VPN. Actually, my test was flawed and limited, as I only accessed one or two sites in a virtual machine (my laptop ran a macOS version that no longer supported PPTP). More on this immediately.

Our previous VPN server had a problem, and we switched to the Linux-based pptpd last week. After the set-up was done, I checked with other users and found the web access problem persisted. This time I sat down with one user and looked into the problem together. It turned out that, after connecting to the VPN, he was able to access http://www.taobao.com/, but not http://www.baidu.com/, which was actually the default web page for many people. And I could reproduce this behaviour in my virtual machine. . . .

My experience told me that it was very much like an MTU-related problem (I have encountered plenty of MTU-related networking problems). I checked the server-side script, and found it already clamped the MSS value to 1356, while the MTU value for the PPP connections was 1396. All seemed quite reasonable.

When in doubt with a network problem, a sniffer should always be in your weaponry. I launched tcpdump on the server, and analysed the result in Wireshark. Something became clearer soon.

For the traffic between the pptpd server and Baidu (when a client visited the web site), the following things occurred:

  1. The pptpd server started a connection to the web server, with MSS = 1356
  2. The web server responded with MSS = 1380
  3. The web server soon sent a packet as large as 1420 bytes (TCP payload length is 1380 bytes)
  4. The pptpd server responded with ICMP Destination unreachable (Fragmentation needed), in which the next-hop MTU of 1396 was reported
  5. The above two steps were repeated, and nothing was improved

For the traffic between the pptpd server and Taobao, things were slightly different:

  1. The pptpd server started a connection to the web server, with MSS = 1356
  2. The web server responded with MSS = 1380
  3. The web server soon sent a packet as large as 1420 bytes (TCP payload length is 1380 bytes)
  4. The pptpd server responded with ICMP Destination unreachable (Fragmentation needed), in which the next-hop MTU of 1396 was reported
  5. A few milliseconds later, the web server began to send TCP packets no larger than 1396 bytes
  6. Now the pptpd server and the web server continued to exchange packets without any problems

Apparently there was an ICMP black hole between our server and the Baidu server, but not between our server and the Taobao server.

Once the issue was found, the solution was easy. Initially, I just ran a cron job to check all the PPP connections and changed their MTU value to 1468 (though 1420 should be good enough in my case). The better way, of course, was to change the MTU on new client connections. It could be done via the script /etc/ppp/ip-up, but the environment variable name for the network interface—which I found on the web—was wrong in the beginning. After dumping all the existing environment variables in the script, I finally got the correct name. The following line in /etc/ppp/ip-up was able to get the job done:

ifconfig $IFNAME mtu 1468

Only one thing remained mysterious now: why didn’t the MSS value in the server script take effect? A packet capture on a server I could control confirmed what I guessed, i.e. the MSS value in the TCP SYN packets from our pptpd server was clamped to 1380. It could be the router, or the ISP. Whatever it is, it really should not have clamped the value up.

In summary, problems occurred because:

  • The MSS value was increased, but pptpd did not know and still enforced a small MTU value on the PPP connections, which no longer matched the MSS
  • Path MTU discovery also failed because of the existence of ICMP black holes

Bad things can always happen, and we sometimes just have to find a way around.


  1. PPTP is not considered secure enough, but is quite convenient, especially because UDP port 500 is not usable in our case due to a router compatibility problem. 😔 

Fixing A VS2017 15.6 Installation Problem

After installing the latest Visual Studio 2017 15.6.6 (Community Edition), I found my custom setting of environment variables INCLUDE lost effect in the Developer Command Prompt. Strangely, LIB was still there. Some tracing indicated that it was a bug in the .BAT files Microsoft provided to initialize the environment. The offending lines are the following (in C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\Common7\Tools\vsdevcmd\core\winsdk.bat; one of them is very long and requires scrolling for viewing):

@REM the folowing are architecture neutral
set __tmpwinsdk_include=
if "%INCLUDE%" NEQ "" set "__tmp_include=;%INCLUDE%"
set "INCLUDE=%WindowsSdkDir%include\%WindowsSDKVersion%shared;%WindowsSdkDir%include\%WindowsSDKVersion%um;%WindowsSdkDir%include\%WindowsSDKVersion%winrt;%WindowsSdkDir%include\%WindowsSDKVersion%cppwinrt%__tmpwinsdk_include%"
set __tmpwinsdk_include=

Apparently somebody missed renaming __tmp_include to __tmpwinsdk_include. Doing that myself fixed the problem.

I’ve reported the problem to Microsoft. In the meanwhile, you know how to fix it if you encounter the same problem.

On the Use of She as a Generic Pronoun

When reading the August 2017 issue of Communications of the ACM, I have been continually distracted by the use of she as a generic pronoun:

Instead of a field engineer constantly traveling between locations, she could troubleshoot machinery and refine product designs in real time . . .

There were times when one person had to be in charge while she captured the organization of the emerging article . . .

. . . we can let the user specify how much precision she wants . . .

A mathematician using “brute force” is a kind of barbaric monster, is she not?

I am not sure whether this is just my personal problem, but I find this usage obtrusive and annoying. I was reading something supposed to be objective and scientific, but the images of women kept surfacing. The last case was especially so, as I could not help envisioning a female mathematician (er, how many female mathematicians have there been?) who was also a barbaric monster, oops, how bad it was!

I dug around for a while for related resources. Before long, I realized one thing: my view is at least partly shaped by my education, which taught me that he be used as the third-person singular pronoun when the gender is unknown, for both English and Chinese. My unscientific survey shows that while many of my female friends are uncomfortable with either he or she used generically, most Chinese female friends actually prefer he to she! According to an online discussion, at least some peoples in Continental Europe still use the masculine pronoun when the gender is unknown, say, hij in Dutch and il/ils in French.1 I think the French example is quite interesting to Chinese speakers, as neither French nor Chinese has a gender-neutral third-person plural pronoun: the generic forms ils and 他们 are actually masculine forms. Unlike the English they, we never had a nice and simple way to escape the problem.

Talking about they, one fact during the search surprised me. My favourite English author, Jane Austen, apparently preferred they/their in her novels.2 Examples (emphasis is mine):

You wanted me, I know, to say ‘Yes,’ that you might have the pleasure of despising my taste; but I always delight in overthrowing those kind of schemes, and cheating a person of their premeditated contempt.

To be sure, you knew no actual good of me—but nobody thinks of that when they fall in love.

Digging deeper, it is revealed that they has been used after words like each, everybody, nobody, etc. since the Middle Ages. The entries everybody and their in the Oxford English Dictionary are nearly a demonstration of such usages, with a note in the latter entry that writes ‘Not favoured by grammarians’.3 Professor Steven Pinker also argues that using they/their/them after everyone is not only correct, but logical as well.4 Oops to the prescriptivist grammarians and my English education!

Accidentally, I encountered an old article by Douglas R. Hofstadter,5 author of the famous book Gödel, Escher, Bach: An Eternal Golden Braid (also known as GEB). It is vastly satirical, and it attacks most points I have for supporting the use of man and he (go read it; it is highly recommended even though I do not fully agree). It influenced my thinking, even though it ignored the etymology of man. The Oxford Dictionary of English has this usage note:6

Traditionally the word man has been used to refer not only to adult males but also to human beings in general, regardless of sex. There is a historical explanation for this: in Old English the principal sense of man was ‘a human being’, and the words wer and wif were used to refer specifically to ‘a male person’ and ‘a female person’ respectively. Subsequently, man replaced wer as the normal term for ‘a male person’, but at the same time the older sense ‘a human being’ remained in use. In the second half of the twentieth century the generic use of man to refer to ‘human beings in general’ (as in ‘reptiles were here long before man appeared on the earth’) became problematic; the use is now often regarded as sexist or at best old-fashioned.

Etymology is not a good representation of word meaning, but I want to point out that Hofstadter had a logical fallacy in comparing man/woman with white/black. Man did include woman at one point of time; one probably cannot say the same for white and black.

This said, the war for continued use of -man is already lost. Once aware of this issue, I do not think I want to use words like policeman again when the gender is unknown. I still do not think words like mankind, manhole, actress, or mother tongue are bad.7 The society and culture are probably a much bigger headache for women facing inequalities. . . .8

I started being angry, but ended up more understanding. And I also reached a different conclusion than I had expected. It is apparent that somebody will be offended, whether I use he, she, he or she, or they after a noun of unknown gender. I think offending grammarians would now probably be my default choice.

P.S. I have also found Professor Ellen Spertus’s article ‘Why are There so Few Female Computer Scientists?’ worth reading.9 Recommended.


  1. StackExchange discussion: Is using “he” for a gender-neutral third-person correct? Retrieved on 21 October 2017. 
  2. Henry Churchyard: Singular “their” in Jane Austen and elsewhere: Anti-pedantry page. 1999. Internet Archive. 
  3. Oxford English Dictionary. Oxford University Press, 2nd edition, 1989. 
  4. Steven Pinker: On the English singular “their” construction—from The Language Instinct. 1994. Internet Archive. 
  5. Douglas R. Hofstadter: A Person Paper on Purity in Language. 1985. Internet Archive. 
  6. Oxford Dictionary of English. Oxford University Press, macOS built-in edition, 2016. This is different from the famous OED
  7. These words are already banned in some places. See entry sexist language in R. W. Burchfield: Fowler’s Modern English Usage. Oxford University Press, revised 3rd edition, 2004. 
  8. Henry Etzkowitz et al.: Barriers to Women in Academic Science and Engineering. 1994. Internet Archive. 
  9. Ellen Spertus: Why are There so Few Female Computer Scientists? 1991. Internet Archive. 

A Journey of Purely Static Linking

As I mentioned last time, I found Microsoft has really messed up its console Unicode support when the C runtime DLL (as versus the static runtime library) is used. So I decided to have a try with linking everything statically in my project that uses C++ REST SDK (a.k.a. cpprestsdk). This is not normally recommended, but in my case it has two obvious advantages:

  • It would solve the Unicode I/O problem.
  • It would be possible to ship just the binaries without requiring the target PC to install the Microsoft Visual C++ runtime.

It took me several hours to get it rolling, but I felt it was worthwhile.

Before I start, I need to mention that cpprestsdk has a solution file that supports building a static library. It turned out not satisfactory:

  • It used NuGet packages for Boost and OpenSSL, and both versions were out of date. Worse, my Visual Studio 2017 IDE hung while I tried to update the packages. Really a nuisance.
  • The static library, as well as all its dependencies like Boost and OpenSSL, still uses the C runtime DLL. I figured it might be easier to go completely on my own.

Prerequisites

Boost

This part is straightforward. After going into the Boost directory, I only need to type (I use version 1.65.1):

bootstrap.bat
.\b2.exe toolset=msvc -j 2 --with-chrono --with-date_time --with-regex --with-system --with-thread release link=static runtime-link=static stage
.\b2.exe toolset=msvc -j 2 --with-chrono --with-date_time --with-regex --with-system --with-thread release link=shared stage

(The last line is needed only because ‘cmake ..’ would otherwise fail to detect Boost in the ‘Building C++ REST SDK’ stage.)

OpenSSL

As I already have Perl and NASM installed, installing OpenSSL is trivial too (I use version 1.0.2l):

perl Configure VC-WIN32 --prefix=C:/Libraries/OpenSSL
ms\do_nasm
nmake -f ms\nt.mak
nmake -f ms\nt.mak install

zlib

This part requires a small change to the build script (for version 1.2.11). I need to open win32\Makefile.msc and change all occurrences of ‘-MD’ to ‘-MT’. Then these commands will work:

nmake -f win32\Makefile.msc zlib.lib
mkdir C:\Libraries\zlib
mkdir C:\Libraries\zlib\include
mkdir C:\Libraries\zlib\lib
copy zconf.h C:\Libraries\zlib\include
copy zlib.h C:\Libraries\zlib\include
copy zlib.lib C:\Libraries\zlib\lib

Building C++ REST SDK

We need to set some environment variables to help the CMake build system find where the libraries are. I set them in ‘Control Panel > System > Advanced system settings > Environment variables’:1

BOOST_ROOT=C:\src\boost_1_65_1
OPENSSL_ROOT_DIR=C:\Libraries\OpenSSL
INCLUDE=C:\Libraries\zlib\include
LIB=C:\Libraries\zlib\lib

(The above setting assumes Boost is unpacked under C:\src.)

We would need to create the solution files for the current environment under cpprestsdk:

cd Release
mkdir build
cd build
cmake ..

If the environment is set correctly, the last command should succeed and report no errors. A cpprest.sln should be generated now.

We then open this solution file. As we only need the release files, we should change the ‘Solution Configuration’ from ‘Debug’ to ‘Release’. After that, we need to find the project ‘cpprest’ in ‘Solution Explorer’, go to its ‘Properties’, and make the following changes under ‘General’:

  • Set Target Name to ‘cpprest’.
  • Set Target Extension to ‘.lib’.
  • Set Configuration Type to ‘Static library (.lib)’.

And the most important change we need under ‘C/C++ > Code Generation’:

  • Set Runtime Library to ‘Multi-threaded (/MT)’.

Click on ‘OK’ to accept the changes. Then we can build this project.

Like zlib, we need to copy the header and library files to a new path, and add the include and lib directories to environment variables INCLUDE and LIB, respectively. In my case, I have:

INCLUDE=C:\Libraries\cpprestsdk\include;C:\Libraries\zlib\include
LIB=C:\Libraries\cpprestsdk\lib;C:\Libraries\zlib\lib

Change to my Project

Of course, the cpprestsdk-based project needs to be adjusted too. I will first show the diff, and then give some explanations:

--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -24,14 +24,25 @@ set(CMAKE_CXX_FLAGS "${ELPP_FLAGS}")

 if(WIN32)
 set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}\
- -DELPP_UNICODE -D_WIN32_WINNT=0x0601")
-set(USED_LIBS Boost::dynamic_linking ${Boost_DATE_TIME_LIBRARY} ${Boost_SYSTEM_LIBRARY} ${Boost_THREAD_LIBRARY})
+ -DELPP_UNICODE -D_WIN32_WINNT=0x0601 -D_NO_ASYNCRTIMP")
+set(USED_LIBS Winhttp httpapi bcrypt crypt32 zlib)
 else(WIN32)
 set(USED_LIBS "-lcrypto" OpenSSL::SSL ${Boost_CHRONO_LIBRARY} ${Boost_SYSTEM_LIBRARY} ${Boost_THREAD_LIBRARY})
 endif(WIN32)

 if(MSVC)
 set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /EHsc /W3")
+set(CompilerFlags
+        CMAKE_CXX_FLAGS
+        CMAKE_CXX_FLAGS_DEBUG
+        CMAKE_CXX_FLAGS_RELEASE
+        CMAKE_C_FLAGS
+        CMAKE_C_FLAGS_DEBUG
+        CMAKE_C_FLAGS_RELEASE)
+foreach(CompilerFlag ${CompilerFlags})
+  string(REPLACE "/MD" "/MT" ${CompilerFlag}
+         "${${CompilerFlag}}")
+endforeach()
 else(MSVC)
 set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -W -Wall -Wfatal-errors")
 endif(MSVC)

There are two blocks of changes. In the first block, one can see that the Boost libraries are no longer needed, but, instead, one needs to link the Windows dependencies of cpprestsdk (I found the list in Release\build\src\cpprest.vcxproj), as well as zlib. One also needs to explicitly define _NO_ASYNCRTIMP so that the cpprestsdk functions will be not treated as dllimport.

As CMake defaults to using ‘/MD’, the second block of changes replaces all occurrences of ‘/MD’ with ‘/MT’ in the compiler flags.2 With these changes, I am able to generate an executable without any external dependencies.

A Gotcha

I am now used to using cmake without specifying the ‘-G’ option on Windows. By default, CMake generates Visual Studio project files: they have several advantages, including multiple configurations (selectable on the MSBuild command line like ‘/p:Configuration=Release’), and parallel building (say, using ‘/m:2’ to take advantage of two processor cores). Neither is possible with nmake. However, the executables built by this method still behave abnormally regarding outputting non-ASCII characters. Actually, I believe everything is still OK at the linking stage, but the build process then touches the executables in some mysterious way and the result becomes bad. I am not familiar with MSBuild well enough to manipulate the result, so I am going back to using ‘cmake -G "NMake Makefiles" -DCMAKE_BUILD_TYPE=Release’ followed by ‘nmake’ for now.


  1. CMake can recognize Boost and OpenSSL by some known environment variables. I failed to find one that really worked for zlib, so the INCLUDE and LIB variables need to be explicitly set. 
  2. This technique is shamelessly copied from a StackOverflow answer

Another Microsoft Unicode I/O Problem

I encountered an annoying bug in Visual C++ 2017 recently. It started when I found my favourite logging library, Easylogging++, output Chinese as garbage characters on Windows. Checking the documentation carefully, I noted that I should have used the macro START_EASYLOGGINGPP. It turned out to be worse: all output starting from the Chinese character was gone. Puzzled but busy, I put it down and worked on something else.

I spend another hour of detective work on this issue today. The result was quite surprising.

  • First, it is not an issue with Easylogging++. The problem can occur if I purely use std::wcout.
  • Second, the magical thing about START_EASYLOGGINGPP is that it will invoke std::locale::global(std::locale("")). This is the switch that leads to the different behaviour.
  • Myteriously, with the correct locale setting, I can get the correct result with both std::wcout and Easylogging++ in a test program. I was not able to get it working in my real project.
  • Finally, it turns out that the difference above is caused by /MT vs. /MD! The former (default if neither is specified on the command line) tells the Visual C++ compiler to use the static multi-threading library, and the latter (set by default in Visual Studio projects) tells the compiler to use the dynamic multi-threading library.

People may remember that I wrote about MSVCRT.DLL Console I/O Bug. While Visual C++ 2013 shows consistent behaviour between /MT and /MD, Visual C++ 2015 and 2017 exhibit the same woeful bug when /MD is specified on the command line. This is something perplexingly weird: it seems someone at Microsoft messed up with the MSVCRT.DLL shipped with Windows first (around 2006), and then the problem spread to the Visual C++ runtime DLL after nearly a decade!

I am using many modern C++ features, so I definitely do not want to go back to Visual C++ 2013 for new projects. It seems I have to tolerate garbage characters in the log for now. Meanwhile, I submitted a bug to Microsoft. Given that I have a bug report that is deferred for four years, I am not very hopeful. But let us wait and see.

Update (20 December 2017)

A few more tests show that the debug versions (/MTd and /MDd) both work well. So only the default release build (using the dynamic C runtime) exhibits this problem, where the executable depends on DLLs like api-ms-win-crt-stdio-l1-1-0.dll. It seems this issue is related to the Universal C Runtime introduced in Visual Studio 2015 and Windows 10. . . .

Update (25 March 2018)

The bug was closed, and a Microsoft developer indicated that the issue had already been fixed since the Windows 10 Anniversary Update SDK (build 10.0.14393). Actually I had had build 10.0.15063 installed. The reason why I still saw the problem was that the Universal C Runtime on Windows 7 had not been updated (‘the issue will be fixed in a future update to the Universal C Runtime on Windows 7’), and I should not have seen the problem on a Windows 10 box. The current workaround is either use static linking (as I did), or copy the redistributable DLLs under C:\Program Files (x86)\Windows Kits\10\Redist\ucrt\DLLs\x86 (or x64 etc.) to the app directory (so called ‘app-local deployment’; which should not be used on Windows 10, as the system version is always preferred). My test showed that copying ucrtbase.dll was enough to fix my test case.

C/C++ Performance, mmap, and string_view

A C++ discussion group I participated in got a challenge last week. Someone posted a short Perl program, and claimed that it was faster than the corresponding C version. It was to this effect (with slight modifications):

open IN, "$ARGV[0]";
my $gc = 0;
while (my $line = ) {
  $gc += ($line =~ tr/cCgG//);
}
print "$gc\n";
close IN

The program simply searched and counted all occurrences of the letters ‘C’ and ‘G’ from the input file in a case-insensitive way. Since the posted C code was incomplete, I wrote a naïve implementation to test, which did turn out to be slower than the Perl code. It was about two times as slow on Linux, and about 10 times as slow on macOS.1

FILE* fp = fopen(argv[1], "rb");
int count = 0;
int ch;
while ((ch = getc(fp)) != EOF) {
    if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G') {
        ++count;
    }
}
fclose(fp);

Of course, it actually shows how optimized the Perl implementation is, instead of how inferior the C language is. Before I had time to test my mmap_line_reader, another person posted an mmap-based solution, to the following effect (with modifications):

int fd = open(argv[1], O_RDONLY);
struct stat st;
fstat(fd, &st);
int len = st.st_size;
char ch;
int count = 0;
char* ptr = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd);
char* begin = ptr;
char* end = ptr + len;
while (ptr < end) {
    ch = *ptr++;
    if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G')
        ++count;
}
munmap(begin, len);

When I tested my mmap_line_reader, I found that its performance was only on par with the Perl code, but slower than the handwritten mmap-based code. It is not surprising, considering that mmap_line_reader copies the line content, while the C code above searches directly in the mmap’d buffer.

I then thought of the C++17 string_view.2 It seemed a good chance of using it to return a line without copying its content. It was actually easy refactoring (on code duplicated from mmap_line_reader), and most of the original code did not require any changes. I got faster code for this test, but the implementations of mmap_line_reader and the new mmap_line_reader_sv were nearly identical, except for a few small differences.

Naturally, the next step was to refactor again to unify the implementations. I made a common base to store the bottommost mmap-related logic, and made the difference between string and string_view disappear with a class template. Now mmap_line_reader and mmap_line_reader_sv were just two aliases of instantiations of basic_mmap_line_reader!

While mmap_line_reader_sv was faster than mmap_line_reader, it was still slower than the mmap-based C code. So I made another abstraction, a ‘container’ that allowed iteration over all of the file content. Since the mmap-related logic was already mostly separated, only some minor modifications were needed to make that base class fully independent of the line reading logic. After that, adding mmap_char_reader was easy, which was simply a normal container that did not need to mess with platform-specific logic.

At this point, all seemed well—except one thing: it only worked on Unix. I did not have an immediate need to make it work on Windows, but I really wanted to show that the abstraction provided could work seamlessly regardless of the platform underneath. After several revisions, in which I dealt with proper Windows support,3 proper 32- and 64-bit support, and my own mistakes, I finally made it work. You can check out the current code in the nvwa repository. With it, I can finally make the following simple code work on Linux, macOS, and Windows, with the same efficiency as raw C code when fully optimized:4

#include <iostream>
#include <stdlib.h>
#include "nvwa/mmap_byte_reader.h"

int main(int argc, char* argv[])
{
    if (argc != 2) {
        std::cerr << "A file name is needed" << std::endl;
        exit(1);
    }

    try {
        int count = 0;
        for (char ch : nvwa::mmap_char_reader(argv[1]))
            if (ch == 'c' || ch == 'C' ||
                ch == 'g' || ch == 'G')
                ++count;
        std::cout << count << std::endl;
    }
    catch (std::exception& e) {
        std::cerr << e.what() << std::endl;
    }
}

Even though it is not used in the final code, I love the C++17 string_view. And I like the simplicity I finally achieved. Do you?

P.S. The string_view-based test code is posted here too as a reference. Line-based processing is common enough!5

#include <iostream>
#include <stdlib.h>
#include "nvwa/mmap_line_reader.h"

int main(int argc, char* argv[])
{
    if (argc != 2) {
        std::cerr << "A file name is needed" << std::endl;
        exit(1);
    }

    try {
        int count = 0;
        for (const auto& line :
                nvwa::mmap_line_reader_sv(argv[1]))
            for (char ch : line)
                if (ch == 'c' || ch == 'C' ||
                    ch == 'g' || ch == 'G')
                    ++count;
        std::cout << count << std::endl;
    }
    catch (std::exception& e) {
        std::cerr << e.what() << std::endl;
    }
}

Update (18 September 2017)

Thanks to Alex Maystrenko (see the comments below), It is now understood that the reason why getc was slow was because there was an implicit lock around file operations. I did not expect it, as I grew from an age when multi-threading was the exception, and I had not learnt about the existence of getc_unlocked until he mentioned it! According to the getc_unlocked page in the POSIX specification:

Some I/O functions are typically implemented as macros for performance reasons (for example, putc() and getc()). For safety, they need to be synchronized, but it is often too expensive to synchronize on every character. Nevertheless, it was felt that the safety concerns were more important; consequently, the getc(), getchar(), putc(), and putchar() functions are required to be thread-safe. However, unlocked versions are also provided with names that clearly indicate the unsafe nature of their operation but can be used to exploit their higher performance.

After replacing getc with getc_unlocked, the naïve implementation immediately outperforms the Perl code on Linux, though not on macOS.6

Another interesting thing to notice is that GCC provides vastly optimized code for the comparison with ‘c’, ‘C’, ‘g’, and ‘G’,7 which is extremely unlikely for interpreted languages like Perl. Observing the codes for the characters are:

  • 010000112 or 6710 (‘C’)
  • 010001112 or 7110 (‘G’)
  • 011000112 or 9910 (‘c’)
  • 011001112 or 10310 (‘g’)

GCC basically ANDs the input character with 110110112, and compares the result with 010000112. In Intel-style assembly:

        movzx   ecx, BYTE PTR [r12]
        and     ecx, -37
        cmp     cl, 67

It is astoundingly short and efficient!


  1. I love my Mac, but I do feel Linux has great optimizations. 
  2. If you are not familiar with string_view, check out its reference
  3. Did I mention how unorthogonal and uncomfortable the Win32 API is, when compared with POSIX? I am very happy to hide all the ugliness from the application code. 
  4. Comparison was only made on Unix (using the POSIX mmap API), under GCC with ‘-O3’ optimization (‘-O2’ was not enough). 
  5. -std=c++17’ must be specified for GCC, and ‘/std:c++latest’ must be specified for Visual C++ 2017. 
  6. However, the performance improvement is more dramatic on macOS, from about 10 times slower to 5% slower. 
  7. Of course, this is only possible when the characters are given as compile-time constants. 

Annoying Vim Behaviour on Ubuntu 16.04

When I started using Ubuntu 16.04 LTS, I was overall happy with it, except for one thing. Its Vim installation behaved strangely, to say the least. On the first look, it tried to be user-friendly: it automatically provided pop-up prompts when I typed something. It also tried to be smart, e.g. path candidates were provided when I was typing a path. Sounds great? Er . . . actually no, at least not to experienced Vim users.

The problem with it was that it was far too intrusive. If what I was typing had at least one autocompletion candidate when I typed Enter, the first candidate was chosen, instead of inserting a newline. What if I really meant to insert a newline? Oh, I had to press Ctrl-E to cancel the autocompletion first (by the way, this is a Vim command I had to do a Google search to find; before that I used Esc followed by o).

Now think about it: Which thing does an advanced user do more often, consulting the screen and choosing an autocompletion candidate, or continuously typing onto newlines, probably without even looking at the screen? When we do need to autocomplete, we all know well how to invoke autocompletion by Ctrl-P and Ctrl-N.

I tolerated this behaviour for a few months. Today I decided I really needed to get rid of it. I first searched the web, but did not see any obvious results. After a few experiments, I checked the /usr/share/vim/vim74/plugin directory, and immediately found a likely suspect—acp.vim! It turned out to be the AutoComplPop plugin. After checking its code, I simply inserted the following line into my .vimrc file:

let acp_enableAtStartup=0

Problem solved, and I am feeling happy again!

I do not know whether Canonical thought carefully about enabling this plugin by default, which might help new users but hurt experienced users. Maybe they thought experienced users could find a way out. I guess it might be true, but I wasted 20 minutes finding the solution, and then spent another hour writing up this article. . . . I do hope this article can help another disgruntled Vim user on Ubuntu. 🙂

Performance of My Line Readers

After I wrote the article about Python yield and C++ Coroutines, I felt that I needed to test the performance of istream_line_reader. The immediate result was both good and bad: good in that there was no actual difference between the straightforward std::getline and my istream_line_reader (as anticipated), and bad in that neither version performed well (a surprise to me). I vaguely remember that sync_with_stdio(false) may affect the performance, so I also tested calling this function in the beginning. However, it did not seem to matter. By the way, my favourite compiler has always been Clang recently (and I use a Mac).

Seeing that istream_line_reader had a performance problem, I tried other approaches. One thing I tried was using the traditional C I/O functions. I wrote another file_line_reader, which used either fgets or fread to read the data, depending what the delimiter is. (fgets could only use ‘\n’ as the delimiter, but it performed better than fread, for I could fgets into the final line buffer, but had to fread into a temporary buffer first.) I also added a switch on whether to strip the delimiter, something not possible with the getline function. The result achieved a more than 10x performance improvement (from about 28 MB/s to 430 MB/s). I was happy, and presented this on the last slide of my presentation on C++ and Functional Programming in the 2016 C++ and System Software Summit (China).

Until C++11, modifying the character array accessed through string::data() has undefined behaviour. To be on the safe side, I implemented a self-expanding character buffer on my own, which complicated the implementation a little bit. It also made the interface slightly different from istream_line_reader, which can be demonstrated in the following code snippets.

Iteration with istream_line_reader:

for (auto& line : istream_line_reader(cin)) {
    puts(line.c_str());
}

Iteration with file_line_reader:

for (auto& line : file_line_reader(stdin)) {
    puts(line);
}

I.e. each iteration with file_line_reader returns a char* instead of a string. This should be OK, as a raw character pointer is often enough. One can always construct a string from char* easily, anyway.


After the presentation, I turned to implementing a small enhancement—iterating over the lines with mmap. This proved interesting work. Not only did it improved the line reading performance, but the code was simplified as well. As I could access the file content directly with a pointer, I was able to copy the lines to a string simply with string::assign. As I used string again, there was no need to define a custom copy constructor, copy assignment operator, move constructor, and move assignment operator as well. The performance was, of course, also good: the throughput rate reached 650 MB/s, a 50% improvement! The only negative side was that it could not work on stdin, so testing it required more lines. Apart from that, I was quite satisfied. And I had three different line readers that could take an istream&, FILE*, or file descriptor as the input source. So all situations were dealt with. Not bad!

One thing of note about the implementation. I tried copying (a character at a time) while searching, before adopting the current method of searching first before assigning to the string. The latter proved faster when dealing with long lines. I can see two reasons:

  1. Strings are normally (and required to be since C++11) null-terminated, so copying one character at a time has a big overhead of zeroing the next byte. I confirmed the case from the libc++ source code of Clang.
  2. Assignment can use memcpy or memmove internally, which normally has a fast platform-specific implementation. In the case of string::assign(const char*, size_t), I verified that libc++ used memmove indeed.

If you are interested, this is the assembly code I finally traced into on my Mac (comments are my analysis; you may need to scroll horizontally to see them all):

libsystem_c.dylib`memcpy$VARIANT$sse42:
   0x7fff9291fcbd:  pushq  %rbp
   0x7fff9291fcbe:  movq   %rsp, %rbp
   0x7fff9291fcc1:  movq   %rdi, %r11           ; save dest
   0x7fff9291fcc4:  movq   %rdi, %rax
   0x7fff9291fcc7:  subq   %rsi, %rax           ; dest - src
   0x7fff9291fcca:  cmpq   %rdx, %rax
   0x7fff9291fccd:  jb     0x7fff9291fd04       ; dest in (src, src + len)?
   ; Entry condition: dest <= src or dest >= src + len; copy starts from front
   0x7fff9291fccf:  cmpq   $80, %rdx
   0x7fff9291fcd3:  ja     0x7fff9291fd09       ; len > 128?
   ; Entry condition: len <= 128
   0x7fff9291fcd5:  movl   %edx, %ecx
   0x7fff9291fcd7:  shrl   $2, %ecx             ; len / 4
   0x7fff9291fcda:  je     0x7fff9291fcec       ; len < 4?
   0x7fff9291fcdc:  movl   (%rsi), %eax         ; 4-byte read
   0x7fff9291fcde:  addq   $4, %rsi             ; src <- src + 4
   0x7fff9291fce2:  movl   %eax, (%rdi)         ; 4-byte write
   0x7fff9291fce4:  addq   $4, %rdi             ; dest <- dest + 4
   0x7fff9291fce8:  decl   %ecx
   0x7fff9291fcea:  jne    0x7fff9291fcdc       ; more 4-byte blocks?
   ; Entry condition: len < 4
   0x7fff9291fcec:  andl   $3, %edx
   0x7fff9291fcef:  je     0x7fff9291fcff       ; len == 0?
   0x7fff9291fcf1:  movb   (%rsi), %al          ; 1-byte read
   0x7fff9291fcf3:  incq   %rsi                 ; src <- src + 1
   0x7fff9291fcf6:  movb   %al, (%rdi)          ; 1-byte write
   0x7fff9291fcf8:  incq   %rdi                 ; dest <- dest + 1
   0x7fff9291fcfb:  decl   %edx
   0x7fff9291fcfd:  jne    0x7fff9291fcf1       ; more bytes?
   0x7fff9291fcff:  movq   %r11, %rax           ; restore dest
   0x7fff9291fd02:  popq   %rbp
   0x7fff9291fd03:  ret
   0x7fff9291fd04:  jmpq   0x7fff9291fdb9
   ; Entry condition: len > 128
   0x7fff9291fd09:  movl   %edi, %ecx
   0x7fff9291fd0b:  negl   %ecx
   0x7fff9291fd0d:  andl   $15, %ecx            ; 16 - dest % 16
   0x7fff9291fd10:  je     0x7fff9291fd22       ; dest 16-byte aligned?
   0x7fff9291fd12:  subl   %ecx, %edx           ; adjust len
   0x7fff9291fd14:  movb   (%rsi), %al          ; one-byte read
   0x7fff9291fd16:  incq   %rsi                 ; src <- src + 1
   0x7fff9291fd19:  movb   %al, (%rdi)          ; one-byte write
   0x7fff9291fd1b:  incq   %rdi                 ; dest <- dest + 1
   0x7fff9291fd1e:  decl   %ecx
   0x7fff9291fd20:  jne    0x7fff9291fd14       ; until dest is aligned
   ; Entry condition: dest is 16-byte aligned
   0x7fff9291fd22:  movq   %rdx, %rcx           ; len
   0x7fff9291fd25:  andl   $63, %edx            ; len % 64
   0x7fff9291fd28:  andq   $-64, %rcx           ; len <- align64(len)
   0x7fff9291fd2c:  addq   %rcx, %rsi           ; src <- src + len
   0x7fff9291fd2f:  addq   %rcx, %rdi           ; src <- dest + len
   0x7fff9291fd32:  negq   %rcx                 ; len <- -len
   0x7fff9291fd35:  testl  $15, %esi
   0x7fff9291fd3b:  jne    0x7fff9291fd80       ; src not 16-byte aligned?
   0x7fff9291fd3d:  jmp    0x7fff9291fd40
   0x7fff9291fd3f:  nop
   ; Entry condition: both src and dest are 16-byte aligned
   0x7fff9291fd40:  movdqa (%rsi,%rcx), %xmm0   ; aligned 16-byte read
   0x7fff9291fd45:  movdqa 16(%rsi,%rcx), %xmm1
   0x7fff9291fd4b:  movdqa 32(%rsi,%rcx), %xmm2
   0x7fff9291fd51:  movdqa 48(%rsi,%rcx), %xmm3
   0x7fff9291fd57:  movdqa %xmm0, (%rdi,%rcx)   ; aligned 16-byte write
   0x7fff9291fd5c:  movdqa %xmm1, 16(%rdi,%rcx)
   0x7fff9291fd62:  movdqa %xmm2, 32(%rdi,%rcx)
   0x7fff9291fd68:  movdqa %xmm3, 48(%rdi,%rcx)
   0x7fff9291fd6e:  addq   $64, %rcx
   0x7fff9291fd72:  jne    0x7fff9291fd40       ; more 64-byte blocks?
   0x7fff9291fd74:  jmpq   0x7fff9291fcd5
   0x7fff9291fd79:  nopl   (%rax)               ; 7-byte nop
   ; Entry condition: src is NOT 16-byte aligned but dest is
   0x7fff9291fd80:  movdqu (%rsi,%rcx), %xmm0   ; unaligned 16-byte read
   0x7fff9291fd85:  movdqu 16(%rsi,%rcx), %xmm1
   0x7fff9291fd8b:  movdqu 32(%rsi,%rcx), %xmm2
   0x7fff9291fd91:  movdqu 48(%rsi,%rcx), %xmm3
   0x7fff9291fd97:  movdqa %xmm0, (%rdi,%rcx)   ; aligned 16-byte write
   0x7fff9291fd9c:  movdqa %xmm1, 16(%rdi,%rcx)
   0x7fff9291fda2:  movdqa %xmm2, 32(%rdi,%rcx)
   0x7fff9291fda8:  movdqa %xmm3, 48(%rdi,%rcx)
   0x7fff9291fdae:  addq   $64, %rcx
   0x7fff9291fdb2:  jne    0x7fff9291fd80       ; more 64-byte blocks?
   0x7fff9291fdb4:  jmpq   0x7fff9291fcd5
   ; Entry condition: dest > src and dest < src + len; copy starts from back
   0x7fff9291fdb9:  addq   %rdx, %rsi           ; src <- src + len
   0x7fff9291fdbc:  addq   %rdx, %rdi           ; dest <- dest + len
   0x7fff9291fdbf:  cmpq   $80, %rdx
   0x7fff9291fdc3:  ja     0x7fff9291fdf6       ; len > 128?
   ; Entry condition: len < 128
   0x7fff9291fdc5:  movl   %edx, %ecx
   0x7fff9291fdc7:  shrl   $3, %ecx             ; len / 8
   0x7fff9291fdca:  je     0x7fff9291fdde       ; len < 8?
   ; Entry condition: len >= 8
   0x7fff9291fdcc:  subq   $8, %rsi             ; src <- src - 8
   0x7fff9291fdd0:  movq   (%rsi), %rax         ; 8-byte read
   0x7fff9291fdd3:  subq   $8, %rdi             ; dest <- dest - 8
   0x7fff9291fdd7:  movq   %rax, (%rdi)         ; 8-byte write
   0x7fff9291fdda:  decl   %ecx
   0x7fff9291fddc:  jne    0x7fff9291fdcc       ; until len < 8
   ; Entry condition: len < 8
   0x7fff9291fdde:  andl   $7, %edx
   0x7fff9291fde1:  je     0x7fff9291fdf1       ; len == 0?
   0x7fff9291fde3:  decq   %rsi                 ; src <- src - 1
   0x7fff9291fde6:  movb   (%rsi), %al          ; 1-byte read
   0x7fff9291fde8:  decq   %rdi                 ; dest <- dest - 1
   0x7fff9291fdeb:  movb   %al, (%rdi)          ; 1-byte write
   0x7fff9291fded:  decl   %edx
   0x7fff9291fdef:  jne    0x7fff9291fde3       ; more bytes?
   0x7fff9291fdf1:  movq   %r11, %rax           ; restore dest
   0x7fff9291fdf4:  popq   %rbp
   0x7fff9291fdf5:  ret
   ; Entry condition: len > 128
   0x7fff9291fdf6:  movl   %edi, %ecx
   0x7fff9291fdf8:  andl   $15, %ecx
   0x7fff9291fdfb:  je     0x7fff9291fe0e       ; dest 16-byte aligned?
   0x7fff9291fdfd:  subq   %rcx, %rdx           ; adjust len
   0x7fff9291fe00:  decq   %rsi                 ; src <- src - 1
   0x7fff9291fe03:  movb   (%rsi), %al          ; one-byte read
   0x7fff9291fe05:  decq   %rdi                 ; dest <- dest - 1
   0x7fff9291fe08:  movb   %al, (%rdi)          ; one-byte write
   0x7fff9291fe0a:  decl   %ecx
   0x7fff9291fe0c:  jne    0x7fff9291fe00       ; until dest is aligned
   ; Entry condition: dest is 16-byte aligned
   0x7fff9291fe0e:  movq   %rdx, %rcx           ; len
   0x7fff9291fe11:  andl   $63, %edx            ; len % 64
   0x7fff9291fe14:  andq   $-64, %rcx           ; len <- align64(len)
   0x7fff9291fe18:  subq   %rcx, %rsi           ; src <- src - len
   0x7fff9291fe1b:  subq   %rcx, %rdi           ; dest <- dest - len
   0x7fff9291fe1e:  testl  $15, %esi
   0x7fff9291fe24:  jne    0x7fff9291fe61       ; src 16-byte aligned?
   ; Entry condition: both src and dest are 16-byte aligned
   0x7fff9291fe26:  movdqa -16(%rsi,%rcx), %xmm0; aligned 16-byte read
   0x7fff9291fe2c:  movdqa -32(%rsi,%rcx), %xmm1
   0x7fff9291fe32:  movdqa -48(%rsi,%rcx), %xmm2
   0x7fff9291fe38:  movdqa -64(%rsi,%rcx), %xmm3
   0x7fff9291fe3e:  movdqa %xmm0, -16(%rdi,%rcx); aligned 16-byte write
   0x7fff9291fe44:  movdqa %xmm1, -32(%rdi,%rcx)
   0x7fff9291fe4a:  movdqa %xmm2, -48(%rdi,%rcx)
   0x7fff9291fe50:  movdqa %xmm3, -64(%rdi,%rcx)
   0x7fff9291fe56:  subq   $64, %rcx
   0x7fff9291fe5a:  jne    0x7fff9291fe26       ; more 64-byte blocks?
   0x7fff9291fe5c:  jmpq   0x7fff9291fdc5
   ; Entry condition: src is NOT 16-byte aligned but dest is
   0x7fff9291fe61:  movdqu -16(%rsi,%rcx), %xmm0; unaligned 16-byte read
   0x7fff9291fe67:  movdqu -32(%rsi,%rcx), %xmm1
   0x7fff9291fe6d:  movdqu -48(%rsi,%rcx), %xmm2
   0x7fff9291fe73:  movdqu -64(%rsi,%rcx), %xmm3
   0x7fff9291fe79:  movdqa %xmm0, -16(%rdi,%rcx); aligned 16-byte write
   0x7fff9291fe7f:  movdqa %xmm1, -32(%rdi,%rcx)
   0x7fff9291fe85:  movdqa %xmm2, -48(%rdi,%rcx)
   0x7fff9291fe8b:  movdqa %xmm3, -64(%rdi,%rcx)
   0x7fff9291fe91:  subq   $64, %rcx
   0x7fff9291fe95:  jne    0x7fff9291fe61       ; more 64-byte blocks?
   0x7fff9291fe97:  jmpq   0x7fff9291fdc5

I am happy that I can take advantage of such optimizations, but do not need to write such code on my own—there are so many different cases to deal with!


Of couse, nothing is simple regarding performance. More tests revealed more facts that are interesting and/or surprising:

  • While libc++ (it is the library, but not the compiler, that matters here) seems to completely ignore sync_with_stdio, it makes a big difference in libstdc++. The same function call gets a more than 10x performance improvement when the istream_line_reader test program is compiled with GCC (which uses libstdc++), from ~28 MB/s to ~390 MB/s. It shows that I made a wrong assumption! Interestingly, reading from stdin (piped from the pv tool) is slightly faster than reading from a file on my Mac (when compiled with GCC).
  • On a CentOS 6.5 Linux system, sync_with_stdio(false) has a bigger performance win (~23 MB/s vs. ~800 MB/s). Reading from a file directly is even faster at 1100 MB/s. That totally beats my file_line_reader (~550 MB/s reading a file directly) and mmap_line_reader (~600 MB/s reading a file directly) on the same machine. I was stunned when first seeing this performance difference of nearly 40 times!

So, apart from the slight difference in versatility, the first and simplest form of my line readers is also the best on Linux, while the mmap-based version may be a better implementation on OS X—though your mileage may vary depending on the different combinations of OS versions, compilers, and hardware. Should I be happy, or sad?


You can find the implementation of istream_line_reader among my example code for the ‘C++ and Functional Programming’ presentation, and the implementations of file_line_reader and mmap_line_reader in the Nvwa repository. And the test code is as follows:

test_istream_line_reader.cpp:

#include <fstream>
#include <iostream>
#include <string>
#include <getopt.h>
#include <stdio.h>
#include <stdlib.h>
#include "istream_line_reader.h"

using namespace std;

int main(int argc, char* argv[])
{
    char optch;
    while ( (optch = getopt(argc, argv, "s")) != EOF) {
        switch (optch) {
        case 's':
            cin.sync_with_stdio(false);
            break;
        }
    }
    if (!(optind == argc || optind == argc - 1)) {
        fprintf(stderr,
                "Only one file name can be specified\n");
        exit(1);
    }

    istream* is = nullptr;
    ifstream ifs;
    if (optind == argc) {
        is = &cin;
    } else {
        ifs.open(argv[optind]);
        if (!ifs) {
            fprintf(stderr,
                    "Cannot open file '%s'\n",
                    argv[optind]);
            exit(1);
        }
        is = &ifs;
    }

    for (auto& line : istream_line_reader(*is)) {
        puts(line.c_str());
    }
}

test_file_line_reader.cpp:

#include <stdio.h>
#include <stdlib.h>
#include <nvwa/file_line_reader.h>

using nvwa::file_line_reader;

int main(int argc, char* argv[])
{
    FILE* fp = stdin;
    if (argc == 2) {
        fp = fopen(argv[1], "r");
        if (!fp) {
            fprintf(stderr,
                    "Cannot open file '%s'\n",
                    argv[1]);
            exit(1);
        }
    }

    file_line_reader
        reader(fp, '\n',
               file_line_reader::no_strip_delimiter);
    for (auto& line : reader) {
        fputs(line, stdout);
    }
}

test_mmap_line_reader.cpp:

#include <stdio.h>
#include <stdlib.h>
#include <stdexcept>
#include <nvwa/mmap_line_reader.h>

using nvwa::mmap_line_reader;

int main(int argc, char* argv[])
{
    if (argc != 2) {
        fprintf(stderr,
                "A file name shall be provided\n");
        exit(1);
    }

    try {
        mmap_line_reader
            reader(argv[1], '\n',
                   mmap_line_reader::no_strip_delimiter);

        for (auto& str : reader) {
            fputs(str.c_str(), stdout);
        }
    }
    catch (std::runtime_error& e) {
        puts(e.what());
    }
}

Upgrading to Boost 1.61 in MacPorts

The Boost version in MacPorts was still 1.59.0—a year old now. When I wrote about Boost.Coroutine2, I found I had to install the latest Boost version 1.61.0. So I had two sets of Boost libraries on my hard drive, which made things . . . er . . . a little bit complicated. After I built Microsoft’s cpprestsdk last night—I managed to make it find and use the MacPorts Boost libraries—I feel more urged to change the situation. So this morning I subscribed to the MacPorts mailing list and posted the question about the outdated version problem. With the help from Mr Michael Dickens and Google, I have a working port of Boost 1.61.0 now. This article will document the procedure how it works.

The first thing one needs to do is check out the port files from the MacPorts Subversion repository. In my case, The boost files are under devel/boost. So I checked out only the boost directory into ~/Programming/MacPorts/devel.

One then needs to tell MacPorts to look for ports in that directory. There are two steps involved:

  1. Add the URL of the local ports directory (e.g. ‘file:///Users/yongwei/Programming/MacPorts’ in my case) to /opt/local/etc/macports/sources.conf, above the default rsync URL.
  2. Run the portindex command under that directory. It needs to be rerun every time a Portfile is changed.

Now MacPorts should find ports first in my local ports directory and then the system default. And I could begin patching the files.

It turned out that people tried to update boost half a year ago for Boost 1.60, but they found there were failing ports and the ABI was incompatible with 1.59. The patch was still good to me, as I had now a good example. I simply applied the patch, ran portindex again, and went ahead to port upgrade boost.

The procedure turned out quite smooth, though mkvtoolnix, the only installed port that depended on boost on my laptop, failed to run after the upgrade. I had to port uninstall it and then port install it again (rebuilding it).

After I had some confidence, I began to change the port files. I changed first Portfile, which contained the version information and file checksums. Updating them was trivial. When I could see the new version 1.61.0 from port info boost, I kicked off the build with port upgrade boost again.

Then came the more painful process of fixing the patch files under devel/boost/files (the ‘patch’ I mentioned a moment ago actually contained patches for these patch files). Most of these MacPorts-specific patch files could be applied without any problems, but one of them failed. It was actually due to trivial code changes in Boost, but I still had to check all the rejections, manually apply the changes, and generate a new patch file. After that, everything went on smoothly.

Against all my hopes, I found that I had to rebuild mkvtoolnix yet again. So the ABI instability is really an issue, and I understand now why boost was stuck at the old version for such a long time. However, I consider my task completed, when I uploaded the updated patch to the MacPorts ticket. At least I have the new working port of boost for myself now. And you can have it too.

Python yield and C++ Coroutines

Back in 2008, an old friend challenged me with a programming puzzle, when we both attended a wedding. He later gave a solution in Python. Comparing with my C++ solution, the Python one has about half the code lines. I was not smart enough to begin learning Python then, but instead put an end to my little interest in Python with a presentation in C++ Conference China 2009. I compared the basic constructs, and concluded they were equivalent, not realizing that Python had more to offer than that trivial programming exercise showed.

Fast forwarding to today (2016), I am really writing some (small) programs in Python. I have begun to appreciate Python more and more. For many of the tasks, the performance loss in executing the code is ignorable, but the productivity boost is huge. I have also realized that there are constructs in Python that are not easily reproducible in other languages. Generator/yield is one of them.

The other day, I was writing a small program to sort hosts based on dot-reversed order so as to group the host names in a more reasonable order (regarding ‘www.wordpress.com’ as ‘com.wordpress.www’). I quickly came up with a solution in Python. The code is very short and I can show it right here:

def backsort(lines):
    result = {}
    for line in lines:
        result['.'.join(reversed(line.split('.')))] = line
    return map(lambda item: item[1],
               sorted(result.items()))

Of course, we can implement a similar function in C++11. We will immediately find that there are no standard implementations for split and join (see Appendix below for my implementation). Regardless, we can write some code like:

template <typename C>
vector<string> backsort(C&& lines)
{
    map<string, string> rmap;
    for (auto& line : lines) {
        auto split_line = split(line, '.');
        reverse(split_line.begin(), split_line.end());
        rmap[join(split_line, '.')] = line;
    }
    vector<string> result(rmap.size());
    transform(rmap.begin(), rmap.end(), result.begin(),
              [](const pair<string, string>& pr)
              {
                  return pr.second;
              });
    return result;
}

Even though it has twice the non-trivial lines of code and is a function template, there is immediately something Python can do readily but C++ cannot. I can give the Python file handle (like os.stdin) directly to backsort, and the for line will iterate through the file content.1 This is because the Python file object implements the iterator protocol over lines of text, but the C++ istream does not do anything similar.

Let us forget this C++ detail, and focus on the problem. My Python code accepts an iterator, and ‘backsorts’ all the input lines. Can we make it process multiple files (like the cat command line), without changing the backsort function?

Of course it can be done. There is a traditional way, and there is a smart way. The traditional way is write a class that implements the iterator protocol (which can be readily modelled by C++):

class cat:
    def __init__(self, files):
        self.files = files
        self.cur_file = None

    def __iter__(self):
        return self

    def next(self):
        while True:
            if self.cur_file:
                line = self.cur_file.readline()
                if line:
                    return line.rstrip('\n')
                self.cur_file.close()
            if self.files:
                self.cur_file = open(self.files[0])
                self.files = self.files[1:]
            else:
                raise StopIteration()

We can then cat files by the following lines:

if __name__ == '__main__':
    if sys.argv[1:]:
        for line in cat(sys.argv[1:]):
            print(line)

Using yield, we can reduce the 18 lines of code of cat to only 5:

def cat(files):
    for fn in files:
        with open(fn) as f:
            for line in f:
                yield line.rstrip('\n')

There is no more bookkeeping of the current file and the unprocessed files, and everything is wrapped in simple loops. Isn’t that amazing? I actually learnt about the concept before (in C#), but never used it in real code—perhaps because I was too much framed by existing code, using callbacks, observer pattern, and the like.—Those ‘patterns’ now look ugly, when compared to the simplicity of generators.

Here comes the real challenge for C++ developers: Can we do the same in C++? Can we do something better than inelegant callbacks? 2


My investigations so far indicate the following: No C++ standards (up to C++14) support such constructs, and there is no portable way to implement them as a library.

Are we doomed? No. Apart from standardization efforts regarding coroutines (which is the ancient name for a superset of generators, dated from 1958) in C++,3 there have been at least five cross-platform implementations for C++:

  • The unofficial Boost.Coroutine by Giovanni P. Deretta (2006), compatible with Windows, Linux, and maybe a few Unix variants (tested not working on OS X); apparently abandoned.4
  • The official Boost.Coroutine by Oliver Kowalke (2013), compatible with ARM, MIPS, PPC, SPARC, x86, and x86-64 architectures.
  • The official Boost.Coroutine2 by Oliver Kowalke (2015), compatible with the same hardware architectures but only C++ compilers/code conformant to the C++14 standard.
  • Mordor by Mozy (2010), compatible with Windows, Linux, and OS X, but seemingly no longer maintained.
  • CO2 by Jamboree (2015), supporting stackless coroutines only, using preprocessor tricks, and requiring C++14.

As Boost.Coroutine2 looks modern, is well-maintained, and is very much comparable to the Python constructs, I will use it in the rest of this article.5 It hides all the platform details with the help of Boost.Context. Now I can write code simply as follows for cat:

typedef boost::coroutines2::coroutine<const string&>
    coro_t;

void cat(coro_t::push_type& yield,
         int argc, char* argv[])
{
    for (int i = 1; i < argc; ++i) {
        ifstream ifs(argv[i]);
        for (;;) {
            string line;
            if (getline(ifs, line)) {
                yield(line);
            } else {
                break;
            }
        }
    }
}

int main(int argc, char* argv[])
{
    using namespace std::placeholders;
    for (auto& line : coro_t::pull_type(
             boost::coroutines2::fixedsize_stack(),
             bind(cat, _1, argc, argv))) {
        cout << line << endl;
    }
}

Is this simple and straightforward? The only thing that is not quite intuitive is the detail that the constructor of pull_type expects the second argument to be a function object that takes a push_type& as the only argument. That is why we need to use bind to generate it—a lambda expression being the other alternative.

I definitely believe being able to write coroutines is a big step forward to make C++ more expressive. I can foresee many tasks simplified, like recursive parsing. I believe this will prove very helpful in the C++ weaponry. I only wish we could see it standardized soon.

Appendix

The complete backsort code in Python:

#!/usr/bin/env python
#coding: utf-8

import sys

def cat(files):
    for fn in files:
        with open(fn) as f:
            for line in f:
                yield line.rstrip('\n')

def backsort(lines):
    result = {}
    for line in lines:
        result['.'.join(reversed(line.split('.')))] = line
    return map(lambda item: item[1],
               sorted(result.items()))

def main():
    if sys.argv[1:]:
        result = backsort(cat(sys.argv[1:]))
    else:
        result = backsort(map(
                lambda line: line.rstrip('\n'), sys.stdin))
    for line in result:
        print(line)

if __name__ == '__main__':
    main()

The complete backsort code in C++:

#include <assert.h>         // assert
#include <algorithm>        // std::reverse/transform
#include <fstream>          // std::ifstream
#include <functional>       // std::bind
#include <iostream>         // std::cin/cout
#include <map>              // std::map
#include <string>           // std::string
#include <vector>           // std::vector
#include <boost/coroutine2/all.hpp>

using namespace std;

typedef boost::coroutines2::coroutine<const string&>
    coro_t;

void cat(coro_t::push_type& yield,
         int argc, char* argv[])
{
    for (int i = 1; i < argc; ++i) {
        ifstream ifs(argv[i]);
        for (;;) {
            string line;
            if (getline(ifs, line)) {
                yield(line);
            } else {
                break;
            }
        }
    }
}

vector<string> split(const string& str, char delim)
{
    vector<string> result;
    string::size_type last_pos = 0;
    string::size_type pos = str.find(delim);
    while (pos != string::npos) {
        result.push_back(
            str.substr(last_pos, pos - last_pos));
        last_pos = pos + 1;
        pos = str.find(delim, last_pos);
        if (pos == string::npos) {
            result.push_back(str.substr(last_pos));
        }
    }
    return result;
}

template <typename C>
string join(const C& str_list, char delim)
{
    string result;
    for (auto& item : str_list) {
        result += item;
        result += delim;
    }
    if (result.size() != 0) {
        result.resize(result.size() - 1);
    }
    return result;
}

template <typename C>
vector<string> backsort(C&& lines)
{
    map<string, string> rmap;
    for (auto& line : lines) {
        auto split_line = split(line, '.');
        reverse(split_line.begin(), split_line.end());
        rmap[join(split_line, '.')] = line;
    }
    vector<string> result(rmap.size());
    transform(rmap.begin(), rmap.end(), result.begin(),
              [](const pair<string, string>& pr)
              {
                  return pr.second;
              });
    return result;
}

class istream_line_reader {
public:
    class iterator { // implements InputIterator
    public:
        typedef const string& reference;
        typedef string value_type;

        iterator() : stream_(nullptr) {}
        explicit iterator(istream& is) : stream_(&is)
        {
            ++*this;
        }

        reference operator*()
        {
            assert(stream_ != nullptr);
            return line_;
        }
        value_type* operator->()
        {
            assert(stream_ != nullptr);
            return &line_;
        }
        iterator& operator++()
        {
            getline(*stream_, line_);
            if (!*stream_) {
                stream_ = nullptr;
            }
            return *this;
        }
        iterator operator++(int)
        {
            iterator temp(*this);
            ++*this;
            return temp;
        }

        bool operator==(const iterator& rhs) const
        {
            return stream_ == rhs.stream_;
        }
        bool operator!=(const iterator& rhs) const
        {
            return !operator==(rhs);
        }

    private:
        istream* stream_;
        string line_;
    };

    explicit istream_line_reader(istream& is)
        : stream_(is)
    {
    }
    iterator begin() const
    {
        return iterator(stream_);
    }
    iterator end() const
    {
        return iterator();
    }

private:
    istream& stream_;
};

int main(int argc, char* argv[])
{
    using namespace std::placeholders;
    vector<string> result;
    if (argc > 1) {
        result = backsort(coro_t::pull_type(
            boost::coroutines2::fixedsize_stack(),
            bind(cat, _1, argc, argv)));
    } else {
        result = backsort(istream_line_reader(cin));
    }
    for (auto& item : result) {
       cout << item << endl;
    }
}

The istream_line_reader class is not really necessary, and we can simplify it with coroutines. I am including it here only to show what we have to write ‘normally’ (if we cannot use coroutines). Even if we remove it entirely, the C++ version will still have about three times as many non-trivial lines of code as the Python equivalent. It is enough proof to me that I should move away from C++ a little bit. . . .


  1. There is one gotcha: the ‘\n’ character will be part of the string. It will be handled in my solution. 
  2. Generally speaking, callbacks or similar techniques are what C++ programmers tend to use in similar circumstances, if the ‘producer’ part is complicated (otherwise the iterator pattern may be more suitable). Unfortunately, we cannot then combine the use of two simple functions like cat and backsort simultaneously. If we used callbacks, backsort would need to be modified and fragmented. 
  3. P0057 is one such effort, which is experimentally implemented in Visual Studio 2015
  4. According to the acknowledgement pages of next two Boost projects, Giovanni Deretta contributed to them. So his work was not in vain. 
  5. This said, CO2 is also well-maintained, and is more efficient if only a stackless coroutine is needed. See Jamboree’s answer on StackOverflow. The difference looks small to me, and preprocessor tricks are not my favourites, so I will not spend more time on CO2 for now.