Performance of My Line Readers

After I wrote the article about Python yield and C++ Coroutines, I felt that I needed to test the performance of istream_line_reader. The immediate result was both good and bad: good in that there was no actual difference between the straightforward std::getline and my istream_line_reader (as anticipated), and bad in that neither version performed well (a surprise to me). I vaguely remember that sync_with_stdio(false) may affect the performance, so I also tested calling this function in the beginning. However, it did not seem to matter. By the way, my favourite compiler has always been Clang recently (and I use a Mac).

Seeing that istream_line_reader had a performance problem, I tried other approaches. One thing I tried was using the traditional C I/O functions. I wrote another file_line_reader, which used either fgets or fread to read the data, depending what the delimiter is. (fgets could only use ‘\n’ as the delimiter, but it performed better than fread, for I could fget into the final line buffer, but had to fread into a temporary buffer first.) I also added a switch on whether to strip the delimiter, something not possible with the getline function. The result achieved a more than 10x performance improvement (from about 28 MB/s to 430 MB/s). I was happy, and presented this on the last slide of my presentation on C++ and Functional Programming in the 2016 C++ and System Software Summit (China).

Until C++11, modifying the character array accessed through string::data() has undefined behaviour. To be on the safe side, I implemented a self-expanding character buffer on my own, which complicated the implementation a little bit. It also made the interface slightly different from istream_line_reader, which can be demonstrated in the following code snippets.

Iteration with istream_line_reader:

for (auto& line : istream_line_reader(cin)) {
    puts(line.c_str());
}

Iteration with file_line_reader:

for (auto& line : file_line_reader(stdin)) {
    puts(line);
}

I.e. each iteration with file_line_reader returns a char* instead of a string. This should be OK, as a raw character pointer is often enough. One can always construct a string from char* easily, anyway.


After the presentation, I turned to implementing a small enhancement—iterating over the lines with mmap. This proved interesting work. Not only did it improved the line reading performance, but the code was simplified as well. As I could access the file content directly with a pointer, I was able to copy the lines to a string simply with string::assign. As I used string again, there was no need to define a custom copy constructor, copy assignment operator, move constructor, and move assignment operator as well. The performance was, of course, also good: the throughput rate reached 650 MB/s, a 50% improvement! The only negative side was that it could not work on stdin, so testing it required more lines. Apart from that, I was quite satisfied. And I had three different line readers that could take an istream&, FILE*, or file descriptor as the input source. So all situations were dealt with. Not bad!

One thing of note about the implementation. I tried copying (a character at a time) while searching, before adopting the current method of searching first before assigning to the string. The latter proved faster when dealing with long lines. I can see two reasons:

  1. Strings are normally (and required to be since C++11) null-terminated, so copying one character at a time has a big overhead of zeroing the next byte. I confirmed the case from the libc++ source code of Clang.
  2. Assignment can use memcpy or memmove internally, which normally has a fast platform-specific implementation. In the case of string::assign(const char*, size_t), I verified that libc++ used memmove indeed.

If you are interested, this is the assembly code I finally traced into on my Mac (comments are my analysis; you may need to scroll horizontally to see them all):

libsystem_c.dylib`memcpy$VARIANT$sse42:
   0x7fff9291fcbd:  pushq  %rbp
   0x7fff9291fcbe:  movq   %rsp, %rbp
   0x7fff9291fcc1:  movq   %rdi, %r11           ; save dest
   0x7fff9291fcc4:  movq   %rdi, %rax
   0x7fff9291fcc7:  subq   %rsi, %rax           ; dest - src
   0x7fff9291fcca:  cmpq   %rdx, %rax
   0x7fff9291fccd:  jb     0x7fff9291fd04       ; dest in (src, src + len)?
   ; Entry condition: dest <= src or dest >= src + len; copy starts from front
   0x7fff9291fccf:  cmpq   $80, %rdx
   0x7fff9291fcd3:  ja     0x7fff9291fd09       ; len > 128?
   ; Entry condition: len <= 128
   0x7fff9291fcd5:  movl   %edx, %ecx
   0x7fff9291fcd7:  shrl   $2, %ecx             ; len / 4
   0x7fff9291fcda:  je     0x7fff9291fcec       ; len < 4?
   0x7fff9291fcdc:  movl   (%rsi), %eax         ; 4-byte read
   0x7fff9291fcde:  addq   $4, %rsi             ; src <- src + 4
   0x7fff9291fce2:  movl   %eax, (%rdi)         ; 4-byte write
   0x7fff9291fce4:  addq   $4, %rdi             ; dest <- dest + 4
   0x7fff9291fce8:  decl   %ecx
   0x7fff9291fcea:  jne    0x7fff9291fcdc       ; more 4-byte blocks?
   ; Entry condition: len < 4
   0x7fff9291fcec:  andl   $3, %edx
   0x7fff9291fcef:  je     0x7fff9291fcff       ; len == 0?
   0x7fff9291fcf1:  movb   (%rsi), %al          ; 1-byte read
   0x7fff9291fcf3:  incq   %rsi                 ; src <- src + 1
   0x7fff9291fcf6:  movb   %al, (%rdi)          ; 1-byte write
   0x7fff9291fcf8:  incq   %rdi                 ; dest <- dest + 1
   0x7fff9291fcfb:  decl   %edx
   0x7fff9291fcfd:  jne    0x7fff9291fcf1       ; more bytes?
   0x7fff9291fcff:  movq   %r11, %rax           ; restore dest
   0x7fff9291fd02:  popq   %rbp
   0x7fff9291fd03:  ret
   0x7fff9291fd04:  jmpq   0x7fff9291fdb9
   ; Entry condition: len > 128
   0x7fff9291fd09:  movl   %edi, %ecx
   0x7fff9291fd0b:  negl   %ecx
   0x7fff9291fd0d:  andl   $15, %ecx            ; 16 - dest % 16
   0x7fff9291fd10:  je     0x7fff9291fd22       ; dest 16-byte aligned?
   0x7fff9291fd12:  subl   %ecx, %edx           ; adjust len
   0x7fff9291fd14:  movb   (%rsi), %al          ; one-byte read
   0x7fff9291fd16:  incq   %rsi                 ; src <- src + 1
   0x7fff9291fd19:  movb   %al, (%rdi)          ; one-byte write
   0x7fff9291fd1b:  incq   %rdi                 ; dest <- dest + 1
   0x7fff9291fd1e:  decl   %ecx
   0x7fff9291fd20:  jne    0x7fff9291fd14       ; until dest is aligned
   ; Entry condition: dest is 16-byte aligned
   0x7fff9291fd22:  movq   %rdx, %rcx           ; len
   0x7fff9291fd25:  andl   $63, %edx            ; len % 64
   0x7fff9291fd28:  andq   $-64, %rcx           ; len <- align64(len)
   0x7fff9291fd2c:  addq   %rcx, %rsi           ; src <- src + len
   0x7fff9291fd2f:  addq   %rcx, %rdi           ; src <- dest + len
   0x7fff9291fd32:  negq   %rcx                 ; len <- -len
   0x7fff9291fd35:  testl  $15, %esi
   0x7fff9291fd3b:  jne    0x7fff9291fd80       ; src not 16-byte aligned?
   0x7fff9291fd3d:  jmp    0x7fff9291fd40
   0x7fff9291fd3f:  nop
   ; Entry condition: both src and dest are 16-byte aligned
   0x7fff9291fd40:  movdqa (%rsi,%rcx), %xmm0   ; aligned 16-byte read
   0x7fff9291fd45:  movdqa 16(%rsi,%rcx), %xmm1
   0x7fff9291fd4b:  movdqa 32(%rsi,%rcx), %xmm2
   0x7fff9291fd51:  movdqa 48(%rsi,%rcx), %xmm3
   0x7fff9291fd57:  movdqa %xmm0, (%rdi,%rcx)   ; aligned 16-byte write
   0x7fff9291fd5c:  movdqa %xmm1, 16(%rdi,%rcx)
   0x7fff9291fd62:  movdqa %xmm2, 32(%rdi,%rcx)
   0x7fff9291fd68:  movdqa %xmm3, 48(%rdi,%rcx)
   0x7fff9291fd6e:  addq   $64, %rcx
   0x7fff9291fd72:  jne    0x7fff9291fd40       ; more 64-byte blocks?
   0x7fff9291fd74:  jmpq   0x7fff9291fcd5
   0x7fff9291fd79:  nopl   (%rax)               ; 7-byte nop
   ; Entry condition: src is NOT 16-byte aligned but dest is
   0x7fff9291fd80:  movdqu (%rsi,%rcx), %xmm0   ; unaligned 16-byte read
   0x7fff9291fd85:  movdqu 16(%rsi,%rcx), %xmm1
   0x7fff9291fd8b:  movdqu 32(%rsi,%rcx), %xmm2
   0x7fff9291fd91:  movdqu 48(%rsi,%rcx), %xmm3
   0x7fff9291fd97:  movdqa %xmm0, (%rdi,%rcx)   ; aligned 16-byte write
   0x7fff9291fd9c:  movdqa %xmm1, 16(%rdi,%rcx)
   0x7fff9291fda2:  movdqa %xmm2, 32(%rdi,%rcx)
   0x7fff9291fda8:  movdqa %xmm3, 48(%rdi,%rcx)
   0x7fff9291fdae:  addq   $64, %rcx
   0x7fff9291fdb2:  jne    0x7fff9291fd80       ; more 64-byte blocks?
   0x7fff9291fdb4:  jmpq   0x7fff9291fcd5
   ; Entry condition: dest > src and dest < src + len; copy starts from back
   0x7fff9291fdb9:  addq   %rdx, %rsi           ; src <- src + len
   0x7fff9291fdbc:  addq   %rdx, %rdi           ; dest <- dest + len
   0x7fff9291fdbf:  cmpq   $80, %rdx
   0x7fff9291fdc3:  ja     0x7fff9291fdf6       ; len > 128?
   ; Entry condition: len < 128
   0x7fff9291fdc5:  movl   %edx, %ecx
   0x7fff9291fdc7:  shrl   $3, %ecx             ; len / 8
   0x7fff9291fdca:  je     0x7fff9291fdde       ; len < 8?
   ; Entry condition: len >= 8
   0x7fff9291fdcc:  subq   $8, %rsi             ; src <- src - 8
   0x7fff9291fdd0:  movq   (%rsi), %rax         ; 8-byte read
   0x7fff9291fdd3:  subq   $8, %rdi             ; dest <- dest - 8
   0x7fff9291fdd7:  movq   %rax, (%rdi)         ; 8-byte write
   0x7fff9291fdda:  decl   %ecx
   0x7fff9291fddc:  jne    0x7fff9291fdcc       ; until len < 8
   ; Entry condition: len < 8
   0x7fff9291fdde:  andl   $7, %edx
   0x7fff9291fde1:  je     0x7fff9291fdf1       ; len == 0?
   0x7fff9291fde3:  decq   %rsi                 ; src <- src - 1
   0x7fff9291fde6:  movb   (%rsi), %al          ; 1-byte read
   0x7fff9291fde8:  decq   %rdi                 ; dest <- dest - 1
   0x7fff9291fdeb:  movb   %al, (%rdi)          ; 1-byte write
   0x7fff9291fded:  decl   %edx
   0x7fff9291fdef:  jne    0x7fff9291fde3       ; more bytes?
   0x7fff9291fdf1:  movq   %r11, %rax           ; restore dest
   0x7fff9291fdf4:  popq   %rbp
   0x7fff9291fdf5:  ret
   ; Entry condition: len > 128
   0x7fff9291fdf6:  movl   %edi, %ecx
   0x7fff9291fdf8:  andl   $15, %ecx
   0x7fff9291fdfb:  je     0x7fff9291fe0e       ; dest 16-byte aligned?
   0x7fff9291fdfd:  subq   %rcx, %rdx           ; adjust len
   0x7fff9291fe00:  decq   %rsi                 ; src <- src - 1
   0x7fff9291fe03:  movb   (%rsi), %al          ; one-byte read
   0x7fff9291fe05:  decq   %rdi                 ; dest <- dest - 1
   0x7fff9291fe08:  movb   %al, (%rdi)          ; one-byte write
   0x7fff9291fe0a:  decl   %ecx
   0x7fff9291fe0c:  jne    0x7fff9291fe00       ; until dest is aligned
   ; Entry condition: dest is 16-byte aligned
   0x7fff9291fe0e:  movq   %rdx, %rcx           ; len
   0x7fff9291fe11:  andl   $63, %edx            ; len % 64
   0x7fff9291fe14:  andq   $-64, %rcx           ; len <- align64(len)
   0x7fff9291fe18:  subq   %rcx, %rsi           ; src <- src - len
   0x7fff9291fe1b:  subq   %rcx, %rdi           ; dest <- dest - len
   0x7fff9291fe1e:  testl  $15, %esi
   0x7fff9291fe24:  jne    0x7fff9291fe61       ; src 16-byte aligned?
   ; Entry condition: both src and dest are 16-byte aligned
   0x7fff9291fe26:  movdqa -16(%rsi,%rcx), %xmm0; aligned 16-byte read
   0x7fff9291fe2c:  movdqa -32(%rsi,%rcx), %xmm1
   0x7fff9291fe32:  movdqa -48(%rsi,%rcx), %xmm2
   0x7fff9291fe38:  movdqa -64(%rsi,%rcx), %xmm3
   0x7fff9291fe3e:  movdqa %xmm0, -16(%rdi,%rcx); aligned 16-byte write
   0x7fff9291fe44:  movdqa %xmm1, -32(%rdi,%rcx)
   0x7fff9291fe4a:  movdqa %xmm2, -48(%rdi,%rcx)
   0x7fff9291fe50:  movdqa %xmm3, -64(%rdi,%rcx)
   0x7fff9291fe56:  subq   $64, %rcx
   0x7fff9291fe5a:  jne    0x7fff9291fe26       ; more 64-byte blocks?
   0x7fff9291fe5c:  jmpq   0x7fff9291fdc5
   ; Entry condition: src is NOT 16-byte aligned but dest is
   0x7fff9291fe61:  movdqu -16(%rsi,%rcx), %xmm0; unaligned 16-byte read
   0x7fff9291fe67:  movdqu -32(%rsi,%rcx), %xmm1
   0x7fff9291fe6d:  movdqu -48(%rsi,%rcx), %xmm2
   0x7fff9291fe73:  movdqu -64(%rsi,%rcx), %xmm3
   0x7fff9291fe79:  movdqa %xmm0, -16(%rdi,%rcx); aligned 16-byte write
   0x7fff9291fe7f:  movdqa %xmm1, -32(%rdi,%rcx)
   0x7fff9291fe85:  movdqa %xmm2, -48(%rdi,%rcx)
   0x7fff9291fe8b:  movdqa %xmm3, -64(%rdi,%rcx)
   0x7fff9291fe91:  subq   $64, %rcx
   0x7fff9291fe95:  jne    0x7fff9291fe61       ; more 64-byte blocks?
   0x7fff9291fe97:  jmpq   0x7fff9291fdc5

I am happy that I can take advantage of such optimizations, but do not need to write such code on my own—there are so many different cases to deal with!


Of couse, nothing is simple regarding performance. More tests revealed more facts that are interesting and/or surprising:

  • While libc++ (it is the library, but not the compiler, that matters here) seems to completely ignore sync_with_stdio, it makes a big difference in libstdc++. The same function call gets a more than 10x performance improvement when the program is compiled with GCC (which uses libstdc++), from ~28 MB/s to ~390 MB/s. It shows that I made a wrong assumption! Interestingly, reading from stdin (piped from the pv tool) is slightly faster than reading from a file on my Mac (when compiled with GCC).
  • On a CentOS 6.5 Linux system, sync_with_stdio(false) has a bigger performance win (~23 MB/s vs. ~800 MB /s). Reading from a file directly is even faster at 1100 MB/s. That totally beats my istream_line_reader (~550 MB/s reading a file directly) and mmap_line_reader (~600 MB/s reading a file directly) on the same machine. I was stunned when first seeing this performance difference of nearly 40 times!

So, apart from the slight difference in versatility, the first and simplest form of my line readers is also the best on Linux, while the mmap-based version may be a better implementation on OS X—though your mileage may vary depending on the different combinations of OS versions, compilers, and hardware. Should I be happy, or sad?


You can find the implementation of istream_line_reader among my example code for the ‘C++ and Functional Programming’ presentation, and the implementations of file_line_reader and mmap_line_reader in the Nvwa repository. And the test code is as follows:

test_istream_line_reader.cpp:

#include <fstream>
#include <iostream>
#include <string>
#include <getopt.h>
#include <stdio.h>
#include <stdlib.h>
#include "istream_line_reader.h"

using namespace std;

int main(int argc, char* argv[])
{
    char optch;
    while ( (optch = getopt(argc, argv, "s")) != EOF) {
        switch (optch) {
        case 's':
            cin.sync_with_stdio(false);
            break;
        }
    }
    if (!(optind == argc || optind == argc - 1)) {
        fprintf(stderr,
                "Only one file name can be specified\n");
        exit(1);
    }

    istream* is = nullptr;
    ifstream ifs;
    if (optind == argc) {
        is = &cin;
    } else {
        ifs.open(argv[optind]);
        if (!ifs) {
            fprintf(stderr,
                    "Cannot open file '%s'\n",
                    argv[optind]);
            exit(1);
        }
        is = &ifs;
    }

    for (auto& line : istream_line_reader(*is)) {
        puts(line.c_str());
    }
}

test_file_line_reader.cpp:

#include <stdio.h>
#include <stdlib.h>
#include <nvwa/file_line_reader.h>

using nvwa::file_line_reader;

int main(int argc, char* argv[])
{
    FILE* fp = stdin;
    if (argc == 2) {
        fp = fopen(argv[1], "r");
        if (!fp) {
            fprintf(stderr,
                    "Cannot open file '%s'\n",
                    argv[1]);
            exit(1);
        }
    }

    file_line_reader
        reader(fp, '\n',
               file_line_reader::no_strip_delimiter);
    for (auto& line : reader) {
        fputs(line, stdout);
    }
}

test_mmap_line_reader.cpp:

#include <stdio.h>
#include <stdlib.h>
#include <stdexcept>
#include <nvwa/mmap_line_reader.h>

using nvwa::mmap_line_reader;

int main(int argc, char* argv[])
{
    if (argc != 2) {
        fprintf(stderr,
                "A file name shall be provided\n");
        exit(1);
    }

    try {
        mmap_line_reader
            reader(argv[1], '\n',
                   mmap_line_reader::no_strip_delimiter);

        for (auto& str : reader) {
            fputs(str.c_str(), stdout);
        }
    }
    catch (std::runtime_error& e) {
        puts(e.what());
    }
}

Upgrading to Boost 1.61 in MacPorts

The Boost version in MacPorts was still 1.59.0—a year old now. When I wrote about Boost.Coroutine2, I found I had to install the latest Boost version 1.61.0. So I had two sets of Boost libraries on my hard drive, which made things . . . er . . . a little bit complicated. After I built Microsoft’s cpprestsdk last night—I managed to make it find and use the MacPorts Boost libraries—I feel more urged to change the situation. So this morning I subscribed to the MacPorts mailing list and posted the question about the outdated version problem. With the help from Mr Michael Dickens and Google, I have a working port of Boost 1.61.0 now. This article will document the procedure how it works.

The first thing one needs to do is check out the port files from the MacPorts Subversion repository. In my case, The boost files are under devel/boost. So I checked out only the boost directory into ~/Programming/MacPorts/devel.

One then needs to tell MacPorts to look for ports in that directory. There are two steps involved:

  1. Add the URL of the local ports directory (e.g. ‘file:///Users/yongwei/Programming/MacPorts’ in my case) to /opt/local/etc/macports/sources.conf, above the default rsync URL.
  2. Run the portindex command under that directory. It needs to be rerun every time a Portfile is changed.

Now MacPorts should find ports first in my local ports directory and then the system default. And I could begin patching the files.

It turned out that people tried to update boost half a year ago for Boost 1.60, but they found there were failing ports and the ABI was incompatible with 1.59. The patch was still good to me, as I had now a good example. I simply applied the patch, ran portindex again, and went ahead to port upgrade boost.

The procedure turned out quite smooth, though mkvtoolnix, the only installed port that depended on boost on my laptop, failed to run after the upgrade. I had to port uninstall it and then port install it again (rebuilding it).

After I had some confidence, I began to change the port files. I changed first Portfile, which contained the version information and file checksums. Updating them was trivial. When I could see the new version 1.61.0 from port info boost, I kicked off the build with port upgrade boost again.

Then came the more painful process of fixing the patch files under devel/boost/files (the ‘patch’ I mentioned a moment ago actually contained patches for these patch files). Most of these MacPorts-specific patch files could be applied without any problems, but one of them failed. It was actually due to trivial code changes in Boost, but I still had to check all the rejections, manually apply the changes, and generate a new patch file. After that, everything went on smoothly.

Against all my hopes, I found that I had to rebuild mkvtoolnix yet again. So the ABI instability is really an issue, and I understand now why boost was stuck at the old version for such a long time. However, I consider my task completed, when I uploaded the updated patch to the MacPorts ticket. At least I have the new working port of boost for myself now. And you can have it too.

Python yield and C++ Coroutines

Back in 2008, an old friend challenged me with a programming puzzle, when we both attended a wedding. He later gave a solution in Python. Comparing with my C++ solution, the Python one has about half the code lines. I was not smart enough to begin learning Python then, but instead put an end to my little interest in Python with a presentation in C++ Conference China 2009. I compared the basic constructs, and concluded they were equivalent, not realizing that Python had more to offer than that trivial programming exercise showed.

Fast forwarding to today (2016), I am really writing some (small) programs in Python. I have begun to appreciate Python more and more. For many of the tasks, the performance loss in executing the code is ignorable, but the productivity boost is huge. I have also realized that there are constructs in Python that are not easily reproducible in other languages. Generator/yield is one of them.

The other day, I was writing a small program to sort hosts based on dot-reversed order so as to group the host names in a more reasonable order (regarding ‘www.wordpress.com’ as ‘com.wordpress.www’). I quickly came up with a solution in Python. The code is very short and I can show it right here:

def backsort(lines):
    result = {}
    for line in lines:
        result['.'.join(reversed(line.split('.')))] = line
    return map(lambda item: item[1],
               sorted(result.items()))

Of course, we can implement a similar function in C++11. We will immediately find that there are no standard implementations for split and join (see Appendix below for my implementation). Regardless, we can write some code like:

template <typename C>
vector<string> backsort(C&& lines)
{
    map<string, string> rmap;
    for (auto& line : lines) {
        auto split_line = split(line, '.');
        reverse(split_line.begin(), split_line.end());
        rmap[join(split_line, '.')] = line;
    }
    vector<string> result(rmap.size());
    transform(rmap.begin(), rmap.end(), result.begin(),
              [](const pair<string, string>& pr)
              {
                  return pr.second;
              });
    return result;
}

Even though it has twice the non-trivial lines of code and is a function template, there is immediately something Python can do readily but C++ cannot. I can give the Python file handle (like os.stdin) directly to backsort, and the for line will iterate through the file content.1 This is because the Python file object implements the iterator protocol over lines of text, but the C++ istream does not do anything similar.

Let us forget this C++ detail, and focus on the problem. My Python code accepts an iterator, and ‘backsorts’ all the input lines. Can we make it process multiple files (like the cat command line), without changing the backsort function?

Of course it can be done. There is a traditional way, and there is a smart way. The traditional way is write a class that implements the iterator protocol (which can be readily modelled by C++):

class cat:
    def __init__(self, files):
        self.files = files
        self.cur_file = None

    def __iter__(self):
        return self

    def next(self):
        while True:
            if self.cur_file:
                line = self.cur_file.readline()
                if line:
                    return line.rstrip('\n')
                self.cur_file.close()
            if self.files:
                self.cur_file = open(self.files[0])
                self.files = self.files[1:]
            else:
                raise StopIteration()

We can then cat files by the following lines:

if __name__ == '__main__':
    if sys.argv[1:]:
        for line in cat(sys.argv[1:]):
            print(line)

Using yield, we can reduce the 18 lines of code of cat to only 5:

def cat(files):
    for fn in files:
        with open(fn) as f:
            for line in f:
                yield line.rstrip('\n')

There is no more bookkeeping of the current file and the unprocessed files, and everything is wrapped in simple loops. Isn’t that amazing? I actually learnt about the concept before (in C#), but never used it in real code—perhaps because I was too much framed by existing code, using callbacks, observer pattern, and the like.—Those ‘patterns’ now look ugly, when compared to the simplicity of generators.

Here comes the real challenge for C++ developers: Can we do the same in C++? Can we do something better than inelegant callbacks? 2


My investigations so far indicate the following: No C++ standards (up to C++14) support such constructs, and there is no portable way to implement them as a library.

Are we doomed? No. Apart from standardization efforts regarding coroutines (which is the ancient name for a superset of generators, dated from 1958) in C++,3 there have been at least five cross-platform implementations for C++:

  • The unofficial Boost.Coroutine by Giovanni P. Deretta (2006), compatible with Windows, Linux, and maybe a few Unix variants (tested not working on OS X); apparently abandoned.4
  • The official Boost.Coroutine by Oliver Kowalke (2013), compatible with ARM, MIPS, PPC, SPARC, x86, and x86-64 architectures.
  • The official Boost.Coroutine2 by Oliver Kowalke (2015), compatible with the same hardware architectures but only C++ compilers/code conformant to the C++14 standard.
  • Mordor by Mozy (2010), compatible with Windows, Linux, and OS X, but seemingly no longer maintained.
  • CO2 by Jamboree (2015), supporting stackless coroutines only, using preprocessor tricks, and requiring C++14.

As Boost.Coroutine2 looks modern, is well-maintained, and is very much comparable to the Python constructs, I will use it in the rest of this article.5 It hides all the platform details with the help of Boost.Context. Now I can write code simply as follows for cat:

typedef boost::coroutines2::coroutine<const string&> coro_t;

void cat(coro_t::push_type& yield, int argc, char* argv[])
{
    for (int i = 1; i < argc; ++i) {
        ifstream ifs(argv[i]);
        for (;;) {
            string line;
            if (getline(ifs, line)) {
                yield(line);
            } else {
                break;
            }
        }
    }
}

int main(int argc, char* argv[])
{
    using namespace std::placeholders;
    for (auto& line : coro_t::pull_type(
             boost::coroutines2::fixedsize_stack(),
             bind(cat, _1, argc, argv))) {
        cout << line << endl;
    }
}

Is this simple and straightforward? The only thing that is not quite intuitive is the detail that the constructor of pull_type expects the second argument to be a function object that takes a push_type& as the only argument. That is why we need to use bind to generate it—a lambda expression being the other alternative.

I definitely believe being able to write coroutines is a big step forward to make C++ more expressive. I can foresee many tasks simplified, like recursive parsing. I believe this will prove very helpful in the C++ weaponry. I only wish we could see it standardized soon.

Appendix

The complete backsort code in Python:

#!/usr/bin/env python
#coding: utf-8

import sys

def cat(files):
    for fn in files:
        with open(fn) as f:
            for line in f:
                yield line.rstrip('\n')

def backsort(lines):
    result = {}
    for line in lines:
        result['.'.join(reversed(line.split('.')))] = line
    return map(lambda item: item[1],
               sorted(result.items()))

def main():
    if sys.argv[1:]:
        result = backsort(cat(sys.argv[1:]))
    else:
        result = backsort(map(
                lambda line: line.rstrip('\n'), sys.stdin))
    for line in result:
        print(line)

if __name__ == '__main__':
    main()

The complete backsort code in C++:

#include <assert.h>         // assert
#include <algorithm>        // std::reverse/transform
#include <fstream>          // std::ifstream
#include <functional>       // std::bind
#include <iostream>         // std::cin/cout
#include <map>              // std::map
#include <string>           // std::string
#include <vector>           // std::vector
#include <boost/coroutine2/all.hpp>

using namespace std;

typedef boost::coroutines2::coroutine<const string&> coro_t;

void cat(coro_t::push_type& yield, int argc, char* argv[])
{
    for (int i = 1; i < argc; ++i) {
        ifstream ifs(argv[i]);
        for (;;) {
            string line;
            if (getline(ifs, line)) {
                yield(line);
            } else {
                break;
            }
        }
    }
}

vector<string> split(const string& str, char delim)
{
    vector<string> result;
    string::size_type last_pos = 0;
    string::size_type pos = str.find(delim);
    while (pos != string::npos) {
        result.push_back(
            str.substr(last_pos, pos - last_pos));
        last_pos = pos + 1;
        pos = str.find(delim, last_pos);
        if (pos == string::npos) {
            result.push_back(str.substr(last_pos));
        }
    }
    return result;
}

template <typename C>
string join(const C& str_list, char delim)
{
    string result;
    for (auto& item : str_list) {
        result += item;
        result += delim;
    }
    if (result.size() != 0) {
        result.resize(result.size() - 1);
    }
    return result;
}

template <typename C>
vector<string> backsort(C&& lines)
{
    map<string, string> rmap;
    for (auto& line : lines) {
        auto split_line = split(line, '.');
        reverse(split_line.begin(), split_line.end());
        rmap[join(split_line, '.')] = line;
    }
    vector<string> result(rmap.size());
    transform(rmap.begin(), rmap.end(), result.begin(),
              [](const pair<string, string>& pr)
              {
                  return pr.second;
              });
    return result;
}

class istream_line_reader {
public:
    class iterator { // implements InputIterator
    public:
        typedef const string& reference;
        typedef string value_type;

        iterator() : stream_(nullptr) {}
        explicit iterator(istream& is) : stream_(&is)
        {
            ++*this;
        }

        reference operator*()
        {
            assert(stream_ != nullptr);
            return line_;
        }
        value_type* operator->()
        {
            assert(stream_ != nullptr);
            return &line_;
        }
        iterator& operator++()
        {
            getline(*stream_, line_);
            if (!*stream_) {
                stream_ = nullptr;
            }
            return *this;
        }
        iterator operator++(int)
        {
            iterator temp(*this);
            ++*this;
            return temp;
        }

        bool operator==(const iterator& rhs) const
        {
            return stream_ == rhs.stream_;
        }
        bool operator!=(const iterator& rhs) const
        {
            return !operator==(rhs);
        }

    private:
        istream* stream_;
        string line_;
    };

    explicit istream_line_reader(istream& is)
        : stream_(is)
    {
    }
    iterator begin() const
    {
        return iterator(stream_);
    }
    iterator end() const
    {
        return iterator();
    }

private:
    istream& stream_;
};

int main(int argc, char* argv[])
{
    using namespace std::placeholders;
    vector<string> result;
    if (argc > 1) {
        result = backsort(coro_t::pull_type(
            boost::coroutines2::fixedsize_stack(),
            bind(cat, _1, argc, argv)));
    } else {
        result = backsort(istream_line_reader(cin));
    }
    for (auto& item : result) {
       cout << item << endl;
    }
}

The istream_line_reader class is not really necessary, and we can simplify it with coroutines. I am including it here only to show what we have to write ‘normally’ (if we cannot use coroutines). Even if we remove it entirely, the C++ version will still have about three times as many non-trivial lines of code as the Python equivalent. It is enough proof to me that I should move away from C++ a little bit. . . .


  1. There is one gotcha: the ‘\n’ character will be part of the string. It will be handled in my solution. 
  2. Generally speaking, callbacks or similar techniques are what C++ programmers tend to use in similar circumstances, if the ‘producer’ part is complicated (otherwise the iterator pattern may be more suitable). Unfortunately, we cannot then combine the use of two simple functions like cat and backsort simultaneously. If we used callbacks, backsort would need to be modified and fragmented. 
  3. P0057 is one such effort, which is experimentally implemented in Visual Studio 2015
  4. According to the acknowledgement pages of next two Boost projects, Giovanni Deretta contributed to them. So his work was not in vain. 
  5. This said, CO2 is also well-maintained, and is more efficient if only a stackless coroutine is needed. See Jamboree’s answer on StackOverflow. The difference looks small to me, and preprocessor tricks are not my favourites, so I will not spend more time on CO2 for now. 

Choosing a Multi-Precision Library for C++—A Critique

The Problem

After reading From Mathematics to Generic Programming,1 I accumulated some template code related to the RSA algorithm,2 but I tested it only with native integer types. Some recent events required me to use it for calculations that involve data sizes slightly bigger than those of the built-in types. I had to find some big-number libraries to help the calculation.

This is what the code looks like:

template <Integer N>
inline bool odd(N n)
{
    return n % 2 == 1;
}

template <Integer N>
inline N half(N n)
{
    return n / 2;
}

…

template <EuclideanDomain E>
std::pair<E, E> extended_gcd(E a, E b)
{
    …
}

template <Regular A, Integer N, SemigroupOperation Op>
A power_accumulate_semigroup(A r, A a, N n, Op op)
{
    …
}

template <Regular A, Integer N, SemigroupOperation Op>
A power_semigroup(A a, N n, Op op)
{
    assert(n > 0);
    while (!odd(n)) {
        a = op(a, a);
        n = half(n);
    }
    if (n == 1) {
        return a;
    }
    return power_accumulate_semigroup(a, op(a, a),
                                      half(n - 1), op);
}

template <Integer N>
struct modulo_multiply {
    modulo_multiply(const N& i) : modulus(i) {}
    N operator()(const N& n, const N& m) const
    {
        return (n * m) % modulus;
    }
private:
    N modulus;
};

int main()
{
    typedef … int_type;

    int_type p1 = …;
    int_type p2 = …;
    int_type n = p1 * p2;
    int_type phi = (p1 - 1) * (p2 - 1);
    int_type pub = …;

    pair<int_type, int_type> p = extended_gcd(pub, phi);
    // Check that p.second == 1
    int_type prv = p.first;
    …

    int_type encrypt = …;

    cout << "Encryped message is " << encrypt << endl;
    cout << "Decrypted message is "
         << power_semigroup(encrypt, prv,
                            modulo_multiply<int_type>(n))
         << endl;
}

There are a few details omitted here, but the point is that I had already the code that worked when int_type was defined to int64_t, and I needed some types that could represent higher precisions and work with the existing code with minimal changes.

The Exploration

NTL

One of the first libraries I tried was NTL,3 which seemed to support the standard mathematical operators well. It did not take me long to make it work with my program, and I was able to get the result successfully. However, I saw several problems that made me think I probably wanted a more mature solution in the long run:

  • Its class name for big integers is ‘NTL::ZZ’. ‘ZZ’ looks ugly to me, not aesthetically comfortable as a type name.
  • It does not provide a make mechanism for Windows. Luckily, it does not require external libraries and it is easy enough to build it manually with GCC.
  • Code like ‘NTL::ZZ pub = 3’ does not compile, which is a minor annoyance (but ‘NTL::ZZ pub(3)’ is an easy workaround, anyway).
  • Code like ‘NTL::ZZ p1("3440890133")’ does not work. This is a problem for big integers that cannot be represented by a native integer type. The workaround is using std::istringstream, which would require more lines and clumsiness.
  • There is no support for getting the input or output in hexadecimal numbers.

CLN

Another library I tried at the same time was CLN.4 It is not friendly to Windows users either, so I simply installed it from Cygwin.5 It seems to be in stark contrast to NTL in some aspects:

  • The class name for big integers is more reasonably named ‘cln::cl_I’.
  • Code lines like ‘cln::cl_I pub = 3’ and ‘cln::cl_I p1("3440890133")’ work.
  • CLN provides support for hexadecimal input (using a special ‘#x’ prefix in the number string) and output (using the fprinthexadecimal function).

However, CLN is quite terrible in its handling of C++ operators:

  • % is not overriden, and I have to call the mod function instead.
  • Division is not implemented on cl_I. I have to, in the general case, use a function that returns the {quotient, remainder} pair, or use an exquo function when I can guarantee that the remainder is zero. Luckily, in my specific case, I can substitute ‘>> 1’ for ‘/ 2’. If shifts could not be used, I would have to replace ‘n / 2’ with something like ‘truncate2(n, 2).quotient’. Providing a series of division functions that return both the quotient and remainder is good; forcing people to use them is not.

Unlike the immaturity of NTL, it looks like that CLN deliberately made the design choices to be this way. Still, it looks bad enough to me. The API design of CLN shows the hauteur of the authors: Your time is not important to me; read the fucking manual, and do things the correct way we want it to be. This condescending attitude is completely against the trend.

Boost.Multiprecision

Finally, I found out that I should have looked no further than just the famous Boost libraries.6 A multi-precision template library is among Boost’s 100+ libraries, simply named ‘Boost.Multiprecision’.7 I wondered why I missed it in the beginning. But, anyway, it fulfilled all my needs wonderfully:

  • Using the basic cpp_int type does not require building any libraries. This makes it work on any platform that has a decent C++ compiler.
  • All needed operators (like +, -, *, /, and %) are implemented.
  • Initialization from native integer types and C strings works.
  • Hexadecimal input and output are implemented in a natural way: inputs can have the ‘0x’ prefix, and the hex manipulator can be used to make the big integer output to iostreams in the hexadecimal form.8
  • In addition, it supports the C++11 user-defined literals.9 So, instead of writing something like ‘cppint encrypt("0xB570BF8E4BDABA4C")’, you can have more efficient code by writing ‘cppint encrypt(0xB570BF8E4BDABA4C_cppi)’.

This said, one problem halted me when I first used its cpp_int type: very weird compilation errors occurred, spanning several screens. Actually, the solution is described in the introduction of the library, as well as in the first answer of its FAQ, so I figured it out the next day (I did not read carefully the documentation on the first night). I needed to either replace expressions like ‘half(n - 1)’ with explicit type-casts like ‘half(N(n - 1))’, or simply use an alternative typedef to turn off expression templates—which I did:

    typedef boost::multiprecision::number<
            boost::multiprecision::cpp_int_backend<>,
            boost::multiprecision::et_off> int_type;

You can read the Boost documentation for more details. It is related to performance. It is also worth noting that with C++11 move semantics, the expression-template-disabled form I use can still have performance close enough (no more than 10% slower) to the expression-template-enabled form. And the first template parameter probably has a bigger impact—GMP can be used as the backend and is considered faster.10

In my humble opinions, Boost.Precision should change the default cpp_int definition to have et_off. Developers who want the ultimate performance will always read the documentation, but it does not seem necessary to force other developers to have failures, read documentation, and change their code. In my case, it takes several seconds to compile the program, a small fraction of a second to run the program, but several hours to find the correct library and learn how to use it.

The Critique

I would argue that the following three criteria should be the foremost in choosing (and thus providing) a good library:

  • Correctness. I think this is self-evident. All three libraries described here satisfy this criterion.
  • Standard and intuitive interface. This is where NTL and CLN fail. CLN does the worst here by intentionally failing to provide operator/. Boost.Multiprecision satisfies my needs without requiring me to look up the documentation (mostly), but it is not perfect, in that the default types can cause horrendous error messages and that its iostream routines do not honour uppercase and nouppercase.11
  • Performance. Yes, performance comes the last among these three. It is still important, but I would argue we should put performance aside when it conflicts with the other two criteria (like treating ‘premature optimization’), and developers can read the documentation and turn performance options back on when they really need it. Boost.Multiprecision is nice to support different backends and the expression template option, but I am not persuaded that expression templates should be enabled by default.

Of course, there are many other criteria, like portability, (lack of) dependency, etc., but they tend to be more subjective and can vary from project to project. The three criteria listed above are the most important to me.

Correctness and developer productivity should be preferred to code performance. This should be true for both scripting languages and traditional compiled languages.

Appendix: Source Listing

The complete RSA sample code that builds with Boost.Precision is listed below for you to play with. We can optimize the code a little bit by substituting the Boost.Multiprecision divide_qr function for the handwritten quotient_remainder. That can be the small exercise for you, dear reader. 🙂

#include <assert.h>             // assert
#include <iostream>             // std::cout/endl
#include <utility>              // std::pair/make_pair
#include <boost/multiprecision/cpp_int.hpp>

// Concepts
#define EuclideanDomain         typename
#define Integer                 typename
#define Regular                 typename
#define SemigroupOperation      typename

template <Integer N>
inline bool odd(N n)
{
    return n % 2 == 1;
}

template <Integer N>
inline N half(N n)
{
    return n / 2;
}

template <Integer N>
N largest_doubling(N a, N b)
{
    assert(b != 0);
    for (;;) {
        N c = b + b;
        if (a < c)
            break;
        b = c;
    }
    return b;
}

template <Integer N>
std::pair<N, N> quotient_remainder(N a, N b)
{
    assert(b > 0);
    if (a < b) {
        return std::make_pair(N(0), a);
    }
    N c = largest_doubling(a, b);
    N n(1);
    a = a - c;
    while (c != b) {
        c = half(c);
        n = n + n;
        if (c <= a) {
            a = a - c;
            n = n + 1;
        }
    }
    return std::make_pair(n, a);
}

template <EuclideanDomain E>
std::pair<E, E> extended_gcd(E a, E b)
{
    E x0(1);
    E x1(0);
    while (b != E(0)) {
        // compute new r and x
        std::pair<E, E> qr = quotient_remainder(a, b);
        E x2 = x0 - qr.first * x1;
        // shift r and x
        x0 = x1;
        x1 = x2;
        a = b;
        b = qr.second;
    }
    return std::make_pair(x0, a);
}

template <Regular A, Integer N, SemigroupOperation Op>
A power_accumulate_semigroup(A r, A a, N n, Op op)
{
    assert(n >= 0);
    if (n == 0) {
        return r;
    }
    for (;;) {
        if (odd(n)) {
            r = op(r, a);
            if (n == 1) {
                return r;
            }
        }
        n = half(n);
        a = op(a, a);
    }
}

template <Regular A, Integer N, SemigroupOperation Op>
A power_semigroup(A a, N n, Op op)
{
    assert(n > 0);
    while (!odd(n)) {
        a = op(a, a);
        n = half(n);
    }
    if (n == 1) {
        return a;
    }
    return power_accumulate_semigroup(a, op(a, a),
                                      half(n - 1), op);
}

template <Integer N>
struct modulo_multiply {
    modulo_multiply(const N& i) : modulus(i) {}
    N operator()(const N& n, const N& m) const
    {
        return (n * m) % modulus;
    }
private:
    N modulus;
};

int main()
{
    using namespace std;
    typedef boost::multiprecision::number<
            boost::multiprecision::cpp_int_backend<>,
            boost::multiprecision::et_off> int_type;

    int_type p1 = 3440890133;
    int_type p2 = 4006628849;
    int_type n = p1 * p2;
    int_type phi = (p1 - 1) * (p2 - 1);
    int_type pub = 65537;

    pair<int_type, int_type> p = extended_gcd(pub, phi);
    if (p.second != 1) {
        // pub is not coprime with phi
        cout << "Please choose another public key!" << endl;
        return 1;
    }

    int_type prv = p.first;
    if (prv < 0) {
        prv += phi;
    }

    cout << "Public key is (" << pub << ", " << n << ")\n";
    cout << "Private key is (" << prv << ", " << n << ")\n";

    int_type encrypt("0xB570BF8E4BDABA4C");
    cout << hex;
    cout << "Encryped message is " << encrypt << endl;
    cout << "Decrypted message is "
         << power_semigroup(encrypt, prv,
                            modulo_multiply<int_type>(n))
         << endl;
}

  1. Alexander A. Stepanov and Daniel E. Rose: From Mathematics to Generic Programming. Addison-Wesley Professional, 2014. 
  2. Wikipedia: RSA (cryptosystem)
  3. Victor Shoup: NTL: A Library for doing Number Theory
  4. Bruno Haible and Richard B. Kreckel: CLN — Class Library for Numbers 
  5. Cygwin
  6. Boost
  7. John Maddock and Christopher Kormanyos: Boost.Multiprecision
  8. Cppreference.com: std::hex
  9. Cppreference.com: User-defined literals
  10. Free Software Foundation: GMP — The GNU Multiple Precision Arithmetic Library
  11. Cppreference.com: std::uppercase

Generic Lambdas and the compose Function

I am about two thirds through Scott Meyers’ Effective Modern C++, and I have discovered the power of generic lambdas. Actually I had read about generic lambdas before in the Wikipedia entry on C++14, but that was far from enough for me to get it—and I was not smart enough to investigate deeper. Anyway, Item 33 in Effective Modern C++ gives enough examples to show me the power, and it is exactly the tool I need to solve the problems in my compose function.

Let me start from my (poor) exclamation in my last blog:

Alas, a lambda is only a function, but not a type-deducing function template.

Wrong, wrong, wrong! The generic lambda is exactly what I claimed it was not.

Type deduction was the problem that caused my compose function template to fail. Recall its definition and the failure case:

template <typename Tp>
auto compose()
{
    return apply<Tp>;
}

template <typename Tp, typename Fn, typename... Fargs>
auto compose(Fn fn, Fargs... args)
{
    return [=](Tp&& x) -> decltype(auto)
    {
        return fn(compose<Tp>(args...)(forward<Tp>(x)));
    };
}

…
Obj obj(0);
auto const op_nr = compose<Obj>(clone);
test(op_nr(obj));

The error was that obj, as an lvalue, could not be bound to Obj&& (on line 10). Although I tried to use the perfect forwarding pattern, it did not work, as there was no type deduction—Tp was specified by the caller. Why so? Because I thought a lambda could not be a type-deducing function template.

I won’t go into details about generic lambdas per se, of which you can get a lot of information in Scott’s book or by Google. Instead, I only want to show you that generic lambdas help solve the type deduction problem and give a function template to suit my needs.

Without further ado, I am showing you the improved version of compose that uses generic lambdas:

auto compose()
{
    return [](auto&& x) -> decltype(auto)
    {
        return forward<decltype(x)>(x);
    };
}

template <typename Fn, typename... Fargs>
auto compose(Fn fn, Fargs... args)
{
    return [=](auto&& x) -> decltype(auto)
    {
        return fn(compose(args...)(forward<decltype(x)>(x)));
    };
}

You can immediately notice the following:

  • The original template type parameter Tp is gone.
  • Tp&& is now changed to auto&&, allowing type deduction to work.
  • In order to make perfect forwarding work when the type of x is unknown, forward<decltype(x)> is used.

With this definition, We no longer need to differentiate between op_klvr, op_rvr, etc. No way to do so, anyway. Things are beautifully unified, until the moment you need to put it in an std::function. Run the code at the final listing to see their differences.

Although it already looks perfect, we have a bonus since we no longer specify Tp. We are no longer constrained by only one argument. A small change will make multiple arguments work:

auto compose()
{
    return [](auto&& x) -> decltype(auto)
    {
        return forward<decltype(x)>(x);
    };
}

template <typename Fn>
auto compose(Fn fn)
{
    return [=](auto&&... x) -> decltype(auto)
    {
        return fn(forward<decltype(x)>(x)...);
    };
}

template <typename Fn, typename... Fargs>
auto compose(Fn fn, Fargs... args)
{
    return [=](auto&&... x) -> decltype(auto)
    {
        return fn(
            compose(args...)(forward<decltype(x)>(x)...));
    };
}

The first compose function is no longer useful when we have at least one function passed to compose, but I am keeping it for now. The parameter pack plus the generic lambda makes a perfect combination here.

Finally, a code listing for you to play with is provided below (also available as test_compose.cpp in the zip file download for my last blog):

#include <functional>
#include <iostream>

using namespace std;

#define PRINT_AND_TEST(x)           \
    cout << " " << #x << ":\n  ";   \
    test(x);                        \
    cout << endl;

auto compose()
{
    return [](auto&& x) -> decltype(auto)
    {
        return forward<decltype(x)>(x);
    };
}

template <typename Fn>
auto compose(Fn fn)
{
    return [=](auto&&... x) -> decltype(auto)
    {
        return fn(forward<decltype(x)>(x)...);
    };
}

template <typename Fn, typename... Fargs>
auto compose(Fn fn, Fargs... args)
{
    return [=](auto&&... x) -> decltype(auto)
    {
        return fn(
            compose(args...)(forward<decltype(x)>(x)...));
    };
}

struct Obj {
    int value;
    explicit Obj(int n) : value(n)
    {
        cout << "Obj(){" << value << "} ";
    }
    Obj(const Obj& rhs) : value(rhs.value)
    {
        cout << "Obj(const Obj&){" << value << "} ";
    }
    Obj(Obj&& rhs) : value(rhs.value)
    {
        rhs.value = -1;
        cout << "Obj(Obj&&){" << value << "} ";
    }
    ~Obj()
    {
        cout << "~Obj(){" << value << "} ";
    }
};

void test(Obj& x)
{
    cout << "=> Obj&:" << x.value << "\n  ";
}

void test(Obj&& x)
{
    cout << "=> Obj&&:" << x.value << "\n  ";
}

void test(const Obj& x)
{
    cout << "=> const Obj&:" << x.value << "\n  ";
}

Obj clone(Obj x)
{
    cout << "=> clone(Obj):" << x.value << "\n  ";
    return x;
}

void test()
{
    Obj obj(0);
    cout << endl;

    auto const op = compose(clone);
    std::function<Obj(const Obj&)> fn_klvr = op;
    std::function<Obj(Obj&&)> fn_rvr = op;
    std::function<Obj(Obj)> fn_nr = op;
    PRINT_AND_TEST(op(obj));
    PRINT_AND_TEST(fn_klvr(obj));
    PRINT_AND_TEST(fn_nr(obj));
    cout << endl;
    PRINT_AND_TEST(op(Obj(1)));
    PRINT_AND_TEST(fn_klvr(Obj(1)));
    PRINT_AND_TEST(fn_rvr(Obj(1)));
    PRINT_AND_TEST(fn_nr(Obj(1)));
}

template <typename T1, typename T2>
auto sum(T1 x, T2 y)
{
    return x + y;
}

template <typename T1, typename T2, typename... Targ>
auto sum(T1 x, T2 y, Targ... args)
{
    return sum(x + y, args...);
}

template <typename T>
auto sqr(T x)
{
    return x * x;
}

int main()
{
    test();
    cout << endl;
    auto const op = compose(sqr<int>,
                            sum<int, int, int, int, int>);
    cout << op(1, 2, 3, 4, 5) << endl;
}

Happy hacking!

Type Deduction and My Reference Mistakes

I had read Scott Meyers’ three previous books in the Effective series 1 2 3, before I began to read his new Effective Modern C++ 4 at Safari Books Online. I always expect to learn from Scott, but it surprised me how fast it could be. After reading only a few items about type deduction, I found that I implemented apply and compose (pipeline) incorrectly in my first WordPress blog 5. What a shame! But I thought I was fortunate to find the problem so soon, and I would like to record my mistakes and solutions here.

(A side note: As WordPress does not allow uploading C++ source files, I have put all the test code in a zip file. I will quote the relevant source code in the blog directly for an easy read, but you are welcome to download and try the test code yourself!)

First, my implementation of apply had the wrong return type, as shown in the following program (test1.cpp):

#include <iostream>

using namespace std;

#define PRINT_AND_TEST(x)    \
    cout << #x << ": ";      \
    test(x);                 \
    cout << endl;

template <typename Tp>
auto apply(Tp&& data)
{
    return forward<Tp>(data);
}

template <typename Tp, typename Fn, typename... Fargs>
auto apply(Tp&& data, Fn fn, Fargs... args)
{
    return apply(fn(forward<Tp>(data)), args...);
}

struct Obj {
    int value;
    explicit Obj(int n) : value(n)
    {
    }
    Obj(const Obj& rhs) : value(rhs.value)
    {
    }
    Obj(Obj&& rhs) : value(rhs.value)
    {
        rhs.value = -1;
    }
    ~Obj()
    {
    }
};

void test(Obj& x)
{
    cout << "Obj&:" << x.value;
}

void test(Obj&& x)
{
    cout << "Obj&&:" << x.value;
}

void test(const Obj& x)
{
    cout << "const Obj&:" << x.value;
}

int main()
{
    Obj obj(0);
    Obj& nref = obj;
    const Obj& cref = obj;

    PRINT_AND_TEST(obj);
    PRINT_AND_TEST(nref);
    PRINT_AND_TEST(cref);
    PRINT_AND_TEST(Obj(1));
    cout << endl;

    PRINT_AND_TEST(apply(obj));
    PRINT_AND_TEST(apply(nref));
    PRINT_AND_TEST(apply(cref));
    PRINT_AND_TEST(apply(Obj(1)));
    cout << endl;
}

It gives the following output:

obj: Obj&:0
nref: Obj&:0
cref: const Obj&:0
Obj(1): Obj&&:1

apply(obj): Obj&&:0
apply(nref): Obj&&:0
apply(cref): Obj&&:0
apply(Obj(1)): Obj&&:1

Apparently I did not make the types correct. The reason was my ignorant use of auto as the return type, without realizing that it always results in a non-reference type. C++14 provides a special decltype(auto) syntax, which keeps the reference-ness, and it seems to work here. The ‘fixed’ code is as follows (test2.cpp):

template <typename Tp>
decltype(auto) apply(Tp&& data)
{
    return forward<Tp>(data);
}

template <typename Tp, typename Fn, typename... Fargs>
decltype(auto) apply(Tp&& data, Fn fn, Fargs... args)
{
    return apply(fn(forward<Tp>(data)), args...);
}

After that, the program can output the correct result:

obj: Obj&:0
nref: Obj&:0
cref: const Obj&:0
Obj(1): Obj&&:1

apply(obj): Obj&:0
apply(nref): Obj&:0
apply(cref): const Obj&:0
apply(Obj(1)): Obj&&:1

Wait—is the code really correct?

Actually a little more testing reveals a bigger problem, which the original code did not exhibit. Here is the additional test code (test3.cpp):

…
Obj clone(Obj x)
{
    cout << "clone(Obj):" << x.value << " => ";
    return x;
}

int main()
{
    …
    PRINT_AND_TEST(clone(obj));
    PRINT_AND_TEST(clone(nref));
    PRINT_AND_TEST(clone(cref));
    PRINT_AND_TEST(clone(Obj(2)));
    cout << endl;

    PRINT_AND_TEST(apply(obj, clone));
    PRINT_AND_TEST(apply(nref, clone));
    PRINT_AND_TEST(apply(cref, clone));
    PRINT_AND_TEST(apply(Obj(2), clone));
    cout << endl;
}

And its horrendous output:

…
clone(obj): clone(Obj):0 => Obj&&:0
clone(nref): clone(Obj):0 => Obj&&:0
clone(cref): clone(Obj):0 => Obj&&:0
clone(Obj(2)): clone(Obj):2 => Obj&&:2

apply(obj, clone): clone(Obj):0 => Obj&&:1875662080
apply(nref, clone): clone(Obj):0 => Obj&&:1875662080
apply(cref, clone): clone(Obj):0 => Obj&&:1875662080
apply(Obj(2), clone): clone(Obj):2 => Obj&&:1875662080

Let us go back and analyse the case. Since all four functions have problems, we’ll just check the first one.

The call apply(obj, clone) makes the template parameter Tp be deduced to Obj&, as obj is an lvalue. This instantiation of apply(Obj&, Fn) contains the following function body, after all arguments are passed in:

    return apply(clone(forward<Obj&>(obj)));

Please notice that clone returns a temporary object, and its lifetime ends after return statement. As clone returns an rvalue, Tp for the next apply call is deduced to Obj. The function is instantiated as follows (check out the definition of std::forward if you are not familiar with it 6):

Obj&& apply(Obj&& data)
{
    return forward<Obj>(data);
}

Although the cloned object is destroyed after apply(Obj&, Fn) returns, its rvalue reference is still returned. Oops!

Seeing the reason, I only need to make sure an object type is returned when this apply is called. I experimented with the enable_if template 7, but it turns out that the fix is simpler than I expected (test4.cpp):

template <typename Tp>
Tp apply(Tp&& data)
{
    return forward<Tp>(data);
}

template <typename Tp, typename Fn, typename... Fargs>
decltype(auto) apply(Tp&& data, Fn fn, Fargs... args)
{
    return apply(fn(forward<Tp>(data)), args...);
}

I just have to change the first decltype(auto) to Tp to take advantage of the type deduction rules of ‘universal references’. While it is explained most clearly in Item 1 of Scott Meyers’ Effect Modern C++, his online article already has it clearly 8:

During type deduction for a template parameter that is a universal reference, lvalues and rvalues of the same type are deduced to have slightly different types. In particular, lvalues of type T are deduced to be of type T& (i.e., lvalue reference to T), while rvalues of type T are deduced to be simply of type T. (Note that while lvalues are deduced to be lvalue references, rvalues are not deduced to be rvalue references!)

Therefore:

  • If an lvalue is passed to apply, Tp (and the return type) is deduced to an lvalue reference (say, Obj&). This is what we expect to reduce copying.
  • If an rvalue is passed to apply, Tp (and the return type) is deduced to the object type (say, Obj). Now the returned object is move-constructed—what we would like to see.

I then also checked compose. At first, I tested with these lines added to the end of main (after adding the definition of compose, of course; see test5.cpp):

    auto const op1 = compose<Obj&&>(apply<Obj&&>);
    PRINT_AND_TEST(op1(Obj(3)));
    auto const op2 = compose<const Obj&>(apply<const Obj&>);
    PRINT_AND_TEST(op2(Obj(3)));

Both output lines contain ‘test(Obj&&):3’. The second line is obviously wrong, and it is exactly like the apply case, so the solution is similar too. I should not have relied on the wrong auto return type deduction—using decltype(auto) would fix it. The changed compose is like the following (test6.cpp):

template <typename Tp>
auto compose()
{
    return apply<Tp>;
}

template <typename Tp, typename Fn, typename... Fargs>
auto compose(Fn fn, Fargs... args)
{
    return [=](Tp&& x) -> decltype(auto)
    {
        return fn(compose<Tp>(args...)(forward<Tp>(x)));
    };
}

However, when I test code like below, it will not even compile (test7.cpp):

    auto const op_nr = compose<Obj>(clone);
    test(op_nr(obj));

Clang reports errors:

test7.cpp:106:10: error: no matching function for call to object of type 'const (lambda at test.cpp:26:12)'
    test(op_nr(obj));
         ^~~~~
test7.cpp:31:12: note: candidate function not viable: no known conversion from 'Obj' to 'Obj &&' for 1st argument
    return [=](Tp&& x) -> decltype(auto)
           ^
1 error generated.

!@#$%^…

Actually I can ‘fix’ the problem like this:

    auto const op_klvr = compose<const Obj&>(clone);
    test(op_klvr(obj));

Or this:

    auto const op_rvr  = compose<Obj&&>(clone);
    test(op_rvr(Obj()));

The two forms above would work actually quite well, if one knew the argument type and was careful. However, there would be a difference if one was careless, as could be shown by the revised test program with tracking information of the objects’ lifetime (test8.cpp). Only the relevant changes are shown below:

…
#define PRINT_AND_TEST(x)           \
    cout << " " << #x << ":\n  ";   \
    test(x);                        \
    cout << endl;
…
struct Obj {
    int value;
    explicit Obj(int n) : value(n)
    {
        cout << "Obj(){" << value << "} ";
    }
    Obj(const Obj& rhs) : value(rhs.value)
    {
        cout << "Obj(const Obj&){" << value << "} ";
    }
    Obj(Obj&& rhs) : value(rhs.value)
    {
        rhs.value = -1;
        cout << "Obj(Obj&&){" << value << "} ";
    }
    ~Obj()
    {
        cout << "~Obj(){" << value << "} ";
    }
};

void test(Obj& x)
{
    cout << "=> Obj&:" << x.value << "\n  ";
}

void test(Obj&& x)
{
    cout << "=> Obj&&:" << x.value << "\n  ";
}

void test(const Obj& x)
{
    cout << "=> const Obj&:" << x.value << "\n  ";
}

Obj clone(Obj x)
{
    cout << "=> clone(Obj):" << x.value << "\n  ";
    return x;
}

int main()
{
    Obj obj(0);
    Obj& nref = obj;
    const Obj& cref = obj;
    cout << endl;
    …
    auto const op_klvr = compose<const Obj&>(clone);
    auto const op_rvr  = compose<Obj&&>(clone);
    auto const op_nr   = compose<Obj>(clone);
    PRINT_AND_TEST(op_klvr(obj));
    PRINT_AND_TEST(op_klvr(Obj(3)));
    PRINT_AND_TEST(op_rvr(Obj(3)));
    PRINT_AND_TEST(op_nr(Obj(3)));
    //PRINT_AND_TEST(op_nr(obj));
}

The commented-out line cannot compile yet. The rest works fine, and the program will generate the following output (edited):

Obj(){0}
 obj:
  => Obj&:0
 …

 clone(obj):
  Obj(const Obj&){0} => clone(Obj):0
  Obj(Obj&&){0} => Obj&&:0
  ~Obj(){0} ~Obj(){-1}
 clone(nref):
  Obj(const Obj&){0} => clone(Obj):0
  Obj(Obj&&){0} => Obj&&:0
  ~Obj(){0} ~Obj(){-1}
 clone(cref):
  Obj(const Obj&){0} => clone(Obj):0
  Obj(Obj&&){0} => Obj&&:0
  ~Obj(){0} ~Obj(){-1}
 clone(Obj(2)):
  Obj(){2} => clone(Obj):2
  Obj(Obj&&){2} => Obj&&:2
  ~Obj(){2} ~Obj(){-1}

 apply(obj, clone):
  Obj(const Obj&){0} => clone(Obj):0
  Obj(Obj&&){0} Obj(Obj&&){0} ~Obj(){-1} ~Obj(){-1} => Obj&&:0
  ~Obj(){0}
 …
 apply(Obj(2), clone):
  Obj(){2} Obj(Obj&&){2} => clone(Obj):2
  Obj(Obj&&){2} Obj(Obj&&){2} ~Obj(){-1} ~Obj(){-1} => Obj&&:2
  ~Obj(){2} ~Obj(){-1}

 op_klvr(obj):
  Obj(const Obj&){0} => clone(Obj):0
  Obj(Obj&&){0} ~Obj(){-1} => Obj&&:0
  ~Obj(){0}
 op_klvr(Obj(3)):
  Obj(){3} Obj(const Obj&){3} => clone(Obj):3
  Obj(Obj&&){3} ~Obj(){-1} => Obj&&:3
  ~Obj(){3} ~Obj(){3}
 op_rvr(Obj(3)):
  Obj(){3} Obj(Obj&&){3} => clone(Obj):3
  Obj(Obj&&){3} ~Obj(){-1} => Obj&&:3
  ~Obj(){3} ~Obj(){-1}
 op_nr(Obj(3)):
  Obj(){3} Obj(Obj&&){3} => clone(Obj):3
  Obj(Obj&&){3} ~Obj(){-1} => Obj&&:3
  ~Obj(){3} ~Obj(){-1}
~Obj(){0}

We can see that apply(Obj(2), clone) generates one more move-construction than clone(Obj(2)), and this is expected from our implementation. We can also see the differences in the last group, where the use of op_klvr, op_rvr, or op_nr can affect whether copy-construction or move-construction is used. I would like to make op_nr work on an lvalue too (so the last code line can be uncommented).

I have finally implemented a version of compose with tag dispatching 9, treating reference types and non-reference types differently. The strategy is as follows:

  • If template argument is a reference type, the old logic still applies.
  • If template argument is not a reference type, a temporary object will be constructed in the pass-by-value parameter (with either the copy constructor or move constructor), and its rvalue reference will be used to invoke the reference-branch logic. An extra move operation may result, so it is still better to use ‘compose’ if it is known that the argument will be an value. (Alas, a lambda is only a function, but not a type-deducing function template. Damn, Scott hit me right on the face, again, with his nice introduction of the C++14 generic lambdas in Effective Modern C++. It provides for a far nicer solution. I won’t change the content here, but the part about compose is now largely obsoleted. Check out ‘Generic Lambdas and the compose Function’ for an update.—I hate love you, Scott!)

The extra move overhead makes this solution less attractive, but it only adds to the choices. Also, it would be a little awkward if one was forced to type ‘compose’. So my final compose is here (test9.cpp):

template <typename Tp>
auto compose_ref()
{
    return apply<Tp>;
}

template <typename Tp, typename Fn, typename... Fargs>
auto compose_ref(Fn fn, Fargs... args)
{
    return [=](Tp&& x) -> decltype(auto)
    {
        return fn(compose_ref<Tp>(args...)(forward<Tp>(x)));
    };
}

template <typename Tp, typename... Fargs>
auto compose_impl(false_type, Fargs... args)
{
    return [=](Tp x) -> decltype(auto)
    {
        return compose_ref<Tp&&>(args...)(move(x));
    };
}

template <typename Tp, typename... Fargs>
auto compose_impl(true_type, Fargs... args)
{
    return compose_ref<Tp>(args...);
}

template <typename Tp, typename... Fargs>
auto compose(Fargs... args)
{
    return compose_impl<Tp>(
        typename is_reference<Tp>::type(), args...);
}

Lessons learnt:

  • C++ programmers should always read Scott’s books (at least 99.99% should).
  • Although type deduction is very helpful, one needs to understand its rules and what auto actually means in each case; otherwise it is easy to make (terrible) mistakes.

  1. Scott Meyers: Effective C++. Addison-Wesley, 3rd edition, 2005. 
  2. Scott Meyers: More Effective C++. Addison-Wesley, 1996. 
  3. Scott Meyers: Effective STL. Addison-Wesley, 2001. 
  4. Scott Meyers: Effective Modern C++. O’Reilly Media, 2014. 
  5. Yongwei Wu: Study Notes: Functional Programming with C++
  6. Thomas Becker: Rvalue References Explained, p. 8
  7. Cppreference.com: std::enable_if
  8. Scott Meyers: Universal References in C++11
  9. David Abrahams and Douglas Gregor: Generic Programming in C++: Techniques, section ‘Tag Dispatching’

Installing Clang 3.5 for Windows

I had used LLVM 3.4 on Windows for quite some time. It had worked well, and had all the features I needed—mostly the beautiful C++11/C++14 compiler and the easy-to-use Clang-Format. However, the C++ compiler only works when some specific GCC versions are installed together, and the GCC version 4.6.3 I installed for Clang has a conflict with the GCC 4.9 I use. The major issue is the C++ run-time library libstdc++-6.dll, which actually has many variants due to the combination of different thread models and different exception handling methods. The result is that GCC 4.9 generated executables will crash when the libstdc++-6.dll from GCC 4.6.3 appears earlier in path, and Clang generated executables will crash when the libstdc++-6.dll from GCC 4.9 appears earlier in path. I do not like this situation. So recently I tried new combinations when I installed LLVM 3.5, and made sure everything work together. I would like to share the result.

Let me first list the binary files one needs to download:

I install Clang to the default location, C:\Program Files (x86)\LLVM. For the rest of this article, I assume GCC 4.9.2 is extracted to C:\ (so all files are under C:\mingw32), and GCC 4.8.2 is extracted to C:\Temp (all files are under C:\Temp\mingw32).

Although I need GCC 4.9 for the best and latest C++ features, Clang does not work with it. One can tell from the error output of Clang that it should work with the MinGW-w64 GCC 4.8.2:

ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.0"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.0/x86_64-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.0/i686-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.0/backward"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.1"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.1/x86_64-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.1/i686-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.1/backward"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.2"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.2/x86_64-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.2/i686-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.2/backward"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.3"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.3/x86_64-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.3/i686-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.3/backward"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.0"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.0/x86_64-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.0/i686-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.0/backward"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.1"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.1/x86_64-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.1/i686-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.1/backward"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.2"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.2/x86_64-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.2/i686-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.2/backward"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.0/include/c++"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.0/include/c++/mingw32"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.0/include/c++/backward"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.1/include/c++"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.1/include/c++/mingw32"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.1/include/c++/backward"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.2/include/c++"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.2/include/c++/mingw32"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.2/include/c++/backward"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.3/include/c++"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.3/include/c++/mingw32"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.3/include/c++/backward"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.8.0/include/c++"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.8.0/include/c++/mingw32"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.8.0/include/c++/backward"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.8.1/include/c++"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.8.1/include/c++/mingw32"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.8.1/include/c++/backward"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.8.2/include/c++"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.8.2/include/c++/mingw32"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.8.2/include/c++/backward"
ignoring nonexistent directory "/usr/include/c++/4.4"
ignoring nonexistent directory "/usr/local/include"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0\../../../x86_64-w64-mingw32/include"
ignoring nonexistent directory "/mingw/include"
ignoring nonexistent directory "c:/mingw/include"
ignoring nonexistent directory "/usr/include"

(As one may expect from the error messages, the official MinGW GCC, currently at version 4.8.1, also works with Clang. I personally prefer MinGW-w64, as its GCC is more usable—e.g., the MinGW version supports only Win32 threads, and therefore does not support std::thread. MinGW does not provide GCC 4.9 yet, and you can’t put C:\MinGW\bin in the path, if you want to use MinGW-w64 GCC 4.9 simultaneously. You do need to put either C:\MinGW\bin or C:\mingw32\bin—for MinGW-w64 GCC 4.9—in the path, as Clang cannot find a working GCC for linking otherwise. If you use only MinGW GCC 4.8.1, or only MinGW-w64 GCC 4.9, this configuration works.)

Now back to MinGW-w64 GCC 4.8.2. Depending on the size of your hard disk, you may want to tailor it. In my case, I removed all traces of Fortran, Ada, and Objective-C, as well as build-info.txt, etc, license, opt, and shared from C:\Temp\mingw32. After that, you need to do the following to make GCC 4.8.2 work for Clang:

  • Make directory c++ under C:\Temp\mingw32\include.
  • Make directory 4.8.2 under C:\Temp\mingw32\include\c++.
  • Copy all contents under C:\Temp\mingw32\i686-w64-mingw32\include\c++ to C:\Temp\mingw32\include\c++\4.8.2.
  • Move all contents under C:\Temp\mingw32 to C:\Program Files (x86)\LLVM, merging with existing directories there.
  • Remove the empty C:\Temp\mingw32.

You can now add both C:\mingw32\bin and C:\Program Files (x86)\LLVM to the path: both Clang and GCC are at your hand and won’t conflict with each other.

Y Combinator and C++

If one searches for ‘Y combinator’ now, it is likely that the first hit is Paul Graham’s incubator company Y Combinator. I am not sure whether Paul likes this fact or not—as a Lisp hacker, he must like the Y combinator very much to name his company with it, but now people looking for information for the mathematical/programmatic concept may be distracted by his company information. If he is truly still a Lisp hacker, he may even regret it a bit—though it apparently benefits his company.

Y combinator is an intriguing concept. When I first saw it, I spent the whole weekend reading and studying about it. Still, I did not fully get it. People think the Y combinator is an important concept in functional programming1:

… we can similarly use knowledge of the Y combinator as a dividing line between programmers who are “functionally literate” (i.e. have a reasonably deep knowledge of functional programming) and those who aren’t.

There are a lot of existing materials to introduce the Y combinator, but I have not yet known one that is written for C++ programmers. I would like to fill this gap.

This said, I still will start from functional languages, but you, dear reader, are not required to know one in advance. The focus is on the concept, and I will show running C++ code soon.

Now the journey begins.

Lambda Calculus

Of course you can find a huge amount of materials about lambda calculus, and I will describe just enough for C++ programmers to have the basic concept. Instead of the more common way of defining a square function as ‘sqr(x) = x · x’, one can define it as a lambda expression:

     sqr = λx.x · x

And that is exactly how languages like Haskell and Scheme define such a function, bar the syntax difference.

Haskell:

sqr = \ x -> x * x

Scheme:

(define sqr (lambda (x) (* x x)))

The key benefit is that you do not need a name for functions now. In order to apply the sqr function to 3, one can simply write (you may notice that parentheses are only used to override associativity, but not required around 3: this is the usual convention):

     (λx.x · x) 3

It is also intuitive how it is evaluated (called β-reduction in the terminology of lambda calculus): one replaces the occurrences of x with 3, and removes the initial λx. So the result is simply 3 · 3 = 9.

The C++ code for printing the result would be (assuming int type; also note that parentheses are required around ‘3’ in C++):

cout << [](int x) { return x * x; } (3) << endl;

However, considering that C++’s grammar is not really helpful to see lambda expressions clearly, it is probably better to use a name wherever appropriate:

auto const sqr = [](int x) { return x * x; };
cout << sqr(3) << endl;

Please be aware that auto is used here, as each C++ lambda expression has a unique type. However, a lambda expression can be converted to a function object (defined in <functional>). The example above can be changed as follows (‘std::’ is omitted), and the code will still compile and run (maybe with a negligible overhead):

function<int(int)> const sqr = [](int x) { return x * x; };
cout << sqr(3) << endl;

Fixed-point Combinator

A fixed-point combinator2 is a higher-order function that satisfied the following equation:

     y f = f (y f)

One can see immediately that this definition can lead to infinite expansion:

     y f = f (y f) = f (f (y f)) = f (… f (y f)…)

And that is exactly where the power lies. If the function y is given, it is possible to define a recursive function in a non-recursive way! Let us take factorial as an example. The traditional definition should be like:

     fact n = If IsZero n then 1 else n × fact (n − 1)

Assume a function F exists so that fact = y F:

     (y F) n = If IsZero n then 1 else n × (y F) (n − 1)
     F (y F) n = If IsZero n then 1 else n × (y F) (n − 1)

Replace y F with f:

     F f n = If IsZero n then 1 else n × f (n − 1)

We now get a function F that can be defined without any recursions. We can define factorial exactly this way in Haskell:

fix f = f (fix f)
fact = fix $ \ f n -> if (n == 0) then 1 else n * f (n - 1)

This definition has two problems:

  • It works only in a lazy language3 like Haskell.
  • Recursion occurs in the definition of fix.

I will address the first problem in the next section, and leave the second to the section after.

How to Overcome Laziness

The laziness of Haskell is great power, but it is not available in most other languages. As I mentioned in my last blog4, the solution is quite simple: an additional layer of indirection. A lambda expression is not evaluated before the argument is given, so the following transformation will help (η-abstraction; assuming f takes one argument):

     fλx.f x

The following program works in lazy Scheme:

(define Y
  (lambda (f)
    (f (Y f))))

(define F
  (lambda (f)
    (lambda (n)
      (cond
        ((eq? n 0) 1)
        (else (* n (f (- n 1))))))))

(define fact (Y F))

Before going ahead to transform the code for strict5 Scheme, let us analyse the involved types first. That will be very important when we try to implement it in C++, and the misunderstanding of the types led me to think that the Y combinator was unimplementable in C++. A Haskell-like notation is used below:

  • f (in F) ∷ int → int
  • F f ∷ int → int
  • Y F ∷ int → int
  • F ∷ (int → int) → int → int
  • Y ∷ ((int → int) → int → int) → int → int

We can see that f (see line 10 in the code above), F f (see line 7), and Y F (which is the factorial function) are all first-order functions that takes an integer and returns an integer. Therefore, F is a second-order function that takes such a first-order function and returns a first-order function. Our Y takes a second-order function like F, and returns a first-order function.

Seeing that Y f  is a first-order function, we can now apply the η-abstraction on it to eliminate the infinite expansion. The following definition of Y makes the program work again in strict Scheme:

(define Y
  (lambda (f)
    (f (lambda (x) ((Y f) x)))))

In the meantime, we have prepared enough for an implementation in C++, and I can show you the code now:

#include <iostream>
#include <functional>

using namespace std;

template <typename T, typename R>
function<R(T)> Y(function<function<R(T)>(function<R(T)>)> f)
{   // Y f = f (λx.(Y f) x)
    return f([=](T x) { return Y(f)(x); });
}

typedef function<int(int)>         fn_1ord;
typedef function<fn_1ord(fn_1ord)> fn_2ord;

fn_2ord almost_fact = [](fn_1ord f)
{
    return [f](int n)
    {
        if (n == 0) return 1; else return n * f(n - 1);
    };
};

int main()
{
    fn_1ord fact = Y(almost_fact);
    cout << "fact(5) = " << fact(5) << endl;
}

Hopefully everything now looks familiar and natural (except the C++ syntax, which is still quite cumbersome for the job).

Curry’s Y Combinator

Now we can come back and address the recursive definition problem. It is probably not a problem for practical usage, but it is very interesting indeed. The solution we encounter most often, the Y combinator, is attributed to Haskell Curry6 (yes, the Haskell language is named after him). It is also mind-boggling, like any of the self-referencing puzzles and paradoxes. I am amazed by it, and that is why I want to write this blog.

There are some good references about deriving the Y combinator7 8 9, and I will simply present the result here:

     Y = λf.(λx.f (x x)) (λx.f (x x))

I also want to show to you that this definition satisfies the fixed-point equation:

     Y F = (λf.(λx.f (x x)) (λx.f (x x))) F   (expand Y)
           = (λx.F (x x)) (λx.F (x x))          (β-reduction on the bound variable f)
           = F ((λx.F (x x)) (λx.F (x x)))   (β-reduction on the first bound variable x)
           = F (Y F)

Notice there is an equivalent form (β-reduce it once to get the Y above):

     Y = λf.(λx.x x) (λx.f (x x))

Basically, this is the definition in lazy Scheme:

(define Y
  (lambda (f)
    ((lambda (x) (x x))
     (lambda (x) (f (x x))))))

Recalling that the function argument of Y—which is f above—takes a first-order function as argument, we know x x should return a first-order function. So we can apply the η-abstraction to get the strict Scheme version:

(define Y
  (lambda (f)
    ((lambda (x) (x x))
     (lambda (x) (f (lambda (y) ((x x) y)))))))

Before implementing it in C++, we have one more difficulty. A statically typed language cannot make a function call like x x, at least not directly so, as the type system will forbid it. Even Haskell, which normally allows people to write very succinct code, cannot express the Y as directly as Scheme. One has to fool the type system with a proxy object.

Here is how it goes in C++:

template <typename F>
struct self_ref_func {
    function<F(self_ref_func)> fn;
};

I.e. an object of this type contains a function that takes an object of its own type and returns an object of the template type, which is again a function. E.g.:

typedef function<int(int)>     fn_1ord;
typedef self_ref_func<fn_1ord> fn_self_ref;
fn_self_ref x = { … };         // assign a suitable value to x
x.fn(x);                       // (x x)

With the help of this self-referencing type, Curry’s Y combinator can finally be implemented in C++:

template <typename T, typename R>
function<R(T)> Y(function<function<R(T)>(function<R(T)>)> f)
{   // Y = λf.(λx.x x) (λx.f (λy.(x x) y))
    typedef function<R(T)>          fn_1ord;
    typedef self_ref_func<fn_1ord>  fn_self_ref;
    fn_self_ref r = {
        [f](fn_self_ref x)
        {   // λx.f (λy.(x x) y)
            return f(fn_1ord([x](T y)
                             {
                                 return x.fn(x)(y);
                             }));
        }
    };
    return r.fn(r);
}

Summary

I have shown how the famous Y combinator can be implemented in C++, and what techniques are required to do it. I wish you found the information useful. If you have gone thus far, you will also probably be interested in reading about practical uses of the Y combinator10, and you will probably want a decent Haskell implementation11 and Scheme implementation12. I have also put the C++ code for Y (slightly changed) in a public repository13, as ready-to-use library code. People have actually implemented the Y combinator in more than five dozen languages14.

My C++ code is tested on Clang 3.5 and GCC 4.9 in C++11 mode. Haskell code is tested on GHC 7.8.3, and Scheme code is tested on DrRacket 6.1.

Have fun!

Update (5 May 2016)

The code here only demonstrates what can be done in C++, but is not optimized for speed. I have just noticed that there is now a proposal15 to add Y combinator to the C++ standard library. To my amazement, the reference implementation is both simple and fast—in my test, its speed is close to that of native recursive functions (-O3 under Clang, or -O2 under GCC). You probably want to check it out!


  1. Mike Vanier: The Y Combinator (Slight Return) or How to Succeed at Recursion Without Really Recursing, section ‘Why Y?’. 
  2. Wikipedia: Fixed-point combinator
  3. Wikipedia: Lazy evaluation
  4. Yongwei Wu: Study Notes: Functional Programming with C++
  5. Wikipedia: Strict programming language
  6. Felice Cardone and J. Roger Hindley: History of Lambda-calculus and Combinatory Logic (PDF), pp. 8–9. 
  7. Mike Vanier: The Y Combinator (Slight Return) or How to Succeed at Recursion Without Really Recursing, section ‘Deriving the Y combinator’. 
  8. Matthias Felleisen: A Lecture on the Why of Y (PS). 
  9. Daniel P. Friedman and Matthias Felleisen: The Little Schemer, chapter 9. MIT Press, Cambridge, Mass., 4th edition, 1995. 
  10. Bruce J. McAdams: That About Wraps it Up: Using FIX to Handle Errors Without Exceptions, and Other Programming Tricks
  11. GHC is the standard Haskell implementation. 
  12. Racket is a nice Scheme platform. 
  13. functional.h contains fix_simple, fix_curry, and fix_function_converter make_curry, among other templates. 
  14. Rosetta Code: Y Combinator
  15. Yegor Derevenets: A Proposal to Add Y Combinator to the Standard Library

Study Notes: Functional Programming with C++

I have been enthused with functional programming recently. As C++ is my favourite language (or most familiar language), I have also tried to implement in C++ some of the techniques I learnt. And this blog serves to record my learning experience. It is my sincere hope that it will be useful to other people that come to functional programming with an imperative programming background.

Without further ado, let me describe first what I wanted to implement:

  • Map
  • Reduce
  • Pipeline

If you are not familiar with these concepts, do not worry, as I will show you the code in both C++ and some ‘more functional’ languages. The functional language implementations are always easy to understand, even if you do not know the languages!

Map

The concept of the map function is quite simple: it applies the function parameter to each item in a list parameter, and return the result as a list. The following examples shows the application of the successor function to a list of numbers (in Haskell):

> print (map succ [1,2,3,4])
[2,3,4,5]

Map is normally a built-in function in a functional language, but implementing it is trivial too. The Haskell implementation might be the most succinct:

map f [] = []
map f (x:xs) = f x : map f xs

It does the work recursively: apply f to the head of the list (x), and concatenate the result with that of map f applied to the rest of the list (xs) until the list is empty. The procedure may be more explicit in the Scheme code below:

(define (map f l)
  (cond
    ((null? l) l)
    (else (cons (f (car l)) (map f (car l))))))

So how do we implement it in C++?

Actually C++98[1] already has something quite close: the std::transform [2] function template. The problem is it is not as composable as the functional equivalent, and you cannot just take the return result and print. The equivalent code for the Haskell example above is as follows (in C++11[3] style):

#include <algorithm>
#include <iostream>
#include <iterator>
#include <vector>

using namespace std;

template <typename Type>
ostream& operator<<(ostream& os, const vector<Type>& v)
{
    os << "[ ";
    copy(v.begin(), v.end(),
         ostream_iterator<Type>(os, " "));
    os << "]";
    return os;
}

int main()
{
    auto const succ = [](int x) { return x + 1; };
    vector<int> v1{1, 2, 3, 4};
    vector<int> v2;
    transform(v1.begin(), v1.end(), back_inserter(v2), succ);
    cout << v2 << endl;
}

The programmer has to define the return variable first, and something like back_inserter[4] needs to be explicitly used. The flexibility is there, but the programmer needs to take extra burdens.

Of course, the C++ language provides enough facilities to define an alternative function. It is actually pretty simple[5] (though the C++ syntax is intimidating indeed for people newly coming into the template world):

template <template <typename,typename> class OutCont=vector,
          template <typename> class Alloc = allocator,
          typename Fn, class Cont>
OutCont<typename Cont::value_type,
        Alloc<typename Cont::value_type>>
map(Fn mapfn, const Cont& inputs)
{
    OutCont<typename Cont::value_type,
        Alloc<typename Cont::value_type>> result;
    for (auto& item : inputs)
        result.push_back(mapfn(item));
    return result;
}

With this function template, you can now write

    cout << map(succ, v1) << endl;

or even

    cout << map(succ, vector<int>{1, 2, 3, 4}) << endl;

Reduce

The reduce function, also called fold, reduces a list to single value. Its usage can be powerfully demonstrated in the following Haskell example:

> foldl (+) 0 [1..100]
5050

The implementation of foldl (called thus as there is also a foldr function) should be like follows:

foldl f z []     = z
foldl f z (x:xs) = foldl f (f z x) xs

I.e., this function applies the two-argument function f recursively over the list items, and the parameter z is used as the initial value. I will show immediately my C++ code for comparison:

template <typename Fn, class Cont>
typename Cont::value_type
reduce(Fn reducefn, const Cont& inputs,
       typename Cont::value_type initval =
       typename Cont::value_type())
{
    auto result = initval;
    for (auto& item : inputs)
        result = reducefn(result, item);
    return result;
}

One can use the template like the code below:

    cout << reduce(plus<int>(), vector<int>{1, 2, 3, 4, 5})
         << endl;

It needs to be mentioned that C++ has a std::accumulate[6] function template, which is similar to reduce, but it suffers from the similar problem like std::transform. I will not elaborate the details here.

Pipeline

Pipelining is about composing functions to form a new function. Haskell supports composition directly in the language with the operator ‘.’. In order to calculate sqrt(x + 1), one only needs to define a new function like follows:

plus_1_sqrt = sqrt . succ

The following is something similar in Python using the reduce function:

def pipeline_func(data, fns):
    return reduce(lambda a, x: x(a),
                  fns,
                  data)

One could use the function like follows:

def plus_1(x):
    return x + 1

pipeline_func(3, [plus_1, math.sqrt])  # result is 2.0

I actually was frustrated at this point about how to implement it in C++. I simply did not have a clue. After some googling, and especially after finding the insightful blogs of Bartosz Milewski[7], I had better ideas. Specifically, the blog ‘What Does Haskell Have to Do with C++?’[8] enlightened me. I quickly came up with this solution[9]:

template <typename Tp>
auto apply(Tp&& data)
{
    return forward<Tp>(data);
}

template <typename Tp, typename Fn, typename... Fargs>
auto apply(Tp&& data, Fn fn, Fargs... args)
{
    return apply(fn(forward<Tp>(data)), args...);
}

In order to make type inference work (which depends on the input and return types of the passed function arguments), a C++14[10] compiler is needed. Using Clang 3.4+ or GCC 4.9 (-std=c++1y needs to be specified on the command line), the following code will print the correct result 55:

    auto const sqr = [](int x) { return x * x; };
    auto const square_list =
        [=](const vector<int>& data) {
            return map(sqr, data);
        };
    auto const sum_list =
        [=](const vector<int>& data) {
            return reduce(plus<int>(), data);
        };
    cout << apply(vector<int>{1, 2, 3, 4, 5},
                  square_list,
                  sum_list)
         << endl;

There is a minor problem, though. In Haskell, the result of composition is a function that can be passed around (with type inference); in Python, the function list can be passed  around (no type inference, due to its dynamic nature). In my implementation of apply, the programmer has to specify the function list exactly at the point of calling in order to make type inference work. This significantly limits its usefulness.

I realized the solution only a few weeks later. The key issue was that Haskell is a lazy language[11], but C++ is an eager language. Actually the answer was always in front of me, but I just failed to see it for a long time. All I needed was an extra layer of indirection, which lambda expressions fit nicely. After learning that, everything seems simple:

template <typename Tp>
auto compose()
{
    return apply<Tp>;
}

template <typename Tp, typename Fn, typename... Fargs>
auto compose(Fn fn, Fargs... args)
{
    return [=](Tp&& x)
    {
        return fn(compose<Tp>(args...)(forward<Tp>(x)));
    };
}

You can see that it is very much like the apply implementation, but the additional lambda expression makes lazy evaluation possible. Incidentally, compose() with no argument returns the identity function (id in Haskell).

The following code demonstrates its use:

    auto const squared_sum =
        compose<const vector<int>&>(sum_list, square_list);
    cout << squared_sum(vector<int>{1, 2, 3, 4, 5}) << endl;

(Please be aware that apply and compose take function arguments in the opposite order—the latter takes arguments like the Haskell ‘.’ operator.)

Ranges

During the search for functional programming information, I encountered Eric Niebler’s blog ‘Range Comprehensions’[12], which is simply amazing. If you have read thus far and are interested in functional programming with C++, definitely have a look at his blog. He provided a comprehensive library with many Haskell-like features, and the C++ standard committee liked it too! Hopefully we will see it in some production libraries soon.

Summary

As C++ evolves, more features are added to the language to enable a more functional style for programming. This is a good thing, as it allows people to be more productive, while keeping easy interaction with the traditional C/C++ code. I am learning, and wish my learning experience could be useful to others too.

May Imperative and Functional live happily together ever after!

Footnotes

1. C++03
2. transform – C++ Reference
3. C++11
4. back_inserter – C++ Reference
5. There are ways to optimize the code for efficiency (and make the code more complicated). You may want to check out a more complete implementation here.
6. accumulate – C++ Reference
7. Bartosz Milewski’s Programming Cafe
8. What Does Haskell Have to Do with C++?
9. To be honest, perfect forwarding was added later. Worse, I did not make it right. See Type Deduction and My Reference Mistakes for my updated code.
10. C++14
11. Lazy evaluation
12. Range Comprehensions