Compile-Time Strings

I have encountered many compile-time uses of strings in my projects in the past few years. I would like to summarize my experience today.

Choice of Types

std::string is mostly unsuitable for compile-time string manipulations. There are several reasons:

  • Before C++20 one cannot use strings at all at compile time. In addition, the support for compile-time strings comes quite late among the major compilers. MSVC was the front runner in this regard, GCC came second with GCC 12 (released a short while ago), and Clang has not yet had a formal release with compile-time string support.
  • With C++20 one can use strings at compile time, but there are still a lot of inconveniences, the most obvious being that strings generated at compile time cannot be used at run time. Besides, a string cannot be declared constexpr.
  • A string cannot be used as a template argument.

So we have to give up this apparent choice, but explore other possibilities. The candidates are:

  • const char pointer, which is what a string literal can naturally decay to
  • string_view, a powerful tool added by C++17: it has similar member functions to those of string, but they are mostly marked as constexpr!
  • array, with which we can generate brand-new strings

We will try these types in the following discussion.

Functions Commonly Needed

Getting the String Length

One of the most basic functions on a string is getting its length. Here we cannot use the C function strlen, as it is not constexpr.

We will try several different ways to implement it.

First, we can implement strlen manually, and mark the function constexpr:

namespace strtools {

constexpr size_t length(const char* str)
{
    size_t count = 0;
    while (*str != '\0') {
        ++str;
        ++count;
    }
    return count;
}

} // namespace strtools

However, is there an existing mechanism to retrieve the length of a string in the standard library? The answer is a definite Yes. The standard library does support getting the length of a string of any of the standard character types, like char, wchar_t, etc. With the most common character type char, we can write:

constexpr size_t length(const char* str)
{
    return char_traits<char>::length(str);
}

Starting with C++17, the methods of char_traits can be used at compile time. (However, you may encounter problems with older compiler versions, like GCC 8.)

Assuming you can use C++17, string_view is definitely worth a try:

constexpr size_t length(string_view sv)
{
    return sv.size();
}

Regardless of the approach used, now we can use the following code to verify that we can indeed check the length of a string at compile time:

static_assert(strtools::length("Hi") == 2);

At present, the string_view implementation seems the most convenient.

Finding a Character

Finding a specific character is also quite often needed. We can’t use strchr, but again, we can choose from a few different implementations. The code is pretty simple, whether implemented with char_traits or with string_view.

Here is the version with char_traits:

constexpr const char* find(const char* str, char ch)
{
    return char_traits<char>::find(str, length(str),
                                   ch);
}

Here is the version with string_view:

constexpr string_view::size_type find(string_view sv,
                                      char ch)
{
    return sv.find(ch);
}

I am not going to show the manual lookup code this time. (Unless you have to use an old compiler, simpler is better.)

Comparing Strings

The next functions are string comparisons. Here string_view wins hands down: string_view supports the standard comparisons directly, and you do not need to write any code.

Getting Substrings

It seems that string_views are very convenient, and we should use string_views wherever possible. However, is string_view::substr enough for getting substrings? This is difficult to answer without an actual usage scenario. One real scenario I encountered in projects was that the __FILE__ macro may contain the full path at compile time, resulting in different binaries when compiling under different paths. We wanted to truncate the path completely so that the absolute paths would not show up in binaries.

My tests showed that string_view::substr could not handle this job. With the following code:

puts("/usr/local"sv.substr(5).data());

We will see assembly output like the following from the compiler (see https://godbolt.org/z/1dssd96vz):

.LC0:
        .string "/usr/local"
        …
        mov     edi, OFFSET FLAT:.LC0+5
        call    puts

We have to find another way. . . .

Let’s try array. It’s easy to think of code like the following:

constexpr auto substr(string_view sv, size_t offset,
                      size_t count)
{
    array<char, count + 1> result{};
    copy_n(&sv[offset], count, result.data());
    return result;
}

The intention of the code should be very clear: generate a brand-new character array of the requested size and zero it out (constexpr variables must be initialized on declaration before C++20); copy what we need; and then return the result. Unfortunately, the code won’t compile. . . .

There are two problems in the code:

  • Functions parameters are not constexpr, and cannot be used as template arguments.
  • copy_n is not constexpr before C++20, and cannot be used in compile-time programming.

The second problem is easy to fix: a manual loop will do. We shall focus on the first problem.

A constexpr function can be evaluated at compile time or at run time, so its function arguments are not treated as compile-time constants, and cannot be used in places where compile-time constants are required, such as template arguments.

Furthermore, this problem still exists with the C++20 consteval function, where the function is only invoked at compile time. The main issue is that if we allow function parameters to be used as compile-time constants, then we can write a function where its arguments of different values (same type) can produce return values of different types. For example (currently illegal):

consteval auto make_constant(int n)
{
    return integral_constant<int, n>{};
}

This is unacceptable in the current type system: we still require that the return values of a function have a unique type. If we want a value to be used as a template argument inside a function, it must be passed to the function template as a template argument (rather than as a function argument to a non-template function). In this case, each distinct template argument implies a different template specialization, so the issue of a multiple-return-type function does not occur.

By the way, a standard proposal P1045 tried to solve this problem, but its progress seems stalled. As there are workarounds (to be discussed below), we are still able to achieve the desired effect.

Let’s now return to the substr function and convert the count parameter into a template parameter. Here is the result:

template <size_t Count>
constexpr auto substr(string_view sv, size_t offset = 0)
{
    array<char, Count + 1> result{};
    for (size_t i = 0; i < Count; ++i) {
        result[i] = sv[offset + i];
    }
    return result;
}

The code can really work this time. With ‘puts(substr("/usr/local", 5).data())’, we no longer see "/usr/" in the compiler output.


Regretfully, we now see how compilers are challenged with abstractions: With the latest versions of GCC (12.1) and MSVC (19.32) on Godbolt, this version of substr does not generate the optimal output. There are also some compatibility issues with older compiler versions. So, purely from a practical point of view, I recommend the following implementation that does not use string_view:

template <size_t Count>
constexpr auto substr(const char* str,
                      size_t offset = 0)
{
    array<char, Count + 1> result{};
    for (size_t i = 0; i < Count; ++i) {
        result[i] = str[offset + i];
    }
    return result;
}

If you are interested, you can compare the assembly outputs of these two different versions of the code:

Only Clang is able to generate the same efficient assembly code with both versions:

        mov     word ptr [rsp + 4], 108
        mov     dword ptr [rsp], 1633906540
        mov     rdi, rsp
        call    puts

If you don’t understand why there are the numbers 108 and 1633906540, let me remind you that the hexadecimal representations of these two numbers are 0x6C and 0x61636F6C, respectively. Check the ASCII table and you should be able to understand.


Since we stopped using string_view in the function parameters, the parameter offset becomes much less useful. Hence, I will get rid of this parameter, and rename the function to copy_str:

template <size_t Count>
constexpr auto copy_str(const char* str)
{
    array<char, Count + 1> result{};
    for (size_t i = 0; i < Count; ++i) {
        result[i] = str[i];
    }
    return result;
}

Passing Arguments at Compile Time

When you try composing the compile-time functions together, you will find something lacking. For example, if you wanted to remove the first segment of a path automatically (like from "/usr/local" to "local"), you might try some code like the following:

constexpr auto remove_head(const char* path)
{
    if (*path == '/') {
        ++path;
    }
    auto start = find(path, '/');
    if (start == nullptr) {
        return copy_str<length(path)>(path);
    } else {
        return copy_str<length(start + 1)>(start + 1);
    }
}

The problem is still that it won’t compile. And did you notice that this code violates exactly the constraint I mentioned above that the return type of a function must be consistent and unique?

I have adopted a solution described by Michael Park: using lambda expressions to encapsulate ‘compile-time arguments’. I have defined three macros for convenience and readability:

#define CARG typename
#define CARG_WRAP(x) [] { return (x); }
#define CARG_UNWRAP(x) (x)()

‘CARG’ means ‘constexpr argument’, a compile-time constant argument. We can now make make_constant really work:

template <CARG Int>
constexpr auto make_constant(Int cn)
{
    constexpr int n = CARG_UNWRAP(cn);
    return integral_constant<int, n>{};
}

And it is easy to verify that it works:

auto result = make_constant(CARG_WRAP(2));
static_assert(std::is_same_v<integral_constant<int, 2>,
                             decltype(result)>);

A few explanations follow. In the template parameter, I use CARG (instead of typename) for code readability: it indicates the intention that the template parameter is essentially a type wrapper for compile-time constants. Int is the name of this special type. We will not provide this type when instantiating the function template, but instead let the compiler deduce it. When calling the ‘function’ (make_constant(CARG_WRAP(2))), we provide a lambda expression ([] { return (2); }), which encapsulates the constant we need. When we need to use this parameter, we use CARG_UNWRAP (evaluate: [] { return (2); }()) to get the constant back.

Now we can rewrite the remove_head function:

template <CARG Str>
constexpr auto remove_head(Str cpath)
{
    constexpr auto path = CARG_UNWRAP(cpath);
    constexpr int skip = (*path == '/') ? 1 : 0;
    constexpr auto pos = path + skip;
    constexpr auto start = find(pos, '/');
    if constexpr (start == nullptr) {
        return copy_str<length(pos)>(pos);
    } else {
        return copy_str<length(start + 1)>(start + 1);
    }
}

This function is similar in structure to the previous version, but there are many detail changes. In order to pass the result to copy_str as a template argument, we have to use constexpr all the way along. So we have to give up mutability, and write code in a quite functional style.

Does it really work? Let’s put the following statement into the main function:

puts(strtools::remove_head(CARG_WRAP("/usr/local"))
         .data());

And here is the optimized assembly output from GCC on x86-64 (see https://godbolt.org/z/Mv5YanPvq):

main:
        sub     rsp, 24
        mov     eax, DWORD PTR .LC0[rip]
        lea     rdi, [rsp+8]
        mov     DWORD PTR [rsp+8], eax
        mov     eax, 108
        mov     WORD PTR [rsp+12], ax
        call    puts
        xor     eax, eax
        add     rsp, 24
        ret
.LC0:
        .byte   108
        .byte   111
        .byte   99
        .byte   97

As you can see clearly, the compiler will put the ASCII codes for "local" on the stack, assign its starting address to the rdi register, and then call the puts function. There is absolutely no trace of "/usr/" in the output. In fact, there is no difference between the output of the puts statement above and that of ‘puts(substr("/usr/local", 5).data())’.

I would like to remind you that it is safe to pass and store the character array, but it is not safe to store the pointer obtained from its data() method. It is possible to use such a pointer immediately in calling other functions (like puts above), as the lifetime of array will extend till the current statement finishes execution. However, if you saved this pointer, it would become dangling after the current statement, and dereferencing it would then be undefined behaviour.

String Template Parameters

We have tried turning strings into types (via lambda expressions) for compile-time argument passing, but unlike integers and integral_constants, there is no one-to-one correspondence between the two. This is often inconvenient: for two integral_constants, we can directly use is_same to determine whether they are the same; for strings represented as lambda expressions, we cannot do the same—two lambda expressions always have different types.

Direct use of string literals as non-type template arguments is not allowed in C++, because strings may appear repeatedly in different translation units, and they do not have proper comparison semantics—comparing two strings is just a comparison of two pointers, which cannot achieve what users generally expect. To use string literals as template arguments, we need to find a way to pass the string as a sequence of characters to the template. We have two methods available:

  • The non-standard GNU extension used by GCC and Clang (which can be used prior to C++20)
  • The C++20 approach suitable for any conformant compilers (including GCC and Clang)

Let’s have a look one by one.

The GNU Extension

GCC and Clang have implemented the standard proposal N3599, which allows us to use strings as template arguments. The compiler will expand the string into characters, and the rest is standard C++.

Here is an example:

template <char... Cs>
struct compile_time_string {
    static constexpr char value[]{Cs..., '\0'};
};

template <typename T, T... Cs>
constexpr compile_time_string<Cs...> operator""_cts()
{
    return {};
}

The definition of the class template is standard C++, so that compile_time_string is a valid type and, at the same time, by taking the value member of this type, we can get "Hi". The GNU extension is the string literal operator template—we can now write ‘"Hi"_cts’ to get an object of type compile_time_string. The following code will compile with the above definitions:

constexpr auto a = "Hi"_cts;
constexpr auto b = "Hi"_cts;
static_assert(is_same_v<decltype(a), decltype(b)>);

The C++20 Approach

Though the above method is simple and effective, it failed to reach consensus in the C++ standards committee and did not become part of the standard. However, with C++20, we can use more types in non-type template parameters. In particular, user-defined literal types are amongst them. Here is an example:

template <size_t N>
struct compile_time_string {
    constexpr compile_time_string(const char (&str)[N])
    {
        copy_n(str, N, value);
    }
    char value[N]{};
};

template <compile_time_string cts>
constexpr auto operator""_cts()
{
    return cts;
}

Again, the first class template is not special, but allowing this compile_time_string to be used as the type of a non-type template parameter (quite a mouthful😝), as well as the string literal operator template, is a C++20 improvement. We can now write ‘"Hi"_cts’ to generate a compile_time_string object. Note, however, that this object is of type compile_time_string, so "Hi"_cts and "Ha"_cts are of the same type—which is very different from the results of the GNU extension. However, the important thing is that compile_time_string can now be used as type of a template parameter, so we can just add another layer:

template <compile_time_string cts>
struct cts_wrapper {
    static constexpr compile_time_string str{cts};
};

Corresponding to the previous compile-time string type comparison, we now need to write:

auto a = cts_wrapper<"Hi"_cts>{};
auto b = cts_wrapper<"Hi"_cts>{};
static_assert(is_same_v<decltype(a), decltype(b)>);

Or we can further simplify it to (as compile_time_string has a non-explicit constructor):

auto a = cts_wrapper<"Hi">{};
auto b = cts_wrapper<"Hi">{};
static_assert(is_same_v<decltype(a), decltype(b)>);

Summary

In this blog I have discussed two things:

  • Compile-time string manipulations
  • Strings as non-type template parameters

They have proved to be useful in my real projects. When having time, I will explore some usages later. Stay tuned!

Contextual Memory Tracing

The Need

A long, long time ago I wrote about memory debugging. I redefined new as a macro, took advantage of placement new, and replaced the global operator new. However, only the replacement of the global operator new was useful in catching memory leaks in the end, while the other facilities became more or less futile; for memory allocation was becoming more and more implicit with the spread use of STL containers and smart pointers, and direct use of new was discouraged. It is even more so today, with many coding conventions basically banning the use of new and delete. So I would like to revisit this topic.

First, there is still need to trace memory usage, even though memory leakage, in the way of unmatched new, is unlikely today. People still need to know how memory is used, by which parts of the program, in which functions and modules, and so on. The exact point of memory allocation is becoming less relevant, as memory allocation is becoming less direct. It probably occurs in some libraries, instead of in application code. So now tracing memory usage means recording the usage context, instead of the exact code position.

Be Contextual

Usage contexts can be set up in a stack-like data structure, and I have done so several times in the past. What needs to be recorded in the context is something one needs to decide beforehand. If you only want to trace memory usage, you can do as I do below. But you may want to fit the interface with your specific memory manager, adding what needs to be passed to it. Anyway, you decide what should be in. My example code is as follows:

struct context {
    const char* file;
    const char* func;
};

We want to record the context automatically, and RAII can be used for this purpose:

class checkpoint {
public:
    explicit checkpoint(const context& ctx);
    ~checkpoint();

private:
    const context ctx_;
};

#define CTX_MEMORY_CHECKPOINT()       \
    checkpoint memory_checkpoint{     \
        context{__FILE__, __func__}}

thread_local std::deque<context>
    context_stack{
        context{"<UNKNOWN>", "<UNKNOWN>"}};

void save_context(const context& ctx)
{
    context_stack.push_back(ctx);
}

void restore_context(const context& ctx)
{
    assert(!context_stack.empty() &&
           context_stack.back() == ctx);
    context_stack.pop_back();
}

const context& get_current_context()
{
    assert(!context_stack.empty());
    return context_stack.back();
}

checkpoint::checkpoint(const context& ctx) : ctx_(ctx)
{
    save_context(ctx);
}

checkpoint::~checkpoint()
{
    restore_context(ctx_);
}

Please notice that the context_stack needs to be thread_local, which was something not standardized when I last wrote about tracing memory usage. It is very convenient to save the information on a per-stack basis.

Fitting with Real Memory Managers

Before we go define the operator new and operator delete functions (normally called ‘allocation’ and ‘deallocation’ functions, and there are a lot of them), let us define first the generic/sample functions that do the real allocation and deallocation. We just pass on the necessary arguments to the system memory manager here (but you may want to do more to make it work with an existing memory manager) and we still use the C convention that a memory allocation failure is indicated by a null pointer:

void* ctx_alloc(size_t size, size_t alignment,
                const context* /*unused*/)
{
#ifdef _WIN32
    return _aligned_malloc(size, alignment);
#elif defined(__unix) || defined(__unix__)
    void* memptr{};
    int result = posix_memalign(&memptr, alignment, size);
    if (result == 0) {
        return memptr;
    } else {
        return nullptr;
    }
#else
    // No alignment guarantees on other platforms
    (void)alignment;
    return malloc(size);
#endif
}

void ctx_free(void* ptr, const context* /*unused*/)
{
#ifdef _WIN32
    _aligned_free(ptr);
#else
    free(ptr);
#endif
}

operator new & operator delete

Now we can go to the allocation and deallocation functions. The declarations with our new context parameter are the following:

void* operator new  (size_t size,
                     const context& ctx);
void* operator new[](size_t size,
                     const context& ctx);
void* operator new  (size_t size,
                     std::align_val_t align_val,
                     const context& ctx);
void* operator new[](size_t size,
                     std::align_val_t align_val,
                     const context& ctx);

void operator delete  (void* ptr,
                       const context&) noexcept;
void operator delete[](void* ptr,
                       const context&) noexcept;
void operator delete  (void* ptr,
                       std::align_val_t align_val,
                       const context&) noexcept;
void operator delete[](void* ptr,
                       std::align_val_t align_val,
                       const context&) noexcept;

But these are not all the function we need to rewrite. We need to replace the non-contextual version too, and they are actually key to the memory tracing functionality. Our saved context can be used, like the following:

void* operator new(size_t size)
{
    return operator new(size, get_current_context());
}

void* operator new[](size_t size)
{
    return operator new[](size, get_current_context());
}

Assuming the existence of an alloc_mem and a free_mem function, we can make the rest of the allocation and deallocation functions basically forwarders:

void* operator new(size_t size,
                   const std::nothrow_t&) noexcept
{
    return alloc_mem(size, get_current_context(),
                     alloc_is_not_array);
}

void* operator new[](size_t size,
                     const std::nothrow_t&) noexcept
{
    return alloc_mem(size, get_current_context(),
                     alloc_is_array);
}

void* operator new(size_t size,
                   std::align_val_t align_val)
{
    return operator new(size, align_val,
                        get_current_context());
}

void* operator new[](size_t size,
                     std::align_val_t align_val)
{
    return operator new[](size, align_val,
                          get_current_context());
}

void* operator new(size_t size,
                   std::align_val_t align_val,
                   const std::nothrow_t&) noexcept
{
    return alloc_mem(size, get_current_context(),
                     alloc_is_not_array,
                     size_t(align_val));
}

void* operator new[](size_t size,
                     std::align_val_t align_val,
                     const std::nothrow_t&) noexcept
{
    return alloc_mem(size, get_current_context(),
                     alloc_is_array,
                     size_t(align_val));
}

void* operator new(size_t size, const context& ctx)
{
    void* ptr = alloc_mem(size, ctx, alloc_is_not_array);
    if (ptr == nullptr) {
        throw std::bad_alloc();
    }
    return ptr;
}

void* operator new[](size_t size, const context& ctx)
{
    void* ptr = alloc_mem(size, ctx, alloc_is_array);
    if (ptr == nullptr) {
        throw std::bad_alloc();
    }
    return ptr;
}

void* operator new(size_t size,
                   std::align_val_t align_val,
                   const context& ctx)
{
    void* ptr = alloc_mem(size, ctx, alloc_is_not_array,
                          size_t(align_val));
    if (ptr == nullptr) {
        throw std::bad_alloc();
    }
    return ptr;
}

void* operator new[](size_t size,
                     std::align_val_t align_val,
                     const context& ctx)
{
    void* ptr = alloc_mem(size, ctx, alloc_is_array,
                          size_t(align_val));
    if (ptr == nullptr) {
        throw std::bad_alloc();
    }
    return ptr;
}

void operator delete(void* ptr) noexcept
{
    free_mem(ptr, alloc_is_not_array);
}

void operator delete[](void* ptr) noexcept
{
    free_mem(ptr, alloc_is_array);
}

void operator delete(void* ptr, size_t) noexcept
{
    free_mem(ptr, alloc_is_not_array);
}

void operator delete[](void* ptr, size_t) noexcept
{
    free_mem(ptr, alloc_is_array);
}

void operator delete(
    void* ptr, std::align_val_t align_val) noexcept
{
    free_mem(ptr, alloc_is_not_array,
                 size_t(align_val));
}

void operator delete[](
    void* ptr, std::align_val_t align_val) noexcept
{
    free_mem(ptr, alloc_is_array,
                 size_t(align_val));
}

void operator delete(void* ptr,
                     const context&) noexcept
{
    operator delete(ptr);
}

void operator delete[](void* ptr,
                       const context&) noexcept
{
    operator delete[](ptr);
}

void operator delete(void* ptr,
                     std::align_val_t align_val,
                     const context&) noexcept
{
    operator delete(ptr, align_val);
}

void operator delete[](void* ptr,
                       std::align_val_t align_val,
                       const context&) noexcept
{
    operator delete[](ptr, align_val);
}

Contexts and Allocation/Deallocation

Now let us focus on the two functions that do the real job:

enum is_array_t : uint32_t {
    alloc_is_not_array,
    alloc_is_array
};

void* alloc_mem(size_t size,
                const context& ctx,
                is_array_t is_array,
                size_t alignment);

void free_mem(void* ptr,
              is_array_t is_array,
              size_t alignment);

Considering this interface, the context information can only be stored immediately before the memory returned to the user. In order to trace leaked memory, we need to link the allocated memory into a linked list, and the control block is as follows:

struct new_ptr_list_t {
    new_ptr_list_t* next;
    new_ptr_list_t* prev;
    size_t          size;
    context         ctx;
    uint32_t        head_size : 31;
    uint32_t        is_array : 1;
    uint32_t        magic;
};

The first four fields should be very clear in meaning. head_size probably requires some explanation. While the struct is fixed in size, alignments can be different across allocations, resulting in different offsets from the struct pointer to the memory pointer the user gets. So this fields records the aligned struct size. is_array records whether the allocation is done by an operator new[]; we use this piece of information to detect the new[]/delete or new/delete[] mismatch, as well as allowing for special offsets required by array allocations. magic is used to mark that the memory is allocated by this implementation so that when freeing the memory we can detect corrupt memory, double freeing, and suchlike.

We also need the list head of control blocks, a mutex to protect its access, a function to align data size, and the magic number constant:

new_ptr_list_t new_ptr_list = {
    &new_ptr_list, &new_ptr_list, 0, {},
    alloc_is_not_array, 0, CTX_MAGIC};

std::mutex new_ptr_lock;

constexpr uint32_t align(size_t s, size_t alignment)
{
    return static_cast<uint32_t>((s + alignment - 1) &
                                 ~(alignment - 1));
}

constexpr uint32_t CTX_MAGIC = 0x4D585443; // "CTXM"

alloc_mem is then quite straightforward:

void* alloc_mem(size_t size, const context& ctx,
                is_array_t is_array,
                size_t alignment =
                    __STDCPP_DEFAULT_NEW_ALIGNMENT__)
{
    assert(alignment >=
           __STDCPP_DEFAULT_NEW_ALIGNMENT__);

    uint32_t aligned_list_item_size =
        align(sizeof(new_ptr_list_t), alignment);
    size_t s = size + aligned_list_item_size;
    auto ptr = static_cast<new_ptr_list_t*>(
        ctx_alloc(s, alignment, ctx));
    if (ptr == nullptr) {
        return nullptr;
    }
    auto usr_ptr = reinterpret_cast<char*>(ptr) +
                   aligned_list_item_size;
    ptr->ctx = ctx;
    ptr->is_array = is_array;
    ptr->size = size;
    ptr->head_size = aligned_list_item_size;
    ptr->magic = CTX_MAGIC;
    {
        std::lock_guard guard{new_ptr_lock};
        ptr->prev = new_ptr_list.prev;
        ptr->next = &new_ptr_list;
        new_ptr_list.prev->next = ptr;
        new_ptr_list.prev = ptr;
    }
    return usr_ptr;
}

I.e. it does the following things:

  1. Allocates memory enough to satisfy the user requirement and the additional metadata (new_ptr_list_t)
  2. Fills in the metadata
  3. Chains the allocated memory blocks into a list
  4. Returns the pointer after the metadata

free_mem does the opposite thing. Apparently, we need a function to convert the user pointer back to the originally allocated pointer, which is not really trivial, considering the potential cases of bad pointer and unmatched use of array and non-array versions of new and delete. It is the convert_user_ptr function:

new_ptr_list_t* convert_user_ptr(void* usr_ptr,
                                 size_t alignment)
{
    auto offset = static_cast<char*>(usr_ptr) -
                  static_cast<char*>(nullptr);
    auto adjusted_ptr = static_cast<char*>(usr_ptr);
    bool is_adjusted = false;

    // Check alignment first
    if (offset % alignment != 0) {
        offset -= sizeof(size_t);
        if (offset % alignment != 0) {
            return nullptr;
        }
        // Likely caused by new[] followed by delete, if
        // we arrive here
        adjusted_ptr = static_cast<char*>(usr_ptr) -
                       sizeof(size_t);
        is_adjusted = true;
    }
    auto ptr = reinterpret_cast<new_ptr_list_t*>(
        adjusted_ptr -
        align(sizeof(new_ptr_list_t), alignment));
    if (ptr->magic == CTX_MAGIC &&
        (!is_adjusted || ptr->is_array)) {
        return ptr;
    }

    if (!is_adjusted && alignment > sizeof(size_t)) {
        // Again, likely caused by new[] followed by
        // delete, as aligned new[] allocates alignment
        // extra space for the array size.
        ptr = reinterpret_cast<new_ptr_list_t*>(
            reinterpret_cast<char*>(ptr) - alignment);
        is_adjusted = true;
    }
    if (ptr->magic == CTX_MAGIC &&
        (!is_adjusted || ptr->is_array)) {
        return ptr;
    }

    return nullptr;
}

With this, free_mem then becomes easy:

void free_mem(void* usr_ptr, is_array_t is_array,
              size_t alignment =
                  __STDCPP_DEFAULT_NEW_ALIGNMENT__)
{
    assert(alignment >=
           __STDCPP_DEFAULT_NEW_ALIGNMENT__);
    if (usr_ptr == nullptr) {
        return;
    }

    auto ptr = convert_user_ptr(usr_ptr, alignment);
    if (ptr == nullptr) {
        fprintf(stderr,
                "delete%s: invalid pointer %p\n",
                is_array ? "[]" : "", usr_ptr);
        abort();
    }
    if (is_array != ptr->is_array) {
        const char* msg = is_array
                              ? "delete[] after new"
                              : "delete after new[]";
        fprintf(stderr,
                "%s: pointer %p (size %zu)\n",
                msg, usr_ptr, ptr->size);
        abort();
    }
    {
        std::lock_guard guard{new_ptr_lock};
        ptr->magic = 0;
        ptr->prev->next = ptr->next;
        ptr->next->prev = ptr->prev;
    }
    ctx_free(ptr, &(ptr->ctx));
}

I.e.:

  1. It invokes convert_user_ptr to convert the user-provided pointer to a new_ptr_list_t*.
  2. It checks whether array-ness matches in the memory allocation and deallocation.
  3. It unlinks the memory block from the linked list.
  4. If anything bad happens, it prints a message and aborts the whole program (as the program already has undefined behaviour).

One More Thing

It is now nearly complete: we have set up the mechanisms to record memory contexts in memory allocation and deallocation functions. However, I have omitted one important detail so far. If you used my code verbatim as above, the program would crash on first memory allocation. When the global allocation and deallocation functions are replaced, care must be taken when we need additional memory inside those functions. If we somehow use the generic C++ memory allocation mechanisms, it will invoke operator new in the end, causing an infinite recursion. It is still OK to use malloc/free, so we need to use a malloc_allocator for the context stack:

template <typename T>
struct malloc_allocator {
    typedef T value_type;

    typedef std::true_type is_always_equal;
    typedef std::true_type
        propagate_on_container_move_assignment;

    malloc_allocator() = default;
    template <typename U>
    malloc_allocator(const malloc_allocator<U>&) {}

    template <typename U>
    struct rebind {
        typedef malloc_allocator<U> other;
    };

    T* allocate(size_t n)
    {
        return static_cast<T*>(malloc(n * sizeof(T)));
    }
    void deallocate(T* p, size_t)
    {
        free(p);
    }
};

thread_local std::deque<context,
                        malloc_allocator<context>>
    context_stack{context{"<UNKNOWN>", "<UNKNOWN>"}};

Everything Put Together

You can find the real code with more details and a working memory leak checker in this repository:

https://github.com/adah1972/nvwa/tree/master/nvwa

You need to add the root directory of Nvwa to your include path, and nvwa/memory_trace.cpp and nvwa/aligned_memory.cpp to your project. In order to add a new memory checkpoint, use the macro NVWA_MEMORY_CHECKPOINT (Nvwa macros are usually prefixed with ‘NVWA_’). A very short test program follows:

#include <nvwa/memory_trace.h>

int main()
{
    char* ptr1 = new char[20];
    NVWA_MEMORY_CHECKPOINT();
    char* ptr2 = new char[42];
}

The output would be like the following:

Leaked object at 0x57697e30 (size 20, context: <UNKNOWN>/<UNKNOWN>)
Leaked object at 0x57697e70 (size 42, context: test.cpp/int main())
*** 2 leaks found

Notes about Using IWYU on macOS

I have recently found IWYU, a very useful tool to identify whether you have included header files correctly. It can be cleanly installed in Ubuntu by apt, though some configuration is needed to make it identify problems more correctly, i.e., let it know that a header file is private and we should use a public header file that includes it, or a symbol should be defined in a certain header file and we should not care where it is really defined in the implementation.

I did encounter some problems in macOS. I installed it, but it had problems with Xcode header files, causing something like “fatal error: ‘stdarg.h’ file not found” when used. A quick search showed that it was a known problem, and people mentioned that it seemed a problem that deteriorated with more recent versions of LLVM, which IWYU used internally.

I happened to have LLVM 7.0 installed from Homebrew, so I had a try. Here are the simple steps:

  1. Make sure you have llvm@7. If not, ‘brew install llvm@7’ would do.
  2. Check out IWYU to a directory.
  3. Execute ‘git checkout clang_7.0’ inside the IWYU directory to choose the Clang 7 branch.
  4. Execute ‘mkdir build && cd build’ to use a build directory.
  5. Execute ‘CC=/usr/local/opt/llvm@7/bin/clang CXX=/usr/local/opt/llvm@7/bin/clang++ cmake -DCMAKE_PREFIX_PATH=/usr/local/opt/llvm@7 ..’ to configure IWYU to use the Homebrew LLVM 7.0.
  6. Execute ‘make’ to build IWYU.
  7. Execute ‘mkdir -k lib/clang/7.1.0 && ln -s /usr/local/opt/llvm@7/lib/clang/7.1.0/include lib/clang/7.1.0/’ to symlink the Clang 7 include directory inside IWYU. This step is critical to solve the “file not found” problem, but regretfully it does not work with a more recent LLVM version like LLVM 11.
  8. Symlink executables to your bin directory for quick access. Something like:
cd ~/bin
ln -s ~/code/include-what-you-use/build/bin/include-what-you-use .
ln -s ~/code/include-what-you-use/iwyu_tool.py iwyu_tool
ln -s include-what-you-use iwyu

Now IWYU is ready to use.

Please be aware that IWYU does not often work out of the box, and some configuration is needed. The key document to read is IWYU Mappings, and the bundled mapping files (.imp) can be good examples. You probably want to use libcxx.imp as a start. Some mappings are already included by default, and you can find them in the file iwyu_include_picker.cc.

While it is not perfect, it did help me identify many inclusion issues. This commit is a result of using IWYU.

Happy hacking!

My Opinions Regarding the Top Five TIOBE Languages

I have written C++ for nearly 30 years. I had been advocating that it was the best language 🤣, until my love moved to Python a few years ago. I will still say C++ is a very powerful and unique language. It is probably the only language that intersects many different software layers. It lets programmers control the bit-level details, and it has the necessary mechanisms to allow programmers to make appropriate abstractions—arguably one of the best as it provides powerful generics, which are becoming better and better with the upcoming concepts and ranges in C++20. It has very decent optimizing compilers, and suitably written C++ code performs better than nearly all other languages. Therefore, C++ has been widely used in not only low-level stuff like drivers, but also libraries and applications, especially where performance is wanted, like scientific computing and games. It is still widely used in desktop applications, say, Microsoft Office and Adobe Photoshop. The power does come with a price: it is probably the most complicated computer language today. Mastering the language takes a long time (and with 30 years’ experience I dare not say I have mastered the language). Generic code also tends to take a long time to compile. Error messages can be overwhelming, especially to novices. I can go on and on, but it is better to stop here, with a note that the complexity and cost are sometimes worthwhile, in exchange for reduced latency and reduced power usage (from CPU and memory).

Python is, on the other hand, easy to learn. It is not a toy language, though: it is handy not only to novices, but also to software veterans like me. The change-and-run cycle is much shorter than C++. Code in Python is very readable, partly because lists, sets, and dictionaries are supported literal types (you cannot write in C++ an expression like {"one": 1} and let compiler deduce it is a dictionary). It has features that C++ has lacked for many years: generator/coroutine, lazy range, and so on. Generics do not need special support, as it is dynamically typed (but it also does not surprise programmers by allowing error-prone expressions like "1" + 2, as in some script languages). With a good IDE, the argument on its lack of compile-time check can be crushed—programmers can enjoy edit-time checks. It has a big ecosystem with a huge number of third-party libraries, and they are easier to take and use than in C++ (thanks to pip). The only main remaining shortcoming to me is performance, but: 1) one may write C/C++ extensions where necessary; and 2) the lack of performance may not matter at all, if your application is not CPU-bound. See my personal experience of 25x performance boost in two hours.

I used Java a long time ago. I do not like it (mostly for its verbosity), and its desktop/server implementation makes it unsuitable for short-time applications due to its sluggish launch time. However, it has always been a workhorse on the server side, and it has a successful ecosystem, if not much harmed by Oracle’s lawyers. Android also brought life to the old language and the development communities (ignoring for now the bad effects Oracle has brought about).

C# started as Microsoft’s answer to Java, but they have differed more and more since then. I actually like C#, and my experience has shown it is very suitable for Windows application development (I do not have experience with Mono, and I don’t do server development on Windows). Many of its features, like LINQ and on-stack structs, are very likeable.

C is a simple and elegant language, and it can be regarded as the ancestor of three languages above (except Python), at least in syntax. It is the most widely supported. It is the closest to metal, and is still very popular in embedded systems, OS development, and cases where maximum portability is wanted (thus the wide offerings from the open-source communities). It is the most dangerous language, as you can easily have buffer overflows. Incidentally, two of the three current answers to ‘How do you store a list of names input by the user into an array in C (not C++ or C#)?’ can have buffer overflows (and I wrote the other answer). Programmers need to tend to many details themselves.

I myself will code everything in Python where possible, as it usually requires the fewest lines of code and takes the least amount of time. If performance is wanted, I’ll go to C++. For Windows GUI applications, I’ll prefer C#. I will write in C if maximum portability and memory efficiency are wanted. I do not feel I will write in Java, except modifying existing code or when the environment supports Java only.

[I first posted it as a Quora answer, but it is probably worth a page of its own.]

C/C++ Performance, mmap, and string_view

A C++ discussion group I participated in got a challenge last week. Someone posted a short Perl program, and claimed that it was faster than the corresponding C version. It was to this effect (with slight modifications):

open IN, "$ARGV[0]";
my $gc = 0;
while (my $line = ) {
  $gc += ($line =~ tr/cCgG//);
}
print "$gc\n";
close IN

The program simply searched and counted all occurrences of the letters ‘C’ and ‘G’ from the input file in a case-insensitive way. Since the posted C code was incomplete, I wrote a naïve implementation to test, which did turn out to be slower than the Perl code. It was about two times as slow on Linux, and about 10 times as slow on macOS.1

FILE* fp = fopen(argv[1], "rb");
int count = 0;
int ch;
while ((ch = getc(fp)) != EOF) {
    if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G') {
        ++count;
    }
}
fclose(fp);

Of course, it actually shows how optimized the Perl implementation is, instead of how inferior the C language is. Before I had time to test my mmap_line_reader, another person posted an mmap-based solution, to the following effect (with modifications):

int fd = open(argv[1], O_RDONLY);
struct stat st;
fstat(fd, &st);
int len = st.st_size;
char ch;
int count = 0;
char* ptr = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd);
char* begin = ptr;
char* end = ptr + len;
while (ptr < end) {
    ch = *ptr++;
    if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G')
        ++count;
}
munmap(begin, len);

When I tested my mmap_line_reader, I found that its performance was only on par with the Perl code, but slower than the handwritten mmap-based code. It is not surprising, considering that mmap_line_reader copies the line content, while the C code above searches directly in the mmap’d buffer.

I then thought of the C++17 string_view.2 It seemed a good chance of using it to return a line without copying its content. It was actually easy refactoring (on code duplicated from mmap_line_reader), and most of the original code did not require any changes. I got faster code for this test, but the implementations of mmap_line_reader and the new mmap_line_reader_sv were nearly identical, except for a few small differences.

Naturally, the next step was to refactor again to unify the implementations. I made a common base to store the bottommost mmap-related logic, and made the difference between string and string_view disappear with a class template. Now mmap_line_reader and mmap_line_reader_sv were just two aliases of instantiations of basic_mmap_line_reader!

While mmap_line_reader_sv was faster than mmap_line_reader, it was still slower than the mmap-based C code. So I made another abstraction, a ‘container’ that allowed iteration over all of the file content. Since the mmap-related logic was already mostly separated, only some minor modifications were needed to make that base class fully independent of the line reading logic. After that, adding mmap_char_reader was easy, which was simply a normal container that did not need to mess with platform-specific logic.

At this point, all seemed well—except one thing: it only worked on Unix. I did not have an immediate need to make it work on Windows, but I really wanted to show that the abstraction provided could work seamlessly regardless of the platform underneath. After several revisions, in which I dealt with proper Windows support,3 proper 32- and 64-bit support, and my own mistakes, I finally made it work. You can check out the current code in the nvwa repository. With it, I can finally make the following simple code work on Linux, macOS, and Windows, with the same efficiency as raw C code when fully optimized:4

#include <iostream>
#include <stdlib.h>
#include "nvwa/mmap_byte_reader.h"

int main(int argc, char* argv[])
{
    if (argc != 2) {
        std::cerr << "A file name is needed" << std::endl;
        exit(1);
    }

    try {
        int count = 0;
        for (char ch : nvwa::mmap_char_reader(argv[1]))
            if (ch == 'c' || ch == 'C' ||
                ch == 'g' || ch == 'G')
                ++count;
        std::cout << count << std::endl;
    }
    catch (std::exception& e) {
        std::cerr << e.what() << std::endl;
    }
}

Even though it is not used in the final code, I love the C++17 string_view. And I like the simplicity I finally achieved. Do you?

P.S. The string_view-based test code is posted here too as a reference. Line-based processing is common enough!5

#include <iostream>
#include <stdlib.h>
#include "nvwa/mmap_line_reader.h"

int main(int argc, char* argv[])
{
    if (argc != 2) {
        std::cerr << "A file name is needed" << std::endl;
        exit(1);
    }

    try {
        int count = 0;
        for (const auto& line :
                nvwa::mmap_line_reader_sv(argv[1]))
            for (char ch : line)
                if (ch == 'c' || ch == 'C' ||
                    ch == 'g' || ch == 'G')
                    ++count;
        std::cout << count << std::endl;
    }
    catch (std::exception& e) {
        std::cerr << e.what() << std::endl;
    }
}

Update (18 September 2017)

Thanks to Alex Maystrenko (see the comments below), It is now understood that the reason why getc was slow was because there was an implicit lock around file operations. I did not expect it, as I grew from an age when multi-threading was the exception, and I had not learnt about the existence of getc_unlocked until he mentioned it! According to the getc_unlocked page in the POSIX specification:

Some I/O functions are typically implemented as macros for performance reasons (for example, putc() and getc()). For safety, they need to be synchronized, but it is often too expensive to synchronize on every character. Nevertheless, it was felt that the safety concerns were more important; consequently, the getc(), getchar(), putc(), and putchar() functions are required to be thread-safe. However, unlocked versions are also provided with names that clearly indicate the unsafe nature of their operation but can be used to exploit their higher performance.

After replacing getc with getc_unlocked, the naïve implementation immediately outperforms the Perl code on Linux, though not on macOS.6

Another interesting thing to notice is that GCC provides vastly optimized code for the comparison with ‘c’, ‘C’, ‘g’, and ‘G’,7 which is extremely unlikely for interpreted languages like Perl. Observing the codes for the characters are:

  • 010000112 or 6710 (‘C’)
  • 010001112 or 7110 (‘G’)
  • 011000112 or 9910 (‘c’)
  • 011001112 or 10310 (‘g’)

GCC basically ANDs the input character with 110110112, and compares the result with 010000112. In Intel-style assembly:

        movzx   ecx, BYTE PTR [r12]
        and     ecx, -37
        cmp     cl, 67

It is astoundingly short and efficient!


  1. I love my Mac, but I do feel Linux has great optimizations. 
  2. If you are not familiar with string_view, check out its reference
  3. Did I mention how unorthogonal and uncomfortable the Win32 API is, when compared with POSIX? I am very happy to hide all the ugliness from the application code. 
  4. Comparison was only made on Unix (using the POSIX mmap API), under GCC with ‘-O3’ optimization (‘-O2’ was not enough). 
  5. -std=c++17’ must be specified for GCC, and ‘/std:c++latest’ must be specified for Visual C++ 2017. 
  6. However, the performance improvement is more dramatic on macOS, from about 10 times slower to 5% slower. 
  7. Of course, this is only possible when the characters are given as compile-time constants. 

Python yield and C++ Coroutines

Back in 2008, an old friend challenged me with a programming puzzle, when we both attended a wedding. He later gave a solution in Python. Comparing with my C++ solution, the Python one has about half the code lines. I was not smart enough to begin learning Python then, but instead put an end to my little interest in Python with a presentation in C++ Conference China 2009. I compared the basic constructs, and concluded they were equivalent, not realizing that Python had more to offer than that trivial programming exercise showed.

Fast forwarding to today (2016), I am really writing some (small) programs in Python. I have begun to appreciate Python more and more. For many of the tasks, the performance loss in executing the code is ignorable, but the productivity boost is huge. I have also realized that there are constructs in Python that are not easily reproducible in other languages. Generator/yield is one of them.

The other day, I was writing a small program to sort hosts based on dot-reversed order so as to group the host names in a more reasonable order (regarding ‘www.wordpress.com’ as ‘com.wordpress.www’). I quickly came up with a solution in Python. The code is very short and I can show it right here:

def backsort(lines):
    result = {}
    for line in lines:
        result['.'.join(reversed(line.split('.')))] = line
    return map(lambda item: item[1],
               sorted(result.items()))

Of course, we can implement a similar function in C++11. We will immediately find that there are no standard implementations for split and join (see Appendix below for my implementation). Regardless, we can write some code like:

template <typename C>
vector<string> backsort(C&& lines)
{
    map<string, string> rmap;
    for (auto& line : lines) {
        auto split_line = split(line, '.');
        reverse(split_line.begin(), split_line.end());
        rmap[join(split_line, '.')] = line;
    }
    vector<string> result(rmap.size());
    transform(rmap.begin(), rmap.end(), result.begin(),
              [](const pair<string, string>& pr)
              {
                  return pr.second;
              });
    return result;
}

Even though it has twice the non-trivial lines of code and is a function template, there is immediately something Python can do readily but C++ cannot. I can give the Python file handle (like os.stdin) directly to backsort, and the for line will iterate through the file content.1 This is because the Python file object implements the iterator protocol over lines of text, but the C++ istream does not do anything similar.

Let us forget this C++ detail, and focus on the problem. My Python code accepts an iterator, and ‘backsorts’ all the input lines. Can we make it process multiple files (like the cat command line), without changing the backsort function?

Of course it can be done. There is a traditional way, and there is a smart way. The traditional way is write a class that implements the iterator protocol (which can be readily modelled by C++):

class cat:
    def __init__(self, files):
        self.files = files
        self.cur_file = None

    def __iter__(self):
        return self

    def next(self):
        while True:
            if self.cur_file:
                line = self.cur_file.readline()
                if line:
                    return line.rstrip('\n')
                self.cur_file.close()
            if self.files:
                self.cur_file = open(self.files[0])
                self.files = self.files[1:]
            else:
                raise StopIteration()

We can then cat files by the following lines:

if __name__ == '__main__':
    if sys.argv[1:]:
        for line in cat(sys.argv[1:]):
            print(line)

Using yield, we can reduce the 18 lines of code of cat to only 5:

def cat(files):
    for fn in files:
        with open(fn) as f:
            for line in f:
                yield line.rstrip('\n')

There is no more bookkeeping of the current file and the unprocessed files, and everything is wrapped in simple loops. Isn’t that amazing? I actually learnt about the concept before (in C#), but never used it in real code—perhaps because I was too much framed by existing code, using callbacks, observer pattern, and the like.—Those ‘patterns’ now look ugly, when compared to the simplicity of generators.

Here comes the real challenge for C++ developers: Can we do the same in C++? Can we do something better than inelegant callbacks? 2


My investigations so far indicate the following: No C++ standards (up to C++14) support such constructs, and there is no portable way to implement them as a library.

Are we doomed? No. Apart from standardization efforts regarding coroutines (which is the ancient name for a superset of generators, dated from 1958) in C++,3 there have been at least five cross-platform implementations for C++:

  • The unofficial Boost.Coroutine by Giovanni P. Deretta (2006), compatible with Windows, Linux, and maybe a few Unix variants (tested not working on OS X); apparently abandoned.4
  • The official Boost.Coroutine by Oliver Kowalke (2013), compatible with ARM, MIPS, PPC, SPARC, x86, and x86-64 architectures.
  • The official Boost.Coroutine2 by Oliver Kowalke (2015), compatible with the same hardware architectures but only C++ compilers/code conformant to the C++14 standard.
  • Mordor by Mozy (2010), compatible with Windows, Linux, and OS X, but seemingly no longer maintained.
  • CO2 by Jamboree (2015), supporting stackless coroutines only, using preprocessor tricks, and requiring C++14.

As Boost.Coroutine2 looks modern, is well-maintained, and is very much comparable to the Python constructs, I will use it in the rest of this article.5 It hides all the platform details with the help of Boost.Context. Now I can write code simply as follows for cat:

typedef boost::coroutines2::coroutine<const string&>
    coro_t;

void cat(coro_t::push_type& yield,
         int argc, char* argv[])
{
    for (int i = 1; i < argc; ++i) {
        ifstream ifs(argv[i]);
        for (;;) {
            string line;
            if (getline(ifs, line)) {
                yield(line);
            } else {
                break;
            }
        }
    }
}

int main(int argc, char* argv[])
{
    using namespace std::placeholders;
    for (auto& line : coro_t::pull_type(
             boost::coroutines2::fixedsize_stack(),
             bind(cat, _1, argc, argv))) {
        cout << line << endl;
    }
}

Is this simple and straightforward? The only thing that is not quite intuitive is the detail that the constructor of pull_type expects the second argument to be a function object that takes a push_type& as the only argument. That is why we need to use bind to generate it—a lambda expression being the other alternative.

I definitely believe being able to write coroutines is a big step forward to make C++ more expressive. I can foresee many tasks simplified, like recursive parsing. I believe this will prove very helpful in the C++ weaponry. I only wish we could see it standardized soon.

Appendix

The complete backsort code in Python:

#!/usr/bin/env python
#coding: utf-8

import sys

def cat(files):
    for fn in files:
        with open(fn) as f:
            for line in f:
                yield line.rstrip('\n')

def backsort(lines):
    result = {}
    for line in lines:
        result['.'.join(reversed(line.split('.')))] = line
    return map(lambda item: item[1],
               sorted(result.items()))

def main():
    if sys.argv[1:]:
        result = backsort(cat(sys.argv[1:]))
    else:
        result = backsort(map(
                lambda line: line.rstrip('\n'), sys.stdin))
    for line in result:
        print(line)

if __name__ == '__main__':
    main()

The complete backsort code in C++:

#include <assert.h>         // assert
#include <algorithm>        // std::reverse/transform
#include <fstream>          // std::ifstream
#include <functional>       // std::bind
#include <iostream>         // std::cin/cout
#include <map>              // std::map
#include <string>           // std::string
#include <vector>           // std::vector
#include <boost/coroutine2/all.hpp>

using namespace std;

typedef boost::coroutines2::coroutine<const string&>
    coro_t;

void cat(coro_t::push_type& yield,
         int argc, char* argv[])
{
    for (int i = 1; i < argc; ++i) {
        ifstream ifs(argv[i]);
        for (;;) {
            string line;
            if (getline(ifs, line)) {
                yield(line);
            } else {
                break;
            }
        }
    }
}

vector<string> split(const string& str, char delim)
{
    vector<string> result;
    string::size_type last_pos = 0;
    string::size_type pos = str.find(delim);
    while (pos != string::npos) {
        result.push_back(
            str.substr(last_pos, pos - last_pos));
        last_pos = pos + 1;
        pos = str.find(delim, last_pos);
        if (pos == string::npos) {
            result.push_back(str.substr(last_pos));
        }
    }
    return result;
}

template <typename C>
string join(const C& str_list, char delim)
{
    string result;
    for (auto& item : str_list) {
        result += item;
        result += delim;
    }
    if (result.size() != 0) {
        result.resize(result.size() - 1);
    }
    return result;
}

template <typename C>
vector<string> backsort(C&& lines)
{
    map<string, string> rmap;
    for (auto& line : lines) {
        auto split_line = split(line, '.');
        reverse(split_line.begin(), split_line.end());
        rmap[join(split_line, '.')] = line;
    }
    vector<string> result(rmap.size());
    transform(rmap.begin(), rmap.end(), result.begin(),
              [](const pair<string, string>& pr)
              {
                  return pr.second;
              });
    return result;
}

class istream_line_reader {
public:
    class iterator { // implements InputIterator
    public:
        typedef const string& reference;
        typedef string value_type;

        iterator() : stream_(nullptr) {}
        explicit iterator(istream& is) : stream_(&is)
        {
            ++*this;
        }

        reference operator*()
        {
            assert(stream_ != nullptr);
            return line_;
        }
        value_type* operator->()
        {
            assert(stream_ != nullptr);
            return &line_;
        }
        iterator& operator++()
        {
            getline(*stream_, line_);
            if (!*stream_) {
                stream_ = nullptr;
            }
            return *this;
        }
        iterator operator++(int)
        {
            iterator temp(*this);
            ++*this;
            return temp;
        }

        bool operator==(const iterator& rhs) const
        {
            return stream_ == rhs.stream_;
        }
        bool operator!=(const iterator& rhs) const
        {
            return !operator==(rhs);
        }

    private:
        istream* stream_;
        string line_;
    };

    explicit istream_line_reader(istream& is)
        : stream_(is)
    {
    }
    iterator begin() const
    {
        return iterator(stream_);
    }
    iterator end() const
    {
        return iterator();
    }

private:
    istream& stream_;
};

int main(int argc, char* argv[])
{
    using namespace std::placeholders;
    vector<string> result;
    if (argc > 1) {
        result = backsort(coro_t::pull_type(
            boost::coroutines2::fixedsize_stack(),
            bind(cat, _1, argc, argv)));
    } else {
        result = backsort(istream_line_reader(cin));
    }
    for (auto& item : result) {
       cout << item << endl;
    }
}

The istream_line_reader class is not really necessary, and we can simplify it with coroutines. I am including it here only to show what we have to write ‘normally’ (if we cannot use coroutines). Even if we remove it entirely, the C++ version will still have about three times as many non-trivial lines of code as the Python equivalent. It is enough proof to me that I should move away from C++ a little bit. . . .


  1. There is one gotcha: the ‘\n’ character will be part of the string. It will be handled in my solution. 
  2. Generally speaking, callbacks or similar techniques are what C++ programmers tend to use in similar circumstances, if the ‘producer’ part is complicated (otherwise the iterator pattern may be more suitable). Unfortunately, we cannot then combine the use of two simple functions like cat and backsort simultaneously. If we used callbacks, backsort would need to be modified and fragmented. 
  3. P0057 is one such effort, which is experimentally implemented in Visual Studio 2015
  4. According to the acknowledgement pages of next two Boost projects, Giovanni Deretta contributed to them. So his work was not in vain. 
  5. This said, CO2 is also well-maintained, and is more efficient if only a stackless coroutine is needed. See Jamboree’s answer on StackOverflow. The difference looks small to me, and preprocessor tricks are not my favourites, so I will not spend more time on CO2 for now. 

Choosing a Multi-Precision Library for C++—A Critique

The Problem

After reading From Mathematics to Generic Programming,1 I accumulated some template code related to the RSA algorithm,2 but I tested it only with native integer types. Some recent events required me to use it for calculations that involve data sizes slightly bigger than those of the built-in types. I had to find some big-number libraries to help the calculation.

This is what the code looks like:

template <Integer N>
inline bool odd(N n)
{
    return n % 2 == 1;
}

template <Integer N>
inline N half(N n)
{
    return n / 2;
}

…

template <EuclideanDomain E>
std::pair<E, E> extended_gcd(E a, E b)
{
    …
}

template <Regular A, Integer N, SemigroupOperation Op>
A power_accumulate_semigroup(A r, A a, N n, Op op)
{
    …
}

template <Regular A, Integer N, SemigroupOperation Op>
A power_semigroup(A a, N n, Op op)
{
    assert(n > 0);
    while (!odd(n)) {
        a = op(a, a);
        n = half(n);
    }
    if (n == 1) {
        return a;
    }
    return power_accumulate_semigroup(a, op(a, a),
                                      half(n - 1), op);
}

template <Integer N>
struct modulo_multiply {
    modulo_multiply(const N& i) : modulus(i) {}
    N operator()(const N& n, const N& m) const
    {
        return (n * m) % modulus;
    }
private:
    N modulus;
};

int main()
{
    typedef … int_type;

    int_type p1 = …;
    int_type p2 = …;
    int_type n = p1 * p2;
    int_type phi = (p1 - 1) * (p2 - 1);
    int_type pub = …;

    pair<int_type, int_type> p = extended_gcd(pub, phi);
    // Check that p.second == 1
    int_type prv = p.first;
    …

    int_type encrypt = …;

    cout << "Encryped message is " << encrypt << endl;
    cout << "Decrypted message is "
         << power_semigroup(encrypt, prv,
                            modulo_multiply<int_type>(n))
         << endl;
}

There are a few details omitted here, but the point is that I had already the code that worked when int_type was defined to int64_t, and I needed some types that could represent higher precisions and work with the existing code with minimal changes.

The Exploration

NTL

One of the first libraries I tried was NTL,3 which seemed to support the standard mathematical operators well. It did not take me long to make it work with my program, and I was able to get the result successfully. However, I saw several problems that made me think I probably wanted a more mature solution in the long run:

  • Its class name for big integers is ‘NTL::ZZ’. ‘ZZ’ looks ugly to me, not aesthetically comfortable as a type name.
  • It does not provide a make mechanism for Windows. Luckily, it does not require external libraries and it is easy enough to build it manually with GCC.
  • Code like ‘NTL::ZZ pub = 3’ does not compile, which is a minor annoyance (but ‘NTL::ZZ pub(3)’ is an easy workaround, anyway).
  • Code like ‘NTL::ZZ p1("3440890133")’ does not work. This is a problem for big integers that cannot be represented by a native integer type. The workaround is using std::istringstream, which would require more lines and clumsiness.
  • There is no support for getting the input or output in hexadecimal numbers.

CLN

Another library I tried at the same time was CLN.4 It is not friendly to Windows users either, so I simply installed it from Cygwin.5 It seems to be in stark contrast to NTL in some aspects:

  • The class name for big integers is more reasonably named ‘cln::cl_I’.
  • Code lines like ‘cln::cl_I pub = 3’ and ‘cln::cl_I p1("3440890133")’ work.
  • CLN provides support for hexadecimal input (using a special ‘#x’ prefix in the number string) and output (using the fprinthexadecimal function).

However, CLN is quite terrible in its handling of C++ operators:

  • % is not overriden, and I have to call the mod function instead.
  • Division is not implemented on cl_I. I have to, in the general case, use a function that returns the {quotient, remainder} pair, or use an exquo function when I can guarantee that the remainder is zero. Luckily, in my specific case, I can substitute ‘>> 1’ for ‘/ 2’. If shifts could not be used, I would have to replace ‘n / 2’ with something like ‘truncate2(n, 2).quotient’. Providing a series of division functions that return both the quotient and remainder is good; forcing people to use them is not.

Unlike the immaturity of NTL, it looks like that CLN deliberately made the design choices to be this way. Still, it looks bad enough to me. The API design of CLN shows the hauteur of the authors: Your time is not important to me; read the fucking manual, and do things the correct way we want it to be. This condescending attitude is completely against the trend.

Boost.Multiprecision

Finally, I found out that I should have looked no further than just the famous Boost libraries.6 A multi-precision template library is among Boost’s 100+ libraries, simply named ‘Boost.Multiprecision’.7 I wondered why I missed it in the beginning. But, anyway, it fulfilled all my needs wonderfully:

  • Using the basic cpp_int type does not require building any libraries. This makes it work on any platform that has a decent C++ compiler.
  • All needed operators (like +, -, *, /, and %) are implemented.
  • Initialization from native integer types and C strings works.
  • Hexadecimal input and output are implemented in a natural way: inputs can have the ‘0x’ prefix, and the hex manipulator can be used to make the big integer output to iostreams in the hexadecimal form.8
  • In addition, it supports the C++11 user-defined literals.9 So, instead of writing something like ‘cppint encrypt("0xB570BF8E4BDABA4C")’, you can have more efficient code by writing ‘cppint encrypt(0xB570BF8E4BDABA4C_cppi)’.

This said, one problem halted me when I first used its cpp_int type: very weird compilation errors occurred, spanning several screens. Actually, the solution is described in the introduction of the library, as well as in the first answer of its FAQ, so I figured it out the next day (I did not read carefully the documentation on the first night). I needed to either replace expressions like ‘half(n - 1)’ with explicit type-casts like ‘half(N(n - 1))’, or simply use an alternative typedef to turn off expression templates—which I did:

    typedef boost::multiprecision::number<
            boost::multiprecision::cpp_int_backend<>,
            boost::multiprecision::et_off> int_type;

You can read the Boost documentation for more details. It is related to performance. It is also worth noting that with C++11 move semantics, the expression-template-disabled form I use can still have performance close enough (no more than 10% slower) to the expression-template-enabled form. And the first template parameter probably has a bigger impact—GMP can be used as the backend and is considered faster.10

In my humble opinions, Boost.Multiprecision should change the default cpp_int definition to have et_off. Developers who want the ultimate performance will always read the documentation, but it does not seem necessary to force other developers to have failures, read documentation, and change their code. In my case, it takes several seconds to compile the program, a small fraction of a second to run the program, but several hours to find the correct library and learn how to use it.

The Critique

I would argue that the following three criteria should be the foremost in choosing (and thus providing) a good library:

  • Correctness. I think this is self-evident. All three libraries described here satisfy this criterion.
  • Standard and intuitive interface. This is where NTL and CLN fail. CLN does the worst here by intentionally failing to provide operator/. Boost.Multiprecision satisfies my needs without requiring me to look up the documentation (mostly), but it is not perfect, in that the default types can cause horrendous error messages and that its iostream routines do not honour uppercase and nouppercase.11
  • Performance. Yes, performance comes the last among these three. It is still important, but I would argue we should put performance aside when it conflicts with the other two criteria (like treating ‘premature optimization’), and developers can read the documentation and turn performance options back on when they really need it. Boost.Multiprecision is nice to support different backends and the expression template option, but I am not persuaded that expression templates should be enabled by default.

Of course, there are many other criteria, like portability, (lack of) dependency, etc., but they tend to be more subjective and can vary from project to project. The three criteria listed above are the most important to me.

Correctness and developer productivity should be preferred to code performance. This should be true for both scripting languages and traditional compiled languages.

Appendix: Source Listing

The complete RSA sample code that builds with Boost.Multiprecision is listed below for you to play with. We can optimize the code a little bit by substituting the Boost.Multiprecision divide_qr function for the handwritten quotient_remainder. That can be the small exercise for you, dear reader. 🙂

#include <assert.h>             // assert
#include <iostream>             // std::cout/endl
#include <utility>              // std::pair/make_pair
#include <boost/multiprecision/cpp_int.hpp>

// Concepts
#define EuclideanDomain         typename
#define Integer                 typename
#define Regular                 typename
#define SemigroupOperation      typename

template <Integer N>
inline bool odd(N n)
{
    return n % 2 == 1;
}

template <Integer N>
inline N half(N n)
{
    return n / 2;
}

template <Integer N>
N largest_doubling(N a, N b)
{
    assert(b != 0);
    for (;;) {
        N c = b + b;
        if (a < c)
            break;
        b = c;
    }
    return b;
}

template <Integer N>
std::pair<N, N> quotient_remainder(N a, N b)
{
    assert(b > 0);
    if (a < b) {
        return std::make_pair(N(0), a);
    }
    N c = largest_doubling(a, b);
    N n(1);
    a = a - c;
    while (c != b) {
        c = half(c);
        n = n + n;
        if (c <= a) {
            a = a - c;
            n = n + 1;
        }
    }
    return std::make_pair(n, a);
}

template <EuclideanDomain E>
std::pair<E, E> extended_gcd(E a, E b)
{
    E x0(1);
    E x1(0);
    while (b != E(0)) {
        // compute new r and x
        std::pair<E, E> qr = quotient_remainder(a, b);
        E x2 = x0 - qr.first * x1;
        // shift r and x
        x0 = x1;
        x1 = x2;
        a = b;
        b = qr.second;
    }
    return std::make_pair(x0, a);
}

template <Regular A, Integer N, SemigroupOperation Op>
A power_accumulate_semigroup(A r, A a, N n, Op op)
{
    assert(n >= 0);
    if (n == 0) {
        return r;
    }
    for (;;) {
        if (odd(n)) {
            r = op(r, a);
            if (n == 1) {
                return r;
            }
        }
        n = half(n);
        a = op(a, a);
    }
}

template <Regular A, Integer N, SemigroupOperation Op>
A power_semigroup(A a, N n, Op op)
{
    assert(n > 0);
    while (!odd(n)) {
        a = op(a, a);
        n = half(n);
    }
    if (n == 1) {
        return a;
    }
    return power_accumulate_semigroup(a, op(a, a),
                                      half(n - 1), op);
}

template <Integer N>
struct modulo_multiply {
    modulo_multiply(const N& i) : modulus(i) {}
    N operator()(const N& n, const N& m) const
    {
        return (n * m) % modulus;
    }
private:
    N modulus;
};

int main()
{
    using namespace std;
    typedef boost::multiprecision::number<
            boost::multiprecision::cpp_int_backend<>,
            boost::multiprecision::et_off> int_type;

    int_type p1 = 3440890133;
    int_type p2 = 4006628849;
    int_type n = p1 * p2;
    int_type phi = (p1 - 1) * (p2 - 1);
    int_type pub = 65537;

    pair<int_type, int_type> p = extended_gcd(pub, phi);
    if (p.second != 1) {
        // pub is not coprime with phi
        cout << "Please choose another public key!" << endl;
        return 1;
    }

    int_type prv = p.first;
    if (prv < 0) {
        prv += phi;
    }

    cout << "Public key is (" << pub << ", " << n << ")\n";
    cout << "Private key is (" << prv << ", " << n << ")\n";

    int_type encrypt("0xB570BF8E4BDABA4C");
    cout << hex;
    cout << "Encrypted message is " << encrypt << endl;
    cout << "Decrypted message is "
         << power_semigroup(encrypt, prv,
                            modulo_multiply<int_type>(n))
         << endl;
}

  1. Alexander A. Stepanov and Daniel E. Rose: From Mathematics to Generic Programming. Addison-Wesley Professional, 2014. 
  2. Wikipedia: RSA (cryptosystem)
  3. Victor Shoup: NTL: A Library for doing Number Theory
  4. Bruno Haible and Richard B. Kreckel: CLN — Class Library for Numbers 
  5. Cygwin
  6. Boost
  7. John Maddock and Christopher Kormanyos: Boost.Multiprecision
  8. Cppreference.com: std::hex
  9. Cppreference.com: User-defined literals
  10. Free Software Foundation: GMP — The GNU Multiple Precision Arithmetic Library
  11. Cppreference.com: std::uppercase

Generic Lambdas and the compose Function

I am about two thirds through Scott Meyers’ Effective Modern C++, and I have discovered the power of generic lambdas. Actually I had read about generic lambdas before in the Wikipedia entry on C++14, but that was far from enough for me to get it—and I was not smart enough to investigate deeper. Anyway, Item 33 in Effective Modern C++ gives enough examples to show me the power, and it is exactly the tool I need to solve the problems in my compose function.

Let me start from my (poor) exclamation in my last blog:

Alas, a lambda is only a function, but not a type-deducing function template.

Wrong, wrong, wrong! The generic lambda is exactly what I claimed it was not.

Type deduction was the problem that caused my compose function template to fail. Recall its definition and the failure case:

template <typename Tp>
auto compose()
{
    return apply<Tp>;
}

template <typename Tp, typename Fn, typename... Fargs>
auto compose(Fn fn, Fargs... args)
{
    return [=](Tp&& x) -> decltype(auto)
    {
        return fn(compose<Tp>(args...)(forward<Tp>(x)));
    };
}

…
Obj obj(0);
auto const op_nr = compose<Obj>(clone);
test(op_nr(obj));

The error was that obj, as an lvalue, could not be bound to Obj&& (on line 10). Although I tried to use the perfect forwarding pattern, it did not work, as there was no type deduction—Tp was specified by the caller. Why so? Because I thought a lambda could not be a type-deducing function template.

I won’t go into details about generic lambdas per se, of which you can get a lot of information in Scott’s book or by Google. Instead, I only want to show you that generic lambdas help solve the type deduction problem and give a function template to suit my needs.

Without further ado, I am showing you the improved version of compose that uses generic lambdas:

auto compose()
{
    return [](auto&& x) -> decltype(auto)
    {
        return forward<decltype(x)>(x);
    };
}

template <typename Fn, typename... Fargs>
auto compose(Fn fn, Fargs... args)
{
    return [=](auto&& x) -> decltype(auto)
    {
        return fn(compose(args...)(forward<decltype(x)>(x)));
    };
}

You can immediately notice the following:

  • The original template type parameter Tp is gone.
  • Tp&& is now changed to auto&&, allowing type deduction to work.
  • In order to make perfect forwarding work when the type of x is unknown, forward<decltype(x)> is used.

With this definition, We no longer need to differentiate between op_klvr, op_rvr, etc. No way to do so, anyway. Things are beautifully unified, until the moment you need to put it in an std::function. Run the code at the final listing to see their differences.

Although it already looks perfect, we have a bonus since we no longer specify Tp. We are no longer constrained by only one argument. A small change will make multiple arguments work:

auto compose()
{
    return [](auto&& x) -> decltype(auto)
    {
        return forward<decltype(x)>(x);
    };
}

template <typename Fn>
auto compose(Fn fn)
{
    return [=](auto&&... x) -> decltype(auto)
    {
        return fn(forward<decltype(x)>(x)...);
    };
}

template <typename Fn, typename... Fargs>
auto compose(Fn fn, Fargs... args)
{
    return [=](auto&&... x) -> decltype(auto)
    {
        return fn(
            compose(args...)(forward<decltype(x)>(x)...));
    };
}

The first compose function is no longer useful when we have at least one function passed to compose, but I am keeping it for now. The parameter pack plus the generic lambda makes a perfect combination here.

Finally, a code listing for you to play with is provided below (also available as test_compose.cpp in the zip file download for my last blog):

#include <functional>
#include <iostream>

using namespace std;

#define PRINT_AND_TEST(x)           \
    cout << " " << #x << ":\n  ";   \
    test(x);                        \
    cout << endl;

auto compose()
{
    return [](auto&& x) -> decltype(auto)
    {
        return forward<decltype(x)>(x);
    };
}

template <typename Fn>
auto compose(Fn fn)
{
    return [=](auto&&... x) -> decltype(auto)
    {
        return fn(forward<decltype(x)>(x)...);
    };
}

template <typename Fn, typename... Fargs>
auto compose(Fn fn, Fargs... args)
{
    return [=](auto&&... x) -> decltype(auto)
    {
        return fn(
            compose(args...)(forward<decltype(x)>(x)...));
    };
}

struct Obj {
    int value;
    explicit Obj(int n) : value(n)
    {
        cout << "Obj(){" << value << "} ";
    }
    Obj(const Obj& rhs) : value(rhs.value)
    {
        cout << "Obj(const Obj&){" << value << "} ";
    }
    Obj(Obj&& rhs) : value(rhs.value)
    {
        rhs.value = -1;
        cout << "Obj(Obj&&){" << value << "} ";
    }
    ~Obj()
    {
        cout << "~Obj(){" << value << "} ";
    }
};

void test(Obj& x)
{
    cout << "=> Obj&:" << x.value << "\n  ";
}

void test(Obj&& x)
{
    cout << "=> Obj&&:" << x.value << "\n  ";
}

void test(const Obj& x)
{
    cout << "=> const Obj&:" << x.value << "\n  ";
}

Obj clone(Obj x)
{
    cout << "=> clone(Obj):" << x.value << "\n  ";
    return x;
}

void test()
{
    Obj obj(0);
    cout << endl;

    auto const op = compose(clone);
    std::function<Obj(const Obj&)> fn_klvr = op;
    std::function<Obj(Obj&&)> fn_rvr = op;
    std::function<Obj(Obj)> fn_nr = op;
    PRINT_AND_TEST(op(obj));
    PRINT_AND_TEST(fn_klvr(obj));
    PRINT_AND_TEST(fn_nr(obj));
    cout << endl;
    PRINT_AND_TEST(op(Obj(1)));
    PRINT_AND_TEST(fn_klvr(Obj(1)));
    PRINT_AND_TEST(fn_rvr(Obj(1)));
    PRINT_AND_TEST(fn_nr(Obj(1)));
}

template <typename T1, typename T2>
auto sum(T1 x, T2 y)
{
    return x + y;
}

template <typename T1, typename T2, typename... Targ>
auto sum(T1 x, T2 y, Targ... args)
{
    return sum(x + y, args...);
}

template <typename T>
auto sqr(T x)
{
    return x * x;
}

int main()
{
    test();
    cout << endl;
    auto const op = compose(sqr<int>,
                            sum<int, int, int, int, int>);
    cout << op(1, 2, 3, 4, 5) << endl;
}

Happy hacking!

Type Deduction and My Reference Mistakes

I had read Scott Meyers’ three previous books in the Effective series 1 2 3, before I began to read his new Effective Modern C++ 4 at Safari Books Online. I always expect to learn from Scott, but it surprised me how fast it could be. After reading only a few items about type deduction, I found that I implemented apply and compose (pipeline) incorrectly in my first WordPress blog 5. What a shame! But I thought I was fortunate to find the problem so soon, and I would like to record my mistakes and solutions here.

(A side note: As WordPress does not allow uploading C++ source files, I have put all the test code in a zip file. I will quote the relevant source code in the blog directly for an easy read, but you are welcome to download and try the test code yourself!)

First, my implementation of apply had the wrong return type, as shown in the following program (test1.cpp):

#include <iostream>

using namespace std;

#define PRINT_AND_TEST(x)    \
    cout << #x << ": ";      \
    test(x);                 \
    cout << endl;

template <typename Tp>
auto apply(Tp&& data)
{
    return forward<Tp>(data);
}

template <typename Tp, typename Fn, typename... Fargs>
auto apply(Tp&& data, Fn fn, Fargs... args)
{
    return apply(fn(forward<Tp>(data)), args...);
}

struct Obj {
    int value;
    explicit Obj(int n) : value(n)
    {
    }
    Obj(const Obj& rhs) : value(rhs.value)
    {
    }
    Obj(Obj&& rhs) : value(rhs.value)
    {
        rhs.value = -1;
    }
    ~Obj()
    {
    }
};

void test(Obj& x)
{
    cout << "Obj&:" << x.value;
}

void test(Obj&& x)
{
    cout << "Obj&&:" << x.value;
}

void test(const Obj& x)
{
    cout << "const Obj&:" << x.value;
}

int main()
{
    Obj obj(0);
    Obj& nref = obj;
    const Obj& cref = obj;

    PRINT_AND_TEST(obj);
    PRINT_AND_TEST(nref);
    PRINT_AND_TEST(cref);
    PRINT_AND_TEST(Obj(1));
    cout << endl;

    PRINT_AND_TEST(apply(obj));
    PRINT_AND_TEST(apply(nref));
    PRINT_AND_TEST(apply(cref));
    PRINT_AND_TEST(apply(Obj(1)));
    cout << endl;
}

It gives the following output:

obj: Obj&:0
nref: Obj&:0
cref: const Obj&:0
Obj(1): Obj&&:1

apply(obj): Obj&&:0
apply(nref): Obj&&:0
apply(cref): Obj&&:0
apply(Obj(1)): Obj&&:1

Apparently I did not make the types correct. The reason was my ignorant use of auto as the return type, without realizing that it always results in a non-reference type. C++14 provides a special decltype(auto) syntax, which keeps the reference-ness, and it seems to work here. The ‘fixed’ code is as follows (test2.cpp):

template <typename Tp>
decltype(auto) apply(Tp&& data)
{
    return forward<Tp>(data);
}

template <typename Tp, typename Fn, typename... Fargs>
decltype(auto) apply(Tp&& data, Fn fn, Fargs... args)
{
    return apply(fn(forward<Tp>(data)), args...);
}

After that, the program can output the correct result:

obj: Obj&:0
nref: Obj&:0
cref: const Obj&:0
Obj(1): Obj&&:1

apply(obj): Obj&:0
apply(nref): Obj&:0
apply(cref): const Obj&:0
apply(Obj(1)): Obj&&:1

Wait—is the code really correct?

Actually a little more testing reveals a bigger problem, which the original code did not exhibit. Here is the additional test code (test3.cpp):

…
Obj clone(Obj x)
{
    cout << "clone(Obj):" << x.value << " => ";
    return x;
}

int main()
{
    …
    PRINT_AND_TEST(clone(obj));
    PRINT_AND_TEST(clone(nref));
    PRINT_AND_TEST(clone(cref));
    PRINT_AND_TEST(clone(Obj(2)));
    cout << endl;

    PRINT_AND_TEST(apply(obj, clone));
    PRINT_AND_TEST(apply(nref, clone));
    PRINT_AND_TEST(apply(cref, clone));
    PRINT_AND_TEST(apply(Obj(2), clone));
    cout << endl;
}

And its horrendous output:

…
clone(obj): clone(Obj):0 => Obj&&:0
clone(nref): clone(Obj):0 => Obj&&:0
clone(cref): clone(Obj):0 => Obj&&:0
clone(Obj(2)): clone(Obj):2 => Obj&&:2

apply(obj, clone): clone(Obj):0 => Obj&&:1875662080
apply(nref, clone): clone(Obj):0 => Obj&&:1875662080
apply(cref, clone): clone(Obj):0 => Obj&&:1875662080
apply(Obj(2), clone): clone(Obj):2 => Obj&&:1875662080

Let us go back and analyse the case. Since all four functions have problems, we’ll just check the first one.

The call apply(obj, clone) makes the template parameter Tp be deduced to Obj&, as obj is an lvalue. This instantiation of apply(Obj&, Fn) contains the following function body, after all arguments are passed in:

    return apply(clone(forward<Obj&>(obj)));

Please notice that clone returns a temporary object, and its lifetime ends after return statement. As clone returns an rvalue, Tp for the next apply call is deduced to Obj. The function is instantiated as follows (check out the definition of std::forward if you are not familiar with it 6):

Obj&& apply(Obj&& data)
{
    return forward<Obj>(data);
}

Although the cloned object is destroyed after apply(Obj&, Fn) returns, its rvalue reference is still returned. Oops!

Seeing the reason, I only need to make sure an object type is returned when this apply is called. I experimented with the enable_if template 7, but it turns out that the fix is simpler than I expected (test4.cpp):

template <typename Tp>
Tp apply(Tp&& data)
{
    return forward<Tp>(data);
}

template <typename Tp, typename Fn, typename... Fargs>
decltype(auto) apply(Tp&& data, Fn fn, Fargs... args)
{
    return apply(fn(forward<Tp>(data)), args...);
}

I just have to change the first decltype(auto) to Tp to take advantage of the type deduction rules of ‘universal references’. While it is explained most clearly in Item 1 of Scott Meyers’ Effect Modern C++, his online article already has it clearly 8:

During type deduction for a template parameter that is a universal reference, lvalues and rvalues of the same type are deduced to have slightly different types. In particular, lvalues of type T are deduced to be of type T& (i.e., lvalue reference to T), while rvalues of type T are deduced to be simply of type T. (Note that while lvalues are deduced to be lvalue references, rvalues are not deduced to be rvalue references!)

Therefore:

  • If an lvalue is passed to apply, Tp (and the return type) is deduced to an lvalue reference (say, Obj&). This is what we expect to reduce copying.
  • If an rvalue is passed to apply, Tp (and the return type) is deduced to the object type (say, Obj). Now the returned object is move-constructed—what we would like to see.

I then also checked compose. At first, I tested with these lines added to the end of main (after adding the definition of compose, of course; see test5.cpp):

    auto const op1 = compose<Obj&&>(apply<Obj&&>);
    PRINT_AND_TEST(op1(Obj(3)));
    auto const op2 = compose<const Obj&>(apply<const Obj&>);
    PRINT_AND_TEST(op2(Obj(3)));

Both output lines contain ‘test(Obj&&):3’. The second line is obviously wrong, and it is exactly like the apply case, so the solution is similar too. I should not have relied on the wrong auto return type deduction—using decltype(auto) would fix it. The changed compose is like the following (test6.cpp):

template <typename Tp>
auto compose()
{
    return apply<Tp>;
}

template <typename Tp, typename Fn, typename... Fargs>
auto compose(Fn fn, Fargs... args)
{
    return [=](Tp&& x) -> decltype(auto)
    {
        return fn(compose<Tp>(args...)(forward<Tp>(x)));
    };
}

However, when I test code like below, it will not even compile (test7.cpp):

    auto const op_nr = compose<Obj>(clone);
    test(op_nr(obj));

Clang reports errors:

test7.cpp:106:10: error: no matching function for call to object of type 'const (lambda at test.cpp:26:12)'
    test(op_nr(obj));
         ^~~~~
test7.cpp:31:12: note: candidate function not viable: no known conversion from 'Obj' to 'Obj &&' for 1st argument
    return [=](Tp&& x) -> decltype(auto)
           ^
1 error generated.

!@#$%^…

Actually I can ‘fix’ the problem like this:

    auto const op_klvr = compose<const Obj&>(clone);
    test(op_klvr(obj));

Or this:

    auto const op_rvr  = compose<Obj&&>(clone);
    test(op_rvr(Obj()));

The two forms above would work actually quite well, if one knew the argument type and was careful. However, there would be a difference if one was careless, as could be shown by the revised test program with tracking information of the objects’ lifetime (test8.cpp). Only the relevant changes are shown below:

…
#define PRINT_AND_TEST(x)           \
    cout << " " << #x << ":\n  ";   \
    test(x);                        \
    cout << endl;
…
struct Obj {
    int value;
    explicit Obj(int n) : value(n)
    {
        cout << "Obj(){" << value << "} ";
    }
    Obj(const Obj& rhs) : value(rhs.value)
    {
        cout << "Obj(const Obj&){" << value << "} ";
    }
    Obj(Obj&& rhs) : value(rhs.value)
    {
        rhs.value = -1;
        cout << "Obj(Obj&&){" << value << "} ";
    }
    ~Obj()
    {
        cout << "~Obj(){" << value << "} ";
    }
};

void test(Obj& x)
{
    cout << "=> Obj&:" << x.value << "\n  ";
}

void test(Obj&& x)
{
    cout << "=> Obj&&:" << x.value << "\n  ";
}

void test(const Obj& x)
{
    cout << "=> const Obj&:" << x.value << "\n  ";
}

Obj clone(Obj x)
{
    cout << "=> clone(Obj):" << x.value << "\n  ";
    return x;
}

int main()
{
    Obj obj(0);
    Obj& nref = obj;
    const Obj& cref = obj;
    cout << endl;
    …
    auto const op_klvr = compose<const Obj&>(clone);
    auto const op_rvr  = compose<Obj&&>(clone);
    auto const op_nr   = compose<Obj>(clone);
    PRINT_AND_TEST(op_klvr(obj));
    PRINT_AND_TEST(op_klvr(Obj(3)));
    PRINT_AND_TEST(op_rvr(Obj(3)));
    PRINT_AND_TEST(op_nr(Obj(3)));
    //PRINT_AND_TEST(op_nr(obj));
}

The commented-out line cannot compile yet. The rest works fine, and the program will generate the following output (edited):

Obj(){0}
 obj:
  => Obj&:0
 …

 clone(obj):
  Obj(const Obj&){0} => clone(Obj):0
  Obj(Obj&&){0} => Obj&&:0
  ~Obj(){0} ~Obj(){-1}
 clone(nref):
  Obj(const Obj&){0} => clone(Obj):0
  Obj(Obj&&){0} => Obj&&:0
  ~Obj(){0} ~Obj(){-1}
 clone(cref):
  Obj(const Obj&){0} => clone(Obj):0
  Obj(Obj&&){0} => Obj&&:0
  ~Obj(){0} ~Obj(){-1}
 clone(Obj(2)):
  Obj(){2} => clone(Obj):2
  Obj(Obj&&){2} => Obj&&:2
  ~Obj(){2} ~Obj(){-1}

 apply(obj, clone):
  Obj(const Obj&){0} => clone(Obj):0
  Obj(Obj&&){0} Obj(Obj&&){0} ~Obj(){-1} ~Obj(){-1} => Obj&&:0
  ~Obj(){0}
 …
 apply(Obj(2), clone):
  Obj(){2} Obj(Obj&&){2} => clone(Obj):2
  Obj(Obj&&){2} Obj(Obj&&){2} ~Obj(){-1} ~Obj(){-1} => Obj&&:2
  ~Obj(){2} ~Obj(){-1}

 op_klvr(obj):
  Obj(const Obj&){0} => clone(Obj):0
  Obj(Obj&&){0} ~Obj(){-1} => Obj&&:0
  ~Obj(){0}
 op_klvr(Obj(3)):
  Obj(){3} Obj(const Obj&){3} => clone(Obj):3
  Obj(Obj&&){3} ~Obj(){-1} => Obj&&:3
  ~Obj(){3} ~Obj(){3}
 op_rvr(Obj(3)):
  Obj(){3} Obj(Obj&&){3} => clone(Obj):3
  Obj(Obj&&){3} ~Obj(){-1} => Obj&&:3
  ~Obj(){3} ~Obj(){-1}
 op_nr(Obj(3)):
  Obj(){3} Obj(Obj&&){3} => clone(Obj):3
  Obj(Obj&&){3} ~Obj(){-1} => Obj&&:3
  ~Obj(){3} ~Obj(){-1}
~Obj(){0}

We can see that apply(Obj(2), clone) generates one more move-construction than clone(Obj(2)), and this is expected from our implementation. We can also see the differences in the last group, where the use of op_klvr, op_rvr, or op_nr can affect whether copy-construction or move-construction is used. I would like to make op_nr work on an lvalue too (so the last code line can be uncommented).

I have finally implemented a version of compose with tag dispatching 9, treating reference types and non-reference types differently. The strategy is as follows:

  • If template argument is a reference type, the old logic still applies.
  • If template argument is not a reference type, a temporary object will be constructed in the pass-by-value parameter (with either the copy constructor or move constructor), and its rvalue reference will be used to invoke the reference-branch logic. An extra move operation may result, so it is still better to use ‘compose’ if it is known that the argument will be an value. (Alas, a lambda is only a function, but not a type-deducing function template. Damn, Scott hit me right on the face, again, with his nice introduction of the C++14 generic lambdas in Effective Modern C++. It provides for a far nicer solution. I won’t change the content here, but the part about compose is now largely obsoleted. Check out ‘Generic Lambdas and the compose Function’ for an update.—I hate love you, Scott!)

The extra move overhead makes this solution less attractive, but it only adds to the choices. Also, it would be a little awkward if one was forced to type ‘compose’. So my final compose is here (test9.cpp):

template <typename Tp>
auto compose_ref()
{
    return apply<Tp>;
}

template <typename Tp, typename Fn, typename... Fargs>
auto compose_ref(Fn fn, Fargs... args)
{
    return [=](Tp&& x) -> decltype(auto)
    {
        return fn(compose_ref<Tp>(args...)(forward<Tp>(x)));
    };
}

template <typename Tp, typename... Fargs>
auto compose_impl(false_type, Fargs... args)
{
    return [=](Tp x) -> decltype(auto)
    {
        return compose_ref<Tp&&>(args...)(move(x));
    };
}

template <typename Tp, typename... Fargs>
auto compose_impl(true_type, Fargs... args)
{
    return compose_ref<Tp>(args...);
}

template <typename Tp, typename... Fargs>
auto compose(Fargs... args)
{
    return compose_impl<Tp>(
        typename is_reference<Tp>::type(), args...);
}

Lessons learnt:

  • C++ programmers should always read Scott’s books (at least 99.99% should).
  • Although type deduction is very helpful, one needs to understand its rules and what auto actually means in each case; otherwise it is easy to make (terrible) mistakes.

  1. Scott Meyers: Effective C++. Addison-Wesley, 3rd edition, 2005. 
  2. Scott Meyers: More Effective C++. Addison-Wesley, 1996. 
  3. Scott Meyers: Effective STL. Addison-Wesley, 2001. 
  4. Scott Meyers: Effective Modern C++. O’Reilly Media, 2014. 
  5. Yongwei Wu: Study Notes: Functional Programming with C++
  6. Thomas Becker: Rvalue References Explained, p. 8
  7. Cppreference.com: std::enable_if
  8. Scott Meyers: Universal References in C++11
  9. David Abrahams and Douglas Gregor: Generic Programming in C++: Techniques, section ‘Tag Dispatching’

Installing Clang 3.5 for Windows

I had used LLVM 3.4 on Windows for quite some time. It had worked well, and had all the features I needed—mostly the beautiful C++11/C++14 compiler and the easy-to-use Clang-Format. However, the C++ compiler only works when some specific GCC versions are installed together, and the GCC version 4.6.3 I installed for Clang has a conflict with the GCC 4.9 I use. The major issue is the C++ run-time library libstdc++-6.dll, which actually has many variants due to the combination of different thread models and different exception handling methods. The result is that GCC 4.9 generated executables will crash when the libstdc++-6.dll from GCC 4.6.3 appears earlier in path, and Clang generated executables will crash when the libstdc++-6.dll from GCC 4.9 appears earlier in path. I do not like this situation. So recently I tried new combinations when I installed LLVM 3.5, and made sure everything work together. I would like to share the result.

Let me first list the binary files one needs to download:

I install Clang to the default location, C:\Program Files (x86)\LLVM. For the rest of this article, I assume GCC 4.9.2 is extracted to C:\ (so all files are under C:\mingw32), and GCC 4.8.2 is extracted to C:\Temp (all files are under C:\Temp\mingw32).

Although I need GCC 4.9 for the best and latest C++ features, Clang does not work with it. One can tell from the error output of Clang that it should work with the MinGW-w64 GCC 4.8.2:

ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.0"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.0/x86_64-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.0/i686-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.0/backward"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.1"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.1/x86_64-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.1/i686-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.1/backward"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.2"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.2/x86_64-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.2/i686-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.2/backward"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.3"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.3/x86_64-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.3/i686-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.7.3/backward"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.0"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.0/x86_64-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.0/i686-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.0/backward"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.1"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.1/x86_64-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.1/i686-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.1/backward"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.2"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.2/x86_64-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.2/i686-w64-mingw32"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0/../../../include/c++/4.8.2/backward"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.0/include/c++"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.0/include/c++/mingw32"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.0/include/c++/backward"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.1/include/c++"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.1/include/c++/mingw32"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.1/include/c++/backward"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.2/include/c++"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.2/include/c++/mingw32"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.2/include/c++/backward"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.3/include/c++"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.3/include/c++/mingw32"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.7.3/include/c++/backward"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.8.0/include/c++"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.8.0/include/c++/mingw32"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.8.0/include/c++/backward"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.8.1/include/c++"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.8.1/include/c++/mingw32"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.8.1/include/c++/backward"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.8.2/include/c++"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.8.2/include/c++/mingw32"
ignoring nonexistent directory "c:/MinGW/lib/gcc/mingw32/4.8.2/include/c++/backward"
ignoring nonexistent directory "/usr/include/c++/4.4"
ignoring nonexistent directory "/usr/local/include"
ignoring nonexistent directory "C:\Program Files (x86)\LLVM\bin\..\lib\clang\3.5.0\../../../x86_64-w64-mingw32/include"
ignoring nonexistent directory "/mingw/include"
ignoring nonexistent directory "c:/mingw/include"
ignoring nonexistent directory "/usr/include"

(As one may expect from the error messages, the official MinGW GCC, currently at version 4.8.1, also works with Clang. I personally prefer MinGW-w64, as its GCC is more usable—e.g., the MinGW version supports only Win32 threads, and therefore does not support std::thread. MinGW does not provide GCC 4.9 yet, and you can’t put C:\MinGW\bin in the path, if you want to use MinGW-w64 GCC 4.9 simultaneously. You do need to put either C:\MinGW\bin or C:\mingw32\bin—for MinGW-w64 GCC 4.9—in the path, as Clang cannot find a working GCC for linking otherwise. If you use only MinGW GCC 4.8.1, or only MinGW-w64 GCC 4.9, this configuration works.)

Now back to MinGW-w64 GCC 4.8.2. Depending on the size of your hard disk, you may want to tailor it. In my case, I removed all traces of Fortran, Ada, and Objective-C, as well as build-info.txt, etc, license, opt, and shared from C:\Temp\mingw32. After that, you need to do the following to make GCC 4.8.2 work for Clang:

  • Make directory c++ under C:\Temp\mingw32\include.
  • Make directory 4.8.2 under C:\Temp\mingw32\include\c++.
  • Copy all contents under C:\Temp\mingw32\i686-w64-mingw32\include\c++ to C:\Temp\mingw32\include\c++\4.8.2.
  • Move all contents under C:\Temp\mingw32 to C:\Program Files (x86)\LLVM, merging with existing directories there.
  • Remove the empty C:\Temp\mingw32.

You can now add both C:\mingw32\bin and C:\Program Files (x86)\LLVM to the path: both Clang and GCC are at your hand and won’t conflict with each other.