Compile-Time Strings

I have encountered many compile-time uses of strings in my projects in the past few years. I would like to summarize my experience today.

Choice of Types

std::string is mostly unsuitable for compile-time string manipulations. There are several reasons:

  • Before C++20 one cannot use strings at all at compile time. In addition, the support for compile-time strings comes quite late among the major compilers. MSVC was the front runner in this regard, GCC came second with GCC 12 (released a short while ago), and Clang has not yet had a formal release with compile-time string support.
  • With C++20 one can use strings at compile time, but there are still a lot of inconveniences, the most obvious being that strings generated at compile time cannot be used at run time. Besides, a string cannot be declared constexpr.
  • A string cannot be used as a template argument.

So we have to give up this apparent choice, but explore other possibilities. The candidates are:

  • const char pointer, which is what a string literal can naturally decay to
  • string_view, a powerful tool added by C++17: it has similar member functions to those of string, but they are mostly marked as constexpr!
  • array, with which we can generate brand-new strings

We will try these types in the following discussion.

Functions Commonly Needed

Getting the String Length

One of the most basic functions on a string is getting its length. Here we cannot use the C function strlen, as it is not constexpr.

We will try several different ways to implement it.

First, we can implement strlen manually, and mark the function constexpr:

namespace strtools {

constexpr size_t length(const char* str)
{
    size_t count = 0;
    while (*str != '\0') {
        ++str;
        ++count;
    }
    return count;
}

} // namespace strtools

However, is there an existing mechanism to retrieve the length of a string in the standard library? The answer is a definite Yes. The standard library does support getting the length of a string of any of the standard character types, like char, wchar_t, etc. With the most common character type char, we can write:

constexpr size_t length(const char* str)
{
    return char_traits<char>::length(str);
}

Starting with C++17, the methods of char_traits can be used at compile time. (However, you may encounter problems with older compiler versions, like GCC 8.)

Assuming you can use C++17, string_view is definitely worth a try:

constexpr size_t length(string_view sv)
{
    return sv.size();
}

Regardless of the approach used, now we can use the following code to verify that we can indeed check the length of a string at compile time:

static_assert(strtools::length("Hi") == 2);

At present, the string_view implementation seems the most convenient.

Finding a Character

Finding a specific character is also quite often needed. We can’t use strchr, but again, we can choose from a few different implementations. The code is pretty simple, whether implemented with char_traits or with string_view.

Here is the version with char_traits:

constexpr const char* find(const char* str, char ch)
{
    return char_traits<char>::find(str, length(str),
                                   ch);
}

Here is the version with string_view:

constexpr string_view::size_type find(string_view sv,
                                      char ch)
{
    return sv.find(ch);
}

I am not going to show the manual lookup code this time. (Unless you have to use an old compiler, simpler is better.)

Comparing Strings

The next functions are string comparisons. Here string_view wins hands down: string_view supports the standard comparisons directly, and you do not need to write any code.

Getting Substrings

It seems that string_views are very convenient, and we should use string_views wherever possible. However, is string_view::substr enough for getting substrings? This is difficult to answer without an actual usage scenario. One real scenario I encountered in projects was that the __FILE__ macro may contain the full path at compile time, resulting in different binaries when compiling under different paths. We wanted to truncate the path completely so that the absolute paths would not show up in binaries.

My tests showed that string_view::substr could not handle this job. With the following code:

puts("/usr/local"sv.substr(5).data());

We will see assembly output like the following from the compiler (see https://godbolt.org/z/1dssd96vz):

.LC0:
        .string "/usr/local"
        …
        mov     edi, OFFSET FLAT:.LC0+5
        call    puts

We have to find another way. . . .

Let’s try array. It’s easy to think of code like the following:

constexpr auto substr(string_view sv, size_t offset,
                      size_t count)
{
    array<char, count + 1> result{};
    copy_n(&sv[offset], count, result.data());
    return result;
}

The intention of the code should be very clear: generate a brand-new character array of the requested size and zero it out (constexpr variables must be initialized on declaration before C++20); copy what we need; and then return the result. Unfortunately, the code won’t compile. . . .

There are two problems in the code:

  • Functions parameters are not constexpr, and cannot be used as template arguments.
  • copy_n is not constexpr before C++20, and cannot be used in compile-time programming.

The second problem is easy to fix: a manual loop will do. We shall focus on the first problem.

A constexpr function can be evaluated at compile time or at run time, so its function arguments are not treated as compile-time constants, and cannot be used in places where compile-time constants are required, such as template arguments.

Furthermore, this problem still exists with the C++20 consteval function, where the function is only invoked at compile time. The main issue is that if we allow function parameters to be used as compile-time constants, then we can write a function where its arguments of different values (same type) can produce return values of different types. For example (currently illegal):

consteval auto make_constant(int n)
{
    return integral_constant<int, n>{};
}

This is unacceptable in the current type system: we still require that the return values of a function have a unique type. If we want a value to be used as a template argument inside a function, it must be passed to the function template as a template argument (rather than as a function argument to a non-template function). In this case, each distinct template argument implies a different template specialization, so the issue of a multiple-return-type function does not occur.

By the way, a standard proposal P1045 tried to solve this problem, but its progress seems stalled. As there are workarounds (to be discussed below), we are still able to achieve the desired effect.

Let’s now return to the substr function and convert the count parameter into a template parameter. Here is the result:

template <size_t Count>
constexpr auto substr(string_view sv, size_t offset = 0)
{
    array<char, Count + 1> result{};
    for (size_t i = 0; i < Count; ++i) {
        result[i] = sv[offset + i];
    }
    return result;
}

The code can really work this time. With ‘puts(substr("/usr/local", 5).data())’, we no longer see "/usr/" in the compiler output.


Regretfully, we now see how compilers are challenged with abstractions: With the latest versions of GCC (12.1) and MSVC (19.32) on Godbolt, this version of substr does not generate the optimal output. There are also some compatibility issues with older compiler versions. So, purely from a practical point of view, I recommend the following implementation that does not use string_view:

template <size_t Count>
constexpr auto substr(const char* str,
                      size_t offset = 0)
{
    array<char, Count + 1> result{};
    for (size_t i = 0; i < Count; ++i) {
        result[i] = str[offset + i];
    }
    return result;
}

If you are interested, you can compare the assembly outputs of these two different versions of the code:

Only Clang is able to generate the same efficient assembly code with both versions:

        mov     word ptr [rsp + 4], 108
        mov     dword ptr [rsp], 1633906540
        mov     rdi, rsp
        call    puts

If you don’t understand why there are the numbers 108 and 1633906540, let me remind you that the hexadecimal representations of these two numbers are 0x6C and 0x61636F6C, respectively. Check the ASCII table and you should be able to understand.


Since we stopped using string_view in the function parameters, the parameter offset becomes much less useful. Hence, I will get rid of this parameter, and rename the function to copy_str:

template <size_t Count>
constexpr auto copy_str(const char* str)
{
    array<char, Count + 1> result{};
    for (size_t i = 0; i < Count; ++i) {
        result[i] = str[i];
    }
    return result;
}

Passing Arguments at Compile Time

When you try composing the compile-time functions together, you will find something lacking. For example, if you wanted to remove the first segment of a path automatically (like from "/usr/local" to "local"), you might try some code like the following:

constexpr auto remove_head(const char* path)
{
    if (*path == '/') {
        ++path;
    }
    auto start = find(path, '/');
    if (start == nullptr) {
        return copy_str<length(path)>(path);
    } else {
        return copy_str<length(start + 1)>(start + 1);
    }
}

The problem is still that it won’t compile. And did you notice that this code violates exactly the constraint I mentioned above that the return type of a function must be consistent and unique?

I have adopted a solution described by Michael Park: using lambda expressions to encapsulate ‘compile-time arguments’. I have defined three macros for convenience and readability:

#define CARG typename
#define CARG_WRAP(x) [] { return (x); }
#define CARG_UNWRAP(x) (x)()

‘CARG’ means ‘constexpr argument’, a compile-time constant argument. We can now make make_constant really work:

template <CARG Int>
constexpr auto make_constant(Int cn)
{
    constexpr int n = CARG_UNWRAP(cn);
    return integral_constant<int, n>{};
}

And it is easy to verify that it works:

auto result = make_constant(CARG_WRAP(2));
static_assert(std::is_same_v<integral_constant<int, 2>,
                             decltype(result)>);

A few explanations follow. In the template parameter, I use CARG (instead of typename) for code readability: it indicates the intention that the template parameter is essentially a type wrapper for compile-time constants. Int is the name of this special type. We will not provide this type when instantiating the function template, but instead let the compiler deduce it. When calling the ‘function’ (make_constant(CARG_WRAP(2))), we provide a lambda expression ([] { return (2); }), which encapsulates the constant we need. When we need to use this parameter, we use CARG_UNWRAP (evaluate: [] { return (2); }()) to get the constant back.

Now we can rewrite the remove_head function:

template <CARG Str>
constexpr auto remove_head(Str cpath)
{
    constexpr auto path = CARG_UNWRAP(cpath);
    constexpr int skip = (*path == '/') ? 1 : 0;
    constexpr auto pos = path + skip;
    constexpr auto start = find(pos, '/');
    if constexpr (start == nullptr) {
        return copy_str<length(pos)>(pos);
    } else {
        return copy_str<length(start + 1)>(start + 1);
    }
}

This function is similar in structure to the previous version, but there are many detail changes. In order to pass the result to copy_str as a template argument, we have to use constexpr all the way along. So we have to give up mutability, and write code in a quite functional style.

Does it really work? Let’s put the following statement into the main function:

puts(strtools::remove_head(CARG_WRAP("/usr/local"))
         .data());

And here is the optimized assembly output from GCC on x86-64 (see https://godbolt.org/z/Mv5YanPvq&gt;):

main:
        sub     rsp, 24
        mov     eax, DWORD PTR .LC0[rip]
        lea     rdi, [rsp+8]
        mov     DWORD PTR [rsp+8], eax
        mov     eax, 108
        mov     WORD PTR [rsp+12], ax
        call    puts
        xor     eax, eax
        add     rsp, 24
        ret
.LC0:
        .byte   108
        .byte   111
        .byte   99
        .byte   97

As you can see clearly, the compiler will put the ASCII codes for "local" on the stack, assign its starting address to the rdi register, and then call the puts function. There is absolutely no trace of "/usr/" in the output. In fact, there is no difference between the output of the puts statement above and that of ‘puts(substr("/usr/local", 5).data())’.

I would like to remind you that it is safe to pass and store the character array, but it is not safe to store the pointer obtained from its data() method. It is possible to use such a pointer immediately in calling other functions (like puts above), as the lifetime of array will extend till the current statement finishes execution. However, if you saved this pointer, it would become dangling after the current statement, and dereferencing it would then be undefined behaviour.

String Template Parameters

We have tried turning strings into types (via lambda expressions) for compile-time argument passing, but unlike integers and integral_constants, there is no one-to-one correspondence between the two. This is often inconvenient: for two integral_constants, we can directly use is_same to determine whether they are the same; for strings represented as lambda expressions, we cannot do the same—two lambda expressions always have different types.

Direct use of string literals as non-type template arguments is not allowed in C++, because strings may appear repeatedly in different translation units, and they do not have proper comparison semantics—comparing two strings is just a comparison of two pointers, which cannot achieve what users generally expect. To use string literals as template arguments, we need to find a way to pass the string as a sequence of characters to the template. We have two methods available:

  • The non-standard GNU extension used by GCC and Clang (which can be used prior to C++20)
  • The C++20 approach suitable for any conformant compilers (including GCC and Clang)

Let’s have a look one by one.

The GNU Extension

GCC and Clang have implemented the standard proposal N3599, which allows us to use strings as template arguments. The compiler will expand the string into characters, and the rest is standard C++.

Here is an example:

template <char... Cs>
struct compile_time_string {
    static constexpr char value[]{Cs..., '\0'};
};

template <typename T, T... Cs>
constexpr compile_time_string<Cs...> operator""_cts()
{
    return {};
}

The definition of the class template is standard C++, so that compile_time_string is a valid type and, at the same time, by taking the value member of this type, we can get "Hi". The GNU extension is the string literal operator template—we can now write ‘"Hi"_cts’ to get an object of type compile_time_string. The following code will compile with the above definitions:

constexpr auto a = "Hi"_cts;
constexpr auto b = "Hi"_cts;
static_assert(is_same_v<decltype(a), decltype(b)>);

The C++20 Approach

Though the above method is simple and effective, it failed to reach consensus in the C++ standards committee and did not become part of the standard. However, with C++20, we can use more types in non-type template parameters. In particular, user-defined literal types are amongst them. Here is an example:

template <size_t N>
struct compile_time_string {
    constexpr compile_time_string(const char (&str)[N])
    {
        copy_n(str, N, value);
    }
    char value[N]{};
};

template <compile_time_string cts>
constexpr auto operator""_cts()
{
    return cts;
}

Again, the first class template is not special, but allowing this compile_time_string to be used as the type of a non-type template parameter (quite a mouthful😝), as well as the string literal operator template, is a C++20 improvement. We can now write ‘"Hi"_cts’ to generate a compile_time_string object. Note, however, that this object is of type compile_time_string, so "Hi"_cts and "Ha"_cts are of the same type—which is very different from the results of the GNU extension. However, the important thing is that compile_time_string can now be used as type of a template parameter, so we can just add another layer:

template <compile_time_string cts>
struct cts_wrapper {
    static constexpr compile_time_string str{cts};
};

Corresponding to the previous compile-time string type comparison, we now need to write:

auto a = cts_wrapper<"Hi"_cts>{};
auto b = cts_wrapper<"Hi"_cts>{};
static_assert(is_same_v<decltype(a), decltype(b)>);

Or we can further simplify it to (as compile_time_string has a non-explicit constructor):

auto a = cts_wrapper<"Hi">{};
auto b = cts_wrapper<"Hi">{};
static_assert(is_same_v<decltype(a), decltype(b)>);

Summary

In this blog I have discussed two things:

  • Compile-time string manipulations
  • Strings as non-type template parameters

They have proved to be useful in my real projects. When having time, I will explore some usages later. Stay tuned!

Contextual Memory Tracing

The Need

A long, long time ago I wrote about memory debugging. I redefined new as a macro, took advantage of placement new, and replaced the global operator new. However, only the replacement of the global operator new was useful in catching memory leaks in the end, while the other facilities became more or less futile; for memory allocation was becoming more and more implicit with the spread use of STL containers and smart pointers, and direct use of new was discouraged. It is even more so today, with many coding conventions basically banning the use of new and delete. So I would like to revisit this topic.

First, there is still need to trace memory usage, even though memory leakage, in the way of unmatched new, is unlikely today. People still need to know how memory is used, by which parts of the program, in which functions and modules, and so on. The exact point of memory allocation is becoming less relevant, as memory allocation is becoming less direct. It probably occurs in some libraries, instead of in application code. So now tracing memory usage means recording the usage context, instead of the exact code position.

Be Contextual

Usage contexts can be set up in a stack-like data structure, and I have done so several times in the past. What needs to be recorded in the context is something one needs to decide beforehand. If you only want to trace memory usage, you can do as I do below. But you may want to fit the interface with your specific memory manager, adding what needs to be passed to it. Anyway, you decide what should be in. My example code is as follows:

struct context {
    const char* file;
    const char* func;
};

We want to record the context automatically, and RAII can be used for this purpose:

class checkpoint {
public:
    explicit checkpoint(const context& ctx);
    ~checkpoint();

private:
    const context ctx_;
};

#define CTX_MEMORY_CHECKPOINT()       \
    checkpoint memory_checkpoint{     \
        context{__FILE__, __func__}}

thread_local std::deque<context>
    context_stack{
        context{"<UNKNOWN>", "<UNKNOWN>"}};

void save_context(const context& ctx)
{
    context_stack.push_back(ctx);
}

void restore_context(const context& ctx)
{
    assert(!context_stack.empty() &&
           context_stack.back() == ctx);
    context_stack.pop_back();
}

const context& get_current_context()
{
    assert(!context_stack.empty());
    return context_stack.back();
}

checkpoint::checkpoint(const context& ctx) : ctx_(ctx)
{
    save_context(ctx);
}

checkpoint::~checkpoint()
{
    restore_context(ctx_);
}

Please notice that the context_stack needs to be thread_local, which was something not standardized when I last wrote about tracing memory usage. It is very convenient to save the information on a per-stack basis.

Fitting with Real Memory Managers

Before we go define the operator new and operator delete functions (normally called ‘allocation’ and ‘deallocation’ functions, and there are a lot of them), let us define first the generic/sample functions that do the real allocation and deallocation. We just pass on the necessary arguments to the system memory manager here (but you may want to do more to make it work with an existing memory manager) and we still use the C convention that a memory allocation failure is indicated by a null pointer:

void* ctx_alloc(size_t size, size_t alignment,
                const context* /*unused*/)
{
#ifdef _WIN32
    return _aligned_malloc(size, alignment);
#elif defined(__unix) || defined(__unix__)
    void* memptr{};
    int result = posix_memalign(&memptr, alignment, size);
    if (result == 0) {
        return memptr;
    } else {
        return nullptr;
    }
#else
    // No alignment guarantees on other platforms
    (void)alignment;
    return malloc(size);
#endif
}

void ctx_free(void* ptr, const context* /*unused*/)
{
#ifdef _WIN32
    _aligned_free(ptr);
#else
    free(ptr);
#endif
}

operator new & operator delete

Now we can go to the allocation and deallocation functions. The declarations with our new context parameter are the following:

void* operator new  (size_t size,
                     const context& ctx);
void* operator new[](size_t size,
                     const context& ctx);
void* operator new  (size_t size,
                     std::align_val_t align_val,
                     const context& ctx);
void* operator new[](size_t size,
                     std::align_val_t align_val,
                     const context& ctx);

void operator delete  (void* ptr,
                       const context&) noexcept;
void operator delete[](void* ptr,
                       const context&) noexcept;
void operator delete  (void* ptr,
                       std::align_val_t align_val,
                       const context&) noexcept;
void operator delete[](void* ptr,
                       std::align_val_t align_val,
                       const context&) noexcept;

But these are not all the function we need to rewrite. We need to replace the non-contextual version too, and they are actually key to the memory tracing functionality. Our saved context can be used, like the following:

void* operator new(size_t size)
{
    return operator new(size, get_current_context());
}

void* operator new[](size_t size)
{
    return operator new[](size, get_current_context());
}

Assuming the existence of an alloc_mem and a free_mem function, we can make the rest of the allocation and deallocation functions basically forwarders:

void* operator new(size_t size,
                   const std::nothrow_t&) noexcept
{
    return alloc_mem(size, get_current_context(),
                     alloc_is_not_array);
}

void* operator new[](size_t size,
                     const std::nothrow_t&) noexcept
{
    return alloc_mem(size, get_current_context(),
                     alloc_is_array);
}

void* operator new(size_t size,
                   std::align_val_t align_val)
{
    return operator new(size, align_val,
                        get_current_context());
}

void* operator new[](size_t size,
                     std::align_val_t align_val)
{
    return operator new[](size, align_val,
                          get_current_context());
}

void* operator new(size_t size,
                   std::align_val_t align_val,
                   const std::nothrow_t&) noexcept
{
    return alloc_mem(size, get_current_context(),
                     alloc_is_not_array,
                     size_t(align_val));
}

void* operator new[](size_t size,
                     std::align_val_t align_val,
                     const std::nothrow_t&) noexcept
{
    return alloc_mem(size, get_current_context(),
                     alloc_is_array,
                     size_t(align_val));
}

void* operator new(size_t size, const context& ctx)
{
    void* ptr = alloc_mem(size, ctx, alloc_is_not_array);
    if (ptr == nullptr) {
        throw std::bad_alloc();
    }
    return ptr;
}

void* operator new[](size_t size, const context& ctx)
{
    void* ptr = alloc_mem(size, ctx, alloc_is_array);
    if (ptr == nullptr) {
        throw std::bad_alloc();
    }
    return ptr;
}

void* operator new(size_t size,
                   std::align_val_t align_val,
                   const context& ctx)
{
    void* ptr = alloc_mem(size, ctx, alloc_is_not_array,
                          size_t(align_val));
    if (ptr == nullptr) {
        throw std::bad_alloc();
    }
    return ptr;
}

void* operator new[](size_t size,
                     std::align_val_t align_val,
                     const context& ctx)
{
    void* ptr = alloc_mem(size, ctx, alloc_is_array,
                          size_t(align_val));
    if (ptr == nullptr) {
        throw std::bad_alloc();
    }
    return ptr;
}

void operator delete(void* ptr) noexcept
{
    free_mem(ptr, alloc_is_not_array);
}

void operator delete[](void* ptr) noexcept
{
    free_mem(ptr, alloc_is_array);
}

void operator delete(void* ptr, size_t) noexcept
{
    free_mem(ptr, alloc_is_not_array);
}

void operator delete[](void* ptr, size_t) noexcept
{
    free_mem(ptr, alloc_is_array);
}

void operator delete(
    void* ptr, std::align_val_t align_val) noexcept
{
    free_mem(ptr, alloc_is_not_array,
                 size_t(align_val));
}

void operator delete[](
    void* ptr, std::align_val_t align_val) noexcept
{
    free_mem(ptr, alloc_is_array,
                 size_t(align_val));
}

void operator delete(void* ptr,
                     const context&) noexcept
{
    operator delete(ptr);
}

void operator delete[](void* ptr,
                       const context&) noexcept
{
    operator delete[](ptr);
}

void operator delete(void* ptr,
                     std::align_val_t align_val,
                     const context&) noexcept
{
    operator delete(ptr, align_val);
}

void operator delete[](void* ptr,
                       std::align_val_t align_val,
                       const context&) noexcept
{
    operator delete[](ptr, align_val);
}

Contexts and Allocation/Deallocation

Now let us focus on the two functions that do the real job:

enum is_array_t : uint32_t {
    alloc_is_not_array,
    alloc_is_array
};

void* alloc_mem(size_t size,
                const context& ctx,
                is_array_t is_array,
                size_t alignment);

void free_mem(void* ptr,
              is_array_t is_array,
              size_t alignment);

Considering this interface, the context information can only be stored immediately before the memory returned to the user. In order to trace leaked memory, we need to link the allocated memory into a linked list, and the control block is as follows:

struct new_ptr_list_t {
    new_ptr_list_t* next;
    new_ptr_list_t* prev;
    size_t          size;
    context         ctx;
    uint32_t        head_size : 31;
    uint32_t        is_array : 1;
    uint32_t        magic;
};

The first four fields should be very clear in meaning. head_size probably requires some explanation. While the struct is fixed in size, alignments can be different across allocations, resulting in different offsets from the struct pointer to the memory pointer the user gets. So this fields records the aligned struct size. is_array records whether the allocation is done by an operator new[]; we use this piece of information to detect the new[]/delete or new/delete[] mismatch, as well as allowing for special offsets required by array allocations. magic is used to mark that the memory is allocated by this implementation so that when freeing the memory we can detect corrupt memory, double freeing, and suchlike.

We also need the list head of control blocks, a mutex to protect its access, a function to align data size, and the magic number constant:

new_ptr_list_t new_ptr_list = {
    &new_ptr_list, &new_ptr_list, 0, {},
    alloc_is_not_array, 0, CTX_MAGIC};

std::mutex new_ptr_lock;

constexpr uint32_t align(size_t s, size_t alignment)
{
    return static_cast<uint32_t>((s + alignment - 1) &
                                 ~(alignment - 1));
}

constexpr uint32_t CTX_MAGIC = 0x4D585443; // "CTXM"

alloc_mem is then quite straightforward:

void* alloc_mem(size_t size, const context& ctx,
                is_array_t is_array,
                size_t alignment =
                    __STDCPP_DEFAULT_NEW_ALIGNMENT__)
{
    assert(alignment >=
           __STDCPP_DEFAULT_NEW_ALIGNMENT__);

    uint32_t aligned_list_item_size =
        align(sizeof(new_ptr_list_t), alignment);
    size_t s = size + aligned_list_item_size;
    auto ptr = static_cast<new_ptr_list_t*>(
        ctx_alloc(s, alignment, ctx));
    if (ptr == nullptr) {
        return nullptr;
    }
    auto usr_ptr = reinterpret_cast<char*>(ptr) +
                   aligned_list_item_size;
    ptr->ctx = ctx;
    ptr->is_array = is_array;
    ptr->size = size;
    ptr->head_size = aligned_list_item_size;
    ptr->magic = CTX_MAGIC;
    {
        std::lock_guard guard{new_ptr_lock};
        ptr->prev = new_ptr_list.prev;
        ptr->next = &new_ptr_list;
        new_ptr_list.prev->next = ptr;
        new_ptr_list.prev = ptr;
    }
    return usr_ptr;
}

I.e. it does the following things:

  1. Allocates memory enough to satisfy the user requirement and the additional metadata (new_ptr_list_t)
  2. Fills in the metadata
  3. Chains the allocated memory blocks into a list
  4. Returns the pointer after the metadata

free_mem does the opposite thing. Apparently, we need a function to convert the user pointer back to the originally allocated pointer, which is not really trivial, considering the potential cases of bad pointer and unmatched use of array and non-array versions of new and delete. It is the convert_user_ptr function:

new_ptr_list_t* convert_user_ptr(void* usr_ptr,
                                 size_t alignment)
{
    auto offset = static_cast<char*>(usr_ptr) -
                  static_cast<char*>(nullptr);
    auto adjusted_ptr = static_cast<char*>(usr_ptr);
    bool is_adjusted = false;

    // Check alignment first
    if (offset % alignment != 0) {
        offset -= sizeof(size_t);
        if (offset % alignment != 0) {
            return nullptr;
        }
        // Likely caused by new[] followed by delete, if
        // we arrive here
        adjusted_ptr = static_cast<char*>(usr_ptr) -
                       sizeof(size_t);
        is_adjusted = true;
    }
    auto ptr = reinterpret_cast<new_ptr_list_t*>(
        adjusted_ptr -
        align(sizeof(new_ptr_list_t), alignment));
    if (ptr->magic == CTX_MAGIC &&
        (!is_adjusted || ptr->is_array)) {
        return ptr;
    }

    if (!is_adjusted && alignment > sizeof(size_t)) {
        // Again, likely caused by new[] followed by
        // delete, as aligned new[] allocates alignment
        // extra space for the array size.
        ptr = reinterpret_cast<new_ptr_list_t*>(
            reinterpret_cast<char*>(ptr) - alignment);
        is_adjusted = true;
    }
    if (ptr->magic == CTX_MAGIC &&
        (!is_adjusted || ptr->is_array)) {
        return ptr;
    }

    return nullptr;
}

With this, free_mem then becomes easy:

void free_mem(void* usr_ptr, is_array_t is_array,
              size_t alignment =
                  __STDCPP_DEFAULT_NEW_ALIGNMENT__)
{
    assert(alignment >=
           __STDCPP_DEFAULT_NEW_ALIGNMENT__);
    if (usr_ptr == nullptr) {
        return;
    }

    auto ptr = convert_user_ptr(usr_ptr, alignment);
    if (ptr == nullptr) {
        fprintf(stderr,
                "delete%s: invalid pointer %p\n",
                is_array ? "[]" : "", usr_ptr);
        abort();
    }
    if (is_array != ptr->is_array) {
        const char* msg = is_array
                              ? "delete[] after new"
                              : "delete after new[]";
        fprintf(stderr,
                "%s: pointer %p (size %zu)\n",
                msg, usr_ptr, ptr->size);
        abort();
    }
    {
        std::lock_guard guard{new_ptr_lock};
        ptr->magic = 0;
        ptr->prev->next = ptr->next;
        ptr->next->prev = ptr->prev;
    }
    ctx_free(ptr, &(ptr->ctx));
}

I.e.:

  1. It invokes convert_user_ptr to convert the user-provided pointer to a new_ptr_list_t*.
  2. It checks whether array-ness matches in the memory allocation and deallocation.
  3. It unlinks the memory block from the linked list.
  4. If anything bad happens, it prints a message and aborts the whole program (as the program already has undefined behaviour).

One More Thing

It is now nearly complete: we have set up the mechanisms to record memory contexts in memory allocation and deallocation functions. However, I have omitted one important detail so far. If you used my code verbatim as above, the program would crash on first memory allocation. When the global allocation and deallocation functions are replaced, care must be taken when we need additional memory inside those functions. If we somehow use the generic C++ memory allocation mechanisms, it will invoke operator new in the end, causing an infinite recursion. It is still OK to use malloc/free, so we need to use a malloc_allocator for the context stack:

template <typename T>
struct malloc_allocator {
    typedef T value_type;

    typedef std::true_type is_always_equal;
    typedef std::true_type
        propagate_on_container_move_assignment;

    malloc_allocator() = default;
    template <typename U>
    malloc_allocator(const malloc_allocator<U>&) {}

    template <typename U>
    struct rebind {
        typedef malloc_allocator<U> other;
    };

    T* allocate(size_t n)
    {
        return static_cast<T*>(malloc(n * sizeof(T)));
    }
    void deallocate(T* p, size_t)
    {
        free(p);
    }
};

thread_local std::deque<context,
                        malloc_allocator<context>>
    context_stack{context{"<UNKNOWN>", "<UNKNOWN>"}};

Everything Put Together

You can find the real code with more details and a working memory leak checker in this repository:

https://github.com/adah1972/nvwa/tree/master/nvwa

You need to add the root directory of Nvwa to your include path, and nvwa/memory_trace.cpp and nvwa/aligned_memory.cpp to your project. In order to add a new memory checkpoint, use the macro NVWA_MEMORY_CHECKPOINT (Nvwa macros are usually prefixed with ‘NVWA_’). A very short test program follows:

#include <nvwa/memory_trace.h>

int main()
{
    char* ptr1 = new char[20];
    NVWA_MEMORY_CHECKPOINT();
    char* ptr2 = new char[42];
}

The output would be like the following:

Leaked object at 0x57697e30 (size 20, context: <UNKNOWN>/<UNKNOWN>)
Leaked object at 0x57697e70 (size 42, context: test.cpp/int main())
*** 2 leaks found

Notes about Using IWYU on macOS

I have recently found IWYU, a very useful tool to identify whether you have included header files correctly. It can be cleanly installed in Ubuntu by apt, though some configuration is needed to make it identify problems more correctly, i.e., let it know that a header file is private and we should use a public header file that includes it, or a symbol should be defined in a certain header file and we should not care where it is really defined in the implementation.

I did encounter some problems in macOS. I installed it, but it had problems with Xcode header files, causing something like “fatal error: ‘stdarg.h’ file not found” when used. A quick search showed that it was a known problem, and people mentioned that it seemed a problem that deteriorated with more recent versions of LLVM, which IWYU used internally.

I happened to have LLVM 7.0 installed from Homebrew, so I had a try. Here are the simple steps:

  1. Make sure you have llvm@7. If not, ‘brew install llvm@7’ would do.
  2. Check out IWYU to a directory.
  3. Execute ‘git checkout clang_7.0’ inside the IWYU directory to choose the Clang 7 branch.
  4. Execute ‘mkdir build && cd build’ to use a build directory.
  5. Execute ‘CC=/usr/local/opt/llvm@7/bin/clang CXX=/usr/local/opt/llvm@7/bin/clang++ cmake -DCMAKE_PREFIX_PATH=/usr/local/opt/llvm@7 ..’ to configure IWYU to use the Homebrew LLVM 7.0.
  6. Execute ‘make’ to build IWYU.
  7. Execute ‘mkdir -k lib/clang/7.1.0 && ln -s /usr/local/opt/llvm@7/lib/clang/7.1.0/include lib/clang/7.1.0/’ to symlink the Clang 7 include directory inside IWYU. This step is critical to solve the “file not found” problem, but regretfully it does not work with a more recent LLVM version like LLVM 11.
  8. Symlink executables to your bin directory for quick access. Something like:
cd ~/bin
ln -s ~/code/include-what-you-use/build/bin/include-what-you-use .
ln -s ~/code/include-what-you-use/iwyu_tool.py iwyu_tool
ln -s include-what-you-use iwyu

Now IWYU is ready to use.

Please be aware that IWYU does not often work out of the box, and some configuration is needed. The key document to read is IWYU Mappings, and the bundled mapping files (.imp) can be good examples. You probably want to use libcxx.imp as a start. Some mappings are already included by default, and you can find them in the file iwyu_include_picker.cc.

While it is not perfect, it did help me identify many inclusion issues. This commit is a result of using IWYU.

Happy hacking!

Enum Filter

I have recently encountered code that is structurally similar to the following:

enum class number {
    zero,
    one,
    two,
    three,
    four,
    five,
    six,
    seven,
    end
};

…

if (value == number::two ||
    value == number::three ||
    value == number::five ||
    value == number::seven) {
    …
}

The manual comparisons do not look good to me, as it is repetitive, error-prone, and not expressing the intent. So the natural questions comes: How can we make the code ‘better’?

While this is a fake example, I hope you can see the point that enumerators have specific properties (which I will call ‘traits’ in this article, as per C++ traditions), and I want the code to show the intent as expressed by traits.

However, let us get rid of ‘value ==’ first. Any repetitions are bad, right?

My first take is something as follows:

template <typename T>
bool is_in(const T& value,
           std::initializer_list<T> value_list)
{
    for (const auto& item : value_list) {
        if (value == item) {
            return true;
        }
    }
    return false;
}

Very simply and straightforward, but not good enough. How can we generate the list, given some criteria?

If you are familiar with the concept of template metaprogramming, you know that this is a compile-time programming topic: compile-time filtering.

In order to filter on the enumerators, we need to describe them with traits. The following code could be good enough for our current purpose:

template <number n>
struct number_traits;

template <>
struct number_traits<number::zero> {
    constexpr bool is_prime = false;
}

template <>
struct number_traits<number::one> {
    constexpr bool is_prime = false;
}

template <>
struct number_traits<number::two> {
    constexpr bool is_prime = true;
}

template <>
struct number_traits<number::three> {
    constexpr bool is_prime = true;
}

template <>
struct number_traits<number::four> {
    constexpr bool is_prime = false;
}

template <>
struct number_traits<number::five> {
    constexpr bool is_prime = true;
}

template <>
struct number_traits<number::six> {
    constexpr bool is_prime = false;
}

template <>
struct number_traits<number::seven> {
    constexpr bool is_prime = true;
}

So, let us try figuring out a way to generate such a list.

After some study, you will know that initializer_list is not fit for such manipulations. tuple is a better utility. The main reason is that we had better manipulate types, instead of values, in template metaprogramming. An initializer_list is not capable of doing that, whereas C++ already has a facility to convert compile-time integral constants into types, its name being exactly integral_constant.

Its approximate definition is as follows, in case you are not familiar with it:

template<class T, T v>
struct integral_constant {
    static constexpr T value = v;
    using value_type = T;
    using type = integral_constant;
    constexpr operator value_type() const noexcept
    {
        return value;
    }
    constexpr value_type operator()() const noexcept
    {
        return value;
    }
};

Such a definition is already provided by the standard library. So, instead of having an initializer_list like {number::two, number::three, number::five}, we would have something like the following:

std::make_tuple(
    std::integral_constant<number, number::two>{},
    std::integral_constant<number, number::three>{},
    std::integral_constant<number, number::five>{})

It would be safe to pass such ‘arguments’ for compile-time programming, as only their types matter. We would not need their values, as each type has exactly one unique value.

The next questions are:

  1. How can we generate the constants for all possible enumerators?—I.e. compile-time iteration.
  2. How can we filter to get only the values we want?—I.e. compile-time filtering.
  3. How can we check whether a value is equal to one of the constanta we present?—I.e. (compile-time or run-time) checking like the is_in above.

The answer to the first question is that we need to generate a sequence, and we need to know what the last enumerator is. As far as I know, there is currently no way in C++ to enumerate all the enumerators of an enum type. I have to resort to an agreement to mark the end of a continuous enumeration, and my choice is that we use end to mark the end, as in the enum class listed in the very beginning of this article. That is, I need to generate the sequence from integral_constant<number, number{0}> to integral_constant<number, number::end>, exclusive.

This job can be easily done with the following code, using the standard tuple and index_sequence technique:

template <typename E, size_t... ints>
constexpr auto make_all_enum_consts_impl(
    std::index_sequence<int...>)
{
    return std::make_tuple(std::integral_constant<
        E, E(ints)>{}...);
}

template <typename E>
constexpr auto make_all_enum_consts()
{
    return make_all_enum_consts_impl<E>(
        std::make_index_sequence<size_t(E::end)>{});
}

Now we have come to the second, and the really difficult, part: how can we filter the values to get only those we need?

The answer is use apply, tuple_cat, and conditional_t, three important tools in the C++ template metaprogramming world:

  • With apply, we can call a function with all elements of a tuple as arguments. I.e. apply(f, make_tuple(42, "answer")) would be equivalent to f(42, "answer").
  • With tuple_cat, we can concatenate elements of tuples into a new tuple. I.e. tuple_cat(make_tuple(42, "answer"), make_tuple("of", "everything")) would result in the tuple {42, "answer", "of", "everything"}.
  • With conditional_t, we can get one of the given types based on a compile-time Boolean expression. I.e. conditional_t<true, int, string> would result in int, but conditional_t<false, int, string> would result in string.

Each tool may look trivial individually, but they can be combined together to work wonders. Specifically, it can do what we now need.

This is the final form I use (mainly inspired by this Stack Overflow answer):

#define ENUM_FILTER_FROM(E, T, tup)                     \
    std::apply(                                         \
        [](auto... ts) {                                \
            return std::tuple_cat(                      \
                std::conditional_t<                     \
                    E##_traits<decltype(ts)::value>::T, \
                    std::tuple<decltype(ts)>,           \
                    std::tuple<>>{}...);                \
        },                                              \
        tup)

Let me explain what it does:

  • The macro takes an enumeration type, a trait name, and a tuple of enumerator constants, which are created by make_all_enum_consts above. The reason why a tuple of constants are used is that the result of calling ENUM_FILTER_FROM can be filtered again.
  • std::apply invokes the generic lambda with the tuple of arguments
  • The generic lambda does the compile-time computation of concatenating (tuple_cat) the arguments into a new tuple
  • The arguments of tuple_cat is either a tuple of one enumerator constant, if the type satisfies the trait, or an empty tuple otherwise
  • So the end result of executing the code in the macro is a tuple of enumerator constants that satisfy the trait

The answer to the third question is relatively simple. For maximal flexibility, I am splitting it into two steps:

  • Convert the tuple of types into a tuple of values
  • Check whether a value is in the tuple with a fold expression

Here is the code:

template <typename Tuple, size_t... ints>
constexpr auto make_values_from_consts_impl(
    Tuple tup, std::index_sequence<ints...>)
{
    return std::make_tuple(std::get<ints>(tup)()...);
}

template <typename Tuple>
constexpr auto make_values_from_consts(Tuple tup)
{
    return make_values_from_consts_impl(
        tup, std::make_index_sequence<
                 std::tuple_size_v<Tuple>>{});
}

template <typename T, typename Tuple, size_t... ints>
constexpr bool is_in_impl(const T& value,
                          const Tuple& tup,
                          std::index_sequence<ints...>)
{
    return ((value == std::get<ints>(tup)) || ...);
}

template <typename T, typename Tuple>
constexpr std::enable_if_t<
    std::is_same_v<T, std::decay_t<decltype(std::get<0>(
                          std::declval<Tuple>()))>>,
    bool>
is_in(const T& value, const Tuple& tup)
{
    return is_in_impl(value, tup,
                      std::make_index_sequence<
                          std::tuple_size_v<Tuple>>{});
}

Finally, we can define the function is_prime:

constexpr bool is_prime(number n)
{
    return is_in(
        n, make_values_from_consts(ENUM_FILTER_FROM(
               number, is_prime,
               make_all_enum_consts<number>())));
}

More interestingly, the result of invoking ENUM_FILTER_FROM can be passed to ENUM_FILTER_FROM again. If we defined the trait is_even as well as is_prime, we would be able to write:

ENUM_FILTER_FROM(number, is_even, \
    ENUM_FILTER_FROM(number, is_prime, …)

Is that nice?

Do note that there is an asymmetry here. It is trivial to implement make_values_from_consts, but it seems impossible to implement its inverse constexpr function make_consts_from_values. This is because there are no constexpr arguments in C++ (check out this discussion if you are interested in the reasons). No arguments are regarded constexpr, even in a constexpr function. You can work around the problem in a cumbersome way, but for this post I am sticking to using types as long as possible.

That’s it, my experience of using compile-time filtering. I have found the techniques presented here useful, and I wish you would find them useful too.

Fixing A VS2017 15.6 Installation Problem

After installing the latest Visual Studio 2017 15.6.6 (Community Edition), I found my custom setting of environment variables INCLUDE lost effect in the Developer Command Prompt. Strangely, LIB was still there. Some tracing indicated that it was a bug in the .BAT files Microsoft provided to initialize the environment. The offending lines are the following (in C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\Common7\Tools\vsdevcmd\core\winsdk.bat; one of them is very long and requires scrolling for viewing):

@REM the folowing are architecture neutral
set __tmpwinsdk_include=
if "%INCLUDE%" NEQ "" set "__tmp_include=;%INCLUDE%"
set "INCLUDE=%WindowsSdkDir%include\%WindowsSDKVersion%shared;%WindowsSdkDir%include\%WindowsSDKVersion%um;%WindowsSdkDir%include\%WindowsSDKVersion%winrt;%WindowsSdkDir%include\%WindowsSDKVersion%cppwinrt%__tmpwinsdk_include%"
set __tmpwinsdk_include=

Apparently somebody missed renaming __tmp_include to __tmpwinsdk_include. Doing that myself fixed the problem.

I’ve reported the problem to Microsoft. In the meanwhile, you know how to fix it if you encounter the same problem.

A Journey of Purely Static Linking

As I mentioned last time, I found Microsoft has really messed up its console Unicode support when the C runtime DLL (as versus the static runtime library) is used. So I decided to have a try with linking everything statically in my project that uses C++ REST SDK (a.k.a. cpprestsdk). This is not normally recommended, but in my case it has two obvious advantages:

  • It would solve the Unicode I/O problem.
  • It would be possible to ship just the binaries without requiring the target PC to install the Microsoft Visual C++ runtime.

It took me several hours to get it rolling, but I felt it was worthwhile.

Before I start, I need to mention that cpprestsdk has a solution file that supports building a static library. It turned out not satisfactory:

  • It used NuGet packages for Boost and OpenSSL, and both versions were out of date. Worse, my Visual Studio 2017 IDE hung while I tried to update the packages. Really a nuisance.
  • The static library, as well as all its dependencies like Boost and OpenSSL, still uses the C runtime DLL. I figured it might be easier to go completely on my own.

Prerequisites

Boost

This part is straightforward. After going into the Boost directory, I only need to type (I use version 1.65.1):

bootstrap.bat
.\b2.exe toolset=msvc -j 2 --with-chrono --with-date_time --with-regex --with-system --with-thread release link=static runtime-link=static stage
.\b2.exe toolset=msvc -j 2 --with-chrono --with-date_time --with-regex --with-system --with-thread release link=shared stage

(The last line is needed only because ‘cmake ..’ would otherwise fail to detect Boost in the ‘Building C++ REST SDK’ stage.)

OpenSSL

As I already have Perl and NASM installed, installing OpenSSL is trivial too (I use version 1.0.2l):

perl Configure VC-WIN32 --prefix=C:/Libraries/OpenSSL
ms\do_nasm
nmake -f ms\nt.mak
nmake -f ms\nt.mak install

zlib

This part requires a small change to the build script (for version 1.2.11). I need to open win32\Makefile.msc and change all occurrences of ‘-MD’ to ‘-MT’. Then these commands will work:

nmake -f win32\Makefile.msc zlib.lib
mkdir C:\Libraries\zlib
mkdir C:\Libraries\zlib\include
mkdir C:\Libraries\zlib\lib
copy zconf.h C:\Libraries\zlib\include
copy zlib.h C:\Libraries\zlib\include
copy zlib.lib C:\Libraries\zlib\lib

Building C++ REST SDK

We need to set some environment variables to help the CMake build system find where the libraries are. I set them in ‘Control Panel > System > Advanced system settings > Environment variables’:1

BOOST_ROOT=C:\src\boost_1_65_1
OPENSSL_ROOT_DIR=C:\Libraries\OpenSSL
INCLUDE=C:\Libraries\zlib\include
LIB=C:\Libraries\zlib\lib

(The above setting assumes Boost is unpacked under C:\src.)

We would need to create the solution files for the current environment under cpprestsdk:

cd Release
mkdir build
cd build
cmake ..

If the environment is set correctly, the last command should succeed and report no errors. A cpprest.sln should be generated now.

We then open this solution file. As we only need the release files, we should change the ‘Solution Configuration’ from ‘Debug’ to ‘Release’. After that, we need to find the project ‘cpprest’ in ‘Solution Explorer’, go to its ‘Properties’, and make the following changes under ‘General’:

  • Set Target Name to ‘cpprest’.
  • Set Target Extension to ‘.lib’.
  • Set Configuration Type to ‘Static library (.lib)’.

And the most important change we need under ‘C/C++ > Code Generation’:

  • Set Runtime Library to ‘Multi-threaded (/MT)’.

Click on ‘OK’ to accept the changes. Then we can build this project.

Like zlib, we need to copy the header and library files to a new path, and add the include and lib directories to environment variables INCLUDE and LIB, respectively. In my case, I have:

INCLUDE=C:\Libraries\cpprestsdk\include;C:\Libraries\zlib\include
LIB=C:\Libraries\cpprestsdk\lib;C:\Libraries\zlib\lib

Change to my Project

Of course, the cpprestsdk-based project needs to be adjusted too. I will first show the diff, and then give some explanations:

--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -24,14 +24,25 @@ set(CMAKE_CXX_FLAGS "${ELPP_FLAGS}")

 if(WIN32)
 set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}\
- -DELPP_UNICODE -D_WIN32_WINNT=0x0601")
-set(USED_LIBS Boost::dynamic_linking ${Boost_DATE_TIME_LIBRARY} ${Boost_SYSTEM_LIBRARY} ${Boost_THREAD_LIBRARY})
+ -DELPP_UNICODE -D_WIN32_WINNT=0x0601 -D_NO_ASYNCRTIMP")
+set(USED_LIBS Winhttp httpapi bcrypt crypt32 zlib)
 else(WIN32)
 set(USED_LIBS "-lcrypto" OpenSSL::SSL ${Boost_CHRONO_LIBRARY} ${Boost_SYSTEM_LIBRARY} ${Boost_THREAD_LIBRARY})
 endif(WIN32)

 if(MSVC)
 set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /EHsc /W3")
+set(CompilerFlags
+        CMAKE_CXX_FLAGS
+        CMAKE_CXX_FLAGS_DEBUG
+        CMAKE_CXX_FLAGS_RELEASE
+        CMAKE_C_FLAGS
+        CMAKE_C_FLAGS_DEBUG
+        CMAKE_C_FLAGS_RELEASE)
+foreach(CompilerFlag ${CompilerFlags})
+  string(REPLACE "/MD" "/MT" ${CompilerFlag}
+         "${${CompilerFlag}}")
+endforeach()
 else(MSVC)
 set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -W -Wall -Wfatal-errors")
 endif(MSVC)

There are two blocks of changes. In the first block, one can see that the Boost libraries are no longer needed, but, instead, one needs to link the Windows dependencies of cpprestsdk (I found the list in Release\build\src\cpprest.vcxproj), as well as zlib. One also needs to explicitly define _NO_ASYNCRTIMP so that the cpprestsdk functions will be not treated as dllimport.

As CMake defaults to using ‘/MD’, the second block of changes replaces all occurrences of ‘/MD’ with ‘/MT’ in the compiler flags.2 With these changes, I am able to generate an executable without any external dependencies.

A Gotcha

I am now used to using cmake without specifying the ‘-G’ option on Windows. By default, CMake generates Visual Studio project files: they have several advantages, including multiple configurations (selectable on the MSBuild command line like ‘/p:Configuration=Release’), and parallel building (say, using ‘/m:2’ to take advantage of two processor cores). Neither is possible with nmake. However, the executables built by this method still behave abnormally regarding outputting non-ASCII characters. Actually, I believe everything is still OK at the linking stage, but the build process then touches the executables in some mysterious way and the result becomes bad. I am not familiar with MSBuild well enough to manipulate the result, so I am going back to using ‘cmake -G "NMake Makefiles" -DCMAKE_BUILD_TYPE=Release’ followed by ‘nmake’ for now.


  1. CMake can recognize Boost and OpenSSL by some known environment variables. I failed to find one that really worked for zlib, so the INCLUDE and LIB variables need to be explicitly set. 
  2. This technique is shamelessly copied from a StackOverflow answer

Another Microsoft Unicode I/O Problem

I encountered an annoying bug in Visual C++ 2017 recently. It started when I found my favourite logging library, Easylogging++, output Chinese as garbage characters on Windows. Checking the documentation carefully, I noted that I should have used the macro START_EASYLOGGINGPP. It turned out to be worse: all output starting from the Chinese character was gone. Puzzled but busy, I put it down and worked on something else.

I spend another hour of detective work on this issue today. The result was quite surprising.

  • First, it is not an issue with Easylogging++. The problem can occur if I purely use std::wcout.
  • Second, the magical thing about START_EASYLOGGINGPP is that it will invoke std::locale::global(std::locale("")). This is the switch that leads to the different behaviour.
  • Myteriously, with the correct locale setting, I can get the correct result with both std::wcout and Easylogging++ in a test program. I was not able to get it working in my real project.
  • Finally, it turns out that the difference above is caused by /MT vs. /MD! The former (default if neither is specified on the command line) tells the Visual C++ compiler to use the static multi-threading library, and the latter (set by default in Visual Studio projects) tells the compiler to use the dynamic multi-threading library.

People may remember that I wrote about MSVCRT.DLL Console I/O Bug. While Visual C++ 2013 shows consistent behaviour between /MT and /MD, Visual C++ 2015 and 2017 exhibit the same woeful bug when /MD is specified on the command line. This is something perplexingly weird: it seems someone at Microsoft messed up with the MSVCRT.DLL shipped with Windows first (around 2006), and then the problem spread to the Visual C++ runtime DLL after nearly a decade!

I am using many modern C++ features, so I definitely do not want to go back to Visual C++ 2013 for new projects. It seems I have to tolerate garbage characters in the log for now. Meanwhile, I submitted a bug to Microsoft. Given that I have a bug report that is deferred for four years, I am not very hopeful. But let us wait and see.

Update (20 December 2017)

A few more tests show that the debug versions (/MTd and /MDd) both work well. So only the default release build (using the dynamic C runtime) exhibits this problem, where the executable depends on DLLs like api-ms-win-crt-stdio-l1-1-0.dll. It seems this issue is related to the Universal C Runtime introduced in Visual Studio 2015 and Windows 10. . . .

Update (25 March 2018)

The bug was closed, and a Microsoft developer indicated that the issue had already been fixed since the Windows 10 Anniversary Update SDK (build 10.0.14393). Actually I had had build 10.0.15063 installed. The reason why I still saw the problem was that the Universal C Runtime on Windows 7 had not been updated (‘the issue will be fixed in a future update to the Universal C Runtime on Windows 7’), and I should not have seen the problem on a Windows 10 box. The current workaround is either use static linking (as I did), or copy the redistributable DLLs under C:\Program Files (x86)\Windows Kits\10\Redist\ucrt\DLLs\x86 (or x64 etc.) to the app directory (so called ‘app-local deployment’; which should not be used on Windows 10, as the system version is always preferred). My test showed that copying ucrtbase.dll was enough to fix my test case.

C/C++ Performance, mmap, and string_view

A C++ discussion group I participated in got a challenge last week. Someone posted a short Perl program, and claimed that it was faster than the corresponding C version. It was to this effect (with slight modifications):

open IN, "$ARGV[0]";
my $gc = 0;
while (my $line = ) {
  $gc += ($line =~ tr/cCgG//);
}
print "$gc\n";
close IN

The program simply searched and counted all occurrences of the letters ‘C’ and ‘G’ from the input file in a case-insensitive way. Since the posted C code was incomplete, I wrote a naïve implementation to test, which did turn out to be slower than the Perl code. It was about two times as slow on Linux, and about 10 times as slow on macOS.1

FILE* fp = fopen(argv[1], "rb");
int count = 0;
int ch;
while ((ch = getc(fp)) != EOF) {
    if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G') {
        ++count;
    }
}
fclose(fp);

Of course, it actually shows how optimized the Perl implementation is, instead of how inferior the C language is. Before I had time to test my mmap_line_reader, another person posted an mmap-based solution, to the following effect (with modifications):

int fd = open(argv[1], O_RDONLY);
struct stat st;
fstat(fd, &st);
int len = st.st_size;
char ch;
int count = 0;
char* ptr = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd);
char* begin = ptr;
char* end = ptr + len;
while (ptr < end) {
    ch = *ptr++;
    if (ch == 'c' || ch == 'C' || ch == 'g' || ch == 'G')
        ++count;
}
munmap(begin, len);

When I tested my mmap_line_reader, I found that its performance was only on par with the Perl code, but slower than the handwritten mmap-based code. It is not surprising, considering that mmap_line_reader copies the line content, while the C code above searches directly in the mmap’d buffer.

I then thought of the C++17 string_view.2 It seemed a good chance of using it to return a line without copying its content. It was actually easy refactoring (on code duplicated from mmap_line_reader), and most of the original code did not require any changes. I got faster code for this test, but the implementations of mmap_line_reader and the new mmap_line_reader_sv were nearly identical, except for a few small differences.

Naturally, the next step was to refactor again to unify the implementations. I made a common base to store the bottommost mmap-related logic, and made the difference between string and string_view disappear with a class template. Now mmap_line_reader and mmap_line_reader_sv were just two aliases of instantiations of basic_mmap_line_reader!

While mmap_line_reader_sv was faster than mmap_line_reader, it was still slower than the mmap-based C code. So I made another abstraction, a ‘container’ that allowed iteration over all of the file content. Since the mmap-related logic was already mostly separated, only some minor modifications were needed to make that base class fully independent of the line reading logic. After that, adding mmap_char_reader was easy, which was simply a normal container that did not need to mess with platform-specific logic.

At this point, all seemed well—except one thing: it only worked on Unix. I did not have an immediate need to make it work on Windows, but I really wanted to show that the abstraction provided could work seamlessly regardless of the platform underneath. After several revisions, in which I dealt with proper Windows support,3 proper 32- and 64-bit support, and my own mistakes, I finally made it work. You can check out the current code in the nvwa repository. With it, I can finally make the following simple code work on Linux, macOS, and Windows, with the same efficiency as raw C code when fully optimized:4

#include <iostream>
#include <stdlib.h>
#include "nvwa/mmap_byte_reader.h"

int main(int argc, char* argv[])
{
    if (argc != 2) {
        std::cerr << "A file name is needed" << std::endl;
        exit(1);
    }

    try {
        int count = 0;
        for (char ch : nvwa::mmap_char_reader(argv[1]))
            if (ch == 'c' || ch == 'C' ||
                ch == 'g' || ch == 'G')
                ++count;
        std::cout << count << std::endl;
    }
    catch (std::exception& e) {
        std::cerr << e.what() << std::endl;
    }
}

Even though it is not used in the final code, I love the C++17 string_view. And I like the simplicity I finally achieved. Do you?

P.S. The string_view-based test code is posted here too as a reference. Line-based processing is common enough!5

#include <iostream>
#include <stdlib.h>
#include "nvwa/mmap_line_reader.h"

int main(int argc, char* argv[])
{
    if (argc != 2) {
        std::cerr << "A file name is needed" << std::endl;
        exit(1);
    }

    try {
        int count = 0;
        for (const auto& line :
                nvwa::mmap_line_reader_sv(argv[1]))
            for (char ch : line)
                if (ch == 'c' || ch == 'C' ||
                    ch == 'g' || ch == 'G')
                    ++count;
        std::cout << count << std::endl;
    }
    catch (std::exception& e) {
        std::cerr << e.what() << std::endl;
    }
}

Update (18 September 2017)

Thanks to Alex Maystrenko (see the comments below), It is now understood that the reason why getc was slow was because there was an implicit lock around file operations. I did not expect it, as I grew from an age when multi-threading was the exception, and I had not learnt about the existence of getc_unlocked until he mentioned it! According to the getc_unlocked page in the POSIX specification:

Some I/O functions are typically implemented as macros for performance reasons (for example, putc() and getc()). For safety, they need to be synchronized, but it is often too expensive to synchronize on every character. Nevertheless, it was felt that the safety concerns were more important; consequently, the getc(), getchar(), putc(), and putchar() functions are required to be thread-safe. However, unlocked versions are also provided with names that clearly indicate the unsafe nature of their operation but can be used to exploit their higher performance.

After replacing getc with getc_unlocked, the naïve implementation immediately outperforms the Perl code on Linux, though not on macOS.6

Another interesting thing to notice is that GCC provides vastly optimized code for the comparison with ‘c’, ‘C’, ‘g’, and ‘G’,7 which is extremely unlikely for interpreted languages like Perl. Observing the codes for the characters are:

  • 010000112 or 6710 (‘C’)
  • 010001112 or 7110 (‘G’)
  • 011000112 or 9910 (‘c’)
  • 011001112 or 10310 (‘g’)

GCC basically ANDs the input character with 110110112, and compares the result with 010000112. In Intel-style assembly:

        movzx   ecx, BYTE PTR [r12]
        and     ecx, -37
        cmp     cl, 67

It is astoundingly short and efficient!


  1. I love my Mac, but I do feel Linux has great optimizations. 
  2. If you are not familiar with string_view, check out its reference
  3. Did I mention how unorthogonal and uncomfortable the Win32 API is, when compared with POSIX? I am very happy to hide all the ugliness from the application code. 
  4. Comparison was only made on Unix (using the POSIX mmap API), under GCC with ‘-O3’ optimization (‘-O2’ was not enough). 
  5. -std=c++17’ must be specified for GCC, and ‘/std:c++latest’ must be specified for Visual C++ 2017. 
  6. However, the performance improvement is more dramatic on macOS, from about 10 times slower to 5% slower. 
  7. Of course, this is only possible when the characters are given as compile-time constants. 

Performance of My Line Readers

After I wrote the article about Python yield and C++ Coroutines, I felt that I needed to test the performance of istream_line_reader. The immediate result was both good and bad: good in that there was no actual difference between the straightforward std::getline and my istream_line_reader (as anticipated), and bad in that neither version performed well (a surprise to me). I vaguely remember that sync_with_stdio(false) may affect the performance, so I also tested calling this function in the beginning. However, it did not seem to matter. By the way, my favourite compiler has always been Clang recently (and I use a Mac).

Seeing that istream_line_reader had a performance problem, I tried other approaches. One thing I tried was using the traditional C I/O functions. I wrote another file_line_reader, which used either fgets or fread to read the data, depending what the delimiter is. (fgets could only use ‘\n’ as the delimiter, but it performed better than fread, for I could fgets into the final line buffer, but had to fread into a temporary buffer first.) I also added a switch on whether to strip the delimiter, something not possible with the getline function. The result achieved a more than 10x performance improvement (from about 28 MB/s to 430 MB/s). I was happy, and presented this on the last slide of my presentation on C++ and Functional Programming in the 2016 C++ and System Software Summit (China).

Until C++11, modifying the character array accessed through string::data() has undefined behaviour. To be on the safe side, I implemented a self-expanding character buffer on my own, which complicated the implementation a little bit. It also made the interface slightly different from istream_line_reader, which can be demonstrated in the following code snippets.

Iteration with istream_line_reader:

for (auto& line : istream_line_reader(cin)) {
    puts(line.c_str());
}

Iteration with file_line_reader:

for (auto& line : file_line_reader(stdin)) {
    puts(line);
}

I.e. each iteration with file_line_reader returns a char* instead of a string. This should be OK, as a raw character pointer is often enough. One can always construct a string from char* easily, anyway.


After the presentation, I turned to implementing a small enhancement—iterating over the lines with mmap. This proved interesting work. Not only did it improved the line reading performance, but the code was simplified as well. As I could access the file content directly with a pointer, I was able to copy the lines to a string simply with string::assign. As I used string again, there was no need to define a custom copy constructor, copy assignment operator, move constructor, and move assignment operator as well. The performance was, of course, also good: the throughput rate reached 650 MB/s, a 50% improvement! The only negative side was that it could not work on stdin, so testing it required more lines. Apart from that, I was quite satisfied. And I had three different line readers that could take an istream&, FILE*, or file descriptor as the input source. So all situations were dealt with. Not bad!

One thing of note about the implementation. I tried copying (a character at a time) while searching, before adopting the current method of searching first before assigning to the string. The latter proved faster when dealing with long lines. I can see two reasons:

  1. Strings are normally (and required to be since C++11) null-terminated, so copying one character at a time has a big overhead of zeroing the next byte. I confirmed the case from the libc++ source code of Clang.
  2. Assignment can use memcpy or memmove internally, which normally has a fast platform-specific implementation. In the case of string::assign(const char*, size_t), I verified that libc++ used memmove indeed.

If you are interested, this is the assembly code I finally traced into on my Mac (comments are my analysis; you may need to scroll horizontally to see them all):

libsystem_c.dylib`memcpy$VARIANT$sse42:
   0x7fff9291fcbd:  pushq  %rbp
   0x7fff9291fcbe:  movq   %rsp, %rbp
   0x7fff9291fcc1:  movq   %rdi, %r11           ; save dest
   0x7fff9291fcc4:  movq   %rdi, %rax
   0x7fff9291fcc7:  subq   %rsi, %rax           ; dest - src
   0x7fff9291fcca:  cmpq   %rdx, %rax
   0x7fff9291fccd:  jb     0x7fff9291fd04       ; dest in (src, src + len)?
   ; Entry condition: dest <= src or dest >= src + len; copy starts from front
   0x7fff9291fccf:  cmpq   $80, %rdx
   0x7fff9291fcd3:  ja     0x7fff9291fd09       ; len > 128?
   ; Entry condition: len <= 128
   0x7fff9291fcd5:  movl   %edx, %ecx
   0x7fff9291fcd7:  shrl   $2, %ecx             ; len / 4
   0x7fff9291fcda:  je     0x7fff9291fcec       ; len < 4?
   0x7fff9291fcdc:  movl   (%rsi), %eax         ; 4-byte read
   0x7fff9291fcde:  addq   $4, %rsi             ; src <- src + 4
   0x7fff9291fce2:  movl   %eax, (%rdi)         ; 4-byte write
   0x7fff9291fce4:  addq   $4, %rdi             ; dest <- dest + 4
   0x7fff9291fce8:  decl   %ecx
   0x7fff9291fcea:  jne    0x7fff9291fcdc       ; more 4-byte blocks?
   ; Entry condition: len < 4
   0x7fff9291fcec:  andl   $3, %edx
   0x7fff9291fcef:  je     0x7fff9291fcff       ; len == 0?
   0x7fff9291fcf1:  movb   (%rsi), %al          ; 1-byte read
   0x7fff9291fcf3:  incq   %rsi                 ; src <- src + 1
   0x7fff9291fcf6:  movb   %al, (%rdi)          ; 1-byte write
   0x7fff9291fcf8:  incq   %rdi                 ; dest <- dest + 1
   0x7fff9291fcfb:  decl   %edx
   0x7fff9291fcfd:  jne    0x7fff9291fcf1       ; more bytes?
   0x7fff9291fcff:  movq   %r11, %rax           ; restore dest
   0x7fff9291fd02:  popq   %rbp
   0x7fff9291fd03:  ret
   0x7fff9291fd04:  jmpq   0x7fff9291fdb9
   ; Entry condition: len > 128
   0x7fff9291fd09:  movl   %edi, %ecx
   0x7fff9291fd0b:  negl   %ecx
   0x7fff9291fd0d:  andl   $15, %ecx            ; 16 - dest % 16
   0x7fff9291fd10:  je     0x7fff9291fd22       ; dest 16-byte aligned?
   0x7fff9291fd12:  subl   %ecx, %edx           ; adjust len
   0x7fff9291fd14:  movb   (%rsi), %al          ; one-byte read
   0x7fff9291fd16:  incq   %rsi                 ; src <- src + 1
   0x7fff9291fd19:  movb   %al, (%rdi)          ; one-byte write
   0x7fff9291fd1b:  incq   %rdi                 ; dest <- dest + 1
   0x7fff9291fd1e:  decl   %ecx
   0x7fff9291fd20:  jne    0x7fff9291fd14       ; until dest is aligned
   ; Entry condition: dest is 16-byte aligned
   0x7fff9291fd22:  movq   %rdx, %rcx           ; len
   0x7fff9291fd25:  andl   $63, %edx            ; len % 64
   0x7fff9291fd28:  andq   $-64, %rcx           ; len <- align64(len)
   0x7fff9291fd2c:  addq   %rcx, %rsi           ; src <- src + len
   0x7fff9291fd2f:  addq   %rcx, %rdi           ; src <- dest + len
   0x7fff9291fd32:  negq   %rcx                 ; len <- -len
   0x7fff9291fd35:  testl  $15, %esi
   0x7fff9291fd3b:  jne    0x7fff9291fd80       ; src not 16-byte aligned?
   0x7fff9291fd3d:  jmp    0x7fff9291fd40
   0x7fff9291fd3f:  nop
   ; Entry condition: both src and dest are 16-byte aligned
   0x7fff9291fd40:  movdqa (%rsi,%rcx), %xmm0   ; aligned 16-byte read
   0x7fff9291fd45:  movdqa 16(%rsi,%rcx), %xmm1
   0x7fff9291fd4b:  movdqa 32(%rsi,%rcx), %xmm2
   0x7fff9291fd51:  movdqa 48(%rsi,%rcx), %xmm3
   0x7fff9291fd57:  movdqa %xmm0, (%rdi,%rcx)   ; aligned 16-byte write
   0x7fff9291fd5c:  movdqa %xmm1, 16(%rdi,%rcx)
   0x7fff9291fd62:  movdqa %xmm2, 32(%rdi,%rcx)
   0x7fff9291fd68:  movdqa %xmm3, 48(%rdi,%rcx)
   0x7fff9291fd6e:  addq   $64, %rcx
   0x7fff9291fd72:  jne    0x7fff9291fd40       ; more 64-byte blocks?
   0x7fff9291fd74:  jmpq   0x7fff9291fcd5
   0x7fff9291fd79:  nopl   (%rax)               ; 7-byte nop
   ; Entry condition: src is NOT 16-byte aligned but dest is
   0x7fff9291fd80:  movdqu (%rsi,%rcx), %xmm0   ; unaligned 16-byte read
   0x7fff9291fd85:  movdqu 16(%rsi,%rcx), %xmm1
   0x7fff9291fd8b:  movdqu 32(%rsi,%rcx), %xmm2
   0x7fff9291fd91:  movdqu 48(%rsi,%rcx), %xmm3
   0x7fff9291fd97:  movdqa %xmm0, (%rdi,%rcx)   ; aligned 16-byte write
   0x7fff9291fd9c:  movdqa %xmm1, 16(%rdi,%rcx)
   0x7fff9291fda2:  movdqa %xmm2, 32(%rdi,%rcx)
   0x7fff9291fda8:  movdqa %xmm3, 48(%rdi,%rcx)
   0x7fff9291fdae:  addq   $64, %rcx
   0x7fff9291fdb2:  jne    0x7fff9291fd80       ; more 64-byte blocks?
   0x7fff9291fdb4:  jmpq   0x7fff9291fcd5
   ; Entry condition: dest > src and dest < src + len; copy starts from back
   0x7fff9291fdb9:  addq   %rdx, %rsi           ; src <- src + len
   0x7fff9291fdbc:  addq   %rdx, %rdi           ; dest <- dest + len
   0x7fff9291fdbf:  cmpq   $80, %rdx
   0x7fff9291fdc3:  ja     0x7fff9291fdf6       ; len > 128?
   ; Entry condition: len < 128
   0x7fff9291fdc5:  movl   %edx, %ecx
   0x7fff9291fdc7:  shrl   $3, %ecx             ; len / 8
   0x7fff9291fdca:  je     0x7fff9291fdde       ; len < 8?
   ; Entry condition: len >= 8
   0x7fff9291fdcc:  subq   $8, %rsi             ; src <- src - 8
   0x7fff9291fdd0:  movq   (%rsi), %rax         ; 8-byte read
   0x7fff9291fdd3:  subq   $8, %rdi             ; dest <- dest - 8
   0x7fff9291fdd7:  movq   %rax, (%rdi)         ; 8-byte write
   0x7fff9291fdda:  decl   %ecx
   0x7fff9291fddc:  jne    0x7fff9291fdcc       ; until len < 8
   ; Entry condition: len < 8
   0x7fff9291fdde:  andl   $7, %edx
   0x7fff9291fde1:  je     0x7fff9291fdf1       ; len == 0?
   0x7fff9291fde3:  decq   %rsi                 ; src <- src - 1
   0x7fff9291fde6:  movb   (%rsi), %al          ; 1-byte read
   0x7fff9291fde8:  decq   %rdi                 ; dest <- dest - 1
   0x7fff9291fdeb:  movb   %al, (%rdi)          ; 1-byte write
   0x7fff9291fded:  decl   %edx
   0x7fff9291fdef:  jne    0x7fff9291fde3       ; more bytes?
   0x7fff9291fdf1:  movq   %r11, %rax           ; restore dest
   0x7fff9291fdf4:  popq   %rbp
   0x7fff9291fdf5:  ret
   ; Entry condition: len > 128
   0x7fff9291fdf6:  movl   %edi, %ecx
   0x7fff9291fdf8:  andl   $15, %ecx
   0x7fff9291fdfb:  je     0x7fff9291fe0e       ; dest 16-byte aligned?
   0x7fff9291fdfd:  subq   %rcx, %rdx           ; adjust len
   0x7fff9291fe00:  decq   %rsi                 ; src <- src - 1
   0x7fff9291fe03:  movb   (%rsi), %al          ; one-byte read
   0x7fff9291fe05:  decq   %rdi                 ; dest <- dest - 1
   0x7fff9291fe08:  movb   %al, (%rdi)          ; one-byte write
   0x7fff9291fe0a:  decl   %ecx
   0x7fff9291fe0c:  jne    0x7fff9291fe00       ; until dest is aligned
   ; Entry condition: dest is 16-byte aligned
   0x7fff9291fe0e:  movq   %rdx, %rcx           ; len
   0x7fff9291fe11:  andl   $63, %edx            ; len % 64
   0x7fff9291fe14:  andq   $-64, %rcx           ; len <- align64(len)
   0x7fff9291fe18:  subq   %rcx, %rsi           ; src <- src - len
   0x7fff9291fe1b:  subq   %rcx, %rdi           ; dest <- dest - len
   0x7fff9291fe1e:  testl  $15, %esi
   0x7fff9291fe24:  jne    0x7fff9291fe61       ; src 16-byte aligned?
   ; Entry condition: both src and dest are 16-byte aligned
   0x7fff9291fe26:  movdqa -16(%rsi,%rcx), %xmm0; aligned 16-byte read
   0x7fff9291fe2c:  movdqa -32(%rsi,%rcx), %xmm1
   0x7fff9291fe32:  movdqa -48(%rsi,%rcx), %xmm2
   0x7fff9291fe38:  movdqa -64(%rsi,%rcx), %xmm3
   0x7fff9291fe3e:  movdqa %xmm0, -16(%rdi,%rcx); aligned 16-byte write
   0x7fff9291fe44:  movdqa %xmm1, -32(%rdi,%rcx)
   0x7fff9291fe4a:  movdqa %xmm2, -48(%rdi,%rcx)
   0x7fff9291fe50:  movdqa %xmm3, -64(%rdi,%rcx)
   0x7fff9291fe56:  subq   $64, %rcx
   0x7fff9291fe5a:  jne    0x7fff9291fe26       ; more 64-byte blocks?
   0x7fff9291fe5c:  jmpq   0x7fff9291fdc5
   ; Entry condition: src is NOT 16-byte aligned but dest is
   0x7fff9291fe61:  movdqu -16(%rsi,%rcx), %xmm0; unaligned 16-byte read
   0x7fff9291fe67:  movdqu -32(%rsi,%rcx), %xmm1
   0x7fff9291fe6d:  movdqu -48(%rsi,%rcx), %xmm2
   0x7fff9291fe73:  movdqu -64(%rsi,%rcx), %xmm3
   0x7fff9291fe79:  movdqa %xmm0, -16(%rdi,%rcx); aligned 16-byte write
   0x7fff9291fe7f:  movdqa %xmm1, -32(%rdi,%rcx)
   0x7fff9291fe85:  movdqa %xmm2, -48(%rdi,%rcx)
   0x7fff9291fe8b:  movdqa %xmm3, -64(%rdi,%rcx)
   0x7fff9291fe91:  subq   $64, %rcx
   0x7fff9291fe95:  jne    0x7fff9291fe61       ; more 64-byte blocks?
   0x7fff9291fe97:  jmpq   0x7fff9291fdc5

I am happy that I can take advantage of such optimizations, but do not need to write such code on my own—there are so many different cases to deal with!


Of couse, nothing is simple regarding performance. More tests revealed more facts that are interesting and/or surprising:

  • While libc++ (it is the library, but not the compiler, that matters here) seems to completely ignore sync_with_stdio, it makes a big difference in libstdc++. The same function call gets a more than 10x performance improvement when the istream_line_reader test program is compiled with GCC (which uses libstdc++), from ~28 MB/s to ~390 MB/s. It shows that I made a wrong assumption! Interestingly, reading from stdin (piped from the pv tool) is slightly faster than reading from a file on my Mac (when compiled with GCC).
  • On a CentOS 6.5 Linux system, sync_with_stdio(false) has a bigger performance win (~23 MB/s vs. ~800 MB/s). Reading from a file directly is even faster at 1100 MB/s. That totally beats my file_line_reader (~550 MB/s reading a file directly) and mmap_line_reader (~600 MB/s reading a file directly) on the same machine. I was stunned when first seeing this performance difference of nearly 40 times!

So, apart from the slight difference in versatility, the first and simplest form of my line readers is also the best on Linux, while the mmap-based version may be a better implementation on OS X—though your mileage may vary depending on the different combinations of OS versions, compilers, and hardware. Should I be happy, or sad?


You can find the implementation of istream_line_reader among my example code for the ‘C++ and Functional Programming’ presentation, and the implementations of file_line_reader and mmap_line_reader in the Nvwa repository. And the test code is as follows:

test_istream_line_reader.cpp:

#include <fstream>
#include <iostream>
#include <string>
#include <getopt.h>
#include <stdio.h>
#include <stdlib.h>
#include "istream_line_reader.h"

using namespace std;

int main(int argc, char* argv[])
{
    char optch;
    while ( (optch = getopt(argc, argv, "s")) != EOF) {
        switch (optch) {
        case 's':
            cin.sync_with_stdio(false);
            break;
        }
    }
    if (!(optind == argc || optind == argc - 1)) {
        fprintf(stderr,
                "Only one file name can be specified\n");
        exit(1);
    }

    istream* is = nullptr;
    ifstream ifs;
    if (optind == argc) {
        is = &cin;
    } else {
        ifs.open(argv[optind]);
        if (!ifs) {
            fprintf(stderr,
                    "Cannot open file '%s'\n",
                    argv[optind]);
            exit(1);
        }
        is = &ifs;
    }

    for (auto& line : istream_line_reader(*is)) {
        puts(line.c_str());
    }
}

test_file_line_reader.cpp:

#include <stdio.h>
#include <stdlib.h>
#include <nvwa/file_line_reader.h>

using nvwa::file_line_reader;

int main(int argc, char* argv[])
{
    FILE* fp = stdin;
    if (argc == 2) {
        fp = fopen(argv[1], "r");
        if (!fp) {
            fprintf(stderr,
                    "Cannot open file '%s'\n",
                    argv[1]);
            exit(1);
        }
    }

    file_line_reader
        reader(fp, '\n',
               file_line_reader::no_strip_delimiter);
    for (auto& line : reader) {
        fputs(line, stdout);
    }
}

test_mmap_line_reader.cpp:

#include <stdio.h>
#include <stdlib.h>
#include <stdexcept>
#include <nvwa/mmap_line_reader.h>

using nvwa::mmap_line_reader;

int main(int argc, char* argv[])
{
    if (argc != 2) {
        fprintf(stderr,
                "A file name shall be provided\n");
        exit(1);
    }

    try {
        mmap_line_reader
            reader(argv[1], '\n',
                   mmap_line_reader::no_strip_delimiter);

        for (auto& str : reader) {
            fputs(str.c_str(), stdout);
        }
    }
    catch (std::runtime_error& e) {
        puts(e.what());
    }
}

Upgrading to Boost 1.61 in MacPorts

The Boost version in MacPorts was still 1.59.0—a year old now. When I wrote about Boost.Coroutine2, I found I had to install the latest Boost version 1.61.0. So I had two sets of Boost libraries on my hard drive, which made things . . . er . . . a little bit complicated. After I built Microsoft’s cpprestsdk last night—I managed to make it find and use the MacPorts Boost libraries—I feel more urged to change the situation. So this morning I subscribed to the MacPorts mailing list and posted the question about the outdated version problem. With the help from Mr Michael Dickens and Google, I have a working port of Boost 1.61.0 now. This article will document the procedure how it works.

The first thing one needs to do is check out the port files from the MacPorts Subversion repository. In my case, The boost files are under devel/boost. So I checked out only the boost directory into ~/Programming/MacPorts/devel.

One then needs to tell MacPorts to look for ports in that directory. There are two steps involved:

  1. Add the URL of the local ports directory (e.g. ‘file:///Users/yongwei/Programming/MacPorts’ in my case) to /opt/local/etc/macports/sources.conf, above the default rsync URL.
  2. Run the portindex command under that directory. It needs to be rerun every time a Portfile is changed.

Now MacPorts should find ports first in my local ports directory and then the system default. And I could begin patching the files.

It turned out that people tried to update boost half a year ago for Boost 1.60, but they found there were failing ports and the ABI was incompatible with 1.59. The patch was still good to me, as I had now a good example. I simply applied the patch, ran portindex again, and went ahead to port upgrade boost.

The procedure turned out quite smooth, though mkvtoolnix, the only installed port that depended on boost on my laptop, failed to run after the upgrade. I had to port uninstall it and then port install it again (rebuilding it).

After I had some confidence, I began to change the port files. I changed first Portfile, which contained the version information and file checksums. Updating them was trivial. When I could see the new version 1.61.0 from port info boost, I kicked off the build with port upgrade boost again.

Then came the more painful process of fixing the patch files under devel/boost/files (the ‘patch’ I mentioned a moment ago actually contained patches for these patch files). Most of these MacPorts-specific patch files could be applied without any problems, but one of them failed. It was actually due to trivial code changes in Boost, but I still had to check all the rejections, manually apply the changes, and generate a new patch file. After that, everything went on smoothly.

Against all my hopes, I found that I had to rebuild mkvtoolnix yet again. So the ABI instability is really an issue, and I understand now why boost was stuck at the old version for such a long time. However, I consider my task completed, when I uploaded the updated patch to the MacPorts ticket. At least I have the new working port of boost for myself now. And you can have it too.