Should We Use std::printf in C++?

The very first C++ standard, C++98, introduced headers such as <cstddef>, <cstdio>, and <cstdlib> to replace the corresponding standard C headers like <stddef.h>, <stdio.h>, and <stdlib.h>. The difference between the new headers and the standard C headers is that all names in these new headers, except macros (which have no namespaces), are considered to be declared or defined in the namespace std. Furthermore, names defined in the C headers are considered equivalent to being defined in the namespace std and then placed into the global namespace with using-declarations:

Each C header, whose name has the form name.h, behaves as if each name placed in the Standard library namespace by the corresponding cname header is also placed within the namespace scope of the namespace std and is followed by an explicit using-declaration (7.3.3).

(C++98 depr.c.headers D.5/2)

Do you notice the name of this section is ‘depr.c.headers’? C headers are considered deprecated—generally speaking, this means that one day, C++ code that includes <stdio.h> might not compile!

Perhaps for this reason, some coding standards explicitly prohibit the use of C headers. The MISRA C++ 2008 standard is one such example:

Rule 18-0-1 (Required) The C library shall not be used.

Rationale

Some C++ libraries (e.g. <cstdio>) also have corresponding C versions (e.g. <stdio.h>). This rule requires that the C++ version is used.

In other words, under such coding standards, you should include <cstdio>, and naturally, you should use std::printf instead of printf.

Ignoring whether there are benefits to write in this way for now (we’ll come back to this point later), I have seen many code pieces, under similar coding standards, in which people included headers like <cstdio>, but continued to use names like printf. . . .

Wait, according to the C++ standard, shouldn’t this fail to compile? In headers like <cstdio>, shouldn’t printf be declared in the namespace std ?

The C++98 standard expected the following (too optimistically): the contents of the standard library, except for macros, should all be defined or declared in the namespace std, including those coming from C. So, it believed <cstdio> would declare printf in the namespace std. So, it believed <stdio.h> should also put things into std first, then inject them into the global namespace.

Let’s see what actually happened. All major compiler vendors believed <stdio.h> was a C language header, only shared with C++. Therefore, they never implemented it the way C++98 expected. In fact, their approach is just the opposite: <cstdio> includes <stdio.h>, and then uses using to inject the names that were expected to be defined/declared in the namespace std into the namespace std. Like the following:

// cstdio

#include <stdio.h>

namespace std
{
  using ::FILE;
  using ::fpos_t;

  using ::clearerr;
  using ::fclose;
  using ::printf;
  …
}

Therefore, C++ code that includes the <cstdio> header can still use printf under such implementations, although it does not comply with the standard.

By the way, this implementation approach has been completely legal since C++11. For example, the same D.5/2 section of the C++11 standard was as follows:

Every C header, each of which has a name of the form name.h, behaves as if each name placed in the standard library namespace by the corresponding cname header is placed within the global namespace scope. It is unspecified whether these names are first declared or defined within namespace scope (3.3.6) of the namespace std and are then injected into the global namespace scope by explicit using-declarations (7.3.3).

In this case, does it make sense to use <cstdio>? At least, it’s completely impossible not to pollute the global namespace. Does std::printf look better and more concise than printf?

From a practical perspective, I see no benefits at all.

Only disadvantages.

When proficient C programmers migrate to C++, do they need to modify their original C code? This is especially awkward for Unix (including Linux) programmers: if you wanted to write code in the way the C++ standard wanted, you would need to write std::fopen(…), but not std::open(…)fopen and open are both part of the POSIX specification (i.e. the Unix standard), but only fopen is part of C++.

Should a Unix programmer care whether a POSIX function is included in the C++ standard?—I would consider it a ridiculous requirement.

If you are in doubt, keep in mind that you will never observe std (or its mangled form) for C functions in the object files. std::printf, in essence, is just the C printf.

If one included <cstdio> but continued to use printf, it would be even worse—the code would be incompliant and in the danger of not compiling one
day.


But the C headers have been deprecated. . . . Wouldn’t it be dangerous to keep using them?

In fact, when reviewing the deprecated facilities of the C++ standard, the C++ Standards Committee has at least twice (P0619 and P2139) proposed to change the deprecated status of the C headers. We can clearly see people’s opinion on this issue from the documents:

Finally, it seems clear that the C headers will be retained essentially forever, as a vital compatibility layer with C and POSIX. After two decades of deprecation, not a single compiler provides a deprecation warning on including these headers, and it is difficult to imagine circumstances where they would, as any use of a third-party C library by a C++ project is almost guaranteed to include one of these headers, at least indirectly. Therefore, it seems appropriate to undeprecate the headers, and find a home for them as a compatibility layer in the main standard.

After P2139, there is another proposal specifically to clarify the status of the ‘C headers’. P2340 further writes:

Ever since C++ has been an ISO standard, the C headers have been “deprecated”, which in ISO parlance means “no longer in use”, “discouraged”, or “subject to future removal”. However, this is inappropriate: The C headers are a necessary part of the C++ ecosystem. . . .

In this author’s opinion . . . which seems to resonate with many other committee members, that original decision (which goes back to the first standard, C++98) was never quite perfect in the first place, and many have since come to regard the deprecated status of the headers as a mistake.

I am pleased to see that the C headers are finally undeprecated in C++23. From the draft standard N4950, we can see that the C headers are no longer part of ‘Annex D Compatibility features’, but appear in ‘17 Language support library’. We no longer need to worry about the legitimacy of using <stdio.h> (but the C++ standard still discourage the use of C headers, about which I have a different opinion).


Does this mean we should always prefer <name.h> to <cname>? No, I won’t go that far.

My personal opinion is that this is a style issue, and the style should depend on the context. It should depend on which way is more natural, and which way makes the code more readable. Does your C++ code purely use C++ facilities, or does it use ‘C facilities’? In my view, printf is a C facility, while size_t is much less C-specific (code that doesn’t call any C functions at all still needs to use std::size_t quite often). So, I would use <stdio.h> and printf, but I would also, very often, use <cstddef> and std::size_t.

Of course, there are always contexts in which neither way seems obviously better. For example, is uint8_t considered C or C++? I am inclined to think that if the code uses a lot of C functions, then it might as well use <stdint.h> and uint8_t. This is simpler and more concise, if nothing else.

Another point to note is that there are things defined in headers like <cname> but completely absent in C. The typical cases I can think of immediately are:

  • std::byte is defined in the <cstddef> header. To use std::byte, you must include <cstddef>.
  • The C abs function only accepts int arguments, while the C++ std::abs has overloads that can accept arguments of other types like long or double. If you want to use those overloads, you absolutely need to use std::abs and <cstdlib>/<cmath>, instead of abs and <stdlib.h>.

In fact, I’ve been using the <name.h> in C++ for many years, but in recent years I’ve also been using <cstddef> happily when writing pure C++ code. This article is partially a justification of my intuition and style.

In any case, coding rules that unconditionally require avoiding <stdio.h> should be abandoned: modifications incurred by such rules do not bring practical value in most cases, and are much likely to introduce problems that make the code fail to comply with the C++ standard.

C++ Exceptions and Memory Allocation Failure

Background

C++ exceptions are habitually disabled in many software projects. A related issue is that these projects also encourage the use of new (nothrow) instead of the more common new, as the latter may throw an exception. This choice is kind of self-deceptive, as people don’t usually disable completely all mechanisms that potentially throw exceptions, such as standard library containers and string. In fact, every time we initialize or modify a string, vector, or map, we may be allocating memory on the heap. If we think that new will end in an exception (and therefore choose to use new (nothrow)), exceptions may also occur when using these mechanisms. In a program that has disabled exceptions, the result will inevitably be a program crash.

However, it seems that the crashes I described are unlikely to occur. . . . When was the last time you saw a memory allocation failure? Before I tested to check this issue, the last time I saw a memory allocation failure was when there was a memory corruption in a program: there was still plenty of memory in the system, but the memory manager of the program could no longer work reliably. In this case, there was already undefined behaviour, and checking for memory allocation failure ceased to make sense. A crash was already inevitable, and it was a good thing if the crash occurred earlier, whether due to an uncaught exception, a null pointer dereference, or something else.

Now the question is: If there is no undefined behaviour in the program, will memory allocation ever fail? This seems worth exploring.

Test of memory allocation failure

Due to the limitation of address space, there is an obvious upper limit to the amount of memory that one process can allocate. On a 32-bit system, this limit is 232 bytes, or 4 GiB. However, on typical 64-bit systems like x64, this limit is not 264 bytes, but 248 bytes instead, or 256 TiB.

On a 32-bit system, when a process’s memory usage reaches around 2 GiB (it may vary depending on the specific system, but will not exceed 4 GiB), memory allocation is guaranteed to fail. The physical memory of a real-world 32-bit system can often reach 4 GiB, so we do expect to see memory allocation failures.

The more interesting question is: What happens when the amount of physical memory is far less than the address space size? Ignoring abnormal scenarios like allocating more memory than the physical memory size at a time (which would likely be a logic error in the program), can a reasonable program still experience memory allocation failures?

The core logic of my test code is shown below:

try {
  std::size_t total_alloc = 0;
  for (;;) {
    char* ptr = new char[chunk_size];
    if (zero_mem) {
      memset(ptr, 0, chunk_size);
    }
    total_alloc += chunk_size;
    std::cout << "Allocated "
              << (zero_mem ? "and initialized "
                           : "")
              << total_alloc << " B\n";
  }
}
catch (std::bad_alloc&) {
  std::cout << "Successfully caught bad_alloc "
               "exception\n";
}

I.e. the program allocates memory repeatedly—optionally zeroing the allocated memory—until it catches a bad_alloc exception.

The test shows that Windows and Linux exhibit significantly different behaviour in this regard. These two are the major platforms concerned, and macOS behaves similarly to Linux.

Windows

I conducted the test on Windows 10 (x64). According to the Microsoft documentation, the total amount of memory that an application can allocate is determined by the size of RAM and that of the page file. When managed by the OS, the maximum size of the page file is three times the size of the memory, and cannot exceed one-eighth the size of the volume where the page file resides. This total memory limit is shared by all applications.

The program’s output is shown below (allocating 1 GiB at a time on a test machine with 6 GiB of RAM):

Allocated 1 GiB
Allocated 2 GiB
Allocated 3 GiB
…
Allocated 14 GiB
Allocated 15 GiB
Successfully caught bad_alloc exception
Press ENTER to quit

The outputs are the same, regardless of whether the memory is zeroed or not, but zeroing the memory makes the program run much slower. You can observe in the Task Manager that the memory actually used by the program is smaller than the amount of allocated memory, even when the memory is zeroed; and that when the amount of allocated (and zeroed) memory gets close to that of available memory, the program’s execution is further slowed down, and disk I/O increases significantly—Windows starts paging in order to satisfy the program’s memory needs.

As I mentioned a moment ago, there is an overall memory limit shared by all applications. If a program encounters a memory allocation failure, other programs will immediately experience memory issues too, until the former one exits. After running the program above, if I don’t press the Enter key to quit, the results of newly opened programs are as follows (even if the physical memory usage remains low):

Successfully caught bad_alloc exception
Press ENTER to quit

Assuming that a program does not allocate a large amount of memory and only uses a small portion (so we exclude some special types of applications, which will be briefly discussed later), when it catches a memory allocation failure, the total memory allocated will be about 4 times the physical memory, and the system should have already slowed down significantly due to frequent paging. In other words, even if the program can continue to run normally, the user experience has already been pretty poor.

Linux

I conducted the test on Ubuntu Linux 22.04 LTS (x64), and the result was quite different from Windows. If I do not zero the memory, memory allocation will only fail when the total allocated memory gets near 128 TiB. The output below is from a run which allocates 4 GiB at a time:

Allocated 4 GiB
Allocated 8 GiB
Allocated 12 GiB
…
Allocated 127.988 TiB
Allocated 127.992 TiB
Successfully caught bad_alloc exception
Press ENTER to quit 

In other words, the program can catch the bad_alloc exception only when it runs out of memory address space. . . .

Another thing different from Windows is that other programs are not affected if memory is allocated but not used (zeroed). A second copy of the test program still gets close to 128 TiB happily.

Of course, we get very different results if we really use the memory. When the allocated memory exceeds the available memory (physical memory plus the swap partition), the program is killed by the Linux OOM killer (out-of-memory killer). An example run is shown below (on a test machine with 3 GiB memory, allocating 1 GiB at a time):

Allocated and initialized 1 GiB
Allocated and initialized 2 GiB
Allocated and initialized 3 GiB
Allocated and initialized 4 GiB
Allocated and initialized 5 GiB
Allocated and initialized 6 GiB
Killed

The program had successfully allocated and used 6 GiB memory, and was killed by the OS when it was initializing the 7th chunk of memory. In a typical 64-bit Linux environment, memory allocation will never fail—unless you request for an apparently unreasonable size (possible only for new Obj[size] or operator new(size), but not new Obj). You cannot catch the memory allocation failure.

Modify the overcommit_memory setting?

We can modify the overcommit_memory setting, you probably have shouted out. What I described above was the default Linux behaviour, when /proc/sys/vm/overcommit_memory was set to 0 (heuristic overcommit handling). If its value is set to 1 (always overcommit), memory allocation will always succeed, as long as there is enough virtual memory address space: you can successfully allocate 32 TiB memory on a machine with only 32 GiB memory—this can actually be useful for applications like sparse matrix computations. Yet another possible value is 2 (don’t overcommit), which allows the user to fine-tune the amount of allocatable memory, usually with the help of /proc/sys/vm/overcommit_ratio.

In the don’t-overcommit mode, the default overcommit ratio (a confusing name) is 50 (%), a quite conservative value. It means the total address space commit for the system is not allowed to exceed swap + 50% of physical RAM. In a general-purpose Linux system, especially in the GUI environment, this mode is unsuitable, as it can cause applications to fail unexpectedly. However, for other systems (like embedded ones) it might be the appropriate mode to use, ensuring that applications can really catch the memory allocation failures and that there is little (or no) thrashing.

(Before you ask, no, you cannot, in general, change the overcommit setting in your code. It is global, not per process; and it requires the root privilege.)

Summary of memory allocation failure behaviour

Looking at the test results above, we can see that normal memory allocations will not fail on general Linux systems, but may fail on Windows or special-purpose Linux systems that have turned off overcommitting.

Strategies for memory allocation failure

We can classify systems into two categories:

  • Those on which memory allocation will not fail
  • Those on which memory allocation can fail

The strategy for the former category is simple: we can simply ignore all memory allocation failures. If there were errors, it must be due to some logic errors or even undefined behaviour in the code. In such a system, you cannot encounter a memory allocation failure unless the requested size if invalid (or when the memory is already corrupt). I assume you must have checked that size is valid for expressions like new Obj[size] or malloc(size), haven’t you?

The strategy for the latter category is much more complicated. Depending on the requirements, we can have different solutions:

  1. Use new (nothrow), do not use the C++ standard library, and disable exceptions. If we turned off exceptions, we would not be able to express the failure to establish invariants in constructors or other places where we cannot return an error code. We would have to resort to the so-called ‘two-phase construction’ and other techniques, which would make the code more complicated and harder to maintain. However, I need to emphasize that notwithstanding all these shortcomings, this solution is self-consistent—troubles for robustness—though I am not inclined to work on such projects.
  2. Use new (nothrow), use the C++ standard library, and disable exceptions. This is a troublesome and self-deceiving approach. It brings about troubles but no safety. If memory is really insufficient, your program can still blow up.
  3. Plan memory use carefully, use new, use the C++ standard library, and disable exceptions; in addition, set up recovery/restart mechanisms for long-running processes. This might be appropriate for special-purpose Linux devices, especially when there is already a lot of code that is not exception-safe. The basic assumption of this scenario is that memory should be sufficient, but the system should still have reasonable behaviour when memory allocation fails.
  4. Use new (nothrow), use the C++ standard library, and enable exceptions. When the bad_alloc exception does happen, we can catch it and deal with the situation appropriately. When serving external requests, we can wrap the entire service code with try ... catch, and perform rollback actions and error logging when an exception (not just bad_alloc) occurs. This may not be the easiest solution, as it requires the developers know how to write exception-safe code. But neither is it very difficult, if RAII is already properly used in the code and there are not many raw owning pointers. In fact, refactoring old code with RAII (including smart pointers) can be beneficial per se, even without considering whether we want exception safety or not.

Somebody may think: Can we modify the C++ standard library so that it does not throw exceptions? Let us have a look how a standard library that does not throw exceptions may look like.

Standard library that does not throw?

If we do not use exception, we still need to have a way to express errors. Traditionally we use error codes, but it has the huge problem that there does not exist a universal way: errno encodes errors in its way, your system has your way, and yet a third-party library may have its own way. When you put all things together, you may find that the only thing in common is that 0 means successful. . . .

Assuming that you have solved the problem after tremendous efforts (make all subsystems use a single set of error codes, or adopt something like std::error_code, you will still find yourself have the big question of when to check for errors. Programmers that have been used to the standard library behaviour may not think that using the following vector is no longer safe:

my::vector<int> v{1, 2, 3, 4, 5};

The constructor of vector may allocate memory, which may fail but it cannot report the error. So you must check for its validity when using v. Something like:

if (auto e = v.error_status();
    e != my::no_error) {
  return e;
}
use(v);

OK. . . . At least a function can use an object passed in by reference from its caller, right?

my::error_t process(const my::string& msg)
{
  use(msg);
  …
}

Naïve! If my::string behaves similarly to std::string and supports implicit construction from a string literal—i.e. people can call this function with process("hello world!")—the constructor of the temporary string object may fail. If we really intend to have complete safety (like in Solution 1 above), we need to write:

my::error_t process(const my::string& msg)
{
  if (auto e = msg.error_status();
      e != my::no_error) {
    return e;
  }
  use(msg);
  …
}

And we cannot use overloaded operators if they may fail. vector::operator[] returns a reference, and it is still OK. map::operator[] may create new map nodes, and can cause problems. Code like the following needs to be rewritten:

class manager {
public:
  void process(int idx, const std::string& msg)
  {
    store_[idx].push_back(msg);
  }

private:
  std::map<int, std::vector<string>> store_;
};

The very simple manager::process would become many lines in its exception-less and safe version:

class manager {
public:
  error_t process(int idx,
                  const my::string& msg)
  {
    if (auto e = msg.error_status();
        e != my::no_error) {
      return e;
    }
    auto* ptr =
      store_.find_or_insert_default(idx);
    if (auto e = store_.error_status();
        e != my::no_error) {
      return e;
    }
    ptr->push_back(msg);
    return ptr->error_status();
  }
  …

private:
  my::map<int, my::vector<string>> store_;
};

Ignoring how verbose it is, writing such code correctly seems more complicated than making one’s code exception-safe, right? It is not an easy thing just to remember which APIs will always succeed and which APIs may return errors.

And obviously you can see that such code would be in no way compatible with the current C++ standard library. The code that uses the current standard library would need to be rewritten, third-party code that uses the standard library could not be used directly, and developers would need to be re-trained (if they did not flee).

Recommended strategy

I would like to emphasize first that how to deal with memory allocation failure is part of the system design, and it should not be just the decision of some local code. This is especially true if the ‘failure’ is unlikely to happen and the cost of ‘prevention’ is too high. (For similar reasons, we do not have checkpoints at the front of each building. Safety is important only when the harm can be higher than the prevention cost.)

Returning to the four solutions I discussed earlier, my order of recommendations is 4, 3, 1, and 2.

  • Solution 4 allows the use of exceptions so that we can catch bad_alloc and other exceptions while using the standard library (or other code). You don’t have to make your code 100% bullet-proof right in the beginning. Instead, you can first enable exceptions and deal with exceptions in some outside constructs, without actually throwing anything in your code. When memory allocation failure happens, you can at least catch it, save critical data, print diagnostics or log something, and quit gracefully (a service probably needs to have some restart mechanism external to itself). In addition, exceptions are friendly to testing and debugging. We should also remember that error codes and exceptions are not mutually exclusive: even in a system where exceptions are enabled, exceptions should only be used for exceptional scenarios. Expected errors, like an unfound file in the specified path or an invalid user input, should not normally be dealt with as exceptions.
  • Solution 3 does not use exceptions, while recognizing that memory failure handling is part of the system design, not deserving local handling anywhere in the code. For a single-run command, crashing on insufficient memory may not be a bad choice (of course, good diagnostics would be better, but then we would need to go to Solution 4). For a long-running service, fast recovery/restart must be taken into account. This is the second best to me.
  • Solution 1 does not use exceptions and rejects all overhead related to exception handling, time- or space-wise. It considers that safety is foremost and is worth extra labour. If your project requires such safety, you need to consider this approach. In fact, it may be the only reasonable approach for real-time control systems (aviation, driving, etc.), as typical C++ implementations have a high penalty when an exception is really thrown.
  • Solution 2 is the worst, neither convenient nor safe. Unfortunately, it seems quite popular due to historical momentum, with its users unaware how bad it is. . . .

Keep in mind that C++ is not C: the C-style check-and-return can look much worse in C++ than in C. This is because C++ code tends to use dynamic memory more often, which is arguably a good thing—it makes C++ code safer and more flexible. Although fixed-size buffers (common in C) are fast, they are inflexible and susceptible to buffer overflows.

Actually, the main reason I wanted to write this article was to point out the problems of Solution 2 and to describe the alternatives. We should not follow existing practices blindly, but make rational choices based on requirements and facts.

Test code

The complete code for testing the memory failure behaviour is available below:

You can clearly see that I am quite happy with exceptions. 😉

Compile-Time Strings

I have encountered many compile-time uses of strings in my projects in the past few years. I would like to summarize my experience today.

Choice of Types

std::string is mostly unsuitable for compile-time string manipulations. There are several reasons:

  • Before C++20 one cannot use strings at all at compile time. In addition, the support for compile-time strings comes quite late among the major compilers. MSVC was the front runner in this regard, GCC came second with GCC 12 (released a short while ago), and Clang has not yet had a formal release with compile-time string support.
  • With C++20 one can use strings at compile time, but there are still a lot of inconveniences, the most obvious being that strings generated at compile time cannot be used at run time. Besides, a string cannot be declared constexpr.
  • A string cannot be used as a template argument.

So we have to give up this apparent choice, but explore other possibilities. The candidates are:

  • const char pointer, which is what a string literal can naturally decay to
  • string_view, a powerful tool added by C++17: it has similar member functions to those of string, but they are mostly marked as constexpr!
  • array, with which we can generate brand-new strings

We will try these types in the following discussion.

Functions Commonly Needed

Getting the String Length

One of the most basic functions on a string is getting its length. Here we cannot use the C function strlen, as it is not constexpr.

We will try several different ways to implement it.

First, we can implement strlen manually, and mark the function constexpr:

namespace strtools {

constexpr size_t length(const char* str)
{
    size_t count = 0;
    while (*str != '\0') {
        ++str;
        ++count;
    }
    return count;
}

} // namespace strtools

However, is there an existing mechanism to retrieve the length of a string in the standard library? The answer is a definite Yes. The standard library does support getting the length of a string of any of the standard character types, like char, wchar_t, etc. With the most common character type char, we can write:

constexpr size_t length(const char* str)
{
    return char_traits<char>::length(str);
}

Starting with C++17, the methods of char_traits can be used at compile time. (However, you may encounter problems with older compiler versions, like GCC 8.)

Assuming you can use C++17, string_view is definitely worth a try:

constexpr size_t length(string_view sv)
{
    return sv.size();
}

Regardless of the approach used, now we can use the following code to verify that we can indeed check the length of a string at compile time:

static_assert(strtools::length("Hi") == 2);

At present, the string_view implementation seems the most convenient.

Finding a Character

Finding a specific character is also quite often needed. We can’t use strchr, but again, we can choose from a few different implementations. The code is pretty simple, whether implemented with char_traits or with string_view.

Here is the version with char_traits:

constexpr const char* find(const char* str, char ch)
{
    return char_traits<char>::find(str, length(str),
                                   ch);
}

Here is the version with string_view:

constexpr string_view::size_type find(string_view sv,
                                      char ch)
{
    return sv.find(ch);
}

I am not going to show the manual lookup code this time. (Unless you have to use an old compiler, simpler is better.)

Comparing Strings

The next functions are string comparisons. Here string_view wins hands down: string_view supports the standard comparisons directly, and you do not need to write any code.

Getting Substrings

It seems that string_views are very convenient, and we should use string_views wherever possible. However, is string_view::substr enough for getting substrings? This is difficult to answer without an actual usage scenario. One real scenario I encountered in projects was that the __FILE__ macro may contain the full path at compile time, resulting in different binaries when compiling under different paths. We wanted to truncate the path completely so that the absolute paths would not show up in binaries.

My tests showed that string_view::substr could not handle this job. With the following code:

puts("/usr/local"sv.substr(5).data());

We will see assembly output like the following from the compiler (see https://godbolt.org/z/1dssd96vz):

.LC0:
        .string "/usr/local"
        …
        mov     edi, OFFSET FLAT:.LC0+5
        call    puts

We have to find another way. . . .

Let’s try array. It’s easy to think of code like the following:

constexpr auto substr(string_view sv, size_t offset,
                      size_t count)
{
    array<char, count + 1> result{};
    copy_n(&sv[offset], count, result.data());
    return result;
}

The intention of the code should be very clear: generate a brand-new character array of the requested size and zero it out (constexpr variables must be initialized on declaration before C++20); copy what we need; and then return the result. Unfortunately, the code won’t compile. . . .

There are two problems in the code:

  • Functions parameters are not constexpr, and cannot be used as template arguments.
  • copy_n is not constexpr before C++20, and cannot be used in compile-time programming.

The second problem is easy to fix: a manual loop will do. We shall focus on the first problem.

A constexpr function can be evaluated at compile time or at run time, so its function arguments are not treated as compile-time constants, and cannot be used in places where compile-time constants are required, such as template arguments.

Furthermore, this problem still exists with the C++20 consteval function, where the function is only invoked at compile time. The main issue is that if we allow function parameters to be used as compile-time constants, then we can write a function where its arguments of different values (same type) can produce return values of different types. For example (currently illegal):

consteval auto make_constant(int n)
{
    return integral_constant<int, n>{};
}

This is unacceptable in the current type system: we still require that the return values of a function have a unique type. If we want a value to be used as a template argument inside a function, it must be passed to the function template as a template argument (rather than as a function argument to a non-template function). In this case, each distinct template argument implies a different template specialization, so the issue of a multiple-return-type function does not occur.

By the way, a standard proposal P1045 tried to solve this problem, but its progress seems stalled. As there are workarounds (to be discussed below), we are still able to achieve the desired effect.

Let’s now return to the substr function and convert the count parameter into a template parameter. Here is the result:

template <size_t Count>
constexpr auto substr(string_view sv, size_t offset = 0)
{
    array<char, Count + 1> result{};
    for (size_t i = 0; i < Count; ++i) {
        result[i] = sv[offset + i];
    }
    return result;
}

The code can really work this time. With ‘puts(substr("/usr/local", 5).data())’, we no longer see "/usr/" in the compiler output.


Regretfully, we now see how compilers are challenged with abstractions: With the latest versions of GCC (12.1) and MSVC (19.32) on Godbolt, this version of substr does not generate the optimal output. There are also some compatibility issues with older compiler versions. So, purely from a practical point of view, I recommend the following implementation that does not use string_view:

template <size_t Count>
constexpr auto substr(const char* str,
                      size_t offset = 0)
{
    array<char, Count + 1> result{};
    for (size_t i = 0; i < Count; ++i) {
        result[i] = str[offset + i];
    }
    return result;
}

If you are interested, you can compare the assembly outputs of these two different versions of the code:

Only Clang is able to generate the same efficient assembly code with both versions:

        mov     word ptr [rsp + 4], 108
        mov     dword ptr [rsp], 1633906540
        mov     rdi, rsp
        call    puts

If you don’t understand why there are the numbers 108 and 1633906540, let me remind you that the hexadecimal representations of these two numbers are 0x6C and 0x61636F6C, respectively. Check the ASCII table and you should be able to understand.


Since we stopped using string_view in the function parameters, the parameter offset becomes much less useful. Hence, I will get rid of this parameter, and rename the function to copy_str:

template <size_t Count>
constexpr auto copy_str(const char* str)
{
    array<char, Count + 1> result{};
    for (size_t i = 0; i < Count; ++i) {
        result[i] = str[i];
    }
    return result;
}

Passing Arguments at Compile Time

When you try composing the compile-time functions together, you will find something lacking. For example, if you wanted to remove the first segment of a path automatically (like from "/usr/local" to "local"), you might try some code like the following:

constexpr auto remove_head(const char* path)
{
    if (*path == '/') {
        ++path;
    }
    auto start = find(path, '/');
    if (start == nullptr) {
        return copy_str<length(path)>(path);
    } else {
        return copy_str<length(start + 1)>(start + 1);
    }
}

The problem is still that it won’t compile. And did you notice that this code violates exactly the constraint I mentioned above that the return type of a function must be consistent and unique?

I have adopted a solution described by Michael Park: using lambda expressions to encapsulate ‘compile-time arguments’. I have defined three macros for convenience and readability:

#define CARG typename
#define CARG_WRAP(x) [] { return (x); }
#define CARG_UNWRAP(x) (x)()

‘CARG’ means ‘constexpr argument’, a compile-time constant argument. We can now make make_constant really work:

template <CARG Int>
constexpr auto make_constant(Int cn)
{
    constexpr int n = CARG_UNWRAP(cn);
    return integral_constant<int, n>{};
}

And it is easy to verify that it works:

auto result = make_constant(CARG_WRAP(2));
static_assert(std::is_same_v<integral_constant<int, 2>,
                             decltype(result)>);

A few explanations follow. In the template parameter, I use CARG (instead of typename) for code readability: it indicates the intention that the template parameter is essentially a type wrapper for compile-time constants. Int is the name of this special type. We will not provide this type when instantiating the function template, but instead let the compiler deduce it. When calling the ‘function’ (make_constant(CARG_WRAP(2))), we provide a lambda expression ([] { return (2); }), which encapsulates the constant we need. When we need to use this parameter, we use CARG_UNWRAP (evaluate: [] { return (2); }()) to get the constant back.

Now we can rewrite the remove_head function:

template <CARG Str>
constexpr auto remove_head(Str cpath)
{
    constexpr auto path = CARG_UNWRAP(cpath);
    constexpr int skip = (*path == '/') ? 1 : 0;
    constexpr auto pos = path + skip;
    constexpr auto start = find(pos, '/');
    if constexpr (start == nullptr) {
        return copy_str<length(pos)>(pos);
    } else {
        return copy_str<length(start + 1)>(start + 1);
    }
}

This function is similar in structure to the previous version, but there are many detail changes. In order to pass the result to copy_str as a template argument, we have to use constexpr all the way along. So we have to give up mutability, and write code in a quite functional style.

Does it really work? Let’s put the following statement into the main function:

puts(strtools::remove_head(CARG_WRAP("/usr/local"))
         .data());

And here is the optimized assembly output from GCC on x86-64 (see https://godbolt.org/z/Mv5YanPvq&gt;):

main:
        sub     rsp, 24
        mov     eax, DWORD PTR .LC0[rip]
        lea     rdi, [rsp+8]
        mov     DWORD PTR [rsp+8], eax
        mov     eax, 108
        mov     WORD PTR [rsp+12], ax
        call    puts
        xor     eax, eax
        add     rsp, 24
        ret
.LC0:
        .byte   108
        .byte   111
        .byte   99
        .byte   97

As you can see clearly, the compiler will put the ASCII codes for "local" on the stack, assign its starting address to the rdi register, and then call the puts function. There is absolutely no trace of "/usr/" in the output. In fact, there is no difference between the output of the puts statement above and that of ‘puts(substr("/usr/local", 5).data())’.

I would like to remind you that it is safe to pass and store the character array, but it is not safe to store the pointer obtained from its data() method. It is possible to use such a pointer immediately in calling other functions (like puts above), as the lifetime of array will extend till the current statement finishes execution. However, if you saved this pointer, it would become dangling after the current statement, and dereferencing it would then be undefined behaviour.

String Template Parameters

We have tried turning strings into types (via lambda expressions) for compile-time argument passing, but unlike integers and integral_constants, there is no one-to-one correspondence between the two. This is often inconvenient: for two integral_constants, we can directly use is_same to determine whether they are the same; for strings represented as lambda expressions, we cannot do the same—two lambda expressions always have different types.

Direct use of string literals as non-type template arguments is not allowed in C++, because strings may appear repeatedly in different translation units, and they do not have proper comparison semantics—comparing two strings is just a comparison of two pointers, which cannot achieve what users generally expect. To use string literals as template arguments, we need to find a way to pass the string as a sequence of characters to the template. We have two methods available:

  • The non-standard GNU extension used by GCC and Clang (which can be used prior to C++20)
  • The C++20 approach suitable for any conformant compilers (including GCC and Clang)

Let’s have a look one by one.

The GNU Extension

GCC and Clang have implemented the standard proposal N3599, which allows us to use strings as template arguments. The compiler will expand the string into characters, and the rest is standard C++.

Here is an example:

template <char... Cs>
struct compile_time_string {
    static constexpr char value[]{Cs..., '\0'};
};

template <typename T, T... Cs>
constexpr compile_time_string<Cs...> operator""_cts()
{
    return {};
}

The definition of the class template is standard C++, so that compile_time_string is a valid type and, at the same time, by taking the value member of this type, we can get "Hi". The GNU extension is the string literal operator template—we can now write ‘"Hi"_cts’ to get an object of type compile_time_string. The following code will compile with the above definitions:

constexpr auto a = "Hi"_cts;
constexpr auto b = "Hi"_cts;
static_assert(is_same_v<decltype(a), decltype(b)>);

The C++20 Approach

Though the above method is simple and effective, it failed to reach consensus in the C++ standards committee and did not become part of the standard. However, with C++20, we can use more types in non-type template parameters. In particular, user-defined literal types are amongst them. Here is an example:

template <size_t N>
struct compile_time_string {
    constexpr compile_time_string(const char (&str)[N])
    {
        copy_n(str, N, value);
    }
    char value[N]{};
};

template <compile_time_string cts>
constexpr auto operator""_cts()
{
    return cts;
}

Again, the first class template is not special, but allowing this compile_time_string to be used as the type of a non-type template parameter (quite a mouthful😝), as well as the string literal operator template, is a C++20 improvement. We can now write ‘"Hi"_cts’ to generate a compile_time_string object. Note, however, that this object is of type compile_time_string, so "Hi"_cts and "Ha"_cts are of the same type—which is very different from the results of the GNU extension. However, the important thing is that compile_time_string can now be used as type of a template parameter, so we can just add another layer:

template <compile_time_string cts>
struct cts_wrapper {
    static constexpr compile_time_string str{cts};
};

Corresponding to the previous compile-time string type comparison, we now need to write:

auto a = cts_wrapper<"Hi"_cts>{};
auto b = cts_wrapper<"Hi"_cts>{};
static_assert(is_same_v<decltype(a), decltype(b)>);

Or we can further simplify it to (as compile_time_string has a non-explicit constructor):

auto a = cts_wrapper<"Hi">{};
auto b = cts_wrapper<"Hi">{};
static_assert(is_same_v<decltype(a), decltype(b)>);

Summary

In this blog I have discussed two things:

  • Compile-time string manipulations
  • Strings as non-type template parameters

They have proved to be useful in my real projects. When having time, I will explore some usages later. Stay tuned!

Contextual Memory Tracing

The Need

A long, long time ago I wrote about memory debugging. I redefined new as a macro, took advantage of placement new, and replaced the global operator new. However, only the replacement of the global operator new was useful in catching memory leaks in the end, while the other facilities became more or less futile; for memory allocation was becoming more and more implicit with the spread use of STL containers and smart pointers, and direct use of new was discouraged. It is even more so today, with many coding conventions basically banning the use of new and delete. So I would like to revisit this topic.

First, there is still need to trace memory usage, even though memory leakage, in the way of unmatched new, is unlikely today. People still need to know how memory is used, by which parts of the program, in which functions and modules, and so on. The exact point of memory allocation is becoming less relevant, as memory allocation is becoming less direct. It probably occurs in some libraries, instead of in application code. So now tracing memory usage means recording the usage context, instead of the exact code position.

Be Contextual

Usage contexts can be set up in a stack-like data structure, and I have done so several times in the past. What needs to be recorded in the context is something one needs to decide beforehand. If you only want to trace memory usage, you can do as I do below. But you may want to fit the interface with your specific memory manager, adding what needs to be passed to it. Anyway, you decide what should be in. My example code is as follows:

struct context {
    const char* file;
    const char* func;
};

We want to record the context automatically, and RAII can be used for this purpose:

class checkpoint {
public:
    explicit checkpoint(const context& ctx);
    ~checkpoint();

private:
    const context ctx_;
};

#define CTX_MEMORY_CHECKPOINT()       \
    checkpoint memory_checkpoint{     \
        context{__FILE__, __func__}}

thread_local std::deque<context>
    context_stack{
        context{"<UNKNOWN>", "<UNKNOWN>"}};

void save_context(const context& ctx)
{
    context_stack.push_back(ctx);
}

void restore_context(const context& ctx)
{
    assert(!context_stack.empty() &&
           context_stack.back() == ctx);
    context_stack.pop_back();
}

const context& get_current_context()
{
    assert(!context_stack.empty());
    return context_stack.back();
}

checkpoint::checkpoint(const context& ctx) : ctx_(ctx)
{
    save_context(ctx);
}

checkpoint::~checkpoint()
{
    restore_context(ctx_);
}

Please notice that the context_stack needs to be thread_local, which was something not standardized when I last wrote about tracing memory usage. It is very convenient to save the information on a per-stack basis.

Fitting with Real Memory Managers

Before we go define the operator new and operator delete functions (normally called ‘allocation’ and ‘deallocation’ functions, and there are a lot of them), let us define first the generic/sample functions that do the real allocation and deallocation. We just pass on the necessary arguments to the system memory manager here (but you may want to do more to make it work with an existing memory manager) and we still use the C convention that a memory allocation failure is indicated by a null pointer:

void* ctx_alloc(size_t size, size_t alignment,
                const context* /*unused*/)
{
#ifdef _WIN32
    return _aligned_malloc(size, alignment);
#elif defined(__unix) || defined(__unix__)
    void* memptr{};
    int result = posix_memalign(&memptr, alignment, size);
    if (result == 0) {
        return memptr;
    } else {
        return nullptr;
    }
#else
    // No alignment guarantees on other platforms
    (void)alignment;
    return malloc(size);
#endif
}

void ctx_free(void* ptr, const context* /*unused*/)
{
#ifdef _WIN32
    _aligned_free(ptr);
#else
    free(ptr);
#endif
}

operator new & operator delete

Now we can go to the allocation and deallocation functions. The declarations with our new context parameter are the following:

void* operator new  (size_t size,
                     const context& ctx);
void* operator new[](size_t size,
                     const context& ctx);
void* operator new  (size_t size,
                     std::align_val_t align_val,
                     const context& ctx);
void* operator new[](size_t size,
                     std::align_val_t align_val,
                     const context& ctx);

void operator delete  (void* ptr,
                       const context&) noexcept;
void operator delete[](void* ptr,
                       const context&) noexcept;
void operator delete  (void* ptr,
                       std::align_val_t align_val,
                       const context&) noexcept;
void operator delete[](void* ptr,
                       std::align_val_t align_val,
                       const context&) noexcept;

But these are not all the function we need to rewrite. We need to replace the non-contextual version too, and they are actually key to the memory tracing functionality. Our saved context can be used, like the following:

void* operator new(size_t size)
{
    return operator new(size, get_current_context());
}

void* operator new[](size_t size)
{
    return operator new[](size, get_current_context());
}

Assuming the existence of an alloc_mem and a free_mem function, we can make the rest of the allocation and deallocation functions basically forwarders:

void* operator new(size_t size,
                   const std::nothrow_t&) noexcept
{
    return alloc_mem(size, get_current_context(),
                     alloc_is_not_array);
}

void* operator new[](size_t size,
                     const std::nothrow_t&) noexcept
{
    return alloc_mem(size, get_current_context(),
                     alloc_is_array);
}

void* operator new(size_t size,
                   std::align_val_t align_val)
{
    return operator new(size, align_val,
                        get_current_context());
}

void* operator new[](size_t size,
                     std::align_val_t align_val)
{
    return operator new[](size, align_val,
                          get_current_context());
}

void* operator new(size_t size,
                   std::align_val_t align_val,
                   const std::nothrow_t&) noexcept
{
    return alloc_mem(size, get_current_context(),
                     alloc_is_not_array,
                     size_t(align_val));
}

void* operator new[](size_t size,
                     std::align_val_t align_val,
                     const std::nothrow_t&) noexcept
{
    return alloc_mem(size, get_current_context(),
                     alloc_is_array,
                     size_t(align_val));
}

void* operator new(size_t size, const context& ctx)
{
    void* ptr = alloc_mem(size, ctx, alloc_is_not_array);
    if (ptr == nullptr) {
        throw std::bad_alloc();
    }
    return ptr;
}

void* operator new[](size_t size, const context& ctx)
{
    void* ptr = alloc_mem(size, ctx, alloc_is_array);
    if (ptr == nullptr) {
        throw std::bad_alloc();
    }
    return ptr;
}

void* operator new(size_t size,
                   std::align_val_t align_val,
                   const context& ctx)
{
    void* ptr = alloc_mem(size, ctx, alloc_is_not_array,
                          size_t(align_val));
    if (ptr == nullptr) {
        throw std::bad_alloc();
    }
    return ptr;
}

void* operator new[](size_t size,
                     std::align_val_t align_val,
                     const context& ctx)
{
    void* ptr = alloc_mem(size, ctx, alloc_is_array,
                          size_t(align_val));
    if (ptr == nullptr) {
        throw std::bad_alloc();
    }
    return ptr;
}

void operator delete(void* ptr) noexcept
{
    free_mem(ptr, alloc_is_not_array);
}

void operator delete[](void* ptr) noexcept
{
    free_mem(ptr, alloc_is_array);
}

void operator delete(void* ptr, size_t) noexcept
{
    free_mem(ptr, alloc_is_not_array);
}

void operator delete[](void* ptr, size_t) noexcept
{
    free_mem(ptr, alloc_is_array);
}

void operator delete(
    void* ptr, std::align_val_t align_val) noexcept
{
    free_mem(ptr, alloc_is_not_array,
                 size_t(align_val));
}

void operator delete[](
    void* ptr, std::align_val_t align_val) noexcept
{
    free_mem(ptr, alloc_is_array,
                 size_t(align_val));
}

void operator delete(void* ptr,
                     const context&) noexcept
{
    operator delete(ptr);
}

void operator delete[](void* ptr,
                       const context&) noexcept
{
    operator delete[](ptr);
}

void operator delete(void* ptr,
                     std::align_val_t align_val,
                     const context&) noexcept
{
    operator delete(ptr, align_val);
}

void operator delete[](void* ptr,
                       std::align_val_t align_val,
                       const context&) noexcept
{
    operator delete[](ptr, align_val);
}

Contexts and Allocation/Deallocation

Now let us focus on the two functions that do the real job:

enum is_array_t : uint32_t {
    alloc_is_not_array,
    alloc_is_array
};

void* alloc_mem(size_t size,
                const context& ctx,
                is_array_t is_array,
                size_t alignment);

void free_mem(void* ptr,
              is_array_t is_array,
              size_t alignment);

Considering this interface, the context information can only be stored immediately before the memory returned to the user. In order to trace leaked memory, we need to link the allocated memory into a linked list, and the control block is as follows:

struct new_ptr_list_t {
    new_ptr_list_t* next;
    new_ptr_list_t* prev;
    size_t          size;
    context         ctx;
    uint32_t        head_size : 31;
    uint32_t        is_array : 1;
    uint32_t        magic;
};

The first four fields should be very clear in meaning. head_size probably requires some explanation. While the struct is fixed in size, alignments can be different across allocations, resulting in different offsets from the struct pointer to the memory pointer the user gets. So this fields records the aligned struct size. is_array records whether the allocation is done by an operator new[]; we use this piece of information to detect the new[]/delete or new/delete[] mismatch, as well as allowing for special offsets required by array allocations. magic is used to mark that the memory is allocated by this implementation so that when freeing the memory we can detect corrupt memory, double freeing, and suchlike.

We also need the list head of control blocks, a mutex to protect its access, a function to align data size, and the magic number constant:

new_ptr_list_t new_ptr_list = {
    &new_ptr_list, &new_ptr_list, 0, {},
    alloc_is_not_array, 0, CTX_MAGIC};

std::mutex new_ptr_lock;

constexpr uint32_t align(size_t s, size_t alignment)
{
    return static_cast<uint32_t>((s + alignment - 1) &
                                 ~(alignment - 1));
}

constexpr uint32_t CTX_MAGIC = 0x4D585443; // "CTXM"

alloc_mem is then quite straightforward:

void* alloc_mem(size_t size, const context& ctx,
                is_array_t is_array,
                size_t alignment =
                    __STDCPP_DEFAULT_NEW_ALIGNMENT__)
{
    assert(alignment >=
           __STDCPP_DEFAULT_NEW_ALIGNMENT__);

    uint32_t aligned_list_item_size =
        align(sizeof(new_ptr_list_t), alignment);
    size_t s = size + aligned_list_item_size;
    auto ptr = static_cast<new_ptr_list_t*>(
        ctx_alloc(s, alignment, ctx));
    if (ptr == nullptr) {
        return nullptr;
    }
    auto usr_ptr = reinterpret_cast<char*>(ptr) +
                   aligned_list_item_size;
    ptr->ctx = ctx;
    ptr->is_array = is_array;
    ptr->size = size;
    ptr->head_size = aligned_list_item_size;
    ptr->magic = CTX_MAGIC;
    {
        std::lock_guard guard{new_ptr_lock};
        ptr->prev = new_ptr_list.prev;
        ptr->next = &new_ptr_list;
        new_ptr_list.prev->next = ptr;
        new_ptr_list.prev = ptr;
    }
    return usr_ptr;
}

I.e. it does the following things:

  1. Allocates memory enough to satisfy the user requirement and the additional metadata (new_ptr_list_t)
  2. Fills in the metadata
  3. Chains the allocated memory blocks into a list
  4. Returns the pointer after the metadata

free_mem does the opposite thing. Apparently, we need a function to convert the user pointer back to the originally allocated pointer, which is not really trivial, considering the potential cases of bad pointer and unmatched use of array and non-array versions of new and delete. It is the convert_user_ptr function:

new_ptr_list_t* convert_user_ptr(void* usr_ptr,
                                 size_t alignment)
{
    auto offset = static_cast<char*>(usr_ptr) -
                  static_cast<char*>(nullptr);
    auto adjusted_ptr = static_cast<char*>(usr_ptr);
    bool is_adjusted = false;

    // Check alignment first
    if (offset % alignment != 0) {
        offset -= sizeof(size_t);
        if (offset % alignment != 0) {
            return nullptr;
        }
        // Likely caused by new[] followed by delete, if
        // we arrive here
        adjusted_ptr = static_cast<char*>(usr_ptr) -
                       sizeof(size_t);
        is_adjusted = true;
    }
    auto ptr = reinterpret_cast<new_ptr_list_t*>(
        adjusted_ptr -
        align(sizeof(new_ptr_list_t), alignment));
    if (ptr->magic == CTX_MAGIC &&
        (!is_adjusted || ptr->is_array)) {
        return ptr;
    }

    if (!is_adjusted && alignment > sizeof(size_t)) {
        // Again, likely caused by new[] followed by
        // delete, as aligned new[] allocates alignment
        // extra space for the array size.
        ptr = reinterpret_cast<new_ptr_list_t*>(
            reinterpret_cast<char*>(ptr) - alignment);
        is_adjusted = true;
    }
    if (ptr->magic == CTX_MAGIC &&
        (!is_adjusted || ptr->is_array)) {
        return ptr;
    }

    return nullptr;
}

With this, free_mem then becomes easy:

void free_mem(void* usr_ptr, is_array_t is_array,
              size_t alignment =
                  __STDCPP_DEFAULT_NEW_ALIGNMENT__)
{
    assert(alignment >=
           __STDCPP_DEFAULT_NEW_ALIGNMENT__);
    if (usr_ptr == nullptr) {
        return;
    }

    auto ptr = convert_user_ptr(usr_ptr, alignment);
    if (ptr == nullptr) {
        fprintf(stderr,
                "delete%s: invalid pointer %p\n",
                is_array ? "[]" : "", usr_ptr);
        abort();
    }
    if (is_array != ptr->is_array) {
        const char* msg = is_array
                              ? "delete[] after new"
                              : "delete after new[]";
        fprintf(stderr,
                "%s: pointer %p (size %zu)\n",
                msg, usr_ptr, ptr->size);
        abort();
    }
    {
        std::lock_guard guard{new_ptr_lock};
        ptr->magic = 0;
        ptr->prev->next = ptr->next;
        ptr->next->prev = ptr->prev;
    }
    ctx_free(ptr, &(ptr->ctx));
}

I.e.:

  1. It invokes convert_user_ptr to convert the user-provided pointer to a new_ptr_list_t*.
  2. It checks whether array-ness matches in the memory allocation and deallocation.
  3. It unlinks the memory block from the linked list.
  4. If anything bad happens, it prints a message and aborts the whole program (as the program already has undefined behaviour).

One More Thing

It is now nearly complete: we have set up the mechanisms to record memory contexts in memory allocation and deallocation functions. However, I have omitted one important detail so far. If you used my code verbatim as above, the program would crash on first memory allocation. When the global allocation and deallocation functions are replaced, care must be taken when we need additional memory inside those functions. If we somehow use the generic C++ memory allocation mechanisms, it will invoke operator new in the end, causing an infinite recursion. It is still OK to use malloc/free, so we need to use a malloc_allocator for the context stack:

template <typename T>
struct malloc_allocator {
    typedef T value_type;

    typedef std::true_type is_always_equal;
    typedef std::true_type
        propagate_on_container_move_assignment;

    malloc_allocator() = default;
    template <typename U>
    malloc_allocator(const malloc_allocator<U>&) {}

    template <typename U>
    struct rebind {
        typedef malloc_allocator<U> other;
    };

    T* allocate(size_t n)
    {
        return static_cast<T*>(malloc(n * sizeof(T)));
    }
    void deallocate(T* p, size_t)
    {
        free(p);
    }
};

thread_local std::deque<context,
                        malloc_allocator<context>>
    context_stack{context{"<UNKNOWN>", "<UNKNOWN>"}};

Everything Put Together

You can find the real code with more details and a working memory leak checker in this repository:

https://github.com/adah1972/nvwa/tree/master/nvwa

You need to add the root directory of Nvwa to your include path, and nvwa/memory_trace.cpp and nvwa/aligned_memory.cpp to your project. In order to add a new memory checkpoint, use the macro NVWA_MEMORY_CHECKPOINT (Nvwa macros are usually prefixed with ‘NVWA_’). A very short test program follows:

#include <nvwa/memory_trace.h>

int main()
{
    char* ptr1 = new char[20];
    NVWA_MEMORY_CHECKPOINT();
    char* ptr2 = new char[42];
}

The output would be like the following:

Leaked object at 0x57697e30 (size 20, context: <UNKNOWN>/<UNKNOWN>)
Leaked object at 0x57697e70 (size 42, context: test.cpp/int main())
*** 2 leaks found

Notes about Using IWYU on macOS

I have recently found IWYU, a very useful tool to identify whether you have included header files correctly. It can be cleanly installed in Ubuntu by apt, though some configuration is needed to make it identify problems more correctly, i.e., let it know that a header file is private and we should use a public header file that includes it, or a symbol should be defined in a certain header file and we should not care where it is really defined in the implementation.

I did encounter some problems in macOS. I installed it, but it had problems with Xcode header files, causing something like “fatal error: ‘stdarg.h’ file not found” when used. A quick search showed that it was a known problem, and people mentioned that it seemed a problem that deteriorated with more recent versions of LLVM, which IWYU used internally.

I happened to have LLVM 7.0 installed from Homebrew, so I had a try. Here are the simple steps:

  1. Make sure you have llvm@7. If not, ‘brew install llvm@7’ would do.
  2. Check out IWYU to a directory.
  3. Execute ‘git checkout clang_7.0’ inside the IWYU directory to choose the Clang 7 branch.
  4. Execute ‘mkdir build && cd build’ to use a build directory.
  5. Execute ‘CC=/usr/local/opt/llvm@7/bin/clang CXX=/usr/local/opt/llvm@7/bin/clang++ cmake -DCMAKE_PREFIX_PATH=/usr/local/opt/llvm@7 ..’ to configure IWYU to use the Homebrew LLVM 7.0.
  6. Execute ‘make’ to build IWYU.
  7. Execute ‘mkdir -k lib/clang/7.1.0 && ln -s /usr/local/opt/llvm@7/lib/clang/7.1.0/include lib/clang/7.1.0/’ to symlink the Clang 7 include directory inside IWYU. This step is critical to solve the “file not found” problem, but regretfully it does not work with a more recent LLVM version like LLVM 11.
  8. Symlink executables to your bin directory for quick access. Something like:
cd ~/bin
ln -s ~/code/include-what-you-use/build/bin/include-what-you-use .
ln -s ~/code/include-what-you-use/iwyu_tool.py iwyu_tool
ln -s include-what-you-use iwyu

Now IWYU is ready to use.

Please be aware that IWYU does not often work out of the box, and some configuration is needed. The key document to read is IWYU Mappings, and the bundled mapping files (.imp) can be good examples. You probably want to use libcxx.imp as a start. Some mappings are already included by default, and you can find them in the file iwyu_include_picker.cc.

While it is not perfect, it did help me identify many inclusion issues. This commit is a result of using IWYU.

Happy hacking!

Update 2023: It seems LLVM 16 works with IWYU again, and step 7 above is no longer needed.

Enum Filter

I have recently encountered code that is structurally similar to the following:

enum class number {
    zero,
    one,
    two,
    three,
    four,
    five,
    six,
    seven,
    end
};

…

if (value == number::two ||
    value == number::three ||
    value == number::five ||
    value == number::seven) {
    …
}

The manual comparisons do not look good to me, as it is repetitive, error-prone, and not expressing the intent. So the natural questions comes: How can we make the code ‘better’?

While this is a fake example, I hope you can see the point that enumerators have specific properties (which I will call ‘traits’ in this article, as per C++ traditions), and I want the code to show the intent as expressed by traits.

However, let us get rid of ‘value ==’ first. Any repetitions are bad, right?

My first take is something as follows:

template <typename T>
bool is_in(const T& value,
           std::initializer_list<T> value_list)
{
    for (const auto& item : value_list) {
        if (value == item) {
            return true;
        }
    }
    return false;
}

Very simply and straightforward, but not good enough. How can we generate the list, given some criteria?

If you are familiar with the concept of template metaprogramming, you know that this is a compile-time programming topic: compile-time filtering.

In order to filter on the enumerators, we need to describe them with traits. The following code could be good enough for our current purpose:

template <number n>
struct number_traits;

template <>
struct number_traits<number::zero> {
    constexpr bool is_prime = false;
}

template <>
struct number_traits<number::one> {
    constexpr bool is_prime = false;
}

template <>
struct number_traits<number::two> {
    constexpr bool is_prime = true;
}

template <>
struct number_traits<number::three> {
    constexpr bool is_prime = true;
}

template <>
struct number_traits<number::four> {
    constexpr bool is_prime = false;
}

template <>
struct number_traits<number::five> {
    constexpr bool is_prime = true;
}

template <>
struct number_traits<number::six> {
    constexpr bool is_prime = false;
}

template <>
struct number_traits<number::seven> {
    constexpr bool is_prime = true;
}

So, let us try figuring out a way to generate such a list.

After some study, you will know that initializer_list is not fit for such manipulations. tuple is a better utility. The main reason is that we had better manipulate types, instead of values, in template metaprogramming. An initializer_list is not capable of doing that, whereas C++ already has a facility to convert compile-time integral constants into types, its name being exactly integral_constant.

Its approximate definition is as follows, in case you are not familiar with it:

template<class T, T v>
struct integral_constant {
    static constexpr T value = v;
    using value_type = T;
    using type = integral_constant;
    constexpr operator value_type() const noexcept
    {
        return value;
    }
    constexpr value_type operator()() const noexcept
    {
        return value;
    }
};

Such a definition is already provided by the standard library. So, instead of having an initializer_list like {number::two, number::three, number::five}, we would have something like the following:

std::make_tuple(
    std::integral_constant<number, number::two>{},
    std::integral_constant<number, number::three>{},
    std::integral_constant<number, number::five>{})

It would be safe to pass such ‘arguments’ for compile-time programming, as only their types matter. We would not need their values, as each type has exactly one unique value.

The next questions are:

  1. How can we generate the constants for all possible enumerators?—I.e. compile-time iteration.
  2. How can we filter to get only the values we want?—I.e. compile-time filtering.
  3. How can we check whether a value is equal to one of the constanta we present?—I.e. (compile-time or run-time) checking like the is_in above.

The answer to the first question is that we need to generate a sequence, and we need to know what the last enumerator is. As far as I know, there is currently no way in C++ to enumerate all the enumerators of an enum type. I have to resort to an agreement to mark the end of a continuous enumeration, and my choice is that we use end to mark the end, as in the enum class listed in the very beginning of this article. That is, I need to generate the sequence from integral_constant<number, number{0}> to integral_constant<number, number::end>, exclusive.

This job can be easily done with the following code, using the standard tuple and index_sequence technique:

template <typename E, size_t... ints>
constexpr auto make_all_enum_consts_impl(
    std::index_sequence<int...>)
{
    return std::make_tuple(std::integral_constant<
        E, E(ints)>{}...);
}

template <typename E>
constexpr auto make_all_enum_consts()
{
    return make_all_enum_consts_impl<E>(
        std::make_index_sequence<size_t(E::end)>{});
}

Now we have come to the second, and the really difficult, part: how can we filter the values to get only those we need?

The answer is use apply, tuple_cat, and conditional_t, three important tools in the C++ template metaprogramming world:

  • With apply, we can call a function with all elements of a tuple as arguments. I.e. apply(f, make_tuple(42, "answer")) would be equivalent to f(42, "answer").
  • With tuple_cat, we can concatenate elements of tuples into a new tuple. I.e. tuple_cat(make_tuple(42, "answer"), make_tuple("of", "everything")) would result in the tuple {42, "answer", "of", "everything"}.
  • With conditional_t, we can get one of the given types based on a compile-time Boolean expression. I.e. conditional_t<true, int, string> would result in int, but conditional_t<false, int, string> would result in string.

Each tool may look trivial individually, but they can be combined together to work wonders. Specifically, it can do what we now need.

This is the final form I use (mainly inspired by this Stack Overflow answer):

#define ENUM_FILTER_FROM(E, T, tup)                     \
    std::apply(                                         \
        [](auto... ts) {                                \
            return std::tuple_cat(                      \
                std::conditional_t<                     \
                    E##_traits<decltype(ts)::value>::T, \
                    std::tuple<decltype(ts)>,           \
                    std::tuple<>>{}...);                \
        },                                              \
        tup)

Let me explain what it does:

  • The macro takes an enumeration type, a trait name, and a tuple of enumerator constants, which are created by make_all_enum_consts above. The reason why a tuple of constants are used is that the result of calling ENUM_FILTER_FROM can be filtered again.
  • std::apply invokes the generic lambda with the tuple of arguments
  • The generic lambda does the compile-time computation of concatenating (tuple_cat) the arguments into a new tuple
  • The arguments of tuple_cat is either a tuple of one enumerator constant, if the type satisfies the trait, or an empty tuple otherwise
  • So the end result of executing the code in the macro is a tuple of enumerator constants that satisfy the trait

The answer to the third question is relatively simple. For maximal flexibility, I am splitting it into two steps:

  • Convert the tuple of types into a tuple of values
  • Check whether a value is in the tuple with a fold expression

Here is the code:

template <typename Tuple, size_t... ints>
constexpr auto make_values_from_consts_impl(
    Tuple tup, std::index_sequence<ints...>)
{
    return std::make_tuple(std::get<ints>(tup)()...);
}

template <typename Tuple>
constexpr auto make_values_from_consts(Tuple tup)
{
    return make_values_from_consts_impl(
        tup, std::make_index_sequence<
                 std::tuple_size_v<Tuple>>{});
}

template <typename T, typename Tuple, size_t... ints>
constexpr bool is_in_impl(const T& value,
                          const Tuple& tup,
                          std::index_sequence<ints...>)
{
    return ((value == std::get<ints>(tup)) || ...);
}

template <typename T, typename Tuple>
constexpr std::enable_if_t<
    std::is_same_v<T, std::decay_t<decltype(std::get<0>(
                          std::declval<Tuple>()))>>,
    bool>
is_in(const T& value, const Tuple& tup)
{
    return is_in_impl(value, tup,
                      std::make_index_sequence<
                          std::tuple_size_v<Tuple>>{});
}

Finally, we can define the function is_prime:

constexpr bool is_prime(number n)
{
    return is_in(
        n, make_values_from_consts(ENUM_FILTER_FROM(
               number, is_prime,
               make_all_enum_consts<number>())));
}

More interestingly, the result of invoking ENUM_FILTER_FROM can be passed to ENUM_FILTER_FROM again. If we defined the trait is_even as well as is_prime, we would be able to write:

ENUM_FILTER_FROM(number, is_even, \
    ENUM_FILTER_FROM(number, is_prime, …)

Is that nice?

Do note that there is an asymmetry here. It is trivial to implement make_values_from_consts, but it seems impossible to implement its inverse constexpr function make_consts_from_values. This is because there are no constexpr arguments in C++ (check out this discussion if you are interested in the reasons). No arguments are regarded constexpr, even in a constexpr function. You can work around the problem in a cumbersome way, but for this post I am sticking to using types as long as possible.

That’s it, my experience of using compile-time filtering. I have found the techniques presented here useful, and I wish you would find them useful too.

Time Zones in Python

Python datetimes are naïve by default, in that they do not include time zone (or time offset) information. E.g. one might be surprised to find that (datetime.now() - datetime.utcnow()).total_seconds() is basically the local time offset (28800 in my case for UTC+08:00). I personally kind of expected a value near zero. This said, datetime is able to handle time zones, but the definitions of time zones are not included in the Python standard library. A third-party library is necessary for handling time zones. In our project, a developer introduced pytz in the beginning. It all looked well, until I found the following:

>>> from datetime import datetime
>>> from pytz import timezone
>>> timezone('Asia/Shanghai')
<DstTzInfo 'Asia/Shanghai' LMT+8:06:00 STD>
>>> (datetime(2017, 6, 1, tzinfo=timezone('Asia/Shanghai'))
...  - datetime(2017, 6, 1, tzinfo=timezone('UTC'))
... ).total_seconds()
-29160.0

Sh*t! Was pytz a joke? The time zone of Shanghai (or China) should be UTC+08:00, and I did not care a bit about its local mean time (I was, of course, expecting -28800 on the last line). What was the author thinking about? Besides, it did not provide a local time zone function, and we had to hardcode our time zone to 'Asia/Shanghai', which was ugly.—Disappointed, I searched for an alternative, and I found dateutil.tz. From then on, I routinely use code like the following:

from datetime import datetime
from dateutil.tz import tzlocal, tzutc
…
datetime.now(tzlocal())  # for local time
datetime.now(tzutc())    # for UTC time

When answering a StackOverflow question, I realized I misunderstood pytz. I still thought it had some bad design decisions; however, it would have been able to achieve everything I needed, if I had read its manual carefully (I cannot help remembering the famous acronym ‘RTFM’). It was explicitly mentioned in the manual that passing a pytz time zone to the datetime constructor (as I did above) ‘“does not work” with pytz for many timezones’. One has to use the pytz localize method or the standard astimezone method of datetime.

As tzlocal and tzutc from dateutil.tz fulfilled all my needs and were easy to use, I continued to use them. The fact that I got a few downvotes on StackOverflow certainly did not make me like pytz better.


When introducing apscheduler to our project, we noticed that it required that the time zone be provided by pytz—it ruled out the use of dateutil.tz. I wondered what was special about it. I also became aware of a Python package called tzlocal, which was able to provide a pytz time zone conforming to the local system settings. More searching and reading revealed facts that I had missed so far:

  • The Python datetime object does not store or handle daylight-saving status. Adding a timedelta to it does not alter its time zone information, and can result in an invalid local time (say, adding one day to the last day of daylight-saving time does not result in a datetime in standard time).
  • The time zone provided by dateutil.tz does not handle all corner cases. E.g. it does not know that Russia observed all-year daylight-saving time from 2012 to 2014, and it does not know that China observed daylight-saving time from 1986 to 1991.
  • The pytz localize and normalize methods can handle all these complexities, and this is partly the reason why pytz requires people to use its localize method instead of passing the time zone to datetime.

So pytz can actually do more, and correctly. I can do things like finding out in which years China observed daylight-saving time:

from datetime import datetime, timedelta
from pytz import timezone
china = timezone('Asia/Shanghai')
utc = timezone('UTC')
expect_diff = timedelta(hours=8)
for year in range(1980, 2000):
    dt = datetime(year, 6, 1)
    if utc.localize(dt) - china.localize(dt) != expect_diff:
        print(year)

It is now clear to me that the pytz-style time zone is necessary when apscheduler handles a past or future local time.


A few benchmarks regarding the related functions in ipython (not that they are very important):

from datetime import datetime
import dateutil.tz
import pytz
import tzlocal
dateutil_utc = dateutil.tz.tzutc()
dateutil_local = dateutil.tz.tzlocal()
pytz_utc = pytz.utc
pytz_local = tzlocal.get_localzone()
%timeit datetime.utcnow()
310 ns ± 0.405 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit datetime.now()
745 ns ± 1.65 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit datetime.now(dateutil_utc)
924 ns ± 0.907 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit datetime.now(pytz_utc)
2.28 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit datetime.now(dateutil_local)
17.4 µs ± 29.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit datetime.now(pytz_local)
5.54 µs ± 11.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

My final recommendations:

  • One should consider using naïve UTC everywhere, as they are easy and fast to work with.
  • The next best is using offset-aware UTC. Both dateutil.tz and pytz can be used in this case without any problems.
  • In all other cases, pytz (as well as tzlocal) is preferred, but one should beware of the peculiar behaviour of pytz time zones.

My Opinions Regarding the Top Five TIOBE Languages

I have written C++ for nearly 30 years. I had been advocating that it was the best language 🤣, until my love moved to Python a few years ago. I will still say C++ is a very powerful and unique language. It is probably the only language that intersects many different software layers. It lets programmers control the bit-level details, and it has the necessary mechanisms to allow programmers to make appropriate abstractions—arguably one of the best as it provides powerful generics, which are becoming better and better with the upcoming concepts and ranges in C++20. It has very decent optimizing compilers, and suitably written C++ code performs better than nearly all other languages. Therefore, C++ has been widely used in not only low-level stuff like drivers, but also libraries and applications, especially where performance is wanted, like scientific computing and games. It is still widely used in desktop applications, say, Microsoft Office and Adobe Photoshop. The power does come with a price: it is probably the most complicated computer language today. Mastering the language takes a long time (and with 30 years’ experience I dare not say I have mastered the language). Generic code also tends to take a long time to compile. Error messages can be overwhelming, especially to novices. I can go on and on, but it is better to stop here, with a note that the complexity and cost are sometimes worthwhile, in exchange for reduced latency and reduced power usage (from CPU and memory).

Python is, on the other hand, easy to learn. It is not a toy language, though: it is handy not only to novices, but also to software veterans like me. The change-and-run cycle is much shorter than C++. Code in Python is very readable, partly because lists, sets, and dictionaries are supported literal types (you cannot write in C++ an expression like {"one": 1} and let compiler deduce it is a dictionary). It has features that C++ has lacked for many years: generator/coroutine, lazy range, and so on. Generics do not need special support, as it is dynamically typed (but it also does not surprise programmers by allowing error-prone expressions like "1" + 2, as in some script languages). With a good IDE, the argument on its lack of compile-time check can be crushed—programmers can enjoy edit-time checks. It has a big ecosystem with a huge number of third-party libraries, and they are easier to take and use than in C++ (thanks to pip). The only main remaining shortcoming to me is performance, but: 1) one may write C/C++ extensions where necessary; and 2) the lack of performance may not matter at all, if your application is not CPU-bound. See my personal experience of 25x performance boost in two hours.

I used Java a long time ago. I do not like it (mostly for its verbosity), and its desktop/server implementation makes it unsuitable for short-time applications due to its sluggish launch time. However, it has always been a workhorse on the server side, and it has a successful ecosystem, if not much harmed by Oracle’s lawyers. Android also brought life to the old language and the development communities (ignoring for now the bad effects Oracle has brought about).

C# started as Microsoft’s answer to Java, but they have differed more and more since then. I actually like C#, and my experience has shown it is very suitable for Windows application development (I do not have experience with Mono, and I don’t do server development on Windows). Many of its features, like LINQ and on-stack structs, are very likeable.

C is a simple and elegant language, and it can be regarded as the ancestor of three languages above (except Python), at least in syntax. It is the most widely supported. It is the closest to metal, and is still very popular in embedded systems, OS development, and cases where maximum portability is wanted (thus the wide offerings from the open-source communities). It is the most dangerous language, as you can easily have buffer overflows. Incidentally, two of the three current answers to ‘How do you store a list of names input by the user into an array in C (not C++ or C#)?’ can have buffer overflows (and I wrote the other answer). Programmers need to tend to many details themselves.

I myself will code everything in Python where possible, as it usually requires the fewest lines of code and takes the least amount of time. If performance is wanted, I’ll go to C++. For Windows GUI applications, I’ll prefer C#. I will write in C if maximum portability and memory efficiency are wanted. I do not feel I will write in Java, except modifying existing code or when the environment supports Java only.

[I first posted it as a Quora answer, but it is probably worth a page of its own.]

25x Performance Boost in Two Hours

Our system has a find_child_regions API, which, as the name indicates, can find subregions of a region up to a certain level. It needs to look up two MongoDB collections, combine the data in a certain structure, and return the result in JSON.

One day, it was reported that the API was slow for big data sets. Tests showed that it took more than 50 seconds to return close to 6000 records. Er . . . that means the average processing speed is only about 100 records a second—not terribly slow, but definitely not ideal.

When there is a performance problem, a profiler is always your friend.1 Profiling quickly revealed that a database read function was called about twice the number of returned records, and occupied the biggest chunk of time. The reason was that the function first found out all the IDs of the regions to return, and then it read all the data and generated the result. Since the data were already read once when the IDs were returned, they could be saved and reused. I had to write a new function, which resembled the function that returned region IDs, but returned objects that contained all the data read instead (we had such a class already). I also needed to split the result-generating function into two, so that either the region IDs, or the data objects, could be accepted. (I could not change these functions directly, as they have many other users than find_child_regions; changing all of them at once would have been both risky and unnecessary.)

In about 30 minutes, this change generated the expected improvement: call time was shortened to about 30 seconds. A good start!

While the improvement percentage looked nice, the absolute time taken was still a bit long. So I continued to look for further optimization chances.

Seeing that database reading was still the bottleneck and the database read function was still called for each record returned, I thought I should try batch reading. Fortunately, I found I only needed to change one function. Basically, I needed to change something like the following

result = []
for x in xs:
    object_id = f(x)
    obj = get_from_db(object_id, …)
    if obj:
        result.append(obj)
return result

to

object_ids = [f(x) for x in xs]
return find_in_db({"_id": {"$in": object_ids}}, …)

I.e. in that specific function, all data of one level of subregions were read in one batch. Getting four levels of subregions took only four database reads, instead of 6000. This reduced the latency significantly.

In 30 minutes, the call time was again reduced, from 30 seconds to 14 seconds. Not bad!

Again, the profiler showed that database reading was still the bottleneck. I made more experiments, and found that the data object could be sizeable, whereas we did not always need all data fields. We might only need, say, 100 bytes from each record, but the average size of each region was more than 50 KB. The functions involved always read the full record, something equivalent to the traditional SQL statement ‘SELECT * FROM ...’. It was convenient, but not efficient. MongoDB APIs provided a projection parameter, which allowed callers to specify which fields to read from the collection, so I tried it. We had the infrastructure in place, and it was not very difficult. It took me about an hour to make it fully work, as many functions needed to be changed to pass the (optional) projection/field names around. When it finally worked, the result was stunning: if one only needed the basic fields about the regions, the call time could be less than 2 seconds. Terrific!

While Python is not a performant language, and I still like C++, I am glad that Python was chosen for this project. The performance improvement by the C++ language would have been negligible when the call time was more than 50 seconds, and still a small number when I improved its performance to less than 2 seconds. In the meanwhile, it would have been simply impossible for me to refactor the code and achieve the same performance in two hours if the code had been written in C++. I highly doubt whether I could have finished the job in a full day. I would probably have been fighting with the compiler and type system most of the time, instead of focusing on the logic and testing.

Life is short—choose your language wisely.


  1. Being able to profile Python programs easily was actually the main reason I purchased a professional licence of PyCharm, instead of just using the Community Edition. 

Pipenv and Relocatable Virtual Environments

Pipenv is a very useful tool to create and maintain independent Python working environments. Using it feels like a breeze. There are enough online tutorials about it, and I will only talk about one specific thing in this article: how to move a virtual environment to another machine.

The reason I need to make virtual environments movable is that our clients do not usually allow direct Internet access in production environments, therefore we cannot install packages from online sources on production servers. They also often enforce a certain directory structure. So we need to prepare the environment in our test environment, and it would be better if we did not need to worry about where we put the result on the production server. Virtual environments, especially with the help of Pipenv, seem to provide a nice and painless way of achieving this effect—if we can just make the result of pipenv install movable, or, in the term of virtualenv, relocatable.

virtualenv is already able to make most of the virtual environment relocatable. When working with Pipenv, it can be as simple as

virtualenv --relocatable `pipenv --venv`

There are two problems, though:

They are not difficult to solve, and we can conquer them one by one.

As pointed out in the issue discussion, one only needs to replace one line in activate to make it relocatable. What is originally

VIRTUAL_ENV="/home/yongwei/.local/share/virtualenvs/something--PD5l8nP"

should be changed to

VIRTUAL_ENV=$(cd $(dirname "$BASH_SOURCE"); dirname `pwd`)

To be on the safe side, I would look for exactly the same line and replace it, so some sed tricks are needed. I also need to take care of the differences between BSD sed and GNU sed, but it is a problem already solved before.

The second problem is even easier. Creating a new relative symlink solves the problem.

I’ll share the final result here, a simple script that can make a virtual environment relocatable, as well as creating a tarball from it. The archive has ‘-venv-platform’ as the suffix, but it does not include a root directory. Keep this in mind when you unpack the tarball.

#!/bin/sh

case $(sed --version 2>&1) in
  *GNU*) sed_i () { sed -i "$@"; };;
  *) sed_i () { sed -i '' "$@"; };;
esac

sed_escape() {
  echo $1|sed -e 's/[]\/$*.^[]/\\&/g'
}

VENV_PATH=`pipenv --venv`
if [ $? -ne 0 ]; then
  exit 1
fi
virtualenv --relocatable "$VENV_PATH"

VENV_PATH_ESC=`sed_escape "$VENV_PATH"`
RUN_PATH=`pwd`
BASE_NAME=`basename "$RUN_PATH"`
PLATFORM=`python -c 'import sys; print(sys.platform)'`
cd "$VENV_PATH"
sed_i "s/^VIRTUAL_ENV=\"$VENV_PATH_ESC\"/VIRTUAL_ENV=\$(cd \$(dirname \"\$BASH_SOURCE\"); dirname \`pwd\`)/" bin/activate
[ -h lib64 ] && rm -f lib64 && ln -s lib lib64
tar cvfz $RUN_PATH/$BASE_NAME-venv-$PLATFORM.tar.gz .

After running the script, I can copy result tarball to another machine of the same OS, unpack it, and then either use the activate script or set the PYTHONPATH environment variable to make my Python program work. Problem solved.

A last note: I have not touched activate.csh and activate.fish, as I do not use them. If you did, you would need to update the script accordingly. That would be your homework as an open-source user. 😼


  1. I tried removing it, and Pipenv was very unhappy.