Time Zones in Python

Python datetimes are naïve by default, in that they do not include time zone (or time offset) information. E.g. one might be surprised to find that (datetime.now() - datetime.utcnow()).total_seconds() is basically the local time offset (28800 in my case for UTC+08:00). I personally kind of expected a value near zero. This said, datetime is able to handle time zones, but the definitions of time zones are not included in the Python standard library. A third-party library is necessary for handling time zones. In our project, a developer introduced pytz in the beginning. It all looked well, until I found the following:

>>> from datetime import datetime
>>> from pytz import timezone
>>> timezone('Asia/Shanghai')
<DstTzInfo 'Asia/Shanghai' LMT+8:06:00 STD>
>>> (datetime(2017, 6, 1, tzinfo=timezone('Asia/Shanghai'))
...  - datetime(2017, 6, 1, tzinfo=timezone('UTC'))
... ).total_seconds()
-29160.0

Sh*t! Was pytz a joke? The time zone of Shanghai (or China) should be UTC+08:00, and I did not care a bit about its local mean time (I was, of course, expecting -28800 on the last line). What was the author thinking about? Besides, it did not provide a local time zone function, and we had to hardcode our time zone to 'Asia/Shanghai', which was ugly.—Disappointed, I searched for an alternative, and I found dateutil.tz. From then on, I routinely use code like the following:

from datetime import datetime
from dateutil.tz import tzlocal, tzutc
…
datetime.now(tzlocal())  # for local time
datetime.now(tzutc())    # for UTC time

When answering a StackOverflow question, I realized I misunderstood pytz. I still thought it had some bad design decisions; however, it would have been able to achieve everything I needed, if I had read its manual carefully (I cannot help remembering the famous acronym ‘RTFM’). It was explicitly mentioned in the manual that passing a pytz time zone to the datetime constructor (as I did above) ‘“does not work” with pytz for many timezones’. One has to use the pytz localize method or the standard astimezone method of datetime.

As tzlocal and tzutc from dateutil.tz fulfilled all my needs and were easy to use, I continued to use them. The fact that I got a few downvotes on StackOverflow certainly did not make me like pytz better.


When introducing apscheduler to our project, we noticed that it required that the time zone be provided by pytz—it ruled out the use of dateutil.tz. I wondered what was special about it. I also became aware of a Python package called tzlocal, which was able to provide a pytz time zone conforming to the local system settings. More searching and reading revealed facts that I had missed so far:

  • The Python datetime object does not store or handle daylight-saving status. Adding a timedelta to it does not alter its time zone information, and can result in an invalid local time (say, adding one day to the last day of daylight-saving time does not result in a datetime in standard time).
  • The time zone provided by dateutil.tz does not handle all corner cases. E.g. it does not know that Russia observed all-year daylight-saving time from 2012 to 2014, and it does not know that China observed daylight-saving time from 1986 to 1991.
  • The pytz localize and normalize methods can handle all these complexities, and this is partly the reason why pytz requires people to use its localize method instead of passing the time zone to datetime.

So pytz can actually do more, and correctly. I can do things like finding out in which years China observed daylight-saving time:

from datetime import datetime, timedelta
from pytz import timezone
china = timezone('Asia/Shanghai')
utc = timezone('UTC')
expect_diff = timedelta(hours=8)
for year in range(1980, 2000):
    dt = datetime(year, 6, 1)
    if utc.localize(dt) - china.localize(dt) != expect_diff:
        print(year)

It is now clear to me that the pytz-style time zone is necessary when apscheduler handles a past or future local time.


A few benchmarks regarding the related functions in ipython (not that they are very important):

from datetime import datetime
import dateutil.tz
import pytz
import tzlocal
dateutil_utc = dateutil.tz.tzutc()
dateutil_local = dateutil.tz.tzlocal()
pytz_utc = pytz.utc
pytz_local = tzlocal.get_localzone()
%timeit datetime.utcnow()
310 ns ± 0.405 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit datetime.now()
745 ns ± 1.65 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit datetime.now(dateutil_utc)
924 ns ± 0.907 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit datetime.now(pytz_utc)
2.28 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit datetime.now(dateutil_local)
17.4 µs ± 29.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit datetime.now(pytz_local)
5.54 µs ± 11.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

My final recommendations:

  • One should consider using naïve UTC everywhere, as they are easy and fast to work with.
  • The next best is using offset-aware UTC. Both dateutil.tz and pytz can be used in this case without any problems.
  • In all other cases, pytz (as well as tzlocal) is preferred, but one should beware of the peculiar behaviour of pytz time zones.

25x Performance Boost in Two Hours

Our system has a find_child_regions API, which, as the name indicates, can find subregions of a region up to a certain level. It needs to look up two MongoDB collections, combine the data in a certain structure, and return the result in JSON.

One day, it was reported that the API was slow for big data sets. Tests showed that it took more than 50 seconds to return close to 6000 records. Er . . . that means the average processing speed is only about 100 records a second—not terribly slow, but definitely not ideal.

When there is a performance problem, a profiler is always your friend.1 Profiling quickly revealed that a database read function was called about twice the number of returned records, and occupied the biggest chunk of time. The reason was that the function first found out all the IDs of the regions to return, and then it read all the data and generated the result. Since the data were already read once when the IDs were returned, they could be saved and reused. I had to write a new function, which resembled the function that returned region IDs, but returned objects that contained all the data read instead (we had such a class already). I also needed to split the result-generating function into two, so that either the region IDs, or the data objects, could be accepted. (I could not change these functions directly, as they have many other users than find_child_regions; changing all of them at once would have been both risky and unnecessary.)

In about 30 minutes, this change generated the expected improvement: call time was shortened to about 30 seconds. A good start!

While the improvement percentage looked nice, the absolute time taken was still a bit long. So I continued to look for further optimization chances.

Seeing that database reading was still the bottleneck and the database read function was still called for each record returned, I thought I should try batch reading. Fortunately, I found I only needed to change one function. Basically, I needed to change something like the following

result = []
for x in xs:
    object_id = f(x)
    obj = get_from_db(object_id, …)
    if obj:
        result.append(obj)
return result

to

object_ids = [f(x) for x in xs]
return find_in_db({"_id": {"$in": object_ids}}, …)

I.e. in that specific function, all data of one level of subregions were read in one batch. Getting four levels of subregions took only four database reads, instead of 6000. This reduced the latency significantly.

In 30 minutes, the call time was again reduced, from 30 seconds to 14 seconds. Not bad!

Again, the profiler showed that database reading was still the bottleneck. I made more experiments, and found that the data object could be sizeable, whereas we did not always need all data fields. We might only need, say, 100 bytes from each record, but the average size of each region was more than 50 KB. The functions involved always read the full record, something equivalent to the traditional SQL statement ‘SELECT * FROM ...’. It was convenient, but not efficient. MongoDB APIs provided a projection parameter, which allowed callers to specify which fields to read from the collection, so I tried it. We had the infrastructure in place, and it was not very difficult. It took me about an hour to make it fully work, as many functions needed to be changed to pass the (optional) projection/field names around. When it finally worked, the result was stunning: if one only needed the basic fields about the regions, the call time could be less than 2 seconds. Terrific!

While Python is not a performant language, and I still like C++, I am glad that Python was chosen for this project. The performance improvement by the C++ language would have been negligible when the call time was more than 50 seconds, and still a small number when I improved its performance to less than 2 seconds. In the meanwhile, it would have been simply impossible for me to refactor the code and achieve the same performance in two hours if the code had been written in C++. I highly doubt whether I could have finished the job in a full day. I would probably have been fighting with the compiler and type system most of the time, instead of focusing on the logic and testing.

Life is short—choose your language wisely.


  1. Being able to profile Python programs easily was actually the main reason I purchased a professional licence of PyCharm, instead of just using the Community Edition. 

Pipenv and Relocatable Virtual Environments

Pipenv is a very useful tool to create and maintain independent Python working environments. Using it feels like a breeze. There are enough online tutorials about it, and I will only talk about one specific thing in this article: how to move a virtual environment to another machine.

The reason I need to make virtual environments movable is that our clients do not usually allow direct Internet access in production environments, therefore we cannot install packages from online sources on production servers. They also often enforce a certain directory structure. So we need to prepare the environment in our test environment, and it would be better if we did not need to worry about where we put the result on the production server. Virtual environments, especially with the help of Pipenv, seem to provide a nice and painless way of achieving this effect—if we can just make the result of pipenv install movable, or, in the term of virtualenv, relocatable.

virtualenv is already able to make most of the virtual environment relocatable. When working with Pipenv, it can be as simple as

virtualenv --relocatable `pipenv --venv`

There are two problems, though:

They are not difficult to solve, and we can conquer them one by one.

As pointed out in the issue discussion, one only needs to replace one line in activate to make it relocatable. What is originally

VIRTUAL_ENV="/home/yongwei/.local/share/virtualenvs/something--PD5l8nP"

should be changed to

VIRTUAL_ENV=$(cd $(dirname "$BASH_SOURCE"); dirname `pwd`)

To be on the safe side, I would look for exactly the same line and replace it, so some sed tricks are needed. I also need to take care of the differences between BSD sed and GNU sed, but it is a problem already solved before.

The second problem is even easier. Creating a new relative symlink solves the problem.

I’ll share the final result here, a simple script that can make a virtual environment relocatable, as well as creating a tarball from it. The archive has ‘-venv-platform’ as the suffix, but it does not include a root directory. Keep this in mind when you unpack the tarball.

#!/bin/sh

case $(sed --version 2>&1) in
  *GNU*) sed_i () { sed -i "$@"; };;
  *) sed_i () { sed -i '' "$@"; };;
esac

sed_escape() {
  echo $1|sed -e 's/[]\/$*.^[]/\\&/g'
}

VENV_PATH=`pipenv --venv`
if [ $? -ne 0 ]; then
  exit 1
fi
virtualenv --relocatable "$VENV_PATH"

VENV_PATH_ESC=`sed_escape "$VENV_PATH"`
RUN_PATH=`pwd`
BASE_NAME=`basename "$RUN_PATH"`
PLATFORM=`python -c 'import sys; print(sys.platform)'`
cd "$VENV_PATH"
sed_i "s/^VIRTUAL_ENV=\"$VENV_PATH_ESC\"/VIRTUAL_ENV=\$(cd \$(dirname \"\$BASH_SOURCE\"); dirname \`pwd\`)/" bin/activate
[ -h lib64 ] && rm -f lib64 && ln -s lib lib64
tar cvfz $RUN_PATH/$BASE_NAME-venv-$PLATFORM.tar.gz .

After running the script, I can copy result tarball to another machine of the same OS, unpack it, and then either use the activate script or set the PYTHONPATH environment variable to make my Python program work. Problem solved.

A last note: I have not touched activate.csh and activate.fish, as I do not use them. If you did, you would need to update the script accordingly. That would be your homework as an open-source user. 😼


  1. I tried removing it, and Pipenv was very unhappy. 

Python yield and C++ Coroutines

Back in 2008, an old friend challenged me with a programming puzzle, when we both attended a wedding. He later gave a solution in Python. Comparing with my C++ solution, the Python one has about half the code lines. I was not smart enough to begin learning Python then, but instead put an end to my little interest in Python with a presentation in C++ Conference China 2009. I compared the basic constructs, and concluded they were equivalent, not realizing that Python had more to offer than that trivial programming exercise showed.

Fast forwarding to today (2016), I am really writing some (small) programs in Python. I have begun to appreciate Python more and more. For many of the tasks, the performance loss in executing the code is ignorable, but the productivity boost is huge. I have also realized that there are constructs in Python that are not easily reproducible in other languages. Generator/yield is one of them.

The other day, I was writing a small program to sort hosts based on dot-reversed order so as to group the host names in a more reasonable order (regarding ‘www.wordpress.com’ as ‘com.wordpress.www’). I quickly came up with a solution in Python. The code is very short and I can show it right here:

def backsort(lines):
    result = {}
    for line in lines:
        result['.'.join(reversed(line.split('.')))] = line
    return map(lambda item: item[1],
               sorted(result.items()))

Of course, we can implement a similar function in C++11. We will immediately find that there are no standard implementations for split and join (see Appendix below for my implementation). Regardless, we can write some code like:

template <typename C>
vector<string> backsort(C&& lines)
{
    map<string, string> rmap;
    for (auto& line : lines) {
        auto split_line = split(line, '.');
        reverse(split_line.begin(), split_line.end());
        rmap[join(split_line, '.')] = line;
    }
    vector<string> result(rmap.size());
    transform(rmap.begin(), rmap.end(), result.begin(),
              [](const pair<string, string>& pr)
              {
                  return pr.second;
              });
    return result;
}

Even though it has twice the non-trivial lines of code and is a function template, there is immediately something Python can do readily but C++ cannot. I can give the Python file handle (like os.stdin) directly to backsort, and the for line will iterate through the file content.1 This is because the Python file object implements the iterator protocol over lines of text, but the C++ istream does not do anything similar.

Let us forget this C++ detail, and focus on the problem. My Python code accepts an iterator, and ‘backsorts’ all the input lines. Can we make it process multiple files (like the cat command line), without changing the backsort function?

Of course it can be done. There is a traditional way, and there is a smart way. The traditional way is write a class that implements the iterator protocol (which can be readily modelled by C++):

class cat:
    def __init__(self, files):
        self.files = files
        self.cur_file = None

    def __iter__(self):
        return self

    def next(self):
        while True:
            if self.cur_file:
                line = self.cur_file.readline()
                if line:
                    return line.rstrip('\n')
                self.cur_file.close()
            if self.files:
                self.cur_file = open(self.files[0])
                self.files = self.files[1:]
            else:
                raise StopIteration()

We can then cat files by the following lines:

if __name__ == '__main__':
    if sys.argv[1:]:
        for line in cat(sys.argv[1:]):
            print(line)

Using yield, we can reduce the 18 lines of code of cat to only 5:

def cat(files):
    for fn in files:
        with open(fn) as f:
            for line in f:
                yield line.rstrip('\n')

There is no more bookkeeping of the current file and the unprocessed files, and everything is wrapped in simple loops. Isn’t that amazing? I actually learnt about the concept before (in C#), but never used it in real code—perhaps because I was too much framed by existing code, using callbacks, observer pattern, and the like.—Those ‘patterns’ now look ugly, when compared to the simplicity of generators.

Here comes the real challenge for C++ developers: Can we do the same in C++? Can we do something better than inelegant callbacks? 2


My investigations so far indicate the following: No C++ standards (up to C++14) support such constructs, and there is no portable way to implement them as a library.

Are we doomed? No. Apart from standardization efforts regarding coroutines (which is the ancient name for a superset of generators, dated from 1958) in C++,3 there have been at least five cross-platform implementations for C++:

  • The unofficial Boost.Coroutine by Giovanni P. Deretta (2006), compatible with Windows, Linux, and maybe a few Unix variants (tested not working on OS X); apparently abandoned.4
  • The official Boost.Coroutine by Oliver Kowalke (2013), compatible with ARM, MIPS, PPC, SPARC, x86, and x86-64 architectures.
  • The official Boost.Coroutine2 by Oliver Kowalke (2015), compatible with the same hardware architectures but only C++ compilers/code conformant to the C++14 standard.
  • Mordor by Mozy (2010), compatible with Windows, Linux, and OS X, but seemingly no longer maintained.
  • CO2 by Jamboree (2015), supporting stackless coroutines only, using preprocessor tricks, and requiring C++14.

As Boost.Coroutine2 looks modern, is well-maintained, and is very much comparable to the Python constructs, I will use it in the rest of this article.5 It hides all the platform details with the help of Boost.Context. Now I can write code simply as follows for cat:

typedef boost::coroutines2::coroutine<const string&>
    coro_t;

void cat(coro_t::push_type& yield,
         int argc, char* argv[])
{
    for (int i = 1; i < argc; ++i) {
        ifstream ifs(argv[i]);
        for (;;) {
            string line;
            if (getline(ifs, line)) {
                yield(line);
            } else {
                break;
            }
        }
    }
}

int main(int argc, char* argv[])
{
    using namespace std::placeholders;
    for (auto& line : coro_t::pull_type(
             boost::coroutines2::fixedsize_stack(),
             bind(cat, _1, argc, argv))) {
        cout << line << endl;
    }
}

Is this simple and straightforward? The only thing that is not quite intuitive is the detail that the constructor of pull_type expects the second argument to be a function object that takes a push_type& as the only argument. That is why we need to use bind to generate it—a lambda expression being the other alternative.

I definitely believe being able to write coroutines is a big step forward to make C++ more expressive. I can foresee many tasks simplified, like recursive parsing. I believe this will prove very helpful in the C++ weaponry. I only wish we could see it standardized soon.

Appendix

The complete backsort code in Python:

#!/usr/bin/env python
#coding: utf-8

import sys

def cat(files):
    for fn in files:
        with open(fn) as f:
            for line in f:
                yield line.rstrip('\n')

def backsort(lines):
    result = {}
    for line in lines:
        result['.'.join(reversed(line.split('.')))] = line
    return map(lambda item: item[1],
               sorted(result.items()))

def main():
    if sys.argv[1:]:
        result = backsort(cat(sys.argv[1:]))
    else:
        result = backsort(map(
                lambda line: line.rstrip('\n'), sys.stdin))
    for line in result:
        print(line)

if __name__ == '__main__':
    main()

The complete backsort code in C++:

#include <assert.h>         // assert
#include <algorithm>        // std::reverse/transform
#include <fstream>          // std::ifstream
#include <functional>       // std::bind
#include <iostream>         // std::cin/cout
#include <map>              // std::map
#include <string>           // std::string
#include <vector>           // std::vector
#include <boost/coroutine2/all.hpp>

using namespace std;

typedef boost::coroutines2::coroutine<const string&>
    coro_t;

void cat(coro_t::push_type& yield,
         int argc, char* argv[])
{
    for (int i = 1; i < argc; ++i) {
        ifstream ifs(argv[i]);
        for (;;) {
            string line;
            if (getline(ifs, line)) {
                yield(line);
            } else {
                break;
            }
        }
    }
}

vector<string> split(const string& str, char delim)
{
    vector<string> result;
    string::size_type last_pos = 0;
    string::size_type pos = str.find(delim);
    while (pos != string::npos) {
        result.push_back(
            str.substr(last_pos, pos - last_pos));
        last_pos = pos + 1;
        pos = str.find(delim, last_pos);
        if (pos == string::npos) {
            result.push_back(str.substr(last_pos));
        }
    }
    return result;
}

template <typename C>
string join(const C& str_list, char delim)
{
    string result;
    for (auto& item : str_list) {
        result += item;
        result += delim;
    }
    if (result.size() != 0) {
        result.resize(result.size() - 1);
    }
    return result;
}

template <typename C>
vector<string> backsort(C&& lines)
{
    map<string, string> rmap;
    for (auto& line : lines) {
        auto split_line = split(line, '.');
        reverse(split_line.begin(), split_line.end());
        rmap[join(split_line, '.')] = line;
    }
    vector<string> result(rmap.size());
    transform(rmap.begin(), rmap.end(), result.begin(),
              [](const pair<string, string>& pr)
              {
                  return pr.second;
              });
    return result;
}

class istream_line_reader {
public:
    class iterator { // implements InputIterator
    public:
        typedef const string& reference;
        typedef string value_type;

        iterator() : stream_(nullptr) {}
        explicit iterator(istream& is) : stream_(&is)
        {
            ++*this;
        }

        reference operator*()
        {
            assert(stream_ != nullptr);
            return line_;
        }
        value_type* operator->()
        {
            assert(stream_ != nullptr);
            return &line_;
        }
        iterator& operator++()
        {
            getline(*stream_, line_);
            if (!*stream_) {
                stream_ = nullptr;
            }
            return *this;
        }
        iterator operator++(int)
        {
            iterator temp(*this);
            ++*this;
            return temp;
        }

        bool operator==(const iterator& rhs) const
        {
            return stream_ == rhs.stream_;
        }
        bool operator!=(const iterator& rhs) const
        {
            return !operator==(rhs);
        }

    private:
        istream* stream_;
        string line_;
    };

    explicit istream_line_reader(istream& is)
        : stream_(is)
    {
    }
    iterator begin() const
    {
        return iterator(stream_);
    }
    iterator end() const
    {
        return iterator();
    }

private:
    istream& stream_;
};

int main(int argc, char* argv[])
{
    using namespace std::placeholders;
    vector<string> result;
    if (argc > 1) {
        result = backsort(coro_t::pull_type(
            boost::coroutines2::fixedsize_stack(),
            bind(cat, _1, argc, argv)));
    } else {
        result = backsort(istream_line_reader(cin));
    }
    for (auto& item : result) {
       cout << item << endl;
    }
}

The istream_line_reader class is not really necessary, and we can simplify it with coroutines. I am including it here only to show what we have to write ‘normally’ (if we cannot use coroutines). Even if we remove it entirely, the C++ version will still have about three times as many non-trivial lines of code as the Python equivalent. It is enough proof to me that I should move away from C++ a little bit. . . .


  1. There is one gotcha: the ‘\n’ character will be part of the string. It will be handled in my solution. 
  2. Generally speaking, callbacks or similar techniques are what C++ programmers tend to use in similar circumstances, if the ‘producer’ part is complicated (otherwise the iterator pattern may be more suitable). Unfortunately, we cannot then combine the use of two simple functions like cat and backsort simultaneously. If we used callbacks, backsort would need to be modified and fragmented. 
  3. P0057 is one such effort, which is experimentally implemented in Visual Studio 2015
  4. According to the acknowledgement pages of next two Boost projects, Giovanni Deretta contributed to them. So his work was not in vain. 
  5. This said, CO2 is also well-maintained, and is more efficient if only a stackless coroutine is needed. See Jamboree’s answer on StackOverflow. The difference looks small to me, and preprocessor tricks are not my favourites, so I will not spend more time on CO2 for now. 

A Small Experiment of System Scripting in Python

My main laptop is still on Mac OS X Lion (10.7). I know I am guilty of exposing my laptop to potential security risks,1 but some of my paid applications do not work on newer OS X versions without an upgrade. I am an austere person and do not want to pay the money yet. In addition, I am also a little bit nostalgic about the skeuomorphic design, though I know some day I will have to use a Mac that has the latest macOS version in order to use new applications. Anyway, I am just procrastinating now, until some sexy new laptop from Apple makes me take out my wallet, or my old laptop goes crazy.

Sorry for this verbose beginning. What I really want to whine about is that Homebrew has stopped supporting my obsolete version of OS X, and I am relying more and more on MacPorts.2 I even had to rebuild most of my ‘ports’ (the term for packages in MacPorts) because the ‘standard’ way of building ports on Lion does not use libc++, while it is necessary for some ports.3 Unlike Homebrew, MacPorts does not show whether a dependency of a port is already installed or not. Worse, MacPorts packages often have heavy dependencies. For example, the command-line tool mkvtoolnix currently has 20 (recursive) dependencies in Homebrew, but 60 dependencies in MacPorts. My default compiler is clang-3.7, which has 46 dependencies. That pretty much makes the ‘port rdeps’ command useless.

A Google search showed this port command could be helpful:

port echo rdepof:PORT_NAME and not installed

However, more investigation showed there were several problems:

  • One cannot specify variants (like ‘+openmp’).
  • An option (like ‘configure.compiler=macports-clang-3.7’) can affect dependencies, but options do not have the intended effect in the ‘port echo’ command.
  • The recursion is not ‘cut’ when a port is already installed, which can result in unnecessary ports.

This problem had fretted me for some time, before I finally decided to take some action. Naturally, the ultimate solution is write some code. I normally use Bash or Perl for such scripting tasks, but, as I have become more and more interested in Python recently, I decided also to give Python a try to see how it handles such tasks.

I first wrote a Bash version for comparison purposes. It was not recursive, though (too cumbersome for Bash):

#!/bin/bash
function escape {
  printf "%s" "$1" | sed 's/[.*\[]/\\&/g'
}

INSTALLED=`port installed \
         | sed -n 's/^  \([A-Za-z_][^ ]*\).*/-e ^\1$/p'`
INSTALLED_ESC=`escape "$INSTALLED"`
port deps "$@" | sed -n 's/.*Dependencies:[[:space:]]*//p' \
               | sed $'s/, /\\\n/g' \
               | sort \
               | uniq \
               | grep -v $INSTALLED_ESC

Let me explain the code quickly (assuming you are familiar with the basic use of Bash and common Unix tools). ‘port installed’ returns the installed ports, and every line beginning with two spaces are port names followed by other information (like version). I retrieve the port names, and wrap each of them with ‘-e ^…$’. Since they will be used for grep, special characters need to be escaped (practically only ‘.’). I then invoke ‘port deps’ with the command-line arguments, look for lines containing ‘Dependencies:’, get everything after it, split at the commas to get the depended ports, sort the ports, remove duplicates, and filter out all installed ports from the result.

It basically works, and the code is succinct. It is also far from elegant, and quite error-prone. A Bash function feels like a hack. The quotation rules are tricky (when invoking escape, $INSTALLED must be quoted; but when invoking grep, $INSTALLED_ESC must not be quoted). Escaping can easily get problematic when used inside quotation marks. And so on. . . . It is difficult to imagine people can write Bash scripts without some trial and error, even though only a few lines are written.

I knew some Python, but I am not very familiar with it. So I was basically writing while Googling. I got the first version, sort of an equivalent of the Bash script, in about two hours:

#!/usr/bin/env python
#coding: utf-8

import re
import sys
import subprocess

# Gets command output as a list of lines
def popen_readlines(cmd):
    p = subprocess.Popen(cmd, stdout=subprocess.PIPE)
    p.wait()
    if p.returncode != 0:
        raise subprocess.CalledProcessError(p.returncode, \
                                            cmd)
    else:
        return map(lambda line: line.rstrip('\n'), \
                   p.stdout.readlines())

# Gets the port name from a line like
# "  gcc6 @6.1.0_0 (active)"
def get_port_name(port_line):
    return re.sub(r'^  (\S+).*', r'\1', port_line)

# Gets installed ports as a set
def get_installed():
    installed_ports_lines = \
            popen_readlines(['port', 'installed'])[1:]
    installed_ports = \
            set(map(get_port_name, installed_ports_lines))
    return installed_ports

# Gets dependencies for the given port list (which may
# contain options etc.), as a list, excluding items in
# ignored_ports
def get_deps(ports, ignored_ports):
    deps_raw = popen_readlines(['port', 'deps'] + ports)
    uninstalled_ports = []
    for line in deps_raw:
        if re.search(r'Dependencies:', line):
            deps = re.sub(r'.*Dependencies:\s*', '', \
                          line).split(', ')
            uninstalled_ports += \
                [x for x in deps if x not in ignored_ports]
            ignored_ports |= set(deps)
    return uninstalled_ports

def main():
    if sys.argv[1:]:
        installed_ports = get_installed()
        uninstalled_ports = get_deps(sys.argv[1:], \
                                     installed_ports)
        for port in uninstalled_ports:
            print port

if __name__ == '__main__':
    main()

A few things immediately came to notice:

  • The code is apparently more verbose than Bash or Perl, but arguably also clearer and more readable.
  • Strings are ubiquitous in Bash, but lists are ubiquitous in Python. Python allowed backticks (`…`) for piping, but they are deprecated now in favour of the subprocess routines, which accept the command line as a list.
  • The set is a built-in type and is a breeze to use.
  • I/O is not as easy as in Perl (thinking of <> and chomp now), but can be easily simplified with helper functions, as composability is very good.
  • List comprehension and map are very helpful to keep the code concise.

It is not all. The real fun was that it was easy to convert the code to work recursively on all depended ports. I only needed to add/change seven lines of code, at the beginning and end of get_deps:

def get_deps(ports, ignored_ports):
    # New code to end the recursion
    if ports == []:
        return []

    # This part is not changed
    deps_raw = popen_readlines(['port', 'deps'] + ports)
    uninstalled_ports = []
    for line in deps_raw:
        if re.search(r'Dependencies:', line):
            deps = re.sub(r'.*Dependencies:\s*', '', \
                          line).split(', ')
            uninstalled_ports += \
                [x for x in deps if x not in ignored_ports]
            ignored_ports |= set(deps)

    # New code to call recursively and collect the result
    results = []
    for port in uninstalled_ports:
        results.append(port)
        results += get_deps([port], ignored_ports)
    return results

The output did not show any indentation yet, and I found another problem later. The improved final code looks as follows:

#!/usr/bin/env python
#coding: utf-8

import re
import sys
import subprocess

# Gets command output as a list of lines
def popen_readlines(cmd):
    p = subprocess.Popen(cmd, stdout=subprocess.PIPE)
    p.wait()
    if p.returncode != 0:
        raise subprocess.CalledProcessError(p.returncode, \
                                            cmd)
    else:
        return map(lambda line: line.rstrip('\n'), \
                   p.stdout.readlines())

# Gets the port name from a line like
# "  gcc6 @6.1.0_0 (active)"
def get_port_name(port_line):
    return re.sub(r'^  (\S+).*', r'\1', port_line)

# Gets installed ports as a set
def get_installed():
    installed_ports_lines = \
            popen_readlines(['port', 'installed'])[1:]
    installed_ports = \
            set(map(get_port_name, installed_ports_lines))
    return installed_ports

# Gets port names from items that may contain version
# specifications, variants, or options
def get_ports(ports_and_specs):
    requested_ports = set()
    for item in ports_and_specs:
        if not (re.search(r'^[-+@]', item) or \
                re.search(r'=', item)):
            requested_ports.add(item)
    return requested_ports

# Gets dependencies for the given port list (which may
# contain options etc.), as a list of tuples (combining
# with level), excluding items in ignored_ports
def get_deps(ports, ignored_ports, level):
    if ports == []:
        return []

    deps_raw = popen_readlines(['port', 'deps'] + ports)
    uninstalled_ports = []
    for line in deps_raw:
        if re.search(r'Dependencies:', line):
            deps = re.sub(r'.*Dependencies:\s*', '', \
                          line).split(', ')
            uninstalled_ports += \
                [x for x in deps if x not in ignored_ports]
            ignored_ports |= set(deps)

    port_level_pairs = []
    for port in uninstalled_ports:
        port_level_pairs += [(port, level)]
        port_level_pairs += get_deps([port], \
                                     ignored_ports, \
                                     level + 1)
    return port_level_pairs

def main():
    if sys.argv[1:]:
        ports_and_specs = sys.argv[1:]
        ignored_ports = get_installed() | \
                        get_ports(ports_and_specs)
        uninstalled_ports = get_deps(ports_and_specs, \
                                     ignored_ports, 0)
        for (port, level) in uninstalled_ports:
            print ' ' * (level * 2) + port

if __name__ == '__main__':
    main()

I would say I am very happy, even excited, with the experiment results. No wonder Python has been a great success, despite being verbose and having a slightly weird syntax :-). I guess I would do more Python in the future.

By the way, the code in this article is in Python 2. Python 3 is stricter and even more verbose: I do not see the benefits of using it for system scripting (yet).


  1. Not really. My MacBook Pro has the firewall turned on, it is behind the home router nearly at all times, and I do not visit strange web sites—not with Safari at least. 
  2. Honestly, it is not the fault of Homebrew, or even Apple. However, I do miss the support lifecycle that Microsoft provided for Windows XP. 
  3. For more details, Using libc++ on older system explains the why and how.