Profiling

This is a short story about performance, profiling and optimization.

First, some backstory. For my current job I’m doing quantum chemical calculations on proteins. If this doesn’t mean anything to you, don’t worry, it’s not important for this article. However, this means that I’m working with data files that contain the structure of proteins a lot. The most popular file format is called PDB and the part of it that is interesting to me is the information about atoms, atom names, atom types, atom positions and so on.

After I started learning Rust I decided to write a command line tool to help me parse and process these files so I wouldn’t have to keep doing it manually. For one or two that’s totally fine but for bulk processing this quickly becomes unwieldy.

I’m not going into the details of my experience of writing this tool, I’ll just say that I started writing my own parser for the file format but discovered after a while that there was a library already out there doing the same thing, just better.

So I decided to rely on this library instead of what I had written myself. Every once in a while I made comments, feature requests and eventually also started contributing a bit to it. All in all a quite pleasant experience as I felt I was learning much during the process.

Everything was fine until I came across something strange. I read that someone had implemented a similar library in Haskell (which is not weird, many languages have parsers for this file format) but it claimed to be exceedingly fast, needing only ~ 8s for parsing a really big file. I was intrigued and tried parsing the same file with the Rust library I was using myself. After all Rust is supposed to be fast, right?

Turns out, I had to wait for about 20 minutes for the parsing process to finish. All things considered, this seemed unsatisfactory, something clearly was wrong here. We couldn’t lose to Haskell, could we? But in all seriousness, although what I did here is obviously not an apples to apples comparison, this seemed to indicate some deficiency in our parsing algorithm to me.

Okay, but what to do now? Reading through thousands of lines of code and looking for issues by hand was obviously not a good idea, not least of all because human’s aren’t necessarily good at optimizing code. That goes double for me who is not experienced enough for this kind of thing yet. What I consider inelegant or inefficient may actually be cheap or even completely optimized away by the compiler. On the other hand, something I regard as good code may in fact be quite costly. How can I find out for sure then?

I’ll share some words of wisdom that I was given a while ago:

“When in doubt, measure.”

So that’s what I resolved to do. This is where profiling comes into play. It’s, roughly speaking, a way of measuring where in your code the computer spends its time during execution. In reality it’s a quite involved art of which I have only scratched the surface.

I decided to not let that deter me from finding out what was wrong with the library. I looked about for tools to use and how to do it and came across two things that came in useful: flamegraph and perf.

flamegraph is a tool that uses perf as a backend (a Linux profiling tool) and builds graphs showing the call stack of your program and how much time each function call takes, relative to the total execution time.

This serves as a first glance of what the computer is doing when running your executable so I installed this and ran it.

Side note: I installed flamegraph with cargo, the Rust package manager, but called it with sudo because the underlying perf needs elevated permissions. For that to work you either need to configure this or call flamegraph with its full path so root can find it (as the binary will usually reside in the home folder of the user who installed it). Also, in order to get more useful output, it’s a good idea to include debug symbols in the binary you want to investigate. For Rust this can, e.g., be done via the following snippet in Cargo.toml:

[profiles.release]
debug = true

Just recompile with this and you’re done.

This is what the resulting graph then looked like:

Flamegraph before optimization

Note that I used the --flamechart option which means that the x axis in this graph actually represents the passing of time, i.e. the program is executed from left to right. This is not the default option.

What you can see here is the call stack, starting from the bottom. As you work your way up the stack you eventually reach the point where your binary enters the scene, in this case in the line saying pdbtest::main. As you can see the call stack seems very sparse. The function that deals with the parsing of the test PDB file I used for this binary is called open_pdb and this takes up all the computing time. Since this is also the only thing I coded in for the test this is hardly surprising. Working our way up the stack we have the following functions each of which calls the next one in turn:

open_pdb_raw
Model::add_atom
Chain::add_atom
core::tuple::<impl core::cmp::PartialEq for (A,B)>::eq

This last function sits at the top of its stack and takes up 80-90% of the total time. Clearly we have a bottle neck here! But what code is responsible for this? Let’s have a look at the Chain::add_atom function:

pub fn add_atom(
    &mut self,
    new_atom: Atom,
    residue_id: (isize, Option<&str>),
    conformer_id: (&str, Option<&str>),
) {
    let mut found = false;
    let mut new_residue = Residue::new(residue_id.0, residue_id.1, None)
        .expect("Invalid chars in Residue creation");
    let mut current_residue = &mut new_residue;
    for residue in &mut self.residues_mut() {
        if residue.id() == residue_id {
            current_residue = residue;
            found = true;
            break;
        }
    }
    if !found {
        self.residues.push(new_residue);
        current_residue = self.residues.last_mut().unwrap();
    }

    current_residue.add_atom(new_atom, conformer_id);
}

This won’t make much sense to you without any context so I’ll try and explain. There is a struct called Chain which contains an array of structs called Residue which in turn contains an array of structs called Atom. The function shown here is actually a method for the Chain struct. Each line in the PDB file we want to parse contains an Atom and its data. After parsing a line, we need to figure out which Residue to add it to. Each parsed line also contains information about which Residue an Atom belongs to. This code takes care of this. It searches all Residue structs that have already been parsed. If the one the currently parsed Atom belongs to is already present, it is added there. If not, a new residue is created and the Atom is added to it.

I hope that made some sense but if it didn’t, don’t worry, it’s not crucial to the tale. The person who wrote the library in the first place had the feeling that this piece of code was the problem. He reasoned that much time is probably lost searching for a Residue to add Atoms to. By necessity, this results in a lot of pointless searching because each time an Atom belonging to a yet unparsed residue is added the whole structure is traversed in search of something that doesn’t exist yet.

As it turns out this assessment was exactly right. The last function in the call stack seen in the flamegraph is a comparison of tuples. This is what happens in this line:

if residue.id() == residue_id {

Obviously, the enclosing for loop is executed a lot and so is this comparison. So what can be done?

An easy thing to try is reversing the order of the traversal. Most of the time parsed Atoms belong to the last Residue that was parsed. So it would be the last item in the array of Residues and thus searching from the back should save us a lot of pointless traversing. That’s easy enough to do by just changing one line:

- for residue in &mut self.residues_mut() {
+ for residue in &mut self.residues.iter_mut().rev() {

I recompiled and had a look at the flamegraph again:

Flamegraph with some optimization

This looks better but the bottleneck has just been reduced, not removed. Why is that? Well, like I wrote before, traversing the array of Residues in reversed order is just part of the solution because each time an Atom is parsed which belongs to a Residue that has not been parsed yet, the function will still traverse the entire array of present Residues, searching in vain for a match that doesn’t exist. That means we can expect a number of useless searches equal to the number of Residues we are parsing from the PDB file. For a typical file of mine that number ranges in the tens of thousands which results in the bottleneck we are still seeing in the graph. Or, to put it in more technical terms, the scaling of this functions is O(n²) (I think). That’s not good enough.

We discussed a variety of options how to deal with this. It seemed clear to me that we needed to check for the presence of a given Residue before traversing the whole array of present Residues, preferably in constant time. Something like that can, in principle, be achieved by sets or hashmaps which feature membership checks in constant time. At first, I tried using a HashSet but this proved difficult to do because the Atom struct contains floats which do not implement Hash so I couldn’t just use them. Also, sets are unordered which is inconvenient. I tried working around this with IndexSet which is an ordered set but then I stumbled over the problem that creating mutable references to sets is not possible in Rust. Unfortunately, this is a deal breaker for the functionality of the library.

At the end of the day, my collaborator came up with a nifty idea of parsing everything into a hashmap, storing the Residues as keys and using the constant lookup times to keep track of the things that had already been parsed. Then, after everything has been parsed, that hashmap is turned into the proper arrays the library uses for its functionalities. I was pretty impressed and I’m sure I couldn’t have come up with something this elegant.

The only remaining issue was that hashmaps, too, are unordered so everything had to be sorted after parsing. Luckily, there is the indexmap crate (which I kind of mentioned above) which includes an ordered hashmap that can be used pretty much as a drop-in replacement for Rust’s HashMap. So that’s what I did to get rid of the need for sorting everything.

Sweet! Time for another profiling run, I’d say. Let’s see how it turned out: Flamegraph with some optimization

Now that looks very different and I’ll tell you why: the whole call stack that corresponds to the pdbtest testing crate that you can see here is equivalent to that jagged mess from the previous profiling runs that made up 10-20% at the end of the previous run time. Remember that we are looking at relative timings here. This means that, apparently, we were successful at eliminating the bottleneck quite thoroughly.

A quick, non-quantitative “benchmark” confirms this. Whereas parsing my test file had taken some three seconds earlier, now it’s 300 - 600 ms. I think that’s impressive.

This experience served as a first glance at profiling an application I had contributed to and it was a fascinating experience. Before I was able to produce anything useful I spent a lot of time reading up on possible tools and skimming their documentation for some simple instructions and the like. In fact, I did use perf itself for profiling as well but realized that, albeit instructive, it didn’t provide any insight for me that the simple flamegraph hadn’t already so I didn’t include an account of my experience with it here.

That’s a task for future me.