Saturday, September 19, 2009

Linux Builds Part II: The Acceleration Incantation

Ubuntu offers a system monitor that can show graphs of how system resources are being used. It's very interesting to turn on all the graphs on and build a large project without fiddling with the build. You'll see some interesting things.

First, the CPU usage will jump up and down, and so will the disk activity - but you'll rarely see them both high at the same time. That's because the compiler typically operates in three phases on a source file:
  1. It reads the source file and all the headers. This is disk-intensive but not CPU-intensive.
  2. Then it does all the usual compiling stuff like lexical analysis and parsing and code generation and optimizing. This makes heavy use of the CPU and RAM, but doesn't hit the hard disk much.
  3. Then it writes the object file out to disk. Again, the disk is very busy, and the CPU just waits around.
So at any one time, the compiler is making good use of the CPU or the disk, but not both. If you could keep them both busy, things would go faster.

The answer to this is parallel builds. Common build tools like make and jam offer command line options to compile multiple files in parallel, using separate compiler instances in separate processes. That way, if one compiler process is waiting for the disk, the Linux kernel will give the CPU to another compiler process that's waiting for the CPU. Even on a single-CPU, single-core computer, a parallel build will make better use of the system and speed things up.

Second, if you're running on a multi-CPU or multi-core system and not doing much else, even at its peak, CPU usage won't peg out at the top of the panel. That's because builds are typically sequential, so they only use one core in one CPU, and any other compute power you have is sitting idle. If you could make use of those other CPUs/cores, things would go faster. And again, the answer is parallel builds.

Fortunately, the major C/C++ build systems support parallel builds, including GNU make, jam, and SCons. In particular, GNU make and jam both offer the "-j X" parameter, where X is the number of parallel jobs to compile at the same time.
The graph above shows what I would generally expect the results of parallel builds to be on a particular hardware configuration, going from left to right.
  • When running with one compile at a time, sequentially, system resources are poorly utilized, so a build takes a long time.
  • As the number of compiles running in parallel increases, the wall time for the build drops, until you hit a minimum. This level of parallelization provides the balanced utilization of CPU, disk, and memory we're looking for. We'll call this number of parallel compiles N.
  • As the number of compiles passes N, the compile processes will increasingly contend for system resources and become blocked, so the build time will rise a bit.
  • Then as the number of parallel compiles continues to rise, more and more of the compile processes will be blocked at any time, but roughly N of them will still be operating efficiently. So the build time will flatten out, and asymptotically approach some limit.
Anticipating further posts, that is what you actually see, except the rise after the minimum is tiny, often to the point where the times in the flat tail are only a tiny bit higher than the minimum time.

A Brief Aside On Significance

In any physical system, there's always some variation in measurements, and the same is true of computer benchmarks. So an important question in this kind of experimentation is: when you see a difference, is it meaningful or just noise?

To answer that, I ran parallelized benchmarks on Valentine (a two-core Sony laptop) and Godzilla (an eight-core Mac Pro). In each case, the Linux kernel was built twenty times with the same settings. Here are the results:
  • Valentine, cached build, j=3. Average 335.91 seconds, standard deviation (sigma) 2.15, or 0.64% of the average.
  • Valentine, non-cached build, j=3. Average 340.09 seconds, standard deviation 4.22, or 1.24% of the average.
  • Godzilla, non-cached build, j=12. Average 67.82 seconds, standard deviation 0.54, or 0.79% of the average.
Generally speaking, a difference of one sigma or less is probably not significant, while a difference of two sigma or more is probably significant. So I'll generally use the rule of thumb, based on the above, that differences between individual values of 2% or less are probably not significant and may easily be due to experimental error (noise).

Linux Build Optimization I: The Need for Speed

gcc may be many things, but it most certainly is slow. Anybody who's worked with a really fast C/C++ compiler - like the late, lamented Metrowerks Codewarrior for MacOS and Windows - will be happy to tell you that.

But build speed matters a lot! As Joel Spolsky points out:
If your compilation process takes more than a few seconds, getting the latest and greatest computer is going to save you time. If compiling takes even 15 seconds, programmers will get bored while the compiler runs and switch over to reading The Onion, which will suck them in and kill hours of productivity.
I currently work with a team on a Linux-based system involving about twenty million lines of C/C++. Building everything using the default settings on our normal development hardware takes hours. So if someone changes a file down in the bowels of some low-level component half the system relies on, here's how we spend our time:

What can we do?

Here are some options for speeding up the build cycle. I think they would all be great if everybody could do them, but only one of them is universally applicable:

Change to a faster compiler that will do the job
That would be lovely if there were one - but unfortunately, the easily-available option I know of for replacing gcc is the combination of LLVM and clang. I have high hopes for that compiler system someday, but at present, real-world benchmarks don't indicate it's hugely faster at compiling than gcc, and clang doesn't support critical C++ features many programmers need.

Throw money at hardware
This would also be lovely if everyone were in a position to do that. But if you just don't have much spare money, this is a non-starter. And even if you do have a fair bit of change to spend on hardware, what I plan to discuss will still help you improve your use of it.

Be clever
Now we get to the meat of these posts: taking your existing gcc compiler, and your existing hardware, and tweaking things so that the build cycle is quicker. The best case would be without spending a dime, and the worst case would involve spending very little money.

The Ground Rules

These posts will largely consist of a series of experiments. Each experiment will involve applying a technique that might accelerate builds, benchmarking it, and analyzing the results.

Unless otherwise specified, the tests will involve:
  • Building the Linux kernel using the default x86 configuration. (If you look at the various components used for the distcc benchmarks, the Linux kernel seems like a pretty representative large set of code.)
  • Running under Ubuntu 9.04 using gcc version 4.3.3.
  • The benchmark is the first thing done after rebooting the computer, with nothing but a couple of terminal windows running.
  • The benchmark script, by default, avoids unrealistically fast builds due to disk caching from previous passes. (It does this by unpackaging the kernel tarball on every build pass.) It also has options to support disk caching, and allow adjustment of build options such as parallelization.
The results are all based on "wall time" - what you'd see on a clock on the wall - because I mainly care about not wasting my time waiting for builds.

Major Factors That Determine Build Speed

When you're doing a build, the main things that happen are:
  • The CPU calculates things and makes decisions
  • The hard drive reads and writes files
  • Memory-based data is read from and written to RAM.
The fastest build you could do on a particular set of hardware would strike a balance among the CPU, hard drive, and RAM, so that all of them are constantly busy, with none of them ever waiting any of the others.

Realistically, you will never achieve that 100% utilization on all three components. Even if you adjusted the system so that file foo.c would compile at 100% utilization on all three, if file bar.c used more or larger header files, compiling it might not achieve 100% CPU utilization because the system would spend more time reading header files from the disk, and the CPU would have to wait for that. Also, settings that are optimal for the compiler might not be optimal for the linker. So all we can aim at is an overall good build time.

There are other resources one can apply to compiles which I also plan to address - in particular, underutilized CPU horsepower out on the network via distcc.

So on to the next post... and I would love to get feedback and suggestions!