Tuesday, 1 May 2012

Linpack (HPL) results on an HPC (beowulf-style) cluster using CentOS 6.2

In my previous post, I described how to install and run Linpack (HPL) on a two node HPC (Beowulf-style) cluster running CentOS 6.2, in this post I will discuss some of the results from various Linpack runs that I have conducted in the past few days.

The first thing to note is that the HPL.dat file that is available post install is simply useless to extract any kind of meaningful performance numbers, so the file needs to be edited, but how? There is an online tool that will generate an HPL.dat file and this is what I have been using to provide me with some guidance of what to use. I have changed the number of equations, to generate a nice graph.

The first two tests that I ran, were ran using the configuration described in my previous post, I then recompiled Linpack with MKL and re-run the tests, see figure 1 below for results.

Figure 1 - WR11C2R4 test for various problem sizes with fixed block size of 128.
The highest value for Atlas, is 25.13 GFlops, whereas the highest result for MKL is 60.03 GFlops, which means that using MKL more than doubles performance. I was expecting a good increase in performance with MKL but a more than doubling of performance is extremely impressive, it's a shame that MKL is not free, but in a real cluster it's probably worth the cost.

The tool suggests that it would be possible to run a test for a problem size of ~41000, however, it seems that performance tanks after a problem size of 30000 for Atlas. MKL shows better performance, but still performance does go down. Execution time for a problem size of 35000 was ~ 7000 seconds for Atlas, I did not try with MKL for such a large problem size. The reason is probably due to memory swapping as there is higher memory usage than expect, which is something that I will need to investigate. 

The second test I ran was intended to investigate the effect of block size. I fixed the problem size (N) and varied the block size (NB), see figure 2 below.

Figure 2 - Influence of Block Size on performance
The gains from increased block size appear to top out at a block size of 168 for a problem size of 20000 and 256 for a problem size of 25000. I did run with a block size of 268, but performance was actually reduced (60.1 GFlops). The netlib guidelines, recommend a block size of less than 256, so it shouldn't be surprising that a bigger block size yields worse performance. Block size is balancing act between data distribution and computational granularity.

It is interesting to note that the maximum performance (68.7 GFlops) was achieved for a problem size of 30000 and a block size of 192, although to be fair, the difference between a block size of 192 and 256 is only 4%.

Also interesting is how much the data varies for a problem size of 30000, all I can say is that the servers in the cluster don't have a separate network and thus performance is unlikely to ever be constant.

The efficiency of the cluster is actually only 46%, which is appalling, but given the various limitations in the system it's perhaps not that surprising.

In my next post, I discuss how to install HPCC, which is more comprehensive benchmark tool.


  1. Hi, could you tell me how you calculate your efficiency? I have a small 3 node HPC cluster with a single i7 and 8GB of RAM in each node. Do you perhaps know the IPC of an i7 processor?

  2. It depends on which architecture your i7 is. Have a look at the Wiki page for I7 and follow the links is my advice