A new version of the QuickSched library is now available here!
This new version brings a number of small bug fixes, thread re-use when running with pthreads, and some fixes to the examples.
This post has little to do with parallel or high-performance computing, but is just a useful observation on using Matlab.
I was recently running a large-ish Matlab script that would create vectors of data and append them to a large array, either as rows or columns. With “large” I mean several hundred thousand rows. After letting the script run for a while, I realized that what was taking the most time was not the creation of the data, which was a non-trivial process, but appending it to the array.
In Matlab, there are several ways you can do this, and most of them are described here:
- Pre-allocate the array and fill in the columns. This only works if you know the final size of your matrix beforehand, which was not the case for me.
A = [ A , B ]or
A = horzcat( A , B ), which creates a new array concatenating the old and new columns, and stores it into the old array.
A(:,end+1:end+size(B,2)) = B, i.e. write the new columns directly to the end of the old matrix. Since the indices on the left-hand side are beyond the size of
A, the matrix is grown to accommodate it.
So which of the last two approaches is fastest? Only one way to find out…
>> n = 100; B = rand(n); A = ; tic; for k=1:1000, A = [ A , B ]; end; toc
Elapsed time is 30.309510 seconds.
>> n = 100; B = rand(n); A = ; tic; for k=1:1000, A = horzcat( A , B ); end; toc
Elapsed time is 28.960844 seconds.
>> n = 100; B = rand(n); A = ; tic; for k=1:1000, A(:,end+1:end+n) = B; end; toc
Elapsed time is 0.190347 seconds.
So using concatenation is more than 150x slower than directly inserting the new values into the matrix. The reason for this is that most probably a new matrix is created, the memory from A and B is copied into it, and the old matrix A is removed. This is not particularly efficient, and I’m a bit surprised that Matlab won’t optimize this out.
Another interesting question is whether we should add rows or columns to A. Matlab stores its data in column-major order, so adding a row to a matrix means all of its data has to be re-shuffled. So how much slower is this?
>> n = 100; B = rand(n); A = ; tic; for k=1:1000, A = [ A ; B ]; end; toc
Elapsed time is 26.628549 seconds.
>> n = 100; B = rand(n); A = ; tic; for k=1:1000, A = vertcat( A , B ); end; toc
Elapsed time is 23.476315 seconds.
>> n = 100; B = rand(n); A = ; tic; for k=1:1000, A(end+1:end+n,:) = B; end; toc
Elapsed time is 9.268510 seconds.
So appending rows to a matrix is roughly 50x more expensive than appending columns.
Obviously, directly inserting the columns at the end of a matrix is syntactically not particularly pretty and confusing to read. However, if simply appending data is taking longer than the actual computations, it may be well worth it.
After almost two years of work, the first journal article on the task-based algorithms for SPH has been submitted to SISC! For the impatient, here is the abstract:
This paper describes a new fast and implicitly parallel approach to neighbour-finding in multi-resolution Smoothed Particle Hydrodynamics (SPH) simulations. This new approach is based on hierarchical cell decompositions and sorted interactions, within a task-based formulation. It is shown to be faster than traditional tree-based codes, and to scale better than domain decomposition-based approaches on hybrid shared/distributed-memory parallel architectures, e.g. clusters of multi-cores, achieving a 40x speedup over the Gadget-2 simulation code.
A preprint of the manuscript has been deposited on the arXiv here, an ECS technical report will follow soon.
I am happy to announce the first public version of the QuickSched library!
QuickSched implements task-based parallelism using both dependencies and conflicts. It is written in plain C and can be compiled with any compiler that supports OpenMP or pthreads. The above figures show the strong scaling and parallel efficiency for both a task-based Barnes-Hut tree-code over 1’000’000 particles and for a tiled QR decomposition of a 2048×2048 matrix.
QuickSched is available for download from SourceForge here. A more detailed description of the underlying algorithms and results are the subject of a forthcoming paper.
A new version of SWIFT is available for download at our SourceForge site!
Version 1.0.0 includes the new hybrid shared/distributed-memory parallelism, as well as several improvements to the task scheduler.
In order to use it’s distributed-memory features, you should install parallel version of the HDF5 library, as well as METIS version 5.1 for the task-based domain decomposition.
Following our talk at the Exascale Computing in Astrophysics Conference (Centro Stefano Franscini, Monte Verita, Ascona, Switzerland, 8 September – 13 September 2013), I am proud to announce the latest version of SWIFT, which is now capable of hybrid shared/distributed-memory parallelism!
The above figure shows the strong scaling for both Gadget-2 and SWIFT for a ~51M particle SPH-only simulation (EAGLE-box at z=0.5), starting at one single core and scaling up to 1024 cores of the COSMA5 cluster (16 Intel E5-2670 cores per node) at Durham University. The speedup plot (left) is scaled linearly, while the x-axis of the efficiency plot (right) is scaled logarithmically. The markers in the efficiency plot indicate the total number of cores. Continue reading
I’ve been playing around on the new 64-core AMD Opteron 6376 we received a few weeks ago from Broadberry, courtesy of Durham University’s Seedcorn Research Fund, trying to see how well SWIFT will perform on it.
Before we begin, a few words on the machine itself: It consists of four 16-core AMD Opteron 6376 processors at 2.3GHz with a total of 32GB of memory installed. The cache hierarchy, as seen by hwloc/lstopo, is as follows:
Each 16-core processor is split into two banks of eight cores, each with it’s own 6MB L3 cache. Each bank is then split into sets of two cores sharing a 2MB L2 and a 64KB instruction cache. Each core then has its own 16KB L1 cache.
There are a few features worth pointing out:
- Each L3 cache (6MB) is smaller than the sum of the L2 caches below it (4x2MB). This seems somewhat odd and quite different to the Intel Xeon X7550 processors used previously, which boast an 18MB L3 cache for eight cores with 8x256KB L2 caches.
- The L1 data cache is relatively small, i.e. half of what it is on the X7550.
On our system, we de-activated the Turbo Core feature to get consistent scaling benchmarks. We also used gcc 4.8 as it has support for the AVX and FMA3/4 instructions supported by the CPU.
During test runs, it quickly became obvious that several modifications had to be made to SWIFT, as it had never been tested on more than 32 cores. The most significant change involved the task queues, which were replaced by a proper scheduler. This new scheduler differs from the previous queues in a number of ways:
- All tasks now reside in a common pool in the task scheduler, and are only assigned to specific queues once all their dependencies are satisfied. This keeps the queues short and makes picking a task off the queue more efficient.
- Each cell is assigned a preferred queue via a crude partition of space. Tasks involving a single cell are assigned to that cell’s queue when they become available. If a task involves more than one cell, and thus potentially more than one preferred queue, it is assigned to the shorter of the two queues.
- Within each queue, the tasks are organized in a binary heap, sorted by their weight. The weight of a task is defined as its approximate computational cost plus the maximum cost of any of its dependent tasks. This gives preference to tasks with a longer critical path.
- The heap in each queue is stored as an array and traversed linearly until a conflict-free task is found. This means that the tasks are not traversed strictly in order of their weight, but this is a compromise to make updating the queues, i.e. inserting/removing tasks, efficient.
- If a thread’s queue is empty and several attempts at work-stealing have failed, the thread simply quits. This was done to avoid banging on the memory bus for too long while the last few tasks were completing.
- The threads are staically assigned to specific CPUs such as to best spread them out over the different processors/banks/blocks. The core IDs are assigned in the order 0, 32, 16, 48, 8, 24, 40, 56, 4, 12, etc…
Since the scheduler is becoming quite sophisticated, I am currently working on branching it off as a separate software project, which will probably be announced here sometime soon.
There were also, as always, several bugs that popped up, and several parts of the non-task based code that was re-implemented to avoid bottlenecks.
The benchmark systems run were the following:
- Sedov Blast: 1’048’576 particles initialized on a FCC lattice with a density of ρ=1 and energy u=1.5e-5, with the central 28 particles set to explode with u=3.5714 each.
- Cosmological Volume: 1’841’127 particles in a 6.4 Mpc cubic box with a realistic distribution of mass and a mix of smoothing lengths spanning three orders of magnitude.
Timings were collected for 200 and 100 runs of each respectively over 1 to 64 cores. The times are for the total time step, i.e. updating the cell hierarchy when necessary, computing the interactions, moving particles, collecting energies, just everything.
While both simulations show a parallel efficiency of ~80% at 32 cores, this is also where they stop scaling, achieving less than 50% parallel efficiency at 64 cores. This is not particularly good.
To see where efficiency is lost, we can have a look at the times for each part of the simulation for increasing numbers of cores for the Cosmo Volume case:
What we can see here is that the loss of scaling is not due to task overheads (the purple line near the bottom), but due to an increased cost of the computations themselves, i.e. the pair and self tasks.
We can have an even closer look at what’s going on by looking at the individual task times for the SedovBlast test case at 32, 48 and 64 cores:
In these figures, each row represents a core, and each coloured block represents a task. Blue tasks are self-interactions, green tasks are pairs interactions, yellow tasks are the ghost tasks between the density and force computations, and the orange tasks are the integrator’s kick tasks. No sort tasks are shown as they are only evaluated when the cell hierarchy needs to be reconstructed.
Although the number of cores doubles between the first and last figure, the total time is not halved. The main culprit seems to be a number of tasks near the end of the step that get stretched at 48, and more so at 64 cores. This is assumed to be due to the concurrently executing kick tasks: As opposed to the pair and self tasks, these tasks run through the particles of a given cell and update their velocities and positions, which makes for a lot less computation per memory access, and therefore makes them significantly more memory intensive. This increased activity on the memory bus seems to be a problem for other tasks, slowing them down significantly. These lagging tasks also wreck havoc on the load balancing, as their dependent tasks are delayed.
Considering the amount of bad publicity these processors get, the scaling up to 32 cores is actually quite good. In any SPH simulation, obtaining ~80% efficiency with only ~32K particles per core is still quite exceptional. The code is also still 10x faster than Gadget-2.
As of 32 cores, I assume the poor caching performance, i.e. two cores having to share each L2 cache, and these L2 caches overwhelming the shared L3 cache between them and the memory bus, starts having a severe impact on the parallel efficiency.
The way around this problem would be to make the tasks even more memory/cache efficient. This, however, is a particularly hard sell: The kick tasks, which seem to cause the most problems, simply run through a chunk of the particle data and update positions and velocities. The memory is accessed only once, so there’s not much that can be optimized therein.
A good indication that the code itself is quite cache efficient is the total lack of any so-called NUMA-related effects in the efficiency from 1 to 32 cores. There are no significant jumps in efficiency at powers of two where supposedly one core would start accessing another’s memory. This is thanks to the inherently cache-friendly nature of the task-based paradigm, i.e. that each task does a large amount of work on a restricted piece of memory.
The problem in this case is, I think, that the caches are not fed quickly enough because of the undersized shared L3 cache, which is smaller than the sum of the L2 caches below it.
Stay tuned now for some mdcore benchmarks!