New Preprint: Efficient and Scalable Algorithms for Smoothed Particle Hydrodynamics on Hybrid Shared/Distributed-Memory Architectures

After almost two years of work, the first journal article on the task-based algorithms for SPH has been submitted to SISC! For the impatient, here is the abstract:

This paper describes a new fast and implicitly parallel approach to neighbour-finding in multi-resolution Smoothed Particle Hydrodynamics (SPH) simulations. This new approach is based on hierarchical cell decompositions and sorted interactions, within a task-based formulation. It is shown to be faster than traditional tree-based codes, and to scale better than domain decomposition-based approaches on hybrid shared/distributed-memory parallel architectures, e.g. clusters of multi-cores, achieving a 40x speedup over the Gadget-2 simulation code.

A preprint of the manuscript has been deposited on the arXiv here, an ECS technical report will follow soon.

Introducing QuickSched: Task-based parallelism with conflicts made easy

I am happy to announce the first public version of the QuickSched library!

BH_scaling QR_scaling

QuickSched implements task-based parallelism using both dependencies and conflicts. It is written in plain C and can be compiled with any compiler that supports OpenMP or pthreads. The above figures show the strong scaling and parallel efficiency for both a task-based Barnes-Hut tree-code over 1’000’000 particles and for a tiled QR decomposition of a 2048×2048 matrix.

QuickSched is available for download from SourceForge here. A more detailed description of the underlying algorithms and results are the subject of a forthcoming paper.

New SWIFT version 1.0.0

A new version of SWIFT is available for download at our SourceForge site!

Version 1.0.0 includes the new hybrid shared/distributed-memory parallelism, as well as several improvements to the task scheduler.

In order to use it’s distributed-memory features, you should install parallel version of the HDF5 library, as well as METIS version 5.1 for the task-based domain decomposition.

SWIFT goes hybrid: Announcing the hybrid shared/distributed-memory parallel version of SWIFT

Following our talk at the Exascale Computing in Astrophysics Conference (Centro Stefano Franscini, Monte Verita, Ascona, Switzerland, 8 September – 13 September 2013), I am proud to announce the latest version of SWIFT, which is now capable of hybrid shared/distributed-memory parallelism!

BigCosmoVolume_cosma5

The above figure shows the strong scaling for both Gadget-2 and SWIFT for a ~51M particle SPH-only simulation (EAGLE-box at z=0.5), starting at one single core and scaling up to 1024 cores of the COSMA5 cluster (16 Intel E5-2670 cores per node) at Durham University. The speedup plot (left) is scaled linearly, while the x-axis of the efficiency plot (right) is scaled logarithmically. The markers in the efficiency plot indicate the total number of cores. Continue reading

First results for SWIFT on a 64-core AMD Opteron 6376

I’ve been playing around on the new 64-core AMD Opteron 6376 we received a few weeks ago from Broadberry, courtesy of Durham University’s Seedcorn Research Fund, trying to see how well SWIFT will perform on it.

The System

Before we begin, a few words on the machine itself: It consists of four 16-core AMD Opteron 6376 processors at 2.3GHz with a total of 32GB of memory installed. The cache hierarchy, as seen by hwloc/lstopo, is as follows:

topo

Each 16-core processor is split into two banks of eight cores, each with it’s own 6MB L3 cache. Each bank is then split into sets of two cores sharing a 2MB L2 and a 64KB instruction cache. Each core then has its own 16KB L1 cache.

There are a few features worth pointing out:

  • Each L3 cache (6MB) is smaller than the sum of the L2 caches below it (4x2MB). This seems somewhat odd and quite different to the Intel Xeon X7550 processors used previously, which boast an 18MB L3 cache for eight cores with 8x256KB L2 caches.
  • The L1 data cache is relatively small, i.e. half of what it is on the X7550.

On our system, we de-activated the Turbo Core feature to get consistent scaling benchmarks. We also used gcc 4.8 as it has support for the AVX and FMA3/4 instructions supported by the CPU.

The Software

During test runs, it quickly became obvious that several modifications had to be made to SWIFT, as it had never been tested on more than 32 cores. The most significant change involved the task queues, which were replaced by a proper scheduler. This new scheduler differs from the previous queues in a number of ways:

  • All tasks now reside in a common pool in the task scheduler, and are only assigned to specific queues once all their dependencies are satisfied. This keeps the queues short and makes picking a task off the queue more efficient.
  • Each cell is assigned a preferred queue via a crude partition of space. Tasks involving a single cell are assigned to that cell’s queue when they become available. If a task involves more than one cell, and thus potentially more than one preferred queue, it is assigned to the shorter of the two queues.
  • Within each queue, the tasks are organized in a binary heap, sorted by their weight. The weight of a task is defined as its approximate computational cost plus the maximum cost of any of its dependent tasks. This gives preference to tasks with a longer critical path.
  • The heap in each queue is stored as an array and traversed linearly until a conflict-free task is found. This means that the tasks are not traversed strictly in order of their weight, but this is a compromise to make updating the queues, i.e. inserting/removing tasks, efficient.
  • If a thread’s queue is empty and several attempts at work-stealing have failed, the thread simply quits. This was done to avoid banging on the memory bus for too long while the last few tasks were completing.
  • The threads are staically assigned to specific CPUs such as to best spread them out over the different processors/banks/blocks. The core IDs are assigned in the order  0, 32, 16, 48, 8, 24, 40, 56, 4, 12, etc…

Since the scheduler is becoming quite sophisticated, I am currently working on branching it off as a separate software project, which will probably be announced here sometime soon.

There were also, as always, several bugs that popped up, and several parts of the non-task based code that was re-implemented to avoid bottlenecks.

The Results

The benchmark systems run were the following:

  • Sedov Blast: 1’048’576 particles initialized on a FCC lattice with a density of ρ=1 and energy u=1.5e-5, with the central 28 particles set to explode with u=3.5714 each.
  • Cosmological Volume: 1’841’127 particles in a 6.4 Mpc cubic box with a realistic distribution of mass and a mix of smoothing lengths spanning three orders of magnitude.

Timings were collected for 200 and 100 runs of each respectively over 1 to 64 cores. The times are for the total time step, i.e. updating the cell hierarchy when necessary, computing the interactions, moving particles, collecting energies, just everything.

CosmoVolume_scaling SedovBlast_scaling

While both simulations show a parallel efficiency of ~80% at 32 cores, this is also where they stop scaling, achieving less than 50% parallel efficiency at 64 cores. This is not particularly good.

To see where efficiency is lost, we can have a look at the times for each part of the simulation for increasing numbers of cores for the Cosmo Volume case:

CosmoVolume_times

What we can see here is that the loss of scaling is not due to task overheads (the purple line near the bottom), but due to an increased cost of the computations themselves, i.e. the pair and self tasks.

We can have an even closer look at what’s going on by looking at the individual task times for the SedovBlast test case at 32, 48 and 64 cores:

SedovBlast_tasks_32 SedovBlast_tasks_48 SedovBlast_tasks_64

In these figures, each row represents a core, and each coloured block represents a task. Blue tasks are self-interactions, green tasks are pairs interactions, yellow tasks are the ghost tasks between the density and force computations, and the orange tasks are the integrator’s kick tasks. No sort tasks are shown as they are only evaluated when the cell hierarchy needs to be reconstructed.

Although the number of cores doubles between the first and last figure, the total time is not halved. The main culprit seems to be a number of tasks near the end of the step that get stretched at 48, and more so at 64 cores. This is assumed to be due to the concurrently executing kick tasks: As opposed to the pair and self tasks, these tasks run through the particles of a given cell and update their velocities and positions, which makes for a lot less computation per memory access, and therefore makes them significantly more memory intensive. This increased activity on the memory bus seems to be a problem for other tasks, slowing them down significantly. These lagging tasks also wreck havoc on the load balancing, as their dependent tasks are delayed.

Summary

Considering the amount of bad publicity these processors get, the scaling up to 32 cores is actually quite good. In any SPH simulation, obtaining ~80% efficiency with only ~32K particles per core is still quite exceptional. The code is also still 10x faster than Gadget-2.

As of 32 cores, I assume the poor caching performance, i.e. two cores having to share each L2 cache, and these L2 caches overwhelming the shared L3 cache between them and the memory bus, starts having a severe impact on the parallel efficiency.

The way around this problem would be to make the tasks even more memory/cache efficient. This, however, is a particularly hard sell: The kick tasks, which seem to cause the most problems, simply run through a chunk of the particle data and update positions and velocities. The memory is accessed only once, so there’s not much that can be optimized therein.

A good indication that the code itself is quite cache efficient is the total lack of any so-called NUMA-related effects in the efficiency from 1 to 32 cores. There are no significant jumps in efficiency at powers of two where supposedly one core would start accessing another’s memory. This is thanks to the inherently cache-friendly nature of the task-based paradigm, i.e. that each task does a large amount of work on a restricted piece of memory.

The problem in this case is, I think, that the caches are not fed quickly enough because of the undersized shared L3 cache, which is smaller than the sum of the L2 caches below it.

Stay tuned now for some mdcore benchmarks!

mdcore: New release 0.1.7!

We’ve just uploaded a new release of mdcore, v. 0.1.7, to the SourceForge site.

The new release incorporates a number of features, i.e.:

  • Better GPU implementation using task-based parallelism directly on the device,
  • Ability to simultaneously use multiple CUDA-capable GPUs,
  • Domain decomposition using the METIS library when running with MPI,
  • Added P-SHAKE implementation for faster constraint resolution,
  • Several fixes to improve energy conservation in both single and double precision.

Since mdcore is work in progress and thus in constant development, it may be advisable to download the latest version of the source code from the SourceForge Subversion repository. You can get the latest version by executing

svn co http://svn.code.sf.net/p/mdcore/code/trunk mdcore

on the command line.

New results using SWIFT

We’ve just submitted a paper for the 8th International SPHERIC Workshop in Trondheim, Norway with some benchmarks using SWIFT, using the following three simulation setups:

  • Sod-shock: A cubic periodic domain containing a high-density region of 800 000 particles with P = 1 and ρ = 1 on one half, and a low-density region of 200 000 particles with P = 0.1795 and ρ = 1/4 on the other.
  • Sedov blast: A cubic grid of 101 × 101 × 101 particles at rest with P = 10−5 and ρ=1, yet with the energy of the central 33 particles set to 100/33/m.
  • Cosmological volume: Realistic distribution of matter in a periodic volume of universe. The simulation consists of 1 841 127 particles with a mix of smoothing lengths spanning three orders of magnitude, providing a test-case for neighbour finding and parallel scaling in a real-world scenario.

All three simulations were run on a 4×Intel Xeon E5-4640 32-core machine running at 2.4 GHz with CentOS release 6.2 Linux for x86 64.

SodShock_scalingSedovBlast_scaling CosmoVolume_scaling

On all three test cases, SWIFT has a parallel efficiency of more than 75%. The numbers in the scaling plot are the number of milliseconds per timestep on 32 cores. The results for the Cosmological volume also show the timings for Gadget-2 for comparison.

SodShock_profile

The density, pressure, and velocity profiles of the Sod-shock were computed at t=0.12 and are comparable to both the analytical results (dotted lines), and the results obtained with Gadget-2 for the same initial conditions. Sharper shocks could be obtained with a higher resolution.

SedovBlast_density

Finally, the average radial density profiles for the Sedov blast computed at t = {0.075, 0.15, 0.275} also show good agreement with the analytical solution (dotted lines), and are comparable to the results obtained with Gadget-2 for the same initial conditions.

We are currently preparing the SWIFT source code for an official Open-Source release.