I’ve been playing around on the new 64-core AMD Opteron 6376 we received a few weeks ago from Broadberry, courtesy of Durham University’s Seedcorn Research Fund, trying to see how well SWIFT will perform on it.
Before we begin, a few words on the machine itself: It consists of four 16-core AMD Opteron 6376 processors at 2.3GHz with a total of 32GB of memory installed. The cache hierarchy, as seen by hwloc/lstopo, is as follows:
Each 16-core processor is split into two banks of eight cores, each with it’s own 6MB L3 cache. Each bank is then split into sets of two cores sharing a 2MB L2 and a 64KB instruction cache. Each core then has its own 16KB L1 cache.
There are a few features worth pointing out:
- Each L3 cache (6MB) is smaller than the sum of the L2 caches below it (4x2MB). This seems somewhat odd and quite different to the Intel Xeon X7550 processors used previously, which boast an 18MB L3 cache for eight cores with 8x256KB L2 caches.
- The L1 data cache is relatively small, i.e. half of what it is on the X7550.
On our system, we de-activated the Turbo Core feature to get consistent scaling benchmarks. We also used gcc 4.8 as it has support for the AVX and FMA3/4 instructions supported by the CPU.
During test runs, it quickly became obvious that several modifications had to be made to SWIFT, as it had never been tested on more than 32 cores. The most significant change involved the task queues, which were replaced by a proper scheduler. This new scheduler differs from the previous queues in a number of ways:
- All tasks now reside in a common pool in the task scheduler, and are only assigned to specific queues once all their dependencies are satisfied. This keeps the queues short and makes picking a task off the queue more efficient.
- Each cell is assigned a preferred queue via a crude partition of space. Tasks involving a single cell are assigned to that cell’s queue when they become available. If a task involves more than one cell, and thus potentially more than one preferred queue, it is assigned to the shorter of the two queues.
- Within each queue, the tasks are organized in a binary heap, sorted by their weight. The weight of a task is defined as its approximate computational cost plus the maximum cost of any of its dependent tasks. This gives preference to tasks with a longer critical path.
- The heap in each queue is stored as an array and traversed linearly until a conflict-free task is found. This means that the tasks are not traversed strictly in order of their weight, but this is a compromise to make updating the queues, i.e. inserting/removing tasks, efficient.
- If a thread’s queue is empty and several attempts at work-stealing have failed, the thread simply quits. This was done to avoid banging on the memory bus for too long while the last few tasks were completing.
- The threads are staically assigned to specific CPUs such as to best spread them out over the different processors/banks/blocks. The core IDs are assigned in the order 0, 32, 16, 48, 8, 24, 40, 56, 4, 12, etc…
Since the scheduler is becoming quite sophisticated, I am currently working on branching it off as a separate software project, which will probably be announced here sometime soon.
There were also, as always, several bugs that popped up, and several parts of the non-task based code that was re-implemented to avoid bottlenecks.
The benchmark systems run were the following:
- Sedov Blast: 1’048’576 particles initialized on a FCC lattice with a density of ρ=1 and energy u=1.5e-5, with the central 28 particles set to explode with u=3.5714 each.
- Cosmological Volume: 1’841’127 particles in a 6.4 Mpc cubic box with a realistic distribution of mass and a mix of smoothing lengths spanning three orders of magnitude.
Timings were collected for 200 and 100 runs of each respectively over 1 to 64 cores. The times are for the total time step, i.e. updating the cell hierarchy when necessary, computing the interactions, moving particles, collecting energies, just everything.
While both simulations show a parallel efficiency of ~80% at 32 cores, this is also where they stop scaling, achieving less than 50% parallel efficiency at 64 cores. This is not particularly good.
To see where efficiency is lost, we can have a look at the times for each part of the simulation for increasing numbers of cores for the Cosmo Volume case:
What we can see here is that the loss of scaling is not due to task overheads (the purple line near the bottom), but due to an increased cost of the computations themselves, i.e. the pair and self tasks.
We can have an even closer look at what’s going on by looking at the individual task times for the SedovBlast test case at 32, 48 and 64 cores:
In these figures, each row represents a core, and each coloured block represents a task. Blue tasks are self-interactions, green tasks are pairs interactions, yellow tasks are the ghost tasks between the density and force computations, and the orange tasks are the integrator’s kick tasks. No sort tasks are shown as they are only evaluated when the cell hierarchy needs to be reconstructed.
Although the number of cores doubles between the first and last figure, the total time is not halved. The main culprit seems to be a number of tasks near the end of the step that get stretched at 48, and more so at 64 cores. This is assumed to be due to the concurrently executing kick tasks: As opposed to the pair and self tasks, these tasks run through the particles of a given cell and update their velocities and positions, which makes for a lot less computation per memory access, and therefore makes them significantly more memory intensive. This increased activity on the memory bus seems to be a problem for other tasks, slowing them down significantly. These lagging tasks also wreck havoc on the load balancing, as their dependent tasks are delayed.
Considering the amount of bad publicity these processors get, the scaling up to 32 cores is actually quite good. In any SPH simulation, obtaining ~80% efficiency with only ~32K particles per core is still quite exceptional. The code is also still 10x faster than Gadget-2.
As of 32 cores, I assume the poor caching performance, i.e. two cores having to share each L2 cache, and these L2 caches overwhelming the shared L3 cache between them and the memory bus, starts having a severe impact on the parallel efficiency.
The way around this problem would be to make the tasks even more memory/cache efficient. This, however, is a particularly hard sell: The kick tasks, which seem to cause the most problems, simply run through a chunk of the particle data and update positions and velocities. The memory is accessed only once, so there’s not much that can be optimized therein.
A good indication that the code itself is quite cache efficient is the total lack of any so-called NUMA-related effects in the efficiency from 1 to 32 cores. There are no significant jumps in efficiency at powers of two where supposedly one core would start accessing another’s memory. This is thanks to the inherently cache-friendly nature of the task-based paradigm, i.e. that each task does a large amount of work on a restricted piece of memory.
The problem in this case is, I think, that the caches are not fed quickly enough because of the undersized shared L3 cache, which is smaller than the sum of the L2 caches below it.
Stay tuned now for some mdcore benchmarks!