## Author e-mails

haoran@cita.utoronto.ca

## Author affiliations

^{1} Tsung-Dao Lee Institute, Shanghai Jiao Tong University, Shanghai, 200240, People's Republic of China; haoran@cita.utoronto.ca

^{2} Canadian Institute for Theoretical Astrophysics, University of Toronto, Toronto, ON M5H 3H8, Canada

^{3} Department of Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, People's Republic of China

^{4} Dunlap Institute for Astronomy and Astrophysics, University of Toronto, Toronto, ON M5S 3H4, Canada

^{5} Canadian Institute for Advanced Research, CIFAR Program in Gravitation and Cosmology, Toronto, ON M5G 1Z8, Canada

^{6} Perimeter Institute for Theoretical Physics, Waterloo, ON N2L 2Y5, Canada

## ORCID iDs

Hao-Ran Yu https://orcid.org/0000-0001-5277-4882

## Dates

- Received 2017 December 17
- Revised 2018 May 24
- Accepted 2018 May 25
- Published 2018 July 26

Keywords

large-scale structure of universe; methods: numerical

0067-0049/237/2/24

## Abstract

Cosmological large-scale structure *N*-body simulations are computation-light, memory-heavy problems in supercomputing. The considerable amount of memory is usually dominated by an inefficient way of storing more than sufficient phase space information of particles. We present a new parallel, information-optimized, particle-mesh-based *N*-body code CUBE, in which information-efficiency and memory-efficiency are increased by nearly an order of magnitude. This is accomplished by storing particle's relative phase space coordinates instead of global values, and in the format of fixed point as light as 1 byte. The remaining information is given by complementary density and velocity fields (negligible in memory space) and proper ordering of particles (no extra memory). Our numerical experiments show that this information-optimized *N*-body algorithm provides accurate results within the error of the particle-mesh algorithm. This significant lowering of the memory-to-computation ratio breaks the bottleneck of scaling up and speeding up large cosmological *N*-body simulations on multi-core and heterogeneous computing systems.

Export citation and abstractBibTeXRIS

- NASA ADS Record
- About Related Links

## 1.Introduction

The *N*-body simulation, a dynamical simulation of a group of particles, is a powerful tool in physics and astronomy (Hockney & Eastwood 1988). It is widely used in cosmology to model the large-scale structure (LSS) of the universe (Davis et al. 1985). Current percent and sub-percent level LSS measurements of cosmological parameters, via the matter power spectrum (Rimes & Hamilton 2005; Takahashi et al. 2011), baryonic acoustic oscillations (BAO) (Eisenstein et al. 2005; Takahashi et al. 2009), weak gravitational lensing (Vale & White 2003; Hilbert et al. 2009; Sato et al. 2009) etc., require understandings of the nonlinear dynamics of the cosmic structure, and rely on high-resolution and high dynamic range *N*-body simulations.

When *N* is large, the brute force pairwise particle–particle (PP) force brings unaffordable computations, so many algorithms were designed to alleviate it. Various fast-multipole methods (Rokhlin 1985; Dehnen 2014; Potter & Stadel 2016) improve the complexity to even *o*(*N*), among which the most popular one is "tree," like `GADGET` (Springel et al. 2001; Springel 2005) and its simulation "Millennium" (Springel et al. 2005; Angulo et al. 2012), `TPM` (Xu 1995) and `GOTPM` (Dubinski et al. 2004). Other methods include adaptive grid algorithms like `HYDRA` (Couchman et al. 1995) and `RAMSES` (Teyssier 2010), as well as mesh-refined codes (Couchman 1991) and moving adaptive particle-mesh (PM) codes (Pen 1995). The standard PM algorithm (Hockney & Eastwood 1988) is most memory and computational efficient if we focus on large cosmological scales. The load-balancing problem is minimized because the matter distribution is rather hom*ogeneous, and the speed benefits from the fast Fourier transform (FFT) libraries, such as `FFTW3` (Frigo & Johnson 2005). `PMFAST` (Merz et al. 2005) introduces a 2-level PM algorithm, aiming to push PM codes toward speed, memory compactness, and scalability. After subsequent developments on `PMFAST`, `CUBEP3M` (Harnois-Déraps et al. 2013) uses cubic spatial decomposition, and adds PP force and many other features.

In addition to the new methodology, the fast development of parallel supercomputers enables us to simulate a system of more than a trillion (10^{12}) *N*-body particles. To date, the largest *N*-body simulation in application is the "TianNu" (Emberson et al. 2017; Yu et al. 2017a) run on the TianHe-2 supercomputer. With the code `CUBEP3M` adding neutrino modules, it uses 3**×**10^{12} particles to simulate the cold dark matter (CDM) and cosmic neutrino evolution through the cosmic age.

Relative to the optimized computation optimizations, *N*-body simulations use a considerable amount of memory to store the information of particles. Their phase space coordinates are stored as at least six single-precision floating point numbers (a total of 24 bytes). On the other hand, modern supercomputer systems use multi-cores, many integrated cores (MIC) and even densely parallelized GPUs, bringing orders of magnitude higher computing power, whereas these architectures usually have limited memory allocation. Thus, these computation-light but memory-heavy applications, compared to matrix multiplication and decomposition calculations, are currently less suitable for full usage of the computing power of modern supercomputers. For example, although native and offload modes of `CUBEP3M` are able to run on the Intel Xeon-PHI MIC architectures, with the requirement of enough memory, TianNu simulations were done on TianHe-2 using only its CPUs—73% of the total memory but only 13% of the total computing power. We investigate how the greatest amount of information on particles can be deduced while still preserving the accuracy of *N*-body simulations, with the aim of optimizing the total memory usage for a given *N*.

In the following sections we present a new, information-optimized parallel cosmological *N*-body simulation code `CUBE` (Yu et al. 2018), using as little as 6 bytes per particle (bpp). It gives accurate results in cosmological LSS simulations—the error induced by information optimization is below the error from the PM algorithm. In Section 2 we show how the memory can be saved by using an "integer-based storage" and how the PM *N*-body algorithm is adapted with this storage format. In Section 3 we quantify the accuracy of this algorithm using groups of simulations from `CUBEP3M` and `CUBE`. Discussions and conclusions are provided in Section 4.

## 2.Method

The most memory-consuming part of an *N*-body simulation is usually the phase space coordinates of *N*-body particles—24 bpp—which contains 4 single-precision floating numbers that must be used to store each particle's three-dimensional position and velocity vectors. `CUBEP3M`, an example of a memory-efficient parallel *N*-body code, can use as little as 40 bpp when sacrificing computing speed (Harnois-Déraps et al. 2013). This includes the phase coordinates (24 bpp) for particles in the physical domain and buffered region, a linked list (4 bpp), and a global coarse mesh and local fine mesh. Sometimes 4-byte real numbers are not necessarily adequate in representing the *global* coordinates in simulations. If the box size is many orders of magnitude larger than the interactive distance between particles, especially in the field of resimulation of dense subregions, double-precision (8 byte) coordinates are needed to avoid round-off errors. Another solution is to record *relative* coordinates for both position and velocity. `CUBE` replaces the coordinates and linked list 24+4=28 bpp memory usage with an integer-based storage, thus reducing the basic memory usage from 28 bpp down to 6 bpp, as described in Sections 2.1 and 2.2. The algorithm is described in Section 2.3.

### 2.1.Particle Position Storage

We construct a uniform mesh throughout the space and each particle belongs to its parent cell of the mesh. Instead of storing the global coordinates of each particle, we store its offset relative to its parent cell that contains the particle. This is similar to storing the quantities of nodes/clumps (structures of a tree in a tree code) relative to their parents (Appel 1985). We divide the cell, in each dimension *d*, evenly into 2^{8}=256 bins, and use a 1 byte (8 bits) integer *χ*_{d}∈{−128, −127, ..., 127} to indicate which bin it locates in this dimension. The global locations of particles are given by a cell-ordered format in memory space, and a complementary number count of particle numbers in this mesh (density field) will give complete information on the particle distribution in the mesh. Then, the global coordinate in the *d*th dimension *x*_{d} is given by , where is the index of the coarse grid. The mesh is chosen to be coarse enough such that the density field takes negligible memory. This coarse density field can be further compressed into a 1-byte integer format, such that a 1-byte integer shows the particle number in this coarse cell in a range from 0 to 255. In the densest cells (this rarely happens) where there are ≥255 particles, we can just write 255, and write the actual number as a 4-byte integer in another file.

In a simulation with volume *L*^{3} and *N*_{c}^{3} coarse cells, particle positions are stored with a resolution of *L*/(256*N*_{c}). The force calculation (e.g., softening length) should be configured to be much finer than this resolution, as discussed in later sections. On the other hand, particle position can also be stored as 2-byte (16 bits) integers to increase the resolution. In this case, each coarse cell is divided into 2^{16}=65536 bins and the position resolution is *L*/(65536*N*_{c}), which is precise compared to using 4-byte global coordinates (see later results). We denote this case as "x2" and denote the case in which we use 1-byte integers for positions as "x1."

We collectively write the general position conversion formulae as

where [] is the operator to take the integer part. *n*_{χ}∈{1, 2} is the number of bytes used for each integer, and *x*_{d} and *χ*_{d} are floating and integer versions of the coordinate. The velocity counterparts of these integers are *n*_{ν}=1, 2, *v*_{d} and *ν*_{d}. The position resolution for an *n*_{χ}-byte integer, "*xn*_{χ}," is .

As a *n*_{χ}=1, 1D (*d* = 1), 4-coarse-cell (*N*_{c}=4) example, if

and particle number density

then in units of coarse cells, the accurate positions of these four particles are

### 2.2.Particle Velocity Storage

Similarly, the actual velocity in the *d*th dimension *v*_{d} is decomposed into an averaged velocity field on the same coarse grid *v*_{c} and a residual Δ*v* relative to this field:

*v*_{c} is always recorded and kept updated, and should not occupy considerable memory. We then divide velocity space Δ*v* into uneven bins, and use a *n*_{ν}-byte integer to indicate which Δ*v* bin the particle is located.

The reason why we use uneven bins is that slower particles are more abundant compared to faster ones, and one should better resolve slower particles by tracing at least linear evolution. On the other hand, there could be extreme scattering particles (in case of PP force), and we can safely ignore or resolve those nonphysical particles. One of the solutions is that, if we know the probability distribution function (PDF) *f*(Δ*v*) we divide its cumulative distribution function (CDF) *F*(Δ*v*)∈(0, 1) into bins to determine the boundary of Δ*v* bins, and particles should evenly distribute in the corresponding uneven Δ*v* bins. Practically we find that either *f*(*v*_{d}) or *f*(Δ*v*) is close to Gaussian, so we can use Gaussian CDF, or any convenient analytic functions that are close to Gaussian, to convert velocity between real numbers and integers.

The essential parameter of the velocity distribution is its variance. On a nonlinear scale, the velocity distribution function is non-Gaussian. However, to the first-order approximation, we simply assume it as Gaussian and characterized it by the variance

where *a*(*z*) is the scale factor, *H*(*z*) is the Hubble parameter, *D* is the linear growth factor, , and *P*(*k*) is the linear power spectrum of density contrast at redshift zero. *W(k,R*) is the Fourier transform of the real space top-hat window function with a smoothing scale *r*. In Figure 1 we plot *σ*_{v}(*a, r*) as a function of *a* for a few smoothing scale *r*. Δ*v* in Equation (3) is the velocity dispersion relative to the coarse grid, so we approximate its variance as

where *r*_{c} is the scale of the coarse grid, and *r*_{p} is the scale of the average particle separation. In each dimension of the 3D velocity field, we use according to the equipartition theorem. On different scales, we measure the statistics of *v*_{d}, *v*_{c}, and Δ*v* and find good agreement with the above model.

The simulation results are very insensitive if we manually tune the variance of the model *σ*_{Δ} within an order of magnitude. However, in the *n*_{ν}=1 case, the method of using uneven bins gets much better results than simply using equal bins between minimum and maximum values . So, one can safely use a standard ΛCDM (cold dark matter with a cosmological constant as dark energy) for slightly different cosmological models, in Equation (4). In `CUBE`, the velocity conversion takes the formula

where is the operator to take the nearest integer. Tangent functions are convenient and computing-efficient. Compared to the error functions used in the Gaussian case, they take the same variance at *ν*_{d}=0 but resolve high velocities relatively better. All possible choices of conversion formulae and *σ*_{Δ} are unbiased in the conversion Equations (6) and (7), however, proper choices optimize the velocity space sampling and can result in more precise results.

Initially, particles are generated by an initial condition generator, at a higher redshift. The coarse grid velocity field *v*_{c} is also generated at this step by averaging all particles in the coarse cell. A global *σ*_{Δ} is calculated by Equation (5), where linear approximation holds. Then, velocities are stored by Equation (6). During the simulation, *v*_{c} is updated every time-step, and a nonlinear *σ*_{Δ} is measured directly from the simulation, and can be simply used in the next time-step, after scaled by the ratio of growth factors between two adjacent time-steps. For more details see Section 2.3.3.

### 2.3.Code Overview

`CUBE` uses a 2-level PM force calculation. In order to apply the integer-based format to the *N*-body simulation, substantial structural changes need to be done. `CUBE` is written in Coarray Fortran, where Coarray features replace MPI (Message Passing Interface) communications between computation nodes/images.^{7} The algorithm is described in this language.

#### 2.3.1.Spatial Decomposition

`CUBE` decomposes the global simulation volume into *M*_{g}^{3} cubic sub-volumes with *N*_{c} coarse grids or *N*_{f}=*RN*_{c} fine grids per side. The fine mesh is usually *R*=4 times finer than the coarse mesh. Each of these sub-volumes is assigned to a coarray *image*. Inside of an image, the sub-volume is further decomposed into *M*_{t}^{3} cubic *tiles* (defined as *V*_{p}, or a *physical* region) with *N*_{c}/(*M*_{g} *M*_{t}) coarse grids per side. Each *V*_{p} is surrounded by a *buffer* region *V*_{b}, which is *N*_{b} coarse cells thick. We define the extended tile region as . *V*_{e} is designed for two purposes:

(1) If the short fine mesh force has a cut-off, *N*_{b} , and is computed on *V*_{e}, then in *V*_{p} is guaranteed to be correct.

(2) *V*_{e} is able to collect all particles that are able to travel to *V*_{p}.

Figure 2 shows the spatial decomposition in a two-dimensional analogy, with *M*_{g}=2 and *M*_{t}=2.

According to this spatial decomposition, and as discussed in the last two subsections, we declare using Fortran language:

where covers the buffer region on both sides, *M*_{t} is the tile dimensions, and *M*_{g} is the image *co-dimensions*.^{8} We denote the actual number of particles in a given image as , and is a value (discussed in Section 2.4) large enough to store particles in *V*_{e}. *χ*_{d} and *ν*_{d} must be sorted according to the same memory layout as *ρ*_{c}, such that *n*_{c} and *x*_{d} can be obtained from Equation (2). provides complete information on the positions and velocities of particles, and we call it a checkpoint.

An additional particle-ID (PID) array, *I*_{P}, can also be declared to differentiate particle types (e.g., CDM and neutrino particles) or to differentiate every particle, using an *n*_{I}-byte integer per particle. If PID is turned on, the array *I*_{P} is also included in the checkpoints, and the ordering of *I*_{P} is the same as that for *χ*_{d} and *ν*_{d}.

#### 2.3.2.Initial Conditions

The cosmological initial condition generator is compiled and run separately from the main *N*-body code. Here, we briefly describe it for completeness.

The first step is the calculation of the displacement potential Φ. At an initial redshift *z*_{i}, we generate a linear density fluctuation on Lagrangian grid by multiplying a Gaussian random field with the transfer function *T*(*k*) (given by the assumed cosmological model) in Fourier space. Then, we solve for the potential of a curlless displacement field by Poisson equation in Fourier space.

The second step to generate particles and displace them by Zel'dovich approximation (ZA) (Zel'dovich 1970), where the displacement field is obtained by differentiating Φ, is in real space. This step is done on *V*_{e} of each tile.

We iterate twice over particles' in *V*_{e}. The first iteration calculates particles' and by ZA and obtains *ρ*_{c} and *v*_{c} on the coarse grid. The second iteration's and are calculated again and are converted to *χ*_{d} and *ν*_{d} by Equations (1) and (6) and placed in a certain order according to *ρ*_{c}. Lastly, we delete particles in *V*_{b}, and convert the ones in *V*_{p} and write of this tile to disk. The above is similar to `update`_`xp` of Section 2.3.3. If PIDs are needed, they are also generated here. After working on all tiles, we sum up and . We summarize the above steps into a pseudocode in Figure 3. During this step, the only major memory usage is Φ on the fine mesh. If the number of particles per fine grid *P*_{f}=1, the memory consumption of this in-place FFT is 4 bpp.

#### 2.3.3.Algorithm

Figure 4 shows the overall structure of the main code.

`initialize` creates fine mesh and coarse mesh FFT plans, and reads in configuration files telling the program at which redshifts we need to do checkpoints, halofinds, or stop the simulation. Force kernels *K*_{c}, *K*_{f} are also computed or loaded.

`read`_`particles`, from the disk, reads in a checkpoint for each image. Because they exist only in the *V*_{p} of every tile, they are *disjoint*, and they provide *complete* information on the whole simulation volume—we call it "*disjoint state*." In this state, *ρ*_{c}, *v*_{c}'s values in buffer regions, and and are 0's. Because *I*_{P} is generated and manipulated together with *ν*_{d}, we do not explicitly mention *I*_{P}.

`buffer`_`density`, `buffer`_`x`, and `buffer`_`v` convert the "disjoint" state to the "*buffered state*." In `buffer`_`density`, *V*_{b} regions of *ρ*_{c} are synchronized between tiles and images. By `buffer`_`x`, *χ*_{d} is updated to contain common, buffered particles, and they are sorted according to the buffered *ρ*_{c}. `buffer`_`v` deals with *ν*_{d} in a similar manner.

`timestep` is a second-order Runge–Kutta method used in the time integration, i.e., for *n* time-steps, we update positions (*D*=drift) and velocities (*K*=kick) at interlaced half time-steps by operator splitting; the operation (DKKD)^{n} is second-order accurate. The actual simulation applies varied time-steps by `timestep`, where a time increment is constrained by particles' maximum velocities, accelerations, cosmic expansion, and any other desired conditions.

`update`_`xp` is used, according to , `to update` particle positions (drift *D*) in a "*gather*" algorithm tile by tile. For each particle, *χ*_{d} and *ν*_{d} are converted to *x*_{d} and *v*_{d} by Equations (2), (7).

In order to keep particles ordered, for each tile, we first perform on all particles to obtain an updated density and velocity field on the tile, and . Then, this calculation is done on the same tile again^{9} to generate a new, local particle list and by Equations (1) and (6). Here, the ordering of and relies on . Then, the third iteration is done on this tile to delete buffer regions of . Then the disjoint state of replaces the old . Finally, and are updated. These steps are summarized in Figure 5.

`For update`_`vp` the PM or PP particle-mesh (P^{3}M) algorithm is applied in this subroutine to update particles' velocities (kick *K*). We first call `buffer`_`density` and `buffer`_`xp` to place the particle positions in a buffered state. Then, according to the particle distributions on *V*_{e}, we calculate the fine mesh force and update the particle velocities in the *V*_{p}. An optional PP force can be called to increase the force resolution.

The compensating coarse grid force is globally computed using a coarser- (usually by a factor of *R* = 4) mesh by dimensional splitting—the cubic distributed coarse density field *ρ*_{c} is transposed (inter-image) and Fourier transformed (inner-image) in three consecutive dimensions. After the multiplication of memory-distributed force kernel *K*_{c}, the inverse transform takes place to get the cubic distributed coarse force field , upon which velocities are updated again.

For each type of velocity update, the collective operations are Equation (7), , and Equation (6). We also update according to the new *v*_{d}−*v*_{c}. These steps are summarized in Figure 6.

After the updating of *ν*_{d} in *V*_{p}, we simply call `buffer`_`v` again to bring *ν*_{d} into a buffered state, such that the `update`_`x` in the next iteration will be done correctly.

`For checkpoint`, if a desired redshift is reached, we execute the last drift step in the (DKKD)^{n} operation by `update`_`xp`, and call `checkpoint` to save the disjoint state of on the disk. Other operations like run-time halo finder, density projections are also done at this point.

Finally, in the `finalize` subroutine we destroy all the FFT plans and finish up any timing or statistics taken in the simulation.

### 2.4.Memory Layout

Here, we list the memory-consuming arrays and how they scale with different configurations of the simulation. We classify them into (1) arrays of particles, (2) coarse mesh arrays, and (3) fine mesh arrays.

#### 2.4.1.Arrays of Particles

These arrays comprise the majority of the memory usage, and contain checkpoint arrays *χ*_{d}, *ν*_{d}, *I*_{P}, and temporary arrays , , . The former uses memory , where

and is the average number of particles per image. The second term, proportional to *V*_{e}/*V*_{p}, lets us store additional particles in *V*_{b}, and the third term takes into account the inhom*ogeneity of on different images. When each image models smaller physical scales, _{image} should be set larger.

Temporary , , store particles only on tiles, and the particle number is set to be

where controls the inhom*ogeneity on scales of tiles. Larger *M*_{t} causes more inhom*ogeneity on smaller tiles, and can be much larger than _{image}; however, the term decreases much faster. Practically, the majority memory is occupied by *χ*_{d}, *ν*_{d}, *I*_{P}.

Summarizing the above, we find that the memory usage per particle (bpp) , given and *n*_{I}=0, is

Because , we can minimize Equation (10) by tuning *M*_{t}.

#### 2.4.2.Coarse Mesh Arrays

On coarse mesh arrays, *ρ*_{c} (4-byte integers), *v*_{c}, and force kernel *K*_{c} should always be kept. They are usually configured to be *R*=4 times coarser than fine grids and particle number density. They have use memories of bytes per image, or

where *P*_{c} is the average number of particles per coarse cell. Other coarse-grid-based arrays are , and* F*_{c} pencil-FFT arrays. exists only on tiles; *F*_{c} and pencil-FFT arrays can be equivalenced with other temporary arrays. Thus, the majority of the memory usage from coarse FFT arrays comes from Equation (11).

#### 2.4.3.Fine Mesh Arrays

On the local fine mesh, only a force kernel array *K*_{f} needs to be kept. It has memory of 3**×**4**×**(*RN*_{e}*M*_{t})^{3/2} per image, or

where *P*_{f} is the average number of particles per fine cell. For the fine mesh density arrays, force field arrays are temporary. Since , , coarse force arrays, and pencil-FFT arrays are also temporary and are not used in any calculation simultaneously, they can be overlapped in memory using equivalent statements.

#### 2.4.4.Compare with Traditional Algorithms

To illustrate the improvement of memory usage, we refer to the TianNu simulation (Yu et al. 2017a) run on the Tianhe-2 supercomputer, which used a traditional *N*-body code `CUBEP3M`. TianNu's particle number (shown in Table 2) is limited by memory per computing node—for each computing node, an average of 576^{3} neutrino particles and 288^{3} CDM particles are used, and consumes ,^{10} or about . A memory-efficient configuration of `CUBEP3M` using large physical scales and at costs of speed, still uses about .

If the same amount of memory is allocated to `CUBE`, we can set parameters as , , *N*_{b}=6, , and thus . Setting *M*_{t}=3 (27 tiles per image) minimizes and uses about , corresponding to . This can be done on most of the supercomputers, even modern laptops.

Table 1 shows the memory consumption for this test simulation. The memory-consuming arrays are listed and classified into the three types above mentioned, and their memory usages are in units of GB (10^{9} byte), bpp, and their percentage of the total memory usage. The parenthesized numbers show overlapped memory, which is saved by equivalencing them with the underscored numbers. There are other unlisted variables that are memory-light, and can also be equivalenced with the listed variables.

**Table 1.**Memory Layout for a Certain Configuration

Memory Usage | ||||
---|---|---|---|---|

Type | Array | /GB | /bpp | Percentage |

Particles | χ_{d}, ν_{d} | 29.9 | 8.24 | 83.8% |

, | 3.16 | 0.872 | 8.87% | |

I_{P}, | 0 | 0 | 0% | |

Subtotal | 33.0 | 9.12 | 92.7% | |

Coarse Mesh | ρ_{c} | 0.296 | 0.0818 | 0.831% |

v_{c} | 0.889 | 0.245 | 2.49% | |

K_{c} | 0.340 | 0.0939 | 0.954% | |

F_{c} | (0.690) | (0.190) | (1.94%) | |

0.0140 | 0.00388 | 0.0394% | ||

Pencil-FFT | (0.454) | (0.125) | (1.27%) | |

Subtotal | 1.55 | 0.429 | 4.36% | |

Fine Mesh | K_{f} | 1.06 | 0.292 | 2.97% |

F_{f} | (1.63) | (0.450) | (4.57%) | |

Fine-FFT | (1.41) | (0.389) | (3.96%) | |

Subtotal | 1.06 | 0.292 | 2.97% | |

Total | 35.6 | 9.84 | 100% | |

Optimal Limit | 21.7 | 6 | 61.0% |

Download table as: ASCIITypeset image

In the bottom of Table 1 we stress that the optimal memory usage is 6 bpp, or 21.7 GB, 61% of the actual . The dominating departure from this limit is that *χ*_{d} and *ν*_{d} already occupy 8.24 bpp, which come from the term of Equation (10). All other variables occupy an additional 8%. On modern supercomputers, the memory per computing node is usually much larger, and by scaling up the number of particles per node, the buffer ratio *N*_{b}/*N*_{t} will be lowered and we can approach closer to the 6 bpp limit.

## 3.Accuracy

We run a group of simulations to test the accuracy of `CUBE`. We use the same seeds to generate the same Gaussian random fields in the initial condition generators of `CUBEP3M` and `CUBE`, and then they produce initial conditions in their own formats . Then, the main *N*-body codes run their own initial conditions to redshift *z*=0. We use the same force kernels as `CUBEP3M` without PP force. Note that near fine grid scales, it is possible to enhance the force kernel to better match the nonlinear power spectrum predictions; however, we use the conservative mode of `CUBEP3M` in this paper. An extended PP force and an unbiased force matching algorithm will be added to `CUBE`. The power spectrum studies are presented in Harnois-Déraps et al. (2013), while here we focus on the cross-correlations between different integer-based methods.

First, using different configurations—different numbers of computing nodes, box sizes, particle resolutions, tiles per node/image, etc., we find that using 2-byte integers for both positions and velocities (, or x2v2) allows `CUBE` to give exact results compared to `CUBEP3M`. So if sufficient memory is provided, one can always use x2v2 to get exact results as `CUBEP3M`, and the optimal memory limit of this case is 12 bpp, which is still much lower than traditional methods. Next, we focus on the accuracy of the other three cases—x1v2, x2v1, and x1v1.

We list the names and configurations of the simulations used in Table 2, where , *L*, *z*_{i}, *N*_{p}, and *m*_{p} are respectively the number of computing nodes used, the length of the side of the box, initial redshift, total number of particles, and particle mass. These configurations are run by `CUBEP3M`, x2v2, x1v2, x2v1 ,and x1v1 versions of `CUBE` with the same initial seeds. Using different numbers of tiles per image gives the exact same results. We also list the configurations for TianNu and TianZero (Emberson et al. 2017; Yu et al. 2017a) simulations as a reference.

**Table 2.**Simulation Configurations

Configurations | |||||
---|---|---|---|---|---|

Name | N_{node} | L/(Mpc h^{−1}) | z_{i} | N_{p} | m_{p}/ |

S512 | 8 | 200 | 49 | 512^{3} | 7.5×10^{9} |

S256 | 1 | 80 | 49 | 256^{3} | 3.8×10^{9} |

S2048S | 64 | 400 | 49 | 2048^{3} | 9.4×10^{8} |

S2048L | 64 | 1200 | 49 | 2048^{3} | 2.5×10^{10} |

TianNu | 13824 | 1200 | 100 | 6912^{3} | 6.9×10^{8} |

5 | 13824^{3} | 3.2×10^{5} | |||

TianZero | 13824 | 1200 | 100 | 6912^{3} | 7.0×10^{8} |

Download table as: ASCIITypeset image

### 3.1.Power Spectrum

### 3.2.Displacement of Particles

In S512, we zoom-in on a small region of and compare the particle distribution between `CUBEP3M` and `CUBE`-x1v1 in Figure 7. For clarity, `CUBEP3M` particles are marked with the larger, gray dots, whereas the smaller red dots are `CUBE`-x1v1 particles overplotted onto them. The difference between them is clear. The 1 Mpc/*h*, fine grid, and coarse grid scales are shown in the figure. The position resolution of particles in `CUBE`-x1v1 is 1/256 of a coarse grid, or 1/64 of a fine grid.

To quantify the offset in the final particle distributions, we use PIDs to track the displacement of every particle (Yu et al. 2017b), where and are Eulerian and Lagrangian coordinates of the particle. Then, we calculate the absolute value of the offset vector

Here, stands for `CUBEP3M` and subscript _{i} can stand for x2v2, x1v2, x2v1, or x1v1. The PDFs and CDFs of in S256 are shown in Figure 8. Results from x2v2, x1v2, x2v1, or x1v1 are in black, red, blue, and green, respectively. The results from the absolute displacement of particles (by replacing with in Equation (13)) are shown in gray for comparison.

For x2v2, almost all particles are accurate up to 1/100 of a fine grid, and the worst particle is of a fine grid away from its counterpart in `CUBEP3M`. The difference is caused by round-off errors and is negligible in physical and cosmological applications. The accuracy of x1v2 is between x2v2 and x1v1, and x2v1 gives only a minor improvement from x1v1. We also run a simulation with the same number of particles but with *L*=600 Mpc/*h* (), and find that the accuracy of x1v2 is in turns between x2v1 and x1v1. We interpret that, in this latter case, particles have lower mass resolution, so they move slower and need higher position resolution but need lower velocity resolution, thus x2v1 outperforms x1v2.

S2048S and S2048L are two simulations with 2048^{3} particles in small (*L*=400 Mpc/*h*) and large (*L*=1200 Mpc/*h*) box sizes. We compare their accuracy by their power spectra and their cross-correlations with `CUBEP3M` at *z*=0. The physical scale of S2048S is designed such that the particle mass resolution *m*_{p} is comparable to TianNu and TianZero simulations (their parameters are also listed in Table 2). On the other hand, S2048L focuses on larger structures, on which scale one can study weak gravitational lensing, BAO (Eisenstein et al. 2005), and its reconstruction (Eisenstein et al. 2007; Wang et al. 2017), etc.

For each simulation the particles are first cloud-in-cell (CIC) interpolated onto the fine mesh grid, and from the density field *ρ* we define the density contrast . We define the cross power spectrum between two fields and ( for the auto-power spectrum) in Fourier space as

where is the 3D Dirac delta function. In cosmology we usually consider the dimensionless power spectrum . The cross-correlation coefficient is defined as

In the upper two panels of Figure 9 we show the power spectra of `CUBE`. In both plots of S2048S and S2048L the four solid curves of different colors show the results of x2v2, x1v2, x2v1, and x1v1, and they almost overlapped with each other. The dashed curves are the nonlinear prediction of the matter power spectrum by CLASS (Blas et al. 2011).

We label *k*=0.2 *k*_{Nyquist} as vertical dashed lines, where is the scale of fine mesh grids, and the scale of average particle separations. On this scale, the power spectra are offset from nonlinear predictions by at least 20%. This error is from the PM algorithm and one has to increase the resolution of the simulation to correct these offsets. We do not plot `CUBEP3M` because we found that `CUBEP3M` and x2v2 of `CUBE` produce same results. Thus, the differences between `CUBEP3M` and different integer formats of `CUBE` are negligible compared to the error of the PM algorithm.

In the lower parts of these four panels we study the cross-correlations. We compare everything with `CUBE`-x2v2, to emphasize the decorrelation (i.e., ) by different integer formats. Note that x2v2 is perfectly correlated with `CUBEP3M`. In the lower two panels of Figure 9 the solid curves show the decorrelation *ξ* of x1v2, x2v1, and x1v1. The higher resolution (S2048S) in general comes with more decorrelations, and x1v2 cross-correlates with x2v2 better than the other two cases. In order to quantify the PM error in terms of decorrelations, we run x2v2 simulations with the same initial conditions, but 8 times more particles and cells, and measure the decorrelation caused by coarser resolutions. These results are shown as dashed curves (labeled "PM"). We conclude that the cross-correlation of x1v1, in all cases, is not worse than the PM errors.

To summarize from Figure 9, in terms of either power spectrum deviation or cross-correlation, the error induced by information optimization (even for x1v1) is lower than the error from the PM algorithm, and we can safely use x1v1 for most of the LSS studies.

## 4.Discussion and Conclusion

We present a parallel, information-optimized *N*-body algorithm. This open-source code, `CUBE`, has recently been used in many studies of LSS, e.g., Yu et al. (2017b), Wang et al. (2017), Pan et al. (2017). It requires very low memory usage, approaching 6 bpp.

The accuracy of this code is adjustable in that we can choose 1-byte/2-byte integers separately for positions and velocities of particles. In the case of using 2 byte integers for both positions and velocities ("x2v2," and memory can be 12 bpp), the algorithm gives the exact results given by traditional *N*-body algorithms. Note that the results are exactly the same in that they not only produce the same physical statistics of LSS, but also the same error (not physical) from the PM algorithm near Nyquist frequencies. In other words, the positions and velocities of each particle are exact. In practice, we only require that the errors from information optimization are much lower than the errors from the PM algorithm. In Figure 9 we see that this is the case even for the most memory-efficient configuration, x1v1. This shows that in most LSS studies, when our scales of interest are smaller than , six 1-byte fixed point numbers contain sufficient information for every *N*-body particle.

Another benefit of this algorithm is LSS simulations with neutrinos, although the neutrino modules of `CUBE` are in development. Neutrinos have a high velocity dispersion and move much faster than CDM, and their small-scale errors are dominated by their Poisson noise. We expect that, compared to CDM, x1v1 gives less power spectrum deviation to neutrinos, as they behave more Gaussian and are less clustered. LSS-neutrino simulations, like TianNu, can contain much more (for TianNu, 8 times more) neutrino *N*-body particles than CDM, which dominate the memory. For a TianNu-like simulation, one can safely use x1v2 or x2v2 for CDM and x1v1 for neutrinos, and the memory usage can approach 6 bpp. This allows much more particles to be included in the simulation and can lower the Poisson noise of neutrinos prominently. A 1-byte PID per particle increases minor memory usage and can differentiate eight kinds of particles, or we can store different particles in different arrays without using PID.

We did not include PP force in `CUBE` and `CUBEP3M` in this paper. If one wants to focus on smaller scales, like halo masses and profiles, extended PP forces, which act up to adjacent fine cells, should be taken into account. The memory consumption for PP force is only local and is negligible compared to particles. In these cases where even x2v2 is used, a 12 bpp memory usage is still much lower than that of traditional *N*-body algorithms.

Traditional *N*-body codes consume considerable memory while performing relatively light computations. `CUBE` is designed to optimize the efficiency of information and the memory usage in *N*-body simulations. `CUBE` is written in Coarray Fortran—concise Coarray features are used instead of complicated MPI—and the code itself is much more concise than `CUBEP3M` for future maintenance and development. The next steps are optimization of the code and adapting it for various kinds of heterogeneous computing systems, e.g., MIC and GPUs. Optimizing the velocity storage may further improve the accuracy of x1v1, whose effects on neutrino-LSS simulation are yet to be discovered.

We acknowledge funding from NSERC. H.R.Y. thanks Derek Inman for many helpful discussions. We thank the anonymous referee for many helpful suggestions that improved the paper. The simulations were performed on the GPC supercomputer at the SciNet HPC Consortium. The code is publicly available on github.com under `yuhaoran/CUBE`.

7

Images are the concept of computing nodes or MPI tasks in Coarray Fortran. We use this terminology in this paper.

8

Coarray Fortran concept. Co-dimensions can enable communications between images.

9

This repetition scales as

*o*(*N*) and is computationally inexpensive.10

Additional memory is used for OpenMP parallelization and for particle IDs to differentiate different particle spices.

Please wait… references are loading.