CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (2024)

Author e-mails

haoran@cita.utoronto.ca

Author affiliations

1 Tsung-Dao Lee Institute, Shanghai Jiao Tong University, Shanghai, 200240, People's Republic of China; haoran@cita.utoronto.ca

2 Canadian Institute for Theoretical Astrophysics, University of Toronto, Toronto, ON M5H 3H8, Canada

3 Department of Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, People's Republic of China

4 Dunlap Institute for Astronomy and Astrophysics, University of Toronto, Toronto, ON M5S 3H4, Canada

5 Canadian Institute for Advanced Research, CIFAR Program in Gravitation and Cosmology, Toronto, ON M5G 1Z8, Canada

6 Perimeter Institute for Theoretical Physics, Waterloo, ON N2L 2Y5, Canada

Dates

  1. Received 2017 December 17
  2. Revised 2018 May 24
  3. Accepted 2018 May 25
  4. Published 2018 July 26

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (1)

Keywords

large-scale structure of universe; methods: numerical

Journal RSS
Create or edit your corridor alerts

What are corridors?

0067-0049/237/2/24

Abstract

Cosmological large-scale structure N-body simulations are computation-light, memory-heavy problems in supercomputing. The considerable amount of memory is usually dominated by an inefficient way of storing more than sufficient phase space information of particles. We present a new parallel, information-optimized, particle-mesh-based N-body code CUBE, in which information-efficiency and memory-efficiency are increased by nearly an order of magnitude. This is accomplished by storing particle's relative phase space coordinates instead of global values, and in the format of fixed point as light as 1 byte. The remaining information is given by complementary density and velocity fields (negligible in memory space) and proper ordering of particles (no extra memory). Our numerical experiments show that this information-optimized N-body algorithm provides accurate results within the error of the particle-mesh algorithm. This significant lowering of the memory-to-computation ratio breaks the bottleneck of scaling up and speeding up large cosmological N-body simulations on multi-core and heterogeneous computing systems.

Export citation and abstractBibTeXRIS

Previous article in issue
Next article in issue

1.Introduction

The N-body simulation, a dynamical simulation of a group of particles, is a powerful tool in physics and astronomy (Hockney & Eastwood 1988). It is widely used in cosmology to model the large-scale structure (LSS) of the universe (Davis et al. 1985). Current percent and sub-percent level LSS measurements of cosmological parameters, via the matter power spectrum (Rimes & Hamilton 2005; Takahashi et al. 2011), baryonic acoustic oscillations (BAO) (Eisenstein et al. 2005; Takahashi et al. 2009), weak gravitational lensing (Vale & White 2003; Hilbert et al. 2009; Sato et al. 2009) etc., require understandings of the nonlinear dynamics of the cosmic structure, and rely on high-resolution and high dynamic range N-body simulations.

When N is large, the brute force pairwise particle–particle (PP) force brings unaffordable CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (2) computations, so many algorithms were designed to alleviate it. Various fast-multipole methods (Rokhlin 1985; Dehnen 2014; Potter & Stadel 2016) improve the complexity to CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (3) even o(N), among which the most popular one is "tree," like GADGET (Springel et al. 2001; Springel 2005) and its simulation "Millennium" (Springel et al. 2005; Angulo et al. 2012), TPM (Xu 1995) and GOTPM (Dubinski et al. 2004). Other methods include adaptive grid algorithms like HYDRA (Couchman et al. 1995) and RAMSES (Teyssier 2010), as well as mesh-refined codes (Couchman 1991) and moving adaptive particle-mesh (PM) codes (Pen 1995). The standard PM algorithm (Hockney & Eastwood 1988) is most memory and computational efficient if we focus on large cosmological scales. The load-balancing problem is minimized because the matter distribution is rather hom*ogeneous, and the speed benefits from the fast Fourier transform (FFT) libraries, such as FFTW3 (Frigo & Johnson 2005). PMFAST (Merz et al. 2005) introduces a 2-level PM algorithm, aiming to push PM codes toward speed, memory compactness, and scalability. After subsequent developments on PMFAST, CUBEP3M (Harnois-Déraps et al. 2013) uses cubic spatial decomposition, and adds PP force and many other features.

In addition to the new methodology, the fast development of parallel supercomputers enables us to simulate a system of more than a trillion (1012) N-body particles. To date, the largest N-body simulation in application is the "TianNu" (Emberson et al. 2017; Yu et al. 2017a) run on the TianHe-2 supercomputer. With the code CUBEP3M adding neutrino modules, it uses 3×1012 particles to simulate the cold dark matter (CDM) and cosmic neutrino evolution through the cosmic age.

Relative to the optimized computation optimizations, N-body simulations use a considerable amount of memory to store the information of particles. Their phase space coordinates CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (4) are stored as at least six single-precision floating point numbers (a total of 24 bytes). On the other hand, modern supercomputer systems use multi-cores, many integrated cores (MIC) and even densely parallelized GPUs, bringing orders of magnitude higher computing power, whereas these architectures usually have limited memory allocation. Thus, these computation-light but memory-heavy applications, compared to matrix multiplication and decomposition calculations, are currently less suitable for full usage of the computing power of modern supercomputers. For example, although native and offload modes of CUBEP3M are able to run on the Intel Xeon-PHI MIC architectures, with the requirement of enough memory, TianNu simulations were done on TianHe-2 using only its CPUs—73% of the total memory but only 13% of the total computing power. We investigate how the greatest amount of information on particles can be deduced while still preserving the accuracy of N-body simulations, with the aim of optimizing the total memory usage for a given N.

In the following sections we present a new, information-optimized parallel cosmological N-body simulation code CUBE (Yu et al. 2018), using as little as 6 bytes per particle (bpp). It gives accurate results in cosmological LSS simulations—the error induced by information optimization is below the error from the PM algorithm. In Section 2 we show how the memory can be saved by using an "integer-based storage" and how the PM N-body algorithm is adapted with this storage format. In Section 3 we quantify the accuracy of this algorithm using groups of simulations from CUBEP3M and CUBE. Discussions and conclusions are provided in Section 4.

2.Method

The most memory-consuming part of an N-body simulation is usually the phase space coordinates of N-body particles—24 bpp—which contains 4 single-precision floating numbers that must be used to store each particle's three-dimensional position and velocity vectors. CUBEP3M, an example of a memory-efficient parallel N-body code, can use as little as 40 bpp when sacrificing computing speed (Harnois-Déraps et al. 2013). This includes the phase coordinates (24 bpp) for particles in the physical domain and buffered region, a linked list (4 bpp), and a global coarse mesh and local fine mesh. Sometimes 4-byte real numbers are not necessarily adequate in representing the global coordinates in simulations. If the box size is many orders of magnitude larger than the interactive distance between particles, especially in the field of resimulation of dense subregions, double-precision (8 byte) coordinates are needed to avoid round-off errors. Another solution is to record relative coordinates for both position and velocity. CUBE replaces the coordinates and linked list 24+4=28 bpp memory usage with an integer-based storage, thus reducing the basic memory usage from 28 bpp down to 6 bpp, as described in Sections 2.1 and 2.2. The algorithm is described in Section 2.3.

2.1.Particle Position Storage

We construct a uniform mesh throughout the space and each particle belongs to its parent cell of the mesh. Instead of storing the global coordinates of each particle, we store its offset relative to its parent cell that contains the particle. This is similar to storing the quantities of nodes/clumps (structures of a tree in a tree code) relative to their parents (Appel 1985). We divide the cell, in each dimension d, evenly into 28=256 bins, and use a 1 byte (8 bits) integer χd∈{−128, −127, ..., 127} to indicate which bin it locates in this dimension. The global locations of particles are given by a cell-ordered format in memory space, and a complementary number count of particle numbers in this mesh (density field) will give complete information on the particle distribution in the mesh. Then, the global coordinate in the dth dimension xd is given by CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (5), where CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (6) is the index of the coarse grid. The mesh is chosen to be coarse enough such that the density field takes negligible memory. This coarse density field can be further compressed into a 1-byte integer format, such that a 1-byte integer shows the particle number in this coarse cell in a range from 0 to 255. In the densest cells (this rarely happens) where there are ≥255 particles, we can just write 255, and write the actual number as a 4-byte integer in another file.

In a simulation with volume L3 and Nc3 coarse cells, particle positions are stored with a resolution of L/(256Nc). The force calculation (e.g., softening length) should be configured to be much finer than this resolution, as discussed in later sections. On the other hand, particle position can also be stored as 2-byte (16 bits) integers to increase the resolution. In this case, each coarse cell is divided into 216=65536 bins and the position resolution is L/(65536Nc), which is precise compared to using 4-byte global coordinates (see later results). We denote this case as "x2" and denote the case in which we use 1-byte integers for positions as "x1."

We collectively write the general position conversion formulae as

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (7)

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (8)

where [] is the operator to take the integer part. nχ∈{1, 2} is the number of bytes used for each integer, and xd and χd are floating and integer versions of the coordinate. The velocity counterparts of these integers are nν=1, 2, vd and νd. The position resolution for an nχ-byte integer, "xnχ," is CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (9).

As a nχ=1, 1D (d = 1), 4-coarse-cell (Nc=4) example, if

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (10)

and particle number density

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (11)

then in units of coarse cells, the accurate positions of these four particles are

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (12)

2.2.Particle Velocity Storage

Similarly, the actual velocity in the dth dimension vd is decomposed into an averaged velocity field on the same coarse grid vc and a residual Δv relative to this field:

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (13)

vc is always recorded and kept updated, and should not occupy considerable memory. We then divide velocity space Δv into uneven bins, and use a nν-byte integer to indicate which Δv bin the particle is located.

The reason why we use uneven bins is that slower particles are more abundant compared to faster ones, and one should better resolve slower particles by tracing at least linear evolution. On the other hand, there could be extreme scattering particles (in case of PP force), and we can safely ignore or resolve those nonphysical particles. One of the solutions is that, if we know the probability distribution function (PDF) fv) we divide its cumulative distribution function (CDF) Fv)∈(0, 1) into CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (14) bins to determine the boundary of Δv bins, and particles should evenly distribute in the corresponding uneven Δv bins. Practically we find that either f(vd) or fv) is close to Gaussian, so we can use Gaussian CDF, or any convenient analytic functions that are close to Gaussian, to convert velocity between real numbers and integers.

The essential parameter of the velocity distribution is its variance. On a nonlinear scale, the velocity distribution function is non-Gaussian. However, to the first-order approximation, we simply assume it as Gaussian and characterized it by the variance

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (15)

where a(z) is the scale factor, H(z) is the Hubble parameter, D is the linear growth factor, CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (16), and P(k) is the linear power spectrum of density contrast at redshift zero. W(k,R) is the Fourier transform of the real space top-hat window function with a smoothing scale r. In Figure 1 we plot σv(a, r) as a function of a for a few smoothing scale r. Δv in Equation (3) is the velocity dispersion relative to the coarse grid, so we approximate its variance as

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (17)

where rc is the scale of the coarse grid, and rp is the scale of the average particle separation. In each dimension of the 3D velocity field, we use CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (18) according to the equipartition theorem. On different scales, we measure the statistics of vd, vc, and Δv and find good agreement with the above model.

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (19)

The simulation results are very insensitive if we manually tune the variance of the model σΔ within an order of magnitude. However, in the nν=1 case, the method of using uneven bins gets much better results than simply using equal bins between minimum and maximum values CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (22). So, one can safely use a standard ΛCDM (cold dark matter with a cosmological constant as dark energy) for slightly different cosmological models, in Equation (4). In CUBE, the velocity conversion takes the formula

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (23)

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (24)

where CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (25) is the operator to take the nearest integer. Tangent functions are convenient and computing-efficient. Compared to the error functions used in the Gaussian case, they take the same variance at νd=0 but resolve high velocities relatively better. All possible choices of conversion formulae and σΔ are unbiased in the conversion Equations (6) and (7), however, proper choices optimize the velocity space sampling and can result in more precise results.

Initially, particles are generated by an initial condition generator, at a higher redshift. The coarse grid velocity field vc is also generated at this step by averaging all particles in the coarse cell. A global σΔ is calculated by Equation (5), where linear approximation holds. Then, velocities are stored by Equation (6). During the simulation, vc is updated every time-step, and a nonlinear σΔ is measured directly from the simulation, and can be simply used in the next time-step, after scaled by the ratio of growth factors between two adjacent time-steps. For more details see Section 2.3.3.

2.3.Code Overview

CUBE uses a 2-level PM force calculation. In order to apply the integer-based format to the N-body simulation, substantial structural changes need to be done. CUBE is written in Coarray Fortran, where Coarray features replace MPI (Message Passing Interface) communications between computation nodes/images.7 The algorithm is described in this language.

2.3.1.Spatial Decomposition

CUBE decomposes the global simulation volume into Mg3 cubic sub-volumes with Nc coarse grids or Nf=RNc fine grids per side. The fine mesh is usually R=4 times finer than the coarse mesh. Each of these sub-volumes is assigned to a coarray image. Inside of an image, the sub-volume is further decomposed into Mt3 cubic tiles (defined as Vp, or a physical region) with Nc/(Mg Mt) coarse grids per side. Each Vp is surrounded by a buffer region Vb, which is Nb coarse cells thick. We define the extended tile region as CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (26). Ve is designed for two purposes:

(1) If the short fine mesh force CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (27) has a cut-off, Nb CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (28), and is computed on Ve, then CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (29) in Vp is guaranteed to be correct.

(2) Ve is able to collect all particles that are able to travel to Vp.

Figure 2 shows the spatial decomposition in a two-dimensional analogy, with Mg=2 and Mt=2.

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (30)

According to this spatial decomposition, and as discussed in the last two subsections, we declare CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (31) using Fortran language:

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (32)

where CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (33) covers the buffer region on both sides, Mt is the tile dimensions, and Mg is the image co-dimensions.8 We denote the actual number of particles in a given image as CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (34), and CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (35) is a value (discussed in Section 2.4) large enough to store particles in Ve. χd and νd must be sorted according to the same memory layout as ρc, such that nc and xd can be obtained from Equation (2). CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (36) provides complete information on the positions and velocities of particles, and we call it a checkpoint.

An additional particle-ID (PID) array, IP, can also be declared to differentiate particle types (e.g., CDM and neutrino particles) or to differentiate every particle, using an nI-byte integer per particle. If PID is turned on, the array IP is also included in the checkpoints, and the ordering of IP is the same as that for χd and νd.

2.3.2.Initial Conditions

The cosmological initial condition generator is compiled and run separately from the main N-body code. Here, we briefly describe it for completeness.

The first step is the calculation of the displacement potential Φ. At an initial redshift zi, we generate a linear density fluctuation CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (37) on Lagrangian grid CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (38) by multiplying a Gaussian random field with the transfer function T(k) (given by the assumed cosmological model) in Fourier space. Then, we solve for the potential CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (39) of a curlless displacement field CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (40) by Poisson equation CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (41) in Fourier space.

The second step to generate particles and displace them by Zel'dovich approximation (ZA) (Zel'dovich 1970), where the displacement field is obtained by differentiating Φ, is CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (42) in real space. This step is done on Ve of each tile.

We iterate twice over particles' CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (43) in Ve. The first iteration calculates particles' CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (44) and CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (45) by ZA and obtains ρc and vc on the coarse grid. The second iteration's CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (46) and CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (47) are calculated again and are converted to χd and νd by Equations (1) and (6) and placed in a certain order according to ρc. Lastly, we delete particles in Vb, and convert the ones in Vp and write CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (48) of this tile to disk. The above is similar to update_xp of Section 2.3.3. If PIDs are needed, they are also generated here. After working on all tiles, we sum up CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (49) and CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (50). We summarize the above steps into a pseudocode in Figure 3. During this step, the only major memory usage is Φ on the fine mesh. If the number of particles per fine grid Pf=1, the memory consumption of this in-place FFT is 4 bpp.

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (51)

2.3.3.Algorithm

Figure 4 shows the overall structure of the main code.

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (52)

initialize creates fine mesh and coarse mesh FFT plans, and reads in configuration files telling the program at which redshifts we need to do checkpoints, halofinds, or stop the simulation. Force kernels Kc, Kf are also computed or loaded.

read_particles, from the disk, reads in a checkpoint CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (53) for each image. Because they exist only in the Vp of every tile, they are disjoint, and they provide complete information on the whole simulation volume—we call it "disjoint state." In this state, ρc, vc's values in buffer regions, and CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (54) and CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (55) are 0's. Because IP is generated and manipulated together with νd, we do not explicitly mention IP.

buffer_density, buffer_x, and buffer_v convert the "disjoint" state to the "buffered state." In buffer_density, Vb regions of ρc are synchronized between tiles and images. By buffer_x, χd is updated to contain common, buffered particles, and they are sorted according to the buffered ρc. buffer_v deals with νd in a similar manner.

timestep is a second-order Runge–Kutta method used in the time integration, i.e., for n time-steps, we update positions (D=drift) and velocities (K=kick) at interlaced half time-steps by operator splitting; the operation (DKKD)n is second-order accurate. The actual simulation applies varied time-steps by timestep, where a time increment CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (56) is constrained by particles' maximum velocities, accelerations, cosmic expansion, and any other desired conditions.

update_xp is used, according to CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (57), to update particle positions (drift D) in a "gather" algorithm tile by tile. For each particle, χd and νd are converted to xd and vd by Equations (2), (7).

In order to keep particles ordered, for each tile, we first perform CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (58) on all particles to obtain an updated density and velocity field on the tile, CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (59) and CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (60). Then, this calculation is done on the same tile again9 to generate a new, local particle list CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (61) and CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (62) by Equations (1) and (6). Here, the ordering of CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (63) and CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (64) relies on CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (65). Then, the third iteration is done on this tile to delete buffer regions of CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (66). Then the disjoint state of CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (67) replaces the old CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (68). Finally, CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (69) and CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (70) are updated. These steps are summarized in Figure 5.

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (71)

For update_vp the PM or PP particle-mesh (P3M) algorithm is applied in this subroutine to update particles' velocities (kick K). We first call buffer_density and buffer_xp to place the particle positions in a buffered state. Then, according to the particle distributions CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (72) on Ve, we calculate the fine mesh force CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (73) and update the particle velocities in the Vp. An optional PP force CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (74) can be called to increase the force resolution.

The compensating coarse grid force CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (75) is globally computed using a coarser- (usually by a factor of R = 4) mesh by dimensional splitting—the cubic distributed coarse density field ρc is transposed (inter-image) and Fourier transformed (inner-image) in three consecutive dimensions. After the multiplication of memory-distributed force kernel Kc, the inverse transform takes place to get the cubic distributed coarse force field CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (76), upon which velocities are updated again.

For each type of velocity update, the collective operations are Equation (7), CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (77), and Equation (6). We also update CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (78) according to the new vdvc. These steps are summarized in Figure 6.

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (79)

After the updating of νd in Vp, we simply call buffer_v again to bring νd into a buffered state, such that the update_x in the next iteration will be done correctly.

For checkpoint, if a desired redshift is reached, we execute the last drift step in the (DKKD)n operation by update_xp, and call checkpoint to save the disjoint state of CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (80) on the disk. Other operations like run-time halo finder, density projections are also done at this point.

Finally, in the finalize subroutine we destroy all the FFT plans and finish up any timing or statistics taken in the simulation.

2.4.Memory Layout

Here, we list the memory-consuming arrays and how they scale with different configurations of the simulation. We classify them into (1) arrays of particles, (2) coarse mesh arrays, and (3) fine mesh arrays.

2.4.1.Arrays of Particles

These arrays comprise the majority of the memory usage, and contain checkpoint arrays χd, νd, IP, and temporary arrays CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (81), CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (82), CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (83). The former uses memory CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (84), where

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (85)

and CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (86) is the average number of particles per image. The second term, proportional to Ve/Vp, lets us store additional particles in Vb, and the third term CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (87) takes into account the inhom*ogeneity of CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (88) on different images. When each image models smaller physical scales, CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (89)image should be set larger.

Temporary CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (90), CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (91), CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (92) store particles only on tiles, and the particle number is set to be

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (93)

where CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (94) controls the inhom*ogeneity on scales of tiles. Larger Mt causes more inhom*ogeneity on smaller tiles, and CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (95) can be much larger than CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (96)image; however, the term CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (97) decreases much faster. Practically, the majority memory is occupied by χd, νd, IP.

Summarizing the above, we find that the memory usage per particle (bpp) CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (98), given CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (99) and nI=0, is

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (100)

Because CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (101), we can minimize Equation (10) by tuning Mt.

2.4.2.Coarse Mesh Arrays

On coarse mesh arrays, ρc (4-byte integers), vc, and force kernel Kc should always be kept. They are usually configured to be R=4 times coarser than fine grids and particle number density. They have use memories of CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (102) bytes per image, or

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (103)

where Pc is the average number of particles per coarse cell. Other coarse-grid-based arrays are CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (104), and Fc pencil-FFT arrays. CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (105) exists only on tiles; Fc and pencil-FFT arrays can be equivalenced with other temporary arrays. Thus, the majority of the memory usage from coarse FFT arrays comes from Equation (11).

2.4.3.Fine Mesh Arrays

On the local fine mesh, only a force kernel array Kf needs to be kept. It has memory of 3×4×(RNeMt)3/2 per image, or

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (106)

where Pf is the average number of particles per fine cell. For the fine mesh density arrays, force field arrays are temporary. Since CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (107), CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (108), coarse force arrays, and pencil-FFT arrays are also temporary and are not used in any calculation simultaneously, they can be overlapped in memory using equivalent statements.

2.4.4.Compare with Traditional Algorithms

To illustrate the improvement of memory usage, we refer to the TianNu simulation (Yu et al. 2017a) run on the Tianhe-2 supercomputer, which used a traditional N-body code CUBEP3M. TianNu's particle number (shown in Table 2) is limited by memory per computing node—for each computing node, an average of 5763 neutrino particles and 2883 CDM particles are used, and consumes CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (109),10 or about CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (110). A memory-efficient configuration of CUBEP3M using large physical scales and at costs of speed, still uses about CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (111).

If the same amount of memory is allocated to CUBE, we can set parameters as CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (112), CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (113), Nb=6, CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (114), CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (115) and thus CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (116). Setting Mt=3 (27 tiles per image) minimizes CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (117) and uses about CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (118), corresponding to CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (119). This can be done on most of the supercomputers, even modern laptops.

Table 1 shows the memory consumption for this test simulation. The memory-consuming arrays are listed and classified into the three types above mentioned, and their memory usages are in units of GB (109 byte), bpp, and their percentage of the total memory usage. The parenthesized numbers show overlapped memory, which is saved by equivalencing them with the underscored numbers. There are other unlisted variables that are memory-light, and can also be equivalenced with the listed variables.

Table 1.Memory Layout for a Certain Configuration

Memory Usage
TypeArray/GB/bppPercentage
Particlesχd, νd29.98.2483.8%
CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (120), CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (121)3.160.8728.87%
IP, CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (122)000%
Subtotal33.09.1292.7%
Coarse Meshρc0.2960.08180.831%
vc0.8890.2452.49%
Kc0.3400.09390.954%
Fc(0.690)(0.190)(1.94%)
CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (123)0.01400.003880.0394%
Pencil-FFT(0.454)(0.125)(1.27%)
Subtotal1.550.4294.36%
Fine MeshKf1.060.2922.97%
Ff(1.63)(0.450)(4.57%)
Fine-FFT(1.41)(0.389)(3.96%)
Subtotal1.060.2922.97%
Total35.69.84100%
Optimal Limit21.7661.0%

Download table as: ASCIITypeset image

In the bottom of Table 1 we stress that the optimal memory usage is 6 bpp, or 21.7 GB, 61% of the actual CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (124). The dominating departure from this limit is that χd and νd already occupy 8.24 bpp, which come from the CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (125) term of Equation (10). All other variables occupy an additional 8%. On modern supercomputers, the memory per computing node is usually much larger, and by scaling up the number of particles per node, the buffer ratio Nb/Nt will be lowered and we can approach closer to the 6 bpp limit.

3.Accuracy

We run a group of simulations to test the accuracy of CUBE. We use the same seeds to generate the same Gaussian random fields in the initial condition generators of CUBEP3M and CUBE, and then they produce initial conditions in their own formats . Then, the main N-body codes run their own initial conditions to redshift z=0. We use the same force kernels as CUBEP3M without PP force. Note that near fine grid scales, it is possible to enhance the force kernel to better match the nonlinear power spectrum predictions; however, we use the conservative mode of CUBEP3M in this paper. An extended PP force and an unbiased force matching algorithm will be added to CUBE. The power spectrum studies are presented in Harnois-Déraps et al. (2013), while here we focus on the cross-correlations between different integer-based methods.

First, using different configurations—different numbers of computing nodes, box sizes, particle resolutions, tiles per node/image, etc., we find that using 2-byte integers for both positions and velocities (CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (126), or x2v2) allows CUBE to give exact results compared to CUBEP3M. So if sufficient memory is provided, one can always use x2v2 to get exact results as CUBEP3M, and the optimal memory limit of this case is 12 bpp, which is still much lower than traditional methods. Next, we focus on the accuracy of the other three cases—x1v2, x2v1, and x1v1.

We list the names and configurations of the simulations used in Table 2, where CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (127), L, zi, Np, and mp are respectively the number of computing nodes used, the length of the side of the box, initial redshift, total number of particles, and particle mass. These configurations are run by CUBEP3M, x2v2, x1v2, x2v1 ,and x1v1 versions of CUBE with the same initial seeds. Using different numbers of tiles per image gives the exact same results. We also list the configurations for TianNu and TianZero (Emberson et al. 2017; Yu et al. 2017a) simulations as a reference.

Table 2.Simulation Configurations

Configurations
NameNnodeL/(Mpc h−1)ziNpmp/CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (128)
S51282004951237.5×109
S2561804925633.8×109
S2048S6440049204839.4×108
S2048L64120049204832.5×1010
TianNu138241200100691236.9×108
51382433.2×105
TianZero138241200100691237.0×108

Download table as: ASCIITypeset image

3.1.Power Spectrum

3.2.Displacement of Particles

In S512, we zoom-in on a small region of CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (129) and compare the particle distribution between CUBEP3M and CUBE-x1v1 in Figure 7. For clarity, CUBEP3M particles are marked with the larger, gray dots, whereas the smaller red dots are CUBE-x1v1 particles overplotted onto them. The difference between them is clear. The 1 Mpc/h, fine grid, and coarse grid scales are shown in the figure. The position resolution of particles in CUBE-x1v1 is 1/256 of a coarse grid, or 1/64 of a fine grid.

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (130)

To quantify the offset in the final particle distributions, we use PIDs to track the displacement CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (132) of every particle (Yu et al. 2017b), where CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (133) and CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (134) are Eulerian and Lagrangian coordinates of the particle. Then, we calculate the absolute value of the offset vector

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (135)

Here, CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (136) stands for CUBEP3M and subscript i can stand for x2v2, x1v2, x2v1, or x1v1. The PDFs and CDFs of CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (137) in S256 are shown in Figure 8. Results from x2v2, x1v2, x2v1, or x1v1 are in black, red, blue, and green, respectively. The results from the absolute displacement of particles (by replacing CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (138) with CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (139) in Equation (13)) are shown in gray for comparison.

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (140)

For x2v2, almost all particles are accurate up to 1/100 of a fine grid, and the worst particle is CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (145) of a fine grid away from its counterpart in CUBEP3M. The difference is caused by round-off errors and is negligible in physical and cosmological applications. The accuracy of x1v2 is between x2v2 and x1v1, and x2v1 gives only a minor improvement from x1v1. We also run a simulation with the same number of particles but with L=600 Mpc/h (CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (146)), and find that the accuracy of x1v2 is in turns between x2v1 and x1v1. We interpret that, in this latter case, particles have lower mass resolution, so they move slower and need higher position resolution but need lower velocity resolution, thus x2v1 outperforms x1v2.

S2048S and S2048L are two simulations with 20483 particles in small (L=400 Mpc/h) and large (L=1200 Mpc/h) box sizes. We compare their accuracy by their power spectra and their cross-correlations with CUBEP3M at z=0. The physical scale of S2048S is designed such that the particle mass resolution mp is comparable to TianNu and TianZero simulations (their parameters are also listed in Table 2). On the other hand, S2048L focuses on larger structures, on which scale one can study weak gravitational lensing, BAO (Eisenstein et al. 2005), and its reconstruction (Eisenstein et al. 2007; Wang et al. 2017), etc.

For each simulation the particles are first cloud-in-cell (CIC) interpolated onto the fine mesh grid, and from the density field ρ we define the density contrast CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (147). We define the cross power spectrum CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (148) between two fields CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (149) and CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (150) (CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (151) for the auto-power spectrum) in Fourier space as

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (152)

where CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (153) is the 3D Dirac delta function. In cosmology we usually consider the dimensionless power spectrum CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (154). The cross-correlation coefficient is defined as

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (155)

In the upper two panels of Figure 9 we show the power spectra of CUBE. In both plots of S2048S and S2048L the four solid curves of different colors show the results of x2v2, x1v2, x2v1, and x1v1, and they almost overlapped with each other. The dashed curves are the nonlinear prediction of the matter power spectrum by CLASS (Blas et al. 2011).

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (156)

We label k=0.2 kNyquist as vertical dashed lines, where CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (159) is the scale of fine mesh grids, and the scale of average particle separations. On this scale, the power spectra are offset from nonlinear predictions by at least 20%. This error is from the PM algorithm and one has to increase the resolution of the simulation to correct these offsets. We do not plot CUBEP3M because we found that CUBEP3M and x2v2 of CUBE produce same results. Thus, the differences between CUBEP3M and different integer formats of CUBE are negligible compared to the error of the PM algorithm.

In the lower parts of these four panels we study the cross-correlations. We compare everything with CUBE-x2v2, to emphasize the decorrelation (i.e., CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (160)) by different integer formats. Note that x2v2 is perfectly correlated with CUBEP3M. In the lower two panels of Figure 9 the solid curves show the decorrelation ξ of x1v2, x2v1, and x1v1. The higher resolution (S2048S) in general comes with more decorrelations, and x1v2 cross-correlates with x2v2 better than the other two cases. In order to quantify the PM error in terms of decorrelations, we run x2v2 simulations with the same initial conditions, but 8 times more particles and cells, and measure the decorrelation caused by coarser resolutions. These results are shown as dashed curves (labeled "PM"). We conclude that the cross-correlation of x1v1, in all cases, is not worse than the PM errors.

To summarize from Figure 9, in terms of either power spectrum deviation or cross-correlation, the error induced by information optimization (even for x1v1) is lower than the error from the PM algorithm, and we can safely use x1v1 for most of the LSS studies.

4.Discussion and Conclusion

We present a parallel, information-optimized N-body algorithm. This open-source code, CUBE, has recently been used in many studies of LSS, e.g., Yu et al. (2017b), Wang et al. (2017), Pan et al. (2017). It requires very low memory usage, approaching 6 bpp.

The accuracy of this code is adjustable in that we can choose 1-byte/2-byte integers separately for positions and velocities of particles. In the case of using 2 byte integers for both positions and velocities ("x2v2," and memory can be 12 bpp), the algorithm gives the exact results given by traditional N-body algorithms. Note that the results are exactly the same in that they not only produce the same physical statistics of LSS, but also the same error (not physical) from the PM algorithm near Nyquist frequencies. In other words, the positions and velocities of each particle are exact. In practice, we only require that the errors from information optimization are much lower than the errors from the PM algorithm. In Figure 9 we see that this is the case even for the most memory-efficient configuration, x1v1. This shows that in most LSS studies, when our scales of interest are smaller than CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (161), six 1-byte fixed point numbers contain sufficient information for every N-body particle.

Another benefit of this algorithm is LSS simulations with neutrinos, although the neutrino modules of CUBE are in development. Neutrinos have a high velocity dispersion and move much faster than CDM, and their small-scale errors are dominated by their Poisson noise. We expect that, compared to CDM, x1v1 gives less power spectrum deviation to neutrinos, as they behave more Gaussian and are less clustered. LSS-neutrino simulations, like TianNu, can contain much more (for TianNu, 8 times more) neutrino N-body particles than CDM, which dominate the memory. For a TianNu-like simulation, one can safely use x1v2 or x2v2 for CDM and x1v1 for neutrinos, and the memory usage can approach 6 bpp. This allows much more particles to be included in the simulation and can lower the Poisson noise of neutrinos prominently. A 1-byte PID per particle increases minor memory usage and can differentiate eight kinds of particles, or we can store different particles in different arrays without using PID.

We did not include PP force in CUBE and CUBEP3M in this paper. If one wants to focus on smaller scales, like halo masses and profiles, extended PP forces, which act up to adjacent fine cells, should be taken into account. The memory consumption for PP force is only local and is negligible compared to particles. In these cases where even x2v2 is used, a 12 bpp memory usage is still much lower than that of traditional N-body algorithms.

Traditional N-body codes consume considerable memory while performing relatively light computations. CUBE is designed to optimize the efficiency of information and the memory usage in N-body simulations. CUBE is written in Coarray Fortran—concise Coarray features are used instead of complicated MPI—and the code itself is much more concise than CUBEP3M for future maintenance and development. The next steps are optimization of the code and adapting it for various kinds of heterogeneous computing systems, e.g., MIC and GPUs. Optimizing the velocity storage may further improve the accuracy of x1v1, whose effects on neutrino-LSS simulation are yet to be discovered.

We acknowledge funding from NSERC. H.R.Y. thanks Derek Inman for many helpful discussions. We thank the anonymous referee for many helpful suggestions that improved the paper. The simulations were performed on the GPC supercomputer at the SciNet HPC Consortium. The code is publicly available on github.com under yuhaoran/CUBE.

  • 7

    Images are the concept of computing nodes or MPI tasks in Coarray Fortran. We use this terminology in this paper.

  • 8

    Coarray Fortran concept. Co-dimensions can enable communications between images.

  • 9

    This repetition scales as o(N) and is computationally inexpensive.

  • 10

    Additional memory is used for OpenMP parallelization and for particle IDs to differentiate different particle spices.

Please wait… references are loading.

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (2024)
Top Articles
Latest Posts
Article information

Author: Kerri Lueilwitz

Last Updated:

Views: 5569

Rating: 4.7 / 5 (67 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Kerri Lueilwitz

Birthday: 1992-10-31

Address: Suite 878 3699 Chantelle Roads, Colebury, NC 68599

Phone: +6111989609516

Job: Chief Farming Manager

Hobby: Mycology, Stone skipping, Dowsing, Whittling, Taxidermy, Sand art, Roller skating

Introduction: My name is Kerri Lueilwitz, I am a courageous, gentle, quaint, thankful, outstanding, brave, vast person who loves writing and wants to share my knowledge and understanding with you.