CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (2024)

Author e-mails

haoran@cita.utoronto.ca

Author affiliations

¹ Tsung-Dao Lee Institute, Shanghai Jiao Tong University, Shanghai, 200240, People's Republic of China; haoran@cita.utoronto.ca

² Canadian Institute for Theoretical Astrophysics, University of Toronto, Toronto, ON M5H 3H8, Canada

³ Department of Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, People's Republic of China

⁴ Dunlap Institute for Astronomy and Astrophysics, University of Toronto, Toronto, ON M5S 3H4, Canada

⁵ Canadian Institute for Advanced Research, CIFAR Program in Gravitation and Cosmology, Toronto, ON M5G 1Z8, Canada

⁶ Perimeter Institute for Theoretical Physics, Waterloo, ON N2L 2Y5, Canada

ORCID iDs

Hao-Ran Yu https://orcid.org/0000-0001-5277-4882

Xin Wang https://orcid.org/0000-0002-2472-6485

Dates

Received 2017 December 17
Revised 2018 May 24
Accepted 2018 May 25
Published 2018 July 26

Keywords

large-scale structure of universe; methods: numerical

Journal RSS

Create or edit your corridor alerts

What are corridors?

0067-0049/237/2/24

Abstract

Cosmological large-scale structure N-body simulations are computation-light, memory-heavy problems in supercomputing. The considerable amount of memory is usually dominated by an inefficient way of storing more than sufficient phase space information of particles. We present a new parallel, information-optimized, particle-mesh-based N-body code CUBE, in which information-efficiency and memory-efficiency are increased by nearly an order of magnitude. This is accomplished by storing particle's relative phase space coordinates instead of global values, and in the format of fixed point as light as 1 byte. The remaining information is given by complementary density and velocity fields (negligible in memory space) and proper ordering of particles (no extra memory). Our numerical experiments show that this information-optimized N-body algorithm provides accurate results within the error of the particle-mesh algorithm. This significant lowering of the memory-to-computation ratio breaks the bottleneck of scaling up and speeding up large cosmological N-body simulations on multi-core and heterogeneous computing systems.

Export citation and abstractBibTeXRIS

Previous article in issue

Next article in issue

NASA ADS Record
About Related Links

1.Introduction

The N-body simulation, a dynamical simulation of a group of particles, is a powerful tool in physics and astronomy (Hockney & Eastwood 1988). It is widely used in cosmology to model the large-scale structure (LSS) of the universe (Davis et al. 1985). Current percent and sub-percent level LSS measurements of cosmological parameters, via the matter power spectrum (Rimes & Hamilton 2005; Takahashi et al. 2011), baryonic acoustic oscillations (BAO) (Eisenstein et al. 2005; Takahashi et al. 2009), weak gravitational lensing (Vale & White 2003; Hilbert et al. 2009; Sato et al. 2009) etc., require understandings of the nonlinear dynamics of the cosmic structure, and rely on high-resolution and high dynamic range N-body simulations.

When N is large, the brute force pairwise particle–particle (PP) force brings unaffordable computations, so many algorithms were designed to alleviate it. Various fast-multipole methods (Rokhlin 1985; Dehnen 2014; Potter & Stadel 2016) improve the complexity to even o(N), among which the most popular one is "tree," like GADGET (Springel et al. 2001; Springel 2005) and its simulation "Millennium" (Springel et al. 2005; Angulo et al. 2012), TPM (Xu 1995) and GOTPM (Dubinski et al. 2004). Other methods include adaptive grid algorithms like HYDRA (Couchman et al. 1995) and RAMSES (Teyssier 2010), as well as mesh-refined codes (Couchman 1991) and moving adaptive particle-mesh (PM) codes (Pen 1995). The standard PM algorithm (Hockney & Eastwood 1988) is most memory and computational efficient if we focus on large cosmological scales. The load-balancing problem is minimized because the matter distribution is rather hom*ogeneous, and the speed benefits from the fast Fourier transform (FFT) libraries, such as FFTW3 (Frigo & Johnson 2005). PMFAST (Merz et al. 2005) introduces a 2-level PM algorithm, aiming to push PM codes toward speed, memory compactness, and scalability. After subsequent developments on PMFAST, CUBEP3M (Harnois-Déraps et al. 2013) uses cubic spatial decomposition, and adds PP force and many other features.

In addition to the new methodology, the fast development of parallel supercomputers enables us to simulate a system of more than a trillion (10¹²) N-body particles. To date, the largest N-body simulation in application is the "TianNu" (Emberson et al. 2017; Yu et al. 2017a) run on the TianHe-2 supercomputer. With the code CUBEP3M adding neutrino modules, it uses 3×10¹² particles to simulate the cold dark matter (CDM) and cosmic neutrino evolution through the cosmic age.

Relative to the optimized computation optimizations, N-body simulations use a considerable amount of memory to store the information of particles. Their phase space coordinates are stored as at least six single-precision floating point numbers (a total of 24 bytes). On the other hand, modern supercomputer systems use multi-cores, many integrated cores (MIC) and even densely parallelized GPUs, bringing orders of magnitude higher computing power, whereas these architectures usually have limited memory allocation. Thus, these computation-light but memory-heavy applications, compared to matrix multiplication and decomposition calculations, are currently less suitable for full usage of the computing power of modern supercomputers. For example, although native and offload modes of CUBEP3M are able to run on the Intel Xeon-PHI MIC architectures, with the requirement of enough memory, TianNu simulations were done on TianHe-2 using only its CPUs—73% of the total memory but only 13% of the total computing power. We investigate how the greatest amount of information on particles can be deduced while still preserving the accuracy of N-body simulations, with the aim of optimizing the total memory usage for a given N.

In the following sections we present a new, information-optimized parallel cosmological N-body simulation code CUBE (Yu et al. 2018), using as little as 6 bytes per particle (bpp). It gives accurate results in cosmological LSS simulations—the error induced by information optimization is below the error from the PM algorithm. In Section 2 we show how the memory can be saved by using an "integer-based storage" and how the PM N-body algorithm is adapted with this storage format. In Section 3 we quantify the accuracy of this algorithm using groups of simulations from CUBEP3M and CUBE. Discussions and conclusions are provided in Section 4.

2.Method

The most memory-consuming part of an N-body simulation is usually the phase space coordinates of N-body particles—24 bpp—which contains 4 single-precision floating numbers that must be used to store each particle's three-dimensional position and velocity vectors. CUBEP3M, an example of a memory-efficient parallel N-body code, can use as little as 40 bpp when sacrificing computing speed (Harnois-Déraps et al. 2013). This includes the phase coordinates (24 bpp) for particles in the physical domain and buffered region, a linked list (4 bpp), and a global coarse mesh and local fine mesh. Sometimes 4-byte real numbers are not necessarily adequate in representing the global coordinates in simulations. If the box size is many orders of magnitude larger than the interactive distance between particles, especially in the field of resimulation of dense subregions, double-precision (8 byte) coordinates are needed to avoid round-off errors. Another solution is to record relative coordinates for both position and velocity. CUBE replaces the coordinates and linked list 24+4=28 bpp memory usage with an integer-based storage, thus reducing the basic memory usage from 28 bpp down to 6 bpp, as described in Sections 2.1 and 2.2. The algorithm is described in Section 2.3.

2.1.Particle Position Storage

We construct a uniform mesh throughout the space and each particle belongs to its parent cell of the mesh. Instead of storing the global coordinates of each particle, we store its offset relative to its parent cell that contains the particle. This is similar to storing the quantities of nodes/clumps (structures of a tree in a tree code) relative to their parents (Appel 1985). We divide the cell, in each dimension d, evenly into 2⁸=256 bins, and use a 1 byte (8 bits) integer χ_d∈{−128, −127, ..., 127} to indicate which bin it locates in this dimension. The global locations of particles are given by a cell-ordered format in memory space, and a complementary number count of particle numbers in this mesh (density field) will give complete information on the particle distribution in the mesh. Then, the global coordinate in the dth dimension x_d is given by , where is the index of the coarse grid. The mesh is chosen to be coarse enough such that the density field takes negligible memory. This coarse density field can be further compressed into a 1-byte integer format, such that a 1-byte integer shows the particle number in this coarse cell in a range from 0 to 255. In the densest cells (this rarely happens) where there are ≥255 particles, we can just write 255, and write the actual number as a 4-byte integer in another file.

In a simulation with volume L³ and N_c³ coarse cells, particle positions are stored with a resolution of L/(256N_c). The force calculation (e.g., softening length) should be configured to be much finer than this resolution, as discussed in later sections. On the other hand, particle position can also be stored as 2-byte (16 bits) integers to increase the resolution. In this case, each coarse cell is divided into 2¹⁶=65536 bins and the position resolution is L/(65536N_c), which is precise compared to using 4-byte global coordinates (see later results). We denote this case as "x2" and denote the case in which we use 1-byte integers for positions as "x1."

2.2.Particle Velocity Storage

Similarly, the actual velocity in the dth dimension v_d is decomposed into an averaged velocity field on the same coarse grid v_c and a residual Δv relative to this field:

v_c is always recorded and kept updated, and should not occupy considerable memory. We then divide velocity space Δv into uneven bins, and use a n_ν-byte integer to indicate which Δv bin the particle is located.

The reason why we use uneven bins is that slower particles are more abundant compared to faster ones, and one should better resolve slower particles by tracing at least linear evolution. On the other hand, there could be extreme scattering particles (in case of PP force), and we can safely ignore or resolve those nonphysical particles. One of the solutions is that, if we know the probability distribution function (PDF) f(Δv) we divide its cumulative distribution function (CDF) F(Δv)∈(0, 1) into bins to determine the boundary of Δv bins, and particles should evenly distribute in the corresponding uneven Δv bins. Practically we find that either f(v_d) or f(Δv) is close to Gaussian, so we can use Gaussian CDF, or any convenient analytic functions that are close to Gaussian, to convert velocity between real numbers and integers.

The essential parameter of the velocity distribution is its variance. On a nonlinear scale, the velocity distribution function is non-Gaussian. However, to the first-order approximation, we simply assume it as Gaussian and characterized it by the variance

where a(z) is the scale factor, H(z) is the Hubble parameter, D is the linear growth factor, , and P(k) is the linear power spectrum of density contrast at redshift zero. W(k,R) is the Fourier transform of the real space top-hat window function with a smoothing scale r. In Figure 1 we plot σ_v(a, r) as a function of a for a few smoothing scale r. Δv in Equation (3) is the velocity dispersion relative to the coarse grid, so we approximate its variance as

where r_c is the scale of the coarse grid, and r_p is the scale of the average particle separation. In each dimension of the 3D velocity field, we use according to the equipartition theorem. On different scales, we measure the statistics of v_d, v_c, and Δv and find good agreement with the above model.

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (19)

The simulation results are very insensitive if we manually tune the variance of the model σ_Δ within an order of magnitude. However, in the n_ν=1 case, the method of using uneven bins gets much better results than simply using equal bins between minimum and maximum values . So, one can safely use a standard ΛCDM (cold dark matter with a cosmological constant as dark energy) for slightly different cosmological models, in Equation (4). In CUBE, the velocity conversion takes the formula

where is the operator to take the nearest integer. Tangent functions are convenient and computing-efficient. Compared to the error functions used in the Gaussian case, they take the same variance at ν_d=0 but resolve high velocities relatively better. All possible choices of conversion formulae and σ_Δ are unbiased in the conversion Equations (6) and (7), however, proper choices optimize the velocity space sampling and can result in more precise results.

Initially, particles are generated by an initial condition generator, at a higher redshift. The coarse grid velocity field v_c is also generated at this step by averaging all particles in the coarse cell. A global σ_Δ is calculated by Equation (5), where linear approximation holds. Then, velocities are stored by Equation (6). During the simulation, v_c is updated every time-step, and a nonlinear σ_Δ is measured directly from the simulation, and can be simply used in the next time-step, after scaled by the ratio of growth factors between two adjacent time-steps. For more details see Section 2.3.3.

2.3.Code Overview

CUBE uses a 2-level PM force calculation. In order to apply the integer-based format to the N-body simulation, substantial structural changes need to be done. CUBE is written in Coarray Fortran, where Coarray features replace MPI (Message Passing Interface) communications between computation nodes/images.⁷ The algorithm is described in this language.

2.3.1.Spatial Decomposition

CUBE decomposes the global simulation volume into M_g³ cubic sub-volumes with N_c coarse grids or N_f=RN_c fine grids per side. The fine mesh is usually R=4 times finer than the coarse mesh. Each of these sub-volumes is assigned to a coarray image. Inside of an image, the sub-volume is further decomposed into M_t³ cubic tiles (defined as V_p, or a physical region) with N_c/(M_g M_t) coarse grids per side. Each V_p is surrounded by a buffer region V_b, which is N_b coarse cells thick. We define the extended tile region as . V_e is designed for two purposes:

(1) If the short fine mesh force has a cut-off, N_b , and is computed on V_e, then in V_p is guaranteed to be correct.

(2) V_e is able to collect all particles that are able to travel to V_p.

Figure 2 shows the spatial decomposition in a two-dimensional analogy, with M_g=2 and M_t=2.

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (30)

According to this spatial decomposition, and as discussed in the last two subsections, we declare using Fortran language:

where covers the buffer region on both sides, M_t is the tile dimensions, and M_g is the image co-dimensions.⁸ We denote the actual number of particles in a given image as , and is a value (discussed in Section 2.4) large enough to store particles in V_e. χ_d and ν_d must be sorted according to the same memory layout as ρ_c, such that n_c and x_d can be obtained from Equation (2). provides complete information on the positions and velocities of particles, and we call it a checkpoint.

An additional particle-ID (PID) array, I_P, can also be declared to differentiate particle types (e.g., CDM and neutrino particles) or to differentiate every particle, using an n_I-byte integer per particle. If PID is turned on, the array I_P is also included in the checkpoints, and the ordering of I_P is the same as that for χ_d and ν_d.

2.3.2.Initial Conditions

The cosmological initial condition generator is compiled and run separately from the main N-body code. Here, we briefly describe it for completeness.

The first step is the calculation of the displacement potential Φ. At an initial redshift z_i, we generate a linear density fluctuation on Lagrangian grid by multiplying a Gaussian random field with the transfer function T(k) (given by the assumed cosmological model) in Fourier space. Then, we solve for the potential of a curlless displacement field by Poisson equation in Fourier space.

The second step to generate particles and displace them by Zel'dovich approximation (ZA) (Zel'dovich 1970), where the displacement field is obtained by differentiating Φ, is in real space. This step is done on V_e of each tile.

We iterate twice over particles' in V_e. The first iteration calculates particles' and by ZA and obtains ρ_c and v_c on the coarse grid. The second iteration's and are calculated again and are converted to χ_d and ν_d by Equations (1) and (6) and placed in a certain order according to ρ_c. Lastly, we delete particles in V_b, and convert the ones in V_p and write of this tile to disk. The above is similar to update_xp of Section 2.3.3. If PIDs are needed, they are also generated here. After working on all tiles, we sum up and . We summarize the above steps into a pseudocode in Figure 3. During this step, the only major memory usage is Φ on the fine mesh. If the number of particles per fine grid P_f=1, the memory consumption of this in-place FFT is 4 bpp.

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (51)

2.3.3.Algorithm

Figure 4 shows the overall structure of the main code.

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (52)

initialize creates fine mesh and coarse mesh FFT plans, and reads in configuration files telling the program at which redshifts we need to do checkpoints, halofinds, or stop the simulation. Force kernels K_c, K_f are also computed or loaded.

read_particles, from the disk, reads in a checkpoint for each image. Because they exist only in the V_p of every tile, they are disjoint, and they provide complete information on the whole simulation volume—we call it "disjoint state." In this state, ρ_c, v_c's values in buffer regions, and and are 0's. Because I_P is generated and manipulated together with ν_d, we do not explicitly mention I_P.

buffer_density, buffer_x, and buffer_v convert the "disjoint" state to the "buffered state." In buffer_density, V_b regions of ρ_c are synchronized between tiles and images. By buffer_x, χ_d is updated to contain common, buffered particles, and they are sorted according to the buffered ρ_c. buffer_v deals with ν_d in a similar manner.

timestep is a second-order Runge–Kutta method used in the time integration, i.e., for n time-steps, we update positions (D=drift) and velocities (K=kick) at interlaced half time-steps by operator splitting; the operation (DKKD)ⁿ is second-order accurate. The actual simulation applies varied time-steps by timestep, where a time increment is constrained by particles' maximum velocities, accelerations, cosmic expansion, and any other desired conditions.

update_xp is used, according to , to update particle positions (drift D) in a "gather" algorithm tile by tile. For each particle, χ_d and ν_d are converted to x_d and v_d by Equations (2), (7).

In order to keep particles ordered, for each tile, we first perform on all particles to obtain an updated density and velocity field on the tile, and . Then, this calculation is done on the same tile again⁹ to generate a new, local particle list and by Equations (1) and (6). Here, the ordering of and relies on . Then, the third iteration is done on this tile to delete buffer regions of . Then the disjoint state of replaces the old . Finally, and are updated. These steps are summarized in Figure 5.

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (71)

For update_vp the PM or PP particle-mesh (P³M) algorithm is applied in this subroutine to update particles' velocities (kick K). We first call buffer_density and buffer_xp to place the particle positions in a buffered state. Then, according to the particle distributions on V_e, we calculate the fine mesh force and update the particle velocities in the V_p. An optional PP force can be called to increase the force resolution.

The compensating coarse grid force is globally computed using a coarser- (usually by a factor of R = 4) mesh by dimensional splitting—the cubic distributed coarse density field ρ_c is transposed (inter-image) and Fourier transformed (inner-image) in three consecutive dimensions. After the multiplication of memory-distributed force kernel K_c, the inverse transform takes place to get the cubic distributed coarse force field , upon which velocities are updated again.

For each type of velocity update, the collective operations are Equation (7), , and Equation (6). We also update according to the new v_d−v_c. These steps are summarized in Figure 6.

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (79)

After the updating of ν_d in V_p, we simply call buffer_v again to bring ν_d into a buffered state, such that the update_x in the next iteration will be done correctly.

For checkpoint, if a desired redshift is reached, we execute the last drift step in the (DKKD)ⁿ operation by update_xp, and call checkpoint to save the disjoint state of on the disk. Other operations like run-time halo finder, density projections are also done at this point.

Finally, in the finalize subroutine we destroy all the FFT plans and finish up any timing or statistics taken in the simulation.

2.4.Memory Layout

Here, we list the memory-consuming arrays and how they scale with different configurations of the simulation. We classify them into (1) arrays of particles, (2) coarse mesh arrays, and (3) fine mesh arrays.

2.4.1.Arrays of Particles

These arrays comprise the majority of the memory usage, and contain checkpoint arrays χ_d, ν_d, I_P, and temporary arrays , , . The former uses memory , where

and is the average number of particles per image. The second term, proportional to V_e/V_p, lets us store additional particles in V_b, and the third term takes into account the inhom*ogeneity of on different images. When each image models smaller physical scales, _image should be set larger.

Temporary , , store particles only on tiles, and the particle number is set to be

where controls the inhom*ogeneity on scales of tiles. Larger M_t causes more inhom*ogeneity on smaller tiles, and can be much larger than _image; however, the term decreases much faster. Practically, the majority memory is occupied by χ_d, ν_d, I_P.

Summarizing the above, we find that the memory usage per particle (bpp) , given and n_I=0, is

Because , we can minimize Equation (10) by tuning M_t.

2.4.2.Coarse Mesh Arrays

On coarse mesh arrays, ρ_c (4-byte integers), v_c, and force kernel K_c should always be kept. They are usually configured to be R=4 times coarser than fine grids and particle number density. They have use memories of bytes per image, or

where P_c is the average number of particles per coarse cell. Other coarse-grid-based arrays are , and F_c pencil-FFT arrays. exists only on tiles; F_c and pencil-FFT arrays can be equivalenced with other temporary arrays. Thus, the majority of the memory usage from coarse FFT arrays comes from Equation (11).

2.4.3.Fine Mesh Arrays

On the local fine mesh, only a force kernel array K_f needs to be kept. It has memory of 3×4×(RN_eM_t)^3/2 per image, or

where P_f is the average number of particles per fine cell. For the fine mesh density arrays, force field arrays are temporary. Since , , coarse force arrays, and pencil-FFT arrays are also temporary and are not used in any calculation simultaneously, they can be overlapped in memory using equivalent statements.

2.4.4.Compare with Traditional Algorithms

To illustrate the improvement of memory usage, we refer to the TianNu simulation (Yu et al. 2017a) run on the Tianhe-2 supercomputer, which used a traditional N-body code CUBEP3M. TianNu's particle number (shown in Table 2) is limited by memory per computing node—for each computing node, an average of 576³ neutrino particles and 288³ CDM particles are used, and consumes ,¹⁰ or about . A memory-efficient configuration of CUBEP3M using large physical scales and at costs of speed, still uses about .

If the same amount of memory is allocated to CUBE, we can set parameters as , , N_b=6, , and thus . Setting M_t=3 (27 tiles per image) minimizes and uses about , corresponding to . This can be done on most of the supercomputers, even modern laptops.

Table 1 shows the memory consumption for this test simulation. The memory-consuming arrays are listed and classified into the three types above mentioned, and their memory usages are in units of GB (10⁹ byte), bpp, and their percentage of the total memory usage. The parenthesized numbers show overlapped memory, which is saved by equivalencing them with the underscored numbers. There are other unlisted variables that are memory-light, and can also be equivalenced with the listed variables.

Table 1.Memory Layout for a Certain Configuration

		Memory Usage
Type	Array	/GB	/bpp	Percentage
Particles	χ_d, ν_d	29.9	8.24	83.8%
	,	3.16	0.872	8.87%
	I_P,	0	0	0%
	Subtotal	33.0	9.12	92.7%

Coarse Mesh	ρ_c	0.296	0.0818	0.831%
	v_c	0.889	0.245	2.49%
	K_c	0.340	0.0939	0.954%
	F_c	(0.690)	(0.190)	(1.94%)
		0.0140	0.00388	0.0394%
	Pencil-FFT	(0.454)	(0.125)	(1.27%)
	Subtotal	1.55	0.429	4.36%

Fine Mesh	K_f	1.06	0.292	2.97%
	F_f	(1.63)	(0.450)	(4.57%)
	Fine-FFT	(1.41)	(0.389)	(3.96%)
	Subtotal	1.06	0.292	2.97%

Total		35.6	9.84	100%

Optimal Limit		21.7	6	61.0%

Download table as: ASCIITypeset image

In the bottom of Table 1 we stress that the optimal memory usage is 6 bpp, or 21.7 GB, 61% of the actual . The dominating departure from this limit is that χ_d and ν_d already occupy 8.24 bpp, which come from the term of Equation (10). All other variables occupy an additional 8%. On modern supercomputers, the memory per computing node is usually much larger, and by scaling up the number of particles per node, the buffer ratio N_b/N_t will be lowered and we can approach closer to the 6 bpp limit.

3.Accuracy

We run a group of simulations to test the accuracy of CUBE. We use the same seeds to generate the same Gaussian random fields in the initial condition generators of CUBEP3M and CUBE, and then they produce initial conditions in their own formats . Then, the main N-body codes run their own initial conditions to redshift z=0. We use the same force kernels as CUBEP3M without PP force. Note that near fine grid scales, it is possible to enhance the force kernel to better match the nonlinear power spectrum predictions; however, we use the conservative mode of CUBEP3M in this paper. An extended PP force and an unbiased force matching algorithm will be added to CUBE. The power spectrum studies are presented in Harnois-Déraps et al. (2013), while here we focus on the cross-correlations between different integer-based methods.

First, using different configurations—different numbers of computing nodes, box sizes, particle resolutions, tiles per node/image, etc., we find that using 2-byte integers for both positions and velocities (, or x2v2) allows CUBE to give exact results compared to CUBEP3M. So if sufficient memory is provided, one can always use x2v2 to get exact results as CUBEP3M, and the optimal memory limit of this case is 12 bpp, which is still much lower than traditional methods. Next, we focus on the accuracy of the other three cases—x1v2, x2v1, and x1v1.

We list the names and configurations of the simulations used in Table 2, where , L, z_i, N_p, and m_p are respectively the number of computing nodes used, the length of the side of the box, initial redshift, total number of particles, and particle mass. These configurations are run by CUBEP3M, x2v2, x1v2, x2v1 ,and x1v1 versions of CUBE with the same initial seeds. Using different numbers of tiles per image gives the exact same results. We also list the configurations for TianNu and TianZero (Emberson et al. 2017; Yu et al. 2017a) simulations as a reference.

Table 2.Simulation Configurations

	Configurations
Name	N_node	L/(Mpc h⁻¹)	z_i	N_p	m_p/
S512	8	200	49	512³	7.5×10⁹
S256	1	80	49	256³	3.8×10⁹
S2048S	64	400	49	2048³	9.4×10⁸
S2048L	64	1200	49	2048³	2.5×10¹⁰

TianNu	13824	1200	100	6912³	6.9×10⁸
			5	13824³	3.2×10⁵
TianZero	13824	1200	100	6912³	7.0×10⁸

Download table as: ASCIITypeset image

3.1.Power Spectrum

3.2.Displacement of Particles

In S512, we zoom-in on a small region of and compare the particle distribution between CUBEP3M and CUBE-x1v1 in Figure 7. For clarity, CUBEP3M particles are marked with the larger, gray dots, whereas the smaller red dots are CUBE-x1v1 particles overplotted onto them. The difference between them is clear. The 1 Mpc/h, fine grid, and coarse grid scales are shown in the figure. The position resolution of particles in CUBE-x1v1 is 1/256 of a coarse grid, or 1/64 of a fine grid.

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (130)

To quantify the offset in the final particle distributions, we use PIDs to track the displacement of every particle (Yu et al. 2017b), where and are Eulerian and Lagrangian coordinates of the particle. Then, we calculate the absolute value of the offset vector

Here, stands for CUBEP3M and subscript _i can stand for x2v2, x1v2, x2v1, or x1v1. The PDFs and CDFs of in S256 are shown in Figure 8. Results from x2v2, x1v2, x2v1, or x1v1 are in black, red, blue, and green, respectively. The results from the absolute displacement of particles (by replacing with in Equation (13)) are shown in gray for comparison.

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (140)

For x2v2, almost all particles are accurate up to 1/100 of a fine grid, and the worst particle is of a fine grid away from its counterpart in CUBEP3M. The difference is caused by round-off errors and is negligible in physical and cosmological applications. The accuracy of x1v2 is between x2v2 and x1v1, and x2v1 gives only a minor improvement from x1v1. We also run a simulation with the same number of particles but with L=600 Mpc/h (), and find that the accuracy of x1v2 is in turns between x2v1 and x1v1. We interpret that, in this latter case, particles have lower mass resolution, so they move slower and need higher position resolution but need lower velocity resolution, thus x2v1 outperforms x1v2.

S2048S and S2048L are two simulations with 2048³ particles in small (L=400 Mpc/h) and large (L=1200 Mpc/h) box sizes. We compare their accuracy by their power spectra and their cross-correlations with CUBEP3M at z=0. The physical scale of S2048S is designed such that the particle mass resolution m_p is comparable to TianNu and TianZero simulations (their parameters are also listed in Table 2). On the other hand, S2048L focuses on larger structures, on which scale one can study weak gravitational lensing, BAO (Eisenstein et al. 2005), and its reconstruction (Eisenstein et al. 2007; Wang et al. 2017), etc.

For each simulation the particles are first cloud-in-cell (CIC) interpolated onto the fine mesh grid, and from the density field ρ we define the density contrast . We define the cross power spectrum between two fields and ( for the auto-power spectrum) in Fourier space as

where is the 3D Dirac delta function. In cosmology we usually consider the dimensionless power spectrum . The cross-correlation coefficient is defined as

In the upper two panels of Figure 9 we show the power spectra of CUBE. In both plots of S2048S and S2048L the four solid curves of different colors show the results of x2v2, x1v2, x2v1, and x1v1, and they almost overlapped with each other. The dashed curves are the nonlinear prediction of the matter power spectrum by CLASS (Blas et al. 2011).

CUBE: An Information-optimized Parallel Cosmological N-body Algorithm (156)

We label k=0.2 k_Nyquist as vertical dashed lines, where is the scale of fine mesh grids, and the scale of average particle separations. On this scale, the power spectra are offset from nonlinear predictions by at least 20%. This error is from the PM algorithm and one has to increase the resolution of the simulation to correct these offsets. We do not plot CUBEP3M because we found that CUBEP3M and x2v2 of CUBE produce same results. Thus, the differences between CUBEP3M and different integer formats of CUBE are negligible compared to the error of the PM algorithm.

In the lower parts of these four panels we study the cross-correlations. We compare everything with CUBE-x2v2, to emphasize the decorrelation (i.e., ) by different integer formats. Note that x2v2 is perfectly correlated with CUBEP3M. In the lower two panels of Figure 9 the solid curves show the decorrelation ξ of x1v2, x2v1, and x1v1. The higher resolution (S2048S) in general comes with more decorrelations, and x1v2 cross-correlates with x2v2 better than the other two cases. In order to quantify the PM error in terms of decorrelations, we run x2v2 simulations with the same initial conditions, but 8 times more particles and cells, and measure the decorrelation caused by coarser resolutions. These results are shown as dashed curves (labeled "PM"). We conclude that the cross-correlation of x1v1, in all cases, is not worse than the PM errors.

To summarize from Figure 9, in terms of either power spectrum deviation or cross-correlation, the error induced by information optimization (even for x1v1) is lower than the error from the PM algorithm, and we can safely use x1v1 for most of the LSS studies.

4.Discussion and Conclusion

We present a parallel, information-optimized N-body algorithm. This open-source code, CUBE, has recently been used in many studies of LSS, e.g., Yu et al. (2017b), Wang et al. (2017), Pan et al. (2017). It requires very low memory usage, approaching 6 bpp.

The accuracy of this code is adjustable in that we can choose 1-byte/2-byte integers separately for positions and velocities of particles. In the case of using 2 byte integers for both positions and velocities ("x2v2," and memory can be 12 bpp), the algorithm gives the exact results given by traditional N-body algorithms. Note that the results are exactly the same in that they not only produce the same physical statistics of LSS, but also the same error (not physical) from the PM algorithm near Nyquist frequencies. In other words, the positions and velocities of each particle are exact. In practice, we only require that the errors from information optimization are much lower than the errors from the PM algorithm. In Figure 9 we see that this is the case even for the most memory-efficient configuration, x1v1. This shows that in most LSS studies, when our scales of interest are smaller than , six 1-byte fixed point numbers contain sufficient information for every N-body particle.

Another benefit of this algorithm is LSS simulations with neutrinos, although the neutrino modules of CUBE are in development. Neutrinos have a high velocity dispersion and move much faster than CDM, and their small-scale errors are dominated by their Poisson noise. We expect that, compared to CDM, x1v1 gives less power spectrum deviation to neutrinos, as they behave more Gaussian and are less clustered. LSS-neutrino simulations, like TianNu, can contain much more (for TianNu, 8 times more) neutrino N-body particles than CDM, which dominate the memory. For a TianNu-like simulation, one can safely use x1v2 or x2v2 for CDM and x1v1 for neutrinos, and the memory usage can approach 6 bpp. This allows much more particles to be included in the simulation and can lower the Poisson noise of neutrinos prominently. A 1-byte PID per particle increases minor memory usage and can differentiate eight kinds of particles, or we can store different particles in different arrays without using PID.

We did not include PP force in CUBE and CUBEP3M in this paper. If one wants to focus on smaller scales, like halo masses and profiles, extended PP forces, which act up to adjacent fine cells, should be taken into account. The memory consumption for PP force is only local and is negligible compared to particles. In these cases where even x2v2 is used, a 12 bpp memory usage is still much lower than that of traditional N-body algorithms.

Traditional N-body codes consume considerable memory while performing relatively light computations. CUBE is designed to optimize the efficiency of information and the memory usage in N-body simulations. CUBE is written in Coarray Fortran—concise Coarray features are used instead of complicated MPI—and the code itself is much more concise than CUBEP3M for future maintenance and development. The next steps are optimization of the code and adapting it for various kinds of heterogeneous computing systems, e.g., MIC and GPUs. Optimizing the velocity storage may further improve the accuracy of x1v1, whose effects on neutrino-LSS simulation are yet to be discovered.

We acknowledge funding from NSERC. H.R.Y. thanks Derek Inman for many helpful discussions. We thank the anonymous referee for many helpful suggestions that improved the paper. The simulations were performed on the GPC supercomputer at the SciNet HPC Consortium. The code is publicly available on github.com under yuhaoran/CUBE.

7
Images are the concept of computing nodes or MPI tasks in Coarray Fortran. We use this terminology in this paper.
8
Coarray Fortran concept. Co-dimensions can enable communications between images.
9
This repetition scales as o(N) and is computationally inexpensive.
10
Additional memory is used for OpenMP parallelization and for particle IDs to differentiate different particle spices.

Please wait… references are loading.