Hydrodynamic free-surface flows requiring accurate pressure prediction occur in many engineering problems including breaking-wave impacts, sloshing tanks and nuclear sludge flows. There is a clear need for an accurate incompressible free-surface solver capable of handling highly distorted flows. The development of numerical meshless particle methods has given the opportunity to form robust solutions for these flows. A popular particle method, Smoothed Particle Hydrodynamics (SPH) by Gingold and Monaghan~cite{GingoldMonaghan1977} and Lucy~cite{Lucy1977} was originally created for the field of astrophysics and is now widely applied to free-surface flows~cite{Monaghan1994}.

The original formulation of SPH for fluid dynamics, known as weakly-compressible SPH (WCSPH), uses an equation of state to link density with pressure. The density is permitted to vary theoretically within about 1 \% of the reference value, but this leads to fluctuations in the pressure and velocity fields~cite{VioleauRogers2016}. Since an artificial speed of sound is required, the timestep size is limited by the Courant-Friedrich-Levy condition~cite{GomezGesteira2010}. In 1999, Cummins and Rudman~cite{CumminsRudman1999} presented a method for enforcing incompressibility by solving the pressure field with a pressure Poisson equation (PPE) following Chorin’s projection step method~cite{Chorin1968}; the semi-implicit scheme came to be known as Incompressible SPH (ISPH).

ISPH has been shown to exhibit a noise-free pressure field and provide accurate pressure force predictions for fluid-structure impacts~cite{Khayyer2009}cite{Shao2009}cite{Lind2016}. Without explcit dependence on a speed of sound, a larger maximum timestep size is possible~cite{VioleauLeroy2015}. The significant development of the particle shifting methodology in SPH, by Xu et al.~cite{Xu2009}, to maintain near uniform particle arrangements enables ISPH to be applied to high Reynolds number flows. Lind et al.~cite{Lind2012} then developed this to include free surfaces using a Fickian-based particle shifting technique which has been further generalised by Khayyer et al.~cite{Khayyer2017Shifting}. While there have been significant advances in improving the accuracy of ISPH~cite{Gotoh2014}cite{GotohKhayyer2016}, very large numbers of particles ($10^7$ to $10^9$), with the solution of large sparse matrices for the PPE, are still required for most applications, and it is the associated computational expense which is addressed here. This is important because the Poisson solver is responsible for over 90 \% of the computation time in a serial code. Hence, parallelisation is necessary for practical computation.

Parallelisation of ISPH is a relatively new area and to date has been undertaken using the message passing interface (MPI) by Guo et al.~cite{Guo2013} for 100s of millions of particles. Using a distributed memory architecture, however, is complex and requires optimisation of load balancing for 1000s of partitions or processes. Furthermore, massively parallel computing is expensive and not feasible for many engineering industries and researchers. Yeylaghi et al.~cite{Yeylaghi2016OpenMP}~cite{Yeylaghi2016MPI} have also parallelised ISPH with both open multi-processing (OpenMP) and MPI, although their scheme solves the Poisson equation explicitly without the use of a matrix, which affects accuracy and limits the timestep size.

In recent years, graphics processing units (GPUs) have emerged as a viable, cheap and highly portable alternative to large and expensive high-performance computing clusters. A single GPU provides a cost-effective approach in the move towards high performance computing. There have been successful applications of WCSPH on the GPU~cite{Harada2007}cite{Herault2010}cite{Crespo2011}cite{Mokos2015}cite{Fourtakas2016}cite{CercosPita2016}. These codes have achieved speedups of between 10-100 compared to optimised CPU-based codes. GPUs have also been utilised for the moving particle semi-implicit (MPS) method, which is closely related to ISPH~cite{SoutoIglesias2013}cite{SoutoIglesias2014}. The MPS method was first implemented on the GPU for 2-D simulations in 2011 by Hori et al.~cite{Hori2011} and Zhu et al.~cite{Zhu2011}, where the latter achieved a 26 times speed up over a traditional CPU-based MPS code. MPS on a GPU has since been applied to multi-phase flows~cite{Gou2016} and 3-D free-surface flows~cite{Li2015}. Developments in MPS also include GPU-based neighbour list algorithms for MPS by Murotani et al.~cite{Murotani2015} and the comparison of openMP/MPI parallelised solvers for MPS~cite{Duan2015}. Such advances in MPS are to be noted when developing ISPH.

To the best of the authors’ knowledge, there is only one published case of ISPH implemented on a GPU by Leroy~cite{Leroy2015}, where the subject was only briefly described without algorithmic implementation details. For ISPH on the GPU there are a number of challenges:

egin{itemize}

item A new PPE matrix must be constructed and solved every timestep for moving computation points (particles) on streaming multiprocessors.

item Use of memory-limited hardware for an inherently memory-expensive method.

item The fast solution of a particle method’s Poisson equation on a GPU.

end{itemize}

Fast and scalable linear solvers and preconditioning for a Poisson equation is a well-researched area for finite-element methods with fixed meshes~cite{Tomczak2013}cite{Elman2014}. For particle methods however, the solution of a linear system presents a different challenge because the particle connectivity is constantly changing with the evolution of the flow. In Section~

ef{subsubsec:Preconditioners}, the complexity of an evolving ISPH system is demonstrated by the performance comparison of two preconditioners for a fragmented flow.

Numerous open-source libraries for GPU-based linear solvers are now available~cite{ViennaCL}cite{CUSP}cite{Cusparse}. To obtain a fast ISPH code accelerated on a GPU, we combine the highly-optimised WCSPH DualSPHysics GPU code~cite{DualSPHysics} with the ViennaCL linear algebra library~cite{ViennaCL} to solve the PPE matrix. This includes employing an algebraic multigrid preconditioner which requires modification for ISPH particles. The ISPH algorithm also required a change of the SPH boundary condition within the DualSPHysics code by extending its fixed dummy boundary particles approach to that of fixed internal mirror particles by Marrone et al.~cite{Marrone2011}.

This paper is structured as follows. Section~

ef{sec:Methodology} presents the ISPH methodology used in this work. Section~

ef{sec:DualSPHysics} explains the use of DualSPHysics and the changes to the code required for ISPH. The details for the novel implementation of a parallel ISPH algorithm on a GPU are then explained in Section~

ef{sec:ISPHonGPU}. Numerical results and performance analysis of the code is presented in Section~

ef{sec:results}. Finally, in Section~

ef{sec:conclusions}, conclusions of the work presented are made.