High Performance Computing

Parallel Multi-Scale Hemodynamics


Efthimios KAXIRAS



We outline a computational method for clinical cardiovascular diagnosis based on massive simulations running on the IBM Blue Gene architecture and Graphical Processing Units (GPU) commodity hardware. Our approach is based on the combination of microscopic Molecular Dynamics (MD) with a hydrokinetic Lattice Boltzmann (LB) method and takes into explicit account the hydrodynamic interactions among the suspended bodies and the surrounding fluid. The end result is a multi-physics/scale parallel code, called MUPHY. The software exhibits excellent scalability on the Blue Gene/L platform and multi-GPUs. By developing a series of ad hoc optimization techniques, MUPHY proves extremely efficient in simulating irregular domains, such as for blood flows in extended coronary arteries, with the capability of accessing physical information at micrometric scale.


In the last several decades, there has been growing interest towards understanding the circulation of blood (hemodynamics) in the human cardiovascular system. This is a very complex network: the heart pumps the blood through the large arteries to the smaller diameter arteries, which become capillaries and eventually venules, where the deoxygenated blood is passed through veins back to the heart, imposing a circular road map through the whole body. In such diverse morphological conditions, blood exhibits a widely variable behavior, ranging from the continuum to the corpuscular nature of the biofluid, rich in red blood cells and other suspended particles.
The phenomenology of hemodynamics has a significant impact in initiating and evolving many cardiovascular diseases. Atherosclerosis is the most common disease that affects the arterial blood vessels resulting in coronary heart disease, the most common cause of mortality and morbidity in developed countries. About 50% of annual deaths due to this disease occur suddenly and with no prior symptoms [1]. Although the development of the disease depends on the presence of systemic risk factors, such as high cholesterol, diabetes and high blood pressure, the clinical manifestations - heart attack, sudden coronary death and angina pectoris - are focal, resulting from the accumulation of lipid molecules and inflammatory cells at specific locations within the wall of the coronary arteries.
The main objective of multi-scale hemodynamics is to develop and deploy a general-purpose methodology to study flow patterns in complex morphological environments. A direct benefit of this activity would be the enhanced biomedical understanding of the causes and evolution of plaques in the heart arteries, with important implications for predicting the course of atherosclerosis and possibly preventing or mediating its effects. The principle underlying our approach is to resolve the motion of the relevant degrees of freedom (particles, molecules, red blood cells, etc.) and their (environment) in a consistent way. The irrelevant components of the system are handled in effective terms, without sacrificing the realism of the overall description. In this way, we are able to approach systems of mesoscopic or macroscopic size, with a number of elemental components that cannot be handled by conventional simulation methods.
Studying complex flow patterns implies the simulation of suspended bodies and particles convected by the underlying fluid plasma. The interaction exerted between the fluid (solvent) and suspended bodies (solute) is bi-directional, with an action-reaction principle underlying the motion of both solute and solvent. We have thus developed a method to couple two computational entities by combining methods borrowed from different contexts, the Lattice Boltzmann (LB) method from the computational fluid-dynamic community, and Molecular Dynamics (MD) from the atomistic computational community. The developed concurrent multi-scale methodology is stable, reliable and suitable for a number of extensions and applications.
For a number of years, highly tuned, system-specific application codes have been used to run large-scale simulations in many different fields of computational physics. However, attempts of coupling such application codes to simulate more complex, interdisciplinary phenomena have been often based on a simple sequential paradigm, that is, the output of the microscopic code provides the input of the macroscopic code, with limited (if any) run-time concurrency and system integration. The end result of our effort has been the development of the software package MUPHY, a general-purpose multi-scale simulation tool that can be used to study a variety of phenomena. Due to a number of technical advancements in High Performance Computing, MUPHY can seamlessly handle real-life geometrical set-ups, such as blood flows in human arteries in the presence of white cells, lipids, drug-delivery carriers and other suspended bodies of biological interest. Moreover, MUPHY has been designed to exploit cutting-edge hardware resources, such as the IBM Blue Gene platform or clusters equipped with high performance graphical processing units (GPU) that are currently attracting considerable attention in the computational community.

Multiscale Method

In order to highlight the main features of our multi-scale approach, in this section we review the basic methodology, for which further details can be found in previous works [2].
The fluid dynamic equations are solved via the LB method, a particularly well suited approach to handle multi-scale problems for several reasons: first, free-streaming of the fluid proceeds along straight trajectories which greatly facilitates the imposition of geometrically complex boundary conditions, such as those appearing in complex cardiovascular environments. Second, fluid diffusivity emerges from the first-order LB relaxation-propagation dynamics, so that the kinetic scheme can march in time-steps scaling only linearly with the mesh resolution. Third, since both fluid-fluid and fluid-particle collisions are completely local, the LB scheme is well suited to parallel computing. Finally, the LB/MD coupling method typically exhibits an order-of magnitude more stable behavior as compared to Navier-Stokes/MD coupling.
In the LB method the basic quantity is fi(\vec{x},t), representing the probability of finding a fluid particle at the spatial mesh location \vec{x} and at time t with discrete speed \vec{c}i and mesh spacing \Delta{x}. Actually, fluid particles do not correspond to individual solvent molecules, but they represent instead the collective motion of a group of physical particles (populations). For our studies, we use the common three-dimensional 19-speed lattice where the discrete velocities \vec{c}i connect mesh points to first and second topological neighbors. Once the discrete populations f_i are known, the kinetic moments are obtained by a direct summation upon all discrete populations at a given lattice site, with \rho(\vec{x},{t})= \sum {f_i}(\vec{x},{t}) being the local density, \rho \vec{u}(\vec{x},{t})= \sum \vec{c}_{i} {f_i}(\vec{x},{t}) the flow current and \vec{P}(\vec{x},{t})= \sum_{i} \vec{c}_{i} \vec{c}_{i}{f}_{i}(\vec{x},{t}) the momentum-flux tensor.
The fluid populations are advanced in time through the following evolution equation: {f}_{i}(\vec{x} + \vec{c}_{i} \Delta{t},{t}+ \Delta {t})= {f}_{i}(\vec{x},{t} + \omega \Delta{t}({f}_{i} - f_{i}^{eq}) (\vec{x},{t}) + \Delta {tS_i}(\vec{x},{t}) where the right hand side represents the effect of fluid-fluid molecular collisions, through a relaxation towards a local equilibrium, {f}_{i} - f_{i}^{eq}, typically a second-order expansion in the fluid velocity of a local Maxwellian with speed \vec{u}:

{f}_{i} - f_{i}^{eq} (\vec{x},{t}) = {w}_{i} \rho \left[{1 + \frac{{\vec{u}}.{\vec{c}_{i}}}{c^2} + \frac{\vec{u}\vec{u}: (\vec{c}_{i}\vec{c}_{i} - c^2 \vec{I}}{2c^4}}\right] c = frac{{1}{\sqrt{3}}} being the sound speed, wi a set of weights normalized to unity, and \vec{I} the unit tensor in Cartesian space. The relaxation frequency \omega controls the kinematic viscosity of the fluid, v = cê \Delta t \left(\frac{1} {\omega}} - {\frac{{1}{2}}\right).

The source term {S}_{i} (\vec{x},{t}) accounts for the presence of particles embedded in the LB solvent and represents the momentum and momentum-flux input per unit time due to the influence of particles on the fluid, reading S_i = - w_i \frac{\vec{F}^f . \vec{c}_i}{c^2}, \vec{F}_f being the mechanical friction acting between a single particle and the surrounding fluid.
The Molecular Dynamics component propagates in time a number of solute particles having position \vec{r}_{p}, velocity \vec{v}_p, and variable orientation, and experiencing mechanical forces due to body-body and body-wall forces. The resulting MD algorithm is well known and well documented. When modeling the frictional term, the embedded solute can be chosen to have a finite extension and shape, as in a detailed representation of red blood cells, or a simple point-like structure. For the latter choice, however, we need to take into account the anisotropic hydrodynamic response (as for a biconcave discoidal body) in effective terms. In the simplest scheme, an isotropic point-like object has a frictional force \vec{F}^f = \gamma ({\vec{u}_p} -{\vec{v}_p}, with  ({\vec{u}_p} = {\vec{u}({\vec{r}_p})} being the fluid velocity evaluated at the particle position, and \gamma a friction coefficient.

Computation Hemodynamics

Prior research, primarily observational and in vitro, has established that the foci of atherosclerosis appear in regions of disturbed blood flow, where the local endothelial shear stress (ESS) is low or of alternating direction. Therefore, atherosclerotic lesions frequently form near arterial branches and bifurcation, where there is always disturbed flow. Moreover, it primarily affects the luminal side of arteries since the distribution of wall stress through the thickness of the wall arising from pressure is higher on the inner surface of the artery. The evidence for ESS in the localization and progression of atherosclerosis is compelling and widely accepted. However, to date there is no direct route to predict the occurrence of atherosclerosis. In fact, because there is no direct method for measuring ESS in vivo, the prediction where disease is likely to develop and what form it could take is basically impossible. In close collaboration with the Cardiovascular and Radiological Units of the Brigham and Women’s Hospital at Harvard Medical School, we have undertaken a systematic investigation of extended coronary systems enveloping the complete human heart. The joint usage of vascular profiling techniques, such as Multi-Detector Computed Tomography (MDCT), and MUPHY, allowed us to simulate coronary systems of unprecedented size and accuracy. MDCT is an emerging noninvasive modality for coronary artery imaging with the potential to assess the coronary artery lumen, the wall, and plaque with improved spatial and temporal resolution as compared to prior acquisitions, permitting the creation of a 3D image of the entire coronary artery system in a few minutes [3]. We have developed a set of tools to reconstruct, regularize and finally simulate coronary arteries in silico. The Lattice Boltzmann method proved particularly favorable in handling complex arterial geometries since the use of a regular cartesian mesh greatly facilitates handling irregular coronary geometries, as compared to traditional fluid-dynamic approaches. Moreover, one can access the crucial quantities, such as ESS, locally on the mesh, without awkward interpolation methods, as illustrated in Fig. 1, for a real-life example of a coronary artery [4].

JPEG - 11.4 kb
fig. 1
Human Left Coronary Artery colored according to the local ESS, computed with 250,000,000 LB voxels. The upper region is directly attached to the Aorta and subdivides in nine different bifurcations. The regions of pathologically low ESS correspond to the deep blue coloring, occurring in proximity of bifurcations, for local enlargements of the vessel diameter, and generally along the inner region of vessel bending. The inset illustrates the original multi-detector computed tomography scan, with the coronary artery system descending from the aorta

Our study focuses on the reconstruction of ESS maps in left and right coronary arteries. Given the high level of ramification in coronaries, sophisticated 3D graphical software allows us to visualize scalar, vectorial and tensorial fields and locate the foci of low ESS [4]. We can thus monitor the coronary hot spots in different hemodynamic conditions, as in Fig. 2. In particular, we have identified a number of observational quantities that allow us to characterize irregular flow patterns, with repercussions on the underlying endothelial cell alignments and adhesion properties of lipid-rich material by the inner coronary wall. Such quantities prove of great interest for a correct interpretation of incipient pathologies by the medical community. We can now handle disparate geometries and flow conditions at the centimeter and millimeter level. Further work is in progress aimed at reproducing the rheological properties of blood at micrometer level. Here, the corpuscular nature of blood modulates viscosity in specific ways, and needs to be taken into account via the reproduction of motion of red blood cells, with full deployment of the multi-scale methodology.

JPEG - 10.8 kb
fig. 2
Diagrammatic ESS map eliciting the connectivity of the vessels, the attachment points, and the longitudinal (vertical axis) and azimuthal (horizontal axis) modulation of ESS, indicated by the color scheme shown at the bottom. The horizontal direction is proportional to the artery local diameter and labels indicate the conventional naming of coronary vessels. The white compact regions on the ESS map represent attachment regions on the mother vessel, where proper surface points are locally absent

Parallelism on Blue Gene and GPU clusters

Our simulation software, MUPHY, swiftly pipelines the different phases of the hemodynamic simulation and exploits the power of state-of-art hardware resources in a highly efficient manner. The code has been initially developed on clusters of PC and the IBM Blue Gene system. The parallelization of the Lattice Boltzmann method and Molecular Dynamics algorithms, as two distinct problems, has been extensively studied for a number of years [5,6]. However, the coupling of these techniques, in view of the dual nature of the multi-scale approach, raises new issues that need to be solved in order to achieve scalability and efficiency for large-scale simulations. For both methods we rely on domain decomposition derived from a k-way multilevel algorithm [7] with optimal load balancing in terms of voxel content and communication minimization. The fluid-dynamic solver is the most time-consuming component of the method. Indeed, the LB algorithm for the update of the populations has an unfavorable ratio between number of floating point operations and number of memory accesses. In addition, it is not possible to exploit the SIMD-like operations of the PowerPC 440 processor since they require stride one access, whereas the LB method has a “scattered” data access pattern. Therefore, we first took advantage of single-core optimizations in the LB kernel, by removal of redundant operations, buffering of multiply-used operations and “fusion” of the collision and streaming steps in a single loop, resulting in performances in line with other highly tuned LB kernels. Next, the intrinsic structure of the LB kernel and the large bandwidth of the Blue Gene architecture were leveraged resulting in optimal scalability. For the Molecular Dynamics section, we developed a parallelization strategy suitable for the multi-scale hemodynamic problem. In particular, we addressed the problem that in typical MD applications the spatial distribution of suspended bodies may be inhomogeneous. A straightforward approach to achieve a good load balancing is to resort to a domain decomposition such that each task has approximately the same number of particles. In this way, the size of the spatial sub-domains assigned to each task may vary substantially. In a stand-alone Molecular Dynamics simulation, this is acceptable but in our case the LB component would be hindered, since the computational load is proportional to the size of the spatial domain assigned to each task. One might opt for two separate domain decompositions for the LB and the MD part of the simulation. However, the exchange of momentum among particles and the surrounding fluid would become a non-local operation, with a very high cost due to the long-range point-to-point communications imposed on the underlying hardware/software platform. For the IBM Blue Gene such communications are explicitly discouraged. We circumvented the problem arising from inhomogeneous distributions by developing a local dynamic load transfer method and applying it to the most time-consuming part of MD, the computation of body-body interactions. Whenever the number of interacting pairs owned by a given task exceeds the number of pairs assigned to a neighbor by a threshold, a fraction of the excess pairs is sent to that neighbor, as illustrated in Fig. 3. In this way, a local load balancing for the computation of forces is achieved. A set of precedence rules prevents situations in which a task sends pairs to a neighbor and receives pairs from another. The receiving task computes the corresponding forces and sends them back to the task that actually owns the particles. For the system under study, the communication/computation ratio is such that the strategy results in a sizeable advantage for the global efficiency [8]. We made scalability tests on a Blue Gene/L computer up to 32,768 nodes, for a system composed by 256,000 particles embedded in a fluid mesh made by 512 x 512 x 512 voxels (Table 1). The test refers to biopolymer translocation experiments [3] and includes hydrodynamic fluctuations, with an additional cost as compared to noise-free hemodynamic simulations. The LB part appears to scale linearly and at times even superlinearly, probably due to a positive side effect of the reduction of lattice nodes per task on cache usage. To be noted that also the MD part exhibits a significant speed-up, showing that the saturation regime is still far from being hit. Without the dynamic local load balancing, the time required by the MD part of the simulation increases, on average, by about 30%. The excellent scalability of both the LB and MD components confirms the efficiency of the IBM Blue Gene communication network. The LB part of the simulation performs slightly more than 210 MFLOPS/sec per task. As for the MD part, on average, each task performs slightly below 160 MFLOPS/sec. Taken together, on average, each task performs at about 190 MFLOPS/sec. On the largest configuration at our disposal, these figures lead to an estimate of a total of 6.2 TeraFLOPS/sec aggregate performances.

JPEG - 35.2 kb
fig. 3
Local load balancing in a simplified 2D case. To balance the load for computing pair-wise forces, node 2 “virtually” transfers the coordinates of four beads to node 0 but remains the “owner” of those beads

Given its particularly favorable price/performance ratio, the Graphics Processing Units represent the latest technological breakthrough in scientific computing, accessible as commodity hardware. Among the GPU, those developed by NVIDIA appear particularly suitable to support general purpose processing thanks to their programming technology named CUDA. We had access to a NVIDIA Tesla C870 equipped with 16 multiprocessors with 8 processors each, for a total of 128 computational cores that can execute at a clock rate of 1.3 GHz. The total on-board global memory on the Tesla C870 amounts to 1.5 GByte with a 384-bit memory interface to the GPU that delivers 76.8 GByte/sec memory bandwidth. The latency for the access to this global memory is approximately 200 cycles (two-order of magnitude slower than access to shared memory) with any location of the global memory visible by any thread, whereas shared memory variables are local to the threads running within a single multiprocessor.

SystemExecution time (in sec)MFLUPS
1 C870 GPU76053
2 GT200 GPUs159252
8 GT200 GPUs41.9955

Table 2 - Timing of 10,000 iterations on an irregular domain with 4,000,000 voxels on multi-GPU architectures. The performance is measured by using the de-facto standard unit for Lattice Boltzmann, that is Million of FLUid node lattice Updates per Second (MFLUPS). A MFLUPS is equal, approximately, to 200 million floating point operations per second (MFLOPS).

To highlight the computing capabilities provided by the GPU, we report in Table 2 the results of the multi-GPU version of MUPHY parallel code [9] obtained on a cluster composed as follows: 4 Quad core Xeon 2.8Ghz connected by Infiniband and equipped each one with two pre-production S1070 GPU systems (for a total of 8 GT200 GPUs). For the test we used an irregular domain with a large bounding box (1057 x 692 x 1446 dimension) and a total number of fluid nodes 4,000,000. To obtain overall high performances, we took special care of the communication patterns among the GPUs due to an intermediate passage through the host CPUs. Overall, the reconstruction of extremely accurate ESS maps of complete coronary systems requires to have mesh spacing of 20 µm resulting in 250,000,000 fluid mesh points, that is a global memory allocation of 60 GByte in single precision representation. Such requirement is met by exploiting a cluster of 8 Tesla GT200 each equipped with 16 GByte memory.

Number of tasksTotal timeTotal efficiencyLB timeMD time
1024 (CO)883.9N/A581.4260.0
2048 (CO)452.398%290.0135.2
4096 (CO)233.395%144.370.5
8192 (CO)118.693%72.138.1
16384 (CO)59.293%36.121.1
16384 (VN)66.283%40.123.0
32768 (VN)36.176%20.113.0

Table 2 - Times (in seconds) for 1000 iterations of a 256000 particles multi-biopolymer translocation in a 512 x 512 x 512 lattice. The runs have been made on a Blue Gene/L architecture running in co-processor (CO) and virtual node (VN) modalities.

To conclude, performance tests have shown that MUPHY can sustain highly competitive performances on Blue Gene and multi-GPU architectures, notwithstanding the underlying general-purpose design of the software components. Given the accessible price of the GPU platforms, MUPHY is ready to use for routine medical analysis. In prospect, the joint usage of simulation and imaging on the same hardware will allow to non-invasively and inexpensively screen large numbers of patients for incipient coronary disease, and intervene at clinical level prior to the occurrence of catastrophic events.


The authors are grateful to M. Fyta, M. Bisson, J. Sircar, F. Rybicky and C.L. Feldman for encouragement and support during the whole development of the project. This work was partially supported by Harvard’s Initiative in Innovative Computing and by the Cyber Infrastructure Laboratory of the Harvard School of Engineering and Applied Sciences.


[1] Heart and stroke encyclopedia, American Heart Association (2009).
[2] M. Fyta, S. Melchionna, E. Kaxiras, S. Succi, Comput. Sci. & Eng., 10, 10 (2008).
[3] F.J. Rybicki et al., Intl. J. of Cardiovasc. Imaging, 15 Jan 2009, DOI 10.1007/s10554-008-9418-x (2009).
[4] S. Melchionna et al., submitted (2009).
[5] G. Amati, R. Piva, S. Succi, Intl. J. ModernPhys. C 4, 869 (1997).
[6] S. Plimpton, J. Comput. Chem. 117, 1 (1995).
[7] glaros.dtc.umn.edu/gkhome/views/metis
[8] M. Bernaschi, S. Melchionna, S. Succi, M. Fyta, E. Kaxiras, J.K. Sircar. Comput. Phys. Comm., 180, 1495 (2009).
[9] M. Bernaschi, M. Fatica, S. Melchionna, S. Succi, E. Kaxiras, Concurrency & Comput., 10.1002/cpe.1466, in press.

Cherchez ...

- dans tous les Flash informatique
(entre 1986 et 2001: seulement sur les titres et auteurs)
- par mot-clé


Cette page est un article d'une publication de l'EPFL.
Le contenu et certains liens ne sont peut-être plus d'actualité.


Les articles n'engagent que leurs auteurs, sauf ceux qui concernent de façon évidente des prestations officielles (sous la responsabilité du DIT ou d'autres entités). Toute reproduction, même partielle, n'est autorisée qu'avec l'accord de la rédaction et des auteurs.

Archives sur clé USB

Le Flash informatique ne paraîtra plus. Le dernier numéro est daté de décembre 2013.

Taguage des articles

Depuis 2010, pour aider le lecteur, les articles sont taggués:
  •   tout public
    que vous soyiez utilisateur occasionnel du PC familial, ou bien simplement propriétaire d'un iPhone, lisez l'article marqué tout public, vous y apprendrez plein de choses qui vous permettront de mieux appréhender ces technologies qui envahissent votre quotidien
  •   public averti
    l'article parle de concepts techniques, mais à la portée de toute personne intéressée par les dessous des nouvelles technologies
  •   expert
    le sujet abordé n'intéresse que peu de lecteurs, mais ceux-là seront ravis d'approfondir un thème, d'en savoir plus sur un nouveau langage.