FLASH INFORMATIQUE FI



expert Thermal-Aware Design for 3D Multi-Processor


Une nouvelle conception des systèmes multiprocesseurs en 3D qui tient compte de leur comportement thermique est basée sur une vision système et une combinaison des approches hardware et logicielle.



New Thermal-Aware 3D Multi-Processor Design with Liquid Cooling for Energy-Efficient HPC systems.


David ATIENZA ALONSO


Abstract

New tendencies envisage 3D Multi-Processor System-On-Chip (MPSoC) design as a promising solution to keep increasing the performance of next-generation high-performance computing (HPC) systems. However, as the power density of HPC systems increases with the forthcoming arrival of 3D MPSoCs, supplying electrical power to the computing equipment and constantly removing the generated heat is rapidly becoming the dominant cost in any HPC facility. Thus, both power and thermal/cooling implications are gradually becoming a major concern for the design of new HPC systems, given the current energy constraints in our society. Moreover, the placement of computational units and storage components on top of each other in HPC systems can lead to degraded performance and even malfunctioning if thermal-aware design and thermal management are not properly handled at the different levels of system integration.
Therefore, the Embedded Systems Laboratory (ESL) of EPFL has been working on the development of novel thermal-aware design paradigms for 3D-MPSoCs in the last three years, in cooperation with the Microelectronics Systems Laboratory (LSM), the Laboratory of Heat and Mass Transfer (LTCM) of EPFL, IBM, ETHZ and ST Microelectronics within the CMOSAIC Nano-Tera.ch program and the PRO3D EU FP7 project. The goal of the work developed at ESL in this framework is the exploration of the thermal modeling and system-level design methods that are necessary to develop 3D MPSoCs with inter-tier liquid cooling systems, as well as the deployment of the energy-efficient run-time thermal control strategies to achieve energy-efficient cooling mechanisms to compress almost 1 Tera nano-sized functional units into one cubic centimeter with a 10 to 100 fold higher connectivity than otherwise possible. Therefore, the proposed thermal-aware design paradigm includes exploring the synergies of hardware-, software- and mechanical-based thermal control techniques as fundamental step to design 3D MPSoCs for HPC systems. This paradigm includes the use of inter-tier coolants ranging from liquid water and two-phase to novel engineered environmentally friendly nano-fluids, as well as using specifically designed micro-channel arrangements, in combination with the use of dynamic thermal management at system-level to tune the flow rate of the coolant in each microchannel to achieve thermally-balanced 3D-ICs.

Introduction

The power density of high performance systems continues to increase with every process technology generation. Hence, it is not possible to design the next-generation of supercomputers using the existing solutions for high-performance computing (HPC) systems. As the density of supercomputers increases with the arrival of chip multi-core and multi-threaded processors, supplying electrical power to the computing equipment and constantly removing the generated heat is rapidly becoming the dominant cost in any HPC facility, even further than the new server deployment costs (see Figure 1). Thus, both power and thermal/cooling implications are increasingly playing a major role in the design of new HPC systems, especially given the current energy constraints in our society.

PNG - 22 ko
fig. 1
projections of energy cost expenditure in new HPC deployment equipment and maintenance costs (power and cooling) in the US market

In addition, 3D integration (i.e., multiple layers of processor, memories, etc. in the same chip stack) is a recently proposed design method for overcoming the limitations with respect to delay, bandwidth, and power consumption of the interconnects in large MPSoC chips, while reducing the chip footprint and improving the fabrication yield. However, one of the main challenges for designing 3D circuits is the elevated temperatures resulting from higher thermal resistivity, which irregularly spread in the whole 3D chip stack, as shown in the 3D 50-core thermal test stacks based on the Niagara Sun SPARC architecture developed within CMOSAIC (see Figure 2). As Figure 2 shows, it is more difficult to remove the heat from 3D systems with respect to conventional 2D MPSoCs. Moreover, 3D MPSoCs are also prone to larger thermal variations. For instance, cores located at different tiers or at different coordinates across a tier have significantly different heating/cooling rates. Such large thermal gradients have very adverse effects on system reliability, performance, and overall cooling costs of large HPC systems. Therefore, temperature-induced problems exacerbate in 3D stacking and are a major concern to be addressed and managed as early as possible in 3D MPSoC design. Furthermore, alternative cooling techniques must be developed for future 3D-based HPC systems.

PNG - 14.5 ko
fig. 2a
manufactured 5-tier stack chip at EPFL for 3D MPSoC thermal calibration and
PNG - 17.4 ko
fig. 2b
transient thermal map (in K) of the Sun-based 50-core MPSoC emulated system with only 10 cores being active

Thermal Modeling of 3D MPSoCs with Liquid Cooling

Conventional back-side heat removal strategies, such as, air cooled heat sinks and microchannel cold-plates only scale with the die size and are insufficient to cool 3D MPSoC with hotspot heat fluxes up to 250W/cm2, as forthcoming 3D MPSoC stacks. On the contrary, inter-tier liquid cooling is a potential solution to address the high temperatures in 3D MPSoCs, due to the higher heat removal capability of liquids in comparison to air. This technology being developed by the CMOSAIC Nano-Tera.ch RTD project, as shown in Figure 3(a)(b), involves injecting water through micro-channels between the tiers of a 3D stack. This cooling technology scales with the number of dies and it is compatible with area-array vertical interconnects (through silicon vias or TSVs), which are needed to connect the different electrical components in the 3D processing architectures.

PNG - 18.9 ko
fig. 3a
detail of the microchannels and TSVs in the 3D stack chip
PNG - 18.1 ko
fig. 3b
3D stacked MPSoC with inter-tier liquid cooling
PNG - 11.4 ko
fig. 3c
SEM micrograph showing the TSV structures

A magnified image of the TSV structures is shown in Figure 3(c). The use of such complex cooling technologies necessitates the development of a fast thermal modeling platform to enable cooling-aware design of 3D MPSoCs during the early design-stages. In fact, accurate thermal models are needed for predicting the costs of operating the liquid cooling (pumping power) in HPC systems, determining the overall energy budget and performing run-time thermal management. However, it is impractical to use extremely detailed simulation methods to analyze thermal effects and model temperature distribution in the target 3D MPSoC designs. Figure 4 shows layouts of these 3D MPSoCs targeted with this liquid cooling technology. They consist of two or more stacked tiers (with cores, L2 caches, crossbar, memory control units, etc.), and water microchannels are distributed uniformly in between these tiers.

PNG - 7.6 ko
fig. 4
Sun-SPARC 3D MPSoCs layouts with inter-tier liquid cooling

Then, Figure 5 shows a detailed inter-tier cooling internal structure of these new 3D MPSoCs, as well as the 5-tier 3D IC test structures with liquid cooling developed in CMOSAIC.
Indeed, using detailed numerical analysis methods, such as finite-element methods (FEM). FEM-based thermal and cooling exploration is a time-consuming process not suitable for design-time and run-time thermal management of such complex 3D MPSoCs, and we use multi-scale modeling concepts at system-level thermal modeling to improve computational efficiency of cooling requirements. Hence, at ESL we have developed 3D-ICE [1] (or 3D Interlayer Cooling Emulator), which is a compact transient thermal model library (written in C) for the thermal simulation of 3D ICs with multiple inter-tier liquid cooling microchannels. 3D-ICE is compatible with existing CAD tools for MPSoC designs, and offers significant speed-ups (up to 975x) over typical commercial computational fluid dynamics and thermal simulation tools while preserving accuracy (i.e., maximum temperature error of 3.4%) for 3D MPSoCs.

PNG - 15 ko
fig. 5a
3D stacked MPSoC architecture with interlayer microchannel cooling ;
PNG - 26 ko
fig. 5b
a 3D IC test vehicle with lateral wire-bonds for electrical IO and fabricated with fluid manifold mounted on a printed circuit board ;
PNG - 29.8 ko
PNG - 12.6 ko
fig. 5c’
detail of the wire bonding process

System-Level Thermal Management of 3D MPSoCs

Even if inter-tier liquid cooling is a potentially effective cooling technology for future 3D MPSoCs, due to the limited diameter of the inter-tier microchannels (channel width and thickness of less than 100μm and 50μm, respectively), the energy spent in the pump that injects the coolant can be very significant. In fact, in an HPC cluster, a central pump, such as a centrifugal pump (e.g., EMB MHIE), is responsible for the fluid injection to a cluster of nodes. This pump has the capability of producing large discharge rates at small pressure heads. Then, liquid is injected to the stacks from this pump via a pumping network. To enable different flow rates for each stack, the cooling infrastructure includes valves in the network, such closed (NCV) valves. NCVs use external power to reduce the pressure drop and to increase the flow rate.
Figure 6 shows the pump and valve power consumption for three flow rate settings of a single stack deployed in a cluster with 60 similar 3D MPSoC stacks, as proposed in the Aquasar IBM data-center design. The maximum energy required to inject the fluid to all stacks is a significant overhead to the whole system, because it represents about 70 Watts (indeed similar to the overall energy consumption of a 2-tier 3D MPSoC). Hence, it is necessary to minimize at system-level the required cooling energy in the liquid injection architecture to achieve energy-efficient cooling infrastructures for 3D MPSoC-based HPC systems.

PNG - 6.2 ko
fig. 6
power consumption and flow rates of the cooling infrastructure per one stack in a 60 3D-chips cluster

Therefore, based on 3D-ICE, ESL has recently developed a new family of global hardware-software temperature controllers (Active-Adapt3D Thermal Controllers [2][3]) for energy-efficient 3D MPSoC cooling management. As Figure 7 shows, these controllers include a thermal-aware job scheduler, which distributes the weighted (from a temperature increase viewpoint) workload and tasks between the multiple cores of each tier to balance the global temperature across the 3D system in order to maximize cooling efficiency. Then, they forecast maximum system temperature based on Auto-Regressive Moving Average (ARMA) models and proactively set the liquid flow rate using the pump controller.

PNG - 15.9 ko
fig. 7
architecture of the Active-Adapt3D Thermal Controllers for 3D MPSoC architectures

As illustrated in Figure 8, the latest experiments on 4-layered 3D MPSoCs (cf. Figure 4), indicate that this new generation of 3D thermal management and cooling controllers, i.e., a liquid cooling controller based on fuzzy logic (LC_fuzzy) or targeting global load balancing (LC_LB), prevent the system to exceed the given threshold temperature (90ºC) while reducing cooling energy by up to 50% with respect to a worst-case flow rate setting. Also, they improve system-level energy by up to 21% with respect to state-of-the-art temperature controllers for MPSoC architectures using air cooling, i.e., dynamic load balancing (AC_LB), temperature-triggered task migration (AC_TTMig) or temperature-triggered dynamic frequency and voltage scaling (AC_TDVFS).

PNG - 12.8 ko
fig. 8
percentage of time of hot-spots per core and across the entire die for 4-tier 3D MPSoC designs running different system-level thermal control policies (i.e., average case across all workloads and for Web-high, the hottest workload benchmark set)

Conclusions

Microchannel-based liquid cooling is a promising cooling technology solution to overcome the thermal challenges of 3D MPSoCs in HPC architectures. However, intelligent control of the coolant flow rate is needed to avoid wasted energy consumption for over-cooling the system when the system is under-utilized. Therefore, ESL and its research partners are working on the development of novel system-level thermal-aware design methodologies, as an effective and combined (mechanical-electrical) technology approach to achieve thermally-balanced 3D MPSoCs for HPC systems.

Acknowledgements

The author wants to thank Prof. Ayse K. Coskun (Boston Univ.), Prof. Yusuf Leblebici (LSM), Thomas Brunschwiler and Dr Bruno Michel (Advanced Packaging Group - IBM Zürich), and the ESL members for their contributions to this work. This research is partially funded by the Nano-Tera RTD project CMOSAIC (ref.123618), financed by the Swiss Confederation and scientifically evaluated by SNSF, the PRO3D EU FP7-ICT-248776 project, and a grant donation of the OpenSPARC University Program of Sun Microsystems-ORACLE for ESL.

References

[1] 3D-ICE : Fast compact transient thermal modeling for 3D-ICs with inter-tier liquid cooling, Arvind Sridhar, Alessandro Vincenzi, Martino Ruggiero, David Atienza, Thomas Brunschwiler, Proc. of International Conference on Computer-Aided Design (ICCAD), San Jose, CA, USA, ACM and IEEE Press, ISBN : 978 1-4244-8192 7/10, pp. 463-470, November 7-11 2010. Simulator available for download at : esl.epfl.ch/3d-ice.html.
[2] Fuzzy Control for Enforcing Energy Efficiency in High-Performance 3D Systems, Mohamed Sabry, Ayse K. Coskun, David Atienza, Proc. of International Conference on Computer-Aided Design (ICCAD), San Jose, CA, USA, ACM and IEEE Press, ISBN : 978 1-4244-8192 7/10, pp. 642-648, November 7-11 2010.
[3] Energy-Efficient Variable-Flow Liquid Cooling in 3D Stacked Architectures, Ayse K. Coskun, David Atienza, Tajana Simunic, Thomas Brunschwiler, Bruno Michel, Proc. of Design, Automation and Test in Europe (DATE ’10), Dresden, France, ACM and IEEE Press, ISSN : 1530-1591/05, ISBN : 978-3-9810801-6-2, pp. 1 - 6, 8-12 March 2010.



Cherchez ...

- dans tous les Flash informatique
(entre 1986 et 2001: seulement sur les titres et auteurs)
- par mot-clé

Avertissement

Cette page est un article d'une publication de l'EPFL.
Le contenu et certains liens ne sont peut-être plus d'actualité.

Responsabilité

Les articles n'engagent que leurs auteurs, sauf ceux qui concernent de façon évidente des prestations officielles (sous la responsabilité du DIT ou d'autres entités). Toute reproduction, même partielle, n'est autorisée qu'avec l'accord de la rédaction et des auteurs.


Archives sur clé USB

Le Flash informatique ne paraîtra plus. Le dernier numéro est daté de décembre 2013.

Taguage des articles

Depuis 2010, pour aider le lecteur, les articles sont taggués:
  •   tout public
    que vous soyiez utilisateur occasionnel du PC familial, ou bien simplement propriétaire d'un iPhone, lisez l'article marqué tout public, vous y apprendrez plein de choses qui vous permettront de mieux appréhender ces technologies qui envahissent votre quotidien
  •   public averti
    l'article parle de concepts techniques, mais à la portée de toute personne intéressée par les dessous des nouvelles technologies
  •   expert
    le sujet abordé n'intéresse que peu de lecteurs, mais ceux-là seront ravis d'approfondir un thème, d'en savoir plus sur un nouveau langage.