Treffer: Automated parallel execution of distributed task graphs with FPGA clusters

Title:
Automated parallel execution of distributed task graphs with FPGA clusters
Contributors:
Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors, Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. PM - Programming Models
Publisher Information:
Elsevier
Publication Year:
2024
Collection:
Universitat Politècnica de Catalunya, BarcelonaTech: UPCommons - Global access to UPC knowledge
Document Type:
Fachzeitschrift article in journal/newspaper
File Description:
17 p.; application/pdf
Language:
English
DOI:
10.1016/j.future.2024.06.041
Rights:
http://creativecommons.org/licenses/by/4.0/ ; Open Access ; Attribution 4.0 International
Accession Number:
edsbas.1DBB3854
Database:
BASE

Weitere Informationen

Over the years, Field Programmable Gate Arrays (FPGA) have been gaining popularity in the High Performance Computing (HPC) field, because their reconfigurability enables very fine-grained optimizations with low energy cost. However, the different characteristics, architectures, and network topologies of the clusters have hindered the use of FPGAs at a large scale. In this work, we present an evolution of OmpSs@FPGA, a high-level task-based programming model and extension to OmpSs-2, that aims at unifying all FPGA clusters by using a message-passing interface that is compatible with FPGA accelerators. These accelerators are programmed with C/C++ pragmas, and synthesized with High-Level Synthesis tools. The new framework includes a custom protocol to exchange messages between FPGAs, agnostic of the architecture and network type. On top of that, we present a new communication paradigm called Implicit Message Passing (IMP), where the user does not need to call any message-passing API. Instead, the runtime automatically infers data movement between nodes. We test classic message passing and IMP with three benchmarks on two different FPGA clusters. One is cloudFPGA, a disaggregated platform with AMD FPGAs that are only connected to the network through UDP/TCP/IP. The other is ESSPER, composed of CPU-attached Intel FPGAs that have a private network at the ethernet level. In both cases, we demonstrate that IMP with OmpSs@FPGA can increase the productivity of FPGA programmers at a large scale thanks to simplifying communication between nodes, without limiting the scalability of applications. We implement the N-body, Heat simulation and Cholesky decomposition benchmarks, and show that FPGA clusters get 2.6x and 2.4x better performance per watt than a CPU-only supercomputer for N-body and Heat. ; This work was supported by the Horizon 2020 TEXTA-ROSSA project [grant number 956831]; the Spanish Government [grant numbers PCI2021-121964, PDC2022-133323-I00, PID2019 - 107255GBC21, MCIN/AEI/10.13039/501100011 033]; the ...