Treffer: Transparent Deployment of Machine Learning Models on Many-Accelerator Architectures
Weitere Informationen
The growing demand for machine learning (ML) and signal processing in embedded and edge systems introduces challenges in balancing performance, energy efficiency, and software development simplicity. Heterogeneous Systems-on-Chip (SoCs), which combine general-purpose processors with specialized accelerators, offer a promising solution. However, deploying applications efficiently on such architectures requires new hardware/software co-design methodologies, integration strategies, and runtime resource management.This dissertation presents an approach that separates application logic from accelerator integration, enabling more efficient deployment of ML and signal processing workloads on many-accelerator SoCs. To begin, I introduce ESP4ML, an open-source design flow that combines ESP, a modular SoC platform, with hls4ml, a high-level synthesis tool for generating ML accelerators. ESP4ML provides an embedded runtime software application programming interface (API) for managing accelerators dynamically within Linux. It enables data transfer between accelerators, reducing memory accesses.These features allow rapid prototyping of SoCs and are demonstrated through FPGA-based implementations executing end-to-end embedded workloads. Following that, I analyze communication models by comparing memory-based and point-to-point (p2p) data transfer mechanisms for accelerators. Using synthetic and real-world benchmarks, such as Nightvision and Wide-Area Motion Imagery (WAMI), I show that p2p communication consistently delivers better performance and energy efficiency, particularly in multi-threaded and tile-based accelerator systems. Next, a co-design strategy is presented that integrates the Eigen C++ linear algebra library with ESP. This enables high-level software to transparently offload computations to accelerators, achieving significant gains in both performance and energy efficiency without compromising software simplicity. To further improve the transparency of deploying ML workloads, I introduce WOLT, a lightweight ...