Treffer: Performance-Driven Algorithm Engineering ; Optimising Pairwise Sequence Alignment and Pattern Matching Algorithms in the Era of Pangenomic Sequence Analysis ; Leistungsorientierte Algorithmenentwicklung
Weitere Informationen
A number of technological advancements in high-throughput genome sequencing have led to the generation of exabyte-scale sequencing data worldwide in recent years. These developments have facilitated large-scale resequencing projects like the 1000 Genomes Project, which aim to catalog the genetic diversity of organisms and specific populations. In the context of medical research, incorporating population data, is of particular interest. However, existing applications are often optimised for analysing a few sequences, and the methods used cannot be easily transferred to these vast datasets without exceeding system resources or producing results within a suitable time frame. Simultaneously, the execution model of processors has evolved from sequential to highly parallel process execution, thanks to the addition of processor cores, vector processing units, and advances in superscalar processor designs. This ongoing development in high-performance computing requires applications and algorithms to scale with the growing levels of parallelism. However, highly optimised algorithms are often embedded within applications, making them practically inaccessible to the scientific community. In this dissertation, we first investigate a generic approach to parallelise and vectorise pairwise sequence alignments using a dynamically scalable concurrency model. We explore various techniques and code-level optimisations to effectively utilise the available parallelisms on modern high-performance CPUs. The results demonstrate that the dynamically accelerated pairwise sequence alignment scales proportionally with the number of cores and provides speed-ups of up to a factor of 2500 compared to the sequential reference implementation on modern hardware. Second, we propose a general solution for data-compressed acceleration of pattern matching algorithms by compressing a large collection of related DNA sequences and providing a set of composable algorithms to refine and optimise the applicable operations. Our research on data-compressed ...