Treffer: LEVERAGING PYSPARK FOR HIGH-PERFORMANCE ANALYTICS IN HIVE: STRATEGIES AND BENCHMARKS

Title:

LEVERAGING PYSPARK FOR HIGH-PERFORMANCE ANALYTICS IN HIVE: STRATEGIES AND BENCHMARKS

Authors:

Researcher

Contributors:

Vishnu Vardhan Reddy Chilukoori, Srikanth Gangarapu

Publisher Information:

Zenodo

Publication Year:

2024

Collection:

Zenodo

Subject Terms:

PySpark, Real-Time Analytics, SQL-On-Hadoop, Data Partitioning, Hive Integration

Document Type:

Fachzeitschrift article in journal/newspaper

Language:

English

Relation:

https://zenodo.org/records/13692181; oai:zenodo.org:13692181; https://doi.org/10.5281/zenodo.13692181

DOI:

10.5281/zenodo.13692181

Availability:

https://doi.org/10.5281/zenodo.13692181
https://zenodo.org/records/13692181

Rights:

Creative Commons Attribution 4.0 International ; cc-by-4.0 ; https://creativecommons.org/licenses/by/4.0/legalcode

Accession Number:

edsbas.AB680C87

Database:

BASE

Weitere Informationen

This article investigates the integration of PySpark with Hive data warehouses to enable high-performance real-time analytics. We explore the synergies between PySpark's distributed computing capabilities and Hive's data storage infrastructure, focusing on performance optimization techniques for large-scale data processing. The article presents a comprehensive framework for leveraging PySpark in Hive environments, including best practices for code optimization, Spark configuration tuning, and effective data partitioning strategies. Through a series of benchmarks and case studies, we demonstrate significant performance improvements in complex analytical tasks and machine learning applications compared to traditional Hive queries. Our findings reveal that PySpark can accelerate data processing by up to 10x in certain scenarios, while enabling more sophisticated real-time analytics. The article also addresses challenges in scaling PySpark solutions and provides insights into emerging trends in big data analytics. This article contributes to the growing body of knowledge on modernizing data warehouses and offers practical guidance for data engineers and analysts seeking to enhance their Hive-based analytics capabilities using PySpark.

Treffer: LEVERAGING PYSPARK FOR HIGH-PERFORMANCE ANALYTICS IN HIVE: STRATEGIES AND BENCHMARKS

Weitere Informationen

Links

Zusatz-Funktionen