Abstract
Data Science applications represent a growing fraction of the scientific computing workload, many of them written in Python. The goal of this paper is to compare two popular parallel programming models, namely MPI and Apache Spark for Python based Data Science applications. The paper presents communication and file I/O microbenchmarks to evaluate the MPI support for Python applications and uses two applications use-cases from Natural Language Processing to compare the performance of the MPI and the Spark versions. Our results indicate that the MPI version shows better scalability and performance than the PySpark version of the code. On the other hand, the MPI applications are significantly larger than their PySpark counterparts and took significantly longer to develop due to the necessity to implement some of the built-in functionality provided by Spark.
Recommended Citation
M. Saxena et al., "Comparison of MPI and Spark for Data Science Applications," Proceedings 2020 IEEE 34th International Parallel and Distributed Processing Symposium Workshops Ipdpsw 2020, pp. 682 - 690, article no. 9150426, Institute of Electrical and Electronics Engineers, May 2020.
The definitive version is available at https://doi.org/10.1109/IPDPSW50202.2020.00123
Department(s)
Engineering Management and Systems Engineering
International Standard Book Number (ISBN)
978-172817445-7
Document Type
Article - Conference proceedings
Document Version
Citation
File Type
text
Language(s)
English
Rights
© 2025 Institute of Electrical and Electronics Engineers, All rights reserved.
Publication Date
01 May 2020
