Abstract

Data Science applications represent a growing fraction of the scientific computing workload, many of them written in Python. The goal of this paper is to compare two popular parallel programming models, namely MPI and Apache Spark for Python based Data Science applications. The paper presents communication and file I/O microbenchmarks to evaluate the MPI support for Python applications and uses two applications use-cases from Natural Language Processing to compare the performance of the MPI and the Spark versions. Our results indicate that the MPI version shows better scalability and performance than the PySpark version of the code. On the other hand, the MPI applications are significantly larger than their PySpark counterparts and took significantly longer to develop due to the necessity to implement some of the built-in functionality provided by Spark.

Department(s)

Engineering Management and Systems Engineering

International Standard Book Number (ISBN)

978-172817445-7

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Rights

© 2025 Institute of Electrical and Electronics Engineers, All rights reserved.

Publication Date

01 May 2020

Share

 
COinS