Engineering Management and Systems Engineering Faculty Research & Creative Works

Comparison of MPI and Spark for Data Science Applications

Manvi Saxena
Shweta Jha
Saba Khan
John Rodgers
Peggy Lindner, Missouri University of Science and TechnologyFollow
Edgar Gabriel

Abstract

Data Science applications represent a growing fraction of the scientific computing workload, many of them written in Python. The goal of this paper is to compare two popular parallel programming models, namely MPI and Apache Spark for Python based Data Science applications. The paper presents communication and file I/O microbenchmarks to evaluate the MPI support for Python applications and uses two applications use-cases from Natural Language Processing to compare the performance of the MPI and the Spark versions. Our results indicate that the MPI version shows better scalability and performance than the PySpark version of the code. On the other hand, the MPI applications are significantly larger than their PySpark counterparts and took significantly longer to develop due to the necessity to implement some of the built-in functionality provided by Spark.

Recommended Citation

M. Saxena et al., "Comparison of MPI and Spark for Data Science Applications," Proceedings 2020 IEEE 34th International Parallel and Distributed Processing Symposium Workshops Ipdpsw 2020, pp. 682 - 690, article no. 9150426, Institute of Electrical and Electronics Engineers, May 2020.

The definitive version is available at https://doi.org/10.1109/IPDPSW50202.2020.00123

Department(s)

Engineering Management and Systems Engineering

International Standard Book Number (ISBN)

978-172817445-7

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Rights

Publication Date

01 May 2020

Download

Full Text Link

Included in

Operations Research, Systems Engineering and Industrial Engineering Commons

COinS

Engineering Management and Systems Engineering Faculty Research & Creative Works

Comparison of MPI and Spark for Data Science Applications

Abstract

Recommended Citation

Department(s)

International Standard Book Number (ISBN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Included in

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations

Engineering Management and Systems Engineering Faculty Research & Creative Works

Comparison of MPI and Spark for Data Science Applications

Author

Abstract

Recommended Citation

Department(s)

International Standard Book Number (ISBN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Included in

Share

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations