Abstract

Forecasts of daily pollutant levels have become a standard part of weather predictions in television, on-line, and in newspapers. Research groups also need to analyze larger timeframes across more locations to correlate long term developments for different pollutants with multiple serious health effects such as asthma. This paper presents a comparison of the Hadoop MapReduce and Spark programing models for air quality simulations, guiding future code development for the research groups interested in these analyses. Two use cases have been used, namely (i) calculating the eight-hour rolling average of pollutants in a restricted region, (ii) identifying clusters of sensors showing similar patterns in pollutant concentration over multiple years in the state of Texas. The data set used in this analysis is air pollution data collected over fifteen years at 179 monitor sites across the state of Texas for a variety of pollutants. Our results reveal 20-25% performance benefits for the Spark solutions over MapReduce. Furthermore, it documents performance benefits of the Spark MLlib machine learning library over the Mahout library which is based on the MapReduce programing model.

Department(s)

Engineering Management and Systems Engineering

Comments

National Science Foundation, Grant CRI-0958464

Keywords and Phrases

Air Quality Simulations; MapReduce; Spark

International Standard Book Number (ISBN)

978-150902251-9

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Rights

© 2025 Institute of Electrical and Electronics Engineers, All rights reserved.

Publication Date

19 May 2016

Share

 
COinS