A programming model performance study using the NAS parallel benchmarks

Shan, Hongzhang; Blagojević, Filip; Min, Seung-Jai; Hargrove, Paul; Jin, Haoqiang; Fuerlinger, Karl; Koniges, Alice; Wright, Nicholas J.

doi:10.3233/SPR-2010-0306

A programming model performance study using the NAS parallel benchmarks

Issue title: Exploring Languages for Expressing Medium to Massive On-Chip Parallelism

Article type: Research Article

Affiliations: Future Technology Group, Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA | NAS Division, NASA Ames Research Center, Moffett Field, CA, USA | University of California at Berkeley, EECS Department, Computer Science Division, Berkeley, CA, USA | NERSC, Lawrence Berkeley National Laboratory, Berkeley, CA, USA

Note: [] Corresponding author: Hongzhang Shan, Future Technology Group, Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA. E-mail: hshan@ lbl.gov.

Abstract: Harnessing the power of multicore platforms is challenging due to the additional levels of parallelism present. In this paper we use the NAS Parallel Benchmarks to study three programming models, MPI, OpenMP and PGAS to understand their performance and memory usage characteristics on current multicore architectures. To understand these characteristics we use the Integrated Performance Monitoring tool and other ways to measure communication versus computation time, as well as the fraction of the run time spent in OpenMP. The benchmarks are run on two different Cray XT5 systems and an Infiniband cluster. Our results show that in general the three programming models exhibit very similar performance characteristics. In a few cases, OpenMP is significantly faster because it explicitly avoids communication. For these particular cases, we were able to re-write the UPC versions and achieve equal performance to OpenMP. Using OpenMP was also the most advantageous in terms of memory usage. Also we compare performance differences between the two Cray systems, which have quad-core and hex-core processors. We show that at scale the performance is almost always slower on the hex-core system because of increased contention for network resources.

Keywords: Programming model, performance study, UPC, OpenMP, MPI, memory usage

DOI: 10.3233/SPR-2010-0306

Journal: Scientific Programming, vol. 18, no. 3-4, pp. 153-167, 2010

Published: 2010

Price: EUR 27.50

North America

IOS Press, Inc.
6751 Tepper Drive
Clifton, VA 20124
USA

Tel: +1 703 830 6300
Fax: +1 703 830 2300
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

Europe

IOS Press
Nieuwe Hemweg 6B
1013 BG Amsterdam
The Netherlands

Tel: +31 20 688 3355
Fax: +31 20 687 0091
[email protected]

For editorial issues, permissions, book requests, submissions and proceedings, contact the Amsterdam office [email protected]

Asia

Inspirees International (China Office)
Ciyunsi Beili 207(CapitaLand), Bld 1, 7-901
100025, Beijing
China

Free service line: 400 661 8717
Fax: +86 10 8446 7947
[email protected]

For editorial issues, like the status of your submitted paper or proposals, write to [email protected]

如果您在出版方面需要帮助或有任何建, 件至: [email protected]

Share this:

North America

Europe

Asia