Scientific Programming - Volume 21, issue 3-4 - Journals

Show:

results per page

Special Issue: Selected Papers from Super Computing 2012

Authors: Vetter, Jeffrey S. | Raghavan, Padma

Article Type: Other

DOI: 10.3233/SPR-130376

Citation: Scientific Programming, vol. 21, no. 3-4, pp. 63-64, 2013

Get PDF

Compiler-directed file layout optimization for hierarchical storage systems

Authors: Ding, Wei | Zhang, Yuanrui | Kandemir, Mahmut | Son, Seung Woo

Article Type: Research Article

Abstract: File layout of array data is a critical factor that effects the behavior of storage caches, and has so far taken not much attention in the context of hierarchical storage systems. The main contribution of this paper is a compiler-driven file layout optimization scheme for hierarchical storage caches. This approach, fully automated within an optimizing compiler, analyzes a multi-threaded application code and determines a file layout for each disk-resident array referenced by the code, such that the performance of the target storage cache hierarchy is maximized. We tested our approach using 16 I/O intensive application programs and compared its performance …against two previously proposed approaches under different cache space management schemes. Our experimental results show that the proposed approach improves the execution time of these parallel applications by 23.7% on average. Show more

Keywords: File layout, compiler optimization

DOI: 10.3233/SPR-130365

Citation: Scientific Programming, vol. 21, no. 3-4, pp. 65-78, 2013

Price: EUR 27.50

Efficient and reliable network tomography in heterogeneous networks using BitTorrent broadcasts and clustering algorithms

Authors: Dichev, Kiril | Reid, Fergal | Lastovetsky, Alexey

Article Type: Research Article

Abstract: In the area of network performance and discovery, network tomography focuses on reconstructing network properties using only end-to-end measurements at the application layer. One challenging problem in network tomography is reconstructing available bandwidth along all links during multiple source/multiple destination transmissions. The traditional measurement procedures used for bandwidth tomography are extremely time consuming. We propose a novel solution to this problem. Our method counts the fragments exchanged during a BitTorrent broadcast. While this measurement has a high level of randomness, it can be obtained very efficiently, and aggregated into a reliable metric. This data is then analyzed with state-of-the-art algorithms, …which correctly reconstruct logical clusters of nodes interconnected by high bandwidth, as well as bottlenecks between these logical clusters. Our experiments demonstrate that the proposed two-phase approach efficiently solves the presented problem for a number of settings on a complex grid infrastructure. Show more

Keywords: Network tomography, BitTorrent, clustering, bandwidth, bottleneck link

DOI: 10.3233/SPR-130366

Citation: Scientific Programming, vol. 21, no. 3-4, pp. 79-92, 2013

Price: EUR 27.50

A divide and conquer strategy for scaling weather simulations with multiple regions of interest

Article Type: Research Article

Abstract: Accurate and timely prediction of weather phenomena, such as hurricanes and flash floods, require high-fidelity compute intensive simulations of multiple finer regions of interest within a coarse simulation domain. Current weather applications execute these nested simulations sequentially using all the available processors, which is sub-optimal due to their sub-linear scalability. In this work, we present a strategy for parallel execution of multiple nested domain simulations based on partitioning the 2-D processor grid into disjoint rectangular regions associated with each domain. We propose a novel combination of performance prediction, processor allocation methods and topology-aware mapping of the regions on torus interconnects. …Experiments on IBM Blue Gene systems using WRF show that the proposed strategies result in performance improvement of up to 33% with topology-oblivious mapping and up to additional 7% with topology-aware mapping over the default sequential strategy. Show more

Keywords: Weather simulation, performance modeling, processor allocation, topology-aware mapping

DOI: 10.3233/SPR-130367

Citation: Scientific Programming, vol. 21, no. 3-4, pp. 93-107, 2013

Price: EUR 27.50

MPI runtime error detection with MUST: Advances in deadlock detection

Authors: Hilbrich, Tobias | Protze, Joachim | Schulz, Martin | de Supinski, Bronis R. | Müller, Matthias S.

Article Type: Research Article

Abstract: The widely used Message Passing Interface (MPI) is complex and rich. As a result, application developers require automated tools to avoid and to detect MPI programming errors. We present the Marmot Umpire Scalable Tool (MUST) that detects such errors with significantly increased scalability. We present improvements to our graph-based deadlock detection approach for MPI, which cover future MPI extensions. Our enhancements also check complex MPI constructs that no previous graph-based detection approach handled correctly. Finally, we present optimizations for the processing of MPI operations that reduce runtime deadlock detection overheads. Existing approaches often require 𝒪(p) analysis time per MPI operation, …for p processes. We empirically observe that our improvements lead to sub-linear or better analysis time per operation for a wide range of real world applications. Show more

Keywords: Deadlock detection, message passing interface, correctness checking

DOI: 10.3233/SPR-130368

Citation: Scientific Programming, vol. 21, no. 3-4, pp. 109-121, 2013

Price: EUR 27.50

Characterizing and mitigating work time inflation in task parallel programs

Authors: Olivier, Stephen L. | de Supinski, Bronis R. | Schulz, Martin | Prins, Jan F.

Article Type: Research Article

Abstract: Task parallelism raises the level of abstraction in shared memory parallel programming to simplify the development of complex applications. However, task parallel applications can exhibit poor performance due to thread idleness, scheduling overheads, and work time inflation – additional time spent by threads in a multithreaded computation beyond the time required to perform the same work in a sequential computation. We identify the contributions of each factor to lost efficiency in various task parallel OpenMP applications and diagnose the causes of work time inflation in those applications. Increased data access latency can cause significant work time inflation in NUMA …systems. Our locality framework for task parallel OpenMP programs mitigates this cause of work time inflation. Our extensions to the Qthreads library demonstrate that locality-aware scheduling can improve performance up to 3X compared to the Intel OpenMP task scheduler. Show more

Keywords: Task parallel programming, locality, task scheduling, affinity, NUMA, OpenMP

DOI: 10.3233/SPR-130369

Citation: Scientific Programming, vol. 21, no. 3-4, pp. 123-136, 2013

Price: EUR 27.50

Direction-optimizing breadth-first search

Authors: Beamer, Scott | Asanović, Krste | Patterson, David

Article Type: Research Article

Abstract: Breadth-First Search is an important kernel used by many graph-processing applications. In many of these emerging applications of BFS, such as analyzing social networks, the input graphs are low-diameter and scale-free. We propose a hybrid approach that is advantageous for low-diameter graphs, which combines a conventional top-down algorithm along with a novel bottom-up algorithm. The bottom-up algorithm can dramatically reduce the number of edges examined, which in turn accelerates the search as a whole. On a multi-socket server, our hybrid approach demonstrates speedups of 3.3–7.8 on a range of standard synthetic graphs and speedups of 2.4–4.6 on graphs from real …social networks when compared to a strong baseline. We also typically double the performance of prior leading shared memory (multicore and GPU) implementations. Show more

Keywords: Graph algorithms, breadth-first search

DOI: 10.3233/SPR-130370

Citation: Scientific Programming, vol. 21, no. 3-4, pp. 137-148, 2013

Price: EUR 27.50

McrEngine: A scalable checkpointing system using data-aware aggregation and compression

Article Type: Research Article

Abstract: High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility …of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression. Show more

Keywords: Data-aware, checkpoint restart, distributed applications, distributed systems, fault tolerance, aggregation, bottleneck, multiple-processor systems, application-level checkpointing, rollback recovery, system reliability, distributed programming, fault tolerant computing, software reliability, system recovery

DOI: 10.3233/SPR-130371

Citation: Scientific Programming, vol. 21, no. 3-4, pp. 149-163, 2013

Price: EUR 27.50

Efficient backprojection-based synthetic aperture radar computation with many-core processors

Authors: Park, Jongsoo | Tang, Ping Tak Peter | Smelyanskiy, Mikhail | Kim, Daehyun | Benson, Thomas

Article Type: Research Article

Abstract: Tackling computationally challenging problems with high efficiency often requires the combination of algorithmic innovation, advanced architecture, and thorough exploitation of parallelism. We demonstrate this synergy through synthetic aperture radar (SAR) via backprojection, an image reconstruction method that can require hundreds of TFLOPS. Computation cost is significantly reduced by our new algorithm of approximate strength reduction; data movement cost is economized by software locality optimizations facilitated by advanced architecture support; parallelism is fully harnessed in various patterns and granularities. We deliver over 35 billion backprojections per second throughput per compute node on an Intel® Xeon® processor E5-2670-based cluster, equipped …with Intel® Xeon Phi™ coprocessors. This corresponds to processing a 3K×3K image within a second using a single node. Our study can be extended to other settings: backprojection is applicable elsewhere including medical imaging, approximate strength reduction is a general code transformation technique, and many-core processors are emerging as a solution to energy-efficient computing. Show more

Keywords: Transcendental functions, approximate computing, wide-vector many-core processors, Xeon Phi™ coprocessor, synthetic aperture radar, backprojection, streaming

DOI: 10.3233/SPR-130372

Citation: Scientific Programming, vol. 21, no. 3-4, pp. 165-179, 2013

Price: EUR 27.50

A framework for low-communication 1-D FFT

Authors: Tang, Ping Tak Peter | Park, Jongsoo | Kim, Daehyun | Petrov, Vladimir

Article Type: Research Article

Abstract: In high-performance computing on distributed-memory systems, communication often represents a significant part of the overall execution time. The relative cost of communication will certainly continue to rise as compute-density growth follows the current technology and industry trends. Design of lower-communication alternatives to fundamental computational algorithms has become an important field of research. For distributed 1-D FFT, communication cost has hitherto remained high as all industry-standard implementations perform three all-to-all internode data exchanges (also called global transposes). These communication steps indeed dominate execution time. In this paper, we present a mathematical framework from which many single-all-to-all and easy-to-implement 1-D FFT algorithms …can be derived. For large-scale problems, our implementation can be twice as fast as leading FFT libraries on state-of-the-art computer clusters. Moreover, our framework allows tradeoff between accuracy and performance, further boosting performance if reduced accuracy is acceptable. Show more

Keywords: FFT, low communication, hybrid convolution theorem, Poisson summation formula

DOI: 10.3233/SPR-130373

Citation: Scientific Programming, vol. 21, no. 3-4, pp. 181-195, 2013

Price: EUR 27.50

Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems

Article Type: Research Article

Abstract: This paper describes and evaluates a scalable and efficient resilience scheme based on the concept of containment domains. Containment domains are a programming construct that enable applications to express resilience needs and to interact with the system to tune and specialize error detection, state preservation and restoration, and recovery schemes. Containment domains have weak transactional semantics and are nested to take advantage of the machine and application hierarchies and to enable hierarchical state preservation, restoration and recovery. We evaluate the scalability and efficiency of containment domains using generalized trace-driven simulation and analytical analysis and show that containment domains are superior …to both checkpoint restart and redundant execution approaches. Show more

Keywords: Exascale, resilience, flexible reliability, fault-tolerance

DOI: 10.3233/SPR-130374

Citation: Scientific Programming, vol. 21, no. 3-4, pp. 197-212, 2013

Price: EUR 27.50

Author Index Volume 21 (2013)

Article Type: Other

Citation: Scientific Programming, vol. 21, no. 3-4, pp. 213-214, 2013

Get PDF

Scientific Programming - Volume 21, issue 3-4

Special Issue: Selected Papers from Super Computing 2012

Compiler-directed file layout optimization for hierarchical storage systems

Efficient and reliable network tomography in heterogeneous networks using BitTorrent broadcasts and clustering algorithms

A divide and conquer strategy for scaling weather simulations with multiple regions of interest

MPI runtime error detection with MUST: Advances in deadlock detection

Characterizing and mitigating work time inflation in task parallel programs

Direction-optimizing breadth-first search

McrEngine: A scalable checkpointing system using data-aware aggregation and compression

Efficient backprojection-based synthetic aperture radar computation with many-core processors

A framework for low-communication 1-D FFT

Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems

Author Index Volume 21 (2013)

North America

Europe

Asia