David K Lowenthal
- Professor, Computer Science
- Member of the Graduate Faculty
- (520) 626-8282
- Gould-Simpson, Rm. 705
- Tucson, AZ 85721
- dkl1@arizona.edu
Biography
David Lowenthal is Professor of Computer Science at the University of Arizona, where he has been a faculty member since January 2009. Prior to that, he was on on the faculty in Computer Science at the University of Georgia. He holds a B.S. degree in Computer Science and Math from the University of California, Davis and M.S. and Ph.D degrees in Computer Science from the University of Arizona. His research is primarily in all aspects of parallel and distributed computing. His particular focus tends to be on solving fundamental problems within this field via system-software techniques. Currently, most of his research centers around power and performance issues within parallel and distributed computing. He was Program co-Chair for the inaugural IEEE Green Computing Conference in 2010, General Chair of the ACM International Conference on Supercomputing in 2011, and has served on numerous program committees, including SC, PPOPP, ICS, and IPDPS.
Degrees
- Ph.D. Computer Science
- The University of Arizona
Awards
- Best Paper
- ACM Symposium on High-Performance Distributed Computing, Summer 2021
Interests
Teaching
Parallel and distributed computing, operating systems
Research
Parallel and distributed computing
Courses
2024-25 Courses
-
Intro Paral+Dist Program
CSC 422 (Spring 2025) -
Special Topics in Comp Science
CSC 496 (Fall 2024)
2023-24 Courses
-
Intro Paral+Dist Program
CSC 422 (Spring 2024) -
Parallel+Distribut Compt
CSC 522 (Fall 2023)
2021-22 Courses
-
Independent Study
CSC 399 (Summer I 2022) -
Independent Study
CSC 499 (Summer I 2022) -
Intro Paral+Dist Program
CSC 422 (Spring 2022)
2020-21 Courses
-
Intro Paral+Dist Program
CSC 422 (Fall 2020)
2019-20 Courses
-
Dissertation
CSC 920 (Spring 2020) -
Dissertation
CSC 920 (Fall 2019) -
Independent Study
CSC 599 (Fall 2019) -
Parallel+Distribut Compt
CSC 522 (Fall 2019)
2018-19 Courses
-
Dissertation
CSC 920 (Spring 2019) -
Dissertation
CSC 920 (Fall 2018) -
Independent Study
CSC 599 (Fall 2018)
2017-18 Courses
-
Adv Tpcs:Doctoral Colloq
CSC 695C (Spring 2018) -
Dissertation
CSC 920 (Spring 2018) -
Adv Tpcs:Doctoral Colloq
CSC 695C (Fall 2017) -
Dissertation
CSC 920 (Fall 2017) -
Independent Study
CSC 599 (Fall 2017)
2016-17 Courses
-
Adv Tpcs:Doctoral Colloq
CSC 695C (Spring 2017) -
Dissertation
CSC 920 (Spring 2017) -
Adv Tpcs:Doctoral Colloq
CSC 695C (Fall 2016) -
Independent Study
CSC 599 (Fall 2016) -
Parallel+Distribut Compt
CSC 522 (Fall 2016) -
Research
CSC 900 (Fall 2016)
2015-16 Courses
-
Independent Study
CSC 499 (Spring 2016) -
Independent Study
CSC 599 (Spring 2016) -
Preceptorship
CSC 391 (Spring 2016) -
Research
CSC 900 (Spring 2016) -
Thesis
CSC 910 (Spring 2016)
Scholarly Contributions
Journals/Publications
- Marathe, A., Harris, R., Lowenthal, D. K., de Supinski, B. R., Rountree, B., & Schulz, M. (2016). Exploiting Redundancy for Cost-Effective, Time-Constrained Execution of HPC Applications on Amazon EC2. IEEE Transactions on Parallel and Distributed Systems.
- Marathe, A., Harris, R., Lowenthal, D. K., R., B., Rountree, B., Schulz, M., & Yuan, X. (2013). A comparative study of high-performance computing on the cloud. HPDC 2013 - Proceedings of the 22nd ACM International Symposium on High-Performance Parallel and Distributed Computing, 239-250.More infoAbstract: The popularity of Amazon's EC2 cloud platform has increased in recent years. However, many high-performance computing (HPC) users consider dedicated high-performance clusters, typically found in large compute centers such as those in national laboratories, to be far superior to EC2 because of significant communication overhead of the latter. Our view is that this is quite narrow and the proper metrics for comparing high-performance clusters to EC2 is turnaround time and cost. In this paper, we compare the top-of-the-line EC2 cluster to HPC clusters at Lawrence Livermore National Laboratory (LLNL) based on turnaround time and total cost of execution. When measuring turnaround time, we include expected queue wait time on HPC clusters. Our results show that although as expected, standard HPC clusters are superior in raw performance, EC2 clusters may produce better turnaround times. To estimate cost, we developed a pricing model - relative to EC2's node-hour prices - to set node-hour prices for (currently free) LLNL clusters. We observe that the cost-effectiveness of running an application on a cluster depends on raw performance and application scalability. © 2013 ACM.
- Patki, T., Lowenthal, D. K., Rountree, B., Schulz, M., & R., B. (2013). Exploring hardware overprovisioning in power-constrained, high performance computing. Proceedings of the International Conference on Supercomputing, 173-182.More infoAbstract: Most recent research in power-aware supercomputing has focused on making individual nodes more efficient and measuring the results in terms of flops per watt. While this work is vital in order to reach exascale computing at 20 megawatts, there has been a dearth of work that explores efficiency at the whole system level. Traditional approaches in supercomputer design use worst-case power provisioning: the total power allocated to the system is determined by the maximum power draw possible per node. In a world where power is plentiful and nodes are scarce, this solution is optimal. However, as power becomes the limiting factor in supercomputer design, worst-case provisioning becomes a drag on performance. In this paper we demonstrate how a policy of overprovisioning hardware with respect to power combined with intelligent, hardware-enforced power bounds consistently leads to greater performance across a range of standard benchmarks. In particular, leveraging overprovisioning requires that applications use effective configurations; the best configuration depends on application scalability and memory contention. We show that using overprovisioning leads to an average speedup of more than 50% over worst-case provisioning. © 2013 ACM.
- Rountree, B., Gamblin, T., R., B., Schulz, M., Lowenthal, D. K., Cobb, G., & Tufo, H. (2013). Parallelizing heavyweight debugging tools with mpiecho. Parallel Computing, 39(3), 156-166.More infoAbstract: Idioms created for debugging execution on single processors and multicore systems have been successfully scaled to thousands of processors, but there is little hope that this class of techniques can continue to be scaled out to tens of millions of cores. In order to allow development of more scalable debugging idioms we introduce mpiecho, a novel runtime platform that enables cloning of MPI ranks. Given identical execution on each clone, we then show how heavyweight debugging approaches can be parallelized, reducing their overhead to a fraction of the serialized case. We also show how this platform can be useful in isolating the source of hardware-based nondeterministic behavior and provide a case study based on a recent processor bug at LLNL. While total overhead will depend on the individual tool, we show that the platform itself contributes little: 512x tool parallelization incurs at worst 2x overhead across the NAS Parallel benchmarks, hardware fault isolation contributes at worst an additional 44% overhead. Finally, we show how mpiecho can lead to near-linear reduction in overhead when combined with maid, a heavyweight memory tracking tool provided with Intel's pin platform. We demonstrate overhead reduction from 1466% to 53% and from 740% to 14% for cg (class D, 64 processes) and lu (class D, 64 processes), respectively, using only an additional 64 cores. © 2012 Elsevier B.V. All rights reserved.
- Rountree, B., Gamblin, T., de, S. B., Schulz, M., Lowenthal, D., Cobb, G., & Tufo, H. (2013). Parallelizing heavyweight debugging tools with MPIecho. Parallel Computing, 39(3), 156-166.
- Zheng, G. u., Small, M., Yuan, X., Marathe, A., & Lowenthal, D. K. (2013). Protocol Customization for Improving MPI Performance on RDMA-Enabled Clusters. International Journal of Parallel Programming, 41(5), 682-703.More infoAbstract: Optimizing Message Passing Interface (MPI) point-to-point communication for large messages is of paramount importance since most communications in MPI applications are performed by such operations. Remote Direct Memory Access (RDMA) allows one-sided data transfer and provides great flexibility in the design of efficient communication protocols for large messages. However, achieving high point-to-point communication performance on RDMA-enabled clusters is challenging due to both the complexity in communication protocols and the impact of the protocol invocation scenario on the performance of a given protocol. In this work, we analyze existing protocols and show that they are not ideal in many situations, and propose to use protocol customization, that is, different protocols for different situations to improve MPI performance. More specifically, by leveraging the RDMA capability, we develop a set of protocols that can provide high performance for all protocol invocation scenarios. Armed with this set of protocols that can collectively achieve high performance in all situations, we demonstrate the potential of protocol customization by developing a trace-driven toolkit that allows the appropriate protocol to be selected for each communication in an MPI application to maximize performance. We evaluate the performance of the proposed techniques using micro-benchmarks and application benchmarks. The results indicate that protocol customization can out-perform traditional communication schemes by a large degree in many situations. © 2013 Springer Science+Business Media New York.
- Chen, J., Ramaswamy, L., Lowenthal, D. K., & Kalyanaraman, S. (2012). Comet: Decentralized complex event detection in mobile delay tolerant networks. Proceedings - 2012 IEEE 13th International Conference on Mobile Data Management, MDM 2012, 131-136.More infoAbstract: Increased commodity use of mobile devices has the potential to enable mission-critical monitoring applications. However, these mobile-enabled monitoring applications have to often work in environments where a delay-tolerant network (DTN) is the only feasible communication paradigm. Detection of complex (composite) events is fundamental to monitoring applications. However, the existing plan-based CED techniques are mostly centralized, and hence are inherently unscalable for DTNs. In this paper, we create Comet - a decentralized plan-based, efficient and scalable CED for DTNs. Comet shares the task of detecting complex events (CEs) among multiple nodes, with each node detecting a part of the CE by aggregating two or more primitive events or sub-CEs. Comet uses a unique h-function to construct cost and delay efficient CED trees. As finding an optimal CED plan requires exponential-time, Comet finds near-optimal detection plans for individual CEs through a novel multi-level push-pull conversion algorithm. Performance results show that Comet reduces cost by up to 89% compared to pushing all primitive events and over 60% compared to a two-level exhaustive search algorithm. © 2012 IEEE.
- Rountree, B., Ahn, D. H., R., B., Lowenthal, D. K., & Schulz, M. (2012). Beyond DVFS: A first look at performance under a hardware-enforced power bound. Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012, 947-953.More infoAbstract: Dynamic Voltage Frequency Scaling (DVFS) has been the tool of choice for balancing power and performance in high-performance computing (HPC). With the introduction of Intel's Sandy Bridge family of processors, researchers now have a far more attractive option: user-specified, dynamic, hardware-enforced processor power bounds. In this paper we provide a first look at this technology in the HPC environment and detail both the opportunities and potential pitfalls of using this technique to control processor power. As part of this evaluation we measure power and performance for single-processor instances of several of the NAS Parallel Benchmarks. Additionally, we focus on the behavior of a single benchmark, MG, under several different power bounds. We quantify the well-known manufacturing variation in processor power efficiency and show that, in the absence of a power bound, this variation has no correlation to performance. We then show that execution under a power bound translates this variation in efficiency into variation in performance. © 2012 IEEE.
- Lim, M. Y., Freeh, V. W., & Lowenthal, D. K. (2011). Adaptive, transparent CPU scaling algorithms leveraging inter-node MPI communication regions. Parallel Computing, 37(10-11), 667-683.More infoAbstract: Although users of high-performance computing are most interested in raw performance, both energy and power consumption have become critical concerns. Because the CPU is often the major power consumer, some microprocessors allow frequency and voltage scaling, which enables a system to efficiently reduce CPU performance and power. When the CPU is not on the critical path, such dynamic frequency and voltage scaling can produce significant energy savings with little performance penalty. This paper presents an MPI runtime system that dynamically reduces CPU frequency and voltage during communication phases in MPI programs. It dynamically identifies such phases and, without a priori knowledge, selects the CPU frequency in order to minimize energy-delay product. All analysis and subsequent frequency and voltage scaling is within MPI and so is entirely transparent to the application. This means that the large number of existing MPI programs, as well as new ones being developed, can use our system without modification. Results show that the median reduction in energy-delay product for twelve benchmarks is 8%, the median energy reduction is 11%, and the median increase in execution time increase is only 2%. © 2011 Elsevier B.V. All rights reserved.
- Lowenthal, D. K. (2011). Message from the general chair. Proceedings of the International Conference on Supercomputing, iii.
- Marathe, A., Lowenthal, D. K., Zheng, G. u., Small, M., & Yuan, X. (2011). Profile guided MPI protocol selection for point-to-point communication calls. IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, 733-739.More infoAbstract: Improving communication performance is critical to achieving high performance in message-passing programs. Designing new, efficient protocols to realize point-to-point and collective communication operations has therefore been an active area of research. However, the best protocol for a given communication routine is both application and architecture specific. This paper contributes a new method of selection of the optimal protocol for a given point-to-point communication pair. Our technique analyzes the MPI communication call profile of an application and uses a computation and communication model we have developed to choose the proper protocol for each communication phase. We have applied our system to MPI applications such as CG, Sweep3D and Sparse Matrix multiplication, as well as synthetic applications. Our scheme yields an improvement in total execution time of up to 20% compared to MVAPICH2 and up to 3.2% compared to the best, highly optimized communication protocol for the real applications. Furthermore, experiments on the synthetic applications show that the savings can be much more pronounced. © 2011 IEEE.
- Rong, G. e., Gioiosa, R., Bellosa, F., Boku, T., Chen, Y., Cher, C., Cesati, M., Supinski, B. D., Feng, X., Feng, W., Hsu, C., Isci, C., Knauerhase, R., Lefevre, L., Lowenthal, D., Nakashima, H., Nathuji, R., Schwan, K., & Torres, J. (2011). High-performance, power-aware computing - HPPAC. IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, 795-.
- Rountree, B., Lowenthal, D. K., Schulz, M., & R., B. (2011). Practical performance prediction under dynamic Voltage frequency scaling. 2011 International Green Computing Conference and Workshops, IGCC 2011.More infoAbstract: Predicting performance under Dynamic Voltage Frequency Scaling (DVFS) remains an open problem. Current best practice explores available performance counters to serve as input to linear regression models that predict performance. However, the inaccuracies of these models require that large-scale DVFS runtime algorithms predict performance conservatively in order to avoid significant consequences of mispredictions. Recent theoretical work based on interval analysis advocates a more accurate and reliable solution based on a single new performance counter, Leading Loads. In this paper, we evaluate a processor-independent analytic framework for existing performance counters based on this interval analysis model. We begin with an analysis of the counters used in many published models. We then briefly describe the Leading Loads architectural model and describe how we can use Leading Loads Cycles to predict performance under DVFS. We validate this approach for the NAS Parallel Benchmarks and SPEC CPU 2006 benchmarks, demonstrating an order of magnitude improvement in both error and standard deviation compared to the best existing approaches. © 2011 IEEE.
- Barnes, B., Garren, J., Lowenthal, D. K., Reeves, J., R., B., Schulz, M., & Rountree, B. (2010). Using focused regression for accurate time-constrained scaling of scientific applications. Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010.More infoAbstract: Many large-scale clusters now have hundreds of thousands of processors, and processor counts will be over one million within a few years. Computational scientists must scale their applications to exploit these new clusters. Time-constrained scaling, which is often used, tries to hold total execution time constant while increasing the problem size along with the processor count. However, complex interactions between parameters, the processor count, and execution time complicate determining the input parameters that achieve this goal. In this paper we develop a novel gray-box, focused regression-based approach that assists the computational scientist with maintaining constant run time on increasing processor counts. Combining application-level information from a small set of training runs, our approach allows prediction of the input parameters that result in similar per-processor execution time at larger scales. Our experimental validation across seven applications showed that median prediction errors are less than 13%. © 2010 IEEE.
- Chen, J., Ramaswamy, L., Lowenthal, D. K., & Kalyanaraman, S. (2010). CAEVA: A customizable and adaptive event aggregation framework for collaborative broker overlays. Proceedings of the 6th International Conference on Collaborative Computing: Networking, Applications and Worksharing, CollaborateCom 2010.More infoAbstract: The publish-subscribe (pub-sub) paradigm is maturing and integrating into community-oriented collaborative applications. Because of this, pub-sub systems are faced with an event stream that may potentially contain large numbers of redundant and partial messages. Most pub-sub systems view partial and redundant messages as unique, which wastes resources not only at routers, but also at possibly resource constrained subscribers. In this paper, we present Caeva, a customizable and adaptive event aggregation framework. The design of Caeva exhibits three novel features. First, the tasks of merging messages and eliminating redundancies are shared among multiple, physically distributed brokers called aggregators. Second, we design a decentralized aggregator placement scheme that continuously adapts to decrease messaging overheads in the face of changing event publishing patterns. Third, we allow subscribers to choose a notification schedule that meets their specific needs. Results of extensive experiments show that Caeva is quite effective in providing flexibility and efficiency. © 2010 ICST.
- Rountree, B., Lowenthal, D. K., R., B., Schulz, M., Freeh, V. W., & Bletsch, T. (2009). Adagio: Making DVS practical for complex HPC applications. Proceedings of the International Conference on Supercomputing, 460-469.More infoAbstract: Power and energy are first-order design constraints in high performance computing. Current research using dynamic voltage scaling (DVS) relies on trading increased execution time for energy savings, which is unacceptable for most high performance computing applications. We present Adagio, a novel runtime system that makes DVS practical for complex, real-world scientific applications by incurring only negligible delay while achieving signifi-cant energy savings. Adagio improves and extends previous stateof-the-art algorithms by combining the lessons learned from static energy-reducing CPU scheduling with a novel runtime mechanism for slack prediction. We present results using Adagio for two realworld programs, UMT2K and ParaDiS, along with the NAS Parallel Benchmark suite. While requiring no modification to the application source code, Adagio provides total system energy savings of 8% and 20% for UMT2K and ParaDiS, respectively, with less than 1% increase in execution time. Copyright 2009 ACM.
- Barnes, B. J., Reeves, J., Rountree, B., Supinski, B. D., Lowenthal, D. K., & Schulz, M. (2008). A regression-based approach to scalability prediction. Proceedings of the International Conference on Supercomputing, 368-377.More infoAbstract: Many applied scientific domains are increasingly relying on largescale parallel computation. Consequently, many large clusters now have thousands of processors. However, the ideal number of processors to use for these scientific applications varies with both the input variables and the machine under consideration, and predicting this processor count is rarely straightforward. Accurate prediction mechanisms would provide many benefits, including improving cluster efficiency and identifying system configuration or hardware issues that impede performance. We explore novel regression-based approaches to predict parallel program scalability. We use several program executions on a small subset of the processors to predict execution time on larger numbers of processors. We compare three different regression-based techniques: one based on execution time only; another that uses per-processor information only; and a third one based on the global critical path. These techniques provide accurate scaling predictions, with median prediction errors between 6.2% and 17.3% for seven applications. Copyright 2008 ACM.
- Freeh, V. W., Kappiah, N., Lowenthal, D. K., & Bletsch, T. K. (2008). Just-in-time dynamic voltage scaling: Exploiting inter-node slack to save energy in MPI programs. Journal of Parallel and Distributed Computing, 68(9), 1175-1185.More infoAbstract: Although users of high-performance computing are most interested in raw performance, both energy and power consumption have become critical concerns. As a result, improving energy efficiency of nodes on HPC machines has become important, and the prevalence of power-scalable clusters, where the frequency and voltage can be dynamically modified, has increased. On power-scalable clusters, one opportunity for saving energy with little or no loss of performance exists when the computational load is not perfectly balanced. This situation occurs frequently, as keeping the load balanced between nodes is one of the long-standing fundamental problems in parallel and distributed computing. Indeed, despite the large body of research aimed at balancing load both statically and dynamically, this problem is quite difficult to solve. This paper presents a system called Jitter that reduces the frequency and voltage on nodes that are assigned less computation and, therefore, have idle or slack time. This saves energy on these nodes, and the goal of Jitter is to attempt to ensure that they arrive "just in time" so that they avoid increasing overall execution time. Specifically, we dynamically determine which nodes have enough slack time such that they can execute at a reduced frequency with little performance cost-which will greatly reduce the consumed energy on that node. In particular, Jitter saves 12.8% energy with 0.4% time increase-which is essentially the same as a hand-tuned solution-on the Aztec benchmark. © 2008 Elsevier Inc. All rights reserved.
- Freeh, V. W., Lowenthal, D. K., Pan, F., Kappiah, N., Springer, R., Rountree, B. L., & Femal, M. E. (2007). Analyzing the energy-time trade-off in high-performance computing applications. IEEE Transactions on Parallel and Distributed Systems, 18(6), 835-848.More infoAbstract: Although users of high-performance computing are most interested in raw performance, both energy and power consumption have become critical concerns. One approach to lowering energy and power is to use high-performance cluster nodes that have several power-performance states so that the energy-time trade-off can be dynamically adjusted. This paper analyzes the energy-time trade-off of a wide range of applications serial and parallel on a power-scalable cluster. We use a cluster of frequency and voltage-scalable AMD-64 nodes, each equipped with a power meter. We study the effects of memory and communication bottlenecks via direct measurement of time and energy. We also investigate metrics that can, at runtime, predict when each type of bottleneck occurs. Our results show that, for programs that have a memory or communication bottleneck, a power-scalable cluster can save significant energy with only a small time penalty. Furthermore, we find that, for some programs, it is possible to both consume less energy and execute in less time by increasing the number of nodes while reducing the frequency-voltage setting of each node. © 2007 IEEE.
- Rountree, B., Lowenthal, D. K., Funk, S., Freeh, V. W., R., B., & Schulz, M. (2007). Bounding energy consumption in large-scale MPI programs. Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC'07.More infoAbstract: Power is now a first-order design constraint in large-scale parallel computing. Used carefully, dynamic voltage scaling can execute parts of a program at a slower CPU speed to achieve energy savings with a relatively small (possibly zero) time delay. However, the problem of when to change frequencies in order to optimize energy savings is NP-complete, which has led to many heuristic energy-saving algorithms. To determine how closely these algorithms approach optimal savings, we developed a system that determines a bound on the energy savings for an application. Our system uses a linear programming solver that takes as inputs the application communication trace and the cluster power characteristics and then outputs a schedule that realizes this bound. We apply our system to three scientific programs, two of which exhibit load imbalance - particle simulation and UMT2K. Results from our bounding technique show particle simulation is more amenable to energy savings than UMT2K. (c) 2007 ACM.
- Faraj, A., Yuan, X., & Lowenthal, D. (2006). STAR-MPI: Self tuned adaptive routines for MPI collective operations. Proceedings of the International Conference on Supercomputing, 199-208.More infoAbstract: Message Passing Interface (MPI) collective communication routines are widely used in parallel applications. In order for a collective communication routine to achieve high performance for different applications on different platforms, it must be adaptable to both the system architecture and the application workload. Current MPI implementations do not support such software adaptability and are not able to achieve high performance on many platforms. In this paper, we present STAR-MPI (Self Tuned Adaptive Routines for MPI collective operations), a set of MPI collective communication routines that are capable of adapting to system architecture and application workload. For each operation, STAR-MPI maintains a set of communication algorithms that can potentially be efficient at different situations. As an application executes, a STAR-MPI routine applies the Automatic Empirical Optimization of Software (AEOS) technique at run time to dynamically select the best performing algorithm for the application on the platform. We describe the techniques used in STAR-MPI, analyze STAR-MPI overheads, and evaluate the performance of STAR-MPI with applications and benchmarks. The results of our study indicate that STAR-MPI is robust and efficient. It is able to and efficient algorithms with reasonable overheads, and it out-performs traditional MPI implementations to a large degree in many cases. Copyright © 2006 ACM.
- Lim, M. Y., Freeh, V. W., & Lowenthal, D. K. (2006). Adaptive, transparent frequency and voltage scaling of communication phases in MPI programs. Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC'06.More infoAbstract: Although users of high-performance computing are most interested in raw performance, both energy and power consumption have become critical concerns. Some microprocessors allow frequency and voltage scaling, which enables a system to reduce CPU performance and power when the CPU is not on the critical path. When properly directed, such dynamic frequency and voltage scaling can produce significant energy savings with little performance penalty.This paper presents an MPI runtime system that dynamically reduces CPU performance during communication phases in MPI programs. It dynamically identifies such phases and, without profiling or training, selects the CPU frequency in order to minimize energy-delay product. All analysis and subsequent frequency and voltage scaling is within MPI and so is entirely transparent to the application. This means that the large number of existing MPI programs, as well as new ones being developed, can use our system without modification. Results show that the average reduction in energy-delay product over the NAS benchmark suite is 10% - -the average energy reduction is 12% while the average execution time increase is only 2.1%. © 2006 IEEE.
- Springer, R., Lowenthal, D. K., Rountree, B., & Freeh, V. W. (2006). Minimizing execution time in MPI programs on an energy-constrained, power-scalable cluster. Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, 2006, 230-238.More infoAbstract: Recently, the high-performance computing community has realized that power is a performance-limiting factor. One reason for this is that supercomputing centers have limited power capacity and machines are starting to hit that limit. In addition, the cost of energy has become increasingly significant, and the heat produced by higher-energy components tends to reduce their reliability. One way to reduce power (and therefore energy) requirements is to use high-performance cluster nodes that are frequency- and voltage-scalable (e.g., AMD-64 processors). The problem we address in this paper is: given a target program, a power-scalable cluster, and an upper limit for energy consumption, choose a schedule (number of nodes and CPU frequency) that simultaneously (1) satisfies an external upper limit for energy consumption and (2) minimizes execution time. There are too many schedules for an exhaustive search. Therefore, we find a schedule through a novel combination of performance modeling, performance prediction, and program execution. Using our technique, we are able to find a near-optimal schedule for all of our benchmarks in just a handful of partial program executions. Copyright © 2006 ACM.
- Weatherly, D. B., Lowenthal, D. K., Nakazawa, M., & Lowenthal, F. (2006). Dyn-MPI: Supporting MPI on medium-scale, non-dedicated clusters. Journal of Parallel and Distributed Computing, 66(6), 822-838.More infoAbstract: Distributing data is a fundamental problem in implementing efficient distributed-memory parallel programs. The problem becomes more difficult in environments where the participating nodes are not dedicated to a parallel application. We are investigating the data distribution problem in non-dedicated environments in the context of explicit message-passing programs. To address this problem, we have designed and implemented an extension to MPI called dynamic MPI (Dyn-MPI). The key component of Dyn-MPI is its run-time system, which efficiently and automatically redistributes data on the fly when there are changes in the application or the underlying environment. Dyn-MPI supports efficient memory allocation, precise measurement of system load and computation time, and node removal. Performance results show that programs that use Dyn-MPI execute efficiently in non-dedicated environments, including up to almost a threefold improvement compared to programs that do not redistribute data and a 25% improvement over standard adaptive load balancing techniques. © 2006 Elsevier Inc. All rights reserved.
- Zhou, W., & Lowenthal, D. K. (2006). A parallel, out-of-core algorithm for RNA secondary structure prediction. Proceedings of the International Conference on Parallel Processing, 74-81.More infoAbstract: RNA pseudoknot prediction is an algorithm for RNA sequence search and alignment. An important building block towards pseudoknot prediction is RNA secondary structure prediction. The difficulty of extending the secondary structure prediction algorithm to a parallel program is (1) it has complicated data dependences, and (2) it has a large data set that typically cannot fit completely in main memory. In this paper, we propose a new out-of-core, distributed-memory algorithm for RNA secondary structure prediction. Its novelty lies in its redundant file scheme, I/O-reducing in-core buffer mechanism, and dynamic load balancing algorithm. Experimental results obtained on 16 Sun UltraSPARC IIIi nodes provide evidence that our approach achieves good speedup. Furthermore, we found that counterintuitively, the size of the in-memory buffer is critical to efficiency of the parallel program. © 2006 IEEE.
- Freeh, V. W., Pan, F., Kappiah, N., & Lowenthal, D. K. (2005). Using multiple energy gears in MPI programs on a power-scalable cluster. Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, 164-173.More infoAbstract: Recently, system architects have built low-power, high-performance clusters, such as Green Destiny. The idea behind these clusters is to improve the energy efficiency of nodes. However, these clusters save power at the expense of performance. Our approach is instead to use high-performance cluster nodes that are frequency-and voltage-scalable; energy can than be saved by scaling down the CPU. Our prior work has examined the costs and benefits of executing an entire application at a single reduced frequency. This paper presents a framework for executing a single application in several frequency-voltage settings. The basic idea is to first divide programs into phases and then execute a series of experiments, with each phase assigned a prescribed frequency. During each experiment, we measure energy consumption and time and then use a heuristic to choose the assignment of frequency to phase for the next experiment. Our results show that significant energy can be saved without an undue performance penalty; particularly, our heuristic finds assignments of frequency to phase that is superior to any fixed-frequency solution. Specifically, this paper shows that more than half of the NAS benchmarks exhibit a better energy-time tradeoff using multiple gears than using a single gear. For example, IS using multiple gears uses 9% less energy and executes in 1% less time than the closest single-gear solution. Compared to no frequency scaling, multiple gear IS uses 16% less energy while executing only 1% longer. Copyright 2005 ACM.
- Freeh, V. W., Pan, F., Kappiah, N., Lowenthal, D. K., & Springer, R. (2005). Exploring the energy-time tradeoff in MPI programs on a power-scalable cluster. Proceedings - 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005, 2005, 4a.More infoAbstract: Recently, energy has become an important issue in high-performance computing. For example, supercomputers that have energy in mind, such as BlueGene/L, have been built; the idea is to improve the energy efficiency of nodes. Our approach, which uses off-the-shelf, high-performance cluster nodes that are frequency scalable, allows energy saving by scaling down the CPU. This paper investigates the energy consumption and execution time of applications from a standard benchmark suite (NAS) on a power-scalable cluster. We study via direct measurement and simulation both intra-node and inter-node effects of memory and communication bottlenecks, respectively. Additionally, we compare energy consumption and execution time across different numbers of nodes. Our results show that a power-scalable cluster has the potential to save energy by scaling the processor down to lower energy levels. Furthermore, we found that for some programs, it is possible to both consume less energy and execute in less time when using a larger number of nodes, each at reduced energy. Additionally, we developed and validated a model that enables us to predict the energy-time tradeoff of larger clusters.
- Kappiah, N., Freeh, V. W., & Lowenthal, D. K. (2005). Just in time dynamic voltage scaling: Exploiting inter-node slack to save energy in MPI programs. Proceedings of the ACM/IEEE 2005 Supercomputing Conference, SC'05, 2005.More infoAbstract: Recently, improving the energy efficiency of HPC machines has become important. As a result, interest in using power-scalable clusters, where frequency and voltage can be dynamically modified, has increased. On power-scalable clusters, one opportunity for saving energy with little or no loss of performance exists when the computational load is not perfectly balanced. This situation occurs frequently, as balancing load between nodes is one of the long standing problems in parallel and distributed computing. In this paper we present a system called Jitter, which reduces the frequency on nodes that are assigned less computation and therefore have slack time. This saves energy on these nodes, and the goal of Jitter is to attempt to ensure that they arrive "just in time" so that they avoid increasing overall execution time. For example, in Aztec, from the ASCI Purple suite, our algorithm uses 8% less energy while increasing execution time by only 2.6%. © 2005 IEEE.
- Karwande, A., Yuan, X., & Lowenthal, D. K. (2005). An MPI prototype for compiled communication on Ethernet switched clusters. Journal of Parallel and Distributed Computing, 65(10), 1123-1133.More infoAbstract: Compiled communication has recently been proposed to improve communication performance for clusters of workstations. The idea of compiled communication is to apply more aggressive optimizations to communications whose information is known at compile time. Existing MPI libraries do not support compiled communication. In this paper, we present an MPI prototype, CC-MPI, that supports compiled communication on Ethernet switched clusters. The unique feature of CC-MPI is that it allows the user to manage network resources such as multicast groups directly and to optimize communications based on the availability of the communication information. CC-MPI optimizes one-to-all, one-to-many, all-to-all, and many-to-many collective communication routines using the compiled communication technique. We describe the techniques used in CC-MPI and report its performance.The results show that communication performance of Ethernet switched clusters can be significantly improved through compiled communication. © 2005 Elsevier Inc. All rights reserved.
- McCreary, D., Kong, L. i., Watterson, S. A., & Lowenthal, D. K. (2005). TCP-RC: A receiver-centered TCP protocol for delay-sensitive applications. Proceedings of SPIE - The International Society for Optical Engineering, 5680, 126-130.More infoAbstract: TCP is the de-facto standard transport-layer protocol in the Internet. However, TCP is generally considered to be inappropriate for delay-sensitive applications such as multimedia. This paper proposes a novel receiver-centered TCP (TCP-RC), which is a TCP modification at the receiver that is intended for delay-sensitive applications. The basic principle behind TCP-RC is that it achieves low latency at the expense of reliability. In particular, TCP-RC forges lost packets, passing them on to an enabled application. This allows low-latency transmission for a class of applications that do not demand full reliability. Results obtained from emulated experiments show that over a range of loss rates and round-trip times, TCP-RC has a significantly smaller average- and worst-case per-packet delay than regular TCP. © 2005 SPIE and IS&T.
- Nakazawa, M., Lowenthal, D. K., & Wendou, Z. (2005). The MHETA execution model for heterogeneous clusters. Proceedings of the ACM/IEEE 2005 Supercomputing Conference, SC'05, 2005.More infoAbstract: The availability of inexpensive "off the shelf" machines increases the likelihood that parallel programs run on heterogeneous clusters of machines. These programs are increasingly likely to be out of core, meaning that portions of their datasets must be stored on disk during program execution. This results in significant, per-iteration, I/O cost. This paper describes an execution model, called MHETA, which is the key component to finding an effective data distribution on heterogeneous clusters. MHETA takes into account computation, communication, and I/O costs of iterative scientific applications. MHETA uses automatically extracted information from a single iteration to predict the execution time of the remaining iterations. Results show that MHETA predicts with on average 98% accuracy the execution time of several scientific benchmarks (with and without prefetching) and one full-scale scientific program that utilize pipelined and other communication. MHETA is thus an effective tool when searching for the most effective distribution on a heterogeneous cluster. © 2005 IEEE.
- Veal, B., Kang, L. i., & Lowenthal, D. (2005). New methods for passive estimation of TCP round-trip times. Lecture Notes in Computer Science, 3431, 121-134.More infoAbstract: We propose two methods to passively measure and monitor changes in round-trip times (RTTs) throughout the lifetime of a TCP connection. Our first method associates data segments with the acknowledgments (ACKs) that trigger them by leveraging the TCP timestamp option. Our second method infers TCP RTT by observing the repeating patterns of segment clusters where the pattern is caused by TCP self-clocking. We evaluate the two methods using both emulated and real Internet tests. © Springer-Verlag Berlin Heidelberg 2005.
- Yan, H., & Lowenthal, D. (2005). Towards cooperation fairness in mobile ad hoc networks. IEEE Wireless Communications and Networking Conference, WCNC, 4, 2143-2148.More infoAbstract: For the sustainable operation of ad hoc networks, incentive mechanisms are required to encourage cooperation. More importantly, we must enforce available bandwidth fairness among nodes in ad hoc networks. While such fair sharing of bandwidth in traditional networks is carried out by the dedicated routers with a variety of algorithms, in mobile ad-hoc networks, it must be performed by each node because nodes can act as routers as well as end hosts. Here, we say that bandwidth sharing is fair if a node's available bandwidth is proportional to its forwarding contribution. In this paper we achieve fair bandwidth sharing through a new packet scheduling algorithm, called cooperative queueing, on each node. In cooperative queueing, which is analogous to fair queueing, packet scheduling is based on a new abstraction that we call cooperation coefficient. The cooperation coefficient quantifies how much a given node contributes to and consumes from the ad-hoc network; the larger the cooperation coefficient, the more bandwidth a node can obtain. We exploit the widely used dynamic source routing information to obtain the cooperation coefficient. We evaluate the effectiveness of cooperative queueing with different parameters and network configurations. We demonstrate that our algorithm is able to encourage cooperation and ensure fair sharing of bandwidth between nodes. We show that cooperative queueing is simple and has little overhead. © 2005 IEEE.
- Yan, H., Kang, L. i., Watterson, S., & Lowenthal, D. (2005). Improving passive estimation of TCP round-trip times using TCP timestamps. 2004 IEEE Workshop on IP Operations and Management Proceedings, IPOM 2004: Self-Measurement and Self-Management of IP Networks and Services, 181-185.More infoAbstract: In order to make accurate routing and queueing decisions, passive measurement of TCP flows is becoming increasingly common. Passive measurement techniques have the advantage that they do not inject extra traffic into the connection. These techniques work by associating packet pairs (e.g., a data packet and its acknowledgement during slow start) and and then measuring the RTT for those pairs. Unfortunately, such association is difficult in the general case, primarily because an acknowledgement can rarely be associated with the data packets that it triggers. This paper presents a new passive measurement technique that associates packet pairs using TCP timestamps. The basic idea is to capture all packets that pass through the measurement point, and match two pairs of packets, where one timestamp is common to both pairs. Unlike previous techniques, this allows us to obtain samples throughout the lifetime of the connection. Results show that our technique has less than a 1% error on average for an ftp download. As most end hosts currently employ the timestamp option, our technique is widely applicable in practice. ©2004 IEEE.
- Yan, H., Lowenthal, D. K., & Kang, L. i. (2005). ACE: An active, client-directed method for reducing energy during web browsing. Proceedings of the International Workshop on Network and Operating System Support for Digital Audio and Video, 27-32.More infoAbstract: In mobile devices, the wireless network interface card (WNIC) consumes a significant portion of overall system energy. One way to reduce energy consumed by a device is to transition its WNIC to a lower-power sleep mode when data is not being received or transmitted. This paper develops ACE, an active, client-directed technique to improve energy efficiency during web browsing. ACE actively retrieves buffered packets from an access point based on predictions made through client-side connection tracking. The key novel implementation technique used in ACE is connection rescheduling, which results is a better energy/time tradeoff for interactive applications such as web browsing. We demonstrate the effectiveness of ACE through actual experiments to real Internet servers. Copyright 2005 ACM.
- Bentley, C., Watterson, S. A., Lowenthal, D. K., & Rountree, B. (2004). Implicit java array bounds checking on 64-bit architectures. Proceedings of the International Conference on Supercomputing, 227-236.More infoAbstract: Interest in using Java for high-performance parallel computing has increased in recent years. One obstacle that has inhibited Java from widespread acceptance in the scientific community is the language requirement that all array accesses must be checked to ensure they are within bounds. In practice, array bounds checking in scientific applications may increase execution time by more than a factor of 2. Previous research has explored optimizations to statically eliminate bounds checks, but the dynamic nature of many scientific codes makes this difficult or impossible. Our approach is instead to create a new Java implementation that does not generate explicit bounds checks. It instead places arrays inside of Index Confinement Regions (ICRs), which are large, isolated, mostly unmapped virtual memory regions. Any array reference outside of its bounds will cause a protection violation; this provides implicit bounds checking. Our results show that our new Java implementation reduces the overhead of bounds checking from an average of 63% to an average of 9% on our benchmarks.
- Gundlach, M., Doster, S., Yan, H., Lowenthal, D. K., Watterson, S. A., & Chandra, S. (2004). Dynamic, power-aware scheduling for mobile clients using a transparent proxy. Proceedings of the International Conference on Parallel Processing, 557-565.More infoAbstract: Mobile computers consume significant amounts of energy when receiving large files. The wireless network interface card (WNIC) is the primary source of this energy consumption. One way to reduce the energy consumed is to transmit the packets to clients in a predictable fashion. Specifically, the packets can be sent in bursts to clients, who can then switch to a lower power sleep state between bursts. This technique is especially effective when the bandwidth of a stream is small. This paper investigates techniques for saving energy in a multiple-client scenario, where clients may be receiving either UDP or TCP data. Energy is saved by using a transparent proxy that is invisible to both clients and servers. The proxy implementation maintains separate connections to the client and server so that a large increase in transmission time is avoided. The proxy also buffers data and dynamically generates a global transmission schedule that includes all active clients. Results show that energy savings within 10-15% of optimal are common, with little packet loss.
- Yan, H., Krishnan, R., Watterson, S. A., & Lowenthal, D. K. (2004). Client-centered energy savings for concurrent HTTP connections. Proceedings of the International Workshop on Network and Operating System Support for Digital Audio and Video, 62-67.More infoAbstract: In mobile devices, the wireless network interface card (WNIC) consumes a significant portion of overall system energy. One way to reduce energy consumed by a WNIC is to transition it to a lower-power sleep mode when data is not being received or transmitted. This paper investigates client-centered techniques for saving energy during web browsing. The basic idea is that the client predicts when packets will arrive, keeping the WNIC in high-power mode only when necessary. This is challenging because web browsing generally results in concurrent HTTP connections. To handle this, we maintain the state of each open connection on the client and then transition the WNIC to sleep mode when no connection is receiving data. Our technique is compatible with standard TCP and does not rely on any assistance from the server, a proxy, or IEEE 802.11b power-saving mode (PSM). Our technique combines the performance of regular TCP with nearly all the energy-saving of PSM during web downloads, and we save more energy than PSM during client think times. Results show that over an entire web browsing session (downloads and think times), our scheme saves up to 21% energy compared to PSM and incurs less than a 1% increase in transmission time compared to regular TCP.
- Karwanda, A., Yuan, X., & Lowenthal, D. K. (2003). CC-MPI: A compiled communication capable MPI prototype for ethernet switched clusters. ACM SIGPLAN Notices, 38(10), 95-106.More infoAbstract: Compiled communication has recently been proposed to improve communication performance for clusters of workstations. The idea of compiled communication is to apply more aggressive optimizations to communications whose information is known at compile time. Existing MPI libraries do not support compiled communication. In this paper, we present an MPI prototype, CC-MPI, that supports compiled communication on Ethernet switched clusters. The unique feature of CC-MPI is that it allows the user to manage network resources such as multicast groups directly and to optimize communications based on the availability of the communication information. CC-MPI optimizes one-to-all, one-to-many, all-to-all, and many-to-many collective communication routines using the compiled communication technique. We describe the techniques used in CC-MPI and report its performance. The results show that communication performance of Ethernet switched clusters can be significantly improved through compiled communication.
- Karwande, A., Yuan, X., & Lowenthal, D. K. (2003). CC-MPI: A compiled communication capable MPI prototype for ethernet switched clusters. Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, 95-106.More infoAbstract: Compiled communication has recently been proposed to improve communication performance for clusters of workstations. The idea of compiled communication is to apply more aggressive optimizations to communications whose information is known at compile time. Existing MPI libraries do not support compiled communication. In this paper, we present an MPI prototype, CC-MPI, that supports compiled communication on Ethernet switched clusters. The unique feature of CC-MPI is that it allows the user to manage network resources such as multicast groups directly and to optimize communications based on the availability of the communication information. CC-MPI optimizes one-to-all, one-to-many, all-to-all, and many-to-many collective communication routines using the compiled communication technique. We describe the techniques used in CC-MPI and report its performance. The results show that communication performance of Ethernet switched clusters can be significantly improved through compiled communication.
- Lowenthal, D. K., & Subramanian, R. (2003). HyFi: Architecture-independent parallelism on networks of multiprocessors. International Journal of Computers and Applications, 25(4), 272-282.More infoAbstract: A network of parallel workstations promises cost-effective parallel computing. This article presents the HyFi (Hybrid Filaments) package, which can be used to create architecture-independent parallel programs that is, programs that are portable and efficient across different parallel machines. HyFi integrates Shared Filaments (SF), which provides parallelism on shared-memory multiprocessors, and Distributed Filaments (DF), which extracts parallelism from networks of uniprocessors, This enables parallelism on any architecture, including homogeneous networks of multiprocessors. HyFi uses fine-grain parallelism and implicit shared-variable communication to provide a uniform programming model. HyFi adopts the same basic execution model as SF and DF; this work discusses the modifications necessary to develop the hybrid system. In particular, HyFi modifies the signal-thread model as well as the software distributed shared memory of DF. It also unifies the SF and DF reduction operations as well as the dynamic load-balancing mechanism of fork-join filaments. Application programs written using the HyFi API can run unchanged on any architecture. Performance is encouraging on fork/join applications, where excellent speedup is achieved. Also, the fine-grain model of HyFi allows up to a 14.5% improvement due to overlap of communication and computation. Unfortunately, we find that iterative applications do not speed up well due to the inability of the Pentium Xeon architecture to efficiently support concurrent memory accesses.
- Price, G. W., & Lowenthal, D. K. (2003). A comparative analysis of fine-grain threads packages. Journal of Parallel and Distributed Computing, 63(11), 1050-1063.More infoAbstract: The rising availability of multiprocessing platforms has increased the importance of providing programming models that allow users to express parallelism simply, portably, and efficiently. One popular way to write parallel programs is to use threads for concurrent sections of code. User-level threads packages allow programmers to implement multithreaded programs in which thread creation, thread management, and thread synchronization are relatively inexpensive. Fine-grain programs are multithreaded programs in which the work is divided into a large number of threads, where each thread contains a relatively small amount of work. The potential benefit of large numbers of threads include easier load balancing, better scalability, greater potential for overlapping communication and computation, and improved platform-independence. However, fine-grain programs are largely considered inefficent due to the overheads involved in managing numerous threads. In this paper, we survey several thread packages that take different approaches to the problem of efficiently supporting the creation and management of large numbers of fine-grain threads. Each package is compared based on its level of support of the general thread model as well as its performance on a set of fine-grain parallel programs. We find that while the thread packages we tested may support medium-grain parallelism efficiently, they do not always support fine-grain parallelism. Although no package supports fine-grain parallelism and a general thread model, we believe that it can potentially be done with help from the compiler. © 2003 Published by Elsevier Inc.
- Weatherly, D. B., Lowenthal, D. K., Nakazawa, M., & Lowenthal, F. (2003). Dyn-MPI: Supporting MPI on non dedicated clusters. Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, SC 2003.More infoAbstract: Distributing data is a fundamental problem in implementing efficient distributed-memory parallel programs. The problem becomes more difficult in environments where the participating nodes are not dedicated to a parallel application. We are investigating the data distribution problem in non dedicated environments in the context of explicit message-passing programs. To address this problem, we have designed and implemented an extension to MPI called Dynamic MPI (Dyn-MPI). The key component of Dyn-MPI is its run-time system, which efficiently and automatically redistributes data on the fly when there are changes in the application or the underlying environment. Dyn-MPI supports efficient memory allocation, precise measurement of system load and computation time, and node removal. Performance results show that programs that use Dyn-MPI execute efficiently in non dedicated environments, including up to almost a three-fold improvement compared to programs that do not redistribute data and a 25% improvement over standard adaptive load balancing techniques. © 2003 ACM.
- G., D., & Lowenthal, D. K. (2001). Accurate data redistribution cost estimation in software distributed shared memory systems. SIGPLAN Notices (ACM Special Interest Group on Programming Languages), 36(7), 62-71.More infoAbstract: Distributing data is one of the key problems in implementing efficient distributed-memory parallel programs. The problem becomes more difficult in programs where data redistribution between computational phases is considered. The global data distribution problem is to find the optimal distribution in multi-phase parallel programs. Solving this problem requires accurate knowledge of data redistribution cost. We are investigating this problem in the context of a software distributed shared memory (SDSM) system, in which obtaining accurate redistribution cost estimates is difficult. This is because SDSM communication is implicit: It depends on access patterns, page locations, and the SDSM consistency protocol. We have developed integrated compile- and run-time analysis for SDSM systems to determine accurate redistribution cost estimates with low overhead. Our resulting system, SUIF-Adapt, can efficiently and accurately estimate execution time, including redistribution, to within 5% of the actual time in all of our test cases and is often much closer. These precise costs enable SUIF-Adapt to find efficient global data distributions in multiple-phase programs. Copyright 2001 ACM.
- Hauschildt, P. H., Lowenthal, D. K., & Baron, E. (2001). Parallel implementation of the phoenix generalized stellar atmosphere program. III. A parallel algorithm for direct opacity sampling. Astrophysical Journal, Supplement Series, 134(2), 323-329.More infoAbstract: We describe two parallel algorithms for line opacity calculations based on a local file and on a global file approach. The performance and scalability of both approaches is discussed for different test cases and very different parallel computing systems. The results show that a global file approach is more efficient on high-performance parallel supercomputers with dedicated parallel I/O subsystem, whereas the local file approach is very useful on farms of workstations, e.g., cheap PC clusters.
- III, D. M., & Lowenthal, D. K. (2001). Accurate data redistribution cost estimation in software distributed shared memory systems. Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, 62-71.More infoAbstract: Distributing data is one of the key problems in implementing efficient distributed-memory parallel programs. The problem becomes more difficult in programs where data redistribution between computational phases is considered. The global data distribution problem is to find the optimal distribution in multi-phase parallel programs. Solving this problem requires accurate knowledge of data redistribution cost. We are investigating this problem in the context of a software distributed shared memory (SDSM) system, in which obtaining accurate redistribution cost estimates is difficult. This is because SDSM communication is implicit: It depends on access patterns, page locations, and the SDSM consistency protocol. We have developed integrated compile- and run-time analysis for SDSM systems to determine accurate redistribution cost estimates with low overhead. Our resulting system, SUIF-Adapt, can efficiently and accurately estimate execution time, including redistribution, to within 5% of the actual time in all of our test cases and is often much closer. These precise costs enable SUIF-Adapt to find efficient global data distributions in multiple-phase programs.
- Lowenthal, D. K. (2000). Accurately selecting block size at runtime in pipelined parallel programs. International Journal of Parallel Programming, 28(3), 245-274.More infoAbstract: Loops that contain cross-processor data dependencies, known as DOACROSS loops, are often found in scientific programs. Efficiently parallelizing such loops is important yet nontrivial. One useful parallelization technique for DOACROSS loops is pipelining, where each processor (node) performs its computation in blocks; after each, it sends data to the next node in the pipeline. The amount of computation before sending a message is called the block size; its choice, although difficult to make statically, is important for efficient execution. This paper describes a flexible runtime approach to choosing the block size. Rather than rely on static estimation of workload, our system takes measurements during the first two iterations of a program and then uses the results to build an execution model and choose an appropriate block size which, unlike a static choice, may be nonuniform. To increase accuracy of the chosen block size, our execution model takes intra- and inter-node performance into account. It is important to note that our system finds an effective block size automatically, without experimentation that is necessary when using a statically chosen block size. Performance on a network of workstations shows that programs that use our runtime analysis outperform those that use static block sizes by as much as 18% when the workload is unbalanced. When the workload is balanced, competitive performance is achieved as long as the initial overhead is sufficiently amortized.
- Lowenthal, D. K., & Freeh, V. W. (2000). Architecture-independent parallelism for both shared- and distributed-memory machines using the Filaments package. Parallel Computing, 26(10), 1297-1323.More infoAbstract: This paper presents the Filaments package, which can be used to create architecture-independent parallel programs - that is, programs that are portable and efficient across vastly different parallel machines. Filaments virtualizes the underlying machine in terms of the number of processors and the interconnection, allowing fine-grain, shared-memory programs to be written or generated. Furthermore, Filaments uses a carefully designed API along with machine-specific runtime libraries and preprocessing that allow programs to run unchanged on both shared- and distributed-memory machines. Performance is not sacrificed, as almost all kernels and applications we tested achieve a speedup of over 4 on 8 processors of both an SGI Challenge and a cluster of Pentium Pros.
- Lowenthal, D. K., & James, M. (1999). Run-time selection of block size in pipelined parallel programs. Proceedings of the International Parallel Processing Symposium, IPPS, 82-87.More infoAbstract: Parallelizing compiler technology has improved in recent years. One area in which compilers have made progress is in handling DOACROSS loops, where cross-processor data dependencies can inhibit efficient parallelization. In regular DOACROSS loops, where dependencies can be determined at compile time, a useful parallelization technique is pipelining, where each processor (node) performs its computation in blocks; after each, it sends data to the next processor in the pipeline. The amount of computation before sending a message is called the block size; its choice, although difficult for a compiler to make, is critical to the efficiency of the program. Compilers typically use a static estimation of workload, which cannot always produce an effective block size. This paper describes a flexible run-time approach to choosing the block size. Our system takes measurements during the first iteration of the program and then uses the results to build an execution model and choose an appropriate block size which, unlike those chosen by compiler analysis, may be nonuniform. Performance on a network of workstations shows that programs using our run-time analysis outperform those that use static block sizes when the workload is either unbalanced or unanalyzable. On more regular programs, our programs are competitive with their static counterparts.
- Lowenthal, D. K., Freeh, V. W., & Andrews, G. R. (1998). Efficient support for fine-grain parallelism on shared-memory machines. Concurrency Practice and Experience, 10(3), 157-173.More infoAbstract: A coarse-grain parallel program typically has one thread (task) per processor, whereas a fine-grain program has one thread for each independent unit of work. Although there are several advantages to fine-grain parallelism, conventional wisdom is that coarse-grain parallelism is more efficient This paper illustrates the advantages of fine-grain parallelism and presents an efficient implementation for shared-memory machines. The approach has been implemented in a portable software package called Filaments, which employs a unique combination of techniques to achieve efficiency. The performance of the fine-grain programs discussed in this paper is always within 13% of a hand-coded coarse-grain program and is usually within 5%. © 1998 John Wiley & Sons, Ltd.
- Lowenthal, D. K., & Andrews, G. R. (1996). Adaptive approach to data placement. IEEE Symposium on Parallel and Distributed Processing - Proceedings, 349-353.More infoAbstract: Programming distributed-memory machines requires careful placement of data to balance the computational load among the nodes and minimize excess data movement between the nodes. Most current approaches to data placement require the programmer or compiler to place data initially and then possibly to move it explicitly during a computation. This paper describes a new, adaptive approach. It is implemented in the Adapt system, which takes an initial data placement, efficiently monitors how well it performs, and changes the placement whenever the monitoring indicates that a different placement would perform better. Adapt frees the programmer from having to specify data placements, and it can use run-time information to find better placements than compilers. Moreover, Adapt automatically supports a 'variable block' placement, which is especially useful for applications with nearest-neighbor communication but an imbalanced workload. For applications in which the best data placement varies dynamically, using Adapt can lead to better performance than using any statically determined data placement.
- Lowenthal, D. K., Freeh, V. W., & Andrews, G. R. (1996). Using fine-grain threads and run-time decision making in parallel computing. Journal of Parallel and Distributed Computing, 37(1), 41-54.More infoAbstract: Programming distributed-memory multiprocessors and networks of workstations requires deciding what can execute concurrently, how processes communicate, and where data is placed. These decisions can be made statically by a programmer or compiler, or they can be made dynamically at run time. Using run-time decisions leads to a simpler interface - because decisions are implicit - and it can lead to better decisions -because more information is available. This paper examines the costs, benefits, and details of making decisions at run time. The starting point is explicit fine-grain parallelism with any number (even thousands) of threads. Five specific techniques are considered: (1) implicitly coarsening the granularity of parallelism, (2) using implicit communication implemented by a distributed shared memory, (3) overlapping computation and communication, (4) adaptively moving threads and data between nodes to minimize communication and balance load, and (5) dynamically remapping data to pages to avoid false sharing. Details are given on the performance of each of these techniques as well as on their overall performance for several scientific applications. © 1996 Academic Press, Inc.
Proceedings Publications
- Smith, S., & Lowenthal, D. K. (2021, June). Jigsaw: A High-Utilization, Interference-Free Job Scheduler for Fat-Tree Clusters. In ACM Symposium on High Performance Distributed Computing.More infoThis paper was named Best Paper of the conference.HPDC is a CSranking conference and is arguably the best conference in the area. (It is either Supercomputing or HPDC.)
- Bhatele, A., Thiagarajan, J., Groves, T., Anirudh, R., Smith, S., Cook, B., & Lowenthal, D. K. (2020, May). The Case of Performance Variability on Dragonfly-based Systems. In International Parallel and Distributed Processing Symposium.More infoThis paper is based on Staci's early dissertation work that was rejected a couple of times, but it was taken to the finish line by Abhinav Bhatele (now an Assistant Professor at Maryland) and others (because Staci wanted to focus on other work).IPDPS is a Core A conference, which is a joke. IPDPS is a second-tier conference, and yet the Core rankings rank *all* the HPC conferences as A. This includes the elite conferences such as SC, HPDC, and ICS, and goes all the way down to CCGRID and ICPP, which I view as garbage bins. That's right: all of these conferences are ranked as A.
- Savoie, L., Lowenthal, D. K., de Supinski, B., Mohror, K., & Jain, N. (2019, Sep). Mitigating Inter-Job Interference via Process-Level Quality-of-Service. In Proceedings of the IEEE International Conference on Cluster Computing.More infoCORE A Conference
- Savoie, L., Lowenthal, D. K., Supinski, B., & Mohror, K. (2018, May). A Study of Network Quality of Service in Many-Core MPI Applications. In Workshop on Run-Time and Operating Systems for the Many-Core Era.
- Smith, S. A., Cromey, C. E., Lowenthal, D. K., Domke, J., Jain, N., Thiagarajan, J. J., & Bhatele, A. (2018, Nov). Mitigating Inter-Job Interference Using Adaptive Flow-Aware Routing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC'18.More infoNotes:1. This paper was one of five Best Student Paper Nominees.2. This conference (Supercomputing) is one of the three csrankings venues in the HPC field and is a CORE "A" conference.
- Patki, T., Lowenthal, D. K., Rountree, B., Schulz, M., & de Supinski, B. R. (2016, November). Economic Viability of Hardware Overprovisioning in Power-Constrained High Performance Computing. In Workshop on Energy-Efficient Supercomputing.
- Savoie, L., Lowenthal, D. K., Supinski, B. R., Islam, T., Mohror, K., Rountree, B., & Schulz, M. (2016, May). I/O Aware Power Shifting. In International Parallel and Distributed Processing Symposium.
- Bailey, P., Marathe, A., Lowenthal, D. K., Rountree, B., & Schulz, M. (2015, Nov). Finding the Limits of Power-Constrained Application Performance. In Supercomputing 2015.
- Inadomi, Y., Patki, T., Inoue, K., Aoyagi, M., Rountree, B., Schulz, M., Lowenthal, D. K., Wada, Y., Fukazawa, K., Ueda, M., Kondo, M., & Miyoshi, I. (2015, Nov). Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing. In Supercomputing 2015.
- Marathe, A., Bailey, P. E., Lowenthal, D. K., Rountree, B., Schulz, M., & de Supinski, B. R. (2015, 2015). A Run-Time System for Power-Constrained HPC Applications. In HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2015, 9137, 394-408.
- Patki, T., Lowenthal, D. K., Sasidharan, A., Maiterth, M., Rountree, B., Schulz, M., & Supinski, B. R. (2015, jun). Practical Resource Management in Power-Constrained, High Performance Computing. In Symposium on High-Performance Parallel and Distributed Computing.
- Lowenthal, D. K., Bailey, P. E., Ravi, V., de Supinski, B., Rountree, B., & Schulz, M. (2014, September). Adaptive Configuration Selection for Power-Constrained Heterogeneous Systems. In 43rd IEEE International Conference on Parallel Processing.
- Marathe, A., Harris, R., Lowenthal, D., de, S. B., Rountree, B., & Schulz, M. (2014, June). Exploiting Redundancy for Cost-Effective, Time-Constrained Execution of HPC Applications on Amazon EC2. In 23rd ACM Symposium on High-Distributed Computing.
- Marathe, A., Harris, R., Lowenthal, D., de, S. B., Rountree, B., & Schulz, M. (2013, June). A Comparative Study of High-Performance Computing on the Cloud. In 22nd ACM Symposium on High-Performance Parallel Distributed Computing.
- Patki, T., Lowenthal, D., Rountree, B., Schulz, M., & de, S. B. (2013, June). Exploring Hardware Overprovisioning in Power-Constrained High Performance Computing. In 27th ACM International Conference on Supercomputing (ICS).
- Bailey, P., Patki, T., Striemer, G., Akoglu, A., Lowenthal, D., Bradbury, P., Vaughn, M., Wang, L., & Goff, S. (2012, May). Quantitative Trait Locus Analysis Uning a Partitioned Linear Model on a GPU Cluster. In 11th Workshop on High-Performance Computational Molecular Biology.
- Chen, J., Ramaswamy, L., Lowenthal, D., & Kalyanaraman, S. (2012, July). Comet: Decentralized Complex Event Detection in Mobile Delay Tolerant Networks. In 13th IEEE International Conference on Mobile Data Management.
- Rountree, B., Ahn, D., de, S. B., Lowenthal, D., & Schulz, M. (2012, May). Beyond DVFS: A First Look at Performance Under a Hardware-Enforced Power Bound. In 8th Workshop on High-Performance, Power-Aware Computing.
Presentations
- Kranzlmuller, D., Lowenthal, D. K., Rountree, B., & Schulz, M. (2017, November). Power-Aware High Performance Computing: Challenges and Opportunities for Application and System Developers. Supercomputing 2017. Salt Lake City: ACM/IEEE.More infoTutorial, equally presented by the four co-authors.
- Kranzlmuller, D., Lowenthal, D. K., Rountree, B., & Schulz, M. (2016, November). Power-Aware High Performance Computing: Challenges and Opportunities for Application and System Developers. Supercomputing 2016. Salt Lake City: ACM/IEEE.More infoTutorial, equally presented by the four co-authors.