ABSTRACT
The era of extremely heterogeneous supercomputing brings with itself the devil of increased performance variation and reduced reproducibility. There is a lack of understanding in the HPC community on how the simultaneous consideration of network traffic, power limits, concurrency tuning, and interference from other jobs impacts application performance.
In this paper, we design a methodology that allows both HPC users and system administrators to understand the trade-off space between optimal and reproducible performance. We present a first-of-its-kind dataset that simultaneously varies multiple system- and user-level parameters on a production cluster, and introduce a new metric, called the desirability score, which enables comparison across different system configurations. We develop a novel, model-agnostic machine learning methodology based on the graph signal theory for comparing the influence of parameters on application predictability, and using a new visualization technique, make practical suggestions for best practices for multi-objective HPC environments.
- 2016. OSU Benchmarks. http://mvapich.cse.ohio-state.edu/benchmarks/. (2016).Google Scholar
- Ana Gainaru Ana, Guillaume Aupy, Anne Benoit, Franck Cappello, Yves Robert, and Marc Snir. 2015. Scheduling the I/O of HPC applications under congestion. In <u>IEEE International Parallel and Distributed Processing Symposium (IPDPS).</u>Google Scholar
- David H. Bailey. 2006. NASA Advanced Supercomputing Division, NAS Parallel Benchmark Suite v3.3. (2006).Google Scholar
- D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. [n. d.]. The NAS Parallel Benchmarks. In <u>Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).</u>Google Scholar
- Bradley J. Barnes, Barry Rountree, David K. Lowenthal, Jaxk Reeves, Bronis de Supinski, and Martin Schulz. 2008. A Regression-based Approach to Scalability Prediction. In <u>Proceedings of the 22nd Annual International Conference on Supercomputing.</u> 368--377.Google Scholar
- Abhinav Bhatele. 2010. Automating Topology Aware Mapping for Supercomputers. In <u>PhD Thesis, Dept. of Computer Science, University of Illinois.</u> http://hdl.handle.net/2142/16578.Google Scholar
- Abhinav Bhatele, Todd Gamblin, Steven H. Langer, Peer-Timo Bremer, Erik W. Draeger, Bernd Hamann, Katherine E. Isaacs, Aaditya G. Landge, Joshua A. Levine, Valerio Pascucci, Martin Schulz, and Charles H. Still. 2012. Mapping Applications with Collectives over Sub-communicators on Torus Networks. In <u>Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12).</u>Google Scholar
- Abhinav Bhatele, Kathryn Mohror, Steven H. Langer, and Katherine E. Isaacs. 2013. There Goes the Neighborhood: Performance Degradation Due to Nearby Jobs. In <u>Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13).</u>Google Scholar
- A. Bhatele, A. R. Titus, J. J. Thiagarajan, N. Jain, T. Gamblin, P. T. Bremer, M. Schulz, and L. V. Kale. 2015. Identifying the Culprits Behind Network Congestion. In <u>Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International.</u>Google Scholar
- H. Bhatia, N. Jain, A. Bhatele, Y. Livnat, J. Domke, V. Pascucci, and P.-T. Bremer. 2018. Interactive Investigation of Traffic Congestion on Fat-Tree Networks Using TreeScope. <u>Computer Graphics Forum</u> 37, 3 (2018), 561--572. Google ScholarCross Ref
- S.H. Bokhari. 1981. On the Mapping Problem. <u>Computers, IEEE Transactions on</u> C-30, 3 (March 1981), 207--214.Google Scholar
- Shekhar Borkar, Tanay Karnik, Siva Narendra, Jim Tschanz, Ali Keshavarzi, and Vivek De. 2003. Parameter Variations and Impact on Circuits and Microarchitecture. In <u>Proceedings of the 40th annual Design Automation Conference.</u> 338--342.Google Scholar
- M. Broyles, C. Cain, T. Rosedahl, and G. Silva. 2015. IBM Energy Scale for POWER8 Processor-Based Systems. In <u>IBM Whitepaper.</u>Google Scholar
- R. R. Chandrasekar, A. Venkatesh, K. Hamidouche, and D. K. Panda. 2015. Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters. In <u>2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.</u>Google Scholar
- Siheng Chen, Rohan Varma, Aliaksei Sandryhaila, and Jelena Kovačević. 2015. Discrete signal processing on graphs: Sampling theory. <u>IEEE transactions on signal processing</u> 63, 24 (2015), 6510--6523.Google Scholar
- Ryan Cochran, Can Hankendi, Ayse K Coskun, and Sherief Reda. 2011. Pack & Cap: adaptive DVFS and thread packing under power caps. In <u>Proceedings of the 44th annual IEEE/ACM international symposium on microarchitecture.</u> ACM, 175--185.Google Scholar
- Diego Crupnicoff, Sujal Das, and Eitan Zahavi. 2005. <u>Deploying Quality of Service and Congestion Control in InfiniBand-based Data Center Networks.</u> Technical Report. Mellanox Technologies.Google Scholar
- Howard David, Eugene Gorbatov, Ulf Hanebutte, Rahul Khanna, and Christian Le. 2010. RAPL: Memory Power Estimation and Capping. In <u>Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design (ISLPED '10).</u> 189--194.Google ScholarDigital Library
- S. Dighe, S.R. Vangal, P. Aseron, S. Kumar, T. Jacob, K.A. Bowman, J. Howard, J. Tschanz, V. Erraguntla, N. Borkar, V.K. De, and S. Borkar. 2011. Within-Die Variation-Aware Dynamic-Voltage-Frequency-Scaling With Optimal Core Allocation and Thread Hopping for the 80-Core TeraFLOPS Processor. <u>Solid-State Circuits, IEEE Journal of</u> 46, 1 (Jan 2011), 184--193.Google Scholar
- Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. 2010. Optimizing Job Performance Under a Given Power Constraint in HPC Centers. In <u>Green Computing Conference.</u> 257--267.Google Scholar
- Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. 2011. Linear Programming Based Parallel Job Scheduling for Power Constrained Systems. In <u>International Conference on High Performance Computing and Simulation.</u> 72--80.Google Scholar
- Maja Etinski, Julita Corbalan, Jesus Labarta, and Mateo Valero. 2012. Parallel Job Scheduling for Power Constrained HPC Systems. Parallel Comput. 38, 12 (Dec. 2012), 615--630.Google ScholarDigital Library
- Y. Fan, P. Rich, W. E. Allcock, M. E. Papka, and Z. Lan. 2017. Trade-Off Between Prediction Accuracy and Underestimation Rate in Job Runtime Estimates. In <u>2017 IEEE International Conference on Cluster Computing (CLUSTER).</u> 530--540. Google ScholarCross Ref
- T. Fujiwara, P. Malakar, K. Reda, V. Vishwanath, M. E. Papka, and K. Ma. 2017. A Visual Analytics System for Optimizing Communications in Massively Parallel Applications. In <u>2017 IEEE Conference on Visual Analytics Science and Technology (VAST).</u> 59--70.Google Scholar
- Yiannis Georgiou, Thomas Cadeau, David Glesser, Danny Auble, Morris Jette, and Matthieu Hautreux. 2014. Energy Accounting and Control with SLURM Resource and Job Management System. In <u>Distributed Computing and Networking.</u> Lecture Notes in Computer Science, Vol. 8314. Springer Berlin Heidelberg, 96--118.Google Scholar
- Luís Fabrício Góes, Pedro Guerra, Bruno Coutinho, Leonardo Rocha, Wagner Meira, Renato Ferreira, Dorgival Guedes, and Walfredo Cirne. 2005. AnthillSched: A Scheduling Strategy for Irregular and Iterative I/O-Intensive Parallel Jobs. In <u>Job Scheduling Strategies for Parallel Processing: 11th International Workshop, JSSPP 2005.</u>Google Scholar
- I. Goiri, Kien Le, M. E. Haque, R. Beauchea, T. D. Nguyen, J. Guitart, J. Torres, and R. Bianchini. 2011. GreenSlot: Scheduling Energy Consumption in Green Datacenters. In <u>High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for.</u> 1--11.Google Scholar
- T. Hoefler and M. Snir. 2011. Generic Topology Mapping Strategies for Large-scale Parallel Architectures. In <u>Proceedings of the 2011 ACM International Conference on Supercomputing (ICS'11).</u> ACM, 75--85.Google Scholar
- Yuichi Inadomi, Tapasya Patki, Koji Inoue, Mutsumi Aoyagi, Barry Rountree, Martin Schulz, David Lowenthal, Yasutaka Wada, Keiichiro Fukazawa, Masatsugu Ueda, Masaaki Kondo, and Ikuo Miyoshi. 2015. Analyzing and Mitigating the Impact of Manufacturing Variability in Power-constrained Supercomputing. In <u>Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15).</u>Google Scholar
- Intel. 2011. Intel-64 and IA-32 Architectures Software Developer's Manual, Volumes 3A and 3B: System Programming Guide. (2011).Google Scholar
- Katherine E. Isaacs, Alfredo Giménez, Ilir Jusufi, Todd Gamblin, Abhinav Bhatele, Martin Schulz, Bernd Hamann, and Timo Bremer. 2014. State of the Art of Performance Visualization. In <u>EuroVis.</u>Google Scholar
- Nikhil Jain, Abhinav Bhatele, Louis H. Howell, David Böhme, Ian Karlin, Edgar A. León, Misbah Mubarak, Noah Wolfe, Todd Gamblin, and Matthew L. Leininger. 2017. Predicting the Performance Impact of Different Fat-tree Configurations. In <u>Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17).</u> ACM, New York, NY, USA, Article 50, 13 pages. Google ScholarDigital Library
- Nikhil Jain, Abhinav Bhatele, Xiang Ni, Todd Gamblin, and Laxmikant V. Kale. 2017. Partitioning Low-diameter Networks to Eliminate Inter-job Interference. In <u>Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS '17 (to appear)).</u> IEEE Computer Society. LLNL-CONF-.Google Scholar
- Sudhakar Jilla. 2013. Minimizing The Effects of Manufacturing Variation During Physcial Layout. <u>Chip Design Magazine</u> (2013). http://chipdesignmag.com/display.php?articleId=2437.Google Scholar
- A. Jokanovic, J. C. Sancho, G. Rodriguez, A. Lucero, C. Minkenberg, and J. Labarta. 2015. Quiet Neighborhoods: Key to Protect Job Performance Predictability. In <u>2015 IEEE International Parallel and Distributed Processing Symposium.</u> 449--459. Google ScholarDigital Library
- Kyong Hoon Kim, R Buyya, and Jong Kim. 2007. Power Aware Scheduling of Bag-of-Tasks Applications with Deadline Constraints on DVS-enabled Clusters. In <u>Cluster Computing and the Grid, 2007. CCGRID 2007.</u> 541--548.Google Scholar
- R. Kent Koeninger. 2003. The Ultra-Scalable HPTC Lustre Filesystem. <u>Cluster World</u> (2003).Google Scholar
- A. J. Kunen, T. S. Bailey, and P. N. Brown. [n. d.]. KRIPKE - A Massively Parallel Transport Mini-App. In <u>American Nuclear Society M&C 2015.</u>Google Scholar
- Aaditya G Landge, Joshua A Levine, Abhinav Bhatele, Katherine E Isaacs, Todd Gamblin, Martin Schulz, Steve H Langer, P-T Bremer, and Valerio Pascucci. 2012. Visualizing network traffic to understand the performance of massively parallel simulations. <u>Visualization and Computer Graphics, IEEE Transactions on</u> 18, 12 (2012), 2467--2476.Google ScholarDigital Library
- Barry Lawson and Evgenia Smirni. 2005. Power-aware Resource Allocation in High-end Systems via Online Simulation. In <u>International onference on Supercomputing.</u> 229--238.Google Scholar
- Kangkang Li, Maciej Malawski, and Jarek Nabrzyski. 2017. Topology-aware Job Allocation in 3D Torus-based HPC Systems with Hard Job Priority Constraints. <u>Procedia Computer Science</u> 108 (2017), 515--524. International Conference on Computational Science, ICCS 2017, 12--14 June 2017, Zurich, Switzerland. Google ScholarCross Ref
- Xiaoyao Liang and David Brooks. 2006. Mitigating the Impact of Process Variations on Processor Register Files and Execution Units. In <u>International Symposium on Microarchitecture.</u> 504--514.Google Scholar
- Aniruddha Marathe, Rushil Anirudh, Nikhil Jain, Abhinav Bhatele, Jayaraman Thiagarajan, Bhavya Kailkhura, Jae-Seung Yeom, Barry Rountree, and Todd Gamblin. 2017. Performance Modeling Under Resource Constraints Using Deep Transfer Learning. In <u>Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17).</u> ACM, New York, NY, USA, Article 31, 12 pages. Google ScholarDigital Library
- Aleksander Maricq, Dmitry Duplyakin, Ivo Jimenez, Carlos Maltzahn, Ryan Stutsman, and Robert Ricci. 2018. <u>Taming Performance Variability.</u> Berkeley, CA, USA. http://dl.acm.org/citation.cfm?id=3291168.3291198Google Scholar
- C. M. McCarthy, K. E. Isaacs, A. Bhatele, P. Bremer, and B. Hamann. 2014. Visualizing the Five-dimensional Torus Network of the IBM Blue Gene/Q. In <u>2014 First Workshop on Visual Performance Analysis.</u> 24--27. Google ScholarDigital Library
- Jie Meng, Eduard Llamosí, Fulya Kaplan, Chulian Zhang, Jiayi Sheng, Martin Herbordt, Gunar Schirner, and Ayse K Coskun. 2016. Communication and cooling aware job allocation in data centers for communication-intensive workloads. J. Parallel and Distrib. Comput. 96 (2016), 181--193.Google ScholarDigital Library
- Jie Meng, Samuel McCauley, Fulya Kaplan, Vitus J. Leung, and Ayse K. Coskun. 2015. Simulation and optimization of {HPC} job allocation for jointly reducing communication and cooling costs. <u>Sustainable Computing: Informatics and Systems</u> 6 (2015), 48--57. Special Issue on Selected Papers from 2013 International Green Computing Conference (IGCC).Google Scholar
- G. Michelogiannakis, K. Z. Ibrahim, J. Shalf, J. J. Wilke, S. Knight, and J. P. Kenny. 2017. APHiD: Hierarchical Task Placement to Enable a Tapered Fat Tree Topology for Lower Power and Cost in HPC Networks. In <u>2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).</u> 228--237. Google ScholarDigital Library
- Adam Moody. 2009. Contention-Free Routing for Shift-based Communication in MPI Applications on Large-scale InfiniBand Clusters. <u>LLNL-TR-418522, Lawrence Livermore National Laboratory, Livermore, CA</u> (October 2009).Google Scholar
- T. Patki, E. Ates, A. Coskun, and J. Thiagarajan. 2018. Understanding Simultaneous Impact of Network QoS and Power on HPC Application Performance. In <u>Computational Reproducibility at Exascale (CRE'18), Supercomputing Workshop 2018.</u>Google Scholar
- Tapasya Patki, David K. Lowenthal, Barry Rountree, Martin Schulz, and Bronis R. de Supinski. 2013. Exploring Hardware Overprovisioning in Power-constrained, High Performance Computing. In <u>International Conference on Supercomputing.</u>Google Scholar
- Tapasya Patki, Anjana Sasidharan, Matthias Maiterth, David Lowenthal, Barry Rountree, Martin Schulz, and Bronis de Supinski. 2015. Practical Resource Management in Power-Constrained, High Performance Computing. In <u>High Performance Parallel and Distributed Computing (HPDC).</u>Google Scholar
- Olga Pearce, Hadia Ahmed, Rasmus W. Larsen, Peter Pirkelbauer, and David F. Richards. 2017. Exploring dynamic load imbalance solutions with the CoMD proxy application. <u>Future Generation Computer Systems</u> (2017). http://www.sciencedirect.com/science/article/pii/S0167739X17300560Google Scholar
- Samuel D. Pollard, Nikhil Jain, Stephen Herbein, and Abhinav Bhatele. 2018. Evaluation of an Interference-free Node Allocation Policy on Fat-tree Clusters. In <u>Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18).</u> IEEE Press, Piscataway, NJ, USA, Article 26, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291691Google Scholar
- R. Rajachandrasekar, J. Jaswani, H. Subramoni, and D. K. Panda. 2012. Minimizing Network Contention in InfiniBand Clusters with a QoS-Aware Data-Staging Framework. In <u>2012 IEEE International Conference on Cluster Computing.</u>Google Scholar
- Barry Rountree, Dong H. Ahn, Bronis R. de Supinski, David K. Lowenthal, and Martin Schulz. 2012. Beyond DVFS: A First Look at Performance under a Hardware-Enforced Power Bound. In <u>IPDPS Workshops (HPPAC).</u> IEEE Computer Society, 947--953.Google Scholar
- Barry Rountree and Stephanie Labasan. [n. d.]. Libmsr. https://github.com/LLNL/libmsr. ([n. d.]).Google Scholar
- P. Sadayappan and F. Ercal. 1987. Nearest-Neighbor Mapping of Finite Element Graphs onto Processor Meshes. <u>Computers, IEEE Transactions on</u> C-36, 12 (Dec 1987), 1408--1424.Google Scholar
- R. Sakamoto, T. Cao, M. Kondo, K. Inoue, M. Ueda, T. Patki, D. Ellsworth, B. Rountree, and M. Schulz. 2017. Production Hardware Overprovisioning: Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework. In <u>2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).</u> 957--966. Google ScholarCross Ref
- R. Sakamoto, T. Patki, T. Cao, M. Kondo, K. Inoue, M. Ueda, D. Ellsworth, B. Rountree, and M. Schulz. 2018. Analyzing Resource Trade-offs in Hardware Over-provisioned Supercomputers. In <u>2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).</u> 526--535. Google ScholarCross Ref
- Samie B. Samaan. 2004. The Impact of Device Parameter Variations on the Frequency and Performance of VLSI Chips. In <u>Computer Aided Design, 2004. ICCAD-2004. IEEE/ACM International Conference on.</u> 343--346.Google Scholar
- Aliaksei Sandryhaila and José MF Moura. 2013. Discrete signal processing on graphs. <u>IEEE transactions on signal processing</u> 61, 7 (2013), 1644--1656.Google Scholar
- Osman Sarood, Akhil Langer, Abhishek Gupta, and Laxmikant V. Kale. 2014. Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget. In <u>Supercomputing.</u>Google Scholar
- Lee Savoie, David K Lowenthal, Bronis R De Supinski, Tanzima Islam, Kathryn Mohror, Barry Rountree, and Martin Schulz. 2016. I/O Aware Power Shifting. In <u>Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016.</u> Institute of Electrical and Electronics Engineers Inc., United States, 740--749. Google ScholarCross Ref
- Kathleen Shoga, Barry Rountree, and Martin Schulz. 2014. Whitelisting MSRs with msr-safe. <u>Third Workshop on Extreme-Scale Programming Tools, held with SC 14</u> (November 2014).Google Scholar
- Wei Tang, N. Desai, D. Buettner, and Zhiling Lan. 2010. Analyzing and Adjusting User Runtime Estimates to Improve Job Scheduling on the Blue Gene/P. In <u>Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on.</u> 1--11.Google Scholar
- R. Teodorescu and J. Torrellas. 2008. Variation-Aware Application Scheduling and Power Management for Chip Multiprocessors. In <u>Computer Architecture, 2008. ISCA '08. 35th International Symposium on.</u> 363--374.Google Scholar
- Sagar Thapaliya, Purushotham Bangalore, Jay Lofstead, Kathryn Mohror, and Adam Moody. 2014. IO-Cop: Managing Concurrent Accesses to Shared Parallel File System. In <u>International Conference on Parallel Processing Workshops (ICCPW).</u>Google Scholar
- L. Theisen, A. Shah, and F. Wolf. 2014. Down to Earth - How to Visualize Traffic on High-dimensional Torus Networks. In <u>2014 First Workshop on Visual Performance Analysis.</u> 17--23. Google ScholarDigital Library
- J. J. Thiagarajan, R. Anirudh, B. Kailkhura, N. Jain, T. Islam, A. Bhatele, J. Yeom, and T. Gamblin. 2018. PADDLE: Performance Analysis Using a Data-Driven Learning Environment. In <u>2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).</u> 784--793. Google ScholarCross Ref
- Ehsan Totoni, Akhil Langer, Josep Torrellas, and Laxmikant Kale. 2015. Scheduling for HPC Systems with Process Variation Heterogeneity. (January 2015).Google Scholar
- James W. Tschanz, James T. Kao, Siva G. Narendra, Raj Nair, Dmitri A. Antoniadis, Anantha P. Chandrakasan, and Vivek De. 2002. Adaptive Body Bias for Reducing Impacts of Die-to-die and Within-die Parameter Variations on Microprocessor Frequency and Leakage. <u>Solid-State Circuits, IEEE Journal of</u> 37, 11 (Nov 2002), 1396--1402.Google Scholar
- Ozan Tuncer, Emre Ates, Yijia Zhang, Ata Turk, Jim Brandt, Vitus Leung, Manuel Egele, and Ayse K. Coskun. 2017. Diagnosing Performance Variations in HPC Applications using Machine Learning. <u>International Supercomputing Conference in High Performance Computing (ISC-HPC)</u> (June 2017).Google Scholar
- C. T. Vaughan and R. F. Barrett. 2015. Enabling Tractable Exploration of the Performance of Adaptive Mesh Refinement. In <u>2015 IEEE International Conference on Cluster Computing.</u> 746--752. Google ScholarDigital Library
- X. Yang, J. Jenkins, M. Mubarak, R. B. Ross, and Z. Lan. 2016. Watch Out for the Bully! Job Interference Study on Dragonfly Network. In <u>SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.</u> 750--760. Google ScholarCross Ref
- Xu Yang, Zhou Zhou, Sean Wallace, Zhiling Lan, Wei Tang, Susan Coghlan, and Michael E. Papka. 2013. Integrating Dynamic Pricing of Electricity into Energy Aware Scheduling for HPC Systems. In <u>International Conference for High Performance Computing, Networking, Storage and Analysis.</u> 17--22.Google Scholar
- Ziming Zhang, Michael Lang, Scott Pakin, and Song Fu. 2014. Trapped Capacity: Scheduling under a Power Cap to Maximize Machine-room Through-put. In <u>Proceedings of the 2nd International Workshop on Energy Efficient Supercomputing.</u> IEEE Press, 41--50.Google Scholar
- Zhou Zhou, Zhiling Lan, Wei Tang, and Narayan Desai. 2014. Reducing Energy Costs for IBM Blue Gene/P via Power-Aware Job Scheduling. In <u>Job Scheduling Strategies for Parallel Processing.</u> Springer Berlin Heidelberg, 96--115.Google Scholar
- Z. Zhou, X. Yang, Z. Lan, P. Rich, W. Tang, V. Morozov, and N. Desai. 2015. Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D Torus Network Allocation Constraints. In <u>2015 IEEE International Parallel and Distributed Processing Symposium.</u> 439--448. Google ScholarDigital Library
Index Terms
Performance optimality or reproducibility: that is the question
Recommendations
Do machine learning platforms provide out-of-the-box reproducibility?
AbstractScience is experiencing an ongoing reproducibility crisis. In light of this crisis, our objective is to investigate whether machine learning platforms provide out-of-the-box reproducibility. Our method is twofold: First, we survey ...
Highlights- A framework for comparing the support for reproducibility of machine learning platforms is proposed.
Reproducibility and Performance: Why Choose?
Research processes often rely on high-performance computing (HPC), but HPC is often seen as antithetical to “reproducibility”: one would have to choose between software that achieves high performance and software that can be deployed in a reproducible ...
Reproducibility in High Performance Computing
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisEnsuring reliability and reproducibility in computational research raises unique challenges in the supercomputing context. Specialized architectures, extensive and customized software, and complex workflows all raise barriers to transparency, while ...
Comments