Abstract
This article studies the I/O write behaviors of the Titan supercomputer and its Lustre parallel file stores under production load. The results can inform the design, deployment, and configuration of file systems along with the design of I/O software in the application, operating system, and adaptive I/O libraries.
We propose a statistical benchmarking methodology to measure write performance across I/O configurations, hardware settings, and system conditions. Moreover, we introduce two relative measures to quantify the write-performance behaviors of hardware components under production load. In addition to designing experiments and benchmarking on Titan, we verify the experimental results on one real application and one real application I/O kernel, XGC and HACC IO, respectively. These two are representative and widely used to address the typical I/O behaviors of applications.
In summary, we find that Titan’s I/O system is variable across the machine at fine time scales. This variability has two major implications. First, stragglers lessen the benefit of coupled I/O parallelism (striping). Peak median output bandwidths are obtained with parallel writes to many independent files, with no striping or write sharing of files across clients (compute nodes). I/O parallelism is most effective when the application—or its I/O libraries—distributes the I/O load so that each target stores files for multiple clients and each client writes files on multiple targets in a balanced way with minimal contention. Second, our results suggest that the potential benefit of dynamic adaptation is limited. In particular, it is not fruitful to attempt to identify “good locations” in the machine or in the file system: component performance is driven by transient load conditions and past performance is not a useful predictor of future performance. For example, we do not observe diurnal load patterns that are predictable.
- Argonne National Laboratory. 2018. Retrieved November 9, 2019 from Darshan: HPC I/O Characterization Tool. http://www.mcs.anl.gov/research/projects/darshan.Google Scholar
- Philip Carns, Kevin Harms, William Allcock, Charles Bacon, Samuel Lang, Robert Latham, and Robert Ross. 2011. Understanding and improving computational science storage access through continuous characterization. ACM Transactions on Storage 7, 3, 8--26.Google ScholarDigital Library
- Philip Carns, Robert Latham, Robert Ross, Kamil Iskra, Samuel Lang, and Katherine Riley. 2009. 24/7 characterization of petascale I/O workloads. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’09). New Orleans, LA, 1--10.Google ScholarCross Ref
- Luis Chacón. 2004. A non-staggered, conservative, finite-volume scheme for 3D implicit extended magnetohydrodynamics in curvilinear geometries. Computer Physics Communications 163, 3, 143--171.Google ScholarCross Ref
- C. S. Chang and Susan Ku. 2008. Spontaneous rotation sources in a quiescent tokamak edge plasma. Physics of Plasmas 15, 6, 062510.Google ScholarCross Ref
- J. H. Chen, A. Choudhary, B. de Supinski, M. DeVries, E. R. Hawkes, S. Klasky, W. Liao, K. Ma, J. Mellor-Crummey, N. Podhorszki, R. Sankaran, S. Shende, and C. Yoo. 2009. Terascale direct numerical simulations of turbulent combustion using S3D. Computational Science 8 Discovery 2, 1, 015001.Google Scholar
- Yanpei Chen, Kiran Srinivasan, Garth Goodson, and Randy Katz. 2011. Design implications for enterprise storage systems via multi-dimensional trace analysis. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11). Cascais, Portugal, 43--56.Google ScholarDigital Library
- Y. Cui, K. Olsen, T. Jordan, K. Lee, J. Zhou, P. Small, D. Roten, G. Ely, D. K. Panda, A. Chourasia, J. Levesque, S. Day, and P. Maechling. 2010. Scalable earthquake simulation on petascale supercomputers. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10). Washington, DC. 1--20.Google Scholar
- David A. Dillow, Galen M. Shipman, Sarp Oral, Zhe Zhang, and Youngjae Kim. 2011. Enhancing I/O throughput via efficient routing and placement for large-scale parallel file systems. In Proceedings of the 30th IEEE International Performance Computing and Communications Conference (IPCCC’11). Orlando, FL, 21--29.Google ScholarDigital Library
- Matthieu Dorier, Shadi Ibrahim, Gabriel Antoniu, and Rob Ross. 2014. Omnisc’IO: A grammar-based approach to spatial and temporal I/O patterns prediction. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). New Orleans, LA, 623--634.Google ScholarDigital Library
- Matt Ezell, David Dillow, Sarp Oral, Feiyi Wang, Devesh Tiwari, Don Maxwell, Dustin Leverman, and Jason Hill. 2014. I/O router placement and fine-grained routing on Titan to support Spider II. In Proceedings of the Cray User Group Conference (CUG’14). Lugano, Switzerland, 1--6.Google Scholar
- Youngjae Kim and Raghul Gunasekaran. 2014. Understanding I/O workload characteristics of a peta-scale storage system. The Journal of Supercomputing 71, 3, 761--780.Google ScholarDigital Library
- Youngjae Kim, Raghul Gunasekaran, Galen M. Shipman, David A. Dillow, Zhe Zhang, and Bradley W. Settlemyer. 2010. Workload characterization of a leadership class storage cluster. In Proceedings of the 5th Petascale Data Storage Workshop (PDSW’10). New Orleans, LA, 1--5.Google Scholar
- S. Klasky, S. Ethier, Z. Lin, K. Martins, D. McCune, and R. Samtaney. 2003. Grid-based parallel data streaming implemented for the Gyrokinetic Toroidal Code. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’03). Phoenix, AZ, 24--36.Google Scholar
- Nancy P. Kronenberg, Henry M. Levy, and William D. Strecker. 1986. VAXcluster: A closely-coupled distributed system. ACM Transactions on Computer Systems 4, 2, 130--146.Google ScholarDigital Library
- S. Ku, C. S. Chang, M. Adams, J. Cummings, F. Hinton, D. Keyes, S. Klasky, W. Lee, Z. Lin, S. Parker, and the CPES team. 2006. Gyrokinetic particle simulation of neoclassical transport in the pedestal/scrape-off region of a tokamak plasma. Journal of Physics 46, 1, 87--91.Google Scholar
- Julian Kunkel, Michaela Zimmer, and Eugen Betke. 2015. Predicting performance of non-contiguous I/O with machine learning. In Proceedings of the International Conference on High Performance Computing (ISC’15). Frankfurt, Germany, 257--273.Google ScholarCross Ref
- S. Lang, P. Carns, R. Latham, R. Ross, K. Harms, and W. Allcock. 2009. I/O performance challenges at leadership scale. In Proceedings of the ACM/IEEE International Conference for High Performance Computing Networking, Storage and Analysis (SC’09). Portland, OR, 40--52.Google Scholar
- Qing Liu, Jeremy Logan, Yuan Tian, Hasan Abbasi, Norbert Podhorszki, Jong Youl Choi, Scott Klasky, Roselyne Tchoua, Jay Lofstead, Ron Oldfield, et al. 2014. Hello ADIOS: The challenges and lessons of developing leadership class I/O frameworks. Concurrency and Computation: Practice and Experience 26, 7, 1453--1473.Google ScholarDigital Library
- Jay Lofstead, Fang Zheng, Scott Klasky, and Karsten Schwan. 2009. Adaptable, metadata-rich I/O methods for portable high performance I/O. In Proceedings of the 23rd IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’09). Rome, Italy, 1--10.Google ScholarDigital Library
- Jay Lofstead, Fang Zheng, Qing Liu, Scott Klasky, Ron Oldfield, Todd Kordenbrock, Karsten Schwan, and Matthew Wolf. 2010. Managing variability in the I/O performance of petascale storage systems. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’10). Washington, DC, 1--12.Google ScholarDigital Library
- Huong Luu, Marianne Winslett, William Gropp, Robert Ross, Philip Carns, Kevin Harms, Mr Prabhat, Suren Byna, and Yushu Yao. 2015. A multiplatform study of I/O behavior on petascale supercomputers. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC’15). Portland, OR, 33--44.Google ScholarDigital Library
- Sandeep Madireddy, Prasanna Balaprakash, Phil Carns, Robert Latham, Robert Ross, Shane Snyder, and Stefan M. Wild. 2018. Machine learning based parallel I/O predictive modeling: A case study on Lustre file systems. In Proceedings of the International Conference on High Performance Computing. Hyderabad, India, 184--204.Google Scholar
- Ryan McKenna, Stephen Herbein, Adam Moody, Todd Gamblin, and Michela Taufer. 2016. Machine learning predictions of runtime and IO traffic on high-end clusters. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’16). Taipei, Taiwan, 255--258.Google ScholarCross Ref
- David A. Nowark and Mark Seager. 1999. ASCI terascale simulation: Requirements and deployments. In Oak Ridge Interconnect Workshop (ASCI-00-003.1). Oak Ridge, TN, 1--15.Google Scholar
- Oak Ridge National Laboratory. 2018. HACC. Retrieved November 9, 2019 from https://www.olcf.ornl.gov/caar/hacc/.Google Scholar
- Sarp Oral, Feiyi Wang, David Dillow, Galen Shipman, Ross Miller, and Oleg Drokin. 2010. Efficient object storage journaling in a distributed parallel file system. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10). San Jose, CA, 143--154.Google ScholarDigital Library
- Hongzhang Shan, Katie Antypas, and John Shalf. 2008. Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’08). Austin, TX, 42--54.Google ScholarCross Ref
- Hongzhang Shan and John Shalf. 2007. Using IOR to analyze the I/O performance for HPC platforms. In Proceedings of the Cray User Group Meeting (CUG’07). Washington, DC, 1--15.Google Scholar
- Galen Shipman, David Dillow, Douglas Fuller, Raghul Gunasekaran, Jason Hill, Youngjae Kim, Sarp Oral, Doug Reitz, James Simmons, and Feiyi Wang. 2012. A next-generation parallel file system environment for the OLCF. In Proceedings of the Cray User Group Conference (CUG’12). Stuttgart, Germany, 1--12.Google Scholar
- Galen Shipman, David Dillow, Sarp Oral, and Feiyi Wang. 2009. The Spider center wide file system: from concept to reality. In Proceedings of the Cray User Group Meeting (CUG’09). Atlanta GA, 1--10.Google Scholar
- Rajeev Thakur, William Gropp, and Ewing Lusk. 1999. Data sieving and collective I/O in ROMIO. In Proceedings of the 7th Symposium on the Frontiers of Massively Parallel Computation (Frontiers’99). Annapolis, MD, 182--189.Google ScholarDigital Library
- Yuan Tian, Scott Klasky, Hasan Abbasi, Jay Lofstead, Ray Grout, Norbert Podhorszki, Qing Liu, Yandong Wang, and Weikuan Yu. 2011. EDO: Improving read performance for scientific applications through elastic data organization. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’11). Austin, TX, 93--102.Google ScholarDigital Library
- Andrew Uselton, Mark Howison, Nicholas J. Wright, David Skinner, Noel Keen, John Shalf, Karen L. Karavanic, and Leonid Oliker. 2010. Parallel I/O performance: From events to ensembles. In Proceedings of the 24th IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’10). Atlanta, GA, 1--11.Google ScholarCross Ref
- Lipeng Wan, Matthew Wolf, Feiyi Wang, Jong Youl Choi, George Ostrouchov, and Scott Klasky. 2017. Analysis and modeling of the end-to-end I/O performance on OLCF’s titan supercomputer. In Proceedings of the 19th IEEE International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS’17). Salt Lake City, Utah, 1--9.Google Scholar
- Feiyi Wang, Sarp Oral, Galen Shipman, Oleg Drokin, Tom Wang, and Isaac Huang. 2009. Understanding Lustre filesystem internals. Technical Report ORNL TM-2009, 117, 1--80.Google Scholar
- Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’06). Seattle, WA, 307--320.Google ScholarDigital Library
- Bing Xie. 2017. Output Performance of Petascale File Systems. Ph.D. Dissertation. Duke University, Durham, NC.Google Scholar
- Bing Xie, Jeffrey Chase, David Dillow, Oleg Drokin, Scott Klasky, Sarp Oral, and Norbert Podhorszki. 2012. Characterizing output bottlenecks in a supercomputer. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’12). Salt Lake City, UT, 1--11.Google ScholarDigital Library
- Bing Xie, Jeffrey S. Chase, David Dillow, Scott Klasky, Jay Lofstead, Sarp Oral, and Norbert Podhorszki. 2017. Output performance study on a production petascale filesystem. In HPC I/O in the Data Center Workshop (HPC-IODC’17). Frankfurt, Germany, 1--14.Google Scholar
- Bing Xie, Yezhou Huang, Jeffrey Chase, Jong Youl Choi, Scott Klasky, Jay Lofstead, and Sarp Oral. 2017. Predicting output performance of a petascale supercomputer. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC’17). ACM, Washington DC, 181--192.Google ScholarDigital Library
Index Terms
- Characterizing Output Bottlenecks of a Production Supercomputer: Analysis and Implications
Recommendations
A multiple-file write scheme for improving write performance of small files in Fast File System
Fast File System (FFS) stores files to disk in separate disk writes, each of which incurs a disk positioning (seek + rotation) limiting the write performance for small files. We propose a new scheme called co-writing to accelerate small file writes in ...
WOJ: Enabling Write-Once Full-data Journaling in SSDs by Using Weak-Hashing-based Deduplication
Journaling is a commonly used technique to ensure data consistency in file systems, such as ext3 and ext4. With journaling technique, file system updates are first recorded in a journal (in the commit phase) and later applied to their home locations in ...
Implementation of a stackable file system for real-time network backup
We propose a backup system based on a stackable mirroring file system, general-purpose mirroring file system (GMFS). This file system mirrors data in real-time on the file system layer. It uses the typical network file system (NFS) and backs up data to ...
Comments