Abstract
Recent research has shown that applications often incorrectly implement crash consistency. We present the Crash-Consistent File System (ccfs), a file system that improves the correctness of application-level crash consistency protocols while maintaining high performance. A key idea in ccfs is the abstraction of a stream. Within a stream, updates are committed in program order, improving correctness; across streams, there are no ordering restrictions, enabling scheduling flexibility and high performance. We empirically demonstrate that applications running atop ccfs achieve high levels of crash consistency. Further, we show that ccfs performance under standard file-system benchmarks is excellent, in the worst case on par with the highest performing modes of Linux ext4, and in some cases notably better. Overall, we demonstrate that both application correctness and high performance can be realized in a modern file system.
- Ramnatthan Alagappan, Aishwarya Ganesan, Yuvraj Patel, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. Correlated crash vulnerabilities. In Proceedings of the 12th Symposium on Operating Systems Design and Implementation (OSDI’16).Google ScholarDigital Library
- Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau. 2014. Operating Systems: Three Easy Pieces (0.9 ed.). Arpaci-Dusseau Books.Google Scholar
- Andrew D. Birrell. 1989. An Introduction to Programming with Threads. Technical Report SRC-RR-35.Google Scholar
- James Bornholt, Antoine Kaufmann, Jialin Li, Arvind Krishnamurthy, Emina Torlak, and Xi Wang. 2016. Specifying and checking file system crash-consistency models. In Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’16). Google ScholarDigital Library
- Nathan C. Burnett. 2006. Information and Control in File System Buffer Management. Ph.D. Dissertation. University of Wisconsin-Madison.Google Scholar
- Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2013. Optimistic crash consistency. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13). Google ScholarDigital Library
- Vijay Chidambaram, Tushar Sharma, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. Consistency without ordering. In Proceedings of the 10th USENIX Symposium on File and Storage Technologies (FAST’12). 101--116.Google ScholarDigital Library
- Sailesh Chutani, Owen T. Anderson, Michael L. Kazar, Bruce W. Leverett, W. Anthony Mason, and Robert N. Sidebotham. 1992. The episode file system. In Proceedings of the USENIX Winter Technical Conference (USENIX Winter’92). 43--60.Google Scholar
- Jeremy Condit, Edmund B. Nightingale, Christopher Frost, Engin Ipek, Benjamin Lee, Doug Burger, and Derrick Coetzee. 2009. Better I/O through byte-addressable, persistent memory. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP’09).Google ScholarDigital Library
- Jonathan Corbet. 2009. Better than POSIX? March 2009. Retrieved April 2016 from https://lwn.net/Articles/323752/.Google Scholar
- Christian Czezatke and M. Anton Ertl. 2000. LinLogFS: A log-structured filesystem for linux. In Proceedings of the USENIX Annual Technical Conference (FREENIX Track’00).Google Scholar
- Linux Documentation. 2016. XFS Delayed Logging Design. Retrieved April 2016 from https://www.kernel.org/doc/Documentation/filesystems/xfs-delayed-logging-design.txt.Google Scholar
- Filebench. 2016. Filebench. Retrieved March 2016 from https://github.com/filebench/filebench/wiki.Google Scholar
- Christopher Frost, Mike Mammarella, Eddie Kohler, Andrew de los Reyes, Shant Hovsepian, Andrew Matsuoka, and Lei Zhang. 2007. Generalized file system dependencies. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP’07). 307--320. Google ScholarDigital Library
- Gregory R. Ganger, Marshall Kirk McKusick, Craig A. N. Soules, and Yale N. Patt. 2000. Soft updates: A solution to the metadata update problem in file systems. ACM Trans. Comput. Syst. (TOCS) 18, 2 (May 2000), 127–153. Google ScholarDigital Library
- Gregory R. Ganger and Yale N. Patt. 1994. Metadata update performance in file systems. In Proceedings of the 1st Symposium on Operating Systems Design and Implementation (OSDI’94). 49--60.Google Scholar
- Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy. 1992. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA’90). Seattle, WA.Google Scholar
- Jim Gray and Andreas Reuter. 1993. Transaction Processing: Concepts and Techniques. Morgan Kaufmann.Google ScholarDigital Library
- Robert Hagmann. 1987. Reimplementing the cedar file system using logging and group commit. In Proceedings of the 11th ACM Symposium on Operating Systems Principles (SOSP’87). Google ScholarDigital Library
- Timothy L. Harris. 2001. A Pragmatic Implementation of Non-blocking Linked-lists. DISC. Google ScholarCross Ref
- Tyler Harter, Chris Dragga, Michael Vaughn, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2011. A file is not a file: Understanding the I/O behavior of apple desktop applications. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11).Google Scholar
- Maurice Herlihy. 1991. Wait-free synchronization. Transactions on Programming Languages 11, 1 (January 1991). Google ScholarDigital Library
- D. M. Jacobson and J. Wilkes. 1991. Disk Scheduling Algorithms Based on Rotational Position. Technical Report HPL-CSP-91-7. Hewlett Packard Laboratories.Google Scholar
- Jaeho Kim, Jongmoo Choi, Yongseok Oh, Donghee Lee, Eunsam Kim, and Sam H. Noh. 2009. Disk schedulers for solid state drives. In EMSOFT. Grenoble, France.Google Scholar
- Lanyue Lu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Shan Lu. 2013. A study of linux file system evolution. In Proceedings of the 11th USENIX Symposium on File and Storage Technologies (FAST’13). San Jose, CA.Google ScholarDigital Library
- Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. WiscKey: Separating keys from values in SSD-conscious storage. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16).Google ScholarDigital Library
- Lanyue Lu, Yupu Zhang, Thanh Do, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. Physical disentanglement in a container-based file system. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI’14).Google Scholar
- MariaDB. 2016. Fusion-io NVMFS Atomic Write Support. Retrieved April 2016 from https://mariadb.com/kb/en/mariadb/fusion-io-nvmfs-atomic-write-support/.Google Scholar
- Mercurial. 2016. Dealing with Repository and Dirstate Corruption. Retrieved April 2016 from https://www.mercurial-scm.org/wiki/RepositoryCorruption.Google Scholar
- Microsoft. 2016. Alternatives to using Transactional NTFS. Retrieved April 2016 from https://msdn.microsoft.com/en-us/library/hh802690.aspx.Google Scholar
- Changwoo Min, Woon-Hak Kang, Taesoo Kim, Sang-Won Lee, and Young Ik Eom. 2015. Lightweight application-level crash consistency on transactional flash storage. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’15).Google ScholarDigital Library
- C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz. 1992. ARIES: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Trans. Data. Syst. 17, 1 (March 1992), 94--162. Google ScholarDigital Library
- Edmund B. Nightingale, Kaushik Veeraraghavan, Peter M. Chen, and Jason Flinn. 2006. Rethink the sync. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’06).Google ScholarDigital Library
- Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Informat. 33, 4 (1996), 351--385.Google ScholarDigital Library
- Stan Park, Terence Kelly, and Kai Shen. 2013. Failure-atomic msync (): A simple and efficient mechanism for preserving the integrity of durable data. In Proceedings of the EuroSys Conference (EuroSys’13).Google ScholarDigital Library
- Steven Pelley, Peter M. Chen, and Thomas F. Wenisch. 2014. Memory persistency. In Proceedings of the 41st International Symposium on Computer Architecture (ISCA’14). Google ScholarCross Ref
- Thanumalayan Sankaranarayana Pillai, Ramnatthan Alagappan, Lanyue Lu, Vijay Chidambaram, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. Application crash consistency and performance with CCFS. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17). Google ScholarDigital Library
- Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. All file systems are not created equal: On the complexity of crafting crash-consistent applications. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI’14).Google ScholarDigital Library
- Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2015. Crash consistency. Commun. ACM 58, 10 (October 2015). Google ScholarDigital Library
- Donald E. Porter, Owen S. Hofmann, Christopher J. Rossbach, Alexander Benn, and Emmett Witchel. 2008. Operating systems transactions. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI’08).Google Scholar
- Vijayan Prabhakaran, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2005. Analysis and evolution of journaling file systems. In Proceedings of the USENIX Annual Technical Conference (USENIX’05). 105--120.Google ScholarDigital Library
- Vijayan Prabhakaran, Thomas L. Rodeheffer, and Lidong Zhou. 2008. Transactional flash. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI’08).Google Scholar
- Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A Lock-free Approach to Parallelizing Stochastic Gradient Descent. Advances in Neural Information Processing (2011), 693–701.Google Scholar
- Mendel Rosenblum and John Ousterhout. 1992. The design and implementation of a log-structured file system. ACM Trans. Comput. Syst. 10, 1 (February 1992), 26--52. Google ScholarDigital Library
- Chris Ruemmler and John Wilkes. 1994. An introduction to disk drive modeling. IEEE Comput. 27, 3 (March 1994), 17--28. Google ScholarDigital Library
- Margo Seltzer, Keith Bostic, Marshall Kirk McKusick, and Carl Staelin. 1993. An implementation of a log-structured file system for UNIX. In Proceedings of the USENIX Winter Technical Conference (USENIX Winter’93). 307--326.Google Scholar
- Margo Seltzer, Peter Chen, and John Ousterhout. 1990. Disk scheduling revisited. In Proceedings of the USENIX Winter Technical Conference (USENIX Winter’90). 313--324.Google Scholar
- Margo I. Seltzer. 1993. File System Performance and Transaction Support. Ph.D. Dissertation. EECS Department, University of California, Berkeley.Google Scholar
- Margo I. Seltzer, Gregory R. Ganger, M. Kirk McKusick, Keith A. Smith, Craig A. N. Soules, and Christopher A. Stein. 2000. Journaling versus soft updates: Asynchronous meta-data protection in file systems. In Proceedings of the USENIX Annual Technical Conference (USENIX’00). 71--84.Google Scholar
- Ji-Yong Shin, Mahesh Balakrishnan, Tudor Marian, and Hakim Weatherspoon. 2016. Isotope: Transactional isolation for block storage. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST’16).Google ScholarDigital Library
- Daniel J. Sorin, Mark D. Hill, and David A. Wood. 2011. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures on Computer Architecture, 1st Morgan & Claypool (November 2011).Google Scholar
- Richard P. Spillane, Sachin Gaikwad, Manjunath Chinni, Erez Zadok, and Charles P. Wright. 2009. Enabling transactional file access via lightweight kernel extensions. In Proceedings of the 7th USENIX Symposium on File and Storage Technologies (FAST’09).Google Scholar
- SQLite. 2016. Isolation In SQLite. Retrieved Dec 2016 from https://www.sqlite.org/isolation.html.Google Scholar
- SQLite. 2016. SQL As Understood By SQLite. Retrieved Dec 2016 from https://www.sqlite.org/lang.html.Google Scholar
- Vasily Tarasov, Erez Zadok, and Spencer Shepler. 2016. Filebench: A flexible framework for file system benchmarking. ;login: USENIX Mag. 41, 1 (June 2016).Google Scholar
- Linus Torvalds. 2009. Linux 2.6.29. Retrieved April 2016 from https://lkml.org/lkml/2009/3/25/632.Google Scholar
- Theodore Ts’o. 2012. ext4: remove calls to ext4_jbd2_file_inode() from delalloc write path. Retrieved April 2016 from http://lists.openwall.net/linux-ext4/2012/11/16/9.Google Scholar
- Rajat Verma, Anton Ajay Mendez, Stan Park, Sandya Srivilliputtur Mannarswamy, Terence P. Kelly, and Charles B. Morrey III. 2015. Failure-atomic updates of application data in a linux file system. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15).Google ScholarDigital Library
- Charles P. Wright, Richard Spillane, Gopalan Sivathanu, and Erez Zadok. 2007. Extending ACID semantics to the file system via ptrace. ACM Trans. Storage (TOS) 3, 2 (June 2007), 1--42. Google ScholarDigital Library
- Junfeng Yang, Can Sar, and Dawson Engler. 2006. EXPLODE: A lightweight, general system for finding serious storage system errors. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’06).Google ScholarDigital Library
- Mai Zheng, Joseph Tucek, Dachuan Huang, Feng Qin, Mark Lillibridge, Elizabeth S. Yang, Bill W. Zhao, and Shashank Singh. 2014. Torturing databases for fun and profit. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI’14).Google Scholar
Index Terms
- Application Crash Consistency and Performance with CCFS
Recommendations
Crash Consistent Non-Volatile Memory Express
SOSP '21: Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems PrinciplesThis paper presents crash consistent Non-Volatile Memory Express (ccNVMe), a novel extension of the NVMe that defines how host software communicates with the non-volatile memory (e.g., solid-state drive) across a PCI Express bus with both crash ...
Specifying and Checking File System Crash-Consistency Models
ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating SystemsApplications depend on persistent storage to recover state after system crashes. But the POSIX file system interfaces do not define the possible outcomes of a crash. As a result, it is difficult for application writers to correctly understand the ...
SplitFS: reducing software overhead in file systems for persistent memory
SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems PrinciplesWe present SplitFS, a file system for persistent memory (PM) that reduces software overhead significantly compared to state-of-the-art PM file systems. SplitFS presents a novel split of responsibilities between a user-space library file system and an ...
Comments