Abstract
Hardware consolidation in the datacenter often leads to scalability bottlenecks from heavy utilization of critical resources, such as the storage and network bandwidth. Client-side caching on durable media is already applied at block level to reduce the storage backend load but has received criticism for added overhead, restricted sharing, and possible data loss at client crash. We introduce a journal to the kernel-level client of an object-based distributed filesystem to improve durability at high I/O performance and reduced shared resource utilization. Storage virtualization at the file interface achieves clear consistency semantics across data and metadata, supports native file sharing among clients, and provides flexible configuration of durable data staging at the host. Over a prototype that we have implemented, we experimentally quantify the performance and efficiency of the proposed Arion system in comparison to a production system. We run microbenchmarks and application-level workloads over a local cluster and a public cloud. We demonstrate reduced latency by 60% and improved performance up to 150% at reduced server network and disk bandwidth by 41% and 77%, respectively. The performance improvement reaches 92% for 16 relational databases as clients and gets as high as 11.3x with two key-value stores as clients.
- Amazon EC2. 2017. Amazon EC2 Instance Types. Retrieved October 13, 2017, from https://aws.amazon.com/ec2/instance-types/.Google Scholar
- Amazon EFS. 2015. Amazon Elastic File System. Retrieved October 13, 2017, from https://aws.amazon.com/efs/.Google Scholar
- Raja Appuswamy, Sergey Legtchenko, and Antony Rowstron. 2014. Towards paravirtualized network file systems. In Proceedings of the 2014 USENIX Workshop on Hot Topics in Storage and File Systems. Article No. 11. Google ScholarDigital Library
- Dulcardo Arteaga, Jorge Cabrera, Jing Xu, and Swaminathan Sundararaman. 2016. CloudCache: On-demand flash cache management for cloud computing. In Proceedings of the 2016 USENIX Conference on File and Storage Technologies. 355–369. Google ScholarDigital Library
- Dulcardo Arteaga and Ming Zhao. 2014. Client-side flash caching for cloud systems. In Proceedings of the 2014 ACM International Systems and Storage Conference. 7:1--7:11. Google ScholarDigital Library
- Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. 2013. Highly available transactions: Virtues and limitations. Proceedings of the VLDB Endowment 7, 3, 181--192. Google ScholarDigital Library
- Mary Baker, Satoshi Asami, Etienne Deprit, John Ousterhout, and Margo Seltzer. 1992. Non-volatile memory for fast, reliable file systems. In Proceedings of the 1992 ACM ASPLOS Conference. 10--22. Google ScholarDigital Library
- Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran, Michael Wei, John D. Davis, Sriram Rao, Tao Zou, and Aviad Zuck. 2013. Tango: Distributed data structures over a shared log. In Proceedings of the 2013 ACM Symposium on Operating Systems Principles. 325--340. Google ScholarDigital Library
- Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. 2013. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan 8 Claypool. Google ScholarDigital Library
- Bcache. 2010. Home Page. Retrieved October 13, 2017, from https://bcache.evilpiepirate.org/.Google Scholar
- Hal Berenson, Phil Bernstein, Jim Gray, Jim Melton, Elizabeth O’Neil, and Patrick O’Neil. 1995. A critique of ANSI SQL isolation levels. In Proceedings of the 1995 ACM SIGMOD Conference. 1--10. Google ScholarDigital Library
- Philip A. Bernstein and Nathan Goodman. 1983. Multiversion concurrency control—Theory and algorithms. ACM Transactions on Database Systems 8, 4, 465--483. Google ScholarDigital Library
- Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. 1987. Concurrency Control and Recovery in Database Systems. Addison-Wesley, Reading, MA. Google ScholarDigital Library
- Alysson Bessani, Ricardo Mendes, Tiago Oliveira, Nuno Neves, Miguel Correia, Marcelo Pasin, and Paulo Verissimo. 2014. SCFS: A shared cloud-backed file system. In Proceedings of the 2014 USENIX Annual Technical Conference. 169--180. Google ScholarDigital Library
- Deepavali Bhagwat, Mahesh Patil, Michal Ostrowski, Murali Vilayannur, Woon Jung, and Chethan Kumar. 2015. A practical implementation of clustered fault tolerant write acceleration in a virtualized environment. In Proceedings of the 2015 USENIX Conference on File and Storage Technologies. 287--300. Google ScholarDigital Library
- Kenneth Birman, Daniel Freedman, Qi Huang, and Patrick Dowell. 2012. Overcoming CAP with consistent soft-state replication. Computer 45, 2, 50--58. Google ScholarDigital Library
- BobMcGee. 2016. EC2 instance types. exact network performance? (March 2016). https://stackoverflow.com/questions/18507405/ec2-instance-typess-exact-network-performance/35806587#35806587.Google Scholar
- William J. Bolosky, John R. Douceur, David Ely, and Marvin Theimer. 2000. Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs. In Proceedings of the 2000 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. 34--43. Google ScholarDigital Library
- Daniel P. Bovet and Marco Cesati. 2005. Understanding the Linux Kernel (3rd ed.). O’Reilly Media, Sebastopol, CA. Google ScholarDigital Library
- Sebastian Burckhardt, Daan Leijen, Manuel Fähndrich, and Mooly Sagiv. 2012. Eventually consistent transactions. In Programming Languages and Systems. Lecture Notes in Computer Science, Vol. 7211. Springer, 67--86. Google ScholarDigital Library
- Steve Byan, James Lentini, Anshul Madan, Luis Pabon, Michael Condict, Jeff Kimmel, Steve Kleiman, Christopher Small, and Mark Storer. 2012. Mercury: Host-side flash caching for the data center. In Proceedings of the 2012 IEEE International Conference on Massive Storage Systems and Technology. 12.Google ScholarCross Ref
- Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, et al. 2011. Windows Azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the 2011 ACM Symposium on Operating Systems Principles. 143--157. Google ScholarDigital Library
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of the 2006 USENIX Symposium on Operating Systems Design and Implementation. 205--218. Google ScholarDigital Library
- Vijay Chidambaram, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2013. Optimistic crash consistency. In Proceedings of the 2013 ACM Symposium on Operating Systems Principles. 228--243. Google ScholarDigital Library
- Vijay Chidambaram, Tushar Sharma, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. Consistency without ordering. In Proceedings of the 2012 USENIX Conference on File and Storage Technologies. 73--86. Google ScholarDigital Library
- Michael Conley, Amin Vahdat, and George Porter. 2015. Achieving cost-efficient, data-intensive computing in the cloud. In Proceedings of the 2015 ACM Symposium on Cloud Computing. 302--314. Google ScholarDigital Library
- Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 2010 ACM Symposium on Cloud Computing. 143--154. Google ScholarDigital Library
- J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, et al. 2012. Spanner: Google’s globally-distributed database. In Proceedings of the 2012 USENIX Symposium on Operating Systems Design and Implementation. 251--264. Google ScholarDigital Library
- Aleksandar Dragojević, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. 2015. No compromises: Distributed transactions with consistency, availability, and performance. In Proceedings of the 2015 ACM Symposium on Operating Systems Principles. 54--70. Google ScholarDigital Library
- John C. Eidson. 2006. Measurement, Control, and Communication Using IEEE 1588. Springer-Verlag London Ltd. Google ScholarDigital Library
- Tyler Harter, Dhruba Borthakur, Siying Dong, Amitanand Aiyer, Liyin Tang, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. Analysis of HDFS over HBase: A Facebook messages case study. In Proceedings of the 2014 USENIX Conference on File and Storage Technologies. 199--212. Google ScholarDigital Library
- Tyler Harter, Chris Dragga, Michael Vaughn, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2011. A file is not a file: Understanding the I/O behavior of apple desktop applications. In Proceedings of the 2011 ACM Symposium on Operating Systems Principles. 71--83. Google ScholarDigital Library
- Andromachi Hatzieleftheriou and Stergios V. Anastasiadis. 2015. Host-side filesystem journaling for durable shared storage. In Proceedings of the 2015 USENIX Conference on File and Storage Technologies. 59--66. Google ScholarDigital Library
- Maurice P. Herlihy and Jeannette M. Wing. 1990. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems 12, 3, 463--492. Google ScholarDigital Library
- Dean Hildebrand, Anna Povzner, Renu Tewari, and Vasily Tarasov. 2011. Revisiting the storage stack in virtualized NAS environments. In Proceedings of the 2011 USENIX Workshop on I/O Virtualization. Article No. 4. Google ScholarDigital Library
- John H. Howard, Michael L. Kazar, Sherri G. Menees, David A. Nichols, M. Satyanarayanan, Robert N. Sidebotham, and Michael J. West. 1988. Scale and performance in a distributed file system. ACM Transactions on Computer Systems 6, 1, 51--81. Google ScholarDigital Library
- David Howells. 2006. FS-Cache: A network filesystem caching facility. In Proceedings of the 2006 Linux Symposium. 427--440.Google Scholar
- William K. Josephson, Lars A. Bongo, David Flynn, and Kai Li. 2010. DFS: A file system for virtualized flash storage. In Proceedings of the 2010 USENIX Conference on File and Storage Technologies. 85--100. Google ScholarDigital Library
- Michael L. Kazar, Bruce W. Leverett, Owen T. Anderson, Vasilis Apostolides, Beth A. Bottos, Sailesh Chutani, Craig F. Everhart, W. Anthony Mason, Shu-Tsui Tu, and Edward R. Zayas. 1990. DEcorum file system architectural overview. In Proceedings of the 1990 USENIX Summer Technical Conference. 151--164.Google Scholar
- J. J. Kistler and M. Satyanarayanan. 1992. Disconnected operation in the coda file system. ACM Transactions on Computer Systems 10, 1, 3--25. Google ScholarDigital Library
- Ricardo Koller, Leonardo Marmol, Raju Rangaswami, Swaminathan Sundararaman, Nisha Talagala, and Ming Zhao. 2013. Write policies for host-side flash caches. In Proceedings of the 2013 USENIX Conference on File and Storage Technologies. 45--58. Google ScholarDigital Library
- Duy Le, Hai Huang, and Haining Wang. 2012. Understanding performance implications of nested file systems in a virtualized environment. In Proceedings of the 2012 USENIX Conference on File and Storage Technologies. 87--100. Google ScholarDigital Library
- Dong-Yun Lee, Kisik Jeong, Sang-Hoon Han, Jin-Soo Kim, Joo-Young Hwang, and Sangyeun Cho. 2017. Understanding write behaviors of storage backends in Ceph object store. In Proceedings of the 2017 IEEE International Conference on Massive Storage Systems and Technology. 10.Google Scholar
- Eunji Lee, Hyokyung Bahn, and Sam H. Noh. 2014. A unified buffer cache architecture that subsumes journaling functionality via nonvolatile memory. ACM Transactions on Storage 10, 1, 1:1--1:17. Google ScholarDigital Library
- Lanyue Lu, Yupu Zhang, Thanh Do, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2014. Physical disentanglement in a container-based file system. In Proceedings of the 2014 USENIX Symposium on Operating Systems Design and Implementation. 81--96. Google ScholarDigital Library
- Youyou Lu, Jiwu Shu, Youmin Chen, and Tao Li. 2017. Octopus: An RDMA-enabled distributed persistent memory file system. In Proceedings of the 2017 USENIX Annual Technical Conference. 773--875. Google ScholarDigital Library
- Timothy Mann, Andrew Birrell, Andy Hisgen, Charles Jerian, and Garret Swart. 1994. A coherent distributed file cache with directory write-behind. ACM Transactions on Computer Systems 12, 2, 123--164. Google ScholarDigital Library
- Bob McGee. 2016. EC2 Instance Types’ Exact Network Performance? Available at https://stackoverflow.com/questions/18507405/ec2-instance-typess-exact-network-performance/35806587#35806587Google Scholar
- Dutch T. Meyer, Gitika Aggarwal, Brendan Cully, Geoffrey Lefebvre, Michael J. Feeley, Norman C. Hutchinson, and Andrew Warfield. 2008. Parallax: Virtual disks for virtual machines. In Proceedings of the 2008 ACM European Conference on Computer Systems. 41--54. Google ScholarDigital Library
- Dutch T. Meyer, Jake Wires, Norman C. Hutchinson, and Andrew Warfield. 2011. Namespace management in virtual desktops. login: The USENIX Magazine 36, 1, 6--11.Google Scholar
- James Mickens, Edmund B. Nightingale, Jeremy Elson, Krishna Nareddy, Darren Gehring, Bin Fan, Asim Kadav, Vijay Chidambaram, and Osama Khan. 2014. Blizzard: Fast, cloud-scale block storage for cloud-oblivious applications. In Proceedings of the 2014 USENIX Symposium on Networked Systems Design and Implementation. 257--273. Google ScholarDigital Library
- David L. Mills. 1995. Improved algorithms for synchronizing computer network clocks. IEEE/ACM Transactions on Networking 3, 3, 245--254. Google ScholarDigital Library
- Michael N. Nelson, Brent B. Welch, and John K. Ousterhout. 1988. Caching in the Sprite network file system. ACM Transactions on Computer Systems 6, 1, 134--154. Google ScholarDigital Library
- Brian M. Oki and Barbara H. Liskov. 1988. Viewstamped replication: A new primary copy method to support highly-available distributed systems. In Proceedings of the 1988 ACM Symposium on Principles of Distributed Computing. 8--17. Google ScholarDigital Library
- Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum. 2011. Fast crash recovery in RAMCloud. In Proceedings of the 2011 ACM Symposium on Operating Systems Principles. 29--41. Google ScholarDigital Library
- Openstack Manila. 2014. Home Page. Retrieved October 13, 2017, from https://wiki.openstack.org/wiki/Manila.Google Scholar
- David Oppenheimer, Archana Ganapathi, and David A. Patterson. 2003. Why do Internet services fail, and what can be done about it? In Proceedings of the 2003 USENIX Symposium on Internet Technologies and Systems. 1--15. Google ScholarDigital Library
- Brian Pawlowski, Chet Juszczak, Peter Staubach, Carl Smith, Diane Lebel, and David Hitz. 1994. NFS version 3 design and implementation. In Proceedings of the 1994 USENIX Summer Technical Conference. 137--152.Google Scholar
- Ben Pfaff, Tal Garfinkel, and Mendel Rosenblum. 2006. Virtualization aware file systems: Getting beyond the limitations of virtual disks. In Proceedings of the 2006 USENIX Symposium on Networked Systems Design and Implementation. 353--366. Google ScholarDigital Library
- Dai Qin, Angela Demke Brown, and Ashvin Goel. 2014. Reliable writeback for client-side flash caches. In Proceedings of the 2014 USENIX Annual Technical Conference. 451--462. Google ScholarDigital Library
- Abhishek Rajimwale, Vijay Chidambaram, Deepak Ramamurthi, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. 2011. Coerced cache eviction and discreet-mode journaling: Dealing with misbehaving disks. In Proceedings of the 2011 International Conference on Dependable Systems and Networks. 518--529. Google ScholarDigital Library
- RBD. 2010. Ceph’s RADOS Block Device. Retrieved October 13, 2017, from docs.ceph.com/docs/master/rbd/rbd.Google Scholar
- David P. Reed. 1983. Implementing atomic actions on decentralized data. ACM Transactions on Computer Systems 1, 1, 3--23. Google ScholarDigital Library
- Mahadev Satanarayanan. 1990. Scalable, secure, and highly available distributed file access. Computer 23, 5, 9--21. Google ScholarDigital Library
- Frank Schmuck and Roger Haskin. 2002. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the 2002 USENIX Conference on File and Storage Technologies. 231--244. Google ScholarDigital Library
- Seagate. 2007. Product Manual Cheetah 15K.5 SAS (Specifications for model ST3300655SS). Seagate Technology LLC.Google Scholar
- Mohammad Shamma, Dutch T. Meyer, Jake Wires, Maria Ivanova, Norman C. Hutchinson, and Andrew Warfield. 2011. Capo: Recapitulating storage for virtual desktops. In Proceedings of the 2011 USENIX Conference on File and Storage Technologies. 31--45. Google ScholarDigital Library
- Justin Sheehy. 2015. There is no now. Communications of the ACM 58, 5, 36--41. Google ScholarDigital Library
- IBM Spectrum. 2017. Highly available write cache (HAWC). In IBM Spectrum Scale Version 4 Release 2.3, Administration Guide. IBM Corp.Google Scholar
- Vasily Tarasov, Dean Hildebrand, Geoff Kuenning, and Erez Zadok. 2013a. Virtual machine workloads: The case for new benchmarks for NAS. In Proceedings of the 2013 USENIX Conference on File and Storage Technologies. 307--320. Google ScholarDigital Library
- Vasily Tarasov, Deepak Jain, Dean Hildebrand, Renu Tewari, Geoff Kuenning, and Erez Zadok. 2013b. Improving I/O performance using virtual disk introspection. In Proceedings of the 2013 USENIX Workshop on Hot Topics in Storage and File Systems. Article 11, 5 pages. Google ScholarDigital Library
- Vasily Tarasov, Erez Zadok, and Spencer Shepler. 2016. Filebench: A flexible framework for file system benchmarking. login: The USENIX Magazine 41, 1, 6--12. https://github.com/filebench/filebench/wiki.Google Scholar
- Douglas B. Terry, Alan J. Demers, Karin Petersen, Mike J. Spreitzer, Marvin M. Theimer, and Brent B. Welch. 1994. Session guarantees for weakly consistent replicated data. In Proceedings of the 1994 International Conference on Parallel and Distributed Information Systems. 140--149. Google ScholarDigital Library
- The Austin Group. 2008. POSIX.1-2008 Volume 2: System Interfaces. IEEE Std 1003.1 and The Open Group Base Specifications Issue 7.Google Scholar
- Robert H. Thomas. 1979. A majority consensus approach to concurrency control for multiple copy databases. ACM Transactions on Database Systems 4, 2, 180--209. Google ScholarDigital Library
- Alexander Thomson and Daniel J. Abadi. 2015. CalvinFS: Consistent WAN replication and scalable metadata management for distributed file systems. In Proceedings of the 2015 USENIX Conference on File and Storage Technologies. 1--14. Google ScholarDigital Library
- Satyam B. Vaghani. 2010. Virtual machine file system. ACM SIGOPS Operating Systems Review 44, 4, 57--70. Google ScholarDigital Library
- David C. van Moolenbroek, Raja Appuswamy, and Andrew S. Tanenbaum. 2014. Towards a flexible, lightweight virtualization alternative. In Proceedings of the 2014 ACM International Systems and Storage Conference 8:1--8:7. Google ScholarDigital Library
- Michael Vrable, Stefan Savage, and Geoffrey M. Voelker. 2012. BlueSky: A cloud-backed file system for the enterprise. In Proceedings of the 2012 USENIX Conference on File and Storage Technologies. 237--250. Google ScholarDigital Library
- Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 2006 USENIX Symposium on Operating Systems Design and Implementation. 307--320. Google ScholarDigital Library
- Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. 2014. Building Consistent Transactions With Inconsistent Replication. Technical Report UW-CSE-14-12-01. University of Washington.Google Scholar
- Wenting Zheng, Stephen Tu, Eddie Kohler, and Barbara Liskov. 2014. Fast databases with fast durability and recovery through multicore parallelism. In Proceedings of the 2014 USENIX Symposium on Operating Systems Design and Implementation. 465--477. Google ScholarDigital Library
Index Terms
- Client-Side Journaling for Durable Shared Storage
Recommendations
Host-side filesystem journaling for durable shared storage
FAST'15: Proceedings of the 13th USENIX Conference on File and Storage TechnologiesHardware consolidation in the datacenter occasionally leads to scalability bottlenecks due to the heavy utilization of critical resources, such as the shared network bandwidth. Host-side caching on durable media is already applied at the block level in ...
VM aware journaling: improving journaling file system performance in virtualization environments
Journaling file systems, which are widely used in modern operating systems, guarantee file system consistency and data integrity by logging file system updates to a journal, which is a reserved space on the storage, before the updates are written to the ...
Differentially private client-side data deduplication protocol for cloud storage services
Cloud storage service providers apply data client-side deduplication across multiple users to achieve cost savings of network bandwidth and disk storage. However, deduplication can be used as a side channel by attackers who try to obtain sensitive ...
Comments