ABSTRACT
We introduce split-level I/O scheduling, a new framework that splits I/O scheduling logic across handlers at three layers of the storage stack: block, system call, and page cache. We demonstrate that traditional block-level I/O schedulers are unable to meet throughput, latency, and isolation goals. By utilizing the split-level framework, we build a variety of novel schedulers to readily achieve these goals: our Actually Fair Queuing scheduler reduces priority-misallocation by 28x; our Split-Deadline scheduler reduces tail latencies by 4x; our Split-Token scheduler reduces sensitivity to interference by 6x. We show that the framework is general and operates correctly with disparate file systems (ext4 and XFS). Finally, we demonstrate that split-level scheduling serves as a useful foundation for databases (SQLite and PostgreSQL), hypervisors (QEMU), and distributed file systems (HDFS), delivering improved isolation and performance in these important application scenarios.
Supplemental Material
- CFQ (Complete Fairness Queueing). https://www.kernel.org/doc/Documentation/block/cfq-iosched.txt.Google Scholar
- Database/kernel community topic at collaboration summit 2014. http://www.postgresql.org/message-id/[email protected].Google Scholar
- Deadline IO scheduler tunables. https://www.kernel.org/doc/Documentation/block/deadline-iosched.txt.Google Scholar
- Documentation for pgbench. http://http://www.postgresql.org/docs/9.4/static/pgbench.html.Google Scholar
- Documentation for /proc/sys/vm/*. https://www.kernel.org/doc/Documentation/sysctl/vm.txt.Google Scholar
- Inside the Windows Vista Kernel: Part 1. http://technet.microsoft.com/en-us/magazine/2007.02.vistakernel.aspx.Google Scholar
- ionice(1) - Linux man page. http://linux.die.net/man/1/ionice.Google Scholar
- Notes on the Generic Block Layer Rewrite in Linux 2.5. https://www.kernel.org/doc/Documentation/block/biodoc.txt.Google Scholar
- pgsql-hackers maillist communication. http://www.postgresql.org/message-id/CA+Tgmobv6gm6SzHx8e2w-0180+jHbCNYbAot9KyzG_3DxRYxaw@mail.gmail.com.Google Scholar
- Postgresql 9.2.9 documentation. http://www.postgresql.org/docs/9.2/static/wal-configuration.html.Google Scholar
- Anand, A., Sen, S., Krioukov, A., Popovici, F., Akella, A., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Banerjee, S. Avoiding File System Micromanagement with Range Writes. In OSDI '08 (San Diego, CA, December 2008). Google ScholarDigital Library
- Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Information and Control in Gray-Box Systems. In ACM SIGOPS Operating Systems Review (2001), vol. 35, ACM, pp. 43--56. Google ScholarDigital Library
- Arpaci-Dusseau, R. H., and Arpaci-Dusseau, A. C. Operating Systems: Three Easy Pieces. Arpaci-Dusseau Books, 2014.Google Scholar
- Banga, G., Druschel, P., and Mogul, J. C. Resource containers: A new facility for resource management in server systems. In OSDI (1999), vol. 99, pp. 45--58. Google ScholarDigital Library
- Best, S. JFS Overview. http://jfs.sourceforge.net/project/pub/jfs.pdf, 2000.Google Scholar
- Bonwick, J., and Moore, B. ZFS: The Last Word in File Systems. http://opensolaris.org/os/community/zfs/docs/zfs_last.pdf, 2007.Google Scholar
- Bosch, P., and Mullender, S. Real-time disk scheduling in a mixed-media file system. In Real-Time Technology and Applications Symposium, 2000. RTAS 2000. Proceedings. Sixth IEEE (2000), pp. 23--32. Google ScholarDigital Library
- Craciunas, S. S., Kirsch, C. M., and Röck, H. The TAP Project: Traffic Shaping System Calls. http://tap.cs.uni-salzburg.at/downloads.html.Google Scholar
- Craciunas, S. S., Kirsch, C. M., and Röck, H. I/O Resource Management Through System Call Scheduling. SIGOPS Oper. Syst. Rev. 42, 5 (July 2008), 44--54. Google ScholarDigital Library
- Frost, C., Mammarella, M., Kohler, E., de los Reyes, A., Hovsepian, S., Matsuoka, A., and Zhang, L. Generalized File System Dependencies. In SOSP '07 (Stevenson, WA, October 2007), pp. 307--320. Google ScholarDigital Library
- Ganger, G. R., McKusick, M. K., Soules, C. A., and Patt, Y. N. Soft Updates: A Solution to the Metadata Update Problem in File Systems. ACM Transactions on Computer Systems (TOCS) 18, 2 (2000), 127--153. Google ScholarDigital Library
- Gu, W., Kalbarczyk, Z., Iyer, R. K., and Yang, Z. Characterization of Linux Kernel Behavior Under Errors. In DSN '03 (San Francisco, CA, June 2003), pp. 459--468.Google Scholar
- Gulati, A., Merchant, A., and Varman, P. J. pClock: An Arrival Curve Based Approach for QoS Guarantees in Shared Storage Systems. In Proceedings of the 2007 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (New York, NY, USA, 2007), SIGMETRICS '07, ACM, pp. 13--24. Google ScholarDigital Library
- Gupta, D., Cherkasova, L., Gardner, R., and Vahdat, A. Enforcing performance isolation across virtual machines in xen. In Middleware 2006. Springer, 2006, pp. 342--362. Google ScholarDigital Library
- Hagmann, R. Reimplementing the Cedar File System Using Logging and Group Commit. In SOSP '87 (Austin, TX, November 1987). Google ScholarDigital Library
- Hipp, D. R., and Kennedy, D. SQlite, 2007.Google Scholar
- Hofri, M. Disk scheduling: FCFS vs.SSTF revisited. Communications of the ACM 23, 11 (1980), 645--653. Google ScholarDigital Library
- Huang, L., and Chiueh, T. Implementation of a Rotation-Latency-Sensitive Disk Scheduler. Tech. Rep. ECSL-TR81, SUNY, Stony Brook, March 2000.Google Scholar
- Jacobson, D. M., and Wilkes, J. Disk Scheduling Algorithms Based on Rotational Position. Tech. Rep. HPL-CSP-91-7, Hewlett Packard Laboratories, 1991.Google Scholar
- Kim, J., Oh, Y., Kim, E., Choi, J., Lee, D., and Noh, S. H. Disk Schedulers for Solid State Drivers. In Proceedings of the Seventh ACM International Conference on Embedded Software (New York, NY, USA, 2009), EMSOFT '09, ACM, pp. 295--304. Google ScholarDigital Library
- Lumb, C., Schindler, J., Ganger, G., Nagle, D., and Riedel, E. Towards Higher Disk Head Utilization: Extracting "Free" Bandwidth From Busy Disk Drives. In OSDI '00 (San Diego, CA, October 2000), pp. 87--102. Google ScholarDigital Library
- Lumb, C. R., Merchant, A., and Alvarez, G. A. Facade: Virtual storage devices with performance guarantees. In FAST '03 (San Francisco, CA, April 2003). Google ScholarDigital Library
- Mason, C. The Btrfs Filesystem. oss.oracle.com/projects/btrfs/dist/documentation/btrfs-ukuug.pdf, September 2007.Google Scholar
- Mathur, A., Cao, M., Bhattacharya, S., Dilge, A., Tomas, A., and Vivier, L. The New Ext4 Filesystem: Current Status and Future Plans. In Ottawa Linux Symposium (OLS '07) (Ottawa, Canada, July 2007).Google Scholar
- Mesnier, M., Chen, F., Luo, T., and Akers, J. B. Differentiated Storage Services. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP '11) (Cascais, Portugal, October 2011). Google ScholarDigital Library
- Park, S., and Shen, K. FIOS: A fair, Efficient Flash I/O Scheduler. In FAST (2012), p. 13. Google ScholarDigital Library
- Pillai, T. S., Chidambaram, V., Alagappan, R., Alkiswany, S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14) (Broomfield, CO, October 2014). Google ScholarDigital Library
- Popovici, F. I., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Robust, Portable I/O Scheduling with the Disk Mimic. In USENIX Annual Technical Conference, General Track (2003), pp. 297--310.Google Scholar
- Povzner, A., Kaldewey, T., Brandt, S., Golding, R., Wong, T. M., and Maltzahn, C. Efficient Guaranteed Disk Request Scheduling with Fahrrad. In EuroSys '08 (Glasgow, Scotland UK, March 2008). Google ScholarDigital Library
- Riska, A., Larkby-Lahet, J., and Riedel, E. Evaluating Block-level Optimization Through the IO Path. In USENIX '07 (Santa Clara, CA, June 2007). Google ScholarDigital Library
- Rizzo, L., and Checconi, F. GEOM_SCHED: A Framework for Disk Scheduling within GEOM. http://info.iet.unipi.it/~luigi/papers/20090508-geom_sched-slides.pdf.Google Scholar
- Rosenblum, M., and Ousterhout, J. The Design and Implementation of a Log-Structured File System. ACM Transactions on Computer Systems 10, 1 (February 1992), 26--52. Google ScholarDigital Library
- Ruemmler, C., and Wilkes, J. An Introduction to Disk Drive Modeling. IEEE Computer 27, 3 (March 1994), 17--28. Google ScholarDigital Library
- Seltzer, M., Chen, P., and Ousterhout, J. Disk Scheduling Revisited. In USENIX Winter '90 (Washington, DC, January 1990), pp. 313--324.Google Scholar
- Shenoy, P., and Vin, H. Cello: A Disk Scheduling Framework for Next-generation Operating Systems. In SIGMETRICS '98 (Madison, WI, June 1998), pp. 44--55. Google ScholarDigital Library
- Shue, D., and Freedman, M. J. From application requests to Virtual IOPs: Provisioned key-value storage with Libra. In Proceedings of the Ninth European Conference on Computer Systems (2014), ACM, p. 17. Google ScholarDigital Library
- Sweeney, A., Doucette, D., Hu, W., Anderson, C., Nishimoto, M., and Peck, G. Scalability in the XFS File System. In USENIX 1996 (San Diego, CA, January 1996). Google ScholarDigital Library
- Thereska, E., Ballani, H., O'Shea, G., Karagiannis, T., Rowstron, A., Talpey, T., Black, R., and Zhu, T. IOFlow: A Software-Defined Storage Architecture. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (2013), ACM, pp. 182--196. Google ScholarDigital Library
- Van Meter, R., and Gao, M. Latency Management in Storage Systems. In OSDI '00 (San Diego, CA, October 2000), pp. 103--117. Google ScholarDigital Library
- Wachs, M., Abd-El-Malek, M., Thereska, E., and Ganger, G. R. Argon: Performance insulation for shared storage servers. In FAST '07 (San Jose, CA, February 2007). Google ScholarDigital Library
- Waldspurger, C. A., and Weihl, W. E. Stride Scheduling: Deterministic Proportional Share Resource Management. Massachusetts Institute of Technology. Laboratory for Computer Science, 1995.Google ScholarDigital Library
- Wang, H., and Varman, P. J. Balancing fairness and efficiency in tiered storage systems with bottleneck-aware allocation. In FAST '13 (San Jose, CA, February 2014). Google ScholarDigital Library
- Wilkes, J., Golding, R., Staelin, C., and Sullivan, T. The HP AutoRAID Hierarchical Storage System. ACM Transactions on Computer Systems 14, 1 (February 1996), 108--136. Google ScholarDigital Library
- Worthington, B. L., Ganger, G. R., and Patt, Y. N. Scheduling Algorithms for Modern Disk Drives. In SIGMETRICS '94 (Nashville, TN, May 1994), pp. 241--251. Google ScholarDigital Library
- Yang, J., Sar, C., and Engler, D. EXPLODE: A Lightweight, General System for Finding Serious Storage System Errors. In OSDI '06 (Seattle, WA, November 2006). Google ScholarDigital Library
- Yang, T., Liu, T., Berger, E. D., Kaplan, S. F., and Moss, J. E. B. Redline: First Class Support for Interactivity in Commodity Operating Systems. In OSDI '08 (San Diego, CA, December 2008). Google ScholarDigital Library
- Yu, X., Gum, B., Chen, Y., Wang, R. Y., Li, K., Krishnamurthy, A., and Anderson, T. E. Trading Capacity for Performance in a Disk Array. In OSDI '00 (San Diego, CA, October 2000). Google ScholarDigital Library
Index Terms
- Split-level I/O scheduling
Recommendations
KVell: the design and implementation of a fast persistent key-value store
SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems PrinciplesModern block-addressable NVMe SSDs provide much higher bandwidth and similar performance for random and sequential access. Persistent key-value stores (KVs) designed for earlier storage devices, using either Log-Structured Merge (LSM) or B trees, do not ...
ghOSt: Fast & Flexible User-Space Delegation of Linux Scheduling
SOSP '21: Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems PrinciplesWe present ghOSt, our infrastructure for delegating kernel scheduling decisions to userspace code. ghOSt is designed to support the rapidly evolving needs of our data center workloads and platforms.
Improving scheduling decisions can drastically improve ...
SplitFS: reducing software overhead in file systems for persistent memory
SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems PrinciplesWe present SplitFS, a file system for persistent memory (PM) that reduces software overhead significantly compared to state-of-the-art PM file systems. SplitFS presents a novel split of responsibilities between a user-space library file system and an ...
Comments