ABSTRACT
We take a comprehensive look at packet corruption in data center networks, which leads to packet losses and application performance degradation. By studying 350K links across 15 production data centers, we find that the extent of corruption losses is significant and that its characteristics differ markedly from congestion losses. Corruption impacts fewer links than congestion, but imposes a heavier loss rate; and unlike congestion, corruption rate on a link is stable over time and is not correlated with its utilization.
Based on these observations, we developed CorrOpt, a system to mitigate corruption. To minimize corruption losses, it intelligently selects which corrupting links can be safely disabled, while ensuring that each top-of-rack switch has a minimum number of paths to reach other switches. CorrOpt also recommends specific actions (e.g., replace cables, clean connectors) to repair disabled links, based on our analysis of common symptoms of different root causes of corruption. Our recommendation engine has been deployed in over seventy data centers of a large cloud provider. Our analysis shows that, compared to current state of the art, CorrOpt can reduce corruption losses by three to six orders of magnitude and improve repair accuracy by 60%.
Supplemental Material
- Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, and George Varghese. 2014. CONGA: Distributed Congestion-aware Load Balancing for Datacenters. In SIGCOMM.Google Scholar
- Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. 2010. Data Center TCP (DCTCP). In SIGCOMM.Google Scholar
- Mohammad Alizadeh, Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, and Masato Yasuda. 2012. Less is More: Trading a Little Bandwidth for Ultra-low Latency in the Data Center. In NSDI.Google Scholar
- Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, and Scott Shenker. 2013. pFabric: Minimal Near-optimal Data-center Transport. In SIGCOMM.Google Scholar
- Behnaz Arzani, Selim Ciraci, Boon Thau Loo, Assaf Schuster, and Geoff Outhred. 2016. Taking the Blame Game out of Data Centers Operations with NetPoirot. In SIGCOMM. Google ScholarDigital Library
- Peter Bailis and Kyle Kingsbury. 2014. The Network is Reliable. Commun. ACM (2014).Google Scholar
- Theophilus Benson, Ashok Anand, Aditya Akella, and Ming Zhang. 2011. MicroTE: Fine Grained Traffic Engineering for Data Centers. In CoNEXT.Google Scholar
- Kashif Bilal, Marc Manzano, Samee U. Khan, Eusebi Calle, Keqin Li, and Albert Y. Zomaya. 2013. On the Characterization of the Structural Robustness of Data Center Networks. IEEE Trans. Cloud Computing (2013).Google Scholar
- Peter Bodík, Ishai Menache, Mosharaf Chowdhury, Pradeepkumar Mani, David A. Maltz, and Ion Stoica. 2012. Surviving Failures in Bandwidth-constrained Datacenters. In SIGCOMM. Google ScholarDigital Library
- Neal Cardwell, Yuchung Cheng, C. Stephen Gunn, Soheil Hassas Yeganeh, and Van Jacobson. 2016. BBR: Congestion-Based Congestion Control. ACM Queue (2016).Google Scholar
- J. D. Case, M. Fedor, M. L. Schoffstall, and J. Davin. 1990. Simple Network Management Protocol (SNMP). (1990).Google Scholar
- Nandita Dukkipati, Matt Mathis, Yuchung Cheng, and Monia Ghobadi. 2011. Proportional Rate Reduction for TCP. In IMC. Google ScholarDigital Library
- FiberStore. 2017. Fiber Optic Inspection Tutorial. http://www.fs.com/fiber-optic-inspection-tutorial-aid-460.html. (2017).Google Scholar
- Fiber for Learning. 2017. Fiber Hygiene. http://fiberforlearning.com/welcome/2010/09/20/connector-cleaning/. (2017).Google Scholar
- M. R. Garey and David S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman.Google Scholar
- Monia Ghobadi and Ratul Mahajan. 2016. Optical Layer Failures in a Large Backbone. In IMC. Google ScholarDigital Library
- Monia Ghobadi, Ratul Mahajan, Amar Phanishayee, Nikhil Devanur, Janardhan Kulkarni, Gireeja Ranade, Pierre-Alexandre Blanche, Houman Rastegarfar, Madeleine Glick, and Daniel Kilper. 2016. ProjecToR: Agile Reconfigurable Data Center Interconnect. In SIGCOMM. Google ScholarDigital Library
- Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. In SIGCOMM.Google Scholar
- Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat. 2016. Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure. In SIGCOMM.Google Scholar
- Peter Hoose. 2011. Monitoring and Troubleshooting, One Engineer's rant. https://www.nanog.org/meetings/nanog53/presentations/Monday/Hoose.pdf. (2011).Google Scholar
- JDSU. 2017. P5000i Fiber Microscope. http://www.viavisolutions.com/en-us/products/p5000i-fiber-microscope. (2017).Google Scholar
- Edward John Forrest Jr. 2014. How to Precision Clean All Fiber Optic Connections: A Step By Step Guide. CreateSpace Independent Publishing Platform.Google Scholar
- Ramana Rao Kompella, Albert Greenberg, Jennifer Rexford, Alex C. Snoeren, and Jennifer Yates. 2005. Cross-layer Visibility as a Service. In HotNets.Google Scholar
- Hongqiang Harry Liu, Xin Wu, Ming Zhang, Lihua Yuan, Roger Wattenhofer, and David A. Maltz. 2013. zUpdate: Updating Data Center Networks with Zero Loss. In SIGCOMM.Google Scholar
- Vincent Liu, Daniel Halperin, Arvind Krishnamurthy, and Thomas Anderson. 2013. F10: A Fault-Tolerant Engineered Network. In NSDI.Google Scholar
- David Maltz. 2016. Keeping Cloud-Scale Networks Healthy. https://video.mtgsf.com/video/4f277939-73f5-4ce8-aba1-3da70ec19345. (2016).Google Scholar
- Jitendra Padhye, Victor Firoiu, Don Towsley, and Jim Kurose. 1998. Modeling TCP Throughput: A Simple Model and Its Empirical Validation. In SIGCOMM.Google Scholar
- Jonathan Perry, Amy Ousterhout, Hari Balakrishnan, Devavrat Shah, and Hans Fugal. 2014. Fastpass: A Centralized "Zero-queue" Datacenter Network. In SIGCOMM. Google ScholarDigital Library
- Peng Sun, Ratul Mahajan, Jennifer Rexford, Lihua Yuan, Ming Zhang, and Ahsan Arefin. 2014. A Network-state Management Service. In SIGCOMM. Google ScholarDigital Library
- USConec. 2017. Single Fiber Cleaning Tools. http://www.usconec.com/products/cleaning_tools/ibc_brand_cleaners_for_single_fiber_connections.htm. (2017).Google Scholar
- Balajee Vamanan, Jahangir Hasan, and T.N. Vijaykumar. 2012. Deadline-aware Datacenter TCP (D2TCP). In SIGCOMM.Google Scholar
- Christo Wilson, Hitesh Ballani, Thomas Karagiannis, and Ant Rowtron. 2011. Better Never Than Late: Meeting Deadlines in Datacenter Networks. In SIGCOMM.Google ScholarDigital Library
- Damon Wischik, Costin Raiciu, Adam Greenhalgh, and Mark Handley. 2011. Design, Implementation and Evaluation of Congestion Control for Multipath TCP. In NSDI.Google Scholar
- Xin Wu, Daniel Turner, George Chen, Dave Maltz, Xiaowei Yang, Lihua Yuan, and Ming Zhang. 2012. NetPilot: Automating Datacenter Network Failure Mitigation. In SIGCOMM.Google Scholar
- Kyriakos Zarifis, Rui Miao, Matt Calder, Ethan Katz-Bassett, Minlan Yu, and Jitendra Padhye. 2014. DIBS: Just-in-time Congestion Mitigation for Data Centers. In EuroSys.Google ScholarDigital Library
- Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. 2015. Congestion Control for Large-Scale RDMA Deployments. In SIGCOMM. Google ScholarDigital Library
- Yibo Zhu, Nanxi Kang, Jiaxin Cao, Albert Greenberg, Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan, Ming Zhang, Ben Y. Zhao, and Haitao Zheng. 2015. Packet-Level Telemetry in Large Datacenter Networks. In SIGCOMM. Google ScholarDigital Library
- Danyang Zhuo, Monia Ghobadi, Ratul Mahajan, Amar Phanishayee, Xuan Kelvin Zou, Hang Guan, Arvind Krishnamurthy, and Thomas Anderson. 2017. RAIL: A Case for Redundant Arrays of Inexpensive Links in Data Center Networks. In NSDI.Google Scholar
Index Terms
- Understanding and Mitigating Packet Corruption in Data Center Networks
Recommendations
LinkGuardian: Mitigating the impact of packet corruption loss with link-local retransmission
APNet '22: Proceedings of the 6th Asia-Pacific Workshop on NetworkingPacket corruption loss is a serious problem in datacenter networks. A large-scale study by Microsoft reported that the number of packets lost due to corruption is comparable to those lost due to congestion. Previous attempts to mitigate the impact of ...
Masking Corruption Packet Losses in Datacenter Networks with Link-local Retransmission
ACM SIGCOMM '23: Proceedings of the ACM SIGCOMM 2023 ConferencePacket loss due to link corruption is a major problem in large warehouse-scale datacenters. The current state-of-the-art approach of disabling corrupting links is not adequate because, in practice, all the corrupting links cannot be disabled due to ...
Analyzing and Optimizing Packet Corruption in RDMA Network
AbstractRemote direct memory access (RDMA) has become one of the state-of-the-art high-performance network technologies in datacenters. The reliable transport of RDMA is designed based on a lossless underlying network and cannot endure a high packet loss ...
Comments