skip to main content
10.1145/3098822.3098849acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Free Access

Understanding and Mitigating Packet Corruption in Data Center Networks

Published:07 August 2017Publication History

ABSTRACT

We take a comprehensive look at packet corruption in data center networks, which leads to packet losses and application performance degradation. By studying 350K links across 15 production data centers, we find that the extent of corruption losses is significant and that its characteristics differ markedly from congestion losses. Corruption impacts fewer links than congestion, but imposes a heavier loss rate; and unlike congestion, corruption rate on a link is stable over time and is not correlated with its utilization.

Based on these observations, we developed CorrOpt, a system to mitigate corruption. To minimize corruption losses, it intelligently selects which corrupting links can be safely disabled, while ensuring that each top-of-rack switch has a minimum number of paths to reach other switches. CorrOpt also recommends specific actions (e.g., replace cables, clean connectors) to repair disabled links, based on our analysis of common symptoms of different root causes of corruption. Our recommendation engine has been deployed in over seventy data centers of a large cloud provider. Our analysis shows that, compared to current state of the art, CorrOpt can reduce corruption losses by three to six orders of magnitude and improve repair accuracy by 60%.

Skip Supplemental Material Section

Supplemental Material

understandingandmitigatingpacketcorruptionindatacenternetworks.webm

webm

68.8 MB

References

  1. Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, and George Varghese. 2014. CONGA: Distributed Congestion-aware Load Balancing for Datacenters. In SIGCOMM.Google ScholarGoogle Scholar
  2. Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. 2010. Data Center TCP (DCTCP). In SIGCOMM.Google ScholarGoogle Scholar
  3. Mohammad Alizadeh, Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, and Masato Yasuda. 2012. Less is More: Trading a Little Bandwidth for Ultra-low Latency in the Data Center. In NSDI.Google ScholarGoogle Scholar
  4. Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji Prabhakar, and Scott Shenker. 2013. pFabric: Minimal Near-optimal Data-center Transport. In SIGCOMM.Google ScholarGoogle Scholar
  5. Behnaz Arzani, Selim Ciraci, Boon Thau Loo, Assaf Schuster, and Geoff Outhred. 2016. Taking the Blame Game out of Data Centers Operations with NetPoirot. In SIGCOMM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Peter Bailis and Kyle Kingsbury. 2014. The Network is Reliable. Commun. ACM (2014).Google ScholarGoogle Scholar
  7. Theophilus Benson, Ashok Anand, Aditya Akella, and Ming Zhang. 2011. MicroTE: Fine Grained Traffic Engineering for Data Centers. In CoNEXT.Google ScholarGoogle Scholar
  8. Kashif Bilal, Marc Manzano, Samee U. Khan, Eusebi Calle, Keqin Li, and Albert Y. Zomaya. 2013. On the Characterization of the Structural Robustness of Data Center Networks. IEEE Trans. Cloud Computing (2013).Google ScholarGoogle Scholar
  9. Peter Bodík, Ishai Menache, Mosharaf Chowdhury, Pradeepkumar Mani, David A. Maltz, and Ion Stoica. 2012. Surviving Failures in Bandwidth-constrained Datacenters. In SIGCOMM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Neal Cardwell, Yuchung Cheng, C. Stephen Gunn, Soheil Hassas Yeganeh, and Van Jacobson. 2016. BBR: Congestion-Based Congestion Control. ACM Queue (2016).Google ScholarGoogle Scholar
  11. J. D. Case, M. Fedor, M. L. Schoffstall, and J. Davin. 1990. Simple Network Management Protocol (SNMP). (1990).Google ScholarGoogle Scholar
  12. Nandita Dukkipati, Matt Mathis, Yuchung Cheng, and Monia Ghobadi. 2011. Proportional Rate Reduction for TCP. In IMC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. FiberStore. 2017. Fiber Optic Inspection Tutorial. http://www.fs.com/fiber-optic-inspection-tutorial-aid-460.html. (2017).Google ScholarGoogle Scholar
  14. Fiber for Learning. 2017. Fiber Hygiene. http://fiberforlearning.com/welcome/2010/09/20/connector-cleaning/. (2017).Google ScholarGoogle Scholar
  15. M. R. Garey and David S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman.Google ScholarGoogle Scholar
  16. Monia Ghobadi and Ratul Mahajan. 2016. Optical Layer Failures in a Large Backbone. In IMC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Monia Ghobadi, Ratul Mahajan, Amar Phanishayee, Nikhil Devanur, Janardhan Kulkarni, Gireeja Ranade, Pierre-Alexandre Blanche, Houman Rastegarfar, Madeleine Glick, and Daniel Kilper. 2016. ProjecToR: Agile Reconfigurable Data Center Interconnect. In SIGCOMM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. In SIGCOMM.Google ScholarGoogle Scholar
  19. Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley, and Amin Vahdat. 2016. Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure. In SIGCOMM.Google ScholarGoogle Scholar
  20. Peter Hoose. 2011. Monitoring and Troubleshooting, One Engineer's rant. https://www.nanog.org/meetings/nanog53/presentations/Monday/Hoose.pdf. (2011).Google ScholarGoogle Scholar
  21. JDSU. 2017. P5000i Fiber Microscope. http://www.viavisolutions.com/en-us/products/p5000i-fiber-microscope. (2017).Google ScholarGoogle Scholar
  22. Edward John Forrest Jr. 2014. How to Precision Clean All Fiber Optic Connections: A Step By Step Guide. CreateSpace Independent Publishing Platform.Google ScholarGoogle Scholar
  23. Ramana Rao Kompella, Albert Greenberg, Jennifer Rexford, Alex C. Snoeren, and Jennifer Yates. 2005. Cross-layer Visibility as a Service. In HotNets.Google ScholarGoogle Scholar
  24. Hongqiang Harry Liu, Xin Wu, Ming Zhang, Lihua Yuan, Roger Wattenhofer, and David A. Maltz. 2013. zUpdate: Updating Data Center Networks with Zero Loss. In SIGCOMM.Google ScholarGoogle Scholar
  25. Vincent Liu, Daniel Halperin, Arvind Krishnamurthy, and Thomas Anderson. 2013. F10: A Fault-Tolerant Engineered Network. In NSDI.Google ScholarGoogle Scholar
  26. David Maltz. 2016. Keeping Cloud-Scale Networks Healthy. https://video.mtgsf.com/video/4f277939-73f5-4ce8-aba1-3da70ec19345. (2016).Google ScholarGoogle Scholar
  27. Jitendra Padhye, Victor Firoiu, Don Towsley, and Jim Kurose. 1998. Modeling TCP Throughput: A Simple Model and Its Empirical Validation. In SIGCOMM.Google ScholarGoogle Scholar
  28. Jonathan Perry, Amy Ousterhout, Hari Balakrishnan, Devavrat Shah, and Hans Fugal. 2014. Fastpass: A Centralized "Zero-queue" Datacenter Network. In SIGCOMM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Peng Sun, Ratul Mahajan, Jennifer Rexford, Lihua Yuan, Ming Zhang, and Ahsan Arefin. 2014. A Network-state Management Service. In SIGCOMM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. USConec. 2017. Single Fiber Cleaning Tools. http://www.usconec.com/products/cleaning_tools/ibc_brand_cleaners_for_single_fiber_connections.htm. (2017).Google ScholarGoogle Scholar
  31. Balajee Vamanan, Jahangir Hasan, and T.N. Vijaykumar. 2012. Deadline-aware Datacenter TCP (D2TCP). In SIGCOMM.Google ScholarGoogle Scholar
  32. Christo Wilson, Hitesh Ballani, Thomas Karagiannis, and Ant Rowtron. 2011. Better Never Than Late: Meeting Deadlines in Datacenter Networks. In SIGCOMM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Damon Wischik, Costin Raiciu, Adam Greenhalgh, and Mark Handley. 2011. Design, Implementation and Evaluation of Congestion Control for Multipath TCP. In NSDI.Google ScholarGoogle Scholar
  34. Xin Wu, Daniel Turner, George Chen, Dave Maltz, Xiaowei Yang, Lihua Yuan, and Ming Zhang. 2012. NetPilot: Automating Datacenter Network Failure Mitigation. In SIGCOMM.Google ScholarGoogle Scholar
  35. Kyriakos Zarifis, Rui Miao, Matt Calder, Ethan Katz-Bassett, Minlan Yu, and Jitendra Padhye. 2014. DIBS: Just-in-time Congestion Mitigation for Data Centers. In EuroSys.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. 2015. Congestion Control for Large-Scale RDMA Deployments. In SIGCOMM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Yibo Zhu, Nanxi Kang, Jiaxin Cao, Albert Greenberg, Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan, Ming Zhang, Ben Y. Zhao, and Haitao Zheng. 2015. Packet-Level Telemetry in Large Datacenter Networks. In SIGCOMM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Danyang Zhuo, Monia Ghobadi, Ratul Mahajan, Amar Phanishayee, Xuan Kelvin Zou, Hang Guan, Arvind Krishnamurthy, and Thomas Anderson. 2017. RAIL: A Case for Redundant Arrays of Inexpensive Links in Data Center Networks. In NSDI.Google ScholarGoogle Scholar

Index Terms

  1. Understanding and Mitigating Packet Corruption in Data Center Networks

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGCOMM '17: Proceedings of the Conference of the ACM Special Interest Group on Data Communication
            August 2017
            515 pages
            ISBN:9781450346535
            DOI:10.1145/3098822

            Copyright © 2017 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 7 August 2017

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited

            Acceptance Rates

            Overall Acceptance Rate554of3,547submissions,16%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader