ABSTRACT
Network operators often face the problem of remote outages in transit networks leading to significant (sometimes on the order of minutes) downtimes. The issue is that BGP, the Internet routing protocol, often converges slowly upon such outages, as large bursts of messages have to be processed and propagated router by router.
In this paper, we present SWIFT, a fast-reroute framework which enables routers to restore connectivity in few seconds upon remote outages. SWIFT is based on two novel techniques. First, SWIFT deals with slow outage notification by predicting the overall extent of a remote failure out of few control-plane (BGP) messages. The key insight is that significant inference speed can be gained at the price of some accuracy. Second, SWIFT introduces a new data-plane encoding scheme, which enables quick and flexible update of the affected forwarding entries. SWIFT is deployable on existing devices, without modifying BGP.
We present a complete implementation of SWIFT and demonstrate that it is both fast and accurate. In our experiments with real BGP traces, SWIFT predicts the extent of a remote outage in few seconds with an accuracy of ~90% and can restore connectivity for 99% of the affected destinations.
Supplemental Material
- TCP Behavior of BGP. (2012). https://archive.psg.com/121009.nag-bgp-tcp.pdf.Google Scholar
- 5-minute outage costs Google $545,000 in revenue. (2013). http://venturebeat.com/2013/08/16/3-minute-outage-costs-google-545000-in-revenue/.Google Scholar
- Cisco Systems. BGP PIC Edge and Core. (2015). http://www.cisco.com/c/en/us/td/docs/routers/7600/ios/15S/configuration/guide/7600_15_0s_book/BGP.html.Google Scholar
- Amazon.com went down for about 20 minutes, and the world freaked out. (2016). http://mashable.com/2016/03/10/amazon-is-down-2/.Google Scholar
- CIDR report. (2016). http://www.cidr-report.org/as2.0/.Google Scholar
- Cisco Umbrella 1 Million. (2016). https://blog.opendns.com/2016/12/14/cisco-umbrella-1-million/.Google Scholar
- ExaBGP. (2016). https://github.com/Exa-Networks/exabgp.Google Scholar
- Google cloud outage highlights more than just networking failure. (2016). http://bit.ly/1MFO2Ye.Google Scholar
- RIPE RIS Raw Data. (2016). https://www.ripe.net/data-tools/stats/ris/.Google Scholar
- Rodrigo Aldecoa, Chiara Orsini, and Dmitri Krioukov. 2015. Hyperbolic graph generator. Computer Physics Communications (2015).Google Scholar
- A. Atlas and A. Zinin. Basic Specification for IP Fast Reroute: Loop-Free Alternates. RFC 5286. (Sept. 2008).Google Scholar
- Ritwik Banerjee, Abbas Razaghpanah, Luis Chiang, Akassh Mishra, Vyas Sekar, Yejin Choi, and Phillipa Gill. 2015. Internet Outages, the Eyewitness Accounts: Analysis of the Outages Mailing List.Google Scholar
- Zied Ben Houidi, Mickael Meulle, and Renata Teixeira. Understanding slow BGP routing table transfers. In ACM IMC, 2009. Google ScholarDigital Library
- Anat Bremler-Barr, Edith Cohen, Haim Kaplan, and Yishay Mansour. 2002. Predicting and Bypassing End-to-end Internet Service Degradations. In ACM SIGCOMM Workshop on Internet Measurment (IMW '02). ACM, New York, NY, USA. Google ScholarDigital Library
- Matthew Caesar, Lakshminarayanan Subramanian, and Randy H Katz. 2003. Towards localizing root causes of BGP dynamics. University of California Berkeley.Google Scholar
- CAIDA. The CAIDA AS Relationships Dataset. (2016). http://www.caida.org/data/active/as-relationships/Google Scholar
- Jaideep Chandrashekar, Zhenhai Duan, Zhi-Li Zhang, and Jeff Krasky. Limiting path exploration in BGP. In IEEE INFOCOM, 2005.Google Scholar
- Di-Fa Chang, Ramesh Govindan, and John Heidemann. The Temporal and Topological Characteristics of BGP Path Changes. In ICNP 2003.Google Scholar
- Ítalo Cunha, Renata Teixeira, Darryl Veitch, and Christophe Diot. 2014. DTRACK: a system to predict and track internet path changes. IEEE/ACM TON (2014).Google Scholar
- G. Das, D. Papadimitriou, B. Puype, D. Colle, M. Pickavet, and P. Demeester. SRLG identification from time series analysis of link state data. In COMSNETS, 2011. Google ScholarCross Ref
- Benoit Donnet and Olivier Bonaventure. 2001. On BGP communities. ACM SIGCOMM CCR (2001).Google Scholar
- Nick Feamster, David G. Andersen, Hari Balakrishnan, and M. Frans Kaashoek. Measuring the Effects of Internet Path Faults on Reactive Routing. In ACM SIGMETRICS, 2003. Google ScholarDigital Library
- Anja Feldmann, Olaf Maennel, Z Morley Mao, Arthur Berger, and Bruce Maggs. 2004. Locating Internet routing instabilities. ACM SIGCOMM CCR (2004).Google Scholar
- Clarence Filsfils. BGP Convergence in much less than a second. (2007). Presentation NANOG 23.Google Scholar
- Clarence Filsfils, Pradosh Mohapatra, John Bettink, Pranav Dharwadkar, Peter De Vriendt, Yuri Tsier, Virginie Van Den Schrieck, Olivier Bonaventure, and Pierre Francois. 2011. BGP Prefix Independent Convergence. Technical Report. Cisco.Google Scholar
- Pierre Francois, Pierre-Alain Coste, Bruno Decraene, and Olivier Bonaventure. 2007. Avoiding disruptions during maintenance operations on BGP sessions. IEEE Transactions on Network and Service Management (2007).Google Scholar
- Lixin Gao. 2001. On inferring autonomous system relationships in the Internet. IEEE/ACM TON (2001).Google Scholar
- Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. In ACM SIGCOMM 2011.Google ScholarDigital Library
- Timothy G Griffin and Brian J Premore. An experimental analysis of BGP convergence time. In IEEE ICNP, 2011.Google Scholar
- Arpit Gupta, Robert MacDavid, Rüdiger Birkner, Marco Canini, Nick Feamster, Jennifer Rexford, and Laurent Vanbever. 2016. An industrial-scale software defined internet exchange point. In USENIX NSDI 2016.Google ScholarDigital Library
- Arpit Gupta, Laurent Vanbever, Muhammad Shahbaz, Sean Donovan, Brandon Schlinker, Nick Feamster, Jennifer Rexford, Scott Shenker, Russ Clark, and Ethan Katz-Bassett. SDX: A Software Defined Internet eXchange. In SIGCOMM 2014.Google Scholar
- Nikola Gvozdiev, Brad Karp, Mark Handley, and others. LOUP: The Principles and Practice of Intra-Domain Route Dissemination. In USENIX NSDI 2013.Google Scholar
- Thomas Holterbach, Stefano Vissicchio, Alberto Dainotti, and Laurent Vanbever. 2017. SWIFT: Predictive Fast Reroute. Tech. Report (2017). https://swift.ethz.chGoogle Scholar
- Polly Huang, Anja Feldmann, and Walter Willinger. A non-instrusive, wavelet-based approach to detecting network performance problems. In ACM SIGCOMM Workshop on Internet Measurement, 2001. Google ScholarDigital Library
- Gianluca Iannaccone, Chen-nee Chuah, Richard Mortier, Supratik Bhattacharyya, and Christophe Diot. Analysis of link failures in an IP backbone. In ACM SIGCOMM Workshop on Internet measurement, 2002. Google ScholarDigital Library
- Umar Javed, Italo Cunha, David Choffnes, Ethan Katz-Bassett, Thomas Anderson, and Arvind Krishnamurthy. PoiRoot: Investigating the Root Cause of Interdomain Path Changes. In ACM SIGCOMM, 2013.Google ScholarDigital Library
- John P John, Ethan Katz-Bassett, Arvind Krishnamurthy, Thomas Anderson, and Arun Venkataramani. Consensus routing: The Internet as a distributed system. In USENIX, 2008.Google Scholar
- D. Katz and D. Ward. Bidirectional Forwarding Detection. RFC 5880. (2010).Google Scholar
- Ethan Katz-Bassett, Colin Scott, David R Choffnes, Ítalo Cunha, Vytautas Valancius, Nick Feamster, Harsha V Madhyastha, Thomas Anderson, and Arvind Krishnamurthy. 2012. LIFEGUARD: practical repair of persistent route failures. ACM SIGCOMM CCR (2012).Google Scholar
- Ravish Khosla, Sonia Fahmy, Y. Charlie Hu, and Jennifer Neville. 2011. Prediction Models for Long-term Internet Prefix Availability. Computer Networks (2011).Google Scholar
- Ramana Rao Kompella, Jennifer Yates, Albert Greenberg, and Alex C Snoeren. IP fault localization via risk modeling. In NSDI, 2005.Google ScholarDigital Library
- Dmitri V. Krioukov, Fragkiskos Papadopoulos, Maksim Kitsak, Amin Vahdat, and Marián Boguñá. 2010. Hyperbolic Geometry of Complex Networks. CoRR abs/1006.5169 (2010). http://arxiv.org/abs/1006.5169Google Scholar
- Nate Kushman, Srikanth Kandula, and Dina Katabi. 2007. Can You Hear Me Now?!: It Must Be BGP. ACM SIGCOMM CCR (2007).Google Scholar
- Nate Kushman, Srikanth Kandula, Dina Katabi, and Bruce M Maggs. R-BGP: Staying connected in a connected world. In USENIX NSDI, 2007.Google ScholarDigital Library
- Craig Labovitz, Abha Ahuja, Abhijit Bose, and Farnam Jahanian. 2000. Delayed Internet routing convergence. ACM SIGCOMM CCR (2000).Google Scholar
- Olaf Maennel and Anja Feldmann. Realistic BGP Traffic for Test Labs. In ACM SIGCOMM, 2002.Google ScholarDigital Library
- Philipp Mao, Rudiger Birkner, Thomas Holterbach, and Laurent Vanbever. Boosting the BGP convergence in SDXes with SWIFT.In ACM SIGCOMM, 2017 (Demo).Google ScholarDigital Library
- Z Morley Mao, Randy Bush, Timothy G Griffin, and Matthew Roughan. BGP beacons. In ACM IMC, 2003.Google ScholarDigital Library
- Z Morley Mao, Ramesh Govindan, George Varghese, and Randy H Katz. Route flap damping exacerbates Internet routing convergence. In SIGCOMM, 2002.Google ScholarDigital Library
- W.B. Norton. 2011. The Internet Peering Playbook: Connecting to the Core of the Internet. DrPeering Press.Google Scholar
- University of Oregon. Route Views Project. (2016). www.routeviews.org/.Google Scholar
- P. Pan, G. Swallow, and A. Atlas. Fast Reroute Extensions to RSVP-TE for LSP Tunnels. RFC 4090. (May 2005).Google Scholar
- Vern Paxson. 2006. End-to-end Routing Behavior in the Internet. ACM SIGCOMM CCR (2006).Google Scholar
- Cristel Pelsser, Olaf Maennel, Pradosh Mohapatra, Randy Bush, and Keyur Patel. Route flap damping made usable. In PAM, 2011. Google ScholarCross Ref
- Ponemon Institute. Cost of Data Center Outages. (2016). http://datacenterfrontier.com/white-paper/cost-data-center-outages/.Google Scholar
- B. Quoitin and S. Uhlig. 2005. Modeling the Routing of an Autonomous System with C-BGP. IEEE Network Magazine of Global Internetworking (2005).Google Scholar
- Mark Reitblatt, Marco Canini, Arjun Guha, and Nate Foster. FatTire: Declarative Fault Tolerance for Software-defined Networks. In HotSDN, 2013.Google ScholarDigital Library
- Y. Rekhter, T. Li, and S. Hares. 2006. A Border Gateway Protocol 4. RFC 4271.Google Scholar
- Matthew Roughan, Tim Griffin, Morley Mao, Albert Greenberg, and Brian Freeman. Combining Routing and Traffic Data for Detection of IP Forwarding Anomalies. In SIGMETRICS, 2004. Google ScholarDigital Library
- M. Roughan, W. Willinger, O. Maennel, D. Perouli, and R. Bush. 2011. 10 Lessons from 10 Years of Measuring and Modeling the Internet's Autonomous Systems. IEEE Journal on Selected Areas in Communications (2011).Google Scholar
- M. Shand and S. Bryant. IP Fast Reroute Framework. RFC 5714. (Jan. 2010).Google Scholar
- Ashwin Sridharan, Sue B. Moon, and Christophe Diot. On the Correlation Between Route Dynamics and Routing Loops. In ACM IMC, 2003. Google ScholarDigital Library
- Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. California Fault Lines: Understanding the Causes and Impact of Network Failures. In ACM SIGCOMM, 2010.Google ScholarDigital Library
- Stefano Vissicchio, Olivier Tilmans, Laurent Vanbever, and Jennifer Rexford. Central control over distributed routing. In ACM SIGCOMM, 2015. Google ScholarDigital Library
- Stefano Vissicchio, Laurent Vanbever, Cristel Pelsser, Luca Cittadini, Pierre Francois, and Olivier Bonaventure. 2013. Improving Network Agility with Seamless BGP Reconfigurations. IEEE/ACM TON (2013).Google Scholar
- Feng Wang, Zhuoqing Morley Mao, Jia Wang, Lixin Gao, and Randy Bush. A Measurement Study on the Impact of Routing Events on End-to-end Internet Path Performance. In ACM SIGCOMM, 2006.Google ScholarDigital Library
- Junling Wang and Srihari Nelakuditi. IP fast reroute with failure inferencing. In ACM SIGCOMM workshop on Internet network management, 2007. Google ScholarDigital Library
- Jian Wu, Zhuoqing Morley Mao, Jennifer Rexford, and Jia Wang. Finding a needle in a haystack: Pinpointing significant BGP routing changes in an IP network. In USENIX NSDI, 2005.Google ScholarDigital Library
- Ying Zhang, Z Morley Mao, and Jia Wang. A framework for measuring and predicting the impact of routing changes. In IEEE INFOCOM, 2007. Google ScholarDigital Library
- Ying Zhang, Z. Morley Mao, and Ming Zhang. Effective Diagnosis of Routing Disruptions from End Systems. In USENIX NSDI, 2008.Google Scholar
Index Terms
- SWIFT: Predictive Fast Reroute
Recommendations
Dynamics of hot-potato routing in IP networks
SIGMETRICS '04/Performance '04: Proceedings of the joint international conference on Measurement and modeling of computer systemsDespite the architectural separation between intradomain and interdomain routing in the Internet, intradomain protocols do influence the path-selection process in the Border Gateway Protocol (BGP). When choosing between multiple equally-good BGP routes, ...
Dynamics of hot-potato routing in IP networks
Despite the architectural separation between intradomain and interdomain routing in the Internet, intradomain protocols do influence the path-selection process in the Border Gateway Protocol (BGP). When choosing between multiple equally-good BGP routes, ...
Boosting the BGP Convergence in SDXes with SWIFT
SIGCOMM Posters and Demos '17: Proceedings of the SIGCOMM Posters and DemosBGP, the only inter-domain routing protocol used today, often converges slowly upon outages. While fast-reroute solutions exist, they can only protect from local outages, not remote ones (e.g., a failure in a transit network). To address this problem, ...
Comments