skip to main content
10.1145/2934872.2934879acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Public Access

Trumpet: Timely and Precise Triggers in Data Centers

Published:22 August 2016Publication History

ABSTRACT

As data centers grow larger and strive to provide tight performance and availability SLAs, their monitoring infrastructure must move from passive systems that provide aggregated inputs to human operators, to active systems that enable programmed control. In this paper, we propose Trumpet, an event monitoring system that leverages CPU resources and end-host programmability, to monitor every packet and report events at millisecond timescales. Trumpet users can express many *network-wide events*, and the system efficiently detects these events using *triggers* at end-hosts. Using careful design, Trumpet can evaluate triggers by inspecting every packet at full line rate even on future generations of NICs, scale to thousands of triggers per end-host while bounding packet processing delay to a few microseconds, and report events to a controller within 10 milliseconds, even in the presence of attacks. We demonstrate these properties using an implementation of Trumpet, and also show that it allows operators to describe new network events such as detecting correlated bursts and loss, identifying the root cause of transient congestion, and detecting short-term anomalies at the scale of a data center tenant.

Skip Supplemental Material Section

Supplemental Material

p129.mp4

mp4

380.5 MB

References

  1. A. Aggarwal, S. Savage, and T. Anderson. "Understanding the Performance of TCP Pacing". In: INFOCOM. Vol. 3. 2000.Google ScholarGoogle Scholar
  2. O. Alipourfard, M. Moshref, and M. Yu. "Re-evaluating Measurement Algorithms in Software". In: HotNets. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Allman, W. M. Eddy, and S. Ostermann. "Estimating Loss Rates with TCP". In: SIGMETRICS Performance Evaluation Review 31.3 (2003), pp. 12-24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Angel, H. Ballani, T. Karagiannis, G. O'Shea, and E. Thereska. "End-to-End Performance Isolation Through Virtual Datacenters". In: OSDI. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny. "Workload Analysis of a Large-scale Key-value Store". In: SIGMETRICS. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. H. Ballani et al. "Enabling End-host Network Functions". In: SIGCOMM. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. H. Chen, N. Foster, J. Silverman, M. Whittaker, B. Zhang, and R. Zhang. "Felix: Implementing Traffic Measurement on End Hosts Using Program Analysis". In: SOSR. 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Chen, R. Griffith, J. Liu, R. H. Katz, and A. D. Joseph. "Understanding TCP Incast Throughput Collapse in Datacenter Networks". In: WREN. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Chowdhury and I. Stoica. "Efficient Coflow Scheduling Without Prior Knowledge". In: SIGCOMM. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Cormode, R. Keralapura, and J. Ramimirtham. "Communication-Efficient Distributed Monitoring of Thresholded Counts". In: SIGMOD. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Cormode, T. Johnson, F. Korn, S. Muthukrishnan, O. Spatscheck, and D. Srivastava. "Holistic UDAFs at Streaming Speeds". In: SIGMOD. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Cranor, T. Johnson, O. Spataschek, and V. Shkapenyuk. "Gigascope: a Stream Database for Network Applications". In: SIGMOD. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Curtis, J. Mogul, J. Tourrilhes, P. Yalagandula, P. Sharma, and S. Banerjee. "DevoFlow: Scaling Flow Management for High-Performance Networks". In: SIGCOMM. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Dobrescu, K. Argyraki, G. Iannaccone, M. Manesh, and S. Ratnasamy. "Controlling Parallelism in a Multicore Software Router". In: PRESTO. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. DPDK. http://dpdk.org.Google ScholarGoogle Scholar
  16. D. E. Eisenbud et al. "Maglev: A Fast and Reliable Software Network Load Balancer". In: NSDI. 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Firestone. "SmartNIC: FPGA Innovation in OCS Servers for Microsoft Azure". In: OCP U.S. Summit. 2016.Google ScholarGoogle Scholar
  18. M. Gabel, A. Schuster, and D. Keren. "Communication-Efficient Distributed Variance Monitoring and Outlier Detection for Multivariate Time Series". In: IPDPS. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Gandhi, Y. C. Hu, C.-k. Koh, H. H. Liu, and M. Zhang. "Rubik: Unlocking the Power of Locality and End-Point Flexibility in Cloud Scale Load Balancing". In: ATC. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Ghasemi, T. Benson, and J. Rexford. RINC: Real-Time Inference-based Network Diagnosis in the Cloud. Tech. rep. Technical Report TR-975-14, Princeton University, 2015.Google ScholarGoogle Scholar
  21. M. Ghobadi and Y. Ganjali. "TCP Pacing in Data Center Networks". In: High-Performance Interconnects (HOTI). 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Google Compute Engine Incident 15041. https://status.cloud.google.com/incident/compute/15041.2015.Google ScholarGoogle Scholar
  23. C. Guo et al. "Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis". In: SIGCOMM. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Han, K. Jang, A. Panda, S. Palkar, D. Han, and S. Ratnasamy. SoftNIC: A Software NIC to Augment Hardware. Tech. rep.UCB/EECS-2015-155. http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-155.html. EECS Department, University of California, Berkeley, 2015.Google ScholarGoogle Scholar
  25. N. Handigol, B. Heller, V. Jeyakumar, D. Mazières, and N. McKeown. "I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks". In: NSDI. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y.-J. Hong and M. Thottethodi. "Understanding and Mitigating the Impact of Load Imbalance in the Memory Caching Tier". In: SOCC. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. L. Hu, K. Schwan, A. Gulati, J. Zhang, and C. Wang. "Netcohort: Detecting and Managing VM Ensembles in Virtualized Data Centers". In: ICAC. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Q. Huang, H. Gudmundsdottir, Y. Vigfusson, D. A. Freedman, K. Birman, and R. van Renesse. "Characterizing Load Imbalance in Real-World Networked Caches". In: HotNets. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. "IEEE Standard for a Precision Clock Synchronization Protocol for Networked Measurement and Control Systems". In: IEEE Std 1588-2008 (Revision of IEEE Std 1588-2002) (2008), pp. 1-269.Google ScholarGoogle Scholar
  30. Intel Data Direct I/O Technology. http://www.intel.com/content/www/us/en/io/data-direct-i-o-technology.html.Google ScholarGoogle Scholar
  31. R. Kapoor, A. C. Snoeren, G. M. Voelker, and G. Porter. "Bullet Trains: A Study of Nic Burst Behavior at Microsecond Timescales". In: CoNEXT. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. H. Kim, J. Reich, A. Gupta, M. Shahbaz, N. Feamster, and R. Clark. "Kinetic: Verifiable Dynamic Network Control". In: NSDI. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Kumar et al. "BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing". In: SIGCOMM. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. B. Li et al. "ClickNP: Highly Flexible and High-performance Network Processing with Reconfigurable Hardware". In: SIGCOMM. 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Y. Li, R. Miao, C. Kim, and M. Yu. "FlowRadar: A Better NetFlow for Data Centers". In: NSDI. 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. H. Lim, D. Han, D. G. Andersen, and M. Kaminsky. "MICA: A Holistic Approach to Fast In-memory Key-value Storage". In: NSDI. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Z. Liu, A. Manousis, G. Vorsanger, V. Sekar, and V. Braverman. "One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon". In: SIGCOMM. 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. N. McKeown et al. "OpenFlow: Enabling Innovation in Campus Networks". In: SIGCOMM Computer Communication Review 38.2 (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. R. Miao, R. Potharaju, M. Yu, and N. Jain. "The Dark Menace: Characterizing Network-based Attacks in the Cloud". In: IMC. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. Moshref, M. Yu, R. Govindan, and A. Vahdat. "DREAM: Dynamic Resource Allocation for Software-defined Measurement". In: SIGCOMM. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. M. Moshref, M. Yu, A. Sharma, and R. Govindan. "Scalable Rule Management for Data Centers". In: NSDI. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. n2disk: A Multi-Gigabit Network Traffic Recorder with Indexing Capabilities. http://www.ntop.org/products/trafficrecording-replay/n2disk/.Google ScholarGoogle Scholar
  43. N. Parlante. Linked List Basics. http://cslibrary.stanford.edu/103/LinkedListBasics.pdf.2001.Google ScholarGoogle Scholar
  44. P. Patel et al. "Ananta: Cloud Scale Load Balancing". In: SIGCOMM. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Perry, A. Ousterhout, H. Balakrishnan, D. Shah, and H. Fugal. "Fastpass: A Centralized "Zero-queue" Datacenter Network". In: SIGCOMM. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. B. Pfaff et al. "The Design and Implementation of Open vSwitch". In: NSDI. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. R. Potharaju and N. Jain. "Demystifying the Dark Side of the Middle: A Field Study of Middlebox Failures in Datacenters". In: IMC. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. Rasley et al. "Planck: Millisecond-scale Monitoring and Control for Commodity Networks". In: SIGCOMM. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. A. Roy, H. Zeng, J. Bagga, G. M. Porter, and A. C. Snoeren. "Inside the Social Network's (Datacenter) Network". In: SIGCOMM. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. I. Sharfman, A. Schuster, and D. Keren. "A Geometric Approach to Monitoring Threshold Functions over Distributed Data Streams". In: Transaction on Database Systems 32.4 (Nov. 2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. A. Singh et al. "Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network". In: SIGCOMM. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. V. Srinivasan, S. Suri, and G. Varghese. "Packet Classification Using Tuple Space Search". In: SIGCOMM. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. P. Sun, M. Yu, M. J. Freedman, J. Rexford, and D. Walker. "HONE: Joint Host-Network Traffic Management in Software-Defined Networks". In: Journal of Network and Systems Management 23.2 (2015), pp. 374-399. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. M. Wang, B. Li, and Z. Li. "sFlow: Towards Resource-efficient and Agile Service Federation in Service Overlay Networks". In: International Conference on Distributed Computing Systems. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. W. Wu, K. He, and A. Akella. "PerfSight: Performance Diagnosis for Software Dataplanes". In: IMC. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. M. Yu, L. Jose, and R. Miao. "Software Defined Traffic Measurement with OpenSketch". In: NSDI. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. M. Yu et al. "Profiling Network Performance for Multi-tier Data Center Applications". In: NSDI. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Y. Zhu et al. "Packet-Level Telemetry in Large Datacenter Networks". In: SIGCOMM. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Trumpet: Timely and Precise Triggers in Data Centers

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGCOMM '16: Proceedings of the 2016 ACM SIGCOMM Conference
          August 2016
          645 pages
          ISBN:9781450341936
          DOI:10.1145/2934872

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 22 August 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          SIGCOMM '16 Paper Acceptance Rate39of231submissions,17%Overall Acceptance Rate554of3,547submissions,16%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader