ABSTRACT
As data centers grow larger and strive to provide tight performance and availability SLAs, their monitoring infrastructure must move from passive systems that provide aggregated inputs to human operators, to active systems that enable programmed control. In this paper, we propose Trumpet, an event monitoring system that leverages CPU resources and end-host programmability, to monitor every packet and report events at millisecond timescales. Trumpet users can express many *network-wide events*, and the system efficiently detects these events using *triggers* at end-hosts. Using careful design, Trumpet can evaluate triggers by inspecting every packet at full line rate even on future generations of NICs, scale to thousands of triggers per end-host while bounding packet processing delay to a few microseconds, and report events to a controller within 10 milliseconds, even in the presence of attacks. We demonstrate these properties using an implementation of Trumpet, and also show that it allows operators to describe new network events such as detecting correlated bursts and loss, identifying the root cause of transient congestion, and detecting short-term anomalies at the scale of a data center tenant.
Supplemental Material
- A. Aggarwal, S. Savage, and T. Anderson. "Understanding the Performance of TCP Pacing". In: INFOCOM. Vol. 3. 2000.Google Scholar
- O. Alipourfard, M. Moshref, and M. Yu. "Re-evaluating Measurement Algorithms in Software". In: HotNets. 2015. Google ScholarDigital Library
- M. Allman, W. M. Eddy, and S. Ostermann. "Estimating Loss Rates with TCP". In: SIGMETRICS Performance Evaluation Review 31.3 (2003), pp. 12-24. Google ScholarDigital Library
- S. Angel, H. Ballani, T. Karagiannis, G. O'Shea, and E. Thereska. "End-to-End Performance Isolation Through Virtual Datacenters". In: OSDI. 2014. Google ScholarDigital Library
- B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny. "Workload Analysis of a Large-scale Key-value Store". In: SIGMETRICS. 2012. Google ScholarDigital Library
- H. Ballani et al. "Enabling End-host Network Functions". In: SIGCOMM. 2015. Google ScholarDigital Library
- H. Chen, N. Foster, J. Silverman, M. Whittaker, B. Zhang, and R. Zhang. "Felix: Implementing Traffic Measurement on End Hosts Using Program Analysis". In: SOSR. 2016. Google ScholarDigital Library
- Y. Chen, R. Griffith, J. Liu, R. H. Katz, and A. D. Joseph. "Understanding TCP Incast Throughput Collapse in Datacenter Networks". In: WREN. 2009. Google ScholarDigital Library
- M. Chowdhury and I. Stoica. "Efficient Coflow Scheduling Without Prior Knowledge". In: SIGCOMM. 2015. Google ScholarDigital Library
- G. Cormode, R. Keralapura, and J. Ramimirtham. "Communication-Efficient Distributed Monitoring of Thresholded Counts". In: SIGMOD. 2006. Google ScholarDigital Library
- G. Cormode, T. Johnson, F. Korn, S. Muthukrishnan, O. Spatscheck, and D. Srivastava. "Holistic UDAFs at Streaming Speeds". In: SIGMOD. 2004. Google ScholarDigital Library
- C. Cranor, T. Johnson, O. Spataschek, and V. Shkapenyuk. "Gigascope: a Stream Database for Network Applications". In: SIGMOD. 2003. Google ScholarDigital Library
- A. Curtis, J. Mogul, J. Tourrilhes, P. Yalagandula, P. Sharma, and S. Banerjee. "DevoFlow: Scaling Flow Management for High-Performance Networks". In: SIGCOMM. 2011. Google ScholarDigital Library
- M. Dobrescu, K. Argyraki, G. Iannaccone, M. Manesh, and S. Ratnasamy. "Controlling Parallelism in a Multicore Software Router". In: PRESTO. 2010. Google ScholarDigital Library
- DPDK. http://dpdk.org.Google Scholar
- D. E. Eisenbud et al. "Maglev: A Fast and Reliable Software Network Load Balancer". In: NSDI. 2016. Google ScholarDigital Library
- D. Firestone. "SmartNIC: FPGA Innovation in OCS Servers for Microsoft Azure". In: OCP U.S. Summit. 2016.Google Scholar
- M. Gabel, A. Schuster, and D. Keren. "Communication-Efficient Distributed Variance Monitoring and Outlier Detection for Multivariate Time Series". In: IPDPS. 2014. Google ScholarDigital Library
- R. Gandhi, Y. C. Hu, C.-k. Koh, H. H. Liu, and M. Zhang. "Rubik: Unlocking the Power of Locality and End-Point Flexibility in Cloud Scale Load Balancing". In: ATC. 2015. Google ScholarDigital Library
- M. Ghasemi, T. Benson, and J. Rexford. RINC: Real-Time Inference-based Network Diagnosis in the Cloud. Tech. rep. Technical Report TR-975-14, Princeton University, 2015.Google Scholar
- M. Ghobadi and Y. Ganjali. "TCP Pacing in Data Center Networks". In: High-Performance Interconnects (HOTI). 2013. Google ScholarDigital Library
- Google Compute Engine Incident 15041. https://status.cloud.google.com/incident/compute/15041.2015.Google Scholar
- C. Guo et al. "Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis". In: SIGCOMM. 2015. Google ScholarDigital Library
- S. Han, K. Jang, A. Panda, S. Palkar, D. Han, and S. Ratnasamy. SoftNIC: A Software NIC to Augment Hardware. Tech. rep.UCB/EECS-2015-155. http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-155.html. EECS Department, University of California, Berkeley, 2015.Google Scholar
- N. Handigol, B. Heller, V. Jeyakumar, D. Mazières, and N. McKeown. "I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks". In: NSDI. 2014. Google ScholarDigital Library
- Y.-J. Hong and M. Thottethodi. "Understanding and Mitigating the Impact of Load Imbalance in the Memory Caching Tier". In: SOCC. 2013. Google ScholarDigital Library
- L. Hu, K. Schwan, A. Gulati, J. Zhang, and C. Wang. "Netcohort: Detecting and Managing VM Ensembles in Virtualized Data Centers". In: ICAC. 2012. Google ScholarDigital Library
- Q. Huang, H. Gudmundsdottir, Y. Vigfusson, D. A. Freedman, K. Birman, and R. van Renesse. "Characterizing Load Imbalance in Real-World Networked Caches". In: HotNets. 2014. Google ScholarDigital Library
- "IEEE Standard for a Precision Clock Synchronization Protocol for Networked Measurement and Control Systems". In: IEEE Std 1588-2008 (Revision of IEEE Std 1588-2002) (2008), pp. 1-269.Google Scholar
- Intel Data Direct I/O Technology. http://www.intel.com/content/www/us/en/io/data-direct-i-o-technology.html.Google Scholar
- R. Kapoor, A. C. Snoeren, G. M. Voelker, and G. Porter. "Bullet Trains: A Study of Nic Burst Behavior at Microsecond Timescales". In: CoNEXT. 2013. Google ScholarDigital Library
- H. Kim, J. Reich, A. Gupta, M. Shahbaz, N. Feamster, and R. Clark. "Kinetic: Verifiable Dynamic Network Control". In: NSDI. 2015. Google ScholarDigital Library
- A. Kumar et al. "BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing". In: SIGCOMM. 2015. Google ScholarDigital Library
- B. Li et al. "ClickNP: Highly Flexible and High-performance Network Processing with Reconfigurable Hardware". In: SIGCOMM. 2016. Google ScholarDigital Library
- Y. Li, R. Miao, C. Kim, and M. Yu. "FlowRadar: A Better NetFlow for Data Centers". In: NSDI. 2016. Google ScholarDigital Library
- H. Lim, D. Han, D. G. Andersen, and M. Kaminsky. "MICA: A Holistic Approach to Fast In-memory Key-value Storage". In: NSDI. 2014. Google ScholarDigital Library
- Z. Liu, A. Manousis, G. Vorsanger, V. Sekar, and V. Braverman. "One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon". In: SIGCOMM. 2016. Google ScholarDigital Library
- N. McKeown et al. "OpenFlow: Enabling Innovation in Campus Networks". In: SIGCOMM Computer Communication Review 38.2 (2008). Google ScholarDigital Library
- R. Miao, R. Potharaju, M. Yu, and N. Jain. "The Dark Menace: Characterizing Network-based Attacks in the Cloud". In: IMC. 2015. Google ScholarDigital Library
- M. Moshref, M. Yu, R. Govindan, and A. Vahdat. "DREAM: Dynamic Resource Allocation for Software-defined Measurement". In: SIGCOMM. 2014. Google ScholarDigital Library
- M. Moshref, M. Yu, A. Sharma, and R. Govindan. "Scalable Rule Management for Data Centers". In: NSDI. 2013. Google ScholarDigital Library
- n2disk: A Multi-Gigabit Network Traffic Recorder with Indexing Capabilities. http://www.ntop.org/products/trafficrecording-replay/n2disk/.Google Scholar
- N. Parlante. Linked List Basics. http://cslibrary.stanford.edu/103/LinkedListBasics.pdf.2001.Google Scholar
- P. Patel et al. "Ananta: Cloud Scale Load Balancing". In: SIGCOMM. 2013. Google ScholarDigital Library
- J. Perry, A. Ousterhout, H. Balakrishnan, D. Shah, and H. Fugal. "Fastpass: A Centralized "Zero-queue" Datacenter Network". In: SIGCOMM. 2014. Google ScholarDigital Library
- B. Pfaff et al. "The Design and Implementation of Open vSwitch". In: NSDI. 2015. Google ScholarDigital Library
- R. Potharaju and N. Jain. "Demystifying the Dark Side of the Middle: A Field Study of Middlebox Failures in Datacenters". In: IMC. 2013. Google ScholarDigital Library
- J. Rasley et al. "Planck: Millisecond-scale Monitoring and Control for Commodity Networks". In: SIGCOMM. 2014. Google ScholarDigital Library
- A. Roy, H. Zeng, J. Bagga, G. M. Porter, and A. C. Snoeren. "Inside the Social Network's (Datacenter) Network". In: SIGCOMM. 2015. Google ScholarDigital Library
- I. Sharfman, A. Schuster, and D. Keren. "A Geometric Approach to Monitoring Threshold Functions over Distributed Data Streams". In: Transaction on Database Systems 32.4 (Nov. 2007). Google ScholarDigital Library
- A. Singh et al. "Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network". In: SIGCOMM. 2015. Google ScholarDigital Library
- V. Srinivasan, S. Suri, and G. Varghese. "Packet Classification Using Tuple Space Search". In: SIGCOMM. 1999. Google ScholarDigital Library
- P. Sun, M. Yu, M. J. Freedman, J. Rexford, and D. Walker. "HONE: Joint Host-Network Traffic Management in Software-Defined Networks". In: Journal of Network and Systems Management 23.2 (2015), pp. 374-399. Google ScholarDigital Library
- M. Wang, B. Li, and Z. Li. "sFlow: Towards Resource-efficient and Agile Service Federation in Service Overlay Networks". In: International Conference on Distributed Computing Systems. 2004. Google ScholarDigital Library
- W. Wu, K. He, and A. Akella. "PerfSight: Performance Diagnosis for Software Dataplanes". In: IMC. 2015. Google ScholarDigital Library
- M. Yu, L. Jose, and R. Miao. "Software Defined Traffic Measurement with OpenSketch". In: NSDI. 2013. Google ScholarDigital Library
- M. Yu et al. "Profiling Network Performance for Multi-tier Data Center Applications". In: NSDI. 2011. Google ScholarDigital Library
- Y. Zhu et al. "Packet-Level Telemetry in Large Datacenter Networks". In: SIGCOMM. 2015. Google ScholarDigital Library
Index Terms
- Trumpet: Timely and Precise Triggers in Data Centers
Recommendations
The trumpet shall sound: de-anonymizing jazz recordings
EVA '16: Proceedings of the conference on Electronic Visualisation and the ArtsWe are experimenting with automated techniques to identify performers on jazz recordings by using stylistic measures of acoustic signals. Many early jazz recordings do not identify individual musicians, leaving them under-appreciated. We look at ...
Novel designer plastic trumpet bells for brass instruments
EVA '16: Proceedings of the conference on Electronic Visualisation and the ArtsIt is proposed that by using computing analysis software and 3D fabrication techniques, low cost plastic trumpet bells could be produced for different music genres by altering their timbres. These may be attached to the trumpet in a similar manner to ...
Comments