ABSTRACT
We present dRMT (disaggregated Reconfigurable Match-Action Table), a new architecture for programmable switches. dRMT overcomes two important restrictions of RMT, the predominant pipeline-based architecture for programmable switches: (1) table memory is local to an RMT pipeline stage, implying that memory not used by one stage cannot be reclaimed by another, and (2) RMT is hardwired to always sequentially execute matches followed by actions as packets traverse pipeline stages. We show that these restrictions make it difficult to execute programs efficiently on RMT.
dRMT resolves both issues by disaggregating the memory and compute resources of a programmable switch. Specifically, dRMT moves table memories out of pipeline stages and into a centralized pool that is accessible through a crossbar. In addition, dRMT replaces RMT's pipeline stages with a cluster of processors that can execute match and action operations in any order.
We show how to schedule a P4 program on dRMT at compile time to guarantee deterministic throughput and latency. We also present a hardware design for dRMT and analyze its feasibility and chip area. Our results show that dRMT can run programs at line rate with fewer processors compared to RMT, and avoids performance cliffs when there are not enough processors to run a program at line rate. dRMT's hardware design incurs a modest increase in chip area relative to RMT, mainly due to the crossbar.
Supplemental Material
- A Deeper Dive Into Barefoot Networks Technology. http://techfieldday.com/appearance/barefoot-networks-presents-at-networking-field-day-14.Google Scholar
- Barefoot: The World's Fastest and Most Programmable Networks. https://barefootnetworks.com/media/white_papers/Barefoot-Worlds-Fastest-Most-Programmable-Networks.pdf.Google Scholar
- Cavium Attacks Broadcom in Switches. http://www.eetimes.com/document.asp?doc_id=1323931.Google Scholar
- Cisco QuantumFlow Processor. https://newsroom.cisco.com/feature-content?type=webcontent&articleId=4237516.Google Scholar
- dRMT project. http://drmt.technion.ac.il.Google Scholar
- Gurobi Optimization. http://www.gurobi.com.Google Scholar
- Intel FlexPipe. http://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/ethernet-switch-fm6000-series-brief.pdf.Google Scholar
- Intel IXP2800 Network Processor. http://www.ic72.com/pdf_file/i/587106.pdf.Google Scholar
- IXP4XX Product Line of Network Processors. http://www.intel.com/content/www/us/en/intelligent-systems/previous-generation/intel-ixp4xx-intel-network-processor-product-line.html.Google Scholar
- Mellanox Indigo NPS-400 400Gbps NPU. http://www.mellanox.com/page/products_dyn?product_family=241&mtag=nps_400.Google Scholar
- Netronome Agilio CX SmartNICs. https://www.netronome.com/products/agilio-cx.Google Scholar
- P4 Specification. https://p4lang.github.io/p4-spec.Google Scholar
- switch.p4. https://github.com/p4lang/switch/tree/master/p4src.Google Scholar
- XPliant™ Ethernet Switch Product Family. http://www.cavium.com/XPliant-Ethernet-Switch-Product-Family.html.Google Scholar
- P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, and D. Walker. P4: Programming Protocol-Independent Packet Processors. SIGCOMM CCR, July 2014.Google ScholarDigital Library
- P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz. Forwarding Metamorphosis: Fast Programmable Match-Action Processing in Hardware for SDN. In ACM SIGCOMM, 2013.Google ScholarDigital Library
- M. Dobrescu, N. Egi, K. Argyraki, B.-G. Chun, K. Fall, G. Iannaccone, A. Knies, M. Manesh, and S. Ratnasamy. RouteBricks: Exploiting Parallelism to Scale Software Routers. ACM SOSP, 2009. Google ScholarDigital Library
- D. L. Draper, A. K. Jonsson, D. P. Clements, and D. E. Joslin. Cyclic scheduling. In IJCAI, 1999.Google Scholar
- G. Gibb, G. Varghese, M. Horowitz, and N. McKeown. Design Principles for Packet Parsers. In ANCS, 2013. Google ScholarCross Ref
- C. Hanen. Study of a NP-hard cyclic scheduling problem: The recurrent job-shop. European journal of operational research, 1994. Google ScholarCross Ref
- C. Hanen and A. Munier. A study of the cyclic scheduling problem on parallel processors. Discrete Applied Mathematics, 1995. Google ScholarDigital Library
- L. Jose, L. Yan, G. Varghese, and N. McKeown. Compiling Packet Programs to Reconfigurable Switches. In NSDI, 2015.Google ScholarDigital Library
- I. Keslassy, K. Kogan, G. Scalosub, and M. Segal. Providing performance guarantees in multipass network processors. IEEE/ACM Transactions on Networking, 20(6):1895--1909, 2012. Google ScholarDigital Library
- D. A. Patterson and J. L. Hennessy. Computer Organization and Design, 4th Edition: The Hardware/Software Interface. 2008.Google Scholar
- T. Sherwood, G. Varghese, and B. Calder. A pipelined memory architecture for high throughput network processors. In ISCA, 2003. Google ScholarDigital Library
Index Terms
- dRMT: Disaggregated Programmable Switching
Recommendations
Automatically partitioning packet processing applications for pipelined architectures
PLDI '05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementationModern network processors employs parallel processing engines (PEs) to keep up with explosive internet packet processing demands. Most network processors further allow processing engines to be organized in a pipelined fashion to enable higher processing ...
Fast online error detection and correction with thread signature calculae
To recognize transient control-flow and data faults, caused by Single-Event Upsets (SEUs) in a microprocessor pipeline, several mechanisms to check the execution in the retirement have been proposed and discussed over the years. In this paper, we ...
Shangri-La: achieving high performance from compiled network applications while enabling ease of programming
PLDI '05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementationProgramming network processors is challenging. To sustain high line rates, network processors have extremely tight memory access and instruction budgets. Achieving desired performance has traditionally required hand-coded assembly. Researchers have ...
Comments