Future of Networking on
Current Cray Products
May 21, 2001

Jay Blakeborough

Cray Inc.
1340 Mendota Heights Road
Mendota Heights, MN 55212
USA

jb@cray.com
www.cray.com

ABSTRACT:
This white paper discusses networking issues confronting Cray Inc. on currently shipping Cray platforms.  It briefly notes the directions that we have investigated. The primary focus of the paper is the current status of the Cray L7R (Layer 7 Router), which Cray has licensed from Essential Communications for integration into our products. Planned interfaces are outlined and available performance data on Gigabit Ethernet is reported.

 
KEYWORDS:
Networking, Gigabit Ethernet, Cray L7R

Introduction

While I might be the author of this paper, I make absolutely no claims about being a networking expert.  Nearly everything that I've learned on the subject, and the directions chosen, have come from a collaborative effort by several folks at Cray Inc. (Cray) and Essential Communications.  Please refer to the Acknowledgments section for a list of contributors.

This paper was essentially generated by Cray customer desire for network access to currently selling Cray platforms via Gigabit Ethernet (GigE).  Given the outcome of our efforts to provide just that,  we believe a fair amount of history and explanation is necessary to help understand the challenges we have experienced and will face as we move forward.  To that end, the paper briefly discusses Cray Networking History, UNICOS TCP/IP, Research On UNICOS Networking, Options Considered, The Cray L7R, Current Performance, and, Future Directions.
 

Cray Networking History

It's not quite the big bang, but for Cray, the first semblance of networking was through Cray Station software to the Cray Operating System (COS).  The connections were made via FEI hardware and a proprietary protocol called the Station Control Protocol (SCP).  For migration purposes, the SCP protocol was brought forward to the UNICOS Operating System as USCP.  Since UNICOS was based on AT&T System V  UNIX and the Berkeley Software Distribution (BSD) UNIX, it provided the Internet Protocol (IP) and the Transmission Control Protocol (TCP) and hence a path to the future of networking.

On Cray systems with Model E I/O (skipping over Model D and VME-based I/O for brevity's sake),  UNICOS supported TCP/IP across HIPPI, FDDI and HyperChannel.  The driver protocol between the mainframe and the I/O Subsystem (IOS) was message or packet based.  That is, for every read posting or write to the actual hardware device, a message was sent between the mainframe and the IOS.  The actual TCP/IP packet was transferred to or from mainframe memory via Direct Memory Access (DMA) by the IOS.  Nearly all hardware protocols were provided by native interfaces in the IOS.  The exception being ATM OC-3 which required a Bus-Based Gateway (BBG) connection through HIPPI.

The introduction of the GigaRing-based I/O systems provided an opportunity to move forward with standard interfaces using Sbus-based network interface cards.  The Multi-purpose I/O Node (MPN) supported SBUS versions of 10/100 BaseT Ethernet, Fddi, and ATM OC-3.  The performance characteristics of HIPPI suggested the requirement for a special purpose node, the HPN.  The GigaRing also brought the advent of a new unified network driver protocol between the mainframe and the I/O nodes.  The generic driver was designed as a table-based protocol with put and take pointers.  The intent was that neither the mainframe nor the I/O node would have to interrupt the other with a message each time a network I/O was to be processed.  Simply reading and writing the requests and responses and updating the respective put and take table pointers should have provided increased and more scalable networking throughput.  In reality, the first implementation of the protocol proved to be unstable and gave unpredictable timings.  It was extended to include messaging between the mainframe and the I/O node to insure that data was sent or received in a more timely fashion.  The unfortunate result was a decrease in maximum transfer rates on media which do not support large Maximum Transfer Units (MTUs).   Performance across HIPPI, using 64K-byte MTUs was acceptable, while performance on 10/100 BaseT Ethernet with 1500-byte MTUs was, to be truthful, abysmal (approximately one third of the maximum potential).

The following section discusses UNICOS TCP/IP at a very high level and concentrates on the GigaRing-based systems implementation.
 

UNICOS TCP/IP

The focus of this paper is UNICOS.  The UNICOS/mk Operating System for the Cray T3E contains virtually the same TCP/IP stack and unfortunately has some additional challenges which are beyond the scope of this work.

The UNICOS TCP/IP stack is based primarily on an early BSD version of the code.  It was ported to the Cray Parallel Vector Processor (PVP) hardware long before the UNICOS kernel was multi-threaded.  Even when the code was multi-threaded, the philosophy was to avoid interrupting user work or even other system work to handle networking interrupts.  There was a time when the debate raged regarding providing interactive access to the Cray at all.  Cray CPUs were expensive and many of our customers didn't want a CPU interrupted because some user hit a <ENTER> on their keyboard.  Consequently, the handling of networking interrupts was relegated to a lower service priority in UNICOS.

The Multi-Level Security (MLS) features were also added during the 8.0 and the 9.0 releases of UNICOS.  The code considerably changed the landscape of the TCP/IP kernel stack with nearly countless security IF tests.  The networking industry stack has significantly changed and improved in BSD UNIX, but the MLS code and its inherent promises reduced our ability to take advantage of the newer BSD or other code.

All performance efforts prior to UNICOS 10.0.0.8 were directed toward optimizing large transfers using large MTU interfaces.  For example, multi-stream HIPPI performance on UNICOS GigaRing-based systems is approximately 700 Mb/s (87% of theoretical peak).   Similar multi-stream transfers on 10/100 BaseT Ethernet yield approximately 40 Mb/s (less than 50% of theoretical peak).

In late 1995 and early 1996 many customers were promised that Cray would improve networking performance and provide new connectivity on GigaRing-based systems.  Things have not improved much since then.  What happened, you ask?  SGI (then Silicon Graphics) purchased Cray Research, Inc. in 1996 and quickly reallocated nearly all networking resources to support future product directions like the Origin line.  UNICOS wasn't particularly good at networking and Irix performed exceptionally well.  To some decision makers, it was clear where to focus our efforts.
 

Research On UNICOS Networking

After being sold by SGI in 2000, we began more concerted research efforts into UNICOS networking performance.  Many customers had been and continue to asked for GigE connectivity.  Given that we perform poorly on 1500 byte MTU interfaces, and not as well as expected on 9000 byte MTU interfaces, providing GigE posed a significant challenge.

We began to research the bottlenecks.  There certainly was enough blame to go around.  Since the only major change going from Model E I/O-based machines to GigaRing-based machines was the creation of the unified network driver architecture, that received much of the initial focus. The table-based (with additional messages) protocol did reduce bandwidth.  It appears that it did so by exacerbating timing issues within UNICOS.  TCP is very sensitive to the number of and the speed at which acknowledgement packets can be returned to the other end point of the connection.  The new driver dramatically increased the delay in returning acknowledgements to the sending side causing the other end of the connection to assume a network congestion problem and thus reduce the speed at which it would send data.   This is normal behavior for TCP - necessary to be considered a good networking neighbor.

To determine just how fast UNICOS could send TCP packets,  a UNICOS kernel was built that simply sent TCP packets as fast as possible and ignored the ability of a hardware interface to actually get the packets out to the network.  The UNICOS TCP stack produced approximately 6000 packets per second in this scenario.  Given even modest overhead for getting the packets out to the network, we hypothesize that the highest rate we could resonably expect from UNICOS would be somewhere between 4000 and 5000 packets per second.  It's important to note that these rates were researched on an Cray SV1 system and that the TCP stack is mostly single threaded suggesting that the highest rate achievable would be roughly 5000 packets per second, per CPU.

To put things in perspective, consider what kind of packet rates are required to run selected interfaces at theoretical speeds given their MTU sizes.  Note that for every two data packets received, one TCP acknowledgement should be returned to the sender.  Since there are cases where this does not happen and given some additional estimates on my part, the numbers below should definitely be considered approximations.  It is the magnitude that is important for this discussion.
 

TCP Packet Rates Necessary to Achieve Theoretical Peak
Interface Type Max. Transfer Unit (MTU) Size Approx. Packets/Second
HIPPI 64K bytes 1,600
GigE 9000 bytes (Jumbo Frames) 14,000
GigE 1500 bytes 85,000

Assuming we could almost perfectly thread the UNICOS TCP stack, and given the results of the packets per second experiment noted above, we would monopolize nearly 3 Cray SV1 CPUs to run GigE at industry-standard rates with 9000-byte MTUs.

Finally, it has been assumed by many (including the author at one time) that we could simply connect to a reasonably fast networking box or switch using HIPPI and have it route packets to the rest of the world via GigE.  This reasoning is broken down by something that I refer to as the MTU challenge.  This challenge is actually a feature of TCP.
 
 







When a TCP connection is established across multiple systems and/or routers, the overall connection MTU is negotiated down to the least common denominator.  Once again, this kind of behavior is expected if one is to be considered a good network neighbor.  Assume the Cray ignored this and sent out 64K-byte packets even when some system along the connection path could only handle 1500-byte packets.  The system in the path prior to the small packet system would have to fragment the 64K-byte packets into 1500-byte chunks.  The ultimate data receiver would have to buffer up the packets for reassembly since they could potentially come in out of order.  This causes additional, unnecessary work for other systems on the network and is therefore frowned upon by the networking powers that be.

Options Considered

After some investigation, the prospects did not look too good.  We considered the following three options.

Do Nothing

Some management personnel, who have since left Cray, had noted that Cray is not a networking company.  We should choose one interface that we can perform at (e.g. HIPPI) and say, "That's it.  Take or leave it."  Believe me, this sounded like a reasonable course of action from the Cray perspective.  Even if we are able to provide GigE, what will be desired next?

After further consideration, and prompting from some of our most important customers, it was decided that this approach was not justifiable in the long run.  We need to help our customers connect our systems to industry standard networks.

New Hardware Channel for the Cray SV1 system to Allow PCI

Since the unified network driver and the GigaRing appeared to be at least some of the problem, we considered replacing the GigaRing node chips with a new chip-set to support PCI connectivity directly to the Cray SV1 system.  This approach had the following drawbacks:

Work with Networking Vendor on a TCP Circuit Router (L7R)

Research into the UNICOS networking performance problems suggested a possible course of action.  We could use another system to fragment and coalesce TCP data packets without having our neighbors contact the networking police.  The concept was further developed as TCP Circuits and working with Essential Communications, was dubbed a Layer-7 Router.

This router allows the Cray system to communicate at MTU sizes above 9000 bytes over HIPPI while the ultimate connection from the router could be at 9000-byte or even 1500-byte MTUs.  The router could initially provide GigE, but had the potential to add new interfaces without changing the connection to the Cray system.

Besides the need for our customers to purchase additional networking hardware, one drawback or shortcoming of this solution is that it provides no performance benefits for UDP protocol transfers.  To create the circuit, the router  requires static connection information provided by TCP.  UDP is a connectionlessnetworking protocol running over IP and consequently the router cannot make necessary assumptions about packet flow.  The bad news is that UNICOS NFS performance will therefore not be increased by this solution.

One final caveat with this solution is that the TCP Circuit concept relies on having all inbound and outbound TCP packets come through the L7R.  It has been common for several sites to configure their Cray networks as nonsymetrical, using two separate network connections; one for access to and one for access from the Cray.  This can no longer be done when using the L7R.

The Cray L7R

Given the heading, it seems clear which option was chosen.  We have been working with Essential Communications to create the Cray L7R.  We chose to partner with them because of their commitment to HIPPI and because of their willingness to explore additional networking interfaces for the L7R.

Cray Inc. has licensed the product from Essential and has been working on a Cray-integrated version.   The integration provides us the ability to boot and manage the L7R as you would another GigaRing ION and should make product support more common in that regard.  This also allows us to specifically tune the product for use with our systems.

Connecting to the L7R requires a serial HIPPI connection.  The HPN HIPPI implementation is parallel.  The connection can be made through an appropriately configured HIPPI switch or a parallel-to-serial HIPPI modem.   Both of these HIPPI products are manufactured by Essential and can be purchased from and will be supported by Cray for connections to the Cray systems.
 
 






The first customer ship of the Cray L7R has been planned for late May 2001 and will support GigE.  A 10/100 BaseT interface will be included and will be supported shortly.  An ATM OC-12 interface is planned for later in the fall of 2001.

Current Performance

All efforts to date have been concentrated on UNICOS performance with an Cray SV1 system through a Cray L7R to an Origin 2000 running GigE.  Cray T3E performance numbers are included for reference but very little effort has been made to tune for UNICOS/mk at this point.

The method chosen to present performance was to:

  1. Vary the MTU size of the connection between the Cray system and the Cray L7R over a range from 9000 to 65000 bytes using multi-stream tests to determine the optimal MTU size.  (Data for these tests are not presented.)
  2. Given the chosen MTU, single-stream tests were run on 1500-byte and 9000-byte MTU GigE connections across a range of user-level write sizes.
  3. The write size just below that which achieved the peak single-stream rate was chosen to perform the multi-stream tests for 1500-byte and 9000-byte MTU.
During our UNICOS/L7R tuning efforts we discovered the following issues, some of which should benefit general networking on Cray PVP systems. Finally, all performance numbers quoted below were obtained using the nettest program which reports numbers as multiplied by 1000.

SV1 Performance Graphs

The first graph shows single-stream rates with varying user write sizes using 1500- and 9000-byte MTUs on GigE.

Maximum single-stream rates:

From the previous data, we chose write sizes which approached the maximum single stream bandwidth for the two MTUs as the basis for the multi-stream tests presented below:

Maximum multi-stream rates:

T3E Performance Graphs

The first graph shows single-stream rates with varying user write sizes using 1500- and 9000-byte MTUs on GigE.

Maximum single-stream rates:

From the previous data, we chose write sizes which approached the maximum single stream bandwidth for the two MTUs as the basis for the multi-stream tests presented below: Maximum multi-stream rates:

Future Direction

Cray Inc. plans to pursue the following networking directions for Cray PVP, T3E, and MTA hardware.

Acknowledgments

The author wishes to thank the following Cray employees and Essential Communications partners for their contributions:

Author Biography

Jay Blakeborough is a Manager in the Software Development organization at Cray Inc.  He has worked on Cray products (for Cray Research, Inc., SGI, and Cray Inc.) as a developer and manager for over 15 years.