This paper was essentially generated by Cray customer desire for network
access to currently selling Cray platforms via Gigabit Ethernet (GigE).
Given the outcome of our efforts to provide just that, we believe
a fair amount of history and explanation is necessary to help understand
the challenges we have experienced and will face as we move forward.
To that end, the paper briefly discusses Cray
Networking History, UNICOS TCP/IP, Research
On UNICOS Networking, Options Considered,
The
Cray L7R, Current Performance, and,
Future
Directions.
On Cray systems with Model E I/O (skipping over Model D and VME-based I/O for brevity's sake), UNICOS supported TCP/IP across HIPPI, FDDI and HyperChannel. The driver protocol between the mainframe and the I/O Subsystem (IOS) was message or packet based. That is, for every read posting or write to the actual hardware device, a message was sent between the mainframe and the IOS. The actual TCP/IP packet was transferred to or from mainframe memory via Direct Memory Access (DMA) by the IOS. Nearly all hardware protocols were provided by native interfaces in the IOS. The exception being ATM OC-3 which required a Bus-Based Gateway (BBG) connection through HIPPI.
The introduction of the GigaRing-based I/O systems provided an opportunity to move forward with standard interfaces using Sbus-based network interface cards. The Multi-purpose I/O Node (MPN) supported SBUS versions of 10/100 BaseT Ethernet, Fddi, and ATM OC-3. The performance characteristics of HIPPI suggested the requirement for a special purpose node, the HPN. The GigaRing also brought the advent of a new unified network driver protocol between the mainframe and the I/O nodes. The generic driver was designed as a table-based protocol with put and take pointers. The intent was that neither the mainframe nor the I/O node would have to interrupt the other with a message each time a network I/O was to be processed. Simply reading and writing the requests and responses and updating the respective put and take table pointers should have provided increased and more scalable networking throughput. In reality, the first implementation of the protocol proved to be unstable and gave unpredictable timings. It was extended to include messaging between the mainframe and the I/O node to insure that data was sent or received in a more timely fashion. The unfortunate result was a decrease in maximum transfer rates on media which do not support large Maximum Transfer Units (MTUs). Performance across HIPPI, using 64K-byte MTUs was acceptable, while performance on 10/100 BaseT Ethernet with 1500-byte MTUs was, to be truthful, abysmal (approximately one third of the maximum potential).
The following section discusses UNICOS TCP/IP at a very high level and
concentrates on the GigaRing-based systems implementation.
The UNICOS TCP/IP stack is based primarily on an early BSD version of the code. It was ported to the Cray Parallel Vector Processor (PVP) hardware long before the UNICOS kernel was multi-threaded. Even when the code was multi-threaded, the philosophy was to avoid interrupting user work or even other system work to handle networking interrupts. There was a time when the debate raged regarding providing interactive access to the Cray at all. Cray CPUs were expensive and many of our customers didn't want a CPU interrupted because some user hit a <ENTER> on their keyboard. Consequently, the handling of networking interrupts was relegated to a lower service priority in UNICOS.
The Multi-Level Security (MLS) features were also added during the 8.0 and the 9.0 releases of UNICOS. The code considerably changed the landscape of the TCP/IP kernel stack with nearly countless security IF tests. The networking industry stack has significantly changed and improved in BSD UNIX, but the MLS code and its inherent promises reduced our ability to take advantage of the newer BSD or other code.
All performance efforts prior to UNICOS 10.0.0.8 were directed toward optimizing large transfers using large MTU interfaces. For example, multi-stream HIPPI performance on UNICOS GigaRing-based systems is approximately 700 Mb/s (87% of theoretical peak). Similar multi-stream transfers on 10/100 BaseT Ethernet yield approximately 40 Mb/s (less than 50% of theoretical peak).
In late 1995 and early 1996 many customers were promised that Cray would
improve networking performance and provide new connectivity on GigaRing-based
systems. Things have not improved much since then. What happened,
you ask? SGI (then Silicon Graphics) purchased Cray Research, Inc.
in 1996 and quickly reallocated nearly all networking resources to support
future
product
directions like the Origin line. UNICOS wasn't particularly good
at networking and Irix performed exceptionally well. To some decision
makers, it was clear where to focus our efforts.
We began to research the bottlenecks. There certainly was enough blame to go around. Since the only major change going from Model E I/O-based machines to GigaRing-based machines was the creation of the unified network driver architecture, that received much of the initial focus. The table-based (with additional messages) protocol did reduce bandwidth. It appears that it did so by exacerbating timing issues within UNICOS. TCP is very sensitive to the number of and the speed at which acknowledgement packets can be returned to the other end point of the connection. The new driver dramatically increased the delay in returning acknowledgements to the sending side causing the other end of the connection to assume a network congestion problem and thus reduce the speed at which it would send data. This is normal behavior for TCP - necessary to be considered a good networking neighbor.
To determine just how fast UNICOS could send TCP packets, a UNICOS kernel was built that simply sent TCP packets as fast as possible and ignored the ability of a hardware interface to actually get the packets out to the network. The UNICOS TCP stack produced approximately 6000 packets per second in this scenario. Given even modest overhead for getting the packets out to the network, we hypothesize that the highest rate we could resonably expect from UNICOS would be somewhere between 4000 and 5000 packets per second. It's important to note that these rates were researched on an Cray SV1 system and that the TCP stack is mostly single threaded suggesting that the highest rate achievable would be roughly 5000 packets per second, per CPU.
To put things in perspective, consider what kind of packet rates are
required to run selected interfaces at theoretical speeds given their MTU
sizes. Note that for every two data packets received, one TCP acknowledgement
should be returned to the sender. Since there are cases where this
does not happen and given some additional estimates on my part, the numbers
below should definitely be considered approximations. It is the magnitude
that is important for this discussion.
Interface Type | Max. Transfer Unit (MTU) Size | Approx. Packets/Second |
HIPPI | 64K bytes | 1,600 |
GigE | 9000 bytes (Jumbo Frames) | 14,000 |
GigE | 1500 bytes | 85,000 |
Assuming we could almost perfectly thread the UNICOS TCP stack, and given the results of the packets per second experiment noted above, we would monopolize nearly 3 Cray SV1 CPUs to run GigE at industry-standard rates with 9000-byte MTUs.
Finally, it has been assumed by many (including the author at one time)
that we could simply connect to a reasonably fast networking box or switch
using HIPPI and have it route packets to the rest of the world via GigE.
This reasoning is broken down by something that I refer to as the MTU challenge.
This challenge is actually a feature of TCP.
When a TCP connection is established across multiple systems and/or routers, the overall connection MTU is negotiated down to the least common denominator. Once again, this kind of behavior is expected if one is to be considered a good network neighbor. Assume the Cray ignored this and sent out 64K-byte packets even when some system along the connection path could only handle 1500-byte packets. The system in the path prior to the small packet system would have to fragment the 64K-byte packets into 1500-byte chunks. The ultimate data receiver would have to buffer up the packets for reassembly since they could potentially come in out of order. This causes additional, unnecessary work for other systems on the network and is therefore frowned upon by the networking powers that be.
After further consideration, and prompting from some of our most important customers, it was decided that this approach was not justifiable in the long run. We need to help our customers connect our systems to industry standard networks.
This router allows the Cray system to communicate at MTU sizes above 9000 bytes over HIPPI while the ultimate connection from the router could be at 9000-byte or even 1500-byte MTUs. The router could initially provide GigE, but had the potential to add new interfaces without changing the connection to the Cray system.
Besides the need for our customers to purchase additional networking hardware, one drawback or shortcoming of this solution is that it provides no performance benefits for UDP protocol transfers. To create the circuit, the router requires static connection information provided by TCP. UDP is a connectionlessnetworking protocol running over IP and consequently the router cannot make necessary assumptions about packet flow. The bad news is that UNICOS NFS performance will therefore not be increased by this solution.
One final caveat with this solution is that the TCP Circuit concept relies on having all inbound and outbound TCP packets come through the L7R. It has been common for several sites to configure their Cray networks as nonsymetrical, using two separate network connections; one for access to and one for access from the Cray. This can no longer be done when using the L7R.
Cray Inc. has licensed the product from Essential and has been working on a Cray-integrated version. The integration provides us the ability to boot and manage the L7R as you would another GigaRing ION and should make product support more common in that regard. This also allows us to specifically tune the product for use with our systems.
Connecting to the L7R requires a serial HIPPI connection. The
HPN HIPPI implementation is parallel. The connection can be made
through an appropriately configured HIPPI switch or a parallel-to-serial
HIPPI modem. Both of these HIPPI products are manufactured
by Essential and can be purchased from and will be supported by Cray for
connections to the Cray systems.
The first customer ship of the Cray L7R has been planned for late May 2001 and will support GigE. A 10/100 BaseT interface will be included and will be supported shortly. An ATM OC-12 interface is planned for later in the fall of 2001.
The method chosen to present performance was to:
Maximum single-stream rates:
Maximum multi-stream rates:
Maximum single-stream rates: