GSN and ST:
Infrastructures for High Bandwidth, Low Latency Communications

Brad Strand
Silicon Graphics, Inc.

bstrand@sgi.com

ABSTRACT:
Gigabyte System Network (GSN) and Scheduled Transfers (ST) are two new networking technologies that SGI is currently developing. These technologies are described here, including some details regarding SGI's implementation of them. GSN and ST are shown to map well into certain problem spaces of interest to SGI's customers. Finally, GSN and ST performance numbers are given, and compared to other networking technologies.
KEYWORDS:High Speed Networking GSN ST Gigabyte Systems Network Scheduled Transfers protocols

Design Goals

When SGI engineering and marketing teams first began to analyze the key characteristics that would be required in next generation networking technologies, several factors quickly jumped to the top of the list. Perhaps most obvious was the need for increased bandwidth. For several years, the bandwidth improvements in available networking media had improved less rapidly than similar improvements in processing power, memory bandwidth, memory density, and secondary storage capacity. At that time, "best of breed" networking technologies included HIPPI-800, with a theoretical maximum bandwidth of 800 Mbits per second (Mbps). Clearly, the next generation networking technology needed to have a much higher bandwidth, in order to bring large systems architectures back into balance.

For many types of applications, low message latency is at least as important as high bandwidth. Message latency refers to the amount of time it takes to send a single message from the memory of one machine, across the network, and into the memory of another machine. The smaller this amount of time, the less time one machine needs to wait, potentially idle, for the arrival of a message from another machine telling it how to proceed. Clearly this measure of performance is critical to applications which do any volume of message passing, including clustered computing. SGI HIPPI-800 networks are able to send messages from the memory of one machine to the memory of another machine in about 90 millionths of a second, or 90 microseconds (usec), when an application uses the very lowest level interfaces. We wanted to improve this dramatically.

In an ideal world, networks would be able to support very different types of traffic efficiently. One of the criticisms imposed upon some networking technologies, HIPPI-800 for example, is that some types of traffic "interfere" with other types. For example, large packets, which are very common and the most efficient for some uses, such as storage area networking, can monopolize the bandwidth on the wire for long periods of time, and thus can have a markedly negative impact on small packet traffic. This basically makes running standard clustering applications or even traditional networking applications problematic on networks which are also being used for storage area networking. As a result, most storage area networking solutions today are run on dedicated networks. We felt that a mechanism to allow for better coexistence of multiple types of data was also an important design goal.

While providing higher bandwidth and lower latencies were desirable, it was clearly not acceptable to deliver these at the expense of over utilization of other system resources. For example, we knew that our design would need to be very efficient with respect to CPU utilization and memory bandwidth utilization, so these became important design goals, as well.

We also recognized the need to keep the cost of the solution attractive when compared to alternate competing technologies. We decided that our goal was to be competitive along the price/performance curve. In other words, we wanted to deliver a solution that was attractive when measured in terms of "dollars per MB/s".

The Origin family of computing products enjoys a highly scalable system architecture. The networking solutions we needed to develop must scale to a degree which matches the scalability of the overall system architecture. In other words, we must be able to configure systems in which aggregate performance of the network solutions improves to allow larger systems to benefit from the use of more than one network interface.

One of the criticisms of early HIPPI-800 networks was that it was a complicated technology to use and administer. HIPPI-800 was unlike most other networking technologies because it required configuration of either the HIPPI-800 switch fabric, or statically managed tables on each host, if any network topology more complex than point-to-point connectivity was desired. As customer cost models moved beyond considering simply purchase cost, to models based on overall cost of ownership, the importance of simpler administration methodology increased dramatically. Thus, we knew that simpler methods of network management and administration must be designed into the next generation of networking technologies we were planning.

Finally, we heard the customers requests for standard solutions. Customers want their SGI products to be able to interoperate with hardware and software provided by other vendors. In the networking realm, this implies standards-based solutions, both in hardware and in software. Thus, a final key design criterion was to have whatever hardware and software technologies we developed be conformant with industry-based standards. The last thing we wanted to develop was an SGI-proprietary solution.

Gigabyte System Network (GSN)

After many, many months of analysis and design work, the hardware technology that SGI is now promoting as a solution to many next generation networking problems is Gigabyte System Network, or GSN. GSN was often been called "HIPPI-6400" or "Super HIPPI" in some of the earlier literature. GSN provides 6400 Mbps bi-directional user payload bandwidth, or eight times the bandwidth of HIPPI-800, about six times the bandwidth of Gigabit Ethernet, and about three times the bandwidth of ATM OC48.

The GSN standard was developed in the ANSI T11.1 committee. The standard provides a 6400 Mbps bi-directional user payload, plus encoding and signaling. The cabling is copper today, with segments of up to 40m. Optical cabling is in development, with maximum segment lengths being extended to up to 1km.

Key GSN Features

In addition to the great bandwidth, GSN was designed with several other important features that improve its utility. One of the most useful attributes of GSN is that it is a reliable medium. GSN delivers its packets reliably, in-order, and without duplication. It does so by generating and checking an ECC at the micropacket level, every 32 bytes. Corrupted packets which are not correctable are automatically retransmitted by the hardware. This greatly improves the efficiency of packet retransmission, normally handled only by the software.

Another important feature of GSN is that the hardware provides end-to-end flow control of the data. Packets are not allowed to be injected into the GSN fabric until end-to-end resources for packet transmission have already been reserved and allocated.

GSN also defines four separately managed virtual channels. These virtual channels, or VCs, are used to logically separate different types of traffic. The GSN standard uses VC0 for control messages, VC1 for short messages such as IP, and VC2 and VC3 for longer messages, such as Scheduled Transfers, which will be described in more detail a bit later. By having separate VCs, GSN elimates the problem of long messages effectively starving out small messages, since small messages can be interleaved on the wire with larger messages, and can be handled by different resources on the host. Thus, small control messages can be delivered in the midst of a large data packet transfer without disruption.

SGI's GSN Implementation

SGI designed two ASICs used in its GSN product. The first, called SuMAC (an acronym for Super-HIPPI Media Access Controller) is the MAC which controls GSN encoding and signaling. SGI has licensed SuMAC to several vendors who use it as the MAC for their GSN products which includes network interface cards, switch ports, and media bridge ports. To date, SGI has shipped over 1000 SuMACs to its licensees.

The second ASIC that SGI developed for its GSN product is called SHAC, which is an acronym for the Super-HIPPI Host Adapter Chip. SHAC has detailed knowledge of ST and IP, including packet formats and buffer control structures. This allows SHAC to provide functions in hardware that would otherwise have to be done in software or firmware. In addition to the "standard" features of ST and GSN, SHAC has dozens of additional features that allow SGI to deliver much of GSN's performance to the user. Many of these features are mechanisms designed to get the most performance out of the Origin system architecture. While an entire paper could be written devoted to nothing other than these interesting SHAC features, only one will be called out here, since it exemplifies the design philosophy of SGI's GSN product.

One of the constraints of the Origin architecture is that SGI's Crossbow ASIC, which one can think of as the IO "crossbar" in the system, has a bandwidth of something on the order of 680 MB/s. Rather than slow down the speedy 800 MB/s GSN, SGI designed the GSN NIC to be optionally connected to two XIO ports. Thus, for applications where the highest bandwidth is required, SGI's GSN product is available in a dual-attached configuration, in which data is routed through another Crossbow port via a cable attached to a Crosstown board. For configurations not quite so bandwidth intensive, SGI also supports the GSN adapter in a single board configuration.

Scheduled Transfers (ST)

Scheduled Transfers is a new protocol that is also being developed in the ANSI T11.1 committee. Scheduled Transfers, or ST, occupies what we normally consider as levels 2 through 4 of the OSI networking model, or in other words, the datalink, network, and transport layers.

Scheduled Transfers were designed to accomodate the high bandwidth and low latencies desired for the next generation of networking. Perhaps the most important feature of Scheduled Transfers is its flow control model. Most popular networking protocols allow applications to send data whenever they desire, and then impose some sort of flow control "after the fact," whether the nature of the problem is on a per-connection basis, or general networking conjestion. ST is designed to give the eventual receiving host or device much more control over the flow of data. Indeed, ST does not allow data to be sent until the resources to support that transfer have been allocated and reserved on the receiver. This feature gives rise to the name "Scheduled Transfers," since data transfers must be scheduled in advance.

There are two basic modes of operation in ST. Data sent using the first mode requires resources to be acquired on a per transfer basis. Because of this, a three-way handshake is required prior to the data transmit operation: Request-To-Send Msg (RTS) , Clear-To-Send Msg (CTS), Data Msg (Data). Note that ST allows the transmission of up to 2^64 bytes of data in a single transfer, so the overhead of this three-way handshake sequence can be amortized over a very large amount of data.

The second mode of data transmission in ST uses the notion of "persistent memory", in which a region of memory on the destination is acquired by the sender ahead of time. The sender is then free to write into this memory at any time. Since the transmission resources are acquired ahead of time, there is no need for a protocol handshake to occur prior to sending data. It is this mode of operation that is used for the low latency messaging mentioned earlier in this paper. Indeed, one can think of this as a true distributed DMA, since SGI's implemation moves data from sender's memory to receiver's memory without a context switch or data copy on either side.

In addition, ST supports the notion of atomic "Fetch and Op" functions, in which the contents of a remote data location are retrieved and operated upon in a single operation. This allows easy implementation of data structures such as distributed synchronization primitives, which enable efficient distributed programming.

Another useful feature of ST is its support for the "striping" of a single logical connection over multiple physical interfaces. One use for this is to allow a stream of ST data to travel out a high-speed interface (e.g. GSN) and then into a bridge which can split the stream into multiple components, each of which can be sent over a separate lower-speed interface to the destination host. Another use for this is for configurations which want to achieve greater single-stream bandwidth between hosts without requiring them to move to a faster networking medium. Instead, a single logical connection can be "striped" across multiple interfaces.

SGI's ST Implementation

SGI's implementation of ST uses the familiar BSD sockets programming model and Application Programming Interfaces (APIs). We decided that it made more sense to make small extensions to an existing and ubiquitous set of APIs than to try to develop something entirely new.

One advantage of this is that converting existing sockets-based programs to use ST is often relatively trivial. Indeed, converting a program from using UDP/IP to using ST can be as simple as changing a single line of code. Whereas a UDP-based application might create a socket with a line such as

sock = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);

the analogous ST-based application could contain the line

sock = socket(AF_INET, SOCK_SEQPACKET, IPPROTO_STP);

Since SGI's implementation of ST uses IP addressing, we create ST sockets using the AF_INET address family, so the first parameter of the socket() call remains the same. The second parameter is changed from SOCK_DGRAM to SOCK_SEQPACKET. SOCK_SEQPACKET specifies a "sequential packet" socket. This is a special type of socket which provides the application with a series of datagrams that are delivered reliably, in-order, and without duplication. A perfect match for GSN hardware! Finally, we specify the protocol to use as STP instead of UDP. This tells the socket layer to hand the request to the ST stack once inside the IRIX kernel.

Since the UDP-based application is expecting to send and receive a series of datagrams, potentially out of order, and potentially with lost and/or duplicated messages, the fact that ST delivers a series of datagrams, in order, with no lost or duplicated packets, is likely to be transparent to the remainder of the application.

Nearly all of SGI's other sockets-based APIs, such as accept(), bind(), connect(), etc. support ST sockets, so no further code changes are generally required to use any of them. One of the only constraints of ST sockets is that they must be connected before they can send or receive data. As a result, the sendto() and recvfrom() calls are not supported for ST sockets.

The same sockets-based APIs are used for SGI's implementation of the low-latency persistent memory based ST data transfers. In addition, there are several ST-specific setsockopt() functions which are used for such tasks as reserving memory on remote hosts, and selecting how the applications are to be notified when data arrives.

Applications of ST...Today

One of the most obvious applications of ST, and particularly ST over GSN, is any situation where high-bandwidth data movement between hosts is of paramount importance. If a situation calls for a custom-built application, the ST sockets APIs are available. If a situation calls for a high-bandwidth distributed filesystem, which can deliver high bandwidth to an application through a standard file interface, SGI can provide this as well. The Bulk Data Service, or BDS, which was always designed with high-performance networking in mind, has been modified to support the ST protocol. As a result, users can enjoy the resource-sharing and administrative benefits of a distributed filesystem, but still enjoy local filesystem performance.

A second obvious use of ST is in the area of clustering. The low latency messaging provided by the ST OS bypass interfaces provides an ideal framework upon which a clustering infrastructure can be built. SGI plans to port the Message Passing Interface (MPI) library to use the ST OS bypass API. To help protect our users' software investments, and to help accelerate the state of the art in low latency API design, SGI is also leading the effort to standardize the ST OS bypass API in the ANSI standards body.

Applications of ST...Future

SGI is working to broaden and accelerate the use of ST in a variety of applications. We are currently working on porting ST to run on our Gigabit Ethernet network adaptors. We see this as a very compelling infrastructure for low-cost clustering products.

SGI is planning to develop a SCSI over ST product, as well. We see great potential for an ST-based solution in the storage area networking space. Some ST features, such as third-party transfers, are especially well-suited for StAN.

ST's "Fetch and Op" functions are ideal primitives for distributed synchronization constructs. SGI is planning on implementing an extremely low-latency distributed database based on this technology.

It should be noted that ST is not an IRIX-only product for SGI. ST is also being ported to linux. In fact, SGI made a presentation at Linux Expo the weekend of May 22, 1999, reporting on our progress in this area. In addition, an implementation of ST on UNICOS is also in beta test at one customer site.

GSN and ST Efforts Outside of SGI

Several other vendors are beginning to deliver products based on GSN and ST technologies. The Essential division of ODS Networks, for example, is currently selling their ESN10000 32-port GSN switch.

Genroco is another company heavily vested in GSN technology. Genroco is developing a wide variety of GSN products, including PCI-based NICs, 8-port switches, and bridges between GSN and HIPPI-800, Gigabit Ethernet, and Fibre Channel.

Status of GSN and ST at SGI

SGI's ST implementation has been in early access release to a small number of customers since June, 1998. The initial release was ST over HIPPI-800. The first ST over GSN solution was made available to a select set of customers in October, 1998. Several incrementally improved software releases have been made since then. SGI is now preparing for a broader beta-release of TCP/GSN in June, 1999. SGI plans to release TCP/GSN in CQ3Y99, and ST/GSN, including the ST OS bypass and BDS support for ST, in CQ4Y99.

GSN and ST Performance

The theoretical maximum bandwidth of GSN is 800 MB/s. SGI has demonstrated that its GSN adaptor can send and receive data at 796 MB/s unidirectionally, and 791 MB/s bidirectionally. This compares with slightly under 100 MB/s for HIPPI-800, and slightly more than 100 MB/s for Gigabit Ethernet.

At the transport level, TCP over GSN sends data at around 280 MB/s, measured from memory to memory. This compares to less than 100 MB/s for TCP over both HIPPI-800 and Gigabit Ethernet. And while UDP over HIPPI-800 and Gigabit Ethernet both have performance in the same ~100 MB/s range, SGI's UDP over GSN performance jumps to over 500 MB/s. Also note that for Gigabit Ethernet, packet lengths of 1500 bytes (or 9000 bytes when "jumbograms" are used) means that a lot of CPU utilization is typically required to service the large number of interrupts these rates generate.

SGI's ST over GSN implemention yields some very impressive performance numbers. When sending a single stream of ST data from memory to memory, we observe transfer rates of 560 MB/s, while utilizing only 5-8% of a single CPU as overhead. This rate is constrained by the memory bandwidth through the HUB ASIC. When we enable the special "memory striping" mode in ST to send the stream of date through more HUBs, the single-stream rates jump to 769 MB/s. (SGI's implementation of ST supports 1-way, 2-way, 4-way, 6-way, and 8-way memory striping.)

When we add BDS to the model, the performance remain impressive. We have measured BDS over ST over GSN performance at over 450 MB/s through a single file descriptor. Our beta customers have measured performance up to 480 MB/s, with a different disk subsystem configuration. When we move away from a simple data flow model, and consider performance using 2-way memory striping on both ends, we have measured BDS over ST over GSN memory to memory performance of 690 MB/s. With most distributed filesystems, a significant tradeoff in performance is usually the cost that must be incurred in order to enjoy the benefits of resource sharing and centralized administration. As mentioned earlier, with BDS over ST over GSN, SGI can provide these benefits, while also delivering data to applications at speeds greater than that allowed by a single memory subsystem.

In terms of latency, HIPPI-800 sends messages from memory to memory in about 90 microseconds. With GSN, the one-way message latency is 4 to 9 microseconds, depending on the cable length. This represents a latency improvement of about a factor of 15. Perhaps even more impressive is the "packet repetition rate", which is the rate at which packets can be sent, once the pipeline is filled. With ST and GSN, SGI is able to send 1.45 million packets per second. This compares to well under 100,000 per second with HIPPI-800.

Finally, in terms of scaling, SGI has demonstrated that ST over GSN performance scales exceptionally well. Indeed, we have shown that as we move from one to three GSN adaptors in an Origin 2000 system, aggregated ST bandwidth scales at 98% of linear. This allows GSN and ST to solve some of the most important networking problems for a broad class of SGI customers.

Summary

Although a paper of this length does not allow many of the most interesting details of GSN, ST, and SGI's implementations of them to be described in any great detail, a few of the points of more general interest have been called out. Unfortunately, many important topics were omitted, including the HIPPI ARP (HARP) standards for address resolution over GSN and HIPPI, the true parallelism of SGI's ST protocol stack, and the important differences between TCP and ST. Nevertheless, it should be clear that with its GSN and ST products, SGI stands on the cusp of leading the industry into the next generation of high performance networking.

More Information

Readers interested in more information on GSN, ST, or the companies investing in these technologies are encouraged to follow the links below.

GSN and ST: Infrastructures for High Bandwidth, Low Latency Communications