boisseau@sdsc.edu
,
wilkinsn@sdsc.edu
www.sdsc.edu
The National Science Foundation's Partnerships for Advanced Computational Infrastructure (PACI) program mission is to advance science and engineering research through the use of high performance computing. NPACI (National Partnership for Advanced Computational Infrastructure) is one of the two partnerships funded by the PACI program. NPACI is a partnership of academic computing centers, universities, and government labs led by the San Diego Supercomputer Center (SDSC) at the University of California San Diego. There are three main kinds of partners in NPACI: resource partners (computational, data), research and development partners, and education and outreach partners. Together, the NPACI partners are developing an advanced computational infrastructure that would have been impossible under the NSF's previous Metacenter program by individual HPC centers. This advanced computational infrastructure includes not only HPC resources but also data intensive computing and archival resources, interaction computing resources, high speed networks and security infrastructure, software for improved user environments and productivity, and education and outreach activities.
Most users utilize the HPC resources most directly in their research; the benefits of the other activities are sometimes obvious (e.g., archival resources) but sometimes transparent to the users. NPACI offers HPC resources at five resource partner sites: SDSC, the University of Texas, the University of Michigan, Caltech, and the University of California at Berkeley. These sites provide HPC resources to over 5000 users, many of whom have accounts on machines at multiple sites. The situation at the other NSF partnership, the National Computational Science Alliance (NCSA), is similar. Thus, the provision of user services by such a partnership must be somewhat different from the model used by individual computing centers.
The partnership model offers many potential benefits to users:
The NSF determined that the provision of these capabilities could only be achieved with maximum effect if leading edge sites developed extensive, integrated partnerships with plans for leveraging the particular expertise of each partner. The successful integration of the activities and resources of the partners provides an advanced computational infrastructure that enables users to be more effective as scientists.
The partnership model also presents some potential problems for users including: a potentially steep learning curve for unfamiliar resources; a wider variety of resource interfaces and environments to understand; and transition from the previous NSF supercomputer centers not chosen to lead partnerships. User services must become fully integrated across a geographically distributed group of resource sites to help offset these problems and make users more effective on all resources.
These potential problems are challenges for the support staff. Issues staff must overcome include a lack of experience with resources and services of other sites, different local procedures/philosophies/traditions at each site, and effective communication among staff at these geographically distributed sites. Staff members must be adaptable and must be able to learn new systems quickly. They must be able to field user queries on a variety of platforms and then combine their own knowledge of these systems with the expertise of staff members throughout the partnership when responding to the user. They must offer consistent documentation and training to lessen the difficulties for users with this array of resource options.
The goal for the support staff in the partnership is to provide excellent user services to users of all resources by combining and leveraging staff expertise at all sites while reducing redundant effort and minimizing overhead. If the partners can successfully leverage the expertise already present at each site, the users will transparently benefit from a total amount of expertise that exceeds that at any of the sites individually. If the staff can achieve this through combining efforts effectively, the total time spent per staff person on basic user support activities can be reduced while the quality of the user services activities are increased. This benefits both users and staff, and allows staff to spend more time on other issues that may benefit users such as developing the advanced computational infrastructure itself.
To explore resources and services issues for the partnership NPACI has established a Resources Working Group (RWG). This group includes senior staff from each resource partner and representatives from systems and user services groups. The group communicates regularly via a mailing list, monthly teleconferences and semi-annual meetings. This is in addition to regular direct (phone, e-mail) communication between systems staff, consulting staff, training staff, etc. on those specific activities (see below). The NPACI Assistant Director for Resources coordinates these meetings. The purpose of this group is to identify the goals in providing the best possible user environment, and then to determine the actions that should be taken to reach these goals.
NPACI provides computational resources at five resource partner sites on eight platforms (http://www.npaci.edu/Resources). The RWG has already initiated and completed projects to provide common login environments through the creation of standard NPACI login scripts and created common queues on NPACI machines. A current project is the elimination of plaintext passwords from NPACI systems, which will be complete at SDSC by 9/1/98 and some time thereafter for the other resource partner sites (http://www.npaci.edu/Security).
NPACI developed an integrated allocations process for all platforms in the partnership (http://www.npaci.edu/Allocations). In addition, NPACI and NCSA together participate in a PACI-wide allocations process. NPACI facilitated the transfer of allocations and data of many former users of previous Metacenter sites that no longer receive funding (Pittsburgh Supercomputer Center and Cornell Theory Center). We also created documentation to help researchers choose the appropriate resources at the partner sites.
NPACI applications software information is listed for users by field of science and also by platform on the web (http://www.npaci.edu/Applications). The partnership resources make it possible to offer a wider variety of applications; essentially every major vendor application and programming tool can be offered on at least one of NPACI's eight allocated HPC systems. Development is underway to define a core set of applications software provided for all (appropriate) platforms, which should help users who receive allocations on multiple systems with similar capabilities.
Effort is also underway to define common installation procedures, which should achieve the goal of reducing redundant effort on the part of the staff. For example, only one person in the partnership will need to learn the nuances of the installation of a particular package; that person will then install the software on all relevant NPACI platforms, rather than having people at each site having to learn these nuances (and perform the installations).
An important point of the partnership's support for applications software and programming tools is that NPACI partner sites can still make local choices. Any site can choose to install an application or tool that is not part of the NPACI core set but is advantageous to users of that particular sites' resource(s). However, installing an application on one machine in the partnership can allow users to become dependent upon this software (e.g., a data format software package such as HDF), making it difficult or impossible for them to utilize other resources in the partnership that do not have that software. Therefore, while such decisions are made frequently, it is generally only for 'niche' applications or for software requested by users who are likely to compute only on that particular resource.
NPACI Consulting is a group of support staff comprised of personnel at SDSC, UT and UM (http://www.npaci.edu/Consulting). By combining the efforts of staff at these three resource partners, NPACI provides a 12-hour window of hotline coverage across three time zones. Users get an increased amount of time during which someone is immediately available to answer their questions without unduly burdening the consulting staff at any site.
NPACI Consultants share information, which decreases the time it takes to solve a problem and helps provide the highest quality responses. All user submissions are entered as 'tickets' directly into a client-server application called 'Remedy ARS'. The clients have been installed at all partnership resource sites, including those not directly staffing the hotline but offering HPC resources (Caltech, UCB). This makes it possible to leverage local expertise for any ticket that does not require immediate attention. Remedy has also been invaluable in tracking information on types of problems and solutions. In the near future, we will develop automated procedures to mine the data that accumulates in the Remedy database to identify improvements to our users services activities and resource configurations.
Regular communication between consulting staff at resource sites is facilitated by a well-used mailing list and by intranet pages that include consulting procedures, a consulting FAQ and a contact list. Teleconferences between the consulting staff at the partner sites will also be starting shortly, as it was determined that the aforementioned RWG teleconferences were not adequate to address the issues of the consulting staff, who must work very closely with one another.
Training is also shared by the same resource partners that share consulting (http://www.npaci.edu/Training). A rotating calendar for on-site classes at different resource partner sites and combined training materials minimize the impact on each partner site's staff. The combined effort enables us to develop higher quality training content. This savings of effort through collaboration allows the partnership to spend more effort addressing the needs of more advanced users as well. We do this with specialized classes on such topics as Fortran 90 and MPI and workshops for advanced users such as the NPACI Parallel Computing Institute and the NPACI Advanced Parallel Applications Optimization Workshop.
The rotating calendar maximizes the geographical distribution of classes for the users. Some users find it easier to get to one resource site than another. Users have also expressed interest in classes at their home institution, which we have started doing when there is sufficient demand, and for distance training materials. Work has begun in this area at one of the partner sites (UM) that will leverage the content created for the 'regular' training classes, which itself leverages the time and expertise of the staff at the resource partners as just described.
Support staff from all resource partners contribute to the user documentation (http://www.npaci.edu/Documentation). The Resources and Services web pages of the NPACI web server (http://www.npaci.edu/Resources) are the main pages of interest to NPACI users. Integrated user guides help users use multiple similar machines in the partnership (e.g., the integrated T3E user guide will be relevant for users of the T3Es at SDSC and at UT). A database capable of being modified by staff at the partner sites is currently being implemented so machine configuration information can be centrally updated.
NPACI Online is a biweekly publication for news about the NPACI and SDSC community that contains a Resources and Services section. Staff members across the partnership contribute to this, highlighting advanced programming issues, new machine capabilities and related information of interest to the users. These articles often become part of the user guides in an advanced section. Users are informed of more immediate news across the partnership through NPACI Resources and Services News (http://www.npaci.edu/UserNews) a web site that can also be subscribed to via email.
Documentation is an area with much potential for leveraging the expertise across the partnership while reducing redundant effort for the staff. Users benefit from having documentation written by the most experienced people on that topic in the partnership (and being reviewed by potentially many other people with excellent expertise). Staff benefit as there are now many more qualified experts to write important technical documentation, thus reducing the burden on any one site to provide a comprehensive suite of technical information. Documentation is an important user service since it is the only one available to all users at all times (via the web), so the benefits of the partnership model are particularly important in this area.
The collaborative efforts of so many experienced, skilled support staff in the partnership enable the provision of new user services that would have been impossible to implement effectively by an individual center (due to time limitations if not also lack of comprehensive expertise). One such effort is the NPACI Strategic Applications Collaborations program (http://www.npaci.edu/SAC). The program establishes collaborations between NPACI staff and users to advance the capabilities of Grand Challenge-class research being conducted by NPACI users. The success of the program depends on establishing enough collaborations per year to generate significant results and solutions that benefit entire research communities and the user community, not just the specific researcher. Conducting such collaborations can be done by an individual center and indeed has been initiated with SDSC staff, but conducting a number sufficient to generate results to carry the program forward and make it worthwhile will require more staff than even SDSC can devote by itself. Staff from the resource partners will help in the coming months. In addition, the research partners will play a critical role as this program matures; the technologies they are developing may be the tools needed to advance the user's research to the next level.
Another project of interest in a partnership environment is the NPACI User HotPage (http://www.npaci.edu/Hotpage). The HotPage provides a distinctly user-oriented interface for information about all resources and services in the partnership. Equally importantly, the HotPage provides active content, such as the machine operational status and queue information for all NPACI resources, and functions such as the capability to generate batch scripts for any NPACI resource without the user having to learn batch commands for multiple resources. While this project can be undertaken by an individual center, the value to users is increased substantially in a partnership of resources sites. The HotPage gives user a single view into the resources and services of the partnership, thus making it feel like one computational center, not several.
As the integration of user services becomes even tighter and more efficient, more staff resources can be focused on addressing user needs with additional new user service projects.
In the first 8 months we have encountered some problems and difficulties, all of which have been or are being addressed. The three major difficulties we have faced are:
In addition to the potential problems that would affect any partnership (described in Potential Problems for a Partnership above), we experienced a few additional stresses that contributed to some of our difficulties:
Our transition to a partnership was brought about by a change in the entire NSF program for providing HPC resources to academic users. Accompanying this transition was an increase in funding for new resources at NPACI (and NCSA) coupled with a decrease (and eventual suspension) of funding for PSC and CTC. These facts meant that we were attempting to integrate user services while at the same time bringing new HPC platforms into production at some of the sites and helping conduct the migration of users and data from PSC and CTC to NPACI. There was even more work involved than we anticipated, so we set priorities to provide the most essential user services at the highest possible quality first. However, most difficulties experienced in integrating user services across the partnership can be at least partially attributed to these stresses.
The successes involved in integrating our user services activities across the partnership have, however, been much greater than the difficulties (and downright heroic!). The biggest success was a transition over the course of a eight months to providing extended consulting hours to a much larger user population (5000+ users!) than any individual site (including SDSC) had experienced on a great variety of resources. We did this while simultaneously installing and supporting more powerful resources across the partnership and while migrating many users and data from the former NSF centers at PSC and CTC. The much larger user base generated a large amount of transitional questions as users (and staff) were becoming familiar with new platforms. The volume of tickets submitted to NPACI Consulting was double the monthly number submitted to SDSC just before the new program began.
There have been many other successes along the way, such as integrating the other user services (training, documentation), developing partnership allocations procedures, providing all popular (i.e. oft-requested) software on at least one NPACI machine, and initiating new kinds of support activities. By and large, the integration of all user services across the partnership into NPACI Scientific Computing Services (http://www.npaci.edu/Services) has been very successful. More work remains to be done (see below), but the most difficult part is in the past and was achieved professionally and relatively quickly.
Of course, it is difficult to determine exactly how well we are doing without getting feedback directly from the users. We often receive useful input in the submissions to the consulting hotline, but more thorough means of gathering information about the comprehensive set of user service activities are desirable. We have recently notified our users of the first NPACI User Survey (http://www.npaci.edu/Survey). The main goal of this survey is to poll our users to determine their level of satisfaction with resources performance and configuration, applications software selection and features, consulting, training, and documentation. In addition, we are also asking for information about future computational needs.
As described above, we have experienced both successes and some problems in our attempts to integrate the user services activities across the partnership. Therefore, we expect some suggestions and criticisms due to the difficulties described of transitioning to new model and integrating user services activities. In particular, we expect that users will have some criticisms of the consulting efforts for reasons discussed below. Of course, we expect the majority of our users will be satisfied with the user services we have provided and will help us to improve these services even further. We will take all suggestions seriously and plan to reply to anyone who provides detailed feedback and an e-mail address.
In our efforts to develop high-quality user services for all NPACI users by integrating the efforts of the staff at our resource partner sites, we have learned and profited from both our successes and our setbacks. Some of these lessons have to do specifically with our attempts to integrate the user services activities of the resource partners, whereas others undoubtedly have more general validity. We list some of these points here for others whom are considering or planning to embark on a similar effort.
The most important lesson we have learned at NPACI is that superior user services can be integrated in a partnership that combines and leverages expertise while reducing redundant effort and minimizing overhead. . . but it is not easy.
The transition of SDSC and its partners to the NPACI appears to have been largely successful. However, there is still much work to be done to provide a seamless user environment across the partnership. Some of the tasks still in progress or in the immediate future at NPACI are:
There are many other projects that we will undertake to provide this seamless environment as we evaluate our computational infrastructure and receive feedback from our users. NPACI is committed to advancing the state of science and engineering by providing an advanced computational infrastructure that makes users more effective as scientists and engineers. If you have suggestions, comments, or complaints, please contact either of the authors of this paper.
The authors wish to thank the many people who participate in the activities of the NPACI Resources Working Group and in the NPACI Scientific Computing Services activities. The authors also thank the systems staffs of the NPACI resources partner sites for their assistance in helping us provide support to our users.
Jay Boisseau is the HPC Consulting Group Manager at the San Diego Supercomputer Center (SDSC) and the manager for NPACI Scientific Computing Services. Nancy Wilkins-Diehr works in the HPC Consulting Group at SDSC and is the NPACI Consulting Coordinator. Both participate in the NPACI Resources Working Group (RWG).
Jay Boisseau's contact info:
Nancy Wilkins-Diehr's contact information: