Link to slides used for this paper as presented at the conference. They are in Acrobat PDF format.

Pushing/Dragging Users Towards Better Utilisation

Guy Robinson
Arctic Regional Supercomputing Centre,
University of ALASKA, Fairbanks, P.O. Box 756020, Fairbanks, AK 99775-6020.

robinson@arsc.edu

ABSTRACT:

This paper will describe ARSC experiences in the operation of MPP Cray systems. In particular issues of user/center interaction will be considered with regard to:

It is hoped that this paper will promote discussion of the need for centers to become more involved with users and how this can be achieved.

KEYWORDS: parallel scheduling, performance tools

Introduction

The ARSC mission is clearly stated as follows.

To support high performance computational research in science and engineering with emphasis on high-latitudes and in the Arctic.

In support of this mission ARSC operates a number of high performance computing systems, including a CRAY T3E MPP and a CRAY J90se, and provides networking, storage and visualisation facilities which are made available to various researchers in the following scientific fields;

Full details of ARSC computing resources and links to many of the above projects can be found at the ARSC web site, http://www.arsc.edu/.

The users of ARSC systems have widely varying computational needs and experience varies from experts who are writing parallel codes to solve grand challenges to relative novices just starting to apply computational techniques in their field of research.

The focus of this paper is on the necessary support and encouragement a centre should provide to such users in order that they improve codes. Such support activities include:

Utilisation in this paper is not being defined as an exact numerical metric by which a centre might be judged. Here it is a less quantitative entity but a more qualitative concept embracing both the users individual productivity with the centre concept of being fair to all users.

Making Users Improve

One of perhaps the greatest challenges facing any centre is keeping users current with the latest developments and encouraging them to implement those advances which give the greatest reward. Some users find the rapidly changing arena of high performance computing to be a constant struggle against change and find it difficult to see any consistent set of rules by which to develop parallel, portable and performance achieving code. Others are more than happy to develop codes to exploit all various features of the different systems available. The majority however, adopts the approach of exploiting a reduced set of features which are the lowest common denominator. While this achieves a degree of success, in that the user can migrate rapidly to new architectures or different centres, it does not necessarily assist the user in the achievement of scientific goals. Users need a background in the following if they are to be productive.

General Education of Users

Some users arrive at the ARSC with a background in high performance computing and are familiar with the general principles and best practice. Here ARSC provides some short guides to the local configuration, software installed and general policies. Such users might also need introduction to the ARSC systems hardware and software having been using other platforms at other sites. These users are perhaps the easiest to bring to a centre. For others it can be the first time they have crossed the threshold into a shared high performance environment. Some general education is necessary and ARSC does this through a mixture of basic training, web pages and direct contact with the individual user. This training covers such topics as the basic use of the system, queues, and mass storage facilities. Having online versions helps greatly in answering user questions by pointing at a consistent and validated set of examples and information on the web in addition to answering the question itself during the initial contact. Pages of particular note, which have been popular with users, cover CRL, NQS and storage. All of which are important aspects for any scientist or engineer considering a large computation.

Code Performance Improvements

The majority of users at ARSC are interested in performing large scientific and engineering computations and therefor are looking to achieve certain levels of performance. Indeed ARSC also has a responsibility in seeing that resources are used efficiently.

Detecting and Correcting Low Performance

All users are encouraged to closely consider the performance of code and to use tools to inspect this. On the ARSC vector systems the unicos utility 'hpm' is used to monitor performance and help to identify those users whose code does not perform at an acceptable level of performance. These users are then contacted and asked if any help can be provided or if training is required. Sometimes it can be a very simple coding or algorithm problem or in extreme cases providing accounts on a different, more suitable resource for all or part of the work. This monitoring system has also been used during the recent replacement of the 8 processors Cray YMP with an equivalent 12 processor CRAY Jse system.

It is more difficult to monitor the performance of the MPP platforms and it is often left to the users themselves to make use of tools such as apprentice, PAT and VAMPIR to inspect performance. Monitoring of jobs is performed through the accounting software, NQS, and grmview logs. Accounting and NQS allows the detection of unusual behaviour such as short or terminated jobs and the user is then questioned as to the nature of the problem, if any. grmview is used to monitor memory use and any jobs which use large number of processors with relatively low memory demands are reminded of the nature of distributed memory systems. However it should be noted that for some users who desire short turnaround this is acceptable behaviour.

In all cases of monitoring usage it is necessary to be aware of the users actual demands and that the computational background varies greatly. In general this process has been successful and resulted in a significant performance improvement after relatively little effort. In other cases there are sound reasons behind the low performance observed or little can be done due to the large amount of work needed to improve matters or the users lack of knowledge about the codes internal workings. In the vast majority of cases interaction between ARSC and centre users from the above inspection process has lead to some benefit either in a code change or a reconfiguration of the local systems.

Tools for Debugging and Performance Inspection

There are many tools available for the inspection of code and performance evaluation. Apprentice gives an immense, perhaps overwhelming, amount of information for the user and can be a great help in finding areas where performance can be improved. However for many users who are starting to look at the performance of their codes a simpler solution can be more productive in the beginning. A simple tool that gives basic information on which routines are the most expensive, and has a less dramatic impact on the runtime of code than apprentice, is PAT. Even simpler is to measure the performance of code by turning streams on/off. By taking the above simple steps several users have been able to rapidly identify a few routines which were targets for modification and obtain significant performance improvements with relatively little recoding effort. All users are pointed to various items of documentation regarding parallel performance provided by Cray or other centres. Such sources of information are important, both for users and consultants at ARSC. The visualisation of the actual data exchange that occurs within a program is of great use in determining possible optimisations. Both when writing code from scratch and in porting code between different architectures. Here ARSC has strongly promoted VAMPIR for this purpose. VAMPIR is also used to generate images of message passing in programs used within the ARSC training classes and in the newsletter to cover various example programs.

Performance evaluation is but one area of working in parallel. Many users find debugging a particular challenge. Users encounter new problems, codes that work on P processors but fail or give incorrect results on P+1 or occasionally fail at random. The Cray version of totalview is particularly useful, but often it is a clear and careful approach to the testing of parallel code and the creation of good parallel test cases which speeds the identification of errors, rather than the existence of such tools. Many errors are still found by a combination of code knowledge and strategic write statements, often a lengthy process. However tools are advancing and several advanced users have noted that there are newer debugging tools providing greater detail, in particular for message passing codes.

Perhaps the greatest problem with all these tools is to actually encourage use by users. The most successful technique so far seems to be examples of use by peers rather than any in depth tools survey, training class or other methods. Users respond best to work carried out by other users and are more likely to follow in the keystrokes of another scientist rather than a specialist from a computing centre or tools expert. Various examples of all the tools mentioned above can be found in the ARSC newsletter, http://www.arsc.edu/pubs/MPPnews.shtml./

Algorithms

As has already been mentioned the users of ARSC resources vary greatly in the techniques used, from highly complex codes which employ advanced solvers on problems which require nearly all processors and memory of ARSC systems to users who are making their first steps into numerical simulation.

One of the simplest changes a user can make is to ensure code is flexible in the number of processors it can use. This has multiple benefits, the most important being the freedom it provides in moving between systems and in actually running jobs on the systems. Not only can the user work in any available number of processors, (subject to some basic constraints on memory limits), but often they can expand to use larger numbers should they be available. The most important principles are that the amount of work per processor can vary and could be different across processors, in short encouraging users not to work with magic numbers. Another important point is than IO should write a single file which can be read by any number of processors rather than a number of files or to a format which depends on the number of processors used in the last run.

There have also been several cases were users who are porting serial codes to parallel have struggled with making a parallel code produce identical results to its serial version, or indeed runs with different numbers of processors produce identical results. While this is in many cases a laudable effort it can be difficult if not impossible. For example the simple changes in the order of computations of global sum operations or a different update order for nodes in a mesh can lead to slightly different results. In the case of global results the parallel code can be changed to perform an identical summation sequence, but this is often costly in terms of performance and memory needed. A better solution is to change the serial code to mimic the parallel summation process. The origin and nature of these numerical differences must be considered carefully. An algorithm, which is sensitive to ordering of events, should be carefully validated in terms of accuracy and correctness of its results. (While many users might be aware that different levels of optimization could give different results many new parallel users are not aware of similar problems in parallel , for example with the sum of sums which occurs in a parallel summation.)

After parallelism is conquered often the basic numerical methods leave something to be desired. The expansion of problem size due to the advent of parallel computing is often not accompanied by a similar investment in the development of efficient numerical methods. Indeed many users happily increase the size of problem being tackled in sudden steps and then wonder at the lack of scalability. Often this is not due to a lack of scalability in the system or parallel method but the numerical aspects of solving a larger or finer resolution problem.

Training Provided

As described earlier ARSC attempts to monitor the usage of the systems and find those users who might benefit from improvements. The actual improvements come from a variety of changes made by the user. These have been collected together and form part of two ARSC training modules, one, on 'Case Studies' outlines the various methods for making common scientific codes parallel and is illustrated by examples selected from the ARSC user community, the other gives guidance of developing 'Parallel, Productive and Portable code'.

The 'Case Studies' section covers topics from embarrassingly parallel through task farming to domain decomposition finishing with a global transpose code. This covers the full range of parallelism and leaves users in the position to choose which might be most applicable to their own problem. In these case studies the differences between data parallel languages such as HPF and the relative advantage/disadvantages for different algorithms and for using the MPI and shmem libraries are discussed.

'Parallel, Productive and Portable' discusses code development methods and such issues as parallel code debugging, performance tuning and validation. In particular the concept of code testing and validation is stressed to users so they can hopefully spend less time debugging. The process behind the correct choice of algorithm is covered, both in terms of its parallel nature and its scientific or mathematical correctness. Reasons for different results in parallel are also covered.

User Trends

In the past year there has been a notable increase in the number of users who have been employing so-called "embarrassingly parallel" approaches in order to meet their computational needs. While this might not seem to be particularly challenging it does present a great opportunity for both the users and centres to work together to improve the overall utilisation of the centre for all. Several of the users who fall into this category have been encouraged to make codes as flexible as possible in terms of the number of processors which can be used. Since the majority simply need to process a large number of near identical tasks it is relatively easy for them to be in a position whereby they can actually exploit the fragments of the system which might be left over between other work or to choose a queue which currently has relatively little work scheduled. In one particular case a high demand from one such "embarrassingly parallel" user resulted in the ARSC Cray T3E system working at nearly 95% utilisation for the period of January 98.

Mixing Languages

While the centre users are mostly Fortran users, and much of this is still fortran77, there is an increasing trend to mix C and Fortran in applications. The need for this arises from the number of graphical libraries and data handling libraries available in C only. Users generally struggle with the conversion process and in getting data passed between the two languages in the same type, word size and array shape. A few simple examples have been created and placed in the ARSC newsletter to cover this and it is now a common practice particularly for codes which make use of the ARSC visualisation resources as well as the computational resources.

High Performance Fortran has been installed at the centre since there are many scientists who wish to exploit the potential of high performance systems without making a great investment in learning parallel languages such as MPI. Usually ARSC only directs those users whose problems are considered to be amenable to the HPF language data distribution capabilities and suggest strongly that the users build up through some simple examples into a major application program. So far this has proven a safe approach aided by SGI/CRAY and PGI advice. Users of HPF still need to be aware of the nature of parallelism in order to write good parallel program. A non-parallel algorithm will not run in parallel.

Improving the System

So far this paper has discussed what changes a user can make to improve code performance. Yet there is much that can be done by the centre, both to improve performance in absolute terms by suitable configuration and to reinforce good user behaviour. Several successful such changes have been undertaken at ARSC and some of these are described below.

Scheduling

The scheduling of MPP systems to ensure the systems are well used is perhaps the greatest challenge faced by any centre. Here 'well used' means all processors and memory are kept busy with useful work and all users feel progress is being made with their projects. The range of users at ARSC presents both challenges and advantages in trying to both get the best from the limited compute resources and keeping all of the users happy. ARSC users fall into the following groups.

Since in the preceding sections much has been made of making our users be flexible it would be a failure if ARSC were unable to reward this by offering improved queue throughput. Indeed many of the above changes are actually needed if the user is to take advantage of some aspects of the ARSC queue structure.

The current queue system is based on two timing levels, Quick queues of 30 minutes and longer queues of 8 hours, there is also a priortised queue for the grand challenge which takes the next suitable slot available. There is a queue policy for all users which describes a number of simple rules so that no one user dominates the service. The queues divide jobs amongst processor numbers. On the sites previous Cray T3D system the division was based around powers of 2 but the flexibility of the Cray T3E to run jobs with any number of processors has allowed the boundary to be shifted away from these numbers. This was done in part to promote the use of non-power of 2 processor sizes and encourage users to therefore be flexible in the number of processors any job needs. While this has taken some time it has resulted in fewer jobs spending time waiting for free resources.

When the Cray T3E first arrived at ARSC in early 97 there were a number of limitations to the system. It was noted that there were several occasions when there were enough free processors for a particular job to run but since these were not contiguous it could not start. ARSC developed a simple policy of migration to place all free processors together to prevent such blocking and this was coded to run as an exit within NQS. This resulted in an improvement in utilisation.

With the arrival of a Department of Defense High Performance Modernization Office Grand Challenge project which would need to utilise the entire system it became necessary to stop all queues until work as complete, run the grand challenge job, and then restart queues. This resulted in many idle hours of system time, which was to some extent minimised by informing users that the queues were held and allowing more interactive work to be run, actively suggesting users look at performance and debugging during these sessions. However a solution was needed. Fortunately for ARSC, Cray added a checkpoint facility to UNICOS/mk at about the same time as the Grand Challenge went into full production in early 1998. This now allows the Grand Challenge jobs, or other large jobs, to be run at any time along with allowing work to be held through maintenance downtimes and reboots. Operators can also occasionally juggle jobs to ensure a high rate of utilisation or for particular work can be priortised.

While the NQS modification and the manual use of checkpointing might seem simple compared to some of the latest queue/scheduler systems available, these two items greatly improve the utilisation of ARSC resources. Replacement with NQE or some other scheduling software has been often been considered but until there is a change in usage patterns or a suitable scheduler which will work as well there seems little reason to change. (Note there was an interesting and informative discussion/tutorial on scheduling at CUG98.)

Interactive sessions

As described in the introduction the majority of scientists use cycles at ARSC in production. There is actually little interactive work demanded by users on a regular basis. The need is typically that a user wishes to use the interactive tools available or needs to make a number of runs in quick succession rather than proceed through the normal queues. It is often asked if it is the setup which causes this behaviour and the concept of interactive pools or dedicated sessions is frequently discussed at the periodic configuration review meetings. Currently sessions are provided at key times during the week, typically before test times. (Monitoring shows little work is actually performed but it is always a concern that this is simply because of the high batch load on the system.) Quick queues are provided for short benchmarking/testing runs and checkpointing allows a periodic turnaround of these queues during the day without introducing excessive administrative overheads.

Still the checkpointing process and machine loading does require extensive human intervention and is considered necessary since it provides an important performance advantage over the still relatively 'dumb' schedulers available or the cost to users of a complex queue structure. It has been hard to estimate the actual quantitative improvement checkpointing has made to the system but feedback from users on the improved throughput and turnaround, or negative feedback when the policy has not been enforced, suggests it is a considerable qualitative improvement from the point of view of users.

Passing the Word!

There is a common belief that a user can/does read documentation and actually understand not only the actual contents but also the intended meaning. This is sometimes compounded with the presentation of large volumes of information on the WWW. Keeping this data current is also a major task for both the information provider and those who make this available and reference the material. ARSC has tried to provide all of this so users can search such resources if necessary but also tried to direct users to the most suitable parts for general information such as optimisation and introduction to key topics.

ARSC MPP newsletter

As part of the PhAROh project (http://www.osc.edu/Pharoh/) ARSC started a newsletter for users of the ARSC MPP platforms. This newsletter is currently distributed to a mailing list of over 500 interested parties from a range of background covering vendors, computer scientists and of course users. It is also made available at http://www.arsc.edu/pubs/MPPnews.shtml and provides a useful reference tool for users, with many example programs and helpful advice. Contributions come from ARSC staff and centre users and then main purpose is to pass on experiences using MPP to the wider community and encourage users to try out new software or ideas after seeing the successes of others. The newsletter is also used by ARSC to promote good behaviour by users by discussing tools, queues, algorithms, and other matters.

Acknowledgment and Thanks

I would like to thank all users of ARSC for contributing both directly and indirectly to the contents of this paper along with staff of ARSC user services and technical services. In particular all those users who have in the past contributed to the ARSC T3D/T3E/MPP newsletter who have taken time and effort to pass on their experiences to then benefit other users.

Conclusion

This paper has described some of the techniques used by both ARSC staff and users to ensure that the centre resources are well utilised and that the users themselves are productive. A major part of this is a constant effort to encourage users to continue to develop codes to adapt to the ever changing face of high performance computing. This in itself is a difficult task since the majority of the users are scientists and engineers who are primarily focused on the task of furthering scientific knowledge and are users of computing as one tool in many. Much of the success has been due to both the centre and the scientist working together to both configure systems and to modify code/working practices.

Table of Contents | Author Index | CUG Home Page | Home