System Support SIGDuring the open SIG session, Cray presented details on huge page support for Cray systems. The presentation included details on how to use the new functionality as well as the benefits towards applications. Dual core systems will see 8 TLB entries for huge pages for a total of 16MBs. Quad core systems will see 128 entries for a total of 512MBs. The use of huge pages requires SLES 10 SP1. Using this functionality is a simple as compile time options and options to aprun. All in attendance had the opportunity to raise questions and concerns in an open forum to Cray. A number of questions regarding quality of software were raised. Areas of concern were with ALPS, Lustre, and batch systems. Out this, Cray recognized the sites concerns and would relay them back to others within Cray. One thing that was pointed out is that although the concern may be with a particular piece of software, the problem area could be somewhere else. The replacement for CRInform, CrayPort, was announced. While this is the initial rollout, enhancements will continue. In addition to being able to manage cases and bugs, the ability to order/download software and access hardware/software documentation is also available. All existing SPRs will be transferred into CrayPort. To gain access, visit http://crayport.cray.com and register. Sites expressed a desire to review design documents and white papers early on in the development process. Getting customer feedback early on will provide the ability to affect change, resulting in a package that meets the customer’s needs. While Cray was open to this idea, it was unclear how to make it work and that more discussions would be needed. A concern was raised over having a single parallel file system, Lustre, to choose for systems. This was viewed by sites as limiting and the desire to have multiple choices was expressed. One suggested product to include is IBM’s General Parallel File System (GPFS). Having multiple parallel file systems would allow customers to choose the one that best fits their needs. The reliability of Cray XT systems was questioned, as the customer’s seem to be noticing a decline. A focus is needed to get the bugs fixed and tested. There is a difficult balance between bug fixes and new functionality. Enhances to the CLE 2.1 software stack were required to support the XT5 systems. Cray is maintaining a single software stack across the XT systems. A new feature aimed at improving reliability is Node Health. Node Health is targeted towards identifying suspect compute nodes and take appropriate action to prevent greater system problems. This feature will continue to be enhanced over time. A Best Practices WIKI is maintained internally to Cray for the express use of Cray field personnel to communicate and share ideas and utilities. Many sites have developed processes and utilities to overcome problems and enhance the system. In many cases, a lot of time is spent re-inventing the same processes and utilities. This WIKI provides a forum to exchange ideas, processes and utilities. Nick Cardo (NERSC) |
|
Copyright 2008 All Rights Reserved. |