Performance Evaluation of Radioss-CFD on the Cray X1

Alexander Akkerman
Ford Motor Company
MD-10, ECC Building, 20000 Rotunda Drive, Dearborn, Michigan 48121
Phone 313-337-1634
aakkerma@ford.com

Dr. Hang-Sheng Hou
Ford Motor Company
MD-203, PDC Building, 20901 Oakwood Blvd, Dearborn, Michigan 48124
Phone 313-317-2322
hhou@ford.com

Dimitri Nicolopoulos
MCube
54 rue Montgrand, BP232, 13178 Marseille Cedex, France
Phone +33 (0) 4 91 59 92 10
dimitri@mcube.fr

Herve Chevanne
Cray Inc
Bat. Hightech2, 22,Avenue De La Baltigue, Villebon Sur Yvette, 91953 Courtaboeuf, France
Phone +33 1 72 86 20 08
chevanne@cray.com

Dave Strenski
Cray Inc
7077 Fieldcrest Road, Suite 202, Brighton, Michigan 48116
Phone 313-317-4438
stren@cray.com

ABSTRACT:
Aero-acoustics is the study of noise generated within a moving fluid, such as air or water, often interacting with structures. MCube began working with Cray in 2000 to look at a key automotive problem, the noise created by the side-view mirrors. Since then we have investigated more complex engineering problems, including exhaust pipes, intakes, HVAC systems and centrifugal and axial fans. This talk will present results from the latest efforts from MCube and Cray in porting and optimizing Radioss-CFD to the Cray X1.

KEYWORDS:
Cray X1, Radioss-CFD, Exhaust Model

Introduction

Last year at the Cray User Group (CUG) meeting in Columbus Ohio, we presented some early results of the Ford internal automotive crash code FCRASH [1]. Those results showed that the Cray X1 had the potential for solving this class of problems, but further optimization and improvements in compiler technology were required to achieve our performance goals. Considering that the CAE workload in the automotive industry is dominated by commercial codes, we decided to switch our attention to one such code, Radioss-CFD, a high vector content code and likely able to benefit from the performance of the Cray X1.

Radioss-CFD uses a 3D compressible Navier Stokes formulation, which has been incorporated into the Radioss-Crash program. This CFD code fully couples the deformable mesh of the crash code with the transient fluid flow, thus allowing the simulation of problems with fluid structure interactions such as the noise coming from a vibrating exhaust manifold. In the past, problems of this nature could not be solved due to their complexity and extremely long simulation times, but can now be solved using a combination of the advanced algorithms in Radioss-CFD and more powerful computers such as the Cray X1.

This paper will present the results of an exhaust manifold analysis using Radioss-CFD, along with a few examples of the optimization needed within Radioss-CFD to meet the turnaround time required to make these simulations possible.

Radioss-CFD background

The field of aero-acoustic deals with noise generated by a turbulent fluid flow interacting with a vibrating structure. These simulations differ from the pure acoustic domain where the objective is to model the propagation of acoustic pressure waves, including reflections, diffraction and absorptions, in a medium at rest. Aero-acoustic questions arise in many industrial design problems and are heavily represented in nuisance noise related to the transportation industry. Every pedestrian's life is directly affected by noise generated by trains and automotive vehicles in the urban setting and by planes taking off, landing, or roaring overhead. Similarly, every passenger is annoyed by the noise within cars, planes, and trains, generated by the exterior wind, engine fans, exhaust, and ventilation fans. All these sources of acoustic noise are at best annoying, but may cause harm with high enough decibel levels or long enough duration. The situation is amplified in large urban setting where the concentration of these noises is higher [2].

A classification of aero-acoustic problems can be made using the following three categories:
  1. External wind noise transmitted to the inside through a structure. In the automotive industry the "A" pillar, side mirror and windshield wiper noise are typical sources in this category.
  2. Internal flow noise transmitted to the outside through a structure. Examples of this class of problems are exhaust manifolds, HVAC and intake noises.
  3. Rotating machine noise. Axial and centrifugal fans are noisy components that bring with them many aero-acoustic problems.
The trend in the automotive industry is to decrease the mass of new designs to achieve a better fuel economy. Unfortunately, the best way to solve a vibration problem is to add mass to the system. Therefore, there is an increased need for engineering tools and methods able to not only predict the noise levels of a given design but also to visualize the noise sources and help the design engineers understand where the noise originates and how to reduce it. This is why automotive NVH departments are more and more concerned with the simulation of aero-acoustics.

To date, most aero-acoustic research work has been performed experimentally, but this approach has some shortcomings. Although it is relatively simple to setup a microphone and measure a noise level and derive a spectrum at any given location in space, the correct analysis of the aero-acoustic problem involves the use of more advanced experimental techniques.
  1. Aero-acoustic measurements need to be performed within a relatively silent flow in order to be able to capture the noise contribution of the object under study. A standard wind tunnel flow noise is in the 80dB to 90dB range, higher than most aero-acoustic noise levels. Acoustic wind tunnels flow noise is in the 40dB range. In other words, the wind flow in these tunnels is barely audible.
  2. One of the difficulties in aero-acoustics is the transparency of the fluids which hinders the ability to visualize the flow. Noise generation zones are often located in the high intensity turbulent regions of the flow and tones depend on Strouhal behavior tight to the vortex shedding. Therefore, correct flow visualization including the turbulence patterns is critical.
  3. Other difficulties are associated with the so called "pseudo" noise, the noise induced by the pressure variation convection (very energetic but does not radiate in the far field like acoustic waves) and self induced noise generated by the interaction of moving fluid and the microphone. These effects can be eliminated with the combination of advanced microphone measurement techniques, such as careful choice of sensors (smooth shape and flush mounted) and post processing techniques like array de-correlation that allow the discrimination between the acoustic waves (propagation) and the locally generated signals (pseudo and self induced noise).
  4. Locating the source of acoustic noise requires methods such as acoustic beam forming or holography. Acoustic images with array of microphones generate mappings of the sources by post processing the signals reaching the array with shifts tight to the finite source speed propagation.
These methods are complex to use and fairly expensive, thus providing the motivation for development of aero-acoustic CAE tools. Those tools complement the experimentations and allow a thorough visualization and understanding of the pressure and velocity fields as well as the structural vibrations. Furthermore, parametric studies can be performed with little additional cost, since the numerical model modification is often straight forward and the processing time is becoming more available.

Test Problem

Our test problem is a vehicle exhaust model. The structural part of the model consists of the exhaust manifold that is bolted to the engine block, an exhaust "Y" pipe that brings the two manifolds together and routes them under the vehicle to the catalytic converter, a section of exhaust pipe connecting the catalytic converter to the muffler, and then a tail pipe which brings the exhaust over the rear axle and out the back of the vehicle. The fluid part of the problem is modeled by pulses of gas from alternating exhaust ports with a velocity of 250 m/s. This is a model of a V10 engine and one firing cycle (each cylinder fires once), corresponding to 720 degrees of engine rotation (2 rounds). This yields each cylinder's firing duration of 6 ms at 2000 RPM. As the gas pulse moves down the exhaust manifold and through the pipes, converter and muffler, it causes the structure to vibrate, generating noise. The goal of this project was to identify the source of the noise and change the design to reduce the noise level.

A typical exhaust model can reach hundreds of thousands of nodes, and a similar amount of shells and solid elements. With the exhaust gas pulses at every 6 milliseconds, the model needs to simulate at least 60 milliseconds to capture each cylinder firing at least one time, but typically runs for 150 to 200 milliseconds. The structure is constrained in space at the exhaust system mounting points.

For the model illustrated above, the simulation was compared to a physical experiment. The figures that follow show that the pressure and decibel levels of the simulated exhaust results (CAE) match up very well with the experimental results (Test) in both the time and frequency domains. For this paper we are only showing the comparison of the results at location "d" which is near the exhaust manifold. References [3] and [4] provide more details on the results.

Showing that the simulation predicted the correct response, let's consider the computational performance of the Radioss-CFD code. From a historical perspective, a model similar to this was run only a few years ago on the single processor Cray C90 in about 550 hours. That same model was later run on a Cray SX-6 in just under 100 hours.

For the Cray X1 port we used a similar, but smaller, model. This model consists of 110,000 nodes, 120,000 elements with a simulation time of 200 milliseconds. It was run on 6 Cray T90 processors in 28 hours to establish a performance baseline. The following are the current performance levels of Radioss-CFD on the Cray T90 and the Cray X1 using this baseline model and simulation time.
     6 processor T90      102,068 seconds
     4 processor  X1       64,586 seconds
     8 processor  X1       39,262 seconds
    16 processor  X1       32,090 seconds

Considering X1 results on 4 and 8 cpus, we derived a 6 processor result of approximately 50,000 seconds. This makes the current X1 performance about twice as fast as the T90 on a per processor basis.

Sample Optimization Used to Increase Performance

The initial port of Radioss-CFD to the Cray-X1 was completed in late 2003. After completing this port and passing the test suite, focus was shifted to optimization. The code was profiled to see where it was spending most of the time in the calculations. Several routines were identified as optimization targets. Given that this is a proprietary commercial code, we can only show examples of loop structures optimized to improve the performance.

The first example illustrates the value of rearranging a double nested loop from non-unit stride to unit stride memory access. The original code had the following structure:
                integer ndim
                real    a0(3,ndim),a1(3,ndim),a2(3,ndim),a3(3,ndim)
  MV----<       do 100 j=1,ndim
  MV              a0(1,j) = a1(1,j) + a2(1,j) + a3(1,j)
  MV              a0(2,j) = a1(2,j) + a2(2,j) + a3(2,j)
  MV              a0(3,j) = a1(3,j) + a2(3,j) + a3(3,j)
  MV---->   100 continue

The loop marks on the left are generated by adding the compiler option -rm. The "M" means the loop was streamed and the "V" shows that the loop was also vectorized. On the surface, this appears to be a very well optimized loop, since the inner loop with a trip count of three has been manually unrolled and the loop was both streamed and vectorized. Since the loop was streamed, the compiler would take the trip count of ndim and divide it into four equal streams, giving each SSP one quarter of the work. The compiler would then divide each stream by 64, the SSP's vector length, and vectorize the work. This resulted in the loop being vectorized on the j index leading to a non-unit stride of three. To improve the performance of this code section, the loop with a trip count of three, has been re-rolled as follows:
                integer ndim
                real    a0(3,ndim),a1(3,ndim),a2(3,ndim),a3(3,ndim)
  C-----<       do 110 j=1,ndim
  C MV--<         do 100 i=1,3
  C MV              a0(i,j) = a1(i,j) + a2(i,j) + a3(i,j)
  C MV-->   100   continue
  C----->   110 continue

Note how the loop marks have changed with the inner short loop streamed and vectorized and the outer loop marked with "C", which means it's been collapsed within the inner short loop. This shows that the compiler is now streaming and vectorizing on the i index in the inner loop, and stacking the short loops one after another yielding a trip count of 3*ndim with a unit stride memory access. This resulting code is running about twice as fast as the original.

Typically, such restructuring of the code would be expected to be performed by the compiler automatically. Because it did not in this case, we filed a Software Problem Report (SPR) to address this problem.

The second example shows a case where rearranging the loops allows the compiler to notice common work in several loops and fuse the loops into one. Like the first example, it shows how to optimize code for better memory access. Here is the original code:
                integer ndim
                real    a0(3,ndim),a1(3,ndim),a2(3,ndim),a3(3,ndim)
                logical cond1
  C-----<       do 110 j=1,ndim
  C               if(cond1) then
  C MV--<           do 100 i=1,3
  C MV                a0(i,j) = a1(i,j) + a2(i,j) + a3(i,j)
  C MV-->   100     continue
  C               endif
  C----->   110 continue
  C-----<       do 210 j=1,ndim
  C               if(cond1) then
  C MV--<           do 200 i=1,3
  C MV                a0(i,j) = a1(i,j) + a2(i,j) + a3(i,j)
  C MV-->   200     continue
  C               endif
  C----->   210 continue
  C-----<       do 310 j=1,ndim
  C               if(cond1) then
  C MV--<           do 300 i=1,3
  C MV                a0(i,j) = a1(i,j) + a2(i,j) + a3(i,j)
  C MV-->   300     continue
  C               endif
  C----->   310 continue

In this fragment, the compiler optimized each set of loops for unit stride memory access. However, the compiler failed to recognize that each loop is doing the same work and the three loops could be combined into one. Ironically for this simple test case, each loop is calculating the same results redundantly, so by fusing the loops, two third of the work and can be eliminated. This is not the case in the Radioss-CFD code, just an example to illustrate this optimization technique. To help the compiler recognize this redundant code, the loops must be un-rolled as follows:
                integer ndim
                real    a0(3,ndim),a1(3,ndim),a2(3,ndim),a3(3,ndim)
                logical cond1
  MV----<       do 100 j=1,ndim
  MV              if(cond1) then
  MV                a0(1,j) = a1(1,j) + a2(1,j) + a3(1,j)
  MV                a0(2,j) = a1(2,j) + a2(2,j) + a3(2,j)
  MV                a0(3,j) = a1(3,j) + a2(3,j) + a3(3,j)
  MV              endif
  MV---->   100 continue
  f-----<       do 200 j=1,ndim
  f               if(cond1) then
  f                 a0(1,j) = a1(1,j) + a2(1,j) + a3(1,j)
  f                 a0(2,j) = a1(2,j) + a2(2,j) + a3(2,j)
  f                 a0(3,j) = a1(3,j) + a2(3,j) + a3(3,j)
  f               endif
  f----->   200 continue
  f-----<       do 300 j=1,ndim
  f               if(cond1) then
  f                 a0(1,j) = a1(1,j) + a2(1,j) + a3(1,j)
  f                 a0(2,j) = a1(2,j) + a2(2,j) + a3(2,j)
  f                 a0(3,j) = a1(3,j) + a2(3,j) + a3(3,j)
  f               endif
  f----->   300 continue

The loop marks in this modified code now show that the compiler has recognized that the three loops are common, and redundant, and fused them into one loop minimizing the amount of memory access by performing multiple operations on the same data before storing it back to memory. As in the first example, this code modification is something that the compiler should have performed automatically. An SPR has been entered against this test case also. Another observation about these first two examples is that the best improvement is achieved by both fusing loops and rearranging the memory access to unit stride.

The third example is a bit more complicated. The loop structure has a complicated exit condition based on a logical flag. It prevents the multi-streaming of the outer loop because it contains an inner loop with an alternate exit. The compiler also fails to vectorize the inner loop because of this complex code before a conditional exit from the loop. This is what the original code looks like:
              integer ndim
              real a0(ndim), a1(ndim), a2((ndim), a3(ndim)
              logical cond1(3,ndim)
  1-----<     do 120 i=1,ndim
  1 2---<       do 100 j=1,3
  1 2             ml = 11
  1 2             if(cond1(j,i)) ml = 52
  1 2             if (ml.ne.11) goto 110
  1 2---> 100   continue
  1       110   continue
  1             if(ml.ne.11) then
  1               a0(i) = a1(i) + a2(i) + a3(i)
  1             else
  1               a0(i) = a0(i) * 20.0
  1             endif
  1-----> 120 continue

The above loop is neither vectorized nor streamed. To accomplish this, the compiler must split the if conditions into two lists, one each for true and false values. Then it can process each list in both vector and streaming mode. Here is the revised code fragment:
              integer ndim
              real    a0(ndim),a1(ndim),a2(ndim),a3(ndim)
              logical cond1(3,ndim)
              integer count_mlne11, list_mlne11(ndim)
              integer count_mleq11, list_mleq11(ndim)
              count_mleq11 = 0
              count_mlne11 = 0
  V------<    do 100 i=1,ndim
  V             ml = 11
  V             if(cond1(1,i)) ml=52
  V             if(ml.ne.11) goto 110
  V
  V             if(cond1(2,i)) ml=52
  V             if(ml.ne.11) goto 110
  V
  V             if(cond1(3,i)) ml=52
  V             if(ml.ne.11) goto 110
  V
  V             count_mleq11 = count_mleq11 + 1
  V             list_mleq11(count_mleq11) = i
  V       110   continue
  V             if(ml.ne.11) then
  V               count_mlne11 = count_mlne11 + 1
  V               list_mlne11(count_mlne11) = i
  V             endif
  V-----> 100 continue
         CDIR$ CONCURRENT
  MVr---<     do 200 idrive=1,count_mlne11
  MVr           i = list_mlne11(idrive)
  MVr           a0(i) = a1(i) + a2(i) + a3(i)
  MVr---> 200 continue
          CDIR$ CONCURRENT
  MVr----<    do 300 idrive=1,count_mleq11
  MVr           i = list_mleq11(idrive)
  MVr           a0(i) = a0(i) * 20.0
  MVr---> 300 continue

Priming the logical variable, cond1(3,ndim), with random true and false values, this new code now runs in both vector and streaming modes and is 13 times faster than the original fragment. The CDIR$ CONCURRENT directives were added to instruct the compiler that these new lists of indices do not overlap and can be processed concurrently.

Summary

The paper describes how aero-acoustic simulation results compare to experimental data for an exhaust model. There are other examples [2] where Radioss-CFD has correctly predicted the acoustic noise levels of vibrating structures. The paper also provided several examples of performance of Radioss-CFD on the Cray X1 and some examples of optimizations required to achieve that level of performance. However, much higher levels of performance are required to address more complex problems, such as modeling noise levels to the 20 KHz range (audible hearing) with substantially higher degree of detail in models. These models would consist of millions of elements and because of the higher density of elements would require lower simulation time steps, demanding an even higher level of performance. Such full vehicle models would result it computation requirements of several orders of magnitude higher than the example we used in this paper, so this is only the beginning. Further optimization along with improvements in parallel performance and processor technology are necessary to successfully address such problems in the near future.

References

[1]A. Akkerman, D. Strenski, "Porting FCRASH to the Cray X1 Architecture", 2003 Cray User Group meeting in Columbus Ohio, http://www.cug.org/6-archives/previous_conferences/2003/CUG2003/pages /1-program/final_program/60.all_abstracts-table.htm

[2]D. Nicolopoulos, A. Jacques, F. Périé, "Direct Numerical Simulation of Aero-Acoustic phenomena", MCube Internal Whitepaper available at http://www.mcube.fr/M-Cube/papers.html

[3]H. Hou, B. Shahidi, "Coupled Fluid/Structure Exhaust System Noise Analysis", September 2002 ICEM-CFD conference, http://www.icemcfd.com/auto_day/agenda_02.html

[4]D. Nicolopoulos. F. Périé, "Recent Computational Aero Acoustic Developments and Industrial Applications with Radioss-CFD", Special Mimi-Symposium at the Second M.I.T. Conference "Advanced Applications in Computational Fluid and Solid Mechanics with Established Software", June 2003, http://www.mecalog.co.jp/Events/MIT_Radioss_06_2003.pdf