**Team:** 28

**School:** MANZANO HIGH

**Area of Science:** Mathematics, Computer Science

**Interim:**

The definition of the problem that we are trying to solve is the overall efficiency

of high performance computing platforms and what might be able to change to improve

the efficiency.

The plan that we have come up with to solve this problem experimentally is to

run the benchmarking codes on multiple machines and then statistically process the

data that is received from the output of the codes. These codes include the High

Performance Computing (HPC) Challenge Benchmarking code[2], which consists

of many tests within one package, and the Numerical Aerospace Simulation (NAS)

Parallel Benchmarks[7].

Of the many NAS Parallel Benchmarks, we have begun with Embarrassingly

Parallel (EP), Conjugate Gradient (CG), and LU Decomposition (LU). EP is a

pseudo-random number generator code in which all of the processors are trying to

come up with the same number with little communication. The code starts every run by

putting the application code on each of the nodes. Each node receives a different

generation seed, or a number that starts off the random number generator. CG is

an iterative method for solving systems of linear equations. This type of code is best

for optimization problems because it is quicker and easier than the method of steepest

descents. LU Decomposition is a parallel code that decomposes an N x N matrix into

a lower and upper triangle matrix system.

The HPC benchmarking codes that we are going to use to assess the overall

efficiency of the supercomputers are: HPLinPack (HPL), PTRANS, MPI Random Access

(GUPS), and the Fast Fourier Transform (FFT). The HPL code is a direct solution

method for large, dense matrices. The PTRANS does matrix transposing. The GUPS

test allocates a lot of memory space on many different nodes and randomly accesses

the memory locations among the processors. The FFT code does a fast-Fourier

transform.

For every code that is run on one of these high performance computing

platforms, there is a file that is produced by the code stating statistical data such

as how long the job took, what size and class that the job was, and other computable

data. The next step in the process is collecting the correct data for the analysis. This

will be done with a perl script[4], which is meant to go through a file and pick out all of the

data that is needed for the statistical analysis. The perl script creates another file and

writes the data that it found in a specific section of the benchmarking file to a single

line of the outfile. This creates a file that is much easier to deal with when computing

values and separating out values by the size of the job. The outfile generated from the

perl script will then be processed by a C++ or Java program that will do the

mathematical, and therefore statistical, analysis.

So far, we have been able to get the NAS Parallel Benchmarks running on

multiple platforms[1], and the HPC Benchmarking code running on some platforms. We

are still working on getting these benchmarks on other platforms in order to have a

better range of data for each system and have a better feel for what performs the best

on a supercomputer. The perl[4] script for the NAS Parallel Benchmarks has been

written and works properly. The HPC perl script is being modified to correctly select

values for analysis. The Java program has been started and has a few errors left in it

which we will debug before we can see how well it works, but we are also trying to come

up with the correct formulas[5,6] to use to do a correct statistical analysis. One of the

issues that we have run into is the formula that we have been using to compute the

sample mean. It is supposed to be in operations per Megabyte(op/M), or time per unit

of work (e.g. seconds per million operations, and our value is in Megabytes per

operation(Mops). The way that we have decided to deal with this is to change our

Mops value into s/Mop by inverting the numerical value.

We have also been speaking to other experts in this field through our mentor,

Sue Goudy. She has taken a statistical analysis class and has spoken to the professor,

Rob Easterling[6], about the project. He told her that we are on the right track and that

the first things that we should do are the analysis of the sample mean and standard

deviation. These equations are standard formulas found in Lilja's text[5]. We have

begun this work and will continue. He also confirmed that it is proper to convert the

Megabytes per operation into operations per Megabyte. This has helped us to keep

pushing along and moving forward to the analysis portion of the project.

Results expected

We have a few hypotheses so far. One of these is that EP will run the most

efficiently because it requires the least amount of communication between the nodes.

The efficiency of EP will thus depend almost solely on what processor is in use, unlike

other benchmark codes that depend upon the network fabric. The other hypotheses

that we have come up with so far are not set yet. We still need to do more research to

clarify what they are.

We also know that there will be discrepancies[3] for every run that is

performed. There are multiple possible causes for this. Many of the reasons that jobs

are slowed down have to do with the operating system that is used. For example, if a

daemon from the operating system wakes up and decides that there is a necessary

operation that needs to be performed on a node with only one processor, the processor

stops the benchmark calculation and its communication, and may even remove it from

the node completely. If the job is one that is in communication with other nodes and has

to wait to calculate because of a necessary value to the calculation, the entire process of

computing could be interrupted, causing a massive slowdown within the benchmark that

is running. Another major problem of job slowdown is bad hardware[3]. If the job is on a

node and something goes wrong with the hardware, such as memory errors or a bad

hard drive, the job allocated to that node cannot function as well as other nodes, causing

a communication slowdown or a total failure, if it is completely removed.

Citations:

1. Sandia's C-Plant website

2. HPC website: http://icl.cs.utk.edu/hpcc/

3. Interview with Donna Brown (Sandia National Laboratories), with

contributions from Paula McAllister (Sandia National

Laboratories), private communication

4. Schwartz, Randal L., Learning Perl, O'Reilly & Associates, Inc., 1993

Wall, Larry & Schwartz, Randal L., Programming Perl, O'Reilly &

Associates, Inc., 1991

5. Lilja, David J., Measuring Computer Performance: A Practitioner's

Guide, Cambridge University Press, 2000

6. Rob Easterling

7.http://www.nas.nasa.gov/Software/NPB/Specs/npb2_0/npb2_0.html

8. Greenberg, "Java Tutorials". From Java class at CEC

Sierra, Kathy & Bates, Bert, Head First Java, O'Reilly Media, Inc., 2003

**Team Members:**

Stephanie McAllister

Vincent Moore

**Sponsoring Teacher:** Stephen Schum